Beyond Fisher
Last Updated: 2021
Microsoft and R. A. Fisher
Due to the dominance of Microsoft Windows and Intel processor in the
computer industry, for most users computing is synonymous to "Wintel."
The same pattern can be found in the field of research methodology.
Today, when most researchers talk about statistical analysis, it is
usually referred to hypothesis testing, which is a fusion of two
schools of thought: Fisher and Neyman-Pearson (Lehmann, 1993). Between
the two schools, Fisher (1932, 1935) is arguably the dominant one.
The Fisherian legacy is also known as statistical significance testing,
null hypothesis significance testing, hypothesis testing, traditional
inferential statistics, classical procedures, and parametric test.
In the future, Microsoft may announce: "Today I
announce that Microsoft has taken over SAS Institute, IBM SPSS Inc, Stata, SyStat...etc. From now on only hypothesis testing is
available in all statistical software packages. Resistance is futile."
|
|
Limitations of the Fisherian approach
Just like that Windows is not suitable for every
application, hypothesis testing has its limitations. There are at least eleven limitations of the Fisherian approach:
- Does not address treatment effectiveness or efficacy: Contrary to popular belief, statistical
hypothesis testing does not address the issue of treatment
effectiveness (Wasserstein & Lazar, 2016). Very often researchers interpret a significant p-value
as evidence of treatment effectiveness or efficiacy. As a matter of fact, hypothesis
testing does not evaluate how likely the hypothesis is right given the
data P(H|D). On the contrary, it assumes that the null hypothesis is
right and examines how likely the data will occur in long run given the
null is true, P(D|H) (McClure & Suen, 1994; Cohen, 1994; Loftus,
1996). In order to report the magnitude of the treatment effectiveness, the researcher must report the effect size.
- Easy to reject the null hypothesis:
By definition,
a null hypothesis denotes no difference (zero effect). Loftus (1996)
mocked that "rejecting a typical null hypothesis is like rejecting the
proposition that the moon is made of green cheese. The appropriate
response would be 'Well, yes, okay ... but so what?" Indeed it is very
common for experimenters to compare a control group (do nothing) to a
treatment group. The winner is the treatment group, of course. For more
information, please read the section "maximize variance control" of the
webpage Experimental design as variance control.
- Mismatch of null and alternate hypotheses:
Statistical hypothesis testing is a fusion of two schools: Fisher and
Neyman/Pearson. In the former the null hypothesis is proposed and the
alpha level is associated with the null. In the latter the alternate
hypothesis is proposed; power and beta are associated with the
alternate. Very few people notice that when Neyman and Pearson developed their methodology, they left out the p value. In short, the two theories are inconsistent (Siegfried, 2015).
- Problem in reproducibility: The
results yielded from hypothesis testing vary from study to study and
scientists are concerned with the replication problems. A common
misinterpretation of significance is that the p value tells you how
"right" the conclusion is. If the p value is .01, then there is only 1%
chance that the conclusion is wrong. Actually, a p value of .01
corresponds to a Type I error rate (false alarm) of 11%; a p value of
.05 is equivalent to an error rate of 29%. Given the fact that the
result of one particular study could be wrong, it is not surprising to
see that researchers are unable to replicate the original result
(Nuzzo, 2014).
- Count on parametric assumptions: Many significance tests are
parametric tests, which rely on parametric assumptions. However, very
often the data obtained by the analyst might violate the assumptions.
In theory there are many countermeasures against these violations, such
as employing non-parametric tests, robust procedures, and data
transformation. As a matter of fact, many researchers do not check the
assumptions before running parametric tests, resulting in questionable
conclusions.
- Probability as a relative frequency in long run:
Hypothesis testing is based upon probability, which is defined as a
relative frequency in long run. However, it is problematic to apply
this frequentist view of probability into a single event or a new event
(Carver, 1978). For example, what is the probability that the universe
was formed by a big bang? What is the probability that a newly invented
super-Java enabled Web-based instruction is effective in teaching
computer sciences?
- Theoretical distribution: The finding resulted from hypothesis testing provides no information about
the form of the underlying pattern of the population. Very often
researchers conduct hypothesis testing either with the assumption of a
particular population distribution (normality) or without asking this
question at all. Indeed, in the parametric test framework the
population, to which inferences are made, is infinite and unknown.
Under this direction, what researchers did is to make inferences from
the population to the sample, but not the sample to the population.
This issue has been discussed in the article regarding misconceived relationships among sample, sampling distributions and population. Basically, in hypothesis testing the test statistics is compared
against a “known” distribution. However, this weakly substantiated
assumption (knowing the underlying distribution) results in a false
sense of certainty. In contrast to Fisher, Neyman, and Pearson, the
founder of exploratory data analysis John Tukey (1986) questioned our
reliance on supposedly known distributions; he asserted that progress
of statistics can only be made when we move away from certainty, and we
need to give up the idea that a single body of data can correspond a
unique appropriate analysis
- Point estimate: Unlike confidence interval
(CI) that indicates a possible range of the population parameter,
hypothesis testing yields a point estimate. It is based on the
conviction that in the population there is one and only one fixed
constant that can represent the true parameter. However, this belief is
challenged by Bayesians because in reality the population body is ever
changing and thus there is no such thing as a fixed parameter.
- Yield a dichotomous answer:
The conclusion yielded from a significance testing is dichotomous:
either the effect is significant or not. It misleads information
consumers and even researchers to see the world as black or white
(Cumming, 2013). Cumming (2013) suspected that this "fatally-flawed"
approach is widely embraced because the simplified decision-making
process has a "seductive appeal--the apparent but illusory certainty"
(p.5).
- Arbitrary cutoff: The dichotomous decision
is based on the conventional alpha levels, such as .05. However, there
is nothing "magical" about .05. Actually, using a fixed cut off, as
commonly practiced by many researchers, is going against Fisher's
advice. Fisher (1956) stated, "No scientific worker has a fixed level
of significance from year to year, and in all circumstances, he rejects
hypothesis; he rather gives his mind to each particular case in the
light of his evidence of ideas" (p.41). Rosnow and Rosenthal (1989)
stated his objection against blindly using .05 in a humorous way: "God
loves the .06 nearly as much as .05" (p.1277). Unfortunately, today
hypothesis testing has become mechanical rather than a matter of
judgment. As a remedy, a partnership of 72 methodologists proposed
lowering the commonly used alpha level from .05 to .005 for new
discoveries (Benjamin et al., 2018). However, Ioannidis (2018) argued
that this approach is only a temporizing fix. Echoing Ioannidis's view,
it is the conviction of the author that as long as a fixed cut-of is
adopted, we will continue facing a similar question: God loves the .006
nearly as much as .005.
- Unknown error and circularity: Hypothesis
testing
is said to balance the Type I and Type II error rates, but in reality
it is untrue. The test starts from the premise that the null hypothesis
is true and thus the output is the p value indicating the probability
of observing the statistics given the null is true. If the null is
really true, at most the Type I error rate is 5%. However, the Type II
error rate would be as high as 95%! Schmidt and Hunter (2014)
criticized that there is a fundamental circularity to hypothesis
testing. If the researcher does not know whether the null is true or
not, then he cannot tell whether the error is tied to Type I or Type
II. But if he knows that the null is true, then there is no need to
perform the test. Schmidt and Hunter suggested that there is only one
way to rectify this situation: give up hypothesis testing and report
the confidence interval.
- Incapable of performing big data analytics:
Experts on data science predict that the size of digital data will
double every two years; this indicates a 50-fold growth from 2010 to
2020. As a matter of fact, human- and machine-generated data are
increasing ten times faster than traditional data (Ffoulkes, 2017).
However, traditional statistical procedures were designed for
small-sample studies only. If the n is too large, virtually any trivial effect would be mistakenly detected as significance.
Alternatives
Methodologists have been questioning over-dependence on and
misuse of the Fisherian statistical approach for many years. In the summer of 1993, the
Journal of Experimental Education devoted the entire issue to the theme
"Statistical significance testing in contemporary practice: Some
proposed alternatives." Several options such as effect size,
cross-validation and resampling were proposed (e.g. Thompson, 1993;
Synder & Lawson, 1993). In 1996 a task force formed by the Board of
Scientific Affairs of American Psychological Association (APA) also
suggested that researchers should apply a wide variety of statistical
techniques, such as Exploratory Data Analysis and Bayesian inference.
In response to frequent occurrences of inappropriate use of hypothesis
testing, in 1999 Dr. Leland Wilkinson led the Task Force on Statistical
Inference formed by the same board of APA to address the controversy.
In conclusion, the Task Force did not recommend abandon hypothesis
testing, but suggested using it with caution; also, researchers were
urged to use more options, such as confidence interval and effect size.
After the release of the APA Task Force report, use of effect-size in American Speech-Language-Hearing Association journals
from 1999-2003 was reviewed (Meline & Wang, 2004). It was found
that reporting of effect size in quantitative studies increased from 5
reports in 1990 to 1994 to 120 reports in 1999 to 2003. However, effect
size was reported less than 30% of the time when inferential statistics
were used, and only half of those reports included an interpretation of
effect size.
In 2016 American Statistical Association (ASA) issued a
statement warning the widespread misuse of p value. ASA highlights six
main points (Wasserstein & Lazar, 2016).
- “P-values can indicate how incompatible the data are with a specified statistical model” (p.131).
- “P-values do not measure the probability that the studied
hypothesis is true, or the probability that the data were produced by
random chance alone” (p.131).
- “Scientific conclusions and business or policy decisions
should not be based only on whether a p-value passes a specific
threshold” (p.131).
- “Proper inference requires full reporting and transparency” (p.131).
- “A p-value, or statistical significance, does not measure the size of an effect or the importance of a result” (p.132).
- “By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis” (p.132).
In conclusion, ASA recommends rather than putting all the
weight of evidence on a single index, researchers should also take good
study design, data visualization, proper understanding, and contextual
interpretation into account while conducting research.
Narrowness of graduate programs
Many researchers tend to follow the convention rather than
experimenting with alternate strategies. This conservative behavior may
be owing to the lack of comprehensive training in graduate schools.
Aiken, West, Sechrest, and Reno (1990) published an article surveying
the curriculum of quantitative methods in graduate psychology programs.
It was found that new and important research methodologies such as
structural equation modeling, confirmatory factor analysis, exploratory
data analysis, and meta-analysis were not taught in the majority of
those programs.
Behrens (1996) held a similar view. To rectify the situation that graduate
programs overly stress hypothesis testing, Behrens suggested that graduate
programs should integrate instruction in confirmatory statistics with
alternative data analytic methods such as meta-analysis, Bayesian analysis,
interval estimation approaches, and hybrid combinations.
Schield (1998) also criticized that traditional statistical training
"covers only half the topic needed for statistical literacy. In
addition to descriptive statistics and inferential statistics,
statistical literacy should include Bayesian statistics and most of
all-evidential statistics" (p.2).
Aikem, West, and Millsap's (2008) study, which is a replication and extension of
Aiken et al.'s (1990) study, examined whether innovations in quantitative
methodology have diffused into the training of PhDs in psychology. On one hand,
exciting advancements had happened in the domains of statistical analysis (e.g.
multi-level modeling), measurement (e.g. item response theory), and research
design (e.g. propensity scores in observational studies), but on the other hand,
many psychology programs still maintain the traditional curriculum. For example,
slightly fewer than half of all departments responded to the survey offered a
full course on structural equation modeling. Coverage of specialized statistical
methods, such as multilevel modeling, was even sparser. In all Ph.D. programs in
psychology, the measurement requirement occupies a median of only 4.5 weeks.
Even worst is that the research design curriculum has largely stagnated.
Up-hill battle
Indeed, the crisis lies on not only the lack of knowledge
of alternate approaches, but also the poverty of conventional research
skills. Inappropriate use of statistics among researchers across
different disciplines were well-documented (Caever, 1978; Gore et al.,
1977; Gibbons & Freund, 1986; Glass, Peckham, & Sanders,1980;
Maxwell & Delaney, 1990; Morrison & Henkel, 1970; Pedhazur,
& Schmelkin, 1991; Thompson, 1994; Wainer, 1989). The purpose of
alternate methodologies is to compensate the limitations of the
traditional approach. However, when researchers do not understand the
assumptions and limitations of Fisherian school, how could they look
for other proper tools to remediate the problem?
Even if alternate methods were widely adopted, it is likely that the
implementation of alternate methods would be as careless as in the
conventional approach. There is no guarantee that the quality of
research would be improved by introducing new methods alone. If the
attitude toward research methodology does not improve, I am afraid that
papers like "Common methodology mistakes in exploratory data analysis,"
"A critical assessment to misinterpretation of resampling," and "The
case against Bayesian inference" would be as popular as papers against
statistical testing today.
Many people are aware of the narrowness of statistical analysis and
attempt to counter against it. But it is an up-hill battle. For
example, at Michigan State University once two professors reformed the
graduate statistics courses by introducing alternate statistics such as
confidence intervals and effect size. The reform, however, was
protested by other faculty members because they worried that graduate
students who received this unorthodox training might not be able to get
their research published (Schmidt, 1996). The norm set by referred
journals is one of many reasons why many researchers, faculty and
graduate students blindly follow statistical testing. Similar
objections are also found in other institutions: "We must continue to
teach hypothesis testing because 90% of the textbooks and journals are
still using it." I heard many
other reasons:
Tool mastering
"Educational and psychological researchers are not mathematics majors.
We don't have to know the detail of statistics." Medical students are
not chemistry and biology majors. Do they need to learn the deep
knowledge of chemistry and biology? Chemistry and biology for
physicians and statistics for social scientists are the means rather
than the end. But without the proper tool, it is impossible for them to
do a good job. The requirement of "tool mastering" is very common in
humanities. For instance, once a Sinologist told me that to study the
history of Yuan dynasty of China, one must achieve a high proficiency
in the Chinese, Mongolian, and Persian languages in order to read first
hand documents. A philosopher specializing in Buddhism also needs to
learn the Chinese, Japanese, Hindu, as well as several other languages
before conducting any meaningful research. Unfortunately, the concept
of "tool mastering" is absent from the mentality of many social
scientists.
Driving a car and building a car: Applications and research
This argument is very similar to the preceding one: "Psychologists and
educators should focus on the subject matter of their own domain rather
than spending time in irrelevant matters. We are concerned with
applications rather than theoretical abstraction." What does it take to
drive a car? Very simple. Just a few days of training is sufficient to
make a qualified driver. But what does it take to build a car? Several
years of training in mechanical and electrical engineering is the
minimum. When an automobile engineer tells consumers that the vehicle
designed by him is safe at all speeds, he is responsible for other
people's lives! By the same token, when psychological researchers
instruct practitioners and policy-makers to follow his theory, he is
accountable, too. He must be sure that the application is firmly
founded on a sound theoretical framework. When psychologists claim that
children being raised in non-traditional families do not have a higher
probability to develop maladjusted behaviors; legalizing drugs does not
lead to long term harmful effects to society...etc., I wonder whether
these conclusions are resulted from appropriate implementation of
research methodologies.
By synthesizing the views of Kerlinger and Pedhazur, Daniel (1997)
debunked the myth of insignificance of statistical knowledge in
research:
Because all statistical methods have certain inherent strengths and
limitations and because each method implies certain assumptions about
the data being analyzed, the use of these methods to some degree
influences both the nature and selection of research problems
(Kerlinger, 1969, 1986; Kerlinger & Pedhazur, 1973). Therefore, the
claim that statistical knowledge is unnecessary to good research
practice is unfounded; in fact, Kerlinger noted that "it is almost
impossible to do outstanding research, though one can do acceptable
research, without being something of a methodologist" (p.622).
Kerlinger and Pedhazur (1973) went so far as to say that the researcher
who lacks a basic knowledge of data analytic strategies is a
"scientific cripple" (p.369)
Qualitative researchers may argue that the preceding view
is narrow-minded. In my view, statistics is a necessary, but not a
sufficient condition of good research.
What shall we do?
To overcome the flooding of poor research, students in my class are asked to do the following:
- To know the difference between applying research findings and producing research results.
- To develop a sense of accountability to your fellow citizens.
- To understand that proper means are essential to reach the end. There is no short-cut.
- To learn the traditional Fisherian methodology as well as
alternative techniques such as resampling, Bayesian inference,
exploratory data analysis, data visualization, and data mining. In the
future, learners should understand
merits and shortcomings of each school and apply the appropriate
technique according to different research problems and data structures.
References
- Aiken, L. S., West, S. G., Sechrest, L., & Reno, P. R. (1990).
Graduate training in statistics, methodology, and measurement in
psychology: A survey of Ph.D. programs in North America. American Psychologist, 45, 721-734.
- Aiken, L., West, S., & Millsap, R. (2008). Doctoral
training in statistics, measurement, and methodology in psychology:
Replication and extension of Aiken, West, Sechrest, and Reno's (1990)
survey of PhD programs in North America. American Psychologist, 63, 32-50.
- Behrens, J. T. (1996). Principles and procedures of exploratory data analysis. Psychological Methods, 2, 131-160.
- Benjamin, D. J., Berger, J., Johannesson, M., Nosek, B. A.,
Wagenmakers, E.,... Johnson, V. (2018). Redefine statistical
significance. Natural Human Behavior, 2, 6-10. http://doi.org/10.17605/OSF.IO/MKY9J
- Carver, R. P. (1978). The case against statistical significance testing. Harvard Educational Review, 48, 378-399.
- Cohen, J. (1994). The earth is round (P < .05). American Psychologist, 49, 997-1003.
- Cumming, G. (2013). The new statistics: Why and how.
Psychological Science. doi:10.1177/0956797613504966. Retrieved from
http://pss.sagepub.com/content/early/2013/11/07/0956797613504966.
- Daniel, K. G. (1997). Kerlinger's research myths: An overview with implications for educational researchers. Journal of Experimental Education, 65(2), 101-112.
- Ffoulkes, P. (2017). insideBIGDATA: Guide to use of big data on an industrial scale. Retrieved from https://insidebigdata.com/white-paper/guide-big-data-industrial-scale/
- Fisher, R. A. (1932). Statistical methods for research workers (4th ed.). Edinburgh, Scotland: Oliver & Boyd.
- Fisher, R. A. (1935). The design of experiment. Edinburgh, Scotland: Oliver & Boyd.
- Fisher, R. A. (1956). Statistical methods and scientific inferences New York, NY: Hafner.
- Glass, G. V, Peckham, P. D., & Sanders, J. R. (1972).
Consequences of failure to meet assumptions underlying the fixed
analysis of variance and covariance. Review of Educational Research, 42, 237-288.
- Gore, S. M., Jones, I. G., & Rytter, E. F. (1977).
Misuse of statistical methods: A critical assessment of articles in BMJ
from January to March, 1976. British Medical Journal, 1, 85-87.
- Ioannidis, J. P. (2018). The proposal to lower p value thresholds to .005. Journal of American Medical Association. doi:10.1001/jama.2018.1536. Retrieved from https://jamanetwork.com/journals/jama/fullarticle/2676503
- Kerlinger, F. N. (1969). Research in education. In R. Ebel, V. Noll, & R. Bauer (Eds.), Encyclopedia of educational research (4th ed., pp. 1127-1134). New York, NY: Macmillan.
- Kerlinger, F. N. (1986). Foundations of behavioral research (3rd ed.). Fort Worth, TX: Holt, Rinehart and Winston.
- Kerlinger, F. N., & Pedhazur, E. J. (1973). Multiple regression in behavioral research. New York: Holt, Rinehart and Winston.
- Lehmann, E. L. (1993). The Fisher, Neyman-Pearson theories of testing hypotheses: One theory or two? Journal of the American Statistical Association, 88, 1242-1249.
- Loftus, G. R. (1996). Psychology will be a much better science when we change the way we analyze data. Current Directions in Psychological Science, 5, 161-170.
- Maxwell, S. E., & Delaney, H. D. (1990). Design experiments and analyzing data: A model comparison perspective. Belmont, CA: Wadsworth Publishing company.
- Meline, T., & Wang, B. (2004). Effect-size reporting practices in AJSLP and other ASHA journals, 1999-2003. American Journal of Speech-Language Pathology, 13, 202-207.
- McClure, J. & Suen, H. K. (1994). Interpretation of statistical significance testing: A matter of perspective. Topics in Early Children Special Education, 14, 88-102.
- Morrison, D. E., & Henkel, R. E. (1970). The significance test controversy--A reader. Chicago, IL: Adeline.
- Nuzzo, R. (2014, February 12). Scientific method: Statistical errors. Nature. Retrieved from http://www.nature.com/news/scientific-method-statistical-errors-1.14700
- Pedhazur, E. J. & Schmelkin, L. P. (1991). Measurement, design, and analysis : An integrated approach. Hillsdale, N.J. : Lawrence Erlbaum Associates.
- Rosnow, R. L., & Rosenthal, R. (1989). Statistical procedures and the justification of knowledge in psychological science. American Psychologist, 44, 1276-1284.
- Schield, M. (1998). Statistical literacy and evidential statistics. Paper presented at the Annual Meeting of the American Statistical Association. Dallas, TX. Retrieved from http://www.augsburg.edu/ppages/schield/.
- Schmidt, F. L. (1996). Statistical significance testing
and cumulative knowledge in psychology: Implications for training of
researchers. Psychological Methods, 2, 115-129.
-
Schmidt, F., & Hunter, J. (2014). Methods of meta-analysis: Correcting error and bias in research findings. Thousand Oaks, CA: Sage.
- Snyder, P. & Lawson, S. (1993). Evaluating results using corrected and uncorrected effect size estimates. Journal of Experimental Education, 61, 350-360.
- Siegfried, T. (2015, July 2). Science is heroic, with a tragic (statistical) flaw. Science News. Retrieved from https://www.sciencenews.org/blog/context/science-heroic-tragic-statistical-flaw
- Task Force on Statistical Inference. (1996). Initial
report: Task force on statistical Inference. Retrieved
from http://www.apa.org/science/bsaweb-tfsi.html.
- Thompson, B. (1993). The use of statistical significance tests in research: Bootstrap and other alternatives. Journal of Experimental Education, 61, 361-377.
- Thompson, B. (1994, April). Common methodology mistakes in dissertations, revisited.
Paper presented at the annual meeting of the American Educational
Research Association, New Orleans. (ERIC Document Reproduction Service
No. ED 368 771).
- Tukey, J. W. (1986). The collected works of John W. Tukey, Volume III: Philosophy and principles of data analysis: 1965–1986. L. V. Jones (Ed.). Pacific Grove, CA: Wadsworth.
- Wainer, H. (1989). Eelworms, bullet holes, and Geraldine Ferraro: Some problems with statistical adjustment and some
solutions. Journal of Educational Statistics, 14, 121-140.
- Wasserstein, R. L,, & Lazar, N. A. (2016) The ASA's statement on p-values: Context, process, and purpose, American Statistician, 70(2), 129-133, DOI: 10.1080/00031305.2016.1154108
- Wilkinson, L., & Task Force on Statistical Inference.
(1999). Statistical methods in psychology journals: Guidelines and
explanations. American Psychologist, 54, 594-604.
Go up to the main menu
|
|