Beyond Fisher

Chong-ho Yu, Ph.Ds.

Last Updated: 2021

Microsoft and R. A. Fisher

Due to the dominance of Microsoft Windows and Intel processor in the computer industry, for most users computing is synonymous to "Wintel." The same pattern can be found in the field of research methodology. Today, when most researchers talk about statistical analysis, it is usually referred to hypothesis testing, which is a fusion of two schools of thought: Fisher and Neyman-Pearson (Lehmann, 1993). Between the two schools, Fisher (1932, 1935) is arguably the dominant one. The Fisherian legacy is also known as statistical significance testing, null hypothesis significance testing, hypothesis testing, traditional inferential statistics, classical procedures, and parametric test.

In the future, Microsoft may announce: "Today I announce that Microsoft has taken over SAS Institute, IBM SPSS Inc, Stata, SyStat...etc. From now on only hypothesis testing is available in all statistical software packages. Resistance is futile."

Limitations of the Fisherian approach

Just like that Windows is not suitable for every application, hypothesis testing has its limitations. There are at least eleven limitations of the Fisherian approach:

Does not address treatment effectiveness or efficacy: Contrary to popular belief, statistical hypothesis testing does not address the issue of treatment effectiveness (Wasserstein & Lazar, 2016). Very often researchers interpret a significant p-value as evidence of treatment effectiveness or efficiacy. As a matter of fact, hypothesis testing does not evaluate how likely the hypothesis is right given the data P(H|D). On the contrary, it assumes that the null hypothesis is right and examines how likely the data will occur in long run given the null is true, P(D|H) (McClure & Suen, 1994; Cohen, 1994; Loftus, 1996). In order to report the magnitude of the treatment effectiveness, the researcher must report the effect size.

Easy to reject the null hypothesis: By definition, a null hypothesis denotes no difference (zero effect). Loftus (1996) mocked that "rejecting a typical null hypothesis is like rejecting the proposition that the moon is made of green cheese. The appropriate response would be 'Well, yes, okay ... but so what?" Indeed it is very common for experimenters to compare a control group (do nothing) to a treatment group. The winner is the treatment group, of course. For more information, please read the section "maximize variance control" of the webpage Experimental design as variance control.

Mismatch of null and alternate hypotheses: Statistical hypothesis testing is a fusion of two schools: Fisher and Neyman/Pearson. In the former the null hypothesis is proposed and the alpha level is associated with the null. In the latter the alternate hypothesis is proposed; power and beta are associated with the alternate. Very few people notice that when Neyman and Pearson developed their methodology, they left out the p value. In short, the two theories are inconsistent (Siegfried, 2015).

Problem in reproducibility: The results yielded from hypothesis testing vary from study to study and scientists are concerned with the replication problems. A common misinterpretation of significance is that the p value tells you how "right" the conclusion is. If the p value is .01, then there is only 1% chance that the conclusion is wrong. Actually, a p value of .01 corresponds to a Type I error rate (false alarm) of 11%; a p value of .05 is equivalent to an error rate of 29%. Given the fact that the result of one particular study could be wrong, it is not surprising to see that researchers are unable to replicate the original result (Nuzzo, 2014).

Count on parametric assumptions: Many significance tests are parametric tests, which rely on parametric assumptions. However, very often the data obtained by the analyst might violate the assumptions. In theory there are many countermeasures against these violations, such as employing non-parametric tests, robust procedures, and data transformation. As a matter of fact, many researchers do not check the assumptions before running parametric tests, resulting in questionable conclusions.

Probability as a relative frequency in long run: Hypothesis testing is based upon probability, which is defined as a relative frequency in long run. However, it is problematic to apply this frequentist view of probability into a single event or a new event (Carver, 1978). For example, what is the probability that the universe was formed by a big bang? What is the probability that a newly invented super-Java enabled Web-based instruction is effective in teaching computer sciences?

Theoretical distribution: The finding resulted from hypothesis testing provides no information about the form of the underlying pattern of the population. Very often researchers conduct hypothesis testing either with the assumption of a particular population distribution (normality) or without asking this question at all. Indeed, in the parametric test framework the population, to which inferences are made, is infinite and unknown. Under this direction, what researchers did is to make inferences from the population to the sample, but not the sample to the population. This issue has been discussed in the article regarding misconceived relationships among sample, sampling distributions and population. Basically, in hypothesis testing the test statistics is compared against a “known” distribution. However, this weakly substantiated assumption (knowing the underlying distribution) results in a false sense of certainty. In contrast to Fisher, Neyman, and Pearson, the founder of exploratory data analysis John Tukey (1986) questioned our reliance on supposedly known distributions; he asserted that progress of statistics can only be made when we move away from certainty, and we need to give up the idea that a single body of data can correspond a unique appropriate analysis

Point estimate: Unlike confidence interval (CI) that indicates a possible range of the population parameter, hypothesis testing yields a point estimate. It is based on the conviction that in the population there is one and only one fixed constant that can represent the true parameter. However, this belief is challenged by Bayesians because in reality the population body is ever changing and thus there is no such thing as a fixed parameter.

Yield a dichotomous answer: The conclusion yielded from a significance testing is dichotomous: either the effect is significant or not. It misleads information consumers and even researchers to see the world as black or white (Cumming, 2013). Cumming (2013) suspected that this "fatally-flawed" approach is widely embraced because the simplified decision-making process has a "seductive appeal--the apparent but illusory certainty" (p.5).

Arbitrary cutoff: The dichotomous decision is based on the conventional alpha levels, such as .05. However, there is nothing "magical" about .05. Actually, using a fixed cut off, as commonly practiced by many researchers, is going against Fisher's advice. Fisher (1956) stated, "No scientific worker has a fixed level of significance from year to year, and in all circumstances, he rejects hypothesis; he rather gives his mind to each particular case in the light of his evidence of ideas" (p.41). Rosnow and Rosenthal (1989) stated his objection against blindly using .05 in a humorous way: "God loves the .06 nearly as much as .05" (p.1277). Unfortunately, today hypothesis testing has become mechanical rather than a matter of judgment. As a remedy, a partnership of 72 methodologists proposed lowering the commonly used alpha level from .05 to .005 for new discoveries (Benjamin et al., 2018). However, Ioannidis (2018) argued that this approach is only a temporizing fix. Echoing Ioannidis's view, it is the conviction of the author that as long as a fixed cut-of is adopted, we will continue facing a similar question: God loves the .006 nearly as much as .005.

Unknown error and circularity: Hypothesis testing is said to balance the Type I and Type II error rates, but in reality it is untrue. The test starts from the premise that the null hypothesis is true and thus the output is the p value indicating the probability of observing the statistics given the null is true. If the null is really true, at most the Type I error rate is 5%. However, the Type II error rate would be as high as 95%! Schmidt and Hunter (2014) criticized that there is a fundamental circularity to hypothesis testing. If the researcher does not know whether the null is true or not, then he cannot tell whether the error is tied to Type I or Type II. But if he knows that the null is true, then there is no need to perform the test. Schmidt and Hunter suggested that there is only one way to rectify this situation: give up hypothesis testing and report the confidence interval.

Incapable of performing big data analytics: Experts on data science predict that the size of digital data will double every two years; this indicates a 50-fold growth from 2010 to 2020. As a matter of fact, human- and machine-generated data are increasing ten times faster than traditional data (Ffoulkes, 2017). However, traditional statistical procedures were designed for small-sample studies only. If the n is too large, virtually any trivial effect would be mistakenly detected as significance.

Alternatives

Methodologists have been questioning over-dependence on and misuse of the Fisherian statistical approach for many years. In the summer of 1993, the Journal of Experimental Education devoted the entire issue to the theme "Statistical significance testing in contemporary practice: Some proposed alternatives." Several options such as effect size, cross-validation and resampling were proposed (e.g. Thompson, 1993; Synder & Lawson, 1993). In 1996 a task force formed by the Board of Scientific Affairs of American Psychological Association (APA) also suggested that researchers should apply a wide variety of statistical techniques, such as Exploratory Data Analysis and Bayesian inference. In response to frequent occurrences of inappropriate use of hypothesis testing, in 1999 Dr. Leland Wilkinson led the Task Force on Statistical Inference formed by the same board of APA to address the controversy. In conclusion, the Task Force did not recommend abandon hypothesis testing, but suggested using it with caution; also, researchers were urged to use more options, such as confidence interval and effect size.
After the release of the APA Task Force report, use of effect-size in American Speech-Language-Hearing Association journals from 1999-2003 was reviewed (Meline & Wang, 2004). It was found that reporting of effect size in quantitative studies increased from 5 reports in 1990 to 1994 to 120 reports in 1999 to 2003. However, effect size was reported less than 30% of the time when inferential statistics were used, and only half of those reports included an interpretation of effect size.

In 2016 American Statistical Association (ASA) issued a statement warning the widespread misuse of p value. ASA highlights six main points (Wasserstein & Lazar, 2016).

“P-values can indicate how incompatible the data are with a specified statistical model” (p.131).

“P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone” (p.131).

“Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold” (p.131).

“Proper inference requires full reporting and transparency” (p.131).

“A p-value, or statistical significance, does not measure the size of an effect or the importance of a result” (p.132).

“By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis” (p.132).

In conclusion, ASA recommends rather than putting all the weight of evidence on a single index, researchers should also take good study design, data visualization, proper understanding, and contextual interpretation into account while conducting research.

Narrowness of graduate programs

Many researchers tend to follow the convention rather than experimenting with alternate strategies. This conservative behavior may be owing to the lack of comprehensive training in graduate schools. Aiken, West, Sechrest, and Reno (1990) published an article surveying the curriculum of quantitative methods in graduate psychology programs. It was found that new and important research methodologies such as structural equation modeling, confirmatory factor analysis, exploratory data analysis, and meta-analysis were not taught in the majority of those programs.
Behrens (1996) held a similar view. To rectify the situation that graduate programs overly stress hypothesis testing, Behrens suggested that graduate programs should integrate instruction in confirmatory statistics with alternative data analytic methods such as meta-analysis, Bayesian analysis, interval estimation approaches, and hybrid combinations.
Schield (1998) also criticized that traditional statistical training "covers only half the topic needed for statistical literacy. In addition to descriptive statistics and inferential statistics, statistical literacy should include Bayesian statistics and most of all-evidential statistics" (p.2).
Aikem, West, and Millsap's (2008) study, which is a replication and extension of Aiken et al.'s (1990) study, examined whether innovations in quantitative methodology have diffused into the training of PhDs in psychology. On one hand, exciting advancements had happened in the domains of statistical analysis (e.g. multi-level modeling), measurement (e.g. item response theory), and research design (e.g. propensity scores in observational studies), but on the other hand, many psychology programs still maintain the traditional curriculum. For example, slightly fewer than half of all departments responded to the survey offered a full course on structural equation modeling. Coverage of specialized statistical methods, such as multilevel modeling, was even sparser. In all Ph.D. programs in psychology, the measurement requirement occupies a median of only 4.5 weeks. Even worst is that the research design curriculum has largely stagnated.

Up-hill battle

Indeed, the crisis lies on not only the lack of knowledge of alternate approaches, but also the poverty of conventional research skills. Inappropriate use of statistics among researchers across different disciplines were well-documented (Caever, 1978; Gore et al., 1977; Gibbons & Freund, 1986; Glass, Peckham, & Sanders,1980; Maxwell & Delaney, 1990; Morrison & Henkel, 1970; Pedhazur, & Schmelkin, 1991; Thompson, 1994; Wainer, 1989). The purpose of alternate methodologies is to compensate the limitations of the traditional approach. However, when researchers do not understand the assumptions and limitations of Fisherian school, how could they look for other proper tools to remediate the problem?
Even if alternate methods were widely adopted, it is likely that the implementation of alternate methods would be as careless as in the conventional approach. There is no guarantee that the quality of research would be improved by introducing new methods alone. If the attitude toward research methodology does not improve, I am afraid that papers like "Common methodology mistakes in exploratory data analysis," "A critical assessment to misinterpretation of resampling," and "The case against Bayesian inference" would be as popular as papers against statistical testing today.
Many people are aware of the narrowness of statistical analysis and attempt to counter against it. But it is an up-hill battle. For example, at Michigan State University once two professors reformed the graduate statistics courses by introducing alternate statistics such as confidence intervals and effect size. The reform, however, was protested by other faculty members because they worried that graduate students who received this unorthodox training might not be able to get their research published (Schmidt, 1996). The norm set by referred journals is one of many reasons why many researchers, faculty and graduate students blindly follow statistical testing. Similar objections are also found in other institutions: "We must continue to teach hypothesis testing because 90% of the textbooks and journals are still using it." I heard many other reasons:

Tool mastering
"Educational and psychological researchers are not mathematics majors. We don't have to know the detail of statistics." Medical students are not chemistry and biology majors. Do they need to learn the deep knowledge of chemistry and biology? Chemistry and biology for physicians and statistics for social scientists are the means rather than the end. But without the proper tool, it is impossible for them to do a good job. The requirement of "tool mastering" is very common in humanities. For instance, once a Sinologist told me that to study the history of Yuan dynasty of China, one must achieve a high proficiency in the Chinese, Mongolian, and Persian languages in order to read first hand documents. A philosopher specializing in Buddhism also needs to learn the Chinese, Japanese, Hindu, as well as several other languages before conducting any meaningful research. Unfortunately, the concept of "tool mastering" is absent from the mentality of many social scientists.

Driving a car and building a car: Applications and research
This argument is very similar to the preceding one: "Psychologists and educators should focus on the subject matter of their own domain rather than spending time in irrelevant matters. We are concerned with applications rather than theoretical abstraction." What does it take to drive a car? Very simple. Just a few days of training is sufficient to make a qualified driver. But what does it take to build a car? Several years of training in mechanical and electrical engineering is the minimum. When an automobile engineer tells consumers that the vehicle designed by him is safe at all speeds, he is responsible for other people's lives! By the same token, when psychological researchers instruct practitioners and policy-makers to follow his theory, he is accountable, too. He must be sure that the application is firmly founded on a sound theoretical framework. When psychologists claim that children being raised in non-traditional families do not have a higher probability to develop maladjusted behaviors; legalizing drugs does not lead to long term harmful effects to society...etc., I wonder whether these conclusions are resulted from appropriate implementation of research methodologies.
By synthesizing the views of Kerlinger and Pedhazur, Daniel (1997) debunked the myth of insignificance of statistical knowledge in research:

Because all statistical methods have certain inherent strengths and limitations and because each method implies certain assumptions about the data being analyzed, the use of these methods to some degree influences both the nature and selection of research problems (Kerlinger, 1969, 1986; Kerlinger & Pedhazur, 1973). Therefore, the claim that statistical knowledge is unnecessary to good research practice is unfounded; in fact, Kerlinger noted that "it is almost impossible to do outstanding research, though one can do acceptable research, without being something of a methodologist" (p.622). Kerlinger and Pedhazur (1973) went so far as to say that the researcher who lacks a basic knowledge of data analytic strategies is a "scientific cripple" (p.369)
Qualitative researchers may argue that the preceding view is narrow-minded. In my view, statistics is a necessary, but not a sufficient condition of good research.

What shall we do?

To overcome the flooding of poor research, students in my class are asked to do the following:

To know the difference between applying research findings and producing research results.
To develop a sense of accountability to your fellow citizens.
To understand that proper means are essential to reach the end. There is no short-cut.
To learn the traditional Fisherian methodology as well as alternative techniques such as resampling, Bayesian inference, exploratory data analysis, data visualization, and data mining. In the future, learners should understand merits and shortcomings of each school and apply the appropriate technique according to different research problems and data structures.

References

Aiken, L. S., West, S. G., Sechrest, L., & Reno, P. R. (1990). Graduate training in statistics, methodology, and measurement in psychology: A survey of Ph.D. programs in North America. American Psychologist, 45, 721-734.
Aiken, L., West, S., & Millsap, R. (2008). Doctoral training in statistics, measurement, and methodology in psychology: Replication and extension of Aiken, West, Sechrest, and Reno's (1990) survey of PhD programs in North America. American Psychologist, 63, 32-50.
Behrens, J. T. (1996). Principles and procedures of exploratory data analysis. Psychological Methods, 2, 131-160.
Benjamin, D. J., Berger, J., Johannesson, M., Nosek, B. A., Wagenmakers, E.,... Johnson, V. (2018). Redefine statistical significance. Natural Human Behavior, 2, 6-10. http://doi.org/10.17605/OSF.IO/MKY9J
Carver, R. P. (1978). The case against statistical significance testing. Harvard Educational Review, 48, 378-399.
Cohen, J. (1994). The earth is round (P < .05). American Psychologist, 49, 997-1003.
Cumming, G. (2013). The new statistics: Why and how. Psychological Science. doi:10.1177/0956797613504966. Retrieved from http://pss.sagepub.com/content/early/2013/11/07/0956797613504966.
Daniel, K. G. (1997). Kerlinger's research myths: An overview with implications for educational researchers. Journal of Experimental Education, 65(2), 101-112.
Ffoulkes, P. (2017). insideBIGDATA: Guide to use of big data on an industrial scale. Retrieved from https://insidebigdata.com/white-paper/guide-big-data-industrial-scale/
Fisher, R. A. (1932). Statistical methods for research workers (4th ed.). Edinburgh, Scotland: Oliver & Boyd.
Fisher, R. A. (1935). The design of experiment. Edinburgh, Scotland: Oliver & Boyd.
Fisher, R. A. (1956). Statistical methods and scientific inferences New York, NY: Hafner.
Glass, G. V, Peckham, P. D., & Sanders, J. R. (1972). Consequences of failure to meet assumptions underlying the fixed analysis of variance and covariance. Review of Educational Research, 42, 237-288.
Gore, S. M., Jones, I. G., & Rytter, E. F. (1977). Misuse of statistical methods: A critical assessment of articles in BMJ from January to March, 1976. British Medical Journal, 1, 85-87.
Ioannidis, J. P. (2018). The proposal to lower p value thresholds to .005. Journal of American Medical Association. doi:10.1001/jama.2018.1536. Retrieved from https://jamanetwork.com/journals/jama/fullarticle/2676503
Kerlinger, F. N. (1969). Research in education. In R. Ebel, V. Noll, & R. Bauer (Eds.), Encyclopedia of educational research (4th ed., pp. 1127-1134). New York, NY: Macmillan.
Kerlinger, F. N. (1986). Foundations of behavioral research (3rd ed.). Fort Worth, TX: Holt, Rinehart and Winston.
Kerlinger, F. N., & Pedhazur, E. J. (1973). Multiple regression in behavioral research. New York: Holt, Rinehart and Winston.
Lehmann, E. L. (1993). The Fisher, Neyman-Pearson theories of testing hypotheses: One theory or two? Journal of the American Statistical Association, 88, 1242-1249.
Loftus, G. R. (1996). Psychology will be a much better science when we change the way we analyze data. Current Directions in Psychological Science, 5, 161-170.
Maxwell, S. E., & Delaney, H. D. (1990). Design experiments and analyzing data: A model comparison perspective. Belmont, CA: Wadsworth Publishing company.
Meline, T., & Wang, B. (2004). Effect-size reporting practices in AJSLP and other ASHA journals, 1999-2003. American Journal of Speech-Language Pathology, 13, 202-207.
McClure, J. & Suen, H. K. (1994). Interpretation of statistical significance testing: A matter of perspective. Topics in Early Children Special Education, 14, 88-102.
Morrison, D. E., & Henkel, R. E. (1970). The significance test controversy--A reader. Chicago, IL: Adeline.
Nuzzo, R. (2014, February 12). Scientific method: Statistical errors. Nature. Retrieved from http://www.nature.com/news/scientific-method-statistical-errors-1.14700
Pedhazur, E. J. & Schmelkin, L. P. (1991). Measurement, design, and analysis : An integrated approach. Hillsdale, N.J. : Lawrence Erlbaum Associates.
Rosnow, R. L., & Rosenthal, R. (1989). Statistical procedures and the justification of knowledge in psychological science. American Psychologist, 44, 1276-1284.
Schield, M. (1998). Statistical literacy and evidential statistics. Paper presented at the Annual Meeting of the American Statistical Association. Dallas, TX. Retrieved from http://www.augsburg.edu/ppages/schield/.
Schmidt, F. L. (1996). Statistical significance testing and cumulative knowledge in psychology: Implications for training of researchers. Psychological Methods, 2, 115-129.
Schmidt, F., & Hunter, J. (2014). Methods of meta-analysis: Correcting error and bias in research findings. Thousand Oaks, CA: Sage.
Snyder, P. & Lawson, S. (1993). Evaluating results using corrected and uncorrected effect size estimates. Journal of Experimental Education, 61, 350-360.
Siegfried, T. (2015, July 2). Science is heroic, with a tragic (statistical) flaw. Science News. Retrieved from https://www.sciencenews.org/blog/context/science-heroic-tragic-statistical-flaw
Task Force on Statistical Inference. (1996). Initial report: Task force on statistical Inference. Retrieved from http://www.apa.org/science/bsaweb-tfsi.html.
Thompson, B. (1993). The use of statistical significance tests in research: Bootstrap and other alternatives. Journal of Experimental Education, 61, 361-377.
Thompson, B. (1994, April). Common methodology mistakes in dissertations, revisited. Paper presented at the annual meeting of the American Educational Research Association, New Orleans. (ERIC Document Reproduction Service No. ED 368 771).
Tukey, J. W. (1986). The collected works of John W. Tukey, Volume III: Philosophy and principles of data analysis: 1965–1986. L. V. Jones (Ed.). Pacific Grove, CA: Wadsworth.
Wainer, H. (1989). Eelworms, bullet holes, and Geraldine Ferraro: Some problems with statistical adjustment and some solutions. Journal of Educational Statistics, 14, 121-140.
Wasserstein, R. L,, & Lazar, N. A. (2016) The ASA's statement on p-values: Context, process, and purpose, American Statistician, 70(2), 129-133, DOI: 10.1080/00031305.2016.1154108
Wilkinson, L., & Task Force on Statistical Inference. (1999). Statistical methods in psychology journals: Guidelines and explanations. American Psychologist, 54, 594-604.

Go up to the main menu

Beyond Fisher

Microsoft and R. A. Fisher

Limitations of the Fisherian approach

Alternatives

Narrowness of graduate programs

Up-hill battle

Tool mastering

Driving a car and building a car: Applications and research

What shall we do?

References

Navigation

Home

Other courses

Search Engine