Meta-analysis and effect size
Power analysis and meta-analysis
The previous lesson mentioned that power is a function of
effect size, alpha level and sample size. What is effect size? How is
it related to power analysis? How can we determine effect size? Answers
to these questions are provided below. It is important to point out
that although power analysis requires the effect size yielded from
meta-analysis, meta-analysis does not rely on power analysis. This
could be a standalone method on its own right.
"Meta" is Greek prefix meaning "after" or "beyond." Meta-analysis (MA)
is a secondary analysis after other researchers had done their own
analyses and the meta-analyzer can go beyond what had been accomplished
in the past. Simply put, MA is analysis of analyses. Even if you might
never done MA before, you have already done quasi-meta-analysis,
namely, literature review, which is typically presented in the
What can be done with the preceding heterogeneous or even
contradictory findings? The answer is: One must go beyond literature
review by looking into MA and effect size.
- According to So and So (2009), Program X can effectively reduce the risk of colon cancer.
- Smith and Smith (2010) found that patients were unresponsive to Treatment X. It is a waste of your money.
- Prior research found that the therapeutic effect of Program X is minimal (Doe and Doe, 2014).
What is effect size?
Before discussing the effect size, I would like to introduce a broader concept: comparison in terms of a standard.
Many statistical formulas seem to be difficult to follow. Indeed, many of them
are nothing more than a standardized comparison. Take comparing wealth
as a metaphor. How could we compare the net assets of American IBM
corporation and Japanese Sony corporation? The simplest way is to
compare them in US dollars, the standard currency for international
trade. By the same token, a t-test is a mean comparison in terms of the
standard deviation. Many statistics follow this thread of logic.
Effect size can be conceptualized as a standardized difference.
In the simplest form, effect size, which is denoted by the symbol "d",
is the mean difference between groups in standard score form i.e. the
ratio of the difference between the means to the standard deviation.
This concept is derived from a school of methodology named Meta-analysis, which was developed by Glass (1976).
P value: Not zero!
Conventionally researchers draw a conclusion based on the p value alone. If the p value is less than .05 or .01, the effect, difference, or relationship is believed to be significant. The p
value, by definition, is the probability of correcting rejecting the
null hypothesis (no effect, no difference, or no relationship) assuming
that the null is true. However, is it sufficient? Consider the
Do you think the above are satisfactory answers? Of course,
not. However, this type of answer is exactly what we can get from
conventional hypothesis testing. When the null hypothesis is rejected,
at most the conclusion is: "The null hypothesis is false! Not null! The
effect, difference, or correlation is not zero." The p value tells you
how unlikely the statistics can be observed in the long run by chance
alone; it says nothing about the degree of treatment effectiveness, the
magnitude of the association, or the distance of the performance gap.
The following figure shows that the difference, indicated by red
can be anything.
- When the US President asks a four-star general to estimate the
potential casualty of a possible war, he replies, "Not zero. Some
soldiers will die."
- Your boss wants you to give a raise and asks you how much you expect, you answer, "Not zero. I deserve more compensation."
- After you have gone through a rigorous fitness program, your
wife asks you how much weight you lost. You answer, "Not zero. I lost
a few pounds."
Researchers should be concerned with not only whether a null hypothesis
is false or not, but also how false it is. In other words, if the
difference is not zero, how large the difference one should expect? By
specifying an effect size, which is the minimum difference that is
worth research attention, researcher could design a study with optimal
power rather than wasting resources on trivial effects. The larger the
effect size (the difference between the null and alternative means) is,
the greater the power of a test is.
Ideally, power analysis employs the population effect size. However, in practice the effect size must be estimated from sample data.
How can we determine effect size?
Gene Glass's approachThere are several ways to calculate
effect size. The three most popular approaches are Gene Glass's
approach, Hunter-schmidt's approach, and Cohen's d. The basic formula
of Glass's approach is:
|Mean of control group - Mean of treatment group
Standard deviation of the control group
The control group's standard deviation is used because it is not affected by the treatment (Glass, McGraw, & Smith, 1981).
Hunter-schmidt's approachHunter and Schmidt (1990)
suggested using a pooled within-group standard deviation because it has
less sampling error than the control group standard deviation under the
condition of equal sample size. In addition, Hunter and Schmidt
corrected the effect size for measurement error by dividing the effect
size by the square root of the reliability coefficient of the dependent
|Measurement error correction = ||___________________________
Square root of r
Concepts of measurement error and reliability coefficient will be discussed in the section "Measurement."
When there are two independent groups (e.g. control and treatment), Cohen's d
can be obtained by the following formula:
|Treatment group mean - Control group mean
SQRT(Treatment group variance + Control group variance)/2)
When a study reports a Chi-square test result with one degree of freedom (n=2),
the following formula can be employed to approximate Cohen's d:
abs(d) = 2*SQRT(Chi-square/N - Chi-square)
where N is the total sample size
When a study reports a hit rate (percentage of success after taking the
treatment or no treatment), the following formula can be used:
d = arscine(p1) + arscine(p2)
where p1 and p2 are the hit rates of the two groups (e.g. control and treatment)
(Poston & Hanson, 2010)
The conventional values of effect size (Cohen, 1962) are:
|Small||d = .20
|Medium||d = .40
|Large||d = .60
Other researchers may have different values for small, medium, and
large effect size. The magnitude of effect size depends on the subject
matter. For example, in medical research d = .05 may consider a large
effect size i.e. if the drug can save even five more lives, further
research should be considered.
It is important to point out that Cohen defined .40 as the medium effect size
because it was close to the average observed effect size based on his literature
review using Journal of Abnormal and Social Psychology
during the 1960s. The so-called small, medium, and large effect sizes
are specific to a particular domain (abnormal and social psychology)
and thus they should not be treated as the universal guideline
(Aguinis, & Harden, 2009).
Because different subject matters might have different effect sizes,
Welkowitz, Ewen, Cohen (1982) explicitly stated that one should not use
conventional values if one can specify the effect size that is
appropriate to the specific problem. Moreover, Wilkinson and Task Force
(1999) gave the following advice, "Because power computations are most
meaningful when done before data are collected and examined, it is
important to show how effect-size estimates have been derived from
previous research and theory in order to dispel suspicions that they
might have been taken from data used in the study or, even worse,
constructed to justify a particular sample size."
It is a common practice for researchers to collect articles in their fields and catalog them in
EndNote for future citation. It may be more beneficial to use
this collection to calculate and constantly update the effect size of
the subject matter to be studied.
In practice, it may be difficult to find past research studies related
to your topic, especially when the topic is fairly new. To rectify this
situation, Glass, McGraw, and Smith (1981) suggested to look at studies
in similar domains. For example, if you are not able to locate enough
research papers on Web-based instruction, you can use studies on
hypertext and multimedia. Before the introduction of World Wide Web,
hypertext and multimedia have been widely employed in computer-based
instruction programmed in HyperCard, Authorware, and Director. Concepts
related to Web-based instruction such as collaboration in chat sessions
and mailing lists can be found in research on collaboration in other
It is noteworthy that not all research studies can be included in your
collection for meta-analysis. Only well-designed studies which conform
to the standards established by Campbell and Stanley (1963) and Cook
and Campbell (1979) should be considered. Criteria of well-designed
studies will be discussed in the section "Design of experiment"
Applications of meta-analysis
As discussed in the section concerning power analysis,
computing effect size is essential to sample size determination.
Nevertheless, meta analysis can not only be used for synthesizing
results of past research, but also for new research studies. For
example, Baker and Dwyer (2000) conducted eight studies regarding
visualization as an instructional variable (n=2000). If all subjects
are used for one analysis, the study will be over-powered. Instead, the
effect size is computed in each study individually. The findings of
eight studies are pooled to draw inferences as to the meaning of a
collective body of research.
Besides the risk of overpowering, using all data in one test may lead
to the Simpson's paradox. Simpson's Paradox is a phenomenon that the
conclusion drawn from the aggregate data is opposite to the conclusion
drawn from the contingency table based upon the same data.
The following example is given by Schwarz (1998). A university
conducted a study to examine whether there is a sex bias in
admission. The admission data of the MBA program and the law school
were analyzed. The first table shows the MBA data:
By looking at the MBA data only, it seems that females are
admitted at a slightly higher rate than males in the MBA program. The
same pattern can be found in the law school data.
Interestingly enough, when the two data sets are pooled, females
seem to be admitted at a lower rate than males!
To avoid the Simpson Paradox, Olkin (2000) recommends researchers to
employ meta-analysis rather than pooling. In pooling, data sets are
first combined and then the groups are compared. In meta-analysis,
groups in different data sets are compared first and then the
comparisons are combined.
MBA and Law
Merits of meta-analysis
- A single study might lack statistical power due to a small sample
size. Nevertheless, when many prior stuides are combined together, statistical power increases.
- An individual study might over-estimate or under-estimate the efect size. Again, when many studies are pooled together, the precision of the estimation can be substantially improved.
- A single study might have a very narrow focus. Meta-analysis can answer questions not posed by those scattering studies.
- When diverse or even conflicting results are found in previous studies, a meta-analyst is able to resolve the dispute
by looking at the forest instead of the trees. However, the
meta-analyzer must take variation across studies into account (Higgins
& Green, 2008).
Limitations of meta-analysis
Like every methodology, meta-analysis also has certain limitations and weaknesses:
Assumption of standardized effects
It is important to point out that in some branches of meta-analysis
computation of effect size is based upon a pooled variance or an
adjusted variance. In response to this practice, Berk and Freedman
(2003) are skeptical to the merit of meta-analysis. In their view, the
claimed merit of meta-analysis is illusory. First, many meta-analyses
use studies from both randomized experiments and observational studies.
In the former, it is usually the case that subjects are not drawn at
random from populations with a common variance. In observational
studies there is no randomization at all. Thus, it is gratuitous to
assume that standardized effects are constant across studies.
While this criticism is valid to some degree, the shortcoming can be
easily fixed by setting a higher bar in the inclusion and exclusion
criteria. For example, in the meta-analysis regarding the effect of
intercessory prayer on the effectiveness of social workers, Hodges
(2007) included only randomized controlled trials. Studies that used
less rigorous designs, such as single-case studies and non-randomized
studies, were excluded. In a similar study, Thompson (2007) started
with 150 potential candidates but at the end only 23 studies that
employed true experiments were retained in his meta-analysis.
Retrospective observational studies that were deigned as
quasi-experiments were removed from his meta-analysis.
Further, Berk and Freedman questioned the assumed independence of
studies for meta-analysis. Researchers are trained in similar ways,
read the same papers, talk to each other, write proposals for the same
funding agencies, and publish the findings to the same pool of
peer-review journals. Earlier studies lead to later studies in the
sense that each generation of doctoral students trains the next. They
questioned whether this social dependence compromises statistical
is true that in some cultures the mentees tend to follow the exact
footstep of the mentor. In this case, close relationships among
researchers might be a threat against the validity of meta-analysis.
Nonetheless, today it is very common for researchers to think
independently and to challenge each other. As a matter of fact,
divergence, instead of conformity, is the norm of the academic
Another common problem of meta-analysis is publication bias, also know
as the file-drawer effect: Publication bias leads to the censoring of
studies with non-significant results. As a remedy, Keng and Beretvas
(2005) developed methodology to quantify the effect that publication
bias can have on correlation estimation. The most common methods to
check publication bias are funnel plots and Egger's test (Steme &
Logic of courtroom
Root (2003) challenged the merits of meta-analysis at the philosophical
level. According to Root, standard hypothesis testing is based upon the
logic of physical sciences, in which the researcher must gamble with
the unknown future,
in the sense that the prediction derived from the hypothesis may not be
in alignment to the proposed theory. However, meta-analysis is
implicitly tied to the logic of courtroom, in which collected evidence
is used to explain past events. In a retrospective methodology
such as meta-analysis, the synthesizer has the luxury of choosing what
past studies to be included. Using gambling as an analogy, Root pointed
out that computing probabilities based on known facts is like betting
money in a game after the result is known.
The result of meta-analysis is tied to the selection criteria set by the
researcher. In an attempt to resolve the debate concerning whether mammography
can reduce the mortality rate of breast cancer, a research team utilizing
meta-analysis found that there was no reliable evidence to support the claim
that mass screening for breast cancer had a positive effect for any women. On
the contrary, the US Preventive Service Task Force that employed meta-analysis,
too, concluded that use of mammogram significantly enhanced the survival rate of
women from 40-74 years of age. Aschengeau and Seage III (2007) asserted that the
preceding contradiction is a result of different criteria for selecting the
"Superrealization bias," the term coined by Cronbach et al. (1980) is germane
to effect size and meta-analysis. Superrealization bias refers to the phenomenon
that in a small-scale study, experimenters are able to monitor the quality of
implementation or create unrealistic conditions, but these ideal conditions
could never be realized on a large scale study. Slavin (2008), and Slavin and
Smith (2008) asserted
that small studies are not inherently biased, but a collection of small studies
tend to be biased. Thus, Slavin warned against reporting average effect sizes
using a cluster of low n studies.
Varying conditions across studies
It is important to point out that quite a few controversial conclusions in
medical research arise from meta-analyses. For example, based on meta-analyses,
the medical research community asserted that antidepressants are not more
effective than placebos. But critics charged that not all the studies included
in the meta-analyzes used the same protocols, definitions, types of patients and
doses. The alleged safety of Avandia is another example. A meta-analysis from
the combined trials showed that only 55 people in 10,000 had heart attacks when
using Avandia whereas 59 people per 10,000 had heart attacks in comparison
groups. However, after a series of statistical manipulations, this conclusion
was reversed. It was argued that a meta-analysis synthesizing many small-scale
studies is not a good substitute for a single trial with a large sample size
No scientific breakthrough was made through meta-analysis
Skeptics and New Atheism authors tried to discredit meta-analysis
because this method was used for studying supernatural and paranormal
phenomena. For example, Stenger (2007) wrote:
This procedure (meta-analysis) is highly
questionable. I am unaware of any extraordinary discovery in all of
science that was made using meta-analysis. If several, independent
experiments do not find significant evidence for a phenomenon, we
surely cannot expect a purely mathematical manipulation of the combined
data to suddenly produce a major discovery. No doubt parapsychologists
and their supporters will dispute my conclusions. But they cannot deny
the fact that after one hundred and fifty years of attempting to verify
a phenomenon, they have failed to provide any evidence that the
phenomenon exists that has caught the attention of the bulk of the
scientific community. We safely conclude that, after all this effort,
the phenomenon very likely does not exist (Kindle Locations 824-830).
In a similar vein, the Skeptic's Dictionary (2012) website defines meta-analysis as the following:
A meta-analysis is a type of data analysis in
which the results of several studies, none of which need find anything
of statistical significance, are lumped together and analyzed as if
they were the results of one large study.
Is the above the correct definition of meta-analysis? Is it a common
practice that a meta-analyst puts together the results from studies
that show no significant effects and then "mathematically manipulate"
the data to prove a point? Did the illustration of the Simpson's
Paradox clearly indicate that it is possible to yield opposite
conclusions when one analysis combines all data and the other one
partitions the data set? In response to Stenger's attack, Bartholomew
(2011) replied that it is not sure what the words “mathematical
manipulation” and “suddenly” mean. The so-called manipulation in
meta-analysis is no more mathematical than other statistical
procedures, such as hypothesis testing. When the criticism, no matter
how sophisticated it sounds, is misguided by the wrong definition and
poor statistical knowledge, it is nothng more than attacking a straw
man (Yu, 2012)
Software for meta-analysis
You can use either all-purpose stat programs or specialized programs to
conduct meta-analysis. SAS is an example of all-purpose stat programs
that can perform meta-analysis (Wang & Bushman, 1999). For
specialized programs, one can use BioStat (2006) or Devilly (2005). On
one hand StatDirect (2014) is considered an all-purpose stat
application because it can perform meta-analysis as well as other
statistical procedures, but on the other hand it can also be viewed as
specialized package because its features are made for biomedical,
public health, and epidemiological research. The image below is a
screenshot of StatDirect.
The following is a screenshot of Effect size Generator written
by Grant Devilly:
To get a quick overview of effect size, I recommend reading a book chapter on effect size written by Tatsuoka (1993) in A Handbook for data analysis in the behavioral sciences (pp. 461-479), edited by Gideon Keren, Charles Lewis and published by Hillsdale, N.J. : L. Erlbaum Associates.
For learning the procedure of conducting meta-analysis, please look at
Liao (1998) as an example.
- Aguinis, H. & Harden, E. E. (2009). Sample size rules of thumb: Evaluating
three common practices. In Charles E. Lance and Robert J. Vandenberg. (Eds.),
Statistical and methodological myths and urban legends: Doctrine, verity and
fable in the organizational and social sciences (pp.267-286). New York :
- Aschengrau, A., & Seage III, G. (2007). Essentials of epidemiology in
public health. Boston, MA: Jones and Bartlett.
- Baker, R., & Dwyer, F. (2000 Feb.). A meta-analytic assessment of the effects of visualized instruction. Paper presented at the 2000 AECT National Convention. Long Beach, CA.
- Bartholomew, D. (2010). Victor Stenger’s scientific critique of Christian Belief. Science and Christian Belief, 22, 117-131.
- Berk, R.A. & Freedman, D. (2003). Statistical assumptions as empirical commitments. In T. G. Blomberg, S. Cohen (Eds.). Law, Punishment, and Social Control: Essays in Honor of Sheldon Messinger (2nd ed) (pp. 235-254). New York: Aldine.
- Biostat (2006). Meta-analysis. [On-line] Available: http://www.meta-analysis.com/
- Campbell, D. & Stanley, J. (1963). Experimental and quasi-experimental designs for research. Chicago, IL: Rand-McNally.
- Cohen, J. (1962). The statistical power of abnormal-social psychological research: A review.
Journal of Abnormal and Social Psychology, 65, 145-153.
- Cook, T. D., & Campbell, D. T. (1979). Quasi-experimentation: Design and analysis issues for field settings. Boston, MA: Houghton Mifflin Company.
Cronbach, L. J., Ambron, S. R., Dornbusch, S. M., Hess, R.O., Hornik,
R. C., Phillips, D. C., Walker, D. F., & Weiner, S. S. (1980). Toward reform of program evaluation: Aims, methods, and institutional arrangements. San Francisco: Jossey-Bass.
- Devilly, G. (2005). Effect size generator. Retrieved from
- Glass, G. V. (1976). Primary, secondary, and meta-analysis of research. Educational Researcher, 5, 3-8.
- Glass, G. V., McGraw, B., & Smith, M. L. (1981). Meta-analysis in social research. Beverly Hills: Sage Publications.
- Higgins, J. P., & Green, S. (2008). Cochrane handbook for systematic reviews of interventions. West Sussex, UK: John Wiley & Sons.
- Hodge, D. R. (2007). A systematic review of the empirical literature on intercessory prayer. Research on Social Work Practice, 17, 174-187. DOI: 10.1177/1049731506296170
- Hunter, J. E., & Schmidt, F. L. (1990). Methods of meta-analysis: Correcting error and bias in research findings. Newbury Park, CA: Sage Publications.
- Keng, L, & Beretvas, N. (2005 April). The effect of publication bias on correlation estimation. Paper presented at the Annual Meeting of the American Educational Research Association, Montreal, Canada.
- Liao, Y. C. (1998). Effects of hypermedia versus traditional instruction
on students' achievement: A meta-analysis. Journal of Research on Computing in Education, 30, 341-361.
- Olkin, I. (2000 November). Reconcilable differences: Gleaning insight from independent scientific studies. ASU Phi Beta Kappa Lecturer Program, Tempe, Arizona.
- Poston, J. M, & Hanson, W. E.(2010). Meta-analysis of psychological assessment as a therapeutic intervention.
Psychological Assessment, 22, 203-12.
- Root, D. (2003). Bacon, Boole, the EPA, and scientific standards. Risk Analysis, 23, 663-668.
- Schwarz, C. (1998). Contingency tables - Simpson's paradox. Retrieved from
- Siegfried, T. (2010). Odds are, it's wrong: Science fails to face the shortcomings of statistics.
Science News, 177(7). Retrieved from
- Slavin, R. (2008). Perspectives on evidence-based research in education. Educational Researcher, 37(1), 5-14.
- Slavin, R., & Smith, D. (2008, March). Effects of sample size on effect size in
systematic reviews in education. Paper presented at the annual meetings of the Society for Research on Effective Education, Crystal City, VI.
- Skeptic's Dictionary. (2012). Meta-analysis. Retrieved from http://www.skepdic.com/metaanalysis.html
- StatDirect Inc. (2014). StatDirect [Computer Software]. Cheshire, WA: Author.
- Sterne, J, & Egger M. (2001). Funnel plots for detecting bias in meta-analysis: Guidelines on choice of axis. Journal of Clinical Epidemiology, 54, 1046-1055.
- Stenger. V. J. (2007). God: The failed hypothesis. How science shows that God does not exist (Kindle Edition). Amherst, NY: Prometheus Books.
- Thompson, D. P. (2007). A meta-analysis on the efficacy of prayer.
Fresno, Alliant International University. Unpublished Dissertation.
- Wang, M. C., & Bushman, B. J. (1999). Integrating results through meta-analytic review using SAS software. Cary, NC: SAS Institute.
- Welkowitz, J., Ewen, R. B., & Cohen, J. (1982). Introductory statistics for the behavioral sciences. San Diego, CA: Harcourt Brace Jovanovich, Publishers.
- Wilkinson, L, & Task Force on Statistical Inference.
(1999). Statistical methods in psychology journals: Guidelines and
explanations. American Psychologist, 54, 594-604.
- Yu, C. H. (2012). Can absence of evidence be treated as
evidence of absence: An analysis of Victor Stenger's argument against
the existence of God. China Graduate School of Theology Journal, 52, 133-152.
Last update: 2016
Go up to the main menu