Experiment and Non-experiment
Experimental research and non-experimental research
"Experiment" is a widely misused term. When some people talk about
their "experiment," indeed their study is non-experimental in nature.
The following are the characteristics of experimental and
non-experimental research designs.
It is very common for even experienced researchers to be
confused by random sampling and randomization. For example, Morse
- Random sampling: a sampling method in which each
member of a set has independent
chances to be selected (the notion of "equal chances" is a theoretical
ideal mentioned by many textbooks, but there are always some hidden
bias or disposition in the real world).
- Randomization: randomly assign subjects into the
control group and the treatment group.
- Experimenter manipulation: directly manipulate
variables to test cause-and-effect relationships e.g. alter the amount
of drug given to the patients. The researcher manuipulates the factor
that she cares about.
- Experimenter control: involves control of all
extraneous variables or conditions that might have an impact on the
dependent variables. The researcher removes the effect that she doesn't
What is wrong with randomization?
Processes of saturation are essential in qualitative inquiry:
saturation ensures replication and validation of data; and it ensures
that our data are valid and reliable. If we select a sample randomly,
the factors that we are interested
in for our study would be normally distributed in our data, and be
represented by some sort of a curve, normal or skewed. Regardless of
the type of curve, we would have lots of data about common events, and
inadequate data about less common events. Given that a qualitative data
set requires a more rectangular distribution to achieve saturation,
with randomization we would have too much data around the mean
(and be swamped with the excess), and not enough data to saturate on
categories in the tails of the distribution (p.234)
(Emphasis added by the author).
As of August 7, 2017, the website of the Department of Statistics
explained the role of sampling in statistical inference as follows:
The use of randomization in sampling
allows for the analysis of results using the methods of statistical
inference. Statistical inference is based on the laws of probability,
and allows analysts to infer conclusions about a given population based
on results observed through random sampling (para. 1)
(Emphasis added by the author).
Again, randomization is concerned with assignment of group membership
after the sample is drawn, whereas random sampling is a subject
Control and manipulation are very crucial to experimentation. Without
them, the conclusion drawn from an observed phenomenon could be
completely wrong even if it makes sense. Let's look at an everyday
example: One of my friends has two TV sets. One of them is
Japanese-made while the other is European-made. She insisted that the
Japanese TV has a better quality than the European one because the
former presents a sharper picture. Being skeptical to her claim, I
conducted a small experiment: I simply swapped the locations of the two
TV sets. As a result, the European TV set showed a clearer picture than
the Japanese one. As you see, the factor here is the signal rather than
the electronics. In an experiment, if I put all TVs under study in the
same location, then location as a source of "noise" is under my control.
If I alternating the location for each TV, then location becomes a
variable under my manipulation.
Let's use herbs as another example: A Chinese friend maintained that
some Chinese herbs could heal certain diseases. She even conducted an
experiment to prove it. When her husband suffered a long-term illness,
he took Chinese herbs for one week and his health condition improved
substantively. The next week he stopped taking Chinese herbs and the
condition reversed. I asked her how many types of Chinese herbs her
husband took, she answered, "Ten." If I feed a patient with 10
vitamins, I am sure he will get better, too! Because of the lack of manipulation/partition
of the chemical components of the herb,
this "experiment" did not tell us which Chinese herb is helpful to
which body function.
However, it is important to note that "control" is not the
core essence of experimentation. The difference between
controlled experiment and randomized experiment will be discussed in a
A quasi-experiment is a research design that does not meet all the
requirements necessary for controlling the influence of extraneous
variables. Usually what is missing is random assignment.
For example, when a researcher studies gender difference in
use, obviously he cannot randomly assign gender (I am happy as a man. I
don't want to be re-assigned).
It is generally agreed that the primary demarcation
experiments and quasi-experiments is random assignment of group
membership. Nonetheless, some authors consider random selection as a
criterion, too. For example, according to Plichta and Garzon (2009),
"quasi-experimental designs may lack random selection, random
assignments, or both" (p.13). In a similar vein, Moule and Hek (2012)
suggested that convenience sampling is "a part of survey or
quasi-experiment designs" (p.95).
This type of research is very common in
political sciences and communications, in which many variables are not
controllable. For example, if you intend to study how wars affect
people's perception to the quality of policy making, you cannot create
a war or manipulate other world affairs, unless you are the villain in
the movie "Tomorrow never dies." Because of this limitation,
researchers send surveys to participants who are exposed to the real
Secondary analysis: Archival research
Archival research is
a subset of secondary data analysis, but the two terms are not
synonymous. Meta-analysis, in
which results of prior research are synthesized, is
also a form of secondary analysis. As the name implies, archival
research utilizes existing raw data archived in databases, but
meta-analysis extracts statistical results from previous studies. If
you don't like the tedious IRB process, go for secondary data
Archival research is popular in economics and educational research,
especially when the research project involves trends or longitudinal
data. For example, if the researcher wants to find out the correlation
between productivity and school performance, he can contact the General
Accounting Office and the Department of Education for obtaining the
related data in the last twenty years. The following are some examples
of archival data that are openly accessible:
Obviously, there are advantages of archival data analysis:
On the the hand, there are shortcomings and limitations. For example,
you might be interested in analyzing disposable income, but the
variable is gross income. In other words, your research question is
confined by what you have at hand (Management Study Guide, 2016).
saves time, efforts, and money, because the data are online available
(Most online databases are free, but CCMH requires a full data access
- It provides a basis for comparing the results of
secondary data analysis and your primary data analysis (e.g. national
sample vs. local sample).
- The sample size is much bigger than what you can collect
by yourself. A small-sample study lacks statistical power and the
result might not be stable across different settings. On the contrary,
big data can reveal stable patterns.
- Many social science studies are conducted with samples
are disproportionally drawn from Western, educated, industrialized,
rich, and democratic populations (WEIRD; Henrich, Heine, &
Norenzayan, 2010). Nationwide and international data sets alleviate the
problem of WEIRD.
Additionally, it is important to point out that very often there are
between different sources of archival data, and thus researcher should
exercise caution in drawing firm conclusions derived from a single data
source. For example, GDP per capita is commonly used in many archival
research studies. Nonetheless, there exist vast differences between the
two different sources indicating GDP per capita of each country,
namely, World Development Indicators (WDI) and Penn World Table 7.1
(PWT) (Ram & Ural, 2014). In addition, based on the 2005 UN Human
Development statistics, Harris (n.d.) pointed out that the most
atheistic societies, including many secular European nations, are the
healthiest. However, in Happy Planet Index none of those secular
European countries is ranked among the top 20. The table below shows
the recent figures of UNHD and HPI side by side.
Happy Planet Index
United Arab Emirates
Both natural settings and laboratory-controlled
experiments have pros and cons. On some occasions, things happen in the
real life challenge artificial experiments. For example, in some
lab-controlled benchmark tests, Windows outperforms Mac OS, Linux, and
even UNIX! But computer users tell different stories in real settings.
It is common that experimentation is equated with scientific
methodology, and thus is highly regarded. Actually, certain science
subjects do not heavily reply on experimentation, such as Astronomy
(Big bang, Quantum tunneling) and physics (e.g. M-theory). In classical
astronomy the major source of knowledge is from observation
rather than experimentation (Deese, 1972). For example, you cannot blow
up Mars and see how the absence of Mars affects the gravitational force
of the Solar system (With modern rocket and nuclear technologies,
humans may be able to do so, but we shouldn't)! And the study of the
origin of the universe could not count on even observation. Mathematics
is another example. Although today with the aid of high-power computer,
several mathematicians are able to conduct "mathematical experiments"
by simulation (Chaitin, 1998), basically the origin of mathematical
theorems are from logical deduction.
Lack of experimentation can also be found in certain areas of biology
such as evolution. Barkow (1989) pointed out that an evolutionary
scenario is speculative in which the usual requirements for empirical
verifiability are relaxed in favor of an emphasis on logic and
Randomization and Simpson's Paradox
Randomization is the major difference between experiment and
quasi-experiment. It is important to point out some common
misconceptions regarding randomization.
Random sampling and randomization
As mentioned before, many people confuse random sampling and
randomization. The former is a sampling process while the latter is
concerned with assignment of group membership. Further, The purpose of
random sampling is to enhance the generalizability of the
results while the purpose of randomization is to establish the cause-effect
interpretations of the results. In other words, random sampling
counteracts the threat to external validity whereas
randomization addresses the threat of internal validity.
However, the above concepts are easily confused (May & Hunter,
1988). The topic of internal validity and external validity will be
discussed in another write-up entitled Threats
to validity of Research Design.
In practice, randomization plays a more important role than
sampling in research. Let's face it. How often can a researcher draw a
random sample? If the target population consists of all university
students, are you able to draw samples from campuses in states other
than your own? As a matter of fact, most research studies recruit
convenience subjects that are instantly available (Frick, 1998). If the
requirement of random sampling is strictly followed, experiments are
hardly implemented. In fact, Reichardt and Gollob (1999) found that in
a randomized experiment, the use of a t test with a convenience sample
can be justified without reference to a hypothetical infinite
population, in which random samples are drawn.
To rectify the situation of non-random sampling,
randomization is used
to spread errors randomly among treatment groups (Fisher, 1971). Pitman
(1937a, 1937b, 1938) went so far as to assert that random sampling is
unnecessary for a valid test of the difference between treatments in a
randomized experiment. Using an example of 40 convenience subjects,
Babbie (1992) conceptualized randomization as treating convenience
samples as probability samples: "It is as though 40 subjects in this
instance are a population from which we select two probability
samples-each consisting the characteristics of the total population, so
the two samples will mirror each other." (p.243)
However, like random sampling, randomization also encounters
difficulties in implementation. Berk (2005) used the following example
to illustrate one of the problems: Even if the experimenter randomly
assigns prisoners into different treatment programs, the inmate may
fail to show up. This can turn the randomized experiment into an
It is important to repeatedly emphasize that Randomization is not the
silver bullet. In addition to the attrition issue mentioned above,
randomization is subject to the threat of Simpson's Paradox, which was
discovered by Dr. E. H. Simpson (1951), not O. J. Simpson or Bart
Simpson. Simpson's Paradox is a phenomenon that the conclusion drawn
from the aggregate data is opposite to the conclusion drawn from the
contingency table based upon the same data.
If it is too abstract to you, let's look at an example: In
a 20-year follow-up study was conducted to examine the survival rate
and death rate of smokers and non-smokers. The result implied a
significant positive effect of smoking because only 24% of smokers died
compared to 31% of non-smokers. Phillip and Morris should celebrate,
right? Not yet. When the data were broken down by age group in a
contingency table, it was found that there were more older people in
the non-smoker group (Appleton & French, 1996).
Another example of Simpson's Paradox can be found in a study regarding
student retention conducted at Arizona State University. Although the
initial analysis based on all data (Yu, DiGangi, Jannasch-Pennell,
& Kaprolet, 2010) shows that among the students who stay at the
university, the probability of being a resident (p=.67) is higher than
that of non-residents (p=.33), a seemingly opposite conclusion emerges
when observations are grouped by state in a GIS analysis, as shown in
the Figure 1:
Figure 1. Retention rate mapped to student
How is Simpson's Paradox related to randomization?
Obviously, the above
study used non-experimental data. You cannot ask people to become
smokers or non-smokers. Neither can age be assigned (I wish it can be.
If so, I will request to be assigned to the young age group). As a
result, two groups which were non-equivalent in age led to Simpson's
Paradox. Although randomization is said to prevent this from happening,
randomization is not 100% fool-proof. By simulation, Hsu (1989) found
that when the sample size is small, randomization tends to make groups
become non-equivalent and increase the possibility of Simpson's
Paradox. Thus, after randomization with a small sample size,
researchers should check the group characteristics on different
dimensions (e.g. race, sex, age, academic year, ...etc.) rather than
blindly trusting randomization.
Randomized and controlled experiments
Another area of confusion can be commonly found in the
difference between randomized and controlled experiments. Today
"randomized experiment" and "controlled experiment" are often used
synonymously. One of the reasons is that usually an experiment consist
of a controlled group and treatment group, and group
membership is randomly assigned into one of the groups. Since
"control" and "randomization" are both perceived as characteristics of
an experiment, it is not surprising that in many texts randomized
experiment and controlled experiment are either used in an
fashion or the two terms are combined as one term such as
"randomized controlled experiment." The latter usage is legitimate as
long as both control and randomization are implemented in the
experiment. However, treating a randomized experiment as "a controlled
experiment" and vice versa is misleading (e.g. "In controlled
experiments, this is accomplished in part through the random assignment
of participants to treatment and control
groups" (Schneider et al., 2008)). Indeed, there is a subtle difference
between the two.
R. A. Fisher is the pioneer of randomized experiment. In
Fisher's view, even if there is a significant difference between the
control and the treatment group, we may not be able to attribute the
difference to the treatment when there exists many uncontrollable
variables and sampling fluctuations. The objective of randomization is
to differentiate between associations due to causal effects of the
treatment and associations due to some variable that is a common cause
to both the treatment and response variables. If there are influences
resulted from uncontrolled variables, by randomization the influences
would be randomly distributed across the control and treatment groups
even though no control of those variables are made.
On the other hand, the logic of experimentation up to
Fisher's time was that of controlled experiment. In a control
experiment, many variables are experimentally fixed to a constant
value. However, Fisher explicitly stated that it is an inferior method,
because it is impossible to know what variables should be taken into
account. For example, a careful researcher may assign equal numbers of
males and females into each group, but she/he may omit the age and
educational level of the subjects. In Fisher's view, instead of
attempting to put everything under control, the researcher should let
randomization take care of the uncontrollable factors. It is not to
suggest that Fisher did not advocate controlling for other causes in
addition to randomization. Rather he explicitly recommended that the
researcher should do as much as control as he can, but he advised that
randomization must be employed as "the second line of defense"
Following the same line of reasoning, the Canadian Task
Force for Preventive Health Care (2003) prefers randomized experiments
to controlled trials without randomization as clinical evidence, as
shown in the following table.
| Evidence from randomized controlled trial(s)
| Evidence from controlled trial(s) without
| Evidence from cohort or case-control analytic
studies, preferably from more than one centre or research group
| Evidence from comparisons between times or
places with or without the intervention; dramatic results in
uncontrolled experiments could be included here
| Opinions of respected authorities, based on
clinical experience; descriptive studies or reports of expert committees
Nonetheless, a randomized experiment is not necessarily
superior to a controlled experiment. As mentioned before, when the
sample size is small, randomization tends to make groups become
non-isomorphic and thus may lead to a Simpson's Paradox (Hsu, 1989).
Not surprisingly, when the sample size is small, a controlled
experiment is more advisable.
Smoking does not cause lung cancer, really?
It is important to point out that any dogmatic thinking
is counter-productive to science, which is supposed to be an open
system. R, A, Fisher, the inventor of randomized experiment, was dead
wrong about the relationship between smoking and lung cancer. Between
1922 and 1947 the prevalent rate of deaths attributed to lung cancer
surged 15 times across England and Wales. In 1947 Austin Bradford and
Richard Doll were hired by the British Medical Research Council to
investigate the possible cause of this pandemic. Obviously, it is
unethical to conduct a randomized experiment, such as randomly
assigning 3,000 healthy people to the smoking group and 3,000 to the
control group. Alternatively, Hill and Doll conducted surveys in the
hospitals of London. Doll was stunned by the fact that people who
smoked tended to die of lung cancer, and in response he gave up smoking
two-thirds of the way through the study. In 1950 Hill and Doll
published their report in the British Medical Journal, suggesting that
there was a causal link between smoking and lung cancer. In 1957
Fisher, who was a smoker, sent a letter to the journal to repudiate
their conclusion. His reasoning is simple: without running a randomized
experiment we cannot assert a cause and effect relationship between
tobacco and lung cancer. Fisher insisted upon his position and kept
counter-arguing his opponents until he died in 1962 (Christopher, 2016).
Nonethless, at least Fisher was consistent by doing what he said. He
kept smoking until his death!
Australian approach cannot work in America, really?
The previous example shows that the dogmas of randomized
experimentation could hinder researchers from drawing a sound causal
conclusion and delaying countermeasures against threats (e.g. the
environmental hazard of DDT pointed out by Silent Spring and climate
change suggested by IPCC). In addition, Berwick (2008) challenged the
experiments can be applied to all situations. Many years ago Rapid
Response Team (RRT), an innovative preventative health care approach
introduced by Australian doctors, in which a team of physicians and
nurses monitor vital signals of patients and take proactive actions,
was implemented in the United States. But, randomized experiments
conducted by American researchers showed that there were no significant
differences between RTT and non-RTT approaches in terms of reducing the
number of unexpected deaths. Berwick questioned the validity of the
conclusion, for it ignored the cultural context and the specific
Similarly, Rawlins disputed the experimental "gold standard"
research by listing the limitations of randomized and controlled
experiments. First, like social scientists, sometime medical
researchers face a "mission impossible" scenario when the disease under
investigation is extremely rare and thus the number of patients is very
small. Second, on some occasions experimentation is unnecessary,
especially when a treatment produces a "dramatic" benefit, such as
Imatinib (Glivec) for chronic myeloid leukemia. In health science
research there is a stopping rule. When the treatment shows healing
effects, the trial should be stopped early so that the control group
can switch to the more effective treatment. There is no consensus among
statisticians as to how best to handle this situation, but treating
this type of incomplete experiment as invalid would throw out valuable
information (cited in Medical News Today, 2008).
Essock et al. (2003) also observed the discrepancy between
world" and the lab settings. Many drug treatment studies last about
four to eight weeks only. Short-term drug tests may cost less to
implement, but usually these studies do not yield the statistical
significance that is found in long-term experiments. On the other hand,
long-term drug trials have problems in retaining participants long
enough to yield unbiased outcomes. In other words, the so-called causal
conclusions produced in experiments may not reflect what would happen
in the real world.
The dictator game in the real world
The dictator game, which is used
very often for studying
morality and cooperative behaviors, is another good example. In a
experiment utilizing the dictator game, the participant is told to
much of a $10 pie he would like to give to an anonymous person who also
up for the same experimental session. The game is so named because the
made by the giver is final. Most experimental results are encouraging:
participants were willing to share the wealth. However, the result is
different when the dictator game is conducted in a naturalistic
setting. In a
study carried out by Winking and Nizer (2013) at a bus stop in Las
researcher told some strangers that he was in a hurry to the airport
therefore he wanted to give away his $20 in casino chips. The
explicitly suggested to the receivers to share a portion of the money
another stranger at the bus stop, who was actually a member of the
team. In contrast to the experimental result, no one in the
naturalistic study gave
any portion of the endowment to the stranger. Thus, Winking and Nizer
that in the past the setting of the experimental context induced
to choose prosocial options.
The Pepsi challenge
The preceding examples may be too remote to you.
Let's look at
some products that we consume everyday: Coke and Pepsi. In experimental
settings, most participants prefer Pepsi to Coke. However, Gladwell
(2007) disputed the result by presenting evidence that this so-called
"Pepsi Challenge" is based on the unrealistic "sip test" method. Most
tasters would favor the sweeter of two beverages when they make a
single sip only, but the result is reversed when the entire can or
bottle is consumed (I am skeptical of this type of taste tests,
including wine tests, coffee tests, water tests...etc. Our limited
sensation may not be able to distinguish one from another while the
difference is very subtle. In an experiment the researcher tinted the
wine and asked the wine experts to rate the "red
wine." Surprisingly, the experts did not recognize that it is not a
glass of red wine! The following movies are some examples:
(Similar results are found in coffee tests and water
Other elements and sample size
In educational research, What Works Clearinghouse (WCC)
still adopts the conventional ranking of study type. Slavin is critical
of this criterion by pointing out that in small, brief, and artificial
studies random assignment does not necessarily guarantee validity;
over-emphasizing randomized studies without taking sample size and
other design elements into account might introduce bias that "can lead
to illogical conclusions" (p.11).
Ruling out rival interpretations in quasi-experiment and
Some statisticians assert that one can never draw causal inferences
without experimental manipulation (e.g. SAS Institute, 1999). Some
researchers argued that causal inferences are weakened in
quasi-experiments (e.g. Keppel & Zedeck, 1989). However,
Christensen (1988) held a more liberal position:
Many causal inferences
are made without
using the experimental framework; they are made by rendering other
rival interpretations implausible. If a friend of yours unknowingly
stepped in front of an oncoming car and was pronounced dead after being
hit by the car, you would probably attribute her death to the moving
vehicle. Your friend might have died as a result of numerous other
causes (a heart attack, for example), but such alternative explanations
are not accepted because they are not plausible. In like manner, the
causal interpretations arrived at from quasi-experimentation analysis
are those that are consistent with the data in situations where rival
interpretations have been shown to be implausible. (p.306)
I would go further than Christensen to assert that even some
observational studies could yield valid causal conclusions. While the
example of an car accident in Christensen's argument is hypothetical,
we can see a similar example in the real life. Some researchers assert
that we could still attribute causal factors to effects with
observational data if virtually identical units in two different
outcomes are observed. To attribute causal factors to accidents, in
Georgia 300 accidents were compared to 300 non-accidents involving the
same car, driver, weather condition, and lighting. The non-accidents
occurred one mile back on the same road, a location passed by the
driver minutes earlier en route to the crash site. Researchers found a
substantial excess of roads that curved more than six degrees with
downhill gradients. In another example, to answer the question of
whether helmets reduce the risk of death in motorcycle crashes,
virtually identical units were compared: Cases in which two people rode
the same motorcycle, a driver and a passenger, one helmeted and the
other was not. Researchers concluded a 40% reduction of risk resulted
from wearing a helmet (Rosenbaum, 2005).
similar scenario could be found in political and economics studies.
During the Cold War era, the whole world was divided into three camps,
namely, the Communist world led by the Soviet Union and the People's
Republic of China, the Capitalist conglomerate led by the United
States, and the non-aligned countries. Some countries were partitioned
into two political entities due to unresolved ideological differences
embraced by different local parties. Obvious examples include North
Korea and South Korea, Mainland China and the Republic of China
(Taiwan), East Germany and West Germany, as well as North Vietnam and
South Vietnam. This division is not a result of randomization, of
course. Nevertheless, the observational data about the two camps could
still inform us about certain causes and effects. Many years ago
philosopher Margaret Walker (person communication) argued that there is
no causal relationship between Communist ideology and the horrible
consequences in the Communist countries. I held a different view. As
mentioned before, we could still attribute causal factors to effects
with observational data when virtually identical conditions associated
with the outcomes are observed. In terms of cultural heritage,
language, and racial attributes, the two countries in each pair on the
preceding list share a high degree of resemblances. The major
difference is found in the political and economic system only. Owing to
self-isolation and the containment policy performed by the West, the
Communist blocs could "experiment" with central planning, class
struggle, and so on without much outside influence. Needless to say,
after half a century people were disenchanted by the broken economy and
the lack of human rights in those Communist countries (Courtois et al.,
1999). It would be difficult to deny a causal relationship between
Communism and those undesirable consequences (Yu, 2009).
Another good example of natural experiment is racial diversity before
and after Proposition 209. In 1996, the State of California passed
Proposition 209, which prohibited public institutions from using
race-based admission policies. After Proposition 209 there was a
50-percent reduction in black freshman enrollment and a 25-percent drop
for Hispanics. Nonetheless, although the black and Hispanic enrollment
was reduced at the most prestigious University of California campuses
(-42% at UC Berkley; -37% at UCLA), other less competitive UC campuses
increased their black and Hispanic enrollment (+22% at UC Irvine, +18%
at UC Santa Cruz; +65% at UC Riverside) (Sander & Taylor, 2012).
Lurking variables, proxy measure, and theoretical casual
variables in correlational studies
Archival research is also called correlational research
because cause-and-effect inferences cannot be directly made. For
example, even though the last twenty-year data shows a positive
correlation between productivity and school performance, it would be a
leap of faith to conclude that school performance gain is the cause of
productivity gain or vice versa. Usually another variable, which may be
the true cause, is "lurking" behind background. This variable is called
lurking variable, and is easily undetected by a
Even if the researcher is aware of the existence of the
variables, he or she has no control of what data were collected.
Rather, the researcher must go by the existing variables available in
the data bank. Another limitation that hinders the researcher from
drawing a valid causal inference from archival data is the problem of
indirect measurement. On some occasions the variable chosen by the
researcher is a proxy measure
of what the researcher intends to study. For example, the researcher
may be interested in studying the causal relationship between Christian
spirituality and productivity. If the instrument is designed by the
researcher, he or she might insert questions like "how often do you
pray," "how often do you attend church activities" or other questions
specific to Christian spirituality into the survey. However, when
archival data are downloaded from the Internet, the researcher might
use general demographics (e.g. religion affiliation) to indicate
Christian spirituality. In other words, the researcher will make
inferences based on inferences
(proxy measure). Although the problems of lurking variables and proxy
measure could also happen in other types of research methods, they are
especially severe when the researcher is unable to customize the
There are many jokes about careless use of correlational studies. For
example, once a study indicated that consumption of alcohol improves
academic performance (the explanation may be something else: when the
overall economy improves, both alcohol consumption and academic
performance go up). A study in Taiwan during the 70s indicates that the
more woks a household owned, the fewer children the family had. Thus,
the government gave woks to households in an attempt to lower national
birth rate. The moral of these stories: researchers should select theoretical
casual variables even though the study is correlational.
Nevertheless, Luker, Luker, Jr., Cobb, and Brown (1998) defended the
use of causal inference in correlation/regression frameworks:
In the social and behavioral sciences, experimental randomization and
control are usually not possible. This has led to an awkward condition
in which our work does not permit useful policy recommendations. The
well-intentioned assertion that relationships do not mean causation,
while useful in contesting gross simple-mindedness, is paralyzing and
misleading in the social sciences. Or, as Dewey puts it, the critical
characteristic of all scientific operations is revealing relationships.
Relationships are a necessary condition of causation. We know that X
cannot be a cause of Y unless X and Y are related. The causal analysis
of nonexperimental data, therefore, can only go on through the analysis
of relationships. Causal inference from non-experimental data, then,
requires the testing of theoretical causal variables in a variety of
quasi-experimental or multiple regression frameworks...Statistical
failures of models suggest that we are not on the right track.
Confirmation of the models suggests the possibility of ameliorative
It is noteworthy that the problems faced by experimentation can also be
found in quasi-experiments. The main point is that the "real world" is
more complicated than an experimental setting, in which the treatment
and the outcome, or the cause and the effect has a one-on-one mapping.
Murnane and Willet (2011) wrote, "Randomized experiments and
quasi-experiments typically provide estimates of the total effect of a
policy intervention on one or more outcomes, not the effects of the
intervention holding constant the levels of other inputs" (p.31). This
issue, which is concerned with internal validity and external validity,
will be discussed in the write-up entitled Threats
to validity of Research Design.
Explicit questions and selection bias in survey research
Whether causal inferences can be drawn from survey research is
debatable. It is true that survey research does not implement any
variable manipulation. However, when a questionnaire includes explicit
questions concerning rationale and motivation, such as "Why do you
choose Web-based instruction over conventional instruction?" it is
difficult to explain that the answers provided by respondents do not
indicate any cause and effect.
Generalizability always comes hand in hand with causal
Survey research is not weaker than experiment in this regard. In many
situations, survey research tends to obtain a more random sample than
experimental research does. Usually subjects are required to be
physically present in experiment studies, and thus only convenience
are recruited from the local campus or the local town. On the other
hand, survey research can break through this limitation by sending
questionnaires to prospective subjects across the country. In the age
of Internet, the researcher can even set up an online form to reach
potential respondents all over the world.
However, someone may argue that a "cyber-sample" is a self-selected
rather than a random sample. In this case a systematic bias may affect
who responds to the questionnaire and who doesn't. The prediction of
"Dewey defeats Truman" by Chicago Daily Tribune in 1948
presidential election is a classic example of selection bias. The
interviewees were polled by phone and thus the sample was confined to
households who own a telephone. By the same token, when the survey is
posted on the Web, it is likely that respondents are computer literate
and have access to computer equipment. Indeed, the same problem can be
found in experimental research. Subjects could refuse to participate in
the experiment or withdraw from the study even though they start the
process. In both survey research and experimental research, the
question is not whether there are missing data. Rather, the question
should be: "Are data completely missing at random?"
Nonetheless, if the subject matter to be studied is
instruction, this should not be considered a selection bias. In an
online survey concerning Web-based instruction, the researcher should
expect that all respondents possess basic computer operation skills and
have access to the Internet (Once I assisted a researcher to post an
online survey on my database server. But several respondents, who used
2400 baud modems, complained that it took five to ten minutes to load a
Research design and statistical analysis
Traditionally, analysis of variance (ANOVA) is said to be appropriate
for data collected in an experiment whereas regression analysis is
considered a proper method for data collected in non-experimental
designs. Keppel and Zedeck (1989) argued that both ANOVA and regression
are suitable to experimental designs while only regression is fitful to
most non-experimental designs. In other words, regression is applicable
to both experimental and non-experimental deigns when the independent
variables are continuous and/or categorical. For this reason, Pedhazur
and Schmelkin (1991) asserted that regression is superior to ANOVA.
However, Pedhazur and Schmelkin criticized that in non-experimental
designs some researchers convert continuous variables into categorical
variables in order to fit the data into an ANOVA framework as if it
were experimental. This conversion not only leads to loss of
information, but also changes the nature of the variables and the
Kerlinger (1986), Shadish, Cook and Campbell (2002) are
two good books to get started with experimental design for neither book
requires a strong mathematical or statistical background. Their books
concentrate on the design aspect rather than the analysis aspect.
Montgomery (2012) is a very updated and comprehensive book
though it is
written for engineering majors. Readers should be able to follow the
content after taking one or two introductory statistics courses. You
may skip the chapter on response surface because it may not be
applicable to educational and psychological research. Dr. Montgomery is
a professor of Industrial Engineering at Arizona State University.
For intermediate users
Kennedy and Bush (1985)'s book was written for graduate students in
education and psychology who have a modest background in both
mathematics and statistics and who are interested in a subject-matter
field rather than statistical methodology. One nice thing about the
book is that it explains the mathematical notation symbols, which are
confusing to many readers.
For beginner and intermediate users
Levine & Parkinson (1994) is a book for both beginners and research
professionals. The first half of the book covers experimental methods
for psychologists in general whereas the second half covers very
detailed examples of experimental methods in cognitive psychology,
social psychology, and clinical psychology. Levine and Parkinson are
professors of psychology at Arizona State University.
For advanced users
Maxwell & Delaney (1990) and Winer, Brown, and Michels (1991) are
considered classics in the field of experimental design. Their books
cover both the design and the analysis aspects. However, their books
require a very strong statistical background.
Last revised: 2017
Last updated: 2018, January.
- Appleton, D. R. & French, J. M. (1996). Ignoring a
covariate: An example of Simpson's paradox. American Statistician,
- Babbie, E. (1992). The practice of social research
(6th ed.). Belmont, CA: Wadsworth.
- Barkow, J. H. (1989). Darwin, sex, and status:
Biological approaches to mind and culture. Toronto: University of
- Berwick, D. (2008, August). Inference and improvement
in health care. Paper presented at the 2008 Joint Statistical
Meeting, Denver, CO.
- Canadian Task Force on Preventive Health Care. (2003).
Canadian Task Force on Preventive Health Care
levels of evidence used to rate research design and quality of
individual studies. Retrieved August 13, 2008, from http://www.ctfphc.org/
- Christopher, B. (2016, September). Why the father of modern statistics didn’t believe smoking caused cancer. Priceonomics. Retrieved from https://priceonomics.com/why-the-father-of-modern-statistics-didnt-believe/
- Berk, R. (2005). Randomized experiments as the bronze
standard. UC Los Angeles: Department of Statistics, UCLA. Retrieved
- Chaitin, G. J. (1998). The limits of mathematics: A
course on information theory and the limits of formal reasoning.
- Christensen, L. B. (1988). Experimental methodology.
Boston, MA : Allyn and Bacon.
- Cohen, J. (1962). The statistical power of
abnormal-social psychological research: A review. Journal of
Abnormal and Social Psychology, 65, 145-153.
- Courtois, S., Kramer, M., Werth, N., Panne, J. L. ,
Paczkowski, A., Bartosek, K., Margolin, J. L. (1999). The black book of Communism: Crimes,
terror, repression. Boston, MA: Harvard University Press.
- Deese, J. (1972). Psychology as science and art.
New York, NY: Harcourt Brace Jovanovich, Inc.
- Essock, S., Drake, R., Frank, R., & McGuire. T.
(2003). Randomized controlled trials in evidence-based mental health
care: Getting the right answer to the right question. Schizophrenia
Bulletin, 29(1), 115-123.
- Fisher, R. A. (1971). The design of experiments (9th
ed.). New York, Hafner Publishing Company.
- Frick, R. W. (1998). Interpreting statistical testing:
Process and propensity, not population and random sampling. Behavior
Research Methods, Instruments, & Computers, 30, 527-535.
- Gladwell, M. (2007). Blink: The power of thinking
without thinking. New York, NY: Black Bay Books.
- Harris, S. (n.d.). The myth of secular moral chaos.
Retrieved from http://www.samharris.org/site/full_text/the-myth-of-secular-moral-chaos
- Henrich, J., Heine, S. J., & Norenzayan, A. (2010).
The weirdest people in the world? Behavioral
and Brain Sciences, 33, 61-135. http://dx.doi.org/
- Hsu, L. M. (1989). Random sampling, randomization, and
equivalence of contrasted groups in psychotherapy outcome research. Journal
of Consulting and Clinical Psychology, 57, 131-137.
- Keppel, G., & Zedeck, S. (1989). Data analysis
for research designs: Analysis of variance and multiple
research/correlation approaches New York: W. H. Freeman.
- Kennedy, J. J. & Bush, A. J. (1984). An
introduction to the design and analysis of experiments. Lanham, MD:
University Press of America, Inc.
- Keppel, G. & Zedeck, S. (1989). Data analysis for
research design: Analysis of variance and multiple
regression/correlation approaches. New York: W. H. Freeman.
- Kerlinger, F. N. (1986). Foundations of behavioral
research. New York, NY: Holt, Rinehart and Winston.
- Levine, G., & Parkinson, S. (1994). Experimental
methods in psychology. Hillsdale, N.J.: L. Erlbaum.
- Luker, B., Luker, B. Jr., Cobb, S. L., & Brown, R.
(1998). Postmodernism, institutionalism, and statistics: Considerations
for an institutionalist statistical method. Journal of Economic
Issues, 32, 449-457.
- Management Study Guide. (2016). Secondary data. Retrieved
- Maxwell, S. E., & Delaney, H. D. (1990). Design
experiments and analyzing data: A model comparison perspective.
Belmont, CA: Wadsworth Publishing company.
- May R. B., & Hunter, M. A. (1988). Interpreting
students' interpretations of research, Teaching of Psychology, 15,
- Medical News Today. (2008). Attack traditional ways of
assessing the evidence of therapeutic interventions. Retrieved from http://www.medicalnewstoday.com/articles/126043.php
- Montgomery, D. C. (2012). Design and analysis of
experiments (8th ed.). New York, NY: Wiley.
- Morse, J. (2007). Sampling in grounded theory. In A.
Bryant, & K. Charmaz (Ed.), Sage handbook of grounded theory
(pp. 229-244). Los Angeles, CA: Sage.
- Moule, P., & Hek, G. (2012). Making sense of research: An introduction
for health and social care practitioners. Thousand Oak, CA: Sage.
- Murnane, R. J. & Willet, J. B. (2010) Methods
Matter: Improving causal inference in educational and social science
research. NY: Oxford University Press.
- Pedhazur, E. J. & Schmelkin, L. P. (1991). Measurement,
design, and analysis : An integrated approach. Hillsdale, N.J. :
Lawrence Erlbaum Associates.
- Pitman,E. J. G. (1937a). Significance tests which may be
applied to samples from any populations. Journal of Royal
Statistical Society B, 4, 119-130.
- Pitman,E. J. G. (1937b). Significance tests which may be
applied to samples from any populations II: The correlation
coefficient. Journal of Royal Statistical Society B, 4,
- Pitman,E. J. G. (1938). Significance tests which may be
applied to samples from any populations III: The analysis of variance
test Journal of Royal Statistical Society B, 29,
- Plichta, S., & Garzon, L. (2009). Statistics for nursing and allied health.
New York, NY: Lippincott Williams & Wilkins.
- Ram, R., & Ural, S. (2014). Comparison of GDP per
capita data in Penn World Table and World Development Indicators. Social Indicators Research, 116,
- Reichardt, C. S., & Gollob, H. F. (1999). Justifying
the use and increasing the power of a t
test for a randomized experiment with a convenience sample. Psychological
Methods, 4, 117-128.
- Rosenbaum, P. (2005). Heterogeneity and causality: Unit
Heterogeneity and design sensitivity in observational studies. American
Statistician, 59, 147-152.
- Sander, R., & Taylor, S. (2012). Mismatch: How
affirmative action hurts students it's intended to help, and why
universities won't admit it. New York, NY: Basic Books.
- SAS Institute. (1999). Comments on interpreting
regression statistics. Retrieved from http://www.sas.com
- Shadish, W. R., Cook, T. D., & Campbell, D. T.
(2002). Experimental and Quasi-experimental designs for genealized
causal inference. New York, NY: Wadsworth Publisher.
- Shipley, B. (2000). Cause and correlation in biology:
A user's guide to path analysis, structural equations and causal
inferences. Cambridge: Cambridge University Press.
- Schneider, B., Carnoy, M., Kilpatrick, J., Schmidt, W.
H., & R. J. Shavelson. (2008). Estimating causal effects using
experimental and observational designs: A think tank white paper.
Washington D. C.: American Educational Research Association.
- Simpson, E. H. (1951). The interpretation of interaction
in contingency tables. Journal of the Royal Statistical Society,
Ser. B., 13, 238-241.
- Slavin, R. (2008). Perspectives on evidence-based
research in education. Educational Researcher, 37(1), 5-14.
- Winer, B. J., Brown, D. R., & Michels, K. M. (1991). Statistical
principles in experimental design. New York: McGraw-Hill, Inc.
- Winking, J., & Mizer, N. (2013). Natural-field
dictator game shows no altruistic giving. Evolution and Human Behavior,
34, 288-293. http://dx.doi.org/10.1016/j.evolhumbehav.2013.04.002
- Yu, C. H. (2009). Causal inferences and abductive
reasoning: Between automated data mining and latent constructs.
Saarbrücken, Germany: VDM-Verlag.
- Yu, C. H., DiGangi, S., Jannasch-Pennell, A., &
C. (2010). A data mining approach for identifying predictors of student
retention from sophomore to junior year. Journal of Data Science, 8,
307-325. Retrieved from
Go up to the main menu