and performance appraisal
Psychological and educational research often is focused on
treatment outcomes. Appropriate interpretation of treatment outcomes
is based primarily on:
- Studying the efficacy of treatment
- Establishing the validity and reliability of
- Obtaining subjects who represent a normal cross-section of the
desired population to ensure accurate performance appraisal.
However, in many quasi-experimental and non-experimental settings
where experimental control is lacking, often each aspect mentioned
above is studied simultaneously. This convolutes the focus of
research as well as comprises the interpretation of results. For
example, in research on Web-based instruction, test results are often
used as an indicator of both the quality of instruction and the
knowledge level of students. In addition, pilot studies are
implemented to examine the usefulness of the treatment program and to
validate the instrument based upon the same pilot results.
To address this misapplication of methods, this article discusses
the differentiation of the objectives of treatment effectiveness
evaluation, instrument validation, and performance appraisal. In
addition, it points out the pitfalls of circular dependency, proposes
the rectification of the problems inherent in circular dependency,
and presents recommendations for the application of appropriate
Further, a survey result was reported to substantiate these
widespread misconceptions. The data were collected via the Internet
by announcing the online survey to six different ListServ groups and
Newsgroups. The survey consists of five multiple-choice questions,
which are related to the functions of treatment effectiveness
evaluation, instrument validation, and performance appraisal (see
Appendix). Thirty-four graduate students, who had taken on average
4.9 undergraduate and graduate statistics courses, responded to the
survey. The respondents span across twenty-five institutions in six
continents/regions (North America, South America, Europe, Asia,
Africa, New Zealand), twenty-two different undergraduate majors (e.g.
chemistry, English, electrical engineering, history, psychology) and
eighteen different graduate majors (e.g. biology, communication,
computer science, education, finance, industrial engineering,
statistics). Within the United States, the respondents came from
eight different states. On the average, the participants answered 1.6
Consequences of the
Researchers may be confused by the function of specific aspects in
studies of treatment outcomes, instrument reliability and validity,
as well as subjects¹ ability. Table 1 summarizes the differences
of the three functions. Confusing these differences among a study's
treatment, instruments and subjects may lead to grave
Source of variation
Sample size determination
To evaluate the effectiveness of a treatment such as an
Treatment effect and experimental errors
effect size, alpha
To evaluate the reliability and validity of an instrument
such as a test or a survey
To evaluate the performance or/and the attitude of
Test-retest reliability and pre-post
As an example, a common occurrence in the field follows: A
researcher administers an instrument to a group of subjects as a
pretest- treatment - posttest. The correlation between the two
measurements is then examined in an attempt to estimate the
test-retest reliability instead of treatment effectiveness. Indeed, a
dependent t-test should be performed to compute the pre-post
difference, because the source of variation is due to factors other
than the instrument. However, in instrument validation, the source of
variation is the instrument itself (e.g. poorly written items) while
in studying treatment effectiveness, the source of variation is the
treatment effect and the experimental errors. The consequence here is
manifested in the interpretation of the results from such a study:
that the instrument is either valid or invalid, and the treatment
effect either supports or does not support the hypothesis. No matter
the interpretation, the method used is inappropriate, and therefore
any discussion would be invalid.
The survey results indicate that this confusion exists though it
may not be widespread. Question 1 was used to test the subjects¹
ability to distinguish test-retest reliability from pre-post
differences. Thirty-two percent of the respondents failed to
differentiate the two concepts.
Sample size determination
Further confusion of studying treatment effectiveness and
instrument validation is manifested in sample size requirement. In
our teaching, consulting, and research experience, many students and
researchers, after calculating a desirable sample size using power
analysis, use the suggested sample size or less to run pilot studies
for instrument validation. Apparent in this act is the failure to
distinguish the sample size requirement in studying treatment
effectiveness from that in instrument validation. For example, power
is the probability of correctly rejecting the null hypothesis. Thus,
power analysis is inherent in studying treatment effectiveness where
dependent and independent variables are present. Moreover, hypothesis
testing alone does not reveal how likely the result can be replicated
in other studies (Thompson, 1996). In other words, stability across
samples, which is a focus of instrument validation, is not an issue
in sample size determination for studying treatment
In contrast, instrument validation requires no distinction between
dependent and independent variables, and there is no null hypothesis
to be rejected. The sample size criterion for instrument validation
is stability rather than power. To be specific, the larger the sample
size, the more likely that the instrument can be applied to different
samples. To achieve this stability, a very large sample size is
necessary. For example, in factor analysis, which is a procedure of
loading items into latent constructs, 300 subjects are considered a
small sample (Wolins, 1995). It is not unusual for a test developer
to use as many as 5000 subjects for instrument validation (Potter,
1999). But a treatment effectiveness study with 300-5000 subjects may
be considered being overpowered in power analysis. It is apparent,
then, that using the results of a single power analysis to determine
sample size to study treatment efficacy and instrument validation
becomes problematic, and may lead to erroneous, yet over-confident
Question 2 and 3 were used to test the subjects¹ knowledge on
sample size determination. In Question 2, Seventy-six percent of
respondents were confused by the criteria of determining sample size
for treatment effectiveness study and instrument validation.
Eight-two percent of responses to question 3 were incorrect. It is
evident that the majority did not understand the purpose of power
Performance/attitude appraisal is less problematic. In appraisals,
the conclusion is made to individuals. Even if there is only one
subject, the judgment is still valid to that particular person, and
therefore sample size determination is not needed.
Last but not least, this confusion may lead to the logical fallacy
of circular dependency. There are two aspects of circular dependency
in classical test theory. The first is the circular dependency
between the theoretical constructs and the data; the second is the
circular dependency between the instrument and the subjects.
First, when Cronbach and Meehl (1955) proposed the concept of
construct validity, they maintained that hypothetical constructs
drive the nature of data collection. For instance, when we have a
theory about "social presence on the Internet," the questions and the
possible choices for answers in the instrumentation will be framed
within the theoretical model. In turn, the data resulting from the
administration of the instrument are then used to revise the theory
itself. Further, in studying treatment effectiveness, the treatment
is developed based on the theory. In turn, the data are used to
confirm or disconfirm the treatment effectiveness, and eventually,
the theory behind the treatment. This circular dependency may be
viewed as a positive feedback loop if the theory and the treatment
are continuously refined by appropriately analyzing data. However,
what if the research starts with terribly incorrect theoretical
constructs? In this case, the data may give the right answer to the
Embedded within the first circular dependency is as second: Fan
(1998) pointed out that in classical test theory, the quality of an
instrument is evaluated based upon the tester responses. At the same
time, the statistics of examinees are dependent on the quality of the
test items. In other words, in studying instrument validity and
reliability, test responses used to validate an instrument are also
dependent upon a certain degree of the appropriate construction of
that instrument. But what if the initial instrument is terribly
constructed, which may be resulted from a terribly misconceived
construct? The two types of circular dependency are illustrated in
Figure 1. Two types of circular dependency
These two types of circular dependency are amplified when the
three types of evaluation are not clearly delineated and are
mistakenly run in parallel. The relationship between the subjects and
the instrument, as well as that between the instrument and the
treatment can be circular: The instrument is validated by the user
input, treatment efficacy determined by the instrument results, and
the final results used to determine the quality of students. By the
same token, the usefulness of the treatment is judged by the
learners' test performance, and later students' abilities are
measured by how much they have learned from the treatment, as
indicated by the test scores. To rectify this situation, our proposed
guideline is simple: Researchers must know two before knowing the
Rectification of the inherent
problems in circular dependency
Because of the inter-dependency among the treatment, the
instrument, and the subjects, certain assumptions must be made or
certain knowledge must be obtained in order to justify an assertion
to either one of them:
- Given that the treatment is effective and the instrument is
valid, the evaluator is able to make a judgment about students'
- Given that the treatment is effective and the subjects have
the requisite characteristics, the evaluator can find out whether
the instrument is valid or not.
- Given that the learners are ready to learn and the instrument
is valid, the evaluator can determine whether the treatment is
effective or not.
In other words, one must know or assume the quality of at least
two elements in order to unveil the third unknown. Therefore,
problems arise when there are two or more unknown elements. When all
of them are unknown, no conclusion can be made to the treatment, the
instrument, or the subject. In the survey, question 5 addresses this
problem. Only nine percent of respondents gave the correct
Many researchers commit a common logical fallacy by asserting that
"If the treatment works, the test scores are high. Thus, if the test
scores are high, then the treatment works." The notation of this
logical fallacy is: "If P then Q, therefore if Q then P." The first
statement (If P then Q), which is called "Modus ponens" (Hurley,
1988), is a valid inference. But the second one (if Q then P) is
considered the fallacy of affirming the consequence (Kelley, 1998).
Put it in a concrete example: The statements that "if it rains, the
ground is wet. If the ground is wet, it rains" are considered
invalid. To infer from Q to P (a wet ground to raining), one must
rule out other plausible rival explanations such as the city
custodians are cleaning the street, an underground water pipe
explodes, and so on. In other words, the fallacy of affirming the
consequence occurs when "ruling out" procedures are absent. By the
same token, in research it is dangerous to infer from the result (Q)
such as test scores to the cause (P) without ruling out competing
explanations. In the survey, question 4 tests the subjects¹
knowledge of logical fallacy. Fifty-six percent of participants
failed to give the right answer.
The following scenarios demonstrate how competing explanations
If the evaluator knows nothing or very little about the treatment
and the instrument, but he is positive that the subjects are students
who are ready to learn the content, what could be inferred from a low
mean score? Figure 2 pictorially illustrates this scenario:
Figure 2. The treatment and instrument qualities are
In this scenario, there are three possible explanations: either
the treatment is very poor and thus the students failed to acquire
the knowledge, the test is poorly written and thus are incapable of
reflecting their knowledge, or both. Since there are several
competing explanations, any inference made from the result to the
cause would be invalid.
Bias in Test of Economic Literacy (TEL) is a good example to
elucidate the above scenario. TEL is a validated instrument and
widely used in many research studies for assessing the knowledge
level of economics. Using Differential Item Functioning (DIF),
Walstad, W. B. and Denise (1997) found that some items in TEL are
biased against female students even if the men and women who took the
exam have the same skill levels in economics. Without this knowledge
about the instrument, the gender difference may be mistakenly
attributed to the treatment rather than to the instrument. This
finding regarding TEL is built upon the fact that the skill level of
one group is comparable to that of the other. Different groups tend
to respond differently to the same item because of cognitive
abilities and other non-biased factors. Thus, understanding the
instrument and the subjects plays a crucial role in unmasking the
nature of this gender effect.
If the evaluator knows nothing or very little about the treatment
and the students, but he adopts a validated instrument, what does a
low average score of the students mean? Figure 3 illustrates this
Figure 3. The treatment and subjects' qualities are
There are four possibilities: the treatment is ineffective, the
treatment is not well-implemented, the students are not ready, or
both of the above. When a treatment is tested, it is usually assumed
that the treatment is properly implmented and thus the conclusion can
be directly drawn to the quality of the treatment. Sechrest et al.
(1979) warned that the strength and integrity of the treatment may
bias the result. For example, when a new drug was introduced, the
dosage might be mild and therefore the effect was undetected. Also,
when a therapy had just been developed recently, its practitioners
might be inexperienced as opposed to the practitioners of the rival
treatment. Therefore, Sechrest et al asserted that conclusions about
whether a treatment is effective cannot be reached without the full
knowledge of how strong and well-implemented the treatment was.
The low scores could also be explained by the problem in
measurement. It is not uncommon that a low Cronbach Alpha
coefficient, which is an indicator of internal consistency, occurs
when a nationally accepted test is administered to a local group. If
the test is too difficult to a sub-standard local group, random
guessing to difficult items would lead to a low Cronbach Alpha
If the researcher has no knowledge about the learners, nor the
test reliability or validity, but he employs a well-developed
technology-based educational product, how could he interpret a low
average test score? Figure 4 illustrates this scenario.
Figure 4. The instrument and subjects' qualities are
In this scenario, the outcome might be caused by the problems of
the students (the lack of basic skills or the fear of using
technology), the problems of the instrument (low reliability or
insufficient validity), or both of them. If one uses a well-developed
treatment and a validated instrument, there will be less
uncertainties in the evaluation. In this case, a quantitative study
could be applied immediately. However, educational researchers
usually develop a new treatment and a customized instrument. Also, it
is not unusual that a self-made test is used to evaluate both the
treatment and the students concurrently. Inevitably, this approach
risks circular dependency or false assumptions. Conclusions such as
"the treatment works because the participants have performance gain.
The participants improved their skill because of the treatment"
suffer serious logical fallacies.
The following procedures are recommended for reducing the risk of
making an invalid evaluation:
Over-estimation or under-estimation of the learners' ability and
readiness may lead to a conclusion of an ineffective treatment or an
invalid instrument. In the context of psychotherapy, Dickson (1975)
asserted, "the behavior in question must be related to client
baselines, situational expectations, and peer performances. Only
after this information is gathered should intervention take place."
(p.379) This principle can be applied to educational psychology. It
is advisable for the evaluator to conduct an initial survey to the
target audience. The assessment should be qualitative in nature. The
purpose is to understand the need and readiness of the subjects. At
this point numeric data may not be helpful to gain the insight
pertaining to the subjects.During this process, the goal is not only
to find out how much the students have already known, but also how
much they do not know.
It is a common practice that a pretest is administered prior to
the treatment. However, usually the purpose of the pretest is not to
understand the subjects. Rather it is directly applied to studying
treatment effectiveness. i.e. The pretest scores is used as a
covariate or the pretest-posttest difference is computed in a
dependent t-test. To avoid threats against the validity of experiment
such as the ceiling effect and regression to the mean, high and low
pretest scores are simply thrown out. Very few researchers ask these
questions: "Why are some people under-achievers and some
over-achievers? Is it possible that those so-called 'over-achievers'
are 'normal' but all the rest are substandard students who are not
ready for the treatment?"
It is a common practice that evaluators use another validated
instrument as a reference for developing the customized instrument,
and also run several pilot studies to revise the instrument. However,
the researcher should be aware that how the theoretical constructs
are formulated dictates how the instrument is written. Observing and
interviewing subjects in an unstructured fashion may help the
researcher to discover new constructs and lead the study in a new
direction. As Messick (1980, 1988) pointed out, construct validation
is a never-ending process. New findings or new crediable assumptions
may lead to a change in construct interpretation, theory, or
The developer should use another well-established treatment as a
model for developing the new treatment, and beta test the treatment
using the think-aloud protocol (Ericsson, 1993; Ericsson & Smith,
1991). The think aloud protocol, also known as the mental protocol
and the concurrent protocol, is a data collection technique in which
the subject verbally expresses whatever he is thinking when he
performs a task. This technique is particularly useful when the
treatment involves complex mental processing.
As mentioned before, a misconceived theoretical construct could
lead to a treatment addressing the wrong research question and an
instrument evaluating the misdirected target. A think aloud protocol
can help the researcher to gain insight of a complex mental process
and thereby develop appropriate constructs. For instance, in a study
regarding teaching children using object-oriented languages to
program LEGO-made robots (Instructional Support Group, 1999), the
objective of the treatment is to enhance problem solving skills. In
this context, the specific problem solving skill is defined as the
ability of modularizing a task and restructuring the relationships
among objects (components) to form a coherent and functioning model
(Kohler, 1925; 1969). The designs of the treatment and the instrument
are difficult because problem solving in terms of restructuring
objects is an abstract theoretical construct, which involves a
complex mental model. In this situation, the think aloud protocol can
be useful to understand the mental construct.
It is important to note that in research the ultimate goal is
concerned with broader concepts rather than a particular software
package or a product created by the software (Yu, 1998). For example,
the evaluator is interested in not only the instructional value of a
Website, but also the concepts of "hypermedia" and "hypertext." Thus,
the refinement of the treatment must align with the intended media
features. More importantly, the media features must be mapped to the
mental constructs to be studied. If the treatment is packed with many
other features that are unrelated to the research objectives, the
subsequent evaluation may be biased.
The differences among treatment effectiveness study, instrument
validation, and performance appraisal may lead to confusion between
the use of test-retest reliability and dependent t-tests for pre-post
difference. Also, these elements may give a false sense of security
in sample size determination when the goals of power and stability
are not differentiated. More importantly, a serious logical fallacy
of circular dependency may result from the failure to rule out rival
explanations. This paper is an attempt to clarify these
misconceptions and to provide practical solutions.
- Cronbach, L. J., & Meehl, P. E. (1955). Construct validity
in psychological tests. Psychological Bulletin, 52,
- Dickson, C. R. (1975). Role of assessment in behavior therapy.
In P. McReynolds (Ed.), Advances in psychological assessment
(Vol. 3). San Francisco, CA: Jossey-Bass.
- Ericsson, K. A. (1993). Protocol analysis: Verbal reports
as data. Cambridge, MA: MIT Press.
- Ericsson, K. A., & Smith, J. (1991). Toward a theory of
expertise. New York: Cambridge University Press.
- Fan, X. (1998). Item response theory and classical test
theory: An empirical comparison of their item/person statistics.
Educational & Psychological Measurement, 58, 357-384.
- Hurley, P. J. (1988). A concise introduction to logic.
Belmont, CA: Wadsworth Publishing Company.
- Instruction Support Group. (1999) The Conexiones project.
[On-line] Available: http://conexiones.asu.edu.
- Kelley, D. (1998). The art of reasoning (3rd ed.). New
York: W. W. Norton & Company.
- Kohler, W. (1925). The mentality of apes. New York:
- Kohler, W. (1969). The task of Gestalt psychology.
Princeton, N. J.: Princeton University Press.
- Messick, S. (1980). Test validity and the ethics of
assessment. American Psychologist, 35, 1012-1027.
- Messick, S. (1988). The once and future issues of validity:
Assessing the meaning and consequence of measurement. In H. Wainer
& H. I. Vraun (Eds.) Test validity (pp.33-45).
Hillsdale, NJ: Lawrence Erlbaum Associates, Publishers.
- Potter, C. C. (1999). Measuring children's mental health
functioning: Confirmatory factor analysis of a multidimensional
measure. Social Work Research, 23, 42-55.
- Sechrest, L.; West, S. G.; Philips, M. A.; Redner, R.; &
Yeaton, W. (1979). Some neglected problems in evaluation research:
Strength and integrity of treatments. In L. Sechrest, S. G. West,
M. A. Phillips, R. Redner, & W. Yeaton. (Eds). Evaluation
Studies Review Annual (pp. 15-35). Beverly Hills, CA: Sage
- Thompson, B. (1996). AERA editorial policies regarding
statistical significance testing: Three suggested reforms,
Educational Researcher, 25 (2), 26-30.
- Walstad, W. B., & Robson, D. (1997). Differential ite,
functioning and male-female differences on multiple-choice tests
in economics. Journal of Economic Education, 28, 155-171.
- Wolins, L. (1995). A Monte Carlo study of constrained factor
analysis using maximum likelihood and unweighted least squares,
Educational & Psychological Measurement, 55, 545-559.
- Yu, C. H. (1998). Media evaluation. Retrieved from http://www.creative-wisdom.com/teaching/assessment/media.html.
Q1: A researcher administers an instrument to a group prior to
the experiment, then he administers the same test again after the
experiment. What kind of statistics should he compute for the test
a. test-retest reliability
b. alternate form
c. dependent t-test for pre-post difference
Q2: What criteria should a researcher use to determine the
desirable sample size for instrument validation?
a. How likely the null hypothesis can be rejected given that the
null is false.
b. To what degree the response patterns can be stable across
d. None of the above
Q3: Power analysis for determining sample size can be applied
in which of the following situation(s)?
a. Statistical analysis for treatment effectiveness
b. Instrument validation
d. None of the above
Q4: Given the statement: "If P then Q," which of the following
is a valid logical deduction?
a. If not Q, not P
b. If Q, then P
c. If not P, then not Q
d. All of the above are valid
Q5: If test scores are high after subjects have been exposed to
a treatment, which is more likely to be true?
a. The treatment enhances learning effectively
b. The test is too easy
c. The subjects are too smart
d. (a) and (b)
e. All of the above
this icon to contact Dr. Yu via various channels