Reliability of self-report data

Chong-ho Yu, Ph.Ds.

Revised: February 28, 2023

Do the subjects tell the truth?

The reliability of self-report data is an Achilles’ heel of survey research. For example, opinion polls indicated that more than 40 percent of Americans attend church every week. However, by examining church attendance records, Hadaway and Marlar (2005) concluded that the actual attendance was fewer than 22 percent. In his seminal work “Everybody lies,” Seth Stephens-Davidowitz (2017) found ample evidence to show that most people do not do what they say and do not say what they do. For example, in response to polls most voters declared that the ethnicity of the candidate is unimportant. However, by checking search terms in Google Sephens-Davidowitz found the otherwise. Specifically, when Google users entered the word “Obama,” they always associated his name with some words related to race.

For research on Web-based instruction, web usage data may be obtained by parsing the user access log, setting cookies, or uploading the cache. However, these options may have limited applicability. For example, the user access log cannot track users who follow links to other websites. Further, cookie or cache approaches may raise privacy issues. In these situations, self-reported data collected by surveys are used. This gives rise to the question: How accurate are self-reported data? Cook and Campbell (1979) have pointed out that subjects (a) tend to report what they believe the researcher expects to see, or (b) report what reflects positively on their own abilities, knowledge, beliefs, or opinions. Another concern about such data centers on whether subjects are able to accurately recall past behaviors. Psychologists have warned that the human memory is fallible (Loftus, 2016; Schacter, 1999). Sometimes people "remember" events that never happened. Thus the reliability of self-reported data is tenuous.
Although statistical software packages are capable of calculating numbers up to 16-32 decimals, this precision is meaningless if the data cannot be accurate at even the integer level. Quite a few scholars had warned researchers how measurement error could cripple statistical analysis (Blalock, 1974) and suggested that good research practice requires the examination of the quality of the data collected (Fetter, Stowe, & Owings, 1984).

Fradulent responses from online participants

Availability of online panels for collecting self-report data is both a blessing and challenge to researchers. On the one hand, online recruitments provide a fast and affordable way to research a wide spectrum of participants. On the other hand, without careful screening online participants recruited by MTurk, Cloud Research, and Qualtrics might cause a drastic drop in data quality when those participants could hide their true identities behind the Internet. For example, being driven by financial incentives, some MTurk workers created automated bots to take as many surveys as possible (Dreyfuss, 2018), whereas some of them lied about their age, residency, and other demographic attributes in order to meet the inclusion criteria. On some occasions, some people identified themselves as lesbian, gay, bisexual, or suicidal just because it is “funny” (Cimpian, 2022). Further, CASE Fraud Detection Study (Ribeiro, 2021) found that 24% of respondents entered over 25 different surveys within 24 hours. No doubt the quality of these inattentive or even fraudulent responses are problematic.

Although statistical software packages are capable of calculating numbers up to 16-32 decimals, this precision is meaningless if the data cannot be accurate at even the integer level. Quite a few scholars had warned researchers how measurement error could cripple statistical analysis (Blalock, 1974) and suggested that good research practice requires the examination of the quality of the data collected (Fetter, Stowe, & Owings, 1984).

Bias and Variance

Measurement errors include two components, namely, bias and variable error. Bias is a systematic error that tends to push the reported scores toward one extreme end. For example, several versions of IQ tests are found to be bias against non-Whites. It means that blacks and Hispanics tend to receive lower scores regardless of their actual intelligence. A variable error, also known as variance, tends to be random. In other words, the reported scores could be either above or below the actual scores (Salvucci, Walter, Conley, Fink, & Saba, 1997).
The findings of these two types of measurement errors have different implications. For example, in a study comparing self-reported data of height and weight with direct measured data (Hart & Tomazic, 1999), it was found that subjects tend to over-report their height but under-report their weight. Obviously, this kind of error pattern is bias rather than variance. A possible explanation of this bias is that most people want to present a better physical image to others. However, if the measurement error is random, the explanation may be more complicated.

One may argue that variable errors, which are random in nature, would cancel out each other and thus may not be a threat to the study. For example, the first user may over-estimate his Internet activities by 10%, but the second user may under-estimate hers by 10%. In this case, the mean might still be correct. However, over-estimation and under-estimation increases variability of the distribution. In many parametric tests, the within-group variability is used as the error term. An inflated variability would definitely affect the significance of the test. Some texts may reinforce the above misconception. For example, Deese (1972) said,

Statistical theory tells us that the reliability of observations is proportional to the square root of their number. The more observations there are, the more random influences there will be. And statistical theory holds that the more random errors there are, the more they are likely to cancel one another and produce a normal distribution (p.55).

First, it is true that as the sample size increases the variance of the distribution decreases, it does not guarantee that the shape of distribution would approach normality. Second, reliability (the quality of data) should be tied to measurement rather than sample size determination. A large sample size with a lot of measurement errors, even random errors, would inflate the error term for parametric tests.

A stem-and-leaf plot or a histogram can be used to visually examine whether a measurement error is due to systematic bias or random variance. In the following example, two types of Internet access (Web browsing and email) are measured by both self-reported survey and logbook. The difference scores (measurement 1 - measurement 2) are plotted in the following histograms.

The first graph reveals that most difference scores are centered around zero. Under-reporting and over-reporting appears near both ends suggest that the measurement error is random error rather than systematic bias.

The second graph clearly indicates that there is a high degree of measurement errors because very few difference scores are centered around zero. Moreover, the distribution is negatively skewed and thus the error is bias instead of variance.

How reliable our memory is?

Schacter (1999) warned that the human memory is fallible. There are seven flaws of our memory:

Transience: Decreasing accessibility of information over time.

Absent-mindedness: Inattentive or shallow processing that contributes to weak memories.

Blocking: The temporary inaccessibility of information that is stored in memory.

Misattribution Attributing a recollection or idea to the wrong source.

Suggestibility: Memories that are implanted as a result of leading questions or expectations.

Bias: Retrospective distortions and unconscious influences that are related to current knowledge and beliefs.

Persistence: Pathological remembrances-information or events that we cannot forget, even though we wish we could.

"I have no recollection of these. I don't recall that I signed the document for Whitewater. I don't remember why the document disappeared but reappeared later. I don't remember anything."

"I remember landing (in Bosnia) under sniper fire. There was supposed to be some kind of a greeting ceremony at the airport, but instead we just ran with our heads down to get into the vehicles to get to our base."

During the investigation of sending classified information via a personal email server, Clinton told FBI that she could not "recall" or "remember" anything 39 times.

Caution: A new computer virus named "Clinton" is discovered. If the computer is infected, it will frequently pop up this message 'out of memory,' even if it has adequate RAM.

Q: "If Vernon Jordon has told us that you have an extraordinary memory, one of the greatest memories he has ever seen in a politician, would this be something you would care to dispute?"

A: "I do have a good memory...But I don't remember whether I was alone with Monica Lewinsky or not. How could I keep track of so many women in my life?"

Q: Why did Clinton recommend Lewinsky for a job at Revlon?

A: He knew she would be good at making things up.

It is important to note that sometime the reliability of our memory is tied to the desirability of the outcome. For example, when a medical researcher tries to collect relevant data from mothers whose babies are healthy and mothers whose kids are malformed, the data from the latter is usually more accurate than that of the former. This is because mothers of malformed babies have been carefully reviewing every illness that occurred during the pregnancy, every drug taken, every detail directly or remotely related to the tragedy in an attempt to find an explanation. On the contrary, mothers of healthy infants do not pay much attention to the preceding information (Aschengrau & Seage III, 2008). Inflating GPA is another example of how desirability affects memory accuracy and data integrity. In some situation there is a gender difference in GPA inflation. A study conducted by Caskie et al. (2014) found that within the group of lower GPA undergraduate students, females were more likely to report a higher than actual GPA than males.

To counteract the problem of memory errors, some researchers suggested collecting data related to the momentary thought or feeling of the participant, rather than asking him or her to recall remote events (Csikszentmihalyi & Larson, 1987; Finnigan & Vazire, 2018). The following examples are survey items in 2018 Programme for International Student Assessment: “Were you treated with respect all day yesterday?” “Did you smile or laugh a lot yesterday?” “Did you learn or do something interesting yesterday?” (Organisation for Economic Cooperation and Development, 2017). However, the response depends on what happened to the participant around that particular moment, which might not be typical. Specifically, even though the respondent did not smile or laugh a lot yesterday, it doesn’t necessarily imply that the respondent is always unhappy.

What shall we do?

On some occasions, the issue of questionable data quality, particularly due to social desirability bias, can originate from the researcher's actions. For instance, when an institutional review board (IRB) mandates the disclosure of the study's aim in the informed consent form, and the researcher explicitly mentions goals like "investigating bias and discrimination" or "assessing the impact of excessive social media use on academic performance," it's almost inevitable that participants might tailor their responses to align with socially desirable outcomes. Researchers sometimes employ ambiguous language to obscure the true purpose of the study. However, the survey's title, such as the Perceived Racism Scale (PRS) or the Psychopathic Personality Traits Scale (PPTS), displayed prominently on the survey, can inadvertently reveal the study's focus. A straightforward solution to mitigate this issue is to omit any language from the informed consent that could hint at the research objectives.
Some researchers reject use of self-reported data due to its alleged poor quality. For example, when a group of researchers investigated whether high religiosity led to less adherence to shelter-in-place directives in the US during the COVID19 pandemic, they used the number of congregations per 10,000 residents as a proxy measure of religiosity of the region, instead of self-report religiosity, which tends to reflect social desirability (DeFranza, Lindow, Harrison, Mishra, & Mishra, 2021).

However, Chan (2009) argued that the so-called poor quality of self-reported data is nothing more than an urban legend. Driven by social desirability, respondents might provide the researchers with inaccurate data on some occasions, but it does not happen all the time. For example, it is unlikely that the respondents would lie about their demographics, such as gender and ethnicity. Second, while it is true that respondents tend to fake their answers in experimental studies, this issue is less serious in measures used in field studies and naturalistic settings. Further, there are numerous well-established self-reported measures of different psychological constructs, which have obtained construct validity evidence through both convergent and discriminant validation. For example, Big-five personality traits, proactive personality, affectivity disposition, self-efficacy, goal orientations, perceived organizational support, and many others.
In the field of epidemiology, Khoury, James and Erickson (1994) asserted that the effect of recall bias is over-rated. But their conclusion may not be well-applied to other fields, such as education and psychology. In spite of the threat of data inaccuracy, it is impossible for the researcher to follow every subject with a camcorder and record every thing they do. Nonetheless, the researcher can use a subset of subjects to obtain observed data such as user log access or daily hardcopy log of web access. The results would then be compared to the outcome of all subjects¹ self-reported data for an estimation of measurement error. For example,

When the user access log is available to the researcher, he can ask the subjects to report the frequency of their access to the web server. The subjects should not be informed that their Internet activities have been logged by the webmaster as this may affect participant behavior.

The researcher can ask a subset of users to keep a log book of their internet activities for a month. Afterwards, the same users are asked to fill out a survey regarding their web usage.

Someone may argue that the log book approach is too demanding. Indeed, in many scientific research studies, subjects are asked for much more than that. For instance, when scientists studied how deep sleep during long range space travel would affect human health, participants were asked to lie in bed for a month. In a study concerning how a closed environment affects human psychology during space travel, subjects were locked in a room individually for a month, too. It takes a high cost to seek out scientific truths.
After different sources of data are collected, the discrepancy between the log and the self-reported data can be analyzed to estimate the data reliability. At first glance, this approach looks like a test-retest reliability, but it isn't. First, in test-retest reliability the instrument used in two or more situations should be the same. Second, when the test-retest reliability is low, the source of errors is within the instrument. However, when the source of errors is external to the instrument such as human errors, inter-rater reliability is more appropriate.

The above suggested procedure can be conceptualized as a measurement of inter-data reliability, which resembles that of inter-rater reliability and repeated measures. There are four ways to estimate the inter-rater reliability, namely, Kappa coefficient, Index of Inconsistency, repeated measures ANOVA, and regression analysis. The following section describes how these inter-rater reliability measurements may be used as inter-data reliability measurements.

Kappa coefficient
In psychological and educational research, it is not unusual to employ two or more raters in the measurement process when the assessment involves subjective judgments (e.g. grading essays). The inter-rater reliability, which is measured by Kappa coefficient, is used to indicate the reliability of the data. For example, the performance of the participants are graded by two or more raters as "master" or "non-master" (1 or 0). Thus, this measurement is usually computed in categorical data analysis procedures such as PROC FREQ in SAS, "measurement of agreement" in SPSS, or an online Kappa calculator (Lowry, 2016). The image below is a screenshot of Vassarstats online calculator.

It is important to note that even if 60 percent of two datasets concur with each other, it doesn't mean that the measurements are reliable. Since the outcome is dichotomous, there is a 50 percent chance that the two measurements agree. Kappa coefficient takes this into account and demands a higher degree of matching to reach consistency.

In the context of Web-based instruction, each category of self-reported Website usage can be re-coded as a binary variable. For example, when question one is "how often do you use telnet," the possible categorical responses are "a: daily," "b: three to five times per weel," "c: three-five times per month," "d: rarely," and "e: never." In this case, the five categories can be recoded into five variables: Q1A, Q1B, Q1C, Q1D, and Q1E. Then all these binary variables can be appended to form a R X 2 table as shown in the following table. With this data structure, responses can be coded as "1" or "0" and thus measurement of classification agreement is possible. The agreement can be computed using Kappa coefficient and thereby the reliability of the data may be estimated.

Subjects Log book data Self-report data

Subject 1 1 1

Subject 2 0 0

Subject 3 1 0

Subject 4 0 1

Index of Inconsistency
Another way to compute the aforementioned categorical data is Index of Inconsistency (IOI). In the above example, because there are two measurements (log and self-reported data) and five options in the answer, a 4 X 4 table is formed. The first step to compute IOI is to divide the RXC table into several 2X2 sub-tables. For example, the last option "never" is treated as one category and all the rest are collapsed into another category as "not never," as shown in the following table.

Self-reported data

Log
Never Not never Total

Never a b a+b

Not Never c d c+d

Total a+c b+d n=Sum(a-d)

The percent of IOI is computed by the following formula:

IOI% = 100*(b+c)/[(2np(1-p)] where p = (a+c)/n

After the IOI is calculated for each 2X2 sub-table, an average of all indices is used as an indicator of the inconsistency of the measure. The criterion to judge whether the data are consistent is as follows:

An IOI of less than 20 is low variance

An IOI between 20 and 50 is moderate variance

An IOI above 50 is high variance

The reliability of the data is expressed in this equation: r = 1 - IOI

Intraclass correlation coefficient

If both data sources yield continuous data, then one can compute intraclass correlation coefficient to indicate the reliability of the data. The following is a screenshot of SPSS's ICC options. In Type there are two options: "consistency" and "absolute agreement." If "consistency" is chosen, then even if one set of numbers is consistency high (e.g. 9, 8, 9, 8, 7...) and the other is consistency low (e.g. 4, 3, 4, 3, 2...), their strong correlation mis-implies that the data are in alignment with each other. Hence, it is advisable to choose "absolute agreement."

Repeated measures
The measurement of inter-data reliability can also be conceptualized and proceduralized as a repeated measures ANOVA. In a repeated measures ANOVA, measurements are given to the same subjects several times such as pretest, midterm and posttest. In this context, the subjects are also measured repeatedly by the web user log, the log book and the self-reported survey. The following is the SAS code for a repeated measures ANOVA:

data one; input user $ web_log log_book self_report;
cards;
1 215 260 200
2 178 200 150
3 100 111 120
4 135 172 100
5 139 150 140
6 198 200 230
7 135 150 180
8 120 110 100
9 289 276 300
proc glm;
classes user;
model web_log log_book self_report = user;
repeated time 3;
run;

In the above program, the number of visited Websites by nine volunteers are recorded in the user access log, the personal log book, and the self-reported survey. The users are treated as a between-subject factor while the three measures are regarded as between-measure factor. The following is a condensed output:

Source of variation DF Mean Square

Between-subject (user) 8 10442.50

Between-measure (time) 2 488.93

Residual 16 454.80

Based on the above information, the reliability coefficient can be calculated using this formula (Fisher, 1946; Horst, 1949):

r = MS_{between-measure} - MS_residual

--------------------------------------------------------------

MS_{between-measure} + (df_{between-people} X MS_residual)

Let's plug the number into the formula:

r = 488.93 - 454.80

---------------------------------------

488.93 + ( 8 X 454.80)

The reliability is about .0008, which is extremely low. Therefore, we can go home and forget about the data. Fortunately, it is only a hypothetical data set. But, what if it is a real data set? You have to be tough enough to give up poor data rather than publishing some findings that are totally unreliable.

Correlational and regression analysis
Correlational analysis, which utilizes Pearson's Product Moment coefficient, is very simple and especially useful when the scales of two measurements are not the same. For example, the web server log may track the number of pages accesses while the self-reported data are Likert-scaled (e.g. How often do you browse the Internet? 5=very often, 4=often, 3=sometimes, 2=seldom, 5=never). In this case, the self-reported scores can be used as a predictor to regress against page access.
A similar approach is regression analysis, in which one set of scores (e.g. survey data) is treated as the predictor while another set of scores (e.g. user daily log) is considered the dependent variable. If more than two measures are employed, a multiple regression model can be applied i.e. the one that yields more accurate result (e.g. Web user access log) is regarded as the dependent variable and all other measures (e.g. user daily log, survey data) are treated as independent variables.

Reference

Aschengrau, A., & Seage III, G. (2008). Essentials of epidemiology in public health. Boston, MA: Jones and Bartlett Publishers.
Blalock, H. M. (1974). (Ed.) Measurement in the social sciences: Theories and strategies. Chicago, Illinois: Aldine Publishing Company.
Caskie, G. I. L., Sutton, M. C., & Eckhardt, A. G. (2014). Accuracy of self-reported college GPA: Gender-moderated differences by achievement level and academic self-efficacy. Journal of College Student Development, 55, 385-390. 10.1353/csd.2014.0038
Chan, D. (2009). So why ask me? Are self report data really that bad? In Charles E. Lance and Robert J. Vandenberg (Eds.), Statistical and methodological myths and urban legends: Doctrine, verity and fable in the organizational and social sciences (pp309-335). New York, NY: Routledge.
Cimpian, J. (2022, September). How invalid and mischievous survey responses bias estimates between groups. Paper presented at Innovations in Online Research Conference, Online.
Cook, T. D., & Campbell, D. T. (1979). Quasi-experimentation: Design and analysis issues. Boston, MA: Houghton Mifflin Company.
Csikszentmihalyi, M., & Larson, R. (1987). Validity and reliability of the experience-sampling method. Journal of Nervous and Mental Disease, 175, 526–536. https://doi.org/10.1097/00005053-198709000-00004
Deese, J. (1972). Psychology as science and art. New York, NY: Harcourt Brace Jovanovich, Inc.
DeFranza, D., Lindow, M., Harrison, K., Mishra, A., & Mishra, H. (2021). Religion and reactance to COVID-19 mitigation guidelines. American Psychologist, 76(5), 744–754. http://dx.doi.org/10.1037/amp0000717.
Dreyfuss, E. (2018, August 17). A bot panic hits Amazon's Mechanical Turk: Concerned social scientists turned their analytical skills onto one of their most widely used research tools this week: Amazon's Mechanical Turk. Wired. https://www.wired.com/story/amazon-mechanical-turk-bot-panic/
Fetters, W., Stowe, P., & Owings, J. (1984). High School and Beyond. A national longitudinal study for the 1980s, quality of responses of high school students to questionnaire items. (NCES 84-216). Washington, D. C.: U.S. Department of Education. Office of Educational Research and Improvement. National center for Education Statistics.
Finnigan, K. M., & Vazire, S. (2018). The incremental validity of average state self-reports over global self-reports of personality. Journal of Personality and Social Psychology, 115, 321–337. https://doi.org/10.1037/pspp0000136
Fisher, R. J. (1946). Statistical methods for research workers (10th ed.). Edinburgh, UK: Oliver and Boyd.
Hadaway, C. K., & Marlar, P. L. (2005). How many Americans attend worship each week? An alternative approach to measurement? Journal for the Scientific Study of Religion, 44, 307-322. DOI: 10.1111/j.1468-5906.2005.00288.x
Hart, W.; & Tomazic, T. (1999 August). Comparison of percentile distributions for anthropometric measures between three data sets. Paper presented at the Annual Joint Statistical Meeting, Baltimore, MD.
Horst, P. (1949). A Generalized expression for the reliability of measures. Psychometrika, 14, 21-31.
Khoury, M., James, L., & Erikson, J. (1994). On the use of affected controls to address recall bias in case-control studies of birth defects. Teratology, 49, 273-281.
Loftus, E. (2016, April). The fiction of memory. Paper presented at the Western Psychological Association Convention. Long Beach, CA.
Lowry, R. (2016). Kappa as a measure of concordance in categorical sorting. Retrieved from http://vassarstats.net/kappa.html
Organisation for Economic Cooperation and Development. (2017). Well-being questionnaire for PISA 2018. Paris: Author. Retrieved from https://www.oecd.org/pisa/data/2018database/CY7_201710_QST_MS_WBQ_NoNotes_final.pdf
Ribeiro, E. (2022, September). The unintended consequences of research automation. Paper presented at Innovations in Online Research Conference, Online.
Schacter, D. L. (1999). The seven sins of memory: Insights from psychology and cognitive neuroscience. American Psychology, 54, 182-203.
Salvucci, S.; Walter, E., Conley, V; Fink, S; & Saba, M. (1997). Measurement error studies at the National Center for Education Statistics. Washington D. C.: U. S. Department of Education.
Stephens-Davidowitz, S. (2017). Everybody lies: Big data, new data, and what the Internet can tell us about who we really are. New York, NY: Dey Street Books.

Go up to the main menu

Reliability of self-report data

Do the subjects tell the truth?

Fradulent responses from online participants

Bias and Variance

How reliable our memory is?

What shall we do?

Kappa coefficient

Index of Inconsistency

Intraclass correlation coefficient

Repeated measures

Correlational and regression analysis

Reference

Navigation

Home

Other courses

Search Engine