Checking assumptions in regression

Chong Ho Yu, Ph.D.

 

Several assumptions for the data should be met in order to apply a valid regression model. Wonnacott and Winacott (1981) argued that if the assumptions of linearity, normality and independence are upheld, additional assumptions such as fixed values of X are not problematic. Berry & Feldman (1995) pointed out that most regression assumptions are concerned with residuals. This write-up will demonstrate how different assumptions of regression are examined.

Residuals have constant variance (homoescedasticity)

When the error term variance appears constant, the data are considered homoscedastic, otherwise, the data are said to be heteroscedastic. The SAS's syntax for checking residuals are as follows:

proc reg;
        model y = x1 x2;
        output out=two
        p=y_hat
        r=y_res;
proc gplot data=two; plot y_res * y_hat;


The outout is a plot of residuals versus predicted values. If the residuals variance is around zero, it implies that the assumption of homoscedasticity is not violated. If there is a high concentration of residuals above zero or below zero, the variance is not constant and thus a systematic error exists.

Independence of Residuals

A regression model requires independence of error terms. Again, a residuals plot can be used to check this assumption. Random, patternless residuals imply independent errors. Even if the residuals are even distributed around zero and the assumption of constant variance of residuals is satisfied, the regression model is still questionable when there is a pattern in the residuals as shown in the following figure.

Normality of Residuals

It is important to note that for regression the normality test should be applied to the residuals rather than the raw scores. There isn't a general agreement of the best way to test normality. In SAS, there are four test statistics for detecting the presence of non-normality, namely, the Shapiro-Wilk (Shapiro & Wilk, 1965), the Kolmogorov-Smirnov test, Cramer von Mises test, and the Anderson-Darling test. According to the SAS manual, if the sample size is over 2000, the Kolmgorov test should be used. If the sample size is less than 2000, the Shapiro test is better. The null hypothesis of a normality test is that there is no significant departure from normality. When the p is more than .05, it fails to reject the null hypothesis and thus the assumption holds.

NCSS (NCSS Statistical Software, 2007) provides more normality tests in addition to the Shapiro test and the Kolmgorov test (see the following table). According to NCSS, the Shapiro and the Anderson Darling tests are the best. The Kolmogorov test is included just because of its historical popularity, but is bettered in almost every way by other tests.

However, some researchers argue that the Shapiro test was originally constructed to test a sample size carrying up to 50 subjects. To examine normality for a sample size between 51 to 1999, other tests such as Anderson-Darling, Martinez-Lglewicz, and D'Agostino tests are recommended.

Normality Test Section

Test Name
Test Value
Prob Level
10% Critical Value
5% Critical Value
Decision (5%)

Shapiro-Wilk W

0.937

0.21



Accept Normality

Anderson-Darling

0.443

0.29



Accept Normality

Martinez-Iglewicz

1.026


1.216

1.357

Accept Normality

Kolmogorov-Smirnov

0.148


0.176

0.192

Accept Normality

D'Agostino Skewness

1.037

0.299

1.645

1.960

Accept Normality

D'Agostino Kurtosis

-.786

0.432

1.645

1.960

Accept Normality

D'Agostino Omnibus

1.691

0.429

1.645

1.960

Accept Normality

Although many authors recommended using skewness and kurtosis for examining normality (Looney, 1995), Wilkinson (1999) argued that skewness and kurtosis often fail to detect distributional irregularities in the residuals. By this argument, the D'Agostino skewness and the D' Agostino Kurtosis test may be less useful.

Also, statistical tests depend on sample size, and as sample size increases, the tests often will reject innocuous assumptions. Most normality tests have small statistical power (probability of detecting nonnormal data) unless the sample size is large. If the null is rejected, the data are definitely non-normal. But if it fails to reject the null, the conclusion is uncertain. All you know is that there was not enough evidence to reject the normality assumption (NCSS Statistical Software, 2007). To rectify the sample size problem, Hair, Anderson, Tatham and Black (1992) suggested that in a small sample size it is safer to use both a normal probability plot and test statistics to ensure the normality. In SAS submitting the following command syntax returns both normal probability plot and test statistics.

 

proc univariate normal plot data=two; var y_res;
histogram y_res/normal;
QQplot y_res; run;


In a normal probability plot, the normal distribution is represented by a straight line angled at 45 degrees. The standard residuals are compared against the diagonal line to show the departure. If the residuals follow along the straight line, it means that the departure from normality is slight.
 

Normal probability plot

Residuals have mean as zero

The above PROC UNIVARIATE statement returns the mean. One can also use PROC MEANS to get the same result. However, when the mean value carries many decimals, the SAS system will use E-notation. In the following example, the decimal point should shift 15 positions to the left, and thus the mean value is near zero (.000000000000001862483).

Residual s have zero
                    mean

Linearity

To examine the assumption of linearity, one can apply a scatterplot matrix showing all Xs against Y in a pairwise manner. However, this option is not available in SAS and SPSS's scatterplot matrix is not interactive. I recommend using an interactive scatterplot matrix, which is a feature of DataDesk. In a DataDesk's scatterplot matrix, one can assign colors to the data points for detecting clusters in different relationships. Take the following graphs as an example, the assumption of linearity seems to be violated because it appears that there are two clusters within the subjects. A linear fit to all data points is not the best fit.

Scatterplot
                    matrix in datadesk


Fox (1991) suggested that although it is useful to plot y against each x for the examination of linearity, these plots are inadequate because they only tell the partial relationship between y and each x, controlling for the other xs. Therefore, it is desirable to use residual plots against y. How to output residuals has been illustrated in the section of homoescedasticity. One can modify those methods by replacing the predicted values (Y hat) with the observed values (Y).

The absence of multicollinearity

Multicollinearity will cause the variances to be high. These inflated variances are quite detrimental to regression because some variables add very little or even no new and independent information to the model (Belsley, Kuh & Welsch, 1980). Although Schroeder, Sjoquist and Stephen (1986) asserted that there is no statistical test that can determine whether or not multicollinearity really is a problem, there are still several ways for detecting multicollinearity such as a matrix of bivariate correlation and the regression of each independent variable in the equation on all other independent variables (Berry and Feldman, 1985). Nonetheless, the former approach lacks sensitivity to multiple correlations while the latter cannot tell much about the influence of regressors to variances. A better approach is using the Variance Inflation Factor (VIF). The detail of detecting multicollinearity is in the write-up Multicollinearity, variance inflation factor, and orthogonalization.

 

Put all things together

It is more efficient to check all of the preceding issues at the same time. The following simple SAS macros was written for this purpose.
 

/*

name = name of the data set

dv = dependent variable,

x1 = first independent variable,

xlast = last independent variable */

%macro reg (name, dv, x1, xlast);

proc reg data=&name; model &dv = &x1 - &xlast/vif;

output out=two

p=y_hat

r=y_res;

title "Check multicollinearity using VIF";

proc gplot data=two; plot y_res * y_hat;

title "Check homoescedasticity and independence of residuals";

proc univariate normal plot data=two; var y_res;

histogram y_res/normal;

probplot y_res;

QQplot y_res;

title "Check the mean and normality of residuals";

proc gplot data=&name; plot &dv * (&x1 - &xlast);

title "Check linearity";

run;

%mend reg;

/*

To invoke the macros, enter the actual names of the data set,

DV, fist IV, and the last IV in the following.

*/

%reg(name, dv, x1, xlast)
 


References

  • Belsley, D. A., Kuh, E. & Welsch, R. E. (1980). Regression Diagnostics: Identifying influential data and sources of collinearity. John Wiley.

  • Berry, W. D. & Feldman, S. (1985). Multiple regression in practice. London: Sage Publications.

  • Fox, J. (1991). Regression diagnostics. Sage Publications.

  • Looney, S. W. (1995). How to use tests for univariate normality to assess multivariate normality. American Statistician, 49, 64-70.

  • NCSS Statistical Software. (2007). NCSS. [Computer Software] Kaysville, UT: Author.

  • Schroeder, L. D., Sjoquist, D. L. & Stephan, P. E. (1986). Understanding regression analysis. Beverly Hills, CA: Sage Publications.

  • Shapiro, S. S., & Wilk, M. B. (1965). An analysis of variance test for normality. Biometrika, 52, 591-611.

  • Wilkinson, L, & Task Force on Statistical Inference. (1999). Statistical methods in psychology journals: Guidelines and explanations. American Psychologist, 54, 594-604.

  • Wonnacott, T. H. & Wonnacott, R. J. (1981). Regression: A second course in statistics. Wiley.

Last updated: 2024


Navigation

Index

Simplified Navigation

Table of Contents

Search Engine

Contact