Multi-collineartity,
Variance Inflation
and Orthogonalization in Regression
|
Chong Ho (Alex) Yu, Ph.D., D. Phil. (2022)
|
Mathematical dependence and logical dependence
A regression model with too many predictors may be
problematic. But even if a model is as simple as applying four
independent variables, collinearity may still happen when a composite
score is included in the model. The following is a typical example:
GPA = GRE-verbal + GRE-quantitative +
GRE-analytical + GRE-total
In the above example, GRE-total is only the sum of all other
predictors. Needless to say, GRE-total is strongly associated with
those variables. Technically speaking, they are both mathematically and
logically dependent. In terms of mathematics, the number of GRE-total
is based upon the numbers of all others. In the logical sense,
GRE-total is not a new concept.
However, the following model is legitimate though
strong relationships exist among predictors:
GPA = time spent with family + time spent in
church + (time spent with family * time spent in church)
The researcher created the last variable because he suspected that GPA
is a function of the interplay between family values and Christian
ethics. Nevertheless, in this case they are mathematically dependent
but logically independent. Mathematically speaking, the interaction
effect is the product of the first two variables and they certainly
have strong numeric relationships. Conceptually speaking, the
interaction is considered a new variable and thus it is logically
independent from others. But when a regression model is built, will the
close relationships lead to collinearity and affect the model's
stability?
For Althauser (1971), the answer is "yea" and thus he
questioned the appropriateness of the use of interaction variables in a
regression model. Actually, when a regression model involving an
interaction effect, the regression plane is no longer flat. Rather it
is curvilinear as shown in the following left panel. Let's use the
finger-and-paper analogy again. In the right picture the paper is
curved, and my fingers (data points) are also curved around the paper.
Even though my fingers are close to each other, the plane is still
well-supported.
Why is the interaction variable
expressed in the form of a product term?
Once a student asked me, "Why do you multiply
two variables to create an interaction variable?" Good question. When a
variable X is said to interact with another variable Z, it may be that
the relationship between a dependent variable Y and the independent
variable X is conditioned by a moderating variable Z. The following
equations expressed their relationships:
Y = a + bX + e |
[equation 1] |
a = c1 + c2Z |
[equation 2] |
b = d1 + d2Z |
[equation 3] |
When we substitute [2] and [3] into [1], we have:
Y = (c1
+ c2Z) + (d1 + d2Z
)X + e |
[equation 4] |
Y = c1 + c2Z
+ d1X + d2ZX + e |
[equation 5] |
That's why an interaction variable is a product term. For the detail
please consult Fisher (1988)
|
Navigation
Index
Simplified Navigation
Table of Contents
Search Engine
Contact
|