Blogs

Sign-up to receive the latest articles related to the area of business excellence.

Variance Inflation Factor

View All Blogs

Multiple Regression

In a multiple regression model, we are trying to find a relationship between the dependent variable Y and several independent variables X1, X2 etc. A typical linear model might be of the form:

Y=C+ β_1 X_1+β_2 X_2+⋯

Where C is the intercept term and the coefficients β_i represent the model coefficients. Typically, we would use the least squares approach to estimate the model coefficients which basically means find the values of the coefficients such that the sum of square of the error terms are minimized. The model coefficients can be estimated using the following matrix form:

{β}=(X^T X)^(-1) (X^T Y)

In order to fit a multiple regression model, there are several assumptions that need to be made such as the model is linear, the residuals are normal, there is no multicollinearity between the independent variables, and the variance is constant. In this article we focus on the issue of multicollinearity.

Multicollinearity

Multicollinearity refers to the problem when the independent variables are collinear. Collinearity refers to a linear relationship between two explanatory variables. Two variables are perfectly collinear if there is an exact relationship between the two variables. If the independent variables are perfectly collinear, then our model becomes singular and it would not be possible to uniquely identify the model coefficients mathematically. For example, if are building a model between the gasoline consumption (Y) in miles per gallon and for the explanatory variables, we have the weight of the car in Kg (X1) and the weight of the car in Lbs. (X2), then our model becomes singular since the two independent variables are related to each other (X2 = 0.45 X1) and we will not be able to estimate the model coefficients. Even if the two independent variables are not perfectly collinear but nearly collinear (for example including age and height as the two dependent variables), it causes stability problems for us to estimate the model coefficients. The model coefficients may not be stable and consistent and can vary widely if we use slightly different data sets or drop terms from your model. Multicollinearity can also occur when you use dummy variables to handle discrete independent variables and if you are not careful and include the default values as well. Here are some ways in which multicollinearity can impact your analysis:
  • Model coefficients could not be computed due to singularity
  • Model coefficients may be inaccurate
  • Model coefficients may vary wildly when you add/subtract data
  • Model coefficients may vary wildly when you add/drop terms
  • Model coefficients may become statistically insignificant
Hence, collinearity is a problem that needs to be addressed when we are building a multiple regression model.

How to detect Multicollinearity?

One way to address this issue is to check the correlation coefficient between the independent variables and if the correlation coefficient is high (either close to +1 or -1) then we conclude that the variables may be collinear, and we need to determine how best we can address this issue before we build the multiple regression model. If we have three inputs (X1, X2, and X3) then we would have to check the correlation coefficient between X1 and X2, between X1 and X3, and between X2 and X3. If any of these correlation coefficients are high, then we need to address this before building the multiple regression model. However, correlation coefficient test works great if two variables are correlated but, in some cases, we can have a more complex situation where one independent variable may be related to more than two independent variables, so checking the variables two at a time does not catch the problem. For example, if we have X3 = 0.25X1 + 0.32X2. In this case, the matrix is still singular because of a relationship between X1, X2 and X3. To address these types of situations, we can use an index called VIF which will give an indication of multicollinearity.

What is VIF?

VIF is an index that provides a measure of how much the variance of an estimated regression coefficient increases due to collinearity. In order to determine VIF, we fit a regression model between the independent variables. For example, we would fit the following models to estimate the coefficient of determination R1 and use this value to estimate the VIF:

X_1=C+ α_2 X_2+α_3 X_3+⋯

〖VIF〗_1=1/(1-R_1^2 )

Next, we fit the model between X2 and the other independent variables to estimate the coefficient of determination R2:

X_2=C+ α_1 X_1+α_3 X_3+⋯

〖VIF〗_2=1/(1-R_2^2 )

If all the independent variables are orthogonal to each other, then VIF = 1.0. If there is perfect correlation, then VIF = infinity. A large value of VIF indicates that there is a correlation between the variables. If the VIF is 4, this means that the variance of the model coefficient is inflated by a factor of 4 due to the presence of multicollinearity. This would mean that that standard error of this coefficient is inflated by a factor of 2 (square root of variance is the standard deviation). The standard error of the coefficient determines the confidence interval of the model coefficients. If the standard error is large, then the confidence intervals may be large, and the model coefficient may come out to be non-significant due to the presence of multicollinearity. A general rule of thumb is that if VIF > 10 then there is multicollinearity. Note that this is a rough rule of thumb, in some cases we might choose to live with high VIF values if it does not affect our model results such as when we are fitting a quadratic or cubic model or depending on the sample size a large value of VIF may not necessarily indicate a poor model.
VIFConclusion
1No multicollinearity
4 - 5Moderate
10 or greaterSevere

What to do if VIF is large?

If VIF is large and multicollinearity affects your analysis results, then you need to take some corrective actions before you can use multiple regression. Here are the various options:
  • One approach is to review your independent variables and eliminate terms that are duplicates or not adding value to explain the variation in the model. For example, if your inputs are measuring the weight in kgs and lbs then just keep one of these variables in the model and drop the other one. Dropping the term with a large value of VIF will hopefully, fix the VIF for the remaining terms and now all the VIF factors are within the threshold limits. If dropping one term is not enough, then you may need to drop more terms as required.
  • A second approach is to use principal component analysis and determine the optimal set of principal components that best describe your independent variables. Using this approach will get rid of your multicollinearity problem but it may be hard for you to interpret the meaning of these “new” independent variables.
  • The third approach is to increase the sample size. By adding more data points to our model, hopefully, the confidence intervals for the model coefficients are narrower to overcome the problems associated with multicollinearity.
  • The fourth approach is to transform the data to a different space like using a log transformation so that the independent variables are no longer correlated as strongly with each other.
  • Finally, you can use a different type of model call ridge regression that better handles multicollinearity.
In conclusion, when you are building a multiple regression model, always check your VIF values for your independent variables and determine if you need to take any corrective action before building the model.

Example

The Blood Pressure (BP) measurements for several individuals was collected along with a few independent variables as shown in the table below. The explanatory variables are age of the person (years), weight of the person (kg), height of the person (feet), duration the person is suffering from hypertension (years), and stress level (score on a scale of 0-100). Develop a multiple regression model between the input(s) and output.
BPAgeWeightHeightYearsStress
10850895.25632
11753996.3814
11853975.94108
12053986.03698
11354935.671094
124491026.751210
125501036.75641
11249915.776
11053925.491162
11550976.211035
11752956.21689
11650995.94620
11851926.151145
10845905.76980
128521026.571198
11748985.941094
10749905.61817
11547975.7812
11150915.64997
12660986.271098

Solution:


Model 1 Let’s build a model with all the factors in the model. The regression model between the output and the inputs is shown below.
VIF Plot
Looking at the P values, we would also conclude that the number of years of hypertension and stress are not correlated to BP. Let’s look at the correlation plots between the input variables – it looks like the height and weight are correlated but the other factors are not. From the analysis results, we can see that the VIF for height and weight are higher than the other factors. It can also be seen that the two factors were correlated with a correlation coefficient of 0.843. Since, some of the VIF factors are large, we are not sure at this point if we can trust the model coefficients. Hence, we need to build a better model where the VIF factors are not that large. Since, height and weight seem to be correlated, we need to pick only one of these terms in our model. If we feel that it is easier to include weight in our analysis, we would drop the height factor and rebuild the model.

Model 2 If we were to drop the height from our model, we would get the following model.
VIF Plot
Looking at the VIF values, our model does not exhibit any multicollinearity. Now, we can trust the model coefficients. Since the P values for number of years and stress are not-significant, let’s drop these terms to build the final model.

Model 3 The final model with only the significant factors is:
VIF Plot
This model shows that the BP increases 0.53 units with every year and 1.141 units for every kg increase in weight. From model 1 to model 3, the R^2 adjusted value drops from 87% to 81%. SInce all the terms are now statistically significant, we can use this model to make predictions and/or optimization.

Follow us on LinkedIn to get the latest posts & updates.


sigma magic adv