# Multicollinearity

- Posted by lhmay
- on Apr, 21, 2018
- in Data Science
- Blog No Comments.

In statistics, multicollinearity (also collinearity) refers to predictors that are correlated with other predictors in the model.It is a phenomenon in which one predictor variable in a multiple regression model can be linearly predicted from the others with a substantial degree of accuracy.

In the presence of high multicollinearity, the confidence intervals of the coefficients tend to become very wide and the statistics tend to be very small. It becomes difficult to reject the null hypothesis of any study when multicollinearity is present in the data under study.severe multicollinearity is a problem because it can increase the variance of the coefficient estimates and make the estimates very sensitive to small changes in the model or data. The result is that the coefficient estimates are unstable and difficult to interpret. Multicollinearity saps the statistical power of the analysis, can cause the coefficients to switch signs, and makes it more difficult to specify the correct model.

**There are certain reasons why multicollinearity occurs:**

- It is caused by an inaccurate use of dummy variables.
- It is caused by the inclusion of a variable which is computed from other variables in the data set.
- Multicollinearity can also result from the repetition of the same kind of variable.
- Generally occurs when the variables are highly correlated to each other.

**Multicollinearity can result in several problems. These problems are as follows:**

- The partial regression coefficient due to multicollinearity may not be estimated precisely. The standard errors are likely to be high.
- Multicollinearity results in a change in the signs as well as in the magnitudes of the partial regression coefficients from one sample to another sample.
- Multicollinearity makes it tedious to assess the relative importance of the independent variables in explaining the variation caused by the dependent variable.

**how do you know if you need to be concerned about multicollinearity in your regression model?**

Here are some things to watch for:

A regression coefficient is not significant even though, theoretically, that variable should be highly correlated with Y.

- When you add or delete an X variable, the regression coefficients change dramatically.
- You see a negative regression coefficient when your response should
*increase*along with X. - You see a positive regression coefficient when the response should
*decrease*as X increases. - Your X variables have high pairwise correlations.

**Indicators that multicollinearity may be present in a model include the following:**

- Large changes in the estimated regression coefficients when a predictor variable is added or deleted
- Insignificant regression coefficients for the affected variables in the multiple regression, but a rejection of the joint hypothesis that those coefficients are all zero (using an
*F*-test) - If a multivariable regression finds an insignificant coefficient of a particular explanator, yet a simple linear regression of the explained variable on this explanatory variable shows its coefficient to be significantly different from zero, this situation indicates multicollinearity in the multivariable regression.
- Some authors have suggested a formal detection-tolerance or the variance inflation factor (VIF) for multicollinearity:

{\displaystyle \mathrm {tolerance} =1-R_{j}^{2},\quad \mathrm {VIF} ={\frac {1}{\mathrm {tolerance} }},}

where {\displaystyle R_{j}^{2}} is the coefficient of determination of a regression of explanator*j*on all the other explanators. A tolerance of less than 0.20 or 0.10 and/or a VIF of 5 or 10 and above indicates a multicollinearity problem.^{[1]} **Condition number test**: The standard measure of ill-conditioning in a matrix is the condition index. It will indicate that the inversion of the matrix is numerically unstable with finite-precision numbers (standard computer floats and doubles). This indicates the potential sensitivity of the computed inverse to small changes in the original matrix. The condition number is computed by finding the square root of the maximum eigenvalue divided by the minimum eigenvalue of the design matrix. If the condition number is above 30, the regression may have significant multicollinearity; multicollinearity exists if, in addition, two or more of the variables related to the high condition number have high proportions of variance explained. One advantage of this method is that it also shows which variables are causing the problem.^{[2]}**Farrar–Glauber test**:^{[3]}If the variables are found to be orthogonal, there is no multicollinearity; if the variables are not orthogonal, then at least some degree of multicollinearity is present. C. Robert Wichers has argued that Farrar–Glauber partial correlation test is ineffective in that a given partial correlation may be compatible with different multicollinearity patterns.^{[4]}The Farrar–Glauber test has also been criticized by other researchers.^{[5]}^{[6]}**Perturbing the data**.^{[7]}Multicollinearity can be detected by adding random noise to the data and re-running the regression many times and seeing how much the coefficients change.- Construction of a correlation matrix among the explanatory variables will yield indications as to the likelihood that any given couplet of right-hand-side variables are creating multicollinearity problems. Correlation values (off-diagonal elements) of at least 0.4 are sometimes interpreted as indicating a multicollinearity problem. This procedure is, however, highly problematic and cannot be recommended. Intuitively, correlation describes a bivariate relationship, whereas collinearity is a multivariate phenomenon.

One such signal is if the individual outcome of a statistic is not significant but the overall outcome of the statistic is significant. In this instance, the researcher might get a mix of significant and insignificant results that show the presence of multicollinearity.Suppose the researcher, after dividing the sample into two parts, finds that the coefficients of the sample differ drastically. This indicates the presence of multicollinearity. This means that the coefficients are unstable due to the presence of multicollinearity. Suppose the researcher observes drastic change in the model by simply adding or dropping some variable. This also indicates that multicollinearity is present in the data.

Multicollinearity can cause a number of problems. We saw how it sapped the significance of one of our predictors and changed its sign. Imagine trying to specify a model with many more potential predictors. If you saw signs that kept changing and incorrect p-values, it could be hard to specify the correct model! Stepwise regression does not work as well with multicollinearity.

However, we also saw that multicollinearity doesn’t affect how well the model fits. If the model satisfies the residual assumptions and has a satisfactory predicted R-squared, even a model with severe multicollinearity can produce great predictions.

**How to deal with collinearity?**

**first step –**remove variable having highest VIF and then check VIF of remaining variables. If VIF of remaining variables > 2.5, then follow the same first step until VIF < =2.5

For some model, the severe multicollinearity was primarily caused by the interaction term. Consequently, we were able to remove the problem simply by standardizing the predictors. However, when standardizing your predictors doesn’t work, you can try other solutions such as:

**Remove highly correlated predictors from the model.**If you have two or more factors with a high VIF, remove one from the model. Because they supply redundant information, removing one of the correlated factors usually doesn’t drastically reduce the R-squared. Consider using stepwise regression, best subsets regression, or specialized knowledge of the data set to remove these variables. Select the model that has the highest R-squared value.- Linearly combining predictors, such as adding them together
**Use Partial Least Squares Regression (PLS) or Principal Components Analysis**, regression methods that cut the number of predictors to a smaller set of uncorrelated components.

- Make sure you have not fallen into the dummy variable trap; including a dummy variable for every category (e.g., summer, autumn, winter, and spring) and including a constant term in the regression together guarantee perfect multicollinearity.
- Try seeing what happens if you use independent subsets of your data for estimation and apply those estimates to the whole data set. Theoretically you should obtain somewhat higher variance from the smaller datasets used for estimation, but the expectation of the coefficient values should be the same. Naturally, the observed coefficient values will vary, but look at how much they vary.
- Leave the model as is, despite multicollinearity. The presence of multicollinearity doesn’t affect the efficiency of extrapolating the fitted model to new data provided that the predictor variables follow the same pattern of multicollinearity in the new data as in the data on which the regression model is based.
^{[9]} - Drop one of the variables. An explanatory variable may be dropped to produce a model with significant coefficients. However, you lose information (because you’ve dropped a variable). Omission of a relevant variable results in biased coefficient estimates for the remaining explanatory variables that are correlated with the dropped variable.
- Obtain more data, if possible. This is the preferred solution. More data can produce more precise parameter estimates (with lower standard errors), as seen from the formula in variance inflation factor for the variance of the estimate of a regression coefficient in terms of the sample size and the degree of multicollinearity.
- Mean-center the predictor variables. Generating polynomial terms can cause some multicollinearity if the variable in question has a limited range (e.g., [2,4]). Mean-centering will eliminate this special kind of multicollinearity. However, in general, this has no effect. It can be useful in overcoming problems arising from rounding and other computational steps if a carefully designed computer program is not used.
- Standardize your independent variables. This may help reduce a false flagging of a condition index above 30.
- It has also been suggested that using the Shapley value, a game theory tool, the model could account for the effects of multicollinearity. The Shapley value assigns a value for each predictor and assesses all possible combinations of importance.
^{[10]} - If the correlated explanators are different lagged values of the same underlying explanator, then a distributed lag technique can be used, imposing a general structure on the relative values of the coefficients to be estimated.