Screening (multi)collinearity in a regression model - r

I hope that this one is not going to be "ask-and-answer" question... here goes:
(multi)collinearity refers to extremely high correlations between predictors in the regression model. How to cure them... well, sometimes you don't need to "cure" collinearity, since it doesn't affect regression model itself, but interpretation of an effect of individual predictors.
One way to spot collinearity is to put each predictor as a dependent variable, and other predictors as independent variables, determine R2, and if it's larger than .9 (or .95), we can consider predictor redundant. This is one "method"... what about other approaches? Some of them are time consuming, like excluding predictors from model and watching for b-coefficient changes - they should be noticeably different.
Of course, we must always bear in mind the specific context/goal of the analysis... Sometimes, only remedy is to repeat a research, but right now, I'm interested in various ways of screening redundant predictors when (multi)collinearity occurs in a regression model.

The kappa() function can help. Here is a simulated example:
> set.seed(42)
> x1 <- rnorm(100)
> x2 <- rnorm(100)
> x3 <- x1 + 2*x2 + rnorm(100)*0.0001 # so x3 approx a linear comb. of x1+x2
> mm12 <- model.matrix(~ x1 + x2) # normal model, two indep. regressors
> mm123 <- model.matrix(~ x1 + x2 + x3) # bad model with near collinearity
> kappa(mm12) # a 'low' kappa is good
[1] 1.166029
> kappa(mm123) # a 'high' kappa indicates trouble
[1] 121530.7
and we go further by making the third regressor more and more collinear:
> x4 <- x1 + 2*x2 + rnorm(100)*0.000001 # even more collinear
> mm124 <- model.matrix(~ x1 + x2 + x4)
> kappa(mm124)
[1] 13955982
> x5 <- x1 + 2*x2 # now x5 is linear comb of x1,x2
> mm125 <- model.matrix(~ x1 + x2 + x5)
> kappa(mm125)
[1] 1.067568e+16
>
This used approximations, see help(kappa) for details.

Just to add to what Dirk said about the Condition Number method, a rule of thumb is that values of CN > 30 indicate severe collinearity. Other methods, apart from condition number, include:
1) the determinant of the covariance
matrix which ranges from 0 (Perfect
Collinearity) to 1 (No Collinearity)
# using Dirk's example
> det(cov(mm12[,-1]))
[1] 0.8856818
> det(cov(mm123[,-1]))
[1] 8.916092e-09
2) Using the fact that the determinant of a diagonal matrix is the product of the eigenvalues => The presence of one or more small eigenvalues indicates collinearity
> eigen(cov(mm12[,-1]))$values
[1] 1.0876357 0.8143184
> eigen(cov(mm123[,-1]))$values
[1] 5.388022e+00 9.862794e-01 1.677819e-09
3) The value of the Variance Inflation Factor (VIF). The VIF for predictor i is 1/(1-R_i^2), where R_i^2 is the R^2 from a regression of predictor i against the remaining predictors. Collinearity is present when VIF for at least one independent variable is large. Rule of Thumb: VIF > 10 is of concern. For an implementation in R see here. I would also like to comment that the use of R^2 for determining collinearity should go hand in hand with visual examination of the scatterplots because a single outlier can "cause" collinearity where it doesn't exist, or can HIDE collinearity where it exists.

You might like Vito Ricci's Reference Card "R Functions For Regression Analysis"
http://cran.r-project.org/doc/contrib/Ricci-refcard-regression.pdf
It succinctly lists many useful regression related functions in R including diagnostic functions.
In particular, it lists the vif function from the car package which can assess multicollinearity.
http://en.wikipedia.org/wiki/Variance_inflation_factor
Consideration of multicollinearity often goes hand in hand with issues of assessing variable importance. If this applies to you, perhaps check out the relaimpo package: http://prof.beuth-hochschule.de/groemping/relaimpo/

See also Section 9.4 in this Book: Practical Regression and Anova using R [Faraway 2002].
Collinearity can be detected in several ways:
Examination of the correlation matrix of the predictors will reveal large pairwise collinearities.
A regression of x_i on all other predictors gives R^2_i. Repeat for all predictors. R^2_i close to one indicates a problem — the offending linear combination may be found.
Examine the eigenvalues of t(X) %*% X, where X denotes the model matrix; Small eigenvalues indicate a problem. The 2-norm condition number can be shown to be the ratio of the largest to the smallest non-zero singular value of the matrix ($\kappa = \sqrt{\lambda_1/\lambda_p}$; see ?kappa); \kappa >= 30 is considered large.

Since there is no mention of VIF so far, I will add my answer. Variance Inflation Factor>10 usually indicates serious redundancy between predictor variables. VIF indicates the factor by which variance of the co-efficient of a variable would increase if it was not highly correlated with other variables.
vif() is available in package cars and applied to an object of class(lm). It returns the vif of x1, x2 . . . xn in object lm(). It is a good idea to exclude variables with vif >10 or introduce transformations to the variables with vif>10.

Related

Using savage dickey null point hypotheses for arms models

Lets say I have a brms model of y ~ a*b + (1|group) and I'd really like to compare it against an alternative model y ~ a + b + (1|group) (in my example a is time and b is experimental condition).
Using hypothesis I could write:
hypothesis(m1, c("a:b = 0"))
However the Evidence ratio value returned for this point null hypothesis is often weird in that it is very dependent on the priors used. When relatively uninformative priors are used then -- even when a fairly precise and >zero estimate for a:b is found -- the ER can still be > 1 or > 3. This is because the prior placed so much weight far away from zero.
I realise I could try using the bridgesampling package to calculate a Bayes Factor but have always found this quite slow and cumbersome. I wonder instead whether it would be reasonable to compare the ERs for these two hypotheses:
a:b = 0 vs a:b > 0
That is, if I calculated ER(a:b > 0)/ER(a:b = 0) then does this give the a Bayes factor in favour of a non-zero and positive effect, as compared with a zero effect?

How to use Box-Tidwell in R in logistic regression

For a logistic regression, I want to test the assumption of a lineair relation between the independent variables and the log-odds of Y.
I have read about using the Box-Tidwell test for this. Applying the test keeps resulting in errors, so I'm hoping to get some assistence on the code.
My dataset consists of about 80 variables. My dependent variabele is a dichotomous variable. My independent variables consist of continuous variables, as well as factors.
The first question I have is which Y to use in the box-tidwell.
I specified Y by running glm (family=binomial) on all the predictors and taking the log-odds:
odds = ...$fitted.values / (1-...$fitted.values)
log_odds <- log(odds)
This means that the Y I use for the box-tidwell is based on the model without the x ln(x) interactions. I'm wondering if this is the right way.
I cannot think of another way to specify the model in the Box_Tidwell.
Should I perhaps define the glm in the Box_Tidwell test, and if yes, how should I do this (i.e. which code should I use)?
The second question I have concerns the specification of X1 en X2.
I specified x1 in the BT by adding a fraction (0.01) to all continuous predictors (since BT cannot handle zeroes) and therefore first selecting all continuous predictors.
Since I read that X1 and X2 should be matrixes I did the following:
DF_numeric <- Filter(is.numeric, DF)
x <- DF_numeric + 0.01
x_matrix <- as.matrix (x)
As I understand X2 should contain variables in the model that are not candidates for transformation, so:
x2 <- Filter(is.factor, DF)
x2_matrix <- as.matrix (x2)
I get the following error, when running my BT:
BT <- boxTidwell(Y=log_odds, x1=x_matrix, x2=x2_matrix)
Error in `[[<-.data.frame`(`*tmp*`, i, value = c(250L, 250L, 194L, 250L, :
replacement has 480090 rows, data has 6155
The 480090 rows are the 6155 participants times the number of variables, but I don't know which mistake I make.
I tried different possibilities. Since the log_odds I use for the Y are not in the same dataframe as X1 en X2 I tried binding (with cbind) x1, x2 and log_odds. This however does not change the error.
Should I use different formats for Y, X or X2?
I hope I have given enough information and hope you can help me!
(I first asked this question in Cross Validated, but since it was off-topic there, I hope this is the correct place).
Thanks in advance!
Simone

Finding model predictor values that maximize the outcome

How do you find the set of values for model predictors (a mixture of linear and non-linear) that yield the highest value for the response.
Example Model:
library(lme4); library(splines)
summary(lmer(formula = Solar.R ~ 1 + bs(Ozone) + Wind + Temp + (1 | Month), data = airquality, REML = F))
Here I am interested in what conditions (predictors) produce the highest solar radation (outcome).
This question seems simple, but I've failed to find a good answer using Google.
If the model was simple, I could take the derivatives to find the maximum or minimum. Someone has suggested that if the model function can be extracted, the stats::optim() function might be used. As a last resort, I could simulate all the reasonable variations of input values and plug it into the predict() function and look for the maximum value.
The last approach mentioned doesn't seem very efficient and I imagine that this is a common enough task (e.g., finding optimal customers for advertising) that someone has built some tools for handling it. Any help is appreciated.
There are some conceptual issues here.
for the simple terms (Wind and Temp), the response is a linear (and hence both monotonic and unbounded) function of the predictors. Thus if these terms have positive parameter estimates, increasing their values to infinity (Inf) will give you an infinite response (Solar.R); values should be as small as possible (negative infinite) if the coefficients are negative. Practically speaking, then, you want to set these predictors to the minimum or maximum reasonable value if the parameter estimates are respectively negative or positive.
for the bs term, I'm not sure what the properties of the B-spline are beyond the boundary knots, but I'm pretty sure that the curves go off to positive or negative infinity, so you've got the same issue. However, for the case of bs, it's also possible that there are one or more interior maxima. For this case I would probably try to extract the basis terms and evaluate the spline over the range of the data ...
Alternatively, your mentioning optim makes me think that this is a possibility:
data(airquality)
library(lme4)
library(splines)
m1 <- lmer(formula = Solar.R ~ 1 + bs(Ozone) + Wind + Temp + (1 | Month),
data = airquality, REML = FALSE)
predval <- function(x) {
newdata <- data.frame(Ozone=x[1],Wind=x[2],Temp=x[3])
## return population-averaged prediction (no Month effect)
return(predict(m1, newdata=newdata, re.form=~0))
}
aq <- na.omit(airquality)
sval <- with(aq,c(mean(Ozone),mean(Wind),mean(Temp)))
predval(sval)
opt1 <- optim(fn=predval,
par=sval,
lower=with(aq,c(min(Ozone),min(Wind),min(Temp))),
upper=with(aq,c(max(Ozone),max(Wind),max(Temp))),
method="L-BFGS-B", ## for constrained opt.
control=list(fnscale=-1)) ## for maximization
## opt1
## $par
## [1] 70.33851 20.70000 97.00000
##
## $value
## [1] 282.9784
As expected, this is intermediate in the range of Ozone(1-168), and min/max for Wind (2.3-20.7) and Temp (57-97).
This brute force solution could be made much more efficient by automatically selecting the min/max values for the simple terms and optimizing only over the complex (polynomial/spline/etc.) terms.

Finding non-linear correlations in R

I have about 90 variables stored in data[2-90]. I suspect about 4 of them will have a parabola-like correlation with data[1]. I want to identify which ones have the correlation. Is there an easy and quick way to do this?
I have tried building a model like this (which I could do in a loop for each variable i = 2:90):
y <- data$AvgRating
x <- data$Hamming.distance
x2 <- x^2
quadratic.model = lm(y ~ x + x2)
And then look at the R^2/coefficient to get an idea of the correlation. Is there a better way of doing this?
Maybe R could build a regression model with the 90 variables and chose the ones which are significant itself? Would that be in any way possible? I can do this in JMP for linear regression, but I'm not sure I could do non-linear regression with R for all the variables at ones. Therefore I was manually trying to see if I could see which ones are correlated in advance. It would be helpful if there was a function to use for that.
You can use nlcor package in R. This package finds the nonlinear correlation between two data vectors.
There are different approaches to estimate a nonlinear correlation, such as infotheo. However, nonlinear correlations between two variables can take any shape.
nlcor is robust to most nonlinear shapes. It works pretty well in different scenarios.
At a high level, nlcor works by adaptively segmenting the data into linearly correlated segments. The segment correlations are aggregated to yield the nonlinear correlation. The output is a number between 0 to 1. With close to 1 meaning high correlation. Unlike a pearson correlation, negative values are not returned because it has no meaning in nonlinear relationships.
More details about this package here
To install nlcor, follow these steps:
install.packages("devtools")
library(devtools)
install_github("ProcessMiner/nlcor")
library(nlcor)
After you install it,
# Implementation
x <- seq(0,3*pi,length.out=100)
y <- sin(x)
plot(x,y,type="l")
# linear correlation is small
cor(x,y)
# [1] 6.488616e-17
# nonlinear correlation is more representative
nlcor(x,y, plt = T)
# $cor.estimate
# [1] 0.9774
# $adjusted.p.value
# [1] 1.586302e-09
# $cor.plot
As shown in the example the linear correlation was close to zero although there was a clear relationship between the variables that nlcor could detect.
Note: The order of x and y inside the nlcor is important. nlcor(x,y) is different from nlcor(y,x). The x and y here represent 'independent' and 'dependent' variables, respectively.
Fitting a generalized additive model, will help you identify curvature in the
relationships between the explanatory variables. Read the example on page 22 here.
Another option would be to compute mutual information score between each pair of variables. For example, using the mutinformation function from the infotheo package, you could do:
set.seed(1)
library(infotheo)
# corrleated vars (x & y correlated, z noise)
x <- seq(-10,10, by=0.5)
y <- x^2
z <- rnorm(length(x))
# list of vectors
raw_dat <- list(x, y, z)
# convert to a dataframe and discretize for mutual information
dat <- matrix(unlist(raw_dat), ncol=length(raw_dat))
dat <- discretize(dat)
mutinformation(dat)
Result:
| | V1| V2| V3|
|:--|---------:|---------:|---------:|
|V1 | 1.0980124| 0.4809822| 0.0553146|
|V2 | 0.4809822| 1.0943907| 0.0413265|
|V3 | 0.0553146| 0.0413265| 1.0980124|
By default, mutinformation() computes the discrete empirical mutual information score between two or more variables. The discretize() function is necessary if you are working with continuous data transform the data to discrete values.
This might be helpful at least as a first stab for looking for nonlinear relationships between variables, such as that described above.

extract residuals from aov()

I've run an anova using the following code:
aov2 <- aov(amt.eaten ~ salt + Error(bird / salt),data)
If I use view(aov2) I can see the residuals within the structure of aov2, but I would like to extract them in a way that doesn't involve cutting and pasting. Can someone help me out with the syntax?
Various versions of residuals(aov2) I have been using only produce NULL
I just learn that you can use proj:
x1 <- gl(8, 4)
block <- gl(2, 16)
y <- as.numeric(x1) + rnorm(length(x1))
d <- data.frame(block, x1, y)
m <- aov(y ~ x1 + Error(block), d)
m.pr <- proj(m)
m.pr[['Within']][,'Residuals']
The reason that you cannot extract residuals from this model is that you have specified a random effect due to the bird salt ratio (???). Here, each unique combination of bird and salt are treated like a random cluster having a unique intercept value but common additive effect associated with a unit difference in salt and the amount eaten.
I can't conceive of why we would want to specify this value as a random effect in this model. But in order to sensibly analyze residuals, you may want to calculate fitted differences in each stratum according to the fitted model and optimal intercept. I think this is tedious work and not very informative, however.

Resources