Removing variables resulting in singular matrix in R regression model - r

I've been using mnlogit in R to generate a multivariable logistic regression model. My original set of variables generated a singular matrix error, i.e.
Error in solve.default(hessian, gradient, tol = 1e-24) :
system is computationally singular: reciprocal condition number = 7.09808e-25
It turns out that several "sparse" columns (variables that are 0 for most sampled individuals) cause this singularity error. I need a systematic way of removing those variables that lead to a singularity error while retaining those that allow estimation of a regression model, i.e. something analogous to the use of the function step to select variables minimizing AIC via stepwise addition, but this time removing variables that generate singular matrices.
Is there some way to do this, since checking each variable by hand (there are several hundred predictor variables) would be incredibly tedious?

If X is the design matrix from your model which you can obtain using
X <- model.matrix(formula, data = data)
then you can find a (non-unique) set of variables that would give you a non-singular model using the QR decomposition. For example,
x <- 1:3
X <- model.matrix(~ x + I(x^2) + I(x^3))
QR <- qr(crossprod(X)) # Get the QR decomposition
vars <- QR$pivot[seq_len(QR$rank)] # Variable numbers
names <- rownames(QR$qr)[vars] # Variable names
names
#> [1] "(Intercept)" "x" "I(x^2)"
This is subject to numerical error and may not agree with whatever code you are using, for two reasons.
First, it doesn't do any weighting, whereas logistic regression normally uses iteratively reweighted regression.
Second, it might not use the same tolerance as the other code. You can change its sensitivity by changing the tol parameter to qr() from the default 1e-07. Bigger values will cause more variables to be omitted from names.

Related

Fit multiple linear regression without an intercept with the function lm() in R

can you please help with this question in R, i need to get more than one predictor:
Fit multiple linear regression without an intercept with the function lm() to train data
using variable (y.train) as a goal variable and variables (X.mat.train) as
predictors. Look at the vector of estimated coefficients of the model and compare it with
the vector of ’true’ values beta.vec graphically
(Tip: build a plot for the differences of the absolute values of estimated and true values).
i have already tried it out with a code i will post at the end but it give me only one predictor but in this example i need to get more than one predictor:
and i think the wrong one is the first line but i couldn't find a way to fix it :
i can't put the data set here it's large but i have a variable that stores 190 observation from a victor (y.train) and another value that stores 190 observation from a matrix (X.mat.trian).. should give more than one predictor but for me it's giving one..
simple.fit = lm(y.train~0+ X.mat.train) #goal var #intercept # predictor
summary(simple.fit)# Showing the linear regression output
plot(simple.fit)
abline(simple.fit)
n <- summary(simple.fit)$coefficients
estimated_coeff <- n[ , 1]
estimated_coeff
plot(estimated_coeff)
#Coefficients: X.mat.train 0.5018
v <- sum(beta.vec)
#0.5369
plot(beta.vec)
plot(beta.vec, simple.fit)

Calculating VIF for ordinal logistic regression & multicollinearity in R

I am running an ordinal regression model. I have 8 explanatory variables, 4 of them categorical ('0' or '1') , 4 of them continuous. Beforehand I want to be sure there's no multicollinearity, so I use the variance inflation factor (vif function from the car package) :
mod1<-polr(Y ~ X1+X2+X3+X4+X5+X6+X7+X8, Hess = T, data=df)
vif(mod1)
but I get a VIF value of 125 for one of the variables, as well as the following warning :
Warning message: In vif.default(mod1) : No intercept: vifs may not be sensible.
However, when I convert my dependent variable to numeric (instead of a factor), and do the same thing with a linear model :
mod2<-lm(Y ~ X1+X2+X3+X4+X5+X6+X7+X8, data=df)
vif(mod2)
This time all the VIF values are below 3, suggesting that there's no multicollinearity.
I am confused about the vif function. How can it return VIFs > 100 for one model and low VIFs for another ? Should I stick with the second result and still do an ordinal model anyway ?
The vif() function uses determinants of the correlation matrix of the parameters (and subsets thereof) to calculate the VIF. In the linear model, this includes just the regression coefficients (excluding the intercept). The vif() function wasn't intended to be used with ordered logit models. So, when it finds the variance-covariance matrix of the parameters, it includes the threshold parameters (i.e., intercepts), which would normally be excluded by the function in a linear model. This is why you get the warning you get - it doesn't know to look for threshold parameters and remove them. Since the VIF is really a function of inter-correlations in the design matrix (which doesn't depend on the dependent variable or the non-linear mapping from the linear predictor into the space of the response variable [i.e., the link function in a glm]), you should get the right answer with your second solution above, using lm() with a numeric version of your dependent variable.

Memory issue when using factor for fixed effect regression

I am working with a data with 900,000 observations. There is a categorical variable x with 966 unique value that needs to be used as fixed effects. I am including fixed effects using factor(x) in the regression. It gives me an error like this
Error: cannot allocate vector of size 6.9 Gb
How to fix this error? or do I need to do something different in the regression for fixed effects?
Then, how do I run a regression like this:
rlm(y~x+ factor(fe), data=pd)
The set of dummy variables constructed from a factor has very low information content. For example, considering only the columns of your model matrix corresponding to your 966-level categorical predictor, each row contains exactly one 1 and 965 zeros.
Thus you can generally save a lot of memory by constructing a sparse model matrix using Matrix::sparse.model.matrix() (or MatrixModels::model.Matrix(*, sparse=TRUE) as suggested by the sparse.model.matrix documentation). However, to use this it's necessary for whatever regression machinery you're using to accept a model matrix + response vector rather than requiring a formula (for example, to do linear regression you would need sparse.model.matrix + lm.fit rather than being able to use lm).
In contrast to #RuiBarradas's estimate of 3.5Gb for a dense model matrix:
m <- Matrix::sparse.model.matrix(~x,
data=data.frame(x=factor(sample(1:966,size=9e5,replace=TRUE))))
format(object.size(m),"Mb")
## [1] "75.6 Mb"
If you are using the rlm function from the MASS package, something like this should work:
library(Matrix)
library(MASS)
mm <- sparse.model.matrix(~x + factor(fe), data=pd)
rlm(y=pd$y, x=mm, ...)
Note that I haven't actually tested this (you didn't give a reproducible example); this should at least get you past the step of creating the model matrix, but I don't know if rlm() does any internal computations that would break and/or make the model matrix non-sparse.

Mixed model starting values for lme4

I am trying to fit a mixed model using the lmer function from the lme4 package. However, I do not understand what should be input to the start parameter.
My purpose is to use a simple linear regression to use the coefficients estimated there as starting values to the mixed model.
Lets say that my model is the following:
linear_model = lm(y ~ x1 + x2 + x3, data = data)
coef = summary(linear_model)$coefficients[- 1, 1] #I remove the intercept
result = lmer(y ~ x1 + x2 + x3 | x1 + x2 + x3, data = data, start = coef)
This example is an oversimplified version of what I am doing since I won't be able to share my data.
Then I get the following kind of error:
Error during wrapup: incorrect number of theta components (!=105) #105 is the value I get from the real regression I am trying to fit.
I have tried many different solutions, trying to provide a list and name those values theta like I saw suggested on some forums.
Also the Github code test whether the length is appropriate but I cant find to what it refers to:
# Assign the start value to theta
if (is.numeric(start)) {
theta <- start
}
# Check the length of theta
length(theta)!=length(pred$theta)
However I can't find where pred$theta is defined and so I don't understand where that value 105 is coming from.
Any help ?
A few points:
lmer doesn't in fact fit any of the fixed-effect coefficients explicitly; these are profiled out so that they are solved for implicitly at each step of the nonlinear estimation process. The estimation involves only a nonlinear search over the variance-covariance parameters. This is detailed (rather technically) in one of the lme4 vignettes (eqs. 30-31, p. 15). Thus providing starting values for the fixed-effect coefficients is impossible, and useless ...
glmer does fit fixed-effects coefficients explicitly as part of the nonlinear optimization (as #G.Grothendieck discusses in comments), if nAGQ>0 ...
it's admittedly rather obscure, but the starting values for the theta parameters (the only ones that are explicitly optimized in lmer fits) are 0 for the off-diagonal elements of the Cholesky factor, 1 for the diagonal elements: this is coded here
ll$theta[] <- is.finite(ll$lower) # initial values of theta are 0 off-diagonal, 1 on
... where you need to know further that, upstream, the values of the lower vector have been coded so that elements of the theta vector corresponding to diagonal elements have a lower bound of 0, off-diagonal elements have a lower bound of -Inf; this is equivalent to starting with an identity matrix for the scaled variance-covariance matrix (i.e., the variance-covariance matrix of the random-effects parameters divided by the residual variance), or a random-effects variance-covariance matrix of (sigma^2 I).
If you have several random effects and big variance-covariance matrices for each, things can get a little hairy. If you want to recover the starting values that lmer will use by default you can use lFormula() as follows:
library(lme4)
ff <- lFormula(Reaction~Days+(Days|Subject),sleepstudy)
(lwr <- ff$reTrms$lower)
## [1] 0 -Inf 0
ifelse(lwr==0,1,0) ## starting values
## [1] 1 0 1
For this model, we have a single 2x2 random-effects variance-covariance matrix. The theta parameters correspond to the lower-triangle Cholesky factor of this matrix, in column-wise order, so the first and third elements are diagonal, and the second element is off-diagonal.
The fact that you have 105 theta parameters worries me; fitting such a large random-effects model will be extremely slow and take an enormous amount of data to fit reliably. (If you know your model makes sense and you have enough data you might want to look into faster options, such as using Doug Bates's MixedModels package for Julia or possibly glmmTMB, which might scale better than lme4 for problems with large theta vectors ...)
your model formula, y ~ x1 + x2 + x3 | x1 + x2 + x3, seems very odd. I can't figure out any context in which it would make sense to have the same variables as random-effect terms and grouping variables in the same model!

set random forest to classification

I am attempting a random forest on some data where the class variables is binary (either 1 or 0). Here is the code I'm running:
forest.model <- randomForest(x = ticdata2000[,1:85], y = ticdata2000[,86],
ntree=500,
mtry=9,
importance=TRUE,
norm.votes=TRUE,
na.action=na.roughfix,
replace=FALSE,
)
But when the forest gets to the end, I get the following error:
Warning message:
In randomForest.default(x = ticdata2000[, 1:85], y = ticdata2000[, :
The response has five or fewer unique values. Are you sure you want to do regression?
The answer, of course, is no. I don't want to do regression. I have a single, discrete variable that only takes on 2 classes. Of course, when I run predictions with this model, I get continuous numbers, when I want a list of zeroes and ones. Can someone tell me what I'm doing wrong to get this to use regression and not classification?
Change your response column to a factor using as.factor (or just factor). Since you've stored that variable as numeric 0's and 1's, R rightly interprets it as a numeric variable. If you want R to treat it differently, you have to tell it so.
This is mentioned in the documentation under the y argument:
A response vector. If a factor, classification is assumed, otherwise
regression is assumed. If omitted, randomForest will run in
unsupervised mode.

Resources