I use mixed models on a large file (500000 rows).
My model formula looks like this:
Y ~ 0 + num1:factor1 + num1:factor2 + num2:factor3 + factor4 + (0 + num3|subject) + (0 + num4|subject) + (1|subject),
where num - numeric variables; factor - categorical variables/factors.
Since categorical variables have many unique levels, the fixed effects matrix is very sparse (sparsity ~0.9).
Fitting such a matrix if it is handle as dense requires a lot of time and RAM.
I had the same problem with linear regression.
My dense matrix was 20GB, but when I converted it to sparse it became only 35 MB.
So, I refused to use lm function and instead it used two another functions:
sparse.model.matrix (to create a sparse model/design matrix) and
MatrixModels:::lm.fit.sparse (to fit a sparse matrix and calculate coefficients).
Can I apply a similar approach to mixed models?
What functions / packages can I use to implement this?
That is, my main question is whether it is possible to implement mixed models with sparse matrices?
What functions should I use to create X and Z sparse model matrices?
Then, which function should I use for fitting the model with sparse matrices to get coefficients?
I would be very-very grateful for any help with this!
As of version 1.0.2.1 on CRAN, glmmTMB has a sparseX argument:
sparseX: a named logical vector containing (possibly) elements named
"cond", "zi", "disp" to indicate whether fixed-effect model
matrices for particular model components should be generated
as sparse matrices, e.g. ‘c(cond=TRUE)’. Default is all
‘FALSE’
You would probably want glmmTMB([formula], [data], sparseX=c(cond=TRUE)) (glmmTMB uses family="gaussian" by default).
glmmTMB is not quite as fast for linear mixed models as lme4 is: I don't know what your mileage will be (but will be interested to here). There is also some discussion here about how to hack the equivalent of sparse model matrices in lme4 (by letting the many-level factor be a random effect with a large fixed variance).
Related
I am working with a data with 900,000 observations. There is a categorical variable x with 966 unique value that needs to be used as fixed effects. I am including fixed effects using factor(x) in the regression. It gives me an error like this
Error: cannot allocate vector of size 6.9 Gb
How to fix this error? or do I need to do something different in the regression for fixed effects?
Then, how do I run a regression like this:
rlm(y~x+ factor(fe), data=pd)
The set of dummy variables constructed from a factor has very low information content. For example, considering only the columns of your model matrix corresponding to your 966-level categorical predictor, each row contains exactly one 1 and 965 zeros.
Thus you can generally save a lot of memory by constructing a sparse model matrix using Matrix::sparse.model.matrix() (or MatrixModels::model.Matrix(*, sparse=TRUE) as suggested by the sparse.model.matrix documentation). However, to use this it's necessary for whatever regression machinery you're using to accept a model matrix + response vector rather than requiring a formula (for example, to do linear regression you would need sparse.model.matrix + lm.fit rather than being able to use lm).
In contrast to #RuiBarradas's estimate of 3.5Gb for a dense model matrix:
m <- Matrix::sparse.model.matrix(~x,
data=data.frame(x=factor(sample(1:966,size=9e5,replace=TRUE))))
format(object.size(m),"Mb")
## [1] "75.6 Mb"
If you are using the rlm function from the MASS package, something like this should work:
library(Matrix)
library(MASS)
mm <- sparse.model.matrix(~x + factor(fe), data=pd)
rlm(y=pd$y, x=mm, ...)
Note that I haven't actually tested this (you didn't give a reproducible example); this should at least get you past the step of creating the model matrix, but I don't know if rlm() does any internal computations that would break and/or make the model matrix non-sparse.
I've been using mnlogit in R to generate a multivariable logistic regression model. My original set of variables generated a singular matrix error, i.e.
Error in solve.default(hessian, gradient, tol = 1e-24) :
system is computationally singular: reciprocal condition number = 7.09808e-25
It turns out that several "sparse" columns (variables that are 0 for most sampled individuals) cause this singularity error. I need a systematic way of removing those variables that lead to a singularity error while retaining those that allow estimation of a regression model, i.e. something analogous to the use of the function step to select variables minimizing AIC via stepwise addition, but this time removing variables that generate singular matrices.
Is there some way to do this, since checking each variable by hand (there are several hundred predictor variables) would be incredibly tedious?
If X is the design matrix from your model which you can obtain using
X <- model.matrix(formula, data = data)
then you can find a (non-unique) set of variables that would give you a non-singular model using the QR decomposition. For example,
x <- 1:3
X <- model.matrix(~ x + I(x^2) + I(x^3))
QR <- qr(crossprod(X)) # Get the QR decomposition
vars <- QR$pivot[seq_len(QR$rank)] # Variable numbers
names <- rownames(QR$qr)[vars] # Variable names
names
#> [1] "(Intercept)" "x" "I(x^2)"
This is subject to numerical error and may not agree with whatever code you are using, for two reasons.
First, it doesn't do any weighting, whereas logistic regression normally uses iteratively reweighted regression.
Second, it might not use the same tolerance as the other code. You can change its sensitivity by changing the tol parameter to qr() from the default 1e-07. Bigger values will cause more variables to be omitted from names.
As the title says I am trying to extract matrices from an lme4 (or other packages?) object. To make clear what I want precisely I think it is easiest to refer to the SAS documentation: https://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_mixed_sect022.htm
Variance-covariance matrix of random effects
In SAS notation this matrix is called G and is the variance-covariance matrix of the random effect parameter gamma. By using the option "G" in PROC MIXED and the Output Delivery System you obtain G as a matrix.
I am aware that it is relatively simple to construct this matrix manually once I have the variance components and dimensions of gamma. I nevertheless expected there to be an even simpler way.
Mixed model equations solution
In SAS notation these are called C.
By using the option "MMEQSOL" in PROC MIXED and the Output Delivery System you request that a solution to the mixed model equations be produced, as well as the inverted coefficients matrix. It is the latter that I am interested in.
Thanks in advance!
Not a very sensible model (see ?lme4::cake), but reasonable for illustration:
library(lme4)
fm1 <- lmer(angle ~ temperature +
(1|recipe)+(1|replicate), cake)
The VarCorr() method gives a list of variance-covariance matrices for each term (in this case each one is 1x1), with its own print method:
v <- VarCorr(fm1)
You can combine these into a single matrix by using the bdiag() (block-diagonal) function from Matrix (as.matrix() converts from a sparse matrix to a standard (dense) R matrix object).
as.matrix(Matrix::bdiag(v))
## [,1] [,2]
## [1,] 39.21541 0.0000000
## [2,] 0.00000 0.4949681
The C matrix is unfortunately not so easy to get. As discussed in vignette("lmer",package="lme4"), lme4 doesn't use the Henderson-equation formulation. The upper block of C (variance-covariance matrix of fixed effects) is accessible via vcov(), but the variance-covariance matrix of the variances is not so easy: see e.g. here.
I'm new to R and statistical modelling, and am looking to use the lmmlasso library in r to fit a mixed effects model, selecting only the best fixed effects out of ~300 possible variables.
For this model I'd like to include both a fixed intercept, a random effect, and a random intercept. Looking at the manual on CRAN, I've come across the following:
x: matrix of dimension ntot x p including the fixed-effects
covariables. An intercept has to be included in the first column as
(1,...,1).
z: random effects matrix of dimension ntot x q. It has to be a matrix,
even if q=1.
While it's obvious what I need to do for the fixed intercept I'm not quite sure how to include both a random intercept and effect. Is it exactly the same as the fixed matrix, where I include (1...1) in my first column?
In addition to this, I'm looking to validate the resulting model I get with another dataset. For lmmlasso is there a function similar to predict in lme4 that can be used to compute new predictions based on the output I get? Alternatively, is it viable/correct to construct a new model using lmer using the variables with non-zero coefficients returned by lmmlasso, and then use predict on the new model?
Thanks in advance.
I'm currently going through the 'Introduction to Statistical Learning' MOOC by Stanford OpenX. In one of the lab exercises, it suggests creating a model matrix from the test data by explicitly using model.matrix().
Extract from textbook
We now compute the validation set error for the best model of each model size. We first make a model matrix from the test data.
test.mat=model.matrix (Salary∼.,data=Hitters [test ,])
The model.matrix() function is used in many regression packages for
building an X matrix from data. Now we run a loop, and for each size i, we
extract the coefficients from regfit.best for the best model of that
size, multiply them into the appropriate columns of the test model
matrix to form the predictions, and compute the test MSE.
val.errors =rep(NA ,19)
for(i in 1:19){
coefi=coef(regfit .best ,id=i)
pred=test.mat [,names(coefi)]%*% coefi
val.errors [i]= mean(( Hitters$Salary[test]-pred)^2)
}
I understand that model.matrix would convert string variables into values with different levels, and that models like lm() would do the conversions under the hood.
However, what are the instances that we would explicitly use model.matrix(), and why?