I am working with a data with 900,000 observations. There is a categorical variable x with 966 unique value that needs to be used as fixed effects. I am including fixed effects using factor(x) in the regression. It gives me an error like this
Error: cannot allocate vector of size 6.9 Gb
How to fix this error? or do I need to do something different in the regression for fixed effects?
Then, how do I run a regression like this:
rlm(y~x+ factor(fe), data=pd)
The set of dummy variables constructed from a factor has very low information content. For example, considering only the columns of your model matrix corresponding to your 966-level categorical predictor, each row contains exactly one 1 and 965 zeros.
Thus you can generally save a lot of memory by constructing a sparse model matrix using Matrix::sparse.model.matrix() (or MatrixModels::model.Matrix(*, sparse=TRUE) as suggested by the sparse.model.matrix documentation). However, to use this it's necessary for whatever regression machinery you're using to accept a model matrix + response vector rather than requiring a formula (for example, to do linear regression you would need sparse.model.matrix + lm.fit rather than being able to use lm).
In contrast to #RuiBarradas's estimate of 3.5Gb for a dense model matrix:
m <- Matrix::sparse.model.matrix(~x,
data=data.frame(x=factor(sample(1:966,size=9e5,replace=TRUE))))
format(object.size(m),"Mb")
## [1] "75.6 Mb"
If you are using the rlm function from the MASS package, something like this should work:
library(Matrix)
library(MASS)
mm <- sparse.model.matrix(~x + factor(fe), data=pd)
rlm(y=pd$y, x=mm, ...)
Note that I haven't actually tested this (you didn't give a reproducible example); this should at least get you past the step of creating the model matrix, but I don't know if rlm() does any internal computations that would break and/or make the model matrix non-sparse.
Related
I use mixed models on a large file (500000 rows).
My model formula looks like this:
Y ~ 0 + num1:factor1 + num1:factor2 + num2:factor3 + factor4 + (0 + num3|subject) + (0 + num4|subject) + (1|subject),
where num - numeric variables; factor - categorical variables/factors.
Since categorical variables have many unique levels, the fixed effects matrix is very sparse (sparsity ~0.9).
Fitting such a matrix if it is handle as dense requires a lot of time and RAM.
I had the same problem with linear regression.
My dense matrix was 20GB, but when I converted it to sparse it became only 35 MB.
So, I refused to use lm function and instead it used two another functions:
sparse.model.matrix (to create a sparse model/design matrix) and
MatrixModels:::lm.fit.sparse (to fit a sparse matrix and calculate coefficients).
Can I apply a similar approach to mixed models?
What functions / packages can I use to implement this?
That is, my main question is whether it is possible to implement mixed models with sparse matrices?
What functions should I use to create X and Z sparse model matrices?
Then, which function should I use for fitting the model with sparse matrices to get coefficients?
I would be very-very grateful for any help with this!
As of version 1.0.2.1 on CRAN, glmmTMB has a sparseX argument:
sparseX: a named logical vector containing (possibly) elements named
"cond", "zi", "disp" to indicate whether fixed-effect model
matrices for particular model components should be generated
as sparse matrices, e.g. ‘c(cond=TRUE)’. Default is all
‘FALSE’
You would probably want glmmTMB([formula], [data], sparseX=c(cond=TRUE)) (glmmTMB uses family="gaussian" by default).
glmmTMB is not quite as fast for linear mixed models as lme4 is: I don't know what your mileage will be (but will be interested to here). There is also some discussion here about how to hack the equivalent of sparse model matrices in lme4 (by letting the many-level factor be a random effect with a large fixed variance).
I've been using mnlogit in R to generate a multivariable logistic regression model. My original set of variables generated a singular matrix error, i.e.
Error in solve.default(hessian, gradient, tol = 1e-24) :
system is computationally singular: reciprocal condition number = 7.09808e-25
It turns out that several "sparse" columns (variables that are 0 for most sampled individuals) cause this singularity error. I need a systematic way of removing those variables that lead to a singularity error while retaining those that allow estimation of a regression model, i.e. something analogous to the use of the function step to select variables minimizing AIC via stepwise addition, but this time removing variables that generate singular matrices.
Is there some way to do this, since checking each variable by hand (there are several hundred predictor variables) would be incredibly tedious?
If X is the design matrix from your model which you can obtain using
X <- model.matrix(formula, data = data)
then you can find a (non-unique) set of variables that would give you a non-singular model using the QR decomposition. For example,
x <- 1:3
X <- model.matrix(~ x + I(x^2) + I(x^3))
QR <- qr(crossprod(X)) # Get the QR decomposition
vars <- QR$pivot[seq_len(QR$rank)] # Variable numbers
names <- rownames(QR$qr)[vars] # Variable names
names
#> [1] "(Intercept)" "x" "I(x^2)"
This is subject to numerical error and may not agree with whatever code you are using, for two reasons.
First, it doesn't do any weighting, whereas logistic regression normally uses iteratively reweighted regression.
Second, it might not use the same tolerance as the other code. You can change its sensitivity by changing the tol parameter to qr() from the default 1e-07. Bigger values will cause more variables to be omitted from names.
I'm using the package glmnet, I need to run several LASSO analysis for the calibration of a large number of variables (%reflectance for each wavelength throughout the spectrum) against one dependent variable. I have a couple of doubts on the procedure and on the results I wish to solve. I show my provisional code below:
First I split my data in training (70% of n) and testing sets.
smp_size <- floor(0.70 * nrow(mydata))
set.seed(123)
train_ind <- sample(seq_len(nrow(mydata)), size = smp_size)
train <- mydata[train_ind, ]
test <- mydata[-train_ind, ]
Then I separate the target trait (y) and the independent variables (x) for each set as follows:
vars.train <- train[3:2153]
vars.test <- test[3:2153]
x.train <- data.matrix(vars.train)
x.test <- data.matrix(vars.test)
y.train <- train$X1
y.test <- test$X1
Afterwords, I run a cross-validated LASSO model for the training set and extract and writte the non-zero coefficients for lambdamin. This is because one of my concerns here is to note which variables (wavebands of the reflectance spectrum) are selected by the model.
install.packages("glmnet")
library(glmnet)
cv.lasso.1 <- cv.glmnet(y=y.train, x= x.train, family="gaussian", nfolds =
5, standardize=TRUE, alpha=1)
coef(cv.lasso.1,s=cv.lasso.1$lambda.min) # Using lambda min.
(cv.lasso.1)
install.packages("broom")
library(broom)
c <- tidy(coef(cv.lasso.1, s="lambda.min"))
write.csv(c, file = "results")
Finally, I use the function “predict” and apply the object “cv.lasso1” (the model obtained previously) to the variables of the testing set (x.2) in order to get the prediction of the variable and I run the correlation between the predicted and the actual values of Y for the testing set.
predict.1.2 <- predict(cv.lasso.1, newx=x.2, type = "response", s =
"lambda.min")
cor.test(x=c(predict.1.2), y=c(y.2))
This is a simplified code and had no problem so far, the point is that I would like to make a loop (of one hundred repetitions) of the whole code and get the non-zero coefficients of the cross-validated model as well as the correlation coefficient of the predicted vs actual values (for the testing set) for each repetition. I've tried but couldn't get any clear results. Can someone give me some hint?
thanks!
In general, running repeated analyses of the same type over and over on the same data can be tricky. And in your case, may not be necessary the way in which you have outlined it.
If you are trying to find the variables most predictive, you can use PCA, Principal Component Analysis to select variables with the most variation within the a variable AND between variables, but it does not consider your outcome at all, so if you have poor model design it will pick the least correlated data in your repository but it may not be predictive. So you should be very aware of all variables in the set. This would be a way of reducing the dimensionality in your data for a linear or logistic regression of some sort.
You can read about it here
yourPCA <- prcomp(yourData,
center = TRUE,
scale. = TRUE)
Scaling and centering are essential to making these models work right, by removing the distance between your various variables setting means to 0 and standard deviations to 1. Unless you know what you are doing, I would leave those as they are. And if you have skewed or kurtotic data, you might need to address this prior to PCA. Run this ONLY on your predictors...keep your target/outcome variable out of the data set.
If you have a classification problem you are looking to resolve with much data, try an LDA, Linear Discriminant Analysis which looks to reduce variables by optimizing the variance of each predictor with respect to the OUTCOME variable...it specifically considers your outcome.
require(MASS)
yourLDA =r <- lda(formula = outcome ~ .,
data = yourdata)
You can also set the prior probabilities in LDA if you know what a global probability for each class is, or you can leave it out, and R/ lda will assign the probabilities of the actual classes from a training set. You can read about that here:
LDA from MASS package
So this gets you headed in the right direction for reducing the complexity of data via feature selection in a computationally solid method. In looking to build the most robust model via repeated model building, this is known as crossvalidation. There is a cv.glm method in boot package which can help you get this taken care of in a safe way.
You can use the following as a rough guide:
require(boot)
yourCVGLM<- cv.glmnet(y = outcomeVariable, x = allPredictorVariables, family="gaussian", K=100) .
Here K=100 specifies that you are creating 100 randomly sampled models from your current data OBSERVATIONS not variables.
So the process is two fold, reduce variables using one of the two methods above, then use cross validation to build a single model from repeated trials without cumbersome loops!
Read about cv.glm here
Try starting on page 41, but look over the whole thing. The repeated sampling you are after is called booting and it is powerful and available in many different model types.
Not as much code and you might hope for, but pointing you in a decent direction.
I'm new to R and statistical modelling, and am looking to use the lmmlasso library in r to fit a mixed effects model, selecting only the best fixed effects out of ~300 possible variables.
For this model I'd like to include both a fixed intercept, a random effect, and a random intercept. Looking at the manual on CRAN, I've come across the following:
x: matrix of dimension ntot x p including the fixed-effects
covariables. An intercept has to be included in the first column as
(1,...,1).
z: random effects matrix of dimension ntot x q. It has to be a matrix,
even if q=1.
While it's obvious what I need to do for the fixed intercept I'm not quite sure how to include both a random intercept and effect. Is it exactly the same as the fixed matrix, where I include (1...1) in my first column?
In addition to this, I'm looking to validate the resulting model I get with another dataset. For lmmlasso is there a function similar to predict in lme4 that can be used to compute new predictions based on the output I get? Alternatively, is it viable/correct to construct a new model using lmer using the variables with non-zero coefficients returned by lmmlasso, and then use predict on the new model?
Thanks in advance.
I'm currently going through the 'Introduction to Statistical Learning' MOOC by Stanford OpenX. In one of the lab exercises, it suggests creating a model matrix from the test data by explicitly using model.matrix().
Extract from textbook
We now compute the validation set error for the best model of each model size. We first make a model matrix from the test data.
test.mat=model.matrix (Salary∼.,data=Hitters [test ,])
The model.matrix() function is used in many regression packages for
building an X matrix from data. Now we run a loop, and for each size i, we
extract the coefficients from regfit.best for the best model of that
size, multiply them into the appropriate columns of the test model
matrix to form the predictions, and compute the test MSE.
val.errors =rep(NA ,19)
for(i in 1:19){
coefi=coef(regfit .best ,id=i)
pred=test.mat [,names(coefi)]%*% coefi
val.errors [i]= mean(( Hitters$Salary[test]-pred)^2)
}
I understand that model.matrix would convert string variables into values with different levels, and that models like lm() would do the conversions under the hood.
However, what are the instances that we would explicitly use model.matrix(), and why?