I am trying to run a fixed effects regression in R. When I run the linear model without the fixed effects factor being applied the model works just fine. But when I apply the factor - which is a numeric code for user ID, I get the following error:
Error in rep.int(c(1, numeric(n)), n - 1L) : cannot allocate vector of length 1055470143
I am not sure what the error means but I fear it may be an issue of coding the variable correctly in R.
I think this is more statistical and less programming problem for two reasons:
First, I am not sure whether you are using cross sectional data or panel data. If you using cross-sectional data it doesn't make sense to control for 30000 individuals(of course, they will add to variation).
Second, if you are using panel data, there are good package such as plm package in R that does this kind of computation.
An example:
set.seed(42)
DF <- data.frame(x=rnorm(1e5),id=factor(sample(seq_len(1e3),1e5,TRUE)))
DF$y <- 100*DF$x + 5 + rnorm(1e5,sd=0.01) + as.numeric(DF$id)^2
fit <- lm(y~x+id,data=DF)
This needs almost 2.5 GB RAM for the R session (if you add RAM needed by the OS this is more than many PCs have available) and takes some time to finish. The result is pretty useless.
If you don't run into RAM limitations you can suffer from limitations of vector length (e.g., if you have even more factor levels), in particular if you use an older version of R.
What happens?
One of the first steps in lm is creating the design matrix using the function model.matrix. Here is a smaller example of what happens with factors:
model.matrix(b~a,data=data.frame(a=factor(1:5),b=2))
# (Intercept) a2 a3 a4 a5
# 1 1 0 0 0 0
# 2 1 1 0 0 0
# 3 1 0 1 0 0
# 4 1 0 0 1 0
# 5 1 0 0 0 1
# attr(,"assign")
# [1] 0 1 1 1 1
# attr(,"contrasts")
# attr(,"contrasts")$a
# [1] "contr.treatment"
See how n factor levels result in n-1 dummy variables? If you have many factor levels and many observations, this matrix gets huge.
What should you do?
I'm quite sure, you should use a mixed effects model. There are two important packages that implement linear mixed effects models in R, package nlme and the newer package lme4.
library(lme4)
fit.mixed <- lmer(y~x+(1|id),data=DF)
summary(fit.mixed)
Linear mixed model fit by REML
Formula: y ~ x + (1 | id)
Data: DF
AIC BIC logLik deviance REMLdev
1025277 1025315 -512634 1025282 1025269
Random effects:
Groups Name Variance Std.Dev.
id (Intercept) 8.9057e+08 29842.472
Residual 1.3875e+03 37.249
Number of obs: 100000, groups: id, 1000
Fixed effects:
Estimate Std. Error t value
(Intercept) 3.338e+05 9.437e+02 353.8
x 1.000e+02 1.180e-01 847.3
Correlation of Fixed Effects:
(Intr)
x 0.000
This needs very little RAM, calculates fast, and is a more correct model.
See how the random intercept accounts for most of the variance?
So, you need to study mixed effects models. There are some nice publications, e.g. Baayen, Davidson, Bates (2008), explaining how to use lme4.
Related
I want to estimate regression parameters of a Cox random effects model. Let us say that I have a categorical variable with two levels, sex for example. Then coding the variable is straightforward: 0 if male and 1 if female for example. The interpretation of the regression coefficient associated to that variable is simple.
Now let us say that I have a categorical variable with three levels. If I just code the variable with 0,1,2 for the three levels (A,B and C), the estimation of the associated regression coefficient would not be what I am looking for. If I want to estimate the risks associated with each "level" wrt the other levels, how should I code the variable ?
What I have done so far:
I define three variables.
I define one variable where I code level A as 1 and the rest (levels B and C) as 0.
I define another variable where I code level B as 1 and the rest (levels A and C) as 0.
Finally, I define a variable where I code level C as 1 and the rest (levels A and B) as 0.
I then estimate the three regression parameters assocaited to the variables.
Just to be clear, I do not want to use any package such as coxph, coxme, survival, etc.
Is there an easier way to to this ?
Your description (one predictor variable that is all ones, the other two predictor variables as indicator variables for groups B and C) is exactly recapitulating the standard treatment contrasts that R uses.
If you want to construct a model matrix with treatment contrasts for a single factor f (inside a data frame d), then model.matrix(~f, data=d) will work
d <- data.frame(f=factor(c("A","B","B","C","A")))
model.matrix(~f, data=d)
Results:
(Intercept) fB fC
1 1 0 0
2 1 1 0
3 1 1 0
4 1 0 1
5 1 0 0
attr(,"assign")
[1] 0 1 1
attr(,"contrasts")
attr(,"contrasts")$f
[1] "contr.treatment"
You can use other contrasts if you like; these will change the parameter estimates (and interpretation!) for your individual variables, but not the overall model fit. e.g.
model.matrix(~f , data=d, contrasts=list(f=contr.sum))
I have a question relating to the “randomForest” package in R. I am trying to build a model with ecological variables that best explain my species occupancy data for 41 sites in the field (which I have gathered from camera traps). My ultimate goal is to do species occupancy modeling using the “unmarked” package but before I get to that stage I need to select the variables that are best explaining my occupancy, since I have many. To gain some understanding of the randomForest package I generated a fake occupancy dataset and a fake variable dataset (with variables A and D being good predictors of my occupancy and B and C being bad predictors). When I run the randomForest my output looks like this:
0 1 MeanDecreaseAccuracy MeanDecreaseGini
A 25.3537667 27.75533 26.9634018 20.6505920
B 0.9567857 0.00000 0.9665287 0.0728273
C 0.4261638 0.00000 0.4242409 0.1411643
D 32.1889374 35.52439 34.0485837 27.0691574
OOB estimate of error rate: 29.02%
Confusion matrix:
0 1 class.error
0 250 119 0.3224932
1 0 41 0.0000000
I did not make a separate train and test set, I put extra weight on the model to correctly predict the “1’s” and the variables are scaled.
I understand that this output tells me that A and D are important variables because they have high MeanDecreaseAccuracy values. However, D is the inverse of A (they are perfectly correlated) so why does D have a higher MeanDecreaseAccuracy value?
Moreover, when I run the randomForest with only A and D as variables, these values change while the confusion matrix stays the same:
0 1 MeanDecreaseAccuracy MeanDecreaseGini
A 28.79540 29.77911 29.00879 23.58469
D 29.75068 30.79498 29.97520 24.53415
OOB estimate of error rate: 29.02%
Confusion matrix:
0 1 class.error
0 250 119 0.3224932
1 0 41 0.0000000
When I run the model with only 1 good predictor (A or D) or with a good and bad predictor (AB or CD) the confusion matrix stays the same but the MeanDecreaseAccuracy values of my predictors change.
Why do these values change and how should I approach the selection of my variables? (I am a beginner in occupancy modeling).
Thanks a lot!
Given two simple sets of data:
head(training_set)
x y
1 1 2.167512
2 2 4.684017
3 3 3.702477
4 4 9.417312
5 5 9.424831
6 6 13.090983
head(test_set)
x y
1 1 2.068663
2 2 4.162103
3 3 5.080583
4 4 8.366680
5 5 8.344651
I want to fit a linear regression line on the training data, and use that line (or the coefficients) to calculate the "test MSE" or Mean Squared Error of the Residuals on the test data once that line is fit there.
model = lm(y~x,data=training_set)
train_MSE = mean(model$residuals^2)
test_MSE = ?
In this case, it is more precise to call it MSPE (mean squared prediction error):
mean((test_set$y - predict.lm(model, test_set)) ^ 2)
This is a more useful measure as all models aim at prediction. We want a model with minimal MSPE.
In practice, if we do have a spare test data set, we can directly compute MSPE as above. However, very often we don't have spare data. In statistics, the leave-one-out cross-validation is an estimate of MSPE from the training dataset.
There are also several other statistics for assessing prediction error, like Mallows's statistic and AIC.
Given two simple sets of data:
head(training_set)
x y
1 1 2.167512
2 2 4.684017
3 3 3.702477
4 4 9.417312
5 5 9.424831
6 6 13.090983
head(test_set)
x y
1 1 2.068663
2 2 4.162103
3 3 5.080583
4 4 8.366680
5 5 8.344651
I want to fit a linear regression line on the training data, and use that line (or the coefficients) to calculate the "test MSE" or Mean Squared Error of the Residuals on the test data once that line is fit there.
model = lm(y~x,data=training_set)
train_MSE = mean(model$residuals^2)
test_MSE = ?
In this case, it is more precise to call it MSPE (mean squared prediction error):
mean((test_set$y - predict.lm(model, test_set)) ^ 2)
This is a more useful measure as all models aim at prediction. We want a model with minimal MSPE.
In practice, if we do have a spare test data set, we can directly compute MSPE as above. However, very often we don't have spare data. In statistics, the leave-one-out cross-validation is an estimate of MSPE from the training dataset.
There are also several other statistics for assessing prediction error, like Mallows's statistic and AIC.
I need to do RNA-Seq analysis with limma and I already have normalized count data for 61810 transcripts in two conditions (no replicates), i.e. a 61810*2 matrix. My "design" model matrix is :
(Intercept) sampletypestest
1 1 0
2 1 1
attr(,"assign")
[1] 0 1
attr(,"contrasts")
attr(,"contrasts")$sampletypes
[1] "contr.treatment
when I use voom on the data: diff.exp <- voom(data,design), it gives the following error:
Error in approxfun(l, rule = 2) :
need at least two non-NA values to interpolate
Can anyone tell me what's the issue here?
voom (and limma more generally) require replicates. The whole purpose of voom is to estimate the mean-variance relationship. It would work if you had any replicates at all in any of groups. But you don't have any replicates, so you can't estimate any variances, so an error is inevitable.