Error during wrapup: long vectors not supported yet: in glm() function - r

I found several questions on Stackoverflow regarding this topic (some of them without any answer) but nothing related (so far) with this error in regression.
I'm, running a probit model in r with (I'm guessing) too many fixed effects (year and places):
myprobit <- glm(factor(Y) ~ factor(T) + factor(X1) + factor(X2) + factor(X3) +
factor(YEAR) + factor(PLACE),
family = binomial(link = "probit"),
data = DT)
The PLACE variable has about 1000 unique values and YEAR 8 values. The dataset DT has 13,099,225 obs and 79 columns.
The error I got is:
Error: cannot allocate vector of size 59.3 Gb
Error during wrapup: long vectors not supported yet: ../include/Rinlinedfuns.h:519
The machine I'm using has 128 GB of RAM.
So, I don't know what I can do, without change the function. Does anyone know how to deal with this issue? Thanks!

In order to close this question, I have to mention that the #Axeman's answer it is the only approach feasible for my problem. The whole issue is, there is not enough memory to manage such a huge design matrix.
Therefore, run a probit regression using the biglm package and bigglm() function is the only solution I found so far.
Nevertheless, I realize, due to how the biglm package works, taking iteratively chunks of the data, the use of factor() variables in the RHS it's problematic every time when factor level is not represented in the chunk. In other words, if a factor variable has 5 levels, but in the data chunk only 4 levels appear, I will have an error in the estimation.
There are several questions and comments about this on Stackoverflow.

Related

glmmLasso error and warning

I am trying to perform variable selection in a generalized linear mixed model using glmmLasso, but am coming up with an error and a warning, that I can not resolve. The dataset is unbalanced, with some participants (PTNO) having more samples than others; no missing data. My dependent variable is binary, all other variables (beside the ID variable PTNO) are continous.
I suspect something very generic is happening, but obviously fail to see it and have not found any solution in the documentation or on the web.
The code, which is basically just adapted from the glmmLasso soccer example is:
glm8 <- glmmLasso(Group~NDUFV2_dCTABL+GPER1_dCTABL+ ESR1_dCTABL+ESR2_dCTABL+KLF12_dCTABL+SP4_dCTABL+SP1_dCTABL+ PGAM1_dCTABL+ANK3_dCTABL+RASGRP1_dCTABL+AKT1_dCTABL+NUDT1_dCTABL+ POLG_dCTABL+ ADARB1_dCTABL+OGG_dCTABL+ PDE4B_dCTABL+ GSK3B_dCTABL+ APOE_dCTABL+ MAPK6_dCTABL, rnd = list(PTNO=~1),
family = poisson(link = log), data = stackdata, lambda=100,
control = list(print.iter=TRUE,start=c(1,rep(0,29)),q.start=0.7))
The error message is displayed below. Specficially, I do not believe there are any NAs in the dataset and I am unsure about the meaning of the warning regarding the factor variable.
Iteration 1
Error in grad.lasso[b.is.0] <- score.beta[b.is.0] - lambda.b * sign(score.beta[b.is.0]) :
NAs are not allowed in subscripted assignments
In addition: Warning message:
In Ops.factor(y, Mu) : ‘-’ not meaningful for factors
An abbreviated dataset containing the necessary variables is available in R format and can be downladed here.
I hope I can be guided a bit as to how to go on with the analysis. Please let me know if there is anything wrong with the dataset or you cannot download it. ANY help is much appreciated.
Just to follow up on #Kristofersen comment above. It is indeed the start vector that messes your analysis up.
If I run
glm8 <- glmmLasso(Group~NDUFV2_dCTABL+GPER1_dCTABL+ ESR1_dCTABL+ESR2_dCTABL+KLF12_dCTABL+SP4_dCTABL+SP1_dCTABL+ PGAM1_dCTABL+ANK3_dCTABL+RASGRP1_dCTABL+AKT1_dCTABL+NUDT1_dCTABL+ POLG_dCTABL+ ADARB1_dCTABL+OGG_dCTABL+ PDE4B_dCTABL+ GSK3B_dCTABL+ APOE_dCTABL+ MAPK6_dCTABL,
rnd = list(PTNO=~1),
family = binomial(),
data = stackdata,
lambda=100,
control = list(print.iter=TRUE))
then everything is fine and dandy (i.e., it converges and produces a solution). You have copied the example with poisson regression and you need to tweak the code to your situation. I have no idea about whether the output makes sense.
Quick note: I ran with the binomial distribution in the code above since your outcome is binary. If it makes sense to estimate relative risks then poisson may be reasonable (and it also converges), but you need to recode your outcome as the two groups are defined as 1 and 2 and that will certainly mess up the poisson regression.
In other words do a
stackdata$Group <- stackdata$Group-1
before you run the analysis.

Elastic net with Cox regression

I am trying to perform elastic net with cox regression on 120 samples with ~100k features.
I tried R with the glmnet package but R is not supporting big matrices (it seems R is not designed for 64 bit). Furthermore, the package glmnet does support sparse matrices but for whatever reason they have not implemented sparse matrix + cox regression.
I am not pushing for R but this is the only tool I found so far. Anyone knows what program I can use to calculate elastic nets + cox regression on big models? I did read that I can use Support Vector Machine but I need to calculate the model first and I cannot do that in R due to the above restriction.
Edit:
A bit of clarification. I am not reporting an error in R as apparently it is normal for R to be limited by how many elements its matrix can hold (as for glmnet not supporting sparse matrix + cox I have no idea). I am not pushing for a tool but it would be easier if there is another package or a stand alone program that can perform what I am looking for.
If someone has an idea or has done this before please share your method (R, Matlab, something else).
Edit 2:
Here is what I used to test:
I made a matrix of 100x100000. Added labels and tried to create the model using model.matrix.
data <- matrix(rnorm(100*100000), 100, 100000)
formula <- as.formula(class ~ .)
x = c(rep('A', 40), rep('B', 30), rep('C', 30))
y = sample(x=1:100, size=100)
class = x[y]
data <- cbind(data, class)
X <- model.matrix(formula, data)
The error I got:
Error: cannot allocate vector of size 37.3 Gb
In addition: Warning messages:
1: In terms.formula(object, data = data) :
Reached total allocation of 12211Mb: see help(memory.size)
2: In terms.formula(object, data = data) :
Reached total allocation of 12211Mb: see help(memory.size)
3: In terms.formula(object, data = data) :
Reached total allocation of 12211Mb: see help(memory.size)
4: In terms.formula(object, data = data) :
Reached total allocation of 12211Mb: see help(memory.size)
Thank you in advance! :)
Edit 3:
Thanks to #marbel I was able to construct a test model that works and does not become too big. It seems my problem came from using cbind in my test.
A few pointers:
a) That's a rather small dataset, R should be more than enought. All you need is a modern computer, meaning a decent amount of RAM. I guess 4GB should be enough for such a small dataset.
The package is available in Julia and Python but I'm not sure if that model is available.
Here and here you have examples of the cox model with the GLMNET package. There is also a package called survival.
There are at least two problems with your code:
This is not something your would like to do in R: data <- cbind(data, class). It's just not memory efficient. If you need to do this type of operations use the data.table package. It allows to do assignment by references, check out the := operator.
If all your data is numeric you don't need to use model.matrix, just use data.matrix(X).
If you have categorical variables, use model.matrix with them only, then add them to the X matrix, perhaps using data.table, one column at a time using the ?data.table::set or the := operator.
Hopefully this can help you debug the code. Good luck!

Nested model in R

I'm having a huge problem with a nested model I am trying to fit in R.
I have response time experiment with 2 conditions with 46 people each and 32 measures each. I would like measures to be nested within people and people nested within conditions, but I can't get it to work.
The code I thought should make sense was:
nestedmodel <- lmer(responsetime ~ 1 + condition +
(1|condition:person) + (1|person:measure), data=dat)
However, all I get is an error:
Error in checkNlevels(reTrms$flist, n = n, control) :
number of levels of each grouping factor must be < number of observations
Unfortunately, I do not even know where to start looking what the problem is here.
Any ideas? Please, please, please? =)
Cheers!
This might be more appropriate on CrossValidated, but: lme4 is trying to tell you that one or more of your random effects is confounded with the residual variance. As you've described your data, I don't quite see why: you should have 2*46*32=2944 total observations, 2*46=92 combinations of condition and person, and 46*32=1472 combinations of measure and person.
If you do
lf <- lFormula(responsetime ~ 1 + condition +
(1|condition:person) + (1|person:measure), data=dat)
and then
lapply(lf$reTrms$Ztlist,dim)
to look at the transposed random-effect design matrices for each term, what do you get? You should (based on your description of your data) see that these matrices are 1472 by 2944 and 92 by 2944, respectively.
As #MrFlick says, a reproducible example would be nice. Other things you could show us are:
fit the model anyway, using lmerControl(check.nobs.vs.nRE="ignore") to ignore the test, and show us the results (especially the random effects variances and the statement of the numbers of groups)
show us the results of with(dat,table(table(interaction(condition,person))) to give information on the number of replicates per combination (and similarly for measure)

computing multiple fixed effects on large dataset

I'm trying to perform a fixed effects regression for two factor variables in a CSV dataset containing over 4000000 rows. These variables can respectively assume about 140000 and 50000 different integer values.
I initially attempted to perform the regression using the biglm and ff packages for R as follows on a Linux machine with 8 Gb of memory; however, it seems that this requires too much memory because R complains about having to allocate a vector of a size greater than the maximum on my machine.
library(biglm)
library(ff)
d <- read.csv.ffdf(file='data.csv', header=TRUE)
model = y~factor(a)+factor(b)-1
out <- biglm(model, data=d)
Some research online revealed that since factors are loaded into memory by ff, the latter will not significantly improve memory usage if many factor values are present.
Is anyone aware of some other way to perform the aforementioned regression on a dataset of the magnitude I described without having to resort to a machine with significantly more memory?
You should try the package lfe, it has been designed for exactly this purpose:
library(lfe)
...
out <- felm(y ~ 0|a+b, data=d)
fe <- getfe(out)
A proof of the method can be found here: http://www.sciencedirect.com/science/article/pii/S0167947313001266
Here's an R-journal article about it: http://journal.r-project.org/archive/2013-2/gaure.pdf
you can get the same mathematical meaning of fixed effects if you will demean the variables (by category). So, instead of finding a constant per dummy, you demean it. and demeaning will be very fast, as it is will be vectorized.
Edit1:
see Green 2012 p.400-401 for the mathematical proof.

"system is computationally singular" error when using `gmm` (GMM Estimation)

Trying to use the GMM package in R to estimate the parameters (a-f) of a linear model:
LEV1 = a*Macro + b*Firm + c*Sector + d*qtr + e*fqtr + f*tax
Macro, Firm and Sector are matrices with n number of rows. qtr, fqtr and tax are vectors with n members.
I have one large data frame called unconstrd that has all of the data. First, I break that data into separate matrices:
v_LEV1 <- as.matrix(unconstrd$LEV1)
Macro <- as.matrix(cbind(unconstrd$Agg_Corp_Prof,unconstrd$R1000_TR, unconstrd$CP_Spread))
Firm <- as.matrix(cbind(unconstrd$ppe_ratio, unconstrd$op_inc_ratio_avg, unconstrd$selling_exp_avg,
unconstrd$tax_avg, unconstrd$Mark_to_Bk, unconstrd$mc_ratio))
Sector <- as.matrix(cbind(unconstrd$Sect_Flag03,
unconstrd$Sect_Flag04, unconstrd$Sect_Flag05, unconstrd$Sect_Flag06,
unconstrd$Sect_Flag07, unconstrd$Sect_Flag08, unconstrd$Sect_Flag12,
unconstrd$Sect_Flag13, unconstrd$Sect_Flag14, unconstrd$Sect_Flag15,
unconstrd$Sect_Flag17))
v_qtr <- as.matrix(unconstrd$qtr)
v_fqtr <- as.matrix(unconstrd$fqtr)
v_tax <- as.matrix(unconstrd$tax_dummy)
Then, I bind the data together for the x variable called by gmm:
h=cbind(Macro,Firm,Sector,v_qtr, v_fqtr, v_tax)
Then, I invoke gmm:
gmm1 <- gmm(v_LEV1 ~ Macro + Firm + Sector + v_qtr + v_fqtr + v_tax, x=h)
I get the message:
Error in solve.default(crossprod(hm, xm), crossprod(hm, ym)) :
system is computationally singular: reciprocal condition number = 1.10214e-18
I apologize in advance and admit that I'm a neophyte at R and I've never used GMM before. The GMM function is so general, I've looked at the examples available on the web but nothing seems specific enough to help my situation.
You are trying to fit onto a matrix which does not have full rank---try excluding some of the variable and/or look for errors. We cannot say much more without your data, or at least a sample.
That's more of a modelling question for Crossvalidated.com than a programming question for StackOverflow.
I was pretty certain there was no linear dependency between my variables but I went through the exercise of adding one variable at a time to see what was causing the errors. In the end, I asked a colleague to run GMM on SAS and it ran perfectly, no error messages. I'm not sure what the problem is with the R version is but at this point I have a solution and give u on GMM on R.
Thanks to everyone who tried to help.

Resources