Elastic net with Cox regression - r

I am trying to perform elastic net with cox regression on 120 samples with ~100k features.
I tried R with the glmnet package but R is not supporting big matrices (it seems R is not designed for 64 bit). Furthermore, the package glmnet does support sparse matrices but for whatever reason they have not implemented sparse matrix + cox regression.
I am not pushing for R but this is the only tool I found so far. Anyone knows what program I can use to calculate elastic nets + cox regression on big models? I did read that I can use Support Vector Machine but I need to calculate the model first and I cannot do that in R due to the above restriction.
Edit:
A bit of clarification. I am not reporting an error in R as apparently it is normal for R to be limited by how many elements its matrix can hold (as for glmnet not supporting sparse matrix + cox I have no idea). I am not pushing for a tool but it would be easier if there is another package or a stand alone program that can perform what I am looking for.
If someone has an idea or has done this before please share your method (R, Matlab, something else).
Edit 2:
Here is what I used to test:
I made a matrix of 100x100000. Added labels and tried to create the model using model.matrix.
data <- matrix(rnorm(100*100000), 100, 100000)
formula <- as.formula(class ~ .)
x = c(rep('A', 40), rep('B', 30), rep('C', 30))
y = sample(x=1:100, size=100)
class = x[y]
data <- cbind(data, class)
X <- model.matrix(formula, data)
The error I got:
Error: cannot allocate vector of size 37.3 Gb
In addition: Warning messages:
1: In terms.formula(object, data = data) :
Reached total allocation of 12211Mb: see help(memory.size)
2: In terms.formula(object, data = data) :
Reached total allocation of 12211Mb: see help(memory.size)
3: In terms.formula(object, data = data) :
Reached total allocation of 12211Mb: see help(memory.size)
4: In terms.formula(object, data = data) :
Reached total allocation of 12211Mb: see help(memory.size)
Thank you in advance! :)
Edit 3:
Thanks to #marbel I was able to construct a test model that works and does not become too big. It seems my problem came from using cbind in my test.

A few pointers:
a) That's a rather small dataset, R should be more than enought. All you need is a modern computer, meaning a decent amount of RAM. I guess 4GB should be enough for such a small dataset.
The package is available in Julia and Python but I'm not sure if that model is available.
Here and here you have examples of the cox model with the GLMNET package. There is also a package called survival.
There are at least two problems with your code:
This is not something your would like to do in R: data <- cbind(data, class). It's just not memory efficient. If you need to do this type of operations use the data.table package. It allows to do assignment by references, check out the := operator.
If all your data is numeric you don't need to use model.matrix, just use data.matrix(X).
If you have categorical variables, use model.matrix with them only, then add them to the X matrix, perhaps using data.table, one column at a time using the ?data.table::set or the := operator.
Hopefully this can help you debug the code. Good luck!

Related

Error during wrapup: long vectors not supported yet: in glm() function

I found several questions on Stackoverflow regarding this topic (some of them without any answer) but nothing related (so far) with this error in regression.
I'm, running a probit model in r with (I'm guessing) too many fixed effects (year and places):
myprobit <- glm(factor(Y) ~ factor(T) + factor(X1) + factor(X2) + factor(X3) +
factor(YEAR) + factor(PLACE),
family = binomial(link = "probit"),
data = DT)
The PLACE variable has about 1000 unique values and YEAR 8 values. The dataset DT has 13,099,225 obs and 79 columns.
The error I got is:
Error: cannot allocate vector of size 59.3 Gb
Error during wrapup: long vectors not supported yet: ../include/Rinlinedfuns.h:519
The machine I'm using has 128 GB of RAM.
So, I don't know what I can do, without change the function. Does anyone know how to deal with this issue? Thanks!
In order to close this question, I have to mention that the #Axeman's answer it is the only approach feasible for my problem. The whole issue is, there is not enough memory to manage such a huge design matrix.
Therefore, run a probit regression using the biglm package and bigglm() function is the only solution I found so far.
Nevertheless, I realize, due to how the biglm package works, taking iteratively chunks of the data, the use of factor() variables in the RHS it's problematic every time when factor level is not represented in the chunk. In other words, if a factor variable has 5 levels, but in the data chunk only 4 levels appear, I will have an error in the estimation.
There are several questions and comments about this on Stackoverflow.

glmmLasso error and warning

I am trying to perform variable selection in a generalized linear mixed model using glmmLasso, but am coming up with an error and a warning, that I can not resolve. The dataset is unbalanced, with some participants (PTNO) having more samples than others; no missing data. My dependent variable is binary, all other variables (beside the ID variable PTNO) are continous.
I suspect something very generic is happening, but obviously fail to see it and have not found any solution in the documentation or on the web.
The code, which is basically just adapted from the glmmLasso soccer example is:
glm8 <- glmmLasso(Group~NDUFV2_dCTABL+GPER1_dCTABL+ ESR1_dCTABL+ESR2_dCTABL+KLF12_dCTABL+SP4_dCTABL+SP1_dCTABL+ PGAM1_dCTABL+ANK3_dCTABL+RASGRP1_dCTABL+AKT1_dCTABL+NUDT1_dCTABL+ POLG_dCTABL+ ADARB1_dCTABL+OGG_dCTABL+ PDE4B_dCTABL+ GSK3B_dCTABL+ APOE_dCTABL+ MAPK6_dCTABL, rnd = list(PTNO=~1),
family = poisson(link = log), data = stackdata, lambda=100,
control = list(print.iter=TRUE,start=c(1,rep(0,29)),q.start=0.7))
The error message is displayed below. Specficially, I do not believe there are any NAs in the dataset and I am unsure about the meaning of the warning regarding the factor variable.
Iteration 1
Error in grad.lasso[b.is.0] <- score.beta[b.is.0] - lambda.b * sign(score.beta[b.is.0]) :
NAs are not allowed in subscripted assignments
In addition: Warning message:
In Ops.factor(y, Mu) : ‘-’ not meaningful for factors
An abbreviated dataset containing the necessary variables is available in R format and can be downladed here.
I hope I can be guided a bit as to how to go on with the analysis. Please let me know if there is anything wrong with the dataset or you cannot download it. ANY help is much appreciated.
Just to follow up on #Kristofersen comment above. It is indeed the start vector that messes your analysis up.
If I run
glm8 <- glmmLasso(Group~NDUFV2_dCTABL+GPER1_dCTABL+ ESR1_dCTABL+ESR2_dCTABL+KLF12_dCTABL+SP4_dCTABL+SP1_dCTABL+ PGAM1_dCTABL+ANK3_dCTABL+RASGRP1_dCTABL+AKT1_dCTABL+NUDT1_dCTABL+ POLG_dCTABL+ ADARB1_dCTABL+OGG_dCTABL+ PDE4B_dCTABL+ GSK3B_dCTABL+ APOE_dCTABL+ MAPK6_dCTABL,
rnd = list(PTNO=~1),
family = binomial(),
data = stackdata,
lambda=100,
control = list(print.iter=TRUE))
then everything is fine and dandy (i.e., it converges and produces a solution). You have copied the example with poisson regression and you need to tweak the code to your situation. I have no idea about whether the output makes sense.
Quick note: I ran with the binomial distribution in the code above since your outcome is binary. If it makes sense to estimate relative risks then poisson may be reasonable (and it also converges), but you need to recode your outcome as the two groups are defined as 1 and 2 and that will certainly mess up the poisson regression.
In other words do a
stackdata$Group <- stackdata$Group-1
before you run the analysis.

R, issue with a Hierarchical clustering after a Multiple correspondence analysis

I want to cluster a dataset (600000 observations), and for each cluster I want to get the principal components.
My vectors are composed by one email and by 30 qualitative variables.
Each quantitative variable has 4 classes: 0,1,2 and 3.
So first thing I'm doing is to load the library FactoMineR and to load my data:
library(FactoMineR)
mydata = read.csv("/home/tom/Desktop/ACM/acm.csv")
Then I'm setting my variables as qualitative (I'm excluding the variable 'email' though):
for(n in 1:length(mydata)){mydata[[n]] <- factor(mydata[[n]])}
I'm removing the emails from my vectors:
mydata2 = mydata[2:31]
And I'm running a MCA in this new dataset:
mca.res <- MCA(mydata2)
I now want to cluster my dataset using the hcpc function:
res.hcpc <- HCPC(mca.res)
But I got the following error message:
Error: cannot allocate vector of size 1296.0 Gb
What do you think I should do? Is my dataset too large? Am I using well the hcpc function?
Since it uses hierarchical clustering, HCPC needs to compute the lower triangle of a 600000 x 600000 distance matrix (~ 180 billion elements). You simply don't have the RAM to store this object and even if you did, the computation would likely take hours if not days to complete.
There have been various discussions on Stack Overflow/Cross Validated on clustering large datasets; some with solutions in R include:
k-means clustering in R on very large, sparse matrix? (bigkmeans)
Cluster Big Data in R and Is Sampling Relevant? (clara)
If you want to use one of these alternative clustering approaches, you would apply it to mca.res$ind$coord in your example.
Another idea, suggested in response to the problem clustering very large dataset in R, is to first use k means to find a certain number of cluster centres and then use hierarchical clustering to build the tree from there. This method is actually implemented via the kk argument of HCPC.
For example, using the tea data set from FactoMineR:
library(FactoMineR)
data(tea)
## run MCA as in ?MCA
res.mca <- MCA(tea, quanti.sup = 19, quali.sup = c(20:36), graph = FALSE)
## run HCPC for all 300 individuals
hc <- HCPC(res.mca, kk = Inf, consol = FALSE)
## run HCPC from 30 k means centres
res.consol <- NULL ## bug work-around
hc2 <- HCPC(res.mca, kk = 30, consol = FALSE)
The consol argument offers the option to consolidate the clusters from the hierarchical clustering using k-means; this option is not available when kk is set to a real number, hence consol is set to FALSE here. The object res.consul is set to NULL to work around a minor bug in FactoMineR 1.27.
The following plot show the clusters based on the 300 individuals (kk = Inf) and based on the 30 k means centres (kk = 30) for the data plotted on the first two MCA axes:
It can be seen that the results are very similar. You should easily be able to apply this to your data with 600 or 1000 k means centres, perhaps up to 6000 with 8GB RAM. If you wanted to use a larger number, you'd probably want to code a more efficient version using bigkmeans, SpatialTools::dist1 and fastcluster::hclust.
That error message usually indicates that R has not enough RAM at its disposal to complete the command. I guess you are running this within 32bit R, possibly under Windows? If this is the case, then killing other processes and deleting unused R variables might possibly help: for example, you might try to delete mydata, mydata2 with
rm(mydata, mydata2)
(as well as all other non-necessary R variables) before executing the command which generates the error. However the ultimate solution in general is to switch to 64bit R, preferably under 64bit Linux and with a decent RAM amount, also see here:
R memory management / cannot allocate vector of size n Mb
R Memory Allocation "Error: cannot allocate vector of size 75.1 Mb"
http://r.789695.n4.nabble.com/Error-cannot-allocate-vector-of-size-td3629384.html

computing multiple fixed effects on large dataset

I'm trying to perform a fixed effects regression for two factor variables in a CSV dataset containing over 4000000 rows. These variables can respectively assume about 140000 and 50000 different integer values.
I initially attempted to perform the regression using the biglm and ff packages for R as follows on a Linux machine with 8 Gb of memory; however, it seems that this requires too much memory because R complains about having to allocate a vector of a size greater than the maximum on my machine.
library(biglm)
library(ff)
d <- read.csv.ffdf(file='data.csv', header=TRUE)
model = y~factor(a)+factor(b)-1
out <- biglm(model, data=d)
Some research online revealed that since factors are loaded into memory by ff, the latter will not significantly improve memory usage if many factor values are present.
Is anyone aware of some other way to perform the aforementioned regression on a dataset of the magnitude I described without having to resort to a machine with significantly more memory?
You should try the package lfe, it has been designed for exactly this purpose:
library(lfe)
...
out <- felm(y ~ 0|a+b, data=d)
fe <- getfe(out)
A proof of the method can be found here: http://www.sciencedirect.com/science/article/pii/S0167947313001266
Here's an R-journal article about it: http://journal.r-project.org/archive/2013-2/gaure.pdf
you can get the same mathematical meaning of fixed effects if you will demean the variables (by category). So, instead of finding a constant per dummy, you demean it. and demeaning will be very fast, as it is will be vectorized.
Edit1:
see Green 2012 p.400-401 for the mathematical proof.

"system is computationally singular" error when using `gmm` (GMM Estimation)

Trying to use the GMM package in R to estimate the parameters (a-f) of a linear model:
LEV1 = a*Macro + b*Firm + c*Sector + d*qtr + e*fqtr + f*tax
Macro, Firm and Sector are matrices with n number of rows. qtr, fqtr and tax are vectors with n members.
I have one large data frame called unconstrd that has all of the data. First, I break that data into separate matrices:
v_LEV1 <- as.matrix(unconstrd$LEV1)
Macro <- as.matrix(cbind(unconstrd$Agg_Corp_Prof,unconstrd$R1000_TR, unconstrd$CP_Spread))
Firm <- as.matrix(cbind(unconstrd$ppe_ratio, unconstrd$op_inc_ratio_avg, unconstrd$selling_exp_avg,
unconstrd$tax_avg, unconstrd$Mark_to_Bk, unconstrd$mc_ratio))
Sector <- as.matrix(cbind(unconstrd$Sect_Flag03,
unconstrd$Sect_Flag04, unconstrd$Sect_Flag05, unconstrd$Sect_Flag06,
unconstrd$Sect_Flag07, unconstrd$Sect_Flag08, unconstrd$Sect_Flag12,
unconstrd$Sect_Flag13, unconstrd$Sect_Flag14, unconstrd$Sect_Flag15,
unconstrd$Sect_Flag17))
v_qtr <- as.matrix(unconstrd$qtr)
v_fqtr <- as.matrix(unconstrd$fqtr)
v_tax <- as.matrix(unconstrd$tax_dummy)
Then, I bind the data together for the x variable called by gmm:
h=cbind(Macro,Firm,Sector,v_qtr, v_fqtr, v_tax)
Then, I invoke gmm:
gmm1 <- gmm(v_LEV1 ~ Macro + Firm + Sector + v_qtr + v_fqtr + v_tax, x=h)
I get the message:
Error in solve.default(crossprod(hm, xm), crossprod(hm, ym)) :
system is computationally singular: reciprocal condition number = 1.10214e-18
I apologize in advance and admit that I'm a neophyte at R and I've never used GMM before. The GMM function is so general, I've looked at the examples available on the web but nothing seems specific enough to help my situation.
You are trying to fit onto a matrix which does not have full rank---try excluding some of the variable and/or look for errors. We cannot say much more without your data, or at least a sample.
That's more of a modelling question for Crossvalidated.com than a programming question for StackOverflow.
I was pretty certain there was no linear dependency between my variables but I went through the exercise of adding one variable at a time to see what was causing the errors. In the end, I asked a colleague to run GMM on SAS and it ran perfectly, no error messages. I'm not sure what the problem is with the R version is but at this point I have a solution and give u on GMM on R.
Thanks to everyone who tried to help.

Resources