computing multiple fixed effects on large dataset - r

I'm trying to perform a fixed effects regression for two factor variables in a CSV dataset containing over 4000000 rows. These variables can respectively assume about 140000 and 50000 different integer values.
I initially attempted to perform the regression using the biglm and ff packages for R as follows on a Linux machine with 8 Gb of memory; however, it seems that this requires too much memory because R complains about having to allocate a vector of a size greater than the maximum on my machine.
library(biglm)
library(ff)
d <- read.csv.ffdf(file='data.csv', header=TRUE)
model = y~factor(a)+factor(b)-1
out <- biglm(model, data=d)
Some research online revealed that since factors are loaded into memory by ff, the latter will not significantly improve memory usage if many factor values are present.
Is anyone aware of some other way to perform the aforementioned regression on a dataset of the magnitude I described without having to resort to a machine with significantly more memory?

You should try the package lfe, it has been designed for exactly this purpose:
library(lfe)
...
out <- felm(y ~ 0|a+b, data=d)
fe <- getfe(out)
A proof of the method can be found here: http://www.sciencedirect.com/science/article/pii/S0167947313001266
Here's an R-journal article about it: http://journal.r-project.org/archive/2013-2/gaure.pdf

you can get the same mathematical meaning of fixed effects if you will demean the variables (by category). So, instead of finding a constant per dummy, you demean it. and demeaning will be very fast, as it is will be vectorized.
Edit1:
see Green 2012 p.400-401 for the mathematical proof.

Related

Variance inflation factors in R

I am trying to compute the VIFs from a regression model that has a lot of independent variables (> 100). I am using vif from the car package to do that.
I always get the error: cannot allocate vector of size 13.8 GB. I realize this is a memory issue, but my PC already has a lot of memory. So the question is, can this function optimized in any way, so it doesn't require so much memory? I am unsure if this is more of a stats or a computational question. And as my dataset is quite large, I am unsure how to represent this case with a MWE. Basically what is needed is a lot of independent variables (e.g. 200+) and one arbitrary dependent variable, with length of each variable around 440 observations. Thanks for any hints.
I just ran a simulated version of what you did, and it worked fine: it took less than a second to run. This is 250 explanatory variables and one response, 500 observations.
For entertainment I pasted together a formula for it, but that isn't really necessary. The vif() were computed easily.
In general, since vif(j) = 1/(1-R^2_j), where R^2_j is the R-squared value when regressing the jth explanatory variable against all the other explanatory variables, computation should take, at most, the time of 250 linear regressions, with 500 observations and 250 explanatory variables, which is very, very fast and not at all memory intensive.
You might need to post your code so we can see what went wrong.
> resp <- rnorm(500)
> X <- matrix(nrow=500, ncol=250, rnorm(500*250))
> X <- data.frame(X)
> colnames(X) <- col_names <- paste("x",1:250, sep="")
> formula <- paste(col_names, collapse="+")
> formula <- paste("resp~",formula)
> hold <- lm(formula, data=cbind(resp,X))
> summary(vif(hold))

Computational speed of a complex Hierarchical GAM

I have a large dataset (3.5+ million observations) of a binary response variable that I am trying to compute a Hierarchical GAM with a global smoother with individual effects that have a Shared penalty (e.g. 'GS' in Pedersen et al. 2019). Specifically I am trying to estimate the following structure: Global > Geographic Zone (N=2) > Bioregion (N=20) > Season (N varies by bioregion). In total, I am trying to estimate 36 different nested parameters.
Here is the the code I am currently using:
modGS <- bam(
outbreak ~
te(days_diff,NDVI_mean,bs=c("tp","tp"),k=c(5,5)) +
t2(days_diff, NDVI_mean, Zone, Bioregion, Season, bs=c("tp", "tp","re","re","re"),k=c(5, 5), m=2, full=TRUE) +
s(Latitude,Longitude,k=50),
family=binomial(),select = TRUE,data=dat)
My main issue is that it is taking a long time (5+ days) to construct the model. This nesting structure cannot be discretized, so I cannot compute it in parallel. Further I have tried gamm4 but I ran into memory limit issues. Here is the gamm4 code:
modGS <- gamm4(
outbreak ~
t2(days_diff,NDVI_mean,bs=c("tp","tp"),k=c(5,5)) +
t2(days_diff, NDVI_mean, Zone, Bioregion, Season, bs=c("tp", "tp","re","re","re"),k=c(5, 5), m=2, full=TRUE) +
s(Latitude,Longitude,k=50),
family=binomial(),select = TRUE,data=dat)
What is the best/most computationally feasible way to run this model?
I cut down the computational time by reducing the amount of bioregion levels and randomly sampling ca. 60% of the data. This actually allow me to calculate OOB error for the model.
There is an article I read recently that has a specific section on decreasing computational time. The main things they highlight are:
Use the bam function with it's useful fREML estimation, which refactorizes the model matrix to make calculation faster. Here it seems you have already done that.
Adding the discrete = TRUE argument, which assumes only a smaller finite number of unique values for estimation.
Manipulating nthreads in this function so it runs more than one core in parallel in your computer.
As the authors caution, the second option can reduce the amount of accuracy in your estimates. I fit some large models recently doing this and found that it was not always the same as the default bam function, so its best to use this as a quick inspection rather than the full result you are looking for.

Error during wrapup: long vectors not supported yet: in glm() function

I found several questions on Stackoverflow regarding this topic (some of them without any answer) but nothing related (so far) with this error in regression.
I'm, running a probit model in r with (I'm guessing) too many fixed effects (year and places):
myprobit <- glm(factor(Y) ~ factor(T) + factor(X1) + factor(X2) + factor(X3) +
factor(YEAR) + factor(PLACE),
family = binomial(link = "probit"),
data = DT)
The PLACE variable has about 1000 unique values and YEAR 8 values. The dataset DT has 13,099,225 obs and 79 columns.
The error I got is:
Error: cannot allocate vector of size 59.3 Gb
Error during wrapup: long vectors not supported yet: ../include/Rinlinedfuns.h:519
The machine I'm using has 128 GB of RAM.
So, I don't know what I can do, without change the function. Does anyone know how to deal with this issue? Thanks!
In order to close this question, I have to mention that the #Axeman's answer it is the only approach feasible for my problem. The whole issue is, there is not enough memory to manage such a huge design matrix.
Therefore, run a probit regression using the biglm package and bigglm() function is the only solution I found so far.
Nevertheless, I realize, due to how the biglm package works, taking iteratively chunks of the data, the use of factor() variables in the RHS it's problematic every time when factor level is not represented in the chunk. In other words, if a factor variable has 5 levels, but in the data chunk only 4 levels appear, I will have an error in the estimation.
There are several questions and comments about this on Stackoverflow.

Elastic net with Cox regression

I am trying to perform elastic net with cox regression on 120 samples with ~100k features.
I tried R with the glmnet package but R is not supporting big matrices (it seems R is not designed for 64 bit). Furthermore, the package glmnet does support sparse matrices but for whatever reason they have not implemented sparse matrix + cox regression.
I am not pushing for R but this is the only tool I found so far. Anyone knows what program I can use to calculate elastic nets + cox regression on big models? I did read that I can use Support Vector Machine but I need to calculate the model first and I cannot do that in R due to the above restriction.
Edit:
A bit of clarification. I am not reporting an error in R as apparently it is normal for R to be limited by how many elements its matrix can hold (as for glmnet not supporting sparse matrix + cox I have no idea). I am not pushing for a tool but it would be easier if there is another package or a stand alone program that can perform what I am looking for.
If someone has an idea or has done this before please share your method (R, Matlab, something else).
Edit 2:
Here is what I used to test:
I made a matrix of 100x100000. Added labels and tried to create the model using model.matrix.
data <- matrix(rnorm(100*100000), 100, 100000)
formula <- as.formula(class ~ .)
x = c(rep('A', 40), rep('B', 30), rep('C', 30))
y = sample(x=1:100, size=100)
class = x[y]
data <- cbind(data, class)
X <- model.matrix(formula, data)
The error I got:
Error: cannot allocate vector of size 37.3 Gb
In addition: Warning messages:
1: In terms.formula(object, data = data) :
Reached total allocation of 12211Mb: see help(memory.size)
2: In terms.formula(object, data = data) :
Reached total allocation of 12211Mb: see help(memory.size)
3: In terms.formula(object, data = data) :
Reached total allocation of 12211Mb: see help(memory.size)
4: In terms.formula(object, data = data) :
Reached total allocation of 12211Mb: see help(memory.size)
Thank you in advance! :)
Edit 3:
Thanks to #marbel I was able to construct a test model that works and does not become too big. It seems my problem came from using cbind in my test.
A few pointers:
a) That's a rather small dataset, R should be more than enought. All you need is a modern computer, meaning a decent amount of RAM. I guess 4GB should be enough for such a small dataset.
The package is available in Julia and Python but I'm not sure if that model is available.
Here and here you have examples of the cox model with the GLMNET package. There is also a package called survival.
There are at least two problems with your code:
This is not something your would like to do in R: data <- cbind(data, class). It's just not memory efficient. If you need to do this type of operations use the data.table package. It allows to do assignment by references, check out the := operator.
If all your data is numeric you don't need to use model.matrix, just use data.matrix(X).
If you have categorical variables, use model.matrix with them only, then add them to the X matrix, perhaps using data.table, one column at a time using the ?data.table::set or the := operator.
Hopefully this can help you debug the code. Good luck!

Samples from scaled inverse chisquare distribution

I want to generate sa scaled-inv-chisquared distribution in R. I know geoR have a R function for generating this. But I want to use gamma-distribution to generate this.
I think this two are equivalent:
X ~ rinvchisq(100, df=d, scale=s)
1/X ~ rgamma(100, shape=d/2, scale=2/(d*s))
isn't it? Can there be any numerical problem due this due to extreme values?
More specifically you would need X <- rinvchisq(...) and X <- 1/rgamma(...) (the ~ notation works this way in programs such as WinBUGS, and in statistics notation, but not in R). If you look at the code of geoR::rinvchisq, the relevant part is just
return((df * scale)/rchisq(n, df = df))
so if you have problems taking the reciprocal of very large or small chi-squared deviates you'll be in trouble anyway (although rchisq is internally using .External(C_rchisq, n, df), which falls through to C code, presumably for efficiency in this special case, rather than calling rgamma). If I were you I would go ahead and superimpose densities of some test samples just to make sure I hadn't screwed up the arithmetic or parameterization somewhere ...
For what it's worth there are also rinvgamma() functions in a variety of packages (library(sos); findFn("rinvgamma"))

Resources