mgcv bam() error: cannot allocate vector of size 99.6 Gb - r

I am trying to fit an additive mixed model using bam (mgcv library). My dataset has 10^6 observations from a longitudinal study on growth in 2.10^5 children nested in 300 health centers. I am looking for the slope for each center.
The model is
bam(haz ~ s(month, bs = "cc", k = 12)+ sex+ s(age)+ center+ year+ year*center+s(child, bs="re"), data)
Whenever, when I try to fit the model the following error message appears:
Error: cannot allocate vector of size 99.6 Gb
In addition: Warning message:
In matrix(by, n, q) : data length exceeds size of matrix
I am working on a cluster with 500 Gb de RAM.
Thank you for any help

To diagnose more precisely where the problem is, try fitting your model with various terms left out. There are several terms in the model that could blow up on you:
the fixed effects involving center will blow up to 300 columns * 10^6 rows; depending on whether year is numeric or a factor, the year*center term could blow up to 600 columns or (nyears*300) columns
it's not clear to me whether bam uses sparse matrices for s(.,bs="re") terms; if not, you'll be in big trouble (2*10^5 columns * 10^6 rows)
Order of magnitude, a vector of 10^6 numeric values (one column of your model matrix) takes 7.6 Mb, so 500 GB / 7.6 MB would be approximately 65,000 columns ...
Just taking a guess here, but I would try out the gamm4 package. It's not specifically geared for low-memory use, but:
‘gamm4’ is most useful when the random effects are not i.i.d., or
when there are large numbers of random coeffecients [sic] (more than
several hundred), each applying to only a small proportion of the
response data.
I would also make most of the terms into random effects:
gamm4::gamm4(haz ~ s(month, bs = "cc", k = 12)+ sex+ s(age)+
(1|center)+ (1|year)+ (1|year:center)+(1|child), data)
or, if there are not very many years in the data set, treat year as a fixed effect:
gamm4::gamm4(haz ~ s(month, bs = "cc", k = 12)+ sex+ s(age)+
year + (1|center)+ (1|year:center)+(1|child), data)
If there are a small number of years then (year|center) might make sense, to assess among-center variation and covariation among years ... if there are many years, consider making it a smooth term instead ...

Related

Implementing Longitudinal Random Forest with LongituRF package in R

I have some high dimensional repeated measures data, and i am interested in fitting random forest model to investigate the suitability and predictive utility of such models. Specifically i am trying to implement the methods in the LongituRF package. The methods behind this package are detailed here :
Capitaine, L., et al. Random forests for high-dimensional longitudinal data. Stat Methods Med Res (2020) doi:10.1177/0962280220946080.
Conveniently the authors provide some useful data generating functions for testing. So we have
install.packages("LongituRF")
library(LongituRF)
Let's generate some data with DataLongGenerator() which takes as arguments n=sample size, p=number of predictors and G=number of predictors with temporal behavior.
my_data <- DataLongGenerator(n=50,p=6,G=6)
my_data is a list of what you'd expect Y (response vector),
X (matrix of fixed effects predictors), Z (matrix of random-effects predictors),
id (vector of sample identifier) and time (vector of time measurements). To fit random forest model simply
model <- REEMforest(X=my_data$X,Y=my_data$Y,Z=my_data$Z,time=my_data$time,
id=my_data$id,sto="BM",mtry=2)
takes about 50secs here so bear with me
so far so good. Now im clear about all the parameters here except for Z. What is Z when i go to fit this model on my actual data?
Looking at my_data$Z.
dim(my_data$Z)
[1] 471 2
head(my_data$Z)
[,1] [,2]
[1,] 1 1.1128914
[2,] 1 1.0349287
[3,] 1 0.7308948
[4,] 1 1.0976203
[5,] 1 1.3739856
[6,] 1 0.6840415
Each row of looks like an intercept term (i.e. 1) and values drawn from a uniform distribution runif().
The documentation of REEMforest() indicates that "Z [matrix]: A Nxq matrix containing the q predictor of the random effects." How is this matrix to be specified when using actual data?
My understanding is that traditionally Z is simply one-hot (binary) encoding of the group variables (e.g. as described here), so Z from the DataLongGenerator() should be nxG (471x6) sparse matrix no?
Clarity on how to specify the Z parameter with actual data would be appreciated.
EDIT
My specific example is as follows, i have a response variable (Y). Samples (identified with id) were randomly assigned to intervention (I, intervention or no intervention). A high dimensional set of features (X). Features and response were measured at two timepoints (Time, baseline and endpoint). I am interested in predicting Y, using X and I. I am also interested in extracting which features were most important to predicting Y (the same way Capitaine et al. did with HIV in their paper).
I will call REEMforest() as follows
REEMforest(X=cbind(X,I), Y=Y, time=Time, id=id)
What should i use for Z?
When the function DataLongGenerator() creates Z, it's a random uniform data in a matrix. The actual coding is
Z <- as.matrix(cbind(rep(1, length(f)), 2 * runif(length(f))))
Where f represents the length of the matrices that represent each of the elements. In your example, you used 6 groups of 50 participants with 6 fixed effects. That led to a length of 472.
From what I can gather, since this function is designed to simulate longitudinal data, this is a simulation of random effects on that data. If you were working with real data, I think it would be a lot easier to understand.
While this example doesn't use RE-EM forests, I thought it was pretty clear, because it uses tangible elements as an example. You can read about random effects in section 1.2.2 Fixed v. Random Effects. https://ademos.people.uic.edu/Chapter17.html#32_fixed_effects
Look at section 3.2 to see examples of random effects that you could intentionally model if you were working with real data.
Another example: You're running a cancer drug trial. You've collected patient demographics on a weekly basis: weight, temperature, and a CBC panel and different groups of drug administration: 1 unit per day, 2 units per day, and 3 units per day.
In traditional regression, you'd model these variables to determine how accurately the model identifies the outcome. The fixed effects are the explainable variance or R2. So if you've .86 or 86% then 14% is unexplained. It could be an interaction causing the noise, the unexplained variance between perfect and what the model determined was the outcome.
Let's say the patients with really low white blood cell counts and were overweight responded far better to the treatment. Or perhaps the patients with red hair responded better; that's not in your data. In terms of longitudinal data, let's say that the relationship (the interaction relationship) only appears after some measure of time passes.
You can try to model different relationships to evaluate the random interactions in the data. I think you'd be better off with one of the many ways to evaluate interactions systematically than a random attempt to identify random effects, though.
EDITED I started to write this in the comments with #JustGettinStarted, but it was too much.
Without the background - the easiest way to achieve this would be to run something like REEMtree::REEMtree(), setting the random effects argument to random = ~1 | time / id). After it runs, extract the random effects it's calculated. You can do it like this:
data2 <- data %>% mutate(oOrder = row_number()) %>% # identify original order of the data
arrange(time, id) %>%
mutate(zOrder = row_number()) # because the random effects will be in order by time then id
extRE <- data.frame(time = attributes(fit$RandomEffects[2][["id"]])[["row.names"]]) %>%
separate(col = time,
into = c("time", "id"),
sep = "\\/") %>%
mutate(Z = fit$RandomEffects[[2]] %>% unlist(),
id = as.integer(id),
time = time)) # set data type to match dataset for time
data2 <- data2 %>% left_join(extRE) %>% arrange(oOrder) # return to original order
Z = cbind(rep(1, times = nrows(data2)), data2$Z)
Alternatively, I suggest that you start with the random generation of random effects. The random-effects you start with are just a jumping-off point. The random effects at the end will be different.
No matter how many ways I tried to use LongituRF::REEMforest() with real data, I ran into errors. I had an uninvertible matrix failure every time.
I noticed that the data generated by DataLongGenerator() comes in order by id, then time. I tried to order the data (and Z) that way, but it didn't help. When I extracted all the functionality out of the package LongituRF, I used the MERF (multiple-effects random forest) function with no problems. Even in the research paper, that method was solid. Just thought it was worth mentioning.

Computational speed of a complex Hierarchical GAM

I have a large dataset (3.5+ million observations) of a binary response variable that I am trying to compute a Hierarchical GAM with a global smoother with individual effects that have a Shared penalty (e.g. 'GS' in Pedersen et al. 2019). Specifically I am trying to estimate the following structure: Global > Geographic Zone (N=2) > Bioregion (N=20) > Season (N varies by bioregion). In total, I am trying to estimate 36 different nested parameters.
Here is the the code I am currently using:
modGS <- bam(
outbreak ~
te(days_diff,NDVI_mean,bs=c("tp","tp"),k=c(5,5)) +
t2(days_diff, NDVI_mean, Zone, Bioregion, Season, bs=c("tp", "tp","re","re","re"),k=c(5, 5), m=2, full=TRUE) +
s(Latitude,Longitude,k=50),
family=binomial(),select = TRUE,data=dat)
My main issue is that it is taking a long time (5+ days) to construct the model. This nesting structure cannot be discretized, so I cannot compute it in parallel. Further I have tried gamm4 but I ran into memory limit issues. Here is the gamm4 code:
modGS <- gamm4(
outbreak ~
t2(days_diff,NDVI_mean,bs=c("tp","tp"),k=c(5,5)) +
t2(days_diff, NDVI_mean, Zone, Bioregion, Season, bs=c("tp", "tp","re","re","re"),k=c(5, 5), m=2, full=TRUE) +
s(Latitude,Longitude,k=50),
family=binomial(),select = TRUE,data=dat)
What is the best/most computationally feasible way to run this model?
I cut down the computational time by reducing the amount of bioregion levels and randomly sampling ca. 60% of the data. This actually allow me to calculate OOB error for the model.
There is an article I read recently that has a specific section on decreasing computational time. The main things they highlight are:
Use the bam function with it's useful fREML estimation, which refactorizes the model matrix to make calculation faster. Here it seems you have already done that.
Adding the discrete = TRUE argument, which assumes only a smaller finite number of unique values for estimation.
Manipulating nthreads in this function so it runs more than one core in parallel in your computer.
As the authors caution, the second option can reduce the amount of accuracy in your estimates. I fit some large models recently doing this and found that it was not always the same as the default bam function, so its best to use this as a quick inspection rather than the full result you are looking for.

Is there a way to handle "cannot allocate vector of size" issue without dropping data?

Unlike a previous question about this, this case is different to that and that is why I'm asking. I have an already cleaned dataset containing 120 000 observations of 25 variables, and I am supposed to analyze it all through logistic regression and random forest. However, I get an error "cannot allocate vector of size 98 GB whereas my friend doesn't.
Summary says most of it. I even tried to reduce number of observations to 50 000 and number of variables in dataset to 15 (used 5 of them in regression) and it failed. However, I tried sending the script where i shortened the dataset to a friend, and she could run it. This is odd because I have a 64 bit system and 8 GB RAM, she has only 4 GB. So it appears that the problem lies with me.
pd_data <- read.csv2("pd_data_v2.csv")
split <- rsample::initial_split(pd_data, prop = 0.7)
train <- rsample::training(split)
test <- rsample::testing(split)
log_model <- glm(default ~ profit_margin + EBITDA_margin + payment_reminders, data = pd_data, family = "binomial")
log_model
The result should be a logistic model where I can see coefficients and meassure it's accuracy, and make adjustments.

Memory usage of imputation with mice in R

I am currently working on the imputation of 10 large datasets (by first creating a prediction matrix with correlation of 0.3, dfpred03) with mice in R and I am having a lot of issues like the following:
imptest <- mice(df, m=1, maxit = 1, method='cart',predictorMatrix=dfpred03)
iter imp variable
1 1 VAR1 VAR2 VAR3 VAR4Error: cannot allocate vector of size 446 Kb
Error during wrapup: cannot allocate vector of size 3.6 Mb
I understand I will have to make some concessions, but because I am not sure what the bottleneck is I do not really know which concessions to make.
Is there any documentation about how mice in R uses data?
My most important questions:
EDITED: How can I adapt the predictor matrix in order to make the imputation process less memory consuming? I am thinking of setting variables which have both low row and column sums to zero, but then I get left with NA's. Another thing I could do is to remove those variables and other variables with low observations.
If I succeed in doing one iteration of a dataset, would that mean that I can increase maxit and m because the maximum memory usage is in the iteration?
Is most memory used because of the amount of predictors or the amount of missingness in observations (and therefore the amount of observations)?
More generally, where does the spike in memory usage lie?

Consistency between OOB error rate and the confusion matrix

When inspecting the statistics of my models, it looks like the numbers in the confusion matrix are not consistent with those of the OOB error rate in randomForest.
How can I deduce the OOB error rate from the confusion matrix? What is the relationship between them?
In the example below, I print the output from two models, one that was fit with stratified sampling (using a subset of the samples in sampsize) and one that was fit without (i.e. using the default sampling scheme, which I think uses all samples).
I don't have the data public, but here are the function calls:
sumY = summary(Y)
sampsize <- c(sumY["Y0"]/10, sumY["Y1"])
# First model in the image above
strat.rf.model <- randomForest(x=X,y=Y,sampsize=sampsize, strata=Y)
# Second model in the image above
rf.model <- randomForest(x=X,y=Y)
It's not inconsistent, it's just arithmetic:
> 180 / (1699 + 180)
[1] 0.09579564
> 63 / (63 + 58)
[1] 0.5206612
> (180 + 63) / (1699 + 180 + 63 + 58)
[1] 0.1215
The error rate in each class is defined as the proportion of misclassified observations in just that class, whereas the overall misclassification rate is the proportion of misclassified observations for the entire data set.
It is rare for the error rate for each class to exactly match the overall error rate. If you stop and think about it for a second, this makes perfect sense: some classes are going to be harder to identify than others, and then the overall error rate is sort of the "average".

Resources