Fitting multivariate (multilevel) model with 200 effect sizes (rma.mv with optimParallel) fails in metafor - r

I have a the following data structure, with approx. studies i = 50, experiments j = 75 and conditions k = 200.
On level k I have dependent measures. For about 20 studies (25 experiments and 65 conditions) I have data on subject level and calculated the variance-covariance matrix. For the rest I calculated an Variance-Covariance matrix from estimated correlations (for subjects and conditions). Finally, I have a complete k x k variance-covariance matrix V.
To respect the multilevel structure of the data I let every condition in every experiment in every study have it's unique covariance using an unstructured variance-covariance matrix (see Details - Specifying Random Effects). Note, that I am not a 100% sure about this reasoning, or reasoning in general for/against variance-covariance assumed structures in multilevel models. So I am happy to receive some thoughts/literature on this...
I now want to conduct a multivariate (multilevel) random effects model with:
rma.mv(
yi = yk
, V = V
, random = list(~ exp_j | stu_i,
~ con_k | exp_j)
, struct = "UN"
, method = "REML"
, test = "t" ## slightly mimics knha
, data = dat
, slab = con_k
, control=list(optimizer="optimParallel", ncpus=32)
)
When run on the complete data set the calculation reaches 128GB(!) of RAM within a few minutes and at some point R just terminates with out an error message.
1) Is this to be expected with the amount of data I have?
Running the same model with a subset of the original data (i.e. i = 20, j = 25 and k = 65, I just grabbed data without estimated variance-covariance matrices) works fine and reaches a top of ~20GB RAM.
I saw the tipps section of the metafor package as well as the optimisation options for rma.mv() in the notes. 2) In my scenario, does switching to Microsofts R Open or another algorithm (with out parallelisation?!) is reasonable?
Note that the model above is not the final model I want to conduct. No moderators are included yet. Additional model(s) should include regression terms for moderators. It will become even more complex, I guess...
I am running R version 3.6.3 (2020-02-29) on x86_64-pc-linux-gnu (64-bit) under: Ubuntu 18.04.5 LTS. Metafor is on Version 2.4-0.
Best
Jonas

Probably not every study has 50 experiments and not every experiment has 200 conditions, but yes, 50 * 75 * 200 (i.e., 750,000) rows of data would be a problem. However, before I address this issue, let's start with the model itself, which makes little sense. With 75 experiments within those 50 studies, using ~ exp_j | stu_i with struct="UN" implies that you are trying to estimate the variances and covariances of a 75 x 75 var-cov matrix. That's 2850 parameters already. The ~ con_k | exp_j part adds yet another 20,000+ parameters by my calculation. This is never going to work.
Based on your description, you have a multilevel structure, but there is no inherent link between what experiment 1 in study 1 stands for and what experiment 1 in study 2 stands for. So the experiment identifier is just used here to distinguish the different experiments within studies, but carries no further meaning. Compare this with the situation where you have, for example, outcomes A and B in study 1, outcome A in study 2, outcome B in study 3, and so on. 'A' really stands for 'A' in all studies and is not just used to distinguish the elements.
Another issues is that ~ con_k | exp_j will not automatically be nested within studies. The rma.mv() function also allows for crossed random effects, so if you want to add random effects for conditions which in turn are nested within studies then you should create a new variable, for example exp.in.study that reflects this. You could do this with dat$exp.in.study <- paste0(dat$stu_i, ".", dat$exp_j). Then you can use ~ con_k | exp.in.stu to reflect this nesting.
However, based on your description, what I think you really should use is a much simpler model structure, namely random = ~ 1 | stu_i / exp_j / con_k (in that case, the struct argument is not relevant).
Still, if your dataset has 100,000+ rows, then the default way rma.mv() works will become a memory issue, because internally the function will then juggle around with matrices that are of such dimensions. A simple solution to this is to use sparse=TRUE, in which case matrices are stored internally as sparse structures. You probably don't even need any parallel processing then, but you could try if optimizer="optimParallel" will speed things up (but then ncpus=3 is all you need because that is actually the number of variance components that will be estimated by the model if it is specified as suggested above).

Related

Implementing Longitudinal Random Forest with LongituRF package in R

I have some high dimensional repeated measures data, and i am interested in fitting random forest model to investigate the suitability and predictive utility of such models. Specifically i am trying to implement the methods in the LongituRF package. The methods behind this package are detailed here :
Capitaine, L., et al. Random forests for high-dimensional longitudinal data. Stat Methods Med Res (2020) doi:10.1177/0962280220946080.
Conveniently the authors provide some useful data generating functions for testing. So we have
install.packages("LongituRF")
library(LongituRF)
Let's generate some data with DataLongGenerator() which takes as arguments n=sample size, p=number of predictors and G=number of predictors with temporal behavior.
my_data <- DataLongGenerator(n=50,p=6,G=6)
my_data is a list of what you'd expect Y (response vector),
X (matrix of fixed effects predictors), Z (matrix of random-effects predictors),
id (vector of sample identifier) and time (vector of time measurements). To fit random forest model simply
model <- REEMforest(X=my_data$X,Y=my_data$Y,Z=my_data$Z,time=my_data$time,
id=my_data$id,sto="BM",mtry=2)
takes about 50secs here so bear with me
so far so good. Now im clear about all the parameters here except for Z. What is Z when i go to fit this model on my actual data?
Looking at my_data$Z.
dim(my_data$Z)
[1] 471 2
head(my_data$Z)
[,1] [,2]
[1,] 1 1.1128914
[2,] 1 1.0349287
[3,] 1 0.7308948
[4,] 1 1.0976203
[5,] 1 1.3739856
[6,] 1 0.6840415
Each row of looks like an intercept term (i.e. 1) and values drawn from a uniform distribution runif().
The documentation of REEMforest() indicates that "Z [matrix]: A Nxq matrix containing the q predictor of the random effects." How is this matrix to be specified when using actual data?
My understanding is that traditionally Z is simply one-hot (binary) encoding of the group variables (e.g. as described here), so Z from the DataLongGenerator() should be nxG (471x6) sparse matrix no?
Clarity on how to specify the Z parameter with actual data would be appreciated.
EDIT
My specific example is as follows, i have a response variable (Y). Samples (identified with id) were randomly assigned to intervention (I, intervention or no intervention). A high dimensional set of features (X). Features and response were measured at two timepoints (Time, baseline and endpoint). I am interested in predicting Y, using X and I. I am also interested in extracting which features were most important to predicting Y (the same way Capitaine et al. did with HIV in their paper).
I will call REEMforest() as follows
REEMforest(X=cbind(X,I), Y=Y, time=Time, id=id)
What should i use for Z?
When the function DataLongGenerator() creates Z, it's a random uniform data in a matrix. The actual coding is
Z <- as.matrix(cbind(rep(1, length(f)), 2 * runif(length(f))))
Where f represents the length of the matrices that represent each of the elements. In your example, you used 6 groups of 50 participants with 6 fixed effects. That led to a length of 472.
From what I can gather, since this function is designed to simulate longitudinal data, this is a simulation of random effects on that data. If you were working with real data, I think it would be a lot easier to understand.
While this example doesn't use RE-EM forests, I thought it was pretty clear, because it uses tangible elements as an example. You can read about random effects in section 1.2.2 Fixed v. Random Effects. https://ademos.people.uic.edu/Chapter17.html#32_fixed_effects
Look at section 3.2 to see examples of random effects that you could intentionally model if you were working with real data.
Another example: You're running a cancer drug trial. You've collected patient demographics on a weekly basis: weight, temperature, and a CBC panel and different groups of drug administration: 1 unit per day, 2 units per day, and 3 units per day.
In traditional regression, you'd model these variables to determine how accurately the model identifies the outcome. The fixed effects are the explainable variance or R2. So if you've .86 or 86% then 14% is unexplained. It could be an interaction causing the noise, the unexplained variance between perfect and what the model determined was the outcome.
Let's say the patients with really low white blood cell counts and were overweight responded far better to the treatment. Or perhaps the patients with red hair responded better; that's not in your data. In terms of longitudinal data, let's say that the relationship (the interaction relationship) only appears after some measure of time passes.
You can try to model different relationships to evaluate the random interactions in the data. I think you'd be better off with one of the many ways to evaluate interactions systematically than a random attempt to identify random effects, though.
EDITED I started to write this in the comments with #JustGettinStarted, but it was too much.
Without the background - the easiest way to achieve this would be to run something like REEMtree::REEMtree(), setting the random effects argument to random = ~1 | time / id). After it runs, extract the random effects it's calculated. You can do it like this:
data2 <- data %>% mutate(oOrder = row_number()) %>% # identify original order of the data
arrange(time, id) %>%
mutate(zOrder = row_number()) # because the random effects will be in order by time then id
extRE <- data.frame(time = attributes(fit$RandomEffects[2][["id"]])[["row.names"]]) %>%
separate(col = time,
into = c("time", "id"),
sep = "\\/") %>%
mutate(Z = fit$RandomEffects[[2]] %>% unlist(),
id = as.integer(id),
time = time)) # set data type to match dataset for time
data2 <- data2 %>% left_join(extRE) %>% arrange(oOrder) # return to original order
Z = cbind(rep(1, times = nrows(data2)), data2$Z)
Alternatively, I suggest that you start with the random generation of random effects. The random-effects you start with are just a jumping-off point. The random effects at the end will be different.
No matter how many ways I tried to use LongituRF::REEMforest() with real data, I ran into errors. I had an uninvertible matrix failure every time.
I noticed that the data generated by DataLongGenerator() comes in order by id, then time. I tried to order the data (and Z) that way, but it didn't help. When I extracted all the functionality out of the package LongituRF, I used the MERF (multiple-effects random forest) function with no problems. Even in the research paper, that method was solid. Just thought it was worth mentioning.

Computational speed of a complex Hierarchical GAM

I have a large dataset (3.5+ million observations) of a binary response variable that I am trying to compute a Hierarchical GAM with a global smoother with individual effects that have a Shared penalty (e.g. 'GS' in Pedersen et al. 2019). Specifically I am trying to estimate the following structure: Global > Geographic Zone (N=2) > Bioregion (N=20) > Season (N varies by bioregion). In total, I am trying to estimate 36 different nested parameters.
Here is the the code I am currently using:
modGS <- bam(
outbreak ~
te(days_diff,NDVI_mean,bs=c("tp","tp"),k=c(5,5)) +
t2(days_diff, NDVI_mean, Zone, Bioregion, Season, bs=c("tp", "tp","re","re","re"),k=c(5, 5), m=2, full=TRUE) +
s(Latitude,Longitude,k=50),
family=binomial(),select = TRUE,data=dat)
My main issue is that it is taking a long time (5+ days) to construct the model. This nesting structure cannot be discretized, so I cannot compute it in parallel. Further I have tried gamm4 but I ran into memory limit issues. Here is the gamm4 code:
modGS <- gamm4(
outbreak ~
t2(days_diff,NDVI_mean,bs=c("tp","tp"),k=c(5,5)) +
t2(days_diff, NDVI_mean, Zone, Bioregion, Season, bs=c("tp", "tp","re","re","re"),k=c(5, 5), m=2, full=TRUE) +
s(Latitude,Longitude,k=50),
family=binomial(),select = TRUE,data=dat)
What is the best/most computationally feasible way to run this model?
I cut down the computational time by reducing the amount of bioregion levels and randomly sampling ca. 60% of the data. This actually allow me to calculate OOB error for the model.
There is an article I read recently that has a specific section on decreasing computational time. The main things they highlight are:
Use the bam function with it's useful fREML estimation, which refactorizes the model matrix to make calculation faster. Here it seems you have already done that.
Adding the discrete = TRUE argument, which assumes only a smaller finite number of unique values for estimation.
Manipulating nthreads in this function so it runs more than one core in parallel in your computer.
As the authors caution, the second option can reduce the amount of accuracy in your estimates. I fit some large models recently doing this and found that it was not always the same as the default bam function, so its best to use this as a quick inspection rather than the full result you are looking for.

Code syntax in calculating posterior distribution in WinBUGS

Recently I read "The BUGS Book – A Practical Introduction to Bayesian Analysis" to learn WinBUGS. The way WinBUGS describes the derivation of posterior distribution makes me feel confused.
Let's take Example 4.1.1 in this book to illustrae:
Suppose we observe the number of deaths y in a given hospital for a
high-risk operation. Let n denote the total number of such
operations performed and suppose we wish to make inferences regarding
the underlying true mortality rate, $\theta$.
The code of WinBUGS is:
y <- 10 # the number of deaths
n <- 100 # the total number of such operations
#########################
y ~ dbin(theta,n) # likelihood, also a parametric sampling distribution
logit(theta) <- logit.theta # normal prior for the logistic transform of theta
logit.theta ~ dnorm(0,0.368) # precision = 1/2.71
The author said that:
The software knows how to derive the posterior distribution and
subsequently sample from it.
My question is:
Which code reflects the logic structure to tell WinBUGS about "which parameter that I want to calculate its posterior distribution"?
This question seems silly, but if I do not read the background first, I truly cannot find directly in the code above about which parameter is focused on (e.g., theta, or y?).
Below are some of my thoughts (as a beginner of WinBUGS):
I think the following three attributions of the code style in WinBUGS makes me confused:
(1) the code does not follow "a specific sequence". For example, why is logit.theta ~ dnorm(0,0.368) not in front of logit(theta) <- logit.theta?
(2) repeated variable. Foe example, why did the last two lines not be reduced into one line: logit(theta) ~ dnorm(0,0.368)?
(3) variables are defined in more than one place. For example, y is defined two times: y <- 10 and y ~ dbin(theta, n). This one has been explained in Appendix A of the book (i.e., However, a check has been built in so that when finding a logical node that also features as a stochastic node, a stochastic node is created with the calculated values as fixed data), yet I still cannot catch its meaning.
BUGS is a declarative language. For the most part, statements aren't executed in sequence, they define different parts of the model. BUGS works on models that can be represented by directed acyclic graphs, i.e. those where you put a prior on some components, then conditional distributions on other components given the earlier ones.
It's a fairly simple language, so I think logit(theta) ~ dnorm(0, 0.368) is just too complicated for it.
The language lets you define a complicated probability model, and declare observations of certain components in it. Once you declare an observation, the model that BUGS samples from is the the original full model conditioned on that observation. y <- 10 defines observed data. y ~ dbin(theta,n) is part of the model.
The statement n <- 100 could be either: for fixed constants like n, it doesn't really matter which way you think of it. Either the model says that n is always 100, or n has an undeclared prior distribution not depending on any other parameter, and an observed value of 100. These two statements are equivalent.
Finally, your big question: Nothing in the code above says which parameter you want to look at. BUGS will compute the joint posterior distribution of every parameter. n and y will take on their fixed values, theta and logit.theta will both be simulated from the posterior. In another part of your code (or by using the WinBUGS menus) you can decide which of those to look at.

Treatment of categorical variables in rpart

I wonder how rpart treats categorical variables. There are several references suggesting that for unordered factors it looks through all combinations. Actually, even the vignette at the end section 6.2 states
(F)or a categorical predictor with m levels, all 2^m−1 different possible
splits are tested.
However, given my experience with the code, I find it difficult to believe. The vignette shows a supporting evidence that running
rpart(Reliability ~ ., data=car90)
takes a really long, long time. However, in my case, it runs in seconds. Despite having an unordered factor variable with 30 levels.
To demonstrate the issue further, I have created several variables with 52 levels, meaning that 2^51 - 1 ~ 2.2 10^15 splits would need to be checked if all possibilities were explored. This code runs in about a minute, IMHO proving that all combinations are not checked.
NROW = 50000
NVAR = 20
rand_letters = data.frame(replicate(NVAR, as.factor(c(
letters[sample.int(26, floor(NROW/2), replace = TRUE)],
LETTERS[sample.int(26, ceiling(NROW/2), replace = TRUE)]))))
rand_letters$target = rbinom(n = NROW, size = 1, prob = 0.1)
system.time({
tree_letter = rpart(target ~., data = rand_letters, cp = 0.0003)
})
tree_letter
What combinations of categorical variables are ACTUALLY checked in rpart?
I know it is an old question but I found this link that might answer some of it.
Bottom line is that rpart seems to be applying a simple algorithm:
First, sort the conditional means, p_i = E(Y|X = x_i)
Then compute Gini indices based on groups obtained from that ordering.
Pick the two groups giving the maximum of these Gini indices.
So it should not be nearly as computationally expensive.
However, I personally have a case where I have a single categorical variable, whose categories are US states, and rpart overtimes when trying to use it to produce a classification tree. Creating dummy variables and running rpart with the 51 variables (1 for each state) works fine.

h2o.GBM taking too long on a small sized dataset

I've got a rather small dataset (162,000 observations with 13 attributes)
that I'm trying to use for modelling, using h2o.GBM. The response variable is categorical with large number of levels (~ 20,000 levels)
The model doesn't run out of memory or give any errors, but it's been going for nearly 24 hours without any progress (says 0% on H2o.GBM reporting)
I finally gave in and stopped it.
I'm wondering if there's anything wrong with my hyperparameters, as data is not particularly large.
here's my code:
library(h2o)
localH2O <- h2o.init(nthreads = -1, max_mem_size = "12g")
train.h20 <- as.h2o(analdata_train)
gbm1 <- h2o.gbm(
y = response_var
, x = independ_vars
, training_frame = train.h20
, ntrees = 3
, max_depth = 5
, min_rows = 10
, stopping_tolerance = 0.001
, learn_rate = 0.1
, distribution = "multinomial"
)
The way H2O GBM multinomial classification works is, when you ask for 1 tree as a parameter, it actually builds a tree for each level in the response column underneath the hood.
So 1 tree really means 20,000 trees in your case.
2 trees would really mean 40,000, and so on...
(Note the binomial classification case takes a shortcut and builds only one tree for both classes.)
So... it will probably finish but it could take quite a long time!
It's probably not a good idea to train a classifier with 20,000 classes -- most GBM implementations won't even let you do that. Can you group/cluster the classes into a smaller number of groups so that you can train a model with a smaller number of classes? If so, then you could perform your training in a two-stage process -- the first model would have K classes (assuming you clustered your classes into K groups). Then you can train secondary models that further classify the observations into your original classes.
This type of two-stage process may make sense if your classes represent groups that naturally clusters into a hierarchy of groups -- such as zip codes or ICD-10 medical diagnostic codes, for example.
If your use-case really demands that you train a 20,000 class GBM (and there's no way around it), then you should get a bigger cluster of machines to use in your H2O cluster (it's unclear how many CPUs you are using currently). H2O GBM should be able to finish training, assuming it has enough memory and CPUs, but it may take a while.

Resources