glmmLasso function won't finish running - r

I am trying to build a Mixed Model Lasso model using glmmLasso in RStudio. However, I am looking for some assistance.
I have the equation of my model as follows:
glmmModel <- glmmLasso(outcome ~ year + married ,list(ID=~1), lambda = 100, family=gaussian(link="identity"), data=data1,control = list(print.iter=TRUE))
where outcome is a continuous variable, year is the year the data was collected, and married is a binary indicator (1/0) of whether or not the subject is married. I eventually would like to include more covariates in my model, but for the purpose of successfully first getting this to run, right now I am just attempting to run a model with these two covariates. My data1 dataframe is 48000 observations and 57 variables.
When I click run, however, the model runs for many hours (48+) without stopping. The only feedback I am getting is "ITERATION 1," "ITERATION 2," etc... Is there something I am missing or doing wrong? Please note, I am running on a machine with only 8 GB RAM, but I don't think this should be the issue, right? My dataset (48000 observations) isn't particularly large (at least I don't think so). Any advice or thoughts would be appreciated on how I can fix this issue. Thank you!

This is too long to be a comment, but I feel like you deserve an answer to this confusion.
It is not uncommon to experience "slow" performance. In fact in many glmm implementations it is more common than not. The fact is that Generalized Linear Mixed Effect models are very hard to estimate. For purely gaussian models (no penalizer) a series of proofs gives us the REML estimator, which can be estimated very efficiently, but for generalized models this is not the case. As such note that the Random Effect model matrix can become absolutely massive. Remember that for every random effect, you obtain a block-diagonal matrix so even for small sized data, you might have a model matrix with 2000+ columns, that needs to go through optimization through PIRLS (inversions and so on).
Some packages (glmmTMB, lme4 and to some extend nlme) have very efficient implementations that abuse the block-diagonality of the random effect matrix and high-performance C/C++ libraries to perform optimized sparse-matrix calculations, while the glmmLasso (link to source) package uses R-base to perform all of it's computations. No matter how we go about it, the fact that it does not abuse sparse computations and implements it's code in R, causes it to be slow.
As a side-note, my thesis project had about 24000~ observations, with 3 random effect variables (and some odd 20 fixed effects). The fitting process of this dataset could take anywhere between 15 minutes to 3 hours, depending on the complexity, and was primarily decided by the random effect structure.
So the answer from here:
Yes glmmLasso will be slow. It may take hours, days or even weeks depending on your dataset. I would suggest using a stratified (or/and clustered) subsample across independent groups, fit the model using a smaller dataset (3000 - 4000 maybe?), to obtain initial starting points, and "hope" that these are close to the real values. Be patient. If you think neural networks are complex, welcome to the world of generalized mixed effect models.

Related

R h2o model sizes on disk

I am using the h2o package to train a GBM for a churn prediction problem.
all I wanted to know is what influences the size of the fitted model saved on disk (via h2o.saveModel()), but unfortunately I wasn't able to find an answer anywhere.
more specifically, when I tune the GBM to find the optimal hyperparameters (via h2o.grid()) on 3 non-overlapping rolling windows of the same length, I obtain models whose sizes are not comparable (i.e. 11mb, 19mb and 67mb). the hyperparameters grid is the same, and also the train set sizes are comparable.
naturally the resulting optimized hyperparameters are different across the 3 intervals, but I cannot see how this can produces such a difference in the model sizes.
moreover, when I train the actual models based on those hyperparameters sets, I end up with models with different sizes as well.
any help is appreciated!
thank you
ps. I'm sorry but I cannot share any dataset to make it reproducible (due to privacy restrictions)
It’s the two things you would expect: the number of trees and the depth.
But it also depends on your data. For GBM, the trees can be cut short depending on the data.
What I would do is export MOJOs and then visualize them as described in the document below to get more details on what was really produced:
http://docs.h2o.ai/h2o/latest-stable/h2o-genmodel/javadoc/index.html
Note the 60 MB range does not seem overly large, in general.
If you look at the model info you will find out things about the number of trees, their average depth, and so on. Comparing those between the three best models should give you some insight into what is making the models large.
From R, if m is your model, just printing it gives you most of that information. str(m) gives you all the information that is held.
I think it is worth investigating. The cause is probably that two of those data windows are relatively clear-cut, and only a few fields can define the trees, whereas the third window of data is more chaotic (in the mathematical sense), and you get some deep trees being made as it tries to split that apart into decision trees.
Looking into that third window more deeply might suggest some data engineering you could do, that would make it easier to learn. Or, it might be a difference in your data. E.g. one column is all NULL in your 2016 and 2017 data, but not in your 2018 data, because 2018 was the year you started collecting it, and it is that extra column that allows/causes the trees to become deeper.
Finally, maybe the grid hyperparameters are unimportant as regards performance, and this a difference due to noise. E.g. you have max_depth as a hyperparameter, but the influence on MSE is minor, and noise is a large factor. These random differences could allow your best model to go to depth 5 for two of your data sets (but 2nd best model was 0.01% worse but went to depth 20), but go to depth 30 for your third data set (but 2nd best model was 0.01% worse but only went to depth 5).
(If I understood your question correctly, you've eliminated this as a possibility, as you then trained all three data sets on the same hyperparameters? But I thought I'd include it, anyway.)

Best alternative for stepwise regression in R

I know that there dosens of similar questions/answers, and lots of papers. But please read till the end.
Non-statisticians tend to use stepwise regressions which is strongly argued by statisticians. This is stomething that I don't understand, but I just obey them. "Ok this is not a good way to do your modelling".
Here is (was) my model:
b <- lmer(metric1~a+b+c+d+e+f+g+h+i+j+k+l+(1|X/Y) + (1|Z), data = dataset)
drop1 (b, test="Chisq")
(Just a small note: Watch out for the random effects in my model; random effects are Year, Month, Sampling.location; one of my variables is 1/0: I allready log-transformed my variables)
I am trying to find a exploratory model (with drop1 to reach final model) and evaluating it with my biological knowledge to see if the dependent ("metric" in this case) seems to be responding variables. I will repeat this process with 100 metrics just to evaulate which metrics seems to be responding environmental variables.
I was in the search for an acceptable model instead of stepwise according to the suggestions of statistics gurus.
However, there are lots of alternatives. I read alot, but still feel myself lost. Some say Lasso, some say elastic modelling, some say ridge regression... Which one fits for my purpose?
Any advise for a better alternative and an easy model or a help page for dummies, or examples (that could be better) would be much appreciated.
Thanks in advance.

Can I trust a full glmer model that converges ONLY with bobyqa and with contrast sum coding?

I am using R 3.2.0 with lme4 version 1.1.8. to run a mixed effects logistic regression model on some binomial data (coded as 0 and 1) from a psycholinguistic experiment. There are 2 categorical predictors (one with 2 levels and one with 3 levels) and two random terms (participants and items). I am using sum coding for the predictors (i.e. contr.sum..) which gives me the effects and interactions that I am interested in.
I find that the full model (with fixed effects and interactions, plus random intercepts AND slopes for the two random terms) converges ONLY when I specify (optimizer="bobyqa"). If I do not specify the optimizer, the model converges only after simplifying the model drastically. The same thing happens when I use the default treatment coding, even when I specify optimizer="bobyqa".
My first question is why is this happening and can I trust the output of the full model?
My second question is whether this might be due to the fact that my data is not fully balanced, in the sense that my conditions do not have exactly the same number of observations. Are there special precautions one must take when the data is not full balanced? Can one suggest any reading on this particular case?
Many thanks
You should take a look at the ?convergence help page of more recent versions of lme4 (or you can read it here). If the two fits using different optimizers give similar estimated parameters (despite one giving convergence warnings and the other not), and the fits with different contrasts give the same log-likelihood, then you probably have a reasonable fit.
In general lack of balance lowers statistical power and makes fitting more difficult, but mildly to moderate unbalanced data should present no particular problems.

PLM in R with time invariant variable

I am trying to analyze a panel data which includes observations for each US state collected across 45 years.
I have two predictor variables that vary across time (A,B) and one that does not vary (C). I am especially interested in knowing the effect of C on the dependent variable Y, while controlling for A and B, and for the differences across states and time.
This is the model that I have, using plm package in R.
random <- plm(Y~log1p(A)+B+C, index=c("state","year"),model="random",data=data)
My reasoning is that with a time invariant variable I should be using random rather than fixed effect model.
My question is: Is my model and thinking correct?
Thank you for your help in advance.
You base your answer about the decision between fixed and random effect soley on computational grounds. Please see the specific assumptions associated with the different models. The Hausman test is often used to discriminate between the fixed and the random effects model, but should not be taken as the definite answer (any good textbook will have further details).
Also pooled OLS could yield a good model, if it applies. Computationally, pooled OLS will also give you estimates for time-invariant variables.

R code: Extracting highly correlated variables and Running multivariate regression model with selected variables

I have a huge data which has about 2,000 variables and about 10,000 observations.
Initially, I wanted to run a regression model for each one with 1999 independent variables and then do stepwise model selection.
Therefore, I would have 2,000 models.
However, unfortunately R presented errors because of lack of memory..
So, alternatively, I have tried to remove some independent variables which are low correlation value- maybe lower than .5-
With variables which are highly correlated with each dependent variable, I would like to run regression model..
I tried to do follow codes, even melt function doesn't work because of memory issue.. oh god..
test<-data.frame(X1=rnorm(50,mean=50,sd=10),
X2=rnorm(50,mean=5,sd=1.5),
X3=rnorm(50,mean=200,sd=25))
test$X1[10]<-5
test$X2[10]<-5
test$X3[10]<-530
corr<-cor(test)
diag(corr)<-NA
corr[upper.tri(corr)]<-NA
melt(corr)
#it doesn't work with my own data..because of lack of memory.
Please help me.. and thank you so much in advance..!
In such a situation if might be worth trying sparsity inducing techniques such as the Lasso. Here a sparse subset of variables is selected by constraining the sum of absolute values of the regression coefficients.
This will give you a reduced subset of variables which are the most relevant (and due to the nature of the Lasso algorithm also the most correlated, which was what you were looking for)
In R you can use the LARS package and information about the Lasso can be found here:
http://www-stat.stanford.edu/~tibs/lasso.html
Also a very good resource is: http://www-stat.stanford.edu/~tibs/ElemStatLearn/

Resources