Best alternative for stepwise regression in R - r

I know that there dosens of similar questions/answers, and lots of papers. But please read till the end.
Non-statisticians tend to use stepwise regressions which is strongly argued by statisticians. This is stomething that I don't understand, but I just obey them. "Ok this is not a good way to do your modelling".
Here is (was) my model:
b <- lmer(metric1~a+b+c+d+e+f+g+h+i+j+k+l+(1|X/Y) + (1|Z), data = dataset)
drop1 (b, test="Chisq")
(Just a small note: Watch out for the random effects in my model; random effects are Year, Month, Sampling.location; one of my variables is 1/0: I allready log-transformed my variables)
I am trying to find a exploratory model (with drop1 to reach final model) and evaluating it with my biological knowledge to see if the dependent ("metric" in this case) seems to be responding variables. I will repeat this process with 100 metrics just to evaulate which metrics seems to be responding environmental variables.
I was in the search for an acceptable model instead of stepwise according to the suggestions of statistics gurus.
However, there are lots of alternatives. I read alot, but still feel myself lost. Some say Lasso, some say elastic modelling, some say ridge regression... Which one fits for my purpose?
Any advise for a better alternative and an easy model or a help page for dummies, or examples (that could be better) would be much appreciated.
Thanks in advance.

Related

glmmLasso function won't finish running

I am trying to build a Mixed Model Lasso model using glmmLasso in RStudio. However, I am looking for some assistance.
I have the equation of my model as follows:
glmmModel <- glmmLasso(outcome ~ year + married ,list(ID=~1), lambda = 100, family=gaussian(link="identity"), data=data1,control = list(print.iter=TRUE))
where outcome is a continuous variable, year is the year the data was collected, and married is a binary indicator (1/0) of whether or not the subject is married. I eventually would like to include more covariates in my model, but for the purpose of successfully first getting this to run, right now I am just attempting to run a model with these two covariates. My data1 dataframe is 48000 observations and 57 variables.
When I click run, however, the model runs for many hours (48+) without stopping. The only feedback I am getting is "ITERATION 1," "ITERATION 2," etc... Is there something I am missing or doing wrong? Please note, I am running on a machine with only 8 GB RAM, but I don't think this should be the issue, right? My dataset (48000 observations) isn't particularly large (at least I don't think so). Any advice or thoughts would be appreciated on how I can fix this issue. Thank you!
This is too long to be a comment, but I feel like you deserve an answer to this confusion.
It is not uncommon to experience "slow" performance. In fact in many glmm implementations it is more common than not. The fact is that Generalized Linear Mixed Effect models are very hard to estimate. For purely gaussian models (no penalizer) a series of proofs gives us the REML estimator, which can be estimated very efficiently, but for generalized models this is not the case. As such note that the Random Effect model matrix can become absolutely massive. Remember that for every random effect, you obtain a block-diagonal matrix so even for small sized data, you might have a model matrix with 2000+ columns, that needs to go through optimization through PIRLS (inversions and so on).
Some packages (glmmTMB, lme4 and to some extend nlme) have very efficient implementations that abuse the block-diagonality of the random effect matrix and high-performance C/C++ libraries to perform optimized sparse-matrix calculations, while the glmmLasso (link to source) package uses R-base to perform all of it's computations. No matter how we go about it, the fact that it does not abuse sparse computations and implements it's code in R, causes it to be slow.
As a side-note, my thesis project had about 24000~ observations, with 3 random effect variables (and some odd 20 fixed effects). The fitting process of this dataset could take anywhere between 15 minutes to 3 hours, depending on the complexity, and was primarily decided by the random effect structure.
So the answer from here:
Yes glmmLasso will be slow. It may take hours, days or even weeks depending on your dataset. I would suggest using a stratified (or/and clustered) subsample across independent groups, fit the model using a smaller dataset (3000 - 4000 maybe?), to obtain initial starting points, and "hope" that these are close to the real values. Be patient. If you think neural networks are complex, welcome to the world of generalized mixed effect models.

Random forest to determine most important predictors explaining variation [duplicate]

I am trying to use the random forests package for classification in R.
The Variable Importance Measures listed are:
mean raw importance score of variable x for class 0
mean raw importance score of variable x for class 1
MeanDecreaseAccuracy
MeanDecreaseGini
Now I know what these "mean" as in I know their definitions. What I want to know is how to use them.
What I really want to know is what these values mean in only the context of how accurate they are, what is a good value, what is a bad value, what are the maximums and minimums, etc.
If a variable has a high MeanDecreaseAccuracy or MeanDecreaseGini does that mean it is important or unimportant? Also any information on raw scores could be useful too.
I want to know everything there is to know about these numbers that is relevant to the application of them.
An explanation that uses the words 'error', 'summation', or 'permutated' would be less helpful then a simpler explanation that didn't involve any discussion of how random forests works.
Like if I wanted someone to explain to me how to use a radio, I wouldn't expect the explanation to involve how a radio converts radio waves into sound.
An explanation that uses the words 'error', 'summation', or 'permutated'
would be less helpful then a simpler explanation that didn't involve any
discussion of how random forests works.
Like if I wanted someone to explain to me how to use a radio, I wouldn't
expect the explanation to involve how a radio converts radio waves into sound.
How would you explain what the numbers in WKRP 100.5 FM "mean" without going into the pesky technical details of wave frequencies? Frankly parameters and related performance issues with Random Forests are difficult to get your head around even if you understand some technical terms.
Here's my shot at some answers:
-mean raw importance score of variable x for class 0
-mean raw importance score of variable x for class 1
Simplifying from the Random Forest web page, raw importance score measures how much more helpful than random a particular predictor variable is in successfully classifying data.
-MeanDecreaseAccuracy
I think this is only in the R module, and I believe it measures how much inclusion of this predictor in the model reduces classification error.
-MeanDecreaseGini
Gini is defined as "inequity" when used in describing a society's distribution of income, or a measure of "node impurity" in tree-based classification. A low Gini (i.e. higher descrease in Gini) means that a particular predictor variable plays a greater role in partitioning the data into the defined classes. It's a hard one to describe without talking about the fact that data in classification trees are split at individual nodes based on values of predictors. I'm not so clear on how this translates into better performance.
For your immediate concern: higher values mean the variables are more important. This should be true for all the measures you mention.
Random forests give you pretty complex models, so it can be tricky to interpret the importance measures. If you want to easily understand what your variables are doing, don't use RFs. Use linear models or a (non-ensemble) decision tree instead.
You said:
An explanation that uses the words
'error', 'summation', or 'permutated'
would be less helpful then a simpler
explanation that didn't involve any
discussion of how random forests
works.
It's going to be awfully tough to explain much more than the above unless you dig in and learn what about random forests. I assume you're complaining about either the manual, or the section from Breiman's manual:
http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#varimp
To figure out how important a variable is, they fill it with random junk ("permute" it), then see how much predictive accuracy decreases. MeanDecreaseAccuracy and MeanDecreaseGini work this way. I'm not sure what the raw importance scores are.
Interpretability is kinda tough with Random Forests. While RF is an extremely robust classifier it makes its predictions democratically. By this I mean you build hundreds or thousands of trees by taking a random subset of your variables and a random subset of your data and build a tree. Then make a prediction for all the non-selected data and save the prediction. Its robust because it deals well with the vagaries of your data set, (ie it smooths over randomly high/low values, fortuitous plots/samples, measuring the same thing 4 different ways, etc). However if you have some highly correlated variables, both may seem important as they are not both always included in each model.
One potential approach with random forests may be to help whittle down your predictors then switch to regular CART or try the PARTY package for inference based tree models. However then you must be wary about data mining issues, and making inferences about parameters.

How to train a multiple linear regression model to find the best combination of variables?

I want to run a linear regression model with a large number of variables and I want an R function to iterate on good combinations of these variables and give me the best combination.
The glmulti package will do this fairly effectively:
Automated model selection and model-averaging. Provides a wrapper for glm and other functions, automatically generating all possible models (under constraints set by the user) with the specified response and explanatory variables, and finding the best models in terms of some Information Criterion (AIC, AICc or BIC). Can handle very large numbers of candidate models. Features a Genetic Algorithm to find the best models when an exhaustive screening of the candidates is not feasible.
Unsolicited advice follows:
HOWEVER. Please be aware that while this approach can find the model that minimizes within-sample error (the goodness of fit on your actual data), it has two major problems that should make you think twice about using it.
this type of data-driven model selection will almost always destroy your ability to make reliable inferences (compute p-values, confidence intervals, etc.). See this CrossValidated question.
it may overfit your data (although using the information criteria listed in the package description will help with this)
There are a number of different ways to characterize a "best" model, but AIC is a common one, and base R offers step(), and package MASS offers stepAIC().
summary(lm1 <- lm(Fertility ~ ., data = swiss))
slm1 <- step(lm1)
summary(slm1)
slm1$anova

Can ensemble classifiers underperform the best single classifier?

I have recently run an ensemble classifier in MLR (R) of a multicenter data set. I noticed that the ensemble over three classifiers (that were trained on different data modalities) was worse than the best classifier.
This seemed to be unexpected to me. I was using logistic regressions (without any parameter optimization) as simple classifier and a Partial Least Squares (PLS) Discriminant Analysis as a superlearner, since the base-learner predictions ought to be correlated. I also tested different superlearners like NB, and logistic regression. The results did not change.
Here are my specific questions:
1) Do you know, whether this can in principle occur?
(I also googled a bit and found this blog that seems to indicate that it can:
https://blogs.sas.com/content/sgf/2017/03/10/are-ensemble-classifiers-always-better-than-single-classifiers/)
2) Especially, if you are as surprised as I was, do you know of any checks I could do in mlr to make sure, that there isnt a bug. I have tried to use a different cross-validation scheme (originally I used leave-center-out CV, but since some centers provided very little data, I wasnt sure, whether this might lead to weird model fits of the super learner), but it still holds. I also tried to combine different data modalities and they give me the same phenomenon.
I would be grateful to hear, whether you have experienced this and if not, whether you know what the problem could be.
Thanks in advance!
Yes, this can happen - ensembles do not always guarantee a better result. More details regarding cases where this can happen are discussed also in this cross-validate question

R Random Forests Variable Importance

I am trying to use the random forests package for classification in R.
The Variable Importance Measures listed are:
mean raw importance score of variable x for class 0
mean raw importance score of variable x for class 1
MeanDecreaseAccuracy
MeanDecreaseGini
Now I know what these "mean" as in I know their definitions. What I want to know is how to use them.
What I really want to know is what these values mean in only the context of how accurate they are, what is a good value, what is a bad value, what are the maximums and minimums, etc.
If a variable has a high MeanDecreaseAccuracy or MeanDecreaseGini does that mean it is important or unimportant? Also any information on raw scores could be useful too.
I want to know everything there is to know about these numbers that is relevant to the application of them.
An explanation that uses the words 'error', 'summation', or 'permutated' would be less helpful then a simpler explanation that didn't involve any discussion of how random forests works.
Like if I wanted someone to explain to me how to use a radio, I wouldn't expect the explanation to involve how a radio converts radio waves into sound.
An explanation that uses the words 'error', 'summation', or 'permutated'
would be less helpful then a simpler explanation that didn't involve any
discussion of how random forests works.
Like if I wanted someone to explain to me how to use a radio, I wouldn't
expect the explanation to involve how a radio converts radio waves into sound.
How would you explain what the numbers in WKRP 100.5 FM "mean" without going into the pesky technical details of wave frequencies? Frankly parameters and related performance issues with Random Forests are difficult to get your head around even if you understand some technical terms.
Here's my shot at some answers:
-mean raw importance score of variable x for class 0
-mean raw importance score of variable x for class 1
Simplifying from the Random Forest web page, raw importance score measures how much more helpful than random a particular predictor variable is in successfully classifying data.
-MeanDecreaseAccuracy
I think this is only in the R module, and I believe it measures how much inclusion of this predictor in the model reduces classification error.
-MeanDecreaseGini
Gini is defined as "inequity" when used in describing a society's distribution of income, or a measure of "node impurity" in tree-based classification. A low Gini (i.e. higher descrease in Gini) means that a particular predictor variable plays a greater role in partitioning the data into the defined classes. It's a hard one to describe without talking about the fact that data in classification trees are split at individual nodes based on values of predictors. I'm not so clear on how this translates into better performance.
For your immediate concern: higher values mean the variables are more important. This should be true for all the measures you mention.
Random forests give you pretty complex models, so it can be tricky to interpret the importance measures. If you want to easily understand what your variables are doing, don't use RFs. Use linear models or a (non-ensemble) decision tree instead.
You said:
An explanation that uses the words
'error', 'summation', or 'permutated'
would be less helpful then a simpler
explanation that didn't involve any
discussion of how random forests
works.
It's going to be awfully tough to explain much more than the above unless you dig in and learn what about random forests. I assume you're complaining about either the manual, or the section from Breiman's manual:
http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#varimp
To figure out how important a variable is, they fill it with random junk ("permute" it), then see how much predictive accuracy decreases. MeanDecreaseAccuracy and MeanDecreaseGini work this way. I'm not sure what the raw importance scores are.
Interpretability is kinda tough with Random Forests. While RF is an extremely robust classifier it makes its predictions democratically. By this I mean you build hundreds or thousands of trees by taking a random subset of your variables and a random subset of your data and build a tree. Then make a prediction for all the non-selected data and save the prediction. Its robust because it deals well with the vagaries of your data set, (ie it smooths over randomly high/low values, fortuitous plots/samples, measuring the same thing 4 different ways, etc). However if you have some highly correlated variables, both may seem important as they are not both always included in each model.
One potential approach with random forests may be to help whittle down your predictors then switch to regular CART or try the PARTY package for inference based tree models. However then you must be wary about data mining issues, and making inferences about parameters.

Resources