How to train a multiple linear regression model to find the best combination of variables? - r

I want to run a linear regression model with a large number of variables and I want an R function to iterate on good combinations of these variables and give me the best combination.

The glmulti package will do this fairly effectively:
Automated model selection and model-averaging. Provides a wrapper for glm and other functions, automatically generating all possible models (under constraints set by the user) with the specified response and explanatory variables, and finding the best models in terms of some Information Criterion (AIC, AICc or BIC). Can handle very large numbers of candidate models. Features a Genetic Algorithm to find the best models when an exhaustive screening of the candidates is not feasible.
Unsolicited advice follows:
HOWEVER. Please be aware that while this approach can find the model that minimizes within-sample error (the goodness of fit on your actual data), it has two major problems that should make you think twice about using it.
this type of data-driven model selection will almost always destroy your ability to make reliable inferences (compute p-values, confidence intervals, etc.). See this CrossValidated question.
it may overfit your data (although using the information criteria listed in the package description will help with this)

There are a number of different ways to characterize a "best" model, but AIC is a common one, and base R offers step(), and package MASS offers stepAIC().
summary(lm1 <- lm(Fertility ~ ., data = swiss))
slm1 <- step(lm1)
summary(slm1)
slm1$anova

Related

Imputation missing data for MLM in R

Maybe anyone can help me with this question. I conducted a follow-up study and obviously now have to face missing data. Now I am considering how to impute the missing data at best using MLM in R (f.e. participants concluded the follow up 2 survey, but not the follow up 1 survey, therefore I am missing L1 predictors for my longitudinal analysis).
I read about Multiple Imputation of multilevel data using the pan package (Schafer & Yucel, 2002) and came across the following code:
imp <- panImpute(data, formula = fml, n.burn = 1000, n.iter = 100, m = 5)
Yet, I have troubles understanding it completely. Is there maybe another way to impute missing data in R? Or maybe somebody could illustrate the process of the imputation method a bit more detailed, that would be so great! Do I have to conduct the imputation for every model I built in my MLM? (f.e. when I compared, whether a random intercept versus a random intercept and random slope model fits better for my data, do I have to use the imputation code for every model, or do I use it at the beginning of all my calculations?)
Thank you in advance
Is there maybe another way to impute missing data in R?
There are other packages. mice is the one that I normally use, and it does support multilevel data.
Do I have to conduct the imputation for every model I built in my MLM? (f.e. when I compared, whether a random intercept versus a random intercept and random slope model fits better for my data, do I have to use the imputation code for every model, or do I use it at the beginning of all my calculations?)
You have to specify the imputation model. Basically that means you have to tell the software which variables are predicted by which other variables. Since you are comparing models with the same fixed effect, and only changing the random effects (in particular comparing models with and without random slopes), the imputation model should be the same in both cases. So the workflow is:
perform the imputations;
run the model on all the imputed datasets,
pool the results (typically using Rubin's rules)
So you will need to do this twice, to end up with 2 sets of pooled results - one for each model. The software should provide functionality for doing all of this.
Having said all of that, I would advise against choosing your model based on fit statistics and instead use expert knowledge. If you have strong theoretical reasons for expecting slopes to vary by group, then include random slopes. If not, then don't include them.

How to run a truncated and inflated Poisson model in R?

My data doesn't contain any zeros. The minimum value for my outcome, y, is 1 and that is the value that is inflated. My objective is to run a truncated and inflated Poisson regression model using R.
I already know how to separate way each regression zero truncated and zero inflated. I want to know how to combine the two conditions into one model.
Thanks for you help.
For zero inflated models or zero-hurdle models, the standard approach is to use pscl package. I also wrote a package fitting that kind of models here but it is not yet mature and fully tested. Unless you have voluminous data, I still recommend you to use pscl that is more flexible, robust and documented.
For zero-truncated models, you can have a look at the VGML::vglm function. You might find useful information here.
Note that you are not doing the same distributional assumption so you won't need the same estimation data. Given the description of your dataset, I think you are looking for a zero-truncated model (since you do not observe zeros). With zero-inflated models, you decompose your observed pattern into zeros generated by a selection model and others generated by a count data model. This doesn't look to be a pattern consistent with your dataset.

Comparison of Different Types of Nonlinear Regression Models

Thank you for seeing this post.
Various regression models are being applied to the curve estimating (actual measured ventilation rate).
Comparison was made using the GLM and GAM models including polynomial regression. I use R.
Are there any other types of regression models that can simulate that curve well?
like...bayesian? (In fact, I didn't even understand if it could be applied here)
Sincerely.
loads of non linear methods exist, and many would give similar results, but this is a statistics question. it would fit better on cross validated. also, you have to tell: do you want to do interpolation, even extrapolation? what is your analysis for?
bayesian methods are used when you have knowledge of the phenomenon prior to data, or in some cases when you want to apply regularization or graphical models to data generation processes, I think you should better leave out bayesian statistics here.
edit:
to be short: if you want to obtain a readable formulation of the curve, and you don't have any specific mathematical model in mind, go for a polynomial fit. Other popular choices, which are better for only plotting the curve, instead of reporting it in a mathematical expression, are smoothing splines and LOESS. for further details, maybe ask a new question on stats.stackexchange.com, after studing better the alternatives.

PLM in R with time invariant variable

I am trying to analyze a panel data which includes observations for each US state collected across 45 years.
I have two predictor variables that vary across time (A,B) and one that does not vary (C). I am especially interested in knowing the effect of C on the dependent variable Y, while controlling for A and B, and for the differences across states and time.
This is the model that I have, using plm package in R.
random <- plm(Y~log1p(A)+B+C, index=c("state","year"),model="random",data=data)
My reasoning is that with a time invariant variable I should be using random rather than fixed effect model.
My question is: Is my model and thinking correct?
Thank you for your help in advance.
You base your answer about the decision between fixed and random effect soley on computational grounds. Please see the specific assumptions associated with the different models. The Hausman test is often used to discriminate between the fixed and the random effects model, but should not be taken as the definite answer (any good textbook will have further details).
Also pooled OLS could yield a good model, if it applies. Computationally, pooled OLS will also give you estimates for time-invariant variables.

Mixed Logit fitted probabilities in RSGHB

My question has to do with using the RSGHB package for predicting choice probabilities per alternative by applying mixed logit models (variation across respondents) with correlated coefficients.
I understand that the choice probabilities are simulated on an individual level and in order to get preference share an average of the individual shares would do. All the sources I have found treat each prediction as a separate simulation which makes the whole process cumbersome if many predictions are needed.
Since one can save the respondent specific coefficient draws wouldn't it be faster to simply apply the logit transform to each each (vector of) coefficient draw? Once this is done new or existing alternatives could be calculated faster than rerunning a whole simulation process for each required alternative. For the time being using a fitted() approach will not help me understand how prediction actually works.

Resources