I am in my first experience using mixed models in R for my statistical analysis. Due to my data being comprised of binary outcome variables, I have managed to build a logistic model using the glmer function of the lme4 package that I think works as I wanted it to.
I am now aiming to investigate the statistical significance of my model coefficients. I have read that generally, the best approach for generalized mixed models is to bootstrap confidence intervals, but I haven't managed to find a good, clear, explanation of how to do this in R.
Would anyone have any suggestions? Are there any packages in R that expedite this process, or do people generally build their own functions for this? I haven't really done any bootstrapping before so I'd appreciate some more in-depth answers.
If you want to compute parametric bootstrap confidence intervals, the built-in functionality
confint(fitted_model, method = "boot")
should work (see ?confint.merMod)
Also see this answer (which illustrates both parametric and nonparametric bootstrapping for user-defined quantities).
If you have multiple cores, you can speed this up by adding parallel = "multicore", ncpus = parallel::detectCores()-1 (or some other appropriate number of cores to use): see ?lme4::bootMer for details.
Related
The only package I know that does unconditional quantile regression in R is uqr. Unfortunately, it's been removed from CRAN. Even though I can still use it, its functionality is limited (e.g., does not conduct significance tests or allow to compare effects across quantiles). I'm wondering if anyone knows how to conduct UQR in R, with either functions they wrote or some other means.
there are many limitations in terms of test and asymptotic theory regarding unconditional quantile regressions, especially if you are thinking on the version proposed in Firpo, Fortin, and Limeaux (2009) "Unconditional quantile regressions".
The application, however, is straightforward. you need only 2 elements:
the unconditional quantile (estimated with any of your favorite packages).
the density of the outcome at the quantile you got in (1)
After that, you apply the RIF function:
$$RIF(q) = q(t)+\frac{t-1(y<=q(t)}{f(q(t))}$$
Once you have this, you just use that instead of your dep variable, when you write your "lm()" function. And that is it.
HTH
I want to run a linear regression model with a large number of variables and I want an R function to iterate on good combinations of these variables and give me the best combination.
The glmulti package will do this fairly effectively:
Automated model selection and model-averaging. Provides a wrapper for glm and other functions, automatically generating all possible models (under constraints set by the user) with the specified response and explanatory variables, and finding the best models in terms of some Information Criterion (AIC, AICc or BIC). Can handle very large numbers of candidate models. Features a Genetic Algorithm to find the best models when an exhaustive screening of the candidates is not feasible.
Unsolicited advice follows:
HOWEVER. Please be aware that while this approach can find the model that minimizes within-sample error (the goodness of fit on your actual data), it has two major problems that should make you think twice about using it.
this type of data-driven model selection will almost always destroy your ability to make reliable inferences (compute p-values, confidence intervals, etc.). See this CrossValidated question.
it may overfit your data (although using the information criteria listed in the package description will help with this)
There are a number of different ways to characterize a "best" model, but AIC is a common one, and base R offers step(), and package MASS offers stepAIC().
summary(lm1 <- lm(Fertility ~ ., data = swiss))
slm1 <- step(lm1)
summary(slm1)
slm1$anova
I've been using the caret package in R to run some boosted regression tree and random forest models and am hoping to generate prediction intervals for a set of new cases using the inbuilt cross-validation routine.
The trainControl function allows you to save the hold-out predictions at each of the n-folds, but I'm wondering whether unknown cases can also be predicted at each fold using the built-in functions, or whether I need to use a separate loop to build the models n-times.
Any advice much appreciated
Check the R package quantregForest, available at CRAN. It can easily calculate prediction intervals for random forest models. There's a nice paper by the author of the package, explaining the backgrounds of the method. (Sorry, I can't say anything about prediction intervals for BRT models; I'm looking for them by myself...)
I have a problem at hand which I'd think is fairly common amongst
groups were R is being adopted for Analytics in place of SAS.
Users would like to obtain results for logistic regression in R that
they have become accustomed to in SAS.
Towards this end, I was able to propose the Design package in R which
contains many functions to extract the various metrics that SAS
reports.
If you have suggestions pertaining to other packages, or sample code
that replicates some of the SAS outputs for logistic regression, I
would be glad to hear of them.
Some of the requirements are:
Stepwise variable selection for logistic regression
Choose base level for factor variables
The Hosmer-Lemeshow statistic
concordant and discordant
Tau C statistic
Thank you for your suggestions.
Just because SAS does it, doesn't necessarily mean it's good statistical practice. Step-wise regression is particularly problematic.
What I have found so far is that the Design and rms package to be the best (and only) package for these outputs.
What functions do you use in R to fit a curve to your data and test how well that curve fits? What results are considered good?
Just the first part of that question can fill entire books. Just some quick choices:
lm() for standard linear models
glm() for generalised linear models (eg for logistic regression)
rlm() from package MASS for robust linear models
lmrob() from package robustbase for robust linear models
loess() for non-linear / non-parametric models
Then there are domain-specific models as e.g. time series, micro-econometrics, mixed-effects and much more. Several of the Task Views as e.g. Econometrics discuss this in more detail. As for goodness of fit, that is also something one can spend easily an entire book discussing.
The workhorses of canonical curve fitting in R are lm(), glm() and nls(). To me, goodness-of-fit is a subproblem in the larger problem of model selection. Infact, using goodness-of-fit incorrectly (e.g., via stepwise regression) can give rise to seriously misspecified model (see Harrell's book on "Regression Modeling Strategies"). Rather than discussing the issue from scratch, I recommend Harrell's book for lm and glm. Venables and Ripley's bible is terse, but still worth a reading. "Extending the Linear Model with R" by Faraway is comprehensive and readable. nls is not covered in these sources, but "Nonlinear Regression with R" by Ritz & Streibig fills the gap and is very hands-on.
The nls() function (http://sekhon.berkeley.edu/stats/html/nls.html) is pretty standard for nonlinear least-squares curve fitting. Chi squared (the sum of the squared residuals) is the metric that is optimized in that case, but it is not normalized so you can't readily use it to determine how good the fit is. The main thing you should ensure is that your residuals are normally distributed. Unfortunately I'm not sure of an automated way to do that.
The Quick R site has a reasonable good summary of basic functions used for fitting models and testing the fits, along with sample R code:
http://www.statmethods.net/stats/regression.html
The main thing you should ensure is
that your residuals are normally
distributed. Unfortunately I'm not
sure of an automated way to do that.
qqnorm() could probably be modified to find the correlation between the sample quantiles and the theoretical quantiles. Essentially, this would just be a numerical interpretation of the normal quantile plot. Perhaps providing several values of the correlation coefficient for different ranges of quantiles could be useful. For example, if the correlation coefficient is close to 1 for the middle 97% of the data and much lower at the tails, this tells us the distribution of residuals is approximately normal, with some funniness going on in the tails.
Best to keep simple, and see if linear methods work "well enuff". You can judge your goodness of fit GENERALLY by looking at the R squared AND F statistic, together, never separate. Adding variables to your model that have no bearing on your dependant variable can increase R2, so you must also consider F statistic.
You should also compare your model to other nested, or more simpler, models. Do this using log liklihood ratio test, so long as dependant variables are the same.
Jarque–Bera test is good for testing the normality of the residual distribution.