95% CI for the ICC in linear mixed effects model (multilevel model, hierarchical model) - r

I fitted a linear mixed effect model to predict the math score as the outcome, x= participant factor (nominal or ordinal) as the fixed effect, Schl is the random effect. Then I compared it with the simple linear regression model using compare_performance, and while the output gives the ICC, I was not sure how to calculate the 95% for it? (for coefficients I used confintconfint and it did the job)
lm1<- lm(math~ gender, data= df)
lme1<- lmer(math~gender+(1|schl), data=df)
compare_performance(lm1,lme1)
the ICC was 0.15

From this gist from Peter Dahlgren, taken in turn from a CrossValidated answer by #Ashe, here is the crux:
calc.icc <- function(y) {
sumy <- summary(y)
(sumy$varcor$id[1]) / (sumy$varcor$id[1] + sumy$sigma^2)
}
boot.icc <- bootMer(mymod, calc.icc, nsim=1000)
#Draw from the bootstrap distribution the usual 95% upper and lower confidence limits
quantile(boot.icc$t, c(0.025, 0.975))
You can (and should) check that this calc.icc() function gives the same results as your compare_performance() function. Since this uses parametric bootstrapping, you can substitute any ICC function you like as it long takes a fitted model as input and returns the ICC as a single numeric value. (Also, because it uses PB, it will be slow; there are potentially faster approximate methods, but PB is reliable and easy to program.)

Related

Optimizing a GAM for Smoothness

I am currently trying to generate a general additive model in R using a response variable and three predictor variables. One of the predictors is linear, and the dataset consists of 298 observations.
I have run the following code to generate a basic GAM:
GAM <- gam(response~ linearpredictor+ s(predictor2) + s(predictor3), data = data[2:5])
This produces a model with 18 degrees of freedom and seems to substantially overfit the data. I'm wondering how I might generate a GAM that maximizes smoothness and predictive error. I realize that each of these features is going to come at the expense of the other, but is there good a way to find the optimal model that doesn't overfit?
Additionally, I need to perform leave one out cross validation (LOOCV), and I am not sure how to make sure that gam() does this in the MGCV package. Any help on either of these problems uld be greatly appreciated. Thank you.
I've run this to generate a GAM, but it overfits the data.
GAM <- gam(response~ linearpredictor+ s(predictor2) + s(predictor3), data = data[2:5])
I have also generated 1,000,000 GAMs with varying combinations of smoothing parameters and ranged the maximum degrees of freedom allowed from 10 (as shown in the code below) to 19. The variable "combinations2" is a list of all 1,000,000 combinations of smoothers I selected. This code is designed to try and balance degrees of freedom and AIC score. It does function, but I'm not sure that I'm actually going to be able to find the optimal model from this. I also cannot tell how to make sure that it uses LOOCV.
BestGAM <- gam(response~ linearpredictor+ predictor2+ predictor3, data = data[2:5])
for(i in 1:100000){
PotentialGAM <- gam(response~ linearpredictor+ s(predictor2) + s(predictor3), data = data[2:5], sp=c(combinations2[i,]$Var1,combinations2[i,]$Var2))
if (AIC(PotentialGAM,BestGAM)$df[1] <= 10 & AIC(PotentialGAM,BestGAM)$AIC[1] < AIC(PotentialGAM,BestGAM)$AIC[2]){
BestGAM <<- PotentialGAM
listNumber <- i
}
}
You are fitting your GAM using generalised cross validation (GCV) smoothness selection. GCV is a way to get around the invariance problem of ordinary cross validation (OCV; what you also call LOOCV) when estimating GAMs. Note that GCV is the same as OCV on a rotated version of the fitting problem (rotating y - Xβ by Q, any orthogonal matrix), and while when fitting with GCV {mgcv} doesn't actually need to do the rotation and the expected GCV score isn't affected by the rotation, GCV is just OCV (wood 2017, p. 260)
It has been shown that GCV can undersmooth (resulting in more wiggly models) as the objective function (GCV profile) can become flat around the optimum. Instead it is preferred to estimate GAMs (with penalized smooths) using REML or ML smoothness selection; add method = "REML" (or "ML") to your gam() call.
If the REML or ML fit is as wiggly as the GCV one with your data, then I'd be likely to presume gam() is not overfitting, but that there is something about your response data that hasn't been explained here (are the data ordered in time, for example?)
As to your question
how I might generate a GAM that maximizes smoothness and [minimize?] predictive error,
you are already doing that using GCV smoothness selection and for a particular definition of "smoothness" (in this case it is squared second derivatives of the estimated smooths, integrated over the range of the covariates, and summed over smooths).
If you want GCV but smoother models, you can increase the gamma argument above 1; gamma 1.4 is often used for example, which means that each EDF costs 40% more in the GCV criterion.
FWIW, you can get the LOOCV (OCV) score for your model without actually fitting 288 GAMs through the use of the influence matrix A. Here's a reproducible example using my {gratia} package:
library("gratia")
library("mgcv")
df <- data_sim("eg1", seed = 1)
m <- gam(y ~ s(x0) + s(x1) + s(x2) + s(x3), data = df, method = "REML")
A <- influence(m)
r <- residuals(m, type = "response")
ocv_score <- mean(r^2 / (1 - A))

How do I specify the dispersion parameter when computing the confidence interval for a GLM?

I have a model of exponential decay in the form Y = exp{a + bX + cW}. In R, I represent this as a generalized linear model (GLM) using a gamma random component with log link function.
fitted <- glm(Y ~ X + W, family=Gamma(link='log'))
I know from this post that for the standard errors to really represent an exponential rather than gamma random component, I need to specify the dispersion parameter as being 1 when I call summary.
summary(fitted, dispersion=1)
summary(fitted) # not the same!
Now, I want to find the 95% confidence intervals for my estimates of a, b, c. However, there seems to be no way to specify the dispersion parameter for the confint, even though I know it should affect the confidence interval (because it affects the standard error).
confint(fitted)
confint(fitted, dispersion=1) # same as the last confint :(
So, in order to get the confidence intervals corresponding to an exponential rather than gamma random component, how do I specify the dispersion parameter when computing the confidence interval for a GLM?

Test of second differences for average marginal effects in logistic regression

I have a question similar to the one here: Testing the difference between marginal effects calculated across factors. I used the same code to generate average marginal effects for two groups. The difference is that I am running a logistic rather than linear regression model. My average marginal effects are on the probability scale, so emmeans will not provide the correct contrast. Does anyone have any suggestions for how to test whether there is a significant difference in the average marginal effects between group 1 and group 2?
Thank you so much,
Ilana
It is a bit unclear what the issue really is, but I'll try. I'm supposing your logistic regression model was fitted using, say, glm:
mod <- glm(cbind(heads, tails) ~ treat, data = mydata, family = binomial())
If you then do
emm <- emmeans(mod, "treat")
emm ### marginal means
pairs(emm) ### differences
Your results will be presented on the logit scale.
If you want them on the probability scale, you can do
summary(emm, type = "response")
summary(pairs(emm), type = "response")
However, the latter will back-transform the differences of logits, thereby producing odds ratios.
If you actually want differences of probabilities rather than ratios of odds, use regrid(), which will construct a new grid of values after back-transforming (and hence it will forget the log transformation):
pairs(regrid(emm))
It seems possible that two or more factors are present and you want contrasts of contrasts on the probability scale. In that case, extend this idea by calling regrid() on the table of EMMs to put everything on the probability scale, then follow the analogous procedure used in the linked article.

Plotting standard errors for effects

I have a lme4 model I have run for a hierarchical logistic regression, and I'm plotting the effects using the effects package. I would like to create an effects graph with the standard error of the mean as the error bars. I can get the point estimates, 95% confidence intervals, and standard errors into a dataframe. The standard errors, however, seem at odds with the confidence limit parameters, see below for an example in a regular glm.
library(effects)
library(dplyr)
mtcars <- mtcars %>%
mutate(vs = factor(vs))
glm1 <- glm(am ~ vs, mtcars, family = "binomial")
(glm1_eff <- Effect("vs", glm1) %>%
as.data.frame())
vs fit se lower upper
1 0 0.3333333 0.4999999 0.1580074 0.5712210
2 1 0.5000000 0.5345225 0.2596776 0.7403224
My understanding is that the fit column displays the point estimate for the probability of am is equal to 1 and that lower and upper correspond to the 95% confidence intervals for the probability that am equals 1. Note that the standard error does not seem to correspond to the confidence interval (e.g., .33+.49 > .57).
Here's what I am shooting for. As opposed to a 95% confidence interval, I would like to have an effects plot with +- the standard error of the mean.
Are the standard errors in log-odds instead of probability? Is there a simply way to convert them to probabilities and plot them so that I can make the graph?
John Fox shared this helpful response:
From ?Effect: "se: (for "eff" objects) a vector of standard errors for the effect, on the scale of the linear predictor." So the standard errors are on the log-odds scale." You could use the delta method to get standard errors on the probability scale but that would be very ill-advised, since the approach to asymptotic normality of estimated probabilities will be much slower than of log-odds. Effect() computes confidence limits on the scale of the linear predictor (log-odds for a logit model) and then inverse-transforms them to the scale of the response (probabilities).
All of the information you need to create a custom plot is in the "eff" object returned by Effect(); the contents of the object are documented in ?Effect.
I agree, by the way, that the as.data.frame.eff() method could be improved, and I'll do that when I have a chance. In particular, it invites misunderstanding to report the effects and confidence limits on the scale of the response but to show standard errors for the linear-predictor scale.
I'm answering the mystery first, then addressing the "show SE on the plot" question
Explanation of the SE mystery: All math in a GLM needs to be done on the link scale because this is the additive scale (where stuff can be added up). So...
The values in the column "fit" are the predicted probability of success (or the "predictions on the response scale"). Their values are expit(b0) and expit(b0 + b1). expit() is the inverse logit function. The SEs are on the link scale. An SE on the response scale doesn't make much sense because the response scale is non-linear (although its kinda weird to have stats on the response and link scale in the same table). "lower" and "upper" are on the response scale, so these are the CIs of the predicted probabilities of success. They are computed as expit(b0 ± 1.96SE) and expit(b0 + b1 ± 1.96SE). To recover these values with what is given
library(boot) # inv.logit and logit functions
expit.pred_0 <- 1/3 # fit 0
expit.pred_1 <- 1/2 # fit 1
se1 <- 1/2
se2 <- .5345225
inv.logit(logit(expit.pred_0) - qnorm(.975)*se1)
inv.logit(logit(expit.pred_0) + qnorm(.975)*se1)
inv.logit(logit(expit.pred_1) - qnorm(.975)*se2)
inv.logit(logit(expit.pred_1) + qnorm(.975)*se2)
> inv.logit(logit(expit.pred_0) - qnorm(.975)*se1)
[1] 0.1580074
> inv.logit(logit(expit.pred_0) + qnorm(.975)*se1)
[1] 0.5712211
> inv.logit(logit(expit.pred_1) - qnorm(.975)*se2)
[1] 0.2596776
> inv.logit(logit(expit.pred_1) + qnorm(.975)*se2)
[1] 0.7403224
Showing an SE computed from a glm on the response (non additive) scale doesn't make any sense because the SE is only additive on the link scale. In other words Multiplying SE by some quantile on the response scale (the scale of the plot you envision, with probability on the y axis) is meaningless. A CI is a point estimate back transformed from the link scale and so makes sense for plotting.
I frequently see researchers plotting SE bars computed from a linear model, like you envision, even though the statistics presented are from a GLM. These SE's are meaningful in a sense I guess but they often imply absurd consequences (like probabilities that could be less than zero or greater than one) so...don't do that either.

corr.bias parameter in Random forest regression model in R

I'm using the regression model of random forest in R and I found the parameter corr.bias which according to the manual is "experimental", my data is nonlinear and I just wonder if setting this parameter to true can enhance the results, plus I don't know exactly how it works for nonlinear data, so I really appreciate if someone can explain to me how this correction bias works in the random forest package and if it can enhance my regression model or not.
The short answer is that it performs a simple correction based on a linear regression on the actual and fitted values.
From regrf.c:
/* Do simple linear regression of y on yhat for bias correction. */
if (*biasCorr) simpleLinReg(nsample, yptr, y, coef, &errb, nout);
and the first few lines of that function are simply:
void simpleLinReg(int nsample, double *x, double *y, double *coef,
double *mse, int *hasPred) {
/* Compute simple linear regression of y on x, returning the coefficients,
the average squared residual, and the predicted values (overwriting y). */
So when you fit a regression random forest with corr.bias = TRUE the model object returned will contain a coef element which will simply be the two coefficients from the linear regression.
Then when you call predict.randomForest this happens:
## Apply bias correction if needed.
yhat <- rep(NA, length(rn))
names(yhat) <- rn
if (!is.null(object$coefs)) {
yhat[keep] <- object$coefs[1] + object$coefs[2] * ans$ypred
}
The non-linear nature of your data probably isn't necessarily relevant, but the bias correction may be very poor if the relationship between the fitted and actual values is very far from linear.
You can always fit the model and then plot the fitted vs actual values yourself and see whether a correction based on a linear regression would help or not.

Resources