standard error & variance of model predictions using merTools::predictInterval - r

I would like to estimate the standard error and variance of predictions from a linear mixed model. I'm using merTools::predictInterval to estimate prediction intervals because I want to include some of the uncertainty in the random effects (in addition to the uncertainty in the fixed effects). Is it acceptable to use the simulations from merTools::predictInterval to estimate the se and variance of predictions? If so, how should I calculate them? I can think of 2 ways:
To get variance corresponding to the prediction interval (i.e. including residual variance), I would first get the simulated predictions:
predictions <- merTools::predictInterval(...,
include.resid.var = TRUE,
returnSims = TRUE)
1. Then I could estimate variance using the normal approximation (calculate the distance between the fit and the upper/lower interval and then divide that by 1.96):
var1 <- ((predictions$upr - predictions$lwr)/2/1.96)^2
2. Or I could just take the variance of the simulated values:
var2 <- apply(X = attr(x = predictions, which = 'sim.results'), MARGIN = 1, FUN = var)
The SE would then be the square root of the variance. To get the SE and/or variance relating to the confidence interval, I could repeat this with include.resid.var = FALSE in the merTools::predictInterval call.
Is either of these methods acceptable? Is one preferable to the other?

Related

Confidence Interval of the predicted mean of a LMER object for large dataset

I would like to get the confidence interval (CI) for the predicted mean of a Linear Mixed Effect Model on a large dataset (~40k rows), which is itself a subset of an even larger dataset. This CI is then used for estimating the uncertainty of another calculation that uses the mean and its related CI as input data.
I managed to create a prediction estimate and interval for the full dataset, but a Prediction Interval is not the same and much larger than a CI. Beside bootstrapping (which takes way too much time with this much data), I cannot find a method that would allow me to estimate a CI – either because it is throwing errors or because it only offers to calculate Prediction intervals.
I quite recently moved into LME and I might therefore have overseen some obvious method.
Here is what I did so far in more detail:
The input data is confidential and I can therefore unfortunately not share any extract.
But in general, we have one dependent variable (y) representing the probability of a event and 2 categorical (c1 and c2) and two continuous variables (x1 and x2) with some weighting factor (w1). Some values in the dataset are missing. An extract of the first rows of the data could look like the example below:
c1
c2
x1
x2
w1
y
London
small
1
10
NA
NA
London
small
1
20
NA
NA
London
large
2
10
0.2
0.1
Paris
small
1
10
0.2
0.23
Paris
large
2
10
0.3
0.3
Based on this input data, I am then fitting a LMER model in the following form:
lmer1 <- lme4::lmer( y ~ x1 * poly(x2, 5) + ((x1 * poly(x2 ,5)) | c1),
data = df,
weights = w1,
control = lme4::lmerControl(check.conv.singular = lme4::.makeCC(action = "ignore", tol = 1e-3)))
This runs for some minutes and returns several warnings:
Warning messages: 1: In optwrap(optimizer, devfun, getStart(start,
rho$pp), lower = rho$lower, : convergence code 5 from nloptwrap:
NLOPT_MAXEVAL_REACHED: Optimization stopped because maxeval (above)
was reached.
2: In checkConv(attr(opt, “derivs”), opt$par, ctrl =
control$checkConv, : unable to evaluate scaled gradient
3: In checkConv(attr(opt, “derivs”), opt$par, ctrl =
control$checkConv, : Model failed to converge: degenerate Hessian with
11 negative eigenvalues
I increased the MAXEVAL parameter but this still did not help to get rid of the warnings and I found that despite these warnings, the model is still fitted. I therefore started to apply different methods to get a prediction of the mean for the whole dataset and the related CI for the mean.
predictInterval
I started with creating a Prediction Interval for the full dataset:
predictions <- merTools::predictInterval(lmer1,
newdata = df,
which = "full",
n.sims = 1000,
include.resid.var = FALSE,
level=0.95,
stat="mean")
However, as stated above, the Prediction Interval is not the same as the CI (see also https://datascienceplus.com/prediction-interval-the-wider-sister-of-confidence-interval/).
I found that the general predict function has the option to set interval to either “prediction” or “confidence”, but this option does not exist with the prediction from a LMER object. And I could not find another possibility to switch from Prediction Interval to CI – even though I would believe that the data drawn should be sufficient to do this.
confint
I then saw that there is a function called “confint”, but when running this function I get the following error:
predicition_ci = lme4::confint.merMod(lmer1)
Computing profile confidence intervals ...
Error in zeta(shiftpar, start = opt[seqpar1][-w]) : profiling
detected new, lower deviance
In addition: Warning messages:
1: In commonArgs(par, fn, control, environment()) : maxfun < 10 *
length(par)^2 is not recommended.
2: In optwrap(optimizer, devfun, x#theta, lower = x#lower, calc.derivs
= TRUE, : convergence code 1 from bobyqa: bobyqa -- maximum number of function evaluations exceeded
I found this thread (Error when estimating CI for GLMM using confint()), which said that I need to reduce the “devtol” parameter by setting a different profile. But doing so results in the same error:
lmer1_devtol = profile(lmer1, devtol = 1e-7)
Error in zeta(shiftpar, start = opt[seqpar1][-w]) : profiling
detected new, lower deviance
In addition: Warning messages:
1: In commonArgs(par, fn, control, environment()) : maxfun < 10 *
length(par)^2 is not recommended.
2: In optwrap(optimizer, devfun, x#theta, lower = x#lower, calc.derivs
= TRUE, : convergence code 1 from bobyqa: bobyqa -- maximum number of function evaluations exceeded
add_ci
I found the function “add_ci” but this again resulted in another error:
predictions_ci = ciTools::add_ci(df, lmer1,
alpha = 0.05)
Error in levelfun(r, n, allow.new.levels = allow.new.levels) : new
levels detected in newdata
I then set the new “allow.new.levels” parameter to TRUE like in the description of the prediction function, but this parameter seems not to be carried through:
predictions_ci = ciTools::add_ci(df, lmer1,
alpha = 0.05,
allow.new.levels = TRUE)
Error in levelfun(r, n, allow.new.levels = allow.new.levels) : new
levels detected in newdata
Diag
I found a method to calculate CI intervals for the sleepstudy data, which uses a matrix conversion with diag.
Designmat <- model.matrix(as.formula("y ~ x1 * poly(x2, 5)")[-2], df)
predvar <- diag(Designmat %*% vcov(lmer1) %*% t(Designmat))
#With new data
newdat = df
newdat$pred <- predict(lmer1, newdat, allow.new.levels = TRUE)
Designmat <- model.matrix(formula(lmer1)[-2], newdat)
But the diag method does not work for such large datasets.
bootMer
As said earlier, the boostrapping of the confidence interval with bootMer is taking too much time for this subset of data (I started it 1 day ago and it is still running). I tried to use some parallel processing with the sleepstudy sample data but this could not increase the speed dramatically, so I would assume it will have the same effect on my large dataset.
merBoot <- bootMer(lmer1, predict, nsim = 1000, re.form = NA)
Others
I have read through all these post (and more), but none of them could help me to get the CI in reasonable time for my case. But maybe I have overseen something.
https://stats.stackexchange.com/questions/344012/confidence-intervals-from-bootmer-in-r-and-pros-cons-of-different-interval-type
https://stats.stackexchange.com/questions/117641/how-trustworthy-are-the-confidence-intervals-for-lmer-objects-through-effects-pa
How to get coefficients and their confidence intervals in mixed effects models?
Error when estimating CI for GLMM using confint()
https://stats.stackexchange.com/questions/235018/r-extract-and-plot-confidence-intervals-from-a-lmer-object-using-ggplot
How to get confidence intervals for lmer object?
Confidence intervals for the predicted probabilities from glmer object, error with bootMer
https://rdrr.io/cran/ciTools/man/add_ci.lmerMod.html
Error when estimating Confidence interval in lme4
https://fromthebottomoftheheap.net/2018/12/10/confidence-intervals-for-glms/
https://cran.r-project.org/web/packages/merTools/vignettes/Using_predictInterval.html
https://drewtyre.rbind.io/classes/nres803/week_12/lab_12/
Unsurprising to me but unfortunate for you, nonconvergence of mixed model estimation and difficulty in generating confidence intervals results from the misuse of a linear model for data with a limited dependent variable. "Despite these warnings, the model is still fitted" is a dangerous practice, as iterations are not to be used from predictions if not converged. As you described, the dependent variable (y) represents the probability of an event, which is a continuous variable between zero and one. Using a linear model to predict probability constitutes a linear probability regression, which requires censoring predicted outcomes (e.g. forcing all predicted values greater than .99 to be .99 while forcing all predicted values smaller than .01 to be .01) and adjusting for heterogenous variances using weighted least squares (see https://bookdown.org/ccolonescu/RPoE4/heteroskedasticity.html). Having continuous variables produce both fixed and random effects also burden the convergence, while some or all the random effects of continuous variables may not be necessary. The use of weights can be also problematic.
Instead of a linear probability regression, beta regression works best for dependent variables which are proportions and probabilities. Beta regression without random effects is done in betareg::betareg(). glmmTMB::glmmTMB() handles beta regression with random effects. Start from a simple setting where only the intercept has random effects such as
glmmTMB(y ~ 1 + x1 * poly(x2, 5) + c2 + (1 | c1), family = list(family = "beta", link ="logit"), data = df)
You may compare the result with glmer() and lmer()
glmer(y ~ 1 + x1 * poly(x2, 5) + c2 + (1 | c1), family = gaussian(link = "logit"), data = df)
lmer(log(y/(1-y)) ~ 1 + x1 * poly(x2, 5) + c2 + (1 | c1), data = df)
glmer() and lmer() with the above specifications are equivalent, and both assume that predicting log(y/(1-y)) has normal residuals, while glmmTMB() assumes that y follows a gamma distribution. lmer() results are easier to explain and receive wider support from other packages, since they are linear models. On the other hand, glmmTMB() may fit better according to AIC, BIC, and log likelihood. Note that all three requires y strictly in (0, 1) noninclusive. To include occasional zeros and ones, manipulate observations at both boundaries by introducing a small tolerance usually equal to half of the smaller distance from a boundary to its closest observed value (see https://stats.stackexchange.com/questions/109702 and https://graphworkflow.com/eda/bounded01/). For probabilities with either or both of many zeros and ones, zero-, one-, and zero-one–inflated beta regression is fitted via gamlss::gamlss(). See Korosteleva, O. (2019). Advanced regression models with SAS and R. CRC Press.
Add random effects of slopes if necessary according to likelihood ratio tests. Make sure there are enough levels in c1 (e.g. more than 10 different cities) to necessitate mixed effect models. The {glmmTMB} package extends glm() and glmer(). Its alternative {brms} package is built for Bayesian approach. Note that the weights = argument in glmmTMB() as in glm() specifies that values in weights are inversely proportional to the dispersions and are not automatically scaled to sum to one unless integer values which specifies number of observation units. Therefore, you need to investigate what w1 stands for and evaluate how to use it in modeling.
merTools::predictInterval() generates many kinds of intervals for mixed models, some comparable to confidence intervals and prediction intervals in linear models without random effects. However, it supports lmer() model objects only. See https://cran.r-project.org/web/packages/merTools/vignettes/merToolsIntro.html and https://cran.r-project.org/web/packages/merTools/vignettes/Using_predictInterval.html.
predictInterval(lmer(), include.resid.var = F) includes uncertainty from both fixed and random effects of all coefficients including the intercept but excludes variation from multiple measurements of the same group or individual. This can be considered similar to prediction intervals of linear models without random effects. predictInterval(lmer(), include.resid.var = F, fix.intercept.variance = T) generates shorter CI than above by accounting for covariance between the fixed and random effects of the intercept. predictInterval(lmer(), include.resid.var = F, ignore.fixed.terms = "(Intercept)") also shortens CI by removing uncertainty from the fixed effect of the intercept. If there are no random slopes other than random intercept, the last two methods are comparable to confidence intervals of of linear models without random effects. confint(lmear()) and confint(profile(lmear())) generates confidence intervals of modal parameters such as a slope, so they do not produce confidence intervals of predicted outcomes.
You may also find the following functions and packages useful for generating CIs of mixed effect models.
ggeffect() {ggeffects} predictions() {marginaleffects} and margins() prediction() {margins} {predictions}
They can produce predictions averaged over observed distribution of covariates, instead of making predictions by holding some predictors at specific values such as means or modes which can be misleading and not useful.

unscale predictor coefficients lmer model fit with an unscaled response

I have fitted a lmer model, and now I am trying to interpret the coefficients in terms of the real coefficients instead of scaled ones.
My top model is:
lmer(logcptplus1~scale.t6+scale.logdepth+(1|location) + (1|Fyear),data=cpt, REML=TRUE)
so both the predictor variables are scaled, with one being the scaled log values. my response variable is not scaled and just logged.
to scale my predictor variables, I used the scale(data$column, center=TRUE,scale=TRUE) function in r.
The output for my model is:
Fixed effects:
Estimate Std. Error t value
(int) 3.31363 0.15163 21.853
scale.t6 -0.34400 0.10540 -3.264
scale.logdepth -0.58199 0.06486 -8.973
so how can I obtain real estimates for my response variable from these coefficients that are scaled based on my scaled predictor variables?
NOTE: I understand how to unscale my predictor variables, just not how to unscale/transform the coefficients
Thanks
The scale function does a z-transform of the data, which means it takes the original values, subtracts the mean, and then divides by the standard deviation.
to_scale <- 1:10
using_scale <- scale(to_scale, center = TRUE, scale = TRUE)
by_hand <- (to_scale - mean(to_scale))/sd(to_scale)
identical(as.numeric(using_scale), by_hand)
[1] TRUE
Therefore, to reverse the model coefficients all you need to do is multiply the coefficient by the standard deviation of the covariate and add the mean. The scale function holds onto the mean and sd for you. So, if we assume that your covariate values are the using_scale vector for the scale.t6 regression coefficient we can write a function to do the work for us.
get_real <- function(coef, scaled_covariate){
# collect mean and standard deviation from scaled covariate
mean_sd <- unlist(attributes(scaled_covariate)[-1])
# reverse the z-transformation
answer <- (coef * mean_sd[2]) + mean_sd[1]
# this value will have a name, remove it
names(answer) <- NULL
# return unscaled coef
return(answer)
}
get_real(-0.3440, using_scale)
[1] 4.458488
In other words, it is the same thing as unscaling your predictor variables because it is a monotonic transformation.

How to estimate the odds ratio with CI for X in a logistic regression containing the square of X using R?

I am trying to calculate odds ratios in R for variables with not only linear but also with quadratic terms in logistic regression. Let's say there is X and X^2 in the model. I know how to get the odds ratio (for a unit change of X) when X takes a specific value, but I don't know how to calculate confidence interval for this estimate. I found this reference how it's done in SAS: http://support.sas.com/kb/35/189.html , but I would like to do it in R. Any suggestions?
#BenBolker
Here is an example:
mydata <-read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")
mydata <- transform(mydata, gpaSquared=gpa^2,greSquared=gre^2)
model <- glm(admit ~ gpa + gpaSquared + gre , family = binomial(logit), data = mydata)
In this example the odds ratio for gpa depends on the actual value of gpa (e.g. the effect of a unit change in gpa if gpa=4). I can calculate the log odds for gpa=5 and gpa=4 and get the odds ratio from those, but I don't know how to get CI for the OR. (please ignore that in the example the squared term is not stat. significant...)
m <- glm(x~X1^2+X2,data,family=binomial(link="logit"))
summary(m)
confint(m) # 95% CI for the coefficients using profiled log-likelihood
confint.default(m) ## CIs using standard errors
exp(coef(m)) # exponentiated coefficients
exp(confint(m)) # 95% CI for exponentiated coefficients

Estimating mean time to failure with survreg/flexsurvreg with standard error

I am trying to estimate the mean time to failure for a Weibull distribution fitted to some survival data with flexsurvreg from the flexsurv package. I need to be able to estimate the standard error for use in a simulation model.
Using flexsurvreg with the lung data as an example;
require(flexsurv)
lungS <- Surv(lung$time,lung$status)
lungfit <- flexsurvreg(lungS~1,dist="weibull")
lungfit
Call:
flexsurvreg(formula = lungS ~ 1, dist = "weibull")
Maximum likelihood estimates:
est L95% U95%
shape 1.32 1.14 1.52
scale 418.00 372.00 469.00
N = 228, Events: 165, Censored: 63
Total time at risk: 69593
Log-likelihood = -1153.851, df = 2
AIC = 2311.702
Now, calculating the mean is just a case of plugging in the estimated parameter values into the standard formula, but is there an easy way of getting out the standard error of this estimate? Can survreg do this?
In flexsurv version 0.2, if x is the fitted model object, then x$cov is the covariance matrix of the parameter estimates, with positive parameters on the log scale. You could then use the asymptotic normal property of maximum likelihood estimators. Simulate a large number of multivariate normal vectors, with the estimates as means, and this covariance matrix (using e.g. rmvnorm from the mvtnorm package). This gives you replicates of the parameter estimates under sampling uncertainty. Calculate the corresponding mean survival for each replicate, then take the SD or quantiles of the resulting sample to get the standard error or a confidence interval.

Any simple way to get regression prediction intervals in R?

I am working on a big data set having over 300K elements, and running some regression analysis trying to estimate a parameter called Rate using the predictor variable Distance. I have the regression equation. Now I want to get the confidence and prediction intervals. I can easily get the confidence intervals for the coefficients by the command:
> confint(W1500.LR1, level = 0.95)
2.5 % 97.5 %
(Intercept) 666.2817393 668.0216072
Distance 0.3934499 0.3946572
which gives me the upper and lower bounds for the CI of the coefficients. Now I want to get the same upper and lower bounds for the Prediction Intervals. Only thing I have learnt so far is that, I can get the prediction intervals for specific values of Distance (say 200, 500, etc.) using the code:
predict(W1500.LR1, newdata, interval="predict")
This is not useful for me because I have over 300K different distance values, requiring to run this code for each of them. Any simple way to get the prediction intervals like the confint command I showed above?
Had to make up my own data but here you go
x = rnorm(300000)
y = jitter(3*x,1000)
fit = lm(y~x)
#Prediction intervals
pred.int = predict(fit,interval="prediction")
#Confidence intervals
conf.int = predict(fit,interval="confidence")
fitted.values = pred.int[,1]
pred.lower = pred.int[,2]
pred.upper = pred.int[,3]
plot(x[1:1000],y[1:1000])
lines(x[1:1000],fitted.values[1:1000],col="red",lwd=2)
lines(x[1:1000],pred.lower[1:1000],lwd=2,col="blue")
lines(x[1:1000],pred.upper[1:1000],lwd=2,col="blue")
So as you can see your prediction is for predicting new data values and not for constructing intervals for the beta coefficients. So the confidence intervals you actually want would be obtained in the same fashion from conf.int.

Resources