Bootstrapping in R: Predict - r

I am running a program where I conduct an OLS regression and then I subtract the coefficients from the actual observations to keep the residuals.
model1 = lm(data = final, obs ~ day + poly(temp,2) + prpn + school + lag1) # linear model
predfit = predict(model1, final) # predicted values
residuals = data.frame(final$obs - predfit) # obtain residuals
I want to bootstrap my model and then do the same with the bootstrapped coefficients. I try doing this the following way:
lboot <- lm.boot(model1, R = 1000)
predfit = predict(lboot, final)
residuals = data.frame(final$obs - predfit) # obtain residuals
However, that does not work. I also try:
boot_predict(model1, final, R = 1000, condense = T, comparison = "difference")
and that also does not work.
How can I bootstrap my model and then predict based of that?

If you're trying to fit the best OLS using bootstrap, I'd use the caret package.
library(caret)
#separate indep and dep variables
indepVars = final[,-final$obs]
depVar = final$obs
#train model
ols.train = train(indepVars, depVar, method='lm',
trControl = trainControl(method='boot', number=1000))
#make prediction and get residuals
ols.pred = predict(ols.train, indepVars)
residuals = ols.pred - final$obs

Related

How to obtain prediction from an INLA model in R without fitting it?

Let's imagine I want to fit an INLA model with a large dataset and want to do some prediction using the fitted model. One solution is to include into the date the covariates used to predict the outcome, assigning NA values to the outcome.
In my case, I want to regularly calculate the predicted outcome based on updated covariates. Ideally, I would like to run the model once to fit it, save it, and use it later to regularly make prediction (e.g for linear model 'lm()' in R using the 'predict()' function).
I haven't found a way to do it in INLA. Below is a simple reproducible example.
library(INLA)
#simulate data
n = 100; a = 1; b = 1; tau = 100
z = rnorm(n)
eta = a + b*z
scale = exp(rnorm(n))
prec = scale*tau
y = rnorm(n, mean = eta, sd = 1/sqrt(prec))
plot(z,y)
#run INLA model
data = list(y=y, z=z)
formula = y ~ 1+z
result = inla(formula, family = "gaussian", data = data)
summary(result)
#define prediction data
data.pred$z = c(data$z,seq(2,4,length.out=100))
data.pred$y = c(data$y, y=rep(NA,100))
#run INLA model with prediction
result = inla(formula, family = "gaussian", data = data,
control.compute=list(config = TRUE))
summary(result)
#get posterior samples of the predictions
post.samples <- inla.posterior.sample(n = 10, result = result)
pred <- do.call(cbind,
lapply(post.samples,
function(X) X$latent[startsWith(rownames(X$latent), "Pred")]))

Getting Confidence Intervals from predicted values from a nlme model from package medrc

I am trying to figure out how to get confidence intervals from predicted values from a model run on medrc (nlme model). The code worked on the regular drc package model, which does not use random effects, so I assume there is something I am not doing right with this nlme model to get CI because I am getting errors.
Below is an example data frame of the data I am using
df <- data.frame(Geno = c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5,6,6,6,6,7,7,7,7,8,8,8,8,
9,9,9,9,10,10,10,10,11,11,11,11,12,12,12,12,13,13,13,13,14,14,14,14),
Treatment = c(3,6,9,"MMM",3,6,9,"MMM",3,6,9,"MMM",3,6,9,"MMM",3,6,9,"MMM",3,6,9,"MMM",
3,6,9,"MMM",3,6,9,"MMM",3,6,9,"MMM",3,6,9,"MMM",3,6,9,"MMM",3,6,9,"MMM",
3,6,9,"MMM",3,6,9,"MMM"),
Temp = c(32.741,34.628,37.924,28.535,32.741,34.628,37.924,28.535,32.741,34.628,37.924,28.535,
32.741,34.628,37.924,28.535,32.741,34.628,37.924,28.535,32.741,34.628,37.924,28.535,
32.741,34.628,37.924,28.535,32.741,34.628,37.924,28.535,32.741,34.628,37.924,28.535,
32.741,34.628,37.924,28.535,32.741,34.628,37.924,28.535,32.741,34.628,37.924,28.535,
32.741,34.628,37.924,28.535,32.741,34.628,37.924,28.535),
PAM = c(0.62225,0.593,0.35775,0.654,0.60625,0.5846667,0.316,0.60875,0.62275,0.60875,0.32125,
0.63725,0.60275,0.588,0.32275,0.60875,0.65225,0.6185,0.29925,0.64525,0.61925,0.61775,
0.11725,0.596,0.603,0.6065,0.2545,0.59025,0.586,0.5895,0.27025,0.59125,0.6345,0.6135,
0.3755,0.622,0.53375,0.552,0.2485,0.51925,0.6375,0.6256667,0.3575,0.63975,0.59375,0.6055,
0.333,0.64125,0.55275,0.51025,0.319,0.55725,0.6375,0.64725,0.348,0.66125))
df$Geno <- as.factor(df$Geno)
With this data, I am running this model that has 3 parameters for the dose-response curve model, b =slope, d= max, e= ED50.
model <- medrm(PAM ~ Temp,
data=df,
random= d + e ~ 1|Geno,
fct=LL.3(),
control=nlmeControl(msMaxIter = 2000, maxIter=2000, minScale=0.00001, tolerance=0.1, pnlsTol=1))
summary(model)
plot(model)
From this model I want to make prediction values for different temperatures along the model
model_preddata = data.frame(Temp = seq(28,39, length.out = 100))
model_pred = as.data.frame(predict(model, newdata = model_preddata, interval = 'confidence'))
with this I get an error but I can make it predict the PAM values if I add this
model_pred = as.data.frame(predict(model, newdata = model_preddata, interval = 'confidence', level = 0))
However this does not give me the lower and upper bounds columns like it does when I run this code with other non mixed effect models.
Can anyone help me figure out how to get the CI from the predicted values of this model

How do I extract standard errors or variation from predicted ordinal logistic regression analyses?

I am undertaking a ordinal logistic regression using R package MASS.
For example:
library(MASS)
house.plr <- polr(Sat ~ Infl + Type + Cont, weights = Freq, data = housing)
summary(house.plr, digits = 3)
I am using the s3 method predict() to obtain the predicted values
test_dat <- data.frame(Infl = factor(rep("Low",4)),
Cont = factor(rep("Low",4)),
Type = unique(housing$Type))
predict(house.plr, test_dat, type = "p")
Low Medium High
1 0.3784493 0.2876752 0.3338755
2 0.5190445 0.2605077 0.2204478
3 0.4675584 0.2745383 0.2579033
4 0.6444840 0.2114256 0.1440905
The result is a table of predicted means for each level of Sat given the variables defined in the test_dat.
How might I extract the variation around each of these means in the form of a standard error or standard deviation?
First, your predicted values are the predicted probability of each outcome for each observation. It is not the predicted mean on the response scale.
Second, you can use the marginaleffects package to get the standard errors for the predicted probabilities and then calculate the confidence intervals yourself. Alternatively, you may implement the non-parametric bootstrap. I implement both in the below. Note that I shifted the order of the columns around in the test data to match the training data.
# Packages
library(MASS)
library(marginaleffects)
library(dplyr)
# Create a test set
N <- 4
test_dat <- data.frame(
Infl = factor(rep("Low", N)),
Type = unique(housing$Type),
Cont = factor(rep("Low", N))
)
# Fit ordered logistic regression model
house.plr <- polr(Sat ~ Infl + Type + Cont,
weights = Freq,
data = housing,
Hess = TRUE)
# Demonstrate that predict() doesn't provide any measure of variability
# for the predicted class probabilities, as shown in question
predict(house.plr, test_dat, type = "probs")
# Use the marginaleffects package to get delta method standard errors for
# each predicted probability
probs <- marginaleffects::predictions(house.plr,
newdata = test_dat,
type = "probs")
# Compute CIs from the standard error using normal approximation
probs$predicted - 1.96*probs$std.error
probs$predicted + 1.96*probs$std.error
# Alternatively, use non-parametric bootstrapped confidence intervals.
# note that this does not adjust the weights to a constant sum for
# each bootstrap, although it is easy to implement. You're free to
# determine how to handle the weights, including resampling based
# on the weights.
# Generate bootstrapped data.frames
set.seed(123)
sims <- 5
samples <- vector(mode = "list", length = sims)
samples <- lapply(samples, function(x){ slice_sample(housing, n = nrow(housing), replace = TRUE)})
# Fit model on each bootstrapped data.frame
models <- lapply(samples, function(x){polr(Sat ~ Infl + Type + Cont,
weights = Freq,
data = x,
Hess = TRUE)})
# Get test predictions into a data.frame
probs_boot <- lapply(models, function(x) {
marginaleffects::predictions(x,
newdata = test_dat,
type = "probs")
})
probs_boot_df <- bind_rows(probs_boot)
# Compute CIs
probs_boot_df %>%
group_by(group, Type.x, Infl, Type.y, Cont) %>%
summarise(ci_low = quantile(predicted, probs = 0.025),
ci_high = quantile(predicted, probs = 0.975))

Is it normal to get the same Goodness of Fit values for a logit model and an LPM model, based on the same data?

I am using the Stata dataset ANES.dta with the information about the 2000 presidential election in the USA. I build two models on this dataset - one logit and one LPM. I want to compare the two models with each other using the following Goodness of fit measures - accuracy, sensitivity and specificity of the both models.
I am new to R, I have mainly used STATA so far and that's why I'm wondering if it is normal to get absolutely the same values in confusion matrices for a logit model and a LPM model, based on the same data? Am I doing something wrong?
rm(list=ls())
library(foreign)
dat <- read.dta("ANES.dta", convert.factors = FALSE)
dat_clear <- na.omit(dat)
head(dat_clear)
#Logit model
m1_logit <- glm(vote ~ gender + income + pro_choice ,
data = dat_clear, family = binomial(link = "logit") ,
na.action = na.omit)
summary(m1_logit)
#LPM
m2_lpm <- lm(vote ~ gender + income + pro_choice,
data = dat_clear, na.action = na.omit)
summary(m2_lpm)
#Confusion matrix for logit model
dat_clear$prediction_log <- predict(m1_logit, newdata = dat_clear, type = "response")
dat_clear$vote_pred_log <- as.numeric(dat_clear$prediction_log > .5)
table(observed = dat_clear$vote, predicted = dat_clear$vote_pred_log)
#Confusion matrix for LPM model
dat_clear$prediction_lpm <- predict(m2_lpm, newdata = dat_clear, type = "response")
dat_clear$vote_pred_lpm <- as.numeric(dat_clear$prediction_lpm > .5)
table(observed = dat_clear$vote, predicted = dat_clear$vote_pred_lpm)
This is what the confusion matrices look like

Getting estimated means after multiple imputation using the mitml, nlme & geepack R packages

I'm running multilevel multiple imputation through the package mitml (using the panimpute() function) and am fitting linear mixed models and marginal models through the packages nlme and geepack and the mitml:with() function.
I can get the estimates, p-values etc for those through the testEstimates() function but I'm also looking to get estimated means across my model predictors. I've tried the emmeans package, which I normally use for getting estimated means when running nlme & geepack without multiple imputation but doing so emmeans tell me "Can't handle an object of class “mitml.result”".
I'm wondering is there a way to get pooled estimated means from the multiple imputation analyses I've run?
The data frames I'm analyzing are longitudinal/repeated measures and in long format. In the linear mixed model I want to get the estimated means for a 2x2 interaction effect and in the marginal model I'm trying to get estimated means for the 6 levels of 'time' variable. The outcome in all models is continuous.
Here's my code
# mixed model
fml <- Dep + time ~ 1 + (1|id)
imp <- panImpute(data=Data, formula=fml, n.burn=50000, n.iter=5000, m=100, group = "treatment")
summary(imp)
plot(imp, trace="all")
implist <- mitmlComplete(imp, "all", force.list = TRUE)
fit <- with(implist, lme(Dep ~ time*treatment, random = ~ 1|id, method = "ML", na.action = na.exclude, control = list(opt = "optim")))
testEstimates(fit, var.comp = TRUE)
confint.mitml.testEstimates(testEstimates(fit, var.comp = TRUE))
# marginal model
fml <- Dep + time ~ 1 + (1|id)
imp <- panImpute(data=Data, formula=fml, n.burn=50000, n.iter=5000, m=100)
summary(imp)
plot(imp, trace="all")
implist <- mitmlComplete(imp, "all", force.list = TRUE)
fit <- with(implist, geeglm(Dep ~ time, id = id, corstr ="unstructured"))
testEstimates(fit, var.comp = TRUE)
confint.mitml.testEstimates(testEstimates(fit, var.comp = TRUE))
is there a way to get pooled estimated means from the multiple imputation analyses I've run?
This is not a reprex without Data, so I can't verify this works for you. But emmeans provides support for mira-class (lists of) models in the mice package. So if you fit your model in with() using the mids rather than mitml.list class object, then you can use that to obtain marginal means of your outcome (and any contrasts or pairwise comparisons afterward).
Using example data found here, which uncomfortably loads an external workspace:
con <- url("https://www.gerkovink.com/mimp/popular.RData")
load(con)
## imputation
library(mice)
ini <- mice(popNCR, maxit = 0)
meth <- ini$meth
meth[c(3, 5, 6, 7)] <- "norm"
pred <- ini$pred
pred[, "pupil"] <- 0
imp <- mice(popNCR, meth = meth, pred = pred, print = FALSE)
## analysis
library(lme4) # fit multilevel model
mod <- with(imp, lmer(popular ~ sex + (1|class)))
library(emmeans) # obtain pooled estimates of means
(em <- emmeans(mod, specs = ~ sex) )
pairs(em) # test comparison

Resources