Logistics regression in R plotting Bootstrap using Titanic Dataset - r

I am working on an exercise for a statistics online course. I need to create a logistic regression in R using the titanic dataset. Therefore I want to apply the bootstrap method to create and plot 95% confidence intervals for the prediction of the logistic regression.
When I run the bootstrap command and want to plot it, I get the error: "All values of t* are equal to 0.0159971772980342". Also, I get a bias and standard error of 0, which cannot be true. I guess there is an error in setting up the bootstrap command, but I unfortunately cannot find it. What can I try?
My Code:
library(boot)
set.seed(50000)
logit_test <- function(data, indices) {
dt <- data[indices,]
fit <- glm(Clean_data$Survived ~ Fare, data = Clean_data, family = "binomial")
return(coef(fit))
}
boot_strap <- boot(
data = Clean_data,
statistic = logit_test,
R = 100)
boot.ci(boot.out = boot_strap,
type = c("basic"))
#Now we look at the results and plot them
boot_strap
plot(boot_strap, index=2)
My Output:
> library(boot)
>
> set.seed(50000)
>
> logit_test <- function(data, indices) {
+ dt <- data[indices,]
+ fit <- glm(Clean_data$Survived ~ Fare, data = Clean_data, family = "binomial")
+ return(coef(fit))
+ }
> boot_strap <- boot(
+ data = Clean_data,
+ statistic = logit_test,
+ R = 100)
>
> boot.ci(boot.out = boot_strap,
+ type = c("basic"))
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 100 bootstrap replicates
CALL :
boot.ci(boot.out = boot_strap, type = c("basic"))
Intervals :
Level Basic
95% (-0.8968, -0.8968 )
Calculations and Intervals on Original Scale
Some basic intervals may be unstable
> boot_strap
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = Clean_data, statistic = logit_test, R = 100)
Bootstrap Statistics :
original bias std. error
t1* -0.89682819 0 0
t2* 0.01599718 0 0
> plot(boot_strap, index=2)
[1] "All values of t* are equal to 0.0159971772980342"

The problem is that you're bootstrap function isn't using the bootstrapped data to fit the model. You have this function:
logit_test <- function(data, indices){
dt <- data[indices, ]
fit <- glm(Clean_Travelers$Survived ~ FARE, data=Clean_Travelers,
family=binomial)
return(coef(fit))
}
Note that there are a couple of problems, one is that you should be using dt in the data= argument, but you should also not be using Clean_Travelers$Survived as the dependent variable, it should just be Survived because you want to ensure that you're taking that variable not from the original data, but from the bootstrapped data. Something like this for your bootstrap function should work:
logit_test <- function(data, indices){
dt <- data[indices, ]
fit <- glm(Survived ~ FARE, data=dt, family=binomial)
return(coef(fit))
}

Related

Bootstrapping in R: Predict

I am running a program where I conduct an OLS regression and then I subtract the coefficients from the actual observations to keep the residuals.
model1 = lm(data = final, obs ~ day + poly(temp,2) + prpn + school + lag1) # linear model
predfit = predict(model1, final) # predicted values
residuals = data.frame(final$obs - predfit) # obtain residuals
I want to bootstrap my model and then do the same with the bootstrapped coefficients. I try doing this the following way:
lboot <- lm.boot(model1, R = 1000)
predfit = predict(lboot, final)
residuals = data.frame(final$obs - predfit) # obtain residuals
However, that does not work. I also try:
boot_predict(model1, final, R = 1000, condense = T, comparison = "difference")
and that also does not work.
How can I bootstrap my model and then predict based of that?
If you're trying to fit the best OLS using bootstrap, I'd use the caret package.
library(caret)
#separate indep and dep variables
indepVars = final[,-final$obs]
depVar = final$obs
#train model
ols.train = train(indepVars, depVar, method='lm',
trControl = trainControl(method='boot', number=1000))
#make prediction and get residuals
ols.pred = predict(ols.train, indepVars)
residuals = ols.pred - final$obs

Getting estimated means after multiple imputation using the mitml, nlme & geepack R packages

I'm running multilevel multiple imputation through the package mitml (using the panimpute() function) and am fitting linear mixed models and marginal models through the packages nlme and geepack and the mitml:with() function.
I can get the estimates, p-values etc for those through the testEstimates() function but I'm also looking to get estimated means across my model predictors. I've tried the emmeans package, which I normally use for getting estimated means when running nlme & geepack without multiple imputation but doing so emmeans tell me "Can't handle an object of class “mitml.result”".
I'm wondering is there a way to get pooled estimated means from the multiple imputation analyses I've run?
The data frames I'm analyzing are longitudinal/repeated measures and in long format. In the linear mixed model I want to get the estimated means for a 2x2 interaction effect and in the marginal model I'm trying to get estimated means for the 6 levels of 'time' variable. The outcome in all models is continuous.
Here's my code
# mixed model
fml <- Dep + time ~ 1 + (1|id)
imp <- panImpute(data=Data, formula=fml, n.burn=50000, n.iter=5000, m=100, group = "treatment")
summary(imp)
plot(imp, trace="all")
implist <- mitmlComplete(imp, "all", force.list = TRUE)
fit <- with(implist, lme(Dep ~ time*treatment, random = ~ 1|id, method = "ML", na.action = na.exclude, control = list(opt = "optim")))
testEstimates(fit, var.comp = TRUE)
confint.mitml.testEstimates(testEstimates(fit, var.comp = TRUE))
# marginal model
fml <- Dep + time ~ 1 + (1|id)
imp <- panImpute(data=Data, formula=fml, n.burn=50000, n.iter=5000, m=100)
summary(imp)
plot(imp, trace="all")
implist <- mitmlComplete(imp, "all", force.list = TRUE)
fit <- with(implist, geeglm(Dep ~ time, id = id, corstr ="unstructured"))
testEstimates(fit, var.comp = TRUE)
confint.mitml.testEstimates(testEstimates(fit, var.comp = TRUE))
is there a way to get pooled estimated means from the multiple imputation analyses I've run?
This is not a reprex without Data, so I can't verify this works for you. But emmeans provides support for mira-class (lists of) models in the mice package. So if you fit your model in with() using the mids rather than mitml.list class object, then you can use that to obtain marginal means of your outcome (and any contrasts or pairwise comparisons afterward).
Using example data found here, which uncomfortably loads an external workspace:
con <- url("https://www.gerkovink.com/mimp/popular.RData")
load(con)
## imputation
library(mice)
ini <- mice(popNCR, maxit = 0)
meth <- ini$meth
meth[c(3, 5, 6, 7)] <- "norm"
pred <- ini$pred
pred[, "pupil"] <- 0
imp <- mice(popNCR, meth = meth, pred = pred, print = FALSE)
## analysis
library(lme4) # fit multilevel model
mod <- with(imp, lmer(popular ~ sex + (1|class)))
library(emmeans) # obtain pooled estimates of means
(em <- emmeans(mod, specs = ~ sex) )
pairs(em) # test comparison

Bootstrapping in R. By Using Nagelkerke R-Squared

I am new in R. I try to use the boot() function in R, by using the Nagelkerke R-squared as the statistics parameter. I know that I need a function that measure Nagelkerke R-squared of original over the resample. However, I have no idea what should I put as the statistical function.
I know that Nagelkerke R-squared can be compute by using deviance and Null.deviance given Logit regression. I write the function to compute Nagelkerke R-squared.
NagR2 <- function(Objects){
n <- nrow(Objects)
reg <- glm(form,
family = binomial("logit"), data = datainput)
mo <- stepAIC(regression,direction = c("backward"), trace = FALSE)
R2cox <- 1- exp((mo$deviance - mo$null.deviance)/n)
R2nag <- R2cox/(1-exp((-mo$null.deviance)/n))
R2nag
}
How should I change my NagR2 function so that I can use it as statistic in the boot() function?
You need to alter the function, to take in a input data.frame as first argument, and indices of the data.frame as the second, and other arguments, so changing your existing function a bit:
NagR2 <- function(datainput,ind,form){
n <- nrow(datainput[ind,])
reg <- glm(form,family = binomial("logit"), data = datainput[ind,])
mo <- stepAIC(reg,direction = c("backward"), trace = FALSE)
R2cox <- 1- exp((mo$deviance - mo$null.deviance)/n)
R2nag <- R2cox/(1-exp((-mo$null.deviance)/n))
R2nag
}
And applying onto a test dataset:
library(MASS)
library(boot)
dat = iris
dat$Species=factor(ifelse(dat$Species=="versicolor","v","o"))
bo = boot(dat,statistic=NagR2,R=100,form = as.formula(Species ~ .))
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = dat, statistic = NagR2, R = 100, form = as.formula(Species ~
.))
Bootstrap Statistics :
original bias std. error
t1* 0.3650395 0.01470299 0.0720022

Bootstrap Multinomial regression in R

I am trying to bootstrap a simple multinomial regression in R, and I am getting an error:
Error in is.data.frame(data) : object 'd' not found
What is really strange is that I am using the same code (adjusted to this particular problem) as in a tutorial for boot package at Quick-R, and that same code also worked when I am using different function (like lm()). For sure, I am doing something stupid, but I do not see what. Please, if anyone can help, I would appreciate a lot.
This is an example:
require(foreign)
require(nnet)
require(boot)
# an example for multinomial logistic regression
ml = read.dta('http://www.ats.ucla.edu/stat/data/hsbdemo.dta')
ml = ml[,c(5,7,3)]
bs <- function(formula, data, indices) {
d = data[indices,] # allows boot to select sample
fit = multinom(formula, data=d)
s = summary(fit)
return(list(fit$coefficients, fit$standard.errors))
}
# 5 replications
results = list()
results <- boot(
data=ml, statistic=bs, R=5, parallel='multicore',
formula=prog~write
)
The error happens in the summary() part, also the object returned by multinom() does not have coefficients and standard.errors. It seems, that summary.multinom() in turn calculates the hessian from your data, d, which for some reason (probably a scoping issue) cannot be found. A quick fix is to add Hess = TRUE:
bs <- function(formula, data, indices) {
d = data[indices,] # allows boot to select sample
fit = multinom(formula, data=d, Hess = TRUE)
s = summary(fit)
return( cbind(s$coefficients, s$standard.errors) )
}
# 5 replications
results = list()
results <- boot(
data=ml, statistic=bs, R=5, parallel='multicore',
formula=prog~write
)
Multinomial logistic regression returns a matrix of coefficients using the coef() function. This differs from a lm or glm model which returns a vector of coefficients.
library(foreign) # read.dta()
library(nnet) # multinom()
require(boot) # boot()
# an example for multinomial logistic regression
ml = read.dta('http://www.ats.ucla.edu/stat/data/hsbdemo.dta')
ml = ml[,c(5,7,3)]
names(ml)
bs <- function(formula, data, indices) {
d = data[indices,] # allows boot to select sample
fit = multinom(formula, data=d, maxit=1000, trace=FALSE)
#s = summary(fit)
#return(list(fit$coefficients, fit$standard.errors))
estimates <- coef(fit)
return(t(estimates))
}
# enable parallel
library(parallel)
cl <- makeCluster(2)
clusterExport(cl, "multinom")
# 10000 replications
set.seed(1984)
results <- boot(
data=ml, statistic=bs, R=10000, parallel = "snow", ncpus=2, cl=cl,
formula=prog~write
)
# label the estimates
subModelNames <- colnames(results$t0)
varNames <- rownames(results$t0)
results$t0
estNames <- apply(expand.grid(varNames,subModelNames),1,function(x) paste(x,collapse="_"))
estNames
colnames(results$t) <- estNames
# summary of results
library(car)
summary(results)
confint(results, level=0.95, type="norm")
confint(results, level=0.95, type="perc")
confint(results, level=0.95, type="bca")
# plot the results
hist(results, legend="separate")

boot() generating an error on replacement - R

I've written a couple of functions for retrieving statistics (coefficients and p-values) from an lm object, to be bootstrapped upon. The coefficient one works; the p-value one is failing with error:
Error in boot(data = data, statistic = bs_p, R = 1000) :
number of items to replace is not a multiple of replacement length
I now believe the error is related to the inclusion of a factor variable. Attempting to recreate the problem with easily reproducible data.
L3 <- LETTERS[1:3]
data <- data.frame(cbind(x = 20:69, y = 1:50), fac = sample(L3, 50, replace = TRUE))
bs_p <- function (data, i) {
d <- data[i,]
fit <- lm (d$y~d$x*d$fac, data=d)
return(summary(fit)$coefficients[,4])
}
bt <- boot(data=data, statistic=bs_p, R=1000)
The class "numeric" values returned from each of these appears to be in exactly the same format, to my beginner's eye... but I'm guessing it isn't? I have also cleared the returned bt bootstrap object before running the next function, but that did not solve it. How could I best retrieve boot-strapped p-values? Thanks for any thoughts. (Running R 3.0.1 on Mac OSX.)
I am not sure if you can bootstrap p-values from lm model (but the solution is provided for that) . In your bs or bs_r function, you can remove d$ on the right hand side of fit since you already defined data d. Here is the example using mtcars data :
library(boot)
bs <- function(mtcars, i) {
d <- mtcars[i,]
fit <- lm (mpg~drat+wt, data=d)
return(coef(fit))
}
bt <- boot(data=mtcars, statistic=bs, R=1000)
bt
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = mtcars, statistic = bs, R = 1000)
Bootstrap Statistics :
original bias std. error
t1* 30.290370 0.54284222 7.494441
t2* 1.442491 -0.07260619 1.393801
t3* -4.782890 -0.09804271 1.000838
Here is the p-values for bootstrapped p-values from lm.
bs_r <- function(mtcars, i) {
d <- mtcars[i,]
fit <- lm (mpg~drat+wt, data=d)
return(summary(fit)$coefficients[,4])
}
bt1 <- boot(data=mtcars, statistic=bs_r, R=1000)
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = mtcars, statistic = bs_r, R = 1000)
Bootstrap Statistics :
original bias std. error
t1* 2.737824e-04 4.020024e-03 0.0253248217
t2* 3.308544e-01 7.108738e-02 0.2960776146
t3* 1.589075e-06 5.405459e-05 0.0005540412

Resources