Confidence interval for quantile regression using bootstrap - r

I am trying to get the five types of bootstrap intervals for linear and quantile regression. I was able to bootstrap and find the 5 boostrap intervals (Quantile,Normal,Basic,Studentized and BCa) for the linear regression using Boot from car and boot.ci from boot. When i tried to do the same for quantile regression using rq from quantreg, it throws up an error. Here is the sample code
Creating the model
library(car)
library(quantreg)
library(boot)
newdata = Prestige[,c(1:4)]
education.c = scale(newdata$education, center=TRUE, scale=FALSE)
prestige.c = scale(newdata$prestige, center=TRUE, scale=FALSE)
women.c = scale(newdata$women, center=TRUE, scale=FALSE)
new.c.vars = cbind(education.c, prestige.c, women.c)
newdata = cbind(newdata, new.c.vars)
names(newdata)[5:7] = c("education.c", "prestige.c", "women.c" )
mod1 = lm(income ~ education.c + prestige.c + women.c, data=newdata)
mod2 = rq(income ~ education.c + prestige.c + women.c, data=newdata)
Booting linear and quantile regression
mod1.boot <- Boot(mod1, R=999)
boot.ci(mod1.boot, level = .95, type = "all")
dat2 <- newdata[5:7]
mod2.boot <- boot.rq(cbind(1,dat2),newdata$income,tau=0.5, R=10000)
boot.ci(mod2.boot, level = .95, type = "all")
Error in if (ncol(boot.out$t) < max(index)) { :
argument is of length zero
1) Why does boot.ci not work for quantile regression
2)Using this solution I got from stackexchange, I was able to find the quantile CI.
Solution for quantile(percentile CI) for rq
t(apply(mod2.boot$B, 2, quantile, c(0.025,0.975)))
how do i obtain other CI for bootstrap (normal, basic, studentized, BCa).
3) Also, my boot.ci command for linear regression produces this warning
Warning message:
In sqrt(tv[, 2L]) : NaNs produced
What does this signify?

Using summary.rq you can calculate boostrap standard errors of model coefficients.
Five boostrap methods (bsmethods) are available (see ?boot.rq).
summary(mod2, se = "boot", bsmethod= "xy")
# Call: rq(formula = income ~ education.c + prestige.c + women.c, data = newdata)
#
# tau: [1] 0.5
#
# Coefficients:
# Value Std. Error t value Pr(>|t|)
# (Intercept) 6542.83599 139.54002 46.88860 0.00000
# education.c 291.57468 117.03314 2.49139 0.01440
# prestige.c 89.68050 22.03406 4.07009 0.00010
# women.c -48.94856 5.79470 -8.44712 0.00000
To calculate bootstrap confidence intervals, you can use the following trick:
mod1.boot <- Boot(mod1, R=999)
set.seed(1234)
boot.ci(mod1.boot, level = .95, type = "all")
dat2 <- newdata[5:7]
set.seed(1234)
mod2.boot <- boot.rq(cbind(1,dat2),newdata$income,tau=0.5, R=10000)
# Create an object with the same structure of mod1.boot
# but with boostrap replicates given by boot.rq
mod3.boot <- mod1.boot
mod3.boot$R <- 10000
mod3.boot$t0 <- coef(mod2)
mod3.boot$t <- mod2.boot$B
boot.ci(mod3.boot, level = .95, type = "all")
# BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
# Based on 10000 bootstrap replicates
#
# CALL :
# boot.ci(boot.out = mod3.boot, type = "all", level = 0.95)
#
# Intervals :
# Level Normal Basic Studentized
# 95% (6293, 6838 ) (6313, 6827 ) (6289, 6941 )
#
# Level Percentile BCa
# 95% (6258, 6772 ) (6275, 6801 )

Thanks for everyone who helped. I was able to figure out the solution myself. I ran a loop calculating the coefficients of the quantile regression and then used boot and boot.ci respectively. Here is the code
Booting commands only, model creation from question
mod3 <- formula(income ~ education.c + prestige.c + women.c)
coefsf <- function(data,ind){
rq(mod3, data=newdata[ind,])$coef
}
boot.mod <- boot(newdata,coefsf,R=10000)
myboot.ci <- list()
for (i in 1:ncol(boot.mod$t)){
myboot.ci[[i]] <- boot.ci(boot.mod, level = .95, type =
c("norm","basic","perc", "bca"),index = i)
}
I did this as I wanted CI on all variables not just the intercept.

Related

Stata vs. R: Delta Method provides different results for relative risk SE's from logit model

I've been trying to estimate the conditional mean treatment effect of covariates in a logit regression (using relative-risk) along with their standard errors for inference purposes. The delta method is necessary to calculated the standard errors for these treatment effects. I've been trying to recreate these results using the Stata user written command, adjrr, to calculate the relative risk with standard errors and confidence intervals in R.
The adjrr command in Stata calculates the adjusted relative-risk (the conditional mean treatment effect of interest for my project) and it's SE's using the delta method. The deltamethod command in R should create the same results, however this is not the case.
How can I replicate the results from Stata in R?
I used the following self generated data: (https://migariane.github.io/DeltaMethodEpiTutorial.nb.html).
R code below:
generateData <- function(n, seed){
set.seed(seed)
age <- rnorm(n, 65, 5)
age65p <- ifelse(age>=65, T, F)
cmbd <- rbinom(n, size=1, prob = plogis(1 - 0.05 * age))
Y <- rbinom(n, size=1, prob = plogis(1 - 0.02* age - 0.02 * cmbd))
data.frame(Y, cmbd, age, age65p)
}
# Describing the data
data <- generateData(n = 1000, seed = 777)
str(data)
logfit <- glm(Y ~ age65p + cmbd, data = data, family = binomial)
summary(logfit)
p1 <- predict(logfit, newdata = data.frame(age65p = T, cmbd = 0), type="response")
p0 <- predict(logfit, newdata = data.frame(age65p = F, cmbd = 0), type="response")
rr <- p1 / p0
rr
0.8123348 #result
library(msm)
se_rr_delta <- deltamethod( ~(1 + exp(-x1)) / (1 + exp(-x1 -x2)), coef(logfit), vcov(logfit))
se_rr_delta
0.6314798 #result
Stata Code (using same data):
logit Y i.age65p i.cmbd
adjrr age65p
//results below
R1 = 0.3685 (0.0218) 95% CI (0.3259, 0.4112)
R0 = 0.4524 (0.0222) 95% CI (0.4090, 0.4958)
ARR = 0.8146 (0.0626) 95% CI (0.7006, 0.9471)
ARD = -0.0839 (0.0311) 95% CI (-0.1449, -0.0229)
p-value (R0 = R1): 0.0071
p-value (ln(R1/R0) = 0): 0.0077

Error when Bootstraping a Beta regression model in R with {betareg}

I need to bootstrap a beta regression model to check its robustness - because of a data point with a large cook's distance - with the boot package (other suggestions welcomed).
I have the following error:
Error in t.star[r, ] <- res[[r]] :
incorrect number of subscripts on matrix
Here's a reproductible example:
library(betareg)
library(boot)
fake_data <- data.frame(diet = as.factor(c(rep("A",10),rep("B",10))),
fat = c(runif(10,.1,.5),runif(10,.4,.9)) )
plot(fat~diet, data = fake_data)
my_beta_reg <- function(data,i){
data_i <- data[i,]
mod <- betareg(data_i[,"fat"] ~ data_i[,"diet"])
return(mod$coef)
}
b = boot(fake_data, statistic = my_beta_reg, R= 50)
Error in t.star[r, ] <- res[[r]] :
incorrect number of subscripts on matrix
What's the issue?
Thanks in advance.
The issue is that mod$coef is a list:
betareg(fat ~ diet, data = fake_data)$coef
#$mean
#(Intercept) dietB
# -1.275793 2.490126
#
#$precision
# (phi)
#20.59014
You need to unlist it or preferably use the function you are supposed to use for extraction of coefficients:
my_beta_reg <- function(data,i){
mod <- betareg(fat ~ diet, data = data[i,])
#unlist(mod$coef)
coef(mod)
}
b = boot(fake_data, statistic = my_beta_reg, R= 50)
print(b)
#ORDINARY NONPARAMETRIC BOOTSTRAP
#
#
#Call:
#boot(data = fake_data, statistic = my_beta_reg, R = 50)
#
#
#Bootstrap Statistics :
# original bias std. error
#t1* -1.275793 -0.019847377 0.2003523
#t2* 2.490126 0.009008892 0.2314521
#t3* 20.590142 8.265394485 17.2271497

Bootstrap t test in R without reliance on MKinfer package

I would like to get CI for a paired t-test using bootstrap in R. Unfortunately, I dont have privileges to install the MKinfer. Using MKinfer I would (like here: T-test with bootstrap in R):
boot.t.test(
x = iris["Petal.Length"],
y = iris["Sepal.Length"],
alternative = c("two.sided"),
mu = 0,
#paired = TRUE,
conf.level = 0.95,
R = 9999
)
How would I do this for paired data with CI's and p-values not relying on MKinfer (relying on boot would be fine)?
Here is an example using boot using R = 1000 bootstrap replicates
library(boot)
x <- iris$Petal.Length
y <- iris$Sepal.Length
change_in_mean <- function(df, indices) t.test(
df[indices, 1], df[indices, 2], paired = TRUE, var.equal = FALSE)$estimate
model <- boot(
data = cbind(x, y),
statistic = change_in_mean,
R = 1000)
We can calculate the confidence interval of the estimated change in the mean using boot.ci
boot.ci(model, type = "norm")
#BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
#Based on 1000 bootstrap replicates
#
#CALL :
#boot.ci(boot.out = model, type = "norm")
#
#Intervals :
#Level Normal
#95% (-2.262, -1.905 )
#Calculations and Intervals on Original Scale
Note that this is very close to the CI reported by t.test
t.test(x, y, paired = TRUE, var.equal = FALSE)
#
# Paired t-test
#
#data: x and y
#t = -22.813, df = 149, p-value < 2.2e-16
#alternative hypothesis: true difference in means is not equal to 0
#95 percent confidence interval:
# -2.265959 -1.904708
#sample estimates:
#mean of the differences
# -2.085333

bootstrapping coefficients of multiple regression of distance matrices (MRM)

I want to estimate frequency distributions of MRM coefficients to generate a 95% CI. Below is the initial code:
library(ecodist)
dat=data.frame(matrix(rnorm(3*25),ncol=3))
names(dat)<-c('Pred','Var1','Var2')
mod<-MRM(dist(Pred) ~ dist(Var1) + dist (Var2), data=dat, nperm=100)
slopes<-mod$coef
How can I bootstrap the coefficient values?
You can use the boot function from the boot library. I do not know of the ecodist::MRM. Though, here is a close to copy-paste example from the help page of boot which shows how to do non-parametric bootstrap of the coefficient estimates for an lm model and get bias and confidence intervals
> library(boot)
> nuke <- nuclear[, c(1, 2, 5, 7, 8, 10, 11)]
> nuke.lm <- lm(log(cost) ~ date+log(cap)+ne+ct+log(cum.n)+pt, data = nuke)
>
> nuke.fun <- function(dat, inds, i.pred, fit.pred, x.pred)
+ {
+ lm.b <- lm(log(cost) ~ date+log(cap)+ne+ct+log(cum.n)+pt,
+ data = dat[inds, ])
+ coef(lm.b)
+ }
>
> set.seed(45282964)
> nuke.boot <- boot(nuke, nuke.fun, R = 999)
> nuke.boot
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = nuke, statistic = nuke.fun, R = 999)
Bootstrap Statistics :
original bias std. error
t1* -13.26031434 -0.482810992 4.93147203
t2* 0.21241460 0.006775883 0.06480161
t3* 0.72340795 0.001842262 0.14160523
t4* 0.24902491 -0.004979272 0.08857604
t5* 0.14039305 0.009209543 0.07253596
t6* -0.08757642 0.002417516 0.05489876
t7* -0.22610341 0.006136044 0.12140501
>
> boot.ci(nuke.boot, index = 2) # pick the covariate index you want
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 999 bootstrap replicates
CALL :
boot.ci(boot.out = nuke.boot, index = 2)
Intervals :
Level Normal Basic
95% ( 0.0786, 0.3326 ) ( 0.0518, 0.3215 )
Level Percentile BCa
95% ( 0.1033, 0.3730 ) ( 0.0982, 0.3688 )
Calculations and Intervals on Original Scale
Warning message:
In boot.ci(nuke.boot, index = 2) :
bootstrap variances needed for studentized intervals
See Davison, A.C. and Hinkley, D.V. (1997) Bootstrap Methods and Their Application. Cambridge University Press for details of the above output. You should consider what you want to achieve with the bootstrap and consider which bootstrap procedure to use.

Bootstrap parameter estimate of non-linear optimization in R: Why is it different than the regular parameter estimate?

Here's the short version of my question. The code is below.
I calculated the parameters for the non-linear von Bertalanffy growth equation in R using optim(), and now I am trying to add 95% confidence intervals to the von B growth coefficient K by bootstrapping. For at least one of the years of data, when I summarize the bootstrapped output of the growth coefficient K, the mean and median parameter estimates from bootstrapping are quite different than the estimated parameter:
>summary(temp.store) # summary of bootstrap values
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.002449 0.005777 0.010290 0.011700 0.016970 0.056720
> est.K [1] 0.01655956 # point-estimate from the optimization
I suspect the discrepancy is because there are errors in the bootstrap of the random draw that bias the result, although I have used try() to stop the optimization from crashing when there is a combination of input values that cause an error. So I would like to know what to do to fix that issue. I think I'm doing things correctly, because the fitted curve looks right.
Also, I have run this code for data from other years, and in at least one other year, the bootrap estimate and the regular estimate are very close.
Long-winded version:
The von Bertalanffy growth curve (VBGC) for length is given by:
L(t) = L.inf * [1 - exp(-K*(t-t0))] (Eq. 3.1.0.1, from FAO)
where L(t) is the fish's length, L.inf is the asymptotic maximum length, K is the growth coefficient, t is the time step and t0 is when growth began. L(t) and t are the observed data. Usually time or age is measured in years, but here I am looking at juvenile fish data and I have made t the day the of year ("doy") starting with January 1 = 1.
To estimate the starting parameters for the optimization, I have used a linearization of the VBGC equation.
doy <- c(156,205,228,276,319,380)
len <- c(36,56,60,68,68,71)
data06 <- data.frame(doy,len)
Function to get starting parameters for the optimization:
get.init <-function(dframe){ # linearization of the von B
l.inf <- 80 # by eyeballing max juvenile fish
# make a response variable and store it in the data frame:
# Eqn. 3.3.3.1 in FAO document
dframe$vonb.y <- - log(1 - (dframe$len)/l.inf )
lin.vonb <- lm(vonb.y ~ doy, data=dframe)
icept <- lin.vonb$coef[1] # 0.01534013 # intercept is a
slope <- k.lin <- lin.vonb$coef[2] # slope is the K param
t0 <- - icept/slope # get t0 from this relship: intercept= -K * t0
pars <- c(l.inf,as.numeric(slope),as.numeric(t0))
}
Sums of squares for von Bertalanffy growth equation
vbl.ssq <- function(theta, data){
linf=theta[1]; k=theta[2]; t0=theta[3]
# name variables for ease of use
obs.length=data$len
age=data$doy
#von B equation
pred.length=linf*(1-exp(-k*(age-t0)))
#sums of squares
ssq=sum((obs.length-pred.length)^2)
}
Estimate parameters
#Get starting parameter values
theta_init <- get.init(dframe=data06)
# optimize VBGC by minimizing sums of square differences
len.fit <- optim(par=theta_init, fn=vbl.ssq, method="BFGS", data=data06)
est.linf <- len.fit$par[1] # vonB len-infinite
est.K <- len.fit$par[2] # vonB K
est.t0 <- len.fit$par[3] # vonB t0
Bootstrapping
# set up for bootstrap loop
tmp.frame <- data.frame()
temp.store <- vector()
# bootstrap to get 95% conf ints on growth coef K
for (j in 1:1000){
# choose indices at random, with replacement
indices <- sample(1:length(data06[,1]),replace=T)
# values from original data corresponding to those indices
new.len <- data06$len[indices]
new.doy <- data06$doy[indices]
tmp.frame <- data.frame(new.doy,new.len)
colnames(tmp.frame) <- c("doy","len")
init.par <- get.init(tmp.frame)
# now get the vonB params for the randomly selected samples
# using try() to keep optimizing errors from crashing the program
try( len.fit.bs <- optim(par=init.par, fn=vbl.ssq, method="BFGS", data=tmp.frame))
tmp.k <- len.fit.bs$par[2]
temp.store[j] <- tmp.k
}
95% confidence interval for K parameter
k.ci <- quantile(temp.store,c(0.025,0.975))
# 2.5% 97.5%
#0.004437702 0.019784178
Here's the problem:
#>summary(temp.store)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 0.002449 0.005777 0.010290 0.011700 0.016970 0.056720
#
# est.K [1] 0.01655956
Example of error:
Error in optim(par = init.par, fn = vbl.ssq, method = "BFGS", data = tmp.frame) :
non-finite finite-difference value [2]
I don't believe I am making any errors with the optimization because the VBGC fit looks reasonable. Here are the plots:
plot(x=data06$doy,y=data06$len,xlim=c(0,550),ylim=c(0,100))
legend(x="topleft",legend=paste("Length curve 2006"), bty="n")
curve(est.linf*(1-exp(-est.K*(x-est.t0))), add=T,type="l")
plot(x=2006,y=est.K, main="von B growth coefficient for length; 95% CIs",
ylim=c(0,0.025))
arrows(x0=2006,y0=k.ci[1],x1=2006,y1=k.ci[2], code=3,
angle=90,length=0.1)
First of all, you have a very small number of values, possibly too few to trust the bootstrap method. Then a high proportion of fits fails for the classic bootstrap, because due to the resampling you often have not enough distinct x values.
Here is an implementation using nls with a selfstarting model and the boot package.
doy <- c(156,205,228,276,319,380)
len <- c(36,56,60,68,68,71)
data06 <- data.frame(doy,len)
plot(len ~ doy, data = data06)
fit <- nls(len ~ SSasympOff(doy, Asym, lrc, c0), data = data06)
summary(fit)
#profiling CI
proCI <- confint(fit)
# 2.5% 97.5%
#Asym 68.290477 75.922174
#lrc -4.453895 -3.779994
#c0 94.777335 126.112523
curve(predict(fit, newdata = data.frame(doy = x)), add = TRUE)
#classic bootstrap
library(boot)
set.seed(42)
boot1 <- boot(data06, function(DF, i) {
tryCatch(coef(nls(len ~ SSasympOff(doy, Asym, lrc, c0), data = DF[i,])),
error = function(e) c(Asym = NA, lrc = NA, c0 = NA))
}, R = 1e3)
#proportion of unsuccessful fits
mean(is.na(boot1$t[, 1]))
#[1] 0.256
#bootstrap CI
boot1CI <- apply(boot1$t, 2, quantile, probs = c(0.025, 0.5, 0.975), na.rm = TRUE)
# [,1] [,2] [,3]
#2.5% 69.70360 -4.562608 67.60152
#50% 71.56527 -4.100148 113.9287
#97.5% 74.79921 -3.697461 151.03541
#bootstrap of the residuals
data06$res <- residuals(fit)
data06$fit <- fitted(fit)
set.seed(42)
boot2 <- boot(data06, function(DF, i) {
DF$lenboot <- DF$fit + DF[i, "res"]
tryCatch(coef(nls(lenboot ~ SSasympOff(doy, Asym, lrc, c0), data = DF)),
error = function(e) c(Asym = NA, lrc = NA, c0 = NA))
}, R = 1e3)
#proportion of unsuccessful fits
mean(is.na(boot2$t[, 1]))
#[1] 0
#(residuals) bootstrap CI
boot2CI <- apply(boot2$t, 2, quantile, probs = c(0.025, 0.5, 0.975), na.rm = TRUE)
# [,1] [,2] [,3]
#2.5% 70.19380 -4.255165 106.3136
#50% 71.56527 -4.100148 113.9287
#97.5% 73.37461 -3.969012 119.2380
proCI[2,1]
CIs_k <- data.frame(lwr = c(exp(proCI[2, 1]),
exp(boot1CI[1, 2]),
exp(boot2CI[1, 2])),
upr = c(exp(proCI[2, 2]),
exp(boot1CI[3, 2]),
exp(boot2CI[3, 2])),
med = c(NA,
exp(boot1CI[2, 2]),
exp(boot2CI[2, 2])),
estimate = exp(coef(fit)[2]),
method = c("profile", "boot", "boot res"))
library(ggplot2)
ggplot(CIs_k, aes(y = estimate, ymin = lwr, ymax = upr, x = method)) +
geom_errorbar() +
geom_point(aes(color = "estimate"), size = 5) +
geom_point(aes(y = med, color = "boot median"), size = 5) +
ylab("k") + xlab("") +
scale_color_brewer(name = "", type = "qual", palette = 2) +
theme_bw(base_size = 22)
As you see, the bootstrap CI is wider than the profile CI and bootstrapping the residuals results in a more narrow estimated CI. All of them are almost symmetric. Furthermore, the medians are close to the point estimates.
As a first step of investigating what goes wrong in your code, you should look at the proportion of failed fits from your procedure.

Resources