Standard error and bias zero with bootstrapping - r

I want to take my dataset bodyfat_trimmed and use bootstrapping to retrieve the mean and the standard errors. However, I seem to be using the same data all the time and therefore get zero standard error and bias. How can I solve this?
bsfunc <- function(data) {
set.seed(1)
x <- model.matrix(reduced_BIC_fit)[, -1]
y <- data$density
bootdata <- sample(1:nrow(x), nrow(x)/2)
x.train <- x[bootdata, ]
y.train <- y[bootdata]
bootframe <- data.frame(bodyfat_trimmed[train, ])
fit <- lm(density ~ age + abdomen + wrist, data = bootframe)
stats <- coef(summary(fit))[, "Estimate"]
return(stats)}
strap <- boot(data = bodyfat_trimmed, sim = "parametric", statistic = bsfunc, R=1000)
strap
Output:
PARAMETRIC BOOTSTRAP
Call:
boot(data = bodyfat_trimmed, statistic = bsfunc, R = 1000, sim = "parametric")
Bootstrap Statistics :
original bias std. error
t1* 1.1360858253 0 0
t2* -0.0000889957 0 0
t3* -0.0018446625 0 0
t4* 0.0050609837 0 0

If the seed is within the function the sample function will be somewhat repetitive.
bsfunc<-function(){set.seed(1); sample(1:10,1)}
bsfunc()
[1] 3
bsfunc()
[1] 3
bsfunc()
[1] 3
PS
Your bsfunc is also misconceived. As written, train (i.e. bootframe <- data.frame(bodyfat_trimmed[train, ])) doesn't come from within this function. And normally the whole point of boot is to do the bootstrap resampling. whilst bsfunc should just be a straight statistic.

Related

How to perform bootstrapping for estimation and inference of quantile regression using multiply imputed data in R?

I am trying to manually pool results from quantile regression models run on multiply imputed data in R using mice. I make use of a bootstrapping procedure to get 95% CIs and P values of the model terms, in which model parameters and their standard errors are obtained after sampling a certain number of rows that is equal to the unique number of participants in my data set. This procedure is repeated 500 times for each of the m imputed data sets. Then, as a last step, I pool the estimated coefficients and their standard errors of the resulting 500 * m regression models according to Rubin's rules (1987) (see e.g. https://bookdown.org/mwheymans/bookmi/rubins-rules.html). To speed things up, I use foreach to split up the analyses over multiple processor cores and for to loop over the m imputed data sets.
However, there seems to be a flaw in the part wherein the results are pooled. When I look at the pooled results, I observe that the P values are not in accordance with the 95% CIs (e.g. P < 0.05 when 0 is included in the 95% CI).
To illustrate this issue, I have made a reproducible example, using these publicly available data: https://archive.ics.uci.edu/ml/machine-learning-databases/00519/heart_failure_clinical_records_dataset.csv
Because there are no missing data in this data set, I introduce them myself and impute the data (m = 10 multiply imputed data sets with 20 iterations). I use set.seed for reproducibility.
Note that I use lm instead of quantreg::rq in this example.
# load data
projdir <- "my_directory"
d <- read.csv(file = file.path(projdir, 'heart_failure_clinical_records_dataset.csv'))
#### introduce missing values
set.seed(1)
# age
age_miss_tag <- rbinom(nrow(d), 1, 0.3)
d$age[age_miss_tag == 1] <- NA # MCAR
# serum creatinine
creat_miss_tag <- rbinom(nrow(d), 1, 0.3)
d$serum_creatinine[creat_miss_tag == 1 & d$anaemia == 0] <- NA # MAR
# CK
CK_miss_tag <- rbinom(nrow(d), 1, 0.3)
d$creatinine_phosphokinase[CK_miss_tag & d$platelets > median(d$platelets)] <- NA # MAR
# platelets
platelets_miss_tag <- rbinom(nrow(d), 1, 0.3)
d$platelets[platelets_miss_tag == 1] <- NA # MCAR
library(mice); library(mitml); library(miceadds); library(splines); library(foreach); library(doParallel)
# impute data
imp <- mice(d, maxit = 20, m = 10, seed = 2)
# log creatinine
implong <- complete(imp, 'long', include = FALSE)
implong$log_creat <- log(implong$serum_creatinine)
imp <- miceadds::datlist2mids(split(implong, implong$.imp))
# compute values for Boundary.knots
temp <- complete(imp, 'long', include = FALSE)
B_knots <- rowMeans(sapply(split(temp, temp$.imp), function(x) {
quantile(x$age, c(0.1, 0.9))
}))
# Convert mids object into a datlist
longlist <- miceadds::mids2datlist(imp)
# fit model based on origial data and use the terms in the below foreach loop
# in order to fix the position of the inner knots
fit_orig <- lm(log_creat ~
# Main effects
ns(age, df = 2, B = c(B_knots[1], B_knots[2])) * sex,
data = longlist[[1]])
To further speed things up, I use OLS instead of quantile regression here and parallelize the process.
# make cluster used in foreach
cores_2_use <- detectCores() - 1
cl <- makeCluster(cores_2_use)
clusterSetRNGStream(cl, iseed = 9956)
registerDoParallel(cl)
# No. of bootstrap samples to be taken
n_iter <- 500
boot.1 <- c()
for(k in seq_along(longlist)){
boot.1[[k]] <- foreach(i = seq_len(n_iter),
.combine = rbind,
.packages = c('mice', 'mitml', 'splines')) %dopar% {
# store data from which rows can be samples
longlist0 <- longlist[[k]]
# set seed for reproducibility
set.seed(i)
# sample rows
boot_dat <- longlist0[sample(1:nrow(longlist0), replace = TRUE), ]
# linear regression model based on sampled rows
fit1 <- lm(terms(fit_orig), data = boot_dat)
# save coefficients
fit1$coef
}
}
stopCluster(cl)
As a last step, I pool the results according to Rubin's rules.
n_cols <- dim(boot.1[[1]])[2]
list <- c()
for(i in seq_len(n_cols)) {
# extract coefficients
parameter <- lapply(boot.1, function(x){
x[,i]
})
m <- length(parameter)
for(k in seq_len(m)) {
names(parameter[[k]]) <- NULL
}
Q <- sapply(parameter, mean)
U <- sapply(parameter, var) # (standard error of estimate)^2
#### Pooling
# Pooled univariate estimate
qbar <- mean(Q)
# Mean of the variances (i.e. the pooled within-imputation variance)
ubar <- mean(U)
# Between-imputation variance
btw_var <- var(Q)
# Total variance of the pooled estimated
tot_var <- ubar + btw_var + (btw_var / m)
# Relative increase in variance due to non-response
r_var <- (btw_var + (btw_var / m)) / ubar
# Fraction of missing information
lambda <- (btw_var + (btw_var / m)) / tot_var
# degrees of freedom for the t-distribution according to Rubin (1987)
df_old <- (m - 1) / lambda^2
# sample size in the imputed data sets
n_sample <- nrow(longlist[[1]])
# observed degrees of freedom
df_observed <- (((n_sample - n_cols) + 1) / ((n_sample - n_cols) + 3)) *
(n_sample - n_cols) * (1 - lambda)
# adjusted degrees of freedom according to Barnard & Rubin (1999)
df_adjusted <- (df_old * df_observed) / (df_old + df_observed)
# 95% confidence interval of qbar
lwr <- qbar - qt(0.975, df_adjusted) * sqrt(tot_var)
upr <- qbar + qt(0.975, df_adjusted) * sqrt(tot_var)
# F statistic
q <- ((0 - qbar)^2 / tot_var)^2
# Significance level associated with the null value Q[0]
p_value <- pf(q, df1 = 1, df2 = df_adjusted, lower.tail = FALSE)
list[[i]] <- cbind(qbar, lwr, upr, p_value)
}
names(list) <- colnames(boot.1[[1]])
list
Obviously, the P value shown below is not in accordance with the 95% CI (as 0 is included in the CI, so the P value should be ≥0.05).
> list
$`(Intercept)`
qbar lwr upr p_value
[1,] 0.06984595 -0.02210231 0.1617942 0.008828337
EDIT (29 Dec 2021)
As #Gerko Vink notes in his answer, multiple imputation and bootstrapping both induce variance. The variance induced by imputation is taken care of by Rubin's rules, the bootstrap variance is not. Unfortunately, mice::pool will not work with the output returned by quantreg::rq.
I am aware of constructing bootstrap CIs based on a naive percentile-based approach as shown in this post, but I am inclined to think this is not the correct approach to proceed with.
Does anyone know how to appropriately take care of the extra variance induced by bootstrapping when using rq?
EDIT (30 Dec 2021)
Inspired by this recent post, I decided not to hit the road of bootstrapping anymore, but instead manually extract the point estimates and variances from each of the imputed data sets and pool them using Rubin's rules. I have posted this approach as answer below. Any input on how to appropriately take care of the extra variance induced by bootstrapping when using rq is still very welcome though!
Bootstrapping and multiple imputation both induce variance. The imputation variance is taken care of by Rubin's rules for parameters with normal sampling distributions. The bootstrap variance is not.
Two remarks:
First, there is a small error in your code. You're calculating the bootstrap variance about Q in U <- sapply(parameter, var). No need for U <- U/n_iter. U is already the variance and sapply(parameter, sd) would yield the bootstrapped standard error.
Second, you're using bootstrap parameters to calculate a parametric interval and p-value. That seems needlessly complicated and, as you can see, potentially problematic. Why not calculate the bootstrap CI?
See also this link for some inspiration with respect to different means of calculating the CIs and their respective validity.
A small sim that demonstrates that you cannot expect both to be identical for a finite set of bootstrap replications.
library(purrr)
library(magrittr)
#fix seed
set.seed(123)
#some data
n = 1000
d <- rnorm(n, 0, 1)
# ci function
fun <- function(x){
se <- var(x)/length(x)
lwr <- mean(x) - 1.96 * se
upr <- mean(x) + 1.96 * se
ci <- c(lwr, upr)
return(ci)
}
# bootstrap
boot <- replicate(500,
d[sample(1:1000, 1000, replace = TRUE)],
simplify = FALSE)
# bootstrapped ci's based on parameters
boot.param.ci <- boot %>%
map(~.x %>% fun) %>%
do.call("rbind", args = .)
# bootstrap CI
boot.ci <- boot %>%
map(~.x %>% mean) %>%
unlist %>%
quantile(c(.025, .975))
# Overview
data.frame(param = fun(d),
boot.param = boot.param.ci %>% colMeans,
boot.ci = boot.ci)
#> param boot.param boot.ci
#> 2.5% 0.01420029 0.01517527 -0.05035913
#> 97.5% 0.01805545 0.01904181 0.07245449
Created on 2021-12-22 by the reprex package (v2.0.1)
The following reprex also demonstrates that the bootstrap applied to the imputed data yields different variance estimates under the same pooling rules.
library(purrr)
library(magrittr)
library(mice)
#fix seed
set.seed(123)
imp <- mice(boys,
m = 10,
printFlag = FALSE)
imp %>%
complete("all") %>%
map(~.x %$%
lm(age ~ hgt + hc)) %>%
pool %>%
summary(conf.int = TRUE)
#> term estimate std.error statistic df p.value 2.5 %
#> 1 (Intercept) -1.9601179 0.809167659 -2.422388 682.5182 0.01567825 -3.5488747
#> 2 hgt 0.1690468 0.002784939 60.700342 572.1861 0.00000000 0.1635768
#> 3 hc -0.2138941 0.021843724 -9.792018 639.0432 0.00000000 -0.2567883
#> 97.5 %
#> 1 -0.3713610
#> 2 0.1745167
#> 3 -0.1710000
imp %>%
complete("all") %>%
map(~.x %>%
.[sample(1:748, 748, replace = TRUE), ] %$%
lm(age ~ hgt + hc)) %>%
pool %>%
summary(conf.int = TRUE)
#> term estimate std.error statistic df p.value 2.5 %
#> 1 (Intercept) -1.9810146 1.253312293 -1.580623 22.57546 1.278746e-01 -4.5763892
#> 2 hgt 0.1689181 0.004124538 40.954423 24.47123 0.000000e+00 0.1604141
#> 3 hc -0.2133606 0.033793045 -6.313743 22.29686 2.217445e-06 -0.2833890
#> 97.5 %
#> 1 0.6143599
#> 2 0.1774221
#> 3 -0.1433322
Created on 2021-12-22 by the reprex package (v2.0.1)
For quantile regression, mice::pool will not work with the output returned by quantreg::rq, because (according to this post) there is no agreed upon method to calculate standard errors, which are required to pool results under multiple imputation.
An ad hoc solution would be to manually extract the point estimates and variances from each of the imputed data sets and pool them using Rubin's rules.
First, a reprex using lm to verify whether results from the manual approach and mice::pool are identical.
library(mice)
imp <- mice(nhanes, print = FALSE, seed = 123)
# fit linear model
fit <- with(imp, lm(bmi ~ chl + hyp))
# manually pool univariate estimates using Rubin's rules
pool_manual <- \(model_object) {
m <- length(model_object$analyses)
Q <- sapply(model_object$analyses, \(x) summary(x)$coefficients[, 'Estimate'])
U <- sapply(model_object$analyses, \(x) (summary(x)$coefficients[, 'Std. Error'])^2)
qbar <- rowMeans(Q)
ubar <- rowMeans(U)
btw_var <- apply(Q, 1, var)
tot_var <- ubar + btw_var + (btw_var / m)
lambda <- (btw_var + (btw_var / m)) / tot_var
df_old <- (m - 1) / lambda^2
n_sample <- length(residuals(model_object$analyses[[1]]))
n_cols <- dim(Q)[1]
df_com <- n_sample - n_cols
df_observed <- ((df_com + 1) / (df_com + 3)) * df_com * (1 - lambda)
df_adjusted <- (df_old * df_observed) / (df_old + df_observed)
lwr <- qbar - qt(0.975, df_adjusted) * sqrt(tot_var)
upr <- qbar + qt(0.975, df_adjusted) * sqrt(tot_var)
q <- (0 - qbar)^2 / tot_var
p_value <- pf(q, df1 = 1, df2 = df_adjusted, lower.tail = FALSE)
df <- data.frame(noquote(rownames(Q)), qbar, lwr, upr, p_value)
rownames(df) <- NULL
names(df) <- c('term', 'estimate', '2.5 %', '97.5 %', 'p.value')
return(df)
}
Verify.
> pool_manual(fit)
term estimate 2.5 % 97.5 % p.value
1 (Intercept) 21.78583831 8.99373786 34.57793875 0.004228746
2 chl 0.03303449 -0.02812005 0.09418903 0.254696358
3 hyp -1.07291395 -5.57406829 3.42824039 0.624035769
> extract <- c('term', 'estimate', '2.5 %', '97.5 %', 'p.value')
> summary(pool(fit), conf.int = TRUE)[, extract]
term estimate 2.5 % 97.5 % p.value
1 (Intercept) 21.78583831 8.99373786 34.57793875 0.004228746
2 chl 0.03303449 -0.02812005 0.09418903 0.254696358
3 hyp -1.07291395 -5.57406829 3.42824039 0.624035769
Quantile regression
Now, let's pool results from rq for the expected median of the outcome.
library(quantreg)
# fit quantile regression model
fit <- with(imp, rq(bmi ~ chl + hyp, tau = 0.5))
To be able to pool results from rq, only the summary method used to extract point estimates and variances from each of the imputed data sets needs to be adjusted in pool_manual.
Q <- sapply(model_object$analyses, \(x) summary.rq(x, covariance = TRUE)$coefficients[, 'Value'])
U <- sapply(model_object$analyses, \(x) (summary.rq(x, covariance = TRUE)$coefficients[, 'Std. Error'])^2)
Result
> pool_manual(fit)
term estimate 2.5 % 97.5 % p.value
1 (Intercept) 22.23452856 0.8551626 43.6138945 0.04461337
2 chl 0.03487894 -0.0857199 0.1554778 0.47022312
3 hyp -1.43636147 -6.0666990 3.1939761 0.52455041
> summary(pool(fit), conf.int = TRUE)[, extract]
Error in rq.fit.br(x, y, tau = tau, ci = TRUE, ...) :
unused arguments (effects = "fixed", parametric = TRUE, exponentiate = FALSE)

Error when Bootstraping a Beta regression model in R with {betareg}

I need to bootstrap a beta regression model to check its robustness - because of a data point with a large cook's distance - with the boot package (other suggestions welcomed).
I have the following error:
Error in t.star[r, ] <- res[[r]] :
incorrect number of subscripts on matrix
Here's a reproductible example:
library(betareg)
library(boot)
fake_data <- data.frame(diet = as.factor(c(rep("A",10),rep("B",10))),
fat = c(runif(10,.1,.5),runif(10,.4,.9)) )
plot(fat~diet, data = fake_data)
my_beta_reg <- function(data,i){
data_i <- data[i,]
mod <- betareg(data_i[,"fat"] ~ data_i[,"diet"])
return(mod$coef)
}
b = boot(fake_data, statistic = my_beta_reg, R= 50)
Error in t.star[r, ] <- res[[r]] :
incorrect number of subscripts on matrix
What's the issue?
Thanks in advance.
The issue is that mod$coef is a list:
betareg(fat ~ diet, data = fake_data)$coef
#$mean
#(Intercept) dietB
# -1.275793 2.490126
#
#$precision
# (phi)
#20.59014
You need to unlist it or preferably use the function you are supposed to use for extraction of coefficients:
my_beta_reg <- function(data,i){
mod <- betareg(fat ~ diet, data = data[i,])
#unlist(mod$coef)
coef(mod)
}
b = boot(fake_data, statistic = my_beta_reg, R= 50)
print(b)
#ORDINARY NONPARAMETRIC BOOTSTRAP
#
#
#Call:
#boot(data = fake_data, statistic = my_beta_reg, R = 50)
#
#
#Bootstrap Statistics :
# original bias std. error
#t1* -1.275793 -0.019847377 0.2003523
#t2* 2.490126 0.009008892 0.2314521
#t3* 20.590142 8.265394485 17.2271497

Confidence interval for quantile regression using bootstrap

I am trying to get the five types of bootstrap intervals for linear and quantile regression. I was able to bootstrap and find the 5 boostrap intervals (Quantile,Normal,Basic,Studentized and BCa) for the linear regression using Boot from car and boot.ci from boot. When i tried to do the same for quantile regression using rq from quantreg, it throws up an error. Here is the sample code
Creating the model
library(car)
library(quantreg)
library(boot)
newdata = Prestige[,c(1:4)]
education.c = scale(newdata$education, center=TRUE, scale=FALSE)
prestige.c = scale(newdata$prestige, center=TRUE, scale=FALSE)
women.c = scale(newdata$women, center=TRUE, scale=FALSE)
new.c.vars = cbind(education.c, prestige.c, women.c)
newdata = cbind(newdata, new.c.vars)
names(newdata)[5:7] = c("education.c", "prestige.c", "women.c" )
mod1 = lm(income ~ education.c + prestige.c + women.c, data=newdata)
mod2 = rq(income ~ education.c + prestige.c + women.c, data=newdata)
Booting linear and quantile regression
mod1.boot <- Boot(mod1, R=999)
boot.ci(mod1.boot, level = .95, type = "all")
dat2 <- newdata[5:7]
mod2.boot <- boot.rq(cbind(1,dat2),newdata$income,tau=0.5, R=10000)
boot.ci(mod2.boot, level = .95, type = "all")
Error in if (ncol(boot.out$t) < max(index)) { :
argument is of length zero
1) Why does boot.ci not work for quantile regression
2)Using this solution I got from stackexchange, I was able to find the quantile CI.
Solution for quantile(percentile CI) for rq
t(apply(mod2.boot$B, 2, quantile, c(0.025,0.975)))
how do i obtain other CI for bootstrap (normal, basic, studentized, BCa).
3) Also, my boot.ci command for linear regression produces this warning
Warning message:
In sqrt(tv[, 2L]) : NaNs produced
What does this signify?
Using summary.rq you can calculate boostrap standard errors of model coefficients.
Five boostrap methods (bsmethods) are available (see ?boot.rq).
summary(mod2, se = "boot", bsmethod= "xy")
# Call: rq(formula = income ~ education.c + prestige.c + women.c, data = newdata)
#
# tau: [1] 0.5
#
# Coefficients:
# Value Std. Error t value Pr(>|t|)
# (Intercept) 6542.83599 139.54002 46.88860 0.00000
# education.c 291.57468 117.03314 2.49139 0.01440
# prestige.c 89.68050 22.03406 4.07009 0.00010
# women.c -48.94856 5.79470 -8.44712 0.00000
To calculate bootstrap confidence intervals, you can use the following trick:
mod1.boot <- Boot(mod1, R=999)
set.seed(1234)
boot.ci(mod1.boot, level = .95, type = "all")
dat2 <- newdata[5:7]
set.seed(1234)
mod2.boot <- boot.rq(cbind(1,dat2),newdata$income,tau=0.5, R=10000)
# Create an object with the same structure of mod1.boot
# but with boostrap replicates given by boot.rq
mod3.boot <- mod1.boot
mod3.boot$R <- 10000
mod3.boot$t0 <- coef(mod2)
mod3.boot$t <- mod2.boot$B
boot.ci(mod3.boot, level = .95, type = "all")
# BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
# Based on 10000 bootstrap replicates
#
# CALL :
# boot.ci(boot.out = mod3.boot, type = "all", level = 0.95)
#
# Intervals :
# Level Normal Basic Studentized
# 95% (6293, 6838 ) (6313, 6827 ) (6289, 6941 )
#
# Level Percentile BCa
# 95% (6258, 6772 ) (6275, 6801 )
Thanks for everyone who helped. I was able to figure out the solution myself. I ran a loop calculating the coefficients of the quantile regression and then used boot and boot.ci respectively. Here is the code
Booting commands only, model creation from question
mod3 <- formula(income ~ education.c + prestige.c + women.c)
coefsf <- function(data,ind){
rq(mod3, data=newdata[ind,])$coef
}
boot.mod <- boot(newdata,coefsf,R=10000)
myboot.ci <- list()
for (i in 1:ncol(boot.mod$t)){
myboot.ci[[i]] <- boot.ci(boot.mod, level = .95, type =
c("norm","basic","perc", "bca"),index = i)
}
I did this as I wanted CI on all variables not just the intercept.

Explanation about difference in original and bootstrap value

I have the following function; (1) to calculate the deviance difference for each variable I have and (2) to do bootstrap for the deviance difference that I calculated in the first step
set.seed(1001)
xfunction <- function(d,i)
{
glm.snp1 <- glm(PHENOTYPE~d[i], family="binomial", data=training1)
null <- glm.snp1$null.deviance
residual <- glm.snp1$deviance
dDeviance <- null-residual
return(dDeviance)
}
myboot <- function(d)
{
boot(d,xfunction, R=1000)
}
result <- lapply(training1,function(x)myboot(x))
So basically from the result I got the values for original dDeviance (without bootstrap) and I can calculate the mean(dDeviance) from the bootstrap. I need help in explaining why the original and the mean bootstrap values are too different? For example for one of the variable, the original value of dDeviance is 0.024 while the bootstrap mean of dDeviance is 0.000412.
As pointed out in the comments, it's is better to subset the data.frame indices, to make it compatible with boot. If you want to iterate through different variables, and apply this function, it's better to do it within the iterate.
For every bootstrap, you fit the different models and extract the values.
So we re-write the function, and it takes in an extra formula parameter, indep_var which specifies the columns to regress PHENOTYPE against:
xfunction <- function(d,ind,indep_var){
dDeviance = sapply(indep_var,function(V){
f = reformulate(V,response="PHENOTYPE")
glm.snp1 <- glm(f, family="binomial", data=d[ind,])
glm.snp1$null.deviance-glm.snp1$deviance
})
return(dDeviance)
}
We can set up an example dataset:
set.seed(111)
training1 = data.frame(PHENOTYPE=rbinom(100,1,0.5),matrix(rnorm(500),ncol=5))
head(training1)
PHENOTYPE X1 X2 X3 X4 X5
1 1 0.1916634 -0.09152026 -0.9685094 -1.0503824 -0.9137690
2 1 1.5525443 -1.87430581 0.9119331 0.3251424 0.1126909
3 0 0.9142423 -0.66416202 0.0928326 -2.1048716 -2.3597249
4 1 0.3586254 0.20341282 -0.5329309 -0.9551027 -1.5693983
5 0 0.1750956 -2.59444339 -1.6669055 -0.5306399 1.2274569
6 0 -0.8472678 -0.09373393 -0.5743455 0.8274405 0.7620480
And try this on data:
xfunction(training1,1:nrow(training1),indep_var=c("X1","X2","X3"))
X1 X2 X3
0.1189847 0.2512539 0.2684927
Then using bootstrap:
library(boot)
boot(training1,xfunction,R=1000,indep_var=c("X1","X2","X3"))
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = training1, statistic = xfunction, R = 1000, indep_var = c("X1",
"X2", "X3"))
Bootstrap Statistics :
original bias std. error
t1* 0.1189847 1.033286 1.564971
t2* 0.2512539 1.047503 1.863012
t3* 0.2684927 1.005062 1.747959

Block bootstrap from subject list

I'm trying to efficiently implement a block bootstrap technique to get the distribution of regression coefficients. The main outline is as follows.
I have a panel data set, and say firm and year are the indices. For each iteration of the bootstrap, I wish to sample n subjects with replacement. From this sample, I need to construct a new data frame that is an rbind() stack of all the observations for each sampled subject, run the regression, and pull out the coefficients. Repeat for a bunch of iterations, say 100.
Each firm can potentially be selected multiple times, so I need to include it data multiple times in each iteration's data set.
Using a loop and subset approach, like below, seems computationally burdensome.
Note that for my real data frame, n, and the number iterations is much larger than the example below.
My thoughts initially are to break the existing data frame into a list by subject using the split() command. From there, use
sample(unique(df1$subject),n,replace=TRUE)
to get the new list, then perhaps implement quickdf from the plyr package to construct a new data frame.
Example slow code:
require(plm)
data("Grunfeld", package="plm")
firms = unique(Grunfeld$firm)
n = 10
iterations = 100
mybootresults=list()
for(j in 1:iterations){
v = sample(length(firms),n,replace=TRUE)
newdata = NULL
for(i in 1:n){
newdata = rbind(newdata,subset(Grunfeld, firm == v[i]))
}
reg1 = lm(value ~ inv + capital, data = newdata)
mybootresults[[j]] = coefficients(reg1)
}
mybootresults = as.data.frame(t(matrix(unlist(mybootresults),ncol=iterations)))
names(mybootresults) = names(reg1$coefficients)
mybootresults
(Intercept) inv capital
1 373.8591 6.981309 -0.9801547
2 370.6743 6.633642 -1.4526338
3 528.8436 6.960226 -1.1597901
4 331.6979 6.239426 -1.0349230
5 507.7339 8.924227 -2.8661479
...
...
How about something like this:
myfit <- function(x, i) {
mydata <- do.call("rbind", lapply(i, function(n) subset(Grunfeld, firm==x[n])))
coefficients(lm(value ~ inv + capital, data = mydata))
}
firms <- unique(Grunfeld$firm)
b0 <- boot(firms, myfit, 999)
You can also use the tsboot function in the boot package with fixed block resampling scheme.
require(plm)
require(boot)
data(Grunfeld)
### each firm is of length 20
table(Grunfeld$firm)
## 1 2 3 4 5 6 7 8 9 10
## 20 20 20 20 20 20 20 20 20 20
blockboot <- function(data)
{
coefficients(lm(value ~ inv + capital, data = data))
}
### fixed length (every 20 obs, so for each different firm) block bootstrap
set.seed(321)
boot.1 <- tsboot(Grunfeld, blockboot, R = 99, l = 20, sim = "fixed")
boot.1
## Bootstrap Statistics :
## original bias std. error
## t1* 410.81557 -25.785972 174.3766
## t2* 5.75981 0.451810 2.0261
## t3* -0.61527 0.065322 0.6330
dim(boot.1$t)
## [1] 99 3
head(boot.1$t)
## [,1] [,2] [,3]
## [1,] 522.11 7.2342 -1.453204
## [2,] 626.88 4.6283 0.031324
## [3,] 479.74 3.2531 0.637298
## [4,] 557.79 4.5284 0.161462
## [5,] 568.72 5.4613 -0.875126
## [6,] 379.04 7.0707 -1.092860
Here is a method that should typically be faster than the accepted answer, returns the same results and does not rely on additional packages (except boot). The key here is to use which and integer indexing to construct each data.frame replicate rather than split/subset and do.call/rbind.
# get function for boot
myIndex <- function(x, i) {
# select the observations to subset. Likely repeated observations
blockObs <- unlist(lapply(i, function(n) which(x[n] == Grunfeld$firm)))
# run regression for given replicate, return estimated coefficients
coefficients(lm(value~ inv + capital, data=Grunfeld[blockObs,]))
}
now, bootstrap
# get result
library(boot)
set.seed(1234)
b1 <- boot(firms, myIndex, 200)
Run the accepted answer
set.seed(1234)
b0 <- boot(firms, myfit, 200)
Let's eyeball a comparison
using indexing
b1
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = firms, statistic = myIndex, R = 200)
Bootstrap Statistics :
original bias std. error
t1* 410.8155650 -6.64885086 197.3147581
t2* 5.7598070 0.37922066 2.4966872
t3* -0.6152727 -0.04468225 0.8351341
Original version
b0
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = firms, statistic = myfit, R = 200)
Bootstrap Statistics :
original bias std. error
t1* 410.8155650 -6.64885086 197.3147581
t2* 5.7598070 0.37922066 2.4966872
t3* -0.6152727 -0.04468225 0.8351341
These look pretty close. Now, a bit more checking
identical(b0$t, b1$t)
[1] TRUE
and
identical(summary(b0), summary(b1))
[1] TRUE
Finally, we'll do a quick benchmark
library(microbenchmark)
microbenchmark(index={b1 <- boot(firms, myIndex, 200)},
rbind={b0 <- boot(firms, myfit, 200)})
On my computer, this returns
Unit: milliseconds
expr min lq mean median uq max neval
index 292.5770 296.3426 303.5444 298.4836 301.1119 395.1866 100
rbind 712.1616 720.0428 729.6644 724.0777 731.0697 833.5759 100
So, direct indexing is more than 2 times faster at every level of the distribution.
note on missing fixed effects
As with most of the answers, the issue of missing "fixed effects" may emerge. Commonly, fixed effects are used as controls and the researcher is interested in one or a couple of variables that will be included with every selected observation. In this dominant case, there is no (or very little) harm in restricting the returned result of the myIndex or myfit function to only include the variables of interest in the returned vector.
The solution needs to be modified to manage fixed effects.
library(boot) # for boot
library(plm) # for Grunfeld
library(dplyr) # for left_join
## Get the Grunfeld firm data (10 firms, each for 20 years, 1935-1954)
data("Grunfeld", package="plm")
## Create dataframe with unique firm identifier (one line per firm)
firms <- data.frame(firm=unique(Grunfeld$firm),junk=1)
## for boot(), X is the firms dataframe; i index the sampled firms
myfit <- function(X, i) {
## join the sampled firms to their firm-year data
mydata <- left_join(X[i,], Grunfeld, by="firm")
## Distinguish between multiple resamples of the same firm
## Otherwise they have the same id in the fixed effects regression
## And trouble ensues
mydata <- mutate(group_by(mydata,firm,year),
firm_uniq4boot = paste(firm,"+",row_number())
)
## Run regression with and without firm fixed effects
c(coefficients(lm(value ~ inv + capital, data = mydata)),
coefficients(lm(value ~ inv + capital + factor(firm_uniq4boot), data = mydata)))
}
set.seed(1)
system.time(b <- boot(firms, myfit, 1000))
summary(b)
summary(lm(value ~ inv + capital, data=Grunfeld))
summary(lm(value ~ inv + capital + factor(firm), data=Grunfeld))
I found a method using dplyr::left_join that is a bit more concise, only takes about 60% as long, and gives the same results as in the answer by Sean. Here's a complete self-contained example.
library(boot) # for boot
library(plm) # for Grunfeld
library(dplyr) # for left_join
# First get the data
data("Grunfeld", package="plm")
firms <- unique(Grunfeld$firm)
myfit1 <- function(x, i) {
# x is the vector of firms
# i are the indexes into x
mydata <- do.call("rbind", lapply(i, function(n) subset(Grunfeld, firm==x[n])))
coefficients(lm(value ~ inv + capital, data = mydata))
}
myfit2 <- function(x, i) {
# x is the vector of firms
# i are the indexes into x
mydata <- left_join(data.frame(firm=x[i]), Grunfeld, by="firm")
coefficients(lm(value ~ inv + capital, data = mydata))
}
# rbind method
set.seed(1)
system.time(b1 <- boot(firms, myfit1, 5000))
## user system elapsed
## 13.51 0.01 13.62
# left_join method
set.seed(1)
system.time(b2 <- boot(firms, myfit2, 5000))
## user system elapsed
## 8.16 0.02 8.26
b1
## original bias std. error
## t1* 410.8155650 9.2896499 198.6877889
## t2* 5.7598070 0.5748503 2.5725441
## t3* -0.6152727 -0.1200954 0.7829191
b2
## original bias std. error
## t1* 410.8155650 9.2896499 198.6877889
## t2* 5.7598070 0.5748503 2.5725441
## t3* -0.6152727 -0.1200954 0.7829191

Resources