Explanation about difference in original and bootstrap value - r

I have the following function; (1) to calculate the deviance difference for each variable I have and (2) to do bootstrap for the deviance difference that I calculated in the first step
set.seed(1001)
xfunction <- function(d,i)
{
glm.snp1 <- glm(PHENOTYPE~d[i], family="binomial", data=training1)
null <- glm.snp1$null.deviance
residual <- glm.snp1$deviance
dDeviance <- null-residual
return(dDeviance)
}
myboot <- function(d)
{
boot(d,xfunction, R=1000)
}
result <- lapply(training1,function(x)myboot(x))
So basically from the result I got the values for original dDeviance (without bootstrap) and I can calculate the mean(dDeviance) from the bootstrap. I need help in explaining why the original and the mean bootstrap values are too different? For example for one of the variable, the original value of dDeviance is 0.024 while the bootstrap mean of dDeviance is 0.000412.

As pointed out in the comments, it's is better to subset the data.frame indices, to make it compatible with boot. If you want to iterate through different variables, and apply this function, it's better to do it within the iterate.
For every bootstrap, you fit the different models and extract the values.
So we re-write the function, and it takes in an extra formula parameter, indep_var which specifies the columns to regress PHENOTYPE against:
xfunction <- function(d,ind,indep_var){
dDeviance = sapply(indep_var,function(V){
f = reformulate(V,response="PHENOTYPE")
glm.snp1 <- glm(f, family="binomial", data=d[ind,])
glm.snp1$null.deviance-glm.snp1$deviance
})
return(dDeviance)
}
We can set up an example dataset:
set.seed(111)
training1 = data.frame(PHENOTYPE=rbinom(100,1,0.5),matrix(rnorm(500),ncol=5))
head(training1)
PHENOTYPE X1 X2 X3 X4 X5
1 1 0.1916634 -0.09152026 -0.9685094 -1.0503824 -0.9137690
2 1 1.5525443 -1.87430581 0.9119331 0.3251424 0.1126909
3 0 0.9142423 -0.66416202 0.0928326 -2.1048716 -2.3597249
4 1 0.3586254 0.20341282 -0.5329309 -0.9551027 -1.5693983
5 0 0.1750956 -2.59444339 -1.6669055 -0.5306399 1.2274569
6 0 -0.8472678 -0.09373393 -0.5743455 0.8274405 0.7620480
And try this on data:
xfunction(training1,1:nrow(training1),indep_var=c("X1","X2","X3"))
X1 X2 X3
0.1189847 0.2512539 0.2684927
Then using bootstrap:
library(boot)
boot(training1,xfunction,R=1000,indep_var=c("X1","X2","X3"))
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = training1, statistic = xfunction, R = 1000, indep_var = c("X1",
"X2", "X3"))
Bootstrap Statistics :
original bias std. error
t1* 0.1189847 1.033286 1.564971
t2* 0.2512539 1.047503 1.863012
t3* 0.2684927 1.005062 1.747959

Related

How to bootstrap correlation using vectorised function applied to large matrix?

I understand how to boostrap using the "boot" package in R, through the PDF for the package and also from these two examples on Stack, Bootstrapped correlation with more than 2 variables in R and Bootstrapped p-value for a correlation coefficient on R.
However, this is for small datasets ( 2 variables or a matrix with 5 variables). I have a very large matrix (1000+ columns) and the code I use to compute the correlation between every metabolite pair (removing duplicate and correlations with the metabolite itself) is:
x <- colnames(dat)
GetCor = function(x,y) cor(dat[,x], dat[,y], method="spearman")
GetCor = Vectorize(GetCor)
out <- data.frame(t(combn(x,2)), stringsAsFactors = F) %>%
mutate(v = GetCor(X1,X2))
I'm not sure how I can then alter this for it to be the function I pass to statistic in boot so
boot_res<- boot(dat, ?, R=1000)
Or would I just need to obtain a matrix of the bootstrapped p value or estimate depending on function code (colMeans(boot_res$t)) and get rid of the upper or lower triangle?
Was curious to know the most efficient way of going about the problem..
Something like this? It follows more or less the same lines as my answer to the 2nd question you link to in your question.
Note that I have simplified the correlation code, cor accepts a data.frame or a matrix, so pass a two column one and keep one of the off diagonal correlation matrix elements.
library(boot)
bootPairwiseCor <- function(data, i) {
d <- data[i,]
combn(d, 2, \(x) cor(x, method="spearman")[1,2])
}
dat <- iris[-5]
nms <- combn(colnames(dat), 2, paste, collapse = "_")
R <- 100L
b <- boot(dat, bootPairwiseCor, R)
b
#>
#> ORDINARY NONPARAMETRIC BOOTSTRAP
#>
#>
#> Call:
#> boot(data = dat, statistic = bootPairwiseCor, R = R)
#>
#>
#> Bootstrap Statistics :
#> original bias std. error
#> t1* -0.1667777 0.0037142908 0.070552718
#> t2* 0.8818981 -0.0002851683 0.017783297
#> t3* 0.8342888 0.0006306610 0.021509280
#> t4* -0.3096351 0.0047809612 0.075976067
#> t5* -0.2890317 0.0045689001 0.069929108
#> t6* 0.9376668 -0.0014838117 0.009632318
data.frame(variables = nms, correlations = colMeans(b$t))
#> variables correlations
#> 1 Sepal.Length_Sepal.Width -0.1630634
#> 2 Sepal.Length_Petal.Length 0.8816130
#> 3 Sepal.Length_Petal.Width 0.8349194
#> 4 Sepal.Width_Petal.Length -0.3048541
#> 5 Sepal.Width_Petal.Width -0.2844628
#> 6 Petal.Length_Petal.Width 0.9361830
Created on 2023-01-28 with reprex v2.0.2
You may want to use cor.test to get theoretical t-values. We will use them for comparison with the B bootstrap t-values. (Recall: The p-value is the probability of obtaining test results at least as extreme as the result actually observed, under the assumption that the null hypothesis is correct.)
Here is a similar function to yours, but applying cor.test and extracting statistics.
corr_cmb <- \(X, boot=FALSE) {
stts <- c('estimate', 'statistic', 'p.value')
cmbn <- combn(colnames(X), 2, simplify=FALSE)
a <- lapply(cmbn, \(x) as.data.frame(cor.test(X[, x[1]], X[, x[2]])[stts])) |>
do.call(what=rbind) |>
`rownames<-`(sapply(cmbn, paste, collapse=':'))
if (boot) {
a <- a[, 'statistic']
}
a
}
We run it one times on the data to get a theoretical solution.
rhat <- corr_cmb(dat)
head(rhat, 3)
# estimate statistic p.value
# V1:V2 0.06780426 2.1469547 0.03203729
# V1:V3 0.03471587 1.0973752 0.27274212
# V1:V4 0.05301563 1.6771828 0.09381987
Bootstrap
We can assume from the start that the bootstrap with 1000 columns will run for a while (choose(1000, 2) returns 499500 combinations). That's why we think about a multithreaded solution right away.
To bootstrap we simple repeatedly apply corr_cmb repeatedly on a sample of the data with replications.
We will measure the time to estimate the time needed for 1000 variables.
## setup clusters
library(parallel)
CL <- makeCluster(detectCores() - 1)
clusterExport(CL, c('corr_cmb', 'dat'))
t0 <- Sys.time() ## timestamp before run
B <- 1099L
clusterSetRNGStream(CL, 42)
boot_res <- parSapply(CL, 1:B, \(i) corr_cmb(dat[sample.int(nrow(dat), replace=TRUE), ], boot=TRUE))
t1 <- Sys.time() ## timestamp after run
stopCluster(CL)
After the bootstrap, we calculate the ratios of how many times the absolute bootstrap test statistics exceeded the theoretical ones (Ref.),
boot_p <- rowMeans(abs(boot_res - rowMeans(boot_res)) > abs(rhat$statistic))
and cbind the bootstrap p-values to the theoretical result.
cbind(rhat, boot_p)
# estimate statistic p.value boot_p
# V1:V2 0.06780426 2.1469547 0.03203729 0.03003003
# V1:V3 0.03471587 1.0973752 0.27274212 0.28028028
# V1:V4 0.05301563 1.6771828 0.09381987 0.08208208
# V1:V5 -0.01018682 -0.3218300 0.74764890 0.73473473
# V2:V3 0.03730133 1.1792122 0.23859474 0.23323323
# V2:V4 0.07203911 2.2817257 0.02271539 0.01201201
# V2:V5 0.03098230 0.9792363 0.32770055 0.30530531
# V3:V4 0.02364486 0.7471768 0.45513283 0.47547548
# V3:V5 -0.02864165 -0.9051937 0.36558126 0.38938939
# V4:V5 0.03415689 1.0796851 0.28054328 0.29329329
Note that the data used is fairly normally distributed. If the data is not normally distributed, the bootstrap p-values will be more different.
To conclude, an estimate of the time needed for your 1000 variables.
d <- as.numeric(difftime(t1, t0, units='mins'))
n_est <- 1000
t_est <- d/(choose(m, 2))*choose(n_est, 2)
cat(sprintf('est. runtime for %s variables: %s mins\n', n_est, round(t_est, 1)))
# est. runtime for 1000 variables: 1485.8 mins
(Perhaps for sake of completeness, a single-threaded version for smaller problems:)
## singlethreaded version
# set.seed(42)
# B <- 1099L
# boot_res <- replicate(B, corr_cmb(dat[sample.int(nrow(dat), replace=TRUE), ], boot=TRUE))
Data:
library(MASS)
n <- 1e3; m <- 5
Sigma <- matrix(.5, m, m)
diag(Sigma) <- 1
set.seed(42)
M <- mvrnorm(n, runif(m), Sigma)
M <- M + rnorm(length(M), sd=6)
dat <- as.data.frame(M)

Categorical variable coding with 0s and 1s giving different outputs in R

What I did: I did a linear mixed effect model analysis in R with nlme library. I have a categorical fixed variable, Blurriness, with 2 levels: B standing for Blurred, N standing for Non-Blurred. Upon suggestion, I changed them into, 1(for B) and 0(for N).
Problem: I re-run the model. And I got different p-values/results (I do not mean the + p values became -, I mean like the numbers changed).
What I did to solve it: Then, I reversed the order (I gave 0 for B, and 1 for N) to see if it changes anything. And I got the same p-values and coefficients as when I coded it as B and N (great!). But do you have any idea why that might be?
Edit: I add here a reproducible example: the data with only 80 rows: https://home.mycloud.com/action/share/dedef0a3-794c-4ccc-b245-f93559de1f33
katilimci = factor(dENEME$Participants)
resimler = factor(dENEME$ImageID)
bugu0 = factor(dENEME$Blurriness)
sira0 = factor(dENEME$TheOrderofTheImages)
cekicilik0 = factor(dENEME$TargetAttractiveness)
bugu1 = factor(dENEME$Blurriness2)
sira1= factor(dENEME$TheOrderofTheImages2)
cekicilik1 = factor(dENEME$TargetAttractiveness2)
library(nlme)
myModel1 = lme(Ratings~bugu0+sira0+cekicilik0+bugu0:cekicilik0+bugu0:sira0+sira0:cekicilik0+bugu0:cekicilik0:sira0,data = dENEME, random=list(katilimci=~1, resimler=~1),na.action = na.exclude)
myModel2 = lme(Ratings~bugu1+sira1+cekicilik1+bugu1:cekicilik1+bugu1:sira1+sira1:cekicilik1+bugu1:cekicilik1:sira1,data = dENEME, random=list(katilimci=~1, resimler=~1),na.action = na.exclude)
summary(myModel1)
summary(myModel2)
The resulting p-values are different and I could not find the reason why...
Edit 2: Another reproducible example:
library(nlme)
#fixed factors:
variable1<-as.factor(rep(c("A","B"),each=20))
variable2<-as.factor(sample(rep(c("A","B"),each=20)))
variable3<-as.factor(sample(rep(c("A","B"),each=20)))
#y variable:
ratings<-c(rnorm(20,0,2),rnorm(20,1,6))
#random factor:
ID<-as.factor(paste("ID",rep(1:20,times=2),sep=""))
#symmetrical matrixes:
contrasts(variable1)<-c(0,1)
#in the Line 11, for variable1, level A becomes 0, and level B becomes 1.
contrasts(variable2)<-c(0,1)
contrasts(variable3)<-c(0,1)
#model1:
m1<-lme(ratings~variable1*variable2*variable3,random=~1|ID)
contrasts(variable1)<-c(1,0)
#in the line 19, for variable1, level A becomes 1 and level B becomes 0. So, all the fixed variables mirrors each other in the data that we created.
contrasts(variable2)<-c(1,0)
contrasts(variable3)<-c(1,0)
#model2:
m2<-lme(ratings~variable1*variable2*variable3,random=~1|ID)
summary(m1)
summary(m2)
#we bind the parameters of the 2 models to see them together for comparison:
rbind(
summary(m1)[[20]][,1],
summary(m2)[[20]][,1]
)
I've had a go at making a reproducible example since there isn't one in the question.
require(nlme)
df <- data.frame(dv = c(rnorm(20, 0), rnorm(20, 1)),
Blurriness = factor(c(rep("B", 20), rep("N", 20))),
Random = factor(rep(rep(c("x", "y"), each = 5), 2)),
Blurriness_1_0 = rep(1:0, each = 20),
Blurriness_0_1 = rep(0:1, each = 20))
m <- list()
m[[1]] <- lme(dv ~ Blurriness, random = ~ Blurriness | Random, data = df)
m[[2]] <- lme(dv ~ Blurriness_1_0, random = ~ Blurriness_1_0 | Random, data = df)
m[[3]] <- lme(dv ~ Blurriness_0_1, random = ~ Blurriness_0_1 | Random, data = df)
models <- lapply(m, function(x) summary(x)$tTable)
This gives 3 models which hopefully show the behaviour you describe:
models
#> [[1]]
#> Value Std.Error DF t-value p-value
#> (Intercept) -0.138797 0.2864303 37 -0.4845752 0.630834098
#> BlurrinessN 1.008572 0.3451909 37 2.9217817 0.005901891
#>
#> [[2]]
#> Value Std.Error DF t-value p-value
#> (Intercept) 0.8697753 0.2864293 37 3.036614 0.004366652
#> Blurriness_1_0 -1.0085723 0.3451909 37 -2.921781 0.005901898
#>
#> [[3]]
#> Value Std.Error DF t-value p-value
#> (Intercept) -0.138797 0.2864303 37 -0.4845752 0.630834098
#> Blurriness_0_1 1.008572 0.3451909 37 2.9217817 0.005901891
In this example, the p values are different only for the intercepts, which is what you would expect (it just tells us that the means of the two fixed-effects groups sit at different numbers of standard deviations from 0).
Perhaps this is not what you meant though - it's difficult to tell from your question without a reproducible example.

Standard error and bias zero with bootstrapping

I want to take my dataset bodyfat_trimmed and use bootstrapping to retrieve the mean and the standard errors. However, I seem to be using the same data all the time and therefore get zero standard error and bias. How can I solve this?
bsfunc <- function(data) {
set.seed(1)
x <- model.matrix(reduced_BIC_fit)[, -1]
y <- data$density
bootdata <- sample(1:nrow(x), nrow(x)/2)
x.train <- x[bootdata, ]
y.train <- y[bootdata]
bootframe <- data.frame(bodyfat_trimmed[train, ])
fit <- lm(density ~ age + abdomen + wrist, data = bootframe)
stats <- coef(summary(fit))[, "Estimate"]
return(stats)}
strap <- boot(data = bodyfat_trimmed, sim = "parametric", statistic = bsfunc, R=1000)
strap
Output:
PARAMETRIC BOOTSTRAP
Call:
boot(data = bodyfat_trimmed, statistic = bsfunc, R = 1000, sim = "parametric")
Bootstrap Statistics :
original bias std. error
t1* 1.1360858253 0 0
t2* -0.0000889957 0 0
t3* -0.0018446625 0 0
t4* 0.0050609837 0 0
If the seed is within the function the sample function will be somewhat repetitive.
bsfunc<-function(){set.seed(1); sample(1:10,1)}
bsfunc()
[1] 3
bsfunc()
[1] 3
bsfunc()
[1] 3
PS
Your bsfunc is also misconceived. As written, train (i.e. bootframe <- data.frame(bodyfat_trimmed[train, ])) doesn't come from within this function. And normally the whole point of boot is to do the bootstrap resampling. whilst bsfunc should just be a straight statistic.

Get coefficients estimated by maximum likelihood into a stargazer table

Stargazer produces very nice latex tables for lm (and other) objects. Suppose I've fit a model by maximum likelihood. I'd like stargazer to produce a lm-like table for my estimates. How can I do this?
Although it's a bit hacky, one way might be to create a "fake" lm object containing my estimates -- I think this would work as long as summary(my.fake.lm.object) works. Is that easily doable?
An example:
library(stargazer)
N <- 200
df <- data.frame(x=runif(N, 0, 50))
df$y <- 10 + 2 * df$x + 4 * rt(N, 4) # True params
plot(df$x, df$y)
model1 <- lm(y ~ x, data=df)
stargazer(model1, title="A Model") # I'd like to produce a similar table for the model below
ll <- function(params) {
## Log likelihood for y ~ x + student's t errors
params <- as.list(params)
return(sum(dt((df$y - params$const - params$beta*df$x) / params$scale, df=params$degrees.freedom, log=TRUE) -
log(params$scale)))
}
model2 <- optim(par=c(const=5, beta=1, scale=3, degrees.freedom=5), lower=c(-Inf, -Inf, 0.1, 0.1),
fn=ll, method="L-BFGS-B", control=list(fnscale=-1), hessian=TRUE)
model2.coefs <- data.frame(coefficient=names(model2$par), value=as.numeric(model2$par),
se=as.numeric(sqrt(diag(solve(-model2$hessian)))))
stargazer(model2.coefs, title="Another Model", summary=FALSE) # Works, but how can I mimic what stargazer does with lm objects?
To be more precise: with lm objects, stargazer nicely prints the dependent variable at the top of the table, includes SEs in parentheses below the corresponding estimates, and has the R^2 and number of observations at the bottom of the table. Is there a(n easy) way to obtain the same behavior with a "custom" model estimated by maximum likelihood, as above?
Here are my feeble attempts at dressing up my optim output as a lm object:
model2.lm <- list() # Mimic an lm object
class(model2.lm) <- c(class(model2.lm), "lm")
model2.lm$rank <- model1$rank # Problematic?
model2.lm$coefficients <- model2$par
names(model2.lm$coefficients)[1:2] <- names(model1$coefficients)
model2.lm$fitted.values <- model2$par["const"] + model2$par["beta"]*df$x
model2.lm$residuals <- df$y - model2.lm$fitted.values
model2.lm$model <- df
model2.lm$terms <- model1$terms # Problematic?
summary(model2.lm) # Not working
I was just having this problem and overcame this through the use of the coef se, and omit functions within stargazer... e.g.
stargazer(regressions, ...
coef = list(... list of coefs...),
se = list(... list of standard errors...),
omit = c(sequence),
covariate.labels = c("new names"),
dep.var.labels.include = FALSE,
notes.append=FALSE), file="")
You need to first instantiate a dummy lm object, then dress it up:
#...
model2.lm = lm(y ~ ., data.frame(y=runif(5), beta=runif(5), scale=runif(5), degrees.freedom=runif(5)))
model2.lm$coefficients <- model2$par
model2.lm$fitted.values <- model2$par["const"] + model2$par["beta"]*df$x
model2.lm$residuals <- df$y - model2.lm$fitted.values
stargazer(model2.lm, se = list(model2.coefs$se), summary=FALSE, type='text')
# ===============================================
# Dependent variable:
# ---------------------------
# y
# -----------------------------------------------
# const 10.127***
# (0.680)
#
# beta 1.995***
# (0.024)
#
# scale 3.836***
# (0.393)
#
# degrees.freedom 3.682***
# (1.187)
#
# -----------------------------------------------
# Observations 200
# R2 0.965
# Adjusted R2 0.858
# Residual Std. Error 75.581 (df = 1)
# F Statistic 9.076 (df = 3; 1)
# ===============================================
# Note: *p<0.1; **p<0.05; ***p<0.01
(and then of course make sure the remaining summary stats are correct)
I don't know how committed you are to using stargazer, but you can try using the broom and the xtable packages, the problem is that it won't give you the standard errors for the optim model
library(broom)
library(xtable)
xtable(tidy(model1))
xtable(tidy(model2))

Block bootstrap from subject list

I'm trying to efficiently implement a block bootstrap technique to get the distribution of regression coefficients. The main outline is as follows.
I have a panel data set, and say firm and year are the indices. For each iteration of the bootstrap, I wish to sample n subjects with replacement. From this sample, I need to construct a new data frame that is an rbind() stack of all the observations for each sampled subject, run the regression, and pull out the coefficients. Repeat for a bunch of iterations, say 100.
Each firm can potentially be selected multiple times, so I need to include it data multiple times in each iteration's data set.
Using a loop and subset approach, like below, seems computationally burdensome.
Note that for my real data frame, n, and the number iterations is much larger than the example below.
My thoughts initially are to break the existing data frame into a list by subject using the split() command. From there, use
sample(unique(df1$subject),n,replace=TRUE)
to get the new list, then perhaps implement quickdf from the plyr package to construct a new data frame.
Example slow code:
require(plm)
data("Grunfeld", package="plm")
firms = unique(Grunfeld$firm)
n = 10
iterations = 100
mybootresults=list()
for(j in 1:iterations){
v = sample(length(firms),n,replace=TRUE)
newdata = NULL
for(i in 1:n){
newdata = rbind(newdata,subset(Grunfeld, firm == v[i]))
}
reg1 = lm(value ~ inv + capital, data = newdata)
mybootresults[[j]] = coefficients(reg1)
}
mybootresults = as.data.frame(t(matrix(unlist(mybootresults),ncol=iterations)))
names(mybootresults) = names(reg1$coefficients)
mybootresults
(Intercept) inv capital
1 373.8591 6.981309 -0.9801547
2 370.6743 6.633642 -1.4526338
3 528.8436 6.960226 -1.1597901
4 331.6979 6.239426 -1.0349230
5 507.7339 8.924227 -2.8661479
...
...
How about something like this:
myfit <- function(x, i) {
mydata <- do.call("rbind", lapply(i, function(n) subset(Grunfeld, firm==x[n])))
coefficients(lm(value ~ inv + capital, data = mydata))
}
firms <- unique(Grunfeld$firm)
b0 <- boot(firms, myfit, 999)
You can also use the tsboot function in the boot package with fixed block resampling scheme.
require(plm)
require(boot)
data(Grunfeld)
### each firm is of length 20
table(Grunfeld$firm)
## 1 2 3 4 5 6 7 8 9 10
## 20 20 20 20 20 20 20 20 20 20
blockboot <- function(data)
{
coefficients(lm(value ~ inv + capital, data = data))
}
### fixed length (every 20 obs, so for each different firm) block bootstrap
set.seed(321)
boot.1 <- tsboot(Grunfeld, blockboot, R = 99, l = 20, sim = "fixed")
boot.1
## Bootstrap Statistics :
## original bias std. error
## t1* 410.81557 -25.785972 174.3766
## t2* 5.75981 0.451810 2.0261
## t3* -0.61527 0.065322 0.6330
dim(boot.1$t)
## [1] 99 3
head(boot.1$t)
## [,1] [,2] [,3]
## [1,] 522.11 7.2342 -1.453204
## [2,] 626.88 4.6283 0.031324
## [3,] 479.74 3.2531 0.637298
## [4,] 557.79 4.5284 0.161462
## [5,] 568.72 5.4613 -0.875126
## [6,] 379.04 7.0707 -1.092860
Here is a method that should typically be faster than the accepted answer, returns the same results and does not rely on additional packages (except boot). The key here is to use which and integer indexing to construct each data.frame replicate rather than split/subset and do.call/rbind.
# get function for boot
myIndex <- function(x, i) {
# select the observations to subset. Likely repeated observations
blockObs <- unlist(lapply(i, function(n) which(x[n] == Grunfeld$firm)))
# run regression for given replicate, return estimated coefficients
coefficients(lm(value~ inv + capital, data=Grunfeld[blockObs,]))
}
now, bootstrap
# get result
library(boot)
set.seed(1234)
b1 <- boot(firms, myIndex, 200)
Run the accepted answer
set.seed(1234)
b0 <- boot(firms, myfit, 200)
Let's eyeball a comparison
using indexing
b1
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = firms, statistic = myIndex, R = 200)
Bootstrap Statistics :
original bias std. error
t1* 410.8155650 -6.64885086 197.3147581
t2* 5.7598070 0.37922066 2.4966872
t3* -0.6152727 -0.04468225 0.8351341
Original version
b0
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = firms, statistic = myfit, R = 200)
Bootstrap Statistics :
original bias std. error
t1* 410.8155650 -6.64885086 197.3147581
t2* 5.7598070 0.37922066 2.4966872
t3* -0.6152727 -0.04468225 0.8351341
These look pretty close. Now, a bit more checking
identical(b0$t, b1$t)
[1] TRUE
and
identical(summary(b0), summary(b1))
[1] TRUE
Finally, we'll do a quick benchmark
library(microbenchmark)
microbenchmark(index={b1 <- boot(firms, myIndex, 200)},
rbind={b0 <- boot(firms, myfit, 200)})
On my computer, this returns
Unit: milliseconds
expr min lq mean median uq max neval
index 292.5770 296.3426 303.5444 298.4836 301.1119 395.1866 100
rbind 712.1616 720.0428 729.6644 724.0777 731.0697 833.5759 100
So, direct indexing is more than 2 times faster at every level of the distribution.
note on missing fixed effects
As with most of the answers, the issue of missing "fixed effects" may emerge. Commonly, fixed effects are used as controls and the researcher is interested in one or a couple of variables that will be included with every selected observation. In this dominant case, there is no (or very little) harm in restricting the returned result of the myIndex or myfit function to only include the variables of interest in the returned vector.
The solution needs to be modified to manage fixed effects.
library(boot) # for boot
library(plm) # for Grunfeld
library(dplyr) # for left_join
## Get the Grunfeld firm data (10 firms, each for 20 years, 1935-1954)
data("Grunfeld", package="plm")
## Create dataframe with unique firm identifier (one line per firm)
firms <- data.frame(firm=unique(Grunfeld$firm),junk=1)
## for boot(), X is the firms dataframe; i index the sampled firms
myfit <- function(X, i) {
## join the sampled firms to their firm-year data
mydata <- left_join(X[i,], Grunfeld, by="firm")
## Distinguish between multiple resamples of the same firm
## Otherwise they have the same id in the fixed effects regression
## And trouble ensues
mydata <- mutate(group_by(mydata,firm,year),
firm_uniq4boot = paste(firm,"+",row_number())
)
## Run regression with and without firm fixed effects
c(coefficients(lm(value ~ inv + capital, data = mydata)),
coefficients(lm(value ~ inv + capital + factor(firm_uniq4boot), data = mydata)))
}
set.seed(1)
system.time(b <- boot(firms, myfit, 1000))
summary(b)
summary(lm(value ~ inv + capital, data=Grunfeld))
summary(lm(value ~ inv + capital + factor(firm), data=Grunfeld))
I found a method using dplyr::left_join that is a bit more concise, only takes about 60% as long, and gives the same results as in the answer by Sean. Here's a complete self-contained example.
library(boot) # for boot
library(plm) # for Grunfeld
library(dplyr) # for left_join
# First get the data
data("Grunfeld", package="plm")
firms <- unique(Grunfeld$firm)
myfit1 <- function(x, i) {
# x is the vector of firms
# i are the indexes into x
mydata <- do.call("rbind", lapply(i, function(n) subset(Grunfeld, firm==x[n])))
coefficients(lm(value ~ inv + capital, data = mydata))
}
myfit2 <- function(x, i) {
# x is the vector of firms
# i are the indexes into x
mydata <- left_join(data.frame(firm=x[i]), Grunfeld, by="firm")
coefficients(lm(value ~ inv + capital, data = mydata))
}
# rbind method
set.seed(1)
system.time(b1 <- boot(firms, myfit1, 5000))
## user system elapsed
## 13.51 0.01 13.62
# left_join method
set.seed(1)
system.time(b2 <- boot(firms, myfit2, 5000))
## user system elapsed
## 8.16 0.02 8.26
b1
## original bias std. error
## t1* 410.8155650 9.2896499 198.6877889
## t2* 5.7598070 0.5748503 2.5725441
## t3* -0.6152727 -0.1200954 0.7829191
b2
## original bias std. error
## t1* 410.8155650 9.2896499 198.6877889
## t2* 5.7598070 0.5748503 2.5725441
## t3* -0.6152727 -0.1200954 0.7829191

Resources