Bootstrapping a vector of results, by group in R - r

Question: How can I use a boostrap to get confidence intervals for a collection of
statistics calculated on the eigenvalues of covariance matrices, separately for
each group (factor level) in a data frame?
Problem: I can't quite work out the data
structure I need to contain these results suitable for the boot function, or a way to "map" the bootstrap over groups and obtain confidence intervals in a form suitable for plotting.
Context:
In the heplots package, boxM calculates Box's M test of equality of covariance matrices.
There is a plot method that produces a useful plot of the log determinants that go into this
test. The confidence intervals in this plot are based on an asymptotic theory approximation.
> library(heplots)
> iris.boxm <- boxM(iris[, 1:4], iris[, "Species"])
> iris.boxm
Box's M-test for Homogeneity of Covariance Matrices
data: iris[, 1:4]
Chi-Sq (approx.) = 140.94, df = 20, p-value < 2.2e-16
> plot(iris.boxm, gplabel="Species")
The plot method can also display other functions of the eigenvalues, but no theoretical
confidence intervals are available in this case.
op <- par(mfrow=c(2,2), mar=c(5,4,1,1))
plot(iris.boxm, gplabel="Species", which="product")
plot(iris.boxm, gplabel="Species", which="sum")
plot(iris.boxm, gplabel="Species", which="precision")
plot(iris.boxm, gplabel="Species", which="max")
par(op)
Thus, I would like to be able to calculate these CIs using a boostrap, and display them in the corresponding plots.
What I've tried:
Below are functions that boostrap these statistics, but for the total
sample, not taking group (Species) into account.
cov_stat_fun <- function(data, indices,
stats=c("logdet", "prod", "sum", "precision", "max")
) {
dat <- data[indices,]
cov <- cov(dat, use="complete.obs")
eigs <- eigen(cov)$values
res <- c(
"logdet" = log(det(cov)),
"prod" = prod(eigs),
"sum" = sum(eigs),
"precision" = 1/ sum(1/eigs),
"max" = max(eigs)
)
}
boot_cov_stat <- function(data, R=500, ...) {
boot(data, cov_stat_fun, R=R, ...)
}
This works, but I need the results by group (and also for the total sample)
> iris.boot <- boot_cov_stat(iris[,1:4])
>
> iris.ci <- boot.ci(iris.boot)
> iris.ci
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 500 bootstrap replicates
CALL :
boot.ci(boot.out = iris.boot)
Intervals :
Level Normal Basic Studentized
95% (-6.622, -5.702 ) (-6.593, -5.653 ) (-6.542, -5.438 )
Level Percentile BCa
95% (-6.865, -5.926 ) (-6.613, -5.678 )
Calculations and Intervals on Original Scale
Some BCa intervals may be unstable
>
I also have written a function that calculates the separate covariance matrices for each group, but I can't see how to use this in my bootstrap functions. Can someone help?
# calculate covariance matrices by group and pooled
covs <- function(Y, group) {
Y <- as.matrix(Y)
gname <- deparse(substitute(group))
if (!is.factor(group)) group <- as.factor(as.character(group))
valid <- complete.cases(Y, group)
if (nrow(Y) > sum(valid))
warning(paste(nrow(Y) - sum(valid)), " cases with missing data have been removed.")
Y <- Y[valid,]
group <- group[valid]
nlev <- nlevels(group)
lev <- levels(group)
mats <- aux <- list()
for(i in 1:nlev) {
mats[[i]] <- cov(Y[group == lev[i], ])
}
names(mats) <- lev
pooled <- cov(Y)
c(mats, "pooled"=pooled)
}
Edit:
In a seemingly related question, Bootstrap by groups, it is suggested that an answer is provided by using the strata argument to boot(), but there is no example of what this gives. [Ah: the strata argument just assures that strata are represented in the bootstrap sample in relation to their frequencies in the data.]
Trying this for my problem, I am not further enlightened, because what I want to get is separate confidence intervals for each Species.
> iris.boot.strat <- boot_cov_stat(iris[,1:4], strata=iris$Species)
>
> boot.ci(iris.boot.strat, conf=0.95, type=c("basic", "bca"))
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 500 bootstrap replicates
CALL :
boot.ci(boot.out = iris.boot.strat, conf = 0.95, type = c("basic",
"bca"))
Intervals :
Level Basic BCa
95% (-6.587, -5.743 ) (-6.559, -5.841 )
Calculations and Intervals on Original Scale
Some BCa intervals may be unstable
>

If I understand your question, you can run your bootstrap function by group as follows:
library(boot)
library(tidyverse)
# Pooled
iris.boot <- boot_cov_stat(iris[,1:4])
iris.ci <- boot.ci(iris.boot)
# By Species
boot.list = setNames(unique(iris$Species), unique(iris$Species)) %>%
map(function(group) {
iris.boot = boot_cov_stat(iris[iris$Species==group, 1:4])
boot.ci(iris.boot)
})
# Combine pooled and by-Species results
boot.list = c(boot.list, list(Pooled=iris.ci))
boot.list
$setosa
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 500 bootstrap replicates
CALL :
boot.ci(boot.out = iris.boot)
Intervals :
Level Normal Basic Studentized
95% (-13.69, -11.86 ) (-13.69, -11.79 ) (-13.52, -10.65 )
Level Percentile BCa
95% (-14.34, -12.44 ) (-13.65, -11.99 )
Calculations and Intervals on Original Scale
Warning : BCa Intervals used Extreme Quantiles
Some BCa intervals may be unstable
$versicolor
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 500 bootstrap replicates
CALL :
boot.ci(boot.out = iris.boot)
Intervals :
Level Normal Basic Studentized
95% (-11.37, -9.81 ) (-11.36, -9.78 ) (-11.25, -8.97 )
Level Percentile BCa
95% (-11.97, -10.39 ) (-11.35, -10.09 )
Calculations and Intervals on Original Scale
Warning : BCa Intervals used Extreme Quantiles
Some BCa intervals may be unstable
$virginica
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 500 bootstrap replicates
CALL :
boot.ci(boot.out = iris.boot)
Intervals :
Level Normal Basic Studentized
95% (-9.467, -7.784 ) (-9.447, -7.804 ) (-9.328, -6.959 )
Level Percentile BCa
95% (-10.050, -8.407 ) ( -9.456, -8.075 )
Calculations and Intervals on Original Scale
Warning : BCa Intervals used Extreme Quantiles
Some BCa intervals may be unstable
$Pooled
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 500 bootstrap replicates
CALL :
boot.ci(boot.out = iris.boot)
Intervals :
Level Normal Basic Studentized
95% (-6.620, -5.714 ) (-6.613, -5.715 ) (-6.556, -5.545 )
Level Percentile BCa
95% (-6.803, -5.906 ) (-6.624, -5.779 )
Calculations and Intervals on Original Scale
Some BCa intervals may be unstable

I think the best general answer will be an extension of what #eipi10 proposed, using some method to extract the required confidence intervals from the bootci objects. This is lacking from the broom package.
As an instructive alternative, I tried using broom::tidy() on the results of the bootstrap directly. Rather than the (typically asymmetric) confidence intervals, it gives the bootstrap estimate as statistic, a bias and a std.error. However, from the results I get (see below), I have doubts about whether broom::tidy() gives correct results in this case.
# try just using tidy on the bootstrap results
## pooled
iris.boot <- boot_cov_stat(iris[,1:4])
iris.pooled <- tidy(iris.boot)
Giving:
> iris.pooled
term statistic bias std.error
1 logdet -6.25922391 -0.0906294902 0.2469587430
2 prod 0.00191273 -0.0001120317 0.0004485317
3 sum 4.57295705 -0.0382145128 0.2861790776
4 precision 0.01692092 -0.0005047993 0.0016818910
5 max 4.22824171 -0.0329408193 0.2815648589
>
Now, use the method described in the other answer to map this over groups,
and combine:
## individual groups
boot.list2 = setNames(unique(iris$Species), unique(iris$Species)) %>%
map(function(group) {
iris.boot = boot_cov_stat(iris[iris$Species==group, 1:4])
tidy(iris.boot)
})
# Combine pooled and by-Species results
boot.list <- c(boot.list2, list(Pooled=iris.pooled))
Transform to a data frame:
## transform this list to a data frame, with a group variable
result <- bind_rows(boot.list) %>%
mutate(group = rep(c( levels(iris$Species), "Pooled"), 5)) %>%
arrange(term)
> result
term statistic bias std.error group
1 logdet -1.306736e+01 -3.240621e-01 4.660334e-01 setosa
2 logdet -1.087433e+01 -2.872073e-01 3.949917e-01 versicolor
3 logdet -8.927058e+00 -2.925485e-01 4.424367e-01 virginica
4 logdet -6.259224e+00 -9.062949e-02 2.469587e-01 Pooled
5 max 2.364557e-01 -6.696719e-03 4.426305e-02 setosa
6 max 4.878739e-01 -6.798321e-03 8.662880e-02 versicolor
7 max 6.952548e-01 -6.517223e-03 1.355433e-01 virginica
8 max 4.228242e+00 -3.294082e-02 2.815649e-01 Pooled
9 precision 5.576122e-03 -5.928678e-04 8.533907e-04 Pooled
10 precision 7.338788e-03 -6.894908e-04 1.184594e-03 setosa
11 precision 1.691212e-02 -1.821494e-03 2.000718e-03 versicolor
12 precision 1.692092e-02 -5.047993e-04 1.681891e-03 virginica
13 prod 2.113088e-06 -4.158518e-07 7.850009e-07 versicolor
14 prod 1.893828e-05 -3.605691e-06 6.100376e-06 virginica
15 prod 1.327479e-04 -2.381536e-05 4.792428e-05 Pooled
16 prod 1.912730e-03 -1.120317e-04 4.485317e-04 setosa
17 sum 3.092041e-01 -1.005543e-02 4.623437e-02 virginica
18 sum 6.248245e-01 -1.238896e-02 8.536621e-02 Pooled
19 sum 8.883673e-01 -1.500578e-02 1.409230e-01 setosa
20 sum 4.572957e+00 -3.821451e-02 2.861791e-01 versicolor
>
This gives something that can now be plotted, supposedly corresponding to the plot in the original question shown without error bars:
result %>% mutate(Pooled = group == "Pooled") %>%
filter (term != "logdet") %>%
ggplot(aes(y=statistic, x=group, color=Pooled)) +
geom_point(size=2.5) +
geom_errorbar(aes(ymin=statistic-2*std.error,
ymax=statistic+2*std.error), width=0.4) +
facet_wrap( ~ term, scales="free") +
coord_flip() + guides(color=FALSE)
However, this "tidy plot" seems evidently WRONG. Theory says that the result for the Polled sample must in every case be intermediate between those for the separate groups, because it is in some sense a 'convex combination` over groups. Compare the plot below with that given in the original question. (It is possible that I did something wrong here, but I can't see a flaw.)

Related

Computing the standard error when dividing coefficients of different regressions in R

Consider the following two regressions from the same dataset mtcars.
#load the data
data(mtcars)
# Run the regression
model1<-lm(mpg~cyl+gear+drat, data = mtcars)
model2<-lm(wt~cyl+gear+drat, data = mtcars)
summary(model1)
summary(model2)
#Calculate ratio of coefficients
g<-model1$coefficients[2] / model2$coefficients[2]
#calculate clustered standard errors
vcov<-cluster.vcov(model1, mtcars$vs)
coeftest(model1, vcov)
vcov<-cluster.vcov(model2, mtcars$vs)
coeftest(model2, vcov)
We observe that the ratio of the cyl variable in the two regressions is equal to -8.16. Now I would like to calculate the standard error that corresponds with this ratio (the clustering of my standard errors here does not make much sense, but it just provides us with a variance covariance matrix for both models, which we may need). Stata has a command called "nlcom" that can do this, but I cannot find a similar command in R. Does anyone of you know whether it exists? If not, then how should I do this? I appreciate any help.
As #MDEWITT suggests, you could use the delta method in the article, though I think you would probably need to estimate the model differently - you would probably need a single multivariate model rather than two independent regression models because you need the covariance of the two coefficients, which doesn't exist unless you estimate a joint model:
library(msm)
data(mtcars)
# Run the regression
model<-lm(cbind(mpg, wt)~cyl+gear+drat, data = mtcars)
b <- c(coef(model))
v <- vcov(model)
## calcualte se
est <- b[2]/b[6]
se <- deltamethod(g = ~ x2/x6, b, vcov(model))
> est
# [1] -8.160363
> se
# [1] 1.770336
There are other methods, too. A non-parametric bootstap could be used:
## write a function to calculate the statistic of interest
boot.fun <- function(data, inds){
m <- lm(cbind(mpg, wt) ~ cyl + gear + drat, data=data[inds, ])
# return the appropraite ratio
coef(m)[2,1]/coef(m)[2,2]
}
library(boot)
## bootstrap the function
out <- boot(mtcars, statistic=boot.fun, R=5000)
## calculate confidence intervals
boot.ci(out, type=c("perc", "bca"))
# BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
# Based on 5000 bootstrap replicates
#
# CALL :
# boot.ci(boot.out = out, type = c("perc", "bca"))
#
# Intervals :
# Level Percentile BCa
# 95% (-12.993, -5.152 ) (-12.330, -4.838 )
# Calculations and Intervals on Original Scale
You could also use a non-parametric boostrap
## draw coefficients from implied sampling distributions
B <- MASS::mvrnorm(5000, c(coef(model)), vcov(model))
## calculate the ratio for each draw
rat <- B[,2]/B[,6]
## calculate the confidence itnerval
round(c(mean(rat), quantile(rat, c(.025,.975))), 3)
# 2.5% 97.5%
# -8.559 -14.418 -5.554
The three methods generate these three confidence intervals:
q1 <- est + qt(c(.025,.975), df=model$df.residual)*se
q2 <- boot.ci(out, type="perc")$percent[1,4:5]
q3 <- quantile(rat, c(.025,.975))
cis <- rbind(q1, q2, q3)
colnames(cis) <- c("lwr", "upr")
rownames(cis) <- c("Delta Method", "BS (non-parametric)", "BS (parametric)")
cis
# lwr upr
# Delta Method -11.78673 -4.533994
# BS (non-parametric) -12.99338 -5.151545
# BS (parametric) -14.41812 -5.553773
Both the parametric bootstrap distributions (these are for the middle 99% of the data) are skewed left, so a normal-theory confidence interval based on the estimate and standard error may well be inappropriate.
par(mfrow=c(1,2))
hist(out$t[which(out$t > quantile(out$t, .005) & out$t < quantile(out$t, .995))], main="Non-parametric Bootstrap", xlab="ratio")
hist(rat[which(rat < quantile(rat, .995) & rat > quantile(rat, .005))], main="Parametric Boostratp", xlab="ratio")

How to compute confidence intervall for Krippendorf's Alpha in R?

I am sure this is realted to Bootstrapping Krippendorff's Alpha. But I didn't understand the question nor the answers there. And it looks like that even the answers and comments are contradicting each other.
set.seed(0)
df <- data.frame(a = rep(sample(1:4),10), b = rep(sample(1:4),10))
kripp.alpha(t(df))
This is the output.
Krippendorff's alpha
Subjects = 40
Raters = 2
alpha = 0.342
How can I compute the confidence interval here?
You are right it is connected to bootstrapping. You could compute the confidence interval the following way:
library(irr)
library(boot)
alpha.boot <- function(d,w) {
data <- t(d[w,])
kripp.alpha(data)$value
}
b <- boot(data = df, statistic = alpha.boot, R = 1000)
b
plot(b)
boot.ci(b, type = "perc")
This is the output:
Bootstrap Statistics :
original bias std. error
t1* 0.3416667 -0.01376158 0.1058123
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates
CALL :
boot.ci(boot.out = b, type = "perc")
Intervals :
Level Percentile
95% ( 0.1116, 0.5240 )
Calculations and Intervals on Original Scale
there is also a R script from Zapf et al. 2016 look for Additional file 3 at the bottom of the page just before the references
Or you could use the kripp.boot function available on github MikeGruz/kripp.boot

Confidence Interval from hierarchical bootstrap

I would like to calculate a BCa confidence interval for multi-stage bootstrap using boot.ci(). Here is an example from: Non-parametric bootstrapping on the highest level of clustered data using boot() function from {boot} in R
which uses the boot command.
# creating example df
rho <- 0.4
dat <- expand.grid(
trial=factor(1:5),
subject=factor(1:3)
)
sig <- rho * tcrossprod(model.matrix(~ 0 + subject, dat))
diag(sig) <- 1
set.seed(17); dat$value <- chol(sig) %*% rnorm(15, 0, 1)
# function for resampling
resamp.mean <- function(dat,
indices,
cluster = c('subject', 'trial'),
replace = TRUE){
cls <- sample(unique(dat[[cluster[1]]]), replace=replace)
sub <- lapply(cls, function(b) subset(dat, dat[[cluster[1]]]==b))
sub <- do.call(rbind, sub)
mean(sub$value)
}
dat.boot <- boot(dat, resamp.mean, 4) # produces and estimated statistic
boot.ci(data.boot) # produces errors
How can I use boot.ci on the boot output?
You have used too few bootstrap resamples. When you call boot.ci, influence measures are needed, and if not provided they are obtained from empinf, which may fail with too few observations. See here for an explanation along similar lines.
Try
dat.boot <- boot(dat, resamp.mean, 1000)
boot.ci(dat.boot, type = "bca")
which gives:
> boot.ci(dat.boot, type = "bca")
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates
CALL :
boot.ci(boot.out = dat.boot, type = "bca")
Intervals :
Level BCa
95% (-0.2894, 1.2979 )
Calculations and Intervals on Original Scale
Some BCa intervals may be unstable
As an alternative, you can provide L (the influence measures) yourself.
# proof of concept, use appropriate value for L!
> dat.boot <- boot(dat, resamp.mean, 4)
> boot.ci(dat.boot, type = "bca", L = 0.2)
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 4 bootstrap replicates
CALL :
boot.ci(boot.out = dat.boot, type = "bca", L = 0.2)
Intervals :
Level BCa
95% ( 0.1322, 1.2979 )
Calculations and Intervals on Original Scale
Warning : BCa Intervals used Extreme Quantiles
Some BCa intervals may be unstable

Bootstrapping to estimate the mean of a geometric sample

So I need to run the boot strap method for the geometric mean of the population. The 6 measurements are 1,2,2,4,6,6 and the estimate of the population geometric mean based on this sample is gm= (1*2*2*4*6*6)*(1/6). I need to compute the 95% confidence limit on the population geometric mean.
So far I have:
set.seed(13254)
gmsample <- c(1,2,2,4,6,6)
gmsample
n<- length(gmsample)
gm.hat <- prod(gmsample) ** (1/6)
gm.hat
for(b in 1:B){
inx.boot<- sample(1:n, replace=TRUE)
gmboot<- gmsample[idx.boot]
print(gmboot)
rboot[b] <- prod(gmboot) ** (1/n)
print(r.boot)
}
boot.sd <- sd(r.boot)
boot.sd
I got this from collecting info from the internet, and I'm extremely new to R so any help would be great.
I would suggest using the boot package instead of rolling your own bootstrapping method. For instance, computing 1000 bootstrap replicates of the geometric mean can be done with:
gmsample <- c(1,2,2,4,6,6)
library(boot)
b <- boot(gmsample, function(d, i) prod(d[i])^(1/length(i)), 1000)
Now you can use the boot.ci method to compute confidence intervals. For instance, if you wanted to use the 95% percentile confidence interval, you could do:
boot.ci(b, 0.95, "perc")
# BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
# Based on 1000 bootstrap replicates
#
# CALL :
# boot.ci(boot.out = b, conf = 0.95, type = "perc")
#
# Intervals :
# Level Percentile
# 95% ( 1.701, 4.670 )
# Calculations and Intervals on Original Scale
There are many other types of confidence intervals, which you can read about with ?boot.ci.

Bootstrap Confidence Intervals for more than one statistics through boot.ci function

I want to get bootstrap confidence intervals for more than one statistics through boot.ci function. Here is my MWE.
I've two statistics in out and want to find the bootstrap confidence intervals for these two statistics. However, boot.ci function is providing the bootstrap confidence intervals for only first statistic (t1*) but not for the second statistic (t2*).
set.seed(12345)
df <- rnorm(n=10, mean = 0, sd = 1)
Boot.fun <-
function(data, idx) {
data1 <- sample(data[idx], replace=TRUE)
m1 <- mean(data1)
sd1 <- sd(data1)
out <- cbind(m1, sd1)
return(out)
}
Boot.fun(data = df)
library(boot)
boot.out <- boot(df, Boot.fun, R = 20)
boot.out
RDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = df, statistic = Boot.fun, R = 20)
Bootstrap Statistics :
original bias std. error
t1* -0.4815861 0.3190424 0.2309631
t2* 0.9189246 -0.1998455 0.2499412
boot.ci(boot.out=boot.out, conf = 0.95, type = c("norm", "basic", "perc", "bca"))
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 20 bootstrap replicates
CALL :
boot.ci(boot.out = boot.out, conf = 0.95, type = c("norm", "basic",
"perc", "bca"))
Intervals :
Level Normal Basic
95% (-1.2533, -0.3479 ) (-1.1547, -0.4790 )
Level Percentile BCa
95% (-0.4842, 0.1916 ) (-0.4842, -0.4629 )
Calculations and Intervals on Original Scale
Warning : Basic Intervals used Extreme Quantiles
Some basic intervals may be unstable
Warning : Percentile Intervals used Extreme Quantiles
Some percentile intervals may be unstable
Warning : BCa Intervals used Extreme Quantiles
Some BCa intervals may be unstable
Warning messages:
1: In norm.inter(t, (1 + c(conf, -conf))/2) :
extreme order statistics used as endpoints
2: In norm.inter(t, alpha) : extreme order statistics used as endpoints
3: In norm.inter(t, adj.alpha) :
extreme order statistics used as endpoints
The boot package is (IMO) a little clunky for regular use. The short answer is that you need to specify index (default value is 1) to boot.ci, e.g. boot.ci(boot.out,index=2). The long answer is that it would certainly be convenient to get the bootstrap CIs for all of the bootstrap statistics at once!
Get all CI for a specified result slot:
getCI <- function(x,w) {
b1 <- boot.ci(x,index=w)
## extract info for all CI types
tab <- t(sapply(b1[-(1:3)],function(x) tail(c(x),2)))
## combine with metadata: CI method, index
tab <- cbind(w,rownames(tab),as.data.frame(tab))
colnames(tab) <- c("index","method","lwr","upr")
tab
}
## do it for both parameters
do.call(rbind,lapply(1:2,getCI,x=boot.out))
Results (maybe not what you want, but easy to reshape):
index method lwr upr
normal 1 normal -1.2533079 -0.3479490
basic 1 basic -1.1547310 -0.4789996
percent 1 percent -0.4841726 0.1915588
bca 1 bca -0.4841726 -0.4628899
normal1 2 normal 0.6288945 1.6086459
basic1 2 basic 0.5727462 1.4789105
percent1 2 percent 0.3589388 1.2651031
bca1 2 bca 0.6819394 1.2651031
Alternatively, if you can live with getting one bootstrap method at a time, my version of the broom package on Github has this capability (I've submitted a pull request)
## devtools::install_github("bbolker/broom")
library(broom)
tidy(boot.out,conf.int=TRUE,conf.method="perc")

Resources