Bootstrapping to estimate the mean of a geometric sample - r

So I need to run the boot strap method for the geometric mean of the population. The 6 measurements are 1,2,2,4,6,6 and the estimate of the population geometric mean based on this sample is gm= (1*2*2*4*6*6)*(1/6). I need to compute the 95% confidence limit on the population geometric mean.
So far I have:
set.seed(13254)
gmsample <- c(1,2,2,4,6,6)
gmsample
n<- length(gmsample)
gm.hat <- prod(gmsample) ** (1/6)
gm.hat
for(b in 1:B){
inx.boot<- sample(1:n, replace=TRUE)
gmboot<- gmsample[idx.boot]
print(gmboot)
rboot[b] <- prod(gmboot) ** (1/n)
print(r.boot)
}
boot.sd <- sd(r.boot)
boot.sd
I got this from collecting info from the internet, and I'm extremely new to R so any help would be great.

I would suggest using the boot package instead of rolling your own bootstrapping method. For instance, computing 1000 bootstrap replicates of the geometric mean can be done with:
gmsample <- c(1,2,2,4,6,6)
library(boot)
b <- boot(gmsample, function(d, i) prod(d[i])^(1/length(i)), 1000)
Now you can use the boot.ci method to compute confidence intervals. For instance, if you wanted to use the 95% percentile confidence interval, you could do:
boot.ci(b, 0.95, "perc")
# BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
# Based on 1000 bootstrap replicates
#
# CALL :
# boot.ci(boot.out = b, conf = 0.95, type = "perc")
#
# Intervals :
# Level Percentile
# 95% ( 1.701, 4.670 )
# Calculations and Intervals on Original Scale
There are many other types of confidence intervals, which you can read about with ?boot.ci.

Related

Bootstrapping a vector of results, by group in R

Question: How can I use a boostrap to get confidence intervals for a collection of
statistics calculated on the eigenvalues of covariance matrices, separately for
each group (factor level) in a data frame?
Problem: I can't quite work out the data
structure I need to contain these results suitable for the boot function, or a way to "map" the bootstrap over groups and obtain confidence intervals in a form suitable for plotting.
Context:
In the heplots package, boxM calculates Box's M test of equality of covariance matrices.
There is a plot method that produces a useful plot of the log determinants that go into this
test. The confidence intervals in this plot are based on an asymptotic theory approximation.
> library(heplots)
> iris.boxm <- boxM(iris[, 1:4], iris[, "Species"])
> iris.boxm
Box's M-test for Homogeneity of Covariance Matrices
data: iris[, 1:4]
Chi-Sq (approx.) = 140.94, df = 20, p-value < 2.2e-16
> plot(iris.boxm, gplabel="Species")
The plot method can also display other functions of the eigenvalues, but no theoretical
confidence intervals are available in this case.
op <- par(mfrow=c(2,2), mar=c(5,4,1,1))
plot(iris.boxm, gplabel="Species", which="product")
plot(iris.boxm, gplabel="Species", which="sum")
plot(iris.boxm, gplabel="Species", which="precision")
plot(iris.boxm, gplabel="Species", which="max")
par(op)
Thus, I would like to be able to calculate these CIs using a boostrap, and display them in the corresponding plots.
What I've tried:
Below are functions that boostrap these statistics, but for the total
sample, not taking group (Species) into account.
cov_stat_fun <- function(data, indices,
stats=c("logdet", "prod", "sum", "precision", "max")
) {
dat <- data[indices,]
cov <- cov(dat, use="complete.obs")
eigs <- eigen(cov)$values
res <- c(
"logdet" = log(det(cov)),
"prod" = prod(eigs),
"sum" = sum(eigs),
"precision" = 1/ sum(1/eigs),
"max" = max(eigs)
)
}
boot_cov_stat <- function(data, R=500, ...) {
boot(data, cov_stat_fun, R=R, ...)
}
This works, but I need the results by group (and also for the total sample)
> iris.boot <- boot_cov_stat(iris[,1:4])
>
> iris.ci <- boot.ci(iris.boot)
> iris.ci
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 500 bootstrap replicates
CALL :
boot.ci(boot.out = iris.boot)
Intervals :
Level Normal Basic Studentized
95% (-6.622, -5.702 ) (-6.593, -5.653 ) (-6.542, -5.438 )
Level Percentile BCa
95% (-6.865, -5.926 ) (-6.613, -5.678 )
Calculations and Intervals on Original Scale
Some BCa intervals may be unstable
>
I also have written a function that calculates the separate covariance matrices for each group, but I can't see how to use this in my bootstrap functions. Can someone help?
# calculate covariance matrices by group and pooled
covs <- function(Y, group) {
Y <- as.matrix(Y)
gname <- deparse(substitute(group))
if (!is.factor(group)) group <- as.factor(as.character(group))
valid <- complete.cases(Y, group)
if (nrow(Y) > sum(valid))
warning(paste(nrow(Y) - sum(valid)), " cases with missing data have been removed.")
Y <- Y[valid,]
group <- group[valid]
nlev <- nlevels(group)
lev <- levels(group)
mats <- aux <- list()
for(i in 1:nlev) {
mats[[i]] <- cov(Y[group == lev[i], ])
}
names(mats) <- lev
pooled <- cov(Y)
c(mats, "pooled"=pooled)
}
Edit:
In a seemingly related question, Bootstrap by groups, it is suggested that an answer is provided by using the strata argument to boot(), but there is no example of what this gives. [Ah: the strata argument just assures that strata are represented in the bootstrap sample in relation to their frequencies in the data.]
Trying this for my problem, I am not further enlightened, because what I want to get is separate confidence intervals for each Species.
> iris.boot.strat <- boot_cov_stat(iris[,1:4], strata=iris$Species)
>
> boot.ci(iris.boot.strat, conf=0.95, type=c("basic", "bca"))
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 500 bootstrap replicates
CALL :
boot.ci(boot.out = iris.boot.strat, conf = 0.95, type = c("basic",
"bca"))
Intervals :
Level Basic BCa
95% (-6.587, -5.743 ) (-6.559, -5.841 )
Calculations and Intervals on Original Scale
Some BCa intervals may be unstable
>
If I understand your question, you can run your bootstrap function by group as follows:
library(boot)
library(tidyverse)
# Pooled
iris.boot <- boot_cov_stat(iris[,1:4])
iris.ci <- boot.ci(iris.boot)
# By Species
boot.list = setNames(unique(iris$Species), unique(iris$Species)) %>%
map(function(group) {
iris.boot = boot_cov_stat(iris[iris$Species==group, 1:4])
boot.ci(iris.boot)
})
# Combine pooled and by-Species results
boot.list = c(boot.list, list(Pooled=iris.ci))
boot.list
$setosa
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 500 bootstrap replicates
CALL :
boot.ci(boot.out = iris.boot)
Intervals :
Level Normal Basic Studentized
95% (-13.69, -11.86 ) (-13.69, -11.79 ) (-13.52, -10.65 )
Level Percentile BCa
95% (-14.34, -12.44 ) (-13.65, -11.99 )
Calculations and Intervals on Original Scale
Warning : BCa Intervals used Extreme Quantiles
Some BCa intervals may be unstable
$versicolor
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 500 bootstrap replicates
CALL :
boot.ci(boot.out = iris.boot)
Intervals :
Level Normal Basic Studentized
95% (-11.37, -9.81 ) (-11.36, -9.78 ) (-11.25, -8.97 )
Level Percentile BCa
95% (-11.97, -10.39 ) (-11.35, -10.09 )
Calculations and Intervals on Original Scale
Warning : BCa Intervals used Extreme Quantiles
Some BCa intervals may be unstable
$virginica
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 500 bootstrap replicates
CALL :
boot.ci(boot.out = iris.boot)
Intervals :
Level Normal Basic Studentized
95% (-9.467, -7.784 ) (-9.447, -7.804 ) (-9.328, -6.959 )
Level Percentile BCa
95% (-10.050, -8.407 ) ( -9.456, -8.075 )
Calculations and Intervals on Original Scale
Warning : BCa Intervals used Extreme Quantiles
Some BCa intervals may be unstable
$Pooled
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 500 bootstrap replicates
CALL :
boot.ci(boot.out = iris.boot)
Intervals :
Level Normal Basic Studentized
95% (-6.620, -5.714 ) (-6.613, -5.715 ) (-6.556, -5.545 )
Level Percentile BCa
95% (-6.803, -5.906 ) (-6.624, -5.779 )
Calculations and Intervals on Original Scale
Some BCa intervals may be unstable
I think the best general answer will be an extension of what #eipi10 proposed, using some method to extract the required confidence intervals from the bootci objects. This is lacking from the broom package.
As an instructive alternative, I tried using broom::tidy() on the results of the bootstrap directly. Rather than the (typically asymmetric) confidence intervals, it gives the bootstrap estimate as statistic, a bias and a std.error. However, from the results I get (see below), I have doubts about whether broom::tidy() gives correct results in this case.
# try just using tidy on the bootstrap results
## pooled
iris.boot <- boot_cov_stat(iris[,1:4])
iris.pooled <- tidy(iris.boot)
Giving:
> iris.pooled
term statistic bias std.error
1 logdet -6.25922391 -0.0906294902 0.2469587430
2 prod 0.00191273 -0.0001120317 0.0004485317
3 sum 4.57295705 -0.0382145128 0.2861790776
4 precision 0.01692092 -0.0005047993 0.0016818910
5 max 4.22824171 -0.0329408193 0.2815648589
>
Now, use the method described in the other answer to map this over groups,
and combine:
## individual groups
boot.list2 = setNames(unique(iris$Species), unique(iris$Species)) %>%
map(function(group) {
iris.boot = boot_cov_stat(iris[iris$Species==group, 1:4])
tidy(iris.boot)
})
# Combine pooled and by-Species results
boot.list <- c(boot.list2, list(Pooled=iris.pooled))
Transform to a data frame:
## transform this list to a data frame, with a group variable
result <- bind_rows(boot.list) %>%
mutate(group = rep(c( levels(iris$Species), "Pooled"), 5)) %>%
arrange(term)
> result
term statistic bias std.error group
1 logdet -1.306736e+01 -3.240621e-01 4.660334e-01 setosa
2 logdet -1.087433e+01 -2.872073e-01 3.949917e-01 versicolor
3 logdet -8.927058e+00 -2.925485e-01 4.424367e-01 virginica
4 logdet -6.259224e+00 -9.062949e-02 2.469587e-01 Pooled
5 max 2.364557e-01 -6.696719e-03 4.426305e-02 setosa
6 max 4.878739e-01 -6.798321e-03 8.662880e-02 versicolor
7 max 6.952548e-01 -6.517223e-03 1.355433e-01 virginica
8 max 4.228242e+00 -3.294082e-02 2.815649e-01 Pooled
9 precision 5.576122e-03 -5.928678e-04 8.533907e-04 Pooled
10 precision 7.338788e-03 -6.894908e-04 1.184594e-03 setosa
11 precision 1.691212e-02 -1.821494e-03 2.000718e-03 versicolor
12 precision 1.692092e-02 -5.047993e-04 1.681891e-03 virginica
13 prod 2.113088e-06 -4.158518e-07 7.850009e-07 versicolor
14 prod 1.893828e-05 -3.605691e-06 6.100376e-06 virginica
15 prod 1.327479e-04 -2.381536e-05 4.792428e-05 Pooled
16 prod 1.912730e-03 -1.120317e-04 4.485317e-04 setosa
17 sum 3.092041e-01 -1.005543e-02 4.623437e-02 virginica
18 sum 6.248245e-01 -1.238896e-02 8.536621e-02 Pooled
19 sum 8.883673e-01 -1.500578e-02 1.409230e-01 setosa
20 sum 4.572957e+00 -3.821451e-02 2.861791e-01 versicolor
>
This gives something that can now be plotted, supposedly corresponding to the plot in the original question shown without error bars:
result %>% mutate(Pooled = group == "Pooled") %>%
filter (term != "logdet") %>%
ggplot(aes(y=statistic, x=group, color=Pooled)) +
geom_point(size=2.5) +
geom_errorbar(aes(ymin=statistic-2*std.error,
ymax=statistic+2*std.error), width=0.4) +
facet_wrap( ~ term, scales="free") +
coord_flip() + guides(color=FALSE)
However, this "tidy plot" seems evidently WRONG. Theory says that the result for the Polled sample must in every case be intermediate between those for the separate groups, because it is in some sense a 'convex combination` over groups. Compare the plot below with that given in the original question. (It is possible that I did something wrong here, but I can't see a flaw.)

How to compute confidence intervall for Krippendorf's Alpha in R?

I am sure this is realted to Bootstrapping Krippendorff's Alpha. But I didn't understand the question nor the answers there. And it looks like that even the answers and comments are contradicting each other.
set.seed(0)
df <- data.frame(a = rep(sample(1:4),10), b = rep(sample(1:4),10))
kripp.alpha(t(df))
This is the output.
Krippendorff's alpha
Subjects = 40
Raters = 2
alpha = 0.342
How can I compute the confidence interval here?
You are right it is connected to bootstrapping. You could compute the confidence interval the following way:
library(irr)
library(boot)
alpha.boot <- function(d,w) {
data <- t(d[w,])
kripp.alpha(data)$value
}
b <- boot(data = df, statistic = alpha.boot, R = 1000)
b
plot(b)
boot.ci(b, type = "perc")
This is the output:
Bootstrap Statistics :
original bias std. error
t1* 0.3416667 -0.01376158 0.1058123
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates
CALL :
boot.ci(boot.out = b, type = "perc")
Intervals :
Level Percentile
95% ( 0.1116, 0.5240 )
Calculations and Intervals on Original Scale
there is also a R script from Zapf et al. 2016 look for Additional file 3 at the bottom of the page just before the references
Or you could use the kripp.boot function available on github MikeGruz/kripp.boot

Confidence Interval from hierarchical bootstrap

I would like to calculate a BCa confidence interval for multi-stage bootstrap using boot.ci(). Here is an example from: Non-parametric bootstrapping on the highest level of clustered data using boot() function from {boot} in R
which uses the boot command.
# creating example df
rho <- 0.4
dat <- expand.grid(
trial=factor(1:5),
subject=factor(1:3)
)
sig <- rho * tcrossprod(model.matrix(~ 0 + subject, dat))
diag(sig) <- 1
set.seed(17); dat$value <- chol(sig) %*% rnorm(15, 0, 1)
# function for resampling
resamp.mean <- function(dat,
indices,
cluster = c('subject', 'trial'),
replace = TRUE){
cls <- sample(unique(dat[[cluster[1]]]), replace=replace)
sub <- lapply(cls, function(b) subset(dat, dat[[cluster[1]]]==b))
sub <- do.call(rbind, sub)
mean(sub$value)
}
dat.boot <- boot(dat, resamp.mean, 4) # produces and estimated statistic
boot.ci(data.boot) # produces errors
How can I use boot.ci on the boot output?
You have used too few bootstrap resamples. When you call boot.ci, influence measures are needed, and if not provided they are obtained from empinf, which may fail with too few observations. See here for an explanation along similar lines.
Try
dat.boot <- boot(dat, resamp.mean, 1000)
boot.ci(dat.boot, type = "bca")
which gives:
> boot.ci(dat.boot, type = "bca")
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates
CALL :
boot.ci(boot.out = dat.boot, type = "bca")
Intervals :
Level BCa
95% (-0.2894, 1.2979 )
Calculations and Intervals on Original Scale
Some BCa intervals may be unstable
As an alternative, you can provide L (the influence measures) yourself.
# proof of concept, use appropriate value for L!
> dat.boot <- boot(dat, resamp.mean, 4)
> boot.ci(dat.boot, type = "bca", L = 0.2)
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 4 bootstrap replicates
CALL :
boot.ci(boot.out = dat.boot, type = "bca", L = 0.2)
Intervals :
Level BCa
95% ( 0.1322, 1.2979 )
Calculations and Intervals on Original Scale
Warning : BCa Intervals used Extreme Quantiles
Some BCa intervals may be unstable

Bootstrap Confidence Intervals for more than one statistics through boot.ci function

I want to get bootstrap confidence intervals for more than one statistics through boot.ci function. Here is my MWE.
I've two statistics in out and want to find the bootstrap confidence intervals for these two statistics. However, boot.ci function is providing the bootstrap confidence intervals for only first statistic (t1*) but not for the second statistic (t2*).
set.seed(12345)
df <- rnorm(n=10, mean = 0, sd = 1)
Boot.fun <-
function(data, idx) {
data1 <- sample(data[idx], replace=TRUE)
m1 <- mean(data1)
sd1 <- sd(data1)
out <- cbind(m1, sd1)
return(out)
}
Boot.fun(data = df)
library(boot)
boot.out <- boot(df, Boot.fun, R = 20)
boot.out
RDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = df, statistic = Boot.fun, R = 20)
Bootstrap Statistics :
original bias std. error
t1* -0.4815861 0.3190424 0.2309631
t2* 0.9189246 -0.1998455 0.2499412
boot.ci(boot.out=boot.out, conf = 0.95, type = c("norm", "basic", "perc", "bca"))
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 20 bootstrap replicates
CALL :
boot.ci(boot.out = boot.out, conf = 0.95, type = c("norm", "basic",
"perc", "bca"))
Intervals :
Level Normal Basic
95% (-1.2533, -0.3479 ) (-1.1547, -0.4790 )
Level Percentile BCa
95% (-0.4842, 0.1916 ) (-0.4842, -0.4629 )
Calculations and Intervals on Original Scale
Warning : Basic Intervals used Extreme Quantiles
Some basic intervals may be unstable
Warning : Percentile Intervals used Extreme Quantiles
Some percentile intervals may be unstable
Warning : BCa Intervals used Extreme Quantiles
Some BCa intervals may be unstable
Warning messages:
1: In norm.inter(t, (1 + c(conf, -conf))/2) :
extreme order statistics used as endpoints
2: In norm.inter(t, alpha) : extreme order statistics used as endpoints
3: In norm.inter(t, adj.alpha) :
extreme order statistics used as endpoints
The boot package is (IMO) a little clunky for regular use. The short answer is that you need to specify index (default value is 1) to boot.ci, e.g. boot.ci(boot.out,index=2). The long answer is that it would certainly be convenient to get the bootstrap CIs for all of the bootstrap statistics at once!
Get all CI for a specified result slot:
getCI <- function(x,w) {
b1 <- boot.ci(x,index=w)
## extract info for all CI types
tab <- t(sapply(b1[-(1:3)],function(x) tail(c(x),2)))
## combine with metadata: CI method, index
tab <- cbind(w,rownames(tab),as.data.frame(tab))
colnames(tab) <- c("index","method","lwr","upr")
tab
}
## do it for both parameters
do.call(rbind,lapply(1:2,getCI,x=boot.out))
Results (maybe not what you want, but easy to reshape):
index method lwr upr
normal 1 normal -1.2533079 -0.3479490
basic 1 basic -1.1547310 -0.4789996
percent 1 percent -0.4841726 0.1915588
bca 1 bca -0.4841726 -0.4628899
normal1 2 normal 0.6288945 1.6086459
basic1 2 basic 0.5727462 1.4789105
percent1 2 percent 0.3589388 1.2651031
bca1 2 bca 0.6819394 1.2651031
Alternatively, if you can live with getting one bootstrap method at a time, my version of the broom package on Github has this capability (I've submitted a pull request)
## devtools::install_github("bbolker/broom")
library(broom)
tidy(boot.out,conf.int=TRUE,conf.method="perc")

Fitting a lognormal distribution to truncated data in R

For a brief background, I am insterested in describing a distribution of fire sizes, which is presumed to follow a lognormal distribution (many small fires and few large fires). For my specific application I am only interested in the fires that fall within a certain range of sizes (> min, < max). So, I am attempting to fit a lognormal distribution to a data set that has been censored on both ends. In essence, I want to find the parameters of the lognormal distribution (mu and sigma) that best fits the full distribution prior to censoring. Can I fit the distribution taking into account that I know I am only looking a a portion of the distribution?
I have done some experimentation, but have become stumped. Here's an example:
# Generate data #
D <- rlnorm(1000,meanlog = -0.75, sdlog = 1.5)
# Censor data #
min <- 0.10
max <- 20
Dt <- D[D > min]
Dt <- Dt[Dt <= max]
If I fit the non-censored data (D) using either fitdistr (MASS) or fitdist (fitdistrplus) I obviously get approximately the same parameter values as I entered. But if I fit the censored data (Dt) then the parameter values do not match, as expected. The question is how to incorporate the known censoring. I have seen some references elsewhere to using upper and lower within fitdistr, but I encounter an error that I'm not sure how to resolve:
> fitt <- fitdist(Dt, "lognormal", lower = min, upper = max)
Error in fitdist(Dt, "lognormal", lower = min, upper = max) :
The dlognormal function must be defined
I will appreciate any advice, first on whether this is the appropriate way to fit a censored distribution, and if so, how to go about defining the dlognormal function so that I can make this work. Thanks!
Your data is not censored (that would mean that observations outside the interval
are there, but you do not know their exact value)
but truncated (those observations have been discarded).
You just have to provide fitdist with the density and the cumulative distribution function
of your truncated distribution.
library(truncdist)
dtruncated_log_normal <- function(x, meanlog, sdlog)
dtrunc(x, "lnorm", a=.10, b=20, meanlog=meanlog, sdlog=sdlog)
ptruncated_log_normal <- function(q, meanlog, sdlog)
ptrunc(q, "lnorm", a=.10, b=20, meanlog=meanlog, sdlog=sdlog)
library(fitdistrplus)
fitdist(Dt, "truncated_log_normal", start = list(meanlog=0, sdlog=1))
# Fitting of the distribution ' truncated_log_normal ' by maximum likelihood
# Parameters:
# estimate Std. Error
# meanlog -0.7482085 0.08390333
# sdlog 1.4232373 0.0668787

Resources