Bootstrapping sample means in R using boot Package, Creating the Statistic Function for boot() Function - r

I have a data set with 15 density calculations, each from a different transect. I would like to resampled these with replacement, taking 15 randomly selected samples of the 15 transects and then getting the mean of these resamples. Each transect should have its own personal probability of being sampled during this process. This should be done 5000 times. I have a code which does this without using the boot function but if I want to calculate the BCa 95% CI using the boot package it requires the bootstrapping to be done through the boot function first.
I have been trying to create a function but I cant get any that seem to work. I want the bootstrap to select from a certain column (data$xs) and the probabilites to be used are in the column data$prob.
The function I thought might work was;
library(boot)
meanfun <- function (data, i){
d<-data [i,]
return (mean (d)) }
bo<-boot (data$xs, statistic=meanfun, R=5000)
#boot.ci (bo, conf=0.95, type="bca") #obviously `bo` was not made
But this told me 'incorrect number of dimensions'
I understand how to make a function in the normal sense but it seems strange how the function works in boot. Since the function is only given to boot by name, and no specification of the arguments to pass into the function I seem limited to what boot itself will pass in as arguments (for example I am unable to pass data$xs in as the argument for data, and I am unable to pass in data$prob as an argument for probability, and so on). It seems to really limit what can be done. Perhaps I am missing something though?
Thanks for any and all help

The reason for this error is, that data$xs returns a vector, which you then try to subset by data [i, ].
One way to solve this, is by changing it to data[i] or by using data[, "xs", drop = FALSE] instead. The drop = FALSE avoids type coercion, ie. keeps it as a data.frame.
We try
data <- data.frame(xs = rnorm(15, 2))
library(boot)
meanfun <- function(data, i){
d <- data[i, ]
return(mean(d))
}
bo <- boot(data[, "xs", drop = FALSE], statistic=meanfun, R=5000)
boot.ci(bo, conf=0.95, type="bca")
and obtain:
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 5000 bootstrap replicates
CALL :
boot.ci(boot.out = bo, conf = 0.95, type = "bca")
Intervals :
Level BCa
95% ( 1.555, 2.534 )
Calculations and Intervals on Original Scale

One can use boot.array to extract all or a subset of the resampled sets. In this case:
bo.ci<-boot.ci(boot.out = bo, conf = 0.95, type = "bca")
resampled.data<-boot.array(bo,1)
To extract the first and second sets of resampled data:
resample.1<-resampled.data[1,]
resample.2<-resampled.data[2,]
Then proceed to extract the individual statistic you'd want from any subset. For isntance, If you assume normality you could run a student's t.test on teh first subset:
t.test(resample.1)
Which for this example and particular seed value(s) gives:
data: resample.1
t = 6.5216, df = 14, p-value = 1.353e-05
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
5.234781 10.365219
sample estimates:
mean of x
7.8
r resampling boot.array

Related

Confusion about the matrix "B" returned by `quantreg::boot.rq`

When invoking boot.rq like this
b_10 = boot.rq(x, y, tau = .1, bsmethod = "xy", cov = TRUE, R = reps, mofn = mofn)
what does the B matrix (size R x p) in b_10 contain: bootstrapped coefficient estimates or bootstrapped standard errors?
The Value section in the documentation says:
A list consisting of two elements: A matrix B of dimension R by p is returned with the R resampled estimates of the vector of quantile regression parameters. [...]
So, it seems to be the coefficient estimates. But Description section says:
These functions can be used to construct standard errors, confidence intervals and tests of hypotheses regarding quantile regression models.
So it seems to be bootstrapped standard errors.
What is it really?
Edit:
I also wonder what difference the option cov = TRUE makes. Thanks!
The bootstrapped values are different depending on whether I use cov = TRUE or not. The code was written by someone else so I'm not sure why that option was put there.
It stores the bootstrap coefficients. Each row of B is a sample of coefficients, and you have R rows.
These samples are the basis of further inference. We can compute various statistics from them. For example, to compute bootstrap mean and standard error, we can do:
colMeans(B)
apply(B, 2, sd)
Do you also happen to know what difference the option cov = TRUE makes?
Are you sure that cov = TRUE works? First of all, boot.rq itself has no such argument. It may be passed in via .... However, ... is forwarded to boot.rq.pxy (if bsmethod = "pxy") or boot.rq.pwxy (if bsmethod = "pwxy"), neither of which deals with a cov argument. Furthermore, you use bsmethod = "xy", so ... will be silently ignored. As far as I could see, cov = TRUE has no effect at all.
It works in the sense that R doesn't throw me an error.
That is what "silently ignored" means. You can pass whatever into .... They are just ignored.
The bootstrapped values are different depending on whether I use cov = TRUE or not. The code was written by someone else so I'm not sure why that option was put there.
Random sampling won't give identical results on different runs. I suggest you fix a random seed then do testing:
set.seed(0); ans1 <- boot.rq(..., cov = FALSE)
set.seed(0); ans2 <- boot.rq(..., cov = TRUE)
all.equal(ans1$B, ans2$B)
If you don't get TRUE, come back to me.
You're right. It's just because of the different seeds. Thanks!!

Struggling to run moveHMM using lognormal function in parallelised routines

I am attempting to run a two state HMM using a lognormal distribution. I have read Michelot and Langrock (2019) regarding choosing starting parameters through inspecting the data in a histogram and then running iterations in parallel, which has worked for my gamma distribution. Identifying the starting parameters for the lognormal distribution is troubling me however. Do I plot the log of my step length distribution then attempt extracting starting parameters or use the same starting parameters as my gamma distribution and rely on stepDist="lnorm"?
My code for the lognormal attempt currently looks like this:
ncores <- detectCores() - 1
cl <- makeCluster(getOption("cl.cores", ncores))
clusterExport(cl, list("data", "fitHMM"))
niter <- 20
allPar0 <- lapply(as.list(1:niter), function(x) {
stepMean0 <- runif(2,
min = c(x,y),
max = c(y,z))
stepSD0 <- runif(2,
min = c(x,y),
max = c(y,z))
angleMean0 <- c(0, 0)
angleCon0 <- runif(2,
min = c(a,b),
max = c(a,b))
stepPar0 <- c(stepMean0, stepSD0)
anglePar0 <- c(angleMean0, angleCon0)
return(list(step = stepPar0, angle = anglePar0))
})
# Fit the niter models in parallel
logP <- parLapply(cl = cl, X = allPar0, fun = function(par0) {
m <- fitHMM(data = data, nbStates = 2, stepDist = "lnorm", stepPar0 = par0$step,
anglePar0 = par0$angle)
return(m)
})
# Extract likelihoods of fitted models
likelihoodL <- unlist(lapply(logP, function(m) m$mod$minimum))
likelihoodL
# Index of best fitting model (smallest negative log-likelihood)
whichbestpL <- which.min(likelihoodL)
bestL <- logP[[whichbestpL]]
bestL
If I use negative values from plotting the log of the step length of the data then I get the error:
Error in checkForRemoteErrors(val) :
7 nodes produced errors; first error: Check the step parameters bounds (the initial parameters should be strictly between the bounds of their parameter space).
If I use the same starting parameter values that I used for my gamma distribution then I get the error
Error in unserialize(node$con) :
embedded nul in string: 'X\n\0\0\0\003\0\004\002\0\0\003\005\0\0\0'
Please could someone shed some light on how I'm failing at this?
Thank you!
Unfortunately, I can't tell for sure what the problem is from the code you included. If you don't get an error when you run fitHMM outside of parLapply, then it suggests that the problem is in how you choose the values of x, y, and z in your code.
The first parameter of the log-normal distribution can be negative or positive, and it is actually the mean of the logarithm of the step length. So, to find good starting values for this, you should look at a histogram of the log step lengths (e.g., following the dedicated moveHMM vignette). The second parameter is the standard deviation of the log step lengths, and this should be strictly positive (but could also be chosen based on the spread of the histogram of log step lengths).
To summarise, you should choose all the initial values based on plots of the log step lengths (rather than the step lengths themselves), and you should not use the same ranges of values for stepMean0 and stepSD0 (because the former can be negative or positive, whereas the latter is positive). Hopefully, this should help you choose x, y, and z.

Why does the results of the bootstrapping methods differs if it is being used the same seed?

I want to generate 95% confidence intervals from the R2 of a linear model. While developing the code and using the same seed for both approaches, I figured it out that doing the bootstrap manually doesn't give me the same results as using the boot function from the boot package. I am wondering now if I am doing something wrong? or why is this happening?
On the other hand, in order to calculate the 95% CI I was trying to use the confint function, but I'm getting an error "$ operator is invalid for atomic vectors". Any solution to avoid this error?
Here is a reproducible example to explain my concerns
#creating the dataframe
a <- rpois(n = 100, lambda = 10)
b <- rnorm(n = 100, mean = 5, sd = 1)
DF<- data.frame(a,b)
#bootstrapping manually
set.seed(123)
x=length(DF$a)
B_manually<- data.frame(replicate(100, summary(lm(a~b, data = DF[sample(x, replace = T),]))$r.squared))
names(B_manually)[1]<- "r_squared"
#Bootstrapping using the function "Boot" from Boot library
set.seed(123)
library(boot)
B_boot <- boot(DF, function(data,indices)
summary(lm(a~b, data[indices,]))$r.squared,R=100)
head(B_manually) == head(B_boot$t)
r_squared
1 FALSE
2 FALSE
3 FALSE
4 FALSE
5 FALSE
6 FALSE
#Why does the results of the manually vs boot function approach differs if I'm using the same seed?
# 2nd question (Using the confint function to determine the 95 CI gives me an error)
confint(B_manually$r_squared, level = 0.95, method = "quantile")
confint(B_boot$t, level = 0.95, method = "quantile")
#Error: $ operator is invalid for atomic vectors
#NOTE: I already used the boot.ci to determine the 95 confidence interval, as well as the
#quantile function to determine the CI, but the results of these CI differs from each others
#and just wanted to compare with the confint function.
quantile(B_function$t, c(0.025,0.975))
boot.ci(B_function, index=1,type="perc")
Thanks in advance for any help!
The boot package does not use replicate with sample to generate the indices. Check the importance.array function under the source code for boot. It basically generates all the indices at one go. So there's no reason to assume that you will end up with the same indices or same result. Take a step back, the purpose of bootstrap is to use random sampling methods to obtain a estimate of your parameters, you should get similar estimates from different implementation of bootstrap.
For example, you can see the distribution of R^2 is very similar:
set.seed(111)
a <- rpois(n = 100, lambda = 10)
b <- rnorm(n = 100, mean = 5, sd = 1)
DF<- data.frame(a,b)
set.seed(123)
x=length(DF$a)
B_manually<- data.frame(replicate(999, summary(lm(a~b, data = DF[sample(x, replace = T),]))$r.squared))
library(boot)
B_boot <- boot(DF, function(data,indices)
summary(lm(a~b, data[indices,]))$r.squared,R=999)
par(mfrow=c(2,1))
hist(B_manually[,1],breaks=seq(0,0.4,0.01),main="dist of R2 manual")
hist(B_boot$t,breaks=seq(0,0.4,0.01),main="dist of R2 boot")
The function confint you are using, is meant for a lm object, and works on estimating a confidence interval for the coefficient, see help page. It takes the standard error of the coefficient and multiply it by the critical t-value to give you confidence interval. You can check out this book page for the formula. The objects from your bootstrapping are not lm objects and this function doesn't work. It is not meant for any other estimates.

function to call the different columns for calculating Correlation and Confidence interval using Bootstrap in R

Here is the problem I am currently facing: I have a data frame (let's call A) of 200 observations (rows) and 12 variables (columns). where I am, trying to find out the confidence interval using Bootstrap based on Correlation between two variables in the data frame.
My Data:
library(boot)
library(tidyverse)
library(psychometric)
hsb2 <- read.table("https://stats.idre.ucla.edu/stat/data/hsb2.csv", sep=",", header=T)
here I am trying to find out the confidence interval by using bootstrap based correlation formula
I wrote code for that its work.
k<-CIr(r=orig.cor, n = 21, level = .95)
k
n<-length(hsb2$math)
#n
B<-5000
boot.cor.all<-NULL
for (i in 1:B){
index<-sample(1:n, replace=T)
boot.v2<-hsb2$math[index]
boot.v1<-hsb2$write[index]
boot.cor<-cor(boot.v1, boot.v2,method="spearman")
boot.cor.all<-c(boot.cor.all, boot.cor)
}
ci_boot<-quantile(boot.cor.all, prob=c(0.025, 0.975))
ci_boot
Result:
[1] 0.6439442
[1] 0.2939780 0.8416635
2.5% 97.5%
0.5556964 0.7211145
Here is the actual problem I am facing where I have to write a function to get
result for another variable but
this function not working
bo<-function(v1,v2,df){
orig.cor <- cor(df$v1,df$v2,method="spearman")
orig.ci<-CIr(r=orig.cor, n = 21, level = .95)
B<-5000
n<-length(df$v1)
boot.cor.all<-NULL
for (i in 1:B){
index<-sample(1:n, replace=T)
boot.hvltt2<-df$v1[index]
boot.hvltt<-df$v2[index]
boot.cor<-cor(boot.hvltt2, boot.hvltt,method="spearman")
boot.cor.all<-c(boot.cor.all, boot.cor)
}
ci_boot<-quantile(boot.cor.all, prob=c(0.025, 0.975))
return(orig.cor,orig.ci,ci_boot)
}
after calling this function I am getting error
bo(math,write,hsb2)
bo(math,read,hsb2)
bo(female,write,hsb2)
bo(female,read,hsb2)
I am getting this error
Error in cor(df$v1, df$v2, method = "spearman") : supply both 'x' and 'y' or a matrix-like 'x'
how to write a function correctly.
I want the result as each time a call function it needs to be stored in data frame like below
Variable1 variable2 Orig Cor Orig CI bootstrap CI
math wirte 0.643 0.2939780 0.8416635 0.5556964 0.7211145
math read 0.66 0.3242639 0.8511580 0.5736904 0.7400174
female read -0.059 -0.4787978 0.3820967 -0.20432743 0.08176896
female write
science write
science read
The logic was right, I just had to make some changes on how you access the elements on df. R doesn't recognized the objects math and write because they are columns inside the data.frame. One way to pass them as arguments to the function is to define them as strings v1 = "math" and then access them with df[,v1]
bo<-function(v1,v2,df){
orig.cor <- cor(df[,v1],df[,v2],method="spearman")
orig.ci<-CIr(r=orig.cor, n = 21, level = .95)
B<-5000
n<-nrow(df) #Changed length to nrow
boot.cor.all<-NULL
for (i in 1:B){
index<-sample(1:n, replace=T)
boot.hvltt2<-df[index,v1]
boot.hvltt<-df[index,v2]
boot.cor<-cor(boot.hvltt2, boot.hvltt,method="spearman")
boot.cor.all<-c(boot.cor.all, boot.cor)
}
ci_boot<-quantile(boot.cor.all, prob=c(0.025, 0.975))
return(list(orig.cor,orig.ci,ci_boot)) #wrap your returns in a list
}
bo("math","write",hsb2)

Summary statistics for imputed data from Zelig & Amelia

I'm using Amelia to impute the missing values.
While I'm able to use Zelig and Amelia to do some calculations...
How do I use these packages to find the pooled means and standard deviations of the newly imputed data?
library(Amelia)
library(Zelig)
n= 100
x1= rnorm(n,0,1) #random normal distribution
x2= .4*x1+rnorm(n,0,sqrt(1-.4)^2) #x2 is correlated with x1, r=.4
x1= ifelse(rbinom(n,1,.2)==1,NA,x1) #randomly creating missing values
d= data.frame(cbind(x1,x2))
m=5 #set 5 imputed data frames
d.imp=amelia(d,m=m) #imputed data
summary(d.imp) #provides summary of imputation process
I couldn't figure out how to format the code in a comment so here it is.
foo <- function(x, fcn) apply(x, 2, fcn)
lapply(d.imp$imputations, foo, fcn = mean)
lapply(d.imp$imputations, foo, fcn = sd)
d.imp$imputations gives a list of all the imputed data sets. You can work with that list however you are comfortable with to get out the means and sds by column and then pool as you see fit. Same with correlations.
lapply(d.imp$imputations, cor)
Edit: After some discussion in the comments I see that what you are looking for is how to combine results using Rubin's rules for, for example, the mean of imputed data sets generated by Amelia. I think you should clarify in the title and body of your post that what you are looking for is how to combine results over imputations to get appropriate standard errors with Rubin's rules after imputing with package Amelia. This was not clear from the title or original description. "Pooling" can mean different things, particularly w.r.t. variances.
The mi.meld function is looking for a q matrix of estimates from each imputation, an se matrix of the corresponding se estimates, and a logical byrow argument. See ?mi.meld for an example. In your case, you want the sample means and se_hat(sample means) for each of your imputed data sets in the q and se matrices to pass to mi_meld, respectively.
q <- t(sapply(d.imp$imputations, foo, fcn = mean))
se <- t(sapply(d.imp$imputations, foo, fcn = sd)) / sqrt(100)
output <- mi.meld(q = q, se = se, byrow = TRUE)
should get you what you're looking for. For other statistics than the mean, you will need to get an SE either analytically, if available, or by, say, bootstrapping, if not.

Resources