Bootstrapping function with data.table - r

I have been trying to write a function that takes the results from a simple regression model and calculate the Glass's Delta size effect. That was easy.
The problem now is that I would like to calculate confidence intervals for this value and I keep getting an error when I use it with the boot library.
I have tried to follow this answer but with no success.
As an example I am going to use a Stata dataset
library(data.table)
webclass <- readstata13::read.dta13("http://www.stata.com/videos13/data/webclass.dta")
#estimate impact
M0<-lm(formula = math ~ treated ,data = webclass)
######################################
##### Effect Size ######
## Glass's delta=M1-M2/SD2 ##
####################################
ESdelta<-function(regmodel,yvar,tvar,msg=TRUE){
Data<-regmodel$model
setDT(Data)
meanT<-mean(Data[get(tvar)=="Treated",get(yvar)])
meanC<-mean(Data[get(tvar)=="Control",get(yvar)])
sdC<-sd(Data[get(tvar)=="Control",get(yvar)])
ESDelta<-(meanT-meanC)/sdC
if (msg==TRUE) {
cat(paste("the average scores of the variable-",yvar,"-differ by approximately",round(ESDelta,2),"standard deviations"))
}
return(ESDelta)
}
ESdelta(M0,"math","treated",msg = F)
#0.7635896
Now when I try to use the boot function I got the following error
boot::boot(M0, statistic=ESdelta, R=50,"math","treated")
#Error in match.arg(stype) : 'arg' should be one of “i”, “f”, “w”
Thanks

In the boot manual (type ?boot):
statistic: [...] The first argument passed will always be the original
data. The second will be a vector of indices, frequencies or weights
which define the bootstrap sample.
You cannot bootstrap a model, so you modify your function to work with the data.table and index, other arguments to the function must be specified after:
ESdelta<-function(Data,inds,yvar,tvar,msg=TRUE){
Data = Data[inds,]
meanT<-mean(Data[get(tvar)=="Treated",get(yvar)])
meanC<-mean(Data[get(tvar)=="Control",get(yvar)])
sdC<-sd(Data[get(tvar)=="Control",get(yvar)])
ESDelta<-(meanT-meanC)/sdC
if (msg==TRUE) {
cat(paste("the average scores of the variable-",yvar,"-differ by approximately",round(ESDelta,2),"standard deviations"))
}
return(ESDelta)
}
Dat <- setDT(M0$model)
bo = boot(Dat, statistic=ESdelta, R=50,yvar="math",tvar="treated",msg=FALSE)
> bo
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = Dat, statistic = ESdelta, R = 50, yvar = "math",
tvar = "treated", msg = FALSE)
Bootstrap Statistics :
original bias std. error
t1* 0.7635896 0.05685514 0.4058304
You can get the c.i by doing:
boot.ci(bo)
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 50 bootstrap replicates
CALL :
boot.ci(boot.out = bo)
Intervals :
Level Normal Basic
95% (-0.0887, 1.5021 ) (-0.8864, 1.5398 )
Level Percentile BCa
95% (-0.0126, 2.4136 ) (-0.1924, 1.7579 )

Related

Getting Error Bootstrapping to test predictive model

rsq <- function(formula, Data1, indices) {
d <- Data1[indices,] # allows boot to select sample
fit <- lm(formula, Data1=d)
return(summary(fit)$r.square)
}
results = boot(data = Data1, statistic = rsq, R = 500)
When I execute the code, I get the following error:
Error in Data1[indices,] : incorrect number of dimensions
Background info: I am creating a predictive model using Linear Regressions. I would like to test my Predictive Model and through some research, I decided to use the Bootstrapping Method.
Credit goes to #Rui Barradas, check comments for original post.
If you read the help page for function boot::boot you will see that the function it calls has first argument data, then indices, then others. So change the order of your function definition to rsq <- function(Data1, indices, formula)
Another problem that I had was that I didn't define the Function.

vegan::ordiR2step() doesn't find best-fit model

The vegan package includes the ordiR2step() function for model building, which can be used to identify the most important variables using the R2 and the p-value as goodness of fit measures. However for the dataset I was recently working with the function doesn't provide the best-fit model.
# data
RIKZ <- read.table("http://www.uni-koblenz-landau.de/en/campus-landau/faculty7/environmental-sciences/landscape-ecology/Teaching/RIKZ_data/at_download/file", header = TRUE)
# data preparation
Species <- RIKZ[ ,2:5]
ExplVar <- RIKZ[ , 9:15]
Species_fin <- Species[ rowSums(Species) > 0, ]
ExplVar_fin <- ExplVar[ rowSums(Species) > 0, ]
# rda
RIKZ_rda <- rda(Species_fin ~ . , data = ExplVar_fin, scale = TRUE)
# stepwise model building: ordiR2step()
require(vegan)
step_both_R2 <- ordiR2step(rda(Species_fin ~ salinity, data = ExplVar_fin, scale = TRUE),
scope = formula(RIKZ_rda),
direction = "both", R2scope = TRUE, Pin = 0.05,
steps = 1000)
Why does ordiR2step() not add the variable exposure to the model, although it would increase the explained variance?
If R2scope is set FALSE and the p-value criterion is increased (Pin = 0.15) it adds the variable exposure corretly but throws the following error:
Error in terms.formula(tmp, simplify = TRUE) :
invalid model formula in ExtractVars
If R2scope is set TRUE (Pi = 0.15) exposure is not added.
Note: This might seem more as a statistic question and therefore more suitable for CV. However I think the problem is rather technical and better off here on SO.
Please read the ordiR2step documentation: it will tell you why exposure is not added to the model. The help page tells that ordiR2step has three stopping criteria. The second criterion is that "the adjusted R2 of the ‘scope’ is exceeded". This happens with exposure and therefore it was not added. This second criterion will be ignored if you set R2scope = FALSE (also documented). So the function works like documented.

How can I use pre bootstrapped data to obtain a BCa CI?

I have bootstrapped two variables (one which is already in the "Impala.csv" file) using a function which resamples and reports the mean for a sample the size of nrow(data) for 5000 repetitions. The code is as follows:
data<-read.csv("Impala.csv")
allo<-data$distance
data2<-read.csv("2010 - IM.csv")
pro<-data2$pro
n1<-nrow(data2)
boot4000 <- c()
for(i in 1:5000){
s <- sample(data2$xs,n1,replace=T,prob = data2$pro)
boot4000[i] <- mean(s)
}`
And then combine the two outputs in a formula, giving me 5000 new variables.
d<-(pi/2)*(boot4000*(1/allo))
Now I wish to find the BCa confidence intervals for this, but as I understand, the boot function will require me to make a new set of resamples, but I do not want this as the bootstrapping is complete. All I want now is a function which will take my bootstrapped data as is and determine the BCa confidence interval.
http://www.filedropper.com/impala
http://www.filedropper.com/2010-im
Here are the data files I have used
Also, I have tried to create an object imitating a 'boot' object using the following
den<-as.matrix(d, ncol=1)
outs<-list(t0=mean(d), t=den, R=5000, L=3)
boot.ci(outs, type="bca")
This spits out the error:
Error in if (as.character (boot.out$call[1L]) == "tsboot") warning
("BCa intervals not defined for time series bootstraps") else output
<- C (output,: argument is of length zero
outs <- list(t0=mean(d), t=den, R=5000, sim="ordinary",
stype="i", weights=rep(0.0002,5000), statistic=meanfun,
data=d, call=boot(data=d, statistic = meanfun,R=5000),
strata = rep(1,5000), attr="boot", seed=.Random.seed)
This is how one can make the object of class boot.out.

How to calculate panel bootsrapped standard errors with R?

I recently changed from STATA to R and somehow struggles to find some corresponding commands. I would like to get panel bootsrapped standard errors from a Fixed Effect model using the plm library as described here here for STATA users:
My questions concern the approach in general (whether boot is the appropriate library or the library(meboot)
)
How to solve for that particular error using boot:
First get some panel data:
library(plm)
data(EmplUK) # from plm library
test<-function(data, i) coef(plm(wage~emp+sector,data = data[i,],
index=c("firm","year"),model="within"))
Second:
library(boot)
boot<-boot(EmplUK, test, R = 100)
> boot<-boot(EmplUK, test, R = 100)
duplicate couples (time-id)
Error in pdim.default(index[[1]], index[[2]]) :
Called from: top level
Browse[1]>
For some reason , boot will pass an index ( original here) to plm with duplicated values. You should remove all duplicated values and assert that the index is unique before passing it to plm.
test <- function(data,original) {
coef(plm(wage~emp+sector,data = data[unique(original),],
index=c("firm","year"),model="within"))
}
boot(EmplUK, test, R = 100)
## ORDINARY NONPARAMETRIC BOOTSTRAP
## Call:
## boot(data = EmplUK, statistic = test, R = 100)
## Bootstrap Statistics :
## original bias std. error
## t1* -0.1198127 -0.01255009 0.05269375

bootstrap proportion confidence interval

I would like to produce confidence intervals for proportions using the boot package if possible.
I have a vector and I would like to set a threshold and then calculate the proportions below the specified level.
After that I would like to use the bootstrap function in the boot package to calculate the confidence intervals for the proportions.
Simple example of what I have so far:
library(boot)
vec <- abs(rnorm(1000)*10) #generate example vector
data_to_tb <- vec
tb <- function(data) {
sum(data < 10, na.rm = FALSE)/length(data) #function for generating the proportion
}
tb(data_to_tb)
boot(data = data_to_tb, statistic = tb, R = 999)
quantile(boot.out$t, c(.025,.975))
However, I get this error message:
> boot(data = data_to_tb, statistic = tb, R = 999)
Error in statistic(data, original, ...) : unused argument (original)
I can not get it to work though, help appreciated
Your problem is your function tb - it needs two arguments. From the help file ?boot
statistic A function which when applied to data returns a vector
containing the statistic(s) of interest. When sim = "parametric", the
first argument to statistic must be the data. For each replicate a
simulated dataset returned by ran.gen will be passed. In all other
cases statistic must take at least two arguments.

Resources