bootstrap proportion confidence interval

bootstrap proportion confidence interval - r

I would like to produce confidence intervals for proportions using the boot package if possible.
I have a vector and I would like to set a threshold and then calculate the proportions below the specified level.
After that I would like to use the bootstrap function in the boot package to calculate the confidence intervals for the proportions.
Simple example of what I have so far:
library(boot)
vec <- abs(rnorm(1000)*10) #generate example vector
data_to_tb <- vec
tb <- function(data) {
sum(data < 10, na.rm = FALSE)/length(data) #function for generating the proportion
}
tb(data_to_tb)
boot(data = data_to_tb, statistic = tb, R = 999)
quantile(boot.out$t, c(.025,.975))
However, I get this error message:
> boot(data = data_to_tb, statistic = tb, R = 999)
Error in statistic(data, original, ...) : unused argument (original)
I can not get it to work though, help appreciated

Your problem is your function tb - it needs two arguments. From the help file ?boot
statistic A function which when applied to data returns a vector
containing the statistic(s) of interest. When sim = "parametric", the
first argument to statistic must be the data. For each replicate a
simulated dataset returned by ran.gen will be passed. In all other
cases statistic must take at least two arguments.

Related

Error with svyglm function in survey package in R: "all variables must be in design=argument"

New to stackoverflow. I'm working on a project with NHIS data, but I cannot get the svyglm function to work even for a simple, unadjusted logistic regression with a binary predictor and binary outcome variable (ultimately I'd like to use multiple categorical predictors, but one step at a time).
El_under_glm<-svyglm(ElUnder~SO2, design=SAMPdesign, subset=NULL, family=binomial(link="logit"), rescale=FALSE, correlation=TRUE)
Error in eval(extras, data, env) :
object '.survey.prob.weights' not found
I changed the variables to 0 and 1 instead:
Under_narm$SO2REG<-ifelse(Under_narm$SO2=="Heterosexual", 0, 1)
Under_narm$ElUnderREG<-ifelse(Under_narm$ElUnder=="No", 0, 1)
But then get a different issue:
El_under_glm<-svyglm(ElUnderREG~SO2REG, design=SAMPdesign, subset=NULL, family=binomial(link="logit"), rescale=FALSE, correlation=TRUE)
Error in svyglm.survey.design(ElUnderREG ~ SO2REG, design = SAMPdesign, :
all variables must be in design= argument
This is the design I'm using to account for the weights -- I'm pretty sure it's correct:
SAMPdesign=svydesign(data=Under_narm, id= ~NHISPID, weight= ~SAMPWEIGHT)
Any and all assistance appreciated! I've got a good grasp of stats but am a slow coder. Let me know if I can provide any other information.

Using some make-believe sample data I was able to get your model to run by setting rescale = TRUE. The documentation states
Rescaling of weights, to improve numerical stability. The default
rescales weights to sum to the sample size. Use FALSE to not rescale
weights.
So, one solution maybe is just to set rescale = TRUE.
library(survey)
# sample data
Under_narm <- data.frame(SO2 = factor(rep(1:2, 1000)),
ElUnder = sample(0:1, 1000, replace = TRUE),
NHISPID = paste0("id", 1:1000),
SAMPWEIGHT = sample(c(0.5, 2), 1000, replace = TRUE))
# with 'rescale' = TRUE
SAMPdesign=svydesign(ids = ~NHISPID,
data=Under_narm,
weights = ~SAMPWEIGHT)
El_under_glm<-svyglm(formula = ElUnder~SO2,
design=SAMPdesign,
family=quasibinomial(), # this family avoids warnings
rescale=TRUE) # Weights rescaled to the sum of the sample size.
summary(El_under_glm, correlation = TRUE) # use correlation with summary()
Otherwise, looking code for this function's method with 'survey:::svyglm.survey.design', it seems like there may be a bug. I could be wrong, but by my read when 'rescale' is FALSE, .survey.prob.weights does not appear to get assigned a value.
if (is.null(g$weights))
g$weights <- quote(.survey.prob.weights)
else g$weights <- bquote(.survey.prob.weights * .(g$weights)) # bug?
g$data <- quote(data)
g[[1]] <- quote(glm)
if (rescale)
data$.survey.prob.weights <- (1/design$prob)/mean(1/design$prob)
There may be a work around if you assign a vector of numeric values to .survey.prob.weights in the global environment. No idea what these values should be, but your error goes away if you do something like the following. (.survey.prob.weights needs to be double the length of the data.)
SAMPdesign=svydesign(ids = ~NHISPID,
data=Under_narm,
weights = ~SAMPWEIGHT)
.survey.prob.weights <- rep(1, 2000)
El_under_glm<-svyglm(formula = ElUnder~SO2,
design=SAMPdesign,
family=quasibinomial(),
rescale=FALSE)
summary(El_under_glm, correlation = TRUE)

R is only returning non-zero coefficient estimates when using the "poly" function to generate predictors. How do I get the zero values into a vector?

I'm using regsubsets from the leaps library to perform the best subset selection. I need to compare the coefficients it generates to the "true" coefficients I specified when simulating the data (by comparison, meaning, the difference between them squared, and the square root taken of the sum), for each number of predictors.
Since there are 16 different models that regsubsets generated, I use a loop to do this automatically. It would work except that when I extract the coefficients from the best model fit with x predictors, it only gives me the non-zero coefficients of the polynomial fit. This messes up the size of the coefi vector causing it to be smaller in size than the truecoef true coefficients vector.
If I could somehow force all coefficients to be spat out from the model, I wouldn't have an issue. But after looking extensively, I don't know how to do that.
Alternative ways of solving this problem would also be appreciated.
library(leaps)
regfit.train=regsubsets(y ~ poly(x,25, raw = TRUE), data=mydata[train,], nvmax=25)
truecoef = c(3,0,-7,4,-2,8,0,-5,0,2,0,4,5,6,3,2,2,0,3,1,1)
coef.errors = rep(NA, 16)
for (i in 1:16) {
coefi = coef(regfit.train, id=i)
coef.errors[i] = mean((truecoef-coefi)^2)
}
The equation I'm trying to estimate, where j is the coefficient and r refers to the best model containing "r" coefficients:
Thanks!

This is how I ended up solving it (with some help):
The loop indexes which coefficients are available and performs the subtraction, for those unavailable, it assumes they are zero.
truecoef = c(3,0,-7,4,-2,8,0,-5,0,2,0,4,5,6,3,2,2,0,3,1,1)
val.errors = rep(NA, 16)
x_cols = colnames(x, do.NULL = FALSE, prefix = "x.")
for (i in 1:16) {
coefis = coef(regfit.train, id = i)
val.errors[i] = sqrt(sum((truecoef[x_cols %in% names(coefis)] -
coefis[names(coefis) %in% x_cols])^2) + sum(truecoef[!(x_cols %in% names(coefis))])^2)
}

Bootstrapping function with data.table

I have been trying to write a function that takes the results from a simple regression model and calculate the Glass's Delta size effect. That was easy.
The problem now is that I would like to calculate confidence intervals for this value and I keep getting an error when I use it with the boot library.
I have tried to follow this answer but with no success.
As an example I am going to use a Stata dataset
library(data.table)
webclass <- readstata13::read.dta13("http://www.stata.com/videos13/data/webclass.dta")
#estimate impact
M0<-lm(formula = math ~ treated ,data = webclass)
######################################
##### Effect Size ######
## Glass's delta=M1-M2/SD2 ##
####################################
ESdelta<-function(regmodel,yvar,tvar,msg=TRUE){
Data<-regmodel$model
setDT(Data)
meanT<-mean(Data[get(tvar)=="Treated",get(yvar)])
meanC<-mean(Data[get(tvar)=="Control",get(yvar)])
sdC<-sd(Data[get(tvar)=="Control",get(yvar)])
ESDelta<-(meanT-meanC)/sdC
if (msg==TRUE) {
cat(paste("the average scores of the variable-",yvar,"-differ by approximately",round(ESDelta,2),"standard deviations"))
}
return(ESDelta)
}
ESdelta(M0,"math","treated",msg = F)
#0.7635896
Now when I try to use the boot function I got the following error
boot::boot(M0, statistic=ESdelta, R=50,"math","treated")
#Error in match.arg(stype) : 'arg' should be one of “i”, “f”, “w”
Thanks

In the boot manual (type ?boot):
statistic: [...] The first argument passed will always be the original
data. The second will be a vector of indices, frequencies or weights
which define the bootstrap sample.
You cannot bootstrap a model, so you modify your function to work with the data.table and index, other arguments to the function must be specified after:
ESdelta<-function(Data,inds,yvar,tvar,msg=TRUE){
Data = Data[inds,]
meanT<-mean(Data[get(tvar)=="Treated",get(yvar)])
meanC<-mean(Data[get(tvar)=="Control",get(yvar)])
sdC<-sd(Data[get(tvar)=="Control",get(yvar)])
ESDelta<-(meanT-meanC)/sdC
if (msg==TRUE) {
cat(paste("the average scores of the variable-",yvar,"-differ by approximately",round(ESDelta,2),"standard deviations"))
}
return(ESDelta)
}
Dat <- setDT(M0$model)
bo = boot(Dat, statistic=ESdelta, R=50,yvar="math",tvar="treated",msg=FALSE)
> bo
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = Dat, statistic = ESdelta, R = 50, yvar = "math",
tvar = "treated", msg = FALSE)
Bootstrap Statistics :
original bias std. error
t1* 0.7635896 0.05685514 0.4058304
You can get the c.i by doing:
boot.ci(bo)
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 50 bootstrap replicates
CALL :
boot.ci(boot.out = bo)
Intervals :
Level Normal Basic
95% (-0.0887, 1.5021 ) (-0.8864, 1.5398 )
Level Percentile BCa
95% (-0.0126, 2.4136 ) (-0.1924, 1.7579 )

Error when bootstrapping large n with boot package (error: integer overflow)

Why can I not bootstrap a statistic with large n using the boot package? Although, 150,000 obs is not large, so I don't know why this isn't working.
Example
library(boot)
bs <- boot(rnorm(150000), sum, R = 1000)
bs
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = rnorm(150000), statistic = sum, R = 1000)
Bootstrap Statistics :
WARNING: All values of t1* are NA
Error Message
In statistic(data, i[r, ], ...) : integer overflow - use
sum(as.numeric(.))

You're not using boot() as documented (which is, admittedly, surprisingly complex). From ?boot:
In all other cases ‘statistic’
must take at least two arguments. The first argument passed
will always be the original data. The second will be a
vector of indices, frequencies or weights which define the
bootstrap sample.
I think you want:
bsum <- function(x,i) sum(x[i])
bs <- boot(rnorm(150000), bsum, R = 1000)
I haven't taken the time to figure out what boot() is actually doing in your case - almost certainly not what you want though.

How can I use pre bootstrapped data to obtain a BCa CI?

I have bootstrapped two variables (one which is already in the "Impala.csv" file) using a function which resamples and reports the mean for a sample the size of nrow(data) for 5000 repetitions. The code is as follows:
data<-read.csv("Impala.csv")
allo<-data$distance
data2<-read.csv("2010 - IM.csv")
pro<-data2$pro
n1<-nrow(data2)
boot4000 <- c()
for(i in 1:5000){
s <- sample(data2$xs,n1,replace=T,prob = data2$pro)
boot4000[i] <- mean(s)
}`
And then combine the two outputs in a formula, giving me 5000 new variables.
d<-(pi/2)*(boot4000*(1/allo))
Now I wish to find the BCa confidence intervals for this, but as I understand, the boot function will require me to make a new set of resamples, but I do not want this as the bootstrapping is complete. All I want now is a function which will take my bootstrapped data as is and determine the BCa confidence interval.
http://www.filedropper.com/impala
http://www.filedropper.com/2010-im
Here are the data files I have used
Also, I have tried to create an object imitating a 'boot' object using the following
den<-as.matrix(d, ncol=1)
outs<-list(t0=mean(d), t=den, R=5000, L=3)
boot.ci(outs, type="bca")
This spits out the error:
Error in if (as.character (boot.out$call[1L]) == "tsboot") warning
("BCa intervals not defined for time series bootstraps") else output
<- C (output,: argument is of length zero

outs <- list(t0=mean(d), t=den, R=5000, sim="ordinary",
stype="i", weights=rep(0.0002,5000), statistic=meanfun,
data=d, call=boot(data=d, statistic = meanfun,R=5000),
strata = rep(1,5000), attr="boot", seed=.Random.seed)
This is how one can make the object of class boot.out.