Bootstrap t test in R without reliance on MKinfer package - r

I would like to get CI for a paired t-test using bootstrap in R. Unfortunately, I dont have privileges to install the MKinfer. Using MKinfer I would (like here: T-test with bootstrap in R):
boot.t.test(
x = iris["Petal.Length"],
y = iris["Sepal.Length"],
alternative = c("two.sided"),
mu = 0,
#paired = TRUE,
conf.level = 0.95,
R = 9999
)
How would I do this for paired data with CI's and p-values not relying on MKinfer (relying on boot would be fine)?

Here is an example using boot using R = 1000 bootstrap replicates
library(boot)
x <- iris$Petal.Length
y <- iris$Sepal.Length
change_in_mean <- function(df, indices) t.test(
df[indices, 1], df[indices, 2], paired = TRUE, var.equal = FALSE)$estimate
model <- boot(
data = cbind(x, y),
statistic = change_in_mean,
R = 1000)
We can calculate the confidence interval of the estimated change in the mean using boot.ci
boot.ci(model, type = "norm")
#BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
#Based on 1000 bootstrap replicates
#
#CALL :
#boot.ci(boot.out = model, type = "norm")
#
#Intervals :
#Level Normal
#95% (-2.262, -1.905 )
#Calculations and Intervals on Original Scale
Note that this is very close to the CI reported by t.test
t.test(x, y, paired = TRUE, var.equal = FALSE)
#
# Paired t-test
#
#data: x and y
#t = -22.813, df = 149, p-value < 2.2e-16
#alternative hypothesis: true difference in means is not equal to 0
#95 percent confidence interval:
# -2.265959 -1.904708
#sample estimates:
#mean of the differences
# -2.085333

Related

Confidence interval for quantile regression using bootstrap

I am trying to get the five types of bootstrap intervals for linear and quantile regression. I was able to bootstrap and find the 5 boostrap intervals (Quantile,Normal,Basic,Studentized and BCa) for the linear regression using Boot from car and boot.ci from boot. When i tried to do the same for quantile regression using rq from quantreg, it throws up an error. Here is the sample code
Creating the model
library(car)
library(quantreg)
library(boot)
newdata = Prestige[,c(1:4)]
education.c = scale(newdata$education, center=TRUE, scale=FALSE)
prestige.c = scale(newdata$prestige, center=TRUE, scale=FALSE)
women.c = scale(newdata$women, center=TRUE, scale=FALSE)
new.c.vars = cbind(education.c, prestige.c, women.c)
newdata = cbind(newdata, new.c.vars)
names(newdata)[5:7] = c("education.c", "prestige.c", "women.c" )
mod1 = lm(income ~ education.c + prestige.c + women.c, data=newdata)
mod2 = rq(income ~ education.c + prestige.c + women.c, data=newdata)
Booting linear and quantile regression
mod1.boot <- Boot(mod1, R=999)
boot.ci(mod1.boot, level = .95, type = "all")
dat2 <- newdata[5:7]
mod2.boot <- boot.rq(cbind(1,dat2),newdata$income,tau=0.5, R=10000)
boot.ci(mod2.boot, level = .95, type = "all")
Error in if (ncol(boot.out$t) < max(index)) { :
argument is of length zero
1) Why does boot.ci not work for quantile regression
2)Using this solution I got from stackexchange, I was able to find the quantile CI.
Solution for quantile(percentile CI) for rq
t(apply(mod2.boot$B, 2, quantile, c(0.025,0.975)))
how do i obtain other CI for bootstrap (normal, basic, studentized, BCa).
3) Also, my boot.ci command for linear regression produces this warning
Warning message:
In sqrt(tv[, 2L]) : NaNs produced
What does this signify?
Using summary.rq you can calculate boostrap standard errors of model coefficients.
Five boostrap methods (bsmethods) are available (see ?boot.rq).
summary(mod2, se = "boot", bsmethod= "xy")
# Call: rq(formula = income ~ education.c + prestige.c + women.c, data = newdata)
#
# tau: [1] 0.5
#
# Coefficients:
# Value Std. Error t value Pr(>|t|)
# (Intercept) 6542.83599 139.54002 46.88860 0.00000
# education.c 291.57468 117.03314 2.49139 0.01440
# prestige.c 89.68050 22.03406 4.07009 0.00010
# women.c -48.94856 5.79470 -8.44712 0.00000
To calculate bootstrap confidence intervals, you can use the following trick:
mod1.boot <- Boot(mod1, R=999)
set.seed(1234)
boot.ci(mod1.boot, level = .95, type = "all")
dat2 <- newdata[5:7]
set.seed(1234)
mod2.boot <- boot.rq(cbind(1,dat2),newdata$income,tau=0.5, R=10000)
# Create an object with the same structure of mod1.boot
# but with boostrap replicates given by boot.rq
mod3.boot <- mod1.boot
mod3.boot$R <- 10000
mod3.boot$t0 <- coef(mod2)
mod3.boot$t <- mod2.boot$B
boot.ci(mod3.boot, level = .95, type = "all")
# BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
# Based on 10000 bootstrap replicates
#
# CALL :
# boot.ci(boot.out = mod3.boot, type = "all", level = 0.95)
#
# Intervals :
# Level Normal Basic Studentized
# 95% (6293, 6838 ) (6313, 6827 ) (6289, 6941 )
#
# Level Percentile BCa
# 95% (6258, 6772 ) (6275, 6801 )
Thanks for everyone who helped. I was able to figure out the solution myself. I ran a loop calculating the coefficients of the quantile regression and then used boot and boot.ci respectively. Here is the code
Booting commands only, model creation from question
mod3 <- formula(income ~ education.c + prestige.c + women.c)
coefsf <- function(data,ind){
rq(mod3, data=newdata[ind,])$coef
}
boot.mod <- boot(newdata,coefsf,R=10000)
myboot.ci <- list()
for (i in 1:ncol(boot.mod$t)){
myboot.ci[[i]] <- boot.ci(boot.mod, level = .95, type =
c("norm","basic","perc", "bca"),index = i)
}
I did this as I wanted CI on all variables not just the intercept.

R : Calculating p-value using simulations

I wrote this code to run a test statistic on two randomly distributed observations x and y
mean.test <- function(x, y, B=10000,
alternative=c("two.sided","less","greater"))
{
p.value <- 0
alternative <- match.arg(alternative)
s <- replicate(B, (mean(sample(c(x,y), B, replace=TRUE))-mean(sample(c(x,y), B, replace=TRUE))))
t <- mean(x) - mean(y)
p.value <- 2*(1- pnorm(abs(quantile(T,0.01)), mean = 0, sd = 1, lower.tail =
TRUE, log.p = FALSE)) #try to calculate p value
data.name <- deparse(substitute(c(x,y)))
names(t) <- "difference in means"
zero <- 0
names(zero) <- "difference in means"
return(structure(list(statistic = t, p.value = p.value,
method = "mean test", data.name = data.name,
observed = c(x,y), alternative = alternative,
null.value = zero),
class = "htest"))
}
the code uses a Monte-Carlo simulations to generate the distribution function of the test statistic mean(x) - mean(y) and then calculates the p-value, but apparently i miss defined this p-value because for :
> set.seed(0)
> mean.test(rnorm(1000,3,2),rnorm(2000,4,3))
the output should look like:
mean test
data: c(rnorm(1000, 3, 2), rnorm(2000, 4, 3))
difference in means = -1.0967, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
but i got this instead:
mean test
data: c(rnorm(1000, 3, 2), rnorm(2000, 4, 3))
difference in means = -1.0967, p-value = 0.8087
alternative hypothesis: true difference in means is not equal to 0
can someone explain the bug to me ?
As far as I can tell, your code has numerous mistakes and errors in it:
quantile(T, 0.01) - here T == TRUE, so you're calculating the quantile of 1.
The object s is never used.
mean(sample(c(x,y), B, replace=TRUE)) What are you trying to do here? The c() function combines x and y. Sampling makes no sense since you don't know what population they come from
When you calculate the test statistic t, it should depend on the variance (and sample size).

Bootstrap parameter estimate of non-linear optimization in R: Why is it different than the regular parameter estimate?

Here's the short version of my question. The code is below.
I calculated the parameters for the non-linear von Bertalanffy growth equation in R using optim(), and now I am trying to add 95% confidence intervals to the von B growth coefficient K by bootstrapping. For at least one of the years of data, when I summarize the bootstrapped output of the growth coefficient K, the mean and median parameter estimates from bootstrapping are quite different than the estimated parameter:
>summary(temp.store) # summary of bootstrap values
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.002449 0.005777 0.010290 0.011700 0.016970 0.056720
> est.K [1] 0.01655956 # point-estimate from the optimization
I suspect the discrepancy is because there are errors in the bootstrap of the random draw that bias the result, although I have used try() to stop the optimization from crashing when there is a combination of input values that cause an error. So I would like to know what to do to fix that issue. I think I'm doing things correctly, because the fitted curve looks right.
Also, I have run this code for data from other years, and in at least one other year, the bootrap estimate and the regular estimate are very close.
Long-winded version:
The von Bertalanffy growth curve (VBGC) for length is given by:
L(t) = L.inf * [1 - exp(-K*(t-t0))] (Eq. 3.1.0.1, from FAO)
where L(t) is the fish's length, L.inf is the asymptotic maximum length, K is the growth coefficient, t is the time step and t0 is when growth began. L(t) and t are the observed data. Usually time or age is measured in years, but here I am looking at juvenile fish data and I have made t the day the of year ("doy") starting with January 1 = 1.
To estimate the starting parameters for the optimization, I have used a linearization of the VBGC equation.
doy <- c(156,205,228,276,319,380)
len <- c(36,56,60,68,68,71)
data06 <- data.frame(doy,len)
Function to get starting parameters for the optimization:
get.init <-function(dframe){ # linearization of the von B
l.inf <- 80 # by eyeballing max juvenile fish
# make a response variable and store it in the data frame:
# Eqn. 3.3.3.1 in FAO document
dframe$vonb.y <- - log(1 - (dframe$len)/l.inf )
lin.vonb <- lm(vonb.y ~ doy, data=dframe)
icept <- lin.vonb$coef[1] # 0.01534013 # intercept is a
slope <- k.lin <- lin.vonb$coef[2] # slope is the K param
t0 <- - icept/slope # get t0 from this relship: intercept= -K * t0
pars <- c(l.inf,as.numeric(slope),as.numeric(t0))
}
Sums of squares for von Bertalanffy growth equation
vbl.ssq <- function(theta, data){
linf=theta[1]; k=theta[2]; t0=theta[3]
# name variables for ease of use
obs.length=data$len
age=data$doy
#von B equation
pred.length=linf*(1-exp(-k*(age-t0)))
#sums of squares
ssq=sum((obs.length-pred.length)^2)
}
Estimate parameters
#Get starting parameter values
theta_init <- get.init(dframe=data06)
# optimize VBGC by minimizing sums of square differences
len.fit <- optim(par=theta_init, fn=vbl.ssq, method="BFGS", data=data06)
est.linf <- len.fit$par[1] # vonB len-infinite
est.K <- len.fit$par[2] # vonB K
est.t0 <- len.fit$par[3] # vonB t0
Bootstrapping
# set up for bootstrap loop
tmp.frame <- data.frame()
temp.store <- vector()
# bootstrap to get 95% conf ints on growth coef K
for (j in 1:1000){
# choose indices at random, with replacement
indices <- sample(1:length(data06[,1]),replace=T)
# values from original data corresponding to those indices
new.len <- data06$len[indices]
new.doy <- data06$doy[indices]
tmp.frame <- data.frame(new.doy,new.len)
colnames(tmp.frame) <- c("doy","len")
init.par <- get.init(tmp.frame)
# now get the vonB params for the randomly selected samples
# using try() to keep optimizing errors from crashing the program
try( len.fit.bs <- optim(par=init.par, fn=vbl.ssq, method="BFGS", data=tmp.frame))
tmp.k <- len.fit.bs$par[2]
temp.store[j] <- tmp.k
}
95% confidence interval for K parameter
k.ci <- quantile(temp.store,c(0.025,0.975))
# 2.5% 97.5%
#0.004437702 0.019784178
Here's the problem:
#>summary(temp.store)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 0.002449 0.005777 0.010290 0.011700 0.016970 0.056720
#
# est.K [1] 0.01655956
Example of error:
Error in optim(par = init.par, fn = vbl.ssq, method = "BFGS", data = tmp.frame) :
non-finite finite-difference value [2]
I don't believe I am making any errors with the optimization because the VBGC fit looks reasonable. Here are the plots:
plot(x=data06$doy,y=data06$len,xlim=c(0,550),ylim=c(0,100))
legend(x="topleft",legend=paste("Length curve 2006"), bty="n")
curve(est.linf*(1-exp(-est.K*(x-est.t0))), add=T,type="l")
plot(x=2006,y=est.K, main="von B growth coefficient for length; 95% CIs",
ylim=c(0,0.025))
arrows(x0=2006,y0=k.ci[1],x1=2006,y1=k.ci[2], code=3,
angle=90,length=0.1)
First of all, you have a very small number of values, possibly too few to trust the bootstrap method. Then a high proportion of fits fails for the classic bootstrap, because due to the resampling you often have not enough distinct x values.
Here is an implementation using nls with a selfstarting model and the boot package.
doy <- c(156,205,228,276,319,380)
len <- c(36,56,60,68,68,71)
data06 <- data.frame(doy,len)
plot(len ~ doy, data = data06)
fit <- nls(len ~ SSasympOff(doy, Asym, lrc, c0), data = data06)
summary(fit)
#profiling CI
proCI <- confint(fit)
# 2.5% 97.5%
#Asym 68.290477 75.922174
#lrc -4.453895 -3.779994
#c0 94.777335 126.112523
curve(predict(fit, newdata = data.frame(doy = x)), add = TRUE)
#classic bootstrap
library(boot)
set.seed(42)
boot1 <- boot(data06, function(DF, i) {
tryCatch(coef(nls(len ~ SSasympOff(doy, Asym, lrc, c0), data = DF[i,])),
error = function(e) c(Asym = NA, lrc = NA, c0 = NA))
}, R = 1e3)
#proportion of unsuccessful fits
mean(is.na(boot1$t[, 1]))
#[1] 0.256
#bootstrap CI
boot1CI <- apply(boot1$t, 2, quantile, probs = c(0.025, 0.5, 0.975), na.rm = TRUE)
# [,1] [,2] [,3]
#2.5% 69.70360 -4.562608 67.60152
#50% 71.56527 -4.100148 113.9287
#97.5% 74.79921 -3.697461 151.03541
#bootstrap of the residuals
data06$res <- residuals(fit)
data06$fit <- fitted(fit)
set.seed(42)
boot2 <- boot(data06, function(DF, i) {
DF$lenboot <- DF$fit + DF[i, "res"]
tryCatch(coef(nls(lenboot ~ SSasympOff(doy, Asym, lrc, c0), data = DF)),
error = function(e) c(Asym = NA, lrc = NA, c0 = NA))
}, R = 1e3)
#proportion of unsuccessful fits
mean(is.na(boot2$t[, 1]))
#[1] 0
#(residuals) bootstrap CI
boot2CI <- apply(boot2$t, 2, quantile, probs = c(0.025, 0.5, 0.975), na.rm = TRUE)
# [,1] [,2] [,3]
#2.5% 70.19380 -4.255165 106.3136
#50% 71.56527 -4.100148 113.9287
#97.5% 73.37461 -3.969012 119.2380
proCI[2,1]
CIs_k <- data.frame(lwr = c(exp(proCI[2, 1]),
exp(boot1CI[1, 2]),
exp(boot2CI[1, 2])),
upr = c(exp(proCI[2, 2]),
exp(boot1CI[3, 2]),
exp(boot2CI[3, 2])),
med = c(NA,
exp(boot1CI[2, 2]),
exp(boot2CI[2, 2])),
estimate = exp(coef(fit)[2]),
method = c("profile", "boot", "boot res"))
library(ggplot2)
ggplot(CIs_k, aes(y = estimate, ymin = lwr, ymax = upr, x = method)) +
geom_errorbar() +
geom_point(aes(color = "estimate"), size = 5) +
geom_point(aes(y = med, color = "boot median"), size = 5) +
ylab("k") + xlab("") +
scale_color_brewer(name = "", type = "qual", palette = 2) +
theme_bw(base_size = 22)
As you see, the bootstrap CI is wider than the profile CI and bootstrapping the residuals results in a more narrow estimated CI. All of them are almost symmetric. Furthermore, the medians are close to the point estimates.
As a first step of investigating what goes wrong in your code, you should look at the proportion of failed fits from your procedure.

Kolmogorov-Smirnov Tests for Laplace destribution R

I have a problem with ks function in R. I have a Laplace Distribution:
ldes <- function(y, a) {
if(y < 0.5) 1/a*log(2*y, 2)
else 1/a*log(2*(1-y), 2)
}
a <- 1
set.seed(1)
y = runif(1000, 0, 1)
ld <- ldes(y, a)
So, I need to do the ks test, but can't find anything about second parameter that should be in there, like:
ks.test(my_lnorm, **plnorm**, mean = -5, sd = 5)
for Lognormal Destribution or:
ks.test(my_log, **plogis**, location = 2, scale = 3)
for Logistics Destribution
Thanks.
You can try some package for the laplace distribution, for example disclap (if it satisfies our need, otherwise some continuous analog).
library(disclap)
ks.test(ld, "pdisclap", 0.5) # choose the right value of parameter p (p=0.5 is arbitrary)
One-sample Kolmogorov-Smirnov test
data: ld
D = 0.3333, p-value < 2.2e-16
alternative hypothesis: two-sided
As can be seen from the result of the hypothesis test, the null hypothesis (that the samples are drawn from the same population distribution) is rejected.
y2 <- rdisclap(1000, p=0.5) # generate some simulated datapoints
plot(ecdf(ld), xlim = range(c(ld, y2))) # compare ecdfs
plot(ecdf(y2), add = TRUE, lty = "dashed")

Confidence Intervals for Lethal Dose (LD) for Logistic Regression in R

I want to find Lethal Dose (LD50) with its confidence interval in R. Other softwares line Minitab, SPSS, SAS provide three different versions of such confidence intervals. I could not find such intervals in any package in R (I also used findFn function from sos package).
How can I find such intervals? I coded for one type of intervals based on Delta method (as not sure about it correctness) but would like to use any established function from R package. Thanks
MWE:
dose <- c(10.2, 7.7, 5.1, 3.8, 2.6, 0)
total <- c(50, 49, 46, 48, 50, 49)
affected <- c(44, 42, 24, 16, 6, 0)
finney71 <- data.frame(dose, total, affected)
fm1 <- glm(cbind(affected, total-affected) ~ log(dose),
family=binomial(link = logit), data=finney71[finney71$dose != 0, ])
summary(fm1)$coef
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.886912 0.6429272 -7.601035 2.937717e-14
log(dose) 3.103545 0.3877178 8.004650 1.198070e-15
library(MASS)
xp <- dose.p(fm1, p=c(0.50, 0.90, 0.95)) # from MASS
xp.ci <- xp + attr(xp, "SE") %*% matrix(qnorm(1 - 0.05/2)*c(-1,1), nrow=1)
zp.est <- exp(cbind(xp, attr(xp, "SE"), xp.ci[,1], xp.ci[,2]))
dimnames(zp.est)[[2]] <- c("LD", "SE", "LCL","UCL")
zp.est
LD SE LCL UCL
p = 0.50: 4.828918 1.053044 4.363708 5.343724
p = 0.90: 9.802082 1.104050 8.073495 11.900771
p = 0.95: 12.470382 1.133880 9.748334 15.952512
From the package drc, you can get the ED50 (same calculation), along with confidence intervals.
library(drc) # Directly borrowed from the drc manual
mod <- drm(affected/total ~ dose, weights = total,
data = finney71[finney71$dose != 0, ], fct = LL2.2(), type = "binomial")
#intervals on log scale
ED(mod, c(50, 90, 95), interval = "fls", reference = "control")
Estimated effective doses
(Back-transformed from log scale-based confidence interval(s))
Estimate Lower Upper
1:50 4.8289 4.3637 5.3437
1:90 9.8021 8.0735 11.9008
1:95 12.4704 9.7483 15.9525
Which matches the manual output.
The "finney71" data is included in this package, and your calculation of confidence intervals exactly matches the example given by the drc folks, down to the "# from MASS" comment. You should give credit to them, rather than claiming you wrote the code.
There's a few other ways to figure this out. One is using parametric bootstrap, which is conveniently available through the boot package.
First, we'll refit the model.
library(boot)
finney71 <- finney71[finney71$dose != 0,] # pre-clean data
fm1 <- glm(cbind(affected, total-affected) ~ log(dose),
family=binomial(link = logit),
data=finney71)
And for illustration, we can figure out the LD50 and LD75.
statfun <- function(dat, ind) {
mod <- update(fm1, data = dat[ind,])
coefs <- coef(mod)
c(exp(-coefs[1]/coefs[2]),
exp((log(0.75/0.25) - coefs[2])/coefs[1]))
}
boot_out <- boot(data = finney71, statistic = statfun, R = 1000)
The boot.ci function can work out a variety of confidence intervals for us, using this object.
boot.ci(boot_out, index = 1, type = c('basic', 'perc', 'norm'))
##BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
##Based on 999 bootstrap replicates
##
##CALL :
##boot.ci(boot.out = boot_out, type = c("basic", "perc", "norm"),
## index = 1)
##Intervals :
##Level Normal Basic Percentile
##95% ( 3.976, 5.764 ) ( 4.593, 5.051 ) ( 4.607, 5.065 )
The confidence intervals using the normal approximation are thrown off quite a bit by a few extreme values, which the basic and percentile-based intervals are more robust to.
One interesting thing to note: if the sign of the slope is sufficiently unclear, we can get some rather extreme values (simulated as in this answer, and discussed more thoroughly in this blog post by Andrew Gelman).
set.seed(1)
x <- rnorm(100)
z = 0.05 + 0.1*x*rnorm(100, 0, 0.05) # small slope and more noise
pr = 1/(1+exp(-z))
y = rbinom(1000, 1, pr)
sim_dat <- data.frame(x, y)
sim_mod <- glm(y ~ x, data = sim_dat, family = 'binomial')
statfun <- function(dat, ind) {
mod <- update(sim_mod, data = dat[ind,])
-coef(mod)[1]/coef(mod)[2]
}
sim_boot <- boot(data = sim_dat, statistic = statfun, R = 1000)
hist(sim_boot$t[,1], breaks = 100,
main = "Bootstrap of simulated model")
The delta method above gives us mean = 6.448, lower ci = -36.22, and upper ci = 49.12, and all of the bootstrap CIs give us similarly extreme estimates.
##Level Normal Basic Percentile
##95% (-232.19, 247.76 ) ( -20.17, 45.13 ) ( -32.23, 33.06 )

Resources