I want to estimate frequency distributions of MRM coefficients to generate a 95% CI. Below is the initial code:
library(ecodist)
dat=data.frame(matrix(rnorm(3*25),ncol=3))
names(dat)<-c('Pred','Var1','Var2')
mod<-MRM(dist(Pred) ~ dist(Var1) + dist (Var2), data=dat, nperm=100)
slopes<-mod$coef
How can I bootstrap the coefficient values?
You can use the boot function from the boot library. I do not know of the ecodist::MRM. Though, here is a close to copy-paste example from the help page of boot which shows how to do non-parametric bootstrap of the coefficient estimates for an lm model and get bias and confidence intervals
> library(boot)
> nuke <- nuclear[, c(1, 2, 5, 7, 8, 10, 11)]
> nuke.lm <- lm(log(cost) ~ date+log(cap)+ne+ct+log(cum.n)+pt, data = nuke)
>
> nuke.fun <- function(dat, inds, i.pred, fit.pred, x.pred)
+ {
+ lm.b <- lm(log(cost) ~ date+log(cap)+ne+ct+log(cum.n)+pt,
+ data = dat[inds, ])
+ coef(lm.b)
+ }
>
> set.seed(45282964)
> nuke.boot <- boot(nuke, nuke.fun, R = 999)
> nuke.boot
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = nuke, statistic = nuke.fun, R = 999)
Bootstrap Statistics :
original bias std. error
t1* -13.26031434 -0.482810992 4.93147203
t2* 0.21241460 0.006775883 0.06480161
t3* 0.72340795 0.001842262 0.14160523
t4* 0.24902491 -0.004979272 0.08857604
t5* 0.14039305 0.009209543 0.07253596
t6* -0.08757642 0.002417516 0.05489876
t7* -0.22610341 0.006136044 0.12140501
>
> boot.ci(nuke.boot, index = 2) # pick the covariate index you want
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 999 bootstrap replicates
CALL :
boot.ci(boot.out = nuke.boot, index = 2)
Intervals :
Level Normal Basic
95% ( 0.0786, 0.3326 ) ( 0.0518, 0.3215 )
Level Percentile BCa
95% ( 0.1033, 0.3730 ) ( 0.0982, 0.3688 )
Calculations and Intervals on Original Scale
Warning message:
In boot.ci(nuke.boot, index = 2) :
bootstrap variances needed for studentized intervals
See Davison, A.C. and Hinkley, D.V. (1997) Bootstrap Methods and Their Application. Cambridge University Press for details of the above output. You should consider what you want to achieve with the bootstrap and consider which bootstrap procedure to use.
Related
I need to bootstrap a beta regression model to check its robustness - because of a data point with a large cook's distance - with the boot package (other suggestions welcomed).
I have the following error:
Error in t.star[r, ] <- res[[r]] :
incorrect number of subscripts on matrix
Here's a reproductible example:
library(betareg)
library(boot)
fake_data <- data.frame(diet = as.factor(c(rep("A",10),rep("B",10))),
fat = c(runif(10,.1,.5),runif(10,.4,.9)) )
plot(fat~diet, data = fake_data)
my_beta_reg <- function(data,i){
data_i <- data[i,]
mod <- betareg(data_i[,"fat"] ~ data_i[,"diet"])
return(mod$coef)
}
b = boot(fake_data, statistic = my_beta_reg, R= 50)
Error in t.star[r, ] <- res[[r]] :
incorrect number of subscripts on matrix
What's the issue?
Thanks in advance.
The issue is that mod$coef is a list:
betareg(fat ~ diet, data = fake_data)$coef
#$mean
#(Intercept) dietB
# -1.275793 2.490126
#
#$precision
# (phi)
#20.59014
You need to unlist it or preferably use the function you are supposed to use for extraction of coefficients:
my_beta_reg <- function(data,i){
mod <- betareg(fat ~ diet, data = data[i,])
#unlist(mod$coef)
coef(mod)
}
b = boot(fake_data, statistic = my_beta_reg, R= 50)
print(b)
#ORDINARY NONPARAMETRIC BOOTSTRAP
#
#
#Call:
#boot(data = fake_data, statistic = my_beta_reg, R = 50)
#
#
#Bootstrap Statistics :
# original bias std. error
#t1* -1.275793 -0.019847377 0.2003523
#t2* 2.490126 0.009008892 0.2314521
#t3* 20.590142 8.265394485 17.2271497
i'm trying to run a query creating a for loop for creating bootstraps with data from package rattle.data (weather data with RainTomorrow as the target column). I'm trying to choose a class with maximum probability for every single bootstrap sample, then predicting the class with maximum votes.
With this code I keep getting back an error
if(!require(rpart)) install.packages("rpart")
if(!require(rpart.plot)) install.packages("rpart.plot")
if(!require(caret)) install.packages("caret")
if(!require(rattle.data)) install.packages("rattle.data")
if(!require(tidyverse)) install.packages("tidyverse")
if(!require(ipred)) install.packages("ipred")
if(!require(Metrics)) install.packages("Metrics")
library(rpart)
library(rpart.plot)
library(rattle.data)
library(tidyverse)
library(caret)
library(ipred)
library(Metrics)
set.seed(500)
data <- weather
# creating train and test data
index <- createDataPartition(data$RainTomorrow, p = .6, list = FALSE)
train_data <- data[ index, ]
test_data <- data[-index, ]
## b ukol -> error in for each loop
nBoot = 10 #nr bootstrap samples
#create empty matrix [nr test data x nr bootstrap samples]to store bootstrap predictions
pred = matrix(data = NA, nrow = nrow(test_data), ncol = nBoot)
train_controls = rpart.control(minsplit = 6, maxdepth = 3)
for(b in 1:nBoot){
#create bootstrap sample
index.boot = sample(x=nrow(train_data), replace = T, size = nrow(train_data))
data_boot = train_data[index.boot,]
#fit data for the bootstrap sample
boot.model = rpart(RainTomorrow ~ .,
data =data_boot,
method = "anova",
control = train_controls)
#rpart.plot(boot.model)
#save prediction for bootstrap
pred[,b] = predict(boot.model, newdata= test_data )
}
#calculate prediction as mean of bootstrap predictions
pred.bagged = rowMeans(pred)
print(rmse(actual = test_data$RainTomorrow, predicted = pred.bagged))
but running this query gives me back a warning message:
In Ops.factor(actual, predicted) : ‘-’ not meaningful for factors
and I cannot for the life of me figure the reason (newbie in machine learning).
EDIT: still looking for a valid answer
The error occurs because you are trying to calculate RMSE from a factor:
pred.bagged = rowMeans(pred)
class(pred.bagged)
[1] "numeric"
class(test_data$RainTomorrow)
[1] "factor"
you can convert the factor to numeric, which is what rpart did when you specified method = "anova", and calculate RMSE:
rmse(actual = as.numeric(test_data$RainTomorrow), predicted = pred.bagged)
RMSE is normally used for regression and it doesn't make much sense for a classification model. For classification, you would use method="class" and for evaluating use accuracy, f1 or cohen's kappa, you can see example below with confusionMatrix from caret:
for(b in 1:nBoot){
#create bootstrap sample
index.boot = sample(x=nrow(train_data), replace = T)
data_boot = train_data[index.boot,]
#fit data for the bootstrap sample
boot.model = rpart(RainTomorrow ~ .,
data =data_boot,
method = "class",
control = train_controls)
#rpart.plot(boot.model)
#save prediction for bootstrap
pred[,b] = as.character(predict(boot.model, newdata= test_data ,type="class"))
}
# very crude way to get majority vote
pred.bagged = apply(pred,1,function(i){
names(sort(table(factor(i,levels=c("No","Yes")))))[2]
})
# convert to a factor, same levels as test_data$RainTomorrow
pred.bagged = factor(pred.bagged,levels=c("No","Yes"))
confusionMatrix(,test_data$RainTomorrow)
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 120 0
Yes 0 26
Accuracy : 1
95% CI : (0.9751, 1)
No Information Rate : 0.8219
P-Value [Acc > NIR] : 3.672e-13
Kappa : 1
Mcnemar's Test P-Value : NA
Sensitivity : 1.0000
Specificity : 1.0000
Pos Pred Value : 1.0000
Neg Pred Value : 1.0000
Prevalence : 0.8219
Detection Rate : 0.8219
Detection Prevalence : 0.8219
Balanced Accuracy : 1.0000
'Positive' Class : No
I want to take my dataset bodyfat_trimmed and use bootstrapping to retrieve the mean and the standard errors. However, I seem to be using the same data all the time and therefore get zero standard error and bias. How can I solve this?
bsfunc <- function(data) {
set.seed(1)
x <- model.matrix(reduced_BIC_fit)[, -1]
y <- data$density
bootdata <- sample(1:nrow(x), nrow(x)/2)
x.train <- x[bootdata, ]
y.train <- y[bootdata]
bootframe <- data.frame(bodyfat_trimmed[train, ])
fit <- lm(density ~ age + abdomen + wrist, data = bootframe)
stats <- coef(summary(fit))[, "Estimate"]
return(stats)}
strap <- boot(data = bodyfat_trimmed, sim = "parametric", statistic = bsfunc, R=1000)
strap
Output:
PARAMETRIC BOOTSTRAP
Call:
boot(data = bodyfat_trimmed, statistic = bsfunc, R = 1000, sim = "parametric")
Bootstrap Statistics :
original bias std. error
t1* 1.1360858253 0 0
t2* -0.0000889957 0 0
t3* -0.0018446625 0 0
t4* 0.0050609837 0 0
If the seed is within the function the sample function will be somewhat repetitive.
bsfunc<-function(){set.seed(1); sample(1:10,1)}
bsfunc()
[1] 3
bsfunc()
[1] 3
bsfunc()
[1] 3
PS
Your bsfunc is also misconceived. As written, train (i.e. bootframe <- data.frame(bodyfat_trimmed[train, ])) doesn't come from within this function. And normally the whole point of boot is to do the bootstrap resampling. whilst bsfunc should just be a straight statistic.
I am trying to get the five types of bootstrap intervals for linear and quantile regression. I was able to bootstrap and find the 5 boostrap intervals (Quantile,Normal,Basic,Studentized and BCa) for the linear regression using Boot from car and boot.ci from boot. When i tried to do the same for quantile regression using rq from quantreg, it throws up an error. Here is the sample code
Creating the model
library(car)
library(quantreg)
library(boot)
newdata = Prestige[,c(1:4)]
education.c = scale(newdata$education, center=TRUE, scale=FALSE)
prestige.c = scale(newdata$prestige, center=TRUE, scale=FALSE)
women.c = scale(newdata$women, center=TRUE, scale=FALSE)
new.c.vars = cbind(education.c, prestige.c, women.c)
newdata = cbind(newdata, new.c.vars)
names(newdata)[5:7] = c("education.c", "prestige.c", "women.c" )
mod1 = lm(income ~ education.c + prestige.c + women.c, data=newdata)
mod2 = rq(income ~ education.c + prestige.c + women.c, data=newdata)
Booting linear and quantile regression
mod1.boot <- Boot(mod1, R=999)
boot.ci(mod1.boot, level = .95, type = "all")
dat2 <- newdata[5:7]
mod2.boot <- boot.rq(cbind(1,dat2),newdata$income,tau=0.5, R=10000)
boot.ci(mod2.boot, level = .95, type = "all")
Error in if (ncol(boot.out$t) < max(index)) { :
argument is of length zero
1) Why does boot.ci not work for quantile regression
2)Using this solution I got from stackexchange, I was able to find the quantile CI.
Solution for quantile(percentile CI) for rq
t(apply(mod2.boot$B, 2, quantile, c(0.025,0.975)))
how do i obtain other CI for bootstrap (normal, basic, studentized, BCa).
3) Also, my boot.ci command for linear regression produces this warning
Warning message:
In sqrt(tv[, 2L]) : NaNs produced
What does this signify?
Using summary.rq you can calculate boostrap standard errors of model coefficients.
Five boostrap methods (bsmethods) are available (see ?boot.rq).
summary(mod2, se = "boot", bsmethod= "xy")
# Call: rq(formula = income ~ education.c + prestige.c + women.c, data = newdata)
#
# tau: [1] 0.5
#
# Coefficients:
# Value Std. Error t value Pr(>|t|)
# (Intercept) 6542.83599 139.54002 46.88860 0.00000
# education.c 291.57468 117.03314 2.49139 0.01440
# prestige.c 89.68050 22.03406 4.07009 0.00010
# women.c -48.94856 5.79470 -8.44712 0.00000
To calculate bootstrap confidence intervals, you can use the following trick:
mod1.boot <- Boot(mod1, R=999)
set.seed(1234)
boot.ci(mod1.boot, level = .95, type = "all")
dat2 <- newdata[5:7]
set.seed(1234)
mod2.boot <- boot.rq(cbind(1,dat2),newdata$income,tau=0.5, R=10000)
# Create an object with the same structure of mod1.boot
# but with boostrap replicates given by boot.rq
mod3.boot <- mod1.boot
mod3.boot$R <- 10000
mod3.boot$t0 <- coef(mod2)
mod3.boot$t <- mod2.boot$B
boot.ci(mod3.boot, level = .95, type = "all")
# BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
# Based on 10000 bootstrap replicates
#
# CALL :
# boot.ci(boot.out = mod3.boot, type = "all", level = 0.95)
#
# Intervals :
# Level Normal Basic Studentized
# 95% (6293, 6838 ) (6313, 6827 ) (6289, 6941 )
#
# Level Percentile BCa
# 95% (6258, 6772 ) (6275, 6801 )
Thanks for everyone who helped. I was able to figure out the solution myself. I ran a loop calculating the coefficients of the quantile regression and then used boot and boot.ci respectively. Here is the code
Booting commands only, model creation from question
mod3 <- formula(income ~ education.c + prestige.c + women.c)
coefsf <- function(data,ind){
rq(mod3, data=newdata[ind,])$coef
}
boot.mod <- boot(newdata,coefsf,R=10000)
myboot.ci <- list()
for (i in 1:ncol(boot.mod$t)){
myboot.ci[[i]] <- boot.ci(boot.mod, level = .95, type =
c("norm","basic","perc", "bca"),index = i)
}
I did this as I wanted CI on all variables not just the intercept.
I want to find Lethal Dose (LD50) with its confidence interval in R. Other softwares line Minitab, SPSS, SAS provide three different versions of such confidence intervals. I could not find such intervals in any package in R (I also used findFn function from sos package).
How can I find such intervals? I coded for one type of intervals based on Delta method (as not sure about it correctness) but would like to use any established function from R package. Thanks
MWE:
dose <- c(10.2, 7.7, 5.1, 3.8, 2.6, 0)
total <- c(50, 49, 46, 48, 50, 49)
affected <- c(44, 42, 24, 16, 6, 0)
finney71 <- data.frame(dose, total, affected)
fm1 <- glm(cbind(affected, total-affected) ~ log(dose),
family=binomial(link = logit), data=finney71[finney71$dose != 0, ])
summary(fm1)$coef
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.886912 0.6429272 -7.601035 2.937717e-14
log(dose) 3.103545 0.3877178 8.004650 1.198070e-15
library(MASS)
xp <- dose.p(fm1, p=c(0.50, 0.90, 0.95)) # from MASS
xp.ci <- xp + attr(xp, "SE") %*% matrix(qnorm(1 - 0.05/2)*c(-1,1), nrow=1)
zp.est <- exp(cbind(xp, attr(xp, "SE"), xp.ci[,1], xp.ci[,2]))
dimnames(zp.est)[[2]] <- c("LD", "SE", "LCL","UCL")
zp.est
LD SE LCL UCL
p = 0.50: 4.828918 1.053044 4.363708 5.343724
p = 0.90: 9.802082 1.104050 8.073495 11.900771
p = 0.95: 12.470382 1.133880 9.748334 15.952512
From the package drc, you can get the ED50 (same calculation), along with confidence intervals.
library(drc) # Directly borrowed from the drc manual
mod <- drm(affected/total ~ dose, weights = total,
data = finney71[finney71$dose != 0, ], fct = LL2.2(), type = "binomial")
#intervals on log scale
ED(mod, c(50, 90, 95), interval = "fls", reference = "control")
Estimated effective doses
(Back-transformed from log scale-based confidence interval(s))
Estimate Lower Upper
1:50 4.8289 4.3637 5.3437
1:90 9.8021 8.0735 11.9008
1:95 12.4704 9.7483 15.9525
Which matches the manual output.
The "finney71" data is included in this package, and your calculation of confidence intervals exactly matches the example given by the drc folks, down to the "# from MASS" comment. You should give credit to them, rather than claiming you wrote the code.
There's a few other ways to figure this out. One is using parametric bootstrap, which is conveniently available through the boot package.
First, we'll refit the model.
library(boot)
finney71 <- finney71[finney71$dose != 0,] # pre-clean data
fm1 <- glm(cbind(affected, total-affected) ~ log(dose),
family=binomial(link = logit),
data=finney71)
And for illustration, we can figure out the LD50 and LD75.
statfun <- function(dat, ind) {
mod <- update(fm1, data = dat[ind,])
coefs <- coef(mod)
c(exp(-coefs[1]/coefs[2]),
exp((log(0.75/0.25) - coefs[2])/coefs[1]))
}
boot_out <- boot(data = finney71, statistic = statfun, R = 1000)
The boot.ci function can work out a variety of confidence intervals for us, using this object.
boot.ci(boot_out, index = 1, type = c('basic', 'perc', 'norm'))
##BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
##Based on 999 bootstrap replicates
##
##CALL :
##boot.ci(boot.out = boot_out, type = c("basic", "perc", "norm"),
## index = 1)
##Intervals :
##Level Normal Basic Percentile
##95% ( 3.976, 5.764 ) ( 4.593, 5.051 ) ( 4.607, 5.065 )
The confidence intervals using the normal approximation are thrown off quite a bit by a few extreme values, which the basic and percentile-based intervals are more robust to.
One interesting thing to note: if the sign of the slope is sufficiently unclear, we can get some rather extreme values (simulated as in this answer, and discussed more thoroughly in this blog post by Andrew Gelman).
set.seed(1)
x <- rnorm(100)
z = 0.05 + 0.1*x*rnorm(100, 0, 0.05) # small slope and more noise
pr = 1/(1+exp(-z))
y = rbinom(1000, 1, pr)
sim_dat <- data.frame(x, y)
sim_mod <- glm(y ~ x, data = sim_dat, family = 'binomial')
statfun <- function(dat, ind) {
mod <- update(sim_mod, data = dat[ind,])
-coef(mod)[1]/coef(mod)[2]
}
sim_boot <- boot(data = sim_dat, statistic = statfun, R = 1000)
hist(sim_boot$t[,1], breaks = 100,
main = "Bootstrap of simulated model")
The delta method above gives us mean = 6.448, lower ci = -36.22, and upper ci = 49.12, and all of the bootstrap CIs give us similarly extreme estimates.
##Level Normal Basic Percentile
##95% (-232.19, 247.76 ) ( -20.17, 45.13 ) ( -32.23, 33.06 )