boot() generating an error on replacement - R - r

I've written a couple of functions for retrieving statistics (coefficients and p-values) from an lm object, to be bootstrapped upon. The coefficient one works; the p-value one is failing with error:
Error in boot(data = data, statistic = bs_p, R = 1000) :
number of items to replace is not a multiple of replacement length
I now believe the error is related to the inclusion of a factor variable. Attempting to recreate the problem with easily reproducible data.
L3 <- LETTERS[1:3]
data <- data.frame(cbind(x = 20:69, y = 1:50), fac = sample(L3, 50, replace = TRUE))
bs_p <- function (data, i) {
d <- data[i,]
fit <- lm (d$y~d$x*d$fac, data=d)
return(summary(fit)$coefficients[,4])
}
bt <- boot(data=data, statistic=bs_p, R=1000)
The class "numeric" values returned from each of these appears to be in exactly the same format, to my beginner's eye... but I'm guessing it isn't? I have also cleared the returned bt bootstrap object before running the next function, but that did not solve it. How could I best retrieve boot-strapped p-values? Thanks for any thoughts. (Running R 3.0.1 on Mac OSX.)

I am not sure if you can bootstrap p-values from lm model (but the solution is provided for that) . In your bs or bs_r function, you can remove d$ on the right hand side of fit since you already defined data d. Here is the example using mtcars data :
library(boot)
bs <- function(mtcars, i) {
d <- mtcars[i,]
fit <- lm (mpg~drat+wt, data=d)
return(coef(fit))
}
bt <- boot(data=mtcars, statistic=bs, R=1000)
bt
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = mtcars, statistic = bs, R = 1000)
Bootstrap Statistics :
original bias std. error
t1* 30.290370 0.54284222 7.494441
t2* 1.442491 -0.07260619 1.393801
t3* -4.782890 -0.09804271 1.000838
Here is the p-values for bootstrapped p-values from lm.
bs_r <- function(mtcars, i) {
d <- mtcars[i,]
fit <- lm (mpg~drat+wt, data=d)
return(summary(fit)$coefficients[,4])
}
bt1 <- boot(data=mtcars, statistic=bs_r, R=1000)
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = mtcars, statistic = bs_r, R = 1000)
Bootstrap Statistics :
original bias std. error
t1* 2.737824e-04 4.020024e-03 0.0253248217
t2* 3.308544e-01 7.108738e-02 0.2960776146
t3* 1.589075e-06 5.405459e-05 0.0005540412

Related

Logistics regression in R plotting Bootstrap using Titanic Dataset

I am working on an exercise for a statistics online course. I need to create a logistic regression in R using the titanic dataset. Therefore I want to apply the bootstrap method to create and plot 95% confidence intervals for the prediction of the logistic regression.
When I run the bootstrap command and want to plot it, I get the error: "All values of t* are equal to 0.0159971772980342". Also, I get a bias and standard error of 0, which cannot be true. I guess there is an error in setting up the bootstrap command, but I unfortunately cannot find it. What can I try?
My Code:
library(boot)
set.seed(50000)
logit_test <- function(data, indices) {
dt <- data[indices,]
fit <- glm(Clean_data$Survived ~ Fare, data = Clean_data, family = "binomial")
return(coef(fit))
}
boot_strap <- boot(
data = Clean_data,
statistic = logit_test,
R = 100)
boot.ci(boot.out = boot_strap,
type = c("basic"))
#Now we look at the results and plot them
boot_strap
plot(boot_strap, index=2)
My Output:
> library(boot)
>
> set.seed(50000)
>
> logit_test <- function(data, indices) {
+ dt <- data[indices,]
+ fit <- glm(Clean_data$Survived ~ Fare, data = Clean_data, family = "binomial")
+ return(coef(fit))
+ }
> boot_strap <- boot(
+ data = Clean_data,
+ statistic = logit_test,
+ R = 100)
>
> boot.ci(boot.out = boot_strap,
+ type = c("basic"))
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 100 bootstrap replicates
CALL :
boot.ci(boot.out = boot_strap, type = c("basic"))
Intervals :
Level Basic
95% (-0.8968, -0.8968 )
Calculations and Intervals on Original Scale
Some basic intervals may be unstable
> boot_strap
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = Clean_data, statistic = logit_test, R = 100)
Bootstrap Statistics :
original bias std. error
t1* -0.89682819 0 0
t2* 0.01599718 0 0
> plot(boot_strap, index=2)
[1] "All values of t* are equal to 0.0159971772980342"
The problem is that you're bootstrap function isn't using the bootstrapped data to fit the model. You have this function:
logit_test <- function(data, indices){
dt <- data[indices, ]
fit <- glm(Clean_Travelers$Survived ~ FARE, data=Clean_Travelers,
family=binomial)
return(coef(fit))
}
Note that there are a couple of problems, one is that you should be using dt in the data= argument, but you should also not be using Clean_Travelers$Survived as the dependent variable, it should just be Survived because you want to ensure that you're taking that variable not from the original data, but from the bootstrapped data. Something like this for your bootstrap function should work:
logit_test <- function(data, indices){
dt <- data[indices, ]
fit <- glm(Survived ~ FARE, data=dt, family=binomial)
return(coef(fit))
}

Cannot generate predictions in mgcv when using discretization (discrete=T)

I am fitting a model using a random site-level effect using a generalized additive model, implemented in the mgcv package for R. I had been doing this using the function gam() however, to speed things up I need to shift to the bam() framework, which is basically the same as gam(), but faster. I further sped up fitting by passing the options bam(nthreads = N, discrete=T), where nthreads is the number of cores on my machine. However, when I use the discretization option, and then try to make predictions with my model on new data, while ignoring the random effect, I consistent get an error.
Here is code to generate example data and reproduce the error.
library(mgcv)
#generate data.
N <- 10000
x <- runif(N,0,1)
y <- (0.5*x / (x + 0.2)) + rnorm(N)*0.1 #non-linear relationship between x and y.
#uninformative random effect.
random.x <- as.factor(do.call(paste0, replicate(2, sample(LETTERS, N, TRUE), FALSE)))
#fit models.
fit1 <- gam(y ~ s(x) + s(random.x, bs = 're')) #this one takes ~1 minute to fit, rest faster.
fit2 <- bam(y ~ s(x) + s(random.x, bs = 're'))
fit3 <- bam(y ~ s(x) + s(random.x, bs = 're'), discrete = T, nthreads = 2)
#make predictions on new data.
newdat <- data.frame(runif(200, 0, 1))
colnames(newdat) <- 'x'
test1 <- predict(fit1, newdata=newdat, exclude = c("s(random.x)"), newdata.guaranteed = T)
test2 <- predict(fit2, newdata=newdat, exclude = c("s(random.x)"), newdata.guaranteed = T)
test3 <- predict(fit3, newdata=newdat, exclude = c("s(random.x)"), newdata.guaranteed = T)
Making predictions with the third model which uses discretization throws this error (which the other two do not):
Error in model.frame.default(object$dinfo$gp$fake.formula[-2], newdata) :
variable lengths differ (found for 'random.x')
In addition: Warning message:
'newdata' had 200 rows but variables found have 10000 rows
How can I go about making predictions for a new dataset using the model fit with discretization?
newdata.gauranteed doesn't seem to be working for bam() models with discrete = TRUE. You could email the author and maintainer of mgcv and send him the reproducible example so he can take a look. See ?bug.reports.mgcv.
You probably want
names(newdat) <- "x"
as data frames have names.
But the workaround is just to pass in something for random.x
newdat <- data.frame(x = runif(200, 0, 1), random.x = random.x[[1]])
and then do your call to generate test3 and it will work.
The warning message and error are the result of you not specifying random.x in the newdata and then mgcv looking for random.x and finding it in the global environment. You should really gather that variables into a data frame and use the data argument when you are fitting your models, and try not to leave similarly named objects lying around in your global environment.

Bootstrapping in R. By Using Nagelkerke R-Squared

I am new in R. I try to use the boot() function in R, by using the Nagelkerke R-squared as the statistics parameter. I know that I need a function that measure Nagelkerke R-squared of original over the resample. However, I have no idea what should I put as the statistical function.
I know that Nagelkerke R-squared can be compute by using deviance and Null.deviance given Logit regression. I write the function to compute Nagelkerke R-squared.
NagR2 <- function(Objects){
n <- nrow(Objects)
reg <- glm(form,
family = binomial("logit"), data = datainput)
mo <- stepAIC(regression,direction = c("backward"), trace = FALSE)
R2cox <- 1- exp((mo$deviance - mo$null.deviance)/n)
R2nag <- R2cox/(1-exp((-mo$null.deviance)/n))
R2nag
}
How should I change my NagR2 function so that I can use it as statistic in the boot() function?
You need to alter the function, to take in a input data.frame as first argument, and indices of the data.frame as the second, and other arguments, so changing your existing function a bit:
NagR2 <- function(datainput,ind,form){
n <- nrow(datainput[ind,])
reg <- glm(form,family = binomial("logit"), data = datainput[ind,])
mo <- stepAIC(reg,direction = c("backward"), trace = FALSE)
R2cox <- 1- exp((mo$deviance - mo$null.deviance)/n)
R2nag <- R2cox/(1-exp((-mo$null.deviance)/n))
R2nag
}
And applying onto a test dataset:
library(MASS)
library(boot)
dat = iris
dat$Species=factor(ifelse(dat$Species=="versicolor","v","o"))
bo = boot(dat,statistic=NagR2,R=100,form = as.formula(Species ~ .))
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = dat, statistic = NagR2, R = 100, form = as.formula(Species ~
.))
Bootstrap Statistics :
original bias std. error
t1* 0.3650395 0.01470299 0.0720022

Bootstrap Multinomial regression in R

I am trying to bootstrap a simple multinomial regression in R, and I am getting an error:
Error in is.data.frame(data) : object 'd' not found
What is really strange is that I am using the same code (adjusted to this particular problem) as in a tutorial for boot package at Quick-R, and that same code also worked when I am using different function (like lm()). For sure, I am doing something stupid, but I do not see what. Please, if anyone can help, I would appreciate a lot.
This is an example:
require(foreign)
require(nnet)
require(boot)
# an example for multinomial logistic regression
ml = read.dta('http://www.ats.ucla.edu/stat/data/hsbdemo.dta')
ml = ml[,c(5,7,3)]
bs <- function(formula, data, indices) {
d = data[indices,] # allows boot to select sample
fit = multinom(formula, data=d)
s = summary(fit)
return(list(fit$coefficients, fit$standard.errors))
}
# 5 replications
results = list()
results <- boot(
data=ml, statistic=bs, R=5, parallel='multicore',
formula=prog~write
)
The error happens in the summary() part, also the object returned by multinom() does not have coefficients and standard.errors. It seems, that summary.multinom() in turn calculates the hessian from your data, d, which for some reason (probably a scoping issue) cannot be found. A quick fix is to add Hess = TRUE:
bs <- function(formula, data, indices) {
d = data[indices,] # allows boot to select sample
fit = multinom(formula, data=d, Hess = TRUE)
s = summary(fit)
return( cbind(s$coefficients, s$standard.errors) )
}
# 5 replications
results = list()
results <- boot(
data=ml, statistic=bs, R=5, parallel='multicore',
formula=prog~write
)
Multinomial logistic regression returns a matrix of coefficients using the coef() function. This differs from a lm or glm model which returns a vector of coefficients.
library(foreign) # read.dta()
library(nnet) # multinom()
require(boot) # boot()
# an example for multinomial logistic regression
ml = read.dta('http://www.ats.ucla.edu/stat/data/hsbdemo.dta')
ml = ml[,c(5,7,3)]
names(ml)
bs <- function(formula, data, indices) {
d = data[indices,] # allows boot to select sample
fit = multinom(formula, data=d, maxit=1000, trace=FALSE)
#s = summary(fit)
#return(list(fit$coefficients, fit$standard.errors))
estimates <- coef(fit)
return(t(estimates))
}
# enable parallel
library(parallel)
cl <- makeCluster(2)
clusterExport(cl, "multinom")
# 10000 replications
set.seed(1984)
results <- boot(
data=ml, statistic=bs, R=10000, parallel = "snow", ncpus=2, cl=cl,
formula=prog~write
)
# label the estimates
subModelNames <- colnames(results$t0)
varNames <- rownames(results$t0)
results$t0
estNames <- apply(expand.grid(varNames,subModelNames),1,function(x) paste(x,collapse="_"))
estNames
colnames(results$t) <- estNames
# summary of results
library(car)
summary(results)
confint(results, level=0.95, type="norm")
confint(results, level=0.95, type="perc")
confint(results, level=0.95, type="bca")
# plot the results
hist(results, legend="separate")

Why parametric bootstrapping bias and standard error are zero here?

I'm performing parametric bootstrapping in R for a simple problem and getting Bias and Standard Error zero always. What am I doing wrong?
set.seed(12345)
df <- rnorm(n=10, mean = 0, sd = 1)
Boot.fun <-
function(data) {
m1 <- mean(data)
return(m1)
}
Boot.fun(data = df)
library(boot)
out <- boot(df, Boot.fun, R = 20, sim = "parametric")
out
PARAMETRIC BOOTSTRAP
Call:
boot(data = df, statistic = Boot.fun, R = 20, sim = "parametric")
Bootstrap Statistics :
original bias std. error
t1* -0.1329441 0 0
You need to add line of code to do the sampling, ie.
Boot.fun <-
function(data) {
data <- sample(data, replace=T)
m1 <- ...
since you didn't supply a function to the argument rand.gen to generate random values. This is discussed in the documentation for ?boot. If sim = "parametric" and you don't supply a generating function, then the original data is passed to statistic and you need to sample in that function. Since your simulation was run on the same data, there is no standard error or bias.

Resources