Bootstrap Multinomial regression in R - r

I am trying to bootstrap a simple multinomial regression in R, and I am getting an error:
Error in is.data.frame(data) : object 'd' not found
What is really strange is that I am using the same code (adjusted to this particular problem) as in a tutorial for boot package at Quick-R, and that same code also worked when I am using different function (like lm()). For sure, I am doing something stupid, but I do not see what. Please, if anyone can help, I would appreciate a lot.
This is an example:
require(foreign)
require(nnet)
require(boot)
# an example for multinomial logistic regression
ml = read.dta('http://www.ats.ucla.edu/stat/data/hsbdemo.dta')
ml = ml[,c(5,7,3)]
bs <- function(formula, data, indices) {
d = data[indices,] # allows boot to select sample
fit = multinom(formula, data=d)
s = summary(fit)
return(list(fit$coefficients, fit$standard.errors))
}
# 5 replications
results = list()
results <- boot(
data=ml, statistic=bs, R=5, parallel='multicore',
formula=prog~write
)

The error happens in the summary() part, also the object returned by multinom() does not have coefficients and standard.errors. It seems, that summary.multinom() in turn calculates the hessian from your data, d, which for some reason (probably a scoping issue) cannot be found. A quick fix is to add Hess = TRUE:
bs <- function(formula, data, indices) {
d = data[indices,] # allows boot to select sample
fit = multinom(formula, data=d, Hess = TRUE)
s = summary(fit)
return( cbind(s$coefficients, s$standard.errors) )
}
# 5 replications
results = list()
results <- boot(
data=ml, statistic=bs, R=5, parallel='multicore',
formula=prog~write
)

Multinomial logistic regression returns a matrix of coefficients using the coef() function. This differs from a lm or glm model which returns a vector of coefficients.
library(foreign) # read.dta()
library(nnet) # multinom()
require(boot) # boot()
# an example for multinomial logistic regression
ml = read.dta('http://www.ats.ucla.edu/stat/data/hsbdemo.dta')
ml = ml[,c(5,7,3)]
names(ml)
bs <- function(formula, data, indices) {
d = data[indices,] # allows boot to select sample
fit = multinom(formula, data=d, maxit=1000, trace=FALSE)
#s = summary(fit)
#return(list(fit$coefficients, fit$standard.errors))
estimates <- coef(fit)
return(t(estimates))
}
# enable parallel
library(parallel)
cl <- makeCluster(2)
clusterExport(cl, "multinom")
# 10000 replications
set.seed(1984)
results <- boot(
data=ml, statistic=bs, R=10000, parallel = "snow", ncpus=2, cl=cl,
formula=prog~write
)
# label the estimates
subModelNames <- colnames(results$t0)
varNames <- rownames(results$t0)
results$t0
estNames <- apply(expand.grid(varNames,subModelNames),1,function(x) paste(x,collapse="_"))
estNames
colnames(results$t) <- estNames
# summary of results
library(car)
summary(results)
confint(results, level=0.95, type="norm")
confint(results, level=0.95, type="perc")
confint(results, level=0.95, type="bca")
# plot the results
hist(results, legend="separate")

Related

Depth and OOB error of a randomForest and randomForestSRC

Here is my code for random forest and rfsrc in R; Is there anyway to include n_estimators and max_depth like sklearn version in my R code ? Also, How can I plot OBB error vs number of trees plot like this?
set.seed(2234)
tic("Time to train RFSRC fast")
fast.o <- rfsrc.fast(Label ~ ., data = train[(1:50000),],forest=TRUE)
toc()
print(fast.o)
#print(vimp(fast.o)$importance)
set.seed(2367)
tic("Time to test RFSRC fast ")
#data(breast, package = "randomForestSRC")
fast.pred <- predict(fast.o, test[(1:50000),])
toc()
print(fast.pred)
set.seed(3)
tic("RF model fitting without Parallelization")
rf <-randomForest(Label~.,data=train[(1:50000),])
toc()
print(rf)
plot(rf)
varImp(rf,sort = T)
varImpPlot(rf, sort=T, n.var= 10, main= "Variable Importance", pch=16)
rf_pred <- predict(rf, newdata=test[(1:50000),])
confMatrix <- confusionMatrix(rf_pred,test[(1:50000),]$Label)
confMatrix
I appreciate your time.
You need to set block.size=1 , and also take note the sampling is without replacement, you can check the vignette for rfsrc:
Unlike Breiman's random forests, the default action here is sampling
without replacement. Thus out-of-bag (OOB) technically means
out-of-sample, but for legacy reasons we retain the term OOB.
So using an example dataset,
library(mlbench)
library(randomForestSRC)
data(Sonar)
set.seed(911)
trn = sample(nrow(Sonar),150)
rf <- rfsrc(Class ~ ., data = Sonar[trn,],ntree=500,block.size=1,importance=TRUE)
pred <- predict(rf,Sonar[-trn,],block.size=1)
plot(rf$err.rate[,1],type="l",col="steelblue",xlab="ntrees",ylab="err.rate",
ylim=c(0,0.5))
lines(pred$err.rate[,1],col="orange")
legend("topright",fill=c("steelblue","orange"),c("test","OOB.train"))
In randomForest:
library(randomForest)
rf <- randomForest(Class ~ ., data = Sonar[trn,],ntree=500)
pred <- predict(rf,Sonar[-trn,],predict.all=TRUE)
Not very sure if there's an easier to get ntrees error:
err_by_tree = sapply(1:ncol(pred$individual),function(i){
apply(pred$individual[,1:i,drop=FALSE],1,
function(i)with(rle(i),values[which.max(lengths)]))
})
err_by_tree = colMeans(err_by_tree!=Sonar$Class[-trn])
Then plot:
plot(rf$err.rate[,1],type="l",col="steelblue",xlab="ntrees",ylab="err.rate",
ylim=c(0,0.5))
lines(err_by_tree,col="orange")
legend("topright",fill=c("steelblue","orange"),c("test","OOB.train"))

Getting estimated means after multiple imputation using the mitml, nlme & geepack R packages

I'm running multilevel multiple imputation through the package mitml (using the panimpute() function) and am fitting linear mixed models and marginal models through the packages nlme and geepack and the mitml:with() function.
I can get the estimates, p-values etc for those through the testEstimates() function but I'm also looking to get estimated means across my model predictors. I've tried the emmeans package, which I normally use for getting estimated means when running nlme & geepack without multiple imputation but doing so emmeans tell me "Can't handle an object of class “mitml.result”".
I'm wondering is there a way to get pooled estimated means from the multiple imputation analyses I've run?
The data frames I'm analyzing are longitudinal/repeated measures and in long format. In the linear mixed model I want to get the estimated means for a 2x2 interaction effect and in the marginal model I'm trying to get estimated means for the 6 levels of 'time' variable. The outcome in all models is continuous.
Here's my code
# mixed model
fml <- Dep + time ~ 1 + (1|id)
imp <- panImpute(data=Data, formula=fml, n.burn=50000, n.iter=5000, m=100, group = "treatment")
summary(imp)
plot(imp, trace="all")
implist <- mitmlComplete(imp, "all", force.list = TRUE)
fit <- with(implist, lme(Dep ~ time*treatment, random = ~ 1|id, method = "ML", na.action = na.exclude, control = list(opt = "optim")))
testEstimates(fit, var.comp = TRUE)
confint.mitml.testEstimates(testEstimates(fit, var.comp = TRUE))
# marginal model
fml <- Dep + time ~ 1 + (1|id)
imp <- panImpute(data=Data, formula=fml, n.burn=50000, n.iter=5000, m=100)
summary(imp)
plot(imp, trace="all")
implist <- mitmlComplete(imp, "all", force.list = TRUE)
fit <- with(implist, geeglm(Dep ~ time, id = id, corstr ="unstructured"))
testEstimates(fit, var.comp = TRUE)
confint.mitml.testEstimates(testEstimates(fit, var.comp = TRUE))
is there a way to get pooled estimated means from the multiple imputation analyses I've run?
This is not a reprex without Data, so I can't verify this works for you. But emmeans provides support for mira-class (lists of) models in the mice package. So if you fit your model in with() using the mids rather than mitml.list class object, then you can use that to obtain marginal means of your outcome (and any contrasts or pairwise comparisons afterward).
Using example data found here, which uncomfortably loads an external workspace:
con <- url("https://www.gerkovink.com/mimp/popular.RData")
load(con)
## imputation
library(mice)
ini <- mice(popNCR, maxit = 0)
meth <- ini$meth
meth[c(3, 5, 6, 7)] <- "norm"
pred <- ini$pred
pred[, "pupil"] <- 0
imp <- mice(popNCR, meth = meth, pred = pred, print = FALSE)
## analysis
library(lme4) # fit multilevel model
mod <- with(imp, lmer(popular ~ sex + (1|class)))
library(emmeans) # obtain pooled estimates of means
(em <- emmeans(mod, specs = ~ sex) )
pairs(em) # test comparison

ROC curves comparing logistic regression and neural network predictions in R

I am trying to compare the prediction accuracy of a dataset using a logistic regression model and a neural network. While looking at the confusion matrices of the two methods, the ANN model gives a better output compared to the logistic regression model. However, while plotting the ROC curves for the two methods, it seems that the logistic regression model is better. I am wondering if there is something wrong with my code for the ROC curves.
For context, I am explaining my procedure. First, I divided the dataset into training and testing data.
data = read.csv("heart.csv", header=TRUE)
set.seed(300)
index = sample(seq_len(nrow(data)), size = samplesize) # For logistic
train <- data[index,]
test <- data[-index,]
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
scaled <- as.data.frame(lapply(data, normalize))
index = sample(seq_len(nrow(scaled)), size = samplesize) # For ANN
trainset <- scaled[index, ]
testset <- scaled[-index, ]
The response variable is "target" so I fit the following GLM :
glm.fit <- glm(target ~ ., data=train, family=binomial(link = "logit"),control = list(maxit = 50))
For the ANN, I used R's neuralnet package and did the following:
library(neuralnet)
nn <- neuralnet(target ~ ., data=trainset, hidden=c(3,2), act.fct = "logistic", err.fct = "sse", linear.output=FALSE, threshold=0.01)
For my ROC curves, I did the following:
For ANN:
prob = compute(nn, testset[, -ncol(testset)] )
prob.result <- prob$net.result
detach(package:neuralnet,unload = T)
library(ROCR)
nn.pred = prediction(prob.result, testset$target)
pref <- performance(nn.pred, "tpr", "fpr")
plot(pref)
And for logistic regression:
prob=predict(glm.fit,type=c("response"))
library(ROCR)
pred <- prediction(prob, test$target)
perf <- performance(pred, measure = "tpr", x.measure = "fpr")
plot(perf, col=rainbow(7), main="ROC curve Admissions", xlab="Specificity",
ylab="Sensitivity")
I would just like some guidance in understanding why the plots seem to suggest that the logistic regression model is better when the confusion matrix suggests otherwise, and understand what I am doing wrong.
Thank you for any input.

How to speed up my logistic regression bootstrapped R function?

I am bootstrapping a logistic regression by looping over a function many times and storing my results.
Can I make this run faster? I had a loop within the function which is now gone but still not fats enough. Any suggestions?
# Bootstraping a logistic regression
# First create a function which takes a random sample from our data = 1000 and predicts
# have it return the prediction for the validation set using this sampled model
boot_fun <- function(){
train_data <- bankdata[train_set,]
rows_sample <- sample(1:nrow(train_data), 1000, TRUE)
LR_model <- glm(y~., data = train_data[rows_sample,], family = 'binomial')
pred <- predict(LR_model, type = 'response', newdata = bankdata[valid_set,])
pred <- cbind(pred, rep(0,length(pred)))
pred[which(pred[,1] > 0.5),] <- 1
return(pred[,2])
}
# We new run this function as many times as we want to bootstrap
# First create an output dataframe
# A loop error handling function tryCatch had to be used. Some samples didn't include December for
# example and was generating an error when trying to predict using a model with an unexistent December
boot_distribution <- data.frame(rep(NA,length(valid_set)))
n <- 10000
for (i in 1:n) {
tryCatch({
cat('processing.....at.....', round(i/n, digits = 3)*100,' % ', '\n')
boot_distribution <- cbind(boot_distribution, boot_fun())
}, error = function(e){cat('One of the regressions was unsucessful..', conditionMessage(e), '\n')})
}

How do you resample LDA in r?

I thought that using bootstrap would resample my LDA but I am not sure. Additionally, if bootstrap does work, I am not sure how to code a bootstrap in r
Here is my LDA code:
library('MASS')
n=nrow(iris)
train = sample(n ,size = floor(n*0.75), replace = F)
train.species =Species[train]
test.species=Species[-train]
lda.fit = lda(Species~. , data=iris, subset=train)
The code below uses boot() library to perform bootstrap on iris dataset using LDA to get the standard errors for the LD1 and LD2 coefficients. Also, initial part of the code shows LDA fitting without bootstrap with the same coefficients.
# Library
library(MASS)
library(boot)
# Get data
data(iris)
names(iris) <- gsub("\\.", "", names(iris)) #remove dots from column names
# Split data into train and test sets
train_index <-sample(seq(nrow(iris)),floor(dim(iris)[1]*0.75))
train <- iris[train_index,]
test <- iris[-train_index,]
test_Y <- test[, c('Species')]
test_X <- subset(test, select=-c(Species))
#### LDA without bootstrap:
# Fit LDA to train data:
lda.fit = lda(Species ~ . , data=train)
lda.fit
# Predict test_Y based on lda.fit above
lda.pred <- predict(lda.fit, test_X)
lda.class <- lda.pred$class
# Confusion matrix
table(lda.class, test_Y)
#### LDA with bootstrap:
# Fit LDA to train data: to get standard errors for coefficients
set.seed(1)
boot.fn <- function(data,index){
return(coefficients(lda(Species ~ SepalLength + SepalWidth + PetalLength + PetalWidth, data=data, subset=index)))
}
# Call boot(): This returns LD1 and LD2 for each predictor
boot(train, boot.fn, 1000)
# NOTE: Here, in Bootstrap Statistics output, t1* to t4* are LD1 coefficients and t5* to t8* are LD2 coefficients

Resources