How to speed up my logistic regression bootstrapped R function? - r

I am bootstrapping a logistic regression by looping over a function many times and storing my results.
Can I make this run faster? I had a loop within the function which is now gone but still not fats enough. Any suggestions?
# Bootstraping a logistic regression
# First create a function which takes a random sample from our data = 1000 and predicts
# have it return the prediction for the validation set using this sampled model
boot_fun <- function(){
train_data <- bankdata[train_set,]
rows_sample <- sample(1:nrow(train_data), 1000, TRUE)
LR_model <- glm(y~., data = train_data[rows_sample,], family = 'binomial')
pred <- predict(LR_model, type = 'response', newdata = bankdata[valid_set,])
pred <- cbind(pred, rep(0,length(pred)))
pred[which(pred[,1] > 0.5),] <- 1
return(pred[,2])
}
# We new run this function as many times as we want to bootstrap
# First create an output dataframe
# A loop error handling function tryCatch had to be used. Some samples didn't include December for
# example and was generating an error when trying to predict using a model with an unexistent December
boot_distribution <- data.frame(rep(NA,length(valid_set)))
n <- 10000
for (i in 1:n) {
tryCatch({
cat('processing.....at.....', round(i/n, digits = 3)*100,' % ', '\n')
boot_distribution <- cbind(boot_distribution, boot_fun())
}, error = function(e){cat('One of the regressions was unsucessful..', conditionMessage(e), '\n')})
}

Related

Loop linear regression different predictor and outcome variables

I'm new to R but am slowly learning it to analyse a data set.
Let's say I have a data frame which contains 8 variables and 20 observations. Of the 8 variables, V1 - V3 are predictors and V4 - V8 are outcomes.
B = matrix(c(1:160),
nrow = 20,
ncol = 8,)
df <- as.data.frame(B)
Using the car package, to perform a simple linear regression, display summary and confidence intervals is:
fit <- lm(V4 ~ V1, data = df)
summary(fit)
confint(fit)
How can I write code (loop or apply) so that R regresses each predictor on each outcome individually and extracts the coefficients and confidence intervals? I realise I'm probably trying to run before I can walk but any help would be really appreciated.
You could wrap your lines in a lapply call and train a linear model for each of your predictors (excluding the target, of course).
my.target <- 4
my.predictors <- 1:8[-my.target]
lapply(my.predictors, (function(i){
fit <- lm(df[,my.target] ~ df[,i])
list(summary= summary(fit), confint = confint(fit))
}))
You obtain a list of lists.
So, the code in my own data that returns the error is:
my.target <- metabdata[c(34)]
my.predictors <- metabdata[c(18 : 23)]
lapply(my.predictors, (function(i){
fit <- lm(metabdata[, my.target] ~ metabdata[, i])
list(summary = summary(fit), confint = confint(fit))
}))
Returns:
Error: Unsupported index type: tbl_df

How do I use predict() on new data for lme4::glmer model?

I have been trying to establish predictive performance (AUC ROC) for a glmer model. When I try and use the predict() function on a test data set, the output for this function is the length of my train data set.
folds = 10;
glmerperf=rep(0,folds); glmperf=glmerperf;
TB_Train.glmer.subset <- TB_Train.glmer %>% select(one_of(subset.vars), IDNO)
TB_Train.glmer.fs <- TB_Train.glmer.subset[,c(1:7, 22)]
TB_Train.glmer.ns <- TB_Train.glmer.subset[, 8:21]
TB_Train.glmer.cns <- TB_Train.glmer.ns %>% scale(center=TRUE, scale=TRUE) %>% cbind(TB_Train.glmer.fs)
foldsamples = caret::createFolds(TB_Train.glmer.cns$Case.Status, k = folds, list = TRUE, returnTrain = FALSE)
for (n in 1:folds)
{
testdata = TB_Train.glmer.cns[foldsamples[[n]],]
traindata = TB_Train.glmer.cns[-foldsamples[[n]],]
GLMER <- lme4::glmer(Case.Status ~ . + (1 | IDNO), data = traindata, family="binomial", control=glmerControl(optimizer="bobyqa", optCtrl=list(maxfun=1000000)))
glmer.probs <- predict(GLMER, newdata=testdata$Non.TB.Case, type="response")
glmer.ROC <- roc(predictor=glmer.probs, response=testdata$Case.Status, levels=rev(levels(testdata$Case.Status)))
glmerperf[n] <- glmer.ROC$auc
}
prob <- predict(GLMER, newdata=TB_Test.glmer$Non.TB.Case, type="response", re.form=~(1|IDNO))
print(sprintf('Mean AUC ROC of model on test set for GLMER %f', mean(glmerperf)))
Both the prob and glmer.probs objects are the length of the traindata object, despite specifying the newdata argument. I have noticed issues with the predict function in the past, but none as specific as this one.
Also, when the model is run, I get several errors about needing to scale my data (which I already have) and that the model fails to converge. Any ideas on how to fix this? I have already bumped up the iterations and selected a new optimizer.
Figured out that error was arising from using the "." shortcut to specify all predictors for the model.

Leave-One-Out CV implementation for lin. regression

I am building a linear regression on the cafe dataset and I want to validate the results by calculationg Leave-One-Out CrossValidation.
I wrote my own function for that, which works if I fit lm() on all data, but when I am using subset of columns (from Stepwise regression), I am getting an error. Consider the following code:
cafe <- read.table("C:/.../cafedata.txt", header=T)
cafe$Date <- as.Date(cafe$Date, format="%d/%m/%Y")
#Delete row 34
cafe <- cafe[-c(34), ]
#wont need date
cafe <- cafe[,-1]
library(DAAG)
#center the data
cafe.c <- data.frame(lapply(cafe[,2:15], function(x) scale(x, center = FALSE, scale = max(x, na.rm = TRUE))))
cafe.c$Day.of.Week <- cafe$Day.of.Week
cafe.c$Sales <- cafe$Sales
#Leave-One-Out CrossValidation function
LOOCV <- function(fit, dataset){
# Attributes:
#------------------------------
# fit: Fit of the model
# dataset: Dataset to be used
# -----------------------------
# Returns mean of squared errors for each fold - MSE
MSEP_=c()
for (idx in 1:nrow(dataset)){
train <- dataset[-c(idx),]
test <- dataset[idx,]
MSEP_[idx]<-(predict(fit, newdata = test) - dataset[idx,]$Sales)^2
}
return(mean(MSEP_))
}
Then when I fit the simple linear model and call the function, it works:
#----------------Simple Linear regression with all attributes-----------------
fit.all.c <- lm(cafe.c$Sales ~., data=cafe.c)
#MSE:258
LOOCV(fit.all.c, cafe.c)
However when I fit the same lm() only with subset of columns, I get an error:
#-------------------------Linear Stepwise regression--------------------------
step <- stepAIC(fit.all.c, direction="both")
fit.step <- lm(cafe.c$Sales ~ cafe.c$Bread.Sand.Sold + cafe.c$Bread.Sand.Waste
+ cafe.c$Wraps.Waste + cafe.c$Muffins.Sold
+ cafe.c$Muffins.Waste + cafe.c$Fruit.Cup.Sold
+ cafe.c$Chips + cafe.c$Sodas + cafe.c$Coffees
+ cafe.c$Day.of.Week,data=cafe.c)
LOOCV(fit.step, cafe.c)
5495.069
There were 50 or more warnings (use warnings() to see the first 50)
If I look closer:
idx <- 1
train <- cafe.c[-c(idx)]
test <- cafe.c[idx]
(predict(fit.step, newdata = test) -cafe.c[idx]$Sales)^2
I get MSE for all rows and an error:
Warning message:
'newdata' had 1 row but variables found have 47 rows
EDIT
I have found this question about the error, which says that this error occurs when I give different names to the columns, however this is not the case.
Change your code like the following:
fit.step <- lm(Sales ~ Bread.Sand.Sold + Bread.Sand.Waste
+ Wraps.Waste + Muffins.Sold
+ Muffins.Waste + Fruit.Cup.Sold
+ Chips + Sodas + Coffees
+ Day.of.Week,data=cafe.c)
LOOCV(fit.step, cafe.c)
# [1] 278.8984
idx <- 1
train <- cafe.c[-c(idx),]
test <- cafe.c[idx,] # need to select the row, not the column
(predict(fit.step, newdata = test) -cafe.c[idx,]$Sales)^2
# 1
# 51.8022
Also, you LOOCV implementation is not correct. You must fit a new model everytime on the train dataset on the leave-1-out fold. Right now you are training the model once on the entire dataset and using the same single model to compute the MSE on held out test dataset for each leave-1-out fold, but ideally you should have different models trained on different training datasets.

svm {e1071} predict creates larger array of predicted values than expected

I'm using Support Vector Machine (SVM, package e1071) within R to build a classification model and out-of-sample predicting a 7-factor class.
The problem is, when using the predict function, I obtain a array, much larger than the number of rows in the validation set. See code and results below.
Any suggestions about what goes wrong? Do I miss-interpret the predict function in the SVM package?
install.packages("e1071","caret")
library(e1071)
library(caret)
data <- data.frame(replicate(10,sample(0:6,1000,rep=TRUE)))
trainIndex <- createDataPartition(data[,1], p = 0.8,
list = FALSE,
times = 1)
trainset <- data[trainIndex,2:10]
validationset <- data[-trainIndex,2:10]
trainlabel <- data[trainIndex,1]
validationlabel <- data[-trainIndex,1]
svmModel <- svm(x = trainset,
y = trainlabel,
type = "C-classification",
kernel = "radial")
# Predict
svmPred <- predict(svmModel, x = validationset)
length(svmPred)
# 800, expected 200 since validationset has nrow = 200.
It's because x doesn't exist in predict
try :
svmPred <- predict(svmModel, validationset)
length(svmPred)

Bootstrap Multinomial regression in R

I am trying to bootstrap a simple multinomial regression in R, and I am getting an error:
Error in is.data.frame(data) : object 'd' not found
What is really strange is that I am using the same code (adjusted to this particular problem) as in a tutorial for boot package at Quick-R, and that same code also worked when I am using different function (like lm()). For sure, I am doing something stupid, but I do not see what. Please, if anyone can help, I would appreciate a lot.
This is an example:
require(foreign)
require(nnet)
require(boot)
# an example for multinomial logistic regression
ml = read.dta('http://www.ats.ucla.edu/stat/data/hsbdemo.dta')
ml = ml[,c(5,7,3)]
bs <- function(formula, data, indices) {
d = data[indices,] # allows boot to select sample
fit = multinom(formula, data=d)
s = summary(fit)
return(list(fit$coefficients, fit$standard.errors))
}
# 5 replications
results = list()
results <- boot(
data=ml, statistic=bs, R=5, parallel='multicore',
formula=prog~write
)
The error happens in the summary() part, also the object returned by multinom() does not have coefficients and standard.errors. It seems, that summary.multinom() in turn calculates the hessian from your data, d, which for some reason (probably a scoping issue) cannot be found. A quick fix is to add Hess = TRUE:
bs <- function(formula, data, indices) {
d = data[indices,] # allows boot to select sample
fit = multinom(formula, data=d, Hess = TRUE)
s = summary(fit)
return( cbind(s$coefficients, s$standard.errors) )
}
# 5 replications
results = list()
results <- boot(
data=ml, statistic=bs, R=5, parallel='multicore',
formula=prog~write
)
Multinomial logistic regression returns a matrix of coefficients using the coef() function. This differs from a lm or glm model which returns a vector of coefficients.
library(foreign) # read.dta()
library(nnet) # multinom()
require(boot) # boot()
# an example for multinomial logistic regression
ml = read.dta('http://www.ats.ucla.edu/stat/data/hsbdemo.dta')
ml = ml[,c(5,7,3)]
names(ml)
bs <- function(formula, data, indices) {
d = data[indices,] # allows boot to select sample
fit = multinom(formula, data=d, maxit=1000, trace=FALSE)
#s = summary(fit)
#return(list(fit$coefficients, fit$standard.errors))
estimates <- coef(fit)
return(t(estimates))
}
# enable parallel
library(parallel)
cl <- makeCluster(2)
clusterExport(cl, "multinom")
# 10000 replications
set.seed(1984)
results <- boot(
data=ml, statistic=bs, R=10000, parallel = "snow", ncpus=2, cl=cl,
formula=prog~write
)
# label the estimates
subModelNames <- colnames(results$t0)
varNames <- rownames(results$t0)
results$t0
estNames <- apply(expand.grid(varNames,subModelNames),1,function(x) paste(x,collapse="_"))
estNames
colnames(results$t) <- estNames
# summary of results
library(car)
summary(results)
confint(results, level=0.95, type="norm")
confint(results, level=0.95, type="perc")
confint(results, level=0.95, type="bca")
# plot the results
hist(results, legend="separate")

Resources