Stepwise Regression & Cross validation in R | Code explanation - r

I'm a novice in R and I'd like to perform some feature selection using stepwise regression.
Therefore, I'd like to apply the following code using the caret package
# Set up repeated k-fold cross-validation
train.control <- trainControl(method = "cv", number = 10)
# Train the model
step.model <- train(Fertility ~., data = swiss,
method = "lmStepAIC",
trControl = train.control,
trace = FALSE
)
# Model accuracy
step.model$results
# Final model coefficients
step.model$finalModel
# Summary of the model
summary(step.model$finalModel)
However, I don't quite understand the "connection" between the cross-validation and lmStepAIC (which, I know, returns the best performing model as determined by the AIC criterion). How is this linked by trControl, i.e. how does this work?
Any help is very appreciated!
Thank you very much in advance.

Related

R: error when doing backward feature selection with rms::fastbw on caret model

I want to perform backward feature selection using the function fastbw from the rms package. I use a sample dataset PimaIndiansDiabetes as below:
library(mlbench)
data(PimaIndiansDiabetes)
library(caret)
trControl <- trainControl(method = "repeatedcv",
repeats = 3,
classProbs = TRUE,
number = 10,
savePredictions = TRUE,
summaryFunction = twoClassSummary)
caret_model <- train(diabetes~.,
data=PimaIndiansDiabetes,
method="glm",
trControl=trControl)
library(rms)
reduced_model <- fastbw(caret_model$finalModel)
This gives me an error:
Error in fastbw(caret_model$finalModel) : fit does not have design
information
May I know what this means and how to resolve it?
You're probably stuck. fastbw() works only with models from rms, i.e. ?fastbw says:
fit: fit object with ‘Varcov(fit)’ defined (e.g., from ‘ols’,
‘lrm’, ‘cph’, ‘psm’, ‘glmD’)
I tried your fit with method="lrm" (lrm is rms's logistic regression tool), but got
Error: Model lrm is not in caret's built-in library
I think you're going to have to find another way to do stepwise regression, e.g. see this question: i.e. using library(MASS) and then method="glmStepAIC" (within caret), or stepAIC (from scratch).
It's not obvious to me why you're training a model and then doing stepwise regression ...

How to calculate R-squared after using bagging function to develop CART decision trees?

I am using the following bagging function with ipred to bootstrap the sample 500 times in R in order to develop decision trees:
baggedsample <- bagging(p ~., data, nbagg=500, coob=TRUE, control = list
(minbucket=5))
After this, I would like to know the R-squared.
I notice that if I do the bagging with caret function, R-squared would be automatically calculated as follows:
# Specify 10-fold cross validation
ctrl <- trainControl(method = "cv", number = 10)
# CV bagged model
baggedsample <- train(
p ~ .,
data,
method = "treebag",
trControl = ctrl,
importance = TRUE
)
# assess results
baggedsample
RMSE Rsquared MAE
## 36477.25 0.7001783 24059.85
Appreciate any guidance on this issue, thanks.
Since you do not provide any data, I will illustrate using the built-in iris data.
You can simply compute R-squared from the formula.
attach(iris)
BAG = bagging(Sepal.Length ~ ., data=iris)
R2 = 1 - sum((Sepal.Length - predict(BAG))^2) /
sum((Sepal.Length - mean(Sepal.Length))^2)
R2
[1] 0.824782

Automate variable selection based on varimp in R

In R, I have a logistic regression model as follows
train_control <- trainControl(method = "cv", number = 3)
logit_Model <- train(result~., data=df,
trControl = train_control,
method = "glm",
family=binomial(link="logit"))
calculatedVarImp <- varImp(logit_Model, scale = FALSE)
I use multiple datasets that run through the same code, so the variable importance changes for each dataset. Is there a way to get the names of the variables that are less than n (e.g. 1) in the overall importance, so I can automate the removal of those variables and rerun the model.
I was unable to get the information from 'calculatedVarImp' variable by subsetting 'overall' value
lowVarImp <- subset(calculatedVarImp , importance$Overall <1)
Also, is there a better way of doing variable selection?
Thanks in advance
You're using the caret package. Not sure if you're aware of this, but caret has a method for stepwise logistic regression using the Akaike Information Criterion: glmStepAIC.
So it iteratively trains a model for every subset of predictors and stops at the one with the lowest AIC.
train_control <- trainControl(method = "cv", number = 3)
logit_Model <- train(y~., data= train_data,
trControl = train_control,
method = "glmStepAIC",
family=binomial(link="logit"),
na.action = na.omit)
logit_Model$finalModel
This is as automated as it gets but it may be worth reading this answer about the downsides to this method:
See Also.

Obtaining training Error using Caret package in R

I am using caret package in order to train a K-Nearest Neigbors algorithm. For this, I am running this code:
Control <- trainControl(method="cv", summaryFunction=twoClassSummary, classProb=T)
tGrid=data.frame(k=1:100)
trainingInfo <- train(Formula, data=trainData, method = "knn",tuneGrid=tGrid,
trControl=Control, metric = "ROC")
As you can see, I am interested in obtain the AUC parameter of the ROC. This code works good but returns the testing error (which the algorithm uses for tuning the k parameter of the model) as the mean of the error of the CrossValidation folds. I am interested in return, in addition of the testing error, the training error (the mean across each fold of the error obtained with the training data). ¿How can I do it?
Thank you
What you are asking is a bad idea on multiple levels. You will grossly over-estimate the area under the ROC curve. Consider the 1-NN model: you will have perfect predictions every time.
To do this, you will need to run train again and modify the index and indexOut objects:
library(caret)
set.seed(1)
dat <- twoClassSim(200)
set.seed(2)
folds <- createFolds(dat$Class, returnTrain = TRUE)
Control <- trainControl(method="cv",
summaryFunction=twoClassSummary,
classProb=T,
index = folds,
indexOut = folds)
tGrid=data.frame(k=1:100)
set.seed(3)
a_bad_idea <- train(Class ~ ., data=dat,
method = "knn",
tuneGrid=tGrid,
trControl=Control, metric = "ROC")
Max

How to track a progress while building model with the caret package?

I am trying to build model using train function from caret package:
model <- train(training$class ~ .,data=training, method = "nb")
Training set contains about 20K observations, each observation has above 100 variables. I would like to know if building a model from that dataset will take hours or days.
How to estimate time needed to train model from data? How track a progress of training process when using functions from caret package?
Assuming that you are training the model with
an expanded grid of tuning parameters (all combinations of the tuning parameters)
and a resampling technique of your choice (cross validation, bootstrap etc)
You could set
trainctrl <- trainControl(verboseIter = TRUE)
and set it in the trControl argument of the train function to track the training progress
model <- train(training$class ~ .,data=training, method = 'nb', trControl = trainctrl)
This prints out the progress out to the console at each resampling stage, and allows you to gauge the progress of the training/parameter tuning.
To estimate the total running time, you could run the model once to see how long it runs, and estimate the total time by multiplying accordingly based on your resampling scheme and number of parameter combinations. This can be done by setting the trainControl again, and setting the tuneLength to 1:
trainctrl <- trainControl(method = 'none')
model <- train(training$class ~ ., data = training, method = 'nb', trControl = trainctrl, tuneLength = 1)
Hope this helps! :)

Resources