R: Feature Selection with Cross Validation using Caret on Logistic Regression - r

I am currently learning how to implement logistical Regression in R
I have taken a data set and split it into a training and test set and wish to implement forward selection, backward selection and best subset selection using cross validation to select the best features.
I am using caret to implement cross-validation on the training data set and then testing the predictions on the test data.
I have seen the rfe control in caret and had also had a look at the documentation on the caret website as well as following the links on the question How to use wrapper feature selection with algorithms in R?. It isn't apparent to me how to change the type of feature selection as it seems to default to backward selection. Can anyone help me with my workflow. Below is a reproducible example
library("caret")
# Create an Example Dataset from German Credit Card Dataset
mydf <- GermanCredit
# Create Train and Test Sets 80/20 split
trainIndex <- createDataPartition(mydf$Class, p = .8,
list = FALSE,
times = 1)
train <- mydf[ trainIndex,]
test <- mydf[-trainIndex,]
ctrl <- trainControl(method = "repeatedcv",
number = 10,
savePredictions = TRUE)
mod_fit <- train(Class~., data=train,
method="glm",
family="binomial",
trControl = ctrl,
tuneLength = 5)
# Check out Variable Importance
varImp(mod_fit)
summary(mod_fit)
# Test the new model on new and unseen Data for reproducibility
pred = predict(mod_fit, newdata=test)
accuracy <- table(pred, test$Class)
sum(diag(accuracy))/sum(accuracy)

You can simply call it in mod_fit. When it comes to backward stepwise the code below is sufficient
trControl <- trainControl(method="cv",
number = 5,
savePredictions = T,
classProbs = T,
summaryFunction = twoClassSummary)
caret_model <- train(Class~.,
train,
method="glmStepAIC", # This method fits best model stepwise.
family="binomial",
direction="backward", # Direction
trControl=trControl)
Note that in trControl
method= "cv", # No need to call repeated here, the number defined afterward defines the k-fold.
classProbs = T,
summaryFunction = twoClassSummary # Gives back ROC, sensitivity and specifity of the chosen model.

Related

Keep "CV" method in trainControl consistent in R

I am very new to machine learning and was told to run a series of methods in order to predict a variable in my study. I am trying to predict a variable using "ranger", "ctree" and "xgbTree", but first using trainControl ahead of time using "cv".
library(randomForest)
library(caret)
library(ranger)
ind <- sample(2, nrow(iris), replace = TRUE, prob = c(0.7, 0.3))
train <- iris[ind==1,]
test <- iris[ind==2,]
set.seed(222)
ctrl<- trainControl(method = "cv", number = 10)
new.rf<- caret::train(Sepal.Length ~., data =train, method = "ranger", trControl = ctrl)
My issue is that I want my method to use the same parameters/conditions each time my data has a new split. I only want my "ind" variable to be randomized, and everything else for trainControl and the following train and predict to be consistent. I have included a sample of my code using the Iris data built in R. I am planning to split my training data as 70% and my validation as 30%.
####This was how I was checking to see if the training models were being consistent
ctrl1<- trainControl(method = "cv", number = 10)
ctrl2<- trainControl(method = "cv", number = 10)
new.rf1<- caret::train(Sepal.Length ~., data =train, method = "ranger", trControl = ctrl1)
new.rf2<- caret::train(Sepal.Length ~., data =train, method = "ranger", trControl = ctrl2)
rf.pred1<- predict(new.rf1, test)
rf.pred2<- predict(new.rf2, test)
rf.pred1
rf.pred2
I am certain that my wording is not correct in many, if not all parts. The reasoning for this task is to see how different the outcomes become with different combinations of my sample data.

How do I get the training accuracies for each fold in k-fold cross validation in R?

I would like to evaluate whether the logistic regression model I created is overfit. I'd like to compare the accuracies of each training fold to the test fold, but I don't know how to view these in R. This is the k-fold cross validation code:
library(caret)
levels(habitatdata$outcome) <- c("absent", "present") #rename factor levels
set.seed(12)
cvIndex <- createFolds(factor(habitatdata$outcome), 5, returnTrain = T) #create stratified folds
ctrlspecs <- trainControl(index = cvIndex,
method = "cv",
number = 5,
savePredictions = "all",
classProbs = TRUE) #specify training methods
set.seed(123)
model1 <- train(outcome~ ist + hwt,
data=habitatdata,
method = "glm",
family = binomial, trControl = ctrlspecs) #specify model
How do I view the training accuracies of each fold?
Look at model1$resample - it should give you a table with Accuracy (and Kappa) for each fold.

How train data manually per fold with k-fold CV in R?

I have the following code segment which works for me and I get the model result:
library(base)
library(caret)
library(tidyverse)
dataset <- read_csv("https://gist.githubusercontent.com/dmpe/bfe07a29c7fc1e3a70d0522956d8e4a9/raw/7ea71f7432302bb78e58348fede926142ade6992/pima-indians-diabetes.csv", col_names=FALSE)
X = dataset[, 1:8]
Y = as.factor(ifelse(dataset$X9 == 1, 'diabetes', 'nondiabetes'))
set.seed(88)
nfolds <- 3
cvIndex <- createFolds(Y, nfolds, returnTrain = T)
fit.control <- trainControl(method="cv",
index=cvIndex,
number=nfolds,
classProbs=TRUE,
savePredictions=TRUE,
verboseIter=TRUE,
summaryFunction=twoClassSummary,
allowParallel=FALSE)
model <- caret::train(X, Y,
method = "svmLinear",
trControl = fit.control,
preProcess=c("center","scale"),
tuneLength=10)
Using this I can access the final model as model$finalModel, however, in this case instead of having one final model, I actually want to have 3 models as I have 3-fold. So, I want to get the trained model after first fold, then after second fold and lastly after the third fold, which corresponds to the actual final model. Any ideas how to achieve this in R? Please note that usage of caret is not strict, if you can do it with mlr that's also welcomed.
The train function in caret streamlines model evaluation and training
https://cran.r-project.org/web/packages/caret/vignettes/caret.html
"evaluate, using resampling, the effect of model tuning parameters on performance
choose the ``optimal’’ model across these parameters
estimate model performance from a training set"
So, the model that it gives is the optimal final model.
There is no reason to use the models trained on each fold. I'm not aware of how to do this in R
Here is an approach using mlr package:
library(mlr)
library(base)
library(tidyverse)
dataset <- read_csv("https://gist.githubusercontent.com/dmpe/bfe07a29c7fc1e3a70d0522956d8e4a9/raw/7ea71f7432302bb78e58348fede926142ade6992/pima-indians-diabetes.csv", col_names=FALSE)
X = dataset[, 1:8]
Y = as.factor(ifelse(dataset$X9 == 1, 'diabetes', 'nondiabetes'))
create a mlr task:
mlr_task <- makeClassifTask(data = data.frame(X, Y),
target = "Y",
positive = "diabetes")
define the resampling:
set.seed(7)
cv3 <- makeResampleInstance(makeResampleDesc("CV", iters = 3),
task = mlr_task)
define the type of hyper parameter search
ctrl <- makeTuneControlRandom(maxit = 10L)
define a learner
lrn <- makeLearner("classif.ksvm", predict.type = "prob")
optionally check learner parameters to see which ones to tune
mlr::getLearnerParamSet(lrn)
define search space (vanilladot is linear kernel in kernlab package which is called internally for "classif.ksvm"). More info on integrated learners in mlr: https://mlr.mlr-org.com/articles/tutorial/integrated_learners.html
ps <- makeParamSet(makeDiscreteParam("kernel", "vanilladot"),
makeNumericParam("C", lower = 2e-6, upper = 2e-6))
tune hyper parameters. I just set some random measures, the first one listed is used to evaluate the performance, the others are there just for show.
res <- tuneParams(lrn,
mlr_task,
cv3,
measures = list(auc, bac, f1),
par.set = ps,
control = ctrl)
set optimal hyper parameters to a learner
lrn <- setHyperPars(lrn, par.vals = res$x)
resample with models = TRUE
rsmpls <- resample(lrn,
mlr_task,
cv3,
measures = list(auc, bac, f1),
models = TRUE)
models are in
rsmpls$models[[1]]$learner.model
rsmpls$models[[2]]$learner.model
rsmpls$models[[3]]$learner.model
What this does is it first tunes the hyper parameters and then performs another set of cross validation with tuned parameters on the same folds.
an alternative and in my opinion a better approach is to pick hyper parameters in the inner folds of nested cross validation and evaluate on the outer folds keeping outer fold models to fiddle with.
lrn <- makeLearner("classif.ksvm", predict.type = "prob")
define an inner resampling strategy
cv3_inner <- makeResampleDesc("CV", iters = 3)
create a tune wrapper - define what happens in inner cross validation loop
lrn <- makeTuneWrapper(lrn,
resampling = cv3_inner,
measures = list(auc, bac, f1),
par.set = ps,
control = ctrl)
perform outer cross validation
rsmpls <- resample(lrn,
mlr_task,
cv3,
measures = list(auc, bac, f1),
models = TRUE)
This performs three fold CV in the outer loop, in each training instance another, three fold CV is performed to tune the hyper parameters and a model is fit on the whole training instance with optimal hyper parameters, these models are evaluated on the outer loop test instances. This is done to reduce evaluation bias. See also: https://mlr.mlr-org.com/articles/tutorial/nested_resampling.html
Not a caret nor machine learning expert, but why not just train the model on a random sample and store the result in a list?
data <- read_csv("https://gist.githubusercontent.com/dmpe/bfe07a29c7fc1e3a70d0522956d8e4a9/raw/7ea71f7432302bb78e58348fede926142ade6992/pima-indians-diabetes.csv", col_names=FALSE)
train_multiple_models <- function(data, kfolds) {
resultlist <- list()
for(i in 1:kfolds) {
sample <- sample.int(n = nrow(data), size = floor(.75*nrow(data)), replace = F)
train <- data[sample, ]
X = train[, 1:8]
Y = as.factor(ifelse(train$X9 == 1, 'diabetes', 'nondiabetes'))
model <- caret::train(X, Y,
method = "svmLinear",
preProcess=c("center","scale"),
tuneLength=10)
resultlist[[i]] <- model
}
return(resultlist)
}
result <- train_multiple_models(data, kfolds = 3)
> result[[1]]$finalModel
Support Vector Machine object of class "ksvm"
SV type: C-svc (classification)
parameter : cost C = 1
Linear (vanilla) kernel function.
Number of Support Vectors : 307
Objective Function Value : -302.065
Training error : 0.230903

Stepwise Logistic Regression, stopping at best N features

I'm interested in exploring what shakes out of a stepwise logistic regression from the top N variables...whether that is 5 or 15 depending upon my preference of this.
I've tried to play around with the caret package:
set.seed(23)
library(caret)
library(mlbench)
data(Sonar)
traincontrol <- trainControl(method = "cv", number = 5, returnResamp = "all", savePredictions='all', classProbs = TRUE, summaryFunction = twoClassSummary)
glmstep_mod <- train(Class ~.,
data = Sonar,
method = "glmStepAIC",
trControl = traincontrol,
metric = "ROC",
trace = FALSE)
But this spits back a bunch of different variables for the final model.
Any packages out there that let's me do this, code I can generate myself, or missing parameters to these functions for this? So I could say max_variables = N? And give it multiple tries to see the trade-off?
I normally experiment with lasso or some other model types and I'm aware of the advantages/disadvantages that stepwise provides.

trainControl in caret package

In caret package, there is a thing called trainControl that allow us to perform variety of cross validation. To perform 10-fold cross-validation, one would use
fitControl <- trainControl(method= "repeatedcv", number = 10, repeats = 10)
fitJ48_10_fold <- train(x = x, y =y, method = "J48", trControl= fitControl)
while for training set, it is
fitControl <- trainControl(method= "none")
fitJ48train <- train(x = x, y =y, method = "J48", trControl= fitControl)
However, confusion matrix of these model show the same for both 10-fold and training.
Activity <- predict(fitJ48_10_fold, newdata = Train)
confusionMatrix(Activity, Train$Activity)
Activity <- predict(fitJ48train, newdata = Train)
confusionMatrix(Activity, Train$Activity)
I used the weka classifier GUI and indeed the performance of J48 from 10-fold cross validation is lower than that of training set. Am I wrong to suspect that the trainControl from caret isn't working or I pass this in a wrong way?
Am I wrong to suspect that the trainControl from caret isn't working or I pass this in a wrong way?
A little. For J48, there is a tuning parameter but the default grid only fits a single value C = 0.25. The final model will be the same no matter what value of method that you use in trainControl so the confusion matrices will always be the same.
Max

Resources