Stepwise Logistic Regression, stopping at best N features - r

I'm interested in exploring what shakes out of a stepwise logistic regression from the top N variables...whether that is 5 or 15 depending upon my preference of this.
I've tried to play around with the caret package:
set.seed(23)
library(caret)
library(mlbench)
data(Sonar)
traincontrol <- trainControl(method = "cv", number = 5, returnResamp = "all", savePredictions='all', classProbs = TRUE, summaryFunction = twoClassSummary)
glmstep_mod <- train(Class ~.,
data = Sonar,
method = "glmStepAIC",
trControl = traincontrol,
metric = "ROC",
trace = FALSE)
But this spits back a bunch of different variables for the final model.
Any packages out there that let's me do this, code I can generate myself, or missing parameters to these functions for this? So I could say max_variables = N? And give it multiple tries to see the trade-off?
I normally experiment with lasso or some other model types and I'm aware of the advantages/disadvantages that stepwise provides.

Related

Keep "CV" method in trainControl consistent in R

I am very new to machine learning and was told to run a series of methods in order to predict a variable in my study. I am trying to predict a variable using "ranger", "ctree" and "xgbTree", but first using trainControl ahead of time using "cv".
library(randomForest)
library(caret)
library(ranger)
ind <- sample(2, nrow(iris), replace = TRUE, prob = c(0.7, 0.3))
train <- iris[ind==1,]
test <- iris[ind==2,]
set.seed(222)
ctrl<- trainControl(method = "cv", number = 10)
new.rf<- caret::train(Sepal.Length ~., data =train, method = "ranger", trControl = ctrl)
My issue is that I want my method to use the same parameters/conditions each time my data has a new split. I only want my "ind" variable to be randomized, and everything else for trainControl and the following train and predict to be consistent. I have included a sample of my code using the Iris data built in R. I am planning to split my training data as 70% and my validation as 30%.
####This was how I was checking to see if the training models were being consistent
ctrl1<- trainControl(method = "cv", number = 10)
ctrl2<- trainControl(method = "cv", number = 10)
new.rf1<- caret::train(Sepal.Length ~., data =train, method = "ranger", trControl = ctrl1)
new.rf2<- caret::train(Sepal.Length ~., data =train, method = "ranger", trControl = ctrl2)
rf.pred1<- predict(new.rf1, test)
rf.pred2<- predict(new.rf2, test)
rf.pred1
rf.pred2
I am certain that my wording is not correct in many, if not all parts. The reasoning for this task is to see how different the outcomes become with different combinations of my sample data.

Memory issues in R when running models in caret

I am new to R and trying to generate predictions of a "yes"/"no" variable using an ensemble model. To do so, I am using caret to generate predictions using a random forest (ranger), a LASSO (glmnet) and a gradient boosted regression tree (xgblinear) model. My dataset contains around 600k observations and 500 variables (a mix of continuous and binary variables, weighting 322MB), of which I am using 30% to train the models.
Here is the code I am using for this:
train_control_final <- trainControl(method = "none", savePredictions = TRUE, allowParallel = T, classProbs = TRUE,summaryFunction = twoClassSummary)
rff_final <- train(training_y~., data = training_final, method = "ranger",
tuneGrid = rfgrid_final, num.trees = 250, metric = "ROC", sample.fraction = 0.1,
replace = TRUE, trControl=train_control_final, maximize = FALSE, na.action = na.omit)
rboost_final <- train(training_y~., data = training_final, method = "xgbLinear",
tuneGrid = boostgrid_final, metric = "ROC", subsample = 0.1,
trControl=train_control_final, maximize = FALSE, na.action = na.omit)
rlasso_final <- train(training_y~., data = training_final, method = "glmnet",
tuneGrid = lassogrid_final, metric = "ROC",
trControl=train_control_final, maximize = FALSE, na.action = na.omit)
In a first step, I use a different 10% of the sample to tune the parameters for each model (using 3-fold CV), with the results stored in rf/lasso/boostgrid_final (for this, I use the same code structure).
The problem is that every time I run this code to generate the predictions, R crashes because of memory issues. The code works without problems when tuning the parameters and when including less variables (around 60). So my question is what can I try to make this work with the full dataset? Or is there an alternative way of accomplish what I want (generate predictions using these three different algorithms) without running into memory issues?
Thanks a lot in advance!

When using cross validation, is there a way to ensure each fold somehow contains at least several instances of the true class?

I'm fitting a model using cross fold validation with caret:
library(caret)
## tuning & parameters
set.seed(123)
train_control <- trainControl(
method = "cv",
number = 5,
savePredictions = TRUE,
verboseIter = TRUE,
classProbs = TRUE,
summaryFunction = my_summary
)
linear_model = train(
x = select(training_data, Avg_Load_Time),
y = target,
trControl = train_control,
method = "glm", # logistic regression
family = "binomial",
metric = "ROC"
)
The trouble is that out of ~5K rows I have only ~120 true cases. This is throwing a warning message when using GLM via caret "glm.fit: fitted probabilities numerically 0 or 1 occurred".
Is there a parameter I can set or some approach to ensuring each fold has some of the true cases?
It's easier when you shuffle data and have enough examples of each class.
If you don't have enough examples, you can increase the size of the minority class using SMOTE (Synthetic Minority Oversampling Technique). Package smotefamily in R.
Then you will be able to do 5 or 10 fold Cross Validation without raising any issues.

How to downsample using r-caret?

I'd like to downsample my data given that I have a signficant class imbalance. Without downsampling, my GBM model performs reasonably well; however, with r-caret's downSample, accuracy = 0.5. I applied the same downsampling to another GBM model and got exactly the same results. What gives?
set.seed(1914)
down_train_my_gbm <- downSample(x = combined_features,
y = combined_features$label)
down_train_my_gbm$label <- NULL
my_gbm_combined_downsampled <- train(Class ~ .,
data = down_train_my_gbm,
method = "gbm",
trControl = trainControl(method="repeatedcv",
number=10, repeats=3,
classProbs = TRUE),
preProcess = c("range"),
verbose = FALSE)
I suspected that the issue might have to do with classProbs=TRUE. Changing this to FALSE skyrockets the accuracy to >0.95...but I get the exact same results for multiple models (which do not result in the same accuracy without downsampling). I'm baffled by this. What am I doing wrong here?
Caret train function allows to downsample, upsample and more with the trainControl options: from the guide Subsampling During Resampling, in your case it would be
ctrl <- trainControl(method = "repeatedcv", repeats = 5,
classProbs = TRUE,
summaryFunction = twoClassSummary,
## new option here:
sampling = "down")
model_with_down_sample <- train(Class ~ ., data = imbal_train,
method = "gbm",
preProcess = c("range"),
verbose = FALSE,
trControl = ctrl)
As a side note, avoid the formula style (e.g. Class~ .), but use the direct columns: it has been shown to have issues with memory and speed when many predictors are used (https://github.com/topepo/caret/issues/263).
Hope it helps.

R: Feature Selection with Cross Validation using Caret on Logistic Regression

I am currently learning how to implement logistical Regression in R
I have taken a data set and split it into a training and test set and wish to implement forward selection, backward selection and best subset selection using cross validation to select the best features.
I am using caret to implement cross-validation on the training data set and then testing the predictions on the test data.
I have seen the rfe control in caret and had also had a look at the documentation on the caret website as well as following the links on the question How to use wrapper feature selection with algorithms in R?. It isn't apparent to me how to change the type of feature selection as it seems to default to backward selection. Can anyone help me with my workflow. Below is a reproducible example
library("caret")
# Create an Example Dataset from German Credit Card Dataset
mydf <- GermanCredit
# Create Train and Test Sets 80/20 split
trainIndex <- createDataPartition(mydf$Class, p = .8,
list = FALSE,
times = 1)
train <- mydf[ trainIndex,]
test <- mydf[-trainIndex,]
ctrl <- trainControl(method = "repeatedcv",
number = 10,
savePredictions = TRUE)
mod_fit <- train(Class~., data=train,
method="glm",
family="binomial",
trControl = ctrl,
tuneLength = 5)
# Check out Variable Importance
varImp(mod_fit)
summary(mod_fit)
# Test the new model on new and unseen Data for reproducibility
pred = predict(mod_fit, newdata=test)
accuracy <- table(pred, test$Class)
sum(diag(accuracy))/sum(accuracy)
You can simply call it in mod_fit. When it comes to backward stepwise the code below is sufficient
trControl <- trainControl(method="cv",
number = 5,
savePredictions = T,
classProbs = T,
summaryFunction = twoClassSummary)
caret_model <- train(Class~.,
train,
method="glmStepAIC", # This method fits best model stepwise.
family="binomial",
direction="backward", # Direction
trControl=trControl)
Note that in trControl
method= "cv", # No need to call repeated here, the number defined afterward defines the k-fold.
classProbs = T,
summaryFunction = twoClassSummary # Gives back ROC, sensitivity and specifity of the chosen model.

Resources