Does sbf() use metric argument to optimize model? - r

Passing ROC as metric argument value to the caretSBF function
Our objective is to use the ROC summary metric for model selection while running the Selection By Filtering sbf() function for features selection.
The BreastCancer dataset was used as a reproducible example from mlbench package to run train() and sbf() with metric = "Accuracy" and metric = "ROC"
We want to make sure sbf() takes the metric argument as applied by the train() and rfe() functions to optimize the model. To this aim, we planned to make use of the train() function with the sbf() function. The caretSBF$fit function makes a call to train(), and caretSBF is passed to sbfControl.
From the output, it seems the metric argument is used just for inner resampling and not for the sbf part, i.e. for the outer resampling of the output, the metric argument was not applied as used by train() and rfe().
As we have used caretSBF which uses train(), it appears that the metric argument is limited in scope to train() and hence is not passed to sbf.
We would appreciate clarification on whether sbf() uses metric argument for optimizing model, i.e. for outer resampling?
Here is our work on reproducible example, showing train() uses metric argument using Accuracy and ROC, but for sbf we are not sure.
I. DATA SECTION
## Loading required packages
library(mlbench)
library(caret)
## Loading `BreastCancer` Dataset from *mlbench* package
data("BreastCancer")
## Data cleaning for missing values
# Remove rows/observation with NA Values in any of the columns
BrC1 <- BreastCancer[complete.cases(BreastCancer),]
# Removing Class and Id Column and keeping just Numeric Predictors
Num_Pred <- BrC1[,2:10]
II. CUSTOMIZED SUMMARY FUNCTION
Defining fiveStats summary function
fiveStats <- function(...) c(twoClassSummary(...),
defaultSummary(...))
III. TRAIN SECTION
Defining trControl
trCtrl <- trainControl(method="repeatedcv", number=10,
repeats=1, classProbs = TRUE, summaryFunction = fiveStats)
TRAIN + METRIC = "Accuracy"
set.seed(1)
TR_acc <- train(Num_Pred,BrC1$Class, method="rf",metric="Accuracy",
trControl = trCtrl,tuneGrid=expand.grid(.mtry=c(2,3,4,5)))
TR_acc
# Random Forest
#
# 683 samples
# 9 predictor
# 2 classes: 'benign', 'malignant'
#
# No pre-processing
# Resampling: Cross-Validated (10 fold, repeated 1 times)
# Summary of sample sizes: 615, 615, 614, 614, 614, 615, ...
# Resampling results across tuning parameters:
#
# mtry ROC Sens Spec Accuracy Kappa
# 2 0.9936532 0.9729798 0.9833333 0.9765772 0.9490311
# 3 0.9936544 0.9729293 0.9791667 0.9750853 0.9457534
# 4 0.9929957 0.9684343 0.9750000 0.9706948 0.9361373
# 5 0.9922907 0.9684343 0.9666667 0.9677536 0.9295782
#
# Accuracy was used to select the optimal model using the largest value.
# The final value used for the model was mtry = 2.
TRAIN + METRIC = "ROC"
set.seed(1)
TR_roc <- train(Num_Pred,BrC1$Class, method="rf",metric="ROC",
trControl = trCtrl,tuneGrid=expand.grid(.mtry=c(2,3,4,5)))
TR_roc
# Random Forest
#
# 683 samples
# 9 predictor
# 2 classes: 'benign', 'malignant'
#
# No pre-processing
# Resampling: Cross-Validated (10 fold, repeated 1 times)
# Summary of sample sizes: 615, 615, 614, 614, 614, 615, ...
# Resampling results across tuning parameters:
#
# mtry ROC Sens Spec Accuracy Kappa
# 2 0.9936532 0.9729798 0.9833333 0.9765772 0.9490311
# 3 0.9936544 0.9729293 0.9791667 0.9750853 0.9457534
# 4 0.9929957 0.9684343 0.9750000 0.9706948 0.9361373
# 5 0.9922907 0.9684343 0.9666667 0.9677536 0.9295782
#
# ROC was used to select the optimal model using the largest value.
# The final value used for the model was mtry = 3.
IV. EDITING caretSBF
Editing caretSBF summary Function
caretSBF$summary <- fiveStats
V. SBF SECTION
Defining sbfControl
sbfCtrl <- sbfControl(functions=caretSBF,
method="repeatedcv", number=10, repeats=1,
verbose=T, saveDetails = T)
SBF + METRIC = "Accuracy"
set.seed(1)
sbf_acc <- sbf(Num_Pred, BrC1$Class,
sbfControl = sbfCtrl,
trControl = trCtrl, method="rf", metric="Accuracy")
## sbf_acc
sbf_acc
# Selection By Filter
#
# Outer resampling method: Cross-Validated (10 fold, repeated 1 times)
#
# Resampling performance:
#
# ROC Sens Spec Accuracy Kappa ROCSD SensSD SpecSD AccuracySD KappaSD
# 0.9931 0.973 0.9833 0.9766 0.949 0.006272 0.0231 0.02913 0.01226 0.02646
#
# Using the training set, 9 variables were selected:
# Cl.thickness, Cell.size, Cell.shape, Marg.adhesion, Epith.c.size...
#
# During resampling, the top 5 selected variables (out of a possible 9):
# Bare.nuclei (100%), Bl.cromatin (100%), Cell.shape (100%), Cell.size (100%), Cl.thickness (100%)
#
# On average, 9 variables were selected (min = 9, max = 9)
## Class of sbf_acc
class(sbf_acc)
# [1] "sbf"
## Names of elements of sbf_acc
names(sbf_acc)
# [1] "pred" "variables" "results" "fit" "optVariables"
# [6] "call" "control" "resample" "metrics" "times"
# [11] "resampledCM" "obsLevels" "dots"
## sbf_acc fit element*
sbf_acc$fit
# Random Forest
#
# 683 samples
# 9 predictor
# 2 classes: 'benign', 'malignant'
#
# No pre-processing
# Resampling: Cross-Validated (10 fold, repeated 1 times)
# Summary of sample sizes: 615, 614, 614, 615, 615, 615, ...
# Resampling results across tuning parameters:
#
# mtry ROC Sens Spec Accuracy Kappa
# 2 0.9933176 0.9706566 0.9833333 0.9751492 0.9460717
# 5 0.9920034 0.9662121 0.9791667 0.9707801 0.9363708
# 9 0.9914825 0.9684343 0.9708333 0.9693308 0.9327662
#
# Accuracy was used to select the optimal model using the largest value.
# The final value used for the model was mtry = 2.
## Elements of sbf_acc fit
names(sbf_acc$fit)
# [1] "method" "modelInfo" "modelType" "results" "pred"
# [6] "bestTune" "call" "dots" "metric" "control"
# [11] "finalModel" "preProcess" "trainingData" "resample" "resampledCM"
# [16] "perfNames" "maximize" "yLimits" "times" "levels"
## sbf_acc fit final Model
sbf_acc$fit$finalModel
# Call:
# randomForest(x = x, y = y, mtry = param$mtry)
# Type of random forest: classification
# Number of trees: 500
# No. of variables tried at each split: 2
#
# OOB estimate of error rate: 2.34%
# Confusion matrix:
# benign malignant class.error
# benign 431 13 0.02927928
# malignant 3 236 0.01255230
## sbf_acc metric
sbf_acc$fit$metric
# [1] "Accuracy"
## sbf_acc fit best Tune*
sbf_acc$fit$bestTune
# mtry
# 1 2
SBF + METRIC = "ROC"
set.seed(1)
sbf_roc <- sbf(Num_Pred, BrC1$Class,
sbfControl = sbfCtrl,
trControl = trCtrl, method="rf", metric="ROC")
## sbf_roc
sbf_roc
# Selection By Filter
#
# Outer resampling method: Cross-Validated (10 fold, repeated 1 times)
#
# Resampling performance:
#
# ROC Sens Spec Accuracy Kappa ROCSD SensSD SpecSD AccuracySD KappaSD
# 0.9931 0.973 0.9833 0.9766 0.949 0.006272 0.0231 0.02913 0.01226 0.02646
#
# Using the training set, 9 variables were selected:
# Cl.thickness, Cell.size, Cell.shape, Marg.adhesion, Epith.c.size...
#
# During resampling, the top 5 selected variables (out of a possible 9):
# Bare.nuclei (100%), Bl.cromatin (100%), Cell.shape (100%), Cell.size (100%), Cl.thickness (100%)
#
# On average, 9 variables were selected (min = 9, max = 9)
## Class of sbf_roc
class(sbf_roc)
# [1] "sbf"
## Names of elements of sbf_roc
names(sbf_roc)
# [1] "pred" "variables" "results" "fit" "optVariables"
# [6] "call" "control" "resample" "metrics" "times"
# [11] "resampledCM" "obsLevels" "dots"
## sbf_roc fit element*
sbf_roc$fit
# Random Forest
#
# 683 samples
# 9 predictor
# 2 classes: 'benign', 'malignant'
#
# No pre-processing
# Resampling: Cross-Validated (10 fold, repeated 1 times)
# Summary of sample sizes: 615, 614, 614, 615, 615, 615, ...
# Resampling results across tuning parameters:
#
# mtry ROC Sens Spec Accuracy Kappa
# 2 0.9933176 0.9706566 0.9833333 0.9751492 0.9460717
# 5 0.9920034 0.9662121 0.9791667 0.9707801 0.9363708
# 9 0.9914825 0.9684343 0.9708333 0.9693308 0.9327662
#
# ROC was used to select the optimal model using the largest value.
# The final value used for the model was mtry = 2.
## Elements of sbf_roc fit
names(sbf_roc$fit)
# [1] "method" "modelInfo" "modelType" "results" "pred"
# [6] "bestTune" "call" "dots" "metric" "control"
# [11] "finalModel" "preProcess" "trainingData" "resample" "resampledCM"
# [16] "perfNames" "maximize" "yLimits" "times" "levels"
## sbf_roc fit final Model
sbf_roc$fit$finalModel
# Call:
# randomForest(x = x, y = y, mtry = param$mtry)
# Type of random forest: classification
# Number of trees: 500
# No. of variables tried at each split: 2
#
# OOB estimate of error rate: 2.34%
# Confusion matrix:
# benign malignant class.error
# benign 431 13 0.02927928
# malignant 3 236 0.01255230
## sbf_roc metric
sbf_roc$fit$metric
# [1] "ROC"
## sbf_roc fit best Tune
sbf_roc$fit$bestTune
# mtry
# 1 2
Does sbf() use metric argument to optimize model? If yes, what metric does sbf() use as default? If sbf() uses metric argument, then how to set it to ROC?
Thanks.

sbf doesn't use the metric to optimize anything (unlike rfe); all sbf does is do a feature selection step before calling the model. Of course, you define the filters but there is no way to tune the filter using sbf so no metric is needed to guide that step.
Using sbf(x, y, metric = "ROC") will pass metric = "ROC" to whatever modeling function that you are using (and it designed to work with train when caretSBF is used. This happens because there is no metric argument to sbf:
> names(formals(caret:::sbf.default))
[1] "x" "y" "sbfControl" "..."

Related

how to plot RMSE vs number of trees tries in bagging when using train() and cross validation in r

I am studying this website about bagging method. https://bradleyboehmke.github.io/HOML/bagging.html
I am going to use train() function with cross validation for bagging. something like below.
as far as I realized nbagg=200 tells r to try 200 trees, calculate RMSE for each and return the number of trees ( here 80 ) for which the best RMSE is achieved.
now how can I see what RMSE other nbagg values have produced in this model. like RMSE vs number of trees plot in that website ( begore introdicing cv method and train() function like plot below)
ames_bag2 <- train(
Sale_Price ~ .,
data = ames_train,
method = "treebag",
trControl = trainControl(method = "cv", number = 10),
nbagg = 200,
control = rpart.control(minsplit = 2, cp = 0)
)
ames_bag2
## Bagged CART
##
## 2054 samples
## 80 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1849, 1848, 1848, 1849, 1849, 1847, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 26957.06 0.8900689 16713.14
As the example you shared is not completely reproducible, I have taken a different example from the mtcars dataset to illustrate how you can do it. You can extend that for your data.
Note: The RMSE showed here is the average of 10 RMSEs as the CV number is 10 here. So we will store that only. Adding the relevant libraries too in the example here. And setting the maximum number of trees as 15, just for the example.
library(ipred)
library(caret)
library(rpart)
library(dplyr)
data("mtcars")
n_trees <-1
error_df <- data.frame()
while (n_trees <= 15) {
ames_bag2 <- train(
mpg ~.,
data = mtcars,
method = "treebag",
trControl = trainControl(method = "cv", number = 10),
nbagg = n_trees,
control = rpart.control(minsplit = 2, cp = 0)
)
error_df %>%
bind_rows(data.frame(trees=n_trees, rmse=mean(ames_bag2[["resample"]]$RMSE)))-> error_df
n_trees <- n_trees+1
}
error_df will show the output.
> error_df
trees rmse
1 1 2.493117
2 2 3.052958
3 3 2.052801
4 4 2.239841
5 5 2.500279
6 6 2.700347
7 7 2.642525
8 8 2.497162
9 9 2.263527
10 10 2.379366
11 11 2.447560
12 12 2.314433
13 13 2.423648
14 14 2.192112
15 15 2.256778

Repeated cv in a mrl3 ensemble model

I have a beautiful mlr3 ensemble model (combined glmnet and glm) for binary prediction, see details here
library("mlr3verse")
library("dplyr")
# get example data
data(PimaIndiansDiabetes, package="mlbench")
data <- PimaIndiansDiabetes
# add an additional predictor "superdoc" which is not entered in the glmnet but in the final glm
set.seed(2323)
data %>%
rowwise() %>%
mutate(superdoc=case_when(diabetes=="pos" ~ as.numeric(sample(0:2,1)), TRUE~ 0)) %>%
ungroup -> data
# make a rather small train set
set.seed(23)
test.data <- sample_n(data,70,replace=FALSE)
# creat elastic net regression
glmnet_lrn = lrn("classif.cv_glmnet", predict_type = "prob")
# create the learner out-of-bag predictions
glmnet_cv1 = po("learner_cv", glmnet_lrn, id = "glmnet")
# PipeOp that drops 'superdoc', i.e. selects all except 'superdoc'
# (ID given to avoid ID clash with other selector)
drop_superdoc = po("select", id = "drop.superdoc",
selector = selector_invert(selector_name("superdoc")))
# PipeOp that selects 'superdoc' (and drops all other columns)
select_superdoc = po("select", id = "select.superdoc",
selector = selector_name("superdoc"))
# superdoc along one path, the fitted model along the other
stacking_layer = gunion(list(
select_superdoc,
drop_superdoc %>>% glmnet_cv1
)) %>>% po("featureunion", id = "union1")
# final logistic regression
log_reg_lrn = lrn("classif.log_reg", predict_type = "prob")
# combine ensemble model
ensemble = stacking_layer %>>% log_reg_lrn
#define tests
train.task <- TaskClassif$new("test.data", test.data, target = "diabetes")
# make ensemble learner
elearner = as_learner(ensemble)
ensemble$plot(html = FALSE)
If I train it with different set.seed, I get different coefficients.
I think this is mainly caused by the rather low number of training data that is entered in the glmnet model and could be migitated by repeated cross-validation.
# Train the Learner:
# seed 1
elearner = as_learner(ensemble)
set.seed(22521136)
elearner$train(train.task) -> seed1
# seed 2
elearner = as_learner(ensemble)
set.seed(12354)
elearner$train(train.task) -> seed2
# different coefficients of the glment model
coef(seed1$model$glmnet$model, s ="lambda.min")
#> 9 x 1 sparse Matrix of class "dgCMatrix"
#> 1
#> (Intercept) -6.238598277
#> age .
#> glucose 0.023462376
#> insulin -0.001007037
#> mass 0.055587740
#> pedigree 0.322911217
#> pregnant 0.137419564
#> pressure .
#> triceps .
coef(seed2$model$glmnet$model, s ="lambda.min")
#> 9 x 1 sparse Matrix of class "dgCMatrix"
#> 1
#> (Intercept) -6.876802620
#> age .
#> glucose 0.025601712
#> insulin -0.001500856
#> mass 0.063029550
#> pedigree 0.464369417
#> pregnant 0.155971123
#> pressure .
#> triceps .
# different coefficients of the final regression model
seed1$model$classif.log_reg$model$coefficients
#> (Intercept) superdoc glmnet.prob.neg glmnet.prob.pos
#> -9.438452 23.710923 8.726956 NA
seed2$model$classif.log_reg$model$coefficients
#> (Intercept) superdoc glmnet.prob.neg glmnet.prob.pos
#> 0.3698143 23.5362542 -5.5514365 NA
Question:
Where and how could a repeated cross-validation be entered in my mlr3 ensemble model to migitate these varying results? Any help is very appreciated.
Thanks to missuse's comment, his marvellous tutorial (Tuning a stacked learner) and mb706's comments I think I could solve my question.
Replace "classif.cv_glmnet" with "classif.glmnet"
# Add tuning
resampling = rsmp("repeated_cv")
resampling$param_set$values = list(repeats = 10, folds=5)
ps_ens = ParamSet$new(
list(
ParamDbl$new("glmnet.alpha", 0, 1),
ParamDbl$new("glmnet.s", 0, 1)))
auto1 = AutoTuner$new(
learner = elearner,
resampling = resampling,
measure = msr("classif.auc"),
search_space = ps_ens,
terminator = trm("evals", n_evals = 5), # to limit running time
tuner = tnr("random_search")
)
Train with different set.seed and get same coefficients
# Train with different set.seed
#first
set.seed(22521136)
at1= auto1
at1$train(train.task) -> seed1
# second
set.seed(12354)
at2= auto1
at2$train(train.task) -> seed2
# Compare coefficients of the learners
# classif.log_reg
seed1$model$learner$model$classif.log_reg$model$coefficients
# (Intercept) superdoc glmnet.prob.neg glmnet.prob.pos
# 2.467855 21.570766 -6.966693 NA
seed2$model$learner$model$classif.log_reg$model$coefficients
# (Intercept) superdoc glmnet.prob.neg glmnet.prob.pos
# 2.467855 21.570766 -6.966693 NA
#classif.glmnet
coef(at1$learner$model$glmnet$model, alpha=at1$tuning_result$glmnet.alpha,s=at1$tuning_result$glmnet.s)
# 9 x 1 sparse Matrix of class "dgCMatrix"
# 1
# (Intercept) -3.3066981659
# age 0.0076392198
# glucose 0.0077516975
# insulin 0.0003389759
# mass 0.0133955320
# pedigree 0.3256754612
# pregnant 0.0686746156
# pressure 0.0081338885
# triceps -0.0054976030
coef(at2$learner$model$glmnet$model, alpha=at2$tuning_result$glmnet.alpha,s=at2$tuning_result$glmnet.s)
# 9 x 1 sparse Matrix of class "dgCMatrix"
# 1
# (Intercept) -3.3066981659
# age 0.0076392198
# glucose 0.0077516975
# insulin 0.0003389759
# mass 0.0133955320
# pedigree 0.3256754612
# pregnant 0.0686746156
# pressure 0.0081338885
# triceps -0.0054976030

Out-of-fold vs training error in caret

Using cross validation in model tuning, I get different error rates from caret::train's results object and calculating the error myself on its pred object. I'd like to understand why they differ, and ideally how to use out-of-fold error rates for model selection, plotting model performance, etc.
The pred object contains out-of-fold predictions. The docs are pretty clear that trainControl(..., savePredictions = "final") saves out-of-fold predictions for the best hyperparameter values: "an indicator of how much of the hold-out predictions for each resample should be saved... "final" saves the predictions for the optimal tuning parameters." (Keeping "all" predictions and then filtering to the best tuning values doesn't resolve the issue.)
The train docs say that the results object is "a data frame the training error rate..." I'm not sure what that means, but the values for the best row are consistently different from the metrics calculated on pred. Why do they differ and how can I make them line up?
d <- data.frame(y = rnorm(50))
d$x1 <- rnorm(50, d$y)
d$x2 <- rnorm(50, d$y)
train_control <- caret::trainControl(method = "cv",
number = 4,
search = "random",
savePredictions = "final")
m <- caret::train(x = d[, -1],
y = d$y,
method = "ranger",
trControl = train_control,
tuneLength = 3)
#> Loading required package: lattice
#> Loading required package: ggplot2
m
#> Random Forest
#>
#> 50 samples
#> 2 predictor
#>
#> No pre-processing
#> Resampling: Cross-Validated (4 fold)
#> Summary of sample sizes: 38, 36, 38, 38
#> Resampling results across tuning parameters:
#>
#> min.node.size mtry splitrule RMSE Rsquared MAE
#> 1 2 maxstat 0.5981673 0.6724245 0.4993722
#> 3 1 extratrees 0.5861116 0.7010012 0.4938035
#> 4 2 maxstat 0.6017491 0.6661093 0.4999057
#>
#> RMSE was used to select the optimal model using the smallest value.
#> The final values used for the model were mtry = 1, splitrule =
#> extratrees and min.node.size = 3.
MLmetrics::RMSE(m$pred$pred, m$pred$obs)
#> [1] 0.609202
MLmetrics::R2_Score(m$pred$pred, m$pred$obs)
#> [1] 0.642394
Created on 2018-04-09 by the reprex package (v0.2.0).
The RMSE for cross validation is not calculated the way you show, but rather for each fold and then averaged. Full example:
set.seed(1)
d <- data.frame(y = rnorm(50))
d$x1 <- rnorm(50, d$y)
d$x2 <- rnorm(50, d$y)
train_control <- caret::trainControl(method = "cv",
number = 4,
search = "random",
savePredictions = "final")
set.seed(1)
m <- caret::train(x = d[, -1],
y = d$y,
method = "ranger",
trControl = train_control,
tuneLength = 3)
#output
Random Forest
50 samples
2 predictor
No pre-processing
Resampling: Cross-Validated (4 fold)
Summary of sample sizes: 37, 38, 37, 38
Resampling results across tuning parameters:
min.node.size mtry splitrule RMSE Rsquared MAE
8 1 extratrees 0.6106390 0.4360609 0.4926629
12 2 extratrees 0.6156636 0.4294237 0.4954481
19 2 variance 0.6472539 0.3889372 0.5217369
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were mtry = 1, splitrule = extratrees and min.node.size = 8.
RMSE for best model is 0.6106390
Now calculate the RMSE for each fold and average:
m$pred %>%
group_by(Resample) %>%
mutate(rmse = caret::RMSE(pred, obs)) %>%
summarise(mean = mean(rmse)) %>%
pull(mean) %>%
mean
#output
0.610639
m$pred %>%
group_by(Resample) %>%
mutate(rmse = MLmetrics::RMSE(pred, obs)) %>%
summarise(mean = mean(rmse)) %>%
pull(mean) %>%
mean
#output
0.610639
I get different results. This is apparently a random process.
MLmetrics::RMSE(m$pred$pred, m$pred$obs)
[1] 0.5824464
> MLmetrics::R2_Score(m$pred$pred, m$pred$obs)
[1] 0.5271595
If you want a random (more accurately a pseudo-random process to be reproducible, then use set.seed immediately prior to the call.

Caret leap forward for linear regression (feature selection) change nvmax

I have been playing around with the leapForward method from the package leaps in conjunction with caret and found that it only provides 5 variables . according to the leaps package you can change nvmax to whatever number of subsets you wish.
I cannot seem where to fit this into the caret wrapper. I have tried putting it in the train statement, as well as creating an expand.grid line, and ti does not seem to work. Any help would be appreciated!
my code:
library(caret)
data <- read.csv(file="C:/mydata.csv", header=TRUE, sep=",")
fitControl <- trainControl(method = "loocv")
x <- data[, -19]
y <- data[, 19]
lmFit <- train(x=x, y=y,'leapForward', trControl = fitControl)
summary(lmFit)
The default behavior of caret is a random search over the tuning parameters.
You can specify a grid of parameters as you like, with the tuneGrid option.
Here is a reproducible example with the BloodBrain dataset. NB : I had to
transform the predictors with a PCA to avoid problems of multicolinearity
library(caret)
data(BloodBrain, package = "caret")
dim(bbbDescr)
#> [1] 208 134
X <- princomp(bbbDescr)$scores[,1:131]
Y <- logBBB
fitControl <- trainControl(method = "cv")
Default : random search of parameters
lmFit <- train(y = Y, x = X,'leapForward', trControl = fitControl)
lmFit
#> Linear Regression with Forward Selection
#>
#> 208 samples
#> 131 predictors
#>
#> No pre-processing
#> Resampling: Cross-Validated (10 fold)
#> Summary of sample sizes: 187, 188, 187, 187, 187, 187, ...
#> Resampling results across tuning parameters:
#>
#> nvmax RMSE Rsquared MAE
#> 2 0.6682545 0.2928583 0.5286758
#> 3 0.7008359 0.2652202 0.5527730
#> 4 0.6781190 0.3026475 0.5215527
#>
#> RMSE was used to select the optimal model using the smallest value.
#> The final value used for the model was nvmax = 2.
With a grid search of your choice.
NB : expand.grid is not necessary here. it is useful when you combine
several tuning parameters
lmFit <- train(y = Y, x = X,'leapForward', trControl = fitControl,
tuneGrid = expand.grid(nvmax = seq(1, 30, 2)))
lmFit
#> Linear Regression with Forward Selection
#>
#> 208 samples
#> 131 predictors
#>
#> No pre-processing
#> Resampling: Cross-Validated (10 fold)
#> Summary of sample sizes: 188, 188, 188, 186, 187, 187, ...
#> Resampling results across tuning parameters:
#>
#> nvmax RMSE Rsquared MAE
#> 1 0.7649633 0.07840817 0.5919515
#> 3 0.6952295 0.27147443 0.5250173
#> 5 0.6482456 0.35953363 0.4828406
#> 7 0.6509919 0.37800159 0.4865292
#> 9 0.6721529 0.35899937 0.5104467
#> 11 0.6541945 0.39316037 0.4979497
#> 13 0.6355383 0.42654189 0.4794705
#> 15 0.6493433 0.41823974 0.4911399
#> 17 0.6645519 0.37338055 0.5105887
#> 19 0.6575950 0.39628133 0.5084652
#> 21 0.6663806 0.39156852 0.5124487
#> 23 0.6744933 0.38746853 0.5143484
#> 25 0.6709936 0.39228681 0.5025907
#> 27 0.6919163 0.36565876 0.5209107
#> 29 0.7015347 0.35397968 0.5272448
#>
#> RMSE was used to select the optimal model using the smallest value.
#> The final value used for the model was nvmax = 13.
plot(lmFit)
Created on 2018-03-08 by the reprex package (v0.2.0).

feature selection error with logic regressor using `rfe` from `caret`

I was performing feature selection using rfe from package caret for a linear regression.
One of my regressors is a logic variable, when I do feature selection with this variable, I always
got Error in { : task 1 failed - "undefined columns selected".
How to do feature selection with logic variables using rfe?
Is it necessary to convert it to a dummy variable of 0, 1?
Below is a reproducible example:
library(caret)
x <- mtcars[-1]
y <- mtcars$mpg
set.seed(2017)
ctrl <- rfeControl(functions = lmFuncs,
method = "repeatedcv",
repeats = 5,
verbose = FALSE)
lmProfile1 <- rfe(x, y, sizes = 1:5, rfeControl = ctrl)
# > lmProfile1
#
# Recursive feature selection
#
# Outer resampling method: Cross-Validated (10 fold, repeated 5 times)
#
# Resampling performance over subset size:
#
# Variables RMSE Rsquared RMSESD RsquaredSD Selected
# 1 3.503 0.8338 1.627 0.2393
# 2 3.197 0.8841 1.347 0.1783
# 3 3.214 0.8788 1.327 0.1815
# 4 3.050 0.8861 1.341 0.1603 *
# 5 3.063 0.8842 1.254 0.1670
# 10 3.332 0.8638 1.404 0.1926
#
# The top 4 variables (out of 4):
# wt, am, qsec, hp
# am is one of the best features, now I turn it into a logic variable
x <- mtcars[-1]
x$am <- x$am == 1
y <- mtcars$mpg
set.seed(2017)
ctrl <- rfeControl(functions = lmFuncs,
method = "repeatedcv",
repeats = 5,
verbose = FALSE)
lmProfile2 <- rfe(x, y, sizes = 1:5, rfeControl = ctrl)
# Error in { : task 1 failed - "undefined columns selected"
# > packageVersion('caret')
# [1] ‘6.0.73’

Resources