How to do recursive feature elimination with logistic regression? - r

Can someone provide me a detailed example of using caret's rfe function with the glm or glmnet model? I tried something like this:
rfe_records <- Example_data_frame
rfe_ctrl <- rfeControl(functions = caretFuncs, method = "repeatedcv", repeats = 5, verbose = TRUE, classProbs = TRUE, summaryFunction = twoClassSummary)
number_predictors <- dim(rfe_records)[2]-1
x <- dplyr::select(rfe_records, -outcomeVariable)
y <- as.numeric(rfe_records$outcomeVariable)
glmProfile <- rfe(x, y, rfeControl = rfe_ctrl, sizes = c(1:number_predictors), method="glmnet", preProc = c("center", "scale"), metric = "Accuracy")
print(glmProfile)
But the results I'm getting are not what I needed. I specified Accuracy as the metric but I got:
Recursive feature selection
Outer resampling method: Cross-Validated (10 fold, repeated 5 times)
Resampling performance over subset size:
Variables RMSE Rsquared RMSESD RsquaredSD Selected
1 0.5047 0.10830 0.04056 0.11869 *
2 0.5058 0.09386 0.04728 0.11332
3 0.5117 0.08565 0.04999 0.10211
4 0.5139 0.07490 0.05042 0.10048
5 0.5166 0.07678 0.05456 0.09966
6 0.5202 0.08203 0.06174 0.10822
7 0.5187 0.08471 0.06207 0.10893
8 0.5168 0.07850 0.05939 0.09697
9 0.5175 0.08228 0.05966 0.10068
10 0.5176 0.08180 0.05980 0.10042
11 0.5179 0.08015 0.05950 0.09905
The top 1 variables (out of 1):
varName

According to this page caret uses the class of the outcome variable when it determines whether to use regression or classification with a function like glmnet that can do either. According to your code, you specified the outcome variable to be numeric with as.numeric() so glmnet chose to do regression, not classification as you intended. Specify your outcome variable as a two-level factor to get classification instead.

Related

Setting hidden layers and neurons in neuralnet and caret (R)

I would like to cross-validate a neural network using the package neuralnet and caret.
The data df can be copied from this post.
When running the neuralnet() function, there is an argument called hidden where you can set the hidden layers and neurons in each. Let's say I want 2 hidden layers with 3 and 2 neurons respectively. It would be written as hidden = c(3, 2).
However, as I want to cross-validate it, I decided to use the fantastic caret package. But when using the function train(), I do not know how to set the number of layers and neurons.
Does anyone know where can I add these numbers?
This is the code I ran:
nn <- caret::train(DC1 ~ ., data=df,
method = "neuralnet",
#tuneGrid = tune.grid.neuralnet,
metric = "RMSE",
trControl = trainControl (
method = "cv", number = 10,
verboseIter = TRUE
))
By the way, I am getting some warnings with the previous code:
predictions failed for Fold01: layer1=3, layer2=0, layer3=0 Error in cbind(1, pred) %*% weights[[num_hidden_layers + 1]] :
requires numeric/complex matrix/vector arguments
Ideas on how to solve it?
When using neural net model in caret in order to specify the number of hidden units in each of the three supported layers you can use the parameters layer1, layer2 and layer3. I found out by checking the source.
library(caret)
grid <- expand.grid(layer1 = c(32, 16),
layer2 = c(32, 16),
layer3 = 8)
Use case with BostonHousing data:
library(mlbench)
data(BostonHousing)
lets just select numerical columns for the example to make it simple:
BostonHousing[,sapply(BostonHousing, is.numeric)] -> df
nn <- train(medv ~ .,
data = df,
method = "neuralnet",
tuneGrid = grid,
metric = "RMSE",
preProc = c("center", "scale", "nzv"), #good idea to do this with neural nets - your error is due to non scaled data
trControl = trainControl(
method = "cv",
number = 5,
verboseIter = TRUE)
)
The part
preProc = c("center", "scale", "nzv")
is essential for the algorithm to converge, neural nets don't like unscaled features
Its super slow though.
nn
#output
Neural Network
506 samples
12 predictor
Pre-processing: centered (12), scaled (12)
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 405, 404, 404, 405, 406
Resampling results across tuning parameters:
layer1 layer2 RMSE Rsquared MAE
16 16 NaN NaN NaN
16 32 4.177368 0.8113711 2.978918
32 16 3.978955 0.8275479 2.822114
32 32 3.923646 0.8266605 2.783526
Tuning parameter 'layer3' was held constant at a value of 8
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were layer1 = 32, layer2 = 32 and layer3 = 8.

How to compare different models using caret, tuning different parameters?

I'm trying to implement some functions to compare five different machine learning models to predict some values in a regression problem.
My intention is working on a suit of functions that could train the different codes and organize them in a suit of results. The models I select by instance are: Lasso, Random Forest, SVM, Linear Model and Neural Network. To tune some models I intend to use the references of Max Kuhn: https://topepo.github.io/caret/available-models.html.
However, since each model requires different tuning parameters, I'm in doubt how to set them:
First I set up the grid to 'nnet' model tunning. Here I selected different number of nodes in hidden layer and the decay coefficient:
my.grid <- expand.grid(size=seq(from = 1, to = 10, by = 1), decay = seq(from = 0.1, to = 0.5, by = 0.1))
Then I construct the functions that will run the five models 5 times in a 6-fold configuration:
my_list_model <- function(model) {
set.seed(1)
train.control <- trainControl(method = "repeatedcv",
number = 6,
repeats = 5,
returnResamp = "all",
savePredictions = "all")
# The tunning configurations of machine learning models:
set.seed(1)
fit_m <- train(ST1 ~.,
data = train, # my original dataframe, not showed in this code
method = model,
metric = "RMSE",
preProcess = "scale",
trControl = train.control
linout = 1 # linear activation function output
trace = FALSE
maxit = 1000
tuneGrid = my.grid) # Here is how I call the tune of 'nnet' parameters
return(fit_m)
}
Lastly, I execute the five models:
lapply(list(
Lass = "lasso",
RF = "rf",
SVM = "svmLinear",
OLS = "lm",
NN = "nnet"),
my_list_model) -> model_list
However, when I run this, it shows:
Error: The tuning parameter grid should not have columns fraction
By what I understood, I didn't know how to specify very well the tune parameters. If I try to throw away the 'nnet' model and change it, for example, to a XGBoost model, in the penultimate line, it seems it works well and results would be calculated. That is, it seems the problem is with the 'nnet' tuning parameters.
Then, I think my real question is: how to configure these different parameters of models, in special the 'nnet' model. In addition, since I didn't need to set up the parameters of lasso, random forest, svmLinear and linear model, how were they tuned by the caret package?
my_list_model <- function(model,grd=NULL){
train.control <- trainControl(method = "repeatedcv",
number = 6,
returnResamp = "all",
savePredictions = "all")
# The tuning configurations of machine learning models:
set.seed(1)
fit_m <- train(Y ~.,
data = df, # my original dataframe, not showed in this code
method = model,
metric = "RMSE",
preProcess = "scale",
trControl = train.control,
linout = 1, # linear activation function output
trace = FALSE,
maxit = 1000,
tuneGrid = grd) # Here is how I call the tune of 'nnet' parameters
return(fit_m)
}
first run below code and see all the related parameters
modelLookup('rf')
now make grid of all models based on above lookup code
svmGrid <- expand.grid(C=c(3,2,1))
rfGrid <- expand.grid(mtry=c(5,10,15))
create a list of all model's grid and make sure the name of model is same as name in the list
grd_all<-list(svmLinear=svmGrid
,rf=rfGrid)
model_list<-lapply(c("rf","svmLinear")
,function(x){my_list_model(x,grd_all[[x]])})
model_list
[[1]]
Random Forest
17 samples
3 predictor
Pre-processing: scaled (3)
Resampling: Cross-Validated (6 fold, repeated 1 times)
Summary of sample sizes: 14, 14, 15, 14, 14, 14, ...
Resampling results across tuning parameters:
mtry RMSE Rsquared MAE
5 63.54864 0.5247415 55.72074
10 63.70247 0.5255311 55.35263
15 62.13805 0.5765130 54.53411
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 15.
[[2]]
Support Vector Machines with Linear Kernel
17 samples
3 predictor
Pre-processing: scaled (3)
Resampling: Cross-Validated (6 fold, repeated 1 times)
Summary of sample sizes: 14, 14, 15, 14, 14, 14, ...
Resampling results across tuning parameters:
C RMSE Rsquared MAE
1 59.83309 0.5879396 52.26890
2 66.45247 0.5621379 58.74603
3 67.28742 0.5576000 59.55334
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was C = 1.

Feature selection with caret rfe and training with another method

Right now, I'm trying to use Caret rfe function to perform the feature selection, because I'm in a situation with p>>n and most regression techniques that don't involve some sort of regularisation can't be used well. I already used a few techniques with regularisation (Lasso), but what I want to try now is reduce my number of feature so that I'm able to run, at least decently, any kind of regression algorithm on it.
control <- rfeControl(functions=rfFuncs, method="cv", number=5)
model <- rfe(trainX, trainY, rfeControl=control)
predict(model, testX)
Right now, if I do it like this, a feature selection algorithm using random forest will be run, and then the model with the best set of features, according to the 5-fold cross-validation, will be used for the prediction, right?
I'm curious about two things here:
1) Is there an easy way to take the set of feature, and train another function on it that the one used for the feature selection? For example, reducing the number of features from 500 to 20 or so that seem more important and then applying k-nearest neighborhood.
I'm imagining an easy way to do it that would look like that:
control <- rfeControl(functions=rfFuncs, method="cv", number=5)
model <- rfe(trainX, trainY, method = "knn", rfeControl=control)
predict(model, testX)
2) Is there a way to tune the parameters of the feature selection algorithm? I would like to have some control on the values of mtry. The same way you can pass a grid of value when you are using the train function from Caret. Is there a way to do such a thing with rfe?
Here is a short example on how to perform rfe with an inbuilt model:
library(caret)
library(mlbench) #for the data
data(Sonar)
rctrl1 <- rfeControl(method = "cv",
number = 3,
returnResamp = "all",
functions = caretFuncs,
saveDetails = TRUE)
model <- rfe(Class ~ ., data = Sonar,
sizes = c(1, 5, 10, 15),
method = "knn",
trControl = trainControl(method = "cv",
classProbs = TRUE),
tuneGrid = data.frame(k = 1:10),
rfeControl = rctrl1)
model
#output
Recursive feature selection
Outer resampling method: Cross-Validated (3 fold)
Resampling performance over subset size:
Variables Accuracy Kappa AccuracySD KappaSD Selected
1 0.6006 0.1984 0.06783 0.14047
5 0.7113 0.4160 0.04034 0.08261
10 0.7357 0.4638 0.01989 0.03967
15 0.7741 0.5417 0.05981 0.12001 *
60 0.7696 0.5318 0.06405 0.13031
The top 5 variables (out of 15):
V11, V12, V10, V49, V9
model$fit$results
#output
k Accuracy Kappa AccuracySD KappaSD
1 1 0.8082684 0.6121666 0.07402575 0.1483508
2 2 0.8089610 0.6141450 0.10222599 0.2051025
3 3 0.8173377 0.6315411 0.07004865 0.1401424
4 4 0.7842208 0.5651094 0.08956707 0.1761045
5 5 0.7941775 0.5845479 0.07367886 0.1482536
6 6 0.7841775 0.5640338 0.06729946 0.1361090
7 7 0.7932468 0.5821317 0.07545889 0.1536220
8 8 0.7687229 0.5333385 0.05164023 0.1051902
9 9 0.7982468 0.5918922 0.07461116 0.1526814
10 10 0.8030087 0.6024680 0.06117471 0.1229467
for more customization see:
https://topepo.github.io/caret/recursive-feature-elimination.html

Issues with Caret predict function when using caretStack object

I have been using the caretEnsemble and caret packages for stacking. My data is a document term matrix with some additional features like POS tags and the goal is performing sentiment analysis with two classes. "Sentitr" denotes the vector of sentiment corresponding to the training observations. "Sentitest" the vector for the test set.
I use a 60:40 split
control <- trainControl(method="cv", number=10, savePredictions = "final", classProbs = TRUE,
summaryFunction = twoClassSummary,
index=createResample(sentitr, 10))
algorithmList <- c('pda', 'nnet', 'gbm', 'svmLinear', 'rf', 'C5.0', 'glmnet')
models <- caretList(trainset, sentitr, trControl=control, methodList=algorithmList)
# some model info
summary(models)
res = resamples(models)
summary(res)
modelCor(res)
# lda and nnet extremely closely correlated
stackcontrol <- trainControl(method="cv", number=5, savePredictions = "final", classProbs = TRUE,
summaryFunction = twoClassSummary)
# stacks
stack.c5.0 <- caretStack(models, method="C5.0", metric="ROC", trControl=stackcontrol)
summary(stack.c5.0)
stack.c50.pred = predict(stack.c5.0, newdata = testset, type = "raw")
stackc50.conf = confusionMatrix(stack.c50.pred, sentitest)
I tried to run the model 10 times every time randomly partitioning my data into a 60/40 training/test set. I received the following classification accuracies on the test set (which I extracted from the confusion matrix)
X0.3225
1 0.3225
2 0.2550
3 0.7500
4 0.2675
5 0.2950
6 0.7825
7 0.2575
8 0.2875
9 0.2900
10 0.3275
These are the outputs. As you can see an accuracy of around 75-80% is achieved on two model iterations. This is expected and mirrors the results that I get from fitting single models. But the remaining iterations it yields extremely bad accuracy. It almost seems to me that the model randomly confuses testing error with accuracy.
Any ideas what causes this behaviour
Every iteration when the prediction comes up with such terrible accuracy, I get the following error when training the caretStack:
2: In predict.C5.0(modelFit, newdata, trial = submodels$trials[j]) :
'trials' should be <= 9 for this object. Predictions generated using 9 trials
3: In predict.C5.0(modelFit, newdata, type = "prob", trials = submodels$trials[j]) :
'trials' should be <= 9 for this object. Predictions generated using 9 trials
4: In predict.C5.0(modelFit, newdata, trial = submodels$trials[j]) :
'trials' should be <= 9 for this object. Predictions generated using 9 trials
5: In predict.C5.0(modelFit, newdata, type = "prob", trials = submodels$trials[j]) :
'trials' should be <= 9 for this object. Predictions generated using 9 trials

Is there a discrepancy between createMultiFolds behavior and the resampling summary of a caret object?

I encountered a strange issue using custom folds for the cross-validation with caret.
A MWE (in which the use of createMultiFolds doesn't really make sense)
library(caret) #version 6.0-47
data(iris)
set.seed(1)
train.idx <- createDataPartition(iris$Species, p = .75,
list = FALSE,
times = 1)
train_1 <- iris[train.idx, ]
#I create specific folds
set.seed(1)
id_1 <- createMultiFolds(train_1$Species, k=10, times = 10)
# And use them in my cross validation
cvCtrl_2 <- trainControl(method = "repeatedcv",
index = id_1,
classProbs = TRUE)
trainX <- train_1[, names(train_1) != "Species"]
# I fit my model
set.seed(1111)
rfTune2 <- train(trainX, train_1$Species,
method = "rf",
trControl = cvCtrl_2)
rfTune2
And my model summary is the following :
##Random Forest
...
##Resampling: Cross-Validated (10 fold, repeated 1 times)
id_1 is a list of 100 index vectors, for a 10 fold cross validation repeated 10 times. And I ask trainControl to do the resampling using this list.
So why does my model summary define the resampling with
(10 fold, repeated 1 times)
whereas length(rfTune2$control$index) is equal to 100 so I assume my model was correctly trained using all the folds ?
Should I post an issue on github or did I miss anything obvious about how trainControl work ?
The defaults of trainControl has
number = ifelse(grepl("cv", method), 10, 25),
repeats = ifelse(grepl("cv", method), 1, number)
If you supply index, the code has no idea what types of resampling is used. You will have to specify these arguments along with repeats to get the label correct.

Resources