caret: RFE with variable tuneGrid - r

I'm trying to use caret to fit a PLS model while optimizing the number of components 'ncomps':
library("caret")
set.seed(342)
train <- as.data.frame ( matrix( rnorm(1e4) , 100, 100 ) )
ctrl <- rfeControl(functions = caretFuncs,
method = "repeatedcv",
number=2,
repeats=1,
verbose =TRUE
)
pls.fit.rfe <- rfe(V1 ~ .,
data = train,
method = "pls",
sizes = 6,
tuneGrid = data.frame(ncomp = 7),
rfeControl = ctrl
)
Error in { :
task 1 failed - "final tuning parameters could not be determined"
In addition: There were 50 or more warnings (use warnings() to see the first 50)
Invalid number of components, ncomp
Setting sizes to 6 fixes the problem. It makes sense that I get an error when min(sizes) < max(ncomp), but is there a way to vary ncomp depending on the number of features used in the RFE iteration, i.e. the sizes variable? I would simply like to optimize over a wide range of sizes and #components at the same time.

Try using tuneLength = 7 instead of tuneGrid. The former is more flexible and will use an appropriate ncomp given the size of the data set:
> pls.fit.rfe pls.fit.rfe
Recursive feature selection
Outer resampling method: Cross-Validated (2 fold, repeated 1 times)
Resampling performance over subset size:
Variables RMSE Rsquared RMSESD RsquaredSD Selected
6 1.0229 0.01684 0.04192 0.0155092
99 0.9764 0.00746 0.01096 0.0008339 *
The top 5 variables (out of 99):
If you'd rather not do that, you can always write your own fit function too.
Max

Related

Setting hidden layers and neurons in neuralnet and caret (R)

I would like to cross-validate a neural network using the package neuralnet and caret.
The data df can be copied from this post.
When running the neuralnet() function, there is an argument called hidden where you can set the hidden layers and neurons in each. Let's say I want 2 hidden layers with 3 and 2 neurons respectively. It would be written as hidden = c(3, 2).
However, as I want to cross-validate it, I decided to use the fantastic caret package. But when using the function train(), I do not know how to set the number of layers and neurons.
Does anyone know where can I add these numbers?
This is the code I ran:
nn <- caret::train(DC1 ~ ., data=df,
method = "neuralnet",
#tuneGrid = tune.grid.neuralnet,
metric = "RMSE",
trControl = trainControl (
method = "cv", number = 10,
verboseIter = TRUE
))
By the way, I am getting some warnings with the previous code:
predictions failed for Fold01: layer1=3, layer2=0, layer3=0 Error in cbind(1, pred) %*% weights[[num_hidden_layers + 1]] :
requires numeric/complex matrix/vector arguments
Ideas on how to solve it?
When using neural net model in caret in order to specify the number of hidden units in each of the three supported layers you can use the parameters layer1, layer2 and layer3. I found out by checking the source.
library(caret)
grid <- expand.grid(layer1 = c(32, 16),
layer2 = c(32, 16),
layer3 = 8)
Use case with BostonHousing data:
library(mlbench)
data(BostonHousing)
lets just select numerical columns for the example to make it simple:
BostonHousing[,sapply(BostonHousing, is.numeric)] -> df
nn <- train(medv ~ .,
data = df,
method = "neuralnet",
tuneGrid = grid,
metric = "RMSE",
preProc = c("center", "scale", "nzv"), #good idea to do this with neural nets - your error is due to non scaled data
trControl = trainControl(
method = "cv",
number = 5,
verboseIter = TRUE)
)
The part
preProc = c("center", "scale", "nzv")
is essential for the algorithm to converge, neural nets don't like unscaled features
Its super slow though.
nn
#output
Neural Network
506 samples
12 predictor
Pre-processing: centered (12), scaled (12)
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 405, 404, 404, 405, 406
Resampling results across tuning parameters:
layer1 layer2 RMSE Rsquared MAE
16 16 NaN NaN NaN
16 32 4.177368 0.8113711 2.978918
32 16 3.978955 0.8275479 2.822114
32 32 3.923646 0.8266605 2.783526
Tuning parameter 'layer3' was held constant at a value of 8
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were layer1 = 32, layer2 = 32 and layer3 = 8.

How to compare different models using caret, tuning different parameters?

I'm trying to implement some functions to compare five different machine learning models to predict some values in a regression problem.
My intention is working on a suit of functions that could train the different codes and organize them in a suit of results. The models I select by instance are: Lasso, Random Forest, SVM, Linear Model and Neural Network. To tune some models I intend to use the references of Max Kuhn: https://topepo.github.io/caret/available-models.html.
However, since each model requires different tuning parameters, I'm in doubt how to set them:
First I set up the grid to 'nnet' model tunning. Here I selected different number of nodes in hidden layer and the decay coefficient:
my.grid <- expand.grid(size=seq(from = 1, to = 10, by = 1), decay = seq(from = 0.1, to = 0.5, by = 0.1))
Then I construct the functions that will run the five models 5 times in a 6-fold configuration:
my_list_model <- function(model) {
set.seed(1)
train.control <- trainControl(method = "repeatedcv",
number = 6,
repeats = 5,
returnResamp = "all",
savePredictions = "all")
# The tunning configurations of machine learning models:
set.seed(1)
fit_m <- train(ST1 ~.,
data = train, # my original dataframe, not showed in this code
method = model,
metric = "RMSE",
preProcess = "scale",
trControl = train.control
linout = 1 # linear activation function output
trace = FALSE
maxit = 1000
tuneGrid = my.grid) # Here is how I call the tune of 'nnet' parameters
return(fit_m)
}
Lastly, I execute the five models:
lapply(list(
Lass = "lasso",
RF = "rf",
SVM = "svmLinear",
OLS = "lm",
NN = "nnet"),
my_list_model) -> model_list
However, when I run this, it shows:
Error: The tuning parameter grid should not have columns fraction
By what I understood, I didn't know how to specify very well the tune parameters. If I try to throw away the 'nnet' model and change it, for example, to a XGBoost model, in the penultimate line, it seems it works well and results would be calculated. That is, it seems the problem is with the 'nnet' tuning parameters.
Then, I think my real question is: how to configure these different parameters of models, in special the 'nnet' model. In addition, since I didn't need to set up the parameters of lasso, random forest, svmLinear and linear model, how were they tuned by the caret package?
my_list_model <- function(model,grd=NULL){
train.control <- trainControl(method = "repeatedcv",
number = 6,
returnResamp = "all",
savePredictions = "all")
# The tuning configurations of machine learning models:
set.seed(1)
fit_m <- train(Y ~.,
data = df, # my original dataframe, not showed in this code
method = model,
metric = "RMSE",
preProcess = "scale",
trControl = train.control,
linout = 1, # linear activation function output
trace = FALSE,
maxit = 1000,
tuneGrid = grd) # Here is how I call the tune of 'nnet' parameters
return(fit_m)
}
first run below code and see all the related parameters
modelLookup('rf')
now make grid of all models based on above lookup code
svmGrid <- expand.grid(C=c(3,2,1))
rfGrid <- expand.grid(mtry=c(5,10,15))
create a list of all model's grid and make sure the name of model is same as name in the list
grd_all<-list(svmLinear=svmGrid
,rf=rfGrid)
model_list<-lapply(c("rf","svmLinear")
,function(x){my_list_model(x,grd_all[[x]])})
model_list
[[1]]
Random Forest
17 samples
3 predictor
Pre-processing: scaled (3)
Resampling: Cross-Validated (6 fold, repeated 1 times)
Summary of sample sizes: 14, 14, 15, 14, 14, 14, ...
Resampling results across tuning parameters:
mtry RMSE Rsquared MAE
5 63.54864 0.5247415 55.72074
10 63.70247 0.5255311 55.35263
15 62.13805 0.5765130 54.53411
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 15.
[[2]]
Support Vector Machines with Linear Kernel
17 samples
3 predictor
Pre-processing: scaled (3)
Resampling: Cross-Validated (6 fold, repeated 1 times)
Summary of sample sizes: 14, 14, 15, 14, 14, 14, ...
Resampling results across tuning parameters:
C RMSE Rsquared MAE
1 59.83309 0.5879396 52.26890
2 66.45247 0.5621379 58.74603
3 67.28742 0.5576000 59.55334
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was C = 1.

How to do recursive feature elimination with logistic regression?

Can someone provide me a detailed example of using caret's rfe function with the glm or glmnet model? I tried something like this:
rfe_records <- Example_data_frame
rfe_ctrl <- rfeControl(functions = caretFuncs, method = "repeatedcv", repeats = 5, verbose = TRUE, classProbs = TRUE, summaryFunction = twoClassSummary)
number_predictors <- dim(rfe_records)[2]-1
x <- dplyr::select(rfe_records, -outcomeVariable)
y <- as.numeric(rfe_records$outcomeVariable)
glmProfile <- rfe(x, y, rfeControl = rfe_ctrl, sizes = c(1:number_predictors), method="glmnet", preProc = c("center", "scale"), metric = "Accuracy")
print(glmProfile)
But the results I'm getting are not what I needed. I specified Accuracy as the metric but I got:
Recursive feature selection
Outer resampling method: Cross-Validated (10 fold, repeated 5 times)
Resampling performance over subset size:
Variables RMSE Rsquared RMSESD RsquaredSD Selected
1 0.5047 0.10830 0.04056 0.11869 *
2 0.5058 0.09386 0.04728 0.11332
3 0.5117 0.08565 0.04999 0.10211
4 0.5139 0.07490 0.05042 0.10048
5 0.5166 0.07678 0.05456 0.09966
6 0.5202 0.08203 0.06174 0.10822
7 0.5187 0.08471 0.06207 0.10893
8 0.5168 0.07850 0.05939 0.09697
9 0.5175 0.08228 0.05966 0.10068
10 0.5176 0.08180 0.05980 0.10042
11 0.5179 0.08015 0.05950 0.09905
The top 1 variables (out of 1):
varName
According to this page caret uses the class of the outcome variable when it determines whether to use regression or classification with a function like glmnet that can do either. According to your code, you specified the outcome variable to be numeric with as.numeric() so glmnet chose to do regression, not classification as you intended. Specify your outcome variable as a two-level factor to get classification instead.

How can I use caret to train models and give the classification metrics over a validation set?

I have here a training set, a validation set and a test set. I want to know how can I train a model over different parameters (defined by a grid on caret), but with the classification metrics calculated over the validation set?
If I have the following syntax...
TARGET <- iris$Species
trainX <- iris[,-5]
ctrl <- trainControl(method = "cv")
svm.tune <- train(x=trainX,
y= TARGET,
method = "svmRadial",
tuneLength = 9,
preProc = c("center","scale"),
metric="ROC",
trControl=ctrl)
svm.tune
Is there a direct form to obtain the metrics over the validation set as the print of svm.tune? Or should I use 'predict' for each considered fit by hand?
As I'm new to caret grammar, I know how to obtain the metrics for cross-validation, but I would like to redirect the computations to this validation set. Which parameters should I use?
EDIT: Is there a way to show the classification metrics for each set of parameters of the grid using a validation set instead of cross-validation?
You can do this by specifying index and indexOut arguments to trainControl. I will use an example on the diamonds data from the ggplot2 package to highlight.
library(caret)
data(diamonds, package = "ggplot2")
# create a mock training and validation set
training = diamonds[1:10000,]
validation = diamonds[10001:11000,]
Then use the createFolds function to create some cross validation folds for each model fit. The default returnTrain = FALSE would normally return hold out rather than keep in hence it's specification as TRUE.
trainIndex = createFolds(training$price, returnTrain = TRUE)
Now we will create one data frame that contains both the training and validation sets, and create a list of hold out indicies of equal length to the number of training folds. Note these indicies just correspond to the rows of my data that are the validation set.
dat = rbind(training,validation)
valIndex = lapply(trainIndex,function(i) 10001:11000)
Then in specification of the trainControl object we pass these two lists of indicies to the arguments index and indexOut, the indicies to fit and test respectively and train our model. ("lm" here for speed)
trControl = trainControl(method = "cv",
index = trainIndex,
indexOut = valIndex)
train(price~., method = "lm", data = dat, trControl = trControl)
## Linear Regression
##
## 11000 samples
## 9 predictors
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
##
## Summary of sample sizes: 8999, 8999, 9000, 9000, 8999, 9000, ...
##
## Resampling results
##
## RMSE Rsquared RMSE SD Rsquared SD
## 508.0062 0.9539221 2.54004 0.0002948073
You can convince yourself that you are indeed doing what you aim to, either by keeping all the resampling info and testing one of them by fitting manually (you know the indicies used for fitting so can do this). Or maybe just seeing that if we only use the training data we get different resampling results. (Since the folds were initially fixed then we would expect the same if it wasn't using the validation set, got rid of the randomness in rerunning train)
train(price~., method = "lm", data = training,trControl = trainControl(
method = "cv", index = trainIndex
))
## Resampling results
##
## RMSE Rsquared RMSE SD Rsquared SD
## 337.6474 0.9074643 9.916053 0.008115761
Hope that helps.
Edit:
OK just noticed OP asked about classification example, however the answer holds true for both.

Is there a discrepancy between createMultiFolds behavior and the resampling summary of a caret object?

I encountered a strange issue using custom folds for the cross-validation with caret.
A MWE (in which the use of createMultiFolds doesn't really make sense)
library(caret) #version 6.0-47
data(iris)
set.seed(1)
train.idx <- createDataPartition(iris$Species, p = .75,
list = FALSE,
times = 1)
train_1 <- iris[train.idx, ]
#I create specific folds
set.seed(1)
id_1 <- createMultiFolds(train_1$Species, k=10, times = 10)
# And use them in my cross validation
cvCtrl_2 <- trainControl(method = "repeatedcv",
index = id_1,
classProbs = TRUE)
trainX <- train_1[, names(train_1) != "Species"]
# I fit my model
set.seed(1111)
rfTune2 <- train(trainX, train_1$Species,
method = "rf",
trControl = cvCtrl_2)
rfTune2
And my model summary is the following :
##Random Forest
...
##Resampling: Cross-Validated (10 fold, repeated 1 times)
id_1 is a list of 100 index vectors, for a 10 fold cross validation repeated 10 times. And I ask trainControl to do the resampling using this list.
So why does my model summary define the resampling with
(10 fold, repeated 1 times)
whereas length(rfTune2$control$index) is equal to 100 so I assume my model was correctly trained using all the folds ?
Should I post an issue on github or did I miss anything obvious about how trainControl work ?
The defaults of trainControl has
number = ifelse(grepl("cv", method), 10, 25),
repeats = ifelse(grepl("cv", method), 1, number)
If you supply index, the code has no idea what types of resampling is used. You will have to specify these arguments along with repeats to get the label correct.

Resources