How tunelength parameter works in caret - r

I'm using following code to implement elastic net using R
model <- train(
Sales ~., data = train_data, method = "glmnet",
trControl = trainControl("cv", number = 10),
tuneLength = 10
)
I'm confused about tunelength paramater. In Cran I'm seeing that
To change the candidate values of the tuning parameter, either of the
tuneLength or tuneGrid arguments can be used. The train function can
generate a candidate set of parameter values and the tuneLength
argument controls how many are evaluated. In the case of PLS, the
function uses a sequence of integers from 1 to tuneLength. If we want
to evaluate all integers between 1 and 15, setting tuneLength = 15
would achieve this
But train function is taking dependent & independent variable from my data then how it's using tuneLength parameter? Can you please help me understand?

In caret the train() function has a number of arguments to help select the "optimal" tuning parameters for your chosen model.
Model tuning is explained in detail in package documentation here.
Users can customize the tuning process by specifying a grid of possible parameter values that the model will use when training the model.
For some models, the use of tuneLength is an alternative to specifying a tuneGird.
For example, one method of searching for the 'optimal' model parameters is using random selection. In this case the tuneLength argument is used to control the number of combinations generated by this random tuning parameter search.
To use random search, another option is available in trainControl called search. Possible values of this argument are "grid" and "random". The built-in models contained in caret contain code to generate random tuning parameter combinations. The total number of unique combinations is specified by the tuneLength option to train.
It is covered in more detail here:
http://topepo.github.io/caret/random-hyperparameter-search.html
It is important to check the model you are using in the train function and look at which tuning parameters are used for that model. It will then be easier to understand how to correctly customize the model fitting process.
For your example of using method = 'glmnet' here is a comparison using tuneGrid and tuneLength (taken from package tests):
cctrl1 <- trainControl(method = "cv", number = 3, returnResamp = "all",
classProbs = TRUE, summaryFunction = twoClassSummary)
test_class_cv_model <- train(trainX, trainY,
method = "glmnet",
trControl = cctrl1,
metric = "ROC",
preProc = c("center", "scale"),
tuneGrid = expand.grid(.alpha = seq(.05, 1, length = 15),
.lambda = c((1:5)/10)))
cctrlR <- trainControl(method = "cv", number = 3, returnResamp = "all", search = "random")
test_class_rand <- train(trainX, trainY,
method = "glmnet",
trControl = cctrlR,
tuneLength = 4)

Related

gamSpline Caret package

How do I choose the optimal df(degrees of freedom) for my splines?
I used poisson regression and splines that help me to adjust for non linear changes. Using the caret package, I used the train function with method = gamSpline to test only 3 df.
model <- train(
RBC ~ elapsed,
obgyn_aleph,
method = "gamSpline",
trControl = trainControl(
method = "cv",
number = 10,
verboseIter = TRUE
)
)
Aggregating results
Selecting tuning parameters
Fitting df = 3 on full training set
Is this the default? If so how I can change it?
Tnx,
Daniel
The tuneGrid argument allows the user to specify a custom grid of tuning parameters, in this case, df
model <- train(
RBC ~ elapsed,
obgyn_aleph,
method = "gamSpline",
trControl = trainControl(
method = "cv",
number = 10,
verboseIter = TRUE
),
tuneGrid = data.frame(df=seq(2,20,by=2))
)

Does using the same trainControl object for cross-validation when training multiple models with caret allow for accurate model comparison?

I have been delving into the R package caret recently, and have a question about reproducibility and comparison of models during training that I haven't quite been able to pin down.
My intention is that each train call, and thus each resulting model, uses the same cross validation splits so that the initial stored results from the cross-validation are comparable from the out-of-sample estimations of the model that are calculated during building.
One method I've seen is that you can specify the seed prior to each train call as such:
set.seed(1)
model <- train(..., trControl = trainControl(...))
set.seed(1)
model2 <- train(..., trControl = trainControl(...))
set.seed(1)
model3 <- train(..., trControl = trainControl(...))
However, does sharing a trainControl object between the train calls mean that they are using the same resampling and indexes generally or whether I have to explicitly pass the seeds argument into the function. Does the train control object have random functions when it is used or are they set on declaration?
My current method has been:
set.seed(1)
train_control <- trainControl(method="cv", ...)
model1 <- train(..., trControl = train_control)
model2 <- train(..., trControl = train_control)
model3 <- train(..., trControl = train_control)
Are these train calls going to be using the same splits and be comparable, or do I have to take further steps to ensure that? i.e. specifying seeds when the trainControl object is made, or calling set.seed before each train? Or both?
Hopefully this has made some sense, and isn't a complete load of rubbish. Any help
My code project that I'm querying about can be found here. It might be easier to read it and you'll understand.
The CV folds are not created during defining trainControl unless explicitly stated using index argument which I recommend. These can be created using one of the specialized caret functions:
createFolds
createMultiFolds
createTimeSlices
groupKFold
That being said, using a specific seed prior to trainControl definition will not result in the same CV folds.
Example:
library(caret)
library(tidyverse)
set.seed(1)
trControl = trainControl(method = "cv",
returnResamp = "final",
savePredictions = "final")
create two models:
knnFit1 <- train(iris[,1:4], iris[,5],
method = "knn",
preProcess = c("center", "scale"),
tuneLength = 10,
trControl = trControl)
ldaFit2 <- train(iris[,1:4], iris[,5],
method = "lda",
tuneLength = 10,
trControl = trControl)
check if the same indexes are in the same folds:
knnFit1$pred %>%
left_join(ldaFit2$pred, by = "rowIndex") %>%
mutate(same = Resample.x == Resample.y) %>%
{all(.$same)}
#FALSE
If you set the same seed prior each train call
set.seed(1)
knnFit1 <- train(iris[,1:4], iris[,5],
method = "knn",
preProcess = c("center", "scale"),
tuneLength = 10,
trControl = trControl)
set.seed(1)
ldaFit2 <- train(iris[,1:4], iris[,5],
method = "lda",
tuneLength = 10,
trControl = trControl)
set.seed(1)
rangerFit3 <- train(iris[,1:4], iris[,5],
method = "ranger",
tuneLength = 10,
trControl = trControl)
knnFit1$pred %>%
left_join(ldaFit2$pred, by = "rowIndex") %>%
mutate(same = Resample.x == Resample.y) %>%
{all(.$same)}
knnFit1$pred %>%
left_join(rangerFit3$pred, by = "rowIndex") %>%
mutate(same = Resample.x == Resample.y) %>%
{all(.$same)}
the same indexes will be used in the folds. However I would not rely on this method when using parallel computation. Therefore in order to insure the same data splits are used it is best to define them manually using index/indexOut arguments to trainControl.
When you set the index argument manually the folds will be the same, however this does not ensure that models made by the same method will be the same, since most methods include some sort of stochastic process. So to be fully reproducible it is advisable to set the seed prior to each train call also. When run in parallel to get fully reproducible models the seeds argument to trainControl needs to be set.

Manual Summary Function in Caret - Make all predictions = Fail

I am having some issues in understanding how the manual summary function in Caret works. I have created a simple code to maximize all predictions as "fail". But for some reason it doesn't seem to predict all instances as fail (on the training dataset).
See below for the code:
Maximize all predictions as fail function:
BS <- function (data, lev = NULL, model = NULL) {
negpredictions <- sum(data$pred == "fail")
names(negpredictions) <- c("Min_Precision")
negpredictions
}
Training Script:
train.control <- trainControl(method = "repeatedcv",
number = 10,
repeats = 3,
classProbs = TRUE,
#sampling = "smote",
summaryFunction = BS,
search = "grid")
tune.grid <- expand.grid(.mtry = seq(from = 1, to = 10, by = 1))
cl <- makeCluster(3, type = "SOCK")
registerDoSNOW(cl)
random.forest.orig <- train(pass ~ manufacturer+meter.type+premise+size+age+avg.winter+totalizer,
data = meter.train,
method = "rf",
tuneGrid = tune.grid,
metric = "Min_Precision",
maximize = TRUE,
trControl = train.control)
stopCluster(cl)
The metric specified in caret is not a loss function but rather the metric to chose the optimal model (in most cases the optimal combination of hyper parameters). So by specifying the BS function you are merely selecting the mtry that maximizes the prediction of "fail".
From the function help:
metric
A string that specifies what summary metric will be used to select the optimal model. By default, possible values are "RMSE" and "Rsquared" for regression and "Accuracy" and "Kappa" for classification. If custom performance metrics are used (via the summaryFunction argument in trainControl, the value of metric should match one of the arguments. If it does not, a warning is issued and the first metric given by the summaryFunction is used. (NOTE: If given, this argument must be named.)
If you check
random.forest.orig$bestTune
you will see the best tune is the one that maximized the BS function. However this does not change the native models loss function.

Selecting a Different ROC Set Point in Caret

Is it possible to select a different ROC set point in the Caret Train function instead of using metric = ROC (which I believe maximizes the AUC).
For example:
random.forest.orig <- train(pass ~ x+y,
data = meter.train,
method = "rf",
tuneGrid = tune.grid,
metric = "ROC",
trControl = train.control)
Specifically I have a two class problem (fail or pass) and I want to maximize the fail predictions while still maintaining a fail accuracy (or negative prediction value) of >80%. ie for every 10 fails I predict at least 8 of them are correct.
You can customize the caret::trainControl() object to use AUC, instead of accuracy, to tune the parameters of your models. Please check the caret documentation for details. (The built-in function, twoClassSummary, will compute the sensitivity, specificity and area under the ROC curve).
Note: In order to compute class probabilities, the pass feature must be Factor
Here under is an example of using 5-fold CV:
fitControl <- caret::trainControl(
method = "cv",
number = 5,
summaryFunction = twoClassSummary,
classProbs = TRUE,
verboseIter = TRUE
)
So your code will be adjusted a bit:
random.forest.orig <- train(pass ~ x+y,
data = meter.train,
method = "rf",
tuneGrid = tune.grid,
metric = "ROC",
trControl = fitControl)
# Print model to console to examine the output
random.forest.orig

Issues with tuneGrid parameter in random forest

I've been dealing with some extremely imbalanced data and I would like to use stratified sampling to created more balanced random forests
Right now, I'm using the caret package, mainly to for tuning the random forests.
So I try to setup a tuneGrid to pass in the mtry and sampsize parameters into caret train method as follows.
mtryGrid <- data.frame(.mtry = 100),.sampsize=80)
rfTune<- train(x = trainX,
y = trainY,
method = "rf",
trControl = ctrl,
metric = "Kappa",
ntree = 1000,
tuneGrid = mtryGrid,
importance = TRUE)
When I run this example, I get the following error
The tuning parameter grid should have columns mtry
I've come across discussions like this suggesting that passing in these parameters in should be possible.
On the other hand, this page suggests that the only parameter that can be passed in is mtry
Can I even pass in sampsize into the random forests via caret?
It looks like there is a bracket issue with your mtryGrid. Alternatively, you can also use expand.grid to give the different values of mtry you want to try.
By default the only parameter you can tune for a random forest is mtry. However you can still pass the others parameters to train. But those will have a fix value an so won't be tuned by train. But you can still ask to use a stratified sample in train. Below is how I would do, assuming that trainY is a boolean variable according which you want to stratify your samples, and that you want samples of size 80 for each category:
mtryGrid <- expand.grid(mtry = 100) # you can put different values for mtry
rfTune<- train(x = trainX,
y = trainY,
method = "rf",
trControl = ctrl,
metric = "Kappa",
ntree = 1000,
tuneGrid = mtryGrid,
strata = factor(trainY),
sampsize = c(80, 80),
importance = TRUE)
I doubt one can directly pass sampsize and strata to train. But from here I believe the solution is to use trControl(). That is,
mtryGrid <- data.frame(.mtry = 100),.sampsize=80)
rfTune<- train(x = trainX,
y = trainY,
method = "rf",
trControl = trainControl(sampling=X),
metric = "Kappa",
ntree = 1000,
tuneGrid = mtryGrid,
importance = TRUE)
where X can be one of c("up","down","smote","rose").

Resources