Fine Tune Model with Caret Package - r

I am using CARET package to fine tune random forest mtry parameter. In the package, tunelength parameter can be used to automate search for best mtry parameter. But the problem is the "tunelength" works when i set minimum 2 folds in crossvalidation. It does not work when i do not want cross validation.
ctrl <- trainControl(method = "cv", classProbs = TRUE, summaryFunction = twoClassSummary, number = 2)
set.seed(2)
trained <- train(Y ~ . , data = mydata, method = "rf", ntree = 500, tunelength = 10, metric = "ROC", trControl = ctrl, importance = TRUE)
And do anyone know the default setting of tunelength? I mean which value of mtry , it would start with.

I think you don't understand what does parameter tuning mean. You want to select the best combination of parameters that improve some quality measure. The thing is that this quality measure can't be computed on the training set itself, because this would lead to over fitting. Crossvalidation precisely gives you an unbiased estimate of your quality measure.

But the problem is the "tunelength" works when i set minimum 2 folds in crossvalidation. It does not work when i do not want cross validation.
I'm not sure what "does not work" means. If you are not resampling, there are not many ways for determining mtry. You could use method = "OOB" in trainControl and use the internal random forest estimate and set tuneLength the same way you did before (see these two pages for more details).
Again, I'm not sure if this answers your question.
Max

Related

Get accuracy from Random Forest using 'train' model

I have the following code to fetch accuracy from RandomForest model with 5-fold cross validation:
traincontrol = trainControl(method="cv", number = 5, search = "random", savePredictions = T)
tuningGrid <- expand.grid(mtry=c(2,4,6,8))
all_accuracies <- c()
model = train(label~., data=training_data, method="rf", trControl = traincontrol,
tuneGrid = tuningGrid, ntree = 25)
I plan to run this model 15 times and record the best accuracy for each time in all_accuracies. Is there any way to fetch the accuracy with code instead of manually noting it? Since if I can do that, I'll just use for loop and record every accuracy in the all_accuracies vector.
Right now, I have to write 15 lines of the same code and then record the best accuracy manually.
I figured how I can do it.
I can calculate maximum accuracy of a model by
max(model$results$accuracy)

How tunelength parameter works in caret

I'm using following code to implement elastic net using R
model <- train(
Sales ~., data = train_data, method = "glmnet",
trControl = trainControl("cv", number = 10),
tuneLength = 10
)
I'm confused about tunelength paramater. In Cran I'm seeing that
To change the candidate values of the tuning parameter, either of the
tuneLength or tuneGrid arguments can be used. The train function can
generate a candidate set of parameter values and the tuneLength
argument controls how many are evaluated. In the case of PLS, the
function uses a sequence of integers from 1 to tuneLength. If we want
to evaluate all integers between 1 and 15, setting tuneLength = 15
would achieve this
But train function is taking dependent & independent variable from my data then how it's using tuneLength parameter? Can you please help me understand?
In caret the train() function has a number of arguments to help select the "optimal" tuning parameters for your chosen model.
Model tuning is explained in detail in package documentation here.
Users can customize the tuning process by specifying a grid of possible parameter values that the model will use when training the model.
For some models, the use of tuneLength is an alternative to specifying a tuneGird.
For example, one method of searching for the 'optimal' model parameters is using random selection. In this case the tuneLength argument is used to control the number of combinations generated by this random tuning parameter search.
To use random search, another option is available in trainControl called search. Possible values of this argument are "grid" and "random". The built-in models contained in caret contain code to generate random tuning parameter combinations. The total number of unique combinations is specified by the tuneLength option to train.
It is covered in more detail here:
http://topepo.github.io/caret/random-hyperparameter-search.html
It is important to check the model you are using in the train function and look at which tuning parameters are used for that model. It will then be easier to understand how to correctly customize the model fitting process.
For your example of using method = 'glmnet' here is a comparison using tuneGrid and tuneLength (taken from package tests):
cctrl1 <- trainControl(method = "cv", number = 3, returnResamp = "all",
classProbs = TRUE, summaryFunction = twoClassSummary)
test_class_cv_model <- train(trainX, trainY,
method = "glmnet",
trControl = cctrl1,
metric = "ROC",
preProc = c("center", "scale"),
tuneGrid = expand.grid(.alpha = seq(.05, 1, length = 15),
.lambda = c((1:5)/10)))
cctrlR <- trainControl(method = "cv", number = 3, returnResamp = "all", search = "random")
test_class_rand <- train(trainX, trainY,
method = "glmnet",
trControl = cctrlR,
tuneLength = 4)

When using cross validation, is there a way to ensure each fold somehow contains at least several instances of the true class?

I'm fitting a model using cross fold validation with caret:
library(caret)
## tuning & parameters
set.seed(123)
train_control <- trainControl(
method = "cv",
number = 5,
savePredictions = TRUE,
verboseIter = TRUE,
classProbs = TRUE,
summaryFunction = my_summary
)
linear_model = train(
x = select(training_data, Avg_Load_Time),
y = target,
trControl = train_control,
method = "glm", # logistic regression
family = "binomial",
metric = "ROC"
)
The trouble is that out of ~5K rows I have only ~120 true cases. This is throwing a warning message when using GLM via caret "glm.fit: fitted probabilities numerically 0 or 1 occurred".
Is there a parameter I can set or some approach to ensuring each fold has some of the true cases?
It's easier when you shuffle data and have enough examples of each class.
If you don't have enough examples, you can increase the size of the minority class using SMOTE (Synthetic Minority Oversampling Technique). Package smotefamily in R.
Then you will be able to do 5 or 10 fold Cross Validation without raising any issues.

Pre-Processing Data in Caret and Making Predictions on an Unknown Data Set

I am using the Caret package train function to fit a model and then predict to predict values on an unknown data set (which I then get feedback on so I know the quality of my predictions). I'm having problems and I'm convinced it has to do with preprocessing the unknown data.
Briefly and simply, this is what I'm doing:
Pre-Process Training Data:
preproc = preProcess(train_num,method = c("center", "scale"))
train_standardized <- predict(preproc, train_num)
Train the Model:
gbmGrid <- expand.grid(interaction.depth = c(1, 5, 9),
n.trees = c(100,500),
shrinkage = 0.1,
n.minobsinnode = 20)
train.boost = train(x=train_standardized[,-length(train_standardized)],
y=train_standardized$response,
method = "gbm",
metric = "ROC",
maximize = FALSE,
tuneGrid= gbmGrid,
trControl = trainControl(method="cv",
number=5,
classProbs = TRUE,
verboseIter = TRUE,
summaryFunction=twoClassSummary,
savePredictions = TRUE))
Prepare unknown data for predictions:
...
unknown_standardized <- predict(preproc, unknown_num)
...
Make the actual prediction on the unknown data:
preds <- predict(train.boost,newdata=unknown_standardized,type="prob")
Note that the "preproc" object is the same one resulting from analysis of the training set and used to make the centered/standardized predictions on which the model was trained.
When I get my evaluation back my evaluation on the unknown data it is substantially worse than what was predicted using the training set (ROC using training data via cross validation is about .83, ROC using the unknown data that I get back from the evaluating party is about .70).
Do I have the process right? What am I doing wrong?
Thanks in advance.
In one sense, you are not doing anything wrong at all.
A predictor is likely to do better on a training sample as it has used that data to build the model.
The whole point of the training set is to see how well that model generalizes. It is likely to "overfit" to the training data to a greater or lesser extent and to do somewhat worse on new data.
At least once you have your score against new data, you know the true accuracy of the model. If that accuracy is sufficient for your purposes, then the model will be useable and (because you have done the training/test) robust to new data.
Now, it is possible that the model could be better if it was trained on a wider variety of data. So to increase real accuracy, it might be worth using cross-validation to train it on multiple slices of the data - k fold cross-validation. Caret has a nice facility for that. http://machinelearningmastery.com/how-to-estimate-model-accuracy-in-r-using-the-caret-package/

How to track a progress while building model with the caret package?

I am trying to build model using train function from caret package:
model <- train(training$class ~ .,data=training, method = "nb")
Training set contains about 20K observations, each observation has above 100 variables. I would like to know if building a model from that dataset will take hours or days.
How to estimate time needed to train model from data? How track a progress of training process when using functions from caret package?
Assuming that you are training the model with
an expanded grid of tuning parameters (all combinations of the tuning parameters)
and a resampling technique of your choice (cross validation, bootstrap etc)
You could set
trainctrl <- trainControl(verboseIter = TRUE)
and set it in the trControl argument of the train function to track the training progress
model <- train(training$class ~ .,data=training, method = 'nb', trControl = trainctrl)
This prints out the progress out to the console at each resampling stage, and allows you to gauge the progress of the training/parameter tuning.
To estimate the total running time, you could run the model once to see how long it runs, and estimate the total time by multiplying accordingly based on your resampling scheme and number of parameter combinations. This can be done by setting the trainControl again, and setting the tuneLength to 1:
trainctrl <- trainControl(method = 'none')
model <- train(training$class ~ ., data = training, method = 'nb', trControl = trainctrl, tuneLength = 1)
Hope this helps! :)

Resources