I am new to caret, and I just want to ensure that I fully understand what it’s doing. Towards that end, I’ve been attempting to replicate the results I get from a randomForest() model using caret’s train() function for method="rf". Unfortunately, I haven’t been able to get matching results, and I’m wondering what I’m overlooking.
I’ll also add that given that randomForest uses bootstrapping to generate samples to fit each of the ntrees, and estimates error based on out-of-bag predictions, I’m a little fuzzy on the difference between specifying "oob" and "boot" in the trainControl function call. These options generate different results, but neither matches the randomForest() model.
Although I’ve read the caret Package website (http://topepo.github.io/caret/index.html), as well as various StackOverflow questions that seem potentially relevant, but I haven’t been able to figure out why the caret method = "rf" model produces different results from randomForest(). Thank you very much for any insight you might be able to offer.
Here’s a replicable example, using the CO2 dataset from the MASS package.
library(MASS)
data(CO2)
library(randomForest)
set.seed(1)
rf.model <- randomForest(uptake ~ .,
data = CO2,
ntree = 50,
nodesize = 5,
mtry=2,
importance=TRUE,
metric="RMSE")
library(caret)
set.seed(1)
caret.oob.model <- train(uptake ~ .,
data = CO2,
method="rf",
ntree=50,
tuneGrid=data.frame(mtry=2),
nodesize = 5,
importance=TRUE,
metric="RMSE",
trControl = trainControl(method="oob"),
allowParallel=FALSE)
set.seed(1)
caret.boot.model <- train(uptake ~ .,
data = CO2,
method="rf",
ntree=50,
tuneGrid=data.frame(mtry=2),
nodesize = 5,
importance=TRUE,
metric="RMSE",
trControl=trainControl(method="boot", number=50),
allowParallel=FALSE)
print(rf.model)
print(caret.oob.model$finalModel)
print(caret.boot.model$finalModel)
Produces the following:
print(rf.model)
Mean of squared residuals: 9.380421
% Var explained: 91.88
print(caret.oob.model$finalModel)
Mean of squared residuals: 38.3598
% Var explained: 66.81
print(caret.boot.model$finalModel)
Mean of squared residuals: 42.56646
% Var explained: 63.16
And the code to look at variable importance:
importance(rf.model)
importance(caret.oob.model$finalModel)
importance(caret.boot.model$finalModel)
Using formula interface in train converts factors to dummy. To compare results from caret with randomForest you should use the non-formula interface.
In your case, you should provide a seed inside trainControl to get the same result as in randomForest.
Section training in caret webpage, there are some notes on reproducibility where it explains how to use seeds.
library("randomForest")
set.seed(1)
rf.model <- randomForest(uptake ~ .,
data = CO2,
ntree = 50,
nodesize = 5,
mtry = 2,
importance = TRUE,
metric = "RMSE")
library("caret")
caret.oob.model <- train(CO2[, -5], CO2$uptake,
method = "rf",
ntree = 50,
tuneGrid = data.frame(mtry = 2),
nodesize = 5,
importance = TRUE,
metric = "RMSE",
trControl = trainControl(method = "oob", seed = 1),
allowParallel = FALSE)
If you are doing resampling, you should provide seeds for each resampling iteration and an additional one for the final model. Examples in ?trainControl show how to create them.
In the following example, the last seed is for the final model and I set it to 1.
seeds <- as.vector(c(1:26), mode = "list")
# For the final model
seeds[[26]] <- 1
caret.boot.model <- train(CO2[, -5], CO2$uptake,
method = "rf",
ntree = 50,
tuneGrid = data.frame(mtry = 2),
nodesize = 5,
importance = TRUE,
metric = "RMSE",
trControl = trainControl(method = "boot", seeds = seeds),
allowParallel = FALSE)
Definig correctly the non-formula interface with caret and seed in trainControl you will get the same results in all three models:
rf.model
caret.oob.model$final
caret.boot.model$final
Related
I am trying to build a SVM model using the caret package. After tuning the parameters, how can we build the model using the optimal parameters so we don't need to tune the parameters in the future when we use the model? Thanks.
library(caret)
data("mtcars")
set.seed(100)
mydata = mtcars[, -c(8,9)]
model_svmr <- train(
hp ~ .,
data = mydata,
tuneLength = 10,
method = "svmRadial",
metric = "RMSE",
preProcess = c('center', 'scale'),
trControl = trainControl(
method = "repeatedcv",
number = 5,
repeats = 2,
verboseIter = TRUE
)
)
model_svmr$bestTune
The results show that sigma=0.1263203, C=4. How can we build a SVM model using the tuned parameters?
From this page in the caret package's documentation:
In cases where the model tuning values are known, train can be used to fit the model to the entire training set without any resampling or parameter tuning. Using the method = "none" option in trainControl can be used.
In your case, that would look like:
library(caret)
data("mtcars")
set.seed(100)
mydata2 <- mtcars[, -c(8, 9)]
model_svmr <- train(
hp ~ .,
data = mydata,
method = "svmRadial",
trControl = trainControl(method = "none"), # Telling caret not to re-tune
tuneGrid = data.frame(sigma=0.1263203, C=4) # Specifying the parameters
)
where we have removed any parameters relating to the tuning, namely tunelength, metric and preProcess.
Note that plot.train, resamples, confusionMatrix.train and several other functions will not work with this object but predict.train and others will.
I'd like to downsample my data given that I have a signficant class imbalance. Without downsampling, my GBM model performs reasonably well; however, with r-caret's downSample, accuracy = 0.5. I applied the same downsampling to another GBM model and got exactly the same results. What gives?
set.seed(1914)
down_train_my_gbm <- downSample(x = combined_features,
y = combined_features$label)
down_train_my_gbm$label <- NULL
my_gbm_combined_downsampled <- train(Class ~ .,
data = down_train_my_gbm,
method = "gbm",
trControl = trainControl(method="repeatedcv",
number=10, repeats=3,
classProbs = TRUE),
preProcess = c("range"),
verbose = FALSE)
I suspected that the issue might have to do with classProbs=TRUE. Changing this to FALSE skyrockets the accuracy to >0.95...but I get the exact same results for multiple models (which do not result in the same accuracy without downsampling). I'm baffled by this. What am I doing wrong here?
Caret train function allows to downsample, upsample and more with the trainControl options: from the guide Subsampling During Resampling, in your case it would be
ctrl <- trainControl(method = "repeatedcv", repeats = 5,
classProbs = TRUE,
summaryFunction = twoClassSummary,
## new option here:
sampling = "down")
model_with_down_sample <- train(Class ~ ., data = imbal_train,
method = "gbm",
preProcess = c("range"),
verbose = FALSE,
trControl = ctrl)
As a side note, avoid the formula style (e.g. Class~ .), but use the direct columns: it has been shown to have issues with memory and speed when many predictors are used (https://github.com/topepo/caret/issues/263).
Hope it helps.
I have trained a dataset with rf method. For example:
ctrl <- trainControl(
method = "LGOCV",
repeats = 3,
savePred=TRUE,
verboseIter = TRUE,
preProcOptions = list(thresh = 0.95)
)
preProcessInTrain<-c("center", "scale")
metric_used<-"Accuracy"
model <- train(
Output ~ ., data = training,
method = "rf",
trControl = ctrl,
metric=metric_used,
tuneLength = 10,
preProc = preProcessInTrain
)
After thath, I want to plot the decission tree, but when I wirte plot(model), I get this: plot(model).
If I write plot(model$finalModel), I get this : plot(model$finalModel)
I would like to plot the decission tree...
How can I do that?
Thanks :)
The model you are using is random forest, which is not a single decision tree, but an ensemble of a large number of trees. Plotting the final model will plot the error rates on the training and test datasets as # of trees are increased, something like the following.
If you want a single decision tree instead, you may like to train a CART model like the following:
model <- train(
Species ~ ., data = training,
method = "rpart",
trControl = ctrl,
metric=metric_used,
tuneLength = 10,
preProc = preProcessInTrain
)
library(rpart.plot)
rpart.plot(model$finalModel)
Now plotting the final model as above will plot the decision tree for you.
I've been dealing with some extremely imbalanced data and I would like to use stratified sampling to created more balanced random forests
Right now, I'm using the caret package, mainly to for tuning the random forests.
So I try to setup a tuneGrid to pass in the mtry and sampsize parameters into caret train method as follows.
mtryGrid <- data.frame(.mtry = 100),.sampsize=80)
rfTune<- train(x = trainX,
y = trainY,
method = "rf",
trControl = ctrl,
metric = "Kappa",
ntree = 1000,
tuneGrid = mtryGrid,
importance = TRUE)
When I run this example, I get the following error
The tuning parameter grid should have columns mtry
I've come across discussions like this suggesting that passing in these parameters in should be possible.
On the other hand, this page suggests that the only parameter that can be passed in is mtry
Can I even pass in sampsize into the random forests via caret?
It looks like there is a bracket issue with your mtryGrid. Alternatively, you can also use expand.grid to give the different values of mtry you want to try.
By default the only parameter you can tune for a random forest is mtry. However you can still pass the others parameters to train. But those will have a fix value an so won't be tuned by train. But you can still ask to use a stratified sample in train. Below is how I would do, assuming that trainY is a boolean variable according which you want to stratify your samples, and that you want samples of size 80 for each category:
mtryGrid <- expand.grid(mtry = 100) # you can put different values for mtry
rfTune<- train(x = trainX,
y = trainY,
method = "rf",
trControl = ctrl,
metric = "Kappa",
ntree = 1000,
tuneGrid = mtryGrid,
strata = factor(trainY),
sampsize = c(80, 80),
importance = TRUE)
I doubt one can directly pass sampsize and strata to train. But from here I believe the solution is to use trControl(). That is,
mtryGrid <- data.frame(.mtry = 100),.sampsize=80)
rfTune<- train(x = trainX,
y = trainY,
method = "rf",
trControl = trainControl(sampling=X),
metric = "Kappa",
ntree = 1000,
tuneGrid = mtryGrid,
importance = TRUE)
where X can be one of c("up","down","smote","rose").
I'd like to know if it is possible to estimate the RMSE variance of the model trained in each K-fold on the test data.
I'm using the crantastic caret package.
For example:
ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 20,
allowParallel = TRUE)
gbmGrid <- expand.grid(interaction.depth = seq(9, 11, by = 1),
n.trees = seq(700, 900, by = 25),
shrinkage = c(0.01, 0.025, 0.05))
set.seed(100)
gbmFit <- train(R ~ .,
data = trainSet,
method = "gbm",
tuneGrid = gbmGrid,
verbose = FALSE,
trControl = ctrl)
gbmFit$resample gives me the RMSE of each K-fold but I'd like to have the RMSE of the trained model of each K-fold relative to the test data. Is it possible?
To illustrate this further, based on the best model out of train one can predict the values of the output variables for a testset:
predict(gbmFit, testSet)
However it would be nice to do this for the ixk models (20x5) used in train and not only for the best model. For models which are very sensitive to model parameters this is a necessity. Given the two best models out of CV, they can have very different accuracy values for the test set. Perhaps the second best model out of CV is the overall best model. See http://ai.stanford.edu/~ang/papers/cv-final.pdf