Plot decision tree in R (Caret) - r

I have trained a dataset with rf method. For example:
ctrl <- trainControl(
method = "LGOCV",
repeats = 3,
savePred=TRUE,
verboseIter = TRUE,
preProcOptions = list(thresh = 0.95)
)
preProcessInTrain<-c("center", "scale")
metric_used<-"Accuracy"
model <- train(
Output ~ ., data = training,
method = "rf",
trControl = ctrl,
metric=metric_used,
tuneLength = 10,
preProc = preProcessInTrain
)
After thath, I want to plot the decission tree, but when I wirte plot(model), I get this: plot(model).
If I write plot(model$finalModel), I get this : plot(model$finalModel)
I would like to plot the decission tree...
How can I do that?
Thanks :)

The model you are using is random forest, which is not a single decision tree, but an ensemble of a large number of trees. Plotting the final model will plot the error rates on the training and test datasets as # of trees are increased, something like the following.
If you want a single decision tree instead, you may like to train a CART model like the following:
model <- train(
Species ~ ., data = training,
method = "rpart",
trControl = ctrl,
metric=metric_used,
tuneLength = 10,
preProc = preProcessInTrain
)
library(rpart.plot)
rpart.plot(model$finalModel)
Now plotting the final model as above will plot the decision tree for you.

Related

How to build a model using tuned (existing) parameters in caret?

I am trying to build a SVM model using the caret package. After tuning the parameters, how can we build the model using the optimal parameters so we don't need to tune the parameters in the future when we use the model? Thanks.
library(caret)
data("mtcars")
set.seed(100)
mydata = mtcars[, -c(8,9)]
model_svmr <- train(
hp ~ .,
data = mydata,
tuneLength = 10,
method = "svmRadial",
metric = "RMSE",
preProcess = c('center', 'scale'),
trControl = trainControl(
method = "repeatedcv",
number = 5,
repeats = 2,
verboseIter = TRUE
)
)
model_svmr$bestTune
The results show that sigma=0.1263203, C=4. How can we build a SVM model using the tuned parameters?
From this page in the caret package's documentation:
In cases where the model tuning values are known, train can be used to fit the model to the entire training set without any resampling or parameter tuning. Using the method = "none" option in trainControl can be used.
In your case, that would look like:
library(caret)
data("mtcars")
set.seed(100)
mydata2 <- mtcars[, -c(8, 9)]
model_svmr <- train(
hp ~ .,
data = mydata,
method = "svmRadial",
trControl = trainControl(method = "none"), # Telling caret not to re-tune
tuneGrid = data.frame(sigma=0.1263203, C=4) # Specifying the parameters
)
where we have removed any parameters relating to the tuning, namely tunelength, metric and preProcess.
Note that plot.train, resamples, confusionMatrix.train and several other functions will not work with this object but predict.train and others will.

How to check if Random Forest model performed better than chance? (T-Test)

I perform 15-fold cross validation on random forest. Like this:
trainControl = trainControl(method="cv", number = 15, search = "random", savePredictions = T)
tuningGrid <- expand.grid(mtry=c(2,4,6,8))
model = train(Classification~., data=data, method="rf", trControl = trainControl,
tuneGrid = tuningGrid, ntree = 25)
model
How can I use T-Test to show that the model performs better than chance based on the 15 accuracies?
I don't know what to put for x in t.test

How to compute AUC under ROC in R (caret, random forest , svm)

I am using random forest and support vector machine method in caret package in R. I want to calculate AUC under ROC for both cases; however, I do not know how to do it in this particular case. My outcome is coded as 0 and 1. Here is the example of code I am using :
set.seed(123)
cvCtrl <- trainControl(method = "cv", number = 10)
rf_moded<-train(readm30~.,data=train,method="rf", trControl=cvCtrl)
Do you want to train the model with ROC? Then you need the following:
For trainControl:
control <- trainControl(method = 'cv', number = 10,
savePredictions = 'final', classProbs = TRUE, summaryFunction = twoClassSummary)
And in train:
train(
outcome ~ .,
data = data,
method = method,
trControl = control,
metric = "ROC"
)

Plot ROC curve for bootstrapped caret model

I have a model like the following:
library(mlbench)
data(Sonar)
library(caret)
set.seed(998)
my_data <- Sonar
fitControl <-
trainControl(
method = "boot632",
number = 10,
classProbs = T,
savePredictions = T,
summaryFunction = twoClassSummary
)
model <- train(
Class ~ .,
data = my_data,
method = "xgbTree",
trControl = fitControl,
metric = "ROC"
)
How do I plot the ROC curve for this model? As I understand it, the probabilities must be saved (which I did in trainControl), but because of the random sampling which bootstrapping uses to generate a 'test' set, I am not sure how caret calculates the ROC value and how to generate a curve.
To isolate the class probabilities for the best performing parameters, I am doing:
for (a in 1:length(model$bestTune))
{model$pred <-
model$pred[model$pred[, paste(colnames(model$bestTune)[a])] == model$bestTune[1, a], ]}
Please advise.
Thanks!
First an explanation:
If you are not going to check how each possible hyper parameter combination predicted on each sample in each re-sample you can set savePredictions = "final" in trainControl to save space:
fitControl <-
trainControl(
method = "boot632",
number = 10,
classProbs = T,
savePredictions = "final",
summaryFunction = twoClassSummary
)
after running the model:
model <- train(
Class ~ .,
data = my_data,
method = "xgbTree",
trControl = fitControl,
metric = "ROC"
)
results of interest are in model$pred
here you can check how many samples were tested in each re-sample (I set 25 repetitions)
nrow(model$pred[model$pred$Resample == "Resample01",])
#83
caret always provides prediction from rows not used in the model build.
nrow(my_data) #208
83/208 makes sense for the test samples for boot632
Now to build the ROC curve. You may opt for several options here:
-average the probability for each sample and use that (this is usual for CV since you have all samples repeated the same number of times, but it can be done with boot also).
-plot all as is without averaging
-plot ROC for each re-sample.
I will show you the second approach:
Create a data frame of class probabilities and true outcomes:
for_lift = data.frame(Class = model$pred$obs, xgbTree = model$pred$R)
plot ROC:
pROC::plot.roc(pROC::roc(response = for_lift$Class,
predictor = for_lift$xgbTree,
levels = c("M", "R")),
lwd=1.5)
You can also do this with ggplot, to do so I find it easiest to make a lift object using caret function lift
lift_obj = lift(Class ~ xgbTree, data = for_lift, class = "R")
specify which class the probability was used ^.
library(ggplot2)
ggplot(lift_obj$data)+
geom_line(aes(1-Sp , Sn, color = liftModelVar))+
scale_color_discrete(guide = guide_legend(title = "method"))

Different results with randomForest() and caret's randomForest (method = "rf")

I am new to caret, and I just want to ensure that I fully understand what it’s doing. Towards that end, I’ve been attempting to replicate the results I get from a randomForest() model using caret’s train() function for method="rf". Unfortunately, I haven’t been able to get matching results, and I’m wondering what I’m overlooking.
I’ll also add that given that randomForest uses bootstrapping to generate samples to fit each of the ntrees, and estimates error based on out-of-bag predictions, I’m a little fuzzy on the difference between specifying "oob" and "boot" in the trainControl function call. These options generate different results, but neither matches the randomForest() model.
Although I’ve read the caret Package website (http://topepo.github.io/caret/index.html), as well as various StackOverflow questions that seem potentially relevant, but I haven’t been able to figure out why the caret method = "rf" model produces different results from randomForest(). Thank you very much for any insight you might be able to offer.
Here’s a replicable example, using the CO2 dataset from the MASS package.
library(MASS)
data(CO2)
library(randomForest)
set.seed(1)
rf.model <- randomForest(uptake ~ .,
data = CO2,
ntree = 50,
nodesize = 5,
mtry=2,
importance=TRUE,
metric="RMSE")
library(caret)
set.seed(1)
caret.oob.model <- train(uptake ~ .,
data = CO2,
method="rf",
ntree=50,
tuneGrid=data.frame(mtry=2),
nodesize = 5,
importance=TRUE,
metric="RMSE",
trControl = trainControl(method="oob"),
allowParallel=FALSE)
set.seed(1)
caret.boot.model <- train(uptake ~ .,
data = CO2,
method="rf",
ntree=50,
tuneGrid=data.frame(mtry=2),
nodesize = 5,
importance=TRUE,
metric="RMSE",
trControl=trainControl(method="boot", number=50),
allowParallel=FALSE)
print(rf.model)
print(caret.oob.model$finalModel)
print(caret.boot.model$finalModel)
Produces the following:
print(rf.model)
Mean of squared residuals: 9.380421
% Var explained: 91.88
print(caret.oob.model$finalModel)
Mean of squared residuals: 38.3598
% Var explained: 66.81
print(caret.boot.model$finalModel)
Mean of squared residuals: 42.56646
% Var explained: 63.16
And the code to look at variable importance:
importance(rf.model)
importance(caret.oob.model$finalModel)
importance(caret.boot.model$finalModel)
Using formula interface in train converts factors to dummy. To compare results from caret with randomForest you should use the non-formula interface.
In your case, you should provide a seed inside trainControl to get the same result as in randomForest.
Section training in caret webpage, there are some notes on reproducibility where it explains how to use seeds.
library("randomForest")
set.seed(1)
rf.model <- randomForest(uptake ~ .,
data = CO2,
ntree = 50,
nodesize = 5,
mtry = 2,
importance = TRUE,
metric = "RMSE")
library("caret")
caret.oob.model <- train(CO2[, -5], CO2$uptake,
method = "rf",
ntree = 50,
tuneGrid = data.frame(mtry = 2),
nodesize = 5,
importance = TRUE,
metric = "RMSE",
trControl = trainControl(method = "oob", seed = 1),
allowParallel = FALSE)
If you are doing resampling, you should provide seeds for each resampling iteration and an additional one for the final model. Examples in ?trainControl show how to create them.
In the following example, the last seed is for the final model and I set it to 1.
seeds <- as.vector(c(1:26), mode = "list")
# For the final model
seeds[[26]] <- 1
caret.boot.model <- train(CO2[, -5], CO2$uptake,
method = "rf",
ntree = 50,
tuneGrid = data.frame(mtry = 2),
nodesize = 5,
importance = TRUE,
metric = "RMSE",
trControl = trainControl(method = "boot", seeds = seeds),
allowParallel = FALSE)
Definig correctly the non-formula interface with caret and seed in trainControl you will get the same results in all three models:
rf.model
caret.oob.model$final
caret.boot.model$final

Resources