The preferred method in R to train known ML models is to use the caret package and its generic train method. My question is what's the relationship between the tuneGrid and trControl parameters? as they are undoubtedly related and I can't figure out their relationship by reading the documentation ... for example:
library(caret)
# train and choose best model using cross validation
df <- ... # contains input data
control <- trainControl(method = "cv", number = 10, p = .9, allowParallel = TRUE)
fit <- train(y ~ ., method = "knn",
data = df,
tuneGrid = data.frame(k = seq(9, 71, 2)),
trControl = control)
If I run the code above what's happening? how do the 10 CV folds each containing 90% of the data as per the trainControl definition are combined with the 32 levels of k?
More concretely:
I have 32 levels for the parameter k.
I also have 10 CV folds.
Is the k-nearest neighbors model trained 32*10 times? or otherwise?
Yes, you are correct. You partition your training data into 10 sets, say 1..10. Starting with set 1, you train your model using all of 2..10 (90% of the training data) and test it on set 1. This is repeated again for set2, set3.. It's a total of 10 times, and you have 32 values of k to test, hence 32 * 10 = 320.
You can also pull out this cv results using the returnResamp function in trainControl. I simplify it to 3-fold and 4 values of k below:
df <- mtcars
set.seed(100)
control <- trainControl(method = "cv", number = 3, p = .9,returnResamp="all")
fit <- train(mpg ~ ., method = "knn",
data = mtcars,
tuneGrid = data.frame(k = 2:5),
trControl = control)
resample_results = fit$resample
resample_results
RMSE Rsquared MAE k Resample
1 3.502321 0.7772086 2.483333 2 Fold1
2 3.807011 0.7636239 2.861111 3 Fold1
3 3.592665 0.8035741 2.697917 4 Fold1
4 3.682105 0.8486331 2.741667 5 Fold1
5 2.473611 0.8665093 1.995000 2 Fold2
6 2.673429 0.8128622 2.210000 3 Fold2
7 2.983224 0.7120910 2.645000 4 Fold2
8 2.998199 0.7207914 2.608000 5 Fold2
9 2.094039 0.9620830 1.610000 2 Fold3
10 2.551035 0.8717981 2.113333 3 Fold3
11 2.893192 0.8324555 2.482500 4 Fold3
12 2.806870 0.8700533 2.368333 5 Fold3
# we manually calculate the mean RMSE for each parameter
tapply(resample_results$RMSE,resample_results$k,mean)
2 3 4 5
2.689990 3.010492 3.156360 3.162392
# and we can see it corresponds to the final fit result
fit$results
k RMSE Rsquared MAE RMSESD RsquaredSD MAESD
1 2 2.689990 0.8686003 2.029444 0.7286489 0.09245494 0.4376844
2 3 3.010492 0.8160947 2.394815 0.6925154 0.05415954 0.4067066
3 4 3.156360 0.7827069 2.608472 0.3805227 0.06283697 0.1122577
4 5 3.162392 0.8131593 2.572667 0.4601396 0.08070670 0.1891581
Related
I am studying this website about bagging method. https://bradleyboehmke.github.io/HOML/bagging.html
I am going to use train() function with cross validation for bagging. something like below.
as far as I realized nbagg=200 tells r to try 200 trees, calculate RMSE for each and return the number of trees ( here 80 ) for which the best RMSE is achieved.
now how can I see what RMSE other nbagg values have produced in this model. like RMSE vs number of trees plot in that website ( begore introdicing cv method and train() function like plot below)
ames_bag2 <- train(
Sale_Price ~ .,
data = ames_train,
method = "treebag",
trControl = trainControl(method = "cv", number = 10),
nbagg = 200,
control = rpart.control(minsplit = 2, cp = 0)
)
ames_bag2
## Bagged CART
##
## 2054 samples
## 80 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1849, 1848, 1848, 1849, 1849, 1847, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 26957.06 0.8900689 16713.14
As the example you shared is not completely reproducible, I have taken a different example from the mtcars dataset to illustrate how you can do it. You can extend that for your data.
Note: The RMSE showed here is the average of 10 RMSEs as the CV number is 10 here. So we will store that only. Adding the relevant libraries too in the example here. And setting the maximum number of trees as 15, just for the example.
library(ipred)
library(caret)
library(rpart)
library(dplyr)
data("mtcars")
n_trees <-1
error_df <- data.frame()
while (n_trees <= 15) {
ames_bag2 <- train(
mpg ~.,
data = mtcars,
method = "treebag",
trControl = trainControl(method = "cv", number = 10),
nbagg = n_trees,
control = rpart.control(minsplit = 2, cp = 0)
)
error_df %>%
bind_rows(data.frame(trees=n_trees, rmse=mean(ames_bag2[["resample"]]$RMSE)))-> error_df
n_trees <- n_trees+1
}
error_df will show the output.
> error_df
trees rmse
1 1 2.493117
2 2 3.052958
3 3 2.052801
4 4 2.239841
5 5 2.500279
6 6 2.700347
7 7 2.642525
8 8 2.497162
9 9 2.263527
10 10 2.379366
11 11 2.447560
12 12 2.314433
13 13 2.423648
14 14 2.192112
15 15 2.256778
I'm a beginner trying to learn some basic machine learning techniques.
I want to use leave-one-out cross-validation and the train() function to train a model. My function seems to work as it should. However, I'm not able to see the model's test-set predictions. How would you do this given the following model?
# Create custom trainControl: myControl
myControl <- trainControl(
method = "loocv",
verboseIter = TRUE
)
# Fit glmnet model: model
model <- train(
y ~ .,
data,
method = "glmnet",
trControl = myControl,
preProcess = c("center", "scale", "pca")
)
You can set savePredictions=TRUE in trainControl:
myControl <- trainControl(
method = "loocv",
savePredictions=TRUE
)
model <- train(
mpg ~ .,
data,
method = "glmnet",
trControl = myControl,
preProcess = c("center", "scale", "pca"),
tuneGrid = expand.grid(alpha = c(0.1,0.01),lambda = c(0.1,0.01))
)
You can look at the predictions using each parameter combination:
pred obs rowIndex alpha lambda Resample
1 22.56265 21 1 0.10 0.10 Fold01
2 22.59835 21 1 0.10 0.01 Fold01
3 22.57767 21 1 0.01 0.10 Fold01
4 22.59717 21 1 0.01 0.01 Fold01
5 22.12174 21 2 0.10 0.10 Fold02
6 22.14886 21 2 0.10 0.01 Fold02
7 22.13080 21 2 0.01 0.10 Fold02
8 22.14667 21 2 0.01 0.01 Fold02
I tested 4 combinations of lambda and alpha, so you can see above for each observation that is left out, it's prediction
Answering my own follow-up question in case anyone is interested:
myControl <- trainControl(
method = "loocv"
savePredictions = "final",
)
model <- train(
y ~ .,
data,
method = "glmnet",
trControl = myControl,
preProcess = c("center", "scale", "pca")
)
data$pred <- model$pred[ , "pred"]
Using cross validation in model tuning, I get different error rates from caret::train's results object and calculating the error myself on its pred object. I'd like to understand why they differ, and ideally how to use out-of-fold error rates for model selection, plotting model performance, etc.
The pred object contains out-of-fold predictions. The docs are pretty clear that trainControl(..., savePredictions = "final") saves out-of-fold predictions for the best hyperparameter values: "an indicator of how much of the hold-out predictions for each resample should be saved... "final" saves the predictions for the optimal tuning parameters." (Keeping "all" predictions and then filtering to the best tuning values doesn't resolve the issue.)
The train docs say that the results object is "a data frame the training error rate..." I'm not sure what that means, but the values for the best row are consistently different from the metrics calculated on pred. Why do they differ and how can I make them line up?
d <- data.frame(y = rnorm(50))
d$x1 <- rnorm(50, d$y)
d$x2 <- rnorm(50, d$y)
train_control <- caret::trainControl(method = "cv",
number = 4,
search = "random",
savePredictions = "final")
m <- caret::train(x = d[, -1],
y = d$y,
method = "ranger",
trControl = train_control,
tuneLength = 3)
#> Loading required package: lattice
#> Loading required package: ggplot2
m
#> Random Forest
#>
#> 50 samples
#> 2 predictor
#>
#> No pre-processing
#> Resampling: Cross-Validated (4 fold)
#> Summary of sample sizes: 38, 36, 38, 38
#> Resampling results across tuning parameters:
#>
#> min.node.size mtry splitrule RMSE Rsquared MAE
#> 1 2 maxstat 0.5981673 0.6724245 0.4993722
#> 3 1 extratrees 0.5861116 0.7010012 0.4938035
#> 4 2 maxstat 0.6017491 0.6661093 0.4999057
#>
#> RMSE was used to select the optimal model using the smallest value.
#> The final values used for the model were mtry = 1, splitrule =
#> extratrees and min.node.size = 3.
MLmetrics::RMSE(m$pred$pred, m$pred$obs)
#> [1] 0.609202
MLmetrics::R2_Score(m$pred$pred, m$pred$obs)
#> [1] 0.642394
Created on 2018-04-09 by the reprex package (v0.2.0).
The RMSE for cross validation is not calculated the way you show, but rather for each fold and then averaged. Full example:
set.seed(1)
d <- data.frame(y = rnorm(50))
d$x1 <- rnorm(50, d$y)
d$x2 <- rnorm(50, d$y)
train_control <- caret::trainControl(method = "cv",
number = 4,
search = "random",
savePredictions = "final")
set.seed(1)
m <- caret::train(x = d[, -1],
y = d$y,
method = "ranger",
trControl = train_control,
tuneLength = 3)
#output
Random Forest
50 samples
2 predictor
No pre-processing
Resampling: Cross-Validated (4 fold)
Summary of sample sizes: 37, 38, 37, 38
Resampling results across tuning parameters:
min.node.size mtry splitrule RMSE Rsquared MAE
8 1 extratrees 0.6106390 0.4360609 0.4926629
12 2 extratrees 0.6156636 0.4294237 0.4954481
19 2 variance 0.6472539 0.3889372 0.5217369
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were mtry = 1, splitrule = extratrees and min.node.size = 8.
RMSE for best model is 0.6106390
Now calculate the RMSE for each fold and average:
m$pred %>%
group_by(Resample) %>%
mutate(rmse = caret::RMSE(pred, obs)) %>%
summarise(mean = mean(rmse)) %>%
pull(mean) %>%
mean
#output
0.610639
m$pred %>%
group_by(Resample) %>%
mutate(rmse = MLmetrics::RMSE(pred, obs)) %>%
summarise(mean = mean(rmse)) %>%
pull(mean) %>%
mean
#output
0.610639
I get different results. This is apparently a random process.
MLmetrics::RMSE(m$pred$pred, m$pred$obs)
[1] 0.5824464
> MLmetrics::R2_Score(m$pred$pred, m$pred$obs)
[1] 0.5271595
If you want a random (more accurately a pseudo-random process to be reproducible, then use set.seed immediately prior to the call.
Using the caret package for model tuning today I have faced this strange behavior: given a specific combination of tuning parameters T*, the metric (i.e. Cohen's K) value associated with T* changes if T* is evaluated alone or as part of a grid of possible combinations. In the practical example which follows caret is used to interface with the gbm package.
# Load libraries and data
library (caret)
data<-read.csv("mydata.csv")
data$target<-as.factor(data$target)
# data are available at https://www.dropbox.com/s/1bglmqd14g840j1/mydata.csv?dl=0
Pocedure 1: T* evaluated alone
#Define 5-fold cv as validation settings
fitControl <- trainControl(method = "cv",number = 5)
# Define the combination of tuning parameter for this example T*
gbmGrid <- expand.grid(.interaction.depth = 1,
.n.trees = 1000,
.shrinkage = 0.1, .n.minobsinnode=1)
# Fit a gbm with T* as model parameters and K as scoring metric.
set.seed(825)
gbmFit1 <- train(target ~ ., data = data,
method = "gbm",
distribution="adaboost",
trControl = fitControl,
tuneGrid=gbmGrid,
verbose=F,
metric="Kappa")
# The results show that T* is associated with Kappa = 0.47. Remember this result and the confusion matrix.
testPred<-predict(gbmFit1, newdata = data)
confusionMatrix(testPred, data$target)
# output selection
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 832 34
1 0 16
Kappa : 0.4703
Procedure 2: T* evaluated along with other tuning profiles
Here everything is the same as in procedure 1 except for the fact that several combinations of tuning parameters {T} are considered:
# Notice that the original T* is included in {T}!!
gbmGrid2 <- expand.grid(.interaction.depth = 1,
.n.trees = seq(100,1000,by=100),
.shrinkage = 0.1, .n.minobsinnode=1)
# Fit the gbm
set.seed(825)
gbmFit2 <- train(target ~ ., data = data,
method = "gbm",
distribution="adaboost",
trControl = fitControl,
tuneGrid=gbmGrid2,
verbose=F,
metric="Kappa")
# Caret should pick the model with the highest Kappa.
# Since T* is in {T} I would expect the best model to have K >= 0.47
testPred<-predict(gbmFit2, newdata = data)
confusionMatrix(testPred, data$target)
# output selection
Reference
Prediction 0 1
0 831 47
1 1 3
Kappa : 0.1036
The results are inconsistent with my expectations: the best model in {T} scores K=0.10. How is it possible given that T* has K = 0.47 and it is included in {T}? Additionally, according to the following plot , K for T* as evaluated in procedure 2 is now around 0.01. Any idea about what is going on? Am I missing something?
I am getting consistent resampling results from your data and code.
The first model has Kappa = 0.00943
gbmFit1$results
interaction.depth n.trees shrinkage n.minobsinnode Accuracy Kappa AccuracySD
1 1 1000 0.1 1 0.9331022 0.009430576 0.004819004
KappaSD
1 0.0589132
The second model has the same results for n.trees = 1000
gbmFit2$results
shrinkage interaction.depth n.minobsinnode n.trees Accuracy Kappa AccuracySD
1 0.1 1 1 100 0.9421803 -0.002075765 0.002422952
2 0.1 1 1 200 0.9387776 -0.008326896 0.002468351
3 0.1 1 1 300 0.9365049 -0.012187900 0.002625886
4 0.1 1 1 400 0.9353749 -0.013950906 0.003077431
5 0.1 1 1 500 0.9353685 -0.013961221 0.003244201
6 0.1 1 1 600 0.9342322 -0.015486214 0.005202656
7 0.1 1 1 700 0.9319658 -0.018574633 0.007033402
8 0.1 1 1 800 0.9319658 -0.018574633 0.007033402
9 0.1 1 1 900 0.9342386 0.010955568 0.003144850
10 0.1 1 1 1000 0.9331022 0.009430576 0.004819004
KappaSD
1 0.004641553
2 0.004654972
3 0.003978702
4 0.004837097
5 0.004878259
6 0.007469843
7 0.009470466
8 0.009470466
9 0.057825336
10 0.058913202
Note that the best model in your second run has n.trees = 900
gbmFit2$bestTune
n.trees interaction.depth shrinkage n.minobsinnode
9 900 1 0.1 1
Since train picks the "best" model based on your metric, your second prediction is using a different model (n.trees of 900 instead of 1000).
I have used caret package's train function with 10-fold cross validation. I also have got class probabilities for predicted classes by setting classProbs = TRUE in trControl, as follows:
myTrainingControl <- trainControl(method = "cv",
number = 10,
savePredictions = TRUE,
classProbs = TRUE,
verboseIter = TRUE)
randomForestFit = train(x = input[3:154],
y = as.factor(input$Target),
method = "rf",
trControl = myTrainingControl,
preProcess = c("center","scale"),
ntree = 50)
The output predictions I am getting is as follows.
pred obs 0 1 rowIndex mtry Resample
1 0 1 0.52 0.48 28 12 Fold01
2 0 0 0.58 0.42 43 12 Fold01
3 0 1 0.58 0.42 51 12 Fold01
4 0 0 0.68 0.32 55 12 Fold01
5 0 0 0.62 0.38 59 12 Fold01
6 0 1 0.92 0.08 71 12 Fold01
Now I want to calculate ROC and AUC under ROC using this data. How would I achieve this?
A sample example for AUC:
rf_output=randomForest(x=predictor_data, y=target, importance = TRUE, ntree = 10001, proximity=TRUE, sampsize=sampsizes)
library(ROCR)
predictions=as.vector(rf_output$votes[,2])
pred=prediction(predictions,target)
perf_AUC=performance(pred,"auc") #Calculate the AUC value
AUC=perf_AUC#y.values[[1]]
perf_ROC=performance(pred,"tpr","fpr") #plot the actual ROC curve
plot(perf_ROC, main="ROC plot")
text(0.5,0.5,paste("AUC = ",format(AUC, digits=5, scientific=FALSE)))
or using pROC and caret
library(caret)
library(pROC)
data(iris)
iris <- iris[iris$Species == "virginica" | iris$Species == "versicolor", ]
iris$Species <- factor(iris$Species) # setosa should be removed from factor
samples <- sample(NROW(iris), NROW(iris) * .5)
data.train <- iris[samples, ]
data.test <- iris[-samples, ]
forest.model <- train(Species ~., data.train)
result.predicted.prob <- predict(forest.model, data.test, type="prob") # Prediction
result.roc <- roc(data.test$Species, result.predicted.prob$versicolor) # Draw ROC curve.
plot(result.roc, print.thres="best", print.thres.best.method="closest.topleft")
result.coords <- coords(result.roc, "best", best.method="closest.topleft", ret=c("threshold", "accuracy"))
print(result.coords)#to get threshold and accuracy
Update 2019. This is what MLeval was written for (https://cran.r-project.org/web/packages/MLeval/index.html), it works with the Caret train output object to make ROCs, PR curves, calibration curves, and calculate metrics, such as ROC-AUC, sensitivity, specificity etc. It just uses one line to do all of this which is helpful for my analyses and may be of interest.
library(caret)
library(MLeval)
myTrainingControl <- trainControl(method = "cv",
number = 10,
savePredictions = TRUE,
classProbs = TRUE,
verboseIter = TRUE)
randomForestFit = train(x = Sonar[,1:60],
y = as.factor(Sonar$Class),
method = "rf",
trControl = myTrainingControl,
preProcess = c("center","scale"),
ntree = 50)
##
x <- evalm(randomForestFit)
## get roc curve plotted in ggplot2
x$roc
## get AUC and other metrics
x$stdres