Out-of-fold vs training error in caret - r

Using cross validation in model tuning, I get different error rates from caret::train's results object and calculating the error myself on its pred object. I'd like to understand why they differ, and ideally how to use out-of-fold error rates for model selection, plotting model performance, etc.
The pred object contains out-of-fold predictions. The docs are pretty clear that trainControl(..., savePredictions = "final") saves out-of-fold predictions for the best hyperparameter values: "an indicator of how much of the hold-out predictions for each resample should be saved... "final" saves the predictions for the optimal tuning parameters." (Keeping "all" predictions and then filtering to the best tuning values doesn't resolve the issue.)
The train docs say that the results object is "a data frame the training error rate..." I'm not sure what that means, but the values for the best row are consistently different from the metrics calculated on pred. Why do they differ and how can I make them line up?
d <- data.frame(y = rnorm(50))
d$x1 <- rnorm(50, d$y)
d$x2 <- rnorm(50, d$y)
train_control <- caret::trainControl(method = "cv",
number = 4,
search = "random",
savePredictions = "final")
m <- caret::train(x = d[, -1],
y = d$y,
method = "ranger",
trControl = train_control,
tuneLength = 3)
#> Loading required package: lattice
#> Loading required package: ggplot2
m
#> Random Forest
#>
#> 50 samples
#> 2 predictor
#>
#> No pre-processing
#> Resampling: Cross-Validated (4 fold)
#> Summary of sample sizes: 38, 36, 38, 38
#> Resampling results across tuning parameters:
#>
#> min.node.size mtry splitrule RMSE Rsquared MAE
#> 1 2 maxstat 0.5981673 0.6724245 0.4993722
#> 3 1 extratrees 0.5861116 0.7010012 0.4938035
#> 4 2 maxstat 0.6017491 0.6661093 0.4999057
#>
#> RMSE was used to select the optimal model using the smallest value.
#> The final values used for the model were mtry = 1, splitrule =
#> extratrees and min.node.size = 3.
MLmetrics::RMSE(m$pred$pred, m$pred$obs)
#> [1] 0.609202
MLmetrics::R2_Score(m$pred$pred, m$pred$obs)
#> [1] 0.642394
Created on 2018-04-09 by the reprex package (v0.2.0).

The RMSE for cross validation is not calculated the way you show, but rather for each fold and then averaged. Full example:
set.seed(1)
d <- data.frame(y = rnorm(50))
d$x1 <- rnorm(50, d$y)
d$x2 <- rnorm(50, d$y)
train_control <- caret::trainControl(method = "cv",
number = 4,
search = "random",
savePredictions = "final")
set.seed(1)
m <- caret::train(x = d[, -1],
y = d$y,
method = "ranger",
trControl = train_control,
tuneLength = 3)
#output
Random Forest
50 samples
2 predictor
No pre-processing
Resampling: Cross-Validated (4 fold)
Summary of sample sizes: 37, 38, 37, 38
Resampling results across tuning parameters:
min.node.size mtry splitrule RMSE Rsquared MAE
8 1 extratrees 0.6106390 0.4360609 0.4926629
12 2 extratrees 0.6156636 0.4294237 0.4954481
19 2 variance 0.6472539 0.3889372 0.5217369
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were mtry = 1, splitrule = extratrees and min.node.size = 8.
RMSE for best model is 0.6106390
Now calculate the RMSE for each fold and average:
m$pred %>%
group_by(Resample) %>%
mutate(rmse = caret::RMSE(pred, obs)) %>%
summarise(mean = mean(rmse)) %>%
pull(mean) %>%
mean
#output
0.610639
m$pred %>%
group_by(Resample) %>%
mutate(rmse = MLmetrics::RMSE(pred, obs)) %>%
summarise(mean = mean(rmse)) %>%
pull(mean) %>%
mean
#output
0.610639

I get different results. This is apparently a random process.
MLmetrics::RMSE(m$pred$pred, m$pred$obs)
[1] 0.5824464
> MLmetrics::R2_Score(m$pred$pred, m$pred$obs)
[1] 0.5271595
If you want a random (more accurately a pseudo-random process to be reproducible, then use set.seed immediately prior to the call.

Related

Training, Tuning, Cross-Validating, and Testing Ranger (Random Forest) Quantile Regression Model? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed last year.
Improve this question
May someone share how to train, tune (hyperparameters), cross-validate, and test a ranger quantile regression model, along with error evaluation? With the iris or Boston housing dataset?
The reason I ask is because I have not been able to find many examples or walkthroughs using quantile regression on Kaggle, random blogs, Youtube. Most problems I encountered are classification problems.
I am currently using a quantile regression model but I am hoping to see other examples in particular with hyperparameter tuning
There are a lot of parameters for this function. Since this isn't a forum for what it all means, I really suggest that you hit up Cross Validates with questions on the how and why. (Or look for questions that may already be answered.)
library(tidyverse)
library(ranger)
library(caret)
library(funModeling)
data(iris)
#----------- setup data -----------
# this doesn't include exploration or cleaning which are both necessary
summary(iris)
df_status(iris)
#----------------- create training sample ----------------
set.seed(395280469) # for replicability
# create training sample partition (70/20 split)
tr <- createDataPartition(iris$Species,
p = .8,
list = F)
There are a lot of ways to split the data, but I tend to prefer Caret, because they word to even out factors if that's what you feed it.
#--------- First model ---------
fit.r <- ranger(Sepal.Length ~ .,
data = iris[tr, ],
write.forest = TRUE,
importance = 'permutation',
quantreg = TRUE,
keep.inbag = TRUE,
replace = FALSE)
fit.r
# Ranger result
#
# Call:
# ranger(Sepal.Length ~ ., data = iris[tr, ], write.forest = TRUE,
# importance = "permutation", quantreg = TRUE, keep.inbag = TRUE,
# replace = FALSE)
#
# Type: Regression
# Number of trees: 500
# Sample size: 120
# Number of independent variables: 4
# Mtry: 2
# Target node size: 5
# Variable importance mode: permutation
# Splitrule: variance
# OOB prediction error (MSE): 0.1199364
# R squared (OOB): 0.8336928
p.r <- predict(fit.r, iris[-tr, -1],
type = 'quantiles')
It defaults to .1, .5, and .9:
postResample(p.r$predictions[, 1], iris[-tr, 1])
# RMSE Rsquared MAE
# 0.5165946 0.7659124 0.4036667
postResample(p.r$predictions[, 2], iris[-tr, 1])
# RMSE Rsquared MAE
# 0.3750556 0.7587326 0.3133333
postResample(p.r$predictions[, 3], iris[-tr, 1])
# RMSE Rsquared MAE
# 0.6488991 0.7461830 0.5703333
To see what this looks like in practice:
# this performance is the best so far, let's see what it looks like visually
ggplot(data.frame(p.Q1 = p.r$predictions[, 1],
p.Q5 = p.r$predictions[, 2],
p.Q9 = p.r$predictions[, 3],
Actual = iris[-tr, 1])) +
geom_point(aes(x = Actual, y = p.Q1, color = "P.Q1")) +
geom_point(aes(x = Actual, y = p.Q5, color = "P.Q5")) +
geom_point(aes(x = Actual, y = p.Q9, color = "P.Q9")) +
geom_line(aes(Actual, Actual, color = "Actual")) +
scale_color_viridis_d(end = .8, "Error",
direction = -1)+
theme_bw()
# since Quantile .1 performed the best
ggplot(data.frame(p.Q9 = p.r$predictions[, 3],
Actual = iris[-tr, 1])) +
geom_point(aes(x = Actual, y = p.Q9, color = "P.Q9")) +
geom_segment(aes(x = Actual, xend = Actual,
y = Actual, yend = p.Q9)) +
geom_line(aes(Actual, Actual, color = "Actual")) +
scale_color_viridis_d(end = .8, "Error",
direction = -1)+
theme_bw()
#------------ ranger model with options --------------
# last call used default
# splitrule: variance, use "extratrees" (only 2 for this one)
# mtry = 2, use 3 this time
# min.node.size = 5, using 6 this time
# using num.threads = 15 ** this is the number of cores on YOUR device
# change accordingly --- if you don't know, drop this one
set.seed(326)
fit.r2 <- ranger(Sepal.Length ~ .,
data = iris[tr, ],
write.forest = TRUE,
importance = 'permutation',
quantreg = TRUE,
keep.inbag = TRUE,
replace = FALSE,
splitrule = "extratrees",
mtry = 3,
min.node.size = 6,
num.threads = 15)
fit.r2
# Ranger result
# Type: Regression
# Number of trees: 500
# Sample size: 120
# Number of independent variables: 4
# Mtry: 3
# Target node size: 6
# Variable importance mode: permutation
# Splitrule: extratrees
# Number of random splits: 1
# OOB prediction error (MSE): 0.1107299
# R squared (OOB): 0.8464588
This model produced similarly.
p.r2 <- predict(fit.r2, iris[-tr, -1],
type = 'quantiles')
postResample(p.r2$predictions[, 1], iris[-tr, 1])
# RMSE Rsquared MAE
# 0.4932883 0.8144309 0.4000000
postResample(p.r2$predictions[, 2], iris[-tr, 1])
# RMSE Rsquared MAE
# 0.3610171 0.7643744 0.3100000
postResample(p.r2$predictions[, 3], iris[-tr, 1])
# RMSE Rsquared MAE
# 0.6555939 0.8141144 0.5603333
The prediction was pretty similar overall, as well.
This isn't a very large set of data, with few predictors.
How much do they contribute?
importance(fit.r2)
# Sepal.Width Petal.Length Petal.Width Species
# 0.06138883 0.71052453 0.22956522 0.18082998
#------------ ranger model with options --------------
# drop a predictor, lower mtry, min.node.size
set.seed(326)
fit.r3 <- ranger(Sepal.Length ~ .,
data = iris[tr, -4], # dropped Sepal.Width
write.forest = TRUE,
importance = 'permutation',
quantreg = TRUE,
keep.inbag = TRUE,
replace = FALSE,
splitrule = "extratrees",
mtry = 2, # has to change (var count lower)
min.node.size = 4, # lowered
num.threads = 15)
fit.r3
# Ranger result
# Type: Regression
# Number of trees: 500
# Sample size: 120
# Number of independent variables: 3
# Mtry: 2
# Target node size: 6
# Variable importance mode: permutation
# Splitrule: extratrees
# Number of random splits: 1
# OOB prediction error (MSE): 0.1050143
# R squared (OOB): 0.8543842
The second most important predictor was removed and it improved.
p.r3 <- predict(fit.r3, iris[-tr, -c(1, 4)],
type = 'quantiles')
postResample(p.r3$predictions[, 1], iris[-tr, 1])
# RMSE Rsquared MAE
# 0.4760952 0.8089810 0.3800000
postResample(p.r3$predictions[, 2], iris[-tr, 1])
# RMSE Rsquared MAE
# 0.3738315 0.7769388 0.3250000
postResample(p.r3$predictions[, 3], iris[-tr, 1])
# RMSE Rsquared MAE
# 0.6085584 0.8032592 0.5170000
importance(fit.r3)
# almost everthing relies on Petal.Length
# Sepal.Width Petal.Length Species
# 0.08008264 0.95440333 0.32570147

how to plot RMSE vs number of trees tries in bagging when using train() and cross validation in r

I am studying this website about bagging method. https://bradleyboehmke.github.io/HOML/bagging.html
I am going to use train() function with cross validation for bagging. something like below.
as far as I realized nbagg=200 tells r to try 200 trees, calculate RMSE for each and return the number of trees ( here 80 ) for which the best RMSE is achieved.
now how can I see what RMSE other nbagg values have produced in this model. like RMSE vs number of trees plot in that website ( begore introdicing cv method and train() function like plot below)
ames_bag2 <- train(
Sale_Price ~ .,
data = ames_train,
method = "treebag",
trControl = trainControl(method = "cv", number = 10),
nbagg = 200,
control = rpart.control(minsplit = 2, cp = 0)
)
ames_bag2
## Bagged CART
##
## 2054 samples
## 80 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1849, 1848, 1848, 1849, 1849, 1847, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 26957.06 0.8900689 16713.14
As the example you shared is not completely reproducible, I have taken a different example from the mtcars dataset to illustrate how you can do it. You can extend that for your data.
Note: The RMSE showed here is the average of 10 RMSEs as the CV number is 10 here. So we will store that only. Adding the relevant libraries too in the example here. And setting the maximum number of trees as 15, just for the example.
library(ipred)
library(caret)
library(rpart)
library(dplyr)
data("mtcars")
n_trees <-1
error_df <- data.frame()
while (n_trees <= 15) {
ames_bag2 <- train(
mpg ~.,
data = mtcars,
method = "treebag",
trControl = trainControl(method = "cv", number = 10),
nbagg = n_trees,
control = rpart.control(minsplit = 2, cp = 0)
)
error_df %>%
bind_rows(data.frame(trees=n_trees, rmse=mean(ames_bag2[["resample"]]$RMSE)))-> error_df
n_trees <- n_trees+1
}
error_df will show the output.
> error_df
trees rmse
1 1 2.493117
2 2 3.052958
3 3 2.052801
4 4 2.239841
5 5 2.500279
6 6 2.700347
7 7 2.642525
8 8 2.497162
9 9 2.263527
10 10 2.379366
11 11 2.447560
12 12 2.314433
13 13 2.423648
14 14 2.192112
15 15 2.256778

How can I get the same output as the lm() function after using the train() function of caret?

I'm actually trying to do some test on my linear regression model with different functions as ols_vif_tol(), ols_test_normality() or durbinWatsonTest() which only work with lm(). However, I got my model using the train() function of the caret package.
> fitcontrol = trainControl( method = "repeatedcv", number = floor(0.4*nrow(TrainData)), repeats = RepeatsTC, returnResamp = "all", savePredictions = "all")
> BestModel = train(Formula2, data = TrainData, trControl = fitcontrol, method = "lm", metric = "RMSE")
At the end I get this output:
> BestModel
Linear Regression
10 samples
1 predictor
No pre-processing
Resampling: Cross-Validated (4 fold, repeated 100 times)
Summary of sample sizes: 7, 8, 8, 7, 7, 8, ...
Resampling results:
RMSE Rsquared MAE
10.75823 0.8911761 9.660638
Tuning parameter 'intercept' was held constant at a value of TRUE
What I want is to have this output:
> GoodModel = lm(Formula2, data = FinalData)
> GoodModel
Call:
lm(formula = Formula2, data = FinalData)
Coefficients:
(Intercept) Evol.INDUS.PROD
4.089 3.908
So, even if I used method = "lm" I don't have the same output which to give me an error when I do my tests.
> ols_test_normality(BestModel)
Error in ols_test_normality.default(BestModel) : y must be numeric
> ols_test_normality(GoodModel)
-----------------------------------------------
Test Statistic pvalue
-----------------------------------------------
Shapiro-Wilk 0.9042 0.1528
Kolmogorov-Smirnov 0.1904 0.6661
Cramer-von Mises 1.1026 0.0010
Anderson-Darling 0.4615 0.2156
-----------------------------------------------
I know there is a as.lm function but I tried it and I don't have a version that can use it.
Does someone know how to get the same form as the lm() function after using train or a way to use the output of BestModel to do those tests?
EDIT
Here is a simpler case that gives rise to the same error and where you can try different tests.
install.packages("olsrr")
install.package("caret")
library(olsrr)
library(caret)
first = sample(1:10, 10, rep = TRUE)
second = sample(10:20, 10, rep = TRUE)
third = sample(20:30, 10, rep = TRUE)
Df = data.frame(first, second, third)
Df
#Create a model with lm
Model1 = lm(first ~ second + third, data = Df)
Model1
summary(Model1)
ols_test_normality(Model1)
#Create a model with caret::train
Fold = sample(1:nrow(Df) ,size = 0.8*nrow(Df), replace = FALSE)
TrainData = Df[Fold,]
TestData = Df[-Fold,]
fitcontrol = trainControl(method = "repeatedcv", number = 2, repeats = 10)
Model2 = train(first ~ second + third, data = TrainData, trControl = fitcontrol, method = "lm")
Model2
summary(Model2)
ols_test_normality(Model2)
Thank you
Your Model2 is a train object, so ols_test_normality will not work on it:
class(Model2)
[1] "train" "train.formula"
The final lm model is stored under finalModel:
class(Model2$finalModel)
[1] "lm"
ols_test_normality(Model2$finalModel)
-----------------------------------------------
Test Statistic pvalue
-----------------------------------------------
Shapiro-Wilk 0.9843 0.9809
Kolmogorov-Smirnov 0.149 0.9822
Cramer-von Mises 0.4212 0.0611
Anderson-Darling 0.1677 0.9004
-----------------------------------------------

manually making a random forest model doesn't give the same results

I tried recreating a random forest model using caret and I appear to get slightly different results.
##Set up data
attach(sat.act)
sat.act<- na.omit(sat.act)
#rename outcome and make as factor
sat.act <- sat.act %>% mutate(gender=ifelse(gender==1,"male","female"))
sat.act$gender <- as.factor(sat.act$gender)
#create train and test
set.seed(123)
indexes<-createDataPartition(y=sat.act$gender,p=0.7,list=FALSE)
train<-sat.act[indexes,]
test<-sat.act[-indexes,]
Create a model using 5-fold cv to find the best mtry
set.seed(123)
ctrl <- trainControl(method = "cv",
number = 5,
savePredictions = TRUE,
summaryFunction = twoClassSummary,
classProbs = TRUE)
model <- train(gender ~ ., data=train,
trControl = ctrl,
method= "rf",
preProc=c("center","scale"),
metric="ROC",
importance=TRUE)
> model$finalModel
#Call:
# randomForest(x = x, y = y, mtry = param$mtry, importance = TRUE)
# Type of random forest: classification
# Number of trees: 500
#No. of variables tried at each split: 2
# OOB estimate of error rate: 39%
#Confusion matrix:
# female male class.error
#female 238 72 0.2322581
#male 116 56 0.6744186
Cross validation showed best mtry is 2. Make another model and input mtry=2 and see the results.
set.seed(123)
ctrl_other <- trainControl(method="none", savePredictions = TRUE, summaryFunction=twoClassSummary, classProbs=TRUE)
model_other <- train(gender ~., data=train, trControl=ctrl_other, importance=TRUE, tuneGrid = data.frame(mtry = 2))
> model_other$finalModel
#Call:
# randomForest(x = x, y = y, mtry = param$mtry, importance = TRUE)
# Type of random forest: classification
# Number of trees: 500
#No. of variables tried at each split: 2
#
# OOB estimate of error rate: 37.34%
#Confusion matrix:
# female male class.error
#female 245 65 0.2096774
#male 115 57 0.6686047
So you can see what appears to be two of the same models (both with mtry=2 and ntree=500) but you get different results for the final model. Why?

R: knn + pca, undefined columns selected

I am trying to use knn in prediction but would like to conduct principal component analysis first to reduce dimensionality.
However, after I generated principal components and apply them on knn, it generates errors saying
"Error in [.data.frame(data, , all.vars(Terms), drop = FALSE) :
undefined columns selected"
as well as warnings:
"In addition: Warning message: In nominalTrainWorkflow(x = x, y = y,
wts = weights, info = trainInfo, : There were missing values in
resampled performance measures."
Here is my sample:
sample = cbind(rnorm(20, 100, 10), matrix(rnorm(100, 10, 2), nrow = 20)) %>%
data.frame()
The first 15 in the training set
train1 = sample[1:15, ]
test = sample[16:20, ]
Eliminate dependent variable
pca.tr=sample[1:15,2:6]
pcom = prcomp(pca.tr, scale.=T)
pca.tr=data.frame(True=train1[,1], pcom$x)
#select the first 2 principal components
pca.tr = pca.tr[, 1:2]
train.ct = trainControl(method = "repeatedcv", number = 3, repeats=1)
k = train(train1[,1] ~ .,
method = "knn",
tuneGrid = expand.grid(k = 1:5),
trControl = train.control, preProcess='scale',
metric = "RMSE",
data = cbind(train1[,1], pca.tr))
Any advice is appreciated!
Use better column names and a formula without subscripts.
You really should try to post a reproducible example. Some of your code was wrong.
Also, there is a "pca" method for preProc that does the appropriate thing by recomputing the PCA scores inside of resampling.
library(caret)
#> Loading required package: lattice
#> Loading required package: ggplot2
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
set.seed(55)
sample = cbind(rnorm(20, 100, 10), matrix(rnorm(100, 10, 2), nrow = 20)) %>%
data.frame()
train1 = sample[1:15, ]
test = sample[16:20, ]
pca.tr=sample[1:15,2:6]
pcom = prcomp(pca.tr, scale.=T)
pca.tr=data.frame(True=train1[,1], pcom$x)
#select the first 2 principal components
pca.tr = pca.tr[, 1:2]
dat <- cbind(train1[,1], pca.tr) %>%
# This
setNames(c("y", "True", "PC1"))
train.ct = trainControl(method = "repeatedcv", number = 3, repeats=1)
set.seed(356)
k = train(y ~ .,
method = "knn",
tuneGrid = expand.grid(k = 1:5),
trControl = train.ct, # this argument was wrong in your code
preProcess='scale',
metric = "RMSE",
data = dat)
k
#> k-Nearest Neighbors
#>
#> 15 samples
#> 2 predictor
#>
#> Pre-processing: scaled (2)
#> Resampling: Cross-Validated (3 fold, repeated 1 times)
#> Summary of sample sizes: 11, 10, 9
#> Resampling results across tuning parameters:
#>
#> k RMSE Rsquared MAE
#> 1 4.979826 0.4332661 3.998205
#> 2 5.347236 0.3970251 4.312809
#> 3 5.016606 0.5977683 3.939470
#> 4 4.504474 0.8060368 3.662623
#> 5 5.612582 0.5104171 4.500768
#>
#> RMSE was used to select the optimal model using the smallest value.
#> The final value used for the model was k = 4.
# or
set.seed(356)
train(X1 ~ .,
method = "knn",
tuneGrid = expand.grid(k = 1:5),
trControl = train.ct,
preProcess= c('pca', 'scale'),
metric = "RMSE",
data = train1)
#> k-Nearest Neighbors
#>
#> 15 samples
#> 5 predictor
#>
#> Pre-processing: principal component signal extraction (5), scaled
#> (5), centered (5)
#> Resampling: Cross-Validated (3 fold, repeated 1 times)
#> Summary of sample sizes: 11, 10, 9
#> Resampling results across tuning parameters:
#>
#> k RMSE Rsquared MAE
#> 1 13.373189 0.2450736 10.592047
#> 2 10.217517 0.2952671 7.973258
#> 3 9.030618 0.2727458 7.639545
#> 4 8.133807 0.1813067 6.445518
#> 5 8.083650 0.2771067 6.551053
#>
#> RMSE was used to select the optimal model using the smallest value.
#> The final value used for the model was k = 5.
Created on 2019-04-15 by the reprex package (v0.2.1)
These look worse in terms of RMSE but the previous run underestimates RMSE since it assumes that there is no variation in the PCA scores.

Resources