Caret leap forward for linear regression (feature selection) change nvmax - r

I have been playing around with the leapForward method from the package leaps in conjunction with caret and found that it only provides 5 variables . according to the leaps package you can change nvmax to whatever number of subsets you wish.
I cannot seem where to fit this into the caret wrapper. I have tried putting it in the train statement, as well as creating an expand.grid line, and ti does not seem to work. Any help would be appreciated!
my code:
library(caret)
data <- read.csv(file="C:/mydata.csv", header=TRUE, sep=",")
fitControl <- trainControl(method = "loocv")
x <- data[, -19]
y <- data[, 19]
lmFit <- train(x=x, y=y,'leapForward', trControl = fitControl)
summary(lmFit)

The default behavior of caret is a random search over the tuning parameters.
You can specify a grid of parameters as you like, with the tuneGrid option.
Here is a reproducible example with the BloodBrain dataset. NB : I had to
transform the predictors with a PCA to avoid problems of multicolinearity
library(caret)
data(BloodBrain, package = "caret")
dim(bbbDescr)
#> [1] 208 134
X <- princomp(bbbDescr)$scores[,1:131]
Y <- logBBB
fitControl <- trainControl(method = "cv")
Default : random search of parameters
lmFit <- train(y = Y, x = X,'leapForward', trControl = fitControl)
lmFit
#> Linear Regression with Forward Selection
#>
#> 208 samples
#> 131 predictors
#>
#> No pre-processing
#> Resampling: Cross-Validated (10 fold)
#> Summary of sample sizes: 187, 188, 187, 187, 187, 187, ...
#> Resampling results across tuning parameters:
#>
#> nvmax RMSE Rsquared MAE
#> 2 0.6682545 0.2928583 0.5286758
#> 3 0.7008359 0.2652202 0.5527730
#> 4 0.6781190 0.3026475 0.5215527
#>
#> RMSE was used to select the optimal model using the smallest value.
#> The final value used for the model was nvmax = 2.
With a grid search of your choice.
NB : expand.grid is not necessary here. it is useful when you combine
several tuning parameters
lmFit <- train(y = Y, x = X,'leapForward', trControl = fitControl,
tuneGrid = expand.grid(nvmax = seq(1, 30, 2)))
lmFit
#> Linear Regression with Forward Selection
#>
#> 208 samples
#> 131 predictors
#>
#> No pre-processing
#> Resampling: Cross-Validated (10 fold)
#> Summary of sample sizes: 188, 188, 188, 186, 187, 187, ...
#> Resampling results across tuning parameters:
#>
#> nvmax RMSE Rsquared MAE
#> 1 0.7649633 0.07840817 0.5919515
#> 3 0.6952295 0.27147443 0.5250173
#> 5 0.6482456 0.35953363 0.4828406
#> 7 0.6509919 0.37800159 0.4865292
#> 9 0.6721529 0.35899937 0.5104467
#> 11 0.6541945 0.39316037 0.4979497
#> 13 0.6355383 0.42654189 0.4794705
#> 15 0.6493433 0.41823974 0.4911399
#> 17 0.6645519 0.37338055 0.5105887
#> 19 0.6575950 0.39628133 0.5084652
#> 21 0.6663806 0.39156852 0.5124487
#> 23 0.6744933 0.38746853 0.5143484
#> 25 0.6709936 0.39228681 0.5025907
#> 27 0.6919163 0.36565876 0.5209107
#> 29 0.7015347 0.35397968 0.5272448
#>
#> RMSE was used to select the optimal model using the smallest value.
#> The final value used for the model was nvmax = 13.
plot(lmFit)
Created on 2018-03-08 by the reprex package (v0.2.0).

Related

how to plot RMSE vs number of trees tries in bagging when using train() and cross validation in r

I am studying this website about bagging method. https://bradleyboehmke.github.io/HOML/bagging.html
I am going to use train() function with cross validation for bagging. something like below.
as far as I realized nbagg=200 tells r to try 200 trees, calculate RMSE for each and return the number of trees ( here 80 ) for which the best RMSE is achieved.
now how can I see what RMSE other nbagg values have produced in this model. like RMSE vs number of trees plot in that website ( begore introdicing cv method and train() function like plot below)
ames_bag2 <- train(
Sale_Price ~ .,
data = ames_train,
method = "treebag",
trControl = trainControl(method = "cv", number = 10),
nbagg = 200,
control = rpart.control(minsplit = 2, cp = 0)
)
ames_bag2
## Bagged CART
##
## 2054 samples
## 80 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1849, 1848, 1848, 1849, 1849, 1847, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 26957.06 0.8900689 16713.14
As the example you shared is not completely reproducible, I have taken a different example from the mtcars dataset to illustrate how you can do it. You can extend that for your data.
Note: The RMSE showed here is the average of 10 RMSEs as the CV number is 10 here. So we will store that only. Adding the relevant libraries too in the example here. And setting the maximum number of trees as 15, just for the example.
library(ipred)
library(caret)
library(rpart)
library(dplyr)
data("mtcars")
n_trees <-1
error_df <- data.frame()
while (n_trees <= 15) {
ames_bag2 <- train(
mpg ~.,
data = mtcars,
method = "treebag",
trControl = trainControl(method = "cv", number = 10),
nbagg = n_trees,
control = rpart.control(minsplit = 2, cp = 0)
)
error_df %>%
bind_rows(data.frame(trees=n_trees, rmse=mean(ames_bag2[["resample"]]$RMSE)))-> error_df
n_trees <- n_trees+1
}
error_df will show the output.
> error_df
trees rmse
1 1 2.493117
2 2 3.052958
3 3 2.052801
4 4 2.239841
5 5 2.500279
6 6 2.700347
7 7 2.642525
8 8 2.497162
9 9 2.263527
10 10 2.379366
11 11 2.447560
12 12 2.314433
13 13 2.423648
14 14 2.192112
15 15 2.256778

Can the out of bag error for a random forests model in R's TidyModel's framework be obtained?

If you directly use the ranger function, one can obtain the out-of-bag error from the resulting ranger class object.
If instead, one proceeds by way of setting up a recipe, model specification/engine, with tuning parameters, etc., how can we extract that same error? The Tidymodels approach doesn't seem to hold on to that data.
If you want to access the ranger object inside of the parsnip object, it is there as $fit:
library(tidymodels)
data("ad_data", package = "modeldata")
rf_spec <-
rand_forest() %>%
set_engine("ranger", oob.error = TRUE) %>%
set_mode("classification")
rf_fit <- rf_spec %>%
fit(Class ~ ., data = ad_data)
rf_fit
#> parsnip model object
#>
#> Fit time: 158ms
#> Ranger result
#>
#> Call:
#> ranger::ranger(x = maybe_data_frame(x), y = y, oob.error = ~TRUE, num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE)
#>
#> Type: Probability estimation
#> Number of trees: 500
#> Sample size: 333
#> Number of independent variables: 130
#> Mtry: 11
#> Target node size: 10
#> Variable importance mode: none
#> Splitrule: gini
#> OOB prediction error (Brier s.): 0.1340793
class(rf_fit)
#> [1] "_ranger" "model_fit"
class(rf_fit$fit)
#> [1] "ranger"
rf_fit$fit$prediction.error
#> [1] 0.1340793
Created on 2021-03-11 by the reprex package (v1.0.0)

In the train method what's the relationship between tuneGrid and trControl?

The preferred method in R to train known ML models is to use the caret package and its generic train method. My question is what's the relationship between the tuneGrid and trControl parameters? as they are undoubtedly related and I can't figure out their relationship by reading the documentation ... for example:
library(caret)
# train and choose best model using cross validation
df <- ... # contains input data
control <- trainControl(method = "cv", number = 10, p = .9, allowParallel = TRUE)
fit <- train(y ~ ., method = "knn",
data = df,
tuneGrid = data.frame(k = seq(9, 71, 2)),
trControl = control)
If I run the code above what's happening? how do the 10 CV folds each containing 90% of the data as per the trainControl definition are combined with the 32 levels of k?
More concretely:
I have 32 levels for the parameter k.
I also have 10 CV folds.
Is the k-nearest neighbors model trained 32*10 times? or otherwise?
Yes, you are correct. You partition your training data into 10 sets, say 1..10. Starting with set 1, you train your model using all of 2..10 (90% of the training data) and test it on set 1. This is repeated again for set2, set3.. It's a total of 10 times, and you have 32 values of k to test, hence 32 * 10 = 320.
You can also pull out this cv results using the returnResamp function in trainControl. I simplify it to 3-fold and 4 values of k below:
df <- mtcars
set.seed(100)
control <- trainControl(method = "cv", number = 3, p = .9,returnResamp="all")
fit <- train(mpg ~ ., method = "knn",
data = mtcars,
tuneGrid = data.frame(k = 2:5),
trControl = control)
resample_results = fit$resample
resample_results
RMSE Rsquared MAE k Resample
1 3.502321 0.7772086 2.483333 2 Fold1
2 3.807011 0.7636239 2.861111 3 Fold1
3 3.592665 0.8035741 2.697917 4 Fold1
4 3.682105 0.8486331 2.741667 5 Fold1
5 2.473611 0.8665093 1.995000 2 Fold2
6 2.673429 0.8128622 2.210000 3 Fold2
7 2.983224 0.7120910 2.645000 4 Fold2
8 2.998199 0.7207914 2.608000 5 Fold2
9 2.094039 0.9620830 1.610000 2 Fold3
10 2.551035 0.8717981 2.113333 3 Fold3
11 2.893192 0.8324555 2.482500 4 Fold3
12 2.806870 0.8700533 2.368333 5 Fold3
# we manually calculate the mean RMSE for each parameter
tapply(resample_results$RMSE,resample_results$k,mean)
2 3 4 5
2.689990 3.010492 3.156360 3.162392
# and we can see it corresponds to the final fit result
fit$results
k RMSE Rsquared MAE RMSESD RsquaredSD MAESD
1 2 2.689990 0.8686003 2.029444 0.7286489 0.09245494 0.4376844
2 3 3.010492 0.8160947 2.394815 0.6925154 0.05415954 0.4067066
3 4 3.156360 0.7827069 2.608472 0.3805227 0.06283697 0.1122577
4 5 3.162392 0.8131593 2.572667 0.4601396 0.08070670 0.1891581

R: knn + pca, undefined columns selected

I am trying to use knn in prediction but would like to conduct principal component analysis first to reduce dimensionality.
However, after I generated principal components and apply them on knn, it generates errors saying
"Error in [.data.frame(data, , all.vars(Terms), drop = FALSE) :
undefined columns selected"
as well as warnings:
"In addition: Warning message: In nominalTrainWorkflow(x = x, y = y,
wts = weights, info = trainInfo, : There were missing values in
resampled performance measures."
Here is my sample:
sample = cbind(rnorm(20, 100, 10), matrix(rnorm(100, 10, 2), nrow = 20)) %>%
data.frame()
The first 15 in the training set
train1 = sample[1:15, ]
test = sample[16:20, ]
Eliminate dependent variable
pca.tr=sample[1:15,2:6]
pcom = prcomp(pca.tr, scale.=T)
pca.tr=data.frame(True=train1[,1], pcom$x)
#select the first 2 principal components
pca.tr = pca.tr[, 1:2]
train.ct = trainControl(method = "repeatedcv", number = 3, repeats=1)
k = train(train1[,1] ~ .,
method = "knn",
tuneGrid = expand.grid(k = 1:5),
trControl = train.control, preProcess='scale',
metric = "RMSE",
data = cbind(train1[,1], pca.tr))
Any advice is appreciated!
Use better column names and a formula without subscripts.
You really should try to post a reproducible example. Some of your code was wrong.
Also, there is a "pca" method for preProc that does the appropriate thing by recomputing the PCA scores inside of resampling.
library(caret)
#> Loading required package: lattice
#> Loading required package: ggplot2
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
set.seed(55)
sample = cbind(rnorm(20, 100, 10), matrix(rnorm(100, 10, 2), nrow = 20)) %>%
data.frame()
train1 = sample[1:15, ]
test = sample[16:20, ]
pca.tr=sample[1:15,2:6]
pcom = prcomp(pca.tr, scale.=T)
pca.tr=data.frame(True=train1[,1], pcom$x)
#select the first 2 principal components
pca.tr = pca.tr[, 1:2]
dat <- cbind(train1[,1], pca.tr) %>%
# This
setNames(c("y", "True", "PC1"))
train.ct = trainControl(method = "repeatedcv", number = 3, repeats=1)
set.seed(356)
k = train(y ~ .,
method = "knn",
tuneGrid = expand.grid(k = 1:5),
trControl = train.ct, # this argument was wrong in your code
preProcess='scale',
metric = "RMSE",
data = dat)
k
#> k-Nearest Neighbors
#>
#> 15 samples
#> 2 predictor
#>
#> Pre-processing: scaled (2)
#> Resampling: Cross-Validated (3 fold, repeated 1 times)
#> Summary of sample sizes: 11, 10, 9
#> Resampling results across tuning parameters:
#>
#> k RMSE Rsquared MAE
#> 1 4.979826 0.4332661 3.998205
#> 2 5.347236 0.3970251 4.312809
#> 3 5.016606 0.5977683 3.939470
#> 4 4.504474 0.8060368 3.662623
#> 5 5.612582 0.5104171 4.500768
#>
#> RMSE was used to select the optimal model using the smallest value.
#> The final value used for the model was k = 4.
# or
set.seed(356)
train(X1 ~ .,
method = "knn",
tuneGrid = expand.grid(k = 1:5),
trControl = train.ct,
preProcess= c('pca', 'scale'),
metric = "RMSE",
data = train1)
#> k-Nearest Neighbors
#>
#> 15 samples
#> 5 predictor
#>
#> Pre-processing: principal component signal extraction (5), scaled
#> (5), centered (5)
#> Resampling: Cross-Validated (3 fold, repeated 1 times)
#> Summary of sample sizes: 11, 10, 9
#> Resampling results across tuning parameters:
#>
#> k RMSE Rsquared MAE
#> 1 13.373189 0.2450736 10.592047
#> 2 10.217517 0.2952671 7.973258
#> 3 9.030618 0.2727458 7.639545
#> 4 8.133807 0.1813067 6.445518
#> 5 8.083650 0.2771067 6.551053
#>
#> RMSE was used to select the optimal model using the smallest value.
#> The final value used for the model was k = 5.
Created on 2019-04-15 by the reprex package (v0.2.1)
These look worse in terms of RMSE but the previous run underestimates RMSE since it assumes that there is no variation in the PCA scores.

Out-of-fold vs training error in caret

Using cross validation in model tuning, I get different error rates from caret::train's results object and calculating the error myself on its pred object. I'd like to understand why they differ, and ideally how to use out-of-fold error rates for model selection, plotting model performance, etc.
The pred object contains out-of-fold predictions. The docs are pretty clear that trainControl(..., savePredictions = "final") saves out-of-fold predictions for the best hyperparameter values: "an indicator of how much of the hold-out predictions for each resample should be saved... "final" saves the predictions for the optimal tuning parameters." (Keeping "all" predictions and then filtering to the best tuning values doesn't resolve the issue.)
The train docs say that the results object is "a data frame the training error rate..." I'm not sure what that means, but the values for the best row are consistently different from the metrics calculated on pred. Why do they differ and how can I make them line up?
d <- data.frame(y = rnorm(50))
d$x1 <- rnorm(50, d$y)
d$x2 <- rnorm(50, d$y)
train_control <- caret::trainControl(method = "cv",
number = 4,
search = "random",
savePredictions = "final")
m <- caret::train(x = d[, -1],
y = d$y,
method = "ranger",
trControl = train_control,
tuneLength = 3)
#> Loading required package: lattice
#> Loading required package: ggplot2
m
#> Random Forest
#>
#> 50 samples
#> 2 predictor
#>
#> No pre-processing
#> Resampling: Cross-Validated (4 fold)
#> Summary of sample sizes: 38, 36, 38, 38
#> Resampling results across tuning parameters:
#>
#> min.node.size mtry splitrule RMSE Rsquared MAE
#> 1 2 maxstat 0.5981673 0.6724245 0.4993722
#> 3 1 extratrees 0.5861116 0.7010012 0.4938035
#> 4 2 maxstat 0.6017491 0.6661093 0.4999057
#>
#> RMSE was used to select the optimal model using the smallest value.
#> The final values used for the model were mtry = 1, splitrule =
#> extratrees and min.node.size = 3.
MLmetrics::RMSE(m$pred$pred, m$pred$obs)
#> [1] 0.609202
MLmetrics::R2_Score(m$pred$pred, m$pred$obs)
#> [1] 0.642394
Created on 2018-04-09 by the reprex package (v0.2.0).
The RMSE for cross validation is not calculated the way you show, but rather for each fold and then averaged. Full example:
set.seed(1)
d <- data.frame(y = rnorm(50))
d$x1 <- rnorm(50, d$y)
d$x2 <- rnorm(50, d$y)
train_control <- caret::trainControl(method = "cv",
number = 4,
search = "random",
savePredictions = "final")
set.seed(1)
m <- caret::train(x = d[, -1],
y = d$y,
method = "ranger",
trControl = train_control,
tuneLength = 3)
#output
Random Forest
50 samples
2 predictor
No pre-processing
Resampling: Cross-Validated (4 fold)
Summary of sample sizes: 37, 38, 37, 38
Resampling results across tuning parameters:
min.node.size mtry splitrule RMSE Rsquared MAE
8 1 extratrees 0.6106390 0.4360609 0.4926629
12 2 extratrees 0.6156636 0.4294237 0.4954481
19 2 variance 0.6472539 0.3889372 0.5217369
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were mtry = 1, splitrule = extratrees and min.node.size = 8.
RMSE for best model is 0.6106390
Now calculate the RMSE for each fold and average:
m$pred %>%
group_by(Resample) %>%
mutate(rmse = caret::RMSE(pred, obs)) %>%
summarise(mean = mean(rmse)) %>%
pull(mean) %>%
mean
#output
0.610639
m$pred %>%
group_by(Resample) %>%
mutate(rmse = MLmetrics::RMSE(pred, obs)) %>%
summarise(mean = mean(rmse)) %>%
pull(mean) %>%
mean
#output
0.610639
I get different results. This is apparently a random process.
MLmetrics::RMSE(m$pred$pred, m$pred$obs)
[1] 0.5824464
> MLmetrics::R2_Score(m$pred$pred, m$pred$obs)
[1] 0.5271595
If you want a random (more accurately a pseudo-random process to be reproducible, then use set.seed immediately prior to the call.

Resources