R: knn + pca, undefined columns selected - r

I am trying to use knn in prediction but would like to conduct principal component analysis first to reduce dimensionality.
However, after I generated principal components and apply them on knn, it generates errors saying
"Error in [.data.frame(data, , all.vars(Terms), drop = FALSE) :
undefined columns selected"
as well as warnings:
"In addition: Warning message: In nominalTrainWorkflow(x = x, y = y,
wts = weights, info = trainInfo, : There were missing values in
resampled performance measures."
Here is my sample:
sample = cbind(rnorm(20, 100, 10), matrix(rnorm(100, 10, 2), nrow = 20)) %>%
data.frame()
The first 15 in the training set
train1 = sample[1:15, ]
test = sample[16:20, ]
Eliminate dependent variable
pca.tr=sample[1:15,2:6]
pcom = prcomp(pca.tr, scale.=T)
pca.tr=data.frame(True=train1[,1], pcom$x)
#select the first 2 principal components
pca.tr = pca.tr[, 1:2]
train.ct = trainControl(method = "repeatedcv", number = 3, repeats=1)
k = train(train1[,1] ~ .,
method = "knn",
tuneGrid = expand.grid(k = 1:5),
trControl = train.control, preProcess='scale',
metric = "RMSE",
data = cbind(train1[,1], pca.tr))
Any advice is appreciated!

Use better column names and a formula without subscripts.
You really should try to post a reproducible example. Some of your code was wrong.
Also, there is a "pca" method for preProc that does the appropriate thing by recomputing the PCA scores inside of resampling.
library(caret)
#> Loading required package: lattice
#> Loading required package: ggplot2
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
set.seed(55)
sample = cbind(rnorm(20, 100, 10), matrix(rnorm(100, 10, 2), nrow = 20)) %>%
data.frame()
train1 = sample[1:15, ]
test = sample[16:20, ]
pca.tr=sample[1:15,2:6]
pcom = prcomp(pca.tr, scale.=T)
pca.tr=data.frame(True=train1[,1], pcom$x)
#select the first 2 principal components
pca.tr = pca.tr[, 1:2]
dat <- cbind(train1[,1], pca.tr) %>%
# This
setNames(c("y", "True", "PC1"))
train.ct = trainControl(method = "repeatedcv", number = 3, repeats=1)
set.seed(356)
k = train(y ~ .,
method = "knn",
tuneGrid = expand.grid(k = 1:5),
trControl = train.ct, # this argument was wrong in your code
preProcess='scale',
metric = "RMSE",
data = dat)
k
#> k-Nearest Neighbors
#>
#> 15 samples
#> 2 predictor
#>
#> Pre-processing: scaled (2)
#> Resampling: Cross-Validated (3 fold, repeated 1 times)
#> Summary of sample sizes: 11, 10, 9
#> Resampling results across tuning parameters:
#>
#> k RMSE Rsquared MAE
#> 1 4.979826 0.4332661 3.998205
#> 2 5.347236 0.3970251 4.312809
#> 3 5.016606 0.5977683 3.939470
#> 4 4.504474 0.8060368 3.662623
#> 5 5.612582 0.5104171 4.500768
#>
#> RMSE was used to select the optimal model using the smallest value.
#> The final value used for the model was k = 4.
# or
set.seed(356)
train(X1 ~ .,
method = "knn",
tuneGrid = expand.grid(k = 1:5),
trControl = train.ct,
preProcess= c('pca', 'scale'),
metric = "RMSE",
data = train1)
#> k-Nearest Neighbors
#>
#> 15 samples
#> 5 predictor
#>
#> Pre-processing: principal component signal extraction (5), scaled
#> (5), centered (5)
#> Resampling: Cross-Validated (3 fold, repeated 1 times)
#> Summary of sample sizes: 11, 10, 9
#> Resampling results across tuning parameters:
#>
#> k RMSE Rsquared MAE
#> 1 13.373189 0.2450736 10.592047
#> 2 10.217517 0.2952671 7.973258
#> 3 9.030618 0.2727458 7.639545
#> 4 8.133807 0.1813067 6.445518
#> 5 8.083650 0.2771067 6.551053
#>
#> RMSE was used to select the optimal model using the smallest value.
#> The final value used for the model was k = 5.
Created on 2019-04-15 by the reprex package (v0.2.1)
These look worse in terms of RMSE but the previous run underestimates RMSE since it assumes that there is no variation in the PCA scores.

Related

How do I calculate quadratic weighted Kappa in a Tidymodels pipeline?

I have the following code as a simple example.
library(tidymodels)
library(bonsai)
train_folds <- vfold_cv(data = train, strata = target)
train_rec <- recipe(formula = target ~ ., data = train) %>%
update_role(Id, new_role = "ID")
gb_mod <- boost_tree(engine = "lightgbm",
mtry = 11,
mode = "classification",
trees = 100)
gb_workflow <- workflow(preprocessor = train_rec,
spec = gb_mod)
model_fit <- gb_workflow %>% fit_resamples(train_folds,
metrics = metric_set(kap, roc_auc, accuracy))
model_fit %>% collect_metrics()
The kap function calculates the Kappa metric which has no weighting by default. To calculate quadratic weighted Kappa you must add weighting = "quadratic" as a parameter, which metric_set() doesn't seem to accept. How can I include QWK in the metrics output?
Apologies if this has been answered already but I couldn't find any examples.
You need to make an alternate function (just by wrapping the original):
library(yardstick)
#> For binary classification, the first factor level is assumed to be the event.
#> Use the argument `event_level = "second"` to alter this as needed.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
data(hpc_cv, package = "modeldata")
# See example in ?metric_set examples
kap_quad <- function(data, truth, estimate, na_rm = TRUE, ...) {
kap(
data = data,
truth = !! rlang::enquo(truth),
estimate = !! rlang::enquo(estimate),
# set weighting = "quadratic"
weighting = "quadratic",
na_rm = na_rm,
...
)
}
kap_quad <- new_numeric_metric(kap_quad, "maximize")
met <- metric_set(kap_quad)
hpc_cv %>%
met(obs, estimate = pred)
#> # A tibble: 1 × 3
#> .metric .estimator .estimate
#> <chr> <chr> <dbl>
#> 1 kap multiclass 0.692
# no weighting
hpc_cv %>%
kap(obs, estimate = pred)
#> # A tibble: 1 × 3
#> .metric .estimator .estimate
#> <chr> <chr> <dbl>
#> 1 kap multiclass 0.508
Created on 2023-01-31 by the reprex package (v2.0.1)

Can the out of bag error for a random forests model in R's TidyModel's framework be obtained?

If you directly use the ranger function, one can obtain the out-of-bag error from the resulting ranger class object.
If instead, one proceeds by way of setting up a recipe, model specification/engine, with tuning parameters, etc., how can we extract that same error? The Tidymodels approach doesn't seem to hold on to that data.
If you want to access the ranger object inside of the parsnip object, it is there as $fit:
library(tidymodels)
data("ad_data", package = "modeldata")
rf_spec <-
rand_forest() %>%
set_engine("ranger", oob.error = TRUE) %>%
set_mode("classification")
rf_fit <- rf_spec %>%
fit(Class ~ ., data = ad_data)
rf_fit
#> parsnip model object
#>
#> Fit time: 158ms
#> Ranger result
#>
#> Call:
#> ranger::ranger(x = maybe_data_frame(x), y = y, oob.error = ~TRUE, num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE)
#>
#> Type: Probability estimation
#> Number of trees: 500
#> Sample size: 333
#> Number of independent variables: 130
#> Mtry: 11
#> Target node size: 10
#> Variable importance mode: none
#> Splitrule: gini
#> OOB prediction error (Brier s.): 0.1340793
class(rf_fit)
#> [1] "_ranger" "model_fit"
class(rf_fit$fit)
#> [1] "ranger"
rf_fit$fit$prediction.error
#> [1] 0.1340793
Created on 2021-03-11 by the reprex package (v1.0.0)

recipes::step_dummy + caret::train -> Error:Not all variables in the recipe are present

I am getting the following error when using recipes::step_dummy with caret::train (first attempt at combining the two packages):
Error: Not all variables in the recipe are present in the supplied
training set
Not sure what is causing the error nor the best way to debug. Help to train model would be much appreciated.
library(caret)
library(tidyverse)
library(recipes)
library(rsample)
data("credit_data")
## Split the data into training (75%) and test sets (25%)
set.seed(100)
train_test_split <- initial_split(credit_data)
credit_train <- training(train_test_split)
credit_test <- testing(train_test_split)
# Create recipe for data pre-processing
rec_obj <- recipe(Status ~ ., data = credit_train) %>%
step_knnimpute(all_predictors()) %>%
#step_other(Home, Marital, threshold = .2, other = "other") %>%
#step_other(Job, threshold = .2, other = "others") %>%
step_dummy(Records) %>%
step_center(all_numeric()) %>%
step_scale(all_numeric()) %>%
prep(training = credit_train, retain = TRUE)
train_data <- juice(rec_obj)
test_data <- bake(rec_obj, credit_test)
set.seed(1055)
# the glm function models the second factor level.
lrfit <- train(rec_obj, data = train_data,
method = "glm",
trControl = trainControl(method = "repeatedcv",
repeats = 5))
Don't prep the recipe before giving it to train and use the original training set:
library(caret)
#> Loading required package: lattice
#> Loading required package: ggplot2
library(tidyverse)
library(recipes)
#>
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stringr':
#>
#> fixed
#> The following object is masked from 'package:stats':
#>
#> step
library(rsample)
data("credit_data")
## Split the data into training (75%) and test sets (25%)
set.seed(100)
train_test_split <- initial_split(credit_data)
credit_train <- training(train_test_split)
credit_test <- testing(train_test_split)
# Create recipe for data pre-processing
rec_obj <-
recipe(Status ~ ., data = credit_train) %>%
step_knnimpute(all_predictors()) %>%
#step_other(Home, Marital, threshold = .2, other = "other") %>%
#step_other(Job, threshold = .2, other = "others") %>%
step_dummy(Records) %>%
step_center(all_numeric()) %>%
step_scale(all_numeric())
set.seed(1055)
# the glm function models the second factor level.
lrfit <- train(rec_obj, data = credit_train,
method = "glm",
trControl = trainControl(method = "repeatedcv",
repeats = 5))
lrfit
#> Generalized Linear Model
#>
#> 3341 samples
#> 13 predictor
#> 2 classes: 'bad', 'good'
#>
#> Recipe steps: knnimpute, dummy, center, scale
#> Resampling: Cross-Validated (10 fold, repeated 5 times)
#> Summary of sample sizes: 3006, 3008, 3007, 3007, 3007, 3007, ...
#> Resampling results:
#>
#> Accuracy Kappa
#> 0.7965349 0.4546223
Created on 2019-03-20 by the reprex package (v0.2.1)
It seems that you still need the formula in the train function (despite being listed in the recipe)?...
glmfit <- train(Status ~ ., data = juice(rec_obj),
method = "glm",
trControl = trainControl(method = "repeatedcv", repeats = 5))

Out-of-fold vs training error in caret

Using cross validation in model tuning, I get different error rates from caret::train's results object and calculating the error myself on its pred object. I'd like to understand why they differ, and ideally how to use out-of-fold error rates for model selection, plotting model performance, etc.
The pred object contains out-of-fold predictions. The docs are pretty clear that trainControl(..., savePredictions = "final") saves out-of-fold predictions for the best hyperparameter values: "an indicator of how much of the hold-out predictions for each resample should be saved... "final" saves the predictions for the optimal tuning parameters." (Keeping "all" predictions and then filtering to the best tuning values doesn't resolve the issue.)
The train docs say that the results object is "a data frame the training error rate..." I'm not sure what that means, but the values for the best row are consistently different from the metrics calculated on pred. Why do they differ and how can I make them line up?
d <- data.frame(y = rnorm(50))
d$x1 <- rnorm(50, d$y)
d$x2 <- rnorm(50, d$y)
train_control <- caret::trainControl(method = "cv",
number = 4,
search = "random",
savePredictions = "final")
m <- caret::train(x = d[, -1],
y = d$y,
method = "ranger",
trControl = train_control,
tuneLength = 3)
#> Loading required package: lattice
#> Loading required package: ggplot2
m
#> Random Forest
#>
#> 50 samples
#> 2 predictor
#>
#> No pre-processing
#> Resampling: Cross-Validated (4 fold)
#> Summary of sample sizes: 38, 36, 38, 38
#> Resampling results across tuning parameters:
#>
#> min.node.size mtry splitrule RMSE Rsquared MAE
#> 1 2 maxstat 0.5981673 0.6724245 0.4993722
#> 3 1 extratrees 0.5861116 0.7010012 0.4938035
#> 4 2 maxstat 0.6017491 0.6661093 0.4999057
#>
#> RMSE was used to select the optimal model using the smallest value.
#> The final values used for the model were mtry = 1, splitrule =
#> extratrees and min.node.size = 3.
MLmetrics::RMSE(m$pred$pred, m$pred$obs)
#> [1] 0.609202
MLmetrics::R2_Score(m$pred$pred, m$pred$obs)
#> [1] 0.642394
Created on 2018-04-09 by the reprex package (v0.2.0).
The RMSE for cross validation is not calculated the way you show, but rather for each fold and then averaged. Full example:
set.seed(1)
d <- data.frame(y = rnorm(50))
d$x1 <- rnorm(50, d$y)
d$x2 <- rnorm(50, d$y)
train_control <- caret::trainControl(method = "cv",
number = 4,
search = "random",
savePredictions = "final")
set.seed(1)
m <- caret::train(x = d[, -1],
y = d$y,
method = "ranger",
trControl = train_control,
tuneLength = 3)
#output
Random Forest
50 samples
2 predictor
No pre-processing
Resampling: Cross-Validated (4 fold)
Summary of sample sizes: 37, 38, 37, 38
Resampling results across tuning parameters:
min.node.size mtry splitrule RMSE Rsquared MAE
8 1 extratrees 0.6106390 0.4360609 0.4926629
12 2 extratrees 0.6156636 0.4294237 0.4954481
19 2 variance 0.6472539 0.3889372 0.5217369
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were mtry = 1, splitrule = extratrees and min.node.size = 8.
RMSE for best model is 0.6106390
Now calculate the RMSE for each fold and average:
m$pred %>%
group_by(Resample) %>%
mutate(rmse = caret::RMSE(pred, obs)) %>%
summarise(mean = mean(rmse)) %>%
pull(mean) %>%
mean
#output
0.610639
m$pred %>%
group_by(Resample) %>%
mutate(rmse = MLmetrics::RMSE(pred, obs)) %>%
summarise(mean = mean(rmse)) %>%
pull(mean) %>%
mean
#output
0.610639
I get different results. This is apparently a random process.
MLmetrics::RMSE(m$pred$pred, m$pred$obs)
[1] 0.5824464
> MLmetrics::R2_Score(m$pred$pred, m$pred$obs)
[1] 0.5271595
If you want a random (more accurately a pseudo-random process to be reproducible, then use set.seed immediately prior to the call.

Caret leap forward for linear regression (feature selection) change nvmax

I have been playing around with the leapForward method from the package leaps in conjunction with caret and found that it only provides 5 variables . according to the leaps package you can change nvmax to whatever number of subsets you wish.
I cannot seem where to fit this into the caret wrapper. I have tried putting it in the train statement, as well as creating an expand.grid line, and ti does not seem to work. Any help would be appreciated!
my code:
library(caret)
data <- read.csv(file="C:/mydata.csv", header=TRUE, sep=",")
fitControl <- trainControl(method = "loocv")
x <- data[, -19]
y <- data[, 19]
lmFit <- train(x=x, y=y,'leapForward', trControl = fitControl)
summary(lmFit)
The default behavior of caret is a random search over the tuning parameters.
You can specify a grid of parameters as you like, with the tuneGrid option.
Here is a reproducible example with the BloodBrain dataset. NB : I had to
transform the predictors with a PCA to avoid problems of multicolinearity
library(caret)
data(BloodBrain, package = "caret")
dim(bbbDescr)
#> [1] 208 134
X <- princomp(bbbDescr)$scores[,1:131]
Y <- logBBB
fitControl <- trainControl(method = "cv")
Default : random search of parameters
lmFit <- train(y = Y, x = X,'leapForward', trControl = fitControl)
lmFit
#> Linear Regression with Forward Selection
#>
#> 208 samples
#> 131 predictors
#>
#> No pre-processing
#> Resampling: Cross-Validated (10 fold)
#> Summary of sample sizes: 187, 188, 187, 187, 187, 187, ...
#> Resampling results across tuning parameters:
#>
#> nvmax RMSE Rsquared MAE
#> 2 0.6682545 0.2928583 0.5286758
#> 3 0.7008359 0.2652202 0.5527730
#> 4 0.6781190 0.3026475 0.5215527
#>
#> RMSE was used to select the optimal model using the smallest value.
#> The final value used for the model was nvmax = 2.
With a grid search of your choice.
NB : expand.grid is not necessary here. it is useful when you combine
several tuning parameters
lmFit <- train(y = Y, x = X,'leapForward', trControl = fitControl,
tuneGrid = expand.grid(nvmax = seq(1, 30, 2)))
lmFit
#> Linear Regression with Forward Selection
#>
#> 208 samples
#> 131 predictors
#>
#> No pre-processing
#> Resampling: Cross-Validated (10 fold)
#> Summary of sample sizes: 188, 188, 188, 186, 187, 187, ...
#> Resampling results across tuning parameters:
#>
#> nvmax RMSE Rsquared MAE
#> 1 0.7649633 0.07840817 0.5919515
#> 3 0.6952295 0.27147443 0.5250173
#> 5 0.6482456 0.35953363 0.4828406
#> 7 0.6509919 0.37800159 0.4865292
#> 9 0.6721529 0.35899937 0.5104467
#> 11 0.6541945 0.39316037 0.4979497
#> 13 0.6355383 0.42654189 0.4794705
#> 15 0.6493433 0.41823974 0.4911399
#> 17 0.6645519 0.37338055 0.5105887
#> 19 0.6575950 0.39628133 0.5084652
#> 21 0.6663806 0.39156852 0.5124487
#> 23 0.6744933 0.38746853 0.5143484
#> 25 0.6709936 0.39228681 0.5025907
#> 27 0.6919163 0.36565876 0.5209107
#> 29 0.7015347 0.35397968 0.5272448
#>
#> RMSE was used to select the optimal model using the smallest value.
#> The final value used for the model was nvmax = 13.
plot(lmFit)
Created on 2018-03-08 by the reprex package (v0.2.0).

Resources