tidymodels Novel levels found in column - r

I am using tidymodels to create a Random Forrest prediction. I have test data that contains a new factor level not present in the training data which results in the error:
1: Novel levels found in column 'Siblings': '4'. The levels have been removed, and values have been coerced to 'NA'.
2: There are new levels in a factor: NA
> test_predict
Fehler: Objekt 'test_predict' nicht gefunden
I tried to include a step_novel and step_dummy on the "Siblings" column but this does not resolve the error. How should I deal with new factors not present in training data?
library(tidyverse)
library(tidymodels)
data <-
data.frame(
Survived = as.factor(c(0,1,1,1,0,0,0,0,0,1,1,1,0,0,0,0)),
Siblings = as.factor(c(1,1,0,1,0,0,0,3,1,1,0,1,0,0,0,3)),
Class = as.factor(c(0,1,0,1,0,1,0,0,0,1,0,1,0,1,0,0)),
Embarked = as.factor(c("s","c","m","m","s","c","s","m","m","s","s","s","s","s","s","s"))
)
test <-
data.frame(
Siblings = as.factor(c(1,1,0,1,0,0,0,3,1,1,0,1,0,0,0,4)), #New factor level
Class = as.factor(c(0,1,0,1,0,1,0,0,0,1,0,1,0,1,0,0)),
Embarked = as.factor(c("s","c","m","m","s","c","s","m","m","s","s","s","s","s","s","s"))
)
#Model
rf_model <-
rand_forest() %>%
set_args(
mtry = 3,
trees = 1000,
min_n = 15
) %>%
set_engine("ranger",
importance = "impurity") %>%
set_mode("classification")
#Recipe
data_recipe <-
recipe(Survived ~Siblings + Class + Embarked, data=data) %>%
step_novel(Siblings) %>%
step_dummy(Siblings)
#Workflow
rf_workflow <-
workflow() %>%
add_recipe(data_recipe) %>%
add_model(rf_model)
final_model <- fit(rf_workflow, data)
final_model
test_predict <- predict(final_model, test)
test_predict

If you notice in the documentation for step_novel(), it says:
When fitting a model that can deal with new factor levels, consider using workflows::add_recipe() with allow_novel_levels = TRUE set in hardhat::default_recipe_blueprint(). This will allow your model to handle new levels at prediction time, instead of throwing warnings or errors.
So you want to do that:
library(tidyverse)
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
data <-
data.frame(
Survived = as.factor(c(0,1,1,1,0,0,0,0,0,1,1,1,0,0,0,0)),
Siblings = as.factor(c(1,1,0,1,0,0,0,3,1,1,0,1,0,0,0,3)),
Class = as.factor(c(0,1,0,1,0,1,0,0,0,1,0,1,0,1,0,0)),
Embarked = as.factor(c("s","c","m","m","s","c","s","m","m","s","s","s","s","s","s","s"))
)
test <-
data.frame(
Siblings = as.factor(c(1,1,0,1,0,0,0,3,1,1,0,1,0,0,0,4)), #New factor level
Class = as.factor(c(0,1,0,1,0,1,0,0,0,1,0,1,0,1,0,0)),
Embarked = as.factor(c("s","c","m","m","s","c","s","m","m","s","s","s","s","s","s","s"))
)
#Model
rf_model <-
rand_forest() %>%
set_args(
mtry = 3,
trees = 1000,
min_n = 15
) %>%
set_engine("ranger",
importance = "impurity") %>%
set_mode("classification")
#Recipe
data_recipe <-
recipe(Survived ~Siblings + Class + Embarked, data=data) %>%
step_novel(Siblings) %>%
step_dummy(Siblings)
#Workflow
rf_workflow <-
workflow() %>%
add_recipe(data_recipe,
blueprint = hardhat::default_recipe_blueprint(allow_novel_levels = TRUE)) %>%
add_model(rf_model)
final_model <- fit(rf_workflow, data)
final_model
#> ══ Workflow [trained] ══════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: rand_forest()
#>
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 2 Recipe Steps
#>
#> • step_novel()
#> • step_dummy()
#>
#> ── Model ───────────────────────────────────────────────────────────────────────
#> Ranger result
#>
#> Call:
#> ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~3, x), num.trees = ~1000, min.node.size = min_rows(~15, x), importance = ~"impurity", num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE)
#>
#> Type: Probability estimation
#> Number of trees: 1000
#> Sample size: 16
#> Number of independent variables: 5
#> Mtry: 3
#> Target node size: 15
#> Variable importance mode: impurity
#> Splitrule: gini
#> OOB prediction error (Brier s.): 0.254242
test_predict <- predict(final_model, test)
test_predict
#> # A tibble: 16 x 1
#> .pred_class
#> <fct>
#> 1 0
#> 2 1
#> 3 0
#> 4 1
#> 5 0
#> 6 0
#> 7 0
#> 8 0
#> 9 0
#> 10 1
#> 11 0
#> 12 1
#> 13 0
#> 14 0
#> 15 0
#> 16 0
Created on 2021-07-09 by the reprex package (v2.0.0)
The workflows functions are very strict about factor levels and other aspects of the new data, ensuring that they match up with the training data.

To answer my own question:
We need to apply step_novel followed by step_unknown. As far as I understand from the documentation step_novel labels any new factors occurring in the data with "new". This can be used to easily identify such factors when the data is inspected after applying the recipe. step_unknown removes any such factors from the data and converts the values to NA when the model is applied:
data_recipe <-
recipe(Survived ~Siblings + Class + Embarked, data=data) %>%
step_novel(Siblings) %>%
step_unknown(Siblings)

Related

Using tidymodels in R, my BART workflow changes after I fit it once. Why?

I have been trying to train a BART model using the tidymodels framework but I am running into some problems.
I can declare the model, the recipe, and the workflow alright, but once I fit the workflow, two unwanted things happen:
The original model object (bart_mod below), initially correctly stored, becomes "call: NULL", even though I don't touch the model object directly (I assign nothing to the same object name).
I am not able to retrieve any information about the fitted model. The bart_fit contains nothing and there seems to be no tidy method associated to it. All this is true even though I am able to predict values using the fitted model! (See last line of code in the reprex).
This may very well come from a misunderstanding of how all this works on my end, I am fairly new to tidymodels.
I would appreciate any help! Thank you.
library(tidyverse)
library(tidymodels)
set.seed(2022)
# Parameters --------------------------------------------------------------
n <- 5000
coef_x_var_1 <- 1
coef_x_var_2 <- 2
coef_x_var_3 <- 3
gen_y_1 <- function(data = dataset) {
return(data$y_0 +
data$x_var_1*coef_x_var_1 +
data$x_var_2*coef_x_var_2 +
data$x_var_3*coef_x_var_3 +
rnorm(n = nrow(data), mean = 0, sd = 3)
)}
# Data generation ---------------------------------------------------------
dataset <- matrix(NA, nrow = n, ncol = 3)
# Generate the unit-level moderators
dataset[,1] <- rnorm(mean = rnorm(n = 1), n = n)
dataset[,2] <- rnorm(mean = rnorm(n = 1), n = n)
dataset[,3] <- rnorm(mean = rnorm(n = 1), n = n)
# Change into dataframe
colnames(dataset) <- c("x_var_1", "x_var_2", "x_var_3")
dataset <- as_tibble(dataset)
# Make sure the variable format is numeric (except for the identifiers)
dataset$x_var_1 <- as.numeric(dataset$x_var_1)
dataset$x_var_2 <- as.numeric(dataset$x_var_2)
dataset$x_var_3 <- as.numeric(dataset$x_var_3)
# Generate the untreated potential outcomes
P0_coefs <- rdunif(n = 6, 1, 15)
dataset$y_0 <-
dataset$x_var_1*P0_coefs[4] +
dataset$x_var_2*P0_coefs[5] +
dataset$x_var_3*P0_coefs[6] +
rnorm(n = nrow(dataset), mean = 0, sd = 3)
dataset$y_1 <- gen_y_1(data = dataset)
# Create a variable to indicate treatment
treatment_group <- sample(1:nrow(dataset), size = nrow(dataset)/2)
# Indicate which potential outcome you observe
obs_dataset <- dataset |>
mutate(treated = ifelse(row_number() %in% treatment_group, 1, 0),
obs_y = ifelse(treated, y_1, y_0))
y1_obs_dataset <- obs_dataset |> filter(treated == 1)
y0_obs_dataset <- obs_dataset |> filter(treated == 0)
# Analysis ----------------------------------------------------------------
covariates <- c("x_var_1", "x_var_2", "x_var_3")
bart_formula <- as.formula(paste0("obs_y ~ ", paste(covariates, collapse = " + ")))
# Create the workflow
bart_mod <- bart() |>
set_engine("dbarts") |>
set_mode("regression")
bart_recipe <- recipe(bart_formula, data = obs_dataset) |>
step_zv(all_predictors())
bart_workflow <-
workflow() |>
add_model(bart_mod) |>
add_recipe(bart_recipe)
# The workflow first looks right
bart_workflow
#> ══ Workflow ════════════════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: bart()
#>
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 1 Recipe Step
#>
#> • step_zv()
#>
#> ── Model ───────────────────────────────────────────────────────────────────────
#> BART Model Specification (regression)
#>
#> Computational engine: dbarts
# Once I fit it though, the model part becomes call: NULL
bart_fit <- bart_workflow |>
fit(y1_obs_dataset)
# Nothing is stored in the fit
bart_fit
#> ══ Workflow [trained] ══════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: bart()
#>
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 1 Recipe Step
#>
#> • step_zv()
#>
#> ── Model ───────────────────────────────────────────────────────────────────────
#>
#> Call:
#> `NULL`()
# The content of this object has changed!
bart_workflow
#> ══ Workflow ════════════════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: bart()
#>
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 1 Recipe Step
#>
#> • step_zv()
#>
#> ── Model ───────────────────────────────────────────────────────────────────────
#>
#> Call:
#> NULL
bart_fit |>
extract_fit_parsnip(bart_fit)
#> parsnip model object
#>
#>
#> Call:
#> `NULL`()
# And yet, I am able to run a prediction using the fit!
predict(bart_fit, y0_obs_dataset)
#> # A tibble: 2,500 × 1
#> .pred
#> <dbl>
#> 1 -4.67
#> 2 -6.23
#> 3 6.35
#> 4 10.7
#> 5 4.90
#> 6 -13.8
#> 7 4.70
#> 8 19.6
#> 9 -0.907
#> 10 5.38
#> # … with 2,490 more rows
Created on 2022-12-24 with reprex v2.0.2
First stripping Martin's code down to a smaller script:
library(tidyverse)
library(tidymodels)
set.seed(2022)
obs_dataset <- structure(list(x_var_1 = c(-0.273203786163623, 0.0026566250757164,
-0.544359413888551, 0.569128408034224, -2.00048700105319, -0.159113741655834
), obs_y = c(-8.14952415680873, 1.91364235165124, -7.68391811408719,
-9.01497463720505, -18.5017189874949, -13.505685812581)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
bart_formula <- as.formula("obs_y ~ x_var_1")
# Create the workflow
bart_mod <- bart() |>
set_engine("dbarts") |>
set_mode("regression")
bart_recipe <- recipe(bart_formula, data = obs_dataset)
bart_workflow <-
workflow() |>
add_model(bart_mod) |>
add_recipe(bart_recipe)
The workflow at first looks right
bart_workflow
> ══ Workflow
> ════════════════════════════════════════════════════════════════
> Preprocessor: Recipe Model: bart()
>
> ── Preprocessor
> ────────────────────────── 0 Recipe Steps
>
> ── Model
> ─────────────────────────────────────────────────────────
> BART Model Specification (regression)
>
> Computational engine: dbarts
but this changes after fitting:
bart_fit <- bart_workflow |>
fit(obs_dataset)
bart_fit
The workflow now displays NULL for the call, as does the model object.
bart_workflow
bart_mod
══ Workflow [trained] ══════════════════════════════════════════════════════
Preprocessor: Recipe
Model: bart()
── Preprocessor ─────────────────────────────────
0 Recipe Steps
── Model ────────────────────────────────────────────────
Call:
`NULL`()
All these display values:
required_pkgs(bart_mod)
print_model_spec(bart_mod)
bart_mod[["engine"]]
bart_mod[["mode"]]
extract_recipe(bart_fit)
extract_preprocessor(bart_fit)
extract_mold(bart_fit)
bart_fit[["fit"]][["fit"]][["spec"]][["engine"]]
bart_fit[["fit"]][["fit"]][["spec"]][["mode"]]
These display NULL:
print(bart_mod)
print(bart_workflow)
print(bart_fit)
extract_fit_engine(bart_fit)
extract_fit_parsnip(bart_fit)
extract_model(bart_fit)
So, it seems that the model data is still in the objects,
and is useable,
but the print calls do not display it,
and the extract functions do not display it.

DALEX and step_pca

I would like to look at the compound feature importance of the principal components with DALEX model_parts but I am also interested to what extent the results are driven by variation in a specific variable in this principal component. I can look at individual feature influence very neatly when using model_profile but in that case, I cannot investigate the feature importance of the PCA variables. Is it possible to get the best of both world and look at the compound feature importance of a principal component while using model_profile partial dependence plots of individual factors as shown below?
Data:
library(tidymodels)
library(parsnip)
library(DALEXtra)
set.seed(1)
x1 <- rbinom(1000, 5, .1)
x2 <- rbinom(1000, 5, .4)
x3 <- rbinom(1000, 5, .9)
x4 <- rbinom(1000, 5, .6)
# id <- c(1:1000)
y <- as.factor(rbinom(1000, 5, .5))
df <- tibble(y, x1, x2, x3, x4, id)
df[, c("x1", "x2", "x3", "x4", "id")] <- sapply(df[, c("x1", "x2", "x3", "x4", "id")], as.numeric)
Model
# create training and test set
set.seed(20)
split_dat <- initial_split(df, prop = 0.8)
train <- training(split_dat)
test <- testing(split_dat)
# use cross-validation
kfolds <- vfold_cv(df)
# recipe
rec_pca <- recipe(y ~ ., data = train) %>%
update_role(id, new_role = "id variable") %>%
step_center(all_predictors()) %>%
step_scale(all_predictors()) %>%
step_pca(x1, x2, x3, threshold = 0.9, num_comp = turn_off_pca)
# parsnip engine
boost_model <- boost_tree() %>%
set_mode("classification") %>%
set_engine("xgboost")
# create wf
boosted_wf <-
workflow() %>%
add_model(boost_model) %>%
add_recipe(rec_pca)
final_boosted <- generics::fit(boosted_wf, df)
# create an explanation object
explainer_xgb <- DALEXtra::explain_tidymodels(final_boosted,
data = df[,-1],
y = df$y)
# feature importance
model_parts(explainer_xgb) %>% plot()
This gives me the following plot although even if I have reduced x1, x2 and x3 into one component in step_pca above.
I know I could reduce dimensions manually and bind it to the df like so and then look at the feature importance.
rec_pca_2 <- df %>%
select(x1, x2, x3) %>%
recipe() %>%
step_pca(all_numeric(), num_comp = 1)
df <- bind_cols(df, prep(rec_pca_2) %>% juice())
df
> df
# A tibble: 1,000 × 6
y x1 x2 x3 x4 PC1
<fct> <int> <int> <int> <int> <dbl>
1 2 0 2 4 2 -4.45
2 3 0 3 3 3 -3.95
3 0 0 2 4 4 -4.45
4 2 1 4 5 3 -6.27
5 4 0 1 5 2 -4.94
6 2 1 0 5 1 -4.63
7 3 2 2 5 4 -5.56
8 3 1 2 5 3 -5.45
9 2 1 3 5 2 -5.86
10 2 0 2 5 1 -5.35
# … with 990 more rows
I could then estimate a model with PC1 as covariate. Yet, in that case, it would be difficult to interpret what the variation in PC1 substatial means when using model_profile since everything would be collapsed into one component.
model_profile(explainer_xgb) %>% plot()
Thus, my key question is: how can I look at the feature importance of components without compromising on the interpretability of the partial dependence plot?
You may be interested in the discussion here on how to get explainability from the original predictors vs. features that have been created via feature engineering (like PCA components). We don't have a super fluent interface yet, so you have to do this is a bit manually:
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
library(parsnip)
library(DALEX)
#> Welcome to DALEX (version: 2.4.0).
#> Find examples and detailed introduction at: http://ema.drwhy.ai/
#>
#> Attaching package: 'DALEX'
#> The following object is masked from 'package:dplyr':
#>
#> explain
set.seed(1)
x1 <- rbinom(1000, 5, .1)
x2 <- rbinom(1000, 5, .4)
x3 <- rbinom(1000, 5, .9)
x4 <- rbinom(1000, 5, .6)
y <- as.factor(sample(c("yes", "no"), size = 1000, replace = TRUE))
df <- tibble(y, x1, x2, x3, x4) %>% mutate(across(where(is.integer), as.numeric))
# create training and test set
set.seed(20)
split_dat <- initial_split(df, prop = 0.8)
train <- training(split_dat)
test <- testing(split_dat)
# use cross-validation
kfolds <- vfold_cv(df)
# recipe
rec_pca <- recipe(y ~ ., data = train) %>%
step_center(all_predictors()) %>%
step_scale(all_predictors()) %>%
step_pca(x1, x2, x3, threshold = 0.9)
# parsnip engine
boost_model <- boost_tree() %>%
set_mode("classification") %>%
set_engine("xgboost")
# create wf
boosted_wf <-
workflow() %>%
add_model(boost_model) %>%
add_recipe(rec_pca)
final_boosted <- generics::fit(boosted_wf, df)
#> [14:00:11] WARNING: amalgamation/../src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
Notice that next here I use regular DALEX (not DALEXtra), and that I manually extract out the xgboost model from inside the workflow and apply the feature engineering to the data myself:
# create an explanation object
explainer_xgb <-
DALEX::explain(
extract_fit_parsnip(final_boosted),
data = rec_pca %>% prep() %>% bake(new_data = NULL, all_predictors()),
y = as.integer(train$y)
)
#> Preparation of a new explainer is initiated
#> -> model label : model_fit ( default )
#> -> data : 800 rows 4 cols
#> -> data : tibble converted into a data.frame
#> -> target variable : 800 values
#> -> predict function : yhat.model_fit will be used ( default )
#> -> predicted values : No value for predict function target column. ( default )
#> -> model_info : package parsnip , ver. 0.1.7 , task classification ( default )
#> -> predicted values : numerical, min = 0.1157353 , mean = 0.4626758 , max = 0.8343955
#> -> residual function : difference between y and yhat ( default )
#> -> residuals : numerical, min = 0.1860582 , mean = 0.9985742 , max = 1.884265
#> A new explainer has been created!
model_parts(explainer_xgb) %>% plot()
Created on 2022-03-11 by the reprex package (v2.0.1)
The only behavior supported right now in DALEXtra is based on using the original predictors so if you want to look at those engineered features, you need to do it yourself. You may be interested in this chapter of our book.

RMSE value on the example of randomForrest

I am watching one of the solutions for House Prices Kaggle competition. I would like to know how do you get RMSE value from this:
Subset the train rows and selected features
dt.train <- fulldt %>% filter(Set == "Train") %>% select("Id", "OverallQual", "TotalArea", "AreaAbvground", "GarageArea", "TotalBaths", "YearBuilt", "Neighborhood", "MSSubClass", "FireplaceQu", "ExterQual", "KitchenQual", "BsmtQual", "HouseStyle") %>% mutate(SalePrice = log(raw.train$SalePrice))
Same for the test features
dt.test <- fulldt %>% filter(Set == "Test") %>%
select("Id", "OverallQual", "TotalArea", "AreaAbvground", "GarageArea", "TotalBaths", "YearBuilt",
"Neighborhood", "MSSubClass", "FireplaceQu", "ExterQual", "KitchenQual", "BsmtQual", "HouseStyle")
Random Forest model
fit <- randomForest(SalePrice ~ ., data = dt.train, importance = T)
Use new model to predict SalePrice values from the test set
pred <- exp(predict(fit , newdata = dt.test))
How do you get RMSE value from pred ?
Let's calculate the RMSE of the training and test rows based on the minimal example iris data:
library(tibble)
library(randomForest)
#> randomForest 4.6-14
#> Type rfNews() to see new features/changes/bug fixes.
library(yardstick)
#> For binary classification, the first factor level is assumed to be the event.
#> Use the argument `event_level = "second"` to alter this as needed.
train_df <- head(iris, 100)
test_df <- tail(iris, 50)
model <- randomForest(Sepal.Length ~ ., data = train_df, importance = T)
# Test RMSE
tibble(
truth = predict(model, newdata = test_df),
predicted = test_df$Sepal.Length
) %>%
rmse(truth, predicted)
#> # A tibble: 1 x 3
#> .metric .estimator .estimate
#> <chr> <chr> <dbl>
#> 1 rmse standard 0.836
# Train RMSE
tibble(
truth = predict(model, newdata = train_df),
predicted = train_df$Sepal.Length
) %>%
rmse(truth, predicted)
#> # A tibble: 1 x 3
#> .metric .estimator .estimate
#> <chr> <chr> <dbl>
#> 1 rmse standard 0.265
Created on 2021-12-13 by the reprex package (v2.0.1)

Tuning with classification_cost and custom cost matrix in Tidymodels

I am using tidymodels for building a model where false negatives are more costly than false positives. Hence I'd like to use the yardstick::classification_cost metric for hyperparameter tuning, but with a custom classification cost matrix that reflects this fact.
Doing this after fitting a model is simple enough:
library(tidymodels)
# load simulated prediction output
data("two_class_example")
# cost matrix penalizing false negatives
cost_matrix <- tribble(
~truth, ~estimate, ~cost,
"Class1", "Class2", 2,
"Class2", "Class1", 1
)
# use function on simulated prediction output
classification_cost(
data = two_class_example,
truth = truth,
# target class probability
Class1,
# supply the function with the cost matrix
costs = cost_matrix)
#> # A tibble: 1 × 3
#> .metric .estimator .estimate
#> <chr> <chr> <dbl>
#> 1 classification_cost binary 0.260
Created on 2021-11-01 by the reprex package (v2.0.1)
But using this function during hyperparameter tuning is where I run into problems. The documentation states that for setting options the metric should be wrapped in a custom function. Here's my attempt and the resulting error. Note how this wrapper works fine for evaluating a fitted model, but throws an error when trying to use for tuning:
library(tidymodels)
# load data
data("two_class_example")
data("two_class_dat")
# create custom metric penalizing false negatives
classification_cost_penalized <- function(
data,
truth,
class_proba,
na_rm = TRUE
) {
# cost matrix penalizing false negatives
cost_matrix <- tribble(
~truth, ~estimate, ~cost,
"Class1", "Class2", 2,
"Class2", "Class1", 1
)
classification_cost(
data = data,
truth = !! rlang::enquo(truth),
# supply the function with the class probabilities
!! rlang::enquo(class_proba),
# supply the function with the cost matrix
costs = cost_matrix,
na_rm = na_rm
)
}
# Use `new_numeric_metric()` to formalize this new metric function
classification_cost_penalized <- new_prob_metric(classification_cost_penalized, "minimize")
# test if this works on the simulated estimates
two_class_example %>%
classification_cost_penalized(truth = truth, class_prob = Class1)
#> # A tibble: 1 × 3
#> .metric .estimator .estimate
#> <chr> <chr> <dbl>
#> 1 classification_cost binary 0.260
# test if this works with hyperparameter tuning
# specify a RF model
my_model <-
rand_forest(mtry = tune(),
min_n = tune(),
trees = 500) %>%
set_engine("ranger") %>%
set_mode("classification")
# specify recipe
my_recipe <- recipe(Class ~ A + B, data = two_class_dat)
# bundle to workflow
my_wf <- workflow() %>%
add_model(my_model) %>%
add_recipe(my_recipe)
# start tuning
tuned_rf <- my_wf %>%
# set up tuning grid
tune_grid(
resamples = vfold_cv(two_class_dat,
v = 5),
grid = 5,
metrics = metric_set(classification_cost_penalized))
#> i Creating pre-processing data to finalize unknown parameter: mtry
#> x Fold1: internal: Error: In metric: `classification_cost_penalized`
#> unused argum...
#> x Fold2: internal: Error: In metric: `classification_cost_penalized`
#> unused argum...
#> x Fold3: internal: Error: In metric: `classification_cost_penalized`
#> unused argum...
#> x Fold4: internal: Error: In metric: `classification_cost_penalized`
#> unused argum...
#> x Fold5: internal: Error: In metric: `classification_cost_penalized`
#> unused argum...
#> Warning: All models failed. See the `.notes` column.
Created on 2021-11-01 by the reprex package (v2.0.1)
Unnesting the notes shows that there are unused arguments: "internal: Error: In metric: classification_cost_penalized\nunused arguments (estimator = ~prob_estimator, event_level = ~event_level)" But apparently the yardstick_event_level()function, which is how event_level should be set according to this documentation, does not exist? No function under that name shows up when searching for it.
I don't know how to proceed here.
Thank you for your time.
When you are tweaking an existing yardstick metric, it is much easier to use the metric_tweak() function, which allows you to hard code certain optional arguments (like cost), while keeping everything else the same. It is sort of like purrr::partial(), but for yardstick metrics.
library(tidymodels)
# load data
data("two_class_example")
data("two_class_dat")
cost_matrix <- tribble(
~truth, ~estimate, ~cost,
"Class1", "Class2", 2,
"Class2", "Class1", 1
)
classification_cost_penalized <- metric_tweak(
.name = "classification_cost_penalized",
.fn = classification_cost,
costs = cost_matrix
)
# test if this works on the simulated estimates
two_class_example %>%
classification_cost_penalized(truth = truth, class_prob = Class1)
#> # A tibble: 1 × 3
#> .metric .estimator .estimate
#> <chr> <chr> <dbl>
#> 1 classification_cost_penalized binary 0.260
# specify a RF model
my_model <-
rand_forest(
mtry = tune(),
min_n = tune(),
trees = 500
) %>%
set_engine("ranger") %>%
set_mode("classification")
# specify recipe
my_recipe <- recipe(Class ~ A + B, data = two_class_dat)
# bundle to workflow
my_wf <- workflow() %>%
add_model(my_model) %>%
add_recipe(my_recipe)
# start tuning
tuned_rf <- my_wf %>%
tune_grid(
resamples = vfold_cv(two_class_dat, v = 5),
grid = 5,
metrics = metric_set(classification_cost_penalized)
)
#> i Creating pre-processing data to finalize unknown parameter: mtry
collect_metrics(tuned_rf)
#> # A tibble: 5 × 8
#> mtry min_n .metric .estimator mean n std_err .config
#> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
#> 1 1 35 classification_cost… binary 0.407 5 0.0162 Preprocessor1…
#> 2 1 23 classification_cost… binary 0.403 5 0.0146 Preprocessor1…
#> 3 1 10 classification_cost… binary 0.403 5 0.0137 Preprocessor1…
#> 4 2 27 classification_cost… binary 0.396 5 0.0166 Preprocessor1…
#> 5 2 6 classification_cost… binary 0.401 5 0.0161 Preprocessor1…
Created on 2021-11-03 by the reprex package (v2.0.1)
Is there an alternative metric or approach you would recommend for a situation in which costs are different? It doesn't seem to do a whole lot for evaluating what is best, even when extreme differences between false positives and negatives are provided. An example based on your code above:
library(tidymodels)
# load data
data("two_class_example")
data("two_class_dat")
cost_matrix_1 <- tribble(
~truth, ~estimate, ~cost,
"Class1", "Class2", 10,
"Class2", "Class1", 1
)
cost_matrix_2 <- tribble(
~truth, ~estimate, ~cost,
"Class1", "Class2", 1,
"Class2", "Class1", 10
)
classification_cost_penalized_1 <- metric_tweak(
.name = "classification_cost_penalized_1",
.fn = classification_cost,
costs = cost_matrix_1
)
classification_cost_penalized_2 <- metric_tweak(
.name = "classification_cost_penalized_2",
.fn = classification_cost,
costs = cost_matrix_2
)
# test if this works on the simulated estimates
two_class_example %>%
classification_cost_penalized_1(truth = truth, class_prob = Class1)
#> # A tibble: 1 × 3
#> .metric .estimator .estimate
#> <chr> <chr> <dbl>
#> 1 classification_cost_penalized binary 0.260
two_class_example %>%
classification_cost_penalized_2(truth = truth, class_prob = Class1)
# specify a RF model
my_model <-
rand_forest(
mtry = tune(),
min_n = tune(),
trees = 500
) %>%
set_engine("ranger") %>%
set_mode("classification")
# specify recipe
my_recipe <- recipe(Class ~ A + B, data = two_class_dat)
# bundle to workflow
my_wf <- workflow() %>%
add_model(my_model) %>%
add_recipe(my_recipe)
# start tuning
tuned_rf <- my_wf %>%
tune_grid(
resamples = vfold_cv(two_class_dat, v = 5),
grid = 50,
metrics = metric_set(classification_cost_penalized_1,classification_cost_penalized_2)
)
#> i Creating pre-processing data to finalize unknown parameter: mtry
#seems to always be increasing or decreaing and not much differrence even when large differences
collect_metrics(tuned_rf) %>%
ggplot(aes(x = min_n, y = mean, color = .metric)) +
geom_line() +
facet_grid(rows = "mtry")
visualization of metrics overview

Can the out of bag error for a random forests model in R's TidyModel's framework be obtained?

If you directly use the ranger function, one can obtain the out-of-bag error from the resulting ranger class object.
If instead, one proceeds by way of setting up a recipe, model specification/engine, with tuning parameters, etc., how can we extract that same error? The Tidymodels approach doesn't seem to hold on to that data.
If you want to access the ranger object inside of the parsnip object, it is there as $fit:
library(tidymodels)
data("ad_data", package = "modeldata")
rf_spec <-
rand_forest() %>%
set_engine("ranger", oob.error = TRUE) %>%
set_mode("classification")
rf_fit <- rf_spec %>%
fit(Class ~ ., data = ad_data)
rf_fit
#> parsnip model object
#>
#> Fit time: 158ms
#> Ranger result
#>
#> Call:
#> ranger::ranger(x = maybe_data_frame(x), y = y, oob.error = ~TRUE, num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE)
#>
#> Type: Probability estimation
#> Number of trees: 500
#> Sample size: 333
#> Number of independent variables: 130
#> Mtry: 11
#> Target node size: 10
#> Variable importance mode: none
#> Splitrule: gini
#> OOB prediction error (Brier s.): 0.1340793
class(rf_fit)
#> [1] "_ranger" "model_fit"
class(rf_fit$fit)
#> [1] "ranger"
rf_fit$fit$prediction.error
#> [1] 0.1340793
Created on 2021-03-11 by the reprex package (v1.0.0)

Resources