I have the following code as a simple example.
library(tidymodels)
library(bonsai)
train_folds <- vfold_cv(data = train, strata = target)
train_rec <- recipe(formula = target ~ ., data = train) %>%
update_role(Id, new_role = "ID")
gb_mod <- boost_tree(engine = "lightgbm",
mtry = 11,
mode = "classification",
trees = 100)
gb_workflow <- workflow(preprocessor = train_rec,
spec = gb_mod)
model_fit <- gb_workflow %>% fit_resamples(train_folds,
metrics = metric_set(kap, roc_auc, accuracy))
model_fit %>% collect_metrics()
The kap function calculates the Kappa metric which has no weighting by default. To calculate quadratic weighted Kappa you must add weighting = "quadratic" as a parameter, which metric_set() doesn't seem to accept. How can I include QWK in the metrics output?
Apologies if this has been answered already but I couldn't find any examples.
You need to make an alternate function (just by wrapping the original):
library(yardstick)
#> For binary classification, the first factor level is assumed to be the event.
#> Use the argument `event_level = "second"` to alter this as needed.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
data(hpc_cv, package = "modeldata")
# See example in ?metric_set examples
kap_quad <- function(data, truth, estimate, na_rm = TRUE, ...) {
kap(
data = data,
truth = !! rlang::enquo(truth),
estimate = !! rlang::enquo(estimate),
# set weighting = "quadratic"
weighting = "quadratic",
na_rm = na_rm,
...
)
}
kap_quad <- new_numeric_metric(kap_quad, "maximize")
met <- metric_set(kap_quad)
hpc_cv %>%
met(obs, estimate = pred)
#> # A tibble: 1 × 3
#> .metric .estimator .estimate
#> <chr> <chr> <dbl>
#> 1 kap multiclass 0.692
# no weighting
hpc_cv %>%
kap(obs, estimate = pred)
#> # A tibble: 1 × 3
#> .metric .estimator .estimate
#> <chr> <chr> <dbl>
#> 1 kap multiclass 0.508
Created on 2023-01-31 by the reprex package (v2.0.1)
I have been trying to train a BART model using the tidymodels framework but I am running into some problems.
I can declare the model, the recipe, and the workflow alright, but once I fit the workflow, two unwanted things happen:
The original model object (bart_mod below), initially correctly stored, becomes "call: NULL", even though I don't touch the model object directly (I assign nothing to the same object name).
I am not able to retrieve any information about the fitted model. The bart_fit contains nothing and there seems to be no tidy method associated to it. All this is true even though I am able to predict values using the fitted model! (See last line of code in the reprex).
This may very well come from a misunderstanding of how all this works on my end, I am fairly new to tidymodels.
I would appreciate any help! Thank you.
library(tidyverse)
library(tidymodels)
set.seed(2022)
# Parameters --------------------------------------------------------------
n <- 5000
coef_x_var_1 <- 1
coef_x_var_2 <- 2
coef_x_var_3 <- 3
gen_y_1 <- function(data = dataset) {
return(data$y_0 +
data$x_var_1*coef_x_var_1 +
data$x_var_2*coef_x_var_2 +
data$x_var_3*coef_x_var_3 +
rnorm(n = nrow(data), mean = 0, sd = 3)
)}
# Data generation ---------------------------------------------------------
dataset <- matrix(NA, nrow = n, ncol = 3)
# Generate the unit-level moderators
dataset[,1] <- rnorm(mean = rnorm(n = 1), n = n)
dataset[,2] <- rnorm(mean = rnorm(n = 1), n = n)
dataset[,3] <- rnorm(mean = rnorm(n = 1), n = n)
# Change into dataframe
colnames(dataset) <- c("x_var_1", "x_var_2", "x_var_3")
dataset <- as_tibble(dataset)
# Make sure the variable format is numeric (except for the identifiers)
dataset$x_var_1 <- as.numeric(dataset$x_var_1)
dataset$x_var_2 <- as.numeric(dataset$x_var_2)
dataset$x_var_3 <- as.numeric(dataset$x_var_3)
# Generate the untreated potential outcomes
P0_coefs <- rdunif(n = 6, 1, 15)
dataset$y_0 <-
dataset$x_var_1*P0_coefs[4] +
dataset$x_var_2*P0_coefs[5] +
dataset$x_var_3*P0_coefs[6] +
rnorm(n = nrow(dataset), mean = 0, sd = 3)
dataset$y_1 <- gen_y_1(data = dataset)
# Create a variable to indicate treatment
treatment_group <- sample(1:nrow(dataset), size = nrow(dataset)/2)
# Indicate which potential outcome you observe
obs_dataset <- dataset |>
mutate(treated = ifelse(row_number() %in% treatment_group, 1, 0),
obs_y = ifelse(treated, y_1, y_0))
y1_obs_dataset <- obs_dataset |> filter(treated == 1)
y0_obs_dataset <- obs_dataset |> filter(treated == 0)
# Analysis ----------------------------------------------------------------
covariates <- c("x_var_1", "x_var_2", "x_var_3")
bart_formula <- as.formula(paste0("obs_y ~ ", paste(covariates, collapse = " + ")))
# Create the workflow
bart_mod <- bart() |>
set_engine("dbarts") |>
set_mode("regression")
bart_recipe <- recipe(bart_formula, data = obs_dataset) |>
step_zv(all_predictors())
bart_workflow <-
workflow() |>
add_model(bart_mod) |>
add_recipe(bart_recipe)
# The workflow first looks right
bart_workflow
#> ══ Workflow ════════════════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: bart()
#>
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 1 Recipe Step
#>
#> • step_zv()
#>
#> ── Model ───────────────────────────────────────────────────────────────────────
#> BART Model Specification (regression)
#>
#> Computational engine: dbarts
# Once I fit it though, the model part becomes call: NULL
bart_fit <- bart_workflow |>
fit(y1_obs_dataset)
# Nothing is stored in the fit
bart_fit
#> ══ Workflow [trained] ══════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: bart()
#>
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 1 Recipe Step
#>
#> • step_zv()
#>
#> ── Model ───────────────────────────────────────────────────────────────────────
#>
#> Call:
#> `NULL`()
# The content of this object has changed!
bart_workflow
#> ══ Workflow ════════════════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: bart()
#>
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 1 Recipe Step
#>
#> • step_zv()
#>
#> ── Model ───────────────────────────────────────────────────────────────────────
#>
#> Call:
#> NULL
bart_fit |>
extract_fit_parsnip(bart_fit)
#> parsnip model object
#>
#>
#> Call:
#> `NULL`()
# And yet, I am able to run a prediction using the fit!
predict(bart_fit, y0_obs_dataset)
#> # A tibble: 2,500 × 1
#> .pred
#> <dbl>
#> 1 -4.67
#> 2 -6.23
#> 3 6.35
#> 4 10.7
#> 5 4.90
#> 6 -13.8
#> 7 4.70
#> 8 19.6
#> 9 -0.907
#> 10 5.38
#> # … with 2,490 more rows
Created on 2022-12-24 with reprex v2.0.2
First stripping Martin's code down to a smaller script:
library(tidyverse)
library(tidymodels)
set.seed(2022)
obs_dataset <- structure(list(x_var_1 = c(-0.273203786163623, 0.0026566250757164,
-0.544359413888551, 0.569128408034224, -2.00048700105319, -0.159113741655834
), obs_y = c(-8.14952415680873, 1.91364235165124, -7.68391811408719,
-9.01497463720505, -18.5017189874949, -13.505685812581)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
bart_formula <- as.formula("obs_y ~ x_var_1")
# Create the workflow
bart_mod <- bart() |>
set_engine("dbarts") |>
set_mode("regression")
bart_recipe <- recipe(bart_formula, data = obs_dataset)
bart_workflow <-
workflow() |>
add_model(bart_mod) |>
add_recipe(bart_recipe)
The workflow at first looks right
bart_workflow
> ══ Workflow
> ════════════════════════════════════════════════════════════════
> Preprocessor: Recipe Model: bart()
>
> ── Preprocessor
> ────────────────────────── 0 Recipe Steps
>
> ── Model
> ─────────────────────────────────────────────────────────
> BART Model Specification (regression)
>
> Computational engine: dbarts
but this changes after fitting:
bart_fit <- bart_workflow |>
fit(obs_dataset)
bart_fit
The workflow now displays NULL for the call, as does the model object.
bart_workflow
bart_mod
══ Workflow [trained] ══════════════════════════════════════════════════════
Preprocessor: Recipe
Model: bart()
── Preprocessor ─────────────────────────────────
0 Recipe Steps
── Model ────────────────────────────────────────────────
Call:
`NULL`()
All these display values:
required_pkgs(bart_mod)
print_model_spec(bart_mod)
bart_mod[["engine"]]
bart_mod[["mode"]]
extract_recipe(bart_fit)
extract_preprocessor(bart_fit)
extract_mold(bart_fit)
bart_fit[["fit"]][["fit"]][["spec"]][["engine"]]
bart_fit[["fit"]][["fit"]][["spec"]][["mode"]]
These display NULL:
print(bart_mod)
print(bart_workflow)
print(bart_fit)
extract_fit_engine(bart_fit)
extract_fit_parsnip(bart_fit)
extract_model(bart_fit)
So, it seems that the model data is still in the objects,
and is useable,
but the print calls do not display it,
and the extract functions do not display it.
I am watching one of the solutions for House Prices Kaggle competition. I would like to know how do you get RMSE value from this:
Subset the train rows and selected features
dt.train <- fulldt %>% filter(Set == "Train") %>% select("Id", "OverallQual", "TotalArea", "AreaAbvground", "GarageArea", "TotalBaths", "YearBuilt", "Neighborhood", "MSSubClass", "FireplaceQu", "ExterQual", "KitchenQual", "BsmtQual", "HouseStyle") %>% mutate(SalePrice = log(raw.train$SalePrice))
Same for the test features
dt.test <- fulldt %>% filter(Set == "Test") %>%
select("Id", "OverallQual", "TotalArea", "AreaAbvground", "GarageArea", "TotalBaths", "YearBuilt",
"Neighborhood", "MSSubClass", "FireplaceQu", "ExterQual", "KitchenQual", "BsmtQual", "HouseStyle")
Random Forest model
fit <- randomForest(SalePrice ~ ., data = dt.train, importance = T)
Use new model to predict SalePrice values from the test set
pred <- exp(predict(fit , newdata = dt.test))
How do you get RMSE value from pred ?
Let's calculate the RMSE of the training and test rows based on the minimal example iris data:
library(tibble)
library(randomForest)
#> randomForest 4.6-14
#> Type rfNews() to see new features/changes/bug fixes.
library(yardstick)
#> For binary classification, the first factor level is assumed to be the event.
#> Use the argument `event_level = "second"` to alter this as needed.
train_df <- head(iris, 100)
test_df <- tail(iris, 50)
model <- randomForest(Sepal.Length ~ ., data = train_df, importance = T)
# Test RMSE
tibble(
truth = predict(model, newdata = test_df),
predicted = test_df$Sepal.Length
) %>%
rmse(truth, predicted)
#> # A tibble: 1 x 3
#> .metric .estimator .estimate
#> <chr> <chr> <dbl>
#> 1 rmse standard 0.836
# Train RMSE
tibble(
truth = predict(model, newdata = train_df),
predicted = train_df$Sepal.Length
) %>%
rmse(truth, predicted)
#> # A tibble: 1 x 3
#> .metric .estimator .estimate
#> <chr> <chr> <dbl>
#> 1 rmse standard 0.265
Created on 2021-12-13 by the reprex package (v2.0.1)
I am using tidymodels to create a Random Forrest prediction. I have test data that contains a new factor level not present in the training data which results in the error:
1: Novel levels found in column 'Siblings': '4'. The levels have been removed, and values have been coerced to 'NA'.
2: There are new levels in a factor: NA
> test_predict
Fehler: Objekt 'test_predict' nicht gefunden
I tried to include a step_novel and step_dummy on the "Siblings" column but this does not resolve the error. How should I deal with new factors not present in training data?
library(tidyverse)
library(tidymodels)
data <-
data.frame(
Survived = as.factor(c(0,1,1,1,0,0,0,0,0,1,1,1,0,0,0,0)),
Siblings = as.factor(c(1,1,0,1,0,0,0,3,1,1,0,1,0,0,0,3)),
Class = as.factor(c(0,1,0,1,0,1,0,0,0,1,0,1,0,1,0,0)),
Embarked = as.factor(c("s","c","m","m","s","c","s","m","m","s","s","s","s","s","s","s"))
)
test <-
data.frame(
Siblings = as.factor(c(1,1,0,1,0,0,0,3,1,1,0,1,0,0,0,4)), #New factor level
Class = as.factor(c(0,1,0,1,0,1,0,0,0,1,0,1,0,1,0,0)),
Embarked = as.factor(c("s","c","m","m","s","c","s","m","m","s","s","s","s","s","s","s"))
)
#Model
rf_model <-
rand_forest() %>%
set_args(
mtry = 3,
trees = 1000,
min_n = 15
) %>%
set_engine("ranger",
importance = "impurity") %>%
set_mode("classification")
#Recipe
data_recipe <-
recipe(Survived ~Siblings + Class + Embarked, data=data) %>%
step_novel(Siblings) %>%
step_dummy(Siblings)
#Workflow
rf_workflow <-
workflow() %>%
add_recipe(data_recipe) %>%
add_model(rf_model)
final_model <- fit(rf_workflow, data)
final_model
test_predict <- predict(final_model, test)
test_predict
If you notice in the documentation for step_novel(), it says:
When fitting a model that can deal with new factor levels, consider using workflows::add_recipe() with allow_novel_levels = TRUE set in hardhat::default_recipe_blueprint(). This will allow your model to handle new levels at prediction time, instead of throwing warnings or errors.
So you want to do that:
library(tidyverse)
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
data <-
data.frame(
Survived = as.factor(c(0,1,1,1,0,0,0,0,0,1,1,1,0,0,0,0)),
Siblings = as.factor(c(1,1,0,1,0,0,0,3,1,1,0,1,0,0,0,3)),
Class = as.factor(c(0,1,0,1,0,1,0,0,0,1,0,1,0,1,0,0)),
Embarked = as.factor(c("s","c","m","m","s","c","s","m","m","s","s","s","s","s","s","s"))
)
test <-
data.frame(
Siblings = as.factor(c(1,1,0,1,0,0,0,3,1,1,0,1,0,0,0,4)), #New factor level
Class = as.factor(c(0,1,0,1,0,1,0,0,0,1,0,1,0,1,0,0)),
Embarked = as.factor(c("s","c","m","m","s","c","s","m","m","s","s","s","s","s","s","s"))
)
#Model
rf_model <-
rand_forest() %>%
set_args(
mtry = 3,
trees = 1000,
min_n = 15
) %>%
set_engine("ranger",
importance = "impurity") %>%
set_mode("classification")
#Recipe
data_recipe <-
recipe(Survived ~Siblings + Class + Embarked, data=data) %>%
step_novel(Siblings) %>%
step_dummy(Siblings)
#Workflow
rf_workflow <-
workflow() %>%
add_recipe(data_recipe,
blueprint = hardhat::default_recipe_blueprint(allow_novel_levels = TRUE)) %>%
add_model(rf_model)
final_model <- fit(rf_workflow, data)
final_model
#> ══ Workflow [trained] ══════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: rand_forest()
#>
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 2 Recipe Steps
#>
#> • step_novel()
#> • step_dummy()
#>
#> ── Model ───────────────────────────────────────────────────────────────────────
#> Ranger result
#>
#> Call:
#> ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~3, x), num.trees = ~1000, min.node.size = min_rows(~15, x), importance = ~"impurity", num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE)
#>
#> Type: Probability estimation
#> Number of trees: 1000
#> Sample size: 16
#> Number of independent variables: 5
#> Mtry: 3
#> Target node size: 15
#> Variable importance mode: impurity
#> Splitrule: gini
#> OOB prediction error (Brier s.): 0.254242
test_predict <- predict(final_model, test)
test_predict
#> # A tibble: 16 x 1
#> .pred_class
#> <fct>
#> 1 0
#> 2 1
#> 3 0
#> 4 1
#> 5 0
#> 6 0
#> 7 0
#> 8 0
#> 9 0
#> 10 1
#> 11 0
#> 12 1
#> 13 0
#> 14 0
#> 15 0
#> 16 0
Created on 2021-07-09 by the reprex package (v2.0.0)
The workflows functions are very strict about factor levels and other aspects of the new data, ensuring that they match up with the training data.
To answer my own question:
We need to apply step_novel followed by step_unknown. As far as I understand from the documentation step_novel labels any new factors occurring in the data with "new". This can be used to easily identify such factors when the data is inspected after applying the recipe. step_unknown removes any such factors from the data and converts the values to NA when the model is applied:
data_recipe <-
recipe(Survived ~Siblings + Class + Embarked, data=data) %>%
step_novel(Siblings) %>%
step_unknown(Siblings)
I'm struggling with how the obtain the AUC from a logistic regression model using tidymodels.
Here's an example using the built-in mpg dataset.
library(tidymodels)
library(tidyverse)
# Use mpg dataset
df <- mpg
# Create an indicator variable for class="suv"
df$is_suv <- as.factor(df$class == "suv")
# Create the split object
df_split <- initial_split(df, prop=1/2)
# Create the training and testing sets
df_train <- training(df_split)
df_test <- testing(df_split)
# Create workflow
rec <-
recipe(is_suv ~ cty + hwy + cyl, data=df_train)
glm_spec <-
logistic_reg() %>%
set_engine(engine = "glm")
glm_wflow <-
workflow() %>%
add_recipe(rec) %>%
add_model(glm_spec)
# Fit the model
model1 <- fit(glm_wflow, df_train)
# Attach predictions to training dataset
training_results <- bind_cols(df_train, predict(model1, df_train))
# Calculate accuracy
accuracy(training_results, truth = is_suv, estimate = .pred_class)
# Calculate AUC??
roc_auc(training_results, truth = is_suv, estimate = .pred_class)
The last line returns this error:
> roc_auc(training_results, truth = is_suv, estimate = .pred_class)
Error in metric_summarizer(metric_nm = "roc_auc", metric_fn = roc_auc_vec, :
formal argument "estimate" matched by multiple actual arguments
Since you are doing binary classification, roc_auc() is expecting a vector of class probabilities corresponding to the "relevant" class, not the predicted class.
You can get this using predict(model1, df_train, type = "prob"). Alternatively, if you are using workflows version 0.2.2 or newer you can use the augment() to get class predictions and probabilities without using bind_cols().
library(tidymodels)
library(tidyverse)
# Use mpg dataset
df <- mpg
# Create an indicator variable for class="suv"
df$is_suv <- as.factor(df$class == "suv")
# Create the split object
df_split <- initial_split(df, prop=1/2)
# Create the training and testing sets
df_train <- training(df_split)
df_test <- testing(df_split)
# Create workflow
rec <-
recipe(is_suv ~ cty + hwy + cyl, data=df_train)
glm_spec <-
logistic_reg() %>%
set_engine(engine = "glm")
glm_wflow <-
workflow() %>%
add_recipe(rec) %>%
add_model(glm_spec)
# Fit the model
model1 <- fit(glm_wflow, df_train)
# Attach predictions to training dataset
training_results <- augment(model1, df_train)
# Calculate accuracy
accuracy(training_results, truth = is_suv, estimate = .pred_class)
#> # A tibble: 1 x 3
#> .metric .estimator .estimate
#> <chr> <chr> <dbl>
#> 1 accuracy binary 0.795
# Calculate AUC
roc_auc(training_results, truth = is_suv, estimate = .pred_FALSE)
#> # A tibble: 1 x 3
#> .metric .estimator .estimate
#> <chr> <chr> <dbl>
#> 1 roc_auc binary 0.879
Created on 2021-04-12 by the reprex package (v1.0.0)