tidymodels: ranger with cross validation - r

The dataset can be found here:
https://www.kaggle.com/mlg-ulb/creditcardfraud
I am trying to use tidymodels to run ranger with 5 fold cross validation on this dataset.
I have have 2 code blocks. The first code block is the original code with the full data. The second code block is almost identical to the first code block, except I have subset a portion of the data so the code runs faster. The second block of code is just to make sure my code works before I run it on the original dataset.
Here is the first code block with the full data:
#load packages
library(tidyverse)
library(tidymodels)
library(tune)
library(workflows)
#load data
df <- read.csv("~creditcard.csv")
#check for NAs and convert Class to factor
anyNA(df)
df$Class <- as.factor(df$Class)
#set seed and split data into training and testing
set.seed(123)
df_split <- initial_split(df)
df_train <- training(df_split)
df_test <- testing(df_split)
#in the training and testing datasets, how many are fraudulent transactions?
df_train %>% count(Class)
df_test %>% count(Class)
#ranger model with 5-fold cross validation
rf_spec <-
rand_forest() %>%
set_engine("ranger", importance = "impurity") %>%
set_mode("classification")
all_wf <-
workflow() %>%
add_formula(Class ~ .) %>%
add_model(rf_spec)
cv_folds <- vfold_cv(df_train, v = 5)
cv_folds
rf_results <-
all_wf %>%
fit_resamples(resamples = cv_folds)
rf_results %>%
collect_metrics()
Here is the second code block with 1,000 rows:
#load packages
library(tidyverse)
library(tidymodels)
library(tune)
library(workflows)
#load data
df <- read.csv("~creditcard.csv")
###################################################################################
#Testing area#
df <- df %>% arrange(-Class) %>% head(1000)
###################################################################################
#check for NAs and convert Class to factor
anyNA(df)
df$Class <- as.factor(df$Class)
#set seed and split data into training and testing
set.seed(123)
df_split <- initial_split(df)
df_train <- training(df_split)
df_test <- testing(df_split)
#in the training and testing datasets, how many are fraudulent transactions?
df_train %>% count(Class)
df_test %>% count(Class)
#ranger model with 5-fold cross validation
rf_spec <-
rand_forest() %>%
set_engine("ranger", importance = "impurity") %>%
set_mode("classification")
all_wf <-
workflow() %>%
add_formula(Class ~ .) %>%
add_model(rf_spec)
cv_folds <- vfold_cv(df_train, v = 5)
cv_folds
rf_results <-
all_wf %>%
fit_resamples(resamples = cv_folds)
rf_results %>%
collect_metrics()
1.) With the the first code block, I can assign and print cv folds in the console. The Global Enviornment data says cv_folds has 5 obs. of 2 variables. When I View(cv_folds), I have columns labeled splits and id, but there are no rows and no data. When I use str(cv_folds), I get the blank loading line that R is "thinking", but there is not a red STOP icon I can push. The only thing I can do is force quit RStudio. Maybe I just need to wait longer? I am not sure. When I do the same thing with the smaller second code block, str() works fine.
2) My overall goal for this project is to split the dataset into training and testing sets. Then partition the training data with 5 fold cross validation and train a ranger model on it. Next, I want to examine the metrics of my model on the training data. Then I want to test my model on the testing set and view the metrics. Eventually, I want to swap out ranger for something like xgboost. Please give me advice on what parts of my code I can add/modify to improve. I am still missing the portion of testing my model on the testing set.
I think the Predictions portion of this article might be what I'm aiming for.
https://rviews.rstudio.com/2019/06/19/a-gentle-intro-to-tidymodels/
3) When I use rf_results %>% collect_metrics(), it only shows accuracy and roc_auc. How do I get sensitivity, specificity, precision, and recall?
4) How do I plot importance? Would I use something like this?
rf_fit <- get_tree_fit(all_wf)
vip::vip(rf_fit, geom = "point")
5) How can I drastically reduce the amount of time for the model to train? Last time I ran ranger with 5 fold cross validation using caret on this dataset, it took 8+ hours (6 core, 4.0 ghz, 16gb RAM, SSD, gtx 1060). I am open to anything (ie. restructure code, AWS computing, parallelization, etc.)
Edit: This is another way I have tried to set this up
#ranger model with 5-fold cross validation
rf_recipe <- recipe(Class ~ ., data = df_train)
rf_engine <-
rand_forest(mtry = tune(), trees = tune(), min_n = tune()) %>%
set_engine("ranger", importance = "impurity") %>%
set_mode("classification")
rf_grid <- grid_random(
mtry() %>% range_set(c(1, 20)),
trees() %>% range_set(c(500, 1000)),
min_n() %>% range_set(c(2, 10)),
size = 30)
all_wf <-
workflow() %>%
add_recipe(rf_recipe) %>%
add_model(rf_engine)
cv_folds <- vfold_cv(df_train, v = 5)
cv_folds
#####
rf_fit <- tune_grid(
all_wf,
resamples = cv_folds,
grid = rf_grid,
metrics = metric_set(roc_auc),
control = control_grid(save_pred = TRUE)
)
collect_metrics(rf_fit)
rf_fit_best <- select_best(rf_fit)
(wf_rf_best <- finalize_workflow(all_wf, rf_fit_best))

I started with your last block of code and made some edits to have a functional workflow. I answered to your questions along the code. I have taken the liberty to give you some advice and reformat your code.
## Packages, seed and data
library(tidyverse)
library(tidymodels)
set.seed(123)
df <- read_csv("creditcard.csv")
df <-
df %>%
arrange(-Class) %>%
head(1000) %>%
mutate(Class = as_factor(Class))
## Modelisation
# Initial split
df_split <- initial_split(df)
df_train <- training(df_split)
df_test <- testing(df_split)
You can see that df_split returns <750/250/1000> (see below).
2) To tune the xgboost model, you have very little things to change.
# Models
model_rf <-
rand_forest(mtry = tune(), trees = tune(), min_n = tune()) %>%
set_engine("ranger", importance = "impurity") %>%
set_mode("classification")
model_xgboost <-
boost_tree(mtry = tune(), trees = tune(), min_n = tune()) %>%
set_engine("xgboost", importance = "impurity") %>%
set_mode("classification")
Here you choose your hyperparameter grid. I advise you to use a non random grid to visit the space of hypermarameters in an optimal way.
# Grid of hyperparameters
grid_rf <-
grid_max_entropy(
mtry(range = c(1, 20)),
trees(range = c(500, 1000)),
min_n(range = c(2, 10)),
size = 30)
These are your workflows, as you can see, virtually nothing to change.
# Workflow
wkfl_rf <-
workflow() %>%
add_formula(Class ~ .) %>%
add_model(model_rf)
wkfl_wgboost <-
workflow() %>%
add_formula(Class ~ .) %>%
add_model(model_xgboost)
1) <600/150/750> means that you have 600 observations in your training set, 150 in your validation set and a total of 750 observation in the original dataset. Plese note that, here, 600 + 150 = 750 but this is not always the case (e.g. with boostrap methods with resampling).
# Cross validation method
cv_folds <- vfold_cv(df_train, v = 5)
cv_folds
3) Here you choose which metrics you want to collect during your tuning, with the yardstik package.
# Choose metrics
my_metrics <- metric_set(roc_auc, accuracy, sens, spec, precision, recall)
Then you can compute different models according to the grid. For the control parameters, don't save prediction and print progress (imho).
# Tuning
rf_fit <- tune_grid(
wkfl_rf,
resamples = cv_folds,
grid = grid_rf,
metrics = my_metrics,
control = control_grid(verbose = TRUE) # don't save prediction (imho)
)
These are some useful function to deals with the rf_fit object.
# Inspect tuning
rf_fit
collect_metrics(rf_fit)
autoplot(rf_fit, metric = "accuracy")
show_best(rf_fit, metric = "accuracy", maximize = TRUE)
select_best(rf_fit, metric = "accuracy", maximize = TRUE)
Finally, you can fit your model according to best parameters.
# Fit best model
tuned_model <-
wkfl_rf %>%
finalize_workflow(select_best(rf_fit, metric = "accuracy", maximize = TRUE)) %>%
fit(data = df_train)
predict(tuned_model, df_train)
predict(tuned_model, df_test)
4) unfortunately, methods to deals with randomForest objects are usually not availables with parnsnip outputs
5) You can have a look at the vignette about parallelization.

Related

How to Get Variable/Feature Importance From Tidymodels ranger object?

I have a ranger object from the tidymodels rand_forest function:
rf <- rand_forest(mode = "regression", trees = 1000) %>% fit(pay_rate ~ age+profession)
I want to get the feature importance of each variable (I have many more than in this example). I've tried things like rf$variable.importance, or importance(rf), but the former returns NULL and the latter function doesn't exist. I tried using the vip package, but that doesn't work for a ranger object. How can I extract feature importances from this object?
You need to add importance = "impurity" when you set the engine for ranger. This will provide variable importance scores. Once this is set, you can use extract_fit_parsnip with vip to plot the variable importance.
small example:
library(tidymodels)
library(vip)
rf_mod <- rand_forest(mode = "regression", trees = 100) %>%
set_engine("ranger", importance = "impurity")
rf_recipe <-
recipe(mpg ~ ., data = mtcars)
rf_workflow <-
workflow() %>%
add_model(rf_mod) %>%
add_recipe(rf_recipe)
rf_workflow %>%
fit(mtcars) %>%
extract_fit_parsnip() %>%
vip(num_features = 10)
More information is available in the tidymodels get started guide

All values of AUC ROC Curve 1 using tidymodels

Trying to do a LASSO model with a binary outcome using tidymodels, I have essentially copied the case study from the tidymodels webpage (https://www.tidymodels.org/start/case-study/)(the hotel stay dataset) and applied it to my own data but for some reason all of the values on my area under the ROC curve are coming out at 1 (as you can see from graph below). The only thing I have changed is the recipe (to try and suit my data)
recipe(outcome ~ ., data = df_train) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_zv(all_predictors()) %>%
step_normalize(all_predictors()) %>%
step_medianimpute(all_predictors())
so I don't know if it is my recipe that is incorrect or my data is not suitable for whatever reason. As mentioned I have a binary outcome and 68 predictors (59 factors and 9 numeric), some do have missing data but thought that the step_medianimpute would deal with that. Many thanks for any help anyone can offer
My AUC ROC Curve
Without seeing the data it is hard to know for sure, but your results indicate a couple of things.
Firstly, AUC ROC of 1. An AOC ROC of 1 for a binary classification model indicated that the model is perfectly able to separate the two classes. This could either be the case of overfitting or that you just have linearly separable classes.
Secondly, the constant metric value for different values of penalty. For a LASSO model, as the penalty increases, more and more variables will be shrunk to zero. In your case for all the values of the penalty (if you are following the post it will be 10^(-4) through 10^(-1)) you are seeing the same performance. That means that even if you use a penalty of 10^(-1) you still haven't shrunk enough predictors to hurt/change the performance. Reprex below
set.seed(1234)
library(tidymodels)
response <- rep(c(0, 10), length.out = 1000)
data <- bind_cols(
response = factor(response),
map_dfc(seq_len(50), ~ rnorm(1000, response))
)
data_split <- initial_split(data)
data_train <- training(data_split)
data_test <- testing(data_split)
lasso_spec <- logistic_reg(mixture = 1, penalty = tune()) %>%
set_engine("glmnet")
lasso_wf <- workflow() %>%
add_model(lasso_spec) %>%
add_formula(response ~ .)
data_folds <- vfold_cv(data_train)
param_grid <- tibble(penalty = 10^seq(-4, -1, length.out = 30))
tune_res <- tune_grid(
lasso_wf,
resamples = data_folds,
grid = param_grid
)
autoplot(tune_res)
What what you can do is expand the range of penalties until you the performance changes. Below we see that once the penalty is high enough, the last important predictors got shrunk to zero, and we lose performance.
param_grid <- tibble(penalty = 10^seq(-1, 0, length.out = 30))
tune_res <- tune_grid(
lasso_wf,
resamples = data_folds,
grid = param_grid
)
autoplot(tune_res)
To verify, we fit the model using one of the good performance penalties and we get perfect predictions.
lasso_final <- finalize_workflow(lasso_wf, select_best(tune_res))
lasso_final_fit <- fit(lasso_final, data = data_train)
augment(lasso_final_fit, new_data = data_train) %>%
conf_mat(truth = response, estimate = .pred_class)
#> Truth
#> Prediction 0 10
#> 0 375 0
#> 10 0 375
Created on 2021-05-08 by the reprex package (v2.0.0)

how to get beta estimates when I convert from R caret to tidymodels

I don't see an easy way to get the parameter estimates out of a crossvalidated model using the tidymodels ecosystem. How do I do it?
With caret I can do a cross validated model and get the parameter estimates like this:
library(caret)
library(tidyverse)
library(tidymodels)
data(ames)
set.seed(123)
mod_wo_Garage_Cars <- train(
Sale_Price ~ .,
data = select(ames, -Garage_Cars),
method="lm",
trControl=trainControl(method= "cv", number = 10)
)
summary(mod_wo_Garage_Cars) %>%
broom::tidy() %>%
filter(term == "Garage_Area")
I have a workflow that I think does the same modeling (give or take differences in train() vs. vfold_cv() resamples):
library(tidyverse)
library(tidymodels)
data(ames)
set.seed(123)
folds <- vfold_cv(ames, v = 10)
the_rec <-
recipe(Sale_Price ~ ., data = ames) %>%
step_rm(Garage_Cars)
the_lm_model <-
linear_reg() %>%
set_engine("lm")
the_workflow <-
workflow() %>%
add_recipe(the_rec) %>%
add_model(the_lm_model)
mod_wo_Garage_Cars <-
fit_resamples(the_workflow, folds)
I can see how to get the RMSE with show_best(mod_wo_Garage_Cars, metric = "rmse"). How do I get the overall model estimates on the beta's out of this workflow?
You need to pull out the coefficients from your fitted model and then tidy it.
best_rmse <- the_workflow %>%
fit_resamples(folds) %>%
select_best('rmse')
the_workflow %>%
finalize_workflow(best_rmse) %>%
fit(ames) %>%
pull_workflow_fit() %>%
tidy()

How to incorporate tidy models PCA into the workflow of a model and make predictions

I am trying to incorporate tidy models PCA into the workflow of a model. I want to have a predictive model that uses PCA as a preprocessing step and then make predictions with that model.
I have tried the following approach,
diamonds <- diamonds %>%
select(-clarity, -cut, - color)
diamonds_split <- initial_split(diamonds, prop = 4/5)
diamonds_train <- training(diamonds_split)
diamonds_test <- testing(diamonds_split)
diamonds_test <-vfold_cv(diamonds_train)
diamonds_recipe <-
# La fórmula básica y todos los datos (outcome ~ predictors)
recipe(price ~ ., data = diamonds_train) %>%
step_log(all_outcomes(),skip = T) %>%
step_normalize(all_predictors(), -all_nominal()) %>%
step_pca(all_predictors())
preprocesados <- prep(diamonds_recipe)
linear_model <-
linear_reg() %>%
set_engine("glmnet") %>%
set_mode("regression")
pca_workflow <- workflow() %>%
add_recipe(diamonds_recipe) %>%
add_model(linear_model)
lr_fitted_workflow <- pca_workflow %>% #option A workflow full dataset
last_fit(diamonds_split)
performance <- lr_fitted_workflow %>% collect_metrics()
test_predictions <- lr_fitted_workflow %>% collect_predictions()
But I get this error:
x Resample1: model (predictions): Error: penalty should be a single numeric value. ...
Warning message:
“All models failed in [fit_resamples()]. See the .notes column.”
Following other tutorials I tried to use this other approach, but I don't know how to use the model to make new predictions, because the new data comes in the original (non-pca) form. So I tried this:
pca_fit <- juice(preprocesados) %>% #option C no work flow at all
lm(price ~ ., data = .)
prep_test <- prep(diamonds_recipe, new_data = diamonds_test)
truths <- juice(prep_test) %>%
select(price)
ans <- predict(pca_fit, new_data = prep_test)
tib <- tibble(row = 1:length(ans),ans, truths)
ggplot(data = tib) +
geom_smooth(mapping = aes(x = row, y = ans, colour = "predicted")) +
geom_smooth(mapping = aes(x = row, y = price, colour = "true"))
And it prints something that seams reasonable, but by this point I have lost confidence and some guidance would be much appreciated. :D
The problem is not in your recipe or the workflow. As described in chapter 7 of TidyModels with R the function for fitting your model is fit and for it to work you'll have to provide the data for the fitting process (here diamonds). The tradeoff is that you don't have to prep your recipe as the workflow will take care of this itself.
So reducing your code slightly, the example below will work.
library(tidymodels)
data(diamonds)
diamonds <- diamonds %>%
select(-clarity, -cut, - color)
diamonds_split <- initial_split(diamonds, prop = 4/5)
diamonds_train <- training(diamonds_split)
diamonds_test <- testing(diamonds_split)
diamonds_recipe <-
# La fórmula básica y todos los datos (outcome ~ predictors)
recipe(price ~ ., data = diamonds_train) %>%
step_log(all_outcomes(),skip = T) %>%
step_normalize(all_predictors(), -all_nominal()) %>%
step_pca(all_predictors())
linear_model <-
linear_reg() %>%
set_engine("glmnet") %>%
set_mode("regression")
pca_workflow <- workflow() %>%
add_recipe(diamonds_recipe) %>%
add_model(linear_model)
pca_fit <- fit(pca_workflow, data = diamonds_train)
As for crossvalidation one has to use fit_resamples and should split the training set and not the testing set. But here I am currently getting the same error (my answer will be updated if i figure out why)
Edit
Now I've done a bit of digging, and the problem with crossvalidation stems from the engine being glmnet. I am guessing that of the many different aspects this one has somehow been missed. I've added a possible issue to the workflows package github site. Often the answers are quick in coming, so likely one of the developers will come with a reply soon.
As for crossvalidation, assume you instead fit using any of the other engines described in ?linear_reg then we could do this as
linear_model_base <-
linear_reg() %>%
set_engine("lm") %>%
set_mode("regression")
pca_workflow <- update_model(pca_workflow, linear_model_base)
folds <- vfold_cv(diamonds_train, 10)
pca_folds_fit <- fit_resamples(pca_workflow, resamples = folds)
and in the case where metrics are of interest these can indeed be collected as you did using collect_metrics
pca_folds_fit %>% collect_metrics()
If we are interested in the predictions you'll have to tell the model that you want to save these during the fitting process and then use collect_predictions
pca_folds_fit <- fit_resamples(pca_workflow, resamples = folds, control = control_resamples(save_pred = TRUE))
collect_predictions(pca_folds_fit)
Note however that the output from this is the predictions from each fold as you are literally fitting 10 models.
Usually crossvalidation is used to compare multiple models or tuning parameters (eg. random forest vs linear model). The best model on crossvalidation performance (collect_metrics) would then be selected for use and the test dataset would be used to get the evaluation of this models accuracy.
This is all described in TMwR chapter 10 & 11

Create a list column with just one item in it (no group by)

Here is a workflow that trains an XGB model using tidr list columns, rsmaple folds and purrr map:
library(rsample)
library(xgboost)
library(Metrics)
# keep just numeric features for this example
pdata_split <- initial_split(diamonds %>% select(-cut, -color, -clarity), 0.9)
training_data <- training(pdata_split)
testing_data <- testing(pdata_split)
train_cv <- vfold_cv(training_data, 5) %>%
# create training and validation sets within each fold
mutate(train = map(splits, ~training(.x)),
validate = map(splits, ~testing(.x)))
# xgb across each fold
mod.xgb <- train_cv %>%
# convert regression data to a dmatrix for xgb. Just simple price ~ carat for here and now
mutate(train_dmatrix = map(train, ~xgb.DMatrix(.x %>% select(carat) %>% as.matrix(), label = .x$price)),
validate_dmatrix = map(validate, ~xgb.DMatrix(.x %>% select(carat) %>% as.matrix(), label = .x$price))) %>%
mutate(regression = map(train_dmatrix, ~xgboost(.x, objective = "reg:squarederror", nrounds = 100))) %>% # fit the model
mutate(predictions =map2(.x = regression, .y = validate_dmatrix, ~predict(.x, .y))) %>% # predictions
mutate(validation_actuals = map(validate, ~.x$carat)) %>% # get the actuals for computing evaluation metrics
mutate(mae = map2_dbl(.x = validation_actuals, .y = predictions, ~Metrics::mae(actual = .x, predicted = .y))) %>% # mae
mutate(rmse = map2_dbl(.x = validation_actuals, .y = predictions, ~Metrics::rmse(actual = .x, predicted = .y))) # rmse
My actual script and data uses crossing() and other models with their own hyper parameters in order to pick the best model. So, the real block the above is based on allows me to compare several models since it actually contains several models.
I like this workflow because using dplyr verbs and the pipe operator, I can make changes as needed while progressing through each step, then apply them to each fold using map functions.
Now that I'm at the test phase and passed the cross validation phase, I'd like to emulate that 'flow' except I do not have folds so there is no need for map_* functions.
However, I still need to make transformations such as the one above adding an xgb.DMatrix since I am using xgboost.
Example, below what I created to test my chosen xgb model:
library(rsample)
library(xgboost)
library(Metrics)
# keep just numeric features for this example
pdata_split <- initial_split(diamonds %>% select(-cut, -color, -clarity), 0.9)
training_data <- training(pdata_split)
testing_data <- testing(pdata_split)
# create xgb.DMatrix'
training_data_xgb_matrix <- xgb.DMatrix(training_data %>% select(-price) %>% as.matrix(), label = training_data$price)
test_data_xgb_matrix <- xgb.DMatrix(testing_data %>% select(-price) %>% as.matrix(), label = testing_data$price)
# create a regression
model_xgb <- xgboost(training_data_xgb_matrix, nrounds = 100, objective = "reg:squarederror")
# predict on test data
xgb_predictions <- predict(model_xgb, test_data_xgb_matrix)
# evaluate using rmse
test_rmse <- rmse(actual = testing_data$price, predicted = xgb_predictions)
test_rmse
# 1370.185
So, that is doing it step by step. My question is, can I somehow do this in a similar way to using the approach above during cross validation, particularity just adding a new column to a existing df / list column?
What is the 'tidy' way of evaluating a model on test data? Is it possible to start with training_data, append test data in a new column and start a workflow to reach the same result with rmse in it's own column following a call to mutate()?
training_data %>%
(add test data in a new column) %>%
mutate(convert training data to a xgb.DMatrix) %>%
mutate(convert test data to a xgb.DMatrix) %>%
mutate(fit a regression model based on the training data xgb.DMatrix) %>%
mutate(predict with the regression model on test data xgb.DMatrix) %>%
mutate(calculate rmse)
Is this possible?

Resources