Create SHAP plots for tidymodel objects - plot

This question refers to Obtaining summary shap plot for catboost model with tidymodels in R. Given the comment below the question, the OP found a solution but did not share it with the community so far.
I want to analyze my tree ensembles fitted with the tidymodels package with SHAP value plots such as plots for single observations like
and to summarize the effect of all features of my dataset like
DALEXtra provides a function to create SHAP values for tidymodels explain.tidymodels(). force_plot from the fastshap package provide a wrapper for the plot function of the underlying python package SHAP. But I can't understand how to make the function work with the output of the explain.tidymodels() function.
Question : How can one generate such SHAP plots in R using tidymodels and explain.tidymodels?
MWE (for SHAP values with explain.tidymodels)
library(MASS)
library(tidyverse)
library(tidymodels)
library(parsnip)
library(treesnip)
library(catboost)
library(fastshap)
library(DALEXtra)
set.seed(1337)
rec <- recipe(crim ~ ., data = Boston)
split <- initial_split(Boston)
train_data <- training(split)
test_data <- testing(split) %>% dplyr::select(-crim) %>% as.matrix()
model_default<-
parsnip::boost_tree(
mode = "regression"
) %>%
set_engine(engine = 'catboost', loss_function = 'RMSE')
#sometimes catboost is not loaded correctly the following two lines
#ensure prevent fitting errors
#https://github.com/curso-r/treesnip/issues/21 error is mentioned on last post
set_dependency("boost_tree", eng = "catboost", "catboost")
set_dependency("boost_tree", eng = "catboost", "treesnip")
model_fit_wf <- model_fit_wf <- workflow() %>% add_model(model_tune) %>% add_recipe(rec) %>% {parsnip::fit(object = ., data = train_data)}
SHAP_wf <- explain_tidymodels(model_fit_wf, data = X, y = train_data$crim, new_data = test_data

Perhaps this will help. At the very least, it is a step in the right direction.
First, ensure you have fastshap and reticulate installed (i.e., install.packages("...")). Next, set up a virtual environment and install shap (pip install ...). Also, install matplotlib 3.2.2 for the dependency plots (check out GitHub issues on this -- an older version of matplotlib is necessary).
RStudio has great information on virtual environment setup. That said, virtual environment setup requires more or less troubleshooting depending on the IDE of use. (Sadly, some work settings restrict the use of open source RStudio due to licensing.)
Docs for library(fastshap) are also helpful on this front.
Here's a workflow for lightgbm (from treesnip docs, lightly modified).
library(tidymodels)
library(treesnip)
data("diamonds", package = "ggplot2")
diamonds <- diamonds %>% sample_n(1000)
# vfold resamples
diamonds_splits <- vfold_cv(diamonds, v = 5)
model_spec <- boost_tree(mtry = 5, trees = 500) %>% set_mode("regression")
# model specs
lightgbm_model <- model_spec %>%
set_engine("lightgbm", nthread = 6)
#workflows
lightgbm_wf <- workflow() %>%
add_model(
lightgbm_model
)
rec_ordered <- recipe(
price ~ .
, data = diamonds
)
lightgbm_fit_ordered <- fit_resamples(
add_recipe(
lightgbm_wf, rec_ordered
), resamples = diamonds_splits)
Prior to prediction we want to fit our workflow
fit_workflow <- lightgbm_wf %>%
add_recipe(rec_ordered) %>%
fit(data = diamonds)
Now we have a fit workflow and can predict. To use the fastshap::explain function, we need to create a predict function (this doesn't always hold: depending on the engine used it may or may not work out of the box -- see docs).
predict_function_gbm <- function(model, newdata) {
predict(model, newdata) %>% pluck(.,1)
}
Let's get the mean prediction value (used below) while we're at it. This also serves as a check to ensure the function is functioning.
mean_preds <- mean(
predict_function_gbm(
fit_workflow, diamonds %>% select(-price)
)
)
Now we create our explanations (shap values). Note the pred_wrapper and X arguments here (see fastshap github issues for other examples -- i.e. glmnet).
fastshap::explain(
fit_workflow,
X = as.data.frame(diamonds %>% select(-price)),
pred_wrapper = predict_function_gbm,
nsim = 10
) -> explanations_gbm
This should produce a force plot.
fastshap::force_plot(
object = explanations_gbm[1,],
feature_values = as.data.frame(diamonds %>% select(-price))[1,],
display = "viewer",
baseline = mean_preds)
This allows multiple, vertically stacked:
fastshap::force_plot(
object = explanations_gbm[1:20,],
feature_values = as.data.frame(diamonds %>% select(-price))[1:20,],
display = "viewer",
baseline = mean_preds)
Add link = "logit" for classification. Change display to "html" for Rmarkdown rendering.
Now for summary plots and dependency plots.
The trick is using reticulate to access the functions directly. Note that the same logic hold for libraries like transformers, numpy, etc.
First, for dependency plot.
library(reticulate)
shap = import("shap")
np = import("numpy")
shap$dependence_plot(
"rank(3)",
data.matrix(explanations_gbm),
data.matrix(diamond %>% select(-price))
)
See shap docs for explanation of rank(3) -- rank(1) etc will also work.
Unforunately it threw an error when I attempted naming the feature directly (i.e., "cut").
Now for the summary plot:
shap$summary_plot(
data.matrix(explanations_gbm),
data.matrix(diamond %>% select(-price))
)
Final note: rendering the plot repeatedly will produce buggy visualizations. Hopefully this provides a point of depature for catboost visualizations.

Related

Monitoring stacked ensemble models with vetiver

I developed a stacked ensemble model using the tidymodels workflow and I want to monitor the performance of this model from time to time using vetiver. However, it seems the stacked model object isn't supported yet.
Please see the code snippet below
library(tidymodels)
library(vetiver)
library(pins)
library(arrow)
library(tidyverse)
library(bonsai)
library(stacks)
library(lubridate)
library(magrittr)
b <- board_folder(path = "pins-r/")
model <- vetiver_pin_read(board = b, name = "dcp_ibese_truck_arrival",
version = "20230110T094207Z-69661")
trips <- read_parquet("../IbeseLivePosition/ml_data/data_to_monitor_model.parquet")
trips %<>%
mutate(Date = as.Date(DateTimeReceived))
original_metrics <-
vetiver::augment(model, new_data = trips)
Error: No augment method for objects of class butchered_linear_stack

Problem `.x` is empty in pammtools packages

I am trying to replicate the example code in Bender and Schleip for Piece-wise exponential Additive Mixed modelling tools. Specifically a survival exercise with time varying effects.
https://arxiv.org/pdf/1806.01042.pdf
library(dplyr); library(tidyr); library(purrr); library(ggplot2)
library(survival); library(mgcv); library(pammtools)
data("pbc", package="survival")
# event time information
pbc <- pbc %>%
filter(id <= 312) %>%
mutate(status = ifelse(status==0,0,1) )%>%
select(id:status, trt:sex, bili, protime)
pbc %>% slice(1:6)
pbc_ped <- as_ped(
data = list(pbc, pbcseq),
formula = Surv(pbc$time, pbc$status)~sex|concurrent(bili, protime, tz_var = "day"),
id = "id")
I always get the error
Error: .x is empty, and no .init supplied
I installed and checked Rtools, I tried with different (older) version of Purrr, which sometimes is related with this error. I tried to run the code also on https://rdrr.io/snippets/.
Any idea? thank you very much...
You have not used the code in that vignette. And you added pbc$ to the arguments in Surv(), a common mistake but generally not a productive strategy
# Need to narrow the material from pbcseq
pbcseq <- pbcseq %>% select(id, day, bili, protime)
# I would have given it a different name
#------ Error when using "|" rather than "+"
pbc_ped <- as_ped(
data = list(pbc, pbcseq),
formula = Surv(time, status)~sex|concurrent(bili, protime, tz_var = "day"),
id = "id")
#Error: `.x` is empty, and no `.init` supplied
#________________
pbc_ped <- as_ped(
data = list(pbc, pbcseq),
formula = Surv(time, status)~sex + concurrent(bili, protime, tz_var = "day"),
id = "id") # No error
I think there may be an error in the vignette. I don't see any examples using the construct ...
Surv(time,status)~ variates | special(.)
They all use a "+" sign for adding the time-dependent covariates. If you go to https://adibender.github.io/pammtools//articles/data-transformation.html you see them using a "+" rather than a "|". I think there is some sloppiness in that package's documentation. But your additions only made the problem worse.

R Tidymodels: What objects to save for use in production after fitting a recipe-based workflow utilizing pre-processing?

After designing a Tidymodels recipe-based workflow, which is tuned then fitted to some training data, I'm not clear what objects (fitted "workflow", "recipe", ..etc) should be saved to disk for use in predicting new data in production. I understand I can use saveRDS()/readRDS(), write_rds()/read_rds(), or other options to actually do the saving/loading of these objects, but which ones?
In a clean R environment I will have incoming new raw data which will need pre-processed using the "recipe" I used in training the model. I then want to make predictions based on that data after it has been pre-processed. If I intend to use the prep() and bake() functions to pre-process the new data as I did the training data, then I will minimally need the recipe and original training data it seems to get prep() to work. Plus, I also need the fitted model/workflow to make predictions. So three objects it seems. If I save to disk the workflow object in SESSION 1 then I have the ability to extract the recipe and model from it in SESSION 2 with pull_workflow_prepped_recipe() and pull_workflow_fit() respectively. But prep() seems to require the original training data, which I can keep in the workflow with an earlier use of retain = TRUE...but then that gets stripped out of the workflow after a call to fit(). Hear my cries for help! :)
So, imagine two different R sessions, where the first session I am doing all the training and model building, and the second session is some running production app that uses what was learned from the first session. I need help at the arrows in the bottom of SESSION1, and in multiple places in SESSION 2. I used the Tidymodels Get Started as the base for this example.
SESSION 1
library(tidymodels)
library(nycflights13)
library(readr)
set.seed(123)
flight_data <-
head(flights, 500) %>%
mutate(
arr_delay = ifelse(arr_delay >= 30, "late", "on_time"),
arr_delay = factor(arr_delay),
date = as.Date(time_hour)
) %>%
inner_join(weather, by = c("origin", "time_hour")) %>%
select(dep_time, flight, origin, dest, air_time, distance, carrier, date, arr_delay, time_hour) %>%
na.omit() %>%
mutate_if(is.character, as.factor)
set.seed(555)
data_split <- initial_split(flight_data, prop = 3/4)
train_data <- training(data_split)
test_data <- testing(data_split)
flights_rec <-
recipe(arr_delay ~ ., data = train_data) %>%
update_role(flight, time_hour, new_role = "ID") %>%
step_date(date, features = c("dow", "month")) %>%
step_holiday(date, holidays = timeDate::listHolidays("US")) %>%
step_rm(date) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_zv(all_predictors())
lr_mod <-
logistic_reg() %>%
set_engine("glm")
flights_wflow <-
workflow() %>%
add_model(lr_mod) %>%
add_recipe(flights_rec)
flights_fit <-
flights_wflow %>%
fit(data = train_data)
predict(flights_fit, test_data)
### SAVE ONE OR MORE OBJECTS HERE FOR NEXT SESSION <------------
# What to save? workflow (pre or post fit()?), recipe, training data...etc.
write_rds(flights_wflow, "flights_wflow.rds") # Not fitted workflow
write_rds(flights_fit, "flights_fit.rds") # Fitted workflow
SESSION 2
### READ ONE OR MORE OBJECTS HERE FROM PRIOR SESSION <------------
flights_wflow <- read_rds("flights_wflow.rds")
flights_fit <- read_rds("flights_fit.rds")
# Acquire new data, do some basic transforms as before
new_flight_data <-
tail(flights, 500) %>%
mutate(
arr_delay = ifelse(arr_delay >= 30, "late", "on_time"),
arr_delay = factor(arr_delay),
date = as.Date(time_hour)
) %>%
inner_join(weather, by = c("origin", "time_hour")) %>%
select(dep_time, flight, origin, dest, air_time, distance, carrier, date, arr_delay, time_hour) %>%
na.omit() %>%
mutate_if(is.character, as.factor)
# Something here to preprocess the data with recipe as in SESSION 1 <----------
# new_flight_data_prep <- prep(??)
# new_flight_data_preprocessed <- bake(??)
# Predict new data
predict(flights_fit, new_data = new_flight_data_preprocessed)
You have some flexibility in how you approach this, depending on your constraints, but generally I would recommend saving/serializing the fitted workflow, perhaps after using butcher to reduce its size. You can see an example model fitting script in this repo that shows at the end how I save the fitted workflow.
When you go to predict with this workflow, there are some things to keep in mind. I have an example Plumber API in the same repo that demonstrates what is needed to predict for that particular workflow. Notice that the packages how the package needed for prediction are loaded/attached for this API. I didn't use all of tidymodels, but instead only the specific packages I need, for better performance and a smaller container.
Saving the fitted workflow did not work for me. When trying to predict with new data is asking for the target variable (a churn model)
predict(churn_model, the_data)
Error: Problem with `mutate()` column `churn`.
i `churn = dplyr::if_else(churn == 1, "yes", "no")`.
x object 'churn' not found
I still don't get why is asking for a column that should not be present in the data as it is the variable I try to predict...

How to save a parsnip model fit (from ranger)?

I have a parsnip model (from ranger), roughly from here:
# install.packages("tidymodels")
data(cells, package = "modeldata")
rf_mod <-
rand_forest(trees = 100) %>%
set_engine("ranger") %>%
set_mode("classification")
set.seed(123)
cell_split <- initial_split(cells %>% select(-case), strata = class)
cell_train <- training(cell_split)
rf_fit <-
rf_mod %>%
fit(class ~ ., data = cell_train)
> class(rf_fit)
[1] "_ranger" "model_fit"
How do I save it to disk so that I can load it at a later time?
I tried dput, and that gets an error:
dput(rf_fit, file="rf_fit.R")
rf_fit2 <- dget("rf_fit.R")
Error in missing_arg() : could not find function "missing_arg"
It's true, the model_fit.R file has a couple of missing_arg calls in it, which appears to be some sort of way to mark missing args. However, that's a side line. I don't need to use dput, I just want to be able to save and load a model.
Try with this option. save() and load() functions allow you to store the model and then inkove it again. Here the code:
data(cells, package = "modeldata")
rf_mod <-
rand_forest(trees = 100) %>%
set_engine("ranger") %>%
set_mode("classification")
set.seed(123)
cell_split <- initial_split(cells %>% select(-case), strata = class)
cell_train <- training(cell_split)
rf_fit <-
rf_mod %>%
fit(class ~ ., data = cell_train)
#Export option
save(rf_fit,file='Mymod.RData')
load('Mymod.RData')
The other option would be using saveRDS() to save the model and then use readRDS() to load it but it requires to be allocated in an object:
#Export option 2
saveRDS(rf_fit, file = "Mymod.rds")
# Restore the object
rf_fit <- readRDS(file = "Mymod.rds")
For others that may come across this post in the future:
Some model fits in R, including several that parsnip supports as a modeling backend, require native serialization methods to be saved and reloaded in a new R session properly. saveRDS() and readRDS() will do the trick most of the time, though fall short for model objects from packages that require native serialization.
The folks from the tidymodels team put together a new package, bundle, to provide a consistent interface for native serialization of model objects. The bundle() verb prepares a model object for serialization, and then you can safely saveRDS() + readRDS() and pass between R sessions as you wish, and then unbundle() in the new session. With a parsnip model fit mod:
mod_bundle <- bundle(mod)
saveRDS(mod_bundle, file = "path/to/file.rds")
# in a new R session:
mod_bundle <- readRDS("path/to/file.rds")
mod_new <- unbundle(mod_bundle)
As Duck mentioned, saveRDS() and readRDS() can be used to save/load any R object. Also save() & load() can be used for the same purpose. There are many online discussions/blogs comparing the two approaches.

feature_spec in TFdatasets with multiple response variables

I'm looking to predict a set of responses using a common set of features using the tensorflow package in R. I've worked through a couple of single response regression examples, and am attempting to modify that code to accommodate my data. Here I'm using the hearts dataset to demonstrate.
library(keras)
library(tensorflow)
library(tfdatasets)
library(tfprobability)
data(hearts)
head(hearts)
hearts <- tensor_slices_dataset(hearts) %>% dataset_batch(32)
#This works
spec <- feature_spec(hearts, x = c(age, sex, cp, trestbps, chol, fbs, restecg,
exang, oldpeak, slope, ca, thal, thalach),
y = target)
#This doesn't
spec <- feature_spec(hearts, x = c(age, sex, cp, trestbps, chol, fbs, restecg,
exang, oldpeak, slope, ca, thal, thalach),
y = c(target, thalach))
So, how can I pass multiple response variables to feature_spec? It seems like this is possible in Python from this post (https://towardsdatascience.com/bayesian-neural-networks-with-tensorflow-probability-fbce27d6ef6).
See the code chunk under 'Data Handling'. Is this just not an option in R?

Resources