How to save a parsnip model fit (from ranger)? - r

I have a parsnip model (from ranger), roughly from here:
# install.packages("tidymodels")
data(cells, package = "modeldata")
rf_mod <-
rand_forest(trees = 100) %>%
set_engine("ranger") %>%
set_mode("classification")
set.seed(123)
cell_split <- initial_split(cells %>% select(-case), strata = class)
cell_train <- training(cell_split)
rf_fit <-
rf_mod %>%
fit(class ~ ., data = cell_train)
> class(rf_fit)
[1] "_ranger" "model_fit"
How do I save it to disk so that I can load it at a later time?
I tried dput, and that gets an error:
dput(rf_fit, file="rf_fit.R")
rf_fit2 <- dget("rf_fit.R")
Error in missing_arg() : could not find function "missing_arg"
It's true, the model_fit.R file has a couple of missing_arg calls in it, which appears to be some sort of way to mark missing args. However, that's a side line. I don't need to use dput, I just want to be able to save and load a model.

Try with this option. save() and load() functions allow you to store the model and then inkove it again. Here the code:
data(cells, package = "modeldata")
rf_mod <-
rand_forest(trees = 100) %>%
set_engine("ranger") %>%
set_mode("classification")
set.seed(123)
cell_split <- initial_split(cells %>% select(-case), strata = class)
cell_train <- training(cell_split)
rf_fit <-
rf_mod %>%
fit(class ~ ., data = cell_train)
#Export option
save(rf_fit,file='Mymod.RData')
load('Mymod.RData')
The other option would be using saveRDS() to save the model and then use readRDS() to load it but it requires to be allocated in an object:
#Export option 2
saveRDS(rf_fit, file = "Mymod.rds")
# Restore the object
rf_fit <- readRDS(file = "Mymod.rds")

For others that may come across this post in the future:
Some model fits in R, including several that parsnip supports as a modeling backend, require native serialization methods to be saved and reloaded in a new R session properly. saveRDS() and readRDS() will do the trick most of the time, though fall short for model objects from packages that require native serialization.
The folks from the tidymodels team put together a new package, bundle, to provide a consistent interface for native serialization of model objects. The bundle() verb prepares a model object for serialization, and then you can safely saveRDS() + readRDS() and pass between R sessions as you wish, and then unbundle() in the new session. With a parsnip model fit mod:
mod_bundle <- bundle(mod)
saveRDS(mod_bundle, file = "path/to/file.rds")
# in a new R session:
mod_bundle <- readRDS("path/to/file.rds")
mod_new <- unbundle(mod_bundle)

As Duck mentioned, saveRDS() and readRDS() can be used to save/load any R object. Also save() & load() can be used for the same purpose. There are many online discussions/blogs comparing the two approaches.

Related

Monitoring stacked ensemble models with vetiver

I developed a stacked ensemble model using the tidymodels workflow and I want to monitor the performance of this model from time to time using vetiver. However, it seems the stacked model object isn't supported yet.
Please see the code snippet below
library(tidymodels)
library(vetiver)
library(pins)
library(arrow)
library(tidyverse)
library(bonsai)
library(stacks)
library(lubridate)
library(magrittr)
b <- board_folder(path = "pins-r/")
model <- vetiver_pin_read(board = b, name = "dcp_ibese_truck_arrival",
version = "20230110T094207Z-69661")
trips <- read_parquet("../IbeseLivePosition/ml_data/data_to_monitor_model.parquet")
trips %<>%
mutate(Date = as.Date(DateTimeReceived))
original_metrics <-
vetiver::augment(model, new_data = trips)
Error: No augment method for objects of class butchered_linear_stack

Create SHAP plots for tidymodel objects

This question refers to Obtaining summary shap plot for catboost model with tidymodels in R. Given the comment below the question, the OP found a solution but did not share it with the community so far.
I want to analyze my tree ensembles fitted with the tidymodels package with SHAP value plots such as plots for single observations like
and to summarize the effect of all features of my dataset like
DALEXtra provides a function to create SHAP values for tidymodels explain.tidymodels(). force_plot from the fastshap package provide a wrapper for the plot function of the underlying python package SHAP. But I can't understand how to make the function work with the output of the explain.tidymodels() function.
Question : How can one generate such SHAP plots in R using tidymodels and explain.tidymodels?
MWE (for SHAP values with explain.tidymodels)
library(MASS)
library(tidyverse)
library(tidymodels)
library(parsnip)
library(treesnip)
library(catboost)
library(fastshap)
library(DALEXtra)
set.seed(1337)
rec <- recipe(crim ~ ., data = Boston)
split <- initial_split(Boston)
train_data <- training(split)
test_data <- testing(split) %>% dplyr::select(-crim) %>% as.matrix()
model_default<-
parsnip::boost_tree(
mode = "regression"
) %>%
set_engine(engine = 'catboost', loss_function = 'RMSE')
#sometimes catboost is not loaded correctly the following two lines
#ensure prevent fitting errors
#https://github.com/curso-r/treesnip/issues/21 error is mentioned on last post
set_dependency("boost_tree", eng = "catboost", "catboost")
set_dependency("boost_tree", eng = "catboost", "treesnip")
model_fit_wf <- model_fit_wf <- workflow() %>% add_model(model_tune) %>% add_recipe(rec) %>% {parsnip::fit(object = ., data = train_data)}
SHAP_wf <- explain_tidymodels(model_fit_wf, data = X, y = train_data$crim, new_data = test_data
Perhaps this will help. At the very least, it is a step in the right direction.
First, ensure you have fastshap and reticulate installed (i.e., install.packages("...")). Next, set up a virtual environment and install shap (pip install ...). Also, install matplotlib 3.2.2 for the dependency plots (check out GitHub issues on this -- an older version of matplotlib is necessary).
RStudio has great information on virtual environment setup. That said, virtual environment setup requires more or less troubleshooting depending on the IDE of use. (Sadly, some work settings restrict the use of open source RStudio due to licensing.)
Docs for library(fastshap) are also helpful on this front.
Here's a workflow for lightgbm (from treesnip docs, lightly modified).
library(tidymodels)
library(treesnip)
data("diamonds", package = "ggplot2")
diamonds <- diamonds %>% sample_n(1000)
# vfold resamples
diamonds_splits <- vfold_cv(diamonds, v = 5)
model_spec <- boost_tree(mtry = 5, trees = 500) %>% set_mode("regression")
# model specs
lightgbm_model <- model_spec %>%
set_engine("lightgbm", nthread = 6)
#workflows
lightgbm_wf <- workflow() %>%
add_model(
lightgbm_model
)
rec_ordered <- recipe(
price ~ .
, data = diamonds
)
lightgbm_fit_ordered <- fit_resamples(
add_recipe(
lightgbm_wf, rec_ordered
), resamples = diamonds_splits)
Prior to prediction we want to fit our workflow
fit_workflow <- lightgbm_wf %>%
add_recipe(rec_ordered) %>%
fit(data = diamonds)
Now we have a fit workflow and can predict. To use the fastshap::explain function, we need to create a predict function (this doesn't always hold: depending on the engine used it may or may not work out of the box -- see docs).
predict_function_gbm <- function(model, newdata) {
predict(model, newdata) %>% pluck(.,1)
}
Let's get the mean prediction value (used below) while we're at it. This also serves as a check to ensure the function is functioning.
mean_preds <- mean(
predict_function_gbm(
fit_workflow, diamonds %>% select(-price)
)
)
Now we create our explanations (shap values). Note the pred_wrapper and X arguments here (see fastshap github issues for other examples -- i.e. glmnet).
fastshap::explain(
fit_workflow,
X = as.data.frame(diamonds %>% select(-price)),
pred_wrapper = predict_function_gbm,
nsim = 10
) -> explanations_gbm
This should produce a force plot.
fastshap::force_plot(
object = explanations_gbm[1,],
feature_values = as.data.frame(diamonds %>% select(-price))[1,],
display = "viewer",
baseline = mean_preds)
This allows multiple, vertically stacked:
fastshap::force_plot(
object = explanations_gbm[1:20,],
feature_values = as.data.frame(diamonds %>% select(-price))[1:20,],
display = "viewer",
baseline = mean_preds)
Add link = "logit" for classification. Change display to "html" for Rmarkdown rendering.
Now for summary plots and dependency plots.
The trick is using reticulate to access the functions directly. Note that the same logic hold for libraries like transformers, numpy, etc.
First, for dependency plot.
library(reticulate)
shap = import("shap")
np = import("numpy")
shap$dependence_plot(
"rank(3)",
data.matrix(explanations_gbm),
data.matrix(diamond %>% select(-price))
)
See shap docs for explanation of rank(3) -- rank(1) etc will also work.
Unforunately it threw an error when I attempted naming the feature directly (i.e., "cut").
Now for the summary plot:
shap$summary_plot(
data.matrix(explanations_gbm),
data.matrix(diamond %>% select(-price))
)
Final note: rendering the plot repeatedly will produce buggy visualizations. Hopefully this provides a point of depature for catboost visualizations.

R Tidymodels: What objects to save for use in production after fitting a recipe-based workflow utilizing pre-processing?

After designing a Tidymodels recipe-based workflow, which is tuned then fitted to some training data, I'm not clear what objects (fitted "workflow", "recipe", ..etc) should be saved to disk for use in predicting new data in production. I understand I can use saveRDS()/readRDS(), write_rds()/read_rds(), or other options to actually do the saving/loading of these objects, but which ones?
In a clean R environment I will have incoming new raw data which will need pre-processed using the "recipe" I used in training the model. I then want to make predictions based on that data after it has been pre-processed. If I intend to use the prep() and bake() functions to pre-process the new data as I did the training data, then I will minimally need the recipe and original training data it seems to get prep() to work. Plus, I also need the fitted model/workflow to make predictions. So three objects it seems. If I save to disk the workflow object in SESSION 1 then I have the ability to extract the recipe and model from it in SESSION 2 with pull_workflow_prepped_recipe() and pull_workflow_fit() respectively. But prep() seems to require the original training data, which I can keep in the workflow with an earlier use of retain = TRUE...but then that gets stripped out of the workflow after a call to fit(). Hear my cries for help! :)
So, imagine two different R sessions, where the first session I am doing all the training and model building, and the second session is some running production app that uses what was learned from the first session. I need help at the arrows in the bottom of SESSION1, and in multiple places in SESSION 2. I used the Tidymodels Get Started as the base for this example.
SESSION 1
library(tidymodels)
library(nycflights13)
library(readr)
set.seed(123)
flight_data <-
head(flights, 500) %>%
mutate(
arr_delay = ifelse(arr_delay >= 30, "late", "on_time"),
arr_delay = factor(arr_delay),
date = as.Date(time_hour)
) %>%
inner_join(weather, by = c("origin", "time_hour")) %>%
select(dep_time, flight, origin, dest, air_time, distance, carrier, date, arr_delay, time_hour) %>%
na.omit() %>%
mutate_if(is.character, as.factor)
set.seed(555)
data_split <- initial_split(flight_data, prop = 3/4)
train_data <- training(data_split)
test_data <- testing(data_split)
flights_rec <-
recipe(arr_delay ~ ., data = train_data) %>%
update_role(flight, time_hour, new_role = "ID") %>%
step_date(date, features = c("dow", "month")) %>%
step_holiday(date, holidays = timeDate::listHolidays("US")) %>%
step_rm(date) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_zv(all_predictors())
lr_mod <-
logistic_reg() %>%
set_engine("glm")
flights_wflow <-
workflow() %>%
add_model(lr_mod) %>%
add_recipe(flights_rec)
flights_fit <-
flights_wflow %>%
fit(data = train_data)
predict(flights_fit, test_data)
### SAVE ONE OR MORE OBJECTS HERE FOR NEXT SESSION <------------
# What to save? workflow (pre or post fit()?), recipe, training data...etc.
write_rds(flights_wflow, "flights_wflow.rds") # Not fitted workflow
write_rds(flights_fit, "flights_fit.rds") # Fitted workflow
SESSION 2
### READ ONE OR MORE OBJECTS HERE FROM PRIOR SESSION <------------
flights_wflow <- read_rds("flights_wflow.rds")
flights_fit <- read_rds("flights_fit.rds")
# Acquire new data, do some basic transforms as before
new_flight_data <-
tail(flights, 500) %>%
mutate(
arr_delay = ifelse(arr_delay >= 30, "late", "on_time"),
arr_delay = factor(arr_delay),
date = as.Date(time_hour)
) %>%
inner_join(weather, by = c("origin", "time_hour")) %>%
select(dep_time, flight, origin, dest, air_time, distance, carrier, date, arr_delay, time_hour) %>%
na.omit() %>%
mutate_if(is.character, as.factor)
# Something here to preprocess the data with recipe as in SESSION 1 <----------
# new_flight_data_prep <- prep(??)
# new_flight_data_preprocessed <- bake(??)
# Predict new data
predict(flights_fit, new_data = new_flight_data_preprocessed)
You have some flexibility in how you approach this, depending on your constraints, but generally I would recommend saving/serializing the fitted workflow, perhaps after using butcher to reduce its size. You can see an example model fitting script in this repo that shows at the end how I save the fitted workflow.
When you go to predict with this workflow, there are some things to keep in mind. I have an example Plumber API in the same repo that demonstrates what is needed to predict for that particular workflow. Notice that the packages how the package needed for prediction are loaded/attached for this API. I didn't use all of tidymodels, but instead only the specific packages I need, for better performance and a smaller container.
Saving the fitted workflow did not work for me. When trying to predict with new data is asking for the target variable (a churn model)
predict(churn_model, the_data)
Error: Problem with `mutate()` column `churn`.
i `churn = dplyr::if_else(churn == 1, "yes", "no")`.
x object 'churn' not found
I still don't get why is asking for a column that should not be present in the data as it is the variable I try to predict...

Using package specific functions with parsnip

I'm trying to learn the R's modeling framework tidymodels. After creating the model and specifying the package (engine) I want to use for my model, I now try to use some specific functions that are inside of the engine I chose. In this case it is the randomForest package and the varImpPlot() function that I'm trying to use. However, this error shows up when I try to execute it where it says that in order to use the function the object has to be a randomForest object. Well, this is obvious, but my question would be, is there some way to translate the parsnip object to the object of the engine I chose, or some way to use these functions inside of the package I have chosen? Thanks for help!
model_rand_forest <- rand_forest() %>%
set_engine("randomForest") %>%
set_mode("regression") %>%
translate()
training_workflow <- workflow() %>%
add_recipe(recipe) %>%
add_model(model_rand_forest)
training_workflow_fit <- training_workflow %>% fit(data = train)
training_workflow_fit %>% varImpPlot()
training_workflow_fit %>% varImpPlot()
Error in varImpPlot(.) :
This function only works for objects of class `randomForest'
You can extract the randomForest object from the workflow by using $fit$fit$fit. In your example, this should work
training_workflow_fit$fit$fit$fit %>% varImpPlot()
Or you could use the below syntax, which may be more neat
training_workflow_fit %>%
chuck("fit") %>%
chuck("fit") %>%
chuck("fit") %>%
varImpPlot()

Error - No tidy method for objects of class lmerMod

I am using a dataset from an online practice tutorial and the code can be found at the bottom of Page 4 (https://tomhouslay.files.wordpress.com/2017/02/indivvar_mv_tutorial_asreml.pdf)
In the tutorial, they get the function to work using the code listed below, but in my R session, I get an error that says:
No tidy method for objects of class lmerMod.
I have tried using the package "parsnip" as well as restarting my R session and I have tried requiring broom as suggested in other answers to similar questions.
The haggis practice csv file can be downloaded from here: https://figshare.com/articles/Haggis_data_behavioural_syndromes/4702540
library(asreml)
library(nadiv)
library(tidytext)
library(tidyverse)
library(broom)
require(broom)
library(lme4)
library(data.table)
library(parsnip)
HData<- read_csv("haggis practice.csv")
lmer_b <- lmer(boldness ~ scale(assay_rep, scale=FALSE) +
scale(body_size) +
(1|ID),
data = HData)
plot(lmer_b)
qqnorm(residuals(lmer_b))
hist(residuals(lmer_b))
summary(lmer_b)
rep_bold <- tidy(lmer_b, effects = "ran_pars", scales = "vcov") %>%
select(group, estimate) %>%
spread(group, estimate) %>%
mutate(repeatability = ID/(ID + Residual))
Providing an answer (from the comments).
The tidy methods for multilevel/mixed-type models (e.g. from lme4, brms, MCMCglmm, ...) were moved to broom.mixed. You can either install/load the broom.mixed package, or use the broomExtra package, which is a "meta-package" that looks for methods in both broom and broom.mixed ...

Resources