I am trying to create a node-link diagram (decision tree) by using parsnip and tidymodels. What I am performing is building a decision tree model for the StackOverflow dataset using the tidymodels package and rpart as model engine. The model should predict whether a developer will work remotely (variable remote) based on the number of years of programming experience (years_coded_job), degree of career satisfaction (career_satisfaction), job title "Data Scientist" yes/no (data_scientist), and size of the employing company (company_size_number).
My pipeline
library(tidyverse)
library(tidymodels)
library(rpart.plot)
library(rpart)
library(rattle)
so <- read_rds(here::here("stackoverflow.rds"))
fit <- rpart(remote ~ years_coded_job + career_satisfaction + data_scientist + company_size_number,
data = so,
control = rpart.control(minsplit = 20, minbucket = 2))
fancyRpartPlot(fit, sub = "")
The plot I obtain
I want to know whether is this the correct approach for determining the predictors. Since I am not building a model is this the right way?
If you are going tidymodels and parsnip to fit your model, it's better to use that actual fitted model for any visualizations like this. You can get the underlying engine object from a parsnip model using $fit.
library(tidyverse)
library(tidymodels)
library(rattle)
#> Loading required package: bitops
#> Rattle: A free graphical interface for data science with R.
#> Version 5.4.0 Copyright (c) 2006-2020 Togaware Pty Ltd.
#> Type 'rattle()' to shake, rattle, and roll your data.
data(kyphosis, package = "rpart")
tree_fit <- decision_tree(min_n = 20) %>%
set_engine("rpart") %>%
set_mode("classification") %>%
fit(Kyphosis ~ Age + Number + Start,
data = kyphosis)
fancyRpartPlot(tree_fit$fit, sub = "")
Created on 2021-05-25 by the reprex package (v2.0.0)
For some kinds of visualizations, you will need to use repair_call().
Related
I developed a stacked ensemble model using the tidymodels workflow and I want to monitor the performance of this model from time to time using vetiver. However, it seems the stacked model object isn't supported yet.
Please see the code snippet below
library(tidymodels)
library(vetiver)
library(pins)
library(arrow)
library(tidyverse)
library(bonsai)
library(stacks)
library(lubridate)
library(magrittr)
b <- board_folder(path = "pins-r/")
model <- vetiver_pin_read(board = b, name = "dcp_ibese_truck_arrival",
version = "20230110T094207Z-69661")
trips <- read_parquet("../IbeseLivePosition/ml_data/data_to_monitor_model.parquet")
trips %<>%
mutate(Date = as.Date(DateTimeReceived))
original_metrics <-
vetiver::augment(model, new_data = trips)
Error: No augment method for objects of class butchered_linear_stack
I'm trying out a package regarding double machine learning (https://rdrr.io/github/yixinsun1216/crossfit/) and in trying to run the main function dml(), I get the following "Error: No tidy method for objects of class dgCMatrix" using example dataframe data. When looking through the documentation (https://rdrr.io/github/yixinsun1216/crossfit/src/R/dml.R), I can't find anything wrong with how tidy() is used. Does anyone have any idea what could be going wrong here?
R version 4.2.1
I have already tried installing broom.mixed, although broomextra doesn't seem to be available for my R version, and the same problem occurs. Code used below;
install.packages("remotes")
remotes::install_github("yixinsun1216/crossfit", force = TRUE)
library("remotes")
library("crossfit")
library("broom.mixed")
library("broom")
# Effect of temperature and precipitation on corn yield in the presence of
# time and locational effects
data(corn_yield)
library(magrittr)
dml_yield <- "logcornyield ~ lower + higher + prec_lo + prec_hi | year + fips" %>%
as.formula() %>%
dml(corn_yield, "linear", n = 5, ml = "lasso", poly_degree = 3, score = "finite")
This question refers to Obtaining summary shap plot for catboost model with tidymodels in R. Given the comment below the question, the OP found a solution but did not share it with the community so far.
I want to analyze my tree ensembles fitted with the tidymodels package with SHAP value plots such as plots for single observations like
and to summarize the effect of all features of my dataset like
DALEXtra provides a function to create SHAP values for tidymodels explain.tidymodels(). force_plot from the fastshap package provide a wrapper for the plot function of the underlying python package SHAP. But I can't understand how to make the function work with the output of the explain.tidymodels() function.
Question : How can one generate such SHAP plots in R using tidymodels and explain.tidymodels?
MWE (for SHAP values with explain.tidymodels)
library(MASS)
library(tidyverse)
library(tidymodels)
library(parsnip)
library(treesnip)
library(catboost)
library(fastshap)
library(DALEXtra)
set.seed(1337)
rec <- recipe(crim ~ ., data = Boston)
split <- initial_split(Boston)
train_data <- training(split)
test_data <- testing(split) %>% dplyr::select(-crim) %>% as.matrix()
model_default<-
parsnip::boost_tree(
mode = "regression"
) %>%
set_engine(engine = 'catboost', loss_function = 'RMSE')
#sometimes catboost is not loaded correctly the following two lines
#ensure prevent fitting errors
#https://github.com/curso-r/treesnip/issues/21 error is mentioned on last post
set_dependency("boost_tree", eng = "catboost", "catboost")
set_dependency("boost_tree", eng = "catboost", "treesnip")
model_fit_wf <- model_fit_wf <- workflow() %>% add_model(model_tune) %>% add_recipe(rec) %>% {parsnip::fit(object = ., data = train_data)}
SHAP_wf <- explain_tidymodels(model_fit_wf, data = X, y = train_data$crim, new_data = test_data
Perhaps this will help. At the very least, it is a step in the right direction.
First, ensure you have fastshap and reticulate installed (i.e., install.packages("...")). Next, set up a virtual environment and install shap (pip install ...). Also, install matplotlib 3.2.2 for the dependency plots (check out GitHub issues on this -- an older version of matplotlib is necessary).
RStudio has great information on virtual environment setup. That said, virtual environment setup requires more or less troubleshooting depending on the IDE of use. (Sadly, some work settings restrict the use of open source RStudio due to licensing.)
Docs for library(fastshap) are also helpful on this front.
Here's a workflow for lightgbm (from treesnip docs, lightly modified).
library(tidymodels)
library(treesnip)
data("diamonds", package = "ggplot2")
diamonds <- diamonds %>% sample_n(1000)
# vfold resamples
diamonds_splits <- vfold_cv(diamonds, v = 5)
model_spec <- boost_tree(mtry = 5, trees = 500) %>% set_mode("regression")
# model specs
lightgbm_model <- model_spec %>%
set_engine("lightgbm", nthread = 6)
#workflows
lightgbm_wf <- workflow() %>%
add_model(
lightgbm_model
)
rec_ordered <- recipe(
price ~ .
, data = diamonds
)
lightgbm_fit_ordered <- fit_resamples(
add_recipe(
lightgbm_wf, rec_ordered
), resamples = diamonds_splits)
Prior to prediction we want to fit our workflow
fit_workflow <- lightgbm_wf %>%
add_recipe(rec_ordered) %>%
fit(data = diamonds)
Now we have a fit workflow and can predict. To use the fastshap::explain function, we need to create a predict function (this doesn't always hold: depending on the engine used it may or may not work out of the box -- see docs).
predict_function_gbm <- function(model, newdata) {
predict(model, newdata) %>% pluck(.,1)
}
Let's get the mean prediction value (used below) while we're at it. This also serves as a check to ensure the function is functioning.
mean_preds <- mean(
predict_function_gbm(
fit_workflow, diamonds %>% select(-price)
)
)
Now we create our explanations (shap values). Note the pred_wrapper and X arguments here (see fastshap github issues for other examples -- i.e. glmnet).
fastshap::explain(
fit_workflow,
X = as.data.frame(diamonds %>% select(-price)),
pred_wrapper = predict_function_gbm,
nsim = 10
) -> explanations_gbm
This should produce a force plot.
fastshap::force_plot(
object = explanations_gbm[1,],
feature_values = as.data.frame(diamonds %>% select(-price))[1,],
display = "viewer",
baseline = mean_preds)
This allows multiple, vertically stacked:
fastshap::force_plot(
object = explanations_gbm[1:20,],
feature_values = as.data.frame(diamonds %>% select(-price))[1:20,],
display = "viewer",
baseline = mean_preds)
Add link = "logit" for classification. Change display to "html" for Rmarkdown rendering.
Now for summary plots and dependency plots.
The trick is using reticulate to access the functions directly. Note that the same logic hold for libraries like transformers, numpy, etc.
First, for dependency plot.
library(reticulate)
shap = import("shap")
np = import("numpy")
shap$dependence_plot(
"rank(3)",
data.matrix(explanations_gbm),
data.matrix(diamond %>% select(-price))
)
See shap docs for explanation of rank(3) -- rank(1) etc will also work.
Unforunately it threw an error when I attempted naming the feature directly (i.e., "cut").
Now for the summary plot:
shap$summary_plot(
data.matrix(explanations_gbm),
data.matrix(diamond %>% select(-price))
)
Final note: rendering the plot repeatedly will produce buggy visualizations. Hopefully this provides a point of depature for catboost visualizations.
Been trying to use the rpart.plot package to plot a ctree from the partykit library. The reason for this being that the default plot method is terrible when the tree is deep. In my case, my max_depth = 5.
I really enjoy rpart.plot's output as it allows for deep trees to visually display better. How the output looks for a simple example:
rpart
library(partykit)
library(rpart)
library(rpart.plot)
df_test <- cu.summary[complete.cases(cu.summary),]
multi.class.model <- rpart(Reliability~., data = df_test)
rpart.plot(multi.class.model)
I would like to get this output from the partykit model using ctree
ctree
multi.class.model <- ctree(Reliability~., data = df_test)
rpart.plot(multi.class.model)
>Error: the object passed to prp is not an rpart object
Is there some way one could coerce the ctree object to rpart so this would run?
To the best of my knowledge all the other packages for visualizing rpart trees are really rpart-specific and not based on the agnostic party class for representing trees/recursive partitions. Also, we haven't tried to implement an as.rpart() method for party objects because the rpart class is really not well-suited for this.
But you can try to tweak the partykit visualizations which are customizable through panel functions for almost all aspects of the tree. One thing that might be helpful is to compute a simpleparty object which has all sorts of simple summary information in the $info of each node. This can then be used in the node_terminal() panel function for printing information in the tree display. Consider the following simple example for predicting one of three school types in the German Socio-Economic Panel. To achieve the desired depth I switch significance testing essentiall off:
library("partykit")
data("GSOEP9402", package = "AER")
ct <- ctree(school ~ ., data = GSOEP9402, maxdepth = 5, alpha = 0.5)
The default plot(ct) on a sufficiently big device gives you:
When turning the tree into a simpleparty you get a textual summary by default:
st <- as.simpleparty(ct)
plot(st)
This has still overlapping labels so we could set up a small convenience function that extracts the interesting bits from the $info of each node and puts them into a longer character vector with less wide entries:
myfun <- function(i) c(
as.character(i$prediction),
paste("n =", i$n),
format(round(i$distribution/i$n, digits = 3), nsmall = 3)
)
plot(st, tp_args = list(FUN = myfun), ep_args = list(justmin = 20))
In addition to the arguments of the terminal panel function (tp_args) I have tweaked the arguments of the edge panel function (ep_args) to avoid some of the overplotting in the edges.
Of course, you could also change the entire panel function and roll your own...
I'm using the plm package for panel data to do instrumental variable estimation. However, it seems that calculating cluster robust standard errors by using the vcovHC() function is not supported.
More specifically, when I use the vcovHC() function, the following error message is displayed:
Error in vcovG.plm(x, type = type, cluster = cluster, l = 0, inner = >inner, :
Method not available for IV
Example:
data("Wages", package = "plm")
IV <- plm(lwage ~ south + exp | wks + south,
data = Wages, model = "pooling", index = 595)
vcvIV <- vcovHC(IV)
According to this thread, someone worked on a fix two years ago. Is there any progress on the issue? I know that the packages "lfe" and "ivpack" allow to compute cluster robust standard errors for IV estimation but none of them allows for random effects/intercepts.
In fact it's not implemented. However, you can use Schrimpf's clustered errors function which is applied directly to a object of the plm class.
Using your example:
library (plm)
data("Wages", package = "plm")
IV <- plm(lwage ~ south + exp | wks + south, data = Wages, model = "pooling", index = 595)
Wages$id <- rep(1:595, each = 7)
cl.plm(Wages, IV, Wages$id)
Where I'm using Wages$idas the panel first dimension around which clusters will be formed. You may want to compare these results with the obtained in other software. Anyway, the code is simple allowing some tricks. The cl.plm function is based on Arai's clustering notes which can help you further.
You can obtain the same result from cl.plm doing this in Stata:
ivregress 2sls lwage south (exp = wks), vce(cluster id) small
Or for the within model:
xtset id time, generic
xtivreg2 lwage south (exp = wks), fe small cluster(id)
Note however I used the small sample formulation in Stata, which is not big deal. More about this here. Anyway, cl.plm properly deals with the plm class object.
For sake of completeness: as suggested by #Helix123, you can use the development version (1.6-1) of plm package and proceed as you did in tour question.