Plotting decision tree results from tidymodels - r

I have managed to build a decision tree model using the tidymodels package but I am unsure how to pull the results and plot the tree. I know I can use the rpart and rpart.plot packages to achieve the same thing but I would rather use tidymodels as that is what I am learning. Below is an example using the mtcars data.
library(tidymodels)
library(rpart)
library(rpart.plot)
library(dplyr) #contains mtcars
#data
df <- mtcars %>%
mutate(gear = factor(gear))
#train/test
set.seed(1234)
df_split <- initial_split(df)
df_train <- training(df_split)
df_test <- testing(df_split)
df_recipe <- recipe(gear~ ., data = df) %>%
step_normalize(all_numeric())
#building model
tree <- decision_tree() %>%
set_engine("rpart") %>%
set_mode("classification")
#workflow
tree_wf <- workflow() %>%
add_recipe(df_recipe) %>%
add_model(tree) %>%
fit(df_train) #results are found here
rpart.plot(tree_wf$fit$fit) #error is here
The error I get says Error in rpart.plot(tree_wf$fit$fit) : Not an rpart object which makes sense but I am unaware if there is a package or step I am missing to convert the results into a format that rpart.plot will allow me to plot. This might not be possible but any help would be much appreciated.

You can also use the workflows::pull_workflow_fit() function. It makes the code a little bit more elegant.
tree_fit <- tree_wf %>%
pull_workflow_fit()
rpart.plot(tree_fit$fit)

The following works (note the extra $fit):
rpart.plot(tree_wf$fit$fit$fit)
Not a very elegant solution, but it does plot the tree.
Tested with parsnip 0.1.3 and rpart.plot 3.0.8.

Related

How to Get Variable/Feature Importance From Tidymodels ranger object?

I have a ranger object from the tidymodels rand_forest function:
rf <- rand_forest(mode = "regression", trees = 1000) %>% fit(pay_rate ~ age+profession)
I want to get the feature importance of each variable (I have many more than in this example). I've tried things like rf$variable.importance, or importance(rf), but the former returns NULL and the latter function doesn't exist. I tried using the vip package, but that doesn't work for a ranger object. How can I extract feature importances from this object?
You need to add importance = "impurity" when you set the engine for ranger. This will provide variable importance scores. Once this is set, you can use extract_fit_parsnip with vip to plot the variable importance.
small example:
library(tidymodels)
library(vip)
rf_mod <- rand_forest(mode = "regression", trees = 100) %>%
set_engine("ranger", importance = "impurity")
rf_recipe <-
recipe(mpg ~ ., data = mtcars)
rf_workflow <-
workflow() %>%
add_model(rf_mod) %>%
add_recipe(rf_recipe)
rf_workflow %>%
fit(mtcars) %>%
extract_fit_parsnip() %>%
vip(num_features = 10)
More information is available in the tidymodels get started guide

Error: No tidy method for objects of class function :: broom.mixed

I am trying to perform a linear regression fit using tidymodels,parsnip but encounters the following error:
Error: No tidy method for objects of class function
routine:
library(tidymodels)
library(parsnip)
library(broom.mixed)
linear_reg() %>%
set_engine("lm") %>%
fit(formula = cnt ~ temp_raw, data = bikeshare)
fit %>% tidy()
fit %>% glance()
Having read this post Tidy function gives this error: No tidy method for objects of class lmerMod. It will not work on my computer, but works in pdf with same code
and I tried broom.mixed but still the error persists.
The main issue is that you need to assign the fitted model to an object; in your case it would also be fit.
There are two other points to consider:
it's confusing/not best practice to assign variables with the same name as R functions (i.e. you might want to call your fit fit0 or my_fit or something rather than fit); usually you can get away with it but it breaks, confusingly, in some contexts
broom.mixed is a red herring. The broom package is actually used for lm fits (and you don't need to load it, apparently tidymodels loads it (and parsnip) automatically ...)
library(tidymodels)
fit <- linear_reg() %>%
set_engine("lm") %>%
fit(formula = mpg ~ cyl, data = mtcars)
fit %>% tidy()
fit %>% glance()

error checking glmnet model from parsnip object using easystats: $ operator is invalid

Tried to check_model using a very simple glmnet classification task.
Taken some Code from here:
Extract plain model from tidymodel object
library(magrittr)
library(tidymodels)
library(performance)
data(two_class_dat)
glm_spec <- logistic_reg() %>%
set_engine("glmnet")
norm_rec <- recipe(Class ~ A + B, data = two_class_dat) %>%
step_normalize(all_predictors())
glm_fit <- workflow() %>%
add_recipe(norm_rec) %>%
add_model(glm_spec) %>%
fit(two_class_dat) %>%
pull_workflow_fit()
performance::check_model(glm_fit)
Error: $ operator is invalid for atomic vectors
performance::check_model() is a generic function. This means that specific functions (called the methods) need to be written for check_model() to work on different types of objects. I think that they would need to write a method for workflow objects.
In your example, you probably wouldn't want to run check_model() on the fitted model since it doesn't know about the pre-processing in your recipe.
It looks like this SO question was converted to an issue here.

How to incorporate tidy models PCA into the workflow of a model and make predictions

I am trying to incorporate tidy models PCA into the workflow of a model. I want to have a predictive model that uses PCA as a preprocessing step and then make predictions with that model.
I have tried the following approach,
diamonds <- diamonds %>%
select(-clarity, -cut, - color)
diamonds_split <- initial_split(diamonds, prop = 4/5)
diamonds_train <- training(diamonds_split)
diamonds_test <- testing(diamonds_split)
diamonds_test <-vfold_cv(diamonds_train)
diamonds_recipe <-
# La fórmula básica y todos los datos (outcome ~ predictors)
recipe(price ~ ., data = diamonds_train) %>%
step_log(all_outcomes(),skip = T) %>%
step_normalize(all_predictors(), -all_nominal()) %>%
step_pca(all_predictors())
preprocesados <- prep(diamonds_recipe)
linear_model <-
linear_reg() %>%
set_engine("glmnet") %>%
set_mode("regression")
pca_workflow <- workflow() %>%
add_recipe(diamonds_recipe) %>%
add_model(linear_model)
lr_fitted_workflow <- pca_workflow %>% #option A workflow full dataset
last_fit(diamonds_split)
performance <- lr_fitted_workflow %>% collect_metrics()
test_predictions <- lr_fitted_workflow %>% collect_predictions()
But I get this error:
x Resample1: model (predictions): Error: penalty should be a single numeric value. ...
Warning message:
“All models failed in [fit_resamples()]. See the .notes column.”
Following other tutorials I tried to use this other approach, but I don't know how to use the model to make new predictions, because the new data comes in the original (non-pca) form. So I tried this:
pca_fit <- juice(preprocesados) %>% #option C no work flow at all
lm(price ~ ., data = .)
prep_test <- prep(diamonds_recipe, new_data = diamonds_test)
truths <- juice(prep_test) %>%
select(price)
ans <- predict(pca_fit, new_data = prep_test)
tib <- tibble(row = 1:length(ans),ans, truths)
ggplot(data = tib) +
geom_smooth(mapping = aes(x = row, y = ans, colour = "predicted")) +
geom_smooth(mapping = aes(x = row, y = price, colour = "true"))
And it prints something that seams reasonable, but by this point I have lost confidence and some guidance would be much appreciated. :D
The problem is not in your recipe or the workflow. As described in chapter 7 of TidyModels with R the function for fitting your model is fit and for it to work you'll have to provide the data for the fitting process (here diamonds). The tradeoff is that you don't have to prep your recipe as the workflow will take care of this itself.
So reducing your code slightly, the example below will work.
library(tidymodels)
data(diamonds)
diamonds <- diamonds %>%
select(-clarity, -cut, - color)
diamonds_split <- initial_split(diamonds, prop = 4/5)
diamonds_train <- training(diamonds_split)
diamonds_test <- testing(diamonds_split)
diamonds_recipe <-
# La fórmula básica y todos los datos (outcome ~ predictors)
recipe(price ~ ., data = diamonds_train) %>%
step_log(all_outcomes(),skip = T) %>%
step_normalize(all_predictors(), -all_nominal()) %>%
step_pca(all_predictors())
linear_model <-
linear_reg() %>%
set_engine("glmnet") %>%
set_mode("regression")
pca_workflow <- workflow() %>%
add_recipe(diamonds_recipe) %>%
add_model(linear_model)
pca_fit <- fit(pca_workflow, data = diamonds_train)
As for crossvalidation one has to use fit_resamples and should split the training set and not the testing set. But here I am currently getting the same error (my answer will be updated if i figure out why)
Edit
Now I've done a bit of digging, and the problem with crossvalidation stems from the engine being glmnet. I am guessing that of the many different aspects this one has somehow been missed. I've added a possible issue to the workflows package github site. Often the answers are quick in coming, so likely one of the developers will come with a reply soon.
As for crossvalidation, assume you instead fit using any of the other engines described in ?linear_reg then we could do this as
linear_model_base <-
linear_reg() %>%
set_engine("lm") %>%
set_mode("regression")
pca_workflow <- update_model(pca_workflow, linear_model_base)
folds <- vfold_cv(diamonds_train, 10)
pca_folds_fit <- fit_resamples(pca_workflow, resamples = folds)
and in the case where metrics are of interest these can indeed be collected as you did using collect_metrics
pca_folds_fit %>% collect_metrics()
If we are interested in the predictions you'll have to tell the model that you want to save these during the fitting process and then use collect_predictions
pca_folds_fit <- fit_resamples(pca_workflow, resamples = folds, control = control_resamples(save_pred = TRUE))
collect_predictions(pca_folds_fit)
Note however that the output from this is the predictions from each fold as you are literally fitting 10 models.
Usually crossvalidation is used to compare multiple models or tuning parameters (eg. random forest vs linear model). The best model on crossvalidation performance (collect_metrics) would then be selected for use and the test dataset would be used to get the evaluation of this models accuracy.
This is all described in TMwR chapter 10 & 11

How to reverse a `recipe` from `recipes` package?

Suppose I am creating a recipe for my machine learning model, and I need to preprocess my outcome.
How do I reverse the preprocess my outcome or my predictors?
If I preprocess my outcome, how to reverse the output of a model to the original scale?
library(recipes)
biomass <- biomass
rec <- biomass %>%
recipe(carbon ~ hydrogen ) %>%
step_BoxCox(all_outcomes()) %>%
prep()
biomass_box <- rec %>% bake(biomass)
In this example I have made a BoxCox Transformation on my outcome. How do I get biomass_box$carbon back to its original values? recipes may have an easy way of undoing it, but I've been unable to find it.
Have you tried using step_inverse()?
library(recipes)
biomass <- biomass
rec <- biomass %>%
recipe(carbon ~ hydrogen ) %>%
step_BoxCox(all_outcomes()) %>%
prep() %>%
step_inverse(all_predictors())

Resources