Extracting estimates with ranger decision trees - r

I am getting the error message Error: No tidy method for objects of class ranger when trying to extract the estimates for a regression model built with the ranger package in R.
Here is my code:
# libraries
library(tidymodels)
library(textrecipes)
library(LiblineaR)
library(ranger)
library(tidytext)
# create the recipe
comments.rec <- recipe(year ~ comments, data = oa.comments) %>%
step_tokenize(comments, token = "ngrams", options = list(n = 2, n_min = 1)) %>%
step_tokenfilter(comments, max_tokens = 1e3) %>%
step_stopwords(comments, stopword_source = "stopwords-iso") %>%
step_tfidf(comments) %>%
step_normalize(all_predictors())
# workflow with recipe
comments.wf <- workflow() %>%
add_recipe(comments.rec)
# create the regression model using support vector machine
svm.spec <- svm_linear() %>%
set_engine("LiblineaR") %>%
set_mode("regression")
svm.fit <- comments.wf %>%
add_model(svm.spec) %>%
fit(data = oa.comments)
# extract the estimates for the support vector machine model
svm.fit %>%
pull_workflow_fit() %>%
tidy() %>%
arrange(-estimate)
Below is the table of estimates for each tokenized term in the data set (this is a dirty data set for demo purposes)
term estimate
<chr> <dbl>
1 Bias 2015.
2 tfidf_comments_2021 0.877
3 tfidf_comments_2019 0.851
4 tfidf_comments_2020 0.712
5 tfidf_comments_2018 0.641
6 tfidf_comments_https 0.596
7 tfidf_comments_plan s 0.462
8 tfidf_comments_plan 0.417
9 tfidf_comments_2017 0.399
10 tfidf_comments_libraries 0.286
However, when using the ranger engine to create a regression model from random forests, I have no such luck and get the error message above
# create the regression model using random forests
rf.spec <- rand_forest(trees = 50) %>%
set_engine("ranger") %>%
set_mode("regression")
rf.fit <- comments.wf %>%
add_model(rf.spec) %>%
fit(data = oa.comments)
# extract the estimates for the random forests model
rf.fit %>%
pull_workflow_fit() %>%
tidy() %>%
arrange(-estimate)

To put this back to you in a simpler form that I think highlights the issue - if you had a decision tree model, how would you produce coefficients on the data in the dataset? What would those mean?
I think what you are looking for here is some form a attribution to each column. There are tools to do this built into tidymodels, but you should read on what it's actually reporting.
For you, you can get a basic idea of what those numbers would look like by using the vip package, though the produced numbers are definitely not comparable directly to your svm ones.
install.packages('vip')
library(vip)
rf.fit %>%
pull_workflow_fit() %>%
vip(geom = "point") +
labs(title = "Random forest variable importance")
You'll produce a plot with relative importance scores. To get the numbers
rf.fit %>%
pull_workflow_fit() %>%
vi()
tidymodels has a decent walkthrough doing this here but, given you have a model that can estimate importance scores you should be good to go.
Tidymodels tutorial page - 'a case study'
edit: if you HAVEN'T done this you may need to rerun your initial model with a new parameter passed during the 'set_engine' step of your code that gives ranger an idea of what kind of importance scores you are looking for/how they should be computed.

Related

How to Get Variable/Feature Importance From Tidymodels ranger object?

I have a ranger object from the tidymodels rand_forest function:
rf <- rand_forest(mode = "regression", trees = 1000) %>% fit(pay_rate ~ age+profession)
I want to get the feature importance of each variable (I have many more than in this example). I've tried things like rf$variable.importance, or importance(rf), but the former returns NULL and the latter function doesn't exist. I tried using the vip package, but that doesn't work for a ranger object. How can I extract feature importances from this object?
You need to add importance = "impurity" when you set the engine for ranger. This will provide variable importance scores. Once this is set, you can use extract_fit_parsnip with vip to plot the variable importance.
small example:
library(tidymodels)
library(vip)
rf_mod <- rand_forest(mode = "regression", trees = 100) %>%
set_engine("ranger", importance = "impurity")
rf_recipe <-
recipe(mpg ~ ., data = mtcars)
rf_workflow <-
workflow() %>%
add_model(rf_mod) %>%
add_recipe(rf_recipe)
rf_workflow %>%
fit(mtcars) %>%
extract_fit_parsnip() %>%
vip(num_features = 10)
More information is available in the tidymodels get started guide

How to reduce the memory used by a trained workflow

I have trained a workflow with the tidymodels framework. The fit uses random forest and its size is ~ 2.06 GB. After calling butcher::butcher() function on it, still occupies around 2 GB of memory. I get better results with extract_fit_parsnip(), which reduces the object size to ~ 1.86 GB.
Is there a way to reduce the memory used by the model any further? I only need a minimal object to make predictions, that includes the recipe instructions for preprocessing.
The reproduce the example, you will need to download the data from this kaggle competition.
library(tidymodels)
set.seed(112)
splits <- initial_split(bacteria)
bacteria_train <- training(splits)
bacteria_test <- testing(splits)
preprocessing <- recipe(target ~ ., data = bacteria_train) %>%
update_role(row_id, new_role = "id") %>%
step_zv(all_numeric_predictors()) %>%
step_normalize(all_numeric_predictors()) %>%
step_corr(all_numeric_predictors()) %>%
step_pca(all_numeric_predictors())
rf_spec <- rand_forest(trees = 1000) %>%
set_mode("classification") %>%
set_engine("ranger")
rf_wf <- workflow() %>%
add_recipe(preprocessing) %>%
add_model(rf_spec)
# Training takes around 13 minutes
rf_fit <- fit(rf_wf, data = bacteria_train)
rf_fit %>% lobstr::obj_size()
rf_fit %>% butcher::butcher() %>% lobstr::obj_size()
rf_fit %>% extract_fit_parsnip() %>% butcher::butcher() %>% lobstr::obj_size()
A random forest model can get very big, especially when you have a lot of predictors. This is just the nature of a random forest, because all the trees you trained must be stored, with all their nodes, etc.
You can try reducing the number of trees. In fact, you can tune the number of trees, and find the smallest number that works well.
If you are using a random forest, you likely do not need to normalize, create PC components, etc. Random forests are known for being low-maintenance when it comes to feature engineering. Try using only your raw predictors and see how much improvement that feature engineering really gets you.
You can also try out other types of models, which in my experience often have smaller memory footprints than random forest. Some ideas to try include a bagged tree model and an xgboost model.

Consistent "Error: All columns selected for the step should be numeric" in attempted LASSO model within R tidymodels

Trying to run my first LASSO model and running into a few issues. I have a medical dataset where I am trying to predict a dichotomous outcome (the disease) from about 60 predictors. I get as far as tuning the grid before I get the error "All columns selected for the step should be numeric" despite having converted them all to dummy variables already during the recipe stage. I have reduced the amount of predictors to see if that changes anything but it doesn't seem to fix it. The outcome is uncommon and is seen in about 3% of cases so I don't know is this affecting anything.
Code as follows
Splitting into testing and training data and stratifying by disease
set.seed(123)
df_split <- initial_split(df, strata = disease)
df_train <- training(df_split)
df_test <- testing(df_split)
Creating validation set
set.seed(234)
validation_set <- validation_split(df_train,
strata = dfPyVAN,
prop = 0.8)
Building the model
df_model <-
logistic_reg(penalty = tune(), mixture = 1) %>%
set_engine("glmnet")
Creating the recipe
df_recipe <-
recipe(dfPyVAN ~ ., data = df_train) %>%
step_medianimpute(all_predictors()) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_zv(all_predictors()) %>%
step_normalize(all_predictors())
Create workflow
df_workflow <-
workflow() %>%
add_model(df_model) %>%
add_recipe(df_recipe)
Grid of penalty values to tune
df_reg_grid <- tibble(penalty = 10^seq(-4, -1, length.out = 30))
Train and tune the model - this is the step where it breaks down and I get the constant error
df_res <-
df_workflow %>%
tune_grid(validation_set,
grid = df_reg_grid,
control = control_grid(save_pred = TRUE),
metrics = metric_set(roc_auc))
I have tried multiple variations with the same result - would be very grateful if anyone could offer any help,
Many thanks
The error you are getting is coming from step_medianimpute(). step_medianimpute() requires all the variables to be numeric but it is being passed factor variables with all_predictors().
One way to fix this problem is by rearranging your recipe to create dummy variables before you impute.
library(recipes)
library(modeldata)
data(ames)
df_recipe <-
recipe(Central_Air ~ ., data = ames) %>%
step_medianimpute(all_predictors()) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_zv(all_predictors()) %>%
step_normalize(all_predictors())
prep(df_recipe)
#> Error: All columns selected for the step should be numeric
df_recipe <-
recipe(Central_Air ~ ., data = ames) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_medianimpute(all_predictors()) %>%
step_zv(all_predictors()) %>%
step_normalize(all_predictors())
prep(df_recipe)
#> Data Recipe
#>
#> Inputs:
#>
#> role #variables
#> outcome 1
#> predictor 73
#>
#> Training data contained 2930 data points and no missing data.
#>
#> Operations:
#>
#> Dummy variables from MS_SubClass, MS_Zoning, Street, Alley, ... [trained]
#> Median Imputation for Lot_Frontage, Lot_Area, ... [trained]
#> Zero variance filter removed 2 items [trained]
#> Centering and scaling for Lot_Frontage, Lot_Area, ... [trained]
Created on 2021-04-27 by the reprex package (v1.0.0)

Fit a Mean forecasting model using tsibble, fable in R

(Using Orange dataset from library(Ecdat) for reproducibility.)
I am trying to fit a mean forecasting model in R using tsibble, fable package in R. The code below is pretty simple, however I get the error Error in NCOL(x) : object 'value' not found when I try to run the last model part (even though value is a column name in o_ts), not sure why would that be. I am following RJH tutorials from here (https://robjhyndman.com/hyndsight/fable/).
I would also appreciate any help whether arima & mean forecasting model are same, if not what is the function that I should be using instead of Arima.
library(Ecdat)
library(tsibble)
library(feasts)
library(tidyverse)
library(fable)
o<- Orange
o_ts <- o %>% as_tsibble()
o_ts %>%
filter(key=="priceoj") %>%
model(
arima=arima(value))
arima is from the stats package. I believe you want ARIMA from fable.
o_ts %>%
filter(key == "priceoj") %>%
model(
arima = ARIMA(value)
)
#> # A mable: 1 x 2
#> # Key: key [1]
#> key arima
#> <chr> <model>
#> 1 priceoj <ARIMA(1,1,0)(0,0,1)[12]>
If you by mean forecasting model are referring to taking the mean of the last X observation (Moving Average), then you should be using MEAN.
While ARIMA does refer to Moving Average (Auto Regressive Integrated Moving Average), however this refers to a weighted moving average of the forecast errors - you can read more here: 9.4 Moving average models in Forecasting: Principles and Practice
o <- Orange
o_ts <- o %>% as_tsibble()
o_ts %>%
filter(key == "priceoj") %>%
model(mean = MEAN(value))
If you want to specify the amount of observations to take the mean of, then you need to add the special ~window(size = X). Otherwise all observations are used.
o_ts %>%
filter(key == "priceoj") %>%
model(mean = MEAN(value ~ window(size = 3)))

How to incorporate tidy models PCA into the workflow of a model and make predictions

I am trying to incorporate tidy models PCA into the workflow of a model. I want to have a predictive model that uses PCA as a preprocessing step and then make predictions with that model.
I have tried the following approach,
diamonds <- diamonds %>%
select(-clarity, -cut, - color)
diamonds_split <- initial_split(diamonds, prop = 4/5)
diamonds_train <- training(diamonds_split)
diamonds_test <- testing(diamonds_split)
diamonds_test <-vfold_cv(diamonds_train)
diamonds_recipe <-
# La fórmula básica y todos los datos (outcome ~ predictors)
recipe(price ~ ., data = diamonds_train) %>%
step_log(all_outcomes(),skip = T) %>%
step_normalize(all_predictors(), -all_nominal()) %>%
step_pca(all_predictors())
preprocesados <- prep(diamonds_recipe)
linear_model <-
linear_reg() %>%
set_engine("glmnet") %>%
set_mode("regression")
pca_workflow <- workflow() %>%
add_recipe(diamonds_recipe) %>%
add_model(linear_model)
lr_fitted_workflow <- pca_workflow %>% #option A workflow full dataset
last_fit(diamonds_split)
performance <- lr_fitted_workflow %>% collect_metrics()
test_predictions <- lr_fitted_workflow %>% collect_predictions()
But I get this error:
x Resample1: model (predictions): Error: penalty should be a single numeric value. ...
Warning message:
“All models failed in [fit_resamples()]. See the .notes column.”
Following other tutorials I tried to use this other approach, but I don't know how to use the model to make new predictions, because the new data comes in the original (non-pca) form. So I tried this:
pca_fit <- juice(preprocesados) %>% #option C no work flow at all
lm(price ~ ., data = .)
prep_test <- prep(diamonds_recipe, new_data = diamonds_test)
truths <- juice(prep_test) %>%
select(price)
ans <- predict(pca_fit, new_data = prep_test)
tib <- tibble(row = 1:length(ans),ans, truths)
ggplot(data = tib) +
geom_smooth(mapping = aes(x = row, y = ans, colour = "predicted")) +
geom_smooth(mapping = aes(x = row, y = price, colour = "true"))
And it prints something that seams reasonable, but by this point I have lost confidence and some guidance would be much appreciated. :D
The problem is not in your recipe or the workflow. As described in chapter 7 of TidyModels with R the function for fitting your model is fit and for it to work you'll have to provide the data for the fitting process (here diamonds). The tradeoff is that you don't have to prep your recipe as the workflow will take care of this itself.
So reducing your code slightly, the example below will work.
library(tidymodels)
data(diamonds)
diamonds <- diamonds %>%
select(-clarity, -cut, - color)
diamonds_split <- initial_split(diamonds, prop = 4/5)
diamonds_train <- training(diamonds_split)
diamonds_test <- testing(diamonds_split)
diamonds_recipe <-
# La fórmula básica y todos los datos (outcome ~ predictors)
recipe(price ~ ., data = diamonds_train) %>%
step_log(all_outcomes(),skip = T) %>%
step_normalize(all_predictors(), -all_nominal()) %>%
step_pca(all_predictors())
linear_model <-
linear_reg() %>%
set_engine("glmnet") %>%
set_mode("regression")
pca_workflow <- workflow() %>%
add_recipe(diamonds_recipe) %>%
add_model(linear_model)
pca_fit <- fit(pca_workflow, data = diamonds_train)
As for crossvalidation one has to use fit_resamples and should split the training set and not the testing set. But here I am currently getting the same error (my answer will be updated if i figure out why)
Edit
Now I've done a bit of digging, and the problem with crossvalidation stems from the engine being glmnet. I am guessing that of the many different aspects this one has somehow been missed. I've added a possible issue to the workflows package github site. Often the answers are quick in coming, so likely one of the developers will come with a reply soon.
As for crossvalidation, assume you instead fit using any of the other engines described in ?linear_reg then we could do this as
linear_model_base <-
linear_reg() %>%
set_engine("lm") %>%
set_mode("regression")
pca_workflow <- update_model(pca_workflow, linear_model_base)
folds <- vfold_cv(diamonds_train, 10)
pca_folds_fit <- fit_resamples(pca_workflow, resamples = folds)
and in the case where metrics are of interest these can indeed be collected as you did using collect_metrics
pca_folds_fit %>% collect_metrics()
If we are interested in the predictions you'll have to tell the model that you want to save these during the fitting process and then use collect_predictions
pca_folds_fit <- fit_resamples(pca_workflow, resamples = folds, control = control_resamples(save_pred = TRUE))
collect_predictions(pca_folds_fit)
Note however that the output from this is the predictions from each fold as you are literally fitting 10 models.
Usually crossvalidation is used to compare multiple models or tuning parameters (eg. random forest vs linear model). The best model on crossvalidation performance (collect_metrics) would then be selected for use and the test dataset would be used to get the evaluation of this models accuracy.
This is all described in TMwR chapter 10 & 11

Resources