How to reduce the memory used by a trained workflow - r

I have trained a workflow with the tidymodels framework. The fit uses random forest and its size is ~ 2.06 GB. After calling butcher::butcher() function on it, still occupies around 2 GB of memory. I get better results with extract_fit_parsnip(), which reduces the object size to ~ 1.86 GB.
Is there a way to reduce the memory used by the model any further? I only need a minimal object to make predictions, that includes the recipe instructions for preprocessing.
The reproduce the example, you will need to download the data from this kaggle competition.
library(tidymodels)
set.seed(112)
splits <- initial_split(bacteria)
bacteria_train <- training(splits)
bacteria_test <- testing(splits)
preprocessing <- recipe(target ~ ., data = bacteria_train) %>%
update_role(row_id, new_role = "id") %>%
step_zv(all_numeric_predictors()) %>%
step_normalize(all_numeric_predictors()) %>%
step_corr(all_numeric_predictors()) %>%
step_pca(all_numeric_predictors())
rf_spec <- rand_forest(trees = 1000) %>%
set_mode("classification") %>%
set_engine("ranger")
rf_wf <- workflow() %>%
add_recipe(preprocessing) %>%
add_model(rf_spec)
# Training takes around 13 minutes
rf_fit <- fit(rf_wf, data = bacteria_train)
rf_fit %>% lobstr::obj_size()
rf_fit %>% butcher::butcher() %>% lobstr::obj_size()
rf_fit %>% extract_fit_parsnip() %>% butcher::butcher() %>% lobstr::obj_size()

A random forest model can get very big, especially when you have a lot of predictors. This is just the nature of a random forest, because all the trees you trained must be stored, with all their nodes, etc.
You can try reducing the number of trees. In fact, you can tune the number of trees, and find the smallest number that works well.
If you are using a random forest, you likely do not need to normalize, create PC components, etc. Random forests are known for being low-maintenance when it comes to feature engineering. Try using only your raw predictors and see how much improvement that feature engineering really gets you.
You can also try out other types of models, which in my experience often have smaller memory footprints than random forest. Some ideas to try include a bagged tree model and an xgboost model.

Related

Extracting estimates with ranger decision trees

I am getting the error message Error: No tidy method for objects of class ranger when trying to extract the estimates for a regression model built with the ranger package in R.
Here is my code:
# libraries
library(tidymodels)
library(textrecipes)
library(LiblineaR)
library(ranger)
library(tidytext)
# create the recipe
comments.rec <- recipe(year ~ comments, data = oa.comments) %>%
step_tokenize(comments, token = "ngrams", options = list(n = 2, n_min = 1)) %>%
step_tokenfilter(comments, max_tokens = 1e3) %>%
step_stopwords(comments, stopword_source = "stopwords-iso") %>%
step_tfidf(comments) %>%
step_normalize(all_predictors())
# workflow with recipe
comments.wf <- workflow() %>%
add_recipe(comments.rec)
# create the regression model using support vector machine
svm.spec <- svm_linear() %>%
set_engine("LiblineaR") %>%
set_mode("regression")
svm.fit <- comments.wf %>%
add_model(svm.spec) %>%
fit(data = oa.comments)
# extract the estimates for the support vector machine model
svm.fit %>%
pull_workflow_fit() %>%
tidy() %>%
arrange(-estimate)
Below is the table of estimates for each tokenized term in the data set (this is a dirty data set for demo purposes)
term estimate
<chr> <dbl>
1 Bias 2015.
2 tfidf_comments_2021 0.877
3 tfidf_comments_2019 0.851
4 tfidf_comments_2020 0.712
5 tfidf_comments_2018 0.641
6 tfidf_comments_https 0.596
7 tfidf_comments_plan s 0.462
8 tfidf_comments_plan 0.417
9 tfidf_comments_2017 0.399
10 tfidf_comments_libraries 0.286
However, when using the ranger engine to create a regression model from random forests, I have no such luck and get the error message above
# create the regression model using random forests
rf.spec <- rand_forest(trees = 50) %>%
set_engine("ranger") %>%
set_mode("regression")
rf.fit <- comments.wf %>%
add_model(rf.spec) %>%
fit(data = oa.comments)
# extract the estimates for the random forests model
rf.fit %>%
pull_workflow_fit() %>%
tidy() %>%
arrange(-estimate)
To put this back to you in a simpler form that I think highlights the issue - if you had a decision tree model, how would you produce coefficients on the data in the dataset? What would those mean?
I think what you are looking for here is some form a attribution to each column. There are tools to do this built into tidymodels, but you should read on what it's actually reporting.
For you, you can get a basic idea of what those numbers would look like by using the vip package, though the produced numbers are definitely not comparable directly to your svm ones.
install.packages('vip')
library(vip)
rf.fit %>%
pull_workflow_fit() %>%
vip(geom = "point") +
labs(title = "Random forest variable importance")
You'll produce a plot with relative importance scores. To get the numbers
rf.fit %>%
pull_workflow_fit() %>%
vi()
tidymodels has a decent walkthrough doing this here but, given you have a model that can estimate importance scores you should be good to go.
Tidymodels tutorial page - 'a case study'
edit: if you HAVEN'T done this you may need to rerun your initial model with a new parameter passed during the 'set_engine' step of your code that gives ranger an idea of what kind of importance scores you are looking for/how they should be computed.

All values of AUC ROC Curve 1 using tidymodels

Trying to do a LASSO model with a binary outcome using tidymodels, I have essentially copied the case study from the tidymodels webpage (https://www.tidymodels.org/start/case-study/)(the hotel stay dataset) and applied it to my own data but for some reason all of the values on my area under the ROC curve are coming out at 1 (as you can see from graph below). The only thing I have changed is the recipe (to try and suit my data)
recipe(outcome ~ ., data = df_train) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_zv(all_predictors()) %>%
step_normalize(all_predictors()) %>%
step_medianimpute(all_predictors())
so I don't know if it is my recipe that is incorrect or my data is not suitable for whatever reason. As mentioned I have a binary outcome and 68 predictors (59 factors and 9 numeric), some do have missing data but thought that the step_medianimpute would deal with that. Many thanks for any help anyone can offer
My AUC ROC Curve
Without seeing the data it is hard to know for sure, but your results indicate a couple of things.
Firstly, AUC ROC of 1. An AOC ROC of 1 for a binary classification model indicated that the model is perfectly able to separate the two classes. This could either be the case of overfitting or that you just have linearly separable classes.
Secondly, the constant metric value for different values of penalty. For a LASSO model, as the penalty increases, more and more variables will be shrunk to zero. In your case for all the values of the penalty (if you are following the post it will be 10^(-4) through 10^(-1)) you are seeing the same performance. That means that even if you use a penalty of 10^(-1) you still haven't shrunk enough predictors to hurt/change the performance. Reprex below
set.seed(1234)
library(tidymodels)
response <- rep(c(0, 10), length.out = 1000)
data <- bind_cols(
response = factor(response),
map_dfc(seq_len(50), ~ rnorm(1000, response))
)
data_split <- initial_split(data)
data_train <- training(data_split)
data_test <- testing(data_split)
lasso_spec <- logistic_reg(mixture = 1, penalty = tune()) %>%
set_engine("glmnet")
lasso_wf <- workflow() %>%
add_model(lasso_spec) %>%
add_formula(response ~ .)
data_folds <- vfold_cv(data_train)
param_grid <- tibble(penalty = 10^seq(-4, -1, length.out = 30))
tune_res <- tune_grid(
lasso_wf,
resamples = data_folds,
grid = param_grid
)
autoplot(tune_res)
What what you can do is expand the range of penalties until you the performance changes. Below we see that once the penalty is high enough, the last important predictors got shrunk to zero, and we lose performance.
param_grid <- tibble(penalty = 10^seq(-1, 0, length.out = 30))
tune_res <- tune_grid(
lasso_wf,
resamples = data_folds,
grid = param_grid
)
autoplot(tune_res)
To verify, we fit the model using one of the good performance penalties and we get perfect predictions.
lasso_final <- finalize_workflow(lasso_wf, select_best(tune_res))
lasso_final_fit <- fit(lasso_final, data = data_train)
augment(lasso_final_fit, new_data = data_train) %>%
conf_mat(truth = response, estimate = .pred_class)
#> Truth
#> Prediction 0 10
#> 0 375 0
#> 10 0 375
Created on 2021-05-08 by the reprex package (v2.0.0)

How to incorporate tidy models PCA into the workflow of a model and make predictions

I am trying to incorporate tidy models PCA into the workflow of a model. I want to have a predictive model that uses PCA as a preprocessing step and then make predictions with that model.
I have tried the following approach,
diamonds <- diamonds %>%
select(-clarity, -cut, - color)
diamonds_split <- initial_split(diamonds, prop = 4/5)
diamonds_train <- training(diamonds_split)
diamonds_test <- testing(diamonds_split)
diamonds_test <-vfold_cv(diamonds_train)
diamonds_recipe <-
# La fórmula básica y todos los datos (outcome ~ predictors)
recipe(price ~ ., data = diamonds_train) %>%
step_log(all_outcomes(),skip = T) %>%
step_normalize(all_predictors(), -all_nominal()) %>%
step_pca(all_predictors())
preprocesados <- prep(diamonds_recipe)
linear_model <-
linear_reg() %>%
set_engine("glmnet") %>%
set_mode("regression")
pca_workflow <- workflow() %>%
add_recipe(diamonds_recipe) %>%
add_model(linear_model)
lr_fitted_workflow <- pca_workflow %>% #option A workflow full dataset
last_fit(diamonds_split)
performance <- lr_fitted_workflow %>% collect_metrics()
test_predictions <- lr_fitted_workflow %>% collect_predictions()
But I get this error:
x Resample1: model (predictions): Error: penalty should be a single numeric value. ...
Warning message:
“All models failed in [fit_resamples()]. See the .notes column.”
Following other tutorials I tried to use this other approach, but I don't know how to use the model to make new predictions, because the new data comes in the original (non-pca) form. So I tried this:
pca_fit <- juice(preprocesados) %>% #option C no work flow at all
lm(price ~ ., data = .)
prep_test <- prep(diamonds_recipe, new_data = diamonds_test)
truths <- juice(prep_test) %>%
select(price)
ans <- predict(pca_fit, new_data = prep_test)
tib <- tibble(row = 1:length(ans),ans, truths)
ggplot(data = tib) +
geom_smooth(mapping = aes(x = row, y = ans, colour = "predicted")) +
geom_smooth(mapping = aes(x = row, y = price, colour = "true"))
And it prints something that seams reasonable, but by this point I have lost confidence and some guidance would be much appreciated. :D
The problem is not in your recipe or the workflow. As described in chapter 7 of TidyModels with R the function for fitting your model is fit and for it to work you'll have to provide the data for the fitting process (here diamonds). The tradeoff is that you don't have to prep your recipe as the workflow will take care of this itself.
So reducing your code slightly, the example below will work.
library(tidymodels)
data(diamonds)
diamonds <- diamonds %>%
select(-clarity, -cut, - color)
diamonds_split <- initial_split(diamonds, prop = 4/5)
diamonds_train <- training(diamonds_split)
diamonds_test <- testing(diamonds_split)
diamonds_recipe <-
# La fórmula básica y todos los datos (outcome ~ predictors)
recipe(price ~ ., data = diamonds_train) %>%
step_log(all_outcomes(),skip = T) %>%
step_normalize(all_predictors(), -all_nominal()) %>%
step_pca(all_predictors())
linear_model <-
linear_reg() %>%
set_engine("glmnet") %>%
set_mode("regression")
pca_workflow <- workflow() %>%
add_recipe(diamonds_recipe) %>%
add_model(linear_model)
pca_fit <- fit(pca_workflow, data = diamonds_train)
As for crossvalidation one has to use fit_resamples and should split the training set and not the testing set. But here I am currently getting the same error (my answer will be updated if i figure out why)
Edit
Now I've done a bit of digging, and the problem with crossvalidation stems from the engine being glmnet. I am guessing that of the many different aspects this one has somehow been missed. I've added a possible issue to the workflows package github site. Often the answers are quick in coming, so likely one of the developers will come with a reply soon.
As for crossvalidation, assume you instead fit using any of the other engines described in ?linear_reg then we could do this as
linear_model_base <-
linear_reg() %>%
set_engine("lm") %>%
set_mode("regression")
pca_workflow <- update_model(pca_workflow, linear_model_base)
folds <- vfold_cv(diamonds_train, 10)
pca_folds_fit <- fit_resamples(pca_workflow, resamples = folds)
and in the case where metrics are of interest these can indeed be collected as you did using collect_metrics
pca_folds_fit %>% collect_metrics()
If we are interested in the predictions you'll have to tell the model that you want to save these during the fitting process and then use collect_predictions
pca_folds_fit <- fit_resamples(pca_workflow, resamples = folds, control = control_resamples(save_pred = TRUE))
collect_predictions(pca_folds_fit)
Note however that the output from this is the predictions from each fold as you are literally fitting 10 models.
Usually crossvalidation is used to compare multiple models or tuning parameters (eg. random forest vs linear model). The best model on crossvalidation performance (collect_metrics) would then be selected for use and the test dataset would be used to get the evaluation of this models accuracy.
This is all described in TMwR chapter 10 & 11

How to reverse a `recipe` from `recipes` package?

Suppose I am creating a recipe for my machine learning model, and I need to preprocess my outcome.
How do I reverse the preprocess my outcome or my predictors?
If I preprocess my outcome, how to reverse the output of a model to the original scale?
library(recipes)
biomass <- biomass
rec <- biomass %>%
recipe(carbon ~ hydrogen ) %>%
step_BoxCox(all_outcomes()) %>%
prep()
biomass_box <- rec %>% bake(biomass)
In this example I have made a BoxCox Transformation on my outcome. How do I get biomass_box$carbon back to its original values? recipes may have an easy way of undoing it, but I've been unable to find it.
Have you tried using step_inverse()?
library(recipes)
biomass <- biomass
rec <- biomass %>%
recipe(carbon ~ hydrogen ) %>%
step_BoxCox(all_outcomes()) %>%
prep() %>%
step_inverse(all_predictors())

Fitting several regression models after group_by with dplyr and applying the resulting models into test sets

I have a big dataset that I want to partition based on the values of a particular variable (in my case lifetime), and then run logistic regression on each partition. Following the answer of #tchakravarty in Fitting several regression models with dplyr I wrote the following code:
lifetimemodels = data %>% group_by(lifetime) %>% sample_frac(0.7)%>%
do(lifeModel = glm(churn ~., x= TRUE, family=binomial(link='logit'), data = .))
My question now is how I can use the resulting logistic models on computing the AUC on the rest of the data (the 0.3 fraction that was not chosen) which should again be grouped by lifetime?
Thanks a lot in advance!
You could adapt your dplyr approach to use the tidyr and purrr framework. You look at grouping/nesting, and the mutate and map functions to create list frames to store pieces of your workflow.
The test/training split you are looking for is part of modelr a package built to assist modelling within the purrr framework. Specifically the cross_vmc and cross_vkfold functions.
A toy example using mtcars (just to illustrate the framework).
library(dplyr)
library(tidyr)
library(purrr)
library(modelr)
analysis <- mtcars %>%
nest(-cyl) %>%
unnest(map(data, ~crossv_mc(.x, 1, test = 0.3))) %>%
mutate(model = map(train, ~lm(mpg ~ wt, data = .x))) %>%
mutate(pred = map2(model, train, predict)) %>%
mutate(error = map2_dbl(model, test, rmse))
This:
takes mtcars
nest into a list frame called data by cyl
Separate each data into a training set by mapping crossv_mc to each element, then using unnest to make the test and train list columns.
Map the lm model to each train, store that in model
Map the predict function to model and train and store in pred
Map the rmse function to model and test sets and store in error.
There are probably users out there more familiar than me with the workflow, so please correct/elaborate.

Resources