The code below works correctly and has no errors that I know of, but I want to add more to it.
The two things I want to add are:
1 - Predictions of the model on the training data to the final plot. I want to run collect_predictions() on the model fitted to training data.
2 - Code to view the metrics of the model on the training data. I want to run collect_metrics() on the model fitted to training data.
How do I get this information?
# Setup
library(tidyverse)
library(tidymodels)
parks <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-06-22/parks.csv')
modeling_df <- parks %>%
select(pct_near_park_data, spend_per_resident_data, med_park_size_data) %>%
rename(nearness = "pct_near_park_data",
spending = "spend_per_resident_data",
acres = "med_park_size_data") %>%
mutate(nearness = (parse_number(nearness)/100)) %>%
mutate(spending = parse_number(spending))
# Start building models
set.seed(123)
park_split <- initial_split(modeling_df)
park_train <- training(park_split)
park_test <- testing(park_split)
tree_rec <- recipe(nearness ~., data = park_train)
tree_prep <- prep(tree_rec)
juiced <- juice(tree_prep)
tune_spec <- rand_forest(
mtry = tune(),
trees = 1000,
min_n = tune()
) %>%
set_mode("regression") %>%
set_engine("ranger")
tune_wf <- workflow() %>%
add_recipe(tree_rec) %>%
add_model(tune_spec)
set.seed(234)
park_folds <- vfold_cv(park_train)
# Make a grid of various different models
doParallel::registerDoParallel()
set.seed(345)
tune_res <- tune_grid(
tune_wf,
resamples = park_folds,
grid = 20,
control = control_grid(verbose = TRUE)
)
best_rmse <- select_best(tune_res, "rmse")
# Finalize a model with the best grid
final_rf <- finalize_model(
tune_spec,
best_rmse
)
final_wf <- workflow() %>%
add_recipe(tree_rec) %>%
add_model(final_rf)
final_res <- final_wf %>%
last_fit(park_split)
# Visualize the performance
# My issue here is that this is only the testing data
# How can I also get this model's performance on the training data?
# I want to plot both with a facetwrap or color indication as well as numerically see the difference with collect_metrics
final_res %>%
collect_predictions() %>%
ggplot(aes(nearness, .pred)) +
geom_point() +
geom_abline()
What you can do is pull out the trained workflow object from final_res and use that to create predictions on the training data set.
final_model <- final_res$.workflow[[1]]
Now you can use augment() on the test and training data set to visualize the performance.
final_model %>%
augment(new_data = park_test) %>%
ggplot(aes(nearness, .pred)) +
geom_point() +
geom_abline()
final_model %>%
augment(new_data = park_train) %>%
ggplot(aes(nearness, .pred)) +
geom_point() +
geom_abline()
You can also combine the results with bind_rows() so you can compare more easily.
all_predictions <- bind_rows(
augment(final_model, new_data = park_train) %>%
mutate(type = "train"),
augment(final_model, new_data = park_test) %>%
mutate(type = "test")
)
all_predictions %>%
ggplot(aes(nearness, .pred)) +
geom_point() +
geom_abline() +
facet_wrap(~type)
all the yardstick metric functions work on grouped data.frames as well.
all_predictions %>%
group_by(type) %>%
metrics(nearness, .pred)
#> # A tibble: 6 x 4
#> type .metric .estimator .estimate
#> <chr> <chr> <chr> <dbl>
#> 1 test rmse standard 0.0985
#> 2 train rmse standard 0.0473
#> 3 test rsq standard 0.725
#> 4 train rsq standard 0.943
#> 5 test mae standard 0.0706
#> 6 train mae standard 0.0350
Created on 2021-06-24 by the reprex package (v2.0.0)
Related
I am fitting a random forest model using tidymodels in R, and an error occurs when I try to predict the test set using the tuned model: Each element of splits must be an rsplit object.
# Data splitting
data(Sacramento, package = "modeldata")
set.seed(123)
data_split <- initial_split(Sacramento, prop = 0.75, strata = price)
Sac_train <- training(data_split)
Sac_test <- testing(data_split)
# Build the model
rf_mod <- rand_forest(mtry = tune(), min_n = tune(), trees = 1000) %>%
set_engine("ranger", importance = "permutation") %>%
set_mode("regression")
# Create the recipe
Sac_recipe <- recipe(price ~ ., data = Sac_train) %>%
step_rm(zip, latitude, longitude) %>%
step_corr(all_numeric_predictors(), threshold = 0.85) %>%
step_zv(all_numeric_predictors()) %>%
step_normalize(all_numeric_predictors()) %>%
step_dummy(all_nominal_predictors())
# Create the workflow
rf_workflow <- workflow() %>%
add_model(rf_mod) %>%
add_recipe(Sac_recipe)
# Train and Tune the model
set.seed(123)
Sac_folds <- vfold_cv(Sac_train, v = 10, repeats = 2, strata = price)
rf_res <- rf_workflow %>%
tune_grid(grid = 2*2,
resamples = Sac_folds,
control = control_grid(save_pred = TRUE),
metrics = metric_set(rmse))
# Extract the best model
rf_best <- rf_res %>%
select_best(metric = "rmse")
# Last fit
last_rf_workflow <- rf_workflow %>%
finalize_workflow(rf_best)
last_rf_fit <- last_rf_workflow %>%
last_fit(Sac_train)
# Error: Each element of `splits` must be an `rsplit` object.
predict(last_rf_fit, Sac_test, type = "conf_int")
The error generates from these lines,
last_rf_fit <- last_rf_workflow %>%
last_fit(Sac_train)
Now from the documentation of last_fit,
# S3 method for workflow
last_fit(object, split, ..., metrics = NULL, control = control_last_fit())
So an workflow object is passed to last_fit as the first argument via %>% and Sac_train is passed to split parameter.
But from the docs, the split argument needs to be,
An rsplit object created from rsample::initial_split()
So Instead, try this,
last_rf_fit <- last_rf_workflow %>%
last_fit(data_split)
Then to collect the predictions, following the docs,
collect_predictions(last_rf_fit)
I am working on a classification model to predict building age. I want to train my random forest models by groups (suburbs) within the larger dataset.
I've used this as the basis of the code below.
My question is - how should I write the code to train and record the hyperparameters for each suburb?
age.rf <- rand_forest(
mtry = tune(),
trees = tune(),
min_n = tune()) %>%
set_mode("classification") %>%
set_engine("ranger")
age.workflow <- workflow() %>%
add_model(age.rf)
### function for model fitting and predicting
age.predict <- function(df) {
# split the dataset
set.seed(1)
split <- initial_split(df)
train_df <- training(df)
test_df <- testing(df)
# create recipe
age.recipe <- recipe(decade_built ~ .,
data = train_df) %>%
update_role(bld_index, new_role = "ID") %>%
step_dummy(all_nominal_predictors(), -has_role("ID")) %>%
step_zv(all_predictors()) %>%
step_normalize(all_numeric_predictors()) %>%
prep()
# hyperparameters
age.randgrid_rf <- grid_random(mtry(c(1,20)),
trees(),
min_n(),
size = 10)
ctrl <- control_grid(save_pred = T, extract = extract_model)
age_folds <- vfold_cv(train_df, strata = "suburb", v = 10)
age.tunerandom_rf <- age.workflow %>%
tune_grid(resamples = age_folds,
grid = age.randgrid_rf,
control = ctrl)
# best parameters
age.params_rf <- select_best(age.tunerandom_rf)
# finalise model
age.final_rf <- finalize_model(age.spec_rf, age.params_rf)
age.workflowfinal_rf <- workflow() %>%
add_recipe(age.recipe) %>%
add_model(age.final_rf)
# predict on test data
predict(age.workflowfinal_rf, test_df)
}
age_nested <- final.df %>%
group_by(suburb) %>%
nest()
age.preds <- age_nested %>%
mutate(prediction = map(data, possibly(age.predict, otherwise = NA)))
I've mapped out the dataset using the nest() function, and followed the workflow based on Julia's post on another page.
Any help to identify how to get the hyperparameters, as well as apply them to the individual models for each group would be much appreciated.
At the moment, my output is NA.
I am using tidymodels to fit multiple Random Forest models. I then followed along with this tutorial to compare the model results. The problem is that I get the error:
Error in
UseMethod("anova") :
no applicable method for 'anova' applied to an object of class "ranger"
As an example:
set.seed(123)
iris <- iris %>% mutate(
is_versicolor = ifelse(Species == "versicolor", "versicolor", "not_versicolor")) %>%
mutate(is_versicolor = factor(is_versicolor, levels = c("versicolor", "not_versicolor")))
iris_split <- initial_split(iris, strata = is_versicolor, prop = 0.8)
iris_train <- training(iris_split)
iris_test <- testing(iris_split)
rec_normal <- recipe(is_versicolor ~ Petal.Width + Species, data = iris_train)
rec_interaction <- rec_normal %>%
step_interact(~ Petal.Width:starts_with("Species"))
iris_model <- rand_forest() %>% set_engine("ranger") %>% set_mode("classification")
# normal workflow
iris_wf <- workflow() %>%
add_model(iris_model) %>%
add_recipe(rec_normal)
# interaction workflow
iris_wf_interaction <- iris_wf %>%
update_recipe(rec_interaction)
# fit models
iris_normal_lf <- last_fit(iris_wf, split = iris_split)
iris_inter_lf <- last_fit(iris_wf_interaction, split = iris_split)
normalmodel <- iris_normal_lf %>% extract_fit_engine()
intermodel <- iris_inter_lf %>% extract_fit_engine()
anova(normalmodel, intermodel) %>% tidy()
How can I run an ANOVA or ANOVA-type comparison of these models, to see if one is significantly better?
Just using your code, and adapting Julia Silge's blog on workflowsets:
Predict #TidyTuesday giant pumpkin weights with workflowsets
As ANOVA is not available for ranger, instead generate folds to resample:
set. Seed(234)
iris_folds <- vfold_cv(iris_train)
iris_folds
Combine your recipes into a workflowset:
iris_set <-
workflow_set(
list(rec_normal, rec_interaction),
list(iris_model),
cross = TRUE
)
iris_set
Setup parallel processing:
doParallel::registerDoParallel()
set. Seed(2021)
Fit using the folds:
iris_rs <-
workflow_map(
iris_set,
"fit_resamples",
resamples = iris_folds
)
autoplot(iris_rs)
This chart would usually address your question of how to compare models.
As "species" is on the righthand side of both recipe formulas, and the response "is_versicolor" is calculated from species, the models are completely accurate.
Finish off the output:
collect_metrics(iris_rs)
final_fit <-
extract_workflow(iris_rs, "recipe_2_rand_forest") %>%
fit(iris_train)
There is no tidier for ranger models.
In your code, if you change to:
rec_normal <- recipe(is_versicolor ~ Sepal.Length + Sepal.Width, data = iris_train)
rec_interaction <- recipe(is_versicolor ~ Petal.Width + Petal.Length, data = iris_train)
you can have some fun!
Hope this helps Adam. Just learning the wonderful Tidymodels like you, and look forward to comments. :-)
You could compare your random forest models by comparing their accuracies using the aov function. First, you can collect the accuracy with collect_metrics and save them in a data frame to run a model with aov to get the results. Here is a reproducible example:
library(tidymodels)
set.seed(123)
iris <- iris %>% mutate(
is_versicolor = ifelse(Species == "versicolor", "versicolor", "not_versicolor")) %>%
mutate(is_versicolor = factor(is_versicolor, levels = c("versicolor", "not_versicolor")))
iris_split <- initial_split(iris, strata = is_versicolor, prop = 0.8)
iris_train <- training(iris_split)
iris_test <- testing(iris_split)
rec_normal <- recipe(is_versicolor ~ Petal.Width + Species, data = iris_train)
rec_interaction <- rec_normal %>%
step_interact(~ Petal.Width:starts_with("Species"))
iris_model <- rand_forest() %>% set_engine("ranger") %>% set_mode("classification")
# normal workflow
iris_wf <- workflow() %>%
add_model(iris_model) %>%
add_recipe(rec_normal)
# interaction workflow
iris_wf_interaction <- iris_wf %>%
update_recipe(rec_interaction)
# fit models
iris_normal_lf <- last_fit(iris_wf, split = iris_split)
iris_inter_lf <- last_fit(iris_wf_interaction, split = iris_split)
#> ! train/test split: preprocessor 1/1: Categorical variables used in `step_interact` should probably be avoided...
normalmodel <- iris_normal_lf %>% extract_fit_engine()
intermodel <- iris_inter_lf %>% extract_fit_engine()
# Check confusion matrix
iris_normal_lf %>%
collect_predictions() %>%
conf_mat(is_versicolor, .pred_class)
#> Truth
#> Prediction versicolor not_versicolor
#> versicolor 10 0
#> not_versicolor 0 20
iris_inter_lf %>%
collect_predictions() %>%
conf_mat(is_versicolor, .pred_class)
#> Truth
#> Prediction versicolor not_versicolor
#> versicolor 10 0
#> not_versicolor 0 20
# Extract accuracy of models and create dataset
acc_normalmodel <- iris_normal_lf %>% collect_metrics() %>% select(.estimate) %>% slice(1)
acc_intermodel <- iris_normal_lf %>% collect_metrics() %>% select(.estimate) %>% slice(1)
results = data.frame(model = c("normalmodel", "intermodel"),
accuracy = c(acc_normalmodel$.estimate, acc_intermodel$.estimate))
# perform ANOVA on the classification accuracy
aov_results <- aov(accuracy ~ model, data = results)
summary(aov_results)
#> Df Sum Sq Mean Sq
#> model 1 4.93e-32 4.93e-32
Created on 2022-12-15 with reprex v2.0.2
As you can see the results doesn't show a p-value, because the degree of freedom is to low (why do I not get a p-value from this anova in r)
You could also use the aov on the predictions of both models and compare these performance. Here is a reproducible example:
# Get predictions of both models for not_versicolor
normalmodel_pred<-as.data.frame(normalmodel$predictions)$not_versicolor
intermodel_pred<-as.data.frame(intermodel$predictions)$not_versicolor
summary(aov(normalmodel_pred~intermodel_pred))
#> Df Sum Sq Mean Sq F value Pr(>F)
#> intermodel_pred 1 25.032 25.032 9392 <2e-16 ***
#> Residuals 118 0.314 0.003
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Created on 2022-12-17 with reprex v2.0.2
As you can see the p-value is less than 0.05 which suggests that there is a difference between the predictions of the models, which is right if you look at the probabilities of the predictions.
More information about ANOVA check this:
Chapter 7 Understanding ANOVA in R
Using a different model pair, and comparing models based on classification accuracy using resamples. Easily extended to other metrics.
library(dplyr)
library(tibble)
library(ggplot2)
library(tidyr)
library(rsample)
library(recipes)
library(parsnip)
library(workflows)
library(tune)
library(yardstick)
library(workflowsets)
set.seed(123)
iris <- iris %>% mutate(
is_versicolor = ifelse(Species == "versicolor", "versicolor", "not_versicolor")) %>%
mutate(is_versicolor = factor(is_versicolor, levels = c("versicolor", "not_versicolor")))
iris_split <- initial_split(iris, strata = is_versicolor, prop = 0.8)
iris_train <- training(iris_split)
iris_test <- testing(iris_split)
# replacing normal and interaction recipes with models
# that give less than 100% accuracy.
rec_normal <- recipe(is_versicolor ~ Sepal.Width, data = iris_train)
rec_alternative <- recipe(is_versicolor ~ Sepal.Length, data = iris_train)
iris_model <- rand_forest() %>% set_engine("ranger") %>% set_mode("classification")
# Create folds
set.seed(234)
iris_folds <- vfold_cv(iris_train)
iris_folds
# Combine models into set
iris_set <-
workflow_set(
list(rec_normal, rec_alternative),
list(iris_model),
cross = TRUE
)
doParallel::registerDoParallel()
set.seed(2021)
# fit models
iris_rs <-
workflow_map(
iris_set,
"fit_resamples",
resamples = iris_folds
)
# Visualise model performance
autoplot(iris_rs)
# Extract resample accuracies
model_1_rs <- iris_rs[1,][[4]][[1]]$.metrics
model_2_rs <- iris_rs[2,][[4]][[1]]$.metrics
model_acc <- tibble(model_1 = NA, model_2 = NA)
for (i in 1:10) {
model_acc[i, 1] <- model_1_rs[[i]][[".estimate"]][1]
model_acc[i, 2] <- model_2_rs[[i]][[".estimate"]][1]
}
model_acc <- model_acc |> pivot_longer(cols = everything(), names_to = "model", values_to = "acc")
# Do ANOVA
aov_results <- aov(acc ~ model, data = model_acc)
summary(aov_results)
ggplot(data = model_acc, aes(fill = model)) +
geom_density(aes(x = acc, alpha = 0.2)) +
labs(x = "accuracy")
Giving the p values:
> summary(aov_results)
Df Sum Sq Mean Sq F value Pr(>F)
model 1 0.0281 0.02813 1.378 0.256
Residuals 18 0.3674 0.02041
Looking at the p values of the model accuracies using a different lens:
First visualise the variation:
model_acc |> ggplot(aes(x = model, y = acc)) +
geom_boxplot() +
labs(y = 'accuracy')
Then calculate a test statistic:
observed_statistic <- model_acc %>%
specify(acc ~ model) %>%
calculate(stat = "diff in means", order = c("model_1", "model_2"))
observed_statistic
Then do a simulation of the distribution:
null_dist_2_sample <- model_acc %>%
specify(acc ~ model) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in means" ,order = c("model_1", "model_2"))
and plot:
null_dist_2_sample %>%
visualize() +
shade_p_value(observed_statistic,
direction = "two-sided") +
labs(x = "test statistic")
and get the p value:
p_value_2_sample <- null_dist_2_sample %>%
get_p_value(obs_stat = observed_statistic,
direction = "two-sided")
p_value_2_sample
# A tibble: 1 × 1
p_value
<dbl>
1 0.228
Which is almost the same as the p value from the aov.
Note that consistent with the accuracies of the two models being close, the p value is high.
bos <- read_csv("boston_train.csv") %>% clean_names()
bos %>%
mutate_if(is.character, factor) -> bos
Then I split the data and did the k-folds
# -- set a random seed for repeatablity
set.seed(42)
# -- performs our train / test split
split <- initial_split(bos, prop = 0.7)
# -- extract the training data form our bananna split
train <- training(split)
# -- extract the test data
test <- testing(split)
tree_fold <- vfold_cv(train, 10)
sprintf("Train PCT : %1.2f%%", nrow(train)/ nrow(bos) * 100)
sprintf("Test PCT : %1.2f%%", nrow(test)/ nrow(bos) * 100)
My target variable is a continuous variable and I need my random forest to do a regression problem
# recipe
rf_recipe <- recipe(av_total ~ ., data=train) %>%
step_rm(pid, zipcode) %>%
step_meanimpute(all_numeric(), -all_outcomes()) %>%
step_log(all_numeric()) %>%
step_modeimpute(all_nominal(),-all_outcomes()) %>%
step_dummy(all_nominal(), -all_outcomes())
#tuning parameters
rf_model <- rand_forest(
mtry = tune(),
trees = 10,
min_n= tune()
) %>%
set_engine("ranger",
importance = "permutation") %>%
set_mode("regression")
rf_wf <- workflow() %>%
add_recipe(rf_recipe) %>%
add_model(rf_model)
rf_grid <- grid_random(mtry(c(5,7)),
min_n(c(15,20)),
size = 10)
# do parallel
all_cores <- detectCores(logical = TRUE)
sprintf("# of Logical Cores: %d", all_cores)
cl <- makeCluster(all_cores)
registerDoParallel(cl)
Then I had the error, no matter how I change my recipe or tuning process it's still there
set.seed(52)
rf_tune_rs <- rf_wf %>%
tune_grid(
resamples = tree_fold,
grid = rf_grid,
control = control_resamples(save_pred = TRUE)
)
I fixed it by adding step_unknown term in my recipe
I'm using tidymodels to fit a PLS model but I'm struggling to find the PLS variable importance scores or coefficients.
This is what I've tried so far; the example data is from AppliedPredictiveModeling package.
Modeling fitting
data(ChemicalManufacturingProcess)
split <- ChemicalManufacturingProcess %>% initial_split(prop = 0.7)
train <- training(split)
test <- testing(split)
tidy_rec <- recipe(Yield ~ ., data = train) %>%
step_knnimpute(all_predictors()) %>%
step_BoxCox(all_predictors()) %>%
step_normalize(all_predictors()) %>%
step_nzv(all_predictors()) %>%
step_corr(all_predictors())
boots <- bootstraps(time = 25, data = train)
tidy_model <- plsmod::pls(num_comp = tune()) %>%
set_mode("regression") %>%
set_engine("mixOmics")
tidy_grid <- expand.grid(num_comp = seq(from = 1, to = 48, by = 5))
tidy_tune <- tidy_model %>% tune_grid(
preprocessor = tidy_rec,
grid = tidy_grid,
resamples = boots,
metrics = metric_set(mae, rmse, rsq)
)
tidy_best <- tidy_tune %>% select_best("rsq")
Final_model <- tidy_model %>% finalize_model(tidy_best)
tidy_wf <- workflow() %>%
add_model(Final_model) %>%
add_recipe(tidy_rec)
Fit_PLS <- tidy_wf %>% fit(data = train)
# check the most important predictors
tidy_info <- Fit_PLS %>% pull_workflow_fit()
loadings <- tidy_info$fit$loadings$X
PLS variable importance
tidy_load <- loadings %>% as.data.frame() %>% rownames_to_column() %>%
select(rowname, comp1, comp2, comp3) %>%
pivot_longer(-rowname) %>%
rename(predictors = rowname)
tidy_load %>% mutate(Sing = if_else(value < 0, "neg", "pos")) %>%
mutate(absvalue = abs(value)) %>% group_by(predictors) %>% summarise(Importance = sum(absvalue)) %>%
mutate(predictors = fct_reorder(predictors, Importance)) %>%
slice_head(n = 15) %>%
ggplot(aes(Importance, predictors, fill = predictors)) + geom_col(show.legend = F)
Thanks! The vi() function from the vip package is not available for this model.
You can directly tidy() the output of the PLS model to get the coefficients:
library(tidymodels)
library(tidyverse)
library(plsmod)
data(ChemicalManufacturingProcess, package = "AppliedPredictiveModeling")
split <- initial_split(ChemicalManufacturingProcess, prop = 0.7)
train <- training(split)
test <- testing(split)
chem_rec <- recipe(Yield ~ ., data = train) %>%
step_knnimpute(all_predictors()) %>%
step_BoxCox(all_predictors()) %>%
step_normalize(all_predictors()) %>%
step_nzv(all_predictors()) %>%
step_corr(all_predictors())
pls_spec <- pls(num_comp = 4) %>% ## can tune instead to find the optimal number
set_mode("regression") %>%
set_engine("mixOmics")
wf <- workflow() %>%
add_recipe(chem_rec) %>%
add_model(pls_spec)
pls_fit <- fit(wf, train)
## tidy the fitted model
tidy_pls <- pls_fit %>%
pull_workflow_fit()
tidy()
tidy_pls
#> # A tibble: 192 x 4
#> term value type component
#> <chr> <dbl> <chr> <dbl>
#> 1 BiologicalMaterial01 0.193 predictors 1
#> 2 BiologicalMaterial01 -0.247 predictors 2
#> 3 BiologicalMaterial01 0.00969 predictors 3
#> 4 BiologicalMaterial01 0.0228 predictors 4
#> 5 BiologicalMaterial03 0.249 predictors 1
#> 6 BiologicalMaterial03 -0.00118 predictors 2
#> 7 BiologicalMaterial03 0.0780 predictors 3
#> 8 BiologicalMaterial03 -0.0866 predictors 4
#> 9 BiologicalMaterial04 0.217 predictors 1
#> 10 BiologicalMaterial04 -0.192 predictors 2
#> # … with 182 more rows
tidy_pls %>%
filter(term != "Y") %>%
group_by(component) %>%
slice_max(abs(value), n = 10) %>%
ungroup() %>%
ggplot(aes(value, fct_reorder(term, value), fill = factor(component))) +
geom_col(show.legend = FALSE) +
facet_wrap(~component, scales = "free_y") +
labs(y = NULL)
Created on 2020-10-19 by the reprex package (v0.3.0.9001)
I showed this without tuning the number of components, but it works about the same with tuning.