Cannot run ANOVA to Compare Random Forest Models

Cannot run ANOVA to Compare Random Forest Models - r

I am using tidymodels to fit multiple Random Forest models. I then followed along with this tutorial to compare the model results. The problem is that I get the error:
Error in
UseMethod("anova") :
no applicable method for 'anova' applied to an object of class "ranger"
As an example:
set.seed(123)
iris <- iris %>% mutate(
is_versicolor = ifelse(Species == "versicolor", "versicolor", "not_versicolor")) %>%
mutate(is_versicolor = factor(is_versicolor, levels = c("versicolor", "not_versicolor")))
iris_split <- initial_split(iris, strata = is_versicolor, prop = 0.8)
iris_train <- training(iris_split)
iris_test <- testing(iris_split)
rec_normal <- recipe(is_versicolor ~ Petal.Width + Species, data = iris_train)
rec_interaction <- rec_normal %>%
step_interact(~ Petal.Width:starts_with("Species"))
iris_model <- rand_forest() %>% set_engine("ranger") %>% set_mode("classification")
# normal workflow
iris_wf <- workflow() %>%
add_model(iris_model) %>%
add_recipe(rec_normal)
# interaction workflow
iris_wf_interaction <- iris_wf %>%
update_recipe(rec_interaction)
# fit models
iris_normal_lf <- last_fit(iris_wf, split = iris_split)
iris_inter_lf <- last_fit(iris_wf_interaction, split = iris_split)
normalmodel <- iris_normal_lf %>% extract_fit_engine()
intermodel <- iris_inter_lf %>% extract_fit_engine()
anova(normalmodel, intermodel) %>% tidy()
How can I run an ANOVA or ANOVA-type comparison of these models, to see if one is significantly better?

Just using your code, and adapting Julia Silge's blog on workflowsets:
Predict #TidyTuesday giant pumpkin weights with workflowsets
As ANOVA is not available for ranger, instead generate folds to resample:
set. Seed(234)
iris_folds <- vfold_cv(iris_train)
iris_folds
Combine your recipes into a workflowset:
iris_set <-
workflow_set(
list(rec_normal, rec_interaction),
list(iris_model),
cross = TRUE
)
iris_set
Setup parallel processing:
doParallel::registerDoParallel()
set. Seed(2021)
Fit using the folds:
iris_rs <-
workflow_map(
iris_set,
"fit_resamples",
resamples = iris_folds
)
autoplot(iris_rs)
This chart would usually address your question of how to compare models.
As "species" is on the righthand side of both recipe formulas, and the response "is_versicolor" is calculated from species, the models are completely accurate.
Finish off the output:
collect_metrics(iris_rs)
final_fit <-
extract_workflow(iris_rs, "recipe_2_rand_forest") %>%
fit(iris_train)
There is no tidier for ranger models.
In your code, if you change to:
rec_normal <- recipe(is_versicolor ~ Sepal.Length + Sepal.Width, data = iris_train)
rec_interaction <- recipe(is_versicolor ~ Petal.Width + Petal.Length, data = iris_train)
you can have some fun!
Hope this helps Adam. Just learning the wonderful Tidymodels like you, and look forward to comments. :-)

You could compare your random forest models by comparing their accuracies using the aov function. First, you can collect the accuracy with collect_metrics and save them in a data frame to run a model with aov to get the results. Here is a reproducible example:
library(tidymodels)
set.seed(123)
iris <- iris %>% mutate(
is_versicolor = ifelse(Species == "versicolor", "versicolor", "not_versicolor")) %>%
mutate(is_versicolor = factor(is_versicolor, levels = c("versicolor", "not_versicolor")))
iris_split <- initial_split(iris, strata = is_versicolor, prop = 0.8)
iris_train <- training(iris_split)
iris_test <- testing(iris_split)
rec_normal <- recipe(is_versicolor ~ Petal.Width + Species, data = iris_train)
rec_interaction <- rec_normal %>%
step_interact(~ Petal.Width:starts_with("Species"))
iris_model <- rand_forest() %>% set_engine("ranger") %>% set_mode("classification")
# normal workflow
iris_wf <- workflow() %>%
add_model(iris_model) %>%
add_recipe(rec_normal)
# interaction workflow
iris_wf_interaction <- iris_wf %>%
update_recipe(rec_interaction)
# fit models
iris_normal_lf <- last_fit(iris_wf, split = iris_split)
iris_inter_lf <- last_fit(iris_wf_interaction, split = iris_split)
#> ! train/test split: preprocessor 1/1: Categorical variables used in `step_interact` should probably be avoided...
normalmodel <- iris_normal_lf %>% extract_fit_engine()
intermodel <- iris_inter_lf %>% extract_fit_engine()
# Check confusion matrix
iris_normal_lf %>%
collect_predictions() %>%
conf_mat(is_versicolor, .pred_class)
#> Truth
#> Prediction versicolor not_versicolor
#> versicolor 10 0
#> not_versicolor 0 20
iris_inter_lf %>%
collect_predictions() %>%
conf_mat(is_versicolor, .pred_class)
#> Truth
#> Prediction versicolor not_versicolor
#> versicolor 10 0
#> not_versicolor 0 20
# Extract accuracy of models and create dataset
acc_normalmodel <- iris_normal_lf %>% collect_metrics() %>% select(.estimate) %>% slice(1)
acc_intermodel <- iris_normal_lf %>% collect_metrics() %>% select(.estimate) %>% slice(1)
results = data.frame(model = c("normalmodel", "intermodel"),
accuracy = c(acc_normalmodel$.estimate, acc_intermodel$.estimate))
# perform ANOVA on the classification accuracy
aov_results <- aov(accuracy ~ model, data = results)
summary(aov_results)
#> Df Sum Sq Mean Sq
#> model 1 4.93e-32 4.93e-32
Created on 2022-12-15 with reprex v2.0.2
As you can see the results doesn't show a p-value, because the degree of freedom is to low (why do I not get a p-value from this anova in r)
You could also use the aov on the predictions of both models and compare these performance. Here is a reproducible example:
# Get predictions of both models for not_versicolor
normalmodel_pred<-as.data.frame(normalmodel$predictions)$not_versicolor
intermodel_pred<-as.data.frame(intermodel$predictions)$not_versicolor
summary(aov(normalmodel_pred~intermodel_pred))
#> Df Sum Sq Mean Sq F value Pr(>F)
#> intermodel_pred 1 25.032 25.032 9392 <2e-16 ***
#> Residuals 118 0.314 0.003
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Created on 2022-12-17 with reprex v2.0.2
As you can see the p-value is less than 0.05 which suggests that there is a difference between the predictions of the models, which is right if you look at the probabilities of the predictions.
More information about ANOVA check this:
Chapter 7 Understanding ANOVA in R

Using a different model pair, and comparing models based on classification accuracy using resamples. Easily extended to other metrics.
library(dplyr)
library(tibble)
library(ggplot2)
library(tidyr)
library(rsample)
library(recipes)
library(parsnip)
library(workflows)
library(tune)
library(yardstick)
library(workflowsets)
set.seed(123)
iris <- iris %>% mutate(
is_versicolor = ifelse(Species == "versicolor", "versicolor", "not_versicolor")) %>%
mutate(is_versicolor = factor(is_versicolor, levels = c("versicolor", "not_versicolor")))
iris_split <- initial_split(iris, strata = is_versicolor, prop = 0.8)
iris_train <- training(iris_split)
iris_test <- testing(iris_split)
# replacing normal and interaction recipes with models
# that give less than 100% accuracy.
rec_normal <- recipe(is_versicolor ~ Sepal.Width, data = iris_train)
rec_alternative <- recipe(is_versicolor ~ Sepal.Length, data = iris_train)
iris_model <- rand_forest() %>% set_engine("ranger") %>% set_mode("classification")
# Create folds
set.seed(234)
iris_folds <- vfold_cv(iris_train)
iris_folds
# Combine models into set
iris_set <-
workflow_set(
list(rec_normal, rec_alternative),
list(iris_model),
cross = TRUE
)
doParallel::registerDoParallel()
set.seed(2021)
# fit models
iris_rs <-
workflow_map(
iris_set,
"fit_resamples",
resamples = iris_folds
)
# Visualise model performance
autoplot(iris_rs)
# Extract resample accuracies
model_1_rs <- iris_rs[1,][[4]][[1]]$.metrics
model_2_rs <- iris_rs[2,][[4]][[1]]$.metrics
model_acc <- tibble(model_1 = NA, model_2 = NA)
for (i in 1:10) {
model_acc[i, 1] <- model_1_rs[[i]][[".estimate"]][1]
model_acc[i, 2] <- model_2_rs[[i]][[".estimate"]][1]
}
model_acc <- model_acc |> pivot_longer(cols = everything(), names_to = "model", values_to = "acc")
# Do ANOVA
aov_results <- aov(acc ~ model, data = model_acc)
summary(aov_results)
ggplot(data = model_acc, aes(fill = model)) +
geom_density(aes(x = acc, alpha = 0.2)) +
labs(x = "accuracy")
Giving the p values:
> summary(aov_results)
Df Sum Sq Mean Sq F value Pr(>F)
model 1 0.0281 0.02813 1.378 0.256
Residuals 18 0.3674 0.02041
Looking at the p values of the model accuracies using a different lens:
First visualise the variation:
model_acc |> ggplot(aes(x = model, y = acc)) +
geom_boxplot() +
labs(y = 'accuracy')
Then calculate a test statistic:
observed_statistic <- model_acc %>%
specify(acc ~ model) %>%
calculate(stat = "diff in means", order = c("model_1", "model_2"))
observed_statistic
Then do a simulation of the distribution:
null_dist_2_sample <- model_acc %>%
specify(acc ~ model) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in means" ,order = c("model_1", "model_2"))
and plot:
null_dist_2_sample %>%
visualize() +
shade_p_value(observed_statistic,
direction = "two-sided") +
labs(x = "test statistic")
and get the p value:
p_value_2_sample <- null_dist_2_sample %>%
get_p_value(obs_stat = observed_statistic,
direction = "two-sided")
p_value_2_sample
# A tibble: 1 × 1
p_value
<dbl>
1 0.228
Which is almost the same as the p value from the aov.
Note that consistent with the accuracies of the two models being close, the p value is high.

Related

Tidymodels: Nested Dataset and Hyperparameter Tuning

I am working on a classification model to predict building age. I want to train my random forest models by groups (suburbs) within the larger dataset.
I've used this as the basis of the code below.
My question is - how should I write the code to train and record the hyperparameters for each suburb?
age.rf <- rand_forest(
mtry = tune(),
trees = tune(),
min_n = tune()) %>%
set_mode("classification") %>%
set_engine("ranger")
age.workflow <- workflow() %>%
add_model(age.rf)
### function for model fitting and predicting
age.predict <- function(df) {
# split the dataset
set.seed(1)
split <- initial_split(df)
train_df <- training(df)
test_df <- testing(df)
# create recipe
age.recipe <- recipe(decade_built ~ .,
data = train_df) %>%
update_role(bld_index, new_role = "ID") %>%
step_dummy(all_nominal_predictors(), -has_role("ID")) %>%
step_zv(all_predictors()) %>%
step_normalize(all_numeric_predictors()) %>%
prep()
# hyperparameters
age.randgrid_rf <- grid_random(mtry(c(1,20)),
trees(),
min_n(),
size = 10)
ctrl <- control_grid(save_pred = T, extract = extract_model)
age_folds <- vfold_cv(train_df, strata = "suburb", v = 10)
age.tunerandom_rf <- age.workflow %>%
tune_grid(resamples = age_folds,
grid = age.randgrid_rf,
control = ctrl)
# best parameters
age.params_rf <- select_best(age.tunerandom_rf)
# finalise model
age.final_rf <- finalize_model(age.spec_rf, age.params_rf)
age.workflowfinal_rf <- workflow() %>%
add_recipe(age.recipe) %>%
add_model(age.final_rf)
# predict on test data
predict(age.workflowfinal_rf, test_df)
}
age_nested <- final.df %>%
group_by(suburb) %>%
nest()
age.preds <- age_nested %>%
mutate(prediction = map(data, possibly(age.predict, otherwise = NA)))
I've mapped out the dataset using the nest() function, and followed the workflow based on Julia's post on another page.
Any help to identify how to get the hyperparameters, as well as apply them to the individual models for each group would be much appreciated.
At the moment, my output is NA.

How to deal with a column with only one value?

How to add a step to remove a column with constant value?
I am facing a related problem so referencing the previous article above. I used step_zv() in my recipe but I still get the following error- Error in bake(), Only one factor in Column 'X33': "TRUE"
library(tidymodels)
library(readr)
library(broom.mixed)
library(dotwhisker)
library(skimr)
library(rpart.plot)
library(vip)
library(glmnet)
library(naniar)
library(tidyr)
library(dplyr)
library(textrecipes)
# Data cleaning
skool <-
read_csv("/Users/riddhimaagupta/Desktop/log1.csv")
skool_v1 <-
select (skool, -c(...1, id, npsn, public, cert_est, cert_ops, name_clean, name, muh1, muh2, muh, chr1, chr2, chr3, chr, hindu, nu1, nu2, nu_klaten, nu_sby, nu, it1, it, other_swas_international))
skool_v2 <-
filter(skool_v1, afiliasi != 99)
skool_v2.1 <- replace_with_na(skool_v2,
replace = list(village = c("-")))
skool_v2.2 <- replace_with_na(skool_v2.1,
replace = list(area = c("0")))
skool_v2.3 <- replace_with_na(skool_v2.2,
replace = list(date_est = c("-")))
skool_v2.3$date_est <- as.Date(skool_v2.3$date_est, format = '%Y-%m-%d')
skool_v2.3$date_ops <- as.Date(skool_v2.3$date_ops, format = '%Y-%m-%d')
skool_v2.3$latlon <- gsub(".*\\[", "", skool_v2.3$latlon)
skool_v2.3$latlon <- gsub("\\].*", "", skool_v2.3$latlon)
skool_v2.4 <- skool_v2.3 %>%
separate(latlon, c("latitude", "longitude"), ",")
skool_v2.4$latitude <- as.numeric(skool_v2.4$latitude)
skool_v2.4$longitude <- as.numeric(skool_v2.4$longitude)
skool_v3 <- skool_v2.4 %>%
mutate_if(is.character, tolower) %>%
mutate_if(is.character, as.factor)
skool_v4 <- skool_v3 %>%
mutate_if(is.logical, as.factor)
skool_v4$afiliasi <- as.factor(skool_v4$afiliasi)
glimpse(skool_v4)
# Data splitting
set.seed(123)
splits <- initial_split(skool_v4 , strata = afiliasi)
school_train <- training(splits)
school_test <- testing(splits)
set.seed(234)
val_set <- validation_split(skool_v4,
strata = afiliasi,
prop = 0.80)
# Penalised multinomial regression
lr_mod <-
logistic_reg(penalty = tune(), mixture = 0.5) %>%
set_engine("glmnet")
lr_recipe <-
recipe(afiliasi ~ ., data = school_train) %>%
step_date(date_est, date_ops) %>%
step_rm(date_est, date_ops) %>%
textrecipes::step_clean_levels(village) %>%
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors()) %>%
step_normalize(all_predictors())
lr_workflow <-
workflow() %>%
add_model(lr_mod) %>%
add_recipe(lr_recipe)
lr_reg_grid <- tibble(penalty = 10^seq(-4, -1, length.out = 30))
lr_reg_grid %>% top_n(-5)
lr_reg_grid %>% top_n(5)
lr_res <-
lr_workflow %>%
tune_grid(val_set,
grid = lr_reg_grid,
control = control_grid(save_pred = TRUE, verbose = TRUE),
metrics = metric_set(roc_auc))
The console says
x validation: preprocessor 1/1: Error in `bake()`:
! Only one factor...
Warning message:
All models failed. See the `.notes` column.

This error comes from step_dummy() because the variable X33 only has one factor "TRUE". The easiest way to deal with this in your problem is to use step_zv() on the nominal predictors before step_dummy().
This would make your recipe look like
lr_recipe <-
recipe(afiliasi ~ ., data = school_train) %>%
step_date(date_est, date_ops) %>%
step_rm(date_est, date_ops) %>%
textrecipes::step_clean_levels(village) %>%
step_zv(all_nominal_predictors()) %>%
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors()) %>%
step_normalize(all_predictors())
Reprex showing what is happening:
library(recipes)
mtcars$fac1 <- "h"
mtcars$fac2 <- rep(c("a", "b"), length.out = nrow(mtcars))
recipe(mpg ~ ., data = mtcars) %>%
step_dummy(all_nominal_predictors()) %>%
prep()
#> Error in `bake()`:
#> ! Only one factor level in fac1: h
recipe(mpg ~ ., data = mtcars) %>%
step_zv(all_nominal_predictors()) %>%
step_dummy(all_nominal_predictors()) %>%
prep()
#> Recipe
#>
#> Inputs:
#>
#> role #variables
#> outcome 1
#> predictor 12
#>
#> Training data contained 32 data points and no missing data.
#>
#> Operations:
#>
#> Zero variance filter removed fac1 [trained]
#> Dummy variables from fac2 [trained]

Here's an example with mtcars:
# Add a column with only one value
mtcars$constant_col <- 1
# Remove any columns with only one value
mtcars[sapply(mtcars, function(x) length(unique(x)) == 1)] <- NULL

Tidymodels - Get predictions and metrics on training data using workflow/recipe

The code below works correctly and has no errors that I know of, but I want to add more to it.
The two things I want to add are:
1 - Predictions of the model on the training data to the final plot. I want to run collect_predictions() on the model fitted to training data.
2 - Code to view the metrics of the model on the training data. I want to run collect_metrics() on the model fitted to training data.
How do I get this information?
# Setup
library(tidyverse)
library(tidymodels)
parks <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-06-22/parks.csv')
modeling_df <- parks %>%
select(pct_near_park_data, spend_per_resident_data, med_park_size_data) %>%
rename(nearness = "pct_near_park_data",
spending = "spend_per_resident_data",
acres = "med_park_size_data") %>%
mutate(nearness = (parse_number(nearness)/100)) %>%
mutate(spending = parse_number(spending))
# Start building models
set.seed(123)
park_split <- initial_split(modeling_df)
park_train <- training(park_split)
park_test <- testing(park_split)
tree_rec <- recipe(nearness ~., data = park_train)
tree_prep <- prep(tree_rec)
juiced <- juice(tree_prep)
tune_spec <- rand_forest(
mtry = tune(),
trees = 1000,
min_n = tune()
) %>%
set_mode("regression") %>%
set_engine("ranger")
tune_wf <- workflow() %>%
add_recipe(tree_rec) %>%
add_model(tune_spec)
set.seed(234)
park_folds <- vfold_cv(park_train)
# Make a grid of various different models
doParallel::registerDoParallel()
set.seed(345)
tune_res <- tune_grid(
tune_wf,
resamples = park_folds,
grid = 20,
control = control_grid(verbose = TRUE)
)
best_rmse <- select_best(tune_res, "rmse")
# Finalize a model with the best grid
final_rf <- finalize_model(
tune_spec,
best_rmse
)
final_wf <- workflow() %>%
add_recipe(tree_rec) %>%
add_model(final_rf)
final_res <- final_wf %>%
last_fit(park_split)
# Visualize the performance
# My issue here is that this is only the testing data
# How can I also get this model's performance on the training data?
# I want to plot both with a facetwrap or color indication as well as numerically see the difference with collect_metrics
final_res %>%
collect_predictions() %>%
ggplot(aes(nearness, .pred)) +
geom_point() +
geom_abline()

What you can do is pull out the trained workflow object from final_res and use that to create predictions on the training data set.
final_model <- final_res$.workflow[[1]]
Now you can use augment() on the test and training data set to visualize the performance.
final_model %>%
augment(new_data = park_test) %>%
ggplot(aes(nearness, .pred)) +
geom_point() +
geom_abline()
final_model %>%
augment(new_data = park_train) %>%
ggplot(aes(nearness, .pred)) +
geom_point() +
geom_abline()
You can also combine the results with bind_rows() so you can compare more easily.
all_predictions <- bind_rows(
augment(final_model, new_data = park_train) %>%
mutate(type = "train"),
augment(final_model, new_data = park_test) %>%
mutate(type = "test")
)
all_predictions %>%
ggplot(aes(nearness, .pred)) +
geom_point() +
geom_abline() +
facet_wrap(~type)
all the yardstick metric functions work on grouped data.frames as well.
all_predictions %>%
group_by(type) %>%
metrics(nearness, .pred)
#> # A tibble: 6 x 4
#> type .metric .estimator .estimate
#> <chr> <chr> <chr> <dbl>
#> 1 test rmse standard 0.0985
#> 2 train rmse standard 0.0473
#> 3 test rsq standard 0.725
#> 4 train rsq standard 0.943
#> 5 test mae standard 0.0706
#> 6 train mae standard 0.0350
Created on 2021-06-24 by the reprex package (v2.0.0)

Prediction Intervals from Quantile Regression Forests have higher coverage than expected?

Question:
What factors may cause the prediction interval to have wider coverage than would be expected? Particularly with regard to quantile regression forests with the ranger package?
Specific Context + REPREX:
I am using quantile regression forests through parsnip and the tidymodels suite of packages with ranger to generate prediction intervals. I was reviewing an example using the ames housing data and was surprised to see in the example below that my 90% prediction intervals had an empirical coverage of ~97% when evaluated on a hold-out dataset (coverage on the training data was even higher).
This was even more surprising given that my model performance is substantially worse on the hold-out set than on the training set hence I would have guessed the coverage would have been less than expected, not greater than expected?
Load libraries, data, set-up split:
```{r}
library(tidyverse)
library(tidymodels)
library(AmesHousing)
ames <- make_ames() %>%
mutate(Years_Old = Year_Sold - Year_Built,
Years_Old = ifelse(Years_Old < 0, 0, Years_Old))
set.seed(4595)
data_split <- initial_split(ames, strata = "Sale_Price", p = 0.75)
ames_train <- training(data_split)
ames_test <- testing(data_split)
```
Specify model workflow:
```{r}
rf_recipe <-
recipe(
Sale_Price ~ Lot_Area + Neighborhood + Years_Old + Gr_Liv_Area + Overall_Qual + Total_Bsmt_SF + Garage_Area,
data = ames_train
) %>%
step_log(Sale_Price, base = 10) %>%
step_other(Neighborhood, Overall_Qual, threshold = 50) %>%
step_novel(Neighborhood, Overall_Qual) %>%
step_dummy(Neighborhood, Overall_Qual)
rf_mod <- rand_forest() %>%
set_engine("ranger", importance = "impurity", seed = 63233, quantreg = TRUE) %>%
set_mode("regression")
set.seed(63233)
rf_wf <- workflows::workflow() %>%
add_model(rf_mod) %>%
add_recipe(rf_recipe) %>%
fit(ames_train)
```
Make predictions on training and hold-out datasets:
```{r}
rf_preds_train <- predict(
rf_wf$fit$fit$fit,
workflows::pull_workflow_prepped_recipe(rf_wf) %>% bake(ames_train),
type = "quantiles",
quantiles = c(0.05, 0.50, 0.95)
) %>%
with(predictions) %>%
as_tibble() %>%
set_names(paste0(".pred", c("_lower", "", "_upper"))) %>%
mutate(across(contains(".pred"), ~10^.x)) %>%
bind_cols(ames_train)
rf_preds_test <- predict(
rf_wf$fit$fit$fit,
workflows::pull_workflow_prepped_recipe(rf_wf) %>% bake(ames_test),
type = "quantiles",
quantiles = c(0.05, 0.50, 0.95)
) %>%
with(predictions) %>%
as_tibble() %>%
set_names(paste0(".pred", c("_lower", "", "_upper"))) %>%
mutate(across(contains(".pred"), ~10^.x)) %>%
bind_cols(ames_test)
```
Show that coverage rate is far higher for both the training and hold-out data than the 90% expected (empirically seems to be ~98% and ~97% respectively):
```{r}
rf_preds_train %>%
mutate(covered = ifelse(Sale_Price >= .pred_lower & Sale_Price <= .pred_upper, 1, 0)) %>%
summarise(n = n(),
n_covered = sum(
covered
),
covered_prop = n_covered / n,
stderror = sd(covered) / sqrt(n)) %>%
mutate(min_coverage = covered_prop - 2 * stderror,
max_coverage = covered_prop + 2 * stderror)
# # A tibble: 1 x 6
# n n_covered covered_prop stderror min_coverage max_coverage
# <int> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 2199 2159 0.982 0.00285 0.976 0.988
rf_preds_test %>%
mutate(covered = ifelse(Sale_Price >= .pred_lower & Sale_Price <= .pred_upper, 1, 0)) %>%
summarise(n = n(),
n_covered = sum(
covered
),
covered_prop = n_covered / n,
stderror = sd(covered) / sqrt(n)) %>%
mutate(min_coverage = covered_prop - 2 * stderror,
max_coverage = covered_prop + 2 * stderror)
# # A tibble: 1 x 6
# n n_covered covered_prop stderror min_coverage max_coverage
# <int> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 731 706 0.966 0.00673 0.952 0.979
```
Guesses:
Something about the ranger package or quantile regression forests is overly extreme in the way it estimates quantiles, or I am overfitting in the 'extreme' direction somehow -- leading to my highly conservative prediction intervals
This is a quirk specific to this dataset / model
I am missing something or setting-up something incorrectly

Predictor importance for PLS model trained with tidymodels

I'm using tidymodels to fit a PLS model but I'm struggling to find the PLS variable importance scores or coefficients.
This is what I've tried so far; the example data is from AppliedPredictiveModeling package.
Modeling fitting
data(ChemicalManufacturingProcess)
split <- ChemicalManufacturingProcess %>% initial_split(prop = 0.7)
train <- training(split)
test <- testing(split)
tidy_rec <- recipe(Yield ~ ., data = train) %>%
step_knnimpute(all_predictors()) %>%
step_BoxCox(all_predictors()) %>%
step_normalize(all_predictors()) %>%
step_nzv(all_predictors()) %>%
step_corr(all_predictors())
boots <- bootstraps(time = 25, data = train)
tidy_model <- plsmod::pls(num_comp = tune()) %>%
set_mode("regression") %>%
set_engine("mixOmics")
tidy_grid <- expand.grid(num_comp = seq(from = 1, to = 48, by = 5))
tidy_tune <- tidy_model %>% tune_grid(
preprocessor = tidy_rec,
grid = tidy_grid,
resamples = boots,
metrics = metric_set(mae, rmse, rsq)
)
tidy_best <- tidy_tune %>% select_best("rsq")
Final_model <- tidy_model %>% finalize_model(tidy_best)
tidy_wf <- workflow() %>%
add_model(Final_model) %>%
add_recipe(tidy_rec)
Fit_PLS <- tidy_wf %>% fit(data = train)
# check the most important predictors
tidy_info <- Fit_PLS %>% pull_workflow_fit()
loadings <- tidy_info$fit$loadings$X
PLS variable importance
tidy_load <- loadings %>% as.data.frame() %>% rownames_to_column() %>%
select(rowname, comp1, comp2, comp3) %>%
pivot_longer(-rowname) %>%
rename(predictors = rowname)
tidy_load %>% mutate(Sing = if_else(value < 0, "neg", "pos")) %>%
mutate(absvalue = abs(value)) %>% group_by(predictors) %>% summarise(Importance = sum(absvalue)) %>%
mutate(predictors = fct_reorder(predictors, Importance)) %>%
slice_head(n = 15) %>%
ggplot(aes(Importance, predictors, fill = predictors)) + geom_col(show.legend = F)
Thanks! The vi() function from the vip package is not available for this model.

You can directly tidy() the output of the PLS model to get the coefficients:
library(tidymodels)
library(tidyverse)
library(plsmod)
data(ChemicalManufacturingProcess, package = "AppliedPredictiveModeling")
split <- initial_split(ChemicalManufacturingProcess, prop = 0.7)
train <- training(split)
test <- testing(split)
chem_rec <- recipe(Yield ~ ., data = train) %>%
step_knnimpute(all_predictors()) %>%
step_BoxCox(all_predictors()) %>%
step_normalize(all_predictors()) %>%
step_nzv(all_predictors()) %>%
step_corr(all_predictors())
pls_spec <- pls(num_comp = 4) %>% ## can tune instead to find the optimal number
set_mode("regression") %>%
set_engine("mixOmics")
wf <- workflow() %>%
add_recipe(chem_rec) %>%
add_model(pls_spec)
pls_fit <- fit(wf, train)
## tidy the fitted model
tidy_pls <- pls_fit %>%
pull_workflow_fit()
tidy()
tidy_pls
#> # A tibble: 192 x 4
#> term value type component
#> <chr> <dbl> <chr> <dbl>
#> 1 BiologicalMaterial01 0.193 predictors 1
#> 2 BiologicalMaterial01 -0.247 predictors 2
#> 3 BiologicalMaterial01 0.00969 predictors 3
#> 4 BiologicalMaterial01 0.0228 predictors 4
#> 5 BiologicalMaterial03 0.249 predictors 1
#> 6 BiologicalMaterial03 -0.00118 predictors 2
#> 7 BiologicalMaterial03 0.0780 predictors 3
#> 8 BiologicalMaterial03 -0.0866 predictors 4
#> 9 BiologicalMaterial04 0.217 predictors 1
#> 10 BiologicalMaterial04 -0.192 predictors 2
#> # … with 182 more rows
tidy_pls %>%
filter(term != "Y") %>%
group_by(component) %>%
slice_max(abs(value), n = 10) %>%
ungroup() %>%
ggplot(aes(value, fct_reorder(term, value), fill = factor(component))) +
geom_col(show.legend = FALSE) +
facet_wrap(~component, scales = "free_y") +
labs(y = NULL)
Created on 2020-10-19 by the reprex package (v0.3.0.9001)
I showed this without tuning the number of components, but it works about the same with tuning.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Cannot run ANOVA to Compare Random Forest Models - r

Related

Tidymodels: Nested Dataset and Hyperparameter Tuning

How to deal with a column with only one value?

Tidymodels - Get predictions and metrics on training data using workflow/recipe

Prediction Intervals from Quantile Regression Forests have higher coverage than expected?

Predictor importance for PLS model trained with tidymodels

Categories

Resources