Tidymodels - Help evaluating regression models made via recipes - r

I am working with the current tidytuesday data about salaries and trying to create a model with tidymodels and recipes. I want to predict salary with many of the other factors present using the recipes code, but I run into an issue.
Issue 1 - My recipe says there are empty rows, but I do not know how to figure out how. This does not give an error, so maybe it is not a problem.
Issue 2 - Understanding what my models actually did and how to visualize the performance. I want to plot the models performance on the initial data. Here is an example of my goal: https://indescribled.files.wordpress.com/2021/05/image-17.png?w=782
I do not understand exactly how to use the predict function with my recipe. juice(rec) is less than 1000 rows while the testing data is about 6000. Perhaps I am reading it backwards, but can someone try to point me in the right direction?
The code below should be an exact reproduction of mine.
library(tidymodels)
library(tidyverse)
salary_raw <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-05-18/survey.csv')
# Could not figure out tidy way to do this
salary_raw$other_monetary_comp[is.na(salary_raw$other_monetary_comp)] <- 0
salary_raw$other_monetary_comp <- as.numeric(salary_raw$other_monetary_comp)
# Filter and convert to USD
# The mutations to industry were because of other errors, they may not be needed
salary_modeling <- salary_raw %>%
filter(
how_old_are_you %in% c("55-64","45-54","35-44","25-34","18-24"),
currency %in% c("AUD/NZD","CAD","EUR","GBP","USD")
) %>%
mutate(annual_salary = case_when(
currency == "USD" ~ annual_salary * 1.00,
currency == "GBP" ~ annual_salary * 1.42,
currency == "AUD/NZD" ~ annual_salary * 0.75,
currency == "CAD" ~ annual_salary * 0.83,
currency == "EUR" ~ annual_salary * 1.22
)) %>%
mutate(other_monetary_comp = case_when(
currency == "USD" ~ other_monetary_comp * 1.00,
currency == "GBP" ~ other_monetary_comp * 1.42,
currency == "AUD/NZD" ~ other_monetary_comp * 0.75,
currency == "CAD" ~ other_monetary_comp * 0.83,
currency == "EUR" ~ other_monetary_comp * 1.22
)) %>%
rename(age = how_old_are_you,
prof_exp = overall_years_of_professional_experience,
field_exp = years_of_experience_in_field,
education = highest_level_of_education_completed
) %>%
mutate(total_comp = annual_salary + other_monetary_comp) %>%
filter(total_comp > 10000,
total_comp < 350000)%>%
mutate(gender = case_when(
gender == "Prefer not to answer" ~ "Other or prefer not to answer",
TRUE ~ gender
)) %>%
mutate(industry = case_when(
industry == "Biotech pharmaceuticals" ~ "Biotech",
industry == "Consumer Packaged Goods" ~ "Consumer packaged goods ",
industry == "Real Estate Development" ~ "Real Estate",
TRUE ~ industry
))
# Create initial splits
set.seed(123)
salary_split <- initial_split(salary_modeling)
salary_train <- training(salary_split)
salary_test <- testing(salary_split)
# I want to predict total comp with many of the other variables, listed below. Here is my logic
# downsample is because there are a lot more women than men in the data, unsure if necessary
# log is to many data more interpretable, not necessary
# an error message told me to use novel
# unknown is to change NA to unknown as far as I understand
# other is to change values that are less than 5% of the total dataset to "Other"
# unsure what the purpose of dummy is, but it seems to be necessary for modeling
rec <- salary_train %>%
recipe(total_comp ~ age + gender + field_exp + race + industry + job_title) %>%
themis::step_downsample(gender) %>%
step_log(total_comp, base = 2) %>%
step_novel(race, industry) %>%
step_unknown(race, industry, gender) %>%
step_other(race, industry, job_title, threshold = .005) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
prep()
# ISSUE 1 - Running rec says that there are 19,081 data points and 226 incomplete rows. I do not know how to fix the incomplete rows
test_proc <- bake(rec, new_data = salary_test)
# Linear model ------------------------------------------------------------
lm_spec <- linear_reg() %>%
set_engine("lm")
lm_fitted <- lm_spec %>%
fit(total_comp ~ ., data = juice(rec))
tidy(lm_fitted)
# RF MODEL ----------------------------------------------------------------
rf_spec <- rand_forest(mode = "regression", trees = 1500) %>%
set_engine("ranger")
rf_fit <- rf_spec %>%
fit(total_comp ~ .,
data = juice(rec))
rf_fit
# QUESTIONS BEGIN HERE --------------------------------------------------------------------------------------------------------------------------------------------------
# Need to figure out what new data is for this portion
# I think it is juice(rec), but it seems weird to me
# juice(rec) is only about 900 rows while test_proc is multiple thousand. testing data should be smaller than training
asdf <- juice(rec)
results_train <- lm_fitted %>%
predict(new_data = asdf) %>%
mutate(
truth = asdf$total_comp,
model = "lm"
) %>%
bind_rows(rf_fit %>%
predict(new_data = asdf) %>%
mutate(
truth = asdf$total_comp,
model = "rf"
))
results_train
# Is the newdata and test proc correct?
results_test <- lm_fitted %>%
predict(new_data = test_proc) %>%
mutate(
truth = test_proc$total_comp,
model = "lm"
) %>%
bind_rows(rf_fit %>%
predict(new_data = test_proc) %>%
mutate(
truth = test_proc$total_comp,
model = "rf"
))
results_test
# Goal is to run the following code to visualize the predictions, the code below probably will do nothing right now unless the two dataframes above are correct
results_test %>%
mutate(train = "testing") %>%
bind_rows(results_train %>%
mutate(train = "training")) %>%
ggplot(aes(truth, .pred, color = model)) +
geom_abline(lty = 2, color = "gray80", size = 1.5) +
geom_point(alpha = .75) +
facet_wrap(~train)

Looks like you have things pretty well along!
library(tidymodels)
library(tidyverse)
salary_raw <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-05-18/survey.csv')
#>
#> ── Column specification ────────────────────────────────────────────────────────
#> cols(
#> timestamp = col_character(),
#> how_old_are_you = col_character(),
#> industry = col_character(),
#> job_title = col_character(),
#> additional_context_on_job_title = col_character(),
#> annual_salary = col_double(),
#> other_monetary_comp = col_character(),
#> currency = col_character(),
#> currency_other = col_character(),
#> additional_context_on_income = col_character(),
#> country = col_character(),
#> state = col_character(),
#> city = col_character(),
#> overall_years_of_professional_experience = col_character(),
#> years_of_experience_in_field = col_character(),
#> highest_level_of_education_completed = col_character(),
#> gender = col_character(),
#> race = col_character()
#> )
salary_modeling <- salary_raw %>%
replace_na(list(other_monetary_comp = 0)) %>%
filter(
how_old_are_you %in% c("55-64","45-54","35-44","25-34","18-24"),
currency %in% c("AUD/NZD","CAD","EUR","GBP","USD")
) %>%
mutate(annual_salary = case_when(
currency == "USD" ~ annual_salary * 1.00,
currency == "GBP" ~ annual_salary * 1.42,
currency == "AUD/NZD" ~ annual_salary * 0.75,
currency == "CAD" ~ annual_salary * 0.83,
currency == "EUR" ~ annual_salary * 1.22
)) %>%
mutate(other_monetary_comp = parse_number(other_monetary_comp),
other_monetary_comp = case_when(
currency == "USD" ~ other_monetary_comp * 1.00,
currency == "GBP" ~ other_monetary_comp * 1.42,
currency == "AUD/NZD" ~ other_monetary_comp * 0.75,
currency == "CAD" ~ other_monetary_comp * 0.83,
currency == "EUR" ~ other_monetary_comp * 1.22
)) %>%
rename(age = how_old_are_you,
prof_exp = overall_years_of_professional_experience,
field_exp = years_of_experience_in_field,
education = highest_level_of_education_completed
) %>%
mutate(total_comp = annual_salary + other_monetary_comp) %>%
filter(total_comp > 10000,
total_comp < 350000) %>%
mutate(gender = case_when(
gender == "Prefer not to answer" ~ "Other or prefer not to answer",
TRUE ~ gender
)) %>%
mutate(industry = case_when(
industry == "Biotech pharmaceuticals" ~ "Biotech",
industry == "Consumer Packaged Goods" ~ "Consumer packaged goods ",
industry == "Real Estate Development" ~ "Real Estate",
TRUE ~ industry
))
set.seed(123)
salary_split <- initial_split(salary_modeling)
salary_train <- training(salary_split)
salary_test <- testing(salary_split)
rec <- salary_train %>%
recipe(total_comp ~ age + gender + field_exp + race + industry + job_title) %>%
themis::step_downsample(gender) %>%
step_log(total_comp, base = 2) %>%
step_novel(race, industry) %>%
step_unknown(race, industry, gender) %>%
step_other(race, industry, job_title, threshold = 0.005) %>%
step_dummy(all_nominal_predictors())
The recipe here says the training data has incomplete rows because there is missing data; this is why you used step_unknown(), I am guessing.
prep(rec)
#> Data Recipe
#>
#> Inputs:
#>
#> role #variables
#> outcome 1
#> predictor 6
#>
#> Training data contained 19080 data points and 235 incomplete rows.
#>
#> Operations:
#>
#> Down-sampling based on gender [trained]
#> Log transformation on total_comp [trained]
#> Novel factor level assignment for race, industry [trained]
#> Unknown factor level assignment for race, industry, gender [trained]
#> Collapsing factor levels for race, industry, job_title [trained]
#> Dummy variables from age, gender, field_exp, race, industry, job_title [trained]
The processed training set here does not have so many observations any more because of the downsampling; we don't apply downsampling to a test set because we want to compute metrics on test set as they would appear "in the wild".
train_proc <- rec %>% prep() %>% bake(new_data = NULL)
test_proc <- rec %>% prep() %>% bake(new_data = salary_test)
dim(train_proc)
#> [1] 878 57
dim(test_proc)
#> [1] 6361 57
lm_fitted <- linear_reg() %>%
set_engine("lm") %>%
fit(total_comp ~ ., data = train_proc)
rf_fitted <- rand_forest(mode = "regression", trees = 1500) %>%
set_engine("ranger") %>%
fit(total_comp ~ ., data = train_proc)
I'd try using the augment() function to make your visualization
bind_rows(
augment(lm_fitted, new_data = train_proc) %>% mutate(model = "lm", train = "train"),
augment(rf_fitted, new_data = train_proc) %>% mutate(model = "rf", train = "train"),
augment(lm_fitted, new_data = test_proc) %>% mutate(model = "lm", train = "test"),
augment(rf_fitted, new_data = test_proc) %>% mutate(model = "rf", train = "test")
) %>%
ggplot(aes(total_comp, .pred, color = model)) +
geom_abline(lty = 2, color = "gray80", size = 1.5) +
geom_point(alpha = .5) +
facet_wrap(~ train)
Created on 2021-05-24 by the reprex package (v2.0.0)

Related

How to add a column of statistics for only one group in gtsummary(tbl_summary)

For example, now I have two groups of data, Drug A and Drug B. I would like to add a column of the number of observations of each variable for only Drug A, how can I do that? I don't find a way using add_n.
The code for producing example table:
tbl_summary_ex2 <- trial %>% select(age, grade, response, trt) %>%
tbl_summary(
by = trt,
label = list(age ~ "Patient Age"),
statistic = list(all_continuous() ~ "{mean} ({sd})"),
digits = list(age ~ c(0, 1))
)
Here is one way to do it:
library(gtsummary)
packageVersion("gtsummary")
#> [1] '1.6.1'
# build table with only Drug A
tbl_summary_ex1 <-
trial %>%
dplyr::filter(trt == "Drug A") %>%
select(age, grade, response) %>%
tbl_summary(
label = list(age ~ "Patient Age"),
statistic = list(all_continuous() ~ "{mean} ({sd})"),
digits = list(age ~ c(0, 1))
) %>%
add_n(col_label = "**Drug A N**") %>%
modify_column_hide(all_stat_cols())
# build table split by treatment
tbl_summary_ex2 <-
trial %>%
select(age, grade, response, trt) %>%
tbl_summary(
by = trt,
label = list(age ~ "Patient Age"),
statistic = list(all_continuous() ~ "{mean} ({sd})"),
digits = list(age ~ c(0, 1))
)
# merge tables together
tbl_final <-
list(tbl_summary_ex1, tbl_summary_ex2) %>%
tbl_merge(tab_spanner = FALSE)
Created on 2022-08-19 by the reprex package (v2.0.1)

How to present interaction variables horizontally in tbl_regression in R

I want to present the coefficient of interaction horizontally rather than vertically with tbl_regression:
library(gtsummary)
packageVersion("gtsummary")
#> [1] '1.6.0'
tbl1 <- trial %>%
mutate(early=case_when(stage%in%c("T1","T2")~1,T~0)) %>%
glm(response ~ age * early , family = binomial, data=.) %>%
tbl_regression(
exponentiate = TRUE )
tbl1
tbl2 <- trial %>%
mutate(late=case_when(stage%in%c("T3","T4")~1,T~0)) %>%
glm(response ~ age * late, family = binomial, data=.) %>%
tbl_regression(
exponentiate = TRUE )
tbl2
tbl_stack (list(tbl1,tbl2))
Created on 2022-07-20 by the reprex package (v2.0.1)
I want to have the variables presented horizontally (similar to tbl_merg, but the values are from within the same module), in order to stack different values beneath it.
This requires some customization, but is certainly possible.
library(gtsummary)
#> #BlackLivesMatter
packageVersion("gtsummary")
#> [1] '1.6.1'
tbl1 <-
trial %>%
mutate(early = dplyr::case_when(stage %in% c("T1","T2") ~ 1, TRUE ~ 0)) %>%
glm(response ~ age * early , family = binomial, data=.) %>%
tbl_regression(exponentiate = TRUE)
tbl_final <-
1:3 %>%
purrr::map(
function(i) {
tbl1 %>%
modify_table_body(
~ .x %>%
dplyr::mutate(
label = label[1],
variable = variable[1],
row_type = row_type[1]
) %>%
dplyr::filter(dplyr::row_number() %in% i)
)
}
) %>%
tbl_merge(tab_spanner = c("**Variable**", "**Early**", "**Interaction**"))
Created on 2022-07-20 by the reprex package (v2.0.1)

gtsummmary::modify_header() with overall column

I would like to modify the column header formatting in a {gtsummary} table with a categorical varible (two levels) and an overall column, as in With gtsummary, is it possible to have N on a separate row to the column name?. I'm having trouble figuring out the correct gtsummary variable name to access the various column names. I currently have a workaround in which I first modify the categorical levels headers, and then add the overall and modify it. However, I'm wondering if there is a better way to do this in a single modify_header line. Reprex with various attempts below.
library(gtsummary)
library(dplyr)
# Shorten trial for examples
trial <- select(trial, trt, age)
# Modify headers
trial %>%
tbl_summary(by = trt) %>%
modify_header(update = all_stat_cols() ~ "**{level}**<br>N = {N}") %>%
as_kable()
Characteristic
Drug AN = 200
Drug BN = 200
Age
46 (37, 59)
48 (39, 56)
Unknown
7
4
# Adding overall before modifying stats columns doesn't work with {level} or {label}
try(
trial %>%
tbl_summary(by = trt) %>%
add_overall() %>%
modify_header(update = all_stat_cols() ~ "**{level}**<br>N = {N}") %>%
as_kable()
)
#> Error in eval(parse(text = text, keep.source = FALSE), envir) :
#> object 'level' not found
try(
trial %>%
tbl_summary(by = trt) %>%
add_overall() %>%
modify_header(update = all_stat_cols() ~ "**{label}**<br>N = {N}") %>%
as_kable()
)
#> Error in eval(parse(text = text, keep.source = FALSE), envir) :
#> object 'label' not found
# Adding overall before modifying stats columns does work with plain text
trial %>%
tbl_summary(by = trt) %>%
add_overall() %>%
modify_header(update = all_stat_cols() ~ "**THIS WORKS**<br>N = {N}") %>%
as_kable()
Characteristic
THIS WORKSN = 200
THIS WORKSN = 200
THIS WORKSN = 200
Age
47 (38, 57)
46 (37, 59)
48 (39, 56)
Unknown
11
7
4
# And with {column} but then gives the gtsummary backend column name
trial %>%
tbl_summary(by = trt) %>%
add_overall() %>%
modify_header(update = all_stat_cols() ~ "**{column}**<br>N = {N}") %>%
as_kable()
Characteristic
stat_0N = 200
stat_1N = 200
stat_2N = 200
Age
47 (38, 57)
46 (37, 59)
48 (39, 56)
Unknown
11
7
4
# Adding overall after modifying stats columns does work, but need to change label separately
trial %>%
tbl_summary(by = trt) %>%
modify_header(update = all_stat_cols() ~ "**{level}**<br>N = {N}") %>%
add_overall(col_label = "**Overall**<br>N = {N}") %>%
as_kable()
Characteristic
OverallN = 200
Drug AN = 200
Drug BN = 200
Age
47 (38, 57)
46 (37, 59)
48 (39, 56)
Unknown
11
7
4
Created on 2021-08-20 by the reprex package (v2.0.0)
The issue you are running into is that all_stat_cols(), by default, selects the overall column and the other columns. Depending on whether you're assigning a label to the overall or split columns, you'll want to use slightly different syntax.
Example below!
library(gtsummary)
tbl <-
trial %>%
select(trt, age) %>%
tbl_summary(by = trt, missing = "no") %>%
add_overall() %>%
modify_header(
update = list(all_stat_cols(FALSE) ~ "**{level}**<br>N = {n}",
stat_0 ~ "**Overall**<br>N = {N}"))
show_header_names(tbl)
#> i As a usage guide, the code below re-creates the current column headers.
#> modify_header(update = list(
#> label ~ "**Characteristic**",
#> stat_0 ~ "**Overall**<br>N = 200",
#> stat_1 ~ "**Drug A**<br>N = 98",
#> stat_2 ~ "**Drug B**<br>N = 102"
#> ))
#>
#>
#> Column Name Column Header
#> ------------ -----------------------
#> label **Characteristic**
#> stat_0 **Overall**<br>N = 200
#> stat_1 **Drug A**<br>N = 98
#> stat_2 **Drug B**<br>N = 102
Created on 2021-08-20 by the reprex package (v2.0.1)

Trying to predict the probability of a binary variable being equal to 1 using tidymodels

I am trying to predict the probability of two_year_recid by estimating a logit regression (with no penalty) that includes a flexible list of controls excluding decile_score and race_factor, but I keep getting an error saying
Error in eval_tidy(f[[2]], dat) : object '.' not found
this shows up on the line that starts with fit_full of the code chunk bellow
rec_full <- recipe(
two_year_recid ~ .,
data = train
) %>%
step_dummy(all_nominal()) %>%
step_interact(~ all_predictors() * all_predictors()) %>%
step_poly(age, degree = 3) %>%
step_normalize(all_predictors()) %>%
step_nzv(all_predictors())
mod_lm <- logistic_reg() %>%
set_engine('glm')
wf_full <- workflow() %>%
add_recipe(rec_full) %>%
add_model(mod_lm)
fit_full <- wf_full %>% fit(data = train)
test <- test %>%
select(two_year_recid) %>%
bind_cols(predict(fit_full, new_data = test) %>% rename(full = .pred))
The data I am using and the cleaning I did
raw <- read_csv("https://raw.githubusercontent.com/propublica/compas-analysis/master/compas-scores-two-years.csv")
## Main working data
df <- raw %>%
filter(days_b_screening_arrest <= 30) %>%
filter(days_b_screening_arrest >= -30) %>%
filter(is_recid != -1) %>%
filter(c_charge_degree != "O") %>%
filter(score_text != 'N/A')
## clean main working data a bit more
df <- df %>%
mutate(length_of_stay = as.numeric(as.Date(df$c_jail_out) - as.Date(df$c_jail_in)),
charge_factor = fct_explicit_na(c_charge_desc),
race_factor = fct_explicit_na(race),
race_factor = fct_relevel(race_factor, "Caucasian"),
charge_factor = fct_lump_min(charge_factor, 30),
sex_factor = factor(sex, levels = c("Female","Male")),
priors_factor = ifelse(priors_count > 20, 20, priors_count),
priors_factor = factor(priors_factor),
two_year_recid = factor(two_year_recid)) %>%
select(two_year_recid, age, sex_factor , juv_fel_count , juv_misd_count , juv_other_count , priors_count , c_charge_degree , charge_factor, race_factor, decile_score, length_of_stay)
feature_names <- names(df)[-c(1,10,11)]
dfn = subset(df, select = -c(decile_score, race_factor))
set.seed(5281110)
split <- initial_split(dfn, p = 0.75)
train <- training(split)
test <- testing(split)
And the libraries I am using
library(tidyverse)
library(tidymodels)
library(AER)
When you added the step step_dummy(all_nominal()), that selected your outcome two_year_recid and turned it into a dummy variable, because it is a nominal variable. Be sure to say you do not want to select it, either by adding it explicitly via -two_year_recid or by using -all_outcomes(). Then your model will fit and predict:
library(tidymodels)
library(tidyverse)
raw <- read_csv("https://raw.githubusercontent.com/propublica/compas-analysis/master/compas-scores-two-years.csv")
#> Warning: Duplicated column names deduplicated: 'decile_score' =>
#> 'decile_score_1' [40], 'priors_count' => 'priors_count_1' [49]
#>
#> ── Column specification ────────────────────────────────────────────────────────
#> cols(
#> .default = col_character(),
#> id = col_double(),
#> compas_screening_date = col_date(format = ""),
#> dob = col_date(format = ""),
#> age = col_double(),
#> juv_fel_count = col_double(),
#> decile_score = col_double(),
#> juv_misd_count = col_double(),
#> juv_other_count = col_double(),
#> priors_count = col_double(),
#> days_b_screening_arrest = col_double(),
#> c_jail_in = col_datetime(format = ""),
#> c_jail_out = col_datetime(format = ""),
#> c_offense_date = col_date(format = ""),
#> c_arrest_date = col_date(format = ""),
#> c_days_from_compas = col_double(),
#> is_recid = col_double(),
#> r_days_from_arrest = col_double(),
#> r_offense_date = col_date(format = ""),
#> r_jail_in = col_date(format = ""),
#> r_jail_out = col_date(format = "")
#> # ... with 14 more columns
#> )
#> ℹ Use `spec()` for the full column specifications.
## Main working data
df <- raw %>%
filter(days_b_screening_arrest <= 30) %>%
filter(days_b_screening_arrest >= -30) %>%
filter(is_recid != -1) %>%
filter(c_charge_degree != "O") %>%
filter(score_text != 'N/A')
## clean main working data a bit more
df <- df %>%
mutate(length_of_stay = as.numeric(as.Date(df$c_jail_out) - as.Date(df$c_jail_in)),
charge_factor = fct_explicit_na(c_charge_desc),
race_factor = fct_explicit_na(race),
race_factor = fct_relevel(race_factor, "Caucasian"),
charge_factor = fct_lump_min(charge_factor, 30),
sex_factor = factor(sex, levels = c("Female","Male")),
priors_factor = ifelse(priors_count > 20, 20, priors_count),
priors_factor = factor(priors_factor),
two_year_recid = factor(two_year_recid)) %>%
select(two_year_recid, age, sex_factor , juv_fel_count , juv_misd_count , juv_other_count , priors_count , c_charge_degree , charge_factor, race_factor, decile_score, length_of_stay)
feature_names <- names(df)[-c(1,10,11)]
dfn = subset(df, select = -c(decile_score, race_factor))
set.seed(5281110)
split <- initial_split(dfn, p = 0.75)
train <- training(split)
test <- testing(split)
rec_full <- recipe(
two_year_recid ~ .,
data = train
) %>%
step_dummy(all_nominal(), -two_year_recid) %>%
step_interact(~ all_predictors() * all_predictors()) %>%
step_poly(age, degree = 3) %>%
step_normalize(all_predictors()) %>%
step_nzv(all_predictors())
mod_lm <- logistic_reg() %>%
set_engine('glm')
wf_full <- workflow() %>%
add_recipe(rec_full) %>%
add_model(mod_lm)
fit_full <- wf_full %>% fit(data = train)
test %>%
select(two_year_recid) %>%
bind_cols(predict(fit_full, new_data = test) %>% rename(full = .pred_class))
#> Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
#> prediction from a rank-deficient fit may be misleading
#> # A tibble: 1,543 x 2
#> two_year_recid full
#> <fct> <fct>
#> 1 1 0
#> 2 0 0
#> 3 0 0
#> 4 1 1
#> 5 1 1
#> 6 1 1
#> 7 1 1
#> 8 1 0
#> 9 0 0
#> 10 1 0
#> # … with 1,533 more rows
Created on 2020-12-09 by the reprex package (v0.3.0.9001)

Predictor importance for PLS model trained with tidymodels

I'm using tidymodels to fit a PLS model but I'm struggling to find the PLS variable importance scores or coefficients.
This is what I've tried so far; the example data is from AppliedPredictiveModeling package.
Modeling fitting
data(ChemicalManufacturingProcess)
split <- ChemicalManufacturingProcess %>% initial_split(prop = 0.7)
train <- training(split)
test <- testing(split)
tidy_rec <- recipe(Yield ~ ., data = train) %>%
step_knnimpute(all_predictors()) %>%
step_BoxCox(all_predictors()) %>%
step_normalize(all_predictors()) %>%
step_nzv(all_predictors()) %>%
step_corr(all_predictors())
boots <- bootstraps(time = 25, data = train)
tidy_model <- plsmod::pls(num_comp = tune()) %>%
set_mode("regression") %>%
set_engine("mixOmics")
tidy_grid <- expand.grid(num_comp = seq(from = 1, to = 48, by = 5))
tidy_tune <- tidy_model %>% tune_grid(
preprocessor = tidy_rec,
grid = tidy_grid,
resamples = boots,
metrics = metric_set(mae, rmse, rsq)
)
tidy_best <- tidy_tune %>% select_best("rsq")
Final_model <- tidy_model %>% finalize_model(tidy_best)
tidy_wf <- workflow() %>%
add_model(Final_model) %>%
add_recipe(tidy_rec)
Fit_PLS <- tidy_wf %>% fit(data = train)
# check the most important predictors
tidy_info <- Fit_PLS %>% pull_workflow_fit()
loadings <- tidy_info$fit$loadings$X
PLS variable importance
tidy_load <- loadings %>% as.data.frame() %>% rownames_to_column() %>%
select(rowname, comp1, comp2, comp3) %>%
pivot_longer(-rowname) %>%
rename(predictors = rowname)
tidy_load %>% mutate(Sing = if_else(value < 0, "neg", "pos")) %>%
mutate(absvalue = abs(value)) %>% group_by(predictors) %>% summarise(Importance = sum(absvalue)) %>%
mutate(predictors = fct_reorder(predictors, Importance)) %>%
slice_head(n = 15) %>%
ggplot(aes(Importance, predictors, fill = predictors)) + geom_col(show.legend = F)
Thanks! The vi() function from the vip package is not available for this model.
You can directly tidy() the output of the PLS model to get the coefficients:
library(tidymodels)
library(tidyverse)
library(plsmod)
data(ChemicalManufacturingProcess, package = "AppliedPredictiveModeling")
split <- initial_split(ChemicalManufacturingProcess, prop = 0.7)
train <- training(split)
test <- testing(split)
chem_rec <- recipe(Yield ~ ., data = train) %>%
step_knnimpute(all_predictors()) %>%
step_BoxCox(all_predictors()) %>%
step_normalize(all_predictors()) %>%
step_nzv(all_predictors()) %>%
step_corr(all_predictors())
pls_spec <- pls(num_comp = 4) %>% ## can tune instead to find the optimal number
set_mode("regression") %>%
set_engine("mixOmics")
wf <- workflow() %>%
add_recipe(chem_rec) %>%
add_model(pls_spec)
pls_fit <- fit(wf, train)
## tidy the fitted model
tidy_pls <- pls_fit %>%
pull_workflow_fit()
tidy()
tidy_pls
#> # A tibble: 192 x 4
#> term value type component
#> <chr> <dbl> <chr> <dbl>
#> 1 BiologicalMaterial01 0.193 predictors 1
#> 2 BiologicalMaterial01 -0.247 predictors 2
#> 3 BiologicalMaterial01 0.00969 predictors 3
#> 4 BiologicalMaterial01 0.0228 predictors 4
#> 5 BiologicalMaterial03 0.249 predictors 1
#> 6 BiologicalMaterial03 -0.00118 predictors 2
#> 7 BiologicalMaterial03 0.0780 predictors 3
#> 8 BiologicalMaterial03 -0.0866 predictors 4
#> 9 BiologicalMaterial04 0.217 predictors 1
#> 10 BiologicalMaterial04 -0.192 predictors 2
#> # … with 182 more rows
tidy_pls %>%
filter(term != "Y") %>%
group_by(component) %>%
slice_max(abs(value), n = 10) %>%
ungroup() %>%
ggplot(aes(value, fct_reorder(term, value), fill = factor(component))) +
geom_col(show.legend = FALSE) +
facet_wrap(~component, scales = "free_y") +
labs(y = NULL)
Created on 2020-10-19 by the reprex package (v0.3.0.9001)
I showed this without tuning the number of components, but it works about the same with tuning.

Resources