How to extract PLSR coefficients as for glmnet using tidymodels - r

I tuned a glmnet regression model and extracted the coefficients as described here. That works wonderfully. However, when I use the same form of coefficient extraction for PLSR with mixOmics engine, I obtain single values per term and component as demonstrated here. For further external use I need the coefficients of PLSR in the first form. I can achieve this by using the optimal hyperparamterset with the plsr() function from the pls package and then extracting it with coef() as shown at the end of the code below. However, I would like to avoid this extra step because I cannot pass parameters like predictor_prop to plsr and thus results may vary.
Is there a more elegant way to extract the overall model coefficients of the PLSR as for glmnet or can I calculate them from the component values?
library(tidymodels)
library(plsmod)
data(Chicago)
Chicago <- Chicago %>% select(ridership, Clark_Lake, Austin, Harlem)
# create cross-validation dataset
folds <- vfold_cv(Chicago)
# create recipe
rec <- recipe(ridership ~ ., Chicago) %>%
step_normalize(all_predictors()) %>%
prep(training = Chicago)
# define model
mod <- parsnip::pls(mode = "regression",
num_comp = tune(),
predictor_prop = tune()) %>%
set_engine("mixOmics")
# define workflow
wf <- workflow() %>%
add_recipe(rec) %>%
add_model(mod)
# run grid tuning
set.seed(123)
res <- tune_grid(wf, resamples = folds, grid = 5)
# get best model
res_best <- res %>% select_best("rmse")
# fit best model and extract coefficients
wf %>%
finalize_workflow(res_best) %>%
fit(Chicago) %>%
extract_fit_parsnip() %>%
tidy()
# extracting coefficients using plsr from pls package and coef function
p <- pls::plsr(ridership ~ ., data = Chicago, scale = T, center = T, ncomp = 3)
coef(p, intercept = T)
Thank you for the awesome tidymodels framework and everyone who makes it what it is!

As far as I can tell you are doing the correct thing.
The coef() functions only shows you the result for 3 comps but you can get the same result by adding filter(component == 3) in the following code
wf %>%
finalize_workflow(res_best) %>%
fit(Chicago) %>%
extract_fit_parsnip() %>%
tidy() %>%
filter(component == 3)
#> # A tibble: 4 × 4
#> term value type component
#> <chr> <dbl> <chr> <dbl>
#> 1 Clark_Lake 0 predictors 3
#> 2 Austin 1 predictors 3
#> 3 Harlem 0 predictors 3
#> 4 Y 1 outcomes 3
The reason why you are getting 0s and 1s is because the hyper parameter tuned values of predictor_prop is quite low, giving you are sparse representation
library(tidymodels)
library(plsmod)
data(Chicago)
Chicago <- Chicago %>% select(ridership, Clark_Lake, Austin, Harlem)
# create cross-validation dataset
folds <- vfold_cv(Chicago)
# create recipe
rec <- recipe(ridership ~ ., Chicago) %>%
step_normalize(all_predictors()) %>%
prep(training = Chicago)
# define model
mod <- parsnip::pls(mode = "regression",
num_comp = tune(),
predictor_prop = tune()) %>%
set_engine("mixOmics")
# define workflow
wf <- workflow() %>%
add_recipe(rec) %>%
add_model(mod)
# run grid tuning
set.seed(123)
res <- tune_grid(wf, resamples = folds, grid = 5)
# get best model
res_best <- res %>% select_best("rmse")
res_best
#> # A tibble: 1 × 3
#> predictor_prop num_comp .config
#> <dbl> <int> <chr>
#> 1 0.0869 3 Preprocessor1_Model1
# fit best model and extract coefficients
wf %>%
finalize_workflow(res_best) %>%
fit(Chicago) %>%
extract_fit_parsnip() %>%
tidy()
#> # A tibble: 12 × 4
#> term value type component
#> <chr> <dbl> <chr> <dbl>
#> 1 Clark_Lake 1 predictors 1
#> 2 Clark_Lake 0 predictors 2
#> 3 Clark_Lake 0 predictors 3
#> 4 Austin 0 predictors 1
#> 5 Austin 0 predictors 2
#> 6 Austin 1 predictors 3
#> 7 Harlem 0 predictors 1
#> 8 Harlem -1 predictors 2
#> 9 Harlem 0 predictors 3
#> 10 Y 1 outcomes 1
#> 11 Y 1 outcomes 2
#> 12 Y 1 outcomes 3
wf %>%
finalize_workflow(
tibble(predictor_prop = 0, num_comp = 3)
) %>%
fit(Chicago) %>%
extract_fit_parsnip() %>%
tidy()
#> # A tibble: 12 × 4
#> term value type component
#> <chr> <dbl> <chr> <dbl>
#> 1 Clark_Lake 1 predictors 1
#> 2 Clark_Lake 0 predictors 2
#> 3 Clark_Lake 0 predictors 3
#> 4 Austin 0 predictors 1
#> 5 Austin 0 predictors 2
#> 6 Austin 1 predictors 3
#> 7 Harlem 0 predictors 1
#> 8 Harlem -1 predictors 2
#> 9 Harlem 0 predictors 3
#> 10 Y 1 outcomes 1
#> 11 Y 1 outcomes 2
#> 12 Y 1 outcomes 3
wf %>%
finalize_workflow(
tibble(predictor_prop = 0.5, num_comp = 3)
) %>%
fit(Chicago) %>%
extract_fit_parsnip() %>%
tidy()
#> # A tibble: 12 × 4
#> term value type component
#> <chr> <dbl> <chr> <dbl>
#> 1 Clark_Lake 0.908 predictors 1
#> 2 Clark_Lake 0 predictors 2
#> 3 Clark_Lake 0 predictors 3
#> 4 Austin 0.419 predictors 1
#> 5 Austin -0.859 predictors 2
#> 6 Austin -0.406 predictors 3
#> 7 Harlem 0 predictors 1
#> 8 Harlem -0.513 predictors 2
#> 9 Harlem 0.914 predictors 3
#> 10 Y 1 outcomes 1
#> 11 Y 1 outcomes 2
#> 12 Y 1 outcomes 3
wf %>%
finalize_workflow(
tibble(predictor_prop = 1, num_comp = 3)
) %>%
fit(Chicago) %>%
extract_fit_parsnip() %>%
tidy()
#> # A tibble: 12 × 4
#> term value type component
#> <chr> <dbl> <chr> <dbl>
#> 1 Clark_Lake 0.593 predictors 1
#> 2 Clark_Lake 0.738 predictors 2
#> 3 Clark_Lake 0.321 predictors 3
#> 4 Austin 0.576 predictors 1
#> 5 Austin -0.111 predictors 2
#> 6 Austin -0.810 predictors 3
#> 7 Harlem 0.562 predictors 1
#> 8 Harlem -0.665 predictors 2
#> 9 Harlem 0.491 predictors 3
#> 10 Y 1 outcomes 1
#> 11 Y 1 outcomes 2
#> 12 Y 1 outcomes 3
Created on 2022-09-08 by the reprex package (v2.0.1)

Related

How to tune a model using grid search and a single validation fold with tidymodels?

I have just learnt about the KNN algorithm and machine learning. It is a lot for me to take in and we are using tidymodels in R to practice.
Now, I know how to implement a grid search using k-fold cross-validation as follows:
hist_data_split <- initial_split(hist_data, strata = fraud)
hist_data_train <- training(hist_data_split)
hist_data_test <- testing(hist_data_split)
folds <- vfold_cv(hist_data_train, strata = fraud)
nearest_neighbor_grid <- grid_regular(neighbors(range = c(1, 500)), levels = 25)
knn_rec_1 <- recipe(fraud ~ ., data = hist_data_train)
knn_spec_1 <- nearest_neighbor(mode = "classification", engine = "kknn", neighbors = tune(), weight_func = "rectangular")
knn_wf_1 <- workflow(preprocessor = knn_rec_1, spec = knn_spec_1)
knn_fit_1 <- tune_grid(knn_wf_1, resamples = folds, metrics = metric_set(accuracy, sens, spec, roc_auc), control = control_resamples(save_pred = T), grid = nearest_neighbor_grid)
In the above case, I am essentially running a 10-fold cross-validated grid search to tune my model. However, the size of hist_data is 169173, which gives an optimal K of about 411 and with a 10-fold cross-validation, the tuning is going to take forever, so the hint given is to use a single validation fold instead of cross-validation.
Thus, I am wondering how I can tweak my code to implement this. When I add the argument v = 1 in vfold_cv, R throws me an error which says, "At least one row should be selected for the analysis set." Should I instead change resamples = folds in tune_grid to resamples = 1?
Any intuitive suggestions will be greatly appreciated :)
P.S. I did not include an MWE in the sense that the data is not provided because I feel like this is a really trivial question which can be answered as is!
If you are not able to do a cross validation split, for whatever reason, you can do a validation split which conceptually is very close to a v = 1 cross validation.
library(tidymodels)
hist_data_split <- initial_split(ames, strata = Street)
hist_data_train <- training(hist_data_split)
hist_data_test <- testing(hist_data_split)
folds <- validation_split(hist_data_train, strata = Street)
nearest_neighbor_grid <- grid_regular(
neighbors(range = c(1, 500)),
levels = 25
)
knn_rec_1 <- recipe(Street ~ ., data = ames)
knn_spec_1 <- nearest_neighbor(neighbors = tune()) %>%
set_mode("classification") %>%
set_engine("kknn") %>%
set_args(weight_func = "rectangular")
knn_wf_1 <- workflow(preprocessor = knn_rec_1, spec = knn_spec_1)
knn_fit_1 <- tune_grid(
knn_wf_1,
resamples = folds,
metrics = metric_set(accuracy, sens, spec, roc_auc),
control = control_resamples(save_pred = T),
grid = nearest_neighbor_grid
)
knn_fit_1
#> # Tuning results
#> # Validation Set Split (0.75/0.25) using stratification
#> # A tibble: 1 × 5
#> splits id .metrics .notes .predictions
#> <list> <chr> <list> <list> <list>
#> 1 <split [1647/550]> validation <tibble [100 × 5]> <tibble [0 × 3]> <tibble>
knn_fit_1 %>%
collect_metrics()
#> # A tibble: 100 × 7
#> neighbors .metric .estimator mean n std_err .config
#> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
#> 1 1 accuracy binary 0.996 1 NA Preprocessor1_Model01
#> 2 1 roc_auc binary 0.5 1 NA Preprocessor1_Model01
#> 3 1 sens binary 0 1 NA Preprocessor1_Model01
#> 4 1 spec binary 1 1 NA Preprocessor1_Model01
#> 5 21 accuracy binary 0.996 1 NA Preprocessor1_Model02
#> 6 21 roc_auc binary 0.495 1 NA Preprocessor1_Model02
#> 7 21 sens binary 0 1 NA Preprocessor1_Model02
#> 8 21 spec binary 1 1 NA Preprocessor1_Model02
#> 9 42 accuracy binary 0.996 1 NA Preprocessor1_Model03
#> 10 42 roc_auc binary 0.486 1 NA Preprocessor1_Model03
#> # … with 90 more rows
Created on 2022-09-06 by the reprex package (v2.0.1)

R calculate most abundant taxa using phyloseq object

I would like to know if my approach to calculate the average of the relative abundance of any taxon is correct !!!
If I want to know if, to calculate the relative abundance (percent) of each family (or any Taxon) in a phyloseq object (GlobalPattern) will be correct like:
data("GlobalPatterns")
T <- GlobalPatterns %>%
tax_glom(., "Family") %>%
transform_sample_counts(function(x)100* x / sum(x)) %>% psmelt() %>%
arrange(OTU) %>% rename(OTUsID = OTU) %>%
select(OTUsID, Family, Sample, Abundance) %>%
spread(Sample, Abundance)
T$Mean <- rowMeans(T[, c(3:ncol(T))])
FAM <- T[, c("Family", "Mean" ) ]
#order data frame
FAM <- FAM[order(dplyr::desc(FAM$Mean)),]
rownames(FAM) <- NULL
head(FAM)
Family Mean
1 Bacteroidaceae 7.490944
2 Ruminococcaceae 6.038956
3 Lachnospiraceae 5.758200
4 Flavobacteriaceae 5.016402
5 Desulfobulbaceae 3.341026
6 ACK-M1 3.242808
in this case the Bacteroidaceae were the most abundant family in all the samples of GlobalPattern (26 samples and 19216 OTUs), it was present in 7.49% in average in 26 samples !!!!
It’s correct to make the T$Mean <- rowMeans(T[, c(3:ncol(T))]) to calculate the average any given Taxon ?
Bacteroidaceae has the highest abundance, if all samples were pooled together.
However, it has the highest abundance in only 2 samples.
Nevertheless, there is no other taxon having a higher abundance in an average sample.
Let's use dplyr verbs for all the steps to have a more descriptive and consistent code:
library(tidyverse)
library(phyloseq)
#> Creating a generic function for 'nrow' from package 'base' in package 'biomformat'
#> Creating a generic function for 'ncol' from package 'base' in package 'biomformat'
#> Creating a generic function for 'rownames' from package 'base' in package 'biomformat'
#> Creating a generic function for 'colnames' from package 'base' in package 'biomformat'
data(GlobalPatterns)
data <-
GlobalPatterns %>%
tax_glom("Family") %>%
transform_sample_counts(function(x)100* x / sum(x)) %>%
psmelt() %>%
as_tibble()
# highest abundance: all samples pooled together
data %>%
group_by(Family) %>%
summarise(Abundance = mean(Abundance)) %>%
arrange(-Abundance)
#> # A tibble: 334 × 2
#> Family Abundance
#> <chr> <dbl>
#> 1 Bacteroidaceae 7.49
#> 2 Ruminococcaceae 6.04
#> 3 Lachnospiraceae 5.76
#> 4 Flavobacteriaceae 5.02
#> 5 Desulfobulbaceae 3.34
#> 6 ACK-M1 3.24
#> 7 Streptococcaceae 2.77
#> 8 Nostocaceae 2.62
#> 9 Enterobacteriaceae 2.55
#> 10 Spartobacteriaceae 2.45
#> # … with 324 more rows
# sanity check: is total abundance of each sample 100%?
data %>%
group_by(Sample) %>%
summarise(Abundance = sum(Abundance)) %>%
pull(Abundance) %>%
`==`(100) %>%
all()
#> [1] TRUE
# get most abundant family for each sample individually
data %>%
group_by(Sample) %>%
arrange(-Abundance) %>%
slice(1) %>%
select(Family) %>%
ungroup() %>%
count(Family, name = "n_samples") %>%
arrange(-n_samples)
#> Adding missing grouping variables: `Sample`
#> # A tibble: 18 × 2
#> Family n_samples
#> <chr> <int>
#> 1 Desulfobulbaceae 3
#> 2 Bacteroidaceae 2
#> 3 Crenotrichaceae 2
#> 4 Flavobacteriaceae 2
#> 5 Lachnospiraceae 2
#> 6 Ruminococcaceae 2
#> 7 Streptococcaceae 2
#> 8 ACK-M1 1
#> 9 Enterobacteriaceae 1
#> 10 Moraxellaceae 1
#> 11 Neisseriaceae 1
#> 12 Nostocaceae 1
#> 13 Solibacteraceae 1
#> 14 Spartobacteriaceae 1
#> 15 Sphingomonadaceae 1
#> 16 Synechococcaceae 1
#> 17 Veillonellaceae 1
#> 18 Verrucomicrobiaceae 1
Created on 2022-06-10 by the reprex package (v2.0.0)

How can I unscale and understand glmnet coefficients while using tidymodels?

I'm a bit confused with how I should interpret the coefficients from the elastic net model that I'm getting through tidymodels and glmnet. Ideally, I'd like to produce unscaled coefficients for maximum interpretability.
My issue is that I'm honestly not sure how to unscale the coefficients that the model is yielding because I can't quite figure out what's being done in the first place.
It's a bit tricky for me to post the data one would need to reproduce my results, but here's my code:
library(tidymodels)
library(tidyverse)
# preps data for model
myrecipe <- mydata %>%
recipe(transactionrevenue ~ sessions + channelgrouping + month + new_user_pct + is_weekend) %>%
step_novel(all_nominal(), -all_outcomes()) %>%
step_dummy(month, channelgrouping, one_hot = TRUE) %>%
step_zv(all_predictors()) %>%
step_normalize(sessions, new_user_pct) %>%
step_interact(terms = ~ sessions:starts_with("channelgrouping") + new_user_pct:starts_with("channelgrouping"))
# creates the model
mymodel <- linear_reg(penalty = 10, mixture = 0.2) %>%
set_engine("glmnet", standardize = FALSE)
wf <- workflow() %>%
add_recipe(myrecipe)
model_fit <- wf %>%
add_model(mymodel) %>%
fit(data = mydata)
# posts coefficients
tidy(model_fit)
If it would help, here's some information that might be useful:
The variable that I'm really focusing on is "sessions."
In the model, the coefficient for sessions is 2543.094882, and the intercept is 1963.369782. The penalty is also 10.
The unscaled mean for sessions is 725.2884 and the standard deviation is 1035.381.
I just can't seem to figure out what units the coefficients are in and how/if it's even possible to unscale the coefficients back to the original units.
Any insight would be very much appreciated.
You can use tidy() on a lot of different components of a workflow. The default is to the tidy() the model but you can also get out the recipe and even recipe steps. This is where the information it sounds like you are interested in is.
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
data(bivariate)
biv_rec <-
recipe(Class ~ ., data = bivariate_train) %>%
step_BoxCox(all_predictors())%>%
step_normalize(all_predictors())
svm_spec <- svm_linear(mode = "classification")
biv_fit <- workflow(biv_rec, svm_spec) %>% fit(bivariate_train)
## tidy the *model*
tidy(biv_fit)
#> # A tibble: 3 × 2
#> term estimate
#> <chr> <dbl>
#> 1 A -1.15
#> 2 B 1.17
#> 3 Bias 0.328
## tidy the *recipe*
extract_recipe(biv_fit) %>%
tidy()
#> # A tibble: 2 × 6
#> number operation type trained skip id
#> <int> <chr> <chr> <lgl> <lgl> <chr>
#> 1 1 step BoxCox TRUE FALSE BoxCox_ZRpI2
#> 2 2 step normalize TRUE FALSE normalize_DGmtN
## tidy the *recipe step*
extract_recipe(biv_fit) %>%
tidy(number = 1)
#> # A tibble: 2 × 3
#> terms value id
#> <chr> <dbl> <chr>
#> 1 A -0.857 BoxCox_ZRpI2
#> 2 B -1.09 BoxCox_ZRpI2
## tidy the other *recipe step*
extract_recipe(biv_fit) %>%
tidy(number = 2)
#> # A tibble: 4 × 4
#> terms statistic value id
#> <chr> <chr> <dbl> <chr>
#> 1 A mean 1.16 normalize_DGmtN
#> 2 B mean 0.909 normalize_DGmtN
#> 3 A sd 0.00105 normalize_DGmtN
#> 4 B sd 0.00260 normalize_DGmtN
Created on 2021-08-05 by the reprex package (v2.0.0)
You can read more about tidying a recipe here.

How to convert normalized numeric variable (library(recipes)) back to original value in R

I normalized the numeric variables by library(recipes) in R before putting into Decision Tree models to predict outcome. Now, I have decision tree, and age is one of important variables in the node, like >1.5 and < 1.5. I want to convert that -1.5 back into a non-normalized value to be able to give it a practical meaning (like age >50 or </= 50 years old). I have searched and cannot find the answer.
library(recipes)
recipe_obj <- dataset %>%
recipe(formula = anyaki ~.) %>% #specify formula
step_center(all_numeric()) %>% #center data (0 mean)
step_scale(all_numeric()) %>% #std = 1
prep(data = dataset)
dataset_scaled <- bake(recipe_obj, new_data = dataset)
Age is one of variables that have been normalized in recipes package in R. Now, I am struggling to convert the normalized data that I have in the final model back to into a non-normalized value to be able to give it a practical meaning. How can I do this?
You can access these kind of estimated values using the tidy() method for recipes and recipe steps. Check out more details here and here.
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
data(penguins)
penguin_rec <- recipe(~ ., data = penguins) %>%
step_other(all_nominal(), threshold = 0.2, other = "another") %>%
step_normalize(all_numeric()) %>%
step_dummy(all_nominal())
tidy(penguin_rec)
#> # A tibble: 3 × 6
#> number operation type trained skip id
#> <int> <chr> <chr> <lgl> <lgl> <chr>
#> 1 1 step other FALSE FALSE other_ZNJ2R
#> 2 2 step normalize FALSE FALSE normalize_ogEvZ
#> 3 3 step dummy FALSE FALSE dummy_YVCBo
tidy(penguin_rec, number = 1)
#> # A tibble: 1 × 3
#> terms retained id
#> <chr> <chr> <chr>
#> 1 all_nominal() <NA> other_ZNJ2R
penguin_prepped <- prep(penguin_rec, training = penguins)
#> Warning: There are new levels in a factor: NA
tidy(penguin_prepped)
#> # A tibble: 3 × 6
#> number operation type trained skip id
#> <int> <chr> <chr> <lgl> <lgl> <chr>
#> 1 1 step other TRUE FALSE other_ZNJ2R
#> 2 2 step normalize TRUE FALSE normalize_ogEvZ
#> 3 3 step dummy TRUE FALSE dummy_YVCBo
tidy(penguin_prepped, number = 1)
#> # A tibble: 6 × 3
#> terms retained id
#> <chr> <chr> <chr>
#> 1 species Adelie other_ZNJ2R
#> 2 species Gentoo other_ZNJ2R
#> 3 island Biscoe other_ZNJ2R
#> 4 island Dream other_ZNJ2R
#> 5 sex female other_ZNJ2R
#> 6 sex male other_ZNJ2R
tidy(penguin_prepped, number = 2)
#> # A tibble: 8 × 4
#> terms statistic value id
#> <chr> <chr> <dbl> <chr>
#> 1 bill_length_mm mean 43.9 normalize_ogEvZ
#> 2 bill_depth_mm mean 17.2 normalize_ogEvZ
#> 3 flipper_length_mm mean 201. normalize_ogEvZ
#> 4 body_mass_g mean 4202. normalize_ogEvZ
#> 5 bill_length_mm sd 5.46 normalize_ogEvZ
#> 6 bill_depth_mm sd 1.97 normalize_ogEvZ
#> 7 flipper_length_mm sd 14.1 normalize_ogEvZ
#> 8 body_mass_g sd 802. normalize_ogEvZ
Created on 2021-08-07 by the reprex package (v2.0.0)

Getting error when trying to apply tidymodels recipe from train data to resamples in r?

I am new to tidymodels and somewhat new in R as well. I am trying to replicate code of David Robinson from Youtube tidytuesday/Sliced Customer churn data but facing issues in applying recipe changes on cross validated data / resamples.
Issue: When I perform step_mutate() on train data then it works but when I apply the same recipe on cross validated train_5folds data then it gives error: Error: All of the models failed. See the .notes column.
To recreate issue (Download data using below code):
train <- read.csv(url("https://raw.githubusercontent.com/johnsnow09/covid19-df_stack-code/main/train_object.csv"))
train_5fold Cross validated resamples data can be downloaded from: https://github.com/johnsnow09/covid19-df_stack-code/blob/main/train_5fold.RDS
train_5fold <- readRDS("train_5fold.RDS")
Code:
library(tidyverse)
library(tidymodels)
mset <- metric_set(mn_log_loss)
control <- control_grid(save_workflow = TRUE,
save_pred = TRUE,
extract = extract_model)
xg_spec <- parsnip::boost_tree(
trees = tune(),
mtry = tune(),
learn_rate = tune()) %>%
set_engine("xgboost") %>%
set_mode("classification")
factor_to_ordinal <- function(x){
ifelse(x == "Unknown", NA, as.integer(x))
}
xg_rec_4 <- recipe(churned ~ .,data = train) %>%
update_role(id, new_role = "ID") %>%
step_mutate(income_category = factor_to_ordinal(income_category),
education_level = factor_to_ordinal(education_level)) %>%
step_impute_mean(all_numeric_predictors()) %>%
step_dummy(all_nominal_predictors())
xg_wf_4 <- workflow() %>%
add_recipe(xg_rec_4) %>%
add_model(xg_spec)
xg_res_4 <- xg_wf_4 %>%
tune_grid(
resamples = train_5fold,
metrics = mset,
control = control,
grid = crossing(trees = seq(200,800, 20),
mtry = c(2, 4, 6, 8, 10),
learn_rate = c(0.02))
)
)
autoplot(xg_res_4)
Error: All of the models failed. See the .notes column.
In .notes i get
.notes
<chr>
preprocessor 1/1: Error: Problem with `mutate()` column `income_category`.\ni `income_category = factor_to_ordinal(income_category)`.\nx could not find function "factor_to_ordinal"
Cross checking:
xg_rec_4 %>% prep() %>% juice()
# A tibble: 5,316 x 15
id customer_age education_level income_category total_relationship~ months_inactive_1~ credit_limit
<dbl> <dbl> <int> <int> <dbl> <dbl> <dbl>
1 9168 46 3 5 3 3 2171
2 2187 51 4 4 3 1 11373
3 5659 48 3 4 4 2 14322
4 447 57 6 2 5 3 12291
5 6342 39 4 5 5 2 1862
6 496 56 6 5 4 3 3219
7 7064 33 4 1 6 3 27499
8 3978 48 4 4 1 2 34516
9 13 41 4 5 4 3 2372
10 8242 46 3 2 4 3 3115
# ... with 5,306 more rows, and 8 more variables: total_revolving_bal <dbl>, total_amt_chng_q4_q1 <dbl>,
# total_trans_amt <dbl>, total_trans_ct <dbl>, total_ct_chng_q4_q1 <dbl>, avg_utilization_ratio <dbl>,
# churned <fct>, gender_M <dbl>
colSums(xg_rec_4 %>% prep() %>% juice() %>% select_if(is.numeric) %>% is.na())
id customer_age education_level income_category
0 0 0 0
total_relationship_count months_inactive_12_mon credit_limit total_revolving_bal
0 0 0 0
total_amt_chng_q4_q1 total_trans_amt total_trans_ct total_ct_chng_q4_q1
0 0 0 0
avg_utilization_ratio gender_M
0 0
Where as In the video it worked for David Robinson:

Resources