Tidymodels / XGBoost error in last_fit with rsplit value - r

I am trying to follow this tutorial here - https://juliasilge.com/blog/xgboost-tune-volleyball/
I am using it on the most recent Tidy Tuesday dataset about great lakes fishing - trying to predict agency based on many other values.
ALL of the code below works except the final row where I get the following error:
> final_res <- last_fit(final_xgb, stock_folds)
Error: Each element of `splits` must be an `rsplit` object.
I searched that error and came to this page - https://github.com/tidymodels/rsample/issues/175
That site has it called a bug and seems to be fixed - but it is with initial_time_split, not initial_split that I am using. I would rather not change it because then I would have to rerun the xgboost that took 9 hours. What went wrong here?
# Setup ----
library(tidyverse)
library(tidymodels)
stocked <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-06-08/stocked.csv')
stocked_modeling <- stocked %>%
mutate(AGENCY = case_when(
AGENCY != "OMNR" ~ "other",
TRUE ~ AGENCY
)) %>%
select(-SID, -MONTH, -DAY, -LATITUDE, -LONGITUDE, -GRID, -STRAIN, -AGEMONTH,
-MARK_EFF, -TAG_NO, -TAG_RET, -LENGTH, -WEIGHT, - CONDITION, -LOT_CODE,
-NOTES, - VALIDATION, -LS_MGMT, -STAT_DIST, -ST_SITE, -YEAR_CLASS, -STOCK_METH) %>%
mutate_if(is.character, factor) %>%
drop_na()
# Start making model ----
set.seed(123)
stock_split <- initial_split(stocked_modeling, strata = AGENCY)
stock_train <- training(stock_split)
stock_test <- testing(stock_split)
xgb_spec <- boost_tree(
trees = 1000,
tree_depth = tune(), min_n = tune(), loss_reduction = tune(),
sample_size = tune(), mtry = tune(),
learn_rate = tune()
) %>%
set_engine("xgboost") %>%
set_mode("classification")
xgb_grid <- grid_latin_hypercube(
tree_depth(),
min_n(),
loss_reduction(),
sample_size = sample_prop(),
finalize(mtry(), stock_train),
learn_rate(),
size = 20
)
xgb_workflow <- workflow() %>%
add_formula(AGENCY ~ .) %>%
add_model(xgb_spec)
set.seed(123)
stock_folds <- vfold_cv(stock_train, strata = AGENCY)
doParallel::registerDoParallel()
# BEWARE, THIS CODE BELOW TOOK 9 HOURS TO RUN
set.seed(234)
xgb_res <- tune_grid(
xgb_workflow,
resamples = stock_folds,
grid = xgb_grid,
control = control_grid(save_pred = TRUE)
)
# Explore results
best_auc <- select_best(xgb_res, "roc_auc")
final_xgb <- finalize_workflow(
xgb_workflow,
best_auc)
final_res <- last_fit(final_xgb, stock_folds)

If we look at the documentation of last_fit() We see that split must be
An rsplit object created from `rsample::initial_split().
You accidentally passed the cross-validation folds object stock_folds into split but you should have passed rsplit object stock_split instead
final_res <- last_fit(final_xgb, stock_split)

Related

How can I tune the minority_prop argument of the ROSE upsampling algorithm using tidymodels?

I have an imbalanced data set and am using the tidymodels framework to build predictive models. To correct for the imbalance, I use the upsampling ROSE algorithm, which has two arguments I'd like to tune, namely over_ratio and minority_prop.
To do so, I specified in the step recipe that each argument =tune()and then I built a CV grid with the corresponding names. However, the minority_pro argument is not recognized when I run the CV search.
# data
set.seed(20)
y <- rbinom(100, 1, 0.1)
X <- MASS::mvrnorm(100, c(1,2), diag(2))
dat <- cbind(y,X)
dat <- data.frame(dat)
dat$y <- as.factor(dat$y)
# define the recipe
my_recipe <-
recipe(y ~ ., data = dat) |>
step_rose(y, over_ratio = tune(), minority_prop = tune(),
skip = TRUE) %>%
step_normalize(all_numeric_predictors(), skip = FALSE)
# MODEL
mod <-
svm_rbf(mode = "classification", cost = tune(),
rbf_sigma = tune()) %>%
set_engine("kernlab")
# set the workflow
svc_workflow <- workflow() %>%
# add the recipe
add_recipe(my_recipe) %>%
# add the model
add_model(mod)
grid_svc <- expand.grid(rbf_sigma = seq(0, 10, 2), cost = seq(0,10,2),
over_ratio = seq(0.5,1.5,0.5), minority_prop = seq(0.5,0.8,0.15))
# cv tuning
doParallel::registerDoParallel()
cv_tuning <- tune_grid(svc_workflow,
resamples = vfold_cv(dat),
grid = grid_svc,
metrics = metric_set(f_meas, precision, recall,
accuracy, pr_auc))
I then receive the following error.
Error in `check_grid()`:
! The provided `grid` has the following parameter columns that have not been marked for tuning by `tune()`: 'minority_prop'.
Run `rlang::last_error()` to see where the error occurred.
I tried tuning only over over_ratio without minority_prop and it worked. What am I doing wrong?

Tidymodels Predict Error in R while predict on test

I am using below code to build and predict model using tidymodels. I am fairly new to tidymodels, so may be I am totally wrong in my approach. But here is what the problem is.
When input datatype for test dataset is different from train, I am getting this error. Otherwise, the code works fine(In cases where train and test data structures are identical). I am assuming that the preprocessing step should have tackled this while processing test data.
If anyone knows/encountered this problem. Please let me know the possible solution.
I search for this issue, but haven't found anything of this sort.
Thanks for looking into it.
Code:
library(tidymodels)
library(dplyr)
mt1 <- mtcars ## assume this is the train data
mt2 <- mtcars ## assume this is the test data
mt2$mpg <- as.character(mt2$mpg) ## just forcing them to be character to reproduce the problem in my actual data
mt2$qsec <- as.character(mt2$qsec)
dp_pipe <- recipe(am ~ .,data=mt1) %>%
update_role(cyl,vs,new_role = "drop_vars") %>%
update_role(mpg,
disp,
drat,wt, qsec, new_role="to_numeric") %>%
step_rm(has_role("drop_vars")) %>%
step_mutate_at(has_role(match = "to_numeric"),fn = as.numeric)
# Cross folds
folds = vfold_cv(mt1, v = 10)
# define parameter grid to be tuned
my_grid = tibble(penalty = 10^seq(-2, -1, length.out = 10))
# define lasso model
lasso_mod = linear_reg(mode = "regression",
penalty = tune(),
mixture = 1) %>%
set_engine("glmnet")
# add everything to a workflow
wf = workflow() %>%
add_model(lasso_mod) %>%
add_recipe(dp_pipe)
# tune the workflow
my_res <- wf %>%
tune_grid(resamples = folds,
grid = my_grid,
control = control_grid(verbose = FALSE, save_pred = TRUE),
metrics = metric_set(rmse))
best_mod = my_res %>% select_best("rmse")
best_mod
final_fitted = finalize_workflow(wf, best_mod) %>% fit(data=mt1)
# predicted for train
final_fitted %>%
predict(mt1)
final_fitted %>%
predict(mt2)
Error at my end:
> Error: ! Can't convert `data$mpg` <character> to match type of `mpg`
> <double>. Run `rlang::last_error()` to see where the error occurred.

How does the tune package handles the mtry hyperparameter when applying step_dummy

I'm defining the grid for a xgboost model with grid_latin_hypercube(). I understand that the mtry hyperparameter should be finalized either with the finalize() function or manually with the range parameter of mtry().
Assuming that I have a dataframe with 10 variables: 1 id, 1 outcome, 7 numeric predictors and 1 categorical predictor with 10 equally frequent classes, and I run the following code:
folds <- vfold_cv(train_data)
xgb_spec <- boost_tree(
trees = 2000,
mtry = tune(),
) %>%
set_mode('regression') %>%
set_engine('xgboost')
xgb_grid <- grid_latin_hypercube(
finalize(mtry(), train_data),
size = 5
)
xgb_rec <- recipe(outcome ~ ., data = train_data) %>%
update_role(id, new_role = 'id') %>%
step_dummy(all_nominal_predictors())
xgb_wflow <- workflow() %>%
add_model(xgb_spec) %>%
add_recipe(xgb_rec)
xgb_results <- tune_grid(
xgb_wflow,
resamples = folds,
grid = xgb_grid,
metrics = metrics_set(rmse),
control = control_grid(save_pred = TRUE, save_workflow = TRUE)
)
How does tune handle the new set of predictors (9 - 2 + 10 = 17) generated by the recipe, given that the grid hyperparameter was finalized with a dataframe with 10 variables?
When finalizing manually, you'll have to run the prep()/bake() cycle (see here) to generate the preprocessed data.

tune_bayes() function error - All of the models failed--see the .notes column

I am trying to replicate the examples of hyperparameter tuning using Bayesian searching from this site: https://www.r-bloggers.com/2020/05/bayesian-hyperparameters-optimization/ , and when running my code, received the following error: Error: All of the models failed. See the .notes column. Run rlang::last_error() to see where the error occurred.
Here is my current code. The error occurs when running the code starting on the tuned_PI line. Please let me know if you have any suggestions. I am very new to the tidymodels package and hyperparameter tuning.
training_index <- sample(nrow(data)*0.70)
test_index <- setdiff(seq(1:nrow(data)), training_index )
# Get the training data and test data
training_data <- data[training_index, ]
test_data <- data[test_index, ]
model_tune <- rand_forest(mtry = tune(), min_n = tune(), trees = tune()) %>%
set_engine("ranger", seed=222) %>%
set_mode("classification")
set.seed(1234)
folds <- vfold_cv(training_data, v=5, strata = DEATH_EVENT)
tune_wf <- workflow() %>%
add_model(model_tune) %>%
add_formula(DEATH_EVENT~.)
tuned_PI <- tune_wf %>%
tune_bayes(resamples = folds,
param_info=parameters(mtry(range = c(1,10)), min_n(range = c(1,10)), trees(range = c(480,540))),
metrics=metric_set(sensitivity),
objective=prob_improve(trade_off = 0.01))

Why is tidymodels with a ranger engine so much slower than ranger?

I'm taking a first look at tidymodels. My alternative for the current project would be non-tidyfied ranger. On a test run, classification random forest with tidymodels using the ranger engine is much slower than hand-held ranger (approximately ten times slower) when run on the classic iris dataset. Why is that?
library(tidymodels)
library(ranger)
# Make example data
data("iris")
mydata <- iris[sample(1:nrow(iris), 600, replace=T),]
# Recipe
myrecipe <- mydata %>% recipe( Species ~ . )
# Setting a Ranger RF model
myRF <- rand_forest( trees = 300, mtry = 3, min_n = 1) %>%
set_mode("classification") %>%
set_engine("ranger")
# Setting a workflow
myworkflow <- workflow() %>%
add_model(myRF) %>%
add_recipe(myrecipe)
# Compare base ranger and tidy setup
time <- Sys.time()
fit_ranger <- ranger( Species ~ . , data = mydata, probability = T,
mtry = 3, num.trees = 300, min.node.size = 1)
ranger_time <- difftime( Sys.time(), time, "secs")
time <- Sys.time()
fit_tidy <- myworkflow %>%
fit(data= mydata)
tidy_time <- difftime( Sys.time(), time, "secs")
tidy_time
ranger_time

Resources