Weird behavior when wrapping purrr::map within dplyr::mutate - r

I am running into some errors I do not fully understand when trying to call purrr::map around dplyr::mutate. The reproducible code is as follows:
# data
test_dset <- structure(list(genus = c("Aureitalea", "Aureivirga", "Auricoccucs"),
t_count = c(0L, 0L, 0L), n = c(1L, 1L, 1L),
ncbi_id = list("1176327", "1433990", character(0)),
g_test = list(c(`1176327` = 0),
c(`1433990` = 0),
structure(numeric(0), .Names = character(0)))),
class = c("rowwise_df", "tbl_df", "tbl", "data.frame"),
row.names = c(NA, -3L),
groups = structure(list(.rows = structure(list(1L, 2L, 3L),
ptype = integer(0),
class = c("vctrs_list_of","vctrs_vctr", "list"))),
row.names = c(NA, -3L),
class = c("tbl_df", "tbl", "data.frame")))
#> # A tibble: 3 × 5
#> # Rowwise:
#> genus t_count n ncbi_id g_test
#> <chr> <int> <int> <list> <list>
#> 1 Aureitalea 0 1 <chr [1]> <dbl [1]>
#> 2 Aureivirga 0 1 <chr [1]> <dbl [1]>
#> 3 Auricoccucs 0 1 <chr [0]> <dbl [0]>
# process a vector of pvals
proc_gtest <- function(pvals){
if (length(pvals) == 0){
sig <- which(pvals < 0.05)
if (length(sig) == 0){
} else {
# returns errors
test_dset |> mutate(ncbi_filt = map(g_test, proc_gtest))
#> Error: Problem with `mutate()` column `ncbi_filt`.
#> ℹ `ncbi_filt = map(g_test, proc_gtest)`.
#> ℹ `ncbi_filt` must be size 1, not 0.
#> ℹ Did you mean: `ncbi_filt = list(map(g_test, proc_gtest))` ?
#> ℹ The error occurred in row 3.
# this is okay
map(test_dset$g_test, proc_gtest)
#> [[1]]
#> [1] "1176327"
#> [[2]]
#> [1] "1433990"
#> [[3]]
#> [1] NA
# adding list doesn't work because it returns a list of NULL
# with names as the quantities I wanted.
test_dset |> mutate(ncbi_filt = list(map(g_test, proc_gtest))) |> pull(ncbi_filt)
#> [[1]]
#> [[1]]$`1176327`
#> [[2]]
#> [[2]]$`1433990`
#> [[3]]
#> named list()
Created on 2021-10-13 by the reprex package (v2.0.1)
My understanding is that the error is due to the fact that the function being mapped returns nothing at row 3. The solution dplyr gave is that I should wrap everything in a list.
I am using the original map which should already return a list (other tutorials on using map to transform list columns for tibbles also did not wrap everything around list). Wrapping this inside a list returns a list of NULL elements where the things that I want to extract are set as names of this new list.
My function does return values even if the element in the list is empty (returns NA_character_.
As seen in the reprex, the normal map function works and returns a list of length 3 with the empty row having an NA assigned to it as per the logic of the custom function. Right now I'm working around this by just generating a separate list and attach it to the data frame later, however I would love to understand what I'm looking at!

It is an issue with rowwise group attribute. As we are looping over each element in map, just ungroup
test_dset %>%
ungroup %>%
mutate(ncbi_filt = map(g_test, proc_gtest))
# A tibble: 3 × 6
genus t_count n ncbi_id g_test ncbi_filt
<chr> <int> <int> <list> <list> <list>
1 Aureitalea 0 1 <chr [1]> <dbl [1]> <chr [1]>
2 Aureivirga 0 1 <chr [1]> <dbl [1]> <chr [1]>
3 Auricoccucs 0 1 <chr [0]> <dbl [0]> <chr [1]>
Or use map_chr to return as a vector (as there is one single value returned)
test_dset %>%
ungroup %>%
mutate(ncbi_filt = map_chr(g_test, proc_gtest))
# A tibble: 3 × 6
genus t_count n ncbi_id g_test ncbi_filt
<chr> <int> <int> <list> <list> <chr>
1 Aureitalea 0 1 <chr [1]> <dbl [1]> 1176327
2 Aureivirga 0 1 <chr [1]> <dbl [1]> 1433990
3 Auricoccucs 0 1 <chr [0]> <dbl [0]> <NA>
If there is a rowwise attribute, we can directly apply the function and get the output in a list (if the output returns length > 1 or of different structure)
test_dset %>%
mutate(ncbi_filt = list(proc_gtest(g_test)))
# A tibble: 3 × 6
# Rowwise:
genus t_count n ncbi_id g_test ncbi_filt
<chr> <int> <int> <list> <list> <list>
1 Aureitalea 0 1 <chr [1]> <dbl [1]> <chr [1]>
2 Aureivirga 0 1 <chr [1]> <dbl [1]> <chr [1]>
3 Auricoccucs 0 1 <chr [0]> <dbl [0]> <chr [1]>
The function returns a single value, so we don't need to wrap with list as well
test_dset %>%
mutate(ncbi_filt = proc_gtest(g_test))
# A tibble: 3 × 6
# Rowwise:
genus t_count n ncbi_id g_test ncbi_filt
<chr> <int> <int> <list> <list> <chr>
1 Aureitalea 0 1 <chr [1]> <dbl [1]> 1176327
2 Aureivirga 0 1 <chr [1]> <dbl [1]> 1433990
3 Auricoccucs 0 1 <chr [0]> <dbl [0]> <NA>


How to enable parallelization in tidymodels stacks::control_stack_grid()

I am attempting to use the tidymodels stacks package to perform ensemble modeling. Following the instructions provided in their article, I was able to reproduce the example successfully.
However, when I added parallelization during hyperparameter tuning for the "knn_res" section of the code:
cls <- makePSOCKcluster(parallelly::availableCores())
knn_res <-
resamples = folds,
metrics = metric,
grid = 4,
control = ctrl_grid
I encountered an error when running the "tree_frogs_model_st" section of the code:
tree_frogs_model_st <-
tree_frogs_data_st %>%
The error message states:
Error in summary.connection(connection) : invalid connection
I believe this issue may be related to the stacks::control_stack_grid() function, but I am unsure of how to resolve it. Please advice.
UPDATE (full reprex)
I excluded the linear model for brevity.
# subset the data
tree_frogs <- tree_frogs %>%
filter(! %>%
select(-c(clutch, hatched))
# some setup: resampling and a basic recipe
tree_frogs_split <- initial_split(tree_frogs)
tree_frogs_train <- training(tree_frogs_split)
tree_frogs_test <- testing(tree_frogs_split)
folds <- rsample::vfold_cv(tree_frogs_train, v = 5)
tree_frogs_rec <-
recipe(latency ~ ., data = tree_frogs_train)
metric <- metric_set(rmse)
ctrl_grid <- control_stack_grid()
ctrl_res <- control_stack_resamples()
# create a model definition
knn_spec <-
mode = "regression",
neighbors = tune("k")
) %>%
#> K-Nearest Neighbor Model Specification (regression)
#> Main Arguments:
#> neighbors = tune("k")
#> Computational engine: kknn
knn_rec <-
tree_frogs_rec %>%
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors()) %>%
step_impute_mean(all_numeric_predictors()) %>%
#> Recipe
#> Inputs:
#> role #variables
#> outcome 1
#> predictor 4
#> Operations:
#> Dummy variables from all_nominal_predictors()
#> Zero variance filter on all_predictors()
#> Mean imputation for all_numeric_predictors()
#> Centering and scaling for all_numeric_predictors()
knn_wflow <-
workflow() %>%
add_model(knn_spec) %>%
#> ══ Workflow ════════════════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: nearest_neighbor()
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 4 Recipe Steps
#> • step_dummy()
#> • step_zv()
#> • step_impute_mean()
#> • step_normalize()
#> ── Model ───────────────────────────────────────────────────────────────────────
#> K-Nearest Neighbor Model Specification (regression)
#> Main Arguments:
#> neighbors = tune("k")
#> Computational engine: kknn
cls <- makePSOCKcluster(parallelly::availableCores())
knn_res <-
resamples = folds,
metrics = metric,
grid = 4,
control = ctrl_grid
#> # Tuning results
#> # 5-fold cross-validation
#> # A tibble: 5 × 5
#> splits id .metrics .notes .predictions
#> <list> <chr> <list> <list> <list>
#> 1 <split [343/86]> Fold1 <tibble [4 × 5]> <tibble [0 × 3]> <tibble [344 × 5]>
#> 2 <split [343/86]> Fold2 <tibble [4 × 5]> <tibble [0 × 3]> <tibble [344 × 5]>
#> 3 <split [343/86]> Fold3 <tibble [4 × 5]> <tibble [0 × 3]> <tibble [344 × 5]>
#> 4 <split [343/86]> Fold4 <tibble [4 × 5]> <tibble [0 × 3]> <tibble [344 × 5]>
#> 5 <split [344/85]> Fold5 <tibble [4 × 5]> <tibble [0 × 3]> <tibble [340 × 5]>
# create a model definition -----
svm_spec <-
cost = tune("cost"),
rbf_sigma = tune("sigma")
) %>%
set_engine("kernlab") %>%
# extend the recipe
svm_rec <-
tree_frogs_rec %>%
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors()) %>%
step_impute_mean(all_numeric_predictors()) %>%
step_corr(all_predictors()) %>%
# add both to a workflow
svm_wflow <-
workflow() %>%
add_model(svm_spec) %>%
# tune cost and sigma and fit to the 5-fold cv
cls <- makePSOCKcluster(parallelly::availableCores())
svm_res <-
resamples = folds,
grid = 6,
metrics = metric,
control = ctrl_grid
#> # Tuning results
#> # 5-fold cross-validation
#> # A tibble: 5 × 5
#> splits id .metrics .notes .predictions
#> <list> <chr> <list> <list> <list>
#> 1 <split [343/86]> Fold1 <tibble [6 × 6]> <tibble [0 × 3]> <tibble [516 × 6]>
#> 2 <split [343/86]> Fold2 <tibble [6 × 6]> <tibble [0 × 3]> <tibble [516 × 6]>
#> 3 <split [343/86]> Fold3 <tibble [6 × 6]> <tibble [0 × 3]> <tibble [516 × 6]>
#> 4 <split [343/86]> Fold4 <tibble [6 × 6]> <tibble [0 × 3]> <tibble [516 × 6]>
#> 5 <split [344/85]> Fold5 <tibble [6 × 6]> <tibble [0 × 3]> <tibble [510 × 6]>
tree_frogs_data_st <-
stacks() %>%
add_candidates(knn_res) %>%
#> # A data stack with 2 model definitions and 10 candidate members:
#> # knn_res: 4 model configurations
#> # svm_res: 6 model configurations
#> # Outcome: latency (numeric)
tree_frogs_model_st <-
tree_frogs_data_st %>%
#> Error in summary.connection(connection): invalid connection
#> Error in eval(expr, envir, enclos): object 'tree_frogs_model_st' not found
Created on 2023-01-27 by the reprex package (v2.0.1)
I can reproduce the issue.
A parallel backend was registered and stacks picks up on that.
The issue is that the cluster is stopped before the blending and it would try to use it. If you move stopCluster(cls) to the end, it works.
We should be able to understand that some parts should be done in parallel and others might not. I'll add a bug report for that.
The blending and member training can also be done in parallel so, for the time being, move that to the end of the script.

R: {modeltime}'s ts_split_indices object appears not to qualify as an rsplit object

In the below reprex I believe I followed the example from exactly. However, I get an error when trying to extract the training data from the split object saying that the ts_split_indices object needs to be an rsplit object. Does anyone know why this might be the case?
Thank you in advance.
data_tbl <- walmart_sales_weekly %>%
select(id, Date, Weekly_Sales) %>%
set_names(c("id", "date", "value"))
#> # A tibble: 1,001 × 3
#> id date value
#> <fct> <date> <dbl>
#> 1 1_1 2010-02-05 24924.
#> 2 1_1 2010-02-12 46039.
#> 3 1_1 2010-02-19 41596.
#> 4 1_1 2010-02-26 19404.
#> 5 1_1 2010-03-05 21828.
#> 6 1_1 2010-03-12 21043.
#> 7 1_1 2010-03-19 22137.
#> 8 1_1 2010-03-26 26229.
#> 9 1_1 2010-04-02 57258.
#> 10 1_1 2010-04-09 42961.
#> # … with 991 more rows
data_tbl %>%
group_by(id) %>%
date, value, .interactive = F, .facet_ncol = 2
nested_data_tbl <- data_tbl %>%
# 1. Extending: We'll predict 52 weeks into the future.
.id_var = id,
.date_var = date,
.length_future = 52
) %>%
# 2. Nesting: We'll group by id, and create a future dataset
# that forecasts 52 weeks of extended data and
# an actual dataset that contains 104 weeks (2-years of data)
.id_var = id,
.length_future = 52,
.length_actual = 52*2
) %>%
# 3. Splitting: We'll take the actual data and create splits
# for accuracy and confidence interval estimation of 52 weeks (test)
# and the rest is training data
.length_test = 52
#> # A tibble: 7 × 4
#> id .actual_data .future_data .splits
#> <fct> <list> <list> <list>
#> 1 1_1 <tibble [104 × 2]> <tibble [52 × 2]> <split [52|52]>
#> 2 1_3 <tibble [104 × 2]> <tibble [52 × 2]> <split [52|52]>
#> 3 1_8 <tibble [104 × 2]> <tibble [52 × 2]> <split [52|52]>
#> 4 1_13 <tibble [104 × 2]> <tibble [52 × 2]> <split [52|52]>
#> 5 1_38 <tibble [104 × 2]> <tibble [52 × 2]> <split [52|52]>
#> 6 1_93 <tibble [104 × 2]> <tibble [52 × 2]> <split [52|52]>
#> 7 1_95 <tibble [104 × 2]> <tibble [52 × 2]> <split [52|52]>
rec_prophet <- recipe(value ~ date, training(nested_data_tbl$.splits[[1]]))
#> Error in `analysis()`:
#> ! `x` should be an `rsplit` object
#> Backtrace:
#> ▆
#> 1. ├─recipes::recipe(value ~ date, training(nested_data_tbl$.splits[[1]]))
#> 2. ├─recipes:::recipe.formula(value ~ date, training(nested_data_tbl$.splits[[1]]))
#> 3. │ └─recipes:::form2args(formula, data, ...)
#> 4. │ └─tibble::is_tibble(data)
#> 5. └─rsample::training(nested_data_tbl$.splits[[1]])
#> 6. └─rsample::analysis(x)
#> 7. └─rlang::abort("`x` should be an `rsplit` object")
#> [1] "ts_split_indicies"
#> Error in `analysis()`:
#> ! `x` should be an `rsplit` object
#> Backtrace:
#> ▆
#> 1. └─rsample::training(nested_data_tbl$.splits[[1]])
#> 2. └─rsample::analysis(x)
#> 3. └─rlang::abort("`x` should be an `rsplit` object")
Created on 2022-11-29 with reprex v2.0.2
It turns out the solution to my question is the extract_nested_train_split() function. I.e, rather than using training(nested_data_tbl$.splits[[1]]), I would just use extract_nested_train_split(nested_data_tbl)

tidymodel recipe and `step_lag()`: Error when using `predict()`

This may be a usage misunderstanding, but I expect the following toy example to work. I want to have a lagged predictor in my recipe, but once I include it in the recipe, and try to predict on the same data using a workflow with the recipe, it doesn't recognize the column foo and cannot compute its lag.
Now, I can get this to work if I:
Pull the fit out of the workflow that has been fit.
Independently prep and bake the data I want to fit.
Which I code after the failed workflow fit, and it succeeds. According to the documentation, I should be able to put a workflow fit in the predict slot:
I am probably fundamentally misunderstanding how workflow is supposed to operate. I have what I consider a workaround, but I do not understand why the failed statement isn't working in the way the workaround is. I expected the failed workflow construct to work under the covers like the workaround I have.
In short, if work_df is a dataframe, the_rec is a recipe based off work_df, rf_mod is a model, and you create the workflow rf_workflow, then should I expect the predict() function to work identically in the two predict() calls below?
## Workflow
rf_workflow <-
workflow() %>%
add_model(rf_mod) %>%
## fit
rf_workflow_fit <-
rf_workflow %>%
fit(data = work_df)
## Predict with workflow. I expect since a workflow has a fit model and
## a recipe as part of it, it should know how to do the following:
predict(rf_workflow_fit, work_df)
#> Error: Problem with `mutate()` input `lag_1_foo`.
#> x object 'foo' not found
#> i Input `lag_1_foo` is `dplyr::lag(x = foo, n = 1L, default = NA)`.
## Predict by explicitly prepping and baking the data, and pulling out the
## fit from the workflow:
rf_workflow_fit %>%
bake(prep(the_rec), work_df))
#> # A tibble: 995 x 1
#> .pred
#> <dbl>
#> 1 2.24
#> 2 0.595
#> 3 0.262
Full reprex example below.
### Create autocorrelated timeseries:
work_df <-
foo = stats::filter(rnorm(1000), filter=rep(1,5), circular=TRUE) %>%
# plot(work_df$foo)
#> # A tibble: 1,000 x 1
#> foo
#> <dbl>
#> 1 -0.00375
#> 2 0.589
#> 3 0.968
#> 4 3.24
#> 5 3.93
#> 6 1.11
#> 7 0.353
#> 8 -0.222
#> 9 -0.713
#> 10 -0.814
#> # ... with 990 more rows
## Recipe
the_rec <-
recipe(foo ~ ., data = work_df) %>%
step_lag(foo, lag=1:5) %>%
the_rec %>% prep() %>% juice()
#> # A tibble: 995 x 6
#> foo lag_1_foo lag_2_foo lag_3_foo lag_4_foo lag_5_foo
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1.11 3.93 3.24 0.968 0.589 -0.00375
#> 2 0.353 1.11 3.93 3.24 0.968 0.589
#> 3 -0.222 0.353 1.11 3.93 3.24 0.968
#> 4 -0.713 -0.222 0.353 1.11 3.93 3.24
#> 5 -0.814 -0.713 -0.222 0.353 1.11 3.93
#> 6 0.852 -0.814 -0.713 -0.222 0.353 1.11
#> 7 1.65 0.852 -0.814 -0.713 -0.222 0.353
#> 8 1.54 1.65 0.852 -0.814 -0.713 -0.222
#> 9 2.10 1.54 1.65 0.852 -0.814 -0.713
#> 10 2.24 2.10 1.54 1.65 0.852 -0.814
#> # ... with 985 more rows
## Model
rf_mod <-
mtry = 4,
trees = 1000,
min_n = 13) %>%
set_mode("regression") %>%
## Workflow
rf_workflow <-
workflow() %>%
add_model(rf_mod) %>%
## fit
rf_workflow_fit <-
rf_workflow %>%
fit(data = work_df)
## Predict
predict(rf_workflow_fit, work_df)
#> Error: Problem with `mutate()` input `lag_1_foo`.
#> x object 'foo' not found
#> i Input `lag_1_foo` is `dplyr::lag(x = foo, n = 1L, default = NA)`.
## Perhaps I just need to pull off the fit and work with that?... Nope.
rf_workflow_fit %>%
#> Error: Can't subset columns that don't exist.
#> x Columns `lag_1_foo`, `lag_2_foo`, `lag_3_foo`, `lag_4_foo`, and `lag_5_foo` don't exist.
## Maybe I need to bake it first... and that works.
## But doesn't that defeat the purpose of a workflow?
rf_workflow_fit %>%
bake(prep(the_rec), work_df))
#> # A tibble: 995 x 1
#> .pred
#> <dbl>
#> 1 2.24
#> 2 0.595
#> 3 0.262
#> 4 -0.977
#> 5 -1.24
#> 6 -0.140
#> 7 1.36
#> 8 1.30
#> 9 1.78
#> 10 2.42
#> # ... with 985 more rows
Created on 2020-10-13 by the reprex package (v0.3.0)
The reason you are experiencing an error is that you have created a predictor variable from the outcome. When it comes time to predict on new data, the outcome is not available; we are predicting the outcome for new data, not assuming that it is there already.
This is a fairly strong assumption of the tidymodels framework, for either modeling or preprocessing, to protect against information leakage. You can read about this a bit more here.
It's possible you already know about these resources, but if you are working with time series models, I'd suggest checking out these resources:
Resampling for time series
Using timetk for time series preprocessing
Using modeltime for time series modeling

Tidymodels tune_grid: "Can't subset columns that don't exist" when not using formula

I've put together a data preprocessing recipe for the recent coffee dataset featured on TidyTuesday. My intention is to generate a workflow, and then from there tune a hyperparameter. I'm specifically interesting in manually declaring predictors and outcomes through the various update_role() functions, rather than using a formula, since I have some great plans for this style of variable selection (it's a really great idea!).
The example below produces a recipe that works just fine with prep and bake(coffee_test). It even works if I deselect the outcome column, eg. coffee_recipe %>% bake(select(coffee_test, -cupper_points)). However, when I run the workflow through tune_grid I get the errors as shown. It looks like tune_grid can't find the variables that don't have the "predictor" role, even though bake does just fine.
Now, if I instead do things the normal way with a formula and step_rm the variables I don't care about, then things mostly work --- I get a few warnings for rows with missing country_of_origin values, which I find strange since I should be imputing those. It's entirely possible I've misunderstood the purpose of roles and how to use them.
coffee <- tidytuesdayR::tt_load(2020, week = 28)$coffee_ratings
#> --- Download complete ---
#> [1] "total_cup_points" "species" "owner"
#> [4] "country_of_origin" "farm_name" "lot_number"
#> [7] "mill" "ico_number" "company"
#> [10] "altitude" "region" "producer"
#> [13] "number_of_bags" "bag_weight" "in_country_partner"
#> [16] "harvest_year" "grading_date" "owner_1"
#> [19] "variety" "processing_method" "aroma"
#> [22] "flavor" "aftertaste" "acidity"
#> [25] "body" "balance" "uniformity"
#> [28] "clean_cup" "sweetness" "cupper_points"
#> [31] "moisture" "category_one_defects" "quakers"
#> [34] "color" "category_two_defects" "expiration"
#> [37] "certification_body" "certification_address" "certification_contact"
#> [40] "unit_of_measurement" "altitude_low_meters" "altitude_high_meters"
#> [43] "altitude_mean_meters"
coffee_split <- initial_split(coffee, prop = 0.8)
coffee_train <- training(coffee_split)
coffee_test <- testing(coffee_split)
coffee_recipe <- recipe(coffee_train) %>%
update_role(cupper_points, new_role = "outcome") %>%
variety, processing_method, country_of_origin,
aroma, flavor, aftertaste, acidity, sweetness, altitude_mean_meters,
new_role = "predictor"
) %>%
step_string2factor(all_nominal(), -all_outcomes()) %>%
country_of_origin, altitude_mean_meters,
impute_with = imp_vars(
in_country_partner, company, region, farm_name, certification_body
) %>%
step_unknown(variety, processing_method, new_level = "Unknown") %>%
step_other(country_of_origin, threshold = 0.01) %>%
step_other(processing_method, threshold = 0.10) %>%
step_other(variety, threshold = 0.10)
#> Data Recipe
#> Inputs:
#> role #variables
#> outcome 1
#> predictor 9
#> 33 variables with undeclared roles
#> Operations:
#> Factor variables from all_nominal(), -all_outcomes()
#> K-nearest neighbor imputation for country_of_origin, altitude_mean_meters
#> Unknown factor level assignment for variety, processing_method
#> Collapsing factor levels for country_of_origin
#> Collapsing factor levels for processing_method
#> Collapsing factor levels for variety
# This works just fine
coffee_recipe %>%
prep(coffee_train) %>%
bake(select(coffee_test, -cupper_points)) %>%
#> # A tibble: 6 x 42
#> total_cup_points species owner country_of_orig… farm_name lot_number mill
#> <dbl> <fct> <fct> <fct> <fct> <fct> <fct>
#> 1 90.6 Arabica meta… Ethiopia metad plc <NA> meta…
#> 2 87.9 Arabica cqi … other <NA> <NA> <NA>
#> 3 87.9 Arabica grou… United States (… <NA> <NA> <NA>
#> 4 87.3 Arabica ethi… Ethiopia <NA> <NA> <NA>
#> 5 87.2 Arabica cqi … other <NA> <NA> <NA>
#> 6 86.9 Arabica ethi… Ethiopia <NA> <NA> <NA>
#> # … with 35 more variables: ico_number <fct>, company <fct>, altitude <fct>,
#> # region <fct>, producer <fct>, number_of_bags <dbl>, bag_weight <fct>,
#> # in_country_partner <fct>, harvest_year <fct>, grading_date <fct>,
#> # owner_1 <fct>, variety <fct>, processing_method <fct>, aroma <dbl>,
#> # flavor <dbl>, aftertaste <dbl>, acidity <dbl>, body <dbl>, balance <dbl>,
#> # uniformity <dbl>, clean_cup <dbl>, sweetness <dbl>, moisture <dbl>,
#> # category_one_defects <dbl>, quakers <dbl>, color <fct>,
#> # category_two_defects <dbl>, expiration <fct>, certification_body <fct>,
#> # certification_address <fct>, certification_contact <fct>,
#> # unit_of_measurement <fct>, altitude_low_meters <dbl>,
#> # altitude_high_meters <dbl>, altitude_mean_meters <dbl>
# Now let's try putting it into a workflow and running tune_grid
coffee_model <- rand_forest(trees = 500, mtry = tune()) %>%
set_engine("ranger") %>%
#> Random Forest Model Specification (regression)
#> Main Arguments:
#> mtry = tune()
#> trees = 500
#> Computational engine: ranger
coffee_workflow <- workflow() %>%
add_recipe(coffee_recipe) %>%
#> ══ Workflow ═══════════════════════════════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: rand_forest()
#> ── Preprocessor ───────────────────────────────────────────────────────────────────────────────
#> 6 Recipe Steps
#> ● step_string2factor()
#> ● step_knnimpute()
#> ● step_unknown()
#> ● step_other()
#> ● step_other()
#> ● step_other()
#> ── Model ──────────────────────────────────────────────────────────────────────────────────────
#> Random Forest Model Specification (regression)
#> Main Arguments:
#> mtry = tune()
#> trees = 500
#> Computational engine: ranger
coffee_grid <- expand_grid(mtry = c(2, 5))
coffee_folds <- vfold_cv(coffee_train, v = 5)
coffee_workflow %>%
resamples = coffee_folds,
grid = coffee_grid
#> x Fold1: model 1/2 (predictions): Error: Can't subset columns that don't exist.
#> x...
#> x Fold1: model 2/2 (predictions): Error: Can't subset columns that don't exist.
#> x...
#> x Fold2: model 1/2 (predictions): Error: Can't subset columns that don't exist.
#> x...
#> x Fold2: model 2/2 (predictions): Error: Can't subset columns that don't exist.
#> x...
#> x Fold3: model 1/2 (predictions): Error: Can't subset columns that don't exist.
#> x...
#> x Fold3: model 2/2 (predictions): Error: Can't subset columns that don't exist.
#> x...
#> x Fold4: model 1/2 (predictions): Error: Can't subset columns that don't exist.
#> x...
#> x Fold4: model 2/2 (predictions): Error: Can't subset columns that don't exist.
#> x...
#> x Fold5: model 1/2 (predictions): Error: Can't subset columns that don't exist.
#> x...
#> x Fold5: model 2/2 (predictions): Error: Can't subset columns that don't exist.
#> x...
#> Warning: All models failed in tune_grid(). See the `.notes` column.
#> Warning: This tuning result has notes. Example notes on model fitting include:
#> model 1/2 (predictions): Error: Can't subset columns that don't exist.
#> x Columns `species`, `owner`, `farm_name`, `lot_number`, `mill`, etc. don't exist.
#> model 1/2 (predictions): Error: Can't subset columns that don't exist.
#> x Columns `species`, `owner`, `farm_name`, `lot_number`, `mill`, etc. don't exist.
#> model 2/2 (predictions): Error: Can't subset columns that don't exist.
#> x Columns `species`, `owner`, `farm_name`, `lot_number`, `mill`, etc. don't exist.
#> # Tuning results
#> # 5-fold cross-validation
#> # A tibble: 5 x 4
#> splits id .metrics .notes
#> <list> <chr> <list> <list>
#> 1 <split [857/215]> Fold1 <NULL> <tibble [2 × 1]>
#> 2 <split [857/215]> Fold2 <NULL> <tibble [2 × 1]>
#> 3 <split [858/214]> Fold3 <NULL> <tibble [2 × 1]>
#> 4 <split [858/214]> Fold4 <NULL> <tibble [2 × 1]>
#> 5 <split [858/214]> Fold5 <NULL> <tibble [2 × 1]>
Created on 2020-07-21 by the reprex package (v0.3.0)
Try setting the role for all of your nominal variables before picking out the outcomes and predictors.
coffee_recipe <- recipe(coffee_train) %>%
update_role(all_nominal(), new_role = "id") %>% ## ADD THIS
update_role(cupper_points, new_role = "outcome") %>%
variety, processing_method, country_of_origin,
aroma, flavor, aftertaste, acidity, sweetness, altitude_mean_meters,
new_role = "predictor"
) %>%
step_string2factor(all_nominal(), -all_outcomes()) %>%
country_of_origin, altitude_mean_meters,
impute_with = imp_vars(
in_country_partner, company, region, farm_name, certification_body
) %>%
step_unknown(variety, processing_method, new_level = "Unknown") %>%
step_other(country_of_origin, threshold = 0.01) %>%
step_other(processing_method, threshold = 0.10) %>%
step_other(variety, threshold = 0.10)
After I do this, this mostly runs fine, with only some failures to impute altitude. It might be tough to impute both of those things at the same time.

R, iml, mlr. Feature Importance always returns 1 for every feature

I'm doing something with the mlr framework that causes FeatureImp to return 1 for every feature and I can't put my finger on it. Here's an exemple:
iris = iris[iris$Species != 'setosa',]
iris$Species = ifelse(iris$Species == 'virginica', 1, 0)
iris$Species = as.factor(iris$Species)
ind=createDataPartition(iris$Species, times=1, p=0.8, list=FALSE)
train.task=makeClassifTask(data=train, target = 'Species', positive = 1)
test.task=makeClassifTask(data=test, target = 'Species', positive = 1)
xgboost = makeLearner("classif.xgboost",predict.type = "prob"),
ksvm = makeLearner("classif.ksvm",predict.type = "prob"),
nnet = makeLearner("classif.nnet",predict.type = "prob"),
randomForest = makeLearner("classif.randomForest",predict.type = "prob")
model = lapply(learner, function(x) train(x, train.task))
#> # weights: 19
#> initial value 57.506055
#> iter 10 value 52.109027
#> iter 20 value 7.798098
#> iter 30 value 5.401193
#> iter 40 value 4.707935
#> iter 50 value 4.702049
#> final value 4.701710
#> converged
prediction = lapply(model, function(x) predict(x, test.task))
ensemble = makeStackedLearner(learner, super.learner = 'classif.randomForest', predict.type = 'prob',
method = "", use.feat = FALSE)
model$ensemble = train(ensemble, train.task)
#> # weights: 19
#> initial value 43.712841
#> iter 10 value 5.444287
#> iter 20 value 4.536990
#> iter 30 value 4.527489
#> iter 40 value 4.481401
#> iter 50 value 4.481221
#> iter 50 value 4.481221
#> iter 50 value 4.481221
#> final value 4.481221
#> converged
#> # weights: 19
#> initial value 52.864011
#> iter 10 value 33.347827
#> iter 20 value 2.926847
#> iter 30 value 0.011104
#> final value 0.000055
#> converged
#> # weights: 19
#> initial value 44.627604
#> iter 10 value 31.360597
#> iter 20 value 5.798769
#> iter 30 value 4.290623
#> iter 40 value 3.751202
#> iter 50 value 3.547856
#> iter 60 value 3.469366
#> iter 70 value 3.373487
#> iter 80 value 3.317680
#> iter 90 value 3.310354
#> iter 100 value 3.301115
#> final value 3.301115
#> stopped after 100 iterations
#> # weights: 19
#> initial value 46.410266
#> iter 10 value 29.975896
#> iter 20 value 1.266423
#> iter 30 value 0.004667
#> final value 0.000052
#> converged
#> # weights: 19
#> initial value 52.665930
#> final value 44.361399
#> converged
#> # weights: 19
#> initial value 60.471973
#> iter 10 value 50.475349
#> iter 20 value 7.580138
#> iter 30 value 4.828646
#> iter 40 value 4.543112
#> iter 50 value 2.995374
#> iter 60 value 2.636710
#> iter 70 value 2.539857
#> iter 80 value 2.497281
#> iter 90 value 2.427158
#> iter 100 value 2.370383
#> final value 2.370383
#> stopped after 100 iterations
prediction$ensemble = predict(model$ensemble, test.task)
predictor = Predictor$new(model$ensemble,
data = train.task$env$data[which(names(train.task$env$data) != "Species")],
y = as.numeric(train.task$env$data$Species)-1)
imp = FeatureImp$new(predictor, loss = "ce")
#> feature importance.05 importance importance.95 permutation.error
#> 1 Sepal.Length 1 1 1 1
#> 2 Sepal.Width 1 1 1 1
#> 3 Petal.Length 1 1 1 1
#> 4 Petal.Width 1 1 1 1
Created on 2020-01-23 by the reprex package (v0.3.0)
Seems like this is fixed with the dev version of {iml}.
I could reproduce your issues with the current CRAN version.
#> Loading required package: lattice
#> Loading required package: ggplot2
#> Loading required package: ParamHelpers
#> 'mlr' is in maintenance mode since July 2019. Future development
#> efforts will go into its successor 'mlr3' (<>).
#> Attaching package: 'mlr'
#> The following object is masked from 'package:caret':
#> train
iris = iris[iris$Species != "setosa", ]
iris$Species = ifelse(iris$Species == "virginica", 1, 0)
iris$Species = as.factor(iris$Species)
ind = createDataPartition(iris$Species, times = 1, p = 0.8, list = FALSE)
train = iris[ind, ]
test = iris[-ind, ]
train.task = makeClassifTask(data = train, target = "Species", positive = 1)
test.task = makeClassifTask(data = test, target = "Species", positive = 1)
learner = list(
xgboost = makeLearner("classif.xgboost", predict.type = "prob"),
ksvm = makeLearner("classif.ksvm", predict.type = "prob"),
nnet = makeLearner("classif.nnet", predict.type = "prob"),
randomForest = makeLearner("classif.randomForest", predict.type = "prob")
model = lapply(learner, function(x) train(x, train.task))
#> # weights: 19
#> initial value 59.040647
#> iter 10 value 54.908003
#> iter 20 value 8.784817
#> iter 30 value 2.906017
#> iter 40 value 0.187334
#> iter 50 value 0.000610
#> final value 0.000059
#> converged
prediction = lapply(model, function(x) predict(x, test.task))
ensemble = makeStackedLearner(learner,
super.learner = "classif.randomForest", predict.type = "prob",
method = "", use.feat = FALSE)
model$ensemble = train(ensemble, train.task)
#> # weights: 19
#> initial value 44.537254
#> iter 10 value 6.716784
#> iter 20 value 4.750452
#> iter 30 value 4.487501
#> iter 40 value 4.481250
#> final value 4.481222
#> converged
#> # weights: 19
#> initial value 54.135701
#> iter 10 value 13.081961
#> iter 20 value 1.676063
#> iter 30 value 0.002261
#> final value 0.000044
#> converged
#> # weights: 19
#> initial value 42.621635
#> iter 10 value 5.201573
#> iter 20 value 2.878946
#> iter 30 value 1.133911
#> iter 40 value 0.002784
#> iter 50 value 0.000726
#> final value 0.000037
#> converged
#> # weights: 19
#> initial value 43.795663
#> iter 10 value 4.478310
#> iter 20 value 1.811306
#> iter 30 value 0.027775
#> iter 40 value 0.004873
#> iter 50 value 0.001480
#> iter 60 value 0.000230
#> iter 70 value 0.000221
#> final value 0.000089
#> converged
#> # weights: 19
#> initial value 44.433321
#> iter 10 value 7.252874
#> iter 20 value 1.200457
#> iter 30 value 0.001668
#> final value 0.000063
#> converged
#> # weights: 19
#> initial value 67.012204
#> final value 55.451774
#> converged
prediction$ensemble = predict(model$ensemble, test.task)
predictor = Predictor$new(model$ensemble,
data = train.task$env$data[which(names(train.task$env$data) != "Species")],
y = as.numeric(train.task$env$data$Species) - 1)
imp = FeatureImp$new(predictor, loss = "ce")
#> feature importance.05 importance importance.95 permutation.error
#> 1 Petal.Width 11.1 12.0 14.2 0.3000
#> 2 Petal.Length 10.3 11.5 13.1 0.2875
#> 3 Sepal.Length 3.3 4.5 6.3 0.1125
#> 4 Sepal.Width 2.1 3.5 4.0 0.0875
Created on 2020-01-23 by the reprex package (v0.3.0)
