Combine nesting and rolling_origin from Tidymodels in R - r

I am trying to train a random forest using rolling_origin from the Tidymodels suite. I would like the folds to be exactly the months of the year. Nesting looks like it could do the trick, but tune_grid is not able to find the variables when the data is nested. How can I make this work? I put a reproducible example below.
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(tidymodels))
suppressPackageStartupMessages(library(yardstick))
# Create dummy data ====================================================================================================
dates <- seq(from = as.Date("2019-01-01"), to = as.Date("2019-12-31"), by = 'day' )
l <- length(dates)
set.seed(1)
data_set <- data.frame(
date = dates,
v1 = rnorm(l),
v2 = rnorm(l),
v3 = rnorm(l),
y = rnorm(l)
)
# Random Forest Model =================================================================================================
model <-
parsnip::rand_forest(
mode = "regression",
trees = tune()) %>%
set_engine("ranger")
# grid specification
params <-
dials::parameters(
trees()
)
# Set up grid and model workflow =======================================================================================
grid <-
dials::grid_max_entropy(
params,
size = 2
)
form <- as.formula(paste("y ~ v1 + v2 + v3"))
model_workflow <-
workflows::workflow() %>%
add_model(model) %>%
add_formula(form)
# Tuning on the normal data set works ====================================================================================================
data_ro_day <- data_set %>%
rolling_origin(
initial = 304,
assess = 30,
cumulative = TRUE,
skip = 30
)
results <- tune_grid(
model_workflow,
grid = grid,
resamples = data_ro_day,
param_info = params,
metrics = metric_set(mae, mape, rmse, rsq),
control = control_grid(verbose = TRUE))
results %>% show_best("mape", n = 2)
# Tuning on the nested data set doesn't work =========================================================================================
data_ro_month <- data_set %>%
mutate(year_month = format(date, "%Y-%m")) %>%
nest(-year_month) %>%
rolling_origin(
initial = 10,
assess = 1,
cumulative = TRUE
)
results <- tune_grid(
model_workflow,
grid = grid,
resamples = data_ro_month,
param_info = params,
metrics = metric_set(mae, mape, rmse, rsq),
control = control_grid(verbose = TRUE))
results$.notes ```

I'm not entirely clear on how you want to divide up your data for tuning, but I would recommend looking into some of the other rsample functions like sliding_window() and especially sliding_period(). They let you create experimental designs for tuning where you can fit on certain months of data and then asses on another month, sliding along all the months you have available:
library(tidymodels)
dates <- seq(from = as.Date("2019-01-01"), to = as.Date("2019-12-31"), by = 'day' )
l <- length(dates)
set.seed(1)
data_set <- tibble(
date = dates,
v1 = rnorm(l),
v2 = rnorm(l),
v3 = rnorm(l),
y = rnorm(l)
)
month_folds <- data_set %>%
sliding_period(
date,
"month",
lookback = Inf,
skip = 4
)
month_folds
#> # Sliding period resampling
#> # A tibble: 7 x 2
#> splits id
#> <list> <chr>
#> 1 <split [151/30]> Slice1
#> 2 <split [181/31]> Slice2
#> 3 <split [212/31]> Slice3
#> 4 <split [243/30]> Slice4
#> 5 <split [273/31]> Slice5
#> 6 <split [304/30]> Slice6
#> 7 <split [334/31]> Slice7
I used skip = 4 here to only keep slices where you will have more data for training. Each of these slices will training on several months of data and assess on a new, last month. The resamples slide forward through your dataset. Since I used lookback = Inf it always includes all past data, but you can change that.
When you have your resampling approach set up however is appropriate for your domain problem, you can then make a model specification and tune it:
rf_spec <-
rand_forest(
mode = "regression",
trees = tune()) %>%
set_engine("ranger")
rf_wf <-
workflow() %>%
add_model(rf_spec) %>%
add_formula(y ~ v1 + v2 + v3)
tune_grid(rf_wf, resamples = month_folds)
#> # Tuning results
#> # Sliding period resampling
#> # A tibble: 7 x 4
#> splits id .metrics .notes
#> <list> <chr> <list> <list>
#> 1 <split [151/30]> Slice1 <tibble [20 × 5]> <tibble [0 × 1]>
#> 2 <split [181/31]> Slice2 <tibble [20 × 5]> <tibble [0 × 1]>
#> 3 <split [212/31]> Slice3 <tibble [20 × 5]> <tibble [0 × 1]>
#> 4 <split [243/30]> Slice4 <tibble [20 × 5]> <tibble [0 × 1]>
#> 5 <split [273/31]> Slice5 <tibble [20 × 5]> <tibble [0 × 1]>
#> 6 <split [304/30]> Slice6 <tibble [20 × 5]> <tibble [0 × 1]>
#> 7 <split [334/31]> Slice7 <tibble [20 × 5]> <tibble [0 × 1]>
Created on 2020-11-15 by the reprex package (v0.3.0.9001)

Related

Creating multiple training subsets using sample() in R

I have a training dataset that consists of 60,000 observations that I want to create 9 subset training sets from. I want to sample randomly without replacement; I need 3 datasets of 500 observations, 3 datasets of 1,000 observations, and 3 datasets of 2,000 observations.
How can I do this using sample() in R?
Given your data.frame is named df you do:
sample_sizes <- c(rep(500,3), rep(1000,3), rep(2000,3))
sampling <- sample(60000, sum(sample_sizes))
training_sets <- split(df[sampling,], rep(1:9, sample_sizes))
This do sampling without replacement over all dataset.
If you want sampling without replacement in each training set (but not through all training sets):
sample_sizes <- c(rep(500,3), rep(1000,3), rep(2000,3))
sampling <- do.call(c, lapply(sample_sizes, function(i) sample(60000, i)))
training_sets <- split(df[sampling,], rep(1:9, sample_sizes))
I'm not positive if you want the output to look like the screenshot, but if so, here you go:
library(tidyverse)
df <- tibble(rand = runif(6e4))
tibble(`Sample Size` = rep(c(500,1000,2000), each = 3)) |>
mutate(name = rep(paste(c("First", "Second", "Third"), "Random Sample"), 3),
samp = map2(`Sample Size`, row_number(),
\(x,y) {set.seed(y); df[sample(1:nrow(df), size = x),]})) |>
pivot_wider(names_from = name, values_from = samp)
#> # A tibble: 3 x 4
#> `Sample Size` `First Random Sample` `Second Random Sample` Third Random Samp~1
#> <dbl> <list> <list> <list>
#> 1 500 <tibble [500 x 1]> <tibble [500 x 1]> <tibble [500 x 1]>
#> 2 1000 <tibble [1,000 x 1]> <tibble [1,000 x 1]> <tibble>
#> 3 2000 <tibble [2,000 x 1]> <tibble [2,000 x 1]> <tibble>
#> # ... with abbreviated variable name 1: `Third Random Sample`

LASSO regression - Force variables in glmnet with tidymodels

I am doing feature selection using LASSO regression with tidymodels and glmnet.
It is possible to force variables in glmnet by using the penalty.factors argument (see here and here, for example).
Is it possible to do the same using tidymodels ?
library(tidymodels)
library(vip)
library(forcats)
library(dplyr)
library(ggplot2)
library(data.table)
# Define data split
datasplit = rsample::initial_split(mtcars, prop=0.8)
data_training = rsample::training(datasplit)
data_testing = rsample::testing(datasplit)
# Model specifications - should penalty.factors go here?
model_spec = parsnip::linear_reg(penalty = tune::tune(),
mixture = 1) %>%
parsnip::set_engine("glmnet")
# Model recipe
rec = recipe(mpg ~ ., mtcars)
# Model workflow
wf = workflows::workflow() %>%
workflows::add_recipe(rec) %>%
workflows::add_model(model_spec)
# Resampling
data_resample = rsample::vfold_cv(data_training,
repeats = 3,
v = 2)
hyperparam_grid = dials::grid_regular(dials::penalty(),
levels = 100)
# Define metrics
metrics = yardstick::metric_set(yardstick::rsq,
yardstick::mape,
yardstick::mpe)
# Tune the model
tune_grid_results = tune::tune_grid(
wf,
resamples = data_resample,
grid = hyperparam_grid,
metrics = metrics
)
# Collect and finalise best model
selected_model = tune_grid_results %>%
tune::select_best("mape")
final_model = tune::finalize_workflow(wf, selected_model)
final_model_fit = final_model %>%
parsnip::fit(data_training) %>%
workflows::extract_fit_parsnip()
# Plot variables importance
t_importance = final_model_fit %>%
vip::vi(lambda = selected_model$penalty) %>%
dplyr::mutate(
Importance = Importance,
Variable = forcats::fct_reorder(Variable, Importance)
) %>%
data.table() %>%
setorder( - Importance)
t_importance %>%
ggplot(aes(x = Importance, y = Variable, fill = Sign)) +
geom_col() +
scale_x_continuous(expand = c(0, 0)) +
labs(y = NULL) +
theme_minimal()
Created on 2022-03-14 by the reprex package (v2.0.1)
As mentioned in the comment above, you can pass engine-specific arguments like penalty.factor in set_engine():
library(tidyverse)
library(tidymodels)
library(vip)
#>
#> Attaching package: 'vip'
#> The following object is masked from 'package:utils':
#>
#> vi
datasplit <- initial_split(mtcars, prop = 0.8)
car_train <- training(datasplit)
car_test <- testing(datasplit)
car_folds <- vfold_cv(car_train, repeats = 3, v = 2)
You can pass penalty.factor here to the model specification as an engine-specific argument:
glmnet_spec <- linear_reg(penalty = tune(), mixture = 1) %>%
set_engine("glmnet", penalty.factor = c(0, rep(1, 7), 0, 0))
car_wf <- workflow(mpg ~ ., glmnet_spec)
glmnet_res <- tune_grid(car_wf, resamples = car_folds, grid = 5)
glmnet_res
#> # Tuning results
#> # 2-fold cross-validation repeated 3 times
#> # A tibble: 6 × 5
#> splits id id2 .metrics .notes
#> <list> <chr> <chr> <list> <list>
#> 1 <split [12/13]> Repeat1 Fold1 <tibble [10 × 5]> <tibble [0 × 3]>
#> 2 <split [13/12]> Repeat1 Fold2 <tibble [10 × 5]> <tibble [0 × 3]>
#> 3 <split [12/13]> Repeat2 Fold1 <tibble [10 × 5]> <tibble [0 × 3]>
#> 4 <split [13/12]> Repeat2 Fold2 <tibble [10 × 5]> <tibble [0 × 3]>
#> 5 <split [12/13]> Repeat3 Fold1 <tibble [10 × 5]> <tibble [0 × 3]>
#> 6 <split [13/12]> Repeat3 Fold2 <tibble [10 × 5]> <tibble [0 × 3]>
best_penalty <- select_best(glmnet_res, "rmse")
final_fit <- car_wf %>%
finalize_workflow(best_penalty) %>%
fit(data = car_train) %>%
extract_fit_parsnip()
final_fit %>%
vi(lambda = best_penalty$penalty) %>%
mutate(Variable = fct_reorder(Variable, Importance)) %>%
ggplot(aes(x = Importance, y = Variable, fill = Sign)) +
geom_col() +
scale_x_continuous(expand = c(0, 0)) +
labs(y = NULL) +
theme_minimal()
Created on 2022-03-14 by the reprex package (v2.0.1)
This does require that you know the number of predictors when you create the model specification, which can become challenging for a complex recipe including many feature engineering steps.

Tune recipe in workflow set with custom range (or value)

I'm trying to use workflow_set() function in tidymodels to evaluate a batch of models.
I've understand that is possible to modify some model specification in order to change the search range so, for example, given this specification:
spec_lin <- linear_reg( penalty = tune(),
mixture = tune() ) %>%
set_engine('glmnet')
I can modify the range using:
rec_base <- recipe( price ~ feat_1) %>%
step_novel(feat_1) %>%
step_other(feat_1,threshold=.2 ) %>%
step_dummy(feat_1)
rec_adv_param <- rec_base %>%
parameters() %>%
update ( mixture = mixture(c(0.1,0.01)) )
My attempt is to do the same but with the parameters in the recipe. For example:
rec_tuned <- recipe( price ~ feat_1) %>%
step_novel(feat_1) %>%
step_other(feat_1,threshold=tune() ) %>%
step_dummy(feat_1)
followed by
rec_adv_param <- rec_tuned %>%
parameters() %>%
update ( threshold = threshold(c(0.1,0.2)) )
However when I try to use it in the workflow_set() definition if I use something like
wf_set <- workflow_set(recipes, models, cross = TRUE )
option_add(param_info = rec_adv_param, id = "rec_tuned_spec_lin")
The finale "wf_set" lost his original tuning parameters the has been changed with the
threshold = threshold(c(0.1,0.2)
Is there a way to add the parameters specification for the recipe in all workflow_set models?
Thanks
You can add the parameters for a recipe via option_add(), either for a single workflow by id for all workflows if you leave id = NULL. When you go to tune or fit on resampled data, these options will be used.
For example, if we want to try 0 to 20 PCA components (instead of the default):
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
data(Chicago)
data("chi_features_set")
time_val_split <-
sliding_period(
Chicago,
date,
"month",
lookback = 38,
assess_stop = 1
)
## notice that there are no options; defaults will be used
chi_features_set
#> # A workflow set/tibble: 3 × 4
#> wflow_id info option result
#> <chr> <list> <list> <list>
#> 1 date_lm <tibble [1 × 4]> <opts[0]> <list [0]>
#> 2 plus_holidays_lm <tibble [1 × 4]> <opts[0]> <list [0]>
#> 3 plus_pca_lm <tibble [1 × 4]> <opts[0]> <list [0]>
## make new params
pca_param <-
parameters(num_comp()) %>%
update(num_comp = num_comp(c(0, 20)))
## add new params to workflowset like this:
chi_features_set %>%
option_add(param_info = pca_param, id = "plus_pca_lm")
#> # A workflow set/tibble: 3 × 4
#> wflow_id info option result
#> <chr> <list> <list> <list>
#> 1 date_lm <tibble [1 × 4]> <opts[0]> <list [0]>
#> 2 plus_holidays_lm <tibble [1 × 4]> <opts[0]> <list [0]>
#> 3 plus_pca_lm <tibble [1 × 4]> <opts[1]> <list [0]>
## now these new parameters can be used by `workflow_map()`:
chi_features_set %>%
option_add(param_info = pca_param, id = "plus_pca_lm") %>%
workflow_map(resamples = time_val_split, grid = 21, seed = 1)
#> # A workflow set/tibble: 3 × 4
#> wflow_id info option result
#> <chr> <list> <list> <list>
#> 1 date_lm <tibble [1 × 4]> <opts[2]> <rsmp[+]>
#> 2 plus_holidays_lm <tibble [1 × 4]> <opts[2]> <rsmp[+]>
#> 3 plus_pca_lm <tibble [1 × 4]> <opts[3]> <tune[+]>
Created on 2021-07-30 by the reprex package (v2.0.0)

How to map a nested dataframe, and store multiple columns as output

I have a data structure as follows:
test <- data.frame(
id= rep(1:3, each=20),
count = rnorm(60, mean=5, sd=1),
covar1 = rnorm(60, mean=10, sd=3),
covar2 = rnorm(60, mean=95, sd=5),
covar3 = rnorm(60, mean=30, sd=5)
)
Then I nest it by id:
test <- test %>% nest(-id)
I want to apply a model to each data covar column, for a given id. Then I want to store the result in a separate column. I can do this as follows:
test <- test %>% mutate(covar1_lm = map(data, ~lm(count ~ covar1, data=.x)),
covar2_lm = map(data, ~lm(count ~ covar2, data=.x)),
covar3_lm = map(data, ~lm(count ~ covar3, data=.x)))
Which gives the output I want:
> test
# A tibble: 3 x 5
id data covar1_lm covar2_lm covar3_lm
<int> <list> <list> <list> <list>
1 1 <tibble [20 × 4]> <lm> <lm> <lm>
2 2 <tibble [20 × 4]> <lm> <lm> <lm>
3 3 <tibble [20 × 4]> <lm> <lm> <lm>
The problem is my real data has a large number of covar columns, and so I'd like to reduce the boilerplate code. So I'm guessing I need some concept of dynamic variable names, but I cant figure out how to map over a dynamic set of column names??
You can pivot_longer() the dataset first, so that there is one observation (row) for each covariate for each dataset. Then you perform the model within each covariate.
test %>%
pivot_longer(starts_with("covar"),
names_to = "covariate") %>%
group_by(id, covariate) %>%
summarize(model = list(lm(count ~ value)))
You now have one observation for each combination of ID and covariate.
# A tibble: 9 x 3
# Groups: id [3]
id covariate model
<int> <chr> <list>
1 1 covar1 <lm>
2 1 covar2 <lm>
3 1 covar3 <lm>
4 2 covar1 <lm>
5 2 covar2 <lm>
6 2 covar3 <lm>
7 3 covar1 <lm>
8 3 covar2 <lm>
9 3 covar3 <lm>
If you want to turn that into the same kind of result, you could pipe this to pivot_wider(names_from = covariate, values_from = model). (But note that having one row for each model could make it easier to explore and visualize the models, especially if you tidy each with broom::tidy() and unnested them).
An alternative to the group_by()/summarize() above would be to nest them :
test %>%
pivot_longer(starts_with("covar"),
names_to = "covariate") %>%
group_by(id, covariate) %>%
nest() %>%
mutate(model = map(data, ~ lm(count ~ value, data = .x)))

Use purrr to map to 2 functions

I have data of the following form
date data
<chr> <list>
1 2012-01-05 <tibble [796 x 5]>
2 2012-01-12 <tibble [831 x 5]>
3 2012-01-19 <tibble [820 x 5]>
... ...
I would like to use something analogous to map() to calculate the mean and standard deviation.
I can currently use the following separately, but it is possible to calculate both at the same time.
mutate(stats = map(data, ~ sd(.$metric)))
mutate(stats = map(data, ~ mean(.$metric)))
Another alternative is to make a function that is like summary, which returns quartiles and the mean. but calculate the mean and sd instead. then I could use that new function in map as follows:
mutate(stats = map(data, ~ new_function(.$metric)))
Is there a better alternative?
A simple option to add multiple columns is to just make another list column of the desired summary statistics and unnest it:
library(tidyverse)
set.seed(47)
df <- data_frame(date = seq(as.Date('1970-01-01'), by = 1, length = 4),
data = map(date, ~data_frame(metric = rnorm(10))))
df
#> # A tibble: 4 x 2
#> date data
#> <date> <list>
#> 1 1970-01-01 <tibble [10 × 1]>
#> 2 1970-01-02 <tibble [10 × 1]>
#> 3 1970-01-03 <tibble [10 × 1]>
#> 4 1970-01-04 <tibble [10 × 1]>
df %>%
mutate(stats = map(data, ~data.frame(mean = mean(.x$metric),
sd = sd(.x$metric)))) %>%
unnest(stats)
#> # A tibble: 4 x 4
#> date data mean sd
#> <date> <list> <dbl> <dbl>
#> 1 1970-01-01 <tibble [10 × 1]> -0.106 0.992
#> 2 1970-01-02 <tibble [10 × 1]> -0.102 0.875
#> 3 1970-01-03 <tibble [10 × 1]> -0.833 0.979
#> 4 1970-01-04 <tibble [10 × 1]> 0.184 0.671
A more programmatic approach (which may scale better) is to iterate within the anonymous function over a list of functions. lst will automatically name them, so the results will be named, and map_dfc will cbind them into a data frame:
df %>%
mutate(stats = map(data,
~map_dfc(lst(mean, sd),
function(.fun) .fun(.x$metric)))) %>%
unnest(stats)
purrr has a purpose-built function for iterating over functions/parameters like this: invoke_map. If you want the function or parameters to be recycled, they have to be in a length-1 list. Since parameters should already be collected in a list, here it has to be a nested list.
df %>%
mutate(stats = map(data,
~invoke_map_dfc(lst(mean, sd),
list(list(.x$metric))))) %>%
unnest(stats)
All approaches return the same thing.

Resources