Workflow Tidymodels Formula Object - r

I am fairly new using R and I am following a guide and learning to build an Expected Goals Model for my hockey league. When I run the code below, I get the error at the bottom. Is there something simple that I am missing?
Seems like its trying to use a formula in the model portion of the workflow but I already have a recipe in there. Thanks in advance for any help anyone can offer me! The guide is here https://www.thesignificantgame.com/portfolio/expected-goals-model-with-tidymodels/
library(tidymodels)
library(tidyverse)
library(dplyr)
set.seed(1972)
train_test_split <- initial_split(data = EXPECTED_GOALS_MODEL, prop = 0.80)
train_data <- train_test_split %>% training()
test_data <- train_test_split %>% testing()
xg_recipe <- recipe(Goal ~ DistanceC + Angle + Home + Hand + AgeDec31 + GoalieAgeDec31 + NewX + NewY, data = train_data) %>% update_role(NewX, NewY, new_role = "ID")
model <- logistic_reg() %>% set_engine("glm")
xg_wflow <- workflow() %>% add_model(model) %>% add_recipe(xg_recipe)
xg_wflow
xg_fit <- xg_wflow %>% fit(data = train_data)
Error in validObject(.Object) :
invalid class “model” object: invalid object for slot "formula" in class "model": got class "workflow", should be or extend class "formula"
In addition: Warning message:
In fit(., data = train_data) :
fit failed: Error in as.matrix(y) : argument "y" is missing, with no default
fit(x = ., data = train_data)

It's difficult to tell exactly what the issue is without a reproducible example, though this error brings up a few questions up for me:
Does the EXPECTED_GOALS_MODEL data indeed have a column called Goal in it, with two unique levels? Have you also spelled the remainder of the column names correctly?
Are your tidymodels package installs up to date?
Does this error persist if you run specifically generics::fit(data = train_data) instead of fit(data = train_data)? This almost looks like a different fit() is being dispatched to.
Here's a place to start with a reprex:
library(tidymodels)
data(ames)
set.seed(1972)
ames <- ames %>% rowid_to_column()
train_test_split <- initial_split(data = ames, prop = 0.80)
train_data <- train_test_split %>% training()
test_data <- train_test_split %>% testing()
xg_recipe <- recipe(Sale_Price ~ ., data = train_data) %>% update_role(rowid, new_role = "ID")
model <- linear_reg() %>% set_engine("glm")
xg_wflow <- workflow() %>% add_model(model) %>% add_recipe(xg_recipe)
xg_fit <- xg_wflow %>% fit(data = train_data)
xg_fit
#> ══ Workflow [trained] ══════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: linear_reg()
#>
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 0 Recipe Steps
#>
#> ── Model ───────────────────────────────────────────────────────────────────────
#>
#> Call: stats::glm(formula = ..y ~ ., family = stats::gaussian, data = data)
#>
#> Coefficients:
#> (Intercept)
#> -2.583e+07
#> MS_SubClassOne_Story_1945_and_Older
#> 7.419e+03
#> MS_SubClassOne_Story_with_Finished_Attic_All_Ages
#> 1.562e+04
#> MS_SubClassOne_and_Half_Story_Unfinished_All_Ages
#> 1.060e+04
#> MS_SubClassOne_and_Half_Story_Finished_All_Ages
#> 8.413e+03
#> MS_SubClassTwo_Story_1946_and_Newer
#> 3.007e+03
#> MS_SubClassTwo_Story_1945_and_Older
#> 1.793e+04
#> MS_SubClassTwo_and_Half_Story_All_Ages
#> -3.909e+03
#> MS_SubClassSplit_or_Multilevel
#> -1.098e+04
#> MS_SubClassSplit_Foyer
#> -4.038e+03
#> MS_SubClassDuplex_All_Styles_and_Ages
#> -2.004e+04
#> MS_SubClassOne_Story_PUD_1946_and_Newer
#> -2.335e+04
#> MS_SubClassOne_and_Half_Story_PUD_All_Ages
#> -2.482e+04
#> MS_SubClassTwo_Story_PUD_1946_and_Newer
#> -1.794e+04
#> MS_SubClassPUD_Multilevel_Split_Level_Foyer
#> -2.098e+04
#> MS_SubClassTwo_Family_conversion_All_Styles_and_Ages
#> 6.903e+03
#> MS_ZoningResidential_High_Density
#> -3.853e+03
#> MS_ZoningResidential_Low_Density
#> -3.661e+03
#> MS_ZoningResidential_Medium_Density
#> -8.240e+03
#> MS_ZoningA_agr
#> -3.824e+03
#> MS_ZoningC_all
#> -1.800e+04
#> MS_ZoningI_all
#> -3.299e+04
#> Lot_Frontage
#> 1.336e+01
#>
#> ...
#> and 506 more lines.
Created on 2022-09-28 by the reprex package (v2.0.1)
Hope this helps!
Simon, tidymodels team

Related

Using tidymodels in R, my BART workflow changes after I fit it once. Why?

I have been trying to train a BART model using the tidymodels framework but I am running into some problems.
I can declare the model, the recipe, and the workflow alright, but once I fit the workflow, two unwanted things happen:
The original model object (bart_mod below), initially correctly stored, becomes "call: NULL", even though I don't touch the model object directly (I assign nothing to the same object name).
I am not able to retrieve any information about the fitted model. The bart_fit contains nothing and there seems to be no tidy method associated to it. All this is true even though I am able to predict values using the fitted model! (See last line of code in the reprex).
This may very well come from a misunderstanding of how all this works on my end, I am fairly new to tidymodels.
I would appreciate any help! Thank you.
library(tidyverse)
library(tidymodels)
set.seed(2022)
# Parameters --------------------------------------------------------------
n <- 5000
coef_x_var_1 <- 1
coef_x_var_2 <- 2
coef_x_var_3 <- 3
gen_y_1 <- function(data = dataset) {
return(data$y_0 +
data$x_var_1*coef_x_var_1 +
data$x_var_2*coef_x_var_2 +
data$x_var_3*coef_x_var_3 +
rnorm(n = nrow(data), mean = 0, sd = 3)
)}
# Data generation ---------------------------------------------------------
dataset <- matrix(NA, nrow = n, ncol = 3)
# Generate the unit-level moderators
dataset[,1] <- rnorm(mean = rnorm(n = 1), n = n)
dataset[,2] <- rnorm(mean = rnorm(n = 1), n = n)
dataset[,3] <- rnorm(mean = rnorm(n = 1), n = n)
# Change into dataframe
colnames(dataset) <- c("x_var_1", "x_var_2", "x_var_3")
dataset <- as_tibble(dataset)
# Make sure the variable format is numeric (except for the identifiers)
dataset$x_var_1 <- as.numeric(dataset$x_var_1)
dataset$x_var_2 <- as.numeric(dataset$x_var_2)
dataset$x_var_3 <- as.numeric(dataset$x_var_3)
# Generate the untreated potential outcomes
P0_coefs <- rdunif(n = 6, 1, 15)
dataset$y_0 <-
dataset$x_var_1*P0_coefs[4] +
dataset$x_var_2*P0_coefs[5] +
dataset$x_var_3*P0_coefs[6] +
rnorm(n = nrow(dataset), mean = 0, sd = 3)
dataset$y_1 <- gen_y_1(data = dataset)
# Create a variable to indicate treatment
treatment_group <- sample(1:nrow(dataset), size = nrow(dataset)/2)
# Indicate which potential outcome you observe
obs_dataset <- dataset |>
mutate(treated = ifelse(row_number() %in% treatment_group, 1, 0),
obs_y = ifelse(treated, y_1, y_0))
y1_obs_dataset <- obs_dataset |> filter(treated == 1)
y0_obs_dataset <- obs_dataset |> filter(treated == 0)
# Analysis ----------------------------------------------------------------
covariates <- c("x_var_1", "x_var_2", "x_var_3")
bart_formula <- as.formula(paste0("obs_y ~ ", paste(covariates, collapse = " + ")))
# Create the workflow
bart_mod <- bart() |>
set_engine("dbarts") |>
set_mode("regression")
bart_recipe <- recipe(bart_formula, data = obs_dataset) |>
step_zv(all_predictors())
bart_workflow <-
workflow() |>
add_model(bart_mod) |>
add_recipe(bart_recipe)
# The workflow first looks right
bart_workflow
#> ══ Workflow ════════════════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: bart()
#>
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 1 Recipe Step
#>
#> • step_zv()
#>
#> ── Model ───────────────────────────────────────────────────────────────────────
#> BART Model Specification (regression)
#>
#> Computational engine: dbarts
# Once I fit it though, the model part becomes call: NULL
bart_fit <- bart_workflow |>
fit(y1_obs_dataset)
# Nothing is stored in the fit
bart_fit
#> ══ Workflow [trained] ══════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: bart()
#>
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 1 Recipe Step
#>
#> • step_zv()
#>
#> ── Model ───────────────────────────────────────────────────────────────────────
#>
#> Call:
#> `NULL`()
# The content of this object has changed!
bart_workflow
#> ══ Workflow ════════════════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: bart()
#>
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 1 Recipe Step
#>
#> • step_zv()
#>
#> ── Model ───────────────────────────────────────────────────────────────────────
#>
#> Call:
#> NULL
bart_fit |>
extract_fit_parsnip(bart_fit)
#> parsnip model object
#>
#>
#> Call:
#> `NULL`()
# And yet, I am able to run a prediction using the fit!
predict(bart_fit, y0_obs_dataset)
#> # A tibble: 2,500 × 1
#> .pred
#> <dbl>
#> 1 -4.67
#> 2 -6.23
#> 3 6.35
#> 4 10.7
#> 5 4.90
#> 6 -13.8
#> 7 4.70
#> 8 19.6
#> 9 -0.907
#> 10 5.38
#> # … with 2,490 more rows
Created on 2022-12-24 with reprex v2.0.2
First stripping Martin's code down to a smaller script:
library(tidyverse)
library(tidymodels)
set.seed(2022)
obs_dataset <- structure(list(x_var_1 = c(-0.273203786163623, 0.0026566250757164,
-0.544359413888551, 0.569128408034224, -2.00048700105319, -0.159113741655834
), obs_y = c(-8.14952415680873, 1.91364235165124, -7.68391811408719,
-9.01497463720505, -18.5017189874949, -13.505685812581)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
bart_formula <- as.formula("obs_y ~ x_var_1")
# Create the workflow
bart_mod <- bart() |>
set_engine("dbarts") |>
set_mode("regression")
bart_recipe <- recipe(bart_formula, data = obs_dataset)
bart_workflow <-
workflow() |>
add_model(bart_mod) |>
add_recipe(bart_recipe)
The workflow at first looks right
bart_workflow
> ══ Workflow
> ════════════════════════════════════════════════════════════════
> Preprocessor: Recipe Model: bart()
>
> ── Preprocessor
> ────────────────────────── 0 Recipe Steps
>
> ── Model
> ─────────────────────────────────────────────────────────
> BART Model Specification (regression)
>
> Computational engine: dbarts
but this changes after fitting:
bart_fit <- bart_workflow |>
fit(obs_dataset)
bart_fit
The workflow now displays NULL for the call, as does the model object.
bart_workflow
bart_mod
══ Workflow [trained] ══════════════════════════════════════════════════════
Preprocessor: Recipe
Model: bart()
── Preprocessor ─────────────────────────────────
0 Recipe Steps
── Model ────────────────────────────────────────────────
Call:
`NULL`()
All these display values:
required_pkgs(bart_mod)
print_model_spec(bart_mod)
bart_mod[["engine"]]
bart_mod[["mode"]]
extract_recipe(bart_fit)
extract_preprocessor(bart_fit)
extract_mold(bart_fit)
bart_fit[["fit"]][["fit"]][["spec"]][["engine"]]
bart_fit[["fit"]][["fit"]][["spec"]][["mode"]]
These display NULL:
print(bart_mod)
print(bart_workflow)
print(bart_fit)
extract_fit_engine(bart_fit)
extract_fit_parsnip(bart_fit)
extract_model(bart_fit)
So, it seems that the model data is still in the objects,
and is useable,
but the print calls do not display it,
and the extract functions do not display it.

How to extract predictors from parsnip fit object

I have the following prediction model:
library(tidymodels)
data(ames)
set.seed(4595)
data_split <- initial_split(ames, strata = "Sale_Price", prop = 0.75)
ames_train <- training(data_split)
ames_test <- testing(data_split)
rec <- recipe(Sale_Price ~ ., data = ames_train)
norm_trans <- rec %>%
step_zv(all_predictors()) %>%
step_nzv(all_predictors()) %>%
step_corr(all_numeric_predictors(), threshold = 0.1)
# Preprocessing
norm_obj <- prep(norm_trans, training = ames_train)
rf_ames_train <- bake(norm_obj, ames_train) %>%
dplyr::select(Sale_Price, everything()) %>%
as.data.frame()
dim(rf_ames_train )
rf_xy_fit <- rand_forest(mode = "regression") %>%
set_engine("ranger") %>%
fit_xy(
x = rf_ames_train,
y = log10(rf_ames_train$Sale_Price)
)
Note that after preprocessing step the number of features are reduced from 74 to 33.
dim(rf_ames_train )
# 33
Currently, I have to explicitly pass the predictors in the function:
preds <- colnames(rf_ames_train)
my_pred_function <- function (fit = NULL, test_data = NULL, predictors = NULL) {
test_results <- test_data %>%
select(Sale_Price) %>%
mutate(Sale_Price = log10(Sale_Price)) %>%
bind_cols(
predict(fit, new_data = ames_test[, predictors])
)
test_results
}
my_pred_function(fit = rf_xy_fit, test_data = ames_test, predictors = preds)
Shown as predictors = preds in the function call above.
In practice, I have to save the rf_xy_fit and preds as two RDS files, then read them again. This is prone to error and troublesome.
I would like to by-pass this explicit passing. Is there a way I can extract that from rf_xy_fit directly?
This is a case where you would benefit from using the workflows package. This allows you to combine the preprocessing code with the model fitting code
library(tidymodels)
data(ames)
set.seed(4595)
# Notice how I did log transformation before doing the splitting to assure that it is not on both testing and training data sets.
ames <- ames %>%
mutate(Sale_Price = log10(Sale_Price))
data_split <- initial_split(ames, strata = "Sale_Price", prop = 0.75)
ames_train <- training(data_split)
ames_test <- testing(data_split)
rec <- recipe(Sale_Price ~ ., data = ames_train)
norm_trans <- rec %>%
step_zv(all_predictors()) %>%
step_nzv(all_predictors()) %>%
step_corr(all_numeric_predictors(), threshold = 0.1)
rf_spec <- rand_forest(mode = "regression") %>%
set_engine("ranger")
rf_wf <- workflow() %>%
add_recipe(norm_trans) %>%
add_model(rf_spec)
rf_fit <- fit(rf_wf, ames_train)
predict(rf_fit, new_data = ames_train)
#> # A tibble: 2,197 × 1
#> .pred
#> <dbl>
#> 1 5.09
#> 2 5.12
#> 3 5.01
#> 4 4.99
#> 5 5.12
#> 6 5.07
#> 7 4.90
#> 8 5.09
#> 9 5.13
#> 10 5.08
#> # … with 2,187 more rows
Created on 2022-11-21 with reprex v2.0.2
Supplementing Emils answer based on your comment...
Keep in mind that most R modeling functions will expect the original feature set, even if some of them are not at all used. This is a by-product of R’s formula/model.matrix() machinery.
For recipes, it depends on which steps that you use.
You could refit the final model without them but you might not get exactly the same model. In many cases, the process to getting to the subset of features depends on how many were originally passed.
I’m working on a tidymodels api for this but caret has one to get the list of predictors that were actually used by the model. See the example:
library(caret)
#> Loading required package: ggplot2
#> Loading required package: lattice
library(tidymodels)
tidymodels_prefer()
options(pillar.advice = FALSE, pillar.min_title_chars = Inf)
data(ames)
set.seed(4595)
ames <- ames %>%
mutate(Sale_Price = log10(Sale_Price))
data_split <- initial_split(ames, strata = "Sale_Price", prop = 0.75)
ames_train <- training(data_split)
ames_test <- testing(data_split)
rec <- recipe(Sale_Price ~ ., data = ames_train)
norm_trans <- rec %>%
step_zv(all_predictors()) %>%
step_nzv(all_predictors()) %>%
step_corr(all_numeric_predictors(), threshold = 0.1)
rf_spec <- rand_forest(mode = "regression") %>%
set_engine("ranger")
rf_wf <- workflow() %>%
add_recipe(norm_trans) %>%
add_model(rf_spec)
rf_fit <- fit(rf_wf, ames_train)
# get predictor set:
rf_features <-
rf_fit %>%
extract_fit_engine() %>%
predictors() #<- the caret funciton
head(rf_features)
#> [1] "MS_SubClass" "MS_Zoning" "Lot_Frontage" "Lot_Shape" "Lot_Config"
#> [6] "Neighborhood"
# You get an error here:
ames_test %>%
select(all_of(rf_features)) %>%
predict(rf_fit, new_data = .)
#> Error in `validate_column_names()`:
#> ! The following required columns are missing: 'Lot_Area',
#> 'Street', 'Alley', 'Land_Contour', 'Utilities', 'Land_Slope',
#> 'Condition_2', 'Year_Built', 'Year_Remod_Add', 'Roof_Matl',
#> 'Mas_Vnr_Area', 'Bsmt_Cond', 'BsmtFin_SF_1', 'BsmtFin_Type_2',
#> 'BsmtFin_SF_2', 'Bsmt_Unf_SF', 'Total_Bsmt_SF', 'Heating',
#> 'First_Flr_SF', 'Second_Flr_SF', 'Gr_Liv_Area', 'Bsmt_Full_Bath',
#> 'Full_Bath', 'Half_Bath', 'Bedroom_AbvGr', 'Kitchen_AbvGr',
#> 'TotRms_AbvGrd', 'Functional', 'Fireplaces', 'Garage_Cars',
#> 'Garage_Area', 'Wood_Deck_SF', 'Open_Porch_SF', 'Enclosed_Porch',
#> 'Three_season_porch', 'Screen_Porch', 'Pool_Area', 'Pool_QC',
#> 'Misc_Feature', 'Misc_Val', 'Mo_Sold', 'Latitude'.
Created on 2022-11-21 by the reprex package (v2.0.1)
This error comes from the workflows package but the underlying modeling package would also error.

gtsummary and add_glance: Error: No glance method for objects of class mira

I am trying to add r-squared and the number of observations to my regression tables using gtsummary and multiply imputed data from mice. I can still extract r-squared using glance if I pool the model, but mipo objects are not compatible with tbl_regression since pooling is already part of the tidying process. Any help with this issue is appreciated.
library(tidyverse)
library(gtsummary)
library(mice)
library(broom)
library(broom.helpers)
set.seed(123)
mice::version(pkg = "tidyverse")
#> [1] "tidyverse 1.3.1.9000 /Library/Frameworks/R.framework/Versions/4.1/Resources/library"
mice::version(pkg = "gtsummary")
#> [1] "gtsummary 1.5.2 /Library/Frameworks/R.framework/Versions/4.1/Resources/library"
mice::version(pkg = "mice")
#> [1] "mice 3.14.3 2021-12-06 /Library/Frameworks/R.framework/Versions/4.1/Resources/library"
mice::version(pkg = "broom")
#> [1] "broom 0.7.12.9000 /Library/Frameworks/R.framework/Versions/4.1/Resources/library"
mice::version(pkg = "broom.helpers")
#> [1] "broom.helpers 1.6.0.9000 /Library/Frameworks/R.framework/Versions/4.1/Resources/library"
R.Version()$version.string
#> [1] "R version 4.1.1 (2021-08-10)"
# imputed data
data(nhanes)
imp <- mice(nhanes, m = 3, print = FALSE)
mod <- with(imp, lm(age ~ bmi + chl))
# is it broom ?
broom::glance(mod)
#> Error: No glance method for objects of class mira
broom::glance(pool(mod))
#> nimp nobs r.squared adj.r.squared
#> 1 3 25 0.5245108 0.4799119
# regular tbl_regression
tbl_regression(mod) %>% as_kable() # kable so it prints nicely here
#> pool_and_tidy_mice(): Tidying mice model with
#> `mice::pool(x) %>% mice::tidy(exponentiate = FALSE, conf.int = TRUE, conf.level = 0.95)`
Characteristic
Beta
95% CI
p-value
bmi
-0.12
-0.21, -0.03
0.012
chl
0.01
0.00, 0.02
0.008
# tbl_regression with add_glance_source_note
tbl_regression(mod) %>% add_glance_source_note(include = c(r.squared, nobs) )
#> pool_and_tidy_mice(): Tidying mice model with
#> `mice::pool(x) %>% mice::tidy(exponentiate = FALSE, conf.int = TRUE, conf.level = 0.95)`
#> Error: No glance method for objects of class mira
# tbl_regression with add_glance_table
tbl_regression(mod) %>% add_glance_table(include = c(r.squared, nobs) )
#> pool_and_tidy_mice(): Tidying mice model with
#> `mice::pool(x) %>% mice::tidy(exponentiate = FALSE, conf.int = TRUE, conf.level = 0.95)`
#> Error: No glance method for objects of class mira
# tbl_regression with pooled data
pool <- mice::pool(mod)
tbl_regression(pool)
#> x Please pass the 'mice' model to `tbl_regression()` before models
#> have been combined with `mice::pool()`. The default tidier, `pool_and_tidy_mice()`, will both pool and tidy the regression model.
#> mice::mice(trial, m = 2) %>%
#> with(lm(age ~ marker + grade)) %>%
#> tbl_regression()
Thank you for the detailed question and reproducible example.
You diagnosed the issue well. tbl_regression() expects unpooled results and broom::glance() expects the pools results. You can get what you need by passing a custom glance() function.
Example below!
library(tidyverse)
library(gtsummary)
library(mice)
#>
#> Attaching package: 'mice'
#> The following object is masked from 'package:stats':
#>
#> filter
#> The following objects are masked from 'package:base':
#>
#> cbind, rbind
set.seed(123)
# imputed data
data(nhanes)
imp <- mice(nhanes, m = 3, print = FALSE)
mod <- with(imp, lm(age ~ bmi + chl))
broom::glance(pool(mod))
#> nimp nobs r.squared adj.r.squared
#> 1 3 25 0.5245108 0.4799119
tbl <-
tbl_regression(mod) %>%
add_glance_source_note(
include = c(r.squared, nobs),
# need to modify the glance function slightly to handle the regression object
glance_fun = function(x) broom::glance(pool(x))
)
#> pool_and_tidy_mice(): Tidying mice model with
#> `mice::pool(x) %>% mice::tidy(exponentiate = FALSE, conf.int = TRUE, conf.level = 0.95)`
Created on 2022-01-29 by the reprex package (v2.0.1)

Tidymodels, All models failed; error in model.frame.default and prediction from a rank-deficient fit may be misleading

I am having problems with the tidymodels-tuning that give the error and warning:
warning: prediction from a rank-deficient fit may be misleading
Error: Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = ob...
Note 1: I am performing the tuning for both normal CV-fold and spatial-cv fold
Note 2: I wanted to include data but Stack Overflow gives me: Body is limited to 30000 characters; you entered 143552. I can send you the data if you wish!
Defining lm model
lm_fit_spatcv <- fit_resamples(
lm_wf,
resamples = spatial_cv_fold,
control = model.control,
metrics = multi.metric)
Defining glm model
glm_fit_spatcv <- fit_resamples(
glm_wf,
resamples = spatial_cv_fold,
control = model.control,
metrics = multi.metric)
I have looked a bit into it here and here and here but still do not really get what of my pre-processing steps might cause these issues..
Made a reprex
# Loading packages
library(tidyverse)
library(parallelMap)
library(parallelly)
library(parallel)
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
library(treesnip)
#> Error in library(treesnip): there is no package called 'treesnip'
library(kknn)
library(kernlab)
#>
#> Attaching package: 'kernlab'
#> The following object is masked from 'package:scales':
#>
#> alpha
#> The following object is masked from 'package:purrr':
#>
#> cross
#> The following object is masked from 'package:ggplot2':
#>
#> alpha
library(ranger)
library(datapasta)
library(spatialsample)
library(stacks)
# DATA
# agrofor.biophys.modelling.data <- read.csv(file = here::here("DATA","agrofor.biophys.modelling.data.csv"))
# Creating sample data
# agrofor.biophys.modelling.data <- agrofor.biophys.modelling.data %>%
# dplyr::slice_sample(n = 100, replace = FALSE) %>%
# as_tibble()
# making a tibble::tribble dataset using dpaste() from the datapasta package
# datapasta::dpasta(agrofor.biophys.modelling.data)
# Here was a tibble::tribble dataset. I can send you the data if you wish!
# Removing observations with NAs from the data
ml.data.clean <- data.table::copy(agrofor.biophys.modelling.data) %>%
drop_na()
ml.data.clean.na.check <- ml.data.clean %>%
select(everything()) %>% # replace to your needs
summarise_all(funs(sum(is.na(.))))
#> Warning: `funs()` was deprecated in dplyr 0.8.0.
#> Please use a list of either functions or lambdas:
#>
#> # Simple named list:
#> list(mean = mean, median = median)
#>
#> # Auto named with `tibble::lst()`:
#> tibble::lst(mean, median)
#>
#> # Using lambdas
#> list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
# Checking for na
#sapply(ml.data.clean.na.check, function(x) sum(is.na(x)))
# STEP 1: Splitting data, defining resampling techniques and setting global model metrics
## Splitting data in training and testing sets
set.seed(234)
# Splitting data
af.split <- initial_split(ml.data.clean, prop = 0.80, strata = logRR)
#> Warning: The number of observations in each quantile is below the recommended
#> threshold of 20. Stratification will be done with 3 breaks instead.
af.train <- training(af.split)
af.test <- testing(af.split)
## Defining resampling techniques
# Re-sample technique(s)
boostrap_df <- bootstraps(af.train, times = 10, strata = logRR)
#> Warning: The number of observations in each quantile is below the recommended
#> threshold of 20. Stratification will be done with 2 breaks instead.
cv_fold <- vfold_cv(af.train, v = 10, repeats = 10)
spatial_cv_fold <- spatial_clustering_cv(af.train, coords = c("Longitude", "Latitude"), v = 20)
## Setting global metrics
# Metrics
multi.metric <- metric_set(rmse, rsq, ccc, mae)
model.control <- control_stack_grid()
# STEP 2: Model recipes - pre-processing steps
# Linear model - lm recipe
lm_recipe <-
recipe(formula = logRR ~ ., data = af.train) %>%
update_role(Site.Type, new_role = "predictor") %>%
update_role(Latitude,
Longitude,
Tree,
new_role = "sample ID") %>%
step_novel(Site.Type, -all_outcomes()) %>%
step_dummy(Site.Type, one_hot = TRUE, naming = partial(dummy_names,sep = "_")) %>%
step_zv(all_predictors()) %>%
step_normalize(all_predictors(), -all_nominal()) %>%
step_nzv(all_numeric(), -all_outcomes()) %>%
step_corr(all_numeric_predictors()) %>%
step_center(all_numeric_predictors()) %>%
step_scale(all_numeric_predictors())
# Generalised linear model recipe
glm_recipe <-
recipe(formula = logRR ~ ., data = af.train) %>%
update_role(Site.Type, new_role = "predictor") %>%
update_role(Latitude,
Longitude,
Tree,
new_role = "sample ID") %>%
step_novel(Site.Type, -all_outcomes()) %>%
step_dummy(Site.Type, one_hot = TRUE, naming = partial(dummy_names,sep = "_")) %>%
step_zv(all_predictors()) %>%
step_normalize(all_predictors(), -all_nominal()) %>%
step_nzv(all_numeric(), -all_outcomes()) %>%
step_corr(all_numeric_predictors()) %>%
step_center(all_numeric_predictors()) %>%
step_scale(all_numeric_predictors()) %>%
step_lincomb(all_numeric(), -all_outcomes())
# STEP 3: Setting model specifications
lm_model <- linear_reg() %>%
set_mode("regression") %>%
set_engine("lm")
glm_model <- linear_reg(
mode = "regression",
penalty = 0.1,
mixture = 0
) %>%
set_engine("glmnet")
# STEP 4: Defining model workflows
lm_wf <- workflow() %>%
add_model(lm_model) %>%
add_recipe(lm_recipe)
glm_wf <- workflow() %>%
add_model(glm_model) %>%
add_recipe(glm_recipe)
# STEP 5: Model (hyper)-parameter tuning
# Initializing parallel processing
parallelStartSocket(cpus = detectCores())
#> Starting parallelization in mode=socket with cpus=8.
##########################################################################
# Spatial k-fold cross validation
##########################################################################
lm_fit_spatcv <- fit_resamples(
lm_wf,
resamples = spatial_cv_fold,
control = model.control,
metrics = multi.metric)
#> ! Fold01: preprocessor 1/1, model 1/1 (predictions): prediction from a rank-defici...
#> ! Fold02: preprocessor 1/1, model 1/1 (predictions): prediction from a rank-defici...
glm_fit_spatcv <- fit_resamples(
glm_wf,
resamples = spatial_cv_fold,
control = model.control,
metrics = multi.metric)
#> x Fold01: preprocessor 1/1, model 1/1: Error in elnet(xd, is.sparse, ix, jx, y, we...
#> x Fold02: preprocessor 1/1, model 1/1: Error in elnet(xd, is.sparse, ix, jx, y,
#> Warning: All models failed. See the `.notes` column.
##########################################################################
# Normal/random k-fold cross validation (CV-fold)
##########################################################################
lm_fit_cv <- fit_resamples(
lm_wf,
resamples = cv_fold,
control = model.control,
metrics = multi.metric)
#> ! Fold01, Repeat01: preprocessor 1/1, model 1/1 (predictions): prediction from a rank-defici...
#> ! Fold02, Repeat01: preprocessor 1/1, model 1/1 (predictions): prediction from a rank-defici...
#> ! Fold03, Repeat01: preprocessor 1/1, model 1/1 (predictions): prediction from a rank-defici...
glm_fit_cv <- fit_resamples(
glm_wf,
resamples = cv_fold,
control = model.control,
metrics = multi.metric)
#> x Fold01, Repeat01: preprocessor 1/1, model 1/1: Error in elnet(xd, is.sparse, ix, jx, y, we...
#> x Fold02, Repeat01: preprocessor 1/1, model 1/1: Error in elnet(xd, is.sparse, ix, jx, y, we...
#> x Fold03, Repeat01: preprocessor 1/1, model 1/1: Error in elnet(xd, is.sparse, ix, jx, y, we...
#> Warning: All models failed. See the `.notes` column.
# Stopping parallel session
parallelStop()
#> Stopped parallelization. All cleaned up.
Created on 2021-09-03 by the reprex package (v2.0.1)
Solved!
.. Waste of time. I forgot to dummify these
lm_recipe <-
recipe(formula = logRR ~ ., data = af.train) %>%
update_role(Site.Type, new_role = "predictor") %>%
update_role(PrName,
Out.SubInd,
Out.SubInd.Code,
Product,
Latitude,
Longitude,
Tree,
new_role = "sample ID") %>%
step_novel(Site.Type, -all_outcomes()) %>%
step_dummy(Site.Type, one_hot = TRUE, naming = partial(dummy_names,sep = "_")) %>%
step_zv(all_predictors()) %>%
step_normalize(all_predictors(), -all_nominal()) %>%
step_nzv(all_numeric(), -all_outcomes()) %>%
step_corr(all_numeric_predictors()) %>%
step_center(all_numeric_predictors()) %>%
step_scale(all_numeric_predictors())

Tidymodels + Spark

I'm trying to develop a simple logistic regression model using Tidymodels with the Spark engine. My code works fine when I specify set_engine = "glm", but fails when I attempt to set the engine to spark. Any advice would be much appreciated!
library(tidyverse)
library(sparklyr)
library(tidymodels)
train.df <- titanic::titanic_train
train.df <- train.df %>%
mutate(Survived = factor(ifelse(Survived == 1, 'Y', 'N')),
Sex = factor(Sex),
Pclass = factor(Pclass))
skimr::skim(train.df)
# Just working with Spark locally.
sc <- spark_connect(master = 'local', version = '3.1')
train.spark.df <- copy_to(sc, train.df)
logistic.regression.recipe <-
recipe(Survived ~ PassengerId + Sex + Age + Pclass, data = train.spark.df) %>%
update_role(PassengerId, new_role = 'ID') %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_impute_linear(all_predictors())
logistic.regression.recipe
summary(logistic.regression.recipe)
logistic.regression.model <-
logistic_reg() %>%
set_mode("classification") %>%
set_engine("spark")
logistic.regression.model
logistic.regression.workflow <-
workflow() %>%
add_recipe(logistic.regression.recipe) %>%
add_model(logistic.regression.model)
logistic.regression.workflow
logistic.regression.final.model <-
logistic.regression.workflow %>%
fit(data = train.spark.df)
logistic.regression.final.model
Error: `data` must be a data.frame or a matrix, not a tbl_spark.
Thanks for reading!
So the support for Spark in tidymodels is not even across all the parts of a modeling analysis. The support for modeling in parsnip is good, but we don't have fully featured support for feature engineering in recipes or putting those building blocks together in workflows. So for example, you can fit just the logistic regression model:
library(tidyverse)
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
library(sparklyr)
#>
#> Attaching package: 'sparklyr'
#> The following object is masked from 'package:purrr':
#>
#> invoke
#> The following object is masked from 'package:stats':
#>
#> filter
sc <- spark_connect(master = "local")
train_sp <- copy_to(sc, titanic::titanic_train, overwrite = TRUE)
log_spec <- logistic_reg() %>% set_engine("spark")
log_spec %>%
fit(Survived ~ Sex + Fare + Pclass, data = train_sp)
#> parsnip model object
#>
#> Fit time: 5.1s
#> Formula: Survived ~ Sex + Fare + Pclass
#>
#> Coefficients:
#> (Intercept) Sex_male Fare Pclass
#> 3.143731639 -2.630648858 0.001450218 -0.917173436
Created on 2021-07-09 by the reprex package (v2.0.0)
But you can't use recipes and workflows out of the box. You might consider trying something like using spark_apply() but that may be a challenge at the current stage of maturity in tidymodels' integration with Spark.

Resources