I'm trying to develop a simple logistic regression model using Tidymodels with the Spark engine. My code works fine when I specify set_engine = "glm", but fails when I attempt to set the engine to spark. Any advice would be much appreciated!
library(tidyverse)
library(sparklyr)
library(tidymodels)
train.df <- titanic::titanic_train
train.df <- train.df %>%
mutate(Survived = factor(ifelse(Survived == 1, 'Y', 'N')),
Sex = factor(Sex),
Pclass = factor(Pclass))
skimr::skim(train.df)
# Just working with Spark locally.
sc <- spark_connect(master = 'local', version = '3.1')
train.spark.df <- copy_to(sc, train.df)
logistic.regression.recipe <-
recipe(Survived ~ PassengerId + Sex + Age + Pclass, data = train.spark.df) %>%
update_role(PassengerId, new_role = 'ID') %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_impute_linear(all_predictors())
logistic.regression.recipe
summary(logistic.regression.recipe)
logistic.regression.model <-
logistic_reg() %>%
set_mode("classification") %>%
set_engine("spark")
logistic.regression.model
logistic.regression.workflow <-
workflow() %>%
add_recipe(logistic.regression.recipe) %>%
add_model(logistic.regression.model)
logistic.regression.workflow
logistic.regression.final.model <-
logistic.regression.workflow %>%
fit(data = train.spark.df)
logistic.regression.final.model
Error: `data` must be a data.frame or a matrix, not a tbl_spark.
Thanks for reading!
So the support for Spark in tidymodels is not even across all the parts of a modeling analysis. The support for modeling in parsnip is good, but we don't have fully featured support for feature engineering in recipes or putting those building blocks together in workflows. So for example, you can fit just the logistic regression model:
library(tidyverse)
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
library(sparklyr)
#>
#> Attaching package: 'sparklyr'
#> The following object is masked from 'package:purrr':
#>
#> invoke
#> The following object is masked from 'package:stats':
#>
#> filter
sc <- spark_connect(master = "local")
train_sp <- copy_to(sc, titanic::titanic_train, overwrite = TRUE)
log_spec <- logistic_reg() %>% set_engine("spark")
log_spec %>%
fit(Survived ~ Sex + Fare + Pclass, data = train_sp)
#> parsnip model object
#>
#> Fit time: 5.1s
#> Formula: Survived ~ Sex + Fare + Pclass
#>
#> Coefficients:
#> (Intercept) Sex_male Fare Pclass
#> 3.143731639 -2.630648858 0.001450218 -0.917173436
Created on 2021-07-09 by the reprex package (v2.0.0)
But you can't use recipes and workflows out of the box. You might consider trying something like using spark_apply() but that may be a challenge at the current stage of maturity in tidymodels' integration with Spark.
Related
I have the following prediction model:
library(tidymodels)
data(ames)
set.seed(4595)
data_split <- initial_split(ames, strata = "Sale_Price", prop = 0.75)
ames_train <- training(data_split)
ames_test <- testing(data_split)
rec <- recipe(Sale_Price ~ ., data = ames_train)
norm_trans <- rec %>%
step_zv(all_predictors()) %>%
step_nzv(all_predictors()) %>%
step_corr(all_numeric_predictors(), threshold = 0.1)
# Preprocessing
norm_obj <- prep(norm_trans, training = ames_train)
rf_ames_train <- bake(norm_obj, ames_train) %>%
dplyr::select(Sale_Price, everything()) %>%
as.data.frame()
dim(rf_ames_train )
rf_xy_fit <- rand_forest(mode = "regression") %>%
set_engine("ranger") %>%
fit_xy(
x = rf_ames_train,
y = log10(rf_ames_train$Sale_Price)
)
Note that after preprocessing step the number of features are reduced from 74 to 33.
dim(rf_ames_train )
# 33
Currently, I have to explicitly pass the predictors in the function:
preds <- colnames(rf_ames_train)
my_pred_function <- function (fit = NULL, test_data = NULL, predictors = NULL) {
test_results <- test_data %>%
select(Sale_Price) %>%
mutate(Sale_Price = log10(Sale_Price)) %>%
bind_cols(
predict(fit, new_data = ames_test[, predictors])
)
test_results
}
my_pred_function(fit = rf_xy_fit, test_data = ames_test, predictors = preds)
Shown as predictors = preds in the function call above.
In practice, I have to save the rf_xy_fit and preds as two RDS files, then read them again. This is prone to error and troublesome.
I would like to by-pass this explicit passing. Is there a way I can extract that from rf_xy_fit directly?
This is a case where you would benefit from using the workflows package. This allows you to combine the preprocessing code with the model fitting code
library(tidymodels)
data(ames)
set.seed(4595)
# Notice how I did log transformation before doing the splitting to assure that it is not on both testing and training data sets.
ames <- ames %>%
mutate(Sale_Price = log10(Sale_Price))
data_split <- initial_split(ames, strata = "Sale_Price", prop = 0.75)
ames_train <- training(data_split)
ames_test <- testing(data_split)
rec <- recipe(Sale_Price ~ ., data = ames_train)
norm_trans <- rec %>%
step_zv(all_predictors()) %>%
step_nzv(all_predictors()) %>%
step_corr(all_numeric_predictors(), threshold = 0.1)
rf_spec <- rand_forest(mode = "regression") %>%
set_engine("ranger")
rf_wf <- workflow() %>%
add_recipe(norm_trans) %>%
add_model(rf_spec)
rf_fit <- fit(rf_wf, ames_train)
predict(rf_fit, new_data = ames_train)
#> # A tibble: 2,197 × 1
#> .pred
#> <dbl>
#> 1 5.09
#> 2 5.12
#> 3 5.01
#> 4 4.99
#> 5 5.12
#> 6 5.07
#> 7 4.90
#> 8 5.09
#> 9 5.13
#> 10 5.08
#> # … with 2,187 more rows
Created on 2022-11-21 with reprex v2.0.2
Supplementing Emils answer based on your comment...
Keep in mind that most R modeling functions will expect the original feature set, even if some of them are not at all used. This is a by-product of R’s formula/model.matrix() machinery.
For recipes, it depends on which steps that you use.
You could refit the final model without them but you might not get exactly the same model. In many cases, the process to getting to the subset of features depends on how many were originally passed.
I’m working on a tidymodels api for this but caret has one to get the list of predictors that were actually used by the model. See the example:
library(caret)
#> Loading required package: ggplot2
#> Loading required package: lattice
library(tidymodels)
tidymodels_prefer()
options(pillar.advice = FALSE, pillar.min_title_chars = Inf)
data(ames)
set.seed(4595)
ames <- ames %>%
mutate(Sale_Price = log10(Sale_Price))
data_split <- initial_split(ames, strata = "Sale_Price", prop = 0.75)
ames_train <- training(data_split)
ames_test <- testing(data_split)
rec <- recipe(Sale_Price ~ ., data = ames_train)
norm_trans <- rec %>%
step_zv(all_predictors()) %>%
step_nzv(all_predictors()) %>%
step_corr(all_numeric_predictors(), threshold = 0.1)
rf_spec <- rand_forest(mode = "regression") %>%
set_engine("ranger")
rf_wf <- workflow() %>%
add_recipe(norm_trans) %>%
add_model(rf_spec)
rf_fit <- fit(rf_wf, ames_train)
# get predictor set:
rf_features <-
rf_fit %>%
extract_fit_engine() %>%
predictors() #<- the caret funciton
head(rf_features)
#> [1] "MS_SubClass" "MS_Zoning" "Lot_Frontage" "Lot_Shape" "Lot_Config"
#> [6] "Neighborhood"
# You get an error here:
ames_test %>%
select(all_of(rf_features)) %>%
predict(rf_fit, new_data = .)
#> Error in `validate_column_names()`:
#> ! The following required columns are missing: 'Lot_Area',
#> 'Street', 'Alley', 'Land_Contour', 'Utilities', 'Land_Slope',
#> 'Condition_2', 'Year_Built', 'Year_Remod_Add', 'Roof_Matl',
#> 'Mas_Vnr_Area', 'Bsmt_Cond', 'BsmtFin_SF_1', 'BsmtFin_Type_2',
#> 'BsmtFin_SF_2', 'Bsmt_Unf_SF', 'Total_Bsmt_SF', 'Heating',
#> 'First_Flr_SF', 'Second_Flr_SF', 'Gr_Liv_Area', 'Bsmt_Full_Bath',
#> 'Full_Bath', 'Half_Bath', 'Bedroom_AbvGr', 'Kitchen_AbvGr',
#> 'TotRms_AbvGrd', 'Functional', 'Fireplaces', 'Garage_Cars',
#> 'Garage_Area', 'Wood_Deck_SF', 'Open_Porch_SF', 'Enclosed_Porch',
#> 'Three_season_porch', 'Screen_Porch', 'Pool_Area', 'Pool_QC',
#> 'Misc_Feature', 'Misc_Val', 'Mo_Sold', 'Latitude'.
Created on 2022-11-21 by the reprex package (v2.0.1)
This error comes from the workflows package but the underlying modeling package would also error.
I am fairly new using R and I am following a guide and learning to build an Expected Goals Model for my hockey league. When I run the code below, I get the error at the bottom. Is there something simple that I am missing?
Seems like its trying to use a formula in the model portion of the workflow but I already have a recipe in there. Thanks in advance for any help anyone can offer me! The guide is here https://www.thesignificantgame.com/portfolio/expected-goals-model-with-tidymodels/
library(tidymodels)
library(tidyverse)
library(dplyr)
set.seed(1972)
train_test_split <- initial_split(data = EXPECTED_GOALS_MODEL, prop = 0.80)
train_data <- train_test_split %>% training()
test_data <- train_test_split %>% testing()
xg_recipe <- recipe(Goal ~ DistanceC + Angle + Home + Hand + AgeDec31 + GoalieAgeDec31 + NewX + NewY, data = train_data) %>% update_role(NewX, NewY, new_role = "ID")
model <- logistic_reg() %>% set_engine("glm")
xg_wflow <- workflow() %>% add_model(model) %>% add_recipe(xg_recipe)
xg_wflow
xg_fit <- xg_wflow %>% fit(data = train_data)
Error in validObject(.Object) :
invalid class “model” object: invalid object for slot "formula" in class "model": got class "workflow", should be or extend class "formula"
In addition: Warning message:
In fit(., data = train_data) :
fit failed: Error in as.matrix(y) : argument "y" is missing, with no default
fit(x = ., data = train_data)
It's difficult to tell exactly what the issue is without a reproducible example, though this error brings up a few questions up for me:
Does the EXPECTED_GOALS_MODEL data indeed have a column called Goal in it, with two unique levels? Have you also spelled the remainder of the column names correctly?
Are your tidymodels package installs up to date?
Does this error persist if you run specifically generics::fit(data = train_data) instead of fit(data = train_data)? This almost looks like a different fit() is being dispatched to.
Here's a place to start with a reprex:
library(tidymodels)
data(ames)
set.seed(1972)
ames <- ames %>% rowid_to_column()
train_test_split <- initial_split(data = ames, prop = 0.80)
train_data <- train_test_split %>% training()
test_data <- train_test_split %>% testing()
xg_recipe <- recipe(Sale_Price ~ ., data = train_data) %>% update_role(rowid, new_role = "ID")
model <- linear_reg() %>% set_engine("glm")
xg_wflow <- workflow() %>% add_model(model) %>% add_recipe(xg_recipe)
xg_fit <- xg_wflow %>% fit(data = train_data)
xg_fit
#> ══ Workflow [trained] ══════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: linear_reg()
#>
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 0 Recipe Steps
#>
#> ── Model ───────────────────────────────────────────────────────────────────────
#>
#> Call: stats::glm(formula = ..y ~ ., family = stats::gaussian, data = data)
#>
#> Coefficients:
#> (Intercept)
#> -2.583e+07
#> MS_SubClassOne_Story_1945_and_Older
#> 7.419e+03
#> MS_SubClassOne_Story_with_Finished_Attic_All_Ages
#> 1.562e+04
#> MS_SubClassOne_and_Half_Story_Unfinished_All_Ages
#> 1.060e+04
#> MS_SubClassOne_and_Half_Story_Finished_All_Ages
#> 8.413e+03
#> MS_SubClassTwo_Story_1946_and_Newer
#> 3.007e+03
#> MS_SubClassTwo_Story_1945_and_Older
#> 1.793e+04
#> MS_SubClassTwo_and_Half_Story_All_Ages
#> -3.909e+03
#> MS_SubClassSplit_or_Multilevel
#> -1.098e+04
#> MS_SubClassSplit_Foyer
#> -4.038e+03
#> MS_SubClassDuplex_All_Styles_and_Ages
#> -2.004e+04
#> MS_SubClassOne_Story_PUD_1946_and_Newer
#> -2.335e+04
#> MS_SubClassOne_and_Half_Story_PUD_All_Ages
#> -2.482e+04
#> MS_SubClassTwo_Story_PUD_1946_and_Newer
#> -1.794e+04
#> MS_SubClassPUD_Multilevel_Split_Level_Foyer
#> -2.098e+04
#> MS_SubClassTwo_Family_conversion_All_Styles_and_Ages
#> 6.903e+03
#> MS_ZoningResidential_High_Density
#> -3.853e+03
#> MS_ZoningResidential_Low_Density
#> -3.661e+03
#> MS_ZoningResidential_Medium_Density
#> -8.240e+03
#> MS_ZoningA_agr
#> -3.824e+03
#> MS_ZoningC_all
#> -1.800e+04
#> MS_ZoningI_all
#> -3.299e+04
#> Lot_Frontage
#> 1.336e+01
#>
#> ...
#> and 506 more lines.
Created on 2022-09-28 by the reprex package (v2.0.1)
Hope this helps!
Simon, tidymodels team
I have a binary classification problem and used a random forest and a logistic regression.
From the results of conf_mat, the collect_metrics() and collect_predictions I want to change my models to classify as TRUE only if the model is "sure" say 75% or a even higher probability. I just don't know where to specify this change. Would be amazing if someone can give me a hint. My intuition tells me that it should be somewhere in the model specification e.g. somewhere here, but maybe I am wrong.
canc_rf_model <- rand_forest(
mtry = tune(),
min_n = tune(),
trees = 500) %>%
set_engine("ranger") %>%
set_mode("classification")
canc_log_model <- logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
Thank you very much in advance!
M.
The hard class predictions come from the underlying ranger::predictions() function, not from a tidymodels function so there's not much to be done in the fitting itself.
However, you can pretty fluently change this if you like after fitting. Let's make an example classification model:
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
data("ad_data")
alz <- ad_data
# data splitting
set.seed(100)
alz_split <- initial_split(alz, strata = Class, prop = .9)
alz_train <- training(alz_split)
alz_test <- testing(alz_split)
# data resampling
set.seed(100)
alz_folds <-
vfold_cv(alz_train, v = 10, strata = Class)
rf_mod <-
rand_forest(trees = 1e3) %>%
set_engine("ranger") %>%
set_mode("classification")
rf_wf <-
workflow() %>%
add_formula(Class ~ .) %>%
add_model(rf_mod)
set.seed(100)
rf_preds <- rf_wf %>%
fit_resamples(
resamples = alz_folds,
control = control_resamples(save_pred = TRUE)) %>%
collect_predictions()
Here is the default confusion matrix:
rf_preds %>%
conf_mat(Class, .pred_class)
#> Truth
#> Prediction Impaired Control
#> Impaired 37 5
#> Control 45 213
You can use the probably package to post-process your class probability estimates and just overwrite the default values:
library(probably)
#>
#> Attaching package: 'probably'
#> The following objects are masked from 'package:base':
#>
#> as.factor, as.ordered
rf_preds %>%
mutate(.pred_class = make_two_class_pred(.pred_Impaired,
levels(rf_preds$Class),
threshold = 0.75),
.pred_class = factor(.pred_class, levels = levels(rf_preds$Class))) %>%
conf_mat(Class, .pred_class)
#> Truth
#> Prediction Impaired Control
#> Impaired 0 0
#> Control 82 218
Created on 2021-03-23 by the reprex package (v1.0.0)
I am trying to fit some lme4::lmer models in drake plan, but am getting an error
'data' not found, and some variables missing from formula environment
If I substitute an lm model, it works.
Here is a reproducible example
library(drake)
library(lme4)
#> Loading required package: Matrix
#>
#> Attaching package: 'Matrix'
#> The following object is masked from 'package:drake':
#>
#> expand
plan_lm <- drake_plan(
dat = iris,
mod = lm(Sepal.Length ~ Petal.Length, data = dat)
)
make(plan_lm)
#> ℹ Consider drake::r_make() to improve robustness.
#> ▶ target dat
#> ▶ target mod
plan_lmer <- drake_plan(
dat1 = iris,
mod1 = lmer(Sepal.Length ~ Petal.Length, data = dat1)
)
make(plan_lmer)
#> ▶ target dat1
#> ▶ target mod1
#> x fail mod1
#> Error: target mod1 failed.
#> diagnose(mod1)$error$message:
#> 'data' not found, and some variables missing from formula environment
#> diagnose(mod1)$error$calls:
#> lme4::lFormula(formula = Sepal.Length ~ Petal.Length, data = dat1,
#> control = list("nloptwrap", TRUE, 1e-05, TRUE, FALSE, list(
#> "ignore", "stop", "ignore", "stop", "stop", "message+drop.cols",
#> "warning", "stop"), list(list("warning", 0.002, NULL),
#> list("message", 1e-04), list("warning", 1e-06)), list()))
#> lme4:::checkFormulaData(formula, data, checkLHS = control$check.formula.LHS ==
#> "stop")
#> base::stop("'data' not found, and some variables missing from formula environment",
#> call. = FALSE)
Created on 2020-07-29 by the reprex package (v0.3.0)
Any suggestions?
This edge case is an instance of https://github.com/ropensci/drake/issues/1012 and https://github.com/ropensci/drake/issues/1163. drake creates its own environments to run commands, so the environment with dat is different from the environment where the model actually runs. There are good reasons drake does this, and the behavior is not going to change, so this issue is unfortunately permanent unless lme4 changes. The best workaround I can offer is to create the formula in the target's environment at runtime, something like the reprex below. You have to manually force the data and the formula to be in the same environment. I recommend writing a custom function to do this.
library(drake)
suppressPackageStartupMessages(library(lme4))
fit_lmer <- function(dat) {
envir <- environment()
envir$dat <- dat
f <- as.formula("Reaction ~ Days + (Days | Subject)", env = envir)
lme4::lmer(f, data = dat)
}
plan <- drake_plan(
dat = sleepstudy,
mod = fit_lmer(dat)
)
make(plan)
#> ▶ target dat
#> ▶ target mod
Created on 2020-07-29 by the reprex package (v0.3.0)
By the way, please consider avoiding the iris dataset if you can: https://armchairecology.blog/iris-dataset/
I can work around the problem by reassigning within the new target
plan <- drake_plan(
dat = sleepstudy,
mod = {dat <- dat
lmer(Reaction ~ Days + (Days | Subject), dat)
}
)
make(plan)
Or following https://github.com/ropensci/drake/issues/1163, using readd(dat)
I am getting the following error when using recipes::step_dummy with caret::train (first attempt at combining the two packages):
Error: Not all variables in the recipe are present in the supplied
training set
Not sure what is causing the error nor the best way to debug. Help to train model would be much appreciated.
library(caret)
library(tidyverse)
library(recipes)
library(rsample)
data("credit_data")
## Split the data into training (75%) and test sets (25%)
set.seed(100)
train_test_split <- initial_split(credit_data)
credit_train <- training(train_test_split)
credit_test <- testing(train_test_split)
# Create recipe for data pre-processing
rec_obj <- recipe(Status ~ ., data = credit_train) %>%
step_knnimpute(all_predictors()) %>%
#step_other(Home, Marital, threshold = .2, other = "other") %>%
#step_other(Job, threshold = .2, other = "others") %>%
step_dummy(Records) %>%
step_center(all_numeric()) %>%
step_scale(all_numeric()) %>%
prep(training = credit_train, retain = TRUE)
train_data <- juice(rec_obj)
test_data <- bake(rec_obj, credit_test)
set.seed(1055)
# the glm function models the second factor level.
lrfit <- train(rec_obj, data = train_data,
method = "glm",
trControl = trainControl(method = "repeatedcv",
repeats = 5))
Don't prep the recipe before giving it to train and use the original training set:
library(caret)
#> Loading required package: lattice
#> Loading required package: ggplot2
library(tidyverse)
library(recipes)
#>
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stringr':
#>
#> fixed
#> The following object is masked from 'package:stats':
#>
#> step
library(rsample)
data("credit_data")
## Split the data into training (75%) and test sets (25%)
set.seed(100)
train_test_split <- initial_split(credit_data)
credit_train <- training(train_test_split)
credit_test <- testing(train_test_split)
# Create recipe for data pre-processing
rec_obj <-
recipe(Status ~ ., data = credit_train) %>%
step_knnimpute(all_predictors()) %>%
#step_other(Home, Marital, threshold = .2, other = "other") %>%
#step_other(Job, threshold = .2, other = "others") %>%
step_dummy(Records) %>%
step_center(all_numeric()) %>%
step_scale(all_numeric())
set.seed(1055)
# the glm function models the second factor level.
lrfit <- train(rec_obj, data = credit_train,
method = "glm",
trControl = trainControl(method = "repeatedcv",
repeats = 5))
lrfit
#> Generalized Linear Model
#>
#> 3341 samples
#> 13 predictor
#> 2 classes: 'bad', 'good'
#>
#> Recipe steps: knnimpute, dummy, center, scale
#> Resampling: Cross-Validated (10 fold, repeated 5 times)
#> Summary of sample sizes: 3006, 3008, 3007, 3007, 3007, 3007, ...
#> Resampling results:
#>
#> Accuracy Kappa
#> 0.7965349 0.4546223
Created on 2019-03-20 by the reprex package (v0.2.1)
It seems that you still need the formula in the train function (despite being listed in the recipe)?...
glmfit <- train(Status ~ ., data = juice(rec_obj),
method = "glm",
trControl = trainControl(method = "repeatedcv", repeats = 5))