I am fairly new using R and I am following a guide and learning to build an Expected Goals Model for my hockey league. When I run the code below, I get the error at the bottom. Is there something simple that I am missing?
Seems like its trying to use a formula in the model portion of the workflow but I already have a recipe in there. Thanks in advance for any help anyone can offer me! The guide is here https://www.thesignificantgame.com/portfolio/expected-goals-model-with-tidymodels/
library(tidymodels)
library(tidyverse)
library(dplyr)
set.seed(1972)
train_test_split <- initial_split(data = EXPECTED_GOALS_MODEL, prop = 0.80)
train_data <- train_test_split %>% training()
test_data <- train_test_split %>% testing()
xg_recipe <- recipe(Goal ~ DistanceC + Angle + Home + Hand + AgeDec31 + GoalieAgeDec31 + NewX + NewY, data = train_data) %>% update_role(NewX, NewY, new_role = "ID")
model <- logistic_reg() %>% set_engine("glm")
xg_wflow <- workflow() %>% add_model(model) %>% add_recipe(xg_recipe)
xg_wflow
xg_fit <- xg_wflow %>% fit(data = train_data)
Error in validObject(.Object) :
invalid class “model” object: invalid object for slot "formula" in class "model": got class "workflow", should be or extend class "formula"
In addition: Warning message:
In fit(., data = train_data) :
fit failed: Error in as.matrix(y) : argument "y" is missing, with no default
fit(x = ., data = train_data)
It's difficult to tell exactly what the issue is without a reproducible example, though this error brings up a few questions up for me:
Does the EXPECTED_GOALS_MODEL data indeed have a column called Goal in it, with two unique levels? Have you also spelled the remainder of the column names correctly?
Are your tidymodels package installs up to date?
Does this error persist if you run specifically generics::fit(data = train_data) instead of fit(data = train_data)? This almost looks like a different fit() is being dispatched to.
Here's a place to start with a reprex:
library(tidymodels)
data(ames)
set.seed(1972)
ames <- ames %>% rowid_to_column()
train_test_split <- initial_split(data = ames, prop = 0.80)
train_data <- train_test_split %>% training()
test_data <- train_test_split %>% testing()
xg_recipe <- recipe(Sale_Price ~ ., data = train_data) %>% update_role(rowid, new_role = "ID")
model <- linear_reg() %>% set_engine("glm")
xg_wflow <- workflow() %>% add_model(model) %>% add_recipe(xg_recipe)
xg_fit <- xg_wflow %>% fit(data = train_data)
xg_fit
#> ══ Workflow [trained] ══════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: linear_reg()
#>
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 0 Recipe Steps
#>
#> ── Model ───────────────────────────────────────────────────────────────────────
#>
#> Call: stats::glm(formula = ..y ~ ., family = stats::gaussian, data = data)
#>
#> Coefficients:
#> (Intercept)
#> -2.583e+07
#> MS_SubClassOne_Story_1945_and_Older
#> 7.419e+03
#> MS_SubClassOne_Story_with_Finished_Attic_All_Ages
#> 1.562e+04
#> MS_SubClassOne_and_Half_Story_Unfinished_All_Ages
#> 1.060e+04
#> MS_SubClassOne_and_Half_Story_Finished_All_Ages
#> 8.413e+03
#> MS_SubClassTwo_Story_1946_and_Newer
#> 3.007e+03
#> MS_SubClassTwo_Story_1945_and_Older
#> 1.793e+04
#> MS_SubClassTwo_and_Half_Story_All_Ages
#> -3.909e+03
#> MS_SubClassSplit_or_Multilevel
#> -1.098e+04
#> MS_SubClassSplit_Foyer
#> -4.038e+03
#> MS_SubClassDuplex_All_Styles_and_Ages
#> -2.004e+04
#> MS_SubClassOne_Story_PUD_1946_and_Newer
#> -2.335e+04
#> MS_SubClassOne_and_Half_Story_PUD_All_Ages
#> -2.482e+04
#> MS_SubClassTwo_Story_PUD_1946_and_Newer
#> -1.794e+04
#> MS_SubClassPUD_Multilevel_Split_Level_Foyer
#> -2.098e+04
#> MS_SubClassTwo_Family_conversion_All_Styles_and_Ages
#> 6.903e+03
#> MS_ZoningResidential_High_Density
#> -3.853e+03
#> MS_ZoningResidential_Low_Density
#> -3.661e+03
#> MS_ZoningResidential_Medium_Density
#> -8.240e+03
#> MS_ZoningA_agr
#> -3.824e+03
#> MS_ZoningC_all
#> -1.800e+04
#> MS_ZoningI_all
#> -3.299e+04
#> Lot_Frontage
#> 1.336e+01
#>
#> ...
#> and 506 more lines.
Created on 2022-09-28 by the reprex package (v2.0.1)
Hope this helps!
Simon, tidymodels team
In base R it is easy to compare two models with the anova() function and get an F test.
library(MASS)
lm.fit1 <- lm(medv ~ . , data = Boston)
lm.fit1a <- update(lm.fit1, ~ . - age - black)
anova(lm.fit1a, lm.fit1)
If I am working with tidymodels workflows. How do I do the same comparison? I have code like this:
library(tidymodels)
lm_spec <- linear_reg() %>%
set_mode("regression") %>%
set_engine("lm")
the_rec <- recipe(medv ~ ., data = Boston)
the_workflow <- workflow() %>%
add_recipe(the_rec) %>%
add_model(lm_spec)
the_workflow_fit1 <-
fit(the_workflow, data = Boston)
tidy(the_workflow_fit1)
the_workflow_fit1a <-
the_workflow_fit1 %>%
update_recipe(the_rec %>% step_rm(age, black)) %>%
fit(data = Boston)
tidy(the_workflow_fit1a)
I don't know how to extract the right object (thingy) to feed a statement like this:
anova(the_workflow_fit1a$thingy, the_workflow_fit1$thingy)
What is the thingy I need? Is there an elegant way to do this inside of the tidymodels ecosystem?
Many hours later and a post from #juliasilge https://github.com/tidymodels/workflows/issues/54 which introduced me to pull_workflow_fit() I have a tidymodels solution.
The base R code:
library(MASS)
lm.fit1 <- lm(medv ~ . , data = Boston)
lm.fit1a <- update(lm.fit1, ~ . - age - black)
anova(lm.fit1a, lm.fit1)
Can be done in tidymodels with:
library(tidymodels)
lm_spec <- linear_reg() %>%
set_mode("regression") %>%
set_engine("lm")
the_rec <- recipe(medv ~ ., data = Boston)
the_workflow <- workflow() %>%
add_recipe(the_rec) %>%
add_model(lm_spec)
the_workflow_fit1 <-
fit(the_workflow, data = Boston) %>%
extract_fit_parsnip()
the_workflow_fit1a <-
the_workflow %>%
update_recipe(
the_rec %>% step_rm(age, black)
) %>%
fit(data = Boston) %>%
extract_fit_parsnip()
anova(the_workflow_fit1a$fit, the_workflow_fit1$fit)
I am not fully familiar with tidymodels ecosystem therefore I am not sure this is the elegant solution that you look for.
I dig into the object the_workflow_fit1a and saw that subsetting .$fit$fit$fit serves the lm object which is needed by anova function.
So, in this way a solution can be considered;
models <- list(the_workflow_fit1,the_workflow_fit1a)
models2 <- lapply(models,function(x) x$fit$fit$fit)
anova(models2[[1]],models2[[2]])
output;
Res.Df RSS Df `Sum of Sq` F `Pr(>F)`
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 492 11079. NA NA NA NA
2 494 11351. -2 -272. 6.05 0.00254
I am having problems with the tidymodels-tuning that give the error and warning:
warning: prediction from a rank-deficient fit may be misleading
Error: Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = ob...
Note 1: I am performing the tuning for both normal CV-fold and spatial-cv fold
Note 2: I wanted to include data but Stack Overflow gives me: Body is limited to 30000 characters; you entered 143552. I can send you the data if you wish!
Defining lm model
lm_fit_spatcv <- fit_resamples(
lm_wf,
resamples = spatial_cv_fold,
control = model.control,
metrics = multi.metric)
Defining glm model
glm_fit_spatcv <- fit_resamples(
glm_wf,
resamples = spatial_cv_fold,
control = model.control,
metrics = multi.metric)
I have looked a bit into it here and here and here but still do not really get what of my pre-processing steps might cause these issues..
Made a reprex
# Loading packages
library(tidyverse)
library(parallelMap)
library(parallelly)
library(parallel)
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
library(treesnip)
#> Error in library(treesnip): there is no package called 'treesnip'
library(kknn)
library(kernlab)
#>
#> Attaching package: 'kernlab'
#> The following object is masked from 'package:scales':
#>
#> alpha
#> The following object is masked from 'package:purrr':
#>
#> cross
#> The following object is masked from 'package:ggplot2':
#>
#> alpha
library(ranger)
library(datapasta)
library(spatialsample)
library(stacks)
# DATA
# agrofor.biophys.modelling.data <- read.csv(file = here::here("DATA","agrofor.biophys.modelling.data.csv"))
# Creating sample data
# agrofor.biophys.modelling.data <- agrofor.biophys.modelling.data %>%
# dplyr::slice_sample(n = 100, replace = FALSE) %>%
# as_tibble()
# making a tibble::tribble dataset using dpaste() from the datapasta package
# datapasta::dpasta(agrofor.biophys.modelling.data)
# Here was a tibble::tribble dataset. I can send you the data if you wish!
# Removing observations with NAs from the data
ml.data.clean <- data.table::copy(agrofor.biophys.modelling.data) %>%
drop_na()
ml.data.clean.na.check <- ml.data.clean %>%
select(everything()) %>% # replace to your needs
summarise_all(funs(sum(is.na(.))))
#> Warning: `funs()` was deprecated in dplyr 0.8.0.
#> Please use a list of either functions or lambdas:
#>
#> # Simple named list:
#> list(mean = mean, median = median)
#>
#> # Auto named with `tibble::lst()`:
#> tibble::lst(mean, median)
#>
#> # Using lambdas
#> list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
# Checking for na
#sapply(ml.data.clean.na.check, function(x) sum(is.na(x)))
# STEP 1: Splitting data, defining resampling techniques and setting global model metrics
## Splitting data in training and testing sets
set.seed(234)
# Splitting data
af.split <- initial_split(ml.data.clean, prop = 0.80, strata = logRR)
#> Warning: The number of observations in each quantile is below the recommended
#> threshold of 20. Stratification will be done with 3 breaks instead.
af.train <- training(af.split)
af.test <- testing(af.split)
## Defining resampling techniques
# Re-sample technique(s)
boostrap_df <- bootstraps(af.train, times = 10, strata = logRR)
#> Warning: The number of observations in each quantile is below the recommended
#> threshold of 20. Stratification will be done with 2 breaks instead.
cv_fold <- vfold_cv(af.train, v = 10, repeats = 10)
spatial_cv_fold <- spatial_clustering_cv(af.train, coords = c("Longitude", "Latitude"), v = 20)
## Setting global metrics
# Metrics
multi.metric <- metric_set(rmse, rsq, ccc, mae)
model.control <- control_stack_grid()
# STEP 2: Model recipes - pre-processing steps
# Linear model - lm recipe
lm_recipe <-
recipe(formula = logRR ~ ., data = af.train) %>%
update_role(Site.Type, new_role = "predictor") %>%
update_role(Latitude,
Longitude,
Tree,
new_role = "sample ID") %>%
step_novel(Site.Type, -all_outcomes()) %>%
step_dummy(Site.Type, one_hot = TRUE, naming = partial(dummy_names,sep = "_")) %>%
step_zv(all_predictors()) %>%
step_normalize(all_predictors(), -all_nominal()) %>%
step_nzv(all_numeric(), -all_outcomes()) %>%
step_corr(all_numeric_predictors()) %>%
step_center(all_numeric_predictors()) %>%
step_scale(all_numeric_predictors())
# Generalised linear model recipe
glm_recipe <-
recipe(formula = logRR ~ ., data = af.train) %>%
update_role(Site.Type, new_role = "predictor") %>%
update_role(Latitude,
Longitude,
Tree,
new_role = "sample ID") %>%
step_novel(Site.Type, -all_outcomes()) %>%
step_dummy(Site.Type, one_hot = TRUE, naming = partial(dummy_names,sep = "_")) %>%
step_zv(all_predictors()) %>%
step_normalize(all_predictors(), -all_nominal()) %>%
step_nzv(all_numeric(), -all_outcomes()) %>%
step_corr(all_numeric_predictors()) %>%
step_center(all_numeric_predictors()) %>%
step_scale(all_numeric_predictors()) %>%
step_lincomb(all_numeric(), -all_outcomes())
# STEP 3: Setting model specifications
lm_model <- linear_reg() %>%
set_mode("regression") %>%
set_engine("lm")
glm_model <- linear_reg(
mode = "regression",
penalty = 0.1,
mixture = 0
) %>%
set_engine("glmnet")
# STEP 4: Defining model workflows
lm_wf <- workflow() %>%
add_model(lm_model) %>%
add_recipe(lm_recipe)
glm_wf <- workflow() %>%
add_model(glm_model) %>%
add_recipe(glm_recipe)
# STEP 5: Model (hyper)-parameter tuning
# Initializing parallel processing
parallelStartSocket(cpus = detectCores())
#> Starting parallelization in mode=socket with cpus=8.
##########################################################################
# Spatial k-fold cross validation
##########################################################################
lm_fit_spatcv <- fit_resamples(
lm_wf,
resamples = spatial_cv_fold,
control = model.control,
metrics = multi.metric)
#> ! Fold01: preprocessor 1/1, model 1/1 (predictions): prediction from a rank-defici...
#> ! Fold02: preprocessor 1/1, model 1/1 (predictions): prediction from a rank-defici...
glm_fit_spatcv <- fit_resamples(
glm_wf,
resamples = spatial_cv_fold,
control = model.control,
metrics = multi.metric)
#> x Fold01: preprocessor 1/1, model 1/1: Error in elnet(xd, is.sparse, ix, jx, y, we...
#> x Fold02: preprocessor 1/1, model 1/1: Error in elnet(xd, is.sparse, ix, jx, y,
#> Warning: All models failed. See the `.notes` column.
##########################################################################
# Normal/random k-fold cross validation (CV-fold)
##########################################################################
lm_fit_cv <- fit_resamples(
lm_wf,
resamples = cv_fold,
control = model.control,
metrics = multi.metric)
#> ! Fold01, Repeat01: preprocessor 1/1, model 1/1 (predictions): prediction from a rank-defici...
#> ! Fold02, Repeat01: preprocessor 1/1, model 1/1 (predictions): prediction from a rank-defici...
#> ! Fold03, Repeat01: preprocessor 1/1, model 1/1 (predictions): prediction from a rank-defici...
glm_fit_cv <- fit_resamples(
glm_wf,
resamples = cv_fold,
control = model.control,
metrics = multi.metric)
#> x Fold01, Repeat01: preprocessor 1/1, model 1/1: Error in elnet(xd, is.sparse, ix, jx, y, we...
#> x Fold02, Repeat01: preprocessor 1/1, model 1/1: Error in elnet(xd, is.sparse, ix, jx, y, we...
#> x Fold03, Repeat01: preprocessor 1/1, model 1/1: Error in elnet(xd, is.sparse, ix, jx, y, we...
#> Warning: All models failed. See the `.notes` column.
# Stopping parallel session
parallelStop()
#> Stopped parallelization. All cleaned up.
Created on 2021-09-03 by the reprex package (v2.0.1)
Solved!
.. Waste of time. I forgot to dummify these
lm_recipe <-
recipe(formula = logRR ~ ., data = af.train) %>%
update_role(Site.Type, new_role = "predictor") %>%
update_role(PrName,
Out.SubInd,
Out.SubInd.Code,
Product,
Latitude,
Longitude,
Tree,
new_role = "sample ID") %>%
step_novel(Site.Type, -all_outcomes()) %>%
step_dummy(Site.Type, one_hot = TRUE, naming = partial(dummy_names,sep = "_")) %>%
step_zv(all_predictors()) %>%
step_normalize(all_predictors(), -all_nominal()) %>%
step_nzv(all_numeric(), -all_outcomes()) %>%
step_corr(all_numeric_predictors()) %>%
step_center(all_numeric_predictors()) %>%
step_scale(all_numeric_predictors())
I'm struggling with how the obtain the AUC from a logistic regression model using tidymodels.
Here's an example using the built-in mpg dataset.
library(tidymodels)
library(tidyverse)
# Use mpg dataset
df <- mpg
# Create an indicator variable for class="suv"
df$is_suv <- as.factor(df$class == "suv")
# Create the split object
df_split <- initial_split(df, prop=1/2)
# Create the training and testing sets
df_train <- training(df_split)
df_test <- testing(df_split)
# Create workflow
rec <-
recipe(is_suv ~ cty + hwy + cyl, data=df_train)
glm_spec <-
logistic_reg() %>%
set_engine(engine = "glm")
glm_wflow <-
workflow() %>%
add_recipe(rec) %>%
add_model(glm_spec)
# Fit the model
model1 <- fit(glm_wflow, df_train)
# Attach predictions to training dataset
training_results <- bind_cols(df_train, predict(model1, df_train))
# Calculate accuracy
accuracy(training_results, truth = is_suv, estimate = .pred_class)
# Calculate AUC??
roc_auc(training_results, truth = is_suv, estimate = .pred_class)
The last line returns this error:
> roc_auc(training_results, truth = is_suv, estimate = .pred_class)
Error in metric_summarizer(metric_nm = "roc_auc", metric_fn = roc_auc_vec, :
formal argument "estimate" matched by multiple actual arguments
Since you are doing binary classification, roc_auc() is expecting a vector of class probabilities corresponding to the "relevant" class, not the predicted class.
You can get this using predict(model1, df_train, type = "prob"). Alternatively, if you are using workflows version 0.2.2 or newer you can use the augment() to get class predictions and probabilities without using bind_cols().
library(tidymodels)
library(tidyverse)
# Use mpg dataset
df <- mpg
# Create an indicator variable for class="suv"
df$is_suv <- as.factor(df$class == "suv")
# Create the split object
df_split <- initial_split(df, prop=1/2)
# Create the training and testing sets
df_train <- training(df_split)
df_test <- testing(df_split)
# Create workflow
rec <-
recipe(is_suv ~ cty + hwy + cyl, data=df_train)
glm_spec <-
logistic_reg() %>%
set_engine(engine = "glm")
glm_wflow <-
workflow() %>%
add_recipe(rec) %>%
add_model(glm_spec)
# Fit the model
model1 <- fit(glm_wflow, df_train)
# Attach predictions to training dataset
training_results <- augment(model1, df_train)
# Calculate accuracy
accuracy(training_results, truth = is_suv, estimate = .pred_class)
#> # A tibble: 1 x 3
#> .metric .estimator .estimate
#> <chr> <chr> <dbl>
#> 1 accuracy binary 0.795
# Calculate AUC
roc_auc(training_results, truth = is_suv, estimate = .pred_FALSE)
#> # A tibble: 1 x 3
#> .metric .estimator .estimate
#> <chr> <chr> <dbl>
#> 1 roc_auc binary 0.879
Created on 2021-04-12 by the reprex package (v1.0.0)
I have a binary classification problem and used a random forest and a logistic regression.
From the results of conf_mat, the collect_metrics() and collect_predictions I want to change my models to classify as TRUE only if the model is "sure" say 75% or a even higher probability. I just don't know where to specify this change. Would be amazing if someone can give me a hint. My intuition tells me that it should be somewhere in the model specification e.g. somewhere here, but maybe I am wrong.
canc_rf_model <- rand_forest(
mtry = tune(),
min_n = tune(),
trees = 500) %>%
set_engine("ranger") %>%
set_mode("classification")
canc_log_model <- logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
Thank you very much in advance!
M.
The hard class predictions come from the underlying ranger::predictions() function, not from a tidymodels function so there's not much to be done in the fitting itself.
However, you can pretty fluently change this if you like after fitting. Let's make an example classification model:
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
data("ad_data")
alz <- ad_data
# data splitting
set.seed(100)
alz_split <- initial_split(alz, strata = Class, prop = .9)
alz_train <- training(alz_split)
alz_test <- testing(alz_split)
# data resampling
set.seed(100)
alz_folds <-
vfold_cv(alz_train, v = 10, strata = Class)
rf_mod <-
rand_forest(trees = 1e3) %>%
set_engine("ranger") %>%
set_mode("classification")
rf_wf <-
workflow() %>%
add_formula(Class ~ .) %>%
add_model(rf_mod)
set.seed(100)
rf_preds <- rf_wf %>%
fit_resamples(
resamples = alz_folds,
control = control_resamples(save_pred = TRUE)) %>%
collect_predictions()
Here is the default confusion matrix:
rf_preds %>%
conf_mat(Class, .pred_class)
#> Truth
#> Prediction Impaired Control
#> Impaired 37 5
#> Control 45 213
You can use the probably package to post-process your class probability estimates and just overwrite the default values:
library(probably)
#>
#> Attaching package: 'probably'
#> The following objects are masked from 'package:base':
#>
#> as.factor, as.ordered
rf_preds %>%
mutate(.pred_class = make_two_class_pred(.pred_Impaired,
levels(rf_preds$Class),
threshold = 0.75),
.pred_class = factor(.pred_class, levels = levels(rf_preds$Class))) %>%
conf_mat(Class, .pred_class)
#> Truth
#> Prediction Impaired Control
#> Impaired 0 0
#> Control 82 218
Created on 2021-03-23 by the reprex package (v1.0.0)