I have fitted text data based on regression and LiblineaR engine. And I want to `tidy()` my results. I have also installed the dev version of `broom`.
But I always get an error. `ERROR: No tidy method for objects of class LiblineaR`
> svm_fit %>%
+ pull_workflow_fit() %>%
+ tidy()
ERROR: No tidy method for objects of class LiblineaR
We just merged in support for the tidy() method for parsnip models fitted with the LiblineaR engine, so if you install from GitHub, you should be able to have this feature now:
devtools::install_github("tidymodels/parsnip")
Here is a demo of how it works:
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
data(two_class_dat, package = "modeldata")
example_split <- initial_split(two_class_dat, prop = 0.99)
example_train <- training(example_split)
example_test <- testing(example_split)
rec <- recipe(Class ~ ., data = example_train) %>%
step_normalize(all_numeric_predictors())
spec1 <- svm_linear() %>%
set_engine("LiblineaR") %>%
set_mode("classification")
spec2 <- logistic_reg(penalty = 0.1, mixture = 1) %>%
set_engine("LiblineaR") %>%
set_mode("classification")
wf <- workflow() %>%
add_recipe(rec)
wf %>%
add_model(spec1) %>%
fit(example_train) %>%
tidy()
#> # A tibble: 3 x 2
#> term estimate
#> <chr> <dbl>
#> 1 A 0.361
#> 2 B -0.966
#> 3 Bias 0.113
wf %>%
add_model(spec2) %>%
fit(example_train) %>%
tidy()
#> # A tibble: 3 x 2
#> term estimate
#> <chr> <dbl>
#> 1 A 1.06
#> 2 B -2.76
#> 3 Bias 0.329
svm_linear() %>%
set_engine("LiblineaR") %>%
set_mode("regression") %>%
fit(mpg ~ ., data = mtcars) %>%
tidy()
#> # A tibble: 11 x 2
#> term estimate
#> <chr> <dbl>
#> 1 cyl 0.141
#> 2 disp -0.0380
#> 3 hp 0.0415
#> 4 drat 0.226
#> 5 wt 0.0757
#> 6 qsec 1.06
#> 7 vs 0.0648
#> 8 am 0.0479
#> 9 gear 0.219
#> 10 carb 0.00861
#> 11 Bias 0.0525
Created on 2021-04-22 by the reprex package (v2.0.0)
Related
I'm a bit confused with how I should interpret the coefficients from the elastic net model that I'm getting through tidymodels and glmnet. Ideally, I'd like to produce unscaled coefficients for maximum interpretability.
My issue is that I'm honestly not sure how to unscale the coefficients that the model is yielding because I can't quite figure out what's being done in the first place.
It's a bit tricky for me to post the data one would need to reproduce my results, but here's my code:
library(tidymodels)
library(tidyverse)
# preps data for model
myrecipe <- mydata %>%
recipe(transactionrevenue ~ sessions + channelgrouping + month + new_user_pct + is_weekend) %>%
step_novel(all_nominal(), -all_outcomes()) %>%
step_dummy(month, channelgrouping, one_hot = TRUE) %>%
step_zv(all_predictors()) %>%
step_normalize(sessions, new_user_pct) %>%
step_interact(terms = ~ sessions:starts_with("channelgrouping") + new_user_pct:starts_with("channelgrouping"))
# creates the model
mymodel <- linear_reg(penalty = 10, mixture = 0.2) %>%
set_engine("glmnet", standardize = FALSE)
wf <- workflow() %>%
add_recipe(myrecipe)
model_fit <- wf %>%
add_model(mymodel) %>%
fit(data = mydata)
# posts coefficients
tidy(model_fit)
If it would help, here's some information that might be useful:
The variable that I'm really focusing on is "sessions."
In the model, the coefficient for sessions is 2543.094882, and the intercept is 1963.369782. The penalty is also 10.
The unscaled mean for sessions is 725.2884 and the standard deviation is 1035.381.
I just can't seem to figure out what units the coefficients are in and how/if it's even possible to unscale the coefficients back to the original units.
Any insight would be very much appreciated.
You can use tidy() on a lot of different components of a workflow. The default is to the tidy() the model but you can also get out the recipe and even recipe steps. This is where the information it sounds like you are interested in is.
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
data(bivariate)
biv_rec <-
recipe(Class ~ ., data = bivariate_train) %>%
step_BoxCox(all_predictors())%>%
step_normalize(all_predictors())
svm_spec <- svm_linear(mode = "classification")
biv_fit <- workflow(biv_rec, svm_spec) %>% fit(bivariate_train)
## tidy the *model*
tidy(biv_fit)
#> # A tibble: 3 × 2
#> term estimate
#> <chr> <dbl>
#> 1 A -1.15
#> 2 B 1.17
#> 3 Bias 0.328
## tidy the *recipe*
extract_recipe(biv_fit) %>%
tidy()
#> # A tibble: 2 × 6
#> number operation type trained skip id
#> <int> <chr> <chr> <lgl> <lgl> <chr>
#> 1 1 step BoxCox TRUE FALSE BoxCox_ZRpI2
#> 2 2 step normalize TRUE FALSE normalize_DGmtN
## tidy the *recipe step*
extract_recipe(biv_fit) %>%
tidy(number = 1)
#> # A tibble: 2 × 3
#> terms value id
#> <chr> <dbl> <chr>
#> 1 A -0.857 BoxCox_ZRpI2
#> 2 B -1.09 BoxCox_ZRpI2
## tidy the other *recipe step*
extract_recipe(biv_fit) %>%
tidy(number = 2)
#> # A tibble: 4 × 4
#> terms statistic value id
#> <chr> <chr> <dbl> <chr>
#> 1 A mean 1.16 normalize_DGmtN
#> 2 B mean 0.909 normalize_DGmtN
#> 3 A sd 0.00105 normalize_DGmtN
#> 4 B sd 0.00260 normalize_DGmtN
Created on 2021-08-05 by the reprex package (v2.0.0)
You can read more about tidying a recipe here.
(Updated at the end based on Julia's reply. TL;DR: This seems to be an issue with the underlying kknn package, instead of with tidymodels)
I'm doing some k-nearest neighbours regression models with tidymodels. This is through the nearest_neighbor() function. I want to see what the difference is between the results with and without normalization of the features.
Now set_engine("kknn") uses the kknn::train.kknn() function under the hood, which has a normalization argument scale = TRUE. I want to compare models with scale = FALSE to scale = TRUE (actually, I want to do that in a recipe, but that is not possible, as I'll explain below).
But it does not seem as if I am able to reliably set scale = FALSE through tidymodels. Below is a reprex showing what I see.
The questions so long: Am I doing something wrong or is this a bug? If it is a bug, is it known and can I read about it somewhere? I'd be very grateful if someone can shed light on this.
Set up for the reprex
Here I'll use mtcars:
library(tidymodels)
data("mtcars")
A train-test split is:
set.seed(1)
mtcars_split <- initial_split(mtcars, prop = 0.7)
Here is a common recipe I'll use:
mtcars_recipe <- recipe(mpg ~ disp + wt, data = mtcars)
Here is model 1 (called knn_FALSE) where scale = FALSE:
knn_FALSE <- nearest_neighbor(neighbors = 5) %>%
set_mode("regression") %>%
set_engine("kknn", scale = FALSE)
Here is model 2 (called knn_TRUE) where scale = TRUE:
knn_TRUE <- nearest_neighbor(neighbors = 5) %>%
set_mode("regression") %>%
set_engine("kknn", scale = TRUE)
I bundle these two models into two workflows:
## Workflow with scale = FALSE
wf_FALSE <- workflow() %>%
add_model(knn_FALSE) %>%
add_recipe(mtcars_recipe)
## Worflow with scale = TRUE
wf_TRUE <- workflow() %>%
add_model(knn_TRUE) %>%
add_recipe(mtcars_recipe)
Using fit(), it is possible to have scale = FALSE
It does seem to be possible to have one version with scale = TRUE and one with scale = FALSE when using fit() on a workflow.
For example, for scale = TRUE I get:
wf_TRUE %>% fit(mtcars)
== Workflow [trained] ===============================================================================================
Preprocessor: Recipe
Model: nearest_neighbor()
-- Preprocessor -----------------------------------------------------------------------------------------------------
0 Recipe Steps
-- Model ------------------------------------------------------------------------------------------------------------
Call:
kknn::train.kknn(formula = ..y ~ ., data = data, ks = ~5, scale = ~TRUE)
Type of response variable: continuous
minimal mean absolute error: 2.09425
Minimal mean squared error: 7.219114
Best kernel: optimal
Best k: 5
Whereas for scale = FALSE I have:
wf_FALSE %>% fit(mtcars)
== Workflow [trained] ===============================================================================================
Preprocessor: Recipe
Model: nearest_neighbor()
-- Preprocessor -----------------------------------------------------------------------------------------------------
0 Recipe Steps
-- Model ------------------------------------------------------------------------------------------------------------
Call:
kknn::train.kknn(formula = ..y ~ ., data = data, ks = ~5, scale = ~FALSE)
Type of response variable: continuous
minimal mean absolute error: 2.1665
Minimal mean squared error: 6.538769
Best kernel: optimal
Best k: 5
The results are clearly different, which comes from the difference in the scale parameter.
But the plot thickens.
No difference with last_fit()
When using last_fit() however, the results for scale = TRUE and scale = FALSE are identical though.
For scale = TRUE:
wf_TRUE %>% last_fit(mtcars_split) %>% collect_metrics()
# A tibble: 2 x 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 3.16
2 rsq standard 0.663
Whereas for scale = FALSE:
wf_FALSE %>% last_fit(mtcars_split) %>% collect_metrics()
# A tibble: 2 x 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 3.16
2 rsq standard 0.663
These are clearly --- and unexpectedly --- the same.
There is also no difference when tuning using tune_grid()
If I do tuning with tune_grid() and a validation_split(), there is also no difference between the results for scale = TRUE and scale = FALSE.
Here is the code for that:
## Tune grid
knn_grid <- tibble(neighbors = c(5, 15))
## Tune Model 1: kNN regresson with no scaling in train.kknn
knn_FALSE_tune <- nearest_neighbor(neighbors = tune()) %>%
set_mode("regression") %>%
set_engine("kknn", scale = FALSE)
## Model 2: kNN regresson with scaling in train.kknn
knn_TRUE_tune <- nearest_neighbor(neighbors = tune()) %>%
set_mode("regression") %>%
set_engine("kknn", scale = TRUE)
## Workflow with scale = FALSE
wf_FALSE_tune <- workflow() %>%
add_model(knn_FALSE_tune) %>%
add_recipe(mtcars_recipe)
## Worflow with scale = TRUE
wf_TRUE_tune <- workflow() %>%
add_model(knn_TRUE_tune) %>%
add_recipe(mtcars_recipe)
## Validation split
mtcars_val <- validation_split(mtcars)
## Tune results: Without scaling
wf_FALSE_tune %>%
tune_grid(resamples = mtcars_val,
grid = knn_grid) %>%
collect_metrics()
## Tune results: With scaling
wf_TRUE_tune %>%
tune_grid(resamples = mtcars_val,
grid = knn_grid) %>%
collect_metrics()
The result when scale = FALSE:
> wf_FALSE_tune %>%
+ tune_grid(resamples = mtcars_val,
+ grid = knn_grid) %>%
+ collect_metrics()
# A tibble: 4 x 7
neighbors .metric .estimator mean n std_err .config
<dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
1 5 rmse standard 1.64 1 NA Model1
2 5 rsq standard 0.920 1 NA Model1
3 15 rmse standard 2.55 1 NA Model2
4 15 rsq standard 0.956 1 NA Model2
The results when scale = TRUE:
> wf_TRUE_tune %>%
+ tune_grid(resamples = mtcars_val,
+ grid = knn_grid) %>%
+ collect_metrics()
# A tibble: 4 x 7
neighbors .metric .estimator mean n std_err .config
<dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
1 5 rmse standard 1.64 1 NA Model1
2 5 rsq standard 0.920 1 NA Model1
3 15 rmse standard 2.55 1 NA Model2
4 15 rsq standard 0.956 1 NA Model2
Question
Am I misunderstanding (or missing my own bug), or are the last_fit() and tune_grid() functions not respecting my choice for scale?
I'm new to tidymodels, so I might have missed something. Answers much appreciated.
I was hoping to use step_normalize() in a recipe to do the normalization myself, but since I cannot reliably set scale = FALSE in the underlying engine, I have not been able to experiment with that.
Update after Julia's reply
As Julia shows, predictions from train.kknn() provide the same predictions for scale = FALSE and scale = TRUE. So this isn't an tidymodels issue. Rather the kknn:::predict.train.kknn() function does not respect all parameters passed to train.kknn() when predicting.
Consider the following output which uses kknn() instead of train.kknn():
kknn::kknn(formula = mpg ~ disp + wt, train = training(mtcars_split),
test = testing(mtcars_split), k = 5, scale = FALSE) %>%
predict(newdata = testing(mtcars_split))
## [1] 21.276 21.276 16.860 16.276 21.276 16.404 29.680 15.700 16.020
kknn::kknn(formula = mpg ~ disp + wt, train = training(mtcars_split),
test = testing(mtcars_split), k = 5, scale = TRUE) %>%
predict(newdata = testing(mtcars_split))
## [1] 21.032 21.784 16.668 16.052 21.264 16.404 26.340 16.076 15.620
These are different, as it should be. The problem is that kknn:::predict.train.kknn() calls kknn(), but without passing along scale (and some other optional arguments):
function (object, newdata, ...)
{
if (missing(newdata))
return(predict(object, ...))
res <- kknn(formula(terms(object)), object$data, newdata,
k = object$best.parameters$k, kernel = object$best.parameters$kernel,
distance = object$distance)
return(predict(res, ...))
}
<bytecode: 0x55e2304fba10>
<environment: namespace:kknn>
I think you don't have a bug or problem but are just misunderstanding what last_fit() and friends are predicting on to estimate performance.
library(tidymodels)
set.seed(1)
mtcars_split <- initial_split(mtcars, prop = 0.7)
knn_FALSE <- nearest_neighbor(neighbors = 5) %>%
set_mode("regression") %>%
set_engine("kknn", scale = FALSE)
knn_FALSE %>% translate()
#> K-Nearest Neighbor Model Specification (regression)
#>
#> Main Arguments:
#> neighbors = 5
#>
#> Engine-Specific Arguments:
#> scale = FALSE
#>
#> Computational engine: kknn
#>
#> Model fit template:
#> kknn::train.kknn(formula = missing_arg(), data = missing_arg(),
#> ks = min_rows(5, data, 5), scale = FALSE)
knn_TRUE <- nearest_neighbor(neighbors = 5) %>%
set_mode("regression") %>%
set_engine("kknn", scale = TRUE)
knn_TRUE %>% translate()
#> K-Nearest Neighbor Model Specification (regression)
#>
#> Main Arguments:
#> neighbors = 5
#>
#> Engine-Specific Arguments:
#> scale = TRUE
#>
#> Computational engine: kknn
#>
#> Model fit template:
#> kknn::train.kknn(formula = missing_arg(), data = missing_arg(),
#> ks = min_rows(5, data, 5), scale = TRUE)
Notice that both parsnip models are correctly passing the scale parameter to the underlying engine.
We can now add these two parsnip models to a workflow(), with a formula preprocessor (a recipe would be fine too).
wf_FALSE <- workflow() %>%
add_model(knn_FALSE) %>%
add_formula(mpg ~ disp + wt)
## Worflow with scale = TRUE
wf_TRUE <- workflow() %>%
add_model(knn_TRUE) %>%
add_formula(mpg ~ disp + wt)
The function last_fit() fits on the training data and predicts on the testing data. We can do that manually with our workflows. Importantly, notice that for these examples in the testing set, the predictions are the same, so the metrics you would get are the same.
wf_TRUE %>% fit(training(mtcars_split)) %>% predict(testing(mtcars_split))
#> # A tibble: 9 x 1
#> .pred
#> <dbl>
#> 1 21.0
#> 2 21.8
#> 3 16.7
#> 4 16.1
#> 5 21.3
#> 6 16.4
#> 7 26.3
#> 8 16.1
#> 9 15.6
wf_FALSE %>% fit(training(mtcars_split)) %>% predict(testing(mtcars_split))
#> # A tibble: 9 x 1
#> .pred
#> <dbl>
#> 1 21.0
#> 2 21.8
#> 3 16.7
#> 4 16.1
#> 5 21.3
#> 6 16.4
#> 7 26.3
#> 8 16.1
#> 9 15.6
The same thing is true for fitting the models directly:
knn_TRUE %>%
fit(mpg ~ disp + wt, data = training(mtcars_split)) %>%
predict(testing(mtcars_split))
#> # A tibble: 9 x 1
#> .pred
#> <dbl>
#> 1 21.0
#> 2 21.8
#> 3 16.7
#> 4 16.1
#> 5 21.3
#> 6 16.4
#> 7 26.3
#> 8 16.1
#> 9 15.6
knn_FALSE %>%
fit(mpg ~ disp + wt, data = training(mtcars_split)) %>%
predict(testing(mtcars_split))
#> # A tibble: 9 x 1
#> .pred
#> <dbl>
#> 1 21.0
#> 2 21.8
#> 3 16.7
#> 4 16.1
#> 5 21.3
#> 6 16.4
#> 7 26.3
#> 8 16.1
#> 9 15.6
And in fact is true if we fit the underlying kknn model directly:
kknn::train.kknn(formula = mpg ~ disp + wt, data = training(mtcars_split),
ks = 5, scale = FALSE) %>%
predict(testing(mtcars_split))
#> [1] 21.032 21.784 16.668 16.052 21.264 16.404 26.340 16.076 15.620
kknn::train.kknn(formula = mpg ~ disp + wt, data = training(mtcars_split),
ks = 5, scale = TRUE) %>%
predict(testing(mtcars_split))
#> [1] 21.032 21.784 16.668 16.052 21.264 16.404 26.340 16.076 15.620
Created on 2020-11-12 by the reprex package (v0.3.0.9001)
The scale parameter is correctly being passed to the underlying engine; it just doesn't change the prediction for these test cases.
Grouped regression is running well model1 with "do". But recently, it is told that do is superseded and suggested to use "across" but no example is given in the help file. Model2 is given in "do" help, and it is running well without "map" or "across". I don't understand how the regression is looping over those groups without map. When I tried using map in model3, I am getting errors. Model4 is given in Hadley's book, R for data science using split and working well. How to tell map function to consider the list "data". Any suggestions?
library(purrr)
#> Warning: package 'purrr' was built under R version 3.6.3
library(tidyverse)
#> Warning: package 'tidyverse' was built under R version 3.6.3
#> Warning: package 'ggplot2' was built under R version 3.6.3
#> Warning: package 'tidyr' was built under R version 3.6.3
#> Warning: package 'dplyr' was built under R version 3.6.3
#> Warning: package 'stringr' was built under R version 3.6.3
#> Warning: package 'forcats' was built under R version 3.6.3
model1 = mtcars %>%
group_by(cyl) %>%
do(mod = lm(mpg ~ disp, data = .))
model1
#> # A tibble: 3 x 2
#> # Rowwise:
#> cyl mod
#> <dbl> <list>
#> 1 4 <lm>
#> 2 6 <lm>
#> 3 8 <lm>
## from "do" help file
model2 = mtcars %>%
nest_by(cyl) %>%
mutate(mod = list(lm(mpg ~ disp, data = data)))
model2
#> # A tibble: 3 x 3
#> # Rowwise: cyl
#> cyl data mod
#> <dbl> <list<tbl_df[,10]>> <list>
#> 1 4 [11 x 10] <lm>
#> 2 6 [7 x 10] <lm>
#> 3 8 [14 x 10] <lm>
## using map
model3 = mtcars %>% nest_by(cyl) %>%
mutate(fit = map(data, ~lm(mpg ~ disp, data = .)))
#> Error: Problem with `mutate()` input `fit`.
#> x numeric 'envir' arg not of length one
#> i Input `fit` is `map(data, ~lm(mpg ~ disp, data = .))`.
#> i The error occured in row 1.
##model4
model4 = mtcars %>%
split(.$cyl) %>%
map(~lm(mpg ~ disp, data = .))
model4
#> $`4`
#>
#> Call:
#> lm(formula = mpg ~ disp, data = .)
#>
#> Coefficients:
#> (Intercept) disp
#> 40.8720 -0.1351
#>
#>
#> $`6`
#>
#> Call:
#> lm(formula = mpg ~ disp, data = .)
#>
#> Coefficients:
#> (Intercept) disp
#> 19.081987 0.003605
#>
#>
#> $`8`
#>
#> Call:
#> lm(formula = mpg ~ disp, data = .)
#>
#> Coefficients:
#> (Intercept) disp
#> 22.03280 -0.01963
Created on 2020-08-02 by the reprex package (v0.3.0)
It could be an issue with rowwise attribute, we could ungroup
library(dplyr)
library(purrr)
mtcars %>%
nest_by(cyl) %>% # // creates the rowwise attribute
ungroup %>% # // remove the rowwise
mutate(fit = map(data, ~lm(mpg ~ disp, data = .)))
# A tibble: 3 x 3
# cyl data fit
# <dbl> <list<tbl_df[,10]>> <list>
#1 4 [11 × 10] <lm>
#2 6 [7 × 10] <lm>
#3 8 [14 × 10] <lm>
I am working with the dataset uscrime but this question applied to any well-known dataset like cars.
After to googling I found extremely useful to standardize my data, considering that PCA finds new directions based on covariance matrix of original variables, and covariance matrix is sensitive to standardization of variables.
Nevertheless, I found "It is not necessary to standardize the variables, if all the variables are in same scale."
To standardize the variable I am using the function:
z_uscrime <- (uscrime - mean(uscrime)) / sd(uscrime)
Prior to standardize my data, how to check if all the variables are in the same scale or not?
Proving my point that you can standardize your data however many times you want
library(tidyverse)
library(recipes)
#>
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stringr':
#>
#> fixed
#> The following object is masked from 'package:stats':
#>
#> step
simple_recipe <- recipe(mpg ~ .,data = mtcars) %>%
step_center(everything()) %>%
step_scale(everything())
mtcars2 <- simple_recipe %>%
prep() %>%
juice()
simple_recipe2 <- recipe(mpg ~ .,data = mtcars2) %>%
step_center(everything()) %>%
step_scale(everything())
mtcars3 <- simple_recipe2 %>%
prep() %>%
juice()
all.equal(mtcars2,mtcars3)
#> [1] TRUE
mtcars2 %>%
summarise(across(everything(),.fns = list(mean = ~ mean(.x),sd = ~sd(.x)))) %>%
pivot_longer(everything(),names_pattern = "(.*)_(.*)",names_to = c("stat", ".value"))
#> # A tibble: 11 x 3
#> stat mean sd
#> <chr> <dbl> <dbl>
#> 1 cyl -1.47e-17 1
#> 2 disp -9.08e-17 1
#> 3 hp 1.04e-17 1
#> 4 drat -2.92e-16 1
#> 5 wt 4.68e-17 1.00
#> 6 qsec 5.30e-16 1
#> 7 vs 6.94e-18 1.00
#> 8 am 4.51e-17 1
#> 9 gear -3.47e-18 1.00
#> 10 carb 3.17e-17 1.00
#> 11 mpg 7.11e-17 1
mtcars3 %>%
summarise(across(everything(),.fns = list(mean = ~ mean(.x),sd = ~sd(.x)))) %>%
pivot_longer(everything(),names_pattern = "(.*)_(.*)",names_to = c("stat", ".value"))
#> # A tibble: 11 x 3
#> stat mean sd
#> <chr> <dbl> <dbl>
#> 1 cyl -1.17e-17 1
#> 2 disp -1.95e-17 1
#> 3 hp 9.54e-18 1
#> 4 drat 1.17e-17 1
#> 5 wt 3.26e-17 1
#> 6 qsec 1.37e-17 1
#> 7 vs 4.16e-17 1
#> 8 am 4.51e-17 1
#> 9 gear 0. 1
#> 10 carb 2.60e-18 1
#> 11 mpg 4.77e-18 1
Created on 2020-06-07 by the reprex package (v0.3.0)
This is the same issue as Predict with step_naomit and retain ID using tidymodels , but even though there is an accepted answer, the OP's last comment states the issue the "id variable" is being used as a predictor, as can be seen when looking at model$fit$variable.importance.
I have a dataset with "id variables" I would like to keep.
I thought I would be able to achieve this with a recipe() specification.
library(tidymodels)
# label is an identifier variable I want to keep even though it's not
# a predictor
df <- tibble(label = 1:50,
x = rnorm(50, 0, 5),
f = factor(sample(c('a', 'b', 'c'), 50, replace = TRUE)),
y = factor(sample(c('Y', 'N'), 50, replace = TRUE)) )
df_split <- initial_split(df, prop = 0.70)
# Make up any recipe: just note I specify 'label' as "id variable"
rec <- recipe(training(df_split)) %>%
update_role(label, new_role = "id variable") %>%
update_role(y, new_role = "outcome") %>%
update_role(x, new_role = "predictor") %>%
update_role(f, new_role = "predictor") %>%
step_corr(all_numeric(), -all_outcomes()) %>%
step_dummy(all_predictors(),-all_numeric()) %>%
step_meanimpute(all_numeric(), -all_outcomes())
train_juiced <- prep(rec, training(df_split)) %>% juice()
logit_fit <- logistic_reg(mode = "classification") %>%
set_engine(engine = "glm") %>%
fit(y ~ ., data = train_juiced)
# Why is label a variable in the model ?
logit_fit[['fit']][['coefficients']]
#> (Intercept) label x f_b f_c
#> 1.03664140 -0.01405316 0.22357266 -1.80701531 -1.66285399
Created on 2020-01-27 by the reprex package (v0.3.0)
But even though I did specify label was an id variable, it is being used as a predictor.
So maybe I can use the specific terms I want in the formula and specifically add label as an id variable.
rec <- recipe(training(df_split), y ~ x + f) %>%
update_role(label, new_role = "id variable") %>%
step_corr(all_numeric(), -all_outcomes()) %>%
step_dummy(all_predictors(),-all_numeric()) %>%
step_meanimpute(all_numeric(), -all_outcomes())
#> Error in .f(.x[[i]], ...): object 'label' not found
Created on 2020-01-27 by the reprex package (v0.3.0)
I can try not mentioning label
rec <- recipe(training(df_split), y ~ x + f) %>%
step_corr(all_numeric(), -all_outcomes()) %>%
step_dummy(all_predictors(),-all_numeric()) %>%
step_meanimpute(all_numeric(), -all_outcomes())
train_juiced <- prep(rec, training(df_split)) %>% juice()
logit_fit <- logistic_reg(mode = "classification") %>%
set_engine(engine = "glm") %>%
fit(y ~ ., data = train_juiced)
# Why is label a variable in the model ?
logit_fit[['fit']][['coefficients']]
#> (Intercept) x f_b f_c
#> -0.98950228 0.03734093 0.98945339 1.27014824
train_juiced
#> # A tibble: 35 x 4
#> x y f_b f_c
#> <dbl> <fct> <dbl> <dbl>
#> 1 -0.928 Y 1 0
#> 2 4.54 N 0 0
#> 3 -1.14 N 1 0
#> 4 -5.19 N 1 0
#> 5 -4.79 N 0 0
#> 6 -6.00 N 0 0
#> 7 3.83 N 0 1
#> 8 -8.66 Y 1 0
#> 9 -0.0849 Y 1 0
#> 10 -3.57 Y 0 1
#> # ... with 25 more rows
Created on 2020-01-27 by the reprex package (v0.3.0)
OK, so the model works, but I have lost my label.
How should I do this ?
The main issue/conceptual problem you are running into is that once you juice() the recipe, it is just data, i.e. just literally a dataframe. When you use that to fit a model, there's no way for the model to know that some of the variables had special roles.
library(tidymodels)
# label is an identifier variable to keep even though it's not a predictor
df <- tibble(label = 1:50,
x = rnorm(50, 0, 5),
f = factor(sample(c('a', 'b', 'c'), 50, replace = TRUE)),
y = factor(sample(c('Y', 'N'), 50, replace = TRUE)) )
df_split <- initial_split(df, prop = 0.70)
rec <- recipe(y ~ ., training(df_split)) %>%
update_role(label, new_role = "id variable") %>%
step_corr(all_numeric(), -all_outcomes()) %>%
step_dummy(all_predictors(),-all_numeric()) %>%
step_meanimpute(all_numeric(), -all_outcomes()) %>%
prep()
train_juiced <- juice(rec)
train_juiced
#> # A tibble: 35 x 5
#> label x y f_b f_c
#> <int> <dbl> <fct> <dbl> <dbl>
#> 1 1 1.80 N 1 0
#> 2 3 1.45 N 0 0
#> 3 5 -5.00 N 0 0
#> 4 6 -4.15 N 1 0
#> 5 7 1.37 Y 0 1
#> 6 8 1.62 Y 0 1
#> 7 10 -1.77 Y 1 0
#> 8 11 -3.15 N 0 1
#> 9 12 -2.02 Y 0 1
#> 10 13 2.65 Y 0 1
#> # … with 25 more rows
Notice that train_juiced is just literally a regular tibble. If you train a model on this tibble using fit(), it won't know anything about the recipe used to transform the data.
The tidymodels framework does have a way to train models using the role information from the recipe. Probably the easiest way to do that is using workflows.
logit_spec <- logistic_reg(mode = "classification") %>%
set_engine(engine = "glm")
wf <- workflow() %>%
add_model(logit_spec) %>%
add_recipe(rec)
logit_fit <- fit(wf, training(df_split))
# No more label in the model
logit_fit
#> ══ Workflow [trained] ══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: logistic_reg()
#>
#> ── Preprocessor ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> 3 Recipe Steps
#>
#> ● step_corr()
#> ● step_dummy()
#> ● step_meanimpute()
#>
#> ── Model ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
#>
#> Call: stats::glm(formula = formula, family = stats::binomial, data = data)
#>
#> Coefficients:
#> (Intercept) x f_b f_c
#> 0.42331 -0.04234 -0.04991 0.64728
#>
#> Degrees of Freedom: 34 Total (i.e. Null); 31 Residual
#> Null Deviance: 45
#> Residual Deviance: 44.41 AIC: 52.41
Created on 2020-02-15 by the reprex package (v0.3.0)
No more labels in the model!