I am trying to fit many survival models using tidymodels, workflow, and purr.
I can get this approach to work for other models, e.g., linear regression, but not survival.
I have loaded the survival extension to parsnip.
Here is code to
Generate a small dataset.
Demonstrate that usual cox-ph works fine.
Run linear regression with tidymodels and workflows and it works fine (models1).
However, models2 and models3 result in error message:
Error in `fit_xy()':
!Models for censored regression must use the formula interface.
This message does not help me much. I suppose it has something to do with the in-line Surv() object but haven't been able to see how to construct this another way.
I realize that parsnip for survival may still be work-in-progress, but it is unclear to me if this has been implemented and should work or not.
Appreciate any help,
Thanks.
library(tidyverse)
library(tidymodels)
library(survival)
library(censored)
set.seed(1973)
df <- tibble(id = seq(1:1000))
df <- df %>% mutate( survtime = floor(100*rgamma(n=1000, shape =1 , rate=1)) ,
fail = runif(n=1000) >0.33 ,
a1 = runif(n=1000) >0.1,
a2 = runif(n=1000) >0.5)
head(df)
cox1 <- coxph(data = df, Surv(survtime, event=fail) ~a1)
summary(cox1)
cox2 <- coxph(data = df, Surv(survtime, event=fail) ~a2)
summary(cox2)
A <- c("a1", "a2")
models1 <- map(A,
~workflow() %>%
add_model(linear_reg()) %>%
add_formula(reformulate(.x, response = 'survtime')) %>%
fit(df)
)
models1
#not working
models2 <- map(A,
~workflow() %>%
add_model(proportional_hazards()) %>%
add_formula(reformulate(.x, response = 'Surv(survtime, event=fail)')) %>%
fit(df)
)
#not working
models3 <- map(A,
~workflow() %>%
add_model(proportional_hazards()) %>%
add_formula(as.formula(paste0('Surv(survtime, event=fail) ~ ', .x))) %>%
fit(df)
)
models3
EDITED to change variables X, x1, and x2 to A, a1, and a2 for clarity.
Error message is the same:
Error in `fit_xy()':
!Models for censored regression must use the formula interface.
However, problem may be version of R or packages.
Error happens with version 4.1.2 with
survival 3.4-0
parsnip 1.0.1
but code works on another system with
R 4.2.2
parsnip 1.0.3
survival 3.3-1
Unfortunately, I am at sysadmin's mercy for updates so I did not check versioning and cannot easily troubleshoot.
Uppercase "X" instead of lower case "x" is the issue.
Your not the first to do this (I stared at it for a few minutes to see it) so don't feel bad.
However... this example is exactly why we have the reprex package. That would have told you the issue right away since it starts with a fresh session (and would show us the error message)
Knowing's half the battle :-)
library(tidyverse)
library(tidymodels)
library(survival)
library(censored)
set.seed(1973)
df <- tibble(id = seq(1:1000))
df <- df %>% mutate( survtime = floor(100*rgamma(n=1000, shape =1 , rate=1)) ,
fail = runif(n=1000) >0.33 ,
x1 = runif(n=1000) >0.1,
x2 = runif(n=1000) >0.5)
head(df)
#> # A tibble: 6 × 5
#> id survtime fail x1 x2
#> <int> <dbl> <lgl> <lgl> <lgl>
#> 1 1 27 FALSE TRUE FALSE
#> 2 2 77 FALSE TRUE TRUE
#> 3 3 180 FALSE TRUE FALSE
#> 4 4 127 FALSE TRUE TRUE
#> 5 5 69 TRUE TRUE TRUE
#> 6 6 20 FALSE TRUE TRUE
cox1 <- coxph(data = df, Surv(survtime, event=fail) ~x1)
summary(cox1)
#> Call:
#> coxph(formula = Surv(survtime, event = fail) ~ x1, data = df)
#>
#> n= 1000, number of events= 692
#>
#> coef exp(coef) se(coef) z Pr(>|z|)
#> x1TRUE 0.2854 1.3303 0.1250 2.284 0.0224 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> exp(coef) exp(-coef) lower .95 upper .95
#> x1TRUE 1.33 0.7517 1.041 1.699
#>
#> Concordance= 0.509 (se = 0.007 )
#> Likelihood ratio test= 5.62 on 1 df, p=0.02
#> Wald test = 5.22 on 1 df, p=0.02
#> Score (logrank) test = 5.25 on 1 df, p=0.02
cox2 <- coxph(data = df, Surv(survtime, event=fail) ~x2)
summary(cox2)
#> Call:
#> coxph(formula = Surv(survtime, event = fail) ~ x2, data = df)
#>
#> n= 1000, number of events= 692
#>
#> coef exp(coef) se(coef) z Pr(>|z|)
#> x2TRUE 0.13210 1.14122 0.07655 1.726 0.0844 .
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> exp(coef) exp(-coef) lower .95 upper .95
#> x2TRUE 1.141 0.8763 0.9822 1.326
#>
#> Concordance= 0.513 (se = 0.011 )
#> Likelihood ratio test= 2.98 on 1 df, p=0.08
#> Wald test = 2.98 on 1 df, p=0.08
#> Score (logrank) test = 2.98 on 1 df, p=0.08
X <- c("x1", "x2")
models1 <- map(X,
~workflow() %>%
add_model(linear_reg()) %>%
add_formula(reformulate(.x, response = 'survtime')) %>%
fit(df)
)
models1
#> [[1]]
#> ══ Workflow [trained] ══════════════════════════════════════════════════════════
#> Preprocessor: Formula
#> Model: linear_reg()
#>
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> survtime ~ x1
#>
#> ── Model ───────────────────────────────────────────────────────────────────────
#>
#> Call:
#> stats::lm(formula = ..y ~ ., data = data)
#>
#> Coefficients:
#> (Intercept) x1TRUE
#> 121.24 -23.58
#>
#>
#> [[2]]
#> ══ Workflow [trained] ══════════════════════════════════════════════════════════
#> Preprocessor: Formula
#> Model: linear_reg()
#>
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> survtime ~ x2
#>
#> ── Model ───────────────────────────────────────────────────────────────────────
#>
#> Call:
#> stats::lm(formula = ..y ~ ., data = data)
#>
#> Coefficients:
#> (Intercept) x2TRUE
#> 104.055 -7.706
#not working
models2 <- map(x,
~workflow() %>%
add_model(proportional_hazards()) %>%
add_formula(reformulate(.x, response = 'Surv(survtime, event=fail)')) %>%
fit(df)
)
#> Error in vctrs_vec_compat(.x, .purrr_user_env): object 'x' not found
#not working
models3 <- map(x,
~workflow() %>%
add_model(proportional_hazards()) %>%
add_formula(as.formula(paste0('Surv(survtime, event=fail) ~ ', .x))) %>%
fit(df)
)
#> Error in vctrs_vec_compat(.x, .purrr_user_env): object 'x' not found
models3
#> Error in eval(expr, envir, enclos): object 'models3' not found
Created on 2023-01-06 with reprex v2.0.2
Related
I'm a bit confused with how I should interpret the coefficients from the elastic net model that I'm getting through tidymodels and glmnet. Ideally, I'd like to produce unscaled coefficients for maximum interpretability.
My issue is that I'm honestly not sure how to unscale the coefficients that the model is yielding because I can't quite figure out what's being done in the first place.
It's a bit tricky for me to post the data one would need to reproduce my results, but here's my code:
library(tidymodels)
library(tidyverse)
# preps data for model
myrecipe <- mydata %>%
recipe(transactionrevenue ~ sessions + channelgrouping + month + new_user_pct + is_weekend) %>%
step_novel(all_nominal(), -all_outcomes()) %>%
step_dummy(month, channelgrouping, one_hot = TRUE) %>%
step_zv(all_predictors()) %>%
step_normalize(sessions, new_user_pct) %>%
step_interact(terms = ~ sessions:starts_with("channelgrouping") + new_user_pct:starts_with("channelgrouping"))
# creates the model
mymodel <- linear_reg(penalty = 10, mixture = 0.2) %>%
set_engine("glmnet", standardize = FALSE)
wf <- workflow() %>%
add_recipe(myrecipe)
model_fit <- wf %>%
add_model(mymodel) %>%
fit(data = mydata)
# posts coefficients
tidy(model_fit)
If it would help, here's some information that might be useful:
The variable that I'm really focusing on is "sessions."
In the model, the coefficient for sessions is 2543.094882, and the intercept is 1963.369782. The penalty is also 10.
The unscaled mean for sessions is 725.2884 and the standard deviation is 1035.381.
I just can't seem to figure out what units the coefficients are in and how/if it's even possible to unscale the coefficients back to the original units.
Any insight would be very much appreciated.
You can use tidy() on a lot of different components of a workflow. The default is to the tidy() the model but you can also get out the recipe and even recipe steps. This is where the information it sounds like you are interested in is.
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
data(bivariate)
biv_rec <-
recipe(Class ~ ., data = bivariate_train) %>%
step_BoxCox(all_predictors())%>%
step_normalize(all_predictors())
svm_spec <- svm_linear(mode = "classification")
biv_fit <- workflow(biv_rec, svm_spec) %>% fit(bivariate_train)
## tidy the *model*
tidy(biv_fit)
#> # A tibble: 3 × 2
#> term estimate
#> <chr> <dbl>
#> 1 A -1.15
#> 2 B 1.17
#> 3 Bias 0.328
## tidy the *recipe*
extract_recipe(biv_fit) %>%
tidy()
#> # A tibble: 2 × 6
#> number operation type trained skip id
#> <int> <chr> <chr> <lgl> <lgl> <chr>
#> 1 1 step BoxCox TRUE FALSE BoxCox_ZRpI2
#> 2 2 step normalize TRUE FALSE normalize_DGmtN
## tidy the *recipe step*
extract_recipe(biv_fit) %>%
tidy(number = 1)
#> # A tibble: 2 × 3
#> terms value id
#> <chr> <dbl> <chr>
#> 1 A -0.857 BoxCox_ZRpI2
#> 2 B -1.09 BoxCox_ZRpI2
## tidy the other *recipe step*
extract_recipe(biv_fit) %>%
tidy(number = 2)
#> # A tibble: 4 × 4
#> terms statistic value id
#> <chr> <chr> <dbl> <chr>
#> 1 A mean 1.16 normalize_DGmtN
#> 2 B mean 0.909 normalize_DGmtN
#> 3 A sd 0.00105 normalize_DGmtN
#> 4 B sd 0.00260 normalize_DGmtN
Created on 2021-08-05 by the reprex package (v2.0.0)
You can read more about tidying a recipe here.
I have fitted text data based on regression and LiblineaR engine. And I want to `tidy()` my results. I have also installed the dev version of `broom`.
But I always get an error. `ERROR: No tidy method for objects of class LiblineaR`
> svm_fit %>%
+ pull_workflow_fit() %>%
+ tidy()
ERROR: No tidy method for objects of class LiblineaR
We just merged in support for the tidy() method for parsnip models fitted with the LiblineaR engine, so if you install from GitHub, you should be able to have this feature now:
devtools::install_github("tidymodels/parsnip")
Here is a demo of how it works:
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
data(two_class_dat, package = "modeldata")
example_split <- initial_split(two_class_dat, prop = 0.99)
example_train <- training(example_split)
example_test <- testing(example_split)
rec <- recipe(Class ~ ., data = example_train) %>%
step_normalize(all_numeric_predictors())
spec1 <- svm_linear() %>%
set_engine("LiblineaR") %>%
set_mode("classification")
spec2 <- logistic_reg(penalty = 0.1, mixture = 1) %>%
set_engine("LiblineaR") %>%
set_mode("classification")
wf <- workflow() %>%
add_recipe(rec)
wf %>%
add_model(spec1) %>%
fit(example_train) %>%
tidy()
#> # A tibble: 3 x 2
#> term estimate
#> <chr> <dbl>
#> 1 A 0.361
#> 2 B -0.966
#> 3 Bias 0.113
wf %>%
add_model(spec2) %>%
fit(example_train) %>%
tidy()
#> # A tibble: 3 x 2
#> term estimate
#> <chr> <dbl>
#> 1 A 1.06
#> 2 B -2.76
#> 3 Bias 0.329
svm_linear() %>%
set_engine("LiblineaR") %>%
set_mode("regression") %>%
fit(mpg ~ ., data = mtcars) %>%
tidy()
#> # A tibble: 11 x 2
#> term estimate
#> <chr> <dbl>
#> 1 cyl 0.141
#> 2 disp -0.0380
#> 3 hp 0.0415
#> 4 drat 0.226
#> 5 wt 0.0757
#> 6 qsec 1.06
#> 7 vs 0.0648
#> 8 am 0.0479
#> 9 gear 0.219
#> 10 carb 0.00861
#> 11 Bias 0.0525
Created on 2021-04-22 by the reprex package (v2.0.0)
(Updated at the end based on Julia's reply. TL;DR: This seems to be an issue with the underlying kknn package, instead of with tidymodels)
I'm doing some k-nearest neighbours regression models with tidymodels. This is through the nearest_neighbor() function. I want to see what the difference is between the results with and without normalization of the features.
Now set_engine("kknn") uses the kknn::train.kknn() function under the hood, which has a normalization argument scale = TRUE. I want to compare models with scale = FALSE to scale = TRUE (actually, I want to do that in a recipe, but that is not possible, as I'll explain below).
But it does not seem as if I am able to reliably set scale = FALSE through tidymodels. Below is a reprex showing what I see.
The questions so long: Am I doing something wrong or is this a bug? If it is a bug, is it known and can I read about it somewhere? I'd be very grateful if someone can shed light on this.
Set up for the reprex
Here I'll use mtcars:
library(tidymodels)
data("mtcars")
A train-test split is:
set.seed(1)
mtcars_split <- initial_split(mtcars, prop = 0.7)
Here is a common recipe I'll use:
mtcars_recipe <- recipe(mpg ~ disp + wt, data = mtcars)
Here is model 1 (called knn_FALSE) where scale = FALSE:
knn_FALSE <- nearest_neighbor(neighbors = 5) %>%
set_mode("regression") %>%
set_engine("kknn", scale = FALSE)
Here is model 2 (called knn_TRUE) where scale = TRUE:
knn_TRUE <- nearest_neighbor(neighbors = 5) %>%
set_mode("regression") %>%
set_engine("kknn", scale = TRUE)
I bundle these two models into two workflows:
## Workflow with scale = FALSE
wf_FALSE <- workflow() %>%
add_model(knn_FALSE) %>%
add_recipe(mtcars_recipe)
## Worflow with scale = TRUE
wf_TRUE <- workflow() %>%
add_model(knn_TRUE) %>%
add_recipe(mtcars_recipe)
Using fit(), it is possible to have scale = FALSE
It does seem to be possible to have one version with scale = TRUE and one with scale = FALSE when using fit() on a workflow.
For example, for scale = TRUE I get:
wf_TRUE %>% fit(mtcars)
== Workflow [trained] ===============================================================================================
Preprocessor: Recipe
Model: nearest_neighbor()
-- Preprocessor -----------------------------------------------------------------------------------------------------
0 Recipe Steps
-- Model ------------------------------------------------------------------------------------------------------------
Call:
kknn::train.kknn(formula = ..y ~ ., data = data, ks = ~5, scale = ~TRUE)
Type of response variable: continuous
minimal mean absolute error: 2.09425
Minimal mean squared error: 7.219114
Best kernel: optimal
Best k: 5
Whereas for scale = FALSE I have:
wf_FALSE %>% fit(mtcars)
== Workflow [trained] ===============================================================================================
Preprocessor: Recipe
Model: nearest_neighbor()
-- Preprocessor -----------------------------------------------------------------------------------------------------
0 Recipe Steps
-- Model ------------------------------------------------------------------------------------------------------------
Call:
kknn::train.kknn(formula = ..y ~ ., data = data, ks = ~5, scale = ~FALSE)
Type of response variable: continuous
minimal mean absolute error: 2.1665
Minimal mean squared error: 6.538769
Best kernel: optimal
Best k: 5
The results are clearly different, which comes from the difference in the scale parameter.
But the plot thickens.
No difference with last_fit()
When using last_fit() however, the results for scale = TRUE and scale = FALSE are identical though.
For scale = TRUE:
wf_TRUE %>% last_fit(mtcars_split) %>% collect_metrics()
# A tibble: 2 x 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 3.16
2 rsq standard 0.663
Whereas for scale = FALSE:
wf_FALSE %>% last_fit(mtcars_split) %>% collect_metrics()
# A tibble: 2 x 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 3.16
2 rsq standard 0.663
These are clearly --- and unexpectedly --- the same.
There is also no difference when tuning using tune_grid()
If I do tuning with tune_grid() and a validation_split(), there is also no difference between the results for scale = TRUE and scale = FALSE.
Here is the code for that:
## Tune grid
knn_grid <- tibble(neighbors = c(5, 15))
## Tune Model 1: kNN regresson with no scaling in train.kknn
knn_FALSE_tune <- nearest_neighbor(neighbors = tune()) %>%
set_mode("regression") %>%
set_engine("kknn", scale = FALSE)
## Model 2: kNN regresson with scaling in train.kknn
knn_TRUE_tune <- nearest_neighbor(neighbors = tune()) %>%
set_mode("regression") %>%
set_engine("kknn", scale = TRUE)
## Workflow with scale = FALSE
wf_FALSE_tune <- workflow() %>%
add_model(knn_FALSE_tune) %>%
add_recipe(mtcars_recipe)
## Worflow with scale = TRUE
wf_TRUE_tune <- workflow() %>%
add_model(knn_TRUE_tune) %>%
add_recipe(mtcars_recipe)
## Validation split
mtcars_val <- validation_split(mtcars)
## Tune results: Without scaling
wf_FALSE_tune %>%
tune_grid(resamples = mtcars_val,
grid = knn_grid) %>%
collect_metrics()
## Tune results: With scaling
wf_TRUE_tune %>%
tune_grid(resamples = mtcars_val,
grid = knn_grid) %>%
collect_metrics()
The result when scale = FALSE:
> wf_FALSE_tune %>%
+ tune_grid(resamples = mtcars_val,
+ grid = knn_grid) %>%
+ collect_metrics()
# A tibble: 4 x 7
neighbors .metric .estimator mean n std_err .config
<dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
1 5 rmse standard 1.64 1 NA Model1
2 5 rsq standard 0.920 1 NA Model1
3 15 rmse standard 2.55 1 NA Model2
4 15 rsq standard 0.956 1 NA Model2
The results when scale = TRUE:
> wf_TRUE_tune %>%
+ tune_grid(resamples = mtcars_val,
+ grid = knn_grid) %>%
+ collect_metrics()
# A tibble: 4 x 7
neighbors .metric .estimator mean n std_err .config
<dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
1 5 rmse standard 1.64 1 NA Model1
2 5 rsq standard 0.920 1 NA Model1
3 15 rmse standard 2.55 1 NA Model2
4 15 rsq standard 0.956 1 NA Model2
Question
Am I misunderstanding (or missing my own bug), or are the last_fit() and tune_grid() functions not respecting my choice for scale?
I'm new to tidymodels, so I might have missed something. Answers much appreciated.
I was hoping to use step_normalize() in a recipe to do the normalization myself, but since I cannot reliably set scale = FALSE in the underlying engine, I have not been able to experiment with that.
Update after Julia's reply
As Julia shows, predictions from train.kknn() provide the same predictions for scale = FALSE and scale = TRUE. So this isn't an tidymodels issue. Rather the kknn:::predict.train.kknn() function does not respect all parameters passed to train.kknn() when predicting.
Consider the following output which uses kknn() instead of train.kknn():
kknn::kknn(formula = mpg ~ disp + wt, train = training(mtcars_split),
test = testing(mtcars_split), k = 5, scale = FALSE) %>%
predict(newdata = testing(mtcars_split))
## [1] 21.276 21.276 16.860 16.276 21.276 16.404 29.680 15.700 16.020
kknn::kknn(formula = mpg ~ disp + wt, train = training(mtcars_split),
test = testing(mtcars_split), k = 5, scale = TRUE) %>%
predict(newdata = testing(mtcars_split))
## [1] 21.032 21.784 16.668 16.052 21.264 16.404 26.340 16.076 15.620
These are different, as it should be. The problem is that kknn:::predict.train.kknn() calls kknn(), but without passing along scale (and some other optional arguments):
function (object, newdata, ...)
{
if (missing(newdata))
return(predict(object, ...))
res <- kknn(formula(terms(object)), object$data, newdata,
k = object$best.parameters$k, kernel = object$best.parameters$kernel,
distance = object$distance)
return(predict(res, ...))
}
<bytecode: 0x55e2304fba10>
<environment: namespace:kknn>
I think you don't have a bug or problem but are just misunderstanding what last_fit() and friends are predicting on to estimate performance.
library(tidymodels)
set.seed(1)
mtcars_split <- initial_split(mtcars, prop = 0.7)
knn_FALSE <- nearest_neighbor(neighbors = 5) %>%
set_mode("regression") %>%
set_engine("kknn", scale = FALSE)
knn_FALSE %>% translate()
#> K-Nearest Neighbor Model Specification (regression)
#>
#> Main Arguments:
#> neighbors = 5
#>
#> Engine-Specific Arguments:
#> scale = FALSE
#>
#> Computational engine: kknn
#>
#> Model fit template:
#> kknn::train.kknn(formula = missing_arg(), data = missing_arg(),
#> ks = min_rows(5, data, 5), scale = FALSE)
knn_TRUE <- nearest_neighbor(neighbors = 5) %>%
set_mode("regression") %>%
set_engine("kknn", scale = TRUE)
knn_TRUE %>% translate()
#> K-Nearest Neighbor Model Specification (regression)
#>
#> Main Arguments:
#> neighbors = 5
#>
#> Engine-Specific Arguments:
#> scale = TRUE
#>
#> Computational engine: kknn
#>
#> Model fit template:
#> kknn::train.kknn(formula = missing_arg(), data = missing_arg(),
#> ks = min_rows(5, data, 5), scale = TRUE)
Notice that both parsnip models are correctly passing the scale parameter to the underlying engine.
We can now add these two parsnip models to a workflow(), with a formula preprocessor (a recipe would be fine too).
wf_FALSE <- workflow() %>%
add_model(knn_FALSE) %>%
add_formula(mpg ~ disp + wt)
## Worflow with scale = TRUE
wf_TRUE <- workflow() %>%
add_model(knn_TRUE) %>%
add_formula(mpg ~ disp + wt)
The function last_fit() fits on the training data and predicts on the testing data. We can do that manually with our workflows. Importantly, notice that for these examples in the testing set, the predictions are the same, so the metrics you would get are the same.
wf_TRUE %>% fit(training(mtcars_split)) %>% predict(testing(mtcars_split))
#> # A tibble: 9 x 1
#> .pred
#> <dbl>
#> 1 21.0
#> 2 21.8
#> 3 16.7
#> 4 16.1
#> 5 21.3
#> 6 16.4
#> 7 26.3
#> 8 16.1
#> 9 15.6
wf_FALSE %>% fit(training(mtcars_split)) %>% predict(testing(mtcars_split))
#> # A tibble: 9 x 1
#> .pred
#> <dbl>
#> 1 21.0
#> 2 21.8
#> 3 16.7
#> 4 16.1
#> 5 21.3
#> 6 16.4
#> 7 26.3
#> 8 16.1
#> 9 15.6
The same thing is true for fitting the models directly:
knn_TRUE %>%
fit(mpg ~ disp + wt, data = training(mtcars_split)) %>%
predict(testing(mtcars_split))
#> # A tibble: 9 x 1
#> .pred
#> <dbl>
#> 1 21.0
#> 2 21.8
#> 3 16.7
#> 4 16.1
#> 5 21.3
#> 6 16.4
#> 7 26.3
#> 8 16.1
#> 9 15.6
knn_FALSE %>%
fit(mpg ~ disp + wt, data = training(mtcars_split)) %>%
predict(testing(mtcars_split))
#> # A tibble: 9 x 1
#> .pred
#> <dbl>
#> 1 21.0
#> 2 21.8
#> 3 16.7
#> 4 16.1
#> 5 21.3
#> 6 16.4
#> 7 26.3
#> 8 16.1
#> 9 15.6
And in fact is true if we fit the underlying kknn model directly:
kknn::train.kknn(formula = mpg ~ disp + wt, data = training(mtcars_split),
ks = 5, scale = FALSE) %>%
predict(testing(mtcars_split))
#> [1] 21.032 21.784 16.668 16.052 21.264 16.404 26.340 16.076 15.620
kknn::train.kknn(formula = mpg ~ disp + wt, data = training(mtcars_split),
ks = 5, scale = TRUE) %>%
predict(testing(mtcars_split))
#> [1] 21.032 21.784 16.668 16.052 21.264 16.404 26.340 16.076 15.620
Created on 2020-11-12 by the reprex package (v0.3.0.9001)
The scale parameter is correctly being passed to the underlying engine; it just doesn't change the prediction for these test cases.
This is the same issue as Predict with step_naomit and retain ID using tidymodels , but even though there is an accepted answer, the OP's last comment states the issue the "id variable" is being used as a predictor, as can be seen when looking at model$fit$variable.importance.
I have a dataset with "id variables" I would like to keep.
I thought I would be able to achieve this with a recipe() specification.
library(tidymodels)
# label is an identifier variable I want to keep even though it's not
# a predictor
df <- tibble(label = 1:50,
x = rnorm(50, 0, 5),
f = factor(sample(c('a', 'b', 'c'), 50, replace = TRUE)),
y = factor(sample(c('Y', 'N'), 50, replace = TRUE)) )
df_split <- initial_split(df, prop = 0.70)
# Make up any recipe: just note I specify 'label' as "id variable"
rec <- recipe(training(df_split)) %>%
update_role(label, new_role = "id variable") %>%
update_role(y, new_role = "outcome") %>%
update_role(x, new_role = "predictor") %>%
update_role(f, new_role = "predictor") %>%
step_corr(all_numeric(), -all_outcomes()) %>%
step_dummy(all_predictors(),-all_numeric()) %>%
step_meanimpute(all_numeric(), -all_outcomes())
train_juiced <- prep(rec, training(df_split)) %>% juice()
logit_fit <- logistic_reg(mode = "classification") %>%
set_engine(engine = "glm") %>%
fit(y ~ ., data = train_juiced)
# Why is label a variable in the model ?
logit_fit[['fit']][['coefficients']]
#> (Intercept) label x f_b f_c
#> 1.03664140 -0.01405316 0.22357266 -1.80701531 -1.66285399
Created on 2020-01-27 by the reprex package (v0.3.0)
But even though I did specify label was an id variable, it is being used as a predictor.
So maybe I can use the specific terms I want in the formula and specifically add label as an id variable.
rec <- recipe(training(df_split), y ~ x + f) %>%
update_role(label, new_role = "id variable") %>%
step_corr(all_numeric(), -all_outcomes()) %>%
step_dummy(all_predictors(),-all_numeric()) %>%
step_meanimpute(all_numeric(), -all_outcomes())
#> Error in .f(.x[[i]], ...): object 'label' not found
Created on 2020-01-27 by the reprex package (v0.3.0)
I can try not mentioning label
rec <- recipe(training(df_split), y ~ x + f) %>%
step_corr(all_numeric(), -all_outcomes()) %>%
step_dummy(all_predictors(),-all_numeric()) %>%
step_meanimpute(all_numeric(), -all_outcomes())
train_juiced <- prep(rec, training(df_split)) %>% juice()
logit_fit <- logistic_reg(mode = "classification") %>%
set_engine(engine = "glm") %>%
fit(y ~ ., data = train_juiced)
# Why is label a variable in the model ?
logit_fit[['fit']][['coefficients']]
#> (Intercept) x f_b f_c
#> -0.98950228 0.03734093 0.98945339 1.27014824
train_juiced
#> # A tibble: 35 x 4
#> x y f_b f_c
#> <dbl> <fct> <dbl> <dbl>
#> 1 -0.928 Y 1 0
#> 2 4.54 N 0 0
#> 3 -1.14 N 1 0
#> 4 -5.19 N 1 0
#> 5 -4.79 N 0 0
#> 6 -6.00 N 0 0
#> 7 3.83 N 0 1
#> 8 -8.66 Y 1 0
#> 9 -0.0849 Y 1 0
#> 10 -3.57 Y 0 1
#> # ... with 25 more rows
Created on 2020-01-27 by the reprex package (v0.3.0)
OK, so the model works, but I have lost my label.
How should I do this ?
The main issue/conceptual problem you are running into is that once you juice() the recipe, it is just data, i.e. just literally a dataframe. When you use that to fit a model, there's no way for the model to know that some of the variables had special roles.
library(tidymodels)
# label is an identifier variable to keep even though it's not a predictor
df <- tibble(label = 1:50,
x = rnorm(50, 0, 5),
f = factor(sample(c('a', 'b', 'c'), 50, replace = TRUE)),
y = factor(sample(c('Y', 'N'), 50, replace = TRUE)) )
df_split <- initial_split(df, prop = 0.70)
rec <- recipe(y ~ ., training(df_split)) %>%
update_role(label, new_role = "id variable") %>%
step_corr(all_numeric(), -all_outcomes()) %>%
step_dummy(all_predictors(),-all_numeric()) %>%
step_meanimpute(all_numeric(), -all_outcomes()) %>%
prep()
train_juiced <- juice(rec)
train_juiced
#> # A tibble: 35 x 5
#> label x y f_b f_c
#> <int> <dbl> <fct> <dbl> <dbl>
#> 1 1 1.80 N 1 0
#> 2 3 1.45 N 0 0
#> 3 5 -5.00 N 0 0
#> 4 6 -4.15 N 1 0
#> 5 7 1.37 Y 0 1
#> 6 8 1.62 Y 0 1
#> 7 10 -1.77 Y 1 0
#> 8 11 -3.15 N 0 1
#> 9 12 -2.02 Y 0 1
#> 10 13 2.65 Y 0 1
#> # … with 25 more rows
Notice that train_juiced is just literally a regular tibble. If you train a model on this tibble using fit(), it won't know anything about the recipe used to transform the data.
The tidymodels framework does have a way to train models using the role information from the recipe. Probably the easiest way to do that is using workflows.
logit_spec <- logistic_reg(mode = "classification") %>%
set_engine(engine = "glm")
wf <- workflow() %>%
add_model(logit_spec) %>%
add_recipe(rec)
logit_fit <- fit(wf, training(df_split))
# No more label in the model
logit_fit
#> ══ Workflow [trained] ══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: logistic_reg()
#>
#> ── Preprocessor ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> 3 Recipe Steps
#>
#> ● step_corr()
#> ● step_dummy()
#> ● step_meanimpute()
#>
#> ── Model ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
#>
#> Call: stats::glm(formula = formula, family = stats::binomial, data = data)
#>
#> Coefficients:
#> (Intercept) x f_b f_c
#> 0.42331 -0.04234 -0.04991 0.64728
#>
#> Degrees of Freedom: 34 Total (i.e. Null); 31 Residual
#> Null Deviance: 45
#> Residual Deviance: 44.41 AIC: 52.41
Created on 2020-02-15 by the reprex package (v0.3.0)
No more labels in the model!
I am trying to write a custom function to run a one-way within-subjects ANOVA using rlang + ez.
An example of the output I am expecting:
# setup
set.seed(123)
library(WRS2)
library(ez)
library(tidyverse)
# getting data in format that `ez` expects
df <- WRS2::WineTasting %>%
dplyr::mutate_if(
.tbl = .,
.predicate = purrr::is_bare_character,
.funs = as.factor
) %>%
dplyr::mutate(.data = ., Taster = as.factor(Taster))
# this works
ez::ezANOVA(
data = df,
dv = Taste,
wid = Taster,
within = Wine,
detailed = TRUE,
return_aov = TRUE
)
#> $ANOVA
#> Effect DFn DFd SSn SSd F p
#> 1 (Intercept) 1 21 2.005310e+03 4.2186364 9982.254929 1.311890e-29
#> 2 Wine 2 42 9.371212e-02 0.3129545 6.288308 4.084101e-03
#> p<.05 ges
#> 1 * 0.99774530
#> 2 * 0.02026075
#>
#> $`Mauchly's Test for Sphericity`
#> Effect W p p<.05
#> 2 Wine 0.7071776 0.03128132 *
#>
#> $`Sphericity Corrections`
#> Effect GGe p[GG] p[GG]<.05 HFe p[HF] p[HF]<.05
#> 2 Wine 0.7735015 0.008439799 * 0.8233709 0.007188822 *
#>
#> $aov
#>
#> Call:
#> aov(formula = formula(aov_formula), data = data)
#>
#> Grand Mean: 5.512121
#>
#> Stratum 1: Taster
#>
#> Terms:
#> Residuals
#> Sum of Squares 4.218636
#> Deg. of Freedom 21
#>
#> Residual standard error: 0.4482047
#>
#> Stratum 2: Taster:Wine
#>
#> Terms:
#> Wine Residuals
#> Sum of Squares 0.09371212 0.31295455
#> Deg. of Freedom 2 42
#>
#> Residual standard error: 0.08632091
#> Estimated effects may be unbalanced
Now here is a custom function I have written to do the same but using non-standard evaluation implemented in rlang:
# custom function
aov_fun <- function(data, x, y, id) {
# getting data in format that `ez` expects
df <- data %>%
dplyr::mutate_if(
.tbl = .,
.predicate = purrr::is_bare_character,
.funs = as.factor
) %>%
dplyr::mutate(.data = ., {{ id }} := as.factor({{ id }})) %>%
tibble::as_tibble(.)
# print the dataframe to see if it was cleaned as expected
print(df)
# running anova
ez::ezANOVA(
data = df,
dv = {{ y }},
wid = {{ id }},
within = {{ x }},
detailed = TRUE,
return_aov = TRUE
)
}
But this doesn't work. Note that the dataframe is getting cleaned properly, so that's not where the error lies.
# using the function
aov_fun(WRS2::WineTasting, Wine, Taste, Taster)
#> # A tibble: 66 x 3
#> Taste Wine Taster
#> <dbl> <fct> <fct>
#> 1 5.4 Wine A 1
#> 2 5.5 Wine B 1
#> 3 5.55 Wine C 1
#> 4 5.85 Wine A 2
#> 5 5.7 Wine B 2
#> 6 5.75 Wine C 2
#> 7 5.2 Wine A 3
#> 8 5.6 Wine B 3
#> 9 5.5 Wine C 3
#> 10 5.55 Wine A 4
#> # ... with 56 more rows
#> Error in ezANOVA_main(data = data, dv = dv, wid = wid, within = within, : "{
#> y
#> }" is not a variable in the data frame provided.
Instead of dv = {{ y }}, I have also tried-
dv = rlang::as_string(y)
dv = rlang::as_name(y)
dv = rlang::enquo(y)
But none of these work.
Whenever I want to bridge rlang's NSE with functions that don't explicitly support it,
I find that dividing the procedure in these 2 steps (at least conceptually) is always helpful:
Create the final expression I'd like using rlang functions.
Evaluate it, either with rlang::eval_tidy if quosures are involved, or with base::eval otherwise.
In your case, you can probably finish your function with something like:
# running anova
rlang::eval_tidy(rlang::expr(ez::ezANOVA(
data = df,
dv = {{ y }},
wid = {{ id }},
within = {{ x }},
detailed = TRUE,
return_aov = TRUE
)))
expr creates the expression and obviously supports rlang's NSE,
and eval_tidy simply evaluates the expression.
Oh and BTW, if ezANOVA (or any other function you want to use NSE with) supported strings instead of expressions as input,
you'd need something like rlang::as_string(rlang::enexpr(param)),
first capturing the expression of what the user wrote as param,
and then using as_string to transform that expression.
This can be corrected with
aov_fun <- function(data, x, y, id) {
lst1 <- as.list(match.call()[-1])
names(lst1)<- c("data", "dv", "wid", "within")[match(names(lst1),
c("data", "y", "id", "x"))]
df <- data %>%
dplyr::mutate_if(
.tbl = .,
.predicate = purrr::is_bare_character,
.funs = as.factor
) %>%
dplyr::mutate(.data = ., {{ id }} := as.factor({{ id }})) %>%
tibble::as_tibble(.)
do.call(getFromNamespace("ezANOVA", "ez"),
c(lst1, detailed = TRUE, return_aov = TRUE))
}
-testing
aov_fun(WRS2::WineTasting, x = Wine,y = Taste, id = Taster)
#$ANOVA
# Effect DFn DFd SSn SSd F p p<.05 ges
# 1 (Intercept) 1 21 2.005310e+03 4.2186364 9982.254929 1.311890e-29 * 0.99774530
# 2 Wine 2 42 9.371212e-02 0.3129545 6.288308 4.084101e-03 * 0.02026075
# $`Mauchly's Test for Sphericity`
# Effect W p p<.05
# 2 Wine 0.7071776 0.03128132 *
# $`Sphericity Corrections`
# Effect GGe p[GG] p[GG]<.05 HFe p[HF] p[HF]<.05
# 2 Wine 0.7735015 0.008439799 * 0.8233709 0.007188822 *
# $aov
# Call:
# aov(formula = formula(aov_formula), data = data)
# Grand Mean: 5.512121
# Stratum 1: Taster
# Terms:
# Residuals
# Sum of Squares 4.218636
# Deg. of Freedom 21
# Residual standard error: 0.4482047
# Stratum 2: Taster:Wine
# Terms:
# Wine Residuals
# Sum of Squares 0.09371212 0.31295455
# Deg. of Freedom 2 42
# Residual standard error: 0.08632091
# Estimated effects may be unbalanced
This is a great application for using_bang from #moody_mudskipper's tags package
aov_fun <- function(data, x, y, id) {
# ...
# code as before
# running anova
tags::using_bang$ezANOVA(
data = df,
dv = {{y}},
wid = {{id}},
within = {{x}},
detailed = TRUE,
return_aov = TRUE
)
}