How can I unscale and understand glmnet coefficients while using tidymodels? - r

I'm a bit confused with how I should interpret the coefficients from the elastic net model that I'm getting through tidymodels and glmnet. Ideally, I'd like to produce unscaled coefficients for maximum interpretability.
My issue is that I'm honestly not sure how to unscale the coefficients that the model is yielding because I can't quite figure out what's being done in the first place.
It's a bit tricky for me to post the data one would need to reproduce my results, but here's my code:
library(tidymodels)
library(tidyverse)
# preps data for model
myrecipe <- mydata %>%
recipe(transactionrevenue ~ sessions + channelgrouping + month + new_user_pct + is_weekend) %>%
step_novel(all_nominal(), -all_outcomes()) %>%
step_dummy(month, channelgrouping, one_hot = TRUE) %>%
step_zv(all_predictors()) %>%
step_normalize(sessions, new_user_pct) %>%
step_interact(terms = ~ sessions:starts_with("channelgrouping") + new_user_pct:starts_with("channelgrouping"))
# creates the model
mymodel <- linear_reg(penalty = 10, mixture = 0.2) %>%
set_engine("glmnet", standardize = FALSE)
wf <- workflow() %>%
add_recipe(myrecipe)
model_fit <- wf %>%
add_model(mymodel) %>%
fit(data = mydata)
# posts coefficients
tidy(model_fit)
If it would help, here's some information that might be useful:
The variable that I'm really focusing on is "sessions."
In the model, the coefficient for sessions is 2543.094882, and the intercept is 1963.369782. The penalty is also 10.
The unscaled mean for sessions is 725.2884 and the standard deviation is 1035.381.
I just can't seem to figure out what units the coefficients are in and how/if it's even possible to unscale the coefficients back to the original units.
Any insight would be very much appreciated.

You can use tidy() on a lot of different components of a workflow. The default is to the tidy() the model but you can also get out the recipe and even recipe steps. This is where the information it sounds like you are interested in is.
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
data(bivariate)
biv_rec <-
recipe(Class ~ ., data = bivariate_train) %>%
step_BoxCox(all_predictors())%>%
step_normalize(all_predictors())
svm_spec <- svm_linear(mode = "classification")
biv_fit <- workflow(biv_rec, svm_spec) %>% fit(bivariate_train)
## tidy the *model*
tidy(biv_fit)
#> # A tibble: 3 × 2
#> term estimate
#> <chr> <dbl>
#> 1 A -1.15
#> 2 B 1.17
#> 3 Bias 0.328
## tidy the *recipe*
extract_recipe(biv_fit) %>%
tidy()
#> # A tibble: 2 × 6
#> number operation type trained skip id
#> <int> <chr> <chr> <lgl> <lgl> <chr>
#> 1 1 step BoxCox TRUE FALSE BoxCox_ZRpI2
#> 2 2 step normalize TRUE FALSE normalize_DGmtN
## tidy the *recipe step*
extract_recipe(biv_fit) %>%
tidy(number = 1)
#> # A tibble: 2 × 3
#> terms value id
#> <chr> <dbl> <chr>
#> 1 A -0.857 BoxCox_ZRpI2
#> 2 B -1.09 BoxCox_ZRpI2
## tidy the other *recipe step*
extract_recipe(biv_fit) %>%
tidy(number = 2)
#> # A tibble: 4 × 4
#> terms statistic value id
#> <chr> <chr> <dbl> <chr>
#> 1 A mean 1.16 normalize_DGmtN
#> 2 B mean 0.909 normalize_DGmtN
#> 3 A sd 0.00105 normalize_DGmtN
#> 4 B sd 0.00260 normalize_DGmtN
Created on 2021-08-05 by the reprex package (v2.0.0)
You can read more about tidying a recipe here.

Related

How to extract PLSR coefficients as for glmnet using tidymodels

I tuned a glmnet regression model and extracted the coefficients as described here. That works wonderfully. However, when I use the same form of coefficient extraction for PLSR with mixOmics engine, I obtain single values per term and component as demonstrated here. For further external use I need the coefficients of PLSR in the first form. I can achieve this by using the optimal hyperparamterset with the plsr() function from the pls package and then extracting it with coef() as shown at the end of the code below. However, I would like to avoid this extra step because I cannot pass parameters like predictor_prop to plsr and thus results may vary.
Is there a more elegant way to extract the overall model coefficients of the PLSR as for glmnet or can I calculate them from the component values?
library(tidymodels)
library(plsmod)
data(Chicago)
Chicago <- Chicago %>% select(ridership, Clark_Lake, Austin, Harlem)
# create cross-validation dataset
folds <- vfold_cv(Chicago)
# create recipe
rec <- recipe(ridership ~ ., Chicago) %>%
step_normalize(all_predictors()) %>%
prep(training = Chicago)
# define model
mod <- parsnip::pls(mode = "regression",
num_comp = tune(),
predictor_prop = tune()) %>%
set_engine("mixOmics")
# define workflow
wf <- workflow() %>%
add_recipe(rec) %>%
add_model(mod)
# run grid tuning
set.seed(123)
res <- tune_grid(wf, resamples = folds, grid = 5)
# get best model
res_best <- res %>% select_best("rmse")
# fit best model and extract coefficients
wf %>%
finalize_workflow(res_best) %>%
fit(Chicago) %>%
extract_fit_parsnip() %>%
tidy()
# extracting coefficients using plsr from pls package and coef function
p <- pls::plsr(ridership ~ ., data = Chicago, scale = T, center = T, ncomp = 3)
coef(p, intercept = T)
Thank you for the awesome tidymodels framework and everyone who makes it what it is!
As far as I can tell you are doing the correct thing.
The coef() functions only shows you the result for 3 comps but you can get the same result by adding filter(component == 3) in the following code
wf %>%
finalize_workflow(res_best) %>%
fit(Chicago) %>%
extract_fit_parsnip() %>%
tidy() %>%
filter(component == 3)
#> # A tibble: 4 × 4
#> term value type component
#> <chr> <dbl> <chr> <dbl>
#> 1 Clark_Lake 0 predictors 3
#> 2 Austin 1 predictors 3
#> 3 Harlem 0 predictors 3
#> 4 Y 1 outcomes 3
The reason why you are getting 0s and 1s is because the hyper parameter tuned values of predictor_prop is quite low, giving you are sparse representation
library(tidymodels)
library(plsmod)
data(Chicago)
Chicago <- Chicago %>% select(ridership, Clark_Lake, Austin, Harlem)
# create cross-validation dataset
folds <- vfold_cv(Chicago)
# create recipe
rec <- recipe(ridership ~ ., Chicago) %>%
step_normalize(all_predictors()) %>%
prep(training = Chicago)
# define model
mod <- parsnip::pls(mode = "regression",
num_comp = tune(),
predictor_prop = tune()) %>%
set_engine("mixOmics")
# define workflow
wf <- workflow() %>%
add_recipe(rec) %>%
add_model(mod)
# run grid tuning
set.seed(123)
res <- tune_grid(wf, resamples = folds, grid = 5)
# get best model
res_best <- res %>% select_best("rmse")
res_best
#> # A tibble: 1 × 3
#> predictor_prop num_comp .config
#> <dbl> <int> <chr>
#> 1 0.0869 3 Preprocessor1_Model1
# fit best model and extract coefficients
wf %>%
finalize_workflow(res_best) %>%
fit(Chicago) %>%
extract_fit_parsnip() %>%
tidy()
#> # A tibble: 12 × 4
#> term value type component
#> <chr> <dbl> <chr> <dbl>
#> 1 Clark_Lake 1 predictors 1
#> 2 Clark_Lake 0 predictors 2
#> 3 Clark_Lake 0 predictors 3
#> 4 Austin 0 predictors 1
#> 5 Austin 0 predictors 2
#> 6 Austin 1 predictors 3
#> 7 Harlem 0 predictors 1
#> 8 Harlem -1 predictors 2
#> 9 Harlem 0 predictors 3
#> 10 Y 1 outcomes 1
#> 11 Y 1 outcomes 2
#> 12 Y 1 outcomes 3
wf %>%
finalize_workflow(
tibble(predictor_prop = 0, num_comp = 3)
) %>%
fit(Chicago) %>%
extract_fit_parsnip() %>%
tidy()
#> # A tibble: 12 × 4
#> term value type component
#> <chr> <dbl> <chr> <dbl>
#> 1 Clark_Lake 1 predictors 1
#> 2 Clark_Lake 0 predictors 2
#> 3 Clark_Lake 0 predictors 3
#> 4 Austin 0 predictors 1
#> 5 Austin 0 predictors 2
#> 6 Austin 1 predictors 3
#> 7 Harlem 0 predictors 1
#> 8 Harlem -1 predictors 2
#> 9 Harlem 0 predictors 3
#> 10 Y 1 outcomes 1
#> 11 Y 1 outcomes 2
#> 12 Y 1 outcomes 3
wf %>%
finalize_workflow(
tibble(predictor_prop = 0.5, num_comp = 3)
) %>%
fit(Chicago) %>%
extract_fit_parsnip() %>%
tidy()
#> # A tibble: 12 × 4
#> term value type component
#> <chr> <dbl> <chr> <dbl>
#> 1 Clark_Lake 0.908 predictors 1
#> 2 Clark_Lake 0 predictors 2
#> 3 Clark_Lake 0 predictors 3
#> 4 Austin 0.419 predictors 1
#> 5 Austin -0.859 predictors 2
#> 6 Austin -0.406 predictors 3
#> 7 Harlem 0 predictors 1
#> 8 Harlem -0.513 predictors 2
#> 9 Harlem 0.914 predictors 3
#> 10 Y 1 outcomes 1
#> 11 Y 1 outcomes 2
#> 12 Y 1 outcomes 3
wf %>%
finalize_workflow(
tibble(predictor_prop = 1, num_comp = 3)
) %>%
fit(Chicago) %>%
extract_fit_parsnip() %>%
tidy()
#> # A tibble: 12 × 4
#> term value type component
#> <chr> <dbl> <chr> <dbl>
#> 1 Clark_Lake 0.593 predictors 1
#> 2 Clark_Lake 0.738 predictors 2
#> 3 Clark_Lake 0.321 predictors 3
#> 4 Austin 0.576 predictors 1
#> 5 Austin -0.111 predictors 2
#> 6 Austin -0.810 predictors 3
#> 7 Harlem 0.562 predictors 1
#> 8 Harlem -0.665 predictors 2
#> 9 Harlem 0.491 predictors 3
#> 10 Y 1 outcomes 1
#> 11 Y 1 outcomes 2
#> 12 Y 1 outcomes 3
Created on 2022-09-08 by the reprex package (v2.0.1)

How to tune a model using grid search and a single validation fold with tidymodels?

I have just learnt about the KNN algorithm and machine learning. It is a lot for me to take in and we are using tidymodels in R to practice.
Now, I know how to implement a grid search using k-fold cross-validation as follows:
hist_data_split <- initial_split(hist_data, strata = fraud)
hist_data_train <- training(hist_data_split)
hist_data_test <- testing(hist_data_split)
folds <- vfold_cv(hist_data_train, strata = fraud)
nearest_neighbor_grid <- grid_regular(neighbors(range = c(1, 500)), levels = 25)
knn_rec_1 <- recipe(fraud ~ ., data = hist_data_train)
knn_spec_1 <- nearest_neighbor(mode = "classification", engine = "kknn", neighbors = tune(), weight_func = "rectangular")
knn_wf_1 <- workflow(preprocessor = knn_rec_1, spec = knn_spec_1)
knn_fit_1 <- tune_grid(knn_wf_1, resamples = folds, metrics = metric_set(accuracy, sens, spec, roc_auc), control = control_resamples(save_pred = T), grid = nearest_neighbor_grid)
In the above case, I am essentially running a 10-fold cross-validated grid search to tune my model. However, the size of hist_data is 169173, which gives an optimal K of about 411 and with a 10-fold cross-validation, the tuning is going to take forever, so the hint given is to use a single validation fold instead of cross-validation.
Thus, I am wondering how I can tweak my code to implement this. When I add the argument v = 1 in vfold_cv, R throws me an error which says, "At least one row should be selected for the analysis set." Should I instead change resamples = folds in tune_grid to resamples = 1?
Any intuitive suggestions will be greatly appreciated :)
P.S. I did not include an MWE in the sense that the data is not provided because I feel like this is a really trivial question which can be answered as is!
If you are not able to do a cross validation split, for whatever reason, you can do a validation split which conceptually is very close to a v = 1 cross validation.
library(tidymodels)
hist_data_split <- initial_split(ames, strata = Street)
hist_data_train <- training(hist_data_split)
hist_data_test <- testing(hist_data_split)
folds <- validation_split(hist_data_train, strata = Street)
nearest_neighbor_grid <- grid_regular(
neighbors(range = c(1, 500)),
levels = 25
)
knn_rec_1 <- recipe(Street ~ ., data = ames)
knn_spec_1 <- nearest_neighbor(neighbors = tune()) %>%
set_mode("classification") %>%
set_engine("kknn") %>%
set_args(weight_func = "rectangular")
knn_wf_1 <- workflow(preprocessor = knn_rec_1, spec = knn_spec_1)
knn_fit_1 <- tune_grid(
knn_wf_1,
resamples = folds,
metrics = metric_set(accuracy, sens, spec, roc_auc),
control = control_resamples(save_pred = T),
grid = nearest_neighbor_grid
)
knn_fit_1
#> # Tuning results
#> # Validation Set Split (0.75/0.25) using stratification
#> # A tibble: 1 × 5
#> splits id .metrics .notes .predictions
#> <list> <chr> <list> <list> <list>
#> 1 <split [1647/550]> validation <tibble [100 × 5]> <tibble [0 × 3]> <tibble>
knn_fit_1 %>%
collect_metrics()
#> # A tibble: 100 × 7
#> neighbors .metric .estimator mean n std_err .config
#> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
#> 1 1 accuracy binary 0.996 1 NA Preprocessor1_Model01
#> 2 1 roc_auc binary 0.5 1 NA Preprocessor1_Model01
#> 3 1 sens binary 0 1 NA Preprocessor1_Model01
#> 4 1 spec binary 1 1 NA Preprocessor1_Model01
#> 5 21 accuracy binary 0.996 1 NA Preprocessor1_Model02
#> 6 21 roc_auc binary 0.495 1 NA Preprocessor1_Model02
#> 7 21 sens binary 0 1 NA Preprocessor1_Model02
#> 8 21 spec binary 1 1 NA Preprocessor1_Model02
#> 9 42 accuracy binary 0.996 1 NA Preprocessor1_Model03
#> 10 42 roc_auc binary 0.486 1 NA Preprocessor1_Model03
#> # … with 90 more rows
Created on 2022-09-06 by the reprex package (v2.0.1)

No tidy method for objects of class LiblineaR

I have fitted text data based on regression and LiblineaR engine. And I want to `tidy()` my results. I have also installed the dev version of `broom`.
But I always get an error. `ERROR: No tidy method for objects of class LiblineaR`
> svm_fit %>%
+ pull_workflow_fit() %>%
+ tidy()
ERROR: No tidy method for objects of class LiblineaR
We just merged in support for the tidy() method for parsnip models fitted with the LiblineaR engine, so if you install from GitHub, you should be able to have this feature now:
devtools::install_github("tidymodels/parsnip")
Here is a demo of how it works:
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
data(two_class_dat, package = "modeldata")
example_split <- initial_split(two_class_dat, prop = 0.99)
example_train <- training(example_split)
example_test <- testing(example_split)
rec <- recipe(Class ~ ., data = example_train) %>%
step_normalize(all_numeric_predictors())
spec1 <- svm_linear() %>%
set_engine("LiblineaR") %>%
set_mode("classification")
spec2 <- logistic_reg(penalty = 0.1, mixture = 1) %>%
set_engine("LiblineaR") %>%
set_mode("classification")
wf <- workflow() %>%
add_recipe(rec)
wf %>%
add_model(spec1) %>%
fit(example_train) %>%
tidy()
#> # A tibble: 3 x 2
#> term estimate
#> <chr> <dbl>
#> 1 A 0.361
#> 2 B -0.966
#> 3 Bias 0.113
wf %>%
add_model(spec2) %>%
fit(example_train) %>%
tidy()
#> # A tibble: 3 x 2
#> term estimate
#> <chr> <dbl>
#> 1 A 1.06
#> 2 B -2.76
#> 3 Bias 0.329
svm_linear() %>%
set_engine("LiblineaR") %>%
set_mode("regression") %>%
fit(mpg ~ ., data = mtcars) %>%
tidy()
#> # A tibble: 11 x 2
#> term estimate
#> <chr> <dbl>
#> 1 cyl 0.141
#> 2 disp -0.0380
#> 3 hp 0.0415
#> 4 drat 0.226
#> 5 wt 0.0757
#> 6 qsec 1.06
#> 7 vs 0.0648
#> 8 am 0.0479
#> 9 gear 0.219
#> 10 carb 0.00861
#> 11 Bias 0.0525
Created on 2021-04-22 by the reprex package (v2.0.0)

tidymodels does not respect fixed set_engine parameters

(Updated at the end based on Julia's reply. TL;DR: This seems to be an issue with the underlying kknn package, instead of with tidymodels)
I'm doing some k-nearest neighbours regression models with tidymodels. This is through the nearest_neighbor() function. I want to see what the difference is between the results with and without normalization of the features.
Now set_engine("kknn") uses the kknn::train.kknn() function under the hood, which has a normalization argument scale = TRUE. I want to compare models with scale = FALSE to scale = TRUE (actually, I want to do that in a recipe, but that is not possible, as I'll explain below).
But it does not seem as if I am able to reliably set scale = FALSE through tidymodels. Below is a reprex showing what I see.
The questions so long: Am I doing something wrong or is this a bug? If it is a bug, is it known and can I read about it somewhere? I'd be very grateful if someone can shed light on this.
Set up for the reprex
Here I'll use mtcars:
library(tidymodels)
data("mtcars")
A train-test split is:
set.seed(1)
mtcars_split <- initial_split(mtcars, prop = 0.7)
Here is a common recipe I'll use:
mtcars_recipe <- recipe(mpg ~ disp + wt, data = mtcars)
Here is model 1 (called knn_FALSE) where scale = FALSE:
knn_FALSE <- nearest_neighbor(neighbors = 5) %>%
set_mode("regression") %>%
set_engine("kknn", scale = FALSE)
Here is model 2 (called knn_TRUE) where scale = TRUE:
knn_TRUE <- nearest_neighbor(neighbors = 5) %>%
set_mode("regression") %>%
set_engine("kknn", scale = TRUE)
I bundle these two models into two workflows:
## Workflow with scale = FALSE
wf_FALSE <- workflow() %>%
add_model(knn_FALSE) %>%
add_recipe(mtcars_recipe)
## Worflow with scale = TRUE
wf_TRUE <- workflow() %>%
add_model(knn_TRUE) %>%
add_recipe(mtcars_recipe)
Using fit(), it is possible to have scale = FALSE
It does seem to be possible to have one version with scale = TRUE and one with scale = FALSE when using fit() on a workflow.
For example, for scale = TRUE I get:
wf_TRUE %>% fit(mtcars)
== Workflow [trained] ===============================================================================================
Preprocessor: Recipe
Model: nearest_neighbor()
-- Preprocessor -----------------------------------------------------------------------------------------------------
0 Recipe Steps
-- Model ------------------------------------------------------------------------------------------------------------
Call:
kknn::train.kknn(formula = ..y ~ ., data = data, ks = ~5, scale = ~TRUE)
Type of response variable: continuous
minimal mean absolute error: 2.09425
Minimal mean squared error: 7.219114
Best kernel: optimal
Best k: 5
Whereas for scale = FALSE I have:
wf_FALSE %>% fit(mtcars)
== Workflow [trained] ===============================================================================================
Preprocessor: Recipe
Model: nearest_neighbor()
-- Preprocessor -----------------------------------------------------------------------------------------------------
0 Recipe Steps
-- Model ------------------------------------------------------------------------------------------------------------
Call:
kknn::train.kknn(formula = ..y ~ ., data = data, ks = ~5, scale = ~FALSE)
Type of response variable: continuous
minimal mean absolute error: 2.1665
Minimal mean squared error: 6.538769
Best kernel: optimal
Best k: 5
The results are clearly different, which comes from the difference in the scale parameter.
But the plot thickens.
No difference with last_fit()
When using last_fit() however, the results for scale = TRUE and scale = FALSE are identical though.
For scale = TRUE:
wf_TRUE %>% last_fit(mtcars_split) %>% collect_metrics()
# A tibble: 2 x 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 3.16
2 rsq standard 0.663
Whereas for scale = FALSE:
wf_FALSE %>% last_fit(mtcars_split) %>% collect_metrics()
# A tibble: 2 x 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 3.16
2 rsq standard 0.663
These are clearly --- and unexpectedly --- the same.
There is also no difference when tuning using tune_grid()
If I do tuning with tune_grid() and a validation_split(), there is also no difference between the results for scale = TRUE and scale = FALSE.
Here is the code for that:
## Tune grid
knn_grid <- tibble(neighbors = c(5, 15))
## Tune Model 1: kNN regresson with no scaling in train.kknn
knn_FALSE_tune <- nearest_neighbor(neighbors = tune()) %>%
set_mode("regression") %>%
set_engine("kknn", scale = FALSE)
## Model 2: kNN regresson with scaling in train.kknn
knn_TRUE_tune <- nearest_neighbor(neighbors = tune()) %>%
set_mode("regression") %>%
set_engine("kknn", scale = TRUE)
## Workflow with scale = FALSE
wf_FALSE_tune <- workflow() %>%
add_model(knn_FALSE_tune) %>%
add_recipe(mtcars_recipe)
## Worflow with scale = TRUE
wf_TRUE_tune <- workflow() %>%
add_model(knn_TRUE_tune) %>%
add_recipe(mtcars_recipe)
## Validation split
mtcars_val <- validation_split(mtcars)
## Tune results: Without scaling
wf_FALSE_tune %>%
tune_grid(resamples = mtcars_val,
grid = knn_grid) %>%
collect_metrics()
## Tune results: With scaling
wf_TRUE_tune %>%
tune_grid(resamples = mtcars_val,
grid = knn_grid) %>%
collect_metrics()
The result when scale = FALSE:
> wf_FALSE_tune %>%
+ tune_grid(resamples = mtcars_val,
+ grid = knn_grid) %>%
+ collect_metrics()
# A tibble: 4 x 7
neighbors .metric .estimator mean n std_err .config
<dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
1 5 rmse standard 1.64 1 NA Model1
2 5 rsq standard 0.920 1 NA Model1
3 15 rmse standard 2.55 1 NA Model2
4 15 rsq standard 0.956 1 NA Model2
The results when scale = TRUE:
> wf_TRUE_tune %>%
+ tune_grid(resamples = mtcars_val,
+ grid = knn_grid) %>%
+ collect_metrics()
# A tibble: 4 x 7
neighbors .metric .estimator mean n std_err .config
<dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
1 5 rmse standard 1.64 1 NA Model1
2 5 rsq standard 0.920 1 NA Model1
3 15 rmse standard 2.55 1 NA Model2
4 15 rsq standard 0.956 1 NA Model2
Question
Am I misunderstanding (or missing my own bug), or are the last_fit() and tune_grid() functions not respecting my choice for scale?
I'm new to tidymodels, so I might have missed something. Answers much appreciated.
I was hoping to use step_normalize() in a recipe to do the normalization myself, but since I cannot reliably set scale = FALSE in the underlying engine, I have not been able to experiment with that.
Update after Julia's reply
As Julia shows, predictions from train.kknn() provide the same predictions for scale = FALSE and scale = TRUE. So this isn't an tidymodels issue. Rather the kknn:::predict.train.kknn() function does not respect all parameters passed to train.kknn() when predicting.
Consider the following output which uses kknn() instead of train.kknn():
kknn::kknn(formula = mpg ~ disp + wt, train = training(mtcars_split),
test = testing(mtcars_split), k = 5, scale = FALSE) %>%
predict(newdata = testing(mtcars_split))
## [1] 21.276 21.276 16.860 16.276 21.276 16.404 29.680 15.700 16.020
kknn::kknn(formula = mpg ~ disp + wt, train = training(mtcars_split),
test = testing(mtcars_split), k = 5, scale = TRUE) %>%
predict(newdata = testing(mtcars_split))
## [1] 21.032 21.784 16.668 16.052 21.264 16.404 26.340 16.076 15.620
These are different, as it should be. The problem is that kknn:::predict.train.kknn() calls kknn(), but without passing along scale (and some other optional arguments):
function (object, newdata, ...)
{
if (missing(newdata))
return(predict(object, ...))
res <- kknn(formula(terms(object)), object$data, newdata,
k = object$best.parameters$k, kernel = object$best.parameters$kernel,
distance = object$distance)
return(predict(res, ...))
}
<bytecode: 0x55e2304fba10>
<environment: namespace:kknn>
I think you don't have a bug or problem but are just misunderstanding what last_fit() and friends are predicting on to estimate performance.
library(tidymodels)
set.seed(1)
mtcars_split <- initial_split(mtcars, prop = 0.7)
knn_FALSE <- nearest_neighbor(neighbors = 5) %>%
set_mode("regression") %>%
set_engine("kknn", scale = FALSE)
knn_FALSE %>% translate()
#> K-Nearest Neighbor Model Specification (regression)
#>
#> Main Arguments:
#> neighbors = 5
#>
#> Engine-Specific Arguments:
#> scale = FALSE
#>
#> Computational engine: kknn
#>
#> Model fit template:
#> kknn::train.kknn(formula = missing_arg(), data = missing_arg(),
#> ks = min_rows(5, data, 5), scale = FALSE)
knn_TRUE <- nearest_neighbor(neighbors = 5) %>%
set_mode("regression") %>%
set_engine("kknn", scale = TRUE)
knn_TRUE %>% translate()
#> K-Nearest Neighbor Model Specification (regression)
#>
#> Main Arguments:
#> neighbors = 5
#>
#> Engine-Specific Arguments:
#> scale = TRUE
#>
#> Computational engine: kknn
#>
#> Model fit template:
#> kknn::train.kknn(formula = missing_arg(), data = missing_arg(),
#> ks = min_rows(5, data, 5), scale = TRUE)
Notice that both parsnip models are correctly passing the scale parameter to the underlying engine.
We can now add these two parsnip models to a workflow(), with a formula preprocessor (a recipe would be fine too).
wf_FALSE <- workflow() %>%
add_model(knn_FALSE) %>%
add_formula(mpg ~ disp + wt)
## Worflow with scale = TRUE
wf_TRUE <- workflow() %>%
add_model(knn_TRUE) %>%
add_formula(mpg ~ disp + wt)
The function last_fit() fits on the training data and predicts on the testing data. We can do that manually with our workflows. Importantly, notice that for these examples in the testing set, the predictions are the same, so the metrics you would get are the same.
wf_TRUE %>% fit(training(mtcars_split)) %>% predict(testing(mtcars_split))
#> # A tibble: 9 x 1
#> .pred
#> <dbl>
#> 1 21.0
#> 2 21.8
#> 3 16.7
#> 4 16.1
#> 5 21.3
#> 6 16.4
#> 7 26.3
#> 8 16.1
#> 9 15.6
wf_FALSE %>% fit(training(mtcars_split)) %>% predict(testing(mtcars_split))
#> # A tibble: 9 x 1
#> .pred
#> <dbl>
#> 1 21.0
#> 2 21.8
#> 3 16.7
#> 4 16.1
#> 5 21.3
#> 6 16.4
#> 7 26.3
#> 8 16.1
#> 9 15.6
The same thing is true for fitting the models directly:
knn_TRUE %>%
fit(mpg ~ disp + wt, data = training(mtcars_split)) %>%
predict(testing(mtcars_split))
#> # A tibble: 9 x 1
#> .pred
#> <dbl>
#> 1 21.0
#> 2 21.8
#> 3 16.7
#> 4 16.1
#> 5 21.3
#> 6 16.4
#> 7 26.3
#> 8 16.1
#> 9 15.6
knn_FALSE %>%
fit(mpg ~ disp + wt, data = training(mtcars_split)) %>%
predict(testing(mtcars_split))
#> # A tibble: 9 x 1
#> .pred
#> <dbl>
#> 1 21.0
#> 2 21.8
#> 3 16.7
#> 4 16.1
#> 5 21.3
#> 6 16.4
#> 7 26.3
#> 8 16.1
#> 9 15.6
And in fact is true if we fit the underlying kknn model directly:
kknn::train.kknn(formula = mpg ~ disp + wt, data = training(mtcars_split),
ks = 5, scale = FALSE) %>%
predict(testing(mtcars_split))
#> [1] 21.032 21.784 16.668 16.052 21.264 16.404 26.340 16.076 15.620
kknn::train.kknn(formula = mpg ~ disp + wt, data = training(mtcars_split),
ks = 5, scale = TRUE) %>%
predict(testing(mtcars_split))
#> [1] 21.032 21.784 16.668 16.052 21.264 16.404 26.340 16.076 15.620
Created on 2020-11-12 by the reprex package (v0.3.0.9001)
The scale parameter is correctly being passed to the underlying engine; it just doesn't change the prediction for these test cases.

Run a aov test through a tibble in a tidy way

I want to run a linear regression on a data frame using the same dependent variable. A similar question was solved here. The problem is that aov function to implement ANOVA doesn't accept x and y as arguments (as far as I know). Is there a way to implement the analysis in a tidy way? So far I've tried something like:
library(tidyverse)
iris %>%
as_tibble() %>%
select(Sepal.Length, Species) %>%
mutate(foo_a = as_factor(sample(c("a", "b", "c"), nrow(.), replace = T)),
foo_b = as_factor(sample(c("d", "e", "f"), nrow(.), replace = T))) %>%
map(~aov(Sepal.Length ~ .x, data = .))
Created on 2019-02-12 by the reprex package (v0.2.1)
The desired output is three analysis: Sepal.Length and Species, Sepal.Length and foo_a and the last one Sepal.Length and foo_b. Is it possible or I am totally wrong?
One approach is to make this into a long-shaped data frame, group by the independent variable of interest, and use the "many models" approach. I usually prefer something like this over trying to do tidyeval across multiple columns—it just gives me a clearer sense of what's going on.
To save space, I'm working with iris_foo, which is your data as you created it up through the 2 mutate lines. Putting it into a long format gives you a key of the names of those three columns that will be used as independent variables in each of the aov calls.
library(tidyverse)
iris_foo %>%
gather(key, value, -Sepal.Length)
#> # A tibble: 450 x 3
#> Sepal.Length key value
#> <dbl> <chr> <chr>
#> 1 5.1 Species setosa
#> 2 4.9 Species setosa
#> 3 4.7 Species setosa
#> 4 4.6 Species setosa
#> 5 5 Species setosa
#> 6 5.4 Species setosa
#> 7 4.6 Species setosa
#> 8 5 Species setosa
#> 9 4.4 Species setosa
#> 10 4.9 Species setosa
#> # … with 440 more rows
From there, nest by key and create a new list-column of ANOVA models. This will be a list of aov objects. For simplicity with getting your models back out, you can drop the data column.
aov_models <- iris_foo %>%
gather(key, value, -Sepal.Length) %>%
group_by(key) %>%
nest() %>%
mutate(model = map(data, ~aov(Sepal.Length ~ value, data = .))) %>%
select(-data)
aov_models
#> # A tibble: 3 x 2
#> key model
#> <chr> <list>
#> 1 Species <S3: aov>
#> 2 foo_a <S3: aov>
#> 3 foo_b <S3: aov>
From there, you can work with the models however you like. They're accessible in the list aov_models$model. Printed, they look how you'd expect. For example, the first model:
aov_models$model[[1]]
#> Call:
#> aov(formula = Sepal.Length ~ value, data = .)
#>
#> Terms:
#> value Residuals
#> Sum of Squares 63.21213 38.95620
#> Deg. of Freedom 2 147
#>
#> Residual standard error: 0.5147894
#> Estimated effects may be unbalanced
To see all the models, call aov_models$model %>% map(print). You might also want to use broom functions, such as broom::tidy or broom::glance, depending on how you need to present the models.
aov_models$model %>%
map(broom::tidy)
#> [[1]]
#> # A tibble: 2 x 6
#> term df sumsq meansq statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 value 2 63.2 31.6 119. 1.67e-31
#> 2 Residuals 147 39.0 0.265 NA NA
#>
#> [[2]]
#> # A tibble: 2 x 6
#> term df sumsq meansq statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 value 2 0.281 0.141 0.203 0.817
#> 2 Residuals 147 102. 0.693 NA NA
#>
#> [[3]]
#> # A tibble: 2 x 6
#> term df sumsq meansq statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 value 2 0.756 0.378 0.548 0.579
#> 2 Residuals 147 101. 0.690 NA NA
Or tidying all the models into a single data frame, which keeps the key column, you could do:
aov_models %>%
mutate(model_tidy = map(model, broom::tidy)) %>%
unnest(model_tidy)

Resources