fit_xy() usage for cross_validation in Tidy Models - r

I am new to Tidy Models and liking it so far but have a question with using a non-formula interface for resampling/cross-validation. The way I understand it so far, in order for me to apply resampling()/cross validation, I should write a
recipe with a formula: outcome ~ predictors
rf_rec <-
recipe(y_graduated ~ .,
data = trainDat_predSet)
specify a model
# Setting Random Forest Model Specifications
rf_model <-
rand_forest() %>%
set_engine("ranger") %>%
set_mode("classification") %>%
set_args(mtry = 3,
trees = 50,
min_n = 5)
create folds
set.seed(1234)
trainDatFolds <-
rsample::vfold_cv(data = trainDat, v = 5)
put recipe and model specification in a workflow
rf_workflow <-
workflow() %>%
add_recipe(rf_rec) %>%
add_model(rf_model)
Then fit the resampling.
rf_workflow %>%
fit_resamples(resamples = trainDatFolds,
metrics = metric_set(roc_auc, pr_auc, accuracy),
control = control_resamples(save_pred = TRUE)
)
For my purposes, it is far more convenient to be able to use a non-formula interface of outcome ~ predictors.
Without the recipe step and if I was doing resampling fit, I could easily use the function -
fit_xy() to specify the y - outcome and x - predictor set.
Is that an option for fitting in resampling?
Thanks a lot!

There is not an x/y interface but an easy way to get there without a formula:
library(recipes)
rec <- recipe(mtcars)
summary(rec)
#> # A tibble: 11 x 4
#> variable type role source
#> <chr> <chr> <lgl> <chr>
#> 1 mpg numeric NA original
#> 2 cyl numeric NA original
#> 3 disp numeric NA original
#> 4 hp numeric NA original
#> 5 drat numeric NA original
#> 6 wt numeric NA original
#> 7 qsec numeric NA original
#> 8 vs numeric NA original
#> 9 am numeric NA original
#> 10 gear numeric NA original
#> 11 carb numeric NA original
# now add roles
rec <-
rec %>%
update_role(mpg, new_role = "outcome") %>%
update_role(-mpg, new_role = "predictor")
summary(rec)
#> # A tibble: 11 x 4
#> variable type role source
#> <chr> <chr> <chr> <chr>
#> 1 mpg numeric outcome original
#> 2 cyl numeric predictor original
#> 3 disp numeric predictor original
#> 4 hp numeric predictor original
#> 5 drat numeric predictor original
#> 6 wt numeric predictor original
#> 7 qsec numeric predictor original
#> 8 vs numeric predictor original
#> 9 am numeric predictor original
#> 10 gear numeric predictor original
#> 11 carb numeric predictor original
Created on 2020-11-06 by the reprex package (v0.3.0)

Related

Summary statistics for continuous variable by levels of factor variables AND stratified by a second categorical variable

I am trying to generate summary statistics (for a sort of epidemiological Table 1) for a continuous variable (systolic blood pressure, sysbp) by levels of several categorical variables (sex, age category, BMI category, etc), all stratified by race/ethnicity, however all the information I find is to create summary tables for a categorical outcome by levels of other variables.
Some simplified example data:
set.seed(42)
sex <- sample(c("Male", "Female"), size=100, replace=TRUE)
bmicat <- sample(c("<18.5", "18.5-24", "25-29", ">=30"), size=100, replace=TRUE)
smoker_ever <- sample(c("Ever", "Never"), size=100, replace=TRUE)
agecat <- sample(c("<25", "25-44", "45-64", ">=65"), size=100, replace=TRUE)
race_ethnicity <- sample(c("African", "Hispanic/Latino", "Asian"), size=100, replace=TRUE)
sysbp <- rnorm(n=100, mean=140, sd=10)
bio <- data.frame(sex, bmicat, agecat, smoker_ever, race_ethnicity, sysbp)
So far all I've come up with is manually calculating each mean and sd with stat.desc like this for each combination of sex/age category/BMI category/smoking status + race/ethnicity like this:
stat.desc(bio$sysbp[bio$sex == 'Male' & bio$race_ethnicity == 'Hispanic/Latino'], basic = T)
and manually entering the resulting mean and sd into my table, but this is obviously very inefficient.
So the goal is to have a table with columns for each race/ethnicity category and rows for each of the categorical variables listed above, with the summary measures being for sysbp of the entries for each combination of categorical variable level + race/ethnicity category (as well as a "total/overall" column and row). Is there any way to do this simply?
How about this:
library(tidyverse)
data(mtcars)
mtcars %>%
group_by(am, cyl) %>%
summarise(across(everything(), list(m = mean, s = sd))) %>%
pivot_longer(-c("am", "cyl"), names_to = c("variable", ".value"), names_pattern="(.*)_([ms])$")
#> `summarise()` has grouped output by 'am'. You can override using the `.groups`
#> argument.
#> # A tibble: 54 × 5
#> # Groups: am [2]
#> am cyl variable m s
#> <dbl> <dbl> <chr> <dbl> <dbl>
#> 1 0 4 mpg 22.9 1.45
#> 2 0 4 disp 136. 14.0
#> 3 0 4 hp 84.7 19.7
#> 4 0 4 drat 3.77 0.13
#> 5 0 4 wt 2.94 0.408
#> 6 0 4 qsec 21.0 1.67
#> 7 0 4 vs 1 0
#> 8 0 4 gear 3.67 0.577
#> 9 0 4 carb 1.67 0.577
#> 10 0 6 mpg 19.1 1.63
#> # … with 44 more rows
#> # ℹ Use `print(n = ...)` to see more rows
Created on 2022-10-13 by the reprex package (v2.0.1)

How can I unscale and understand glmnet coefficients while using tidymodels?

I'm a bit confused with how I should interpret the coefficients from the elastic net model that I'm getting through tidymodels and glmnet. Ideally, I'd like to produce unscaled coefficients for maximum interpretability.
My issue is that I'm honestly not sure how to unscale the coefficients that the model is yielding because I can't quite figure out what's being done in the first place.
It's a bit tricky for me to post the data one would need to reproduce my results, but here's my code:
library(tidymodels)
library(tidyverse)
# preps data for model
myrecipe <- mydata %>%
recipe(transactionrevenue ~ sessions + channelgrouping + month + new_user_pct + is_weekend) %>%
step_novel(all_nominal(), -all_outcomes()) %>%
step_dummy(month, channelgrouping, one_hot = TRUE) %>%
step_zv(all_predictors()) %>%
step_normalize(sessions, new_user_pct) %>%
step_interact(terms = ~ sessions:starts_with("channelgrouping") + new_user_pct:starts_with("channelgrouping"))
# creates the model
mymodel <- linear_reg(penalty = 10, mixture = 0.2) %>%
set_engine("glmnet", standardize = FALSE)
wf <- workflow() %>%
add_recipe(myrecipe)
model_fit <- wf %>%
add_model(mymodel) %>%
fit(data = mydata)
# posts coefficients
tidy(model_fit)
If it would help, here's some information that might be useful:
The variable that I'm really focusing on is "sessions."
In the model, the coefficient for sessions is 2543.094882, and the intercept is 1963.369782. The penalty is also 10.
The unscaled mean for sessions is 725.2884 and the standard deviation is 1035.381.
I just can't seem to figure out what units the coefficients are in and how/if it's even possible to unscale the coefficients back to the original units.
Any insight would be very much appreciated.
You can use tidy() on a lot of different components of a workflow. The default is to the tidy() the model but you can also get out the recipe and even recipe steps. This is where the information it sounds like you are interested in is.
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
data(bivariate)
biv_rec <-
recipe(Class ~ ., data = bivariate_train) %>%
step_BoxCox(all_predictors())%>%
step_normalize(all_predictors())
svm_spec <- svm_linear(mode = "classification")
biv_fit <- workflow(biv_rec, svm_spec) %>% fit(bivariate_train)
## tidy the *model*
tidy(biv_fit)
#> # A tibble: 3 × 2
#> term estimate
#> <chr> <dbl>
#> 1 A -1.15
#> 2 B 1.17
#> 3 Bias 0.328
## tidy the *recipe*
extract_recipe(biv_fit) %>%
tidy()
#> # A tibble: 2 × 6
#> number operation type trained skip id
#> <int> <chr> <chr> <lgl> <lgl> <chr>
#> 1 1 step BoxCox TRUE FALSE BoxCox_ZRpI2
#> 2 2 step normalize TRUE FALSE normalize_DGmtN
## tidy the *recipe step*
extract_recipe(biv_fit) %>%
tidy(number = 1)
#> # A tibble: 2 × 3
#> terms value id
#> <chr> <dbl> <chr>
#> 1 A -0.857 BoxCox_ZRpI2
#> 2 B -1.09 BoxCox_ZRpI2
## tidy the other *recipe step*
extract_recipe(biv_fit) %>%
tidy(number = 2)
#> # A tibble: 4 × 4
#> terms statistic value id
#> <chr> <chr> <dbl> <chr>
#> 1 A mean 1.16 normalize_DGmtN
#> 2 B mean 0.909 normalize_DGmtN
#> 3 A sd 0.00105 normalize_DGmtN
#> 4 B sd 0.00260 normalize_DGmtN
Created on 2021-08-05 by the reprex package (v2.0.0)
You can read more about tidying a recipe here.

PCA - how to visualize that all the variable are in different / same scale

I am working with the dataset uscrime but this question applied to any well-known dataset like cars.
After to googling I found extremely useful to standardize my data, considering that PCA finds new directions based on covariance matrix of original variables, and covariance matrix is sensitive to standardization of variables.
Nevertheless, I found "It is not necessary to standardize the variables, if all the variables are in same scale."
To standardize the variable I am using the function:
z_uscrime <- (uscrime - mean(uscrime)) / sd(uscrime)
Prior to standardize my data, how to check if all the variables are in the same scale or not?
Proving my point that you can standardize your data however many times you want
library(tidyverse)
library(recipes)
#>
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stringr':
#>
#> fixed
#> The following object is masked from 'package:stats':
#>
#> step
simple_recipe <- recipe(mpg ~ .,data = mtcars) %>%
step_center(everything()) %>%
step_scale(everything())
mtcars2 <- simple_recipe %>%
prep() %>%
juice()
simple_recipe2 <- recipe(mpg ~ .,data = mtcars2) %>%
step_center(everything()) %>%
step_scale(everything())
mtcars3 <- simple_recipe2 %>%
prep() %>%
juice()
all.equal(mtcars2,mtcars3)
#> [1] TRUE
mtcars2 %>%
summarise(across(everything(),.fns = list(mean = ~ mean(.x),sd = ~sd(.x)))) %>%
pivot_longer(everything(),names_pattern = "(.*)_(.*)",names_to = c("stat", ".value"))
#> # A tibble: 11 x 3
#> stat mean sd
#> <chr> <dbl> <dbl>
#> 1 cyl -1.47e-17 1
#> 2 disp -9.08e-17 1
#> 3 hp 1.04e-17 1
#> 4 drat -2.92e-16 1
#> 5 wt 4.68e-17 1.00
#> 6 qsec 5.30e-16 1
#> 7 vs 6.94e-18 1.00
#> 8 am 4.51e-17 1
#> 9 gear -3.47e-18 1.00
#> 10 carb 3.17e-17 1.00
#> 11 mpg 7.11e-17 1
mtcars3 %>%
summarise(across(everything(),.fns = list(mean = ~ mean(.x),sd = ~sd(.x)))) %>%
pivot_longer(everything(),names_pattern = "(.*)_(.*)",names_to = c("stat", ".value"))
#> # A tibble: 11 x 3
#> stat mean sd
#> <chr> <dbl> <dbl>
#> 1 cyl -1.17e-17 1
#> 2 disp -1.95e-17 1
#> 3 hp 9.54e-18 1
#> 4 drat 1.17e-17 1
#> 5 wt 3.26e-17 1
#> 6 qsec 1.37e-17 1
#> 7 vs 4.16e-17 1
#> 8 am 4.51e-17 1
#> 9 gear 0. 1
#> 10 carb 2.60e-18 1
#> 11 mpg 4.77e-18 1
Created on 2020-06-07 by the reprex package (v0.3.0)

Coerce model coefficients to clean, 2-column dataframe

I am fitting an elastic net with cross-validation and I am looking at how big the coefficients are for each predictor:
lambda <- cv.glmnet(x = features_training, y = outcomes_training, alpha = 0)
elnet <- lambda$glmnet.fit
coefs <- coef(elnet, s = lambda$lambda.min, digits = 3)
The coefs variable contains a dgCMatrix:
1
(Intercept) -1.386936e-16
ret 4.652863e-02
ind30 -2.419878e-03
spyvol 1.570406e-02
Is there a quick way to turn this into a dataframe with 2 columns (one for the predictor name and the other for the coefficient value)? as.data.frame, as.matrix or chaining both did not work. I would notably like to sort the rows according to the second column.
broom::tidy has a nice method for coercing dgCMatrix objects to long-form data frames (a bit like as.data.frame.table), which works well here:
mod <- glmnet::cv.glmnet(model.matrix(~ ., mtcars[-1]), mtcars$mpg, alpha = 0)
broom::tidy(coef(mod$glmnet.fit, s = mod$lambda.min, digits = 3))
#> row column value
#> 1 (Intercept) 1 21.171285892
#> 2 cyl 1 -0.368057153
#> 3 disp 1 -0.005179902
#> 4 hp 1 -0.011713150
#> 5 drat 1 1.053216800
#> 6 wt 1 -1.264212476
#> 7 qsec 1 0.164975032
#> 8 vs 1 0.756163432
#> 9 am 1 1.655635460
#> 10 gear 1 0.546651086
#> 11 carb 1 -0.559817882
Another way, and no hacks through attributes() function, but extracting the rownames and matrix values. The attributes(class(coefs)) informs that dgCMatrix is a sparse matrix created using Matrix package.
data.frame( predict_names = rownames(coefs),
coef_vals = matrix(coefs))
# predict_names coef_vals
# 1 (Intercept) 21.117339411
# 2 (Intercept) 0.000000000
# 3 cyl -0.371338786
# 4 disp -0.005254534
# 5 hp -0.011613216
# 6 drat 1.054768651
# 7 wt -1.234201216
# 8 qsec 0.162451314
# 9 vs 0.771959823
# 10 am 1.623812912
# 11 gear 0.544171362
# 12 carb -0.547415029

How can I use summarise_each for correlations in dplyr?

A dataframe has 20 columns, and I want to find the correlation of column "a" with rest of the columns.
How can I do it using dplyr?
I know how to do individual correlations such as this:
test %>%
dplyr::summarize(cor(a, b))
Or summarise_each for mean.
But how can I do it for correlation?
Two use cases:
Where it calculates correlations with every other column in the dataframe.
Where it calculates correlations with columns I mention.
The corrr package uses dplyr as a backend (and so easily works with it) to do just this via correlate() %>% focus():
library(corrr)
mtcars %>%
correlate() %>%
focus(mpg)
#> # A tibble: 10 × 2
#> rowname mpg
#> <chr> <dbl>
#> 1 cyl -0.8521620
#> 2 disp -0.8475514
#> 3 hp -0.7761684
#> 4 drat 0.6811719
#> 5 wt -0.8676594
#> 6 qsec 0.4186840
#> 7 vs 0.6640389
#> 8 am 0.5998324
#> 9 gear 0.4802848
#> 10 carb -0.5509251
mtcars %>%
select(mpg, disp, hp) %>%
correlate() %>%
focus(mpg)
#> # A tibble: 2 × 2
#> rowname mpg
#> <chr> <dbl>
#> 1 disp -0.8475514
#> 2 hp -0.7761684
focus() acts like dplyr::select(), except that it excludes any remaining columns from the rows. If interested, take a look at focus_.cor_df() on GitHub here.
Do not quite understand the two use cases which I think you may need the combn function but for:
I want to find the correlation of column "a" with rest of the columns.
You can do something like this, directly pass the column a as one of the parameter to the cor function and use . to represent the rest of columns:
library(dplyr)
df <- data.frame(a = rnorm(5), b = rnorm(5), c = rnorm(5))
df %>% summarise_each(funs(cor(., df$a)), -a)
# b c
# 1 0.1997687 -0.3541925
If there are non-numeric columns and you are only interested in numeric columns, you may need the summarise_if function and specify the condition to be numeric, in which case only numeric columns will be summarized and corresponding correlation coefficients calculated :
df <- data.frame(a = rnorm(5), b = rnorm(5), c = rnorm(5), d = letters[1:5])
df %>% summarise_if(is.numeric, funs(cor(., df$a)))
# a b c
#1 1 0.1153882 -0.03117205

Resources