Is it possible to extract, say, a model of class glm from a tidymodel built with recipe and logistic_reg() %>% set_engine("glm")?
I'd like to use packages from the easystats project, which require "normal", non-tidy models. The workflow extractor function (pull_workflow_fit()) returns an object of class `"_glm" "model_fit", which doesn't seem to be compatible.
I understand I can generate a model using glm() and the same formula as in the recipe, but it seems to me the fitted parameters differ.
Thanks!
The easystats package suite supports tidymodels since the last updates:
library(tidymodels)
data(two_class_dat)
glm_spec <- logistic_reg() %>%
set_engine("glm")
norm_rec <- recipe(Class ~ A + B, data = two_class_dat) %>%
step_normalize(all_predictors())
workflow() %>%
add_recipe(norm_rec) %>%
add_model(glm_spec) %>%
fit(two_class_dat) %>%
pull_workflow_fit() %>%
parameters::model_parameters()
#> Parameter | Log-Odds | SE | 95% CI | z | p
#> ---------------------------------------------------------------
#> (Intercept) | -0.35 | 0.10 | [-0.54, -0.16] | -3.55 | < .001
#> A | -1.11 | 0.17 | [-1.44, -0.79] | -6.64 | < .001
#> B | 2.80 | 0.21 | [ 2.40, 3.22] | 13.33 | < .001
workflow() %>%
add_recipe(norm_rec) %>%
add_model(glm_spec) %>%
fit(two_class_dat) %>%
pull_workflow_fit() %>%
parameters::model_parameters() %>%
plot()
workflow() %>%
add_recipe(norm_rec) %>%
add_model(glm_spec) %>%
fit(two_class_dat) %>%
pull_workflow_fit() %>%
parameters::model_parameters() %>%
parameters::print_md()
Parameter
Log-Odds
SE
95% CI
z
p
(Intercept)
-0.35
0.10
(-0.54, -0.16)
-3.55
< .001
A
-1.11
0.17
(-1.44, -0.79)
-6.64
< .001
B
2.80
0.21
(2.40, 3.22)
13.33
< .001
workflow() %>%
add_recipe(norm_rec) %>%
add_model(glm_spec) %>%
fit(two_class_dat) %>%
pull_workflow_fit() %>%
performance::model_performance()
#> # Indices of model performance
#>
#> AIC | BIC | Tjur's R2 | RMSE | Sigma | Log_loss | Score_log | Score_spherical | PCP
#> ----------------------------------------------------------------------------------------------
#> 679.950 | 693.970 | 0.460 | 0.362 | 0.925 | 0.426 | -Inf | 0.003 | 0.733
Created on 2021-04-25 by the reprex package (v2.0.0)
You can extract out the underlying model object (whether that was created by glm or ranger or keras or anything) from a parsnip object using $fit.
library(tidymodels)
data(two_class_dat)
glm_spec <- logistic_reg() %>%
set_engine("glm")
norm_rec <- recipe(Class ~ A + B, data = two_class_dat) %>%
step_normalize(all_predictors())
glm_fit <- workflow() %>%
add_recipe(norm_rec) %>%
add_model(glm_spec) %>%
fit(two_class_dat) %>%
pull_workflow_fit()
What is in that fitted object?
## this is a parsnip object
glm_fit
#> parsnip model object
#>
#> Fit time: 5ms
#>
#> Call: stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)
#>
#> Coefficients:
#> (Intercept) A B
#> -0.3491 -1.1063 2.7966
#>
#> Degrees of Freedom: 790 Total (i.e. Null); 788 Residual
#> Null Deviance: 1088
#> Residual Deviance: 673.9 AIC: 679.9
## this is a glm object
glm_fit$fit
#>
#> Call: stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)
#>
#> Coefficients:
#> (Intercept) A B
#> -0.3491 -1.1063 2.7966
#>
#> Degrees of Freedom: 790 Total (i.e. Null); 788 Residual
#> Null Deviance: 1088
#> Residual Deviance: 673.9 AIC: 679.9
Created on 2021-02-04 by the reprex package (v1.0.0)
The fitted parameters will definitely not differ from calling the model directly. If you think you are finding different fitted parameters, then something may be going awry is how you are calling the model.
Related
I am running a mixed effects model on my dataset ,
library(lme4)
data(cake)
each dataset is a subset of a larger datsaet
subset(cake, recipe=="A")
subset(cake, recipe=="B")
subset(cake, recipe=="C")
I am using dlply to run my mixed effects model on each subset
MxM1 <- plyr::dlply(cake,
"recipe",
function(x)
lmer(angle ~ 1+ (1|replicate)+ temperature,
data=x))
This gives me a list of summaries based on each subset of data.
I know how to display the summaries one at a time using gt_summary package
lm_cake$A %>%
tbl_regression() %>%
modify_column_hide(columns = ci) %>%
modify_column_unhide(columns = std.error)
lm_cake$B %>%
tbl_regression() %>%
modify_column_hide(columns = ci) %>%
modify_column_unhide(columns = std.error)
lm_cake$B %>%
tbl_regression() %>%
modify_column_hide(columns = ci) %>%
modify_column_unhide(columns = std.error)
I am not sure how to combine the results from all 3 objects (lm_cake$A, lm_cake$B, lm_cake$C) to display them as one summary table.
Model: A Model: B Model: C
Temperature Beta SE Beta SE Beta SE
Temperature. L
Temperature. Q
Temperature. C
Temperature^4
Temperature^5
Any suggestions or help is much apricated. Thanks.
You can also merge two or more gtsummary tables using the gtsummary::tbl_merge() function. Example below!
library(gtsummary)
#> #StandWithUkraine
library(lme4)
#> Loading required package: Matrix
data(cake)
MxM1 <-
plyr::dlply(
cake,
"recipe",
function(x) {
lmer(angle ~ 1+ (1|replicate)+ temperature, data=x) %>%
tbl_regression() %>%
modify_column_hide(columns = ci) %>%
modify_column_unhide(columns = std.error)
}
)
# Merge all model summaries together with `tbl_merge()`
tbl <-
MxM1 %>%
tbl_merge(
tab_spanner = c("**A**", "**B**", "**C**")
)
Created on 2022-12-17 with reprex v2.0.2
Update:
While the answer by #Daniel D. Sjoberg is perfect and the desired one. Here is the answer to OP's question in the comments:
"How can i convert the final results from long format to wide, by each recipe?"
After filtering temperature we could use pivot_wider and some tweaking thereafter:
Note we have to use broom.mixed package for our lmer
library(lme4)
library(tidyverse)
#library(broom)
library(broom.mixed)
cake %>%
mutate(recipe = as_factor(recipe)) %>%
group_by(recipe) %>%
group_split() %>%
map_dfr(.f = function(df){
lmer(angle ~ 1 + (1|replicate) + temperature,
data=df) %>%
tidy() %>%
add_column(recipe = unique(df$recipe), .before = 1)
}) %>%
filter(str_detect(term, "temperature")) %>%
select(recipe, term, Beta=estimate, SE = std.error) %>%
pivot_wider(names_from = recipe,
values_from = c(Beta, SE)) %>%
rename_with(~ str_replace(., "(.*)_(.*)", "\\2_\\1"), -1) %>%
select(term, sort(colnames(.)))
term A_Beta A_SE B_Beta B_SE C_Beta C_SE
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 temperature.L 6.43 1.22 6.88 1.16 6.52 1.12
2 temperature.Q -0.713 1.22 -0.946 1.16 0.502 1.12
3 temperature.C -2.33 1.22 0.368 1.16 0.313 1.12
4 temperature^4 -3.35 1.22 -0.328 1.16 -0.214 1.12
5 temperature^5 -0.151 1.22 -0.815 1.16 -1.78 1.12
First answer:
You need something like this?:
library(lme4)
data(cake)
library(dplyr)
library(broom)
library(broom.mixed)
cake %>%
mutate(recipe = as_factor(recipe)) %>%
group_by(recipe) %>%
group_split() %>%
map_dfr(.f = function(df){
lmer(angle ~ 1 + (1|replicate) + temperature,
data=df) %>%
tidy() %>%
add_column(recipe = unique(df$recipe), .before = 1)
})
A tibble: 24 × 7
recipe effect group term estimate std.error statistic
<fct> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 A fixed NA (Intercept) 33.1 1.42 23.3
2 A fixed NA temperature.L 6.43 1.22 5.26
3 A fixed NA temperature.Q -0.713 1.22 -0.583
4 A fixed NA temperature.C -2.33 1.22 -1.90
5 A fixed NA temperature^4 -3.35 1.22 -2.74
6 A fixed NA temperature^5 -0.151 1.22 -0.124
7 A ran_pars replicate sd__(Intercept) 5.16 NA NA
8 A ran_pars Residual sd__Observation 4.73 NA NA
9 B fixed NA (Intercept) 31.6 1.81 17.5
10 B fixed NA temperature.L 6.88 1.16 5.93
# … with 14 more rows
# ℹ Use `print(n = ...)` to see more rows
Want to preface this with heaps of appreciate for gtsummary -- wonderful package.
After using tidymodels, GLM, and gtsummary for a while, I've been trying to understand gtsummary's computations for GLM model performance and confidence intervals.
Can the anyone and/or Dr. Sjoberg + gtsummary team explain the following questions 1 & 2
Question 1: Why are standard errors different when using broom::tidy() vs. parameters::model_parameters() functions to extract model residual data?
(Bolded text in print outs shows differences)
library(gtsummary)
library(parameters)
library(rsample)
library(broom)
trial2 <- trial %>% select(age, grade, response, trt) %>%
drop_na()
model_trial2 <- glm(response ~ age + grade + trt,
data = trial2,
family=binomial(link="logit"))
broom::tidy(model_trial2, exponentiate = TRUE)
# # A tibble: 5 × 5
# term estimate std.error statistic p.value
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 (Intercept) 0.184 **0.630** -2.69 0.00715
# 2 age 1.02 0.0114 1.67 0.0952
# 3 gradeII 0.852 **0.395** -0.406 0.685
# 4 gradeIII 1.01 0.385 0.0199 0.984
# 5 trtDrug B 1.13 **0.321** 0.387 0.699
preadmission_model_parameters <- model_trial2 %>% parameters::model_parameters(exponentiate = TRUE)
preadmission_model_parameters
# Parameter | Odds Ratio | SE | 95% CI | z | p
# ---------------------------------------------------------------
# (Intercept) | 0.18 | **0.12** | [0.05, 0.61] | -2.69 | 0.007
# age | 1.02 | 0.01 | [1.00, 1.04] | 1.67 | 0.095
# grade [II] | 0.85 | **0.34** | [0.39, 1.85] | -0.41 | 0.685
# grade [III] | 1.01 | 0.39 | [0.47, 2.15] | 0.02 | 0.984
# trt [Drug B] | 1.13 | **0.36** | [0.60, 2.13] | 0.39 | 0.699
Question 2: (a) What method does gtsummary use to produce confidence intervals? (b) can the user define (stratified or unstratified) k-fold cross-validation or bootstraps to produce confidence intervals?
(Bolded differences in confidence intervals for the reg_intervals() bootstrapped confidence intervals and the unknown method gtsummary tbl_regression() confidence intervals.)
library(gtsummary)
library(parameters)
library(rsample)
library(broom)
trial2 <- trial %>% select(age, grade, response, trt) %>%
drop_na()
bootstraps(trial2, times = 10)
trial_bootrapped_confidence_intervals <- reg_intervals(response ~ age + grade + trt,
data = trial2,
model_fn = "glm",
keep_reps = TRUE,
family=binomial(link="logit"))
trial_bootrapped_confidence_intervals_exp <- trial_bootrapped_confidence_intervals %>%
select(term:.alpha) %>%
mutate(across(.cols = c(.lower, .estimate, .upper), ~exp(.))) %>%
as_tibble()
trial_bootrapped_confidence_intervals_exp
# # A tibble: 4 × 5
# term .lower .estimate .upper .alpha
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 age 0.997 1.02 1.04 0.05
# 2 gradeII **0.400** 0.846 **1.86** 0.05
# 3 gradeIII 0.473 1.01 2.10 0.05
# 4 trtDrug B 0.600 1.14 2.22 0.05
model_trial2_tbl_regression <-
glm(response ~ age + grade + trt,
data = trial2,
family=binomial(link="logit")) %>%
tbl_regression(
exponentiate = T
) %>%
add_global_p()
model_trial2_tbl_regression_metrics <- model_trial2_tbl_regression$table_body %>%
select(
label,
estimate,
std.error,
statistic,
conf.low ,
conf.high,
p.value
)
model_trial2_tbl_regression_metrics
# A tibble: 8 × 7
# label estimate std.error statistic conf.low conf.high p.value
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 Age 1.02 0.0114 1.67 0.997 1.04 0.0909
# 2 Grade NA NA NA NA NA 0.894
# 3 I NA NA NA NA NA NA
# 4 II 0.852 0.395 -0.406 **0.389** **1.85** NA
# 5 III 1.01 0.385 0.0199 0.472 2.15 NA
# 6 Chemotherapy Treatment NA NA NA NA NA 0.699
# 7 Drug A NA NA NA NA NA NA
# 8 Drug B 1.13 0.321 0.387 0.603 2.13 NA
The issue is with the exponentiation (applied as the family is binomial). Broom::tidy does not exponentiate the standard errors but parameters does. You can see this with broom::tidy(model_trial2, exponentiate = TRUE) and broom::tidy(model_trial2, exponentiate = FALSE), which return the same standard errors. parameters::model_parameters(exponentiate = TRUE) and parameters::model_parameters(exponentiate = FALSE) return different standard errors. When exponentiate is FALSE for parameters, the standard errors match. This is discussed in Check exponentiate behavior in tidy methods #422
To create a custom tidier for gtsummary, see FAQ + Gallery
I recently picked up Tidymodels after having used R for a few months in my school.
I was trying to make my first model using the Titanic Dataset on Kaggle, but ran into some issues when fitting the model. Could someone help me?
titanic_rec <- recipe(Survived ~ Sex + Age + Pclass + Embarked + Family_Size + Name, data = titanic_train) %>%
step_impute_knn(all_predictors(), k = 3) %>%
step_dummy(Sex, Pclass, Embarked, Family_Size, Name) %>%
step_interact(~ Sex:Age + Sex:Pclass + Pclass:Age)
log_model <- logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
fitted_log_model <- workflow() %>%
add_model(log_model) %>%
add_recipe(titanic_rec) %>%
fit(data = titanic_train) %>%
pull_workflow_fit() %>%
tidy()
Every feature has a factor data type except Age and Survived which are doubles. The error seems to come about when I include the fit(data = ...) onwards.
Error: Can't rename variables in this context. Run `rlang::last_error()` to see where the error occurred.
24.
stop(fallback)
23.
signal_abort(cnd)
22.
abort("Can't rename variables in this context.")
21.
eval_select_recipes(to_impute, training, info)
20.
impute_var_lists(to_impute = x$terms, impute_using = x$impute_with, training = training, info = info)
19.
prep.step_impute_knn(x$steps[[i]], training = training, info = x$term_info)
18.
prep(x$steps[[i]], training = training, info = x$term_info)
17.
prep.recipe(blueprint$recipe, training = data, fresh = blueprint$fresh)
16.
recipes::prep(blueprint$recipe, training = data, fresh = blueprint$fresh)
15.
blueprint$mold$process(blueprint = blueprint, data = data)
14.
run_mold.recipe_blueprint(blueprint, data)
13.
run_mold(blueprint, data)
12.
mold.recipe(recipe, data, blueprint = blueprint)
11.
hardhat::mold(recipe, data, blueprint = blueprint)
10.
fit.action_recipe(action, workflow = workflow, data = data)
9.
fit(action, workflow = workflow, data = data)
8.
.fit_pre(workflow, data)
7.
fit.workflow(., data = titanic_train)
6.
fit(., data = titanic_train)
5.
is_workflow(x)
4.
validate_is_workflow(x)
3.
pull_workflow_fit(.)
2.
tidy(.)
1.
workflow() %>% add_model(log_model) %>% add_recipe(titanic_rec) %>% fit(data = titanic_train) %>% pull_workflow_fit() %>% tidy()
The posted error comes from step_impute_knn() where the number of neighbors should be specified by with neighbors. Secondly, I would advise against using name as a predictor since it creates a separate dummy variable for each name which would mess with the fit.
The final error comes in step_interact(). You can't use step_interact(~ Sex:Age) after step_dummy(Sex) becuase there won't be any columns named Sex after step_dummy() is done. Instead it will have Sex_male (since female is part of the intercept). A way to catch all the created dummy variables is to use starts_with() inside step_interact().
library(tidymodels)
titanic_train <- readr::read_csv("your/path/to/data/train.csv")
titanic_train <- titanic_train %>%
mutate(Survived = factor(Survived),
Pclass = factor(Pclass),
Family_Size = SibSp + Parch + 1)
titanic_rec <- recipe(Survived ~ Sex + Age + Pclass + Embarked + Family_Size,
data = titanic_train) %>%
step_impute_knn(all_predictors(), neighbors = 3) %>%
step_dummy(Sex, Pclass, Embarked) %>%
step_interact(~ starts_with("Sex_"):Age +
starts_with("Sex_"):starts_with("Pclass_") +
starts_with("Pclass_"):Age)
log_model <- logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
fitted_log_model <- workflow() %>%
add_model(log_model) %>%
add_recipe(titanic_rec) %>%
fit(data = titanic_train) %>%
pull_workflow_fit() %>%
tidy()
fitted_log_model
#> # A tibble: 13 x 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 3.85 0.921 4.18 0.0000289
#> 2 Age 0.0117 0.0226 0.516 0.606
#> 3 Family_Size -0.226 0.0671 -3.36 0.000769
#> 4 Sex_male -2.22 0.886 -2.50 0.0124
#> 5 Pclass_X2 1.53 1.16 1.31 0.189
#> 6 Pclass_X3 -2.42 0.884 -2.74 0.00615
#> 7 Embarked_Q -0.0461 0.368 -0.125 0.900
#> 8 Embarked_S -0.548 0.243 -2.26 0.0241
#> 9 Sex_male_x_Age -0.0488 0.0199 -2.46 0.0140
#> 10 Sex_male_x_Pclass_X2 -1.28 0.879 -1.46 0.144
#> 11 Sex_male_x_Pclass_X3 1.48 0.699 2.11 0.0347
#> 12 Age_x_Pclass_X2 -0.0708 0.0263 -2.69 0.00714
#> 13 Age_x_Pclass_X3 -0.0341 0.0209 -1.63 0.103
Created on 2021-07-01 by the reprex package (v2.0.0)
I want separate two columns in a stargazer regression table.
So far I have not found a suitable solution. Therefore I write here my question.
Here is the example code to create a stargazer table with 2 columns:
mod <- lm(data=iris,Sepal.Length~Species)
mod1 <- lm(data=iris,Sepal.Length~Petal.Width+Species)
stargazer(mod,mod1, type = "latex")
RMarkdown gives me this output:
But I would like to have both columns separated with a line:
Can anyone help me with this issue?
I assume that you have to use latex code to change the output. I have not found any possibilities in the Stargazer options.
Thanks in advance!
A proposition:
mod <- lm(data=iris,Sepal.Length~Species)
mod1 <- lm(data=iris,Sepal.Length~Petal.Width+Species)
mod1_sg <- capture.output(stargazer::stargazer(mod, mod1, type = "text"))
library(stringr)
mod1_sg[6:25] <- paste(str_sub(mod1_sg[6:25],1,44), str_sub(mod1_sg[6:25],46,68), sep="|")
mod1_df <- setNames(as.data.frame(noquote(mod1_sg)[-1]),"")
print(mod1_df, row.names=FALSE)
#>
#> ====================================================================
#> Dependent variable:
#> ------------------------------------------------
#> Sepal.Length
#> (1) | (2)
#> --------------------------------------------|-----------------------
#> Petal.Width | 0.917***
#> | (0.194)
#> |
#> Speciesversicolor 0.930*** | -0.060
#> (0.103) | (0.230)
#> |
#> Speciesvirginica 1.582*** | -0.050
#> (0.103) | (0.358)
#> |
#> Constant 5.006*** | 4.780***
#> (0.073) | (0.083)
#> |
#> --------------------------------------------|-----------------------
#> Observations 150 | 150
#> R2 0.619 | 0.669
#> Adjusted R2 0.614 | 0.663
#> Residual Std. Error 0.515 (df = 147) | 0.481 (df = 146)
#> F Statistic 119.265*** (df = 2; 147)|98.525*** (df = 3; 146)
#> ====================================================================
#> Note: *p<0.1; **p<0.05; ***p<0.01
# Created on 2021-02-15 by the reprex package (v0.3.0.9001)
UPDATE
For LaTeX output:
mod <- lm(data=iris,Sepal.Length~Species)
mod1 <- lm(data=iris,Sepal.Length~Petal.Width+Species)
mod1_sg <- capture.output(stargazer::stargazer(mod, mod1, type = "latex"))
mod1_sg <- sub("lcc", "lc|c", mod1_sg)
writeLines(mod1_sg)
Regards,
I am running a linear regression of the type below:
y <- lm(x ~ z, data)
I want z set to 0.8, and then I want to be able to extract the resulting estimate for z from the model output using the tidy function. I have had a look at offset(), but I am unable to see the z estimate in the model output, which I need for a summary table. Does it suffice to simply include I(z*0.8)? This would result in the below code:
y <- lm(x ~ I(z*0.8), data)
I would recommend ggeffects. For example:
library(ggeffects)
#> Warning: package 'ggeffects' was built under R version 3.6.2
library(ggplot2)
#> Registered S3 methods overwritten by 'ggplot2':
#> method from
#> [.quosures rlang
#> c.quosures rlang
#> print.quosures rlang
data(efc)
fit <- lm(barthtot ~ c12hour + neg_c_7 + c161sex + c172code, data = efc)
mydf <- ggpredict(fit, terms = c("c12hour [30:80]", "c172code [1,3]"))
mydf
#> # Predicted values of Total score BARTHEL INDEX
#> # x = average number of hours of care per week
#>
#> # c172code = low level of education
#>
#> x | Predicted | 95% CI
#> -------------------------------
#> 30 | 67.15 | [64.04, 70.26]
#> 38 | 65.12 | [62.06, 68.18]
#> 47 | 62.84 | [59.81, 65.88]
#> 55 | 60.81 | [57.78, 63.85]
#> 63 | 58.79 | [55.72, 61.85]
#> 80 | 54.48 | [51.28, 57.68]
#>
#> # c172code = high level of education
#>
#> x | Predicted | 95% CI
#> -------------------------------
#> 30 | 68.58 | [65.42, 71.75]
#> 38 | 66.56 | [63.39, 69.73]
#> 47 | 64.28 | [61.08, 67.47]
#> 55 | 62.25 | [59.01, 65.50]
#> 63 | 60.23 | [56.91, 63.54]
#> 80 | 55.92 | [52.39, 59.45]
#>
#> Adjusted for:
#> * neg_c_7 = 11.84
#> * c161sex = 1.76
ggplot(mydf, aes(x, predicted, colour = group)) + geom_line()
Created on 2020-12-04 by the reprex package (v0.3.0)
From here