Get confidence intervals and exp with broom from nested coxph-models - r

Data and libraries:
test <- tibble(start=c(1,2,5,2,1,7,3,4,8,8),
age=c(2,3,6,7,8,9,9,9,14,17),
event=c(1,1,0,1,1,1,1,0,0,0),
x=c(1,0,0,1,0,1,1,1,0,0),
sex=c(0,0,0,0,0,1,1,1,1,1))
library(tidyverse)
library(broom)
library(survival)
I want to nest several grouped tibbles and create coxph objects and extract and nest data with tidy and glance (from broom package). In the tidy output I also want the data to be exponentiated and with confidence intervals. This works:
coxph_obj <- (coxph(Surv(start, event) ~ x + sex + age, test))
tidy(coxph_obj, exponentiate = TRUE, conf.int = TRUE)
However, I dont know how to get exponentiate = TRUE, conf.int = TRUE to work in tidied = map(fit, tidy) below:
test %>%
nest(data = -sex) %>%
mutate(
fit = map(data, ~ coxph(Surv(start, event) ~ x + sex + age, data = test)),
tidied = map(fit, tidy),
glanced = map(fit, glance)
)
unnest(c(tidied, glanced), names_repair = "universal" )

Answer provided by Ben in a comment:
"What does using tidied = map(fit, tidy, exponentiate = TRUE, conf.int = TRUE) give you in your mutate"

Related

Change `gtsummary::tbl_regression` columns

I would like to reformat the column in gtsummary::tbl_regression similar to tbl_summary using the statistic argument. However, I cannot find the corresponding argument to make this adjustment. Thank you for your help pointing me to the argument!
For example, instead of:
library(dplyr)
library(gtsummary)
glm(response ~ age, trial, family = binomial(link = "logit")) %>%
tbl_regression(exponentiate = TRUE)
Created on 2021-07-13 by the reprex package (v0.3.0)
I would like:
Characteristic
OR (95% CI; p value)
Age
1.02 (1.00,1.04; 0.10)
You can merge columns in gtsummary, but I will say that this feature is not documented for users because it is still being thought out and it is possible that is implementation may change slightly in a future release. Example below!
library(gtsummary)
glm(response ~ age, trial, family = binomial(link = "logit")) %>%
tbl_regression(exponentiate = TRUE) %>%
modify_table_styling(
column = estimate,
rows = !is.na(estimate),
cols_merge_pattern = "{estimate} ({ci}; {p.value})",
label = "**OR (95% CI; p value)**"
) %>%
modify_footnote(estimate ~ "OR = Odds Ratio, CI = Confidence Interval",
abbreviation = TRUE)

How can to combine odds ratios and the confidence intervals

I am trying to combine the ORs and confidence interval in one column so as to achieve the following results 1.10(0.52,2.29)
library(gtsummary)
trial %>%
select(response, grade) %>%
tbl_uvregression(
method = glm,
y = response,
method.args = list(family = binomial),
exponentiate = TRUE
)
You can use the modify_table_styling() function to merge two or more columns. Example below!
library(gtsummary)
packageVersion("gtsummary")
#> [1] '1.4.0'
tbl <-
trial %>%
select(response, grade) %>%
tbl_uvregression(
method = glm,
y = response,
method.args = list(family = binomial),
exponentiate = TRUE
) %>%
modify_table_styling(
columns = estimate,
rows = !is.na(ci),
cols_merge_pattern = "{estimate} ({ci})"
) %>%
modify_header(estimate ~ "**OR (95% CI)**") %>%
modify_footnote(estimate ~ "OR = Odds Ratio, CI = Confidence Interval",
abbreviation = TRUE)
Created on 2021-05-03 by the reprex package (v2.0.0)

How to use augment with a model on new data

It is fairly straightforward to use the augment function from the Broom package in R to add predictions back into a tibble. Viz.
df <- iris %>%
nest(data = everything()) %>%
mutate(model = map(data, function(x) lm(Sepal.Length ~ Sepal.Width, data = x)),
pred = map2(model, data, ~augment(.x, newdata = .y))) %>%
unnest(pred)
However, when I take a linear model trained on one set of data and try and predict on new data I receive the following error.
mod <- lm(Sepal.Length ~ Sepal.Width, data = iris)
df2 <- iris %>%
mutate(Sepal.Width = Sepal.Width + rnorm(1)) %>%
nest(data = everything()) %>%
mutate(pred = map2(mod, data, ~augment(.x, newdata = .y)))
# Error: Problem with `mutate()` input `pred`.
# x No augment method for objects of class numeric
# i Input `pred` is `map2(mod, data, ~augment(.x, newdata = .y))`.
How should I use augment to fit new data? Is using an external model object (in the example above this is mod) the best practice or is there a more elegant way?
Since there is only one model we can do this without using map.
library(dplyr)
df1 <- iris %>%
mutate(Sepal.Width = Sepal.Width + rnorm(1)) %>%
tidyr::nest(data = everything()) %>%
summarise(pred = broom::augment(mod, newdata = data[[1]]),
mod = list(mod),
data = data)
Having just posted the question, I think I have an answer. I won't accept the answer for 48 hours just in case someone contradicts or provides a more comprehensive one.
In the example, map2 expects mod as a vector or list but it is a model object. Putting mod into the tibble as a list object suppresses the error and correctly calculates predictions.
mod <- lm(Sepal.Length ~ Sepal.Width, data = iris)
df2 <- iris %>%
mutate(Sepal.Width = Sepal.Width + rnorm(1)) %>%
nest(data = everything()) %>%
mutate(mod = list(mod)) %>% #! this is the additional step
mutate(pred = map2(mod, data, ~augment(.x, newdata = .y))) %>%
unnest(pred)
Alternatively, coerce the external model object as list.
...
mutate(pred = map2(list(mod), data, ~augment(.x, newdata = .y))) %>%
...

Fit models with robust standard errors

I am using the following R code to run several linear regression models and extract results to dataframe:
library(tidyverse)
library(broom)
data <- mtcars
outcomes <- c("wt", "mpg", "hp", "disp")
exposures <- c("gear", "vs", "am")
models <- expand.grid(outcomes, exposures) %>%
group_by(Var1) %>% rowwise() %>%
summarise(frm = paste0(Var1, "~factor(", Var2, ")")) %>%
group_by(model_id = row_number(),frm) %>%
do(tidy(lm(.$frm, data = data))) %>%
mutate(lci = estimate-(1.96*std.error),
uci = estimate+(1.96*std.error))
How can I modify my code to use robust standard errors similar to STATA?
* example of using robust standard errors in STATA
regress y x, robust
There is a comprehensive discussion about the robust standard errors in lm models at stackexchange.
You can update your code in the following way:
library(sandwich)
models <- expand.grid(outcomes, exposures) %>%
group_by(Var1) %>% rowwise() %>%
summarise(frm = paste0(Var1, "~factor(", Var2, ")")) %>%
group_by(model_id = row_number(),frm) %>%
do(cbind(
tidy(lm(.$frm, data = data)),
robSE = sqrt(diag(vcovHC(lm(.$frm, data = data), type="HC1"))) )
) %>%
mutate(
lci = estimate - (1.96 * std.error),
uci = estimate + (1.96 * std.error),
lciR = estimate - (1.96 * robSE),
uciR = estimate + (1.96 * robSE)
)
The important line is this:
sqrt(diag(vcovHC(lm(.$frm, data = data), type="HC1"))) )
Function vcovHC returns covariance matrix. You need to extract variances on the diagonal diag and take compute a square root sqrt.

Iterating though slightly different models in purrr

I have the following code comparing the rmse of models that differ only in the polynomial term.
library(tidyverse)
data(mtcars)
cv_mtcars = mtcars %>%
crossv_kfold(k = 10)
cv_mtcars %>%
mutate(model1 = map(train, ~lm(disp ~ wt, data = .)),
model2 = map(train, ~lm(disp ~I(wt^2), data = .)),
model3 = map(train, ~lm(disp ~I(wt^3), data = .)),
model4 = map(train, ~lm(disp ~I(wt^4), data = .)),
model5 = map(train, ~lm(disp ~I(wt^5), data = .)),
model6 = map(train, ~lm(disp ~I(wt^6), data = .)),
order1 = map2_dbl(model1, test, ~rmse(.x, .y)),
order2 = map2_dbl(model2, test, ~rmse(.x, .y)),
order3 = map2_dbl(model3, test, ~rmse(.x, .y)),
order4 = map2_dbl(model4, test, ~rmse(.x, .y)),
order5 = map2_dbl(model5, test, ~rmse(.x, .y)),
order6 = map2_dbl(model6, test, ~rmse(.x, .y))) %>%
select(order1,order2,order3,order4,order5,order6) %>% gather(1:6,key=model,value=value) %>%
ggplot()+
geom_point(aes(x=factor(model),y=value))+
labs(y="rmse",x="polynomial",title="Model Assesment",subtitle="disp~I(wt^x)")
Is there a more efficient way to iterate through my models? I feel like I am writing more code than I need to.
You can iterate through the models with an outer call to map to iterate over the polynomial orders and an inner call to map to iterate over the 10 folds. In the code below, I've used poly(wt, i) instead of I(wt^i), because I(wt^i) generates a polynomial with only the highest-order term, while poly(wt, i) generates a polynomial with terms of all orders up to the highest order. I've saved the rmse for each fold in the model_cv object, but you can, of course, pipe it directly into ggplot instead.
set.seed(50)
model_cv = setNames(1:6, 1:6) %>%
map_df(function(i) {
map2_dbl(cv_mtcars[["train"]], cv_mtcars[["test"]], function(train, test) {
model = lm(disp ~ poly(wt,i), data=train)
rmse(model, test)
})
}) %>%
gather(`Polynomial Order`, rmse)
ggplot(model_cv, aes(`Polynomial Order`, rmse)) +
geom_point() +
stat_summary(fun.y=mean, geom="point", pch="_", colour="red", size=7) +
labs(title="Model Assesment",subtitle="disp ~ poly(wt, order)")

Resources