gtsummmary::modify_header() with overall column - r

I would like to modify the column header formatting in a {gtsummary} table with a categorical varible (two levels) and an overall column, as in With gtsummary, is it possible to have N on a separate row to the column name?. I'm having trouble figuring out the correct gtsummary variable name to access the various column names. I currently have a workaround in which I first modify the categorical levels headers, and then add the overall and modify it. However, I'm wondering if there is a better way to do this in a single modify_header line. Reprex with various attempts below.
library(gtsummary)
library(dplyr)
# Shorten trial for examples
trial <- select(trial, trt, age)
# Modify headers
trial %>%
tbl_summary(by = trt) %>%
modify_header(update = all_stat_cols() ~ "**{level}**<br>N = {N}") %>%
as_kable()
Characteristic
Drug AN = 200
Drug BN = 200
Age
46 (37, 59)
48 (39, 56)
Unknown
7
4
# Adding overall before modifying stats columns doesn't work with {level} or {label}
try(
trial %>%
tbl_summary(by = trt) %>%
add_overall() %>%
modify_header(update = all_stat_cols() ~ "**{level}**<br>N = {N}") %>%
as_kable()
)
#> Error in eval(parse(text = text, keep.source = FALSE), envir) :
#> object 'level' not found
try(
trial %>%
tbl_summary(by = trt) %>%
add_overall() %>%
modify_header(update = all_stat_cols() ~ "**{label}**<br>N = {N}") %>%
as_kable()
)
#> Error in eval(parse(text = text, keep.source = FALSE), envir) :
#> object 'label' not found
# Adding overall before modifying stats columns does work with plain text
trial %>%
tbl_summary(by = trt) %>%
add_overall() %>%
modify_header(update = all_stat_cols() ~ "**THIS WORKS**<br>N = {N}") %>%
as_kable()
Characteristic
THIS WORKSN = 200
THIS WORKSN = 200
THIS WORKSN = 200
Age
47 (38, 57)
46 (37, 59)
48 (39, 56)
Unknown
11
7
4
# And with {column} but then gives the gtsummary backend column name
trial %>%
tbl_summary(by = trt) %>%
add_overall() %>%
modify_header(update = all_stat_cols() ~ "**{column}**<br>N = {N}") %>%
as_kable()
Characteristic
stat_0N = 200
stat_1N = 200
stat_2N = 200
Age
47 (38, 57)
46 (37, 59)
48 (39, 56)
Unknown
11
7
4
# Adding overall after modifying stats columns does work, but need to change label separately
trial %>%
tbl_summary(by = trt) %>%
modify_header(update = all_stat_cols() ~ "**{level}**<br>N = {N}") %>%
add_overall(col_label = "**Overall**<br>N = {N}") %>%
as_kable()
Characteristic
OverallN = 200
Drug AN = 200
Drug BN = 200
Age
47 (38, 57)
46 (37, 59)
48 (39, 56)
Unknown
11
7
4
Created on 2021-08-20 by the reprex package (v2.0.0)

The issue you are running into is that all_stat_cols(), by default, selects the overall column and the other columns. Depending on whether you're assigning a label to the overall or split columns, you'll want to use slightly different syntax.
Example below!
library(gtsummary)
tbl <-
trial %>%
select(trt, age) %>%
tbl_summary(by = trt, missing = "no") %>%
add_overall() %>%
modify_header(
update = list(all_stat_cols(FALSE) ~ "**{level}**<br>N = {n}",
stat_0 ~ "**Overall**<br>N = {N}"))
show_header_names(tbl)
#> i As a usage guide, the code below re-creates the current column headers.
#> modify_header(update = list(
#> label ~ "**Characteristic**",
#> stat_0 ~ "**Overall**<br>N = 200",
#> stat_1 ~ "**Drug A**<br>N = 98",
#> stat_2 ~ "**Drug B**<br>N = 102"
#> ))
#>
#>
#> Column Name Column Header
#> ------------ -----------------------
#> label **Characteristic**
#> stat_0 **Overall**<br>N = 200
#> stat_1 **Drug A**<br>N = 98
#> stat_2 **Drug B**<br>N = 102
Created on 2021-08-20 by the reprex package (v2.0.1)

Related

Summarize information by group in data table in R

I'm trying to get multiple summary statistics in R grouped by Team. I used code like below, but output is not what I want.
please point me in a better direction. Thanks!
set.seed(77)
data <- data.frame(Team =sample(c("A","B"),30, replace=TRUE),
gender=sample(c("female","male"),30, replace=TRUE),
Age =sample(c(0:100),30, replace=T))
dat <- data %>%
group_by(Team, gender) %>%
dplyr::summarize_all(list(my_mean = mean,
my_sum = sum,
my_sd = sd)) %>%
as.data.frame()
df <- data %>%
group_by(Team) %>%
summarize(total = n(gender),
mean = mean(Age),
Max_Age = max(Age),
Min_Age = min(Age),
sd = sd(Age),
)
I want to get like this pic.
You may need to create the dataframe for the summary statistics of age per Team (age_summary in the example below) and that for the count of Team members per gender and Team (gender_summary in the example below), and then merge them into one dataframe (say summary_df).
library(tidyverse)
set.seed(77)
data <- data.frame(
Team = sample(c("A", "B"), 30, replace = TRUE),
gender = sample(c("female", "male"), 30, replace = TRUE),
Age = sample(c(0:100), 30, replace = T)
)
age_summary <- data %>%
group_by(Team) %>%
summarize(
mean = mean(Age),
Max = max(Age),
Min = min(Age),
sd = sd(Age)
) %>%
column_to_rownames("Team") %>%
t() %>%
as_tibble(
rownames = "age_summary"
)
gender_summary <- data %>%
group_by(Team) %>%
count(gender) %>%
ungroup() %>%
pivot_wider(names_from = Team, values_from = n)
summary_df <- full_join(
age_summary,
gender_summary
) %>%
mutate(
"item" = if_else(
is.na(gender),
"Age",
"Sex"
)
) %>%
unite("summary", c(age_summary, gender), na.rm = TRUE, remove = FALSE) %>%
relocate(item, .before = 1) %>%
select(-c(age_summary, gender))
# # A tibble: 6 × 4
# item summary A B
# <chr> <chr> <dbl> <dbl>
# 1 Age mean 45.6 57.8
# 2 Age Max 92 82
# 3 Age Min 5 14
# 4 Age sd 30.1 22.1
# 5 Sex female 8 9
# 6 Sex male 7 6

fct_reorder by function for only one group

I have a df of public and private schools within counties, and each has an assigned value. I want to use forcats::fct_reorder to rearrange the counties by the median value, but only for the private schools. Using default forcats::fct_reorder arranges by total median, which is less useful for what I'm doing.
Reprex here:
# make df
set.seed(1)
df <-
data.frame(
county = rep(c("Bexar","Travis","Tarrant","Aransas"), each=20),
type = rep(c("public","private"), each=10)
) %>%
mutate(value = case_when(type == "public" ~ runif(80,0,1),
type == "private" ~ runif(80, 0, 10)))
# private values are way higher than public
# relevel by median value
df %>%
mutate(county = forcats::fct_reorder(county, value, .fun=median)) %>%
# this rearranges counties by total median, but I only want to arrange by median of the private schools
# plot
ggplot(aes(x=county, y = value, color = type)) +
geom_point(position = position_dodge(
width=.75
)) +
geom_boxplot(alpha=.5)
Desired output would order them by increasing median of private schools only: Aransas, Travis, Tarrant, Bexar.
thanks!
library(tidyverse)
set.seed(1)
df <-
data.frame(
county = rep(c("Bexar","Travis","Tarrant","Aransas"), each=20),
type = rep(c("public","private"), each=10)
) %>%
mutate(value = case_when(type == "public" ~ runif(80,0,1),
type == "private" ~ runif(80, 0, 10)))
private_medians <-
df %>%
filter(type == "private") %>%
group_by(county) %>%
summarise(median = median(value)) %>%
arrange(median)
private_medians
#> # A tibble: 4 x 2
#> county median
#> <chr> <dbl>
#> 1 Aransas 3.91
#> 2 Travis 4.39
#> 3 Tarrant 5.68
#> 4 Bexar 6.24
# add other counties at the end in case they do not appear in the private subset
levels <- private_medians$county %>% union(df$county %>% unique())
df %>%
mutate(county = county %>% factor(levels = levels)) %>%
ggplot(aes(x=county, y = value, color = type)) +
geom_point(position = position_dodge(
width=.75
)) +
geom_boxplot(alpha=.5)
Created on 2021-10-18 by the reprex package (v2.0.1)

Predictor importance for PLS model trained with tidymodels

I'm using tidymodels to fit a PLS model but I'm struggling to find the PLS variable importance scores or coefficients.
This is what I've tried so far; the example data is from AppliedPredictiveModeling package.
Modeling fitting
data(ChemicalManufacturingProcess)
split <- ChemicalManufacturingProcess %>% initial_split(prop = 0.7)
train <- training(split)
test <- testing(split)
tidy_rec <- recipe(Yield ~ ., data = train) %>%
step_knnimpute(all_predictors()) %>%
step_BoxCox(all_predictors()) %>%
step_normalize(all_predictors()) %>%
step_nzv(all_predictors()) %>%
step_corr(all_predictors())
boots <- bootstraps(time = 25, data = train)
tidy_model <- plsmod::pls(num_comp = tune()) %>%
set_mode("regression") %>%
set_engine("mixOmics")
tidy_grid <- expand.grid(num_comp = seq(from = 1, to = 48, by = 5))
tidy_tune <- tidy_model %>% tune_grid(
preprocessor = tidy_rec,
grid = tidy_grid,
resamples = boots,
metrics = metric_set(mae, rmse, rsq)
)
tidy_best <- tidy_tune %>% select_best("rsq")
Final_model <- tidy_model %>% finalize_model(tidy_best)
tidy_wf <- workflow() %>%
add_model(Final_model) %>%
add_recipe(tidy_rec)
Fit_PLS <- tidy_wf %>% fit(data = train)
# check the most important predictors
tidy_info <- Fit_PLS %>% pull_workflow_fit()
loadings <- tidy_info$fit$loadings$X
PLS variable importance
tidy_load <- loadings %>% as.data.frame() %>% rownames_to_column() %>%
select(rowname, comp1, comp2, comp3) %>%
pivot_longer(-rowname) %>%
rename(predictors = rowname)
tidy_load %>% mutate(Sing = if_else(value < 0, "neg", "pos")) %>%
mutate(absvalue = abs(value)) %>% group_by(predictors) %>% summarise(Importance = sum(absvalue)) %>%
mutate(predictors = fct_reorder(predictors, Importance)) %>%
slice_head(n = 15) %>%
ggplot(aes(Importance, predictors, fill = predictors)) + geom_col(show.legend = F)
Thanks! The vi() function from the vip package is not available for this model.
You can directly tidy() the output of the PLS model to get the coefficients:
library(tidymodels)
library(tidyverse)
library(plsmod)
data(ChemicalManufacturingProcess, package = "AppliedPredictiveModeling")
split <- initial_split(ChemicalManufacturingProcess, prop = 0.7)
train <- training(split)
test <- testing(split)
chem_rec <- recipe(Yield ~ ., data = train) %>%
step_knnimpute(all_predictors()) %>%
step_BoxCox(all_predictors()) %>%
step_normalize(all_predictors()) %>%
step_nzv(all_predictors()) %>%
step_corr(all_predictors())
pls_spec <- pls(num_comp = 4) %>% ## can tune instead to find the optimal number
set_mode("regression") %>%
set_engine("mixOmics")
wf <- workflow() %>%
add_recipe(chem_rec) %>%
add_model(pls_spec)
pls_fit <- fit(wf, train)
## tidy the fitted model
tidy_pls <- pls_fit %>%
pull_workflow_fit()
tidy()
tidy_pls
#> # A tibble: 192 x 4
#> term value type component
#> <chr> <dbl> <chr> <dbl>
#> 1 BiologicalMaterial01 0.193 predictors 1
#> 2 BiologicalMaterial01 -0.247 predictors 2
#> 3 BiologicalMaterial01 0.00969 predictors 3
#> 4 BiologicalMaterial01 0.0228 predictors 4
#> 5 BiologicalMaterial03 0.249 predictors 1
#> 6 BiologicalMaterial03 -0.00118 predictors 2
#> 7 BiologicalMaterial03 0.0780 predictors 3
#> 8 BiologicalMaterial03 -0.0866 predictors 4
#> 9 BiologicalMaterial04 0.217 predictors 1
#> 10 BiologicalMaterial04 -0.192 predictors 2
#> # … with 182 more rows
tidy_pls %>%
filter(term != "Y") %>%
group_by(component) %>%
slice_max(abs(value), n = 10) %>%
ungroup() %>%
ggplot(aes(value, fct_reorder(term, value), fill = factor(component))) +
geom_col(show.legend = FALSE) +
facet_wrap(~component, scales = "free_y") +
labs(y = NULL)
Created on 2020-10-19 by the reprex package (v0.3.0.9001)
I showed this without tuning the number of components, but it works about the same with tuning.

Mutate a column of models: "Error: Problem with `mutate()` input `model`. x Input `model` must be a vector, not a `lm` object."

I have a dataframe that contains as a column a model formula definition. I would like to mutate a new column where each row is a model based on the corresponding rows model definition.
Some data:
# Set up
library(tidyverse)
library(lubridate)
# Create data
mydf <- data.frame(
cohort = seq(ymd('2019-01-01'), ymd('2019-12-31'), by = '1 days'),
n = rnorm(365, 1000, 50) %>% round,
cohort_cost = rnorm(365, 800, 50)
) %>%
crossing(tenure_days = 0:365) %>%
mutate(activity_date = cohort + days(tenure_days)) %>%
mutate(daily_revenue = rnorm(nrow(.), 20, 1)) %>%
group_by(cohort) %>%
arrange(activity_date) %>%
mutate(cumulative_revenue = cumsum(daily_revenue)) %>%
arrange(cohort, activity_date) %>%
mutate(payback_velocity = round(cumulative_revenue / cohort_cost, 2)) %>%
select(cohort, n, cohort_cost, activity_date, tenure_days, everything())
## wider data
mydf_wide <- mydf %>%
select(cohort, n, cohort_cost, tenure_days, payback_velocity) %>%
group_by(cohort, n, cohort_cost) %>%
pivot_wider(names_from = tenure_days, values_from = payback_velocity, names_prefix = 'velocity_day_')
Now, the final problem code block. It fails on the very last line:
models <- data.frame(
from = mydf$tenure_days %>% unique,
to = mydf$tenure_days %>% unique
) %>%
expand.grid %>%
filter(to > from) %>%
filter(from > 0) %>%
arrange(from) %>%
mutate(mod_formula = paste0('velocity_day_', to, ' ~ velocity_day_', from)) %>%
mutate(model = lm(as.formula(mod_formula), data = mydf_wide))
Error: Problem with mutate() input model.
x Input model must be a vector, not a lm object.
ℹ Input model is lm(as.formula(mod_formula), data = mydf_wide).
If I run the last code block minus the last line and take a look at the resulting data frame 'models' it looks like this:
models %>% head
from to mod_formula
1 1 2 velocity_day_2 ~ velocity_day_1
2 1 3 velocity_day_3 ~ velocity_day_1
3 1 4 velocity_day_4 ~ velocity_day_1
4 1 5 velocity_day_5 ~ velocity_day_1
5 1 6 velocity_day_6 ~ velocity_day_1
6 1 7 velocity_day_7 ~ velocity_day_1
I tried making it a list column, but to do that as far as I'm aware I need to group by. But in this case I need to group by everything. I amended the last code block:
models <- data.frame(
from = mydf$tenure_days %>% unique,
to = mydf$tenure_days %>% unique
) %>%
expand.grid %>%
filter(to > from) %>%
filter(from > 0) %>%
arrange(from) %>%
mutate(mod_formula = paste0('velocity_day_', to, ' ~ velocity_day_', from)) %>%
group_by_all() %>%
nest() %>%
mutate(model = lm(as.formula(mod_formula), data = mydf_wide))
However this results in the same error.
How can I add a new column onto 'models' that contains a linear model for each row based on the formula in field 'mod_formula'?
lm is not vectorized. Add rowwise to create a model for each row.
library(dplyr)
models <- data.frame(
from = mydf$tenure_days %>% unique,
to = mydf$tenure_days %>% unique
) %>%
expand.grid %>%
filter(to > from) %>%
filter(from > 0) %>%
arrange(from) %>%
mutate(mod_formula = paste0('velocity_day_', to, ' ~ velocity_day_', from)) %>%
rowwise() %>%
mutate(model = list(lm(as.formula(mod_formula), data = mydf_wide)))
models
# from to mod_formula model
# <int> <int> <chr> <list>
#1 1 2 velocity_day_2 ~ velocity_day_1 <lm>
#2 1 3 velocity_day_3 ~ velocity_day_1 <lm>
#3 1 4 velocity_day_4 ~ velocity_day_1 <lm>
#4 1 5 velocity_day_5 ~ velocity_day_1 <lm>
#5 1 6 velocity_day_6 ~ velocity_day_1 <lm>
#6 1 7 velocity_day_7 ~ velocity_day_1 <lm>
#...
#...
You can also use map instead of rowwise.
mutate(model = purrr::map(mod_formula, ~lm(.x, data = mydf_wide)))

Add frequency counts to 2x2 prop.table

How do I add frequency counts to a 2x2 prop.table? So here 'dataset' contains two categorical variables.
dataset %>% prop.table(margin = 2) %>% '*' (100) %>% round(2)
I would like the counts in addition to percentages of each category.
Sorry for the dopey example, but it should look like this, except the sum doesn't need to be reported in every cell.
A reproducible example and solution:
tab <-iris %>% mutate(size = factor(1+(Sepal.Length>median(iris$Sepal.Length)),levels = 1:2, labels = c('S','L'))) %>%
select(Species, size) %>%
table()
prop <- prop.table(tab,margin = 2) %>% '*' (100) %>% round(2)
matrix(paste(tab,prop),nrow = nrow(tab),dimnames = dimnames(tab))
gives
size
Species S L
setosa "50 62.5" "0 0"
versicolor "24 30" "26 37.14"
virginica "6 7.5" "44 62.86"
or another solution:
iris %>% mutate(size = factor(1+(Sepal.Length>median(iris$Sepal.Length)),levels = 1:2, labels = c('S','L'))) %>%
group_by(Species, size) %>%
summarise(n = n()) %>%
group_by(size) %>%
mutate(p = paste(n,round(n/sum(n)*100,2))) %>%
select(-n) %>%
spread(size,p,fill = paste(0,0))
gives
# A tibble: 3 x 3
Species S L
<fct> <chr> <chr>
1 setosa 50 62.5 0 0
2 versicolor 24 30 26 37.14
3 virginica 6 7.5 44 62.86
addmargins applied to your table might do what you want.
set.seed(34)
n <- 20
tab <- table(sample(1:3, n, replace = TRUE), sample(c("A", "B"), n, replace = TRUE))
addmargins(tab)

Resources