Multiple comparisons with gtsummary - r

Since my question is similar to one that's been asked before, I'll steal the reprex (also below), for consistency's sake, from
Summary Table (mean + std.error) with p-values for 2-way anova
I'm curious how to integrate a post hoc means comparisons (i.e. multcomp) and display letter groupings, like what the compact letter display function cld() would provide, directly in the gtsummary table.
Check out this table as an example of what I'm trying to achieve. But ideally, I'd like to use superscripts to denote letter groupings:
Wine grape example
library(gtsummary)
library(titanic)
library(tidyverse)
library(plotrix) #has a std.error function
packageVersion("gtsummary")
#> [1] '1.4.0'
# create smaller version of the dataset
df <-
titanic_train %>%
select(Sex, Embarked, Age, Fare) %>%
filter(Embarked != "") # deleting empty Embarked status
# first, write a little function to get the 2-way ANOVA p-values in a table
# function to get 2-way ANOVA p-values in tibble
twoway_p <- function(variable) {
paste(variable, "~ Sex * Embarked") %>%
as.formula() %>%
aov(data = df) %>%
broom::tidy() %>%
select(term, p.value) %>%
filter(complete.cases(.)) %>%
pivot_wider(names_from = term, values_from = p.value) %>%
mutate(
variable = .env$variable,
row_type = "label"
)
}
# add all results to a single table (will be merged with gtsummary table in next step)
twoway_results <-
bind_rows(
twoway_p("Age"),
twoway_p("Fare")
)
twoway_results
#> # A tibble: 2 x 5
#> Sex Embarked `Sex:Embarked` variable row_type
#> <dbl> <dbl> <dbl> <chr> <chr>
#> 1 0.00823 3.97e- 1 0.611 Age label
#> 2 0.0000000191 4.27e-16 0.0958 Fare label
tbl <-
# first build a stratified `tbl_summary()` table to get summary stats by two variables
df %>%
tbl_strata(
strata = Sex,
.tbl_fun =
~.x %>%
tbl_summary(
by = Embarked,
missing = "no",
statistic = all_continuous() ~ "{mean} ({std.error})",
digits = everything() ~ 1
) %>%
modify_header(all_stat_cols() ~ "**{level}**")
) %>%
# merge the 2way ANOVA results into tbl_summary table
modify_table_body(
~.x %>%
left_join(
twoway_results,
by = c("variable", "row_type")
)
) %>%
# by default the new columns are hidden, add a header to unhide them
modify_header(list(
Sex ~ "**Sex**",
Embarked ~ "**Embarked**",
`Sex:Embarked` ~ "**Sex * Embarked**"
)) %>%
# adding spanning header to analysis results
modify_spanning_header(c(Sex, Embarked, `Sex:Embarked`) ~ "**Two-way ANOVA p-values**") %>%
# format the p-values with a pvalue formatting function
modify_fmt_fun(c(Sex, Embarked, `Sex:Embarked`) ~ style_pvalue) %>%
# update the footnote to be nicer looking
modify_footnote(all_stat_cols() ~ "Mean (SE)")

Related

Stratified Table 1 using a svydesign object and tbl_svysummary?

I'm trying to create a Table 1 for NHANES survey data, first stratified by a binary variable for obese vs non-obese, then stratified again by a binary variable for control/trt group status ("wlp_yn"). I want to get counts (%) for categorical characteristics and means (SE) for continuous baseline characteristics. For these counts and means, I am trying to get p-values as well.
I've tried using tbl_svysummary(), svyby(), tbl_strata(), and CreateTableOne() without any success.
In the code below, I subset the full dataset into a smaller dataset of only control group data ("obese_adults") to divide up the table first. I am also starting out with age for the characteristics ("age_group" is categorical version of "RIDAGEYR" continuous variable). I couldn't figure it out, but I'm curious if there's another way to code this?
add_p_svysummary_ex1 <-
obese_adults %>%
tbl_svysummary(by = wlp_yn, percent = "row", include = c(age_group, RIDAGEYR),
statistic = list(all_continuous() ~ "{mean} ({sd})")) %>%
add_p()
add_p_svysummary_ex1
svyby(~RIDAGEYR, ~age_group+wlp_yn, obese_adults, svymean) # avg age of each age group
Thanks in advance! Would really appreciate any help.
Edit: This is a simplified version of the code for reproducibility
# DEMO
demo <- nhanes('DEMO')
demo_vars <- names(demo)
demo2 <- nhanesTranslate('DEMO', demo_vars, data = demo)
# PRESCRIPTION MEDICATIONS
rxq_rx <- nhanes('RXQ_RX')
rxq_rx_vars <- names(rxq_rx)
rxq_rx2 <- nhanesTranslate('RXQ_RX', rxq_rx_vars, data = rxq_rx)
rxq_rx2 <- rxq_rx2 %>% select("SEQN", "RXD240B") %>% filter(!is.na(RXD240B)) %>% group_by(SEQN) %>% dplyr::summarise(across(everything(), ~toString(na.omit(.))))
nhanesAnalysis = join_all(list(demo2, rxq_rx2), by = "SEQN", type = "full")
# Reconstructing survey weights for combining 1999-2018 - Combining ten survey cycles (twenty years)
nhanesAnalysis$wtint20yr <- ifelse(nhanesAnalysis$SDDSRVYR %in% c(1,2), (2/10 * nhanesAnalysis$WTINT4YR), # for 1999-2002
(1/10 * nhanesAnalysis$WTINT2YR)) # for 2003-2018
# sample weights
nhanesDesign <- svydesign(id = ~SDMVPSU,
strata = ~SDMVSTRA,
weights = ~wtint20yr,
nest = TRUE,
data = nhanesAnalysis)
# subset
obese_adults <- subset(nhanesDesign, (obesity == 1 & !is.na(BMXBMI) & RIDAGEYR >= 60))
Is this what you are looking for. A double dummy split:
library(gtsummary)
library(tidyverse)
data(mtcars)
mtcars %>%
select(am, cyl, hp, vs) %>%
dplyr::mutate(
vs = factor(vs, labels = c("Obese", "Non-Obese")),
am = factor(am, labels = c("Control", "Treatment")),
cyl = paste(cyl, "Cylinder")
) %>%
tbl_strata(
strata = vs,
~.x %>%
tbl_summary(
by = am,
type = where(is.numeric) ~ "continuous"
) %>%
modify_header(all_stat_cols() ~ "**{level}**")
)
I'm not sure why you like to use tbl_svysummary() here, it's made for survey weights.

R: Create summary table using T1 data for all groups, and T2 data for one group only

I am trying to create a summary table in R of baseline (T1) scores for all participants, grouped by another column with three variables (Group1, Group2, Group3), as well as outcome scores for Group3 only, as Group1 and Group2 do not have this data. I would like the table to look something like this (I have T1 and T2 there as headers above headers but I can't figure out how to do this here):
T1 T2
Group 1
Group 2
Group 3
Group 1
measure1
Score
Score
Score
Score
measure2
Score
Score
Score
Score
measure3
Score
Score
Score
Score
measure4
Score
Score
Score
Score
My data are currently in wide format but I've also transformed it into long format to see if it would be achieveable this way but no success yet with any method I've chosen.
My variables in wide format would be = group, measure1_t1, measure2_t1, measure_3t1, measure4_t1, measure1_t2, measure2_t2, measure3_t2, measure4_t2.
In long format, these would be group, time, measure1, measure2, measure3, measure4
Would anyone have any advice on how I could achieve this? I can't seem to get it without including columns for group2 and group3 for the measures for T2. So far, I've tried using gt_summary and dplyr::summarise but with no success, but I'm open to using other packages/functions.
Alternatively, if there's a way to combine two tables to achieve this instead of doing one table only I'm happy to explore that option
Thanks
The {gtsummary} package exports a function tbl_strata() just for this purpose. https://www.danieldsjoberg.com/gtsummary/reference/tbl_strata.html
tbl_strata_ex1 <-
trial %>%
select(age, grade, stage, trt) %>%
mutate(grade = paste("Grade", grade)) %>%
tbl_strata(
strata = grade,
.tbl_fun =
~ .x %>%
tbl_summary(by = trt, missing = "no") %>%
add_n(),
.header = "**{strata}**, N = {n}"
)
I was able to achieve this through creating two separate tables using table_summary and using table_merge to bring them together.
First table:
# Filtering dataset to include only T1 data, and selecting relevant variables.
t1_prep <- df %>%
dplyr::filter(time == "t1") %>%
dplyr::select(group, measure1, measure2, measure3, measure4)
#Creating the summary table
t1_sum <- t1_prep %>%
tbl_summary(by = group,
statistic = list(all_continuous() ~ "{mean} ({sd})"),
missing_text = "Missing"
) %>%
add_overall() %>%
add_n() %>%
modify_header(label ~ "**Variable**") %>%
modify_footnote(
all_stat_cols() ~ "Mean (SD)"
)
Second table
# Filtering dataset, this time to specify at T2, and group 1 only. In select, no longer choose the group variable as this is no longer wanted.
t2_prep <- df %>%
dplyr::filter(time == "t2", group == "Group1") %>%
dplyr::select(measure1, measure2, measure3, measure4)
t2_sum <- t2_prep %>%
tbl_summary(by = NULL,
statistic = list(all_continuous() ~ "{mean} ({sd})"),
missing_text = "Missing") %>%
add_n() %>%
#Note that must specify the group and number included in that group below for consistency when merging the tables - this would only state N = 65 otherwise
modify_header(label ~ "**Variable**", all_stat_cols() ~ "**Completer** N = 65") %>%
modify_footnote(
all_stat_cols() ~ "Mean (SD)"
)
Merging
t1_t2_merge <-
tbl_merge(
list(t1_sum, t2_sum),
tab_spanner = c("**Pre-Intervention**", "**Post-Intervention**")
) %>%
as_gt() %>% #this and belowis not necessary but I wanted to add styling
gt::tab_options(table.font.names = "Times New Roman") %>%
gt::opt_row_striping()
t1_t2_merge

How to use gtsummary::tbl_svysummary() to display confidence intervals for levels of a factor variable?

I am using survey data from the National Electronic Injury Surveillance System (https://www.cpsc.gov/Research--Statistics/NEISS-Injury-Data) to research trends in consumer product injuries.
Using gtsummary and tbl_svysummary(), my goal is to create a descriptive table of summary measures of injuries. Since this is survey data, I want to display the 95% confidence interval associated with each summary measure.
This previous post provides a solution to generating confidence intervals for two level factor variables (Using (gtsummary) tbl_svysummaary function to display confidence intervals for survey.design object?), however, I am looking for a solution to produce confidence intervals for factor variables with >=2 levels.
I am borrowing a reproducible example from the previous post:
library(gtsummary)
library(survey)
svy_trial <-
svydesign(~1, data = trial %>% select(trt, response, death), weights = ~1)
ci <- function(variable, by, data, ...) {
svyby(as.formula( paste0( "~" , variable)) , by = as.formula( paste0( "~" , by)), data, svyciprop, vartype="ci") %>%
tibble::as_tibble() %>%
dplyr::mutate_at(vars(ci_l, ci_u), ~style_number(., scale = 100) %>% paste0("%")) %>%
dplyr::mutate(ci = stringr::str_glue("{ci_l}, {ci_u}")) %>%
dplyr::select(all_of(c(by, "ci"))) %>%
tidyr::pivot_wider(names_from = all_of(by), values_from = ci) %>%
set_names(paste0("add_stat_", seq_len(ncol(.))))
}
ci("response", "trt", svy_trial)
#> # A tibble: 1 x 2
#> add_stat_1 add_stat_2
#> <glue> <glue>
#> 1 21%, 40% 25%, 44%
svy_trial %>%
tbl_svysummary(by = "trt", missing = "no") %>%
add_stat(everything() ~ "ci") %>%
modify_table_body(
dplyr::relocate, add_stat_1, .after = stat_1
) %>%
modify_header(starts_with("add_stat_") ~ "**95% CI**") %>%
modify_footnote(everything() ~ NA)
Table screenshot from previous post 1
In the above example, the factor variables have two levels and summary data from 1 level is shown.
How can I tweak the above approach so that both levels of factor variables are displayed with their respective confidence intervals?
How can this solution be generalized to factor variables with >2 levels (e.g., an age variable binned as follows: <18 years, 18-25 years, 26-50 years, etc)?
Lastly, how could this desired solution also accommodate generating confidence intervals for continuous variables in the same column as the confidence intervals for factor variables?
Here is an example of the table I am trying to produce:
Screenshot of desired table output2
Apologies if this request for help doesn't follow good stack overflow etiquette (I'm fairly new to this community) and your time and assistance is much appreciated!
I have a prepared example for factors with >=2 levels, but not with a by= variable (although the approach is similar). FYI, we have an open issue to support survey objects more thoroughly with a new function add_ci.tbl_svysummary() that will calculate CIs for both categorical and continuous variables. You can click the "subscribe" link here to be alerted when this feature is implemented https://github.com/ddsjoberg/gtsummary/issues/965
In the meantime, here is a code example:
library(gtsummary)
library(tidyverse)
packageVersion("gtsummary")
#> [1] '1.5.0'
svy <- survey::svydesign(~1, data = as.data.frame(Titanic), weights = ~Freq)
# put the CI in a tibble with the variable name
# first create a data frame with each variable and it's values
df_result <-
tibble(variable = c("Class", "Sex", "Age", "Survived")) %>%
# get the levels of each variable in a new column
# adding them as a list to allow for different variable classes
rowwise() %>%
mutate(
# level to be used to construct call
level = unique(svy$variables[[variable]]) %>% as.list() %>% list(),
# character version to be merged into table
label = unique(svy$variables[[variable]]) %>% as.character() %>% as.list() %>% list()
) %>%
unnest(c(level, label)) %>%
mutate(
label = unlist(label)
)
# construct call to svyciprop
df_result$svyciprop <-
map2(
df_result$variable, df_result$label,
function(variable, level) rlang::inject(survey::svyciprop(~I(!!rlang::sym(variable) == !!level), svy))
)
# round/format the 95% CI
df_result <-
df_result %>%
rowwise() %>%
mutate(
ci =
svyciprop %>%
attr("ci") %>%
style_sigfig(scale = 100) %>%
paste0("%", collapse = ", ")
) %>%
ungroup() %>%
# keep variables needed in tbl
select(variable, label, ci)
# construct gtsummary table with CI
tbl <-
svy %>%
tbl_svysummary() %>%
# merge in CI
modify_table_body(
~.x %>%
left_join(
df_result,
by = c("variable", "label")
)
) %>%
# add a header
modify_header(ci = "**95% CI**")
Created on 2021-12-04 by the reprex package (v2.0.1)

gtsummary: multiple continuous variables as columns and than stratify by two categorial variables

I am trying to summarize three continuous variable by one categorical variable.
Here is some dummy data :
test <-
data.frame(
score_1= sample(c("low","medium","high"),50, replace = T),
land=rnorm(50,5,1),
water=rnorm(50,300,1),
fire=rnorm(50,3,1)
)
I can easily stratify the data by tertile:
table<- test %>%
tbl_summary(
by=score_1,
statistic = all_continuous()~ "{mean} ({sd})"
) %>%
print()
Which will make this table:
However I need to transpose this table: the continuous variables need to be the columns.
The reason for that is that I actually have two more score to add, so data is actually looks like this:
test2 <-
data.frame(
score_1= sample(c("low","medium","high"),50, replace = T),
score_2= sample(c("low","medium","high"),50, replace = T),
score_3= sample(c("low","medium","high"),50, replace = T),
land=rnorm(50,5,1),
water=rnorm(50,300,1),
fire=rnorm(50,3,1)
)
I thought of creating three tables, one for each score (with the continuous variables as columns), and then merging the three using tbl_stack. But I don't know how to make the first table (and if that even possible with gtsummary).
Hope that makes sense.
In the next release of gtsummary (v.1.5.0) the package will have a function designed to create tables just like the one you're requesting. While that new function is being vetted, you can use a similar (but not as easy to use) function from the bstfun package (on GitHub). bstfun is a package where some gtsummary functions are born, and when they mature they are migrated to gtsummary. Example code below!
# remotes::install_github("ddsjoberg/bstfun")
library(gtsummary)
test <-
data.frame(
score_1= sample(c("low","medium","high"),50, replace = T),
land=rnorm(50,5,1),
water=rnorm(50,300,1),
fire=rnorm(50,3,1),
all_one = 1L
)
df_pvalues <-
c("land", "water", "fire") %>%
purrr::imap_dfc(
~aov(
formula = glue::glue("{.x} ~ score_1") %>% as.formula(),
data = test
) %>%
broom::tidy() %>%
dplyr::slice(1) %>%
dplyr::select(p.value) %>%
dplyr::mutate_all(style_pvalue) %>%
setNames(glue::glue("stat_1_{.y}"))
) %>%
mutate(label = "ANOVA p-value")
df_pvalues
#> # A tibble: 1 x 4
#> stat_1_1 stat_1_2 stat_1_3 label
#> <chr> <chr> <chr> <chr>
#> 1 0.12 0.6 >0.9 ANOVA p-value
tbl <-
c("land", "water", "fire") %>%
purrr::map(
~test %>%
bstfun::tbl_2way_summary(score_1, all_one, con = all_of(.x),
statistic = "{mean} ({sd})") %>%
modify_header(all_stat_cols() ~ paste0("**", .x, "**"))
) %>%
tbl_merge() %>%
modify_spanning_header(everything() ~ NA) %>%
modify_table_body(
~.x %>%
dplyr::bind_rows(df_pvalues)
)
Created on 2021-08-25 by the reprex package (v2.0.1)

Creating a versatile descriptives table using dplyr

I'm trying to create a simple code that I can reuse over and over (with minimal adjustments) to be able to print a table of summary statistics.
A reproducible example creates a table with M and SD for the variable V1 broken down by group:
data <- as.data.frame(cbind(1:100, sample(1:2), rnorm(100), rnorm(100)))
names(data) <- c("ID", "Group", "V1", "V2")
library(dplyr)
descriptives <- data %>% group_by(Group) %>%
summarize(
Mean = mean(V2)
, SD = sd(V2)
)
descriptives
I'd like to modify this function so that it will compute M and SD for all variables in my dataset.
I'd like to be able to replace the call to V1 with something like vars which is just a list of all the variables in my dataset; in this example, V1 and V2. But usually I have like 100 variables.
The reason I'd like it to work this way is so that I can do something very easy like:
vars <- names(data[3:4])
and very quickly select the columns for which I want summary statistics.
A few things for my wishlist:
M and SD for a given variable should be next to eachother and I'd like to add a column above each pair with the variable name.
I'd like the end product to look something like
I'd like to use dplyr, but I'm open to other options.
I'd also like to learn how I could switch the rows and columns of the table so that the variables are on separate rows and each group has a column (or two columns, one for M and one for SD). Like this:
Close, but no cigar:
The newish summarise(across()) kind of helps:
dplyr::group_by(df, Group) %>%
dplyr::summarise(dplyr::across(.cols = c(V1, V2), .fns = c(mean, sd)))
But I don't know how to scale it without making multiple table and using rbind() to stack them.
I really like the format of table1() (vignette), but from what I can tell I can only stratify the column M/SDs by another variable. I really wish I could just add additional grouping variables on.
There is a limitation in the ordering, but if we use select, then can reorder on the substring on the column names
library(dplyr)
library(stringr)
data %>%
group_by(Group) %>%
summarise_at(vars(vars), list(Mean = mean, SD = sd)) %>%
select(Group, order(str_remove(names(.)[-1], "_.*")) + 1)
# A tibble: 2 x 5
# Group V1_Mean V1_SD V2_Mean V2_SD
# <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 0.165 0.915 0.146 1.16
#2 2 0.308 1.31 -0.00711 0.854
I had a similar question here, and got some really useful and simple answers using tidyverse. In the end a really robust approach was made, which I wrapped in a function and use regularly.
library(tidyverse)
baseline_table <- function(data, variables, grouping_var) {
data %>%
group_by(!!sym(grouping_var)) %>%
summarise(
across(
all_of(variables),
~ paste0(mean(.) %>% round(2), "(±", sd(.) %>% round(2), ")")
)
) %>% pivot_longer(
cols = -grouping_var,
names_to = "variable"
) %>% pivot_wider(
names_from = grouping_var
)
}
It takes three arguments, data, variables and the grouping_var - all of which are rather self explanatory.
Here is a test using mtcars with a 2 level and 3 level grouping var.
baseline_table(
data = mtcars,
variables = c("mpg", "hp"),
grouping_var = "am"
)
# A tibble: 2 x 3
variable `0` `1`
<chr> <chr> <chr>
1 mpg 17.15(±3.83) 24.39(±6.17)
2 hp 160.26(±53.91) 126.85(±84.06)
baseline_table(
data = mtcars,
variables = c("mpg", "hp"),
grouping_var = "cyl"
)
# A tibble: 2 x 4
variable `4` `6` `8`
<chr> <chr> <chr> <chr>
1 mpg 26.66(±4.51) 19.74(±1.45) 15.1(±2.56)
2 hp 82.64(±20.93) 122.29(±24.26) 209.21(±50.98)
It works out of the box, and are applicable to all data, below I used iris,
baseline_table(
data = iris,
variables = c("Sepal.Length", "Sepal.Width"),
grouping_var = "Species"
)
# A tibble: 2 x 4
variable setosa versicolor virginica
<chr> <chr> <chr> <chr>
1 Sepal.Length 5.01(±0.35) 5.94(±0.52) 6.59(±0.64)
2 Sepal.Width 3.43(±0.38) 2.77(±0.31) 2.97(±0.32)
Of course; some grouping variables are not directly suited for this. Namely cyl but it does serve as a good example though. but you can recode your grouping variables accordingly,
baseline_table(
data = mtcars %>% mutate(cyl = paste(cyl, "Cylinders", sep = " ")),
variables = c("mpg", "hp"),
grouping_var = "cyl"
)
# A tibble: 2 x 4
variable `4 Cylinders` `6 Cylinders` `8 Cylinders`
<chr> <chr> <chr> <chr>
1 mpg 26.66(±4.51) 19.74(±1.45) 15.1(±2.56)
2 hp 82.64(±20.93) 122.29(±24.26) 209.21(±50.98)
You can also modify the function to include descriptive strings, about the values,
baseline_table <- function(data, variables, grouping_var) {
# Generate the table;
tmpTable <- data %>%
group_by(!!sym(grouping_var)) %>%
summarise(
across(
all_of(variables),
~ paste0(mean(.) %>% round(2), "(±", sd(.) %>% round(2), ")")
)
) %>% pivot_longer(
cols = -grouping_var,
names_to = "variable"
) %>% pivot_wider(
names_from = grouping_var
)
# Generate Descriptives dynamically
tmpDesc <- tmpTable[1,] %>% mutate(
across(.fns = ~ paste("Mean (±SD)"))
) %>% mutate(
variable = ""
)
bind_rows(
tmpDesc,
tmpTable
)
}
Granted, this extension is a bit awkward - but it is nonetheless still robust. The output is,
# A tibble: 3 x 4
variable `4 Cylinders` `6 Cylinders` `8 Cylinders`
<chr> <chr> <chr> <chr>
1 "" Mean (±SD) Mean (±SD) Mean (±SD)
2 "mpg" 26.66(±4.51) 19.74(±1.45) 15.1(±2.56)
3 "hp" 82.64(±20.93) 122.29(±24.26) 209.21(±50.98)
Update: Ive rewritten the function for added flexibility as noted in the comments.
library(tidyverse)
baseline_table <- function(data, variables, grouping_var) {
data %>%
group_by(!!!syms(grouping_var)) %>%
summarise(
across(
all_of(variables),
~ paste0(mean(.) %>% round(2), "(±", sd(.) %>% round(2), ")")
)
) %>% unite(
"grouping",
all_of(grouping_var)
) %>% pivot_longer(
cols = -"grouping",
names_to = "variables"
) %>% pivot_wider(
names_from = "grouping"
)
}
It works in the same way, and outputs the same, unless there is more than one grouping_var,
baseline_table(
mtcars,
variables = c("hp", "mpg"),
grouping_var = c("am", "cyl")
)
# A tibble: 2 x 7
variables `0_4` `0_6` `0_8` `1_4` `1_6` `1_8`
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 hp 84.67(±19.66) 115.25(±9.18) 194.17(±33.36) 81.88(±22.66) 131.67(±37.53) 299.5(±50.2)
2 mpg 22.9(±1.45) 19.12(±1.63) 15.05(±2.77) 28.08(±4.48) 20.57(±0.75) 15.4(±0.57)
In the updated function I used unite with a default seperator. Clearly, you can modify this to suit your needs such that the colnames says, for example, 4 Cylinder (Automatic) 6 Cylinder (Automatic) etc.
Slight variation of your original code, you could use across() more simply/flexibly if you specify you don't want the ID (or the already-grouped Group) column, but rather everything else:
data %>%
group_by(Group) %>%
summarize(across(-ID, .fns = list(Mean = mean, SD = sd), .names = "{.col}_{.fn}"))
# A tibble: 2 x 5
Group V1_Mean V1_SD V2_Mean V2_SD
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 -0.0167 0.979 0.145 1.02
2 2 0.119 1.11 -0.277 1.05
EDIT:
If you want to create your (first) goal exactly, you can use the gt package to make an html table with column spanners:
data %>%
group_by(Group) %>%
summarize(across(-ID, .fns = list(Mean = mean, SD = sd), .names = "{.col}_{.fn}")) %>%
gt::gt() %>%
gt::tab_spanner_delim("_") %>%
gt::fmt_number(-Group, decimals = 2)
As to your other question, you could alternately do something like this to get the combined & transposed variation:
data %>%
group_by(Group) %>%
summarize(across(-ID, .fns = ~paste0(
sprintf("%.2f", mean(.x)),
sprintf(" (%.2f)", sd(.x))))) %>%
t() %>%
as.data.frame()
V1 V2
Group 1 2
V1 -0.02 (0.98) 0.12 (1.11)
V2 0.15 (1.02) -0.28 (1.05)
Outside dplyr, you could use the tables package which allows to create summary statistics out of a table formula:
library(tables)
vars <- c("V1","V2")
vars <- paste(vars, collapse="+")
table <- as.formula(paste("(group = factor(Group)) ~ (", vars ,")*(mean+sd)"))
table
# (group = factor(Group)) ~ (V1 + V2) * (mean + sd)
tables::tabular(table, data = data)
# V1 V2
# group mean sd mean sd
# 1 -0.15759 0.9771 0.1405 1.0697
# 2 0.05084 0.9039 -0.1470 0.9949
One way to make a nice summary table is to use a package called gtsummary (note I am a co-author on this package just as an FYI). Below I am just formatting the data a little bit in data2 and dropping the ID variable. Then it is a two line call to gtsummary to summarize your data. The by statement is what stratifies the table, and in the statistics input I am simply telling to show the mean and sd, by default gtsummary will show median q1-q3. This table can be rendered in all markdown options (word, pdf, html).
library(dplyr)
library(gtsummary)
data <- as.data.frame(cbind(1:100, sample(1:2), rnorm(100), rnorm(100)))
names(data) <- c("ID", "Group", "V1", "V2")
data2 <- data %>%
mutate(Group = ifelse(Group == 1, "Group Var1","Group Var2")) %>%
select(-ID)
tbl_summary(data2, by = Group,
statistic = all_continuous()~ "{mean} ({sd})")
If you want more than one strata but do not want to use tbl_strata you can combine two variables into one column and use that in the by statement. You can unite() as many variables as you want (although maybe not reccomended)
trial %>%
tidyr::unite(col = "trt_grade", trt, grade, sep = ", ") %>%
select(age, marker,stage,trt_grade) %>%
tbl_summary(by = c(trt_grade))
A data.table option
dcast(
setDT(data)[,
c(
.(Meas = c("M", "Sd")),
lapply(.SD, function(x) c(mean(x), sd(x)))
),
Group,
.SDcols = patterns("V\\d")
], Group ~ Meas,
value.var = c("V1", "V2")
)
gives
Group V1_M V1_Sd V2_M V2_Sd
1: 1 -0.2392583 1.097343 -0.08048455 0.7851212
2: 2 0.1059716 1.011769 -0.23356373 0.9927975
You can also use base R:
# using do.call to make the result a data.frame
do.call(
data.frame
# here you aggregate for all the functions you need
,(aggregate(. ~ Group, data = data[,-1], FUN = function(x) c(mn = mean(x), sd = sd(x))))
)
This leads to something like this:
Group V1.mn V1.sd V2.mn V2.sd
1 1 0.1239868 1.008214 0.07215481 1.026059
2 2 -0.2324611 1.048230 0.11348897 1.071467
If you want a fancier table, kableExtra could really help. Note, the %>% should be imported also in kableExtra, but in case, from R 4.1 you can use |> instead of it:
library(kableExtra)
# data manipulation as above, note the [,-1] to remove the Group column
do.call(
data.frame
,(aggregate(. ~ Group, data = data[,-1], FUN = function(x) c(mn = mean(x), sd = sd(x)))))[,-1] %>%
# here you define as a kable, and give the names you want to columns
kbl(col.names = rep(c('mean','sd'),2) ) %>%
# some formatting
kable_paper() %>%
# adding the first header
add_header_above(c( "Group 1" = 2, "Group 2" = 2)) %>%
# another header if you need it
add_header_above(c( "Big group" = 4))
And you can find much more to make great tables.
In case, you can also try something like this:
do.call(data.frame,
aggregate(. ~ Group, data = data[,-1], FUN = function(x) paste0(round(mean(x),2),' (', round(sd(x),2),')'))
) %>%
kbl() %>%
kable_paper()
That leads to:

Resources