Adding differences to a summary-table created without iteration - r

Based on my first question found here about creating a summary table without iteration, ie. without using map, I made the following algorithm based on the formidable answers,
library(tidyverse)
sum_variables <- c("mpg", "hp", "disp")
# Create grouping var; ####
mtcars <- mtcars %>% mutate(
am_factor = case_when(
am == 0 ~ "Automatic",
TRUE ~ "Manual"
)
)
# Create summary table; ####
mtcars %>%
group_by(am_factor) %>%
summarise(
across(
all_of(sum_variables),
~ paste0(mean(.) %>% round(2), "(±", sd(.) %>% round(2), ")")
)
) %>% pivot_longer(
cols = -"am_factor"
) %>% pivot_wider(
names_from = "am_factor"
)
Which gives me the following output,
# A tibble: 3 x 3
name Automatic Manual
<chr> <chr> <chr>
1 mpg 17.15(±3.83) 24.39(±6.17)
2 hp 160.26(±53.91) 126.85(±84.06)
3 disp 290.38(±110.17) 143.53(±87.2)
Using paste0 here has the benefit of reducing the amount of codes needed in the algorithm, but complicates further additions to table. If I, for example, want to add differences to this table, my current solution is the following,
mtcars %>%
group_by(am_factor) %>%
summarise(
across(
all_of(sum_variables),
~ paste0(mean(.) %>% round(2), "(±", sd(.) %>% round(2), ")")
)
) %>% pivot_longer(
cols = -"am_factor"
) %>% pivot_wider(
names_from = "am_factor"
) %>% mutate(
difference = str_extract(Automatic, "[:digit:].?[:digit:]") %>% as.numeric() -
str_extract(Manual, "[:digit:].?[:digit:]") %>% as.numeric()
)
Which gives the desired output,
# A tibble: 3 × 4
name Automatic Manual difference
<chr> <chr> <chr> <dbl>
1 mpg 17.15(±3.83) 24.39(±6.17) -7
2 hp 160.26(±53.91) 126.85(±84.06) 34
3 disp 290.38(±110.17) 143.53(±87.2) 147
Although it works, it defeats the purpose of making a simple algorithm for the purpose.
How do I create a summary of my data in a simple manner? It has to be a tidyverse-solution, preferably without iteration.

This is not necessarily simpler or shorter but I would prefer to do the mathematical calculations on numbers directly rather than extracting them from strings.
library(dplyr)
library(tidyr)
mtcars %>%
group_by(am_factor) %>%
summarise(across(all_of(sum_variables), list(mean = mean,
sd = ~sprintf('%.2f (± %.2f)', mean(.), sd(.))))) %>%
pivot_longer(cols = -am_factor,
names_to = c('measure', '.value'),
names_sep = '_') %>%
group_by(measure) %>%
mutate(difference = -diff(mean)) %>%
ungroup %>%
select(-mean) %>%
pivot_wider(names_from = am_factor, values_from = sd)
# measure difference Automatic Manual
# <chr> <dbl> <chr> <chr>
#1 mpg -7.24 17.15 (± 3.83) 24.39 (± 6.17)
#2 hp 33.4 160.26 (± 53.91) 126.85 (± 84.06)
#3 disp 147. 290.38 (± 110.17) 143.53 (± 87.20)

Related

Summarising multiple variables without iteration

Consider this data that needs the summary measures mean and sd on multiple variables,
# Create grouping var; ####
mtcars <- mtcars %>% mutate(
am = case_when(
am == 0 ~ "Automatic",
TRUE ~ "Manual"
)
)
With the following custom function and purrr, I can create a baseline table,
# Summarising function; ####
sum_foo <- function(data, var) {
data %>%
group_by(am) %>%
summarise(
mean = mean( !!sym(var) , na.rm = TRUE),
sd = sd( !!sym(var) , na.rm = TRUE)
) %>%
mutate(across(where(is.double), round, 2)) %>%
group_by(am) %>%
transmute(
value = paste(mean, "(±", sd, ")", sep = ""),
variable = var
) %>%
pivot_wider(
names_from = "am"
)
}
# Execute Function; ####
sum_variables <- c("mpg", "hp", "disp")
sum_variables %>% map(
sum_foo,
data = mtcars
) %>% reduce(
bind_rows
)
Which gives the following output,
# A tibble: 3 x 3
variable Automatic Manual
<chr> <chr> <chr>
1 mpg 17.15(±3.83) 24.39(±6.17)
2 hp 160.26(±53.91) 126.85(±84.06)
3 disp 290.38(±110.17) 143.53(±87.2)
I want to get the output without using map and reduce, ie. without iterating through the variables with rowwise or map.
I'm looking for an alternative tidyverse-solution!
Maybe you could use this solution:
library(dplyr)
library(tidyr)
library(tibble)
sum_variables %>%
enframe() %>%
rowwise() %>%
mutate(output = list(sum_foo(mtcars, value))) %>%
select(output) %>%
unnest(cols = output)
# A tibble: 3 x 3
variable Automatic Manual
<chr> <chr> <chr>
1 mpg 17.15(±3.83) 24.39(±6.17)
2 hp 160.26(±53.91) 126.85(±84.06)
3 disp 290.38(±110.17) 143.53(±87.2)
Updated
Or you could even modify your function in the following way:
sum_foo2 <- function(data, var) {
data %>%
group_by(am) %>%
summarise(across(all_of(var), list(Mean = mean, sd = sd))) %>%
mutate(across(where(is.double), round, 2)) %>%
group_by(am) %>%
summarise(across(ends_with("Mean"), ~ paste(.x, "(±", get(gsub("_Mean", "_sd", cur_column())), ")", sep = ""))) %>%
pivot_longer(!am, names_to = "Mean", values_to = "Val") %>%
pivot_wider(names_from = "am", values_from = "Val")
}
sum_foo2(mtcars, sum_variables)
# A tibble: 3 x 3
Mean Automatic Manual
<chr> <chr> <chr>
1 mpg_Mean 17.15(±3.83) 24.39(±6.17)
2 hp_Mean 160.26(±53.91) 126.85(±84.06)
3 disp_Mean 290.38(±110.17) 143.53(±87.2)
If I am to trim the function above into a more concise version:
sum_foo2 <- function(data, var) {
data %>%
group_by(am) %>%
summarise(across(all_of(var), ~ paste0(round(mean(.x), 2), "(±", round(sd(.x), 2), ")"))) %>%
pivot_longer(!am, names_to = "Mean", values_to = "Val") %>%
pivot_wider(names_from = "am", values_from = "Val")
}
sum_foo2(mtcars, sum_variables)
Without using the function that you wrote, which require an iteration, ie rowwise/map, You could simply do:
sum_variables <- c("mpg", "hp", "disp")
mtcars %>%
group_by(am) %>%
summarise(across(all_of(sum_variables),
~sprintf('%.2f(\u00B1%.2f)', mean(.x), sd(.x))), .groups = 'drop') %>%
data.table::transpose(keep.names = 'variable', make.names = TRUE)
variable Automatic Manual
1 mpg 17.15(±3.83) 24.39(±6.17)
2 hp 160.26(±53.91) 126.85(±84.06)
3 disp 290.38(±110.17) 143.53(±87.20)
So: Please be gentile. My solution without using map and reduce:
library(dplyr)
library(tidyr)
library(stringr)
data %>%
group_by(am) %>%
summarise(across(c(mpg, hp, disp), list(mean = mean, sd = sd), .names = "{.col}_{.fn}")) %>%
pivot_longer (
cols = 2:7,
names_to = "variable",
values_to = "values"
) %>%
pivot_wider(
names_from = am,
values_from = values
) %>%
mutate(variable = str_extract(variable, "[^_]*")) %>%
mutate(across(c(Automatic, Manual), lead, .names = "{.col}_{.fn}")) %>%
filter(row_number() %% 2 == 1) %>%
mutate(across(where(is.numeric), round, 2)) %>%
mutate(Automatic = paste0(Automatic,"(±",Automatic_1,")"), .keep="unused") %>%
mutate(Manual = paste0(Manual,"(±",Manual_1,")"), .keep="unused")
Output:
variable Automatic Manual
<chr> <chr> <chr>
1 mpg 17.15(±3.83) 24.39(±6.17)
2 hp 160.26(±53.91) 126.85(±84.06)
3 disp 290.38(±110.17) 143.53(±87.2)
With the formidable answers that I got, this is the final tidyverse-solution without iteration or map that were born,
mtcars %>%
group_by(am) %>%
summarise(
across(
all_of(sum_variables),
~ paste0(mean(.) %>% round(2), "(±", sd(.) %>% round(2), ")")
)
) %>% pivot_longer(
cols = -"am"
) %>% pivot_wider(
names_from = "am"
)
Which gives the following output,
# A tibble: 3 x 3
name Automatic Manual
<chr> <chr> <chr>
1 mpg 17.15(±3.83) 24.39(±6.17)
2 hp 160.26(±53.91) 126.85(±84.06)
3 disp 290.38(±110.17) 143.53(±87.2)

Extract data labels in R from attr and add as column to correspond to the variable/column name

I have a very large data set with variable names that are super abbreviated and it would help immensely if the label in the attr(*, "label") section was extracted and showed up in the column beside the corresponding variable.
label(mtcars[["mpg"]]) <- "Miles/(US) gallon"
label(mtcars[["hp"]]) <- "Gross horsepower"
label(mtcars[["wt"]]) <- "Weight (1000lbs)"
My current code just gets the mean/sd from the entire data set:
mtcars %>% select(mpg, hp, wt) %>% pivot_longer(everything()) %>% group_by(name) %>% summarise(mean=mean(value, na.rm = TRUE), sd=sd(value, na.rm=TRUE))
But I want a column with the label of the variables so it's easier to tell:
name mean sd label
hp 14.7. 68.6 Gross horsepower
mpg 20.1 6.03 Miles/(US) gallon
wt 3.22 0.978 Weight (1000lbs)
I found a thread that sort of gets to what I want, but if I add mutate(labels=label(mtcars)[name]) at the end of the code, I get a column with NA instead of the labels.
We can use imap
library(purrr)
library(dplyr)
library(Hmisc)
imap_dfr(mtcars[c('hp', 'mpg', 'wt')], ~
tibble(name = .y, mean = mean(.x[[1]]),
sd = sd(.x[[1]], na.rm = TRUE),
label = attr(.x, 'label')))
If we use the OP's method, we can also use summarise_all and then do the pivot_longer
library(tidyr)
mtcars %>%
dplyr::select(mpg, hp, wt) %>%
summarise_all(list(mean = ~mean(., na.rm = TRUE),
sd = ~sd(., na.rm = TRUE),
label = ~attr(., 'label'))) %>%
mutate(rn = 1) %>%
pivot_longer(cols = -rn, names_to = c('name', '.value'), names_sep="_") %>%
select(-rn)
# name mean sd label
#1 mpg 20.09062 6.0269481 Miles/(US) gallon
#2 hp 146.68750 68.5628685 Gross horsepower
#3 wt 3.21725 0.9784574 Weight (1000lbs)

Summary Tables using Nested Tibbles

I am trying to generate a table of summary statistics using purrr/tibble methods. I am able to calculate group-wise mean (sd) and counts using the following:
library(dplyr)
library(tidyr)
library(purrr)
library(tibble)
mtcars %>%
gather(variable, value, -vs, -am) %>%
group_by(vs, am, variable) %>%
nest() %>%
filter(variable %in% c("mpg", "hp")) %>%
mutate(
mean = map_dbl(data, ~mean(.$value, na.rm = TRUE)),
sd = map_dbl(data, ~sd(.$value, na.rm = TRUE)),
n = map_dbl(data, ~sum(!is.na(.$value)))
) %>%
select(vs:variable, mean:n) %>%
mutate_at(vars(mean, sd), round, 3) %>%
mutate(mean_sd = paste0(mean, " (", sd, ")"),
var_group = paste(vs, am, variable, sep = "_")) %>%
select(n:var_group) %>%
nest(n, mean_sd, .key = "summary") %>%
spread(key = var_group, value = summary) %>%
unnest()
My immediate question is, how do I retain the column names as seen in spread(key = var_group, value = summary) in the unnest()-ed output?
edit: Thanks to all for the responses.
https://stackoverflow.com/a/55912326/5745045 has the advantages of being easier to read and not storing a temporary variable. A disadvantage is the change of numeric to character in the n columns.
The final goal is to replace the column names with formatted text within the context of a grouped kable table.
By storing the "nested" tibble as a temporary variable1 and using its colnames2, we can achieve what you desire. Look below;
mtcars %>%
gather(variable, value, -vs, -am) %>%
group_by(vs, am, variable) %>%
nest() %>%
filter(variable %in% c("mpg", "hp")) %>%
mutate(
mean = map_dbl(data, ~mean(.$value, na.rm = TRUE)),
sd = map_dbl(data, ~sd(.$value, na.rm = TRUE)),
n = map_dbl(data, ~sum(!is.na(.$value)))
) %>%
select(vs:variable, mean:n) %>%
mutate_at(vars(mean, sd), round, 3) %>%
mutate(mean_sd = paste0(mean, " (", sd, ")"),
var_group = paste(vs, am, variable, sep = "_")) %>%
select(n:var_group) %>%
nest(n, mean_sd, .key = "summary") %>%
spread(key = var_group, value = summary) %>%
#1: storing the temporary nested variable
{. ->> temptibble} %>%
unnest() %>%
#2: renaming the columns of unnested output and removing temporary variable
rename_all(funs(paste0(., "_", rep(colnames(temptibble), each=2)))); rm(temptibble)
# # A tibble: 1 x 16
# n_0_0_hp mean_sd_0_0_hp n1_0_0_mpg mean_sd1_0_0_mpg n2_0_1_hp mean_sd2_0_1_hp n3_0_1_mpg mean_sd3_0_1_mpg
# <dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl> <chr>
# 1 12 194.167 (33.36) 12 15.05 (2.774) 6 180.833 (98.816) 6 19.75 (4.009)
# n4_1_0_hp mean_sd4_1_0_hp n5_1_0_mpg mean_sd5_1_0_mpg n6_1_1_hp mean_sd6_1_1_hp n7_1_1_mpg mean_sd7_1_1_mpg
# <dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl> <chr>
# 1 7 102.143 (20.932) 7 20.743 (2.471) 7 80.571 (24.144) 7 28.371 (4.758)
Here's another method that doesn't require creating a temporary variable. Instead of nesting the data at the end, I used gather() and unite() to restructure the data so that it ends up as one key and value pair.
library(tidyverse)
#> Registered S3 methods overwritten by 'ggplot2':
#> method from
#> [.quosures rlang
#> c.quosures rlang
#> print.quosures rlang
#> Registered S3 method overwritten by 'rvest':
#> method from
#> read_xml.response xml2
mtcars %>%
gather(variable, value, -vs, -am) %>%
group_by(vs, am, variable) %>%
nest() %>%
filter(variable %in% c("mpg", "hp")) %>%
mutate(
mean = map_dbl(data, ~mean(.$value, na.rm = TRUE)),
sd = map_dbl(data, ~sd(.$value, na.rm = TRUE)),
n = map_dbl(data, ~sum(!is.na(.$value)))
) %>%
select(vs:variable, mean:n) %>%
mutate_at(vars(mean, sd), round, 3) %>%
mutate(mean_sd = paste0(mean, " (", sd, ")"),
var_group = paste(vs, am, variable, sep = "_")) %>%
select(n:var_group) %>%
gather(key, value, -var_group) %>%
unite(var_group_key, var_group, key) %>%
spread(var_group_key, value)
#> # A tibble: 1 x 16
#> `0_0_hp_mean_sd` `0_0_hp_n` `0_0_mpg_mean_s… `0_0_mpg_n` `0_1_hp_mean_sd`
#> <chr> <chr> <chr> <chr> <chr>
#> 1 194.167 (33.36) 12 15.05 (2.774) 12 180.833 (98.816)
#> # … with 11 more variables: `0_1_hp_n` <chr>, `0_1_mpg_mean_sd` <chr>,
#> # `0_1_mpg_n` <chr>, `1_0_hp_mean_sd` <chr>, `1_0_hp_n` <chr>,
#> # `1_0_mpg_mean_sd` <chr>, `1_0_mpg_n` <chr>, `1_1_hp_mean_sd` <chr>,
#> # `1_1_hp_n` <chr>, `1_1_mpg_mean_sd` <chr>, `1_1_mpg_n` <chr>
Created on 2019-04-29 by the reprex package (v0.2.1)

reformatting dplyr summarise_at() output

I am using summarise_at() to obtain the mean and standard error of multiple variables by group.
The output has 1 row for each group, and 1 column for each calculated quantity, per group. I'd like to have a table with 1 row for each variable, and 1 column for each calculated quantity:
data <- mtcars
data$condition <- as.factor(c(rep("control", 16), rep("treat", 16)))
data %>%
group_by(condition) %>%
summarise_at(vars(mpg, cyl, wt),
funs(mean = mean, se=sd(.)/sqrt(n())))
# A tibble: 2 x 7
condition mpg_mean cyl_mean wt_mean mpg_se cyl_se wt_se
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 control 18.2 6.5 3.56 1.04 0.387 0.204
2 treat 22.0 5.88 2.87 1.77 0.499 0.257
Here's what I think would be more useful (the numbers are not meaningful):
# MEAN.control, MEAN.treat, SE.control, SE.treat
# mpg 1.5 2.4 .30 .45
# cyl 3.2 1.9 .20 .60
# disp 12.3 17.8 .20 .19
Any ideas? New to the tidyverse, so sorry if this is too basic.
The funs is getting deprecated in dplyr. Instead use list in summarise_at/mutate_at. After the summarise step, gather the data into 'long' format, separate the 'key' column into two by splitting at the delimiter _, then unite the 'cond' and 'key2' (after changing the case of 'key2'), spread it to 'wide' format and if needed, change the row names with the column 'key1'
library(tidyverse)
data %>%
group_by(condition) %>%
summarise_at(vars(mpg, cyl, wt), list(MEAN = ~ mean(.),
SE = ~sd(.)/sqrt(n()))) %>%
gather(key, val, -condition) %>%
separate(key, into = c("key1", "key2")) %>%
unite(cond, key2, condition, sep=".") %>%
spread(cond, val) %>%
column_to_rownames('key1')
# MEAN.control MEAN.treat SE.control SE.treat
#cyl 6.500000 5.875000 0.3872983 0.4989572
#mpg 18.200000 21.981250 1.0369024 1.7720332
#wt 3.560875 2.873625 0.2044885 0.2571034
A different possibility could be:
data %>%
group_by(condition) %>%
summarise_at(vars(mpg, cyl, wt), list(mean = ~ mean(.),
se = ~ sd(.)/sqrt(n()))) %>%
gather(var, val, -condition) %>%
separate(var, c("vars", "var2")) %>%
mutate(var2 = paste(toupper(var2), as.character(condition), sep = "_")) %>%
select(-condition) %>%
spread(var2, val)
vars MEAN_control MEAN_treat SE_control SE_treat
<chr> <dbl> <dbl> <dbl> <dbl>
1 cyl 6.5 5.88 0.387 0.499
2 mpg 18.2 22.0 1.04 1.77
3 wt 3.56 2.87 0.204 0.257
Here, after your initial steps, it performs a wide-to-long data transformation, excluding the "condition" column. Second, it separates the variable names into two columns. Third, it combines the metric and the condition, with the metric being upper case. Finally, it removes the redundant variable and returns it to the desired format.
Or you can avoid separate() by using some regex:
data %>%
group_by(condition) %>%
summarise_at(vars(mpg, cyl, wt), list(mean = ~ mean(.),
se = ~ sd(.)/sqrt(n()))) %>%
gather(var, val, -condition) %>%
mutate(vars = gsub("_.*$", "", var),
var2 = gsub(".*\\_", "", var)) %>%
mutate(var2 = paste(toupper(var2), as.character(condition), sep = "_")) %>%
select(-condition, -var) %>%
spread(var2, val)
Or with strsplit():
data %>%
group_by(condition) %>%
summarise_at(vars(mpg, cyl, wt), list(mean = ~ mean(.),
se = ~ sd(.)/sqrt(n()))) %>%
gather(var, val, -condition) %>%
mutate(vars = sapply(strsplit(var, "_"), function(x) x[1]),
var2 = sapply(strsplit(var, "_"), function(x) x[2])) %>%
mutate(var2 = paste(toupper(var2), as.character(condition), sep = "_")) %>%
select(-condition, -var) %>%
spread(var2, val)
Or you can completely rewrite it to:
data %>%
select(mpg, cyl, wt, condition) %>%
gather(vars, val, -condition) %>%
group_by(condition, vars) %>%
summarise(mean = mean(val),
se = sd(val)/sqrt(n())) %>%
ungroup() %>%
gather(var2, val, -c(condition, vars)) %>%
mutate(var2 = paste(toupper(var2), condition, sep = "_")) %>%
select(-condition) %>%
spread(var2, val)
In this case it, first, selects the variables of interest. Second, it performs a transformation from wide to long format, excluding the "condition" column. Third, it groups by conditions and variable names and calculates the metrics. In the forth step, it performs a second wide-to-long transformation, excluding the "condition" column and the column with initial variable names. Finally, it combines together the metric (upper case) and condition, removes the redundant variable and returns it to the desired format.

How to add totals as well as group_by statistics in R

When computing any statistic using summarise and group_by we only get the summary statistic per-category, and not the value for all the population (Total). How to get both?
I am looking for something clean and short. Until now I can only think of:
bind_rows(
iris %>% group_by(Species) %>% summarise(
"Mean" = mean(Sepal.Width),
"Median" = median(Sepal.Width),
"sd" = sd(Sepal.Width),
"p10" = quantile(Sepal.Width, probs = 0.1))
,
iris %>% summarise(
"Mean" = mean(Sepal.Width),
"Median" = median(Sepal.Width),
"sd" = sd(Sepal.Width),
"p10" = quantile(Sepal.Width, probs = 0.1)) %>%
mutate(Species = "Total")
)
But I would like something more compact. In particular, I don't want to type the code (for summarize) twice, once for each group and once for the total.
You can simplify it if you untangle what you're trying to do: you have iris data that has several species, and you want that summarized along with data for all species. You don't need to calculate those summary stats before you can bind. Instead, bind iris with a version of iris that's been set to Species = "Total", then group and summarize.
library(tidyverse)
bind_rows(
iris,
iris %>% mutate(Species = "Total")
) %>%
group_by(Species) %>%
summarise(Mean = mean(Sepal.Width),
Median = median(Sepal.Width),
sd = sd(Sepal.Width),
p10 = quantile(Sepal.Width, probs = 0.1))
#> # A tibble: 4 x 5
#> Species Mean Median sd p10
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa 3.43 3.4 0.379 3
#> 2 Total 3.06 3 0.436 2.5
#> 3 versicolor 2.77 2.8 0.314 2.3
#> 4 virginica 2.97 3 0.322 2.59
I like the caution in the comments above, though I have to do this sort of calculation for work enough that I have a similar shorthand function in a personal package. It perhaps makes less sense for things like standard deviations, but it's something I need to do a lot for adding up totals of demographic groups, etc. (If it's useful, that function is here).
bit shorter, though quite similar to bind_rows
q10 <- function(x){quantile(x , probs=0.1)}
iris %>%
select(Species,Sepal.Width)%>%
group_by(Species) %>%
summarise_all(c("mean", "sd", "q10")) %>%
t() %>%
cbind(c("total", iris %>% select(Sepal.Width) %>% summarise_all(c("mean", "sd", "q10")))) %>%
t()
more clean probably:
bind_rows(
iris %>%
group_by(Species) %>%
select(Sepal.Width)%>%
summarise_all(c("mean", "sd", "q10"))
,
iris %>%
select(Sepal.Width)%>%
summarise_all(c("mean", "sd", "q10")) %>%
mutate(Species = "Total")
)

Resources