Summarise multiple variables by one group at a time - r

There are a number of questions and answers about summarising multiple variables by one or more groups (e.g., Means multiple columns by multiple groups). I don't think this is a duplicate.
Here's what I'm trying to do: I want to calculate the mean for 4 variables by Displacement, then calculate the mean for those same three by Horsepower, and so on. I don't want to group by vs, am, gear, and carb simultaneously (i.e., I'm not looking for simply mydata %>% group_by(vs, am, gear, and carb) %>% summarise_if(...).
How can I calculate the means for a set of variables by Displacement, then calculate the means for that same set of variables by Horsepower, etc., then place in a table side by side?
I tried to come up with a reproducible example but couldn't. Here is a tibble from mtcars that shows what I'm ultimately looking for (data is made up):
tibble(Item = c("vs", "am" ,"gear", "carb"),
"Displacement (mean)" = c(2.4, 1.4, 5.5, 1.3),
"Horsepower (mean)" = c(155, 175, 300, 200))

Perhaps something like this using purrr::map and some rlang syntax?
grps <- list("cyl", "vs")
map(setNames(grps, unlist(grps)), function(x)
mtcars %>%
group_by(!!rlang::sym(x)) %>%
summarise(mean.mpg = mean(mpg), mean.disp = mean(disp)) %>%
rename(id.val = 1)) %>%
bind_rows(.id = "id")
## A tibble: 5 x 4
# id id.val mean.mpg mean.disp
# <chr> <dbl> <dbl> <dbl>
#1 cyl 4. 26.7 105.
#2 cyl 6. 19.7 183.
#3 cyl 8. 15.1 353.
#4 vs 0. 16.6 307.
#5 vs 1. 24.6 132.

With so few groupings, why not do each set of means one at a time:
out1 <- mydata %>% group_by(Var1) %>%
summarise(mean_1a = mean(var_a), mean_1b = mean(var_b))
out2 <- mydata %>% group_by(Var2) %>%
summarise(mean_2a = mean(var_a), mean_2b = mean(var_b))
out3 <- mydata %>% group_by(Var3) %>%
summarise(mean_3a = mean(var_a), mean_3b = mean(var_b))
If it makes sense to place the results side-by-side, you could do so with something like:
result <- cbind(out1, out2, out3)

Related

Combine two dfs with different column names and then melt

I want to combine two data frames but melt them into different columns based on below:
treatment<-c('control','noise')
weight<-c(0.01872556,0.01575400)
sd<-c(0.008540041,0.007460524)
df1<-data.frame(treatment,weight,sd)
treatment2<-c('control','noise')
area<-c(0.79809444,0.68014667)
sd2<-c(0.337949414,0.294295847)
df2<-data.frame(treatment2,area,sd2)
And I wanted to combine them and create a data frame which should look like this:
treatment
var
sum
sd
control
area
0.79809444
0.337949414
noise
area
0.68014667
0.294295847
control
weight
0.01872556
0.008540041
noise
weight
0.01575400
0.01575400
I tried this various ways, googled various ways and ended up exporting each data frame into a csv then combining them in excel, re-importing into R for analysis.
Is there a simpler solution?
You could use
library(tidyr)
library(dplyr)
df2 %>%
rename(sd = sd2, treatment = treatment2) %>%
pivot_longer(area, names_to = "var", values_to = "sum") %>%
bind_rows(pivot_longer(df1, weight, names_to = "var", values_to = "sum")) %>%
select(treatment, var, sum, sd)
to get
# A tibble: 4 x 4
treatment var sum sd
<chr> <chr> <dbl> <dbl>
1 control area 0.798 0.338
2 noise area 0.680 0.294
3 control weight 0.0187 0.00854
4 noise weight 0.0158 0.00746
You could do this using functions from {purrr} and {dplyr}:
map(list(df2, df1), ~ mutate(., var = colnames(.)[2])) %>%
map(~ set_names(., nm = c("treatment", "sum", "sd", "var"))) %>%
bind_rows() %>%
relocate("var", .before = "sum")
Output:
treatment var sum sd
1 control area 0.79809444 0.337949414
2 noise area 0.68014667 0.294295847
3 control weight 0.01872556 0.008540041
4 noise weight 0.01575400 0.007460524
Here is a dplyr solution. The strategy is to first process the two dfs to the desired format before merging them
df1 <- df1 %>%
dplyr::mutate(var = "weight") %>%
dplyr::rename(sum = weight)
df2 <- df2 %>%
dplyr::mutate(var = "area") %>%
dplyr::rename(treatment = treatment2,
sd = sd2,
sum = area)
dplyr::bind_rows(df1, df2)
# output
treatment sum sd var
1 control 0.01872556 0.008540041 weight
2 noise 0.01575400 0.007460524 weight
3 control 0.79809444 0.337949414 area
4 noise 0.68014667 0.294295847 area

function referencing inside sdf_pivot

all:
I am having trouble referencing simple functions inside sdf_pivot. Can anyone help? Thanks!
This is the code that works, but not exactly what I need:
spark_disconnect_all();
sc <- spark_connect(master = "yarn-client")
mtcars_tbl <- sdf_copy_to(sc, mtcars, name = "mtcars_tbl", overwrite = TRUE)
mtcars_tbl %>%
mutate(mpg = ifelse(mpg > 30, "High", "Low" )) %>%
sdf_pivot(mpg+cyl ~ gear, fun.aggregate = list(hp = "mean"))
I want to calculate mean with NA removed and median calculated as well, ideally with NA removed as well. I cannot get the following codes to work though:
mtcars_tbl %>%
mutate(mpg = ifelse(mpg > 30, "High", "Low" )) %>%
sdf_pivot(mpg+cyl ~ gear, fun.aggregate = list(hp = "mean(na.rm=TRUE)"))
mtcars_tbl %>%
mutate(mpg = ifelse(mpg > 30, "High", "Low" )) %>%
sdf_pivot(mpg+cyl ~ gear, fun.aggregate = list(hp = "percentile(0.5)"))
This is the result I need:
mpg cyl `3.0` `4.0` `5.0`
<chr> <dbl> <dbl> <dbl> <dbl>
1 Low 8 194. NaN 300.
2 High 4 NaN 61 113
3 Low 4 97 85 91
4 Low 6 108. 116. 175
My data is 800 million rows and I'm just using a single example here for the purpose of easy replication. In reality, I cannot just collect it to a dataframe in R. All the calculation has to be on spark. A lot of things stop to work on Spark, median function is one. I can get percentile function to work but not median. But I can't figure out how to supply the extra parameters in this specific setting.

Creating a versatile descriptives table using dplyr

I'm trying to create a simple code that I can reuse over and over (with minimal adjustments) to be able to print a table of summary statistics.
A reproducible example creates a table with M and SD for the variable V1 broken down by group:
data <- as.data.frame(cbind(1:100, sample(1:2), rnorm(100), rnorm(100)))
names(data) <- c("ID", "Group", "V1", "V2")
library(dplyr)
descriptives <- data %>% group_by(Group) %>%
summarize(
Mean = mean(V2)
, SD = sd(V2)
)
descriptives
I'd like to modify this function so that it will compute M and SD for all variables in my dataset.
I'd like to be able to replace the call to V1 with something like vars which is just a list of all the variables in my dataset; in this example, V1 and V2. But usually I have like 100 variables.
The reason I'd like it to work this way is so that I can do something very easy like:
vars <- names(data[3:4])
and very quickly select the columns for which I want summary statistics.
A few things for my wishlist:
M and SD for a given variable should be next to eachother and I'd like to add a column above each pair with the variable name.
I'd like the end product to look something like
I'd like to use dplyr, but I'm open to other options.
I'd also like to learn how I could switch the rows and columns of the table so that the variables are on separate rows and each group has a column (or two columns, one for M and one for SD). Like this:
Close, but no cigar:
The newish summarise(across()) kind of helps:
dplyr::group_by(df, Group) %>%
dplyr::summarise(dplyr::across(.cols = c(V1, V2), .fns = c(mean, sd)))
But I don't know how to scale it without making multiple table and using rbind() to stack them.
I really like the format of table1() (vignette), but from what I can tell I can only stratify the column M/SDs by another variable. I really wish I could just add additional grouping variables on.
There is a limitation in the ordering, but if we use select, then can reorder on the substring on the column names
library(dplyr)
library(stringr)
data %>%
group_by(Group) %>%
summarise_at(vars(vars), list(Mean = mean, SD = sd)) %>%
select(Group, order(str_remove(names(.)[-1], "_.*")) + 1)
# A tibble: 2 x 5
# Group V1_Mean V1_SD V2_Mean V2_SD
# <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 0.165 0.915 0.146 1.16
#2 2 0.308 1.31 -0.00711 0.854
I had a similar question here, and got some really useful and simple answers using tidyverse. In the end a really robust approach was made, which I wrapped in a function and use regularly.
library(tidyverse)
baseline_table <- function(data, variables, grouping_var) {
data %>%
group_by(!!sym(grouping_var)) %>%
summarise(
across(
all_of(variables),
~ paste0(mean(.) %>% round(2), "(±", sd(.) %>% round(2), ")")
)
) %>% pivot_longer(
cols = -grouping_var,
names_to = "variable"
) %>% pivot_wider(
names_from = grouping_var
)
}
It takes three arguments, data, variables and the grouping_var - all of which are rather self explanatory.
Here is a test using mtcars with a 2 level and 3 level grouping var.
baseline_table(
data = mtcars,
variables = c("mpg", "hp"),
grouping_var = "am"
)
# A tibble: 2 x 3
variable `0` `1`
<chr> <chr> <chr>
1 mpg 17.15(±3.83) 24.39(±6.17)
2 hp 160.26(±53.91) 126.85(±84.06)
baseline_table(
data = mtcars,
variables = c("mpg", "hp"),
grouping_var = "cyl"
)
# A tibble: 2 x 4
variable `4` `6` `8`
<chr> <chr> <chr> <chr>
1 mpg 26.66(±4.51) 19.74(±1.45) 15.1(±2.56)
2 hp 82.64(±20.93) 122.29(±24.26) 209.21(±50.98)
It works out of the box, and are applicable to all data, below I used iris,
baseline_table(
data = iris,
variables = c("Sepal.Length", "Sepal.Width"),
grouping_var = "Species"
)
# A tibble: 2 x 4
variable setosa versicolor virginica
<chr> <chr> <chr> <chr>
1 Sepal.Length 5.01(±0.35) 5.94(±0.52) 6.59(±0.64)
2 Sepal.Width 3.43(±0.38) 2.77(±0.31) 2.97(±0.32)
Of course; some grouping variables are not directly suited for this. Namely cyl but it does serve as a good example though. but you can recode your grouping variables accordingly,
baseline_table(
data = mtcars %>% mutate(cyl = paste(cyl, "Cylinders", sep = " ")),
variables = c("mpg", "hp"),
grouping_var = "cyl"
)
# A tibble: 2 x 4
variable `4 Cylinders` `6 Cylinders` `8 Cylinders`
<chr> <chr> <chr> <chr>
1 mpg 26.66(±4.51) 19.74(±1.45) 15.1(±2.56)
2 hp 82.64(±20.93) 122.29(±24.26) 209.21(±50.98)
You can also modify the function to include descriptive strings, about the values,
baseline_table <- function(data, variables, grouping_var) {
# Generate the table;
tmpTable <- data %>%
group_by(!!sym(grouping_var)) %>%
summarise(
across(
all_of(variables),
~ paste0(mean(.) %>% round(2), "(±", sd(.) %>% round(2), ")")
)
) %>% pivot_longer(
cols = -grouping_var,
names_to = "variable"
) %>% pivot_wider(
names_from = grouping_var
)
# Generate Descriptives dynamically
tmpDesc <- tmpTable[1,] %>% mutate(
across(.fns = ~ paste("Mean (±SD)"))
) %>% mutate(
variable = ""
)
bind_rows(
tmpDesc,
tmpTable
)
}
Granted, this extension is a bit awkward - but it is nonetheless still robust. The output is,
# A tibble: 3 x 4
variable `4 Cylinders` `6 Cylinders` `8 Cylinders`
<chr> <chr> <chr> <chr>
1 "" Mean (±SD) Mean (±SD) Mean (±SD)
2 "mpg" 26.66(±4.51) 19.74(±1.45) 15.1(±2.56)
3 "hp" 82.64(±20.93) 122.29(±24.26) 209.21(±50.98)
Update: Ive rewritten the function for added flexibility as noted in the comments.
library(tidyverse)
baseline_table <- function(data, variables, grouping_var) {
data %>%
group_by(!!!syms(grouping_var)) %>%
summarise(
across(
all_of(variables),
~ paste0(mean(.) %>% round(2), "(±", sd(.) %>% round(2), ")")
)
) %>% unite(
"grouping",
all_of(grouping_var)
) %>% pivot_longer(
cols = -"grouping",
names_to = "variables"
) %>% pivot_wider(
names_from = "grouping"
)
}
It works in the same way, and outputs the same, unless there is more than one grouping_var,
baseline_table(
mtcars,
variables = c("hp", "mpg"),
grouping_var = c("am", "cyl")
)
# A tibble: 2 x 7
variables `0_4` `0_6` `0_8` `1_4` `1_6` `1_8`
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 hp 84.67(±19.66) 115.25(±9.18) 194.17(±33.36) 81.88(±22.66) 131.67(±37.53) 299.5(±50.2)
2 mpg 22.9(±1.45) 19.12(±1.63) 15.05(±2.77) 28.08(±4.48) 20.57(±0.75) 15.4(±0.57)
In the updated function I used unite with a default seperator. Clearly, you can modify this to suit your needs such that the colnames says, for example, 4 Cylinder (Automatic) 6 Cylinder (Automatic) etc.
Slight variation of your original code, you could use across() more simply/flexibly if you specify you don't want the ID (or the already-grouped Group) column, but rather everything else:
data %>%
group_by(Group) %>%
summarize(across(-ID, .fns = list(Mean = mean, SD = sd), .names = "{.col}_{.fn}"))
# A tibble: 2 x 5
Group V1_Mean V1_SD V2_Mean V2_SD
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 -0.0167 0.979 0.145 1.02
2 2 0.119 1.11 -0.277 1.05
EDIT:
If you want to create your (first) goal exactly, you can use the gt package to make an html table with column spanners:
data %>%
group_by(Group) %>%
summarize(across(-ID, .fns = list(Mean = mean, SD = sd), .names = "{.col}_{.fn}")) %>%
gt::gt() %>%
gt::tab_spanner_delim("_") %>%
gt::fmt_number(-Group, decimals = 2)
As to your other question, you could alternately do something like this to get the combined & transposed variation:
data %>%
group_by(Group) %>%
summarize(across(-ID, .fns = ~paste0(
sprintf("%.2f", mean(.x)),
sprintf(" (%.2f)", sd(.x))))) %>%
t() %>%
as.data.frame()
V1 V2
Group 1 2
V1 -0.02 (0.98) 0.12 (1.11)
V2 0.15 (1.02) -0.28 (1.05)
Outside dplyr, you could use the tables package which allows to create summary statistics out of a table formula:
library(tables)
vars <- c("V1","V2")
vars <- paste(vars, collapse="+")
table <- as.formula(paste("(group = factor(Group)) ~ (", vars ,")*(mean+sd)"))
table
# (group = factor(Group)) ~ (V1 + V2) * (mean + sd)
tables::tabular(table, data = data)
# V1 V2
# group mean sd mean sd
# 1 -0.15759 0.9771 0.1405 1.0697
# 2 0.05084 0.9039 -0.1470 0.9949
One way to make a nice summary table is to use a package called gtsummary (note I am a co-author on this package just as an FYI). Below I am just formatting the data a little bit in data2 and dropping the ID variable. Then it is a two line call to gtsummary to summarize your data. The by statement is what stratifies the table, and in the statistics input I am simply telling to show the mean and sd, by default gtsummary will show median q1-q3. This table can be rendered in all markdown options (word, pdf, html).
library(dplyr)
library(gtsummary)
data <- as.data.frame(cbind(1:100, sample(1:2), rnorm(100), rnorm(100)))
names(data) <- c("ID", "Group", "V1", "V2")
data2 <- data %>%
mutate(Group = ifelse(Group == 1, "Group Var1","Group Var2")) %>%
select(-ID)
tbl_summary(data2, by = Group,
statistic = all_continuous()~ "{mean} ({sd})")
If you want more than one strata but do not want to use tbl_strata you can combine two variables into one column and use that in the by statement. You can unite() as many variables as you want (although maybe not reccomended)
trial %>%
tidyr::unite(col = "trt_grade", trt, grade, sep = ", ") %>%
select(age, marker,stage,trt_grade) %>%
tbl_summary(by = c(trt_grade))
A data.table option
dcast(
setDT(data)[,
c(
.(Meas = c("M", "Sd")),
lapply(.SD, function(x) c(mean(x), sd(x)))
),
Group,
.SDcols = patterns("V\\d")
], Group ~ Meas,
value.var = c("V1", "V2")
)
gives
Group V1_M V1_Sd V2_M V2_Sd
1: 1 -0.2392583 1.097343 -0.08048455 0.7851212
2: 2 0.1059716 1.011769 -0.23356373 0.9927975
You can also use base R:
# using do.call to make the result a data.frame
do.call(
data.frame
# here you aggregate for all the functions you need
,(aggregate(. ~ Group, data = data[,-1], FUN = function(x) c(mn = mean(x), sd = sd(x))))
)
This leads to something like this:
Group V1.mn V1.sd V2.mn V2.sd
1 1 0.1239868 1.008214 0.07215481 1.026059
2 2 -0.2324611 1.048230 0.11348897 1.071467
If you want a fancier table, kableExtra could really help. Note, the %>% should be imported also in kableExtra, but in case, from R 4.1 you can use |> instead of it:
library(kableExtra)
# data manipulation as above, note the [,-1] to remove the Group column
do.call(
data.frame
,(aggregate(. ~ Group, data = data[,-1], FUN = function(x) c(mn = mean(x), sd = sd(x)))))[,-1] %>%
# here you define as a kable, and give the names you want to columns
kbl(col.names = rep(c('mean','sd'),2) ) %>%
# some formatting
kable_paper() %>%
# adding the first header
add_header_above(c( "Group 1" = 2, "Group 2" = 2)) %>%
# another header if you need it
add_header_above(c( "Big group" = 4))
And you can find much more to make great tables.
In case, you can also try something like this:
do.call(data.frame,
aggregate(. ~ Group, data = data[,-1], FUN = function(x) paste0(round(mean(x),2),' (', round(sd(x),2),')'))
) %>%
kbl() %>%
kable_paper()
That leads to:

Pass a list of variable names to a function using {{foo}}

Problem
I would like to know how to pass a list of variable names to a purrr::map2 function for the purpose of iterating over a separate data frame.
The input_table$key variable below contains mpg and disp from the mtcars dataset. I think the names of the variables are being passed as character strings rather than variable names. The question is how I can change that so that my function recognises that they are variable names(?).
In this example I am trying to sum all of the values in the mtcars variables mpg and disp that fall below a set of numeric thresholds. Those variables from mtcars and the relevant thresholds are contained in input_table (below).
Ideal result
percentile key value sum_y
<fct> <chr> <dbl> <dbl>
1 0.5 mpg 19.2 266.5
2 0.9 mpg 30.1 515.8
3 0.99 mpg 33.4 609.0
4 1 mpg 33.9 642.9
5 ... ... ... ...
Attempt
library(dplyr)
library(purrr)
library(tidyr)
# Arrange a generic example
# Replicating my data structure
input_table <- mtcars %>%
as_tibble() %>%
select(mpg, disp) %>%
map_df(quantile, probs = c(0.5, 0.90, 0.99, 1)) %>%
mutate(
percentile = factor(c(0.5, 0.90, 0.99, 1))
) %>%
select(
percentile, mpg, disp
) %>%
gather(key, value, -percentile)
# Defining the function
test_func <- function(label_desc, threshold) {
mtcars %>%
select({{label_desc}}) %>%
filter({{label_desc}} <= {{threshold}}) %>%
summarise(
sum_y = sum(as.numeric({{label_desc}}), na.rm = T)
)
}
# Demo'ing that it works for a single variable and threshold value
test_func(label_desc = mpg, threshold = 19.2)
# This is where I am having trouble
# Trying to iterate over multiple (mpg, disp) variables
map2(input_table$key, input_table$value, ~test_func(label_desc = .x, threshold = .y))
The issue is curly-curly ({{}}) is used for unquoted variables as you are using in your first attempt. In your second attempt you are passing quoted variables to which the curly-curly operator does not work. A simple fix would be to use _at variants of dplyr which accepts quoted arguments.
test_func <- function(label_desc, threshold) {
mtcars %>%
filter_at(label_desc, any_vars(. <= threshold)) %>%
summarise_at(label_desc, sum)
}
purrr::map2(input_table$key, input_table$value, test_func)
#[[1]]
# mpg
#1 266.5
#[[2]]
# mpg
#1 515.8
#[[3]]
# mpg
#1 609
#[[4]]
# mpg
#1 642.9
#[[5]]
# disp
#1 1956.7
#.....

R: How to cut a numeric variable in grouped tbl_df using dynamic variable length breaks

I wonder how to create categorical variable using dynamic breaks for grouped numeric variables using dplyr.
Here is the toy example, say, using mtcars data and I want to categorize cars into low and high mpg classes when grouped by vs and am. A car will be classified as low mpg car if its mpg is below the mean mpg of its group. Here is my way of doing this
library(tidyverse)
mtcars %>%
tbl_df() %>%
group_by(vs, am) %>%
mutate(lowMPG = ifelse(mpg < mean(mpg), "Yes", "No"))
However my actual problem is more general where the breaks can be a vector instead of a scalar for each group. Also the function used to compute the breaks might come from external source. So you might have the following object stored in brk in R to cut the mpg variable.
vs am breakPoint_1 breakPoint_2 breakPoint_3
0 0 14.0 15.0 17.0
0 1 17.0 19.0
1 0 19.0 21.0
1 1 28.4
Any help will be highly appreciated
You can use dplyr, and pmap from purrr. The main thing is to create the break point for all unique combination of am and vs first.
brk_point <- tibble(am = c(0,0,1,1),
vs = c(0,1,0,1),
brk = list(c(-Inf, 14,15,17, Inf),
c(-Inf, 17,19, Inf),
c(-Inf, 19,21, Inf),
c(-Inf, 27.4, Inf)))
foo <- mtcars %>%
tbl_df() %>%
left_join(., brk_point)
foo_cut <- foo %>%
dplyr::mutate(cut_mpg = purrr::pmap(list(.$mpg,.$brk),
cut,
include.lowest = TRUE))
You can also use unnest to organise it.

Resources