I am using summarise_at() to obtain the mean and standard error of multiple variables by group.
The output has 1 row for each group, and 1 column for each calculated quantity, per group. I'd like to have a table with 1 row for each variable, and 1 column for each calculated quantity:
data <- mtcars
data$condition <- as.factor(c(rep("control", 16), rep("treat", 16)))
data %>%
group_by(condition) %>%
summarise_at(vars(mpg, cyl, wt),
funs(mean = mean, se=sd(.)/sqrt(n())))
# A tibble: 2 x 7
condition mpg_mean cyl_mean wt_mean mpg_se cyl_se wt_se
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 control 18.2 6.5 3.56 1.04 0.387 0.204
2 treat 22.0 5.88 2.87 1.77 0.499 0.257
Here's what I think would be more useful (the numbers are not meaningful):
# MEAN.control, MEAN.treat, SE.control, SE.treat
# mpg 1.5 2.4 .30 .45
# cyl 3.2 1.9 .20 .60
# disp 12.3 17.8 .20 .19
Any ideas? New to the tidyverse, so sorry if this is too basic.
The funs is getting deprecated in dplyr. Instead use list in summarise_at/mutate_at. After the summarise step, gather the data into 'long' format, separate the 'key' column into two by splitting at the delimiter _, then unite the 'cond' and 'key2' (after changing the case of 'key2'), spread it to 'wide' format and if needed, change the row names with the column 'key1'
library(tidyverse)
data %>%
group_by(condition) %>%
summarise_at(vars(mpg, cyl, wt), list(MEAN = ~ mean(.),
SE = ~sd(.)/sqrt(n()))) %>%
gather(key, val, -condition) %>%
separate(key, into = c("key1", "key2")) %>%
unite(cond, key2, condition, sep=".") %>%
spread(cond, val) %>%
column_to_rownames('key1')
# MEAN.control MEAN.treat SE.control SE.treat
#cyl 6.500000 5.875000 0.3872983 0.4989572
#mpg 18.200000 21.981250 1.0369024 1.7720332
#wt 3.560875 2.873625 0.2044885 0.2571034
A different possibility could be:
data %>%
group_by(condition) %>%
summarise_at(vars(mpg, cyl, wt), list(mean = ~ mean(.),
se = ~ sd(.)/sqrt(n()))) %>%
gather(var, val, -condition) %>%
separate(var, c("vars", "var2")) %>%
mutate(var2 = paste(toupper(var2), as.character(condition), sep = "_")) %>%
select(-condition) %>%
spread(var2, val)
vars MEAN_control MEAN_treat SE_control SE_treat
<chr> <dbl> <dbl> <dbl> <dbl>
1 cyl 6.5 5.88 0.387 0.499
2 mpg 18.2 22.0 1.04 1.77
3 wt 3.56 2.87 0.204 0.257
Here, after your initial steps, it performs a wide-to-long data transformation, excluding the "condition" column. Second, it separates the variable names into two columns. Third, it combines the metric and the condition, with the metric being upper case. Finally, it removes the redundant variable and returns it to the desired format.
Or you can avoid separate() by using some regex:
data %>%
group_by(condition) %>%
summarise_at(vars(mpg, cyl, wt), list(mean = ~ mean(.),
se = ~ sd(.)/sqrt(n()))) %>%
gather(var, val, -condition) %>%
mutate(vars = gsub("_.*$", "", var),
var2 = gsub(".*\\_", "", var)) %>%
mutate(var2 = paste(toupper(var2), as.character(condition), sep = "_")) %>%
select(-condition, -var) %>%
spread(var2, val)
Or with strsplit():
data %>%
group_by(condition) %>%
summarise_at(vars(mpg, cyl, wt), list(mean = ~ mean(.),
se = ~ sd(.)/sqrt(n()))) %>%
gather(var, val, -condition) %>%
mutate(vars = sapply(strsplit(var, "_"), function(x) x[1]),
var2 = sapply(strsplit(var, "_"), function(x) x[2])) %>%
mutate(var2 = paste(toupper(var2), as.character(condition), sep = "_")) %>%
select(-condition, -var) %>%
spread(var2, val)
Or you can completely rewrite it to:
data %>%
select(mpg, cyl, wt, condition) %>%
gather(vars, val, -condition) %>%
group_by(condition, vars) %>%
summarise(mean = mean(val),
se = sd(val)/sqrt(n())) %>%
ungroup() %>%
gather(var2, val, -c(condition, vars)) %>%
mutate(var2 = paste(toupper(var2), condition, sep = "_")) %>%
select(-condition) %>%
spread(var2, val)
In this case it, first, selects the variables of interest. Second, it performs a transformation from wide to long format, excluding the "condition" column. Third, it groups by conditions and variable names and calculates the metrics. In the forth step, it performs a second wide-to-long transformation, excluding the "condition" column and the column with initial variable names. Finally, it combines together the metric (upper case) and condition, removes the redundant variable and returns it to the desired format.
Related
I'm looking to report the min, max, and mean of certain columns (price, age, and dist)from the houses data set using pipes in a concise tibble. For now, I have the following code which produces a rather inelegant solution with a 1x9 tibble:
houses %>%
select(price, age, dist) %>%
summarize_each(list(min = min, max = max, mean = mean))
I was hoping to create a more organized solution using pipes with the selected data as rows and the summary stats (min, max, mean) as columns resulting in a 3x3 tibble. Any ideas?
You may first get the data in long format and then calculate summary statistics for each column. Here is an example with mtcars dataset.
library(dplyr)
library(tidyr)
mtcars %>%
select(mpg, disp, cyl) %>%
pivot_longer(cols = everything()) %>%
group_by(name) %>%
summarise(min = min(value, na.rm = TRUE),
max = max(value, na.rm = TRUE),
mean = mean(value, na.rm = TRUE))
# name min max mean
# <chr> <dbl> <dbl> <dbl>
#1 cyl 4 8 6.19
#2 disp 71.1 472 231.
#3 mpg 10.4 33.9 20.1
A possible solution to output a dataframe:
library(dplyr)
houses %>%
summarise(across(c(price,age,dist),c(max,min,mean))) %>%
matrix(ncol = 3, byrow = T) %>%
as.data.frame() %>%
rename(Max=V1, Min=V2, Mean=V3)
A possible solution to output a tibble:
library(dplyr)
houses %>%
summarise(across(c(price,age,dist),c(max,min,mean))) %>%
matrix(ncol = 3, byrow = T) %>%
tibble(Max=unlist(.[,1]),Min=unlist(.[,2]),Mean=unlist(.[,3])) %>%
select(Max,Min,Mean)
EDIT (2021-10-01)
A very short solution:
library(dplyr)
library(purrr)
map_dfc(c("Max","Min","Mean"),
~ tibble(!!sym(.x) := apply(select(houses, price, age, dist),2,tolower(.x))))
Based on my first question found here about creating a summary table without iteration, ie. without using map, I made the following algorithm based on the formidable answers,
library(tidyverse)
sum_variables <- c("mpg", "hp", "disp")
# Create grouping var; ####
mtcars <- mtcars %>% mutate(
am_factor = case_when(
am == 0 ~ "Automatic",
TRUE ~ "Manual"
)
)
# Create summary table; ####
mtcars %>%
group_by(am_factor) %>%
summarise(
across(
all_of(sum_variables),
~ paste0(mean(.) %>% round(2), "(±", sd(.) %>% round(2), ")")
)
) %>% pivot_longer(
cols = -"am_factor"
) %>% pivot_wider(
names_from = "am_factor"
)
Which gives me the following output,
# A tibble: 3 x 3
name Automatic Manual
<chr> <chr> <chr>
1 mpg 17.15(±3.83) 24.39(±6.17)
2 hp 160.26(±53.91) 126.85(±84.06)
3 disp 290.38(±110.17) 143.53(±87.2)
Using paste0 here has the benefit of reducing the amount of codes needed in the algorithm, but complicates further additions to table. If I, for example, want to add differences to this table, my current solution is the following,
mtcars %>%
group_by(am_factor) %>%
summarise(
across(
all_of(sum_variables),
~ paste0(mean(.) %>% round(2), "(±", sd(.) %>% round(2), ")")
)
) %>% pivot_longer(
cols = -"am_factor"
) %>% pivot_wider(
names_from = "am_factor"
) %>% mutate(
difference = str_extract(Automatic, "[:digit:].?[:digit:]") %>% as.numeric() -
str_extract(Manual, "[:digit:].?[:digit:]") %>% as.numeric()
)
Which gives the desired output,
# A tibble: 3 × 4
name Automatic Manual difference
<chr> <chr> <chr> <dbl>
1 mpg 17.15(±3.83) 24.39(±6.17) -7
2 hp 160.26(±53.91) 126.85(±84.06) 34
3 disp 290.38(±110.17) 143.53(±87.2) 147
Although it works, it defeats the purpose of making a simple algorithm for the purpose.
How do I create a summary of my data in a simple manner? It has to be a tidyverse-solution, preferably without iteration.
This is not necessarily simpler or shorter but I would prefer to do the mathematical calculations on numbers directly rather than extracting them from strings.
library(dplyr)
library(tidyr)
mtcars %>%
group_by(am_factor) %>%
summarise(across(all_of(sum_variables), list(mean = mean,
sd = ~sprintf('%.2f (± %.2f)', mean(.), sd(.))))) %>%
pivot_longer(cols = -am_factor,
names_to = c('measure', '.value'),
names_sep = '_') %>%
group_by(measure) %>%
mutate(difference = -diff(mean)) %>%
ungroup %>%
select(-mean) %>%
pivot_wider(names_from = am_factor, values_from = sd)
# measure difference Automatic Manual
# <chr> <dbl> <chr> <chr>
#1 mpg -7.24 17.15 (± 3.83) 24.39 (± 6.17)
#2 hp 33.4 160.26 (± 53.91) 126.85 (± 84.06)
#3 disp 147. 290.38 (± 110.17) 143.53 (± 87.20)
Consider this data that needs the summary measures mean and sd on multiple variables,
# Create grouping var; ####
mtcars <- mtcars %>% mutate(
am = case_when(
am == 0 ~ "Automatic",
TRUE ~ "Manual"
)
)
With the following custom function and purrr, I can create a baseline table,
# Summarising function; ####
sum_foo <- function(data, var) {
data %>%
group_by(am) %>%
summarise(
mean = mean( !!sym(var) , na.rm = TRUE),
sd = sd( !!sym(var) , na.rm = TRUE)
) %>%
mutate(across(where(is.double), round, 2)) %>%
group_by(am) %>%
transmute(
value = paste(mean, "(±", sd, ")", sep = ""),
variable = var
) %>%
pivot_wider(
names_from = "am"
)
}
# Execute Function; ####
sum_variables <- c("mpg", "hp", "disp")
sum_variables %>% map(
sum_foo,
data = mtcars
) %>% reduce(
bind_rows
)
Which gives the following output,
# A tibble: 3 x 3
variable Automatic Manual
<chr> <chr> <chr>
1 mpg 17.15(±3.83) 24.39(±6.17)
2 hp 160.26(±53.91) 126.85(±84.06)
3 disp 290.38(±110.17) 143.53(±87.2)
I want to get the output without using map and reduce, ie. without iterating through the variables with rowwise or map.
I'm looking for an alternative tidyverse-solution!
Maybe you could use this solution:
library(dplyr)
library(tidyr)
library(tibble)
sum_variables %>%
enframe() %>%
rowwise() %>%
mutate(output = list(sum_foo(mtcars, value))) %>%
select(output) %>%
unnest(cols = output)
# A tibble: 3 x 3
variable Automatic Manual
<chr> <chr> <chr>
1 mpg 17.15(±3.83) 24.39(±6.17)
2 hp 160.26(±53.91) 126.85(±84.06)
3 disp 290.38(±110.17) 143.53(±87.2)
Updated
Or you could even modify your function in the following way:
sum_foo2 <- function(data, var) {
data %>%
group_by(am) %>%
summarise(across(all_of(var), list(Mean = mean, sd = sd))) %>%
mutate(across(where(is.double), round, 2)) %>%
group_by(am) %>%
summarise(across(ends_with("Mean"), ~ paste(.x, "(±", get(gsub("_Mean", "_sd", cur_column())), ")", sep = ""))) %>%
pivot_longer(!am, names_to = "Mean", values_to = "Val") %>%
pivot_wider(names_from = "am", values_from = "Val")
}
sum_foo2(mtcars, sum_variables)
# A tibble: 3 x 3
Mean Automatic Manual
<chr> <chr> <chr>
1 mpg_Mean 17.15(±3.83) 24.39(±6.17)
2 hp_Mean 160.26(±53.91) 126.85(±84.06)
3 disp_Mean 290.38(±110.17) 143.53(±87.2)
If I am to trim the function above into a more concise version:
sum_foo2 <- function(data, var) {
data %>%
group_by(am) %>%
summarise(across(all_of(var), ~ paste0(round(mean(.x), 2), "(±", round(sd(.x), 2), ")"))) %>%
pivot_longer(!am, names_to = "Mean", values_to = "Val") %>%
pivot_wider(names_from = "am", values_from = "Val")
}
sum_foo2(mtcars, sum_variables)
Without using the function that you wrote, which require an iteration, ie rowwise/map, You could simply do:
sum_variables <- c("mpg", "hp", "disp")
mtcars %>%
group_by(am) %>%
summarise(across(all_of(sum_variables),
~sprintf('%.2f(\u00B1%.2f)', mean(.x), sd(.x))), .groups = 'drop') %>%
data.table::transpose(keep.names = 'variable', make.names = TRUE)
variable Automatic Manual
1 mpg 17.15(±3.83) 24.39(±6.17)
2 hp 160.26(±53.91) 126.85(±84.06)
3 disp 290.38(±110.17) 143.53(±87.20)
So: Please be gentile. My solution without using map and reduce:
library(dplyr)
library(tidyr)
library(stringr)
data %>%
group_by(am) %>%
summarise(across(c(mpg, hp, disp), list(mean = mean, sd = sd), .names = "{.col}_{.fn}")) %>%
pivot_longer (
cols = 2:7,
names_to = "variable",
values_to = "values"
) %>%
pivot_wider(
names_from = am,
values_from = values
) %>%
mutate(variable = str_extract(variable, "[^_]*")) %>%
mutate(across(c(Automatic, Manual), lead, .names = "{.col}_{.fn}")) %>%
filter(row_number() %% 2 == 1) %>%
mutate(across(where(is.numeric), round, 2)) %>%
mutate(Automatic = paste0(Automatic,"(±",Automatic_1,")"), .keep="unused") %>%
mutate(Manual = paste0(Manual,"(±",Manual_1,")"), .keep="unused")
Output:
variable Automatic Manual
<chr> <chr> <chr>
1 mpg 17.15(±3.83) 24.39(±6.17)
2 hp 160.26(±53.91) 126.85(±84.06)
3 disp 290.38(±110.17) 143.53(±87.2)
With the formidable answers that I got, this is the final tidyverse-solution without iteration or map that were born,
mtcars %>%
group_by(am) %>%
summarise(
across(
all_of(sum_variables),
~ paste0(mean(.) %>% round(2), "(±", sd(.) %>% round(2), ")")
)
) %>% pivot_longer(
cols = -"am"
) %>% pivot_wider(
names_from = "am"
)
Which gives the following output,
# A tibble: 3 x 3
name Automatic Manual
<chr> <chr> <chr>
1 mpg 17.15(±3.83) 24.39(±6.17)
2 hp 160.26(±53.91) 126.85(±84.06)
3 disp 290.38(±110.17) 143.53(±87.2)
I have a very large data set with variable names that are super abbreviated and it would help immensely if the label in the attr(*, "label") section was extracted and showed up in the column beside the corresponding variable.
label(mtcars[["mpg"]]) <- "Miles/(US) gallon"
label(mtcars[["hp"]]) <- "Gross horsepower"
label(mtcars[["wt"]]) <- "Weight (1000lbs)"
My current code just gets the mean/sd from the entire data set:
mtcars %>% select(mpg, hp, wt) %>% pivot_longer(everything()) %>% group_by(name) %>% summarise(mean=mean(value, na.rm = TRUE), sd=sd(value, na.rm=TRUE))
But I want a column with the label of the variables so it's easier to tell:
name mean sd label
hp 14.7. 68.6 Gross horsepower
mpg 20.1 6.03 Miles/(US) gallon
wt 3.22 0.978 Weight (1000lbs)
I found a thread that sort of gets to what I want, but if I add mutate(labels=label(mtcars)[name]) at the end of the code, I get a column with NA instead of the labels.
We can use imap
library(purrr)
library(dplyr)
library(Hmisc)
imap_dfr(mtcars[c('hp', 'mpg', 'wt')], ~
tibble(name = .y, mean = mean(.x[[1]]),
sd = sd(.x[[1]], na.rm = TRUE),
label = attr(.x, 'label')))
If we use the OP's method, we can also use summarise_all and then do the pivot_longer
library(tidyr)
mtcars %>%
dplyr::select(mpg, hp, wt) %>%
summarise_all(list(mean = ~mean(., na.rm = TRUE),
sd = ~sd(., na.rm = TRUE),
label = ~attr(., 'label'))) %>%
mutate(rn = 1) %>%
pivot_longer(cols = -rn, names_to = c('name', '.value'), names_sep="_") %>%
select(-rn)
# name mean sd label
#1 mpg 20.09062 6.0269481 Miles/(US) gallon
#2 hp 146.68750 68.5628685 Gross horsepower
#3 wt 3.21725 0.9784574 Weight (1000lbs)
When computing any statistic using summarise and group_by we only get the summary statistic per-category, and not the value for all the population (Total). How to get both?
I am looking for something clean and short. Until now I can only think of:
bind_rows(
iris %>% group_by(Species) %>% summarise(
"Mean" = mean(Sepal.Width),
"Median" = median(Sepal.Width),
"sd" = sd(Sepal.Width),
"p10" = quantile(Sepal.Width, probs = 0.1))
,
iris %>% summarise(
"Mean" = mean(Sepal.Width),
"Median" = median(Sepal.Width),
"sd" = sd(Sepal.Width),
"p10" = quantile(Sepal.Width, probs = 0.1)) %>%
mutate(Species = "Total")
)
But I would like something more compact. In particular, I don't want to type the code (for summarize) twice, once for each group and once for the total.
You can simplify it if you untangle what you're trying to do: you have iris data that has several species, and you want that summarized along with data for all species. You don't need to calculate those summary stats before you can bind. Instead, bind iris with a version of iris that's been set to Species = "Total", then group and summarize.
library(tidyverse)
bind_rows(
iris,
iris %>% mutate(Species = "Total")
) %>%
group_by(Species) %>%
summarise(Mean = mean(Sepal.Width),
Median = median(Sepal.Width),
sd = sd(Sepal.Width),
p10 = quantile(Sepal.Width, probs = 0.1))
#> # A tibble: 4 x 5
#> Species Mean Median sd p10
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa 3.43 3.4 0.379 3
#> 2 Total 3.06 3 0.436 2.5
#> 3 versicolor 2.77 2.8 0.314 2.3
#> 4 virginica 2.97 3 0.322 2.59
I like the caution in the comments above, though I have to do this sort of calculation for work enough that I have a similar shorthand function in a personal package. It perhaps makes less sense for things like standard deviations, but it's something I need to do a lot for adding up totals of demographic groups, etc. (If it's useful, that function is here).
bit shorter, though quite similar to bind_rows
q10 <- function(x){quantile(x , probs=0.1)}
iris %>%
select(Species,Sepal.Width)%>%
group_by(Species) %>%
summarise_all(c("mean", "sd", "q10")) %>%
t() %>%
cbind(c("total", iris %>% select(Sepal.Width) %>% summarise_all(c("mean", "sd", "q10")))) %>%
t()
more clean probably:
bind_rows(
iris %>%
group_by(Species) %>%
select(Sepal.Width)%>%
summarise_all(c("mean", "sd", "q10"))
,
iris %>%
select(Sepal.Width)%>%
summarise_all(c("mean", "sd", "q10")) %>%
mutate(Species = "Total")
)