Creating a table from summarized data using dplyr - r

I want to create a summary table from summarized data using dplyr.
library(dplyr)
mtcars %>% group_by(cyl, gear) %>% summarise(avg_wt = mean(wt))
Here's the output:
# A tibble: 8 x 3
# Groups: cyl [3]
cyl gear avg_wt
<dbl> <dbl> <dbl>
1 4 3 2.46
2 4 4 2.38
3 4 5 1.83
4 6 3 3.34
5 6 4 3.09
6 6 5 2.77
7 8 3 4.10
8 8 5 3.37
How can I generate this output?
columns are cyl and rows are gear:
4 6 8
3 2.46 3.34 4.10
4 2.38 3.09 NA
5 1.83 2.77 3.37

library(dplyr)
library(tidyr)
library(tibble)
mtcars %>%
group_by(cyl, gear) %>%
summarise(avg_wt = mean(wt)) %>%
pivot_wider(
id_cols = "gear",
names_from = "cyl",
values_from = "avg_wt"
) %>%
column_to_rownames("gear")
#> 4 6 8
#> 3 2.465000 3.33750 4.104083
#> 4 2.378125 3.09375 NA
#> 5 1.826500 2.77000 3.370000

Try this:
mytable <- mtcars %>% group_by(cyl, gear) %>% summarise(avg_wt = mean(wt))
tidyr::spread(mytable, cyl, avg_wt)
You should get the following:
gear `4` `6` `8`
<dbl> <dbl> <dbl> <dbl>
1 3 2.46 3.34 4.10
2 4 2.38 3.09 NA
3 5 1.83 2.77 3.37
Hope this helps you.

Related

How do I build a dplyr summarize statement programmatically?

I'm trying to do some dplyr programming and having trouble. I'd like to group_by an arbitrary number of variables (thus, across), and then summarize based on arbitrary length (but all the same length) vectors of:
The column to apply the function to
The function to apply
The name of the new column
So, like in a map or apply statement, I want to execute code that ends up looking like:
data %>%
group_by(group_column) %>%
summarize(new_name_1 = function_1(column_1),
summarize(new_name_2 = function_2(column_2))
Here's an example of what I want and my best shot so far. I know I can use the names argument to clean those up if I use across, but I'm not confident that across is the correct way. Finally, I'll be applying this to fairly large dataframes, so I'd rather not calculate the extra columns.
Desired result
mtcars %>%
group_by(across(c("cyl", "carb"))) %>%
summarise(across(c("disp", "hp"), list(mean = mean, sd = sd))) %>%
select(cyl, carb, disp_mean, hp_sd)
#> `summarise()` regrouping output by 'cyl' (override with `.groups` argument)
#> # A tibble: 9 x 4
#> # Groups: cyl [3]
#> cyl carb disp_mean hp_sd
#> <dbl> <dbl> <dbl> <dbl>
#> 1 4 1 91.4 16.1
#> 2 4 2 117. 24.9
#> 3 6 1 242. 3.54
#> 4 6 4 164. 7.51
#> 5 6 6 145 NA
#> 6 8 2 346. 14.4
#> 7 8 3 276. 0
#> 8 8 4 406. 21.7
#> 9 8 8 301 NA
What I get
mtcars %>%
group_by(across(c("cyl", "carb"))) %>%
summarise(across(c("disp", "hp"), list(mean = mean, sd = sd)))
#> `summarise()` regrouping output by 'cyl' (override with `.groups` argument)
#> # A tibble: 9 x 6
#> # Groups: cyl [3]
#> cyl carb disp_mean disp_sd hp_mean hp_sd
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 4 1 91.4 21.4 77.4 16.1
#> 2 4 2 117. 27.1 87 24.9
#> 3 6 1 242. 23.3 108. 3.54
#> 4 6 4 164. 4.39 116. 7.51
#> 5 6 6 145 NA 175 NA
#> 6 8 2 346. 43.4 162. 14.4
#> 7 8 3 276. 0 180 0
#> 8 8 4 406. 57.8 234 21.7
#> 9 8 8 301 NA 335 NA
With different functions on different columns, an option is to use collap from collapse
library(collapse)
collap(mtcars, ~ cyl + carb, custom = list(fmean = 4, fsd = 5))
-output
cyl disp hp carb
1 4 91.38 16.133815 1
2 4 116.60 24.859606 2
3 6 241.50 3.535534 1
4 6 163.80 7.505553 4
5 6 145.00 NA 6
6 8 345.50 14.433757 2
7 8 275.80 0.000000 3
8 8 405.50 21.725561 4
9 8 301.00 NA 8
Or the index can be dynamically generated with match
collap(mtcars, ~ cyl + carb, custom = list(fmean =
match('disp', names(mtcars)), fsd = match('hp', names(mtcars))))
With tidyverse, an option is to loop over the column names of interest and the functions in map2 and do a join later
library(dplyr)
library(purrr)
library(stringr)
map2(c("disp", "hp"), c("mean", "sd"), ~
mtcars %>%
group_by(across(c('cyl', 'carb'))) %>%
summarise(across(all_of(.x), match.fun(.y),
.names = str_c("{.col}_", .y)), .groups = 'drop')) %>%
reduce(inner_join)
-output
# A tibble: 9 x 4
cyl carb disp_mean hp_sd
<dbl> <dbl> <dbl> <dbl>
1 4 1 91.4 16.1
2 4 2 117. 24.9
3 6 1 242. 3.54
4 6 4 164. 7.51
5 6 6 145 NA
6 8 2 346. 14.4
7 8 3 276. 0
8 8 4 406. 21.7
9 8 8 301 NA
I have a package on github {dplyover}
which can help with this kind of tasks. In this case we could use over2 to
loop over two character vectors simultaniously. The first vector contains the
variable names as string, which is why we have to wrap .x in sym() when
applying a function to it. The second vector contains the function names,
which we use as .y in a do.call. over2 creates the desired names automatically.
library(dplyr)
library(dplyover) # https://github.com/TimTeaFan/dplyover
mtcars %>%
group_by(across(c("cyl", "carb"))) %>%
summarise(over2(c("disp", "hp"),
c("mean", "sd"),
~ do.call(.y, list(sym(.x)))
))
#> `summarise()` has grouped output by 'cyl'. You can override using the `.groups` argument.
#> # A tibble: 9 x 4
#> # Groups: cyl [3]
#> cyl carb disp_mean hp_sd
#> <dbl> <dbl> <dbl> <dbl>
#> 1 4 1 91.4 16.1
#> 2 4 2 117. 24.9
#> 3 6 1 242. 3.54
#> 4 6 4 164. 7.51
#> 5 6 6 145 NA
#> 6 8 2 346. 14.4
#> 7 8 3 276. 0
#> 8 8 4 406. 21.7
#> 9 8 8 301 NA
An alternative way building on the same logic is to use purrr::map2. However,
here we have to put some effort into creating vectors with the desired names.
library(purrr)
# setup vectors and names
myfuns <- c("mean", "sd")
myvars <- c("disp", "hp") %>%
set_names(., paste(., myfuns, sep = "_"))
mtcars %>%
group_by(across(c("cyl", "carb"))) %>%
summarise(map2(myvars,
myfuns,
~ do.call(.y, list(sym(.x)))
) %>% bind_cols()
)
#> `summarise()` has grouped output by 'cyl'. You can override using the `.groups` argument.
#> # A tibble: 9 x 4
#> # Groups: cyl [3]
#> cyl carb disp_mean hp_sd
#> <dbl> <dbl> <dbl> <dbl>
#> 1 4 1 91.4 16.1
#> 2 4 2 117. 24.9
#> 3 6 1 242. 3.54
#> 4 6 4 164. 7.51
#> 5 6 6 145 NA
#> 6 8 2 346. 14.4
#> 7 8 3 276. 0
#> 8 8 4 406. 21.7
#> 9 8 8 301 NA
Created on 2021-08-20 by the reprex package (v2.0.1)

Compare means of 2 levels of group_by in one data set in R

Take for example the mtcars data set.
I would like to compare the ratio of mpg of cars that are grouped by cyl only , to the cars that are grouped by both cyl and carb.
Problem is that grouping the dataset using dplyr creates one level of granularity which makes it impossible to compare to a different level of grouping.
So what I did was create 2 new data sets, with each grouping, and then joined them together to compare the 2 means with a mutated column, as below.
This worked, but it just seems like a a roundabout inefficient way to code . What is the proper way to do this?
my code:
cyl_only <- mtcars %>%
group_by(cyl) %>%
summarise(cyl_only_mean= mean(mpg))
cyl_carb <- mtcars %>%
group_by(cyl,carb) %>%
summarise(cyl_carb_mean= mean(mpg))
cyl_carb_join <- cyl_only %>%
left_join(cyl_carb,by="cyl")
mtcars_result <- mutate(cyl_carb_join,ratio= cyl_only_mean/cyl_carb_mean)
You can accomplish this without doing the joining by bringing along the summary information needed to calculate a mean.
mtcars %>%
group_by(cyl, carb) %>%
summarise(sum_mpg = sum(mpg),
count = n(),
cyl_carb_mean = mean(mpg)) %>%
group_by(cyl) %>%
mutate(cyl_only_mean = sum(sum_mpg) / sum(count),
ratio = cyl_only_mean/cyl_carb_mean)
cyl carb sum_mpg count cyl_carb_mean cyl_only_mean ratio
<dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 4 1 138. 5 27.6 26.7 0.967
2 4 2 155. 6 25.9 26.7 1.03
3 6 1 39.5 2 19.8 19.7 1.00
4 6 4 79 4 19.8 19.7 1.00
5 6 6 19.7 1 19.7 19.7 1.00
6 8 2 68.6 4 17.2 15.1 0.880
7 8 3 48.9 3 16.3 15.1 0.926
8 8 4 78.9 6 13.2 15.1 1.15
9 8 8 15 1 15 15.1 1.01

Use summarise and summarise_at in same dplyr chain

Suppose I want to summarise a data frame after grouping with differing functions. How can I do that?
mtcars %>% group_by(cyl) %>% summarise(size = n())
# A tibble: 3 x 2
cyl size
<dbl> <int>
1 4 11
2 6 7
3 8 14
But if I try:
mtcars %>% group_by(cyl) %>% summarise(size = n()) %>% summarise_at(vars(c(mpg, am:carb)), mean)
Error in is_string(y) : object 'carb' not found
How can I get first the size of each group with n() and then the mean of the other chosen features?
Here is one way using a dplyr::inner_join() on the two summarize operations by the grouping variable:
mtcars %>%
group_by(cyl) %>%
summarise(size = n()) %>%
inner_join(
mtcars %>%
group_by(cyl) %>%
summarise_at(vars(c(mpg, am:carb)), mean),
by='cyl' )
Output is:
# A tibble: 3 x 6
cyl size mpg am gear carb
<dbl> <int> <dbl> <dbl> <dbl> <dbl>
1 4 11 26.7 0.727 4.09 1.55
2 6 7 19.7 0.429 3.86 3.43
3 8 14 15.1 0.143 3.29 3.5
Since summarise removes the column which are not grouped or summarised, an alternative in this case would be to first add a new column with mutate (so that all other columns remain as it is) to count number of rows in each group and include that column in summarise_at calculation.
library(dplyr)
mtcars %>%
group_by(cyl) %>%
mutate(n = n()) %>%
summarise_at(vars(mpg, am:carb, n), mean)
# A tibble: 3 x 6
# cyl mpg am gear carb n
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 4 26.7 0.727 4.09 1.55 11
#2 6 19.7 0.429 3.86 3.43 7
#3 8 15.1 0.143 3.29 3.5 14
We can use data.table methods
library(data.table)
as.data.table(mtcars)[, n := .N, cyl][, lapply(.SD, mean), cyl,
.SDcols = c("mpg", "am", "gear", "carb", "n")]
#. yl mpg am gear carb n
#1: 6 19.74286 0.4285714 3.857143 3.428571 7
#2: 4 26.66364 0.7272727 4.090909 1.545455 11
#3: 8 15.10000 0.1428571 3.285714 3.500000 14
Or with tidyverse
library(tidyverse)
mtcars %>%
add_count(cyl) %>%
group_by(cyl) %>%
summarise_at(vars(mpg, am:carb, n), mean)
# A tibble: 3 x 6
# cyl mpg am gear carb n
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 4 26.7 0.727 4.09 1.55 11
#2 6 19.7 0.429 3.86 3.43 7
#3 8 15.1 0.143 3.29 3.5 14
Or using base R
nm1 <- c("mpg", "am", "gear", "carb", "cyl")
transform(aggregate(.~ cyl, mtcars[nm1], mean), n = as.vector(table(mtcars$cyl)))
# cyl mpg am gear carb n
#1 4 26.66364 0.7272727 4.090909 1.545455 11
#2 6 19.74286 0.4285714 3.857143 3.428571 7
#3 8 15.10000 0.1428571 3.285714 3.500000 14

unquote a list of functions inside R dplyr functions

I was trying to pass a list of functions into dplyr summerize_at function and got a warning:
library(tidyverse)
library(purrr)
p <- c(0.2, 0.5, 0.8)
p_names <- map_chr(p, ~paste0(.x*100, "%"))
p_funs <- map(p, ~partial(quantile, probs = .x, na.rm = TRUE)) %>%
set_names(nm = p_names)
mtcars %>%
group_by(cyl) %>%
summarize_at(vars(mpg), funs(!!!p_funs))
#> Warning: funs() is soft deprecated as of dplyr 0.8.0
#> please use list() instead
#>
#> # Before:
#> funs(name = f(.)
#>
#> # After:
#> list(name = ~f(.))
#> This warning is displayed once per session.
#> # A tibble: 3 x 4
#> cyl `20%` `50%` `80%`
#> <dbl> <dbl> <dbl> <dbl>
#> 1 4 22.8 26 30.4
#> 2 6 18.3 19.7 21
#> 3 8 13.9 15.2 16.8
I then changed the funs to list but couldn't find a way to unquote the list of funs.
mtcars %>%
group_by(cyl) %>%
summarize_at(vars(mpg), list(~ !!!p_funs))
#> Error in !p_funs: invalid argument type
mtcars %>%
group_by(cyl) %>%
summarize_at(vars(mpg), list(~ {{p_funs}}))
#> Error: Column `mpg` must be length 1 (a summary value), not 3
list doesn't support splicing (!!!), use list2 or lst instead :
mtcars %>%
group_by(cyl) %>%
summarize_at(vars(mpg), rlang::list2(!!!p_funs))
# # A tibble: 3 x 4
# cyl `20%` `50%` `80%`
# <dbl> <dbl> <dbl> <dbl>
# 1 4 22.8 26 30.4
# 2 6 18.3 19.7 21
# 3 8 13.9 15.2 16.8
mtcars %>%
group_by(cyl) %>%
summarize_at(vars(mpg), lst(!!!p_funs))
# # A tibble: 3 x 4
# cyl `20%` `50%` `80%`
# <dbl> <dbl> <dbl> <dbl>
# 1 4 22.8 26 30.4
# 2 6 18.3 19.7 21
# 3 8 13.9 15.2 16.8
Though here the simplest is just to do :
mtcars %>%
group_by(cyl) %>%
summarize_at(vars(mpg), p_funs)
# # A tibble: 3 x 4
# cyl `20%` `50%` `80%`
# <dbl> <dbl> <dbl> <dbl>
# 1 4 22.8 26 30.4
# 2 6 18.3 19.7 21
# 3 8 13.9 15.2 16.8

change several column names() in data.frame() with str_replace_all()

I read this this question and practiced matching patterns, but I am still not figuring it.
I have a panel with the same measure, several times per year. Now, I want to rename them in a logical way. My raw data looks a bit like this,
set.seed(667)
dta <- data.frame(id = 1:6,
R1213 = runif(6),
R1224 = runif(6, 1, 2),
R1255 = runif(6, 2, 3),
R1235 = runif(6, 3, 4))
# install.packages(c("tidyverse"), dependencies = TRUE)
require(tidyverse)
(tbl <- dta %>% as_tibble())
#> # A tibble: 6 x 5
#> id R1213 R1224 R1255 R1235
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0.488 1.60 2.07 3.07
#> 2 2 0.692 1.42 2.76 3.19
#> 3 3 0.262 1.34 2.33 3.82
#> 4 4 0.330 1.77 2.61 3.93
#> 5 5 0.582 1.92 2.15 3.86
#> 6 6 0.930 1.88 2.56 3.59
Now, I use str_replace_all() to rename them, here with only one variable in where I use pate, and everything is fine (it might also be possible to optimize this in other ways, if so please feel to let me know),
names(tbl) <- tbl %>% names() %>%
str_replace_all('^R1.[125].$', 'A') %>%
str_replace_all('^R1.[3].$', paste0('A.2018.', 1))
tbl
#> # A tibble: 6 x 5
#> id A A A A.2018.1
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0.488 1.60 2.07 3.07
#> 2 2 0.692 1.42 2.76 3.19
#> 3 3 0.262 1.34 2.33 3.82
#> 4 4 0.330 1.77 2.61 3.93
#> 5 5 0.582 1.92 2.15 3.86
#> 6 6 0.930 1.88 2.56 3.59
Eveything call A is actually from the same year, let's say 2017, but with the suffix .1, .2, etc. need to appended. I start over and again use paste0('A.2017.', 1:3), but this time with three suffices,
tbl <- dta %>% as_tibble()
names(tbl) <- tbl %>% names() %>%
str_replace_all('^R1.[125].$', paste0('A.2017.', 1:3)) %>%
str_replace_all('^R1.[7].$', paste0('A.2018.', 1))
tbl
#> Warning message:
#> In stri_replace_all_regex(string, pattern, fix_replacement(replacement), :
#> longer object length is not a multiple of shorter object length
#> > tbl
#> # A tibble: 6 x 5
#> id A.2017.2 A.2017.3 A.2017.1 R1235
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0.488 1.60 2.07 3.07
#> 2 2 0.692 1.42 2.76 3.19
#> 3 3 0.262 1.34 2.33 3.82
#> 4 4 0.330 1.77 2.61 3.93
#> 5 5 0.582 1.92 2.15 3.86
#> 6 6 0.930 1.88 2.56 3.59
this does come out, but the order is reversed and I am told longer object length is not a multiple of shorter object length, but isen't 3 the right length? I am looking to do this in a cleaner and simpler way. Also, I don't really like names(tbl) <-, if that can be done in a more elegant way.
Building on David's suggestion - how about something like the following using dplyr::rename_at?
library(dplyr)
## Get data
set.seed(667)
dta <- data.frame(id = 1:6,
R1213 = runif(6),
R1224 = runif(6, 1, 2),
R1255 = runif(6, 2, 3),
R1235 = runif(6, 3, 4)) %>%
as_tibble()
## Rename
dta <- dta %>%
rename_at(.vars = grep('^R1.[125].$', names(.)),
.funs = ~paste0("A.2017.", 1:length(.)))
dta
#> # A tibble: 6 x 5
#> id A.2017.1 A.2017.2 A.2017.3 R1235
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0.196 1.74 2.51 3.49
#> 2 2 0.478 1.85 2.06 3.69
#> 3 3 0.780 1.32 2.21 3.26
#> 4 4 0.705 1.49 2.49 3.33
#> 5 5 0.942 1.59 2.66 3.58
#> 6 6 0.906 1.90 2.87 3.93
Vectorised solution for multiple patterns
For a complete solution that can be used for multiple patterns and replacements, we can make use of purr::map2_dfc as follows.
library(dplyr)
library(purrr)
## Get data
set.seed(667)
dta <- data.frame(id = 1:6,
R1213 = runif(6),
R1224 = runif(6, 1, 2),
R1255 = runif(6, 2, 3),
R1235 = runif(6, 3, 4)) %>%
as_tibble()
## Define a function to keep a hold out data set, then rename iteratively for each pattern and replacement.
rename_multiple_years <- function(df, patterns,
replacements,
hold_out_var = "id") {
hold_out_df <- df %>%
select_at(.vars = hold_out_var)
rename_df <- map2_dfc(patterns, replacements, function(pattern, replacement) {
df %>%
rename_at(.vars = grep(pattern, names(.)),
.funs = ~paste0(replacement, 1:length(.))) %>%
select_at(.vars = grep(replacement, names(.)))
})
final_df <- bind_cols(hold_out_df, rename_df)
return(final_df)
}
## Call function on specified patterns and replacements
renamed_dta <- dta %>%
rename_multiple_years(patterns = c("^R1.[125].$", "^R1.[3].$"),
replacements = c("A.2017.", "A.2018."))
renamed_dta
#> # A tibble: 6 x 5
#> id A.2017.1 A.2017.2 A.2017.3 A.2018.1
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0.196 1.74 2.51 3.49
#> 2 2 0.478 1.85 2.06 3.69
#> 3 3 0.780 1.32 2.21 3.26
#> 4 4 0.705 1.49 2.49 3.33
#> 5 5 0.942 1.59 2.66 3.58
#> 6 6 0.906 1.90 2.87 3.93
Towards tidy data
Now that the variables have been renamed you might find it useful to have your data in a tidy format. The following using tidyr::gather might be useful.
library(tidyr)
library(dplyr)
#Use tidy dataframe gather all variables, split by "." and drop A column (or keep if a measurement id)
renamed_dta %>%
gather(key = "measure", value = "value", -id) %>%
separate(measure, c("A", "year", "measure"), "[[.]]") %>%
select(-A)
#> # A tibble: 24 x 4
#> id year measure value
#> <int> <chr> <chr> <dbl>
#> 1 1 2017 1 0.196
#> 2 2 2017 1 0.478
#> 3 3 2017 1 0.780
#> 4 4 2017 1 0.705
#> 5 5 2017 1 0.942
#> 6 6 2017 1 0.906
#> 7 1 2017 2 1.74
#> 8 2 2017 2 1.85
#> 9 3 2017 2 1.32
#> 10 4 2017 2 1.49
#> # ... with 14 more rows

Resources