Add baseline/grand total with group_by() in dplyr - r

When I've grouped my data by certain attributes, I want to add a "grand total" line that gives a baseline of comparison. Let's group mtcars by cylinders and carburetors, for example:
by_cyl_carb <- mtcars %>%
group_by(cyl, carb) %>%
summarize(median_mpg = median(mpg),
avg_mpg = mean(mpg),
count = n())
...yields these results:
> by_cyl_carb
# A tibble: 9 x 5
# Groups: cyl [?]
cyl carb median_mpg avg_mpg count
<dbl> <dbl> <dbl> <dbl> <int>
1 4 1 27.3 27.6 5
2 4 2 25.2 25.9 6
3 6 1 19.8 19.8 2
4 6 4 20.1 19.8 4
5 6 6 19.7 19.7 1
6 8 2 17.1 17.2 4
7 8 3 16.4 16.3 3
8 8 4 13.8 13.2 6
9 8 8 15 15 1
What is the code I need to make it provide a baseline or grand total that would sum (or mean or median) over all of the data? The desired data would be something like this:
cyl carb median_mpg avg_mpg count
<chr> <chr> <dbl> <dbl> <int>
1 4 1 27.3 27.6 5
2 4 2 25.2 25.9 6
3 6 1 19.8 19.8 2
4 6 4 20.1 19.8 4
5 6 6 19.7 19.7 1
6 8 2 17.1 17.2 4
7 8 3 16.4 16.3 3
8 8 4 13.8 13.2 6
9 8 8 15 15 1
10 ttl ttl 19.2 20.1 32
A twist on this would be able to manipulate the output so that the sub-grouped data would be rolled up. For example:
11 ttl 1 13.8 13.2 6
12 ttl 2 15 15 1
13 ttl 3 19.3 20.4 32
14 ... etc ...
The real-life example I am using this for is median sale price of homes by geography by year. Hence I want to report out the median sale price for each geography-year I'm interested in, but I want a baseline comparison for each year regardless of geography.
Edit: Solved with two solutions
#camille referenced this link, which solved the problem, as well as #MKR offering a solution. Here is one code that might work:
by_cyl_carb <- mtcars %>%
mutate_at(vars(c(cyl,carb)), funs(as.character(.))) %>%
bind_rows(mutate(., cyl = "All cylinders")) %>%
bind_rows(mutate(., carb = "All carburetors")) %>%
group_by(cyl, carb) %>%
summarize(median_mpg = median(mpg),
avg_mpg = mean(mpg),
count = n())
> by_cyl_carb
# A tibble: 19 x 5
# Groups: cyl [?]
cyl carb median_mpg avg_mpg count
<chr> <chr> <dbl> <dbl> <int>
1 4 1 27.3 27.6 5
2 4 2 25.2 25.9 6
3 4 All carburetors 26 26.7 11
4 6 1 19.8 19.8 2
5 6 4 20.1 19.8 4
6 6 6 19.7 19.7 1
7 6 All carburetors 19.7 19.7 7
8 8 2 17.1 17.2 4
9 8 3 16.4 16.3 3
10 8 4 13.8 13.2 6
11 8 8 15 15 1
12 8 All carburetors 15.2 15.1 14
13 All cylinders 1 22.8 25.3 7
14 All cylinders 2 22.1 22.4 10
15 All cylinders 3 16.4 16.3 3
16 All cylinders 4 15.2 15.8 10
17 All cylinders 6 19.7 19.7 1
18 All cylinders 8 15 15 1
19 All cylinders All carburetors 19.2 20.1 32

A solution using dplyr::bind_rows and mutate_at can be achieved as:
library(tidyverse)
mtcars %>%
group_by(cyl, carb) %>%
summarize(median_mpg = median(mpg),
avg_mpg = mean(mpg),
count = n()) %>%
ungroup() %>%
mutate_at(vars(cyl:carb), funs(as.character(.))) %>%
bind_rows(summarise(cyl = "ttl", carb = "ttl", mtcars, median_mpg = median(mpg),
avg_mpg = mean(mpg),
count = n()))
# # A tibble: 10 x 5
# cyl carb median_mpg avg_mpg count
# <chr> <chr> <dbl> <dbl> <int>
# 1 4 1 27.3 27.6 5
# 2 4 2 25.2 25.9 6
# 3 6 1 19.8 19.8 2
# 4 6 4 20.1 19.8 4
# 5 6 6 19.7 19.7 1
# 6 8 2 17.1 17.2 4
# 7 8 3 16.4 16.3 3
# 8 8 4 13.8 13.2 6
# 9 8 8 15.0 15.0 1
#10 ttl ttl 19.2 20.1 32

Related

Output separate dataframes in a list using purrr::map_dfr()

I'm looking to sequentially read in data and the transform it in two disparate scripts then combine the results into a list of dataframes:
library(tidyverse)
dat_list <- list(as_tibble(mtcars),as_tibble(mtcars),as_tibble(mtcars))
test_func <- function(x) {
dat <- x
gear_avg <- dat %>%
group_by(gear) %>%
summarize(value=mean(mpg))
carb_avg <- dat %>%
group_by(carb) %>%
summarize(value=mean(mpg))
df_list <- list(as_tibble(gear_avg),as_tibble(carb_avg))
return(df_list)
}
test_dat <- map_dfr(dat_list, test_func)
desired_output <-
list(
test_dat %>% filter(!is.na(gear)) %>% select(-carb),
test_dat %>% filter(!is.na(carb)) %>% select(-gear)
)
This is what I would expect to work but it just outputs a single dataframe.
Try using purrr::transpose:
map(transpose(test_dat), bind_rows)
From the purrr cheatsheet here is a little visual aid to understand what that function does:
Also, test_func does not return anything. So, in your reprex you should add the following as the last line: return(df_list)
Output
[[1]]
# A tibble: 9 x 2
gear value
<dbl> <dbl>
1 3 16.1
2 4 24.5
3 5 21.4
4 3 16.1
5 4 24.5
6 5 21.4
7 3 16.1
8 4 24.5
9 5 21.4
[[2]]
# A tibble: 18 x 2
carb value
<dbl> <dbl>
1 1 25.3
2 2 22.4
3 3 16.3
4 4 15.8
5 6 19.7
6 8 15
7 1 25.3
8 2 22.4
9 3 16.3
10 4 15.8
11 6 19.7
12 8 15
13 1 25.3
14 2 22.4
15 3 16.3
16 4 15.8
17 6 19.7
18 8 15

How do I build a dplyr summarize statement programmatically?

I'm trying to do some dplyr programming and having trouble. I'd like to group_by an arbitrary number of variables (thus, across), and then summarize based on arbitrary length (but all the same length) vectors of:
The column to apply the function to
The function to apply
The name of the new column
So, like in a map or apply statement, I want to execute code that ends up looking like:
data %>%
group_by(group_column) %>%
summarize(new_name_1 = function_1(column_1),
summarize(new_name_2 = function_2(column_2))
Here's an example of what I want and my best shot so far. I know I can use the names argument to clean those up if I use across, but I'm not confident that across is the correct way. Finally, I'll be applying this to fairly large dataframes, so I'd rather not calculate the extra columns.
Desired result
mtcars %>%
group_by(across(c("cyl", "carb"))) %>%
summarise(across(c("disp", "hp"), list(mean = mean, sd = sd))) %>%
select(cyl, carb, disp_mean, hp_sd)
#> `summarise()` regrouping output by 'cyl' (override with `.groups` argument)
#> # A tibble: 9 x 4
#> # Groups: cyl [3]
#> cyl carb disp_mean hp_sd
#> <dbl> <dbl> <dbl> <dbl>
#> 1 4 1 91.4 16.1
#> 2 4 2 117. 24.9
#> 3 6 1 242. 3.54
#> 4 6 4 164. 7.51
#> 5 6 6 145 NA
#> 6 8 2 346. 14.4
#> 7 8 3 276. 0
#> 8 8 4 406. 21.7
#> 9 8 8 301 NA
What I get
mtcars %>%
group_by(across(c("cyl", "carb"))) %>%
summarise(across(c("disp", "hp"), list(mean = mean, sd = sd)))
#> `summarise()` regrouping output by 'cyl' (override with `.groups` argument)
#> # A tibble: 9 x 6
#> # Groups: cyl [3]
#> cyl carb disp_mean disp_sd hp_mean hp_sd
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 4 1 91.4 21.4 77.4 16.1
#> 2 4 2 117. 27.1 87 24.9
#> 3 6 1 242. 23.3 108. 3.54
#> 4 6 4 164. 4.39 116. 7.51
#> 5 6 6 145 NA 175 NA
#> 6 8 2 346. 43.4 162. 14.4
#> 7 8 3 276. 0 180 0
#> 8 8 4 406. 57.8 234 21.7
#> 9 8 8 301 NA 335 NA
With different functions on different columns, an option is to use collap from collapse
library(collapse)
collap(mtcars, ~ cyl + carb, custom = list(fmean = 4, fsd = 5))
-output
cyl disp hp carb
1 4 91.38 16.133815 1
2 4 116.60 24.859606 2
3 6 241.50 3.535534 1
4 6 163.80 7.505553 4
5 6 145.00 NA 6
6 8 345.50 14.433757 2
7 8 275.80 0.000000 3
8 8 405.50 21.725561 4
9 8 301.00 NA 8
Or the index can be dynamically generated with match
collap(mtcars, ~ cyl + carb, custom = list(fmean =
match('disp', names(mtcars)), fsd = match('hp', names(mtcars))))
With tidyverse, an option is to loop over the column names of interest and the functions in map2 and do a join later
library(dplyr)
library(purrr)
library(stringr)
map2(c("disp", "hp"), c("mean", "sd"), ~
mtcars %>%
group_by(across(c('cyl', 'carb'))) %>%
summarise(across(all_of(.x), match.fun(.y),
.names = str_c("{.col}_", .y)), .groups = 'drop')) %>%
reduce(inner_join)
-output
# A tibble: 9 x 4
cyl carb disp_mean hp_sd
<dbl> <dbl> <dbl> <dbl>
1 4 1 91.4 16.1
2 4 2 117. 24.9
3 6 1 242. 3.54
4 6 4 164. 7.51
5 6 6 145 NA
6 8 2 346. 14.4
7 8 3 276. 0
8 8 4 406. 21.7
9 8 8 301 NA
I have a package on github {dplyover}
which can help with this kind of tasks. In this case we could use over2 to
loop over two character vectors simultaniously. The first vector contains the
variable names as string, which is why we have to wrap .x in sym() when
applying a function to it. The second vector contains the function names,
which we use as .y in a do.call. over2 creates the desired names automatically.
library(dplyr)
library(dplyover) # https://github.com/TimTeaFan/dplyover
mtcars %>%
group_by(across(c("cyl", "carb"))) %>%
summarise(over2(c("disp", "hp"),
c("mean", "sd"),
~ do.call(.y, list(sym(.x)))
))
#> `summarise()` has grouped output by 'cyl'. You can override using the `.groups` argument.
#> # A tibble: 9 x 4
#> # Groups: cyl [3]
#> cyl carb disp_mean hp_sd
#> <dbl> <dbl> <dbl> <dbl>
#> 1 4 1 91.4 16.1
#> 2 4 2 117. 24.9
#> 3 6 1 242. 3.54
#> 4 6 4 164. 7.51
#> 5 6 6 145 NA
#> 6 8 2 346. 14.4
#> 7 8 3 276. 0
#> 8 8 4 406. 21.7
#> 9 8 8 301 NA
An alternative way building on the same logic is to use purrr::map2. However,
here we have to put some effort into creating vectors with the desired names.
library(purrr)
# setup vectors and names
myfuns <- c("mean", "sd")
myvars <- c("disp", "hp") %>%
set_names(., paste(., myfuns, sep = "_"))
mtcars %>%
group_by(across(c("cyl", "carb"))) %>%
summarise(map2(myvars,
myfuns,
~ do.call(.y, list(sym(.x)))
) %>% bind_cols()
)
#> `summarise()` has grouped output by 'cyl'. You can override using the `.groups` argument.
#> # A tibble: 9 x 4
#> # Groups: cyl [3]
#> cyl carb disp_mean hp_sd
#> <dbl> <dbl> <dbl> <dbl>
#> 1 4 1 91.4 16.1
#> 2 4 2 117. 24.9
#> 3 6 1 242. 3.54
#> 4 6 4 164. 7.51
#> 5 6 6 145 NA
#> 6 8 2 346. 14.4
#> 7 8 3 276. 0
#> 8 8 4 406. 21.7
#> 9 8 8 301 NA
Created on 2021-08-20 by the reprex package (v2.0.1)

Subset by multiple single-variable conditions in one step?

I hope I can explain what I'm trying to do sufficiently. I'm working in R and for a dataset I'm trying to keep only observations where for one variable, another variable satisfies two conditions.
Specifically, I want to keep only rows where for a particular "cyl", there is at least one mpg value >20, and at least one <20. Here is some example data from mtcars similar to what I'm working with.
mpg cyl
1 21.0 6
2 21.0 6
3 22.8 4
4 21.4 6
5 18.7 8
6 18.1 6
7 14.3 8
8 24.4 4
9 22.8 4
10 19.2 6
11 17.8 6
12 16.4 8
13 17.3 8
14 15.2 8
15 10.4 8
16 10.4 8
17 14.7 8
18 32.4 4
19 30.4 4
20 33.9 4
Ideally, my output for the above example would be what's below.
mpg cyl
1 21.0 6
2 21.0 6
4 21.4 6
6 18.1 6
10 19.2 6
11 17.8 6
Thanks in advance!
Assuming your dataframe input is DF, try this:
library(dplyr)
DF %>%
group_by(cyl) %>%
filter(sum(mpg > 20) > 1 & sum(mpg < 20) > 1)
# A tibble: 7 x 2
# Groups: cyl [1]
# mpg cyl
# <dbl> <dbl>
# 1 21 6
# 2 21 6
# 3 21.4 6
# 4 18.1 6
# 5 19.2 6
# 6 17.8 6
# 7 19.7 6
data
DF <- mtcars[,1:2]

Select top rows in R using add_tally and top_n functions

I would like to select the top n rows in a data frame for which I
calculated a column n that represents the sum of a variable. For example,
using the mtcars data, I would like to filter to keep only the two cyl
with the greatest sum of mpg. In the following example, I was expecting
to select all rows where cyl == 4 and cyl == 8. It must be simple, but
I can not figure out my mistake.
library(tidyverse)
mtcars %>%
group_by(cyl) %>%
summarise(sum(mpg))
#> # A tibble: 3 x 2
#> cyl `sum(mpg)`
#> <dbl> <dbl>
#> 1 4 293.
#> 2 6 138.
#> 3 8 211.
mtcars %>%
group_by(cyl) %>% # Calculate the sum of mpg for each cyl
add_tally(mpg, sort = TRUE) %>%
ungroup() %>%
top_n(2, n)
#> # A tibble: 11 x 12
#> mpg cyl disp hp drat wt qsec vs am gear carb n
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 293.
#> 2 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 293.
#> 3 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 293.
#> 4 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1 293.
#> 5 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2 293.
#> 6 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1 293.
#> 7 21.5 4 120. 97 3.7 2.46 20.0 1 0 3 1 293.
#> 8 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1 293.
#> 9 26 4 120. 91 4.43 2.14 16.7 0 1 5 2 293.
#> 10 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2 293.
#> 11 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2 293.
Created on 2019-07-26 by the reprex package (v0.3.0)
It seems that top_n returns the top n rows after ordering the dataframe and returns more than n rows if there are ties. It does not return rows with distinct top n values.
From documentation -
Usage
top_n(x, n, wt)
Arguments
x: a tbl() to filter
n: number of rows to return. If x is grouped,
this is the number of rows per group. Will include more than n rows if
there are ties. If n is positive, selects the top n rows. If negative,
selects the bottom n rows.
You need, as suggested by #tmfmnk -
mtcars %>%
group_by(cyl) %>%
add_tally(mpg, sort = TRUE) %>%
ungroup() %>%
filter(dense_rank(desc(n)) < 3)

custom grouped dplyr function (sample_n)

I am trying to apply a sampling function in a grouped fashion to a data frame, where it should sample n samples from each group, or all group members if the group size is smaller than n.
Using dplyr, I first tried
library(dplyr)
mtcars %>% group_by(cyl) %>% sample_n(2)
This works when n is smaller than all the group sizes but does not take the full group when I choose n larger than the group size (note that there are 7 cars in one of the cyl groups):
mtcars %>% group_by(cyl) %>% sample_n(8)
Error: `size` must be less or equal than 7 (size of data),
set `replace` = TRUE to use sampling with replacement
I tried to solve this by creating an adapted group_n function like so:
sample_n_or_all <- function(tbl, n) {
if (nrow(tbl) < n)return(tbl)
sample_n(tbl, n)
}
but using my custom function (mtcars %>% group_by(cyl) %>% sample_n_or_all(8)) generates the same error.
Any suggestions how I can adapt my function so I can apply it to each of the groups? Or another solution to the problem?
We could check the number of rows in the group and pass the value to sample_n accordingly.
library(dplyr)
n <- 8
temp <- mtcars %>% group_by(cyl) %>% sample_n(if(n() < n) n() else n)
temp
# mpg cyl disp hp drat wt qsec vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
# 2 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1
# 3 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
# 4 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
# 5 26 4 120. 91 4.43 2.14 16.7 0 1 5 2
# 6 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1
# 7 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2
# 8 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2
# 9 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#10 17.8 6 168. 123 3.92 3.44 18.9 1 0 4 4
# … with 13 more rows
We can check number of rows in each group after that.
table(temp$cyl)
#4 6 8
#8 7 8
table(mtcars$cyl)
# 4 6 8
#11 7 14
We can do this without using a logical condition with pmin
library(dplyr)
tmp <- mtcars %>%
group_by(cyl) %>%
sample_n(pmin(n(), n))
# A tibble: 23 x 11
# Groups: cyl [3]
# mpg cyl disp hp drat wt qsec vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1
# 2 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1
# 3 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
# 4 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2
# 5 21.5 4 120. 97 3.7 2.46 20.0 1 0 3 1
# 6 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1
# 7 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2
# 8 26 4 120. 91 4.43 2.14 16.7 0 1 5 2
# 9 17.8 6 168. 123 3.92 3.44 18.9 1 0 4 4
#10 21 6 160 110 3.9 2.62 16.5 0 1 4 4
# … with 13 more rows
-checking
table(tmp$cyl)
# 4 6 8
# 8 7 8

Resources