R: What is the difference between dplyr::group_keys() and summarise()? - r

Suppose I group a data.frame() using dplyr::group_by(). Is there any scenario where passing this to group_keys() or summarise() would produce different results? Was surprised to see a group_keys function.
library(dplyr)
df <- data.frame(x = rep(1:2, 10), y = rep(1:10,2))
df_grouped <- df %>% group_by(x,y)
# group_keys
df_grouped %>% group_keys()
# summarise
df_grouped %>% summarise()

summarise() without arguments will strip one level of grouping, returning
a grouped data frame if there are multiple grouping columns:
library(dplyr)
mtcars %>%
group_by(am, vs) %>%
summarise()
#> `summarise()` has grouped output by 'am'. You can override using the `.groups`
#> argument.
#> # A tibble: 4 x 2
#> # Groups: am [2]
#> am vs
#> <dbl> <dbl>
#> 1 0 0
#> 2 0 1
#> 3 1 0
#> 4 1 1
group_keys() does not return a grouped data frame, and is more idiomatic
for the task:
mtcars %>%
group_by(am, vs) %>%
group_keys()
#> # A tibble: 4 x 2
#> am vs
#> <dbl> <dbl>
#> 1 0 0
#> 2 0 1
#> 3 1 0
#> 4 1 1

Related

Calculate proportions according to different groups [duplicate]

Suppose I want to calculate the proportion of different values within each group. For example, using the mtcars data, how do I calculate the relative frequency of number of gears by am (automatic/manual) in one go with dplyr?
library(dplyr)
data(mtcars)
mtcars <- tbl_df(mtcars)
# count frequency
mtcars %>%
group_by(am, gear) %>%
summarise(n = n())
# am gear n
# 0 3 15
# 0 4 4
# 1 4 8
# 1 5 5
What I would like to achieve:
am gear n rel.freq
0 3 15 0.7894737
0 4 4 0.2105263
1 4 8 0.6153846
1 5 5 0.3846154
Try this:
mtcars %>%
group_by(am, gear) %>%
summarise(n = n()) %>%
mutate(freq = n / sum(n))
# am gear n freq
# 1 0 3 15 0.7894737
# 2 0 4 4 0.2105263
# 3 1 4 8 0.6153846
# 4 1 5 5 0.3846154
From the dplyr vignette:
When you group by multiple variables, each summary peels off one level of the grouping. That makes it easy to progressively roll-up a dataset.
Thus, after the summarise, the last grouping variable specified in group_by, 'gear', is peeled off. In the mutate step, the data is grouped by the remaining grouping variable(s), here 'am'. You may check grouping in each step with groups.
The outcome of the peeling is of course dependent of the order of the grouping variables in the group_by call. You may wish to do a subsequent group_by(am), to make your code more explicit.
For rounding and prettification, please refer to the nice answer by #Tyler Rinker.
You can use count() function, which has however a different behaviour depending on the version of dplyr:
dplyr 0.7.1: returns an ungrouped table: you need to group again by am
dplyr < 0.7.1: returns a grouped table, so no need to group again, although you might want to ungroup() for later manipulations
dplyr 0.7.1
mtcars %>%
count(am, gear) %>%
group_by(am) %>%
mutate(freq = n / sum(n))
dplyr < 0.7.1
mtcars %>%
count(am, gear) %>%
mutate(freq = n / sum(n))
This results into a grouped table, if you want to use it for further analysis, it might be useful to remove the grouped attribute with ungroup().
#Henrik's is better for usability as this will make the column character and no longer numeric but matches what you asked for...
mtcars %>%
group_by (am, gear) %>%
summarise (n=n()) %>%
mutate(rel.freq = paste0(round(100 * n/sum(n), 0), "%"))
## am gear n rel.freq
## 1 0 3 15 79%
## 2 0 4 4 21%
## 3 1 4 8 62%
## 4 1 5 5 38%
EDIT Because Spacedman asked for it :-)
as.rel_freq <- function(x, rel_freq_col = "rel.freq", ...) {
class(x) <- c("rel_freq", class(x))
attributes(x)[["rel_freq_col"]] <- rel_freq_col
x
}
print.rel_freq <- function(x, ...) {
freq_col <- attributes(x)[["rel_freq_col"]]
x[[freq_col]] <- paste0(round(100 * x[[freq_col]], 0), "%")
class(x) <- class(x)[!class(x)%in% "rel_freq"]
print(x)
}
mtcars %>%
group_by (am, gear) %>%
summarise (n=n()) %>%
mutate(rel.freq = n/sum(n)) %>%
as.rel_freq()
## Source: local data frame [4 x 4]
## Groups: am
##
## am gear n rel.freq
## 1 0 3 15 79%
## 2 0 4 4 21%
## 3 1 4 8 62%
## 4 1 5 5 38%
Despite the many answers, one more approach which uses prop.table in combination with dplyr or data.table.
library(dplyr)
mtcars %>%
group_by(am, gear) %>%
tally() %>%
mutate(freq = prop.table(n))
#> # A tibble: 4 × 4
#> # Groups: am [2]
#> am gear n freq
#> <dbl> <dbl> <int> <dbl>
#> 1 0 3 15 0.789
#> 2 0 4 4 0.211
#> 3 1 4 8 0.615
#> 4 1 5 5 0.385
library(data.table)
cars_dt <- as.data.table(mtcars)
cars_dt[, .(n = .N), keyby = .(am, gear)][, freq := prop.table(n), by = "am"][]
#> am gear n freq
#> 1: 0 3 15 0.7894737
#> 2: 0 4 4 0.2105263
#> 3: 1 4 8 0.6153846
#> 4: 1 5 5 0.3846154
Created on 2022-10-22 with reprex v2.0.2
I wrote a small function for this repeating task:
count_pct <- function(df) {
return(
df %>%
tally %>%
mutate(n_pct = 100*n/sum(n))
)
}
I can then use it like:
mtcars %>%
group_by(cyl) %>%
count_pct
It returns:
# A tibble: 3 x 3
cyl n n_pct
<dbl> <int> <dbl>
1 4 11 34.4
2 6 7 21.9
3 8 14 43.8
For the sake of completeness of this popular question, since version 1.0.0 of dplyr, parameter .groups controls the grouping structure of the summarise function after group_by summarise help.
With .groups = "drop_last", summarise drops the last level of grouping. This was the only result obtained before version 1.0.0.
library(dplyr)
library(scales)
original <- mtcars %>%
group_by (am, gear) %>%
summarise (n=n()) %>%
mutate(rel.freq = scales::percent(n/sum(n), accuracy = 0.1))
#> `summarise()` regrouping output by 'am' (override with `.groups` argument)
original
#> # A tibble: 4 x 4
#> # Groups: am [2]
#> am gear n rel.freq
#> <dbl> <dbl> <int> <chr>
#> 1 0 3 15 78.9%
#> 2 0 4 4 21.1%
#> 3 1 4 8 61.5%
#> 4 1 5 5 38.5%
new_drop_last <- mtcars %>%
group_by (am, gear) %>%
summarise (n=n(), .groups = "drop_last") %>%
mutate(rel.freq = scales::percent(n/sum(n), accuracy = 0.1))
dplyr::all_equal(original, new_drop_last)
#> [1] TRUE
With .groups = "drop", all levels of grouping are dropped. The result is turned into an independent tibble with no trace of the previous group_by
# .groups = "drop"
new_drop <- mtcars %>%
group_by (am, gear) %>%
summarise (n=n(), .groups = "drop") %>%
mutate(rel.freq = scales::percent(n/sum(n), accuracy = 0.1))
new_drop
#> # A tibble: 4 x 4
#> am gear n rel.freq
#> <dbl> <dbl> <int> <chr>
#> 1 0 3 15 46.9%
#> 2 0 4 4 12.5%
#> 3 1 4 8 25.0%
#> 4 1 5 5 15.6%
If .groups = "keep", same grouping structure as .data (mtcars, in this case). summarise does not peel off any variable used in the group_by.
Finally, with .groups = "rowwise", each row is it's own group. It is equivalent to "keep" in this situation
# .groups = "keep"
new_keep <- mtcars %>%
group_by (am, gear) %>%
summarise (n=n(), .groups = "keep") %>%
mutate(rel.freq = scales::percent(n/sum(n), accuracy = 0.1))
new_keep
#> # A tibble: 4 x 4
#> # Groups: am, gear [4]
#> am gear n rel.freq
#> <dbl> <dbl> <int> <chr>
#> 1 0 3 15 100.0%
#> 2 0 4 4 100.0%
#> 3 1 4 8 100.0%
#> 4 1 5 5 100.0%
# .groups = "rowwise"
new_rowwise <- mtcars %>%
group_by (am, gear) %>%
summarise (n=n(), .groups = "rowwise") %>%
mutate(rel.freq = scales::percent(n/sum(n), accuracy = 0.1))
dplyr::all_equal(new_keep, new_rowwise)
#> [1] TRUE
Another point that can be of interest is that sometimes, after applying group_by and summarise, a summary line can help.
# create a subtotal line to help readability
subtotal_am <- mtcars %>%
group_by (am) %>%
summarise (n=n()) %>%
mutate(gear = NA, rel.freq = 1)
#> `summarise()` ungrouping output (override with `.groups` argument)
mtcars %>% group_by (am, gear) %>%
summarise (n=n()) %>%
mutate(rel.freq = n/sum(n)) %>%
bind_rows(subtotal_am) %>%
arrange(am, gear) %>%
mutate(rel.freq = scales::percent(rel.freq, accuracy = 0.1))
#> `summarise()` regrouping output by 'am' (override with `.groups` argument)
#> # A tibble: 6 x 4
#> # Groups: am [2]
#> am gear n rel.freq
#> <dbl> <dbl> <int> <chr>
#> 1 0 3 15 78.9%
#> 2 0 4 4 21.1%
#> 3 0 NA 19 100.0%
#> 4 1 4 8 61.5%
#> 5 1 5 5 38.5%
#> 6 1 NA 13 100.0%
Created on 2020-11-09 by the reprex package (v0.3.0)
Hope you find this answer useful.
Here is a general function implementing Henrik's solution on dplyr 0.7.1.
freq_table <- function(x,
group_var,
prop_var) {
group_var <- enquo(group_var)
prop_var <- enquo(prop_var)
x %>%
group_by(!!group_var, !!prop_var) %>%
summarise(n = n()) %>%
mutate(freq = n /sum(n)) %>%
ungroup
}
Also, try add_count() (to get around pesky group_by .groups).
mtcars %>%
count(am, gear) %>%
add_count(am, wt = n, name = "nn") %>%
mutate(proportion = n / nn)
Here is a base R answer using aggregate and ave :
df1 <- with(mtcars, aggregate(list(n = mpg), list(am = am, gear = gear), length))
df1$prop <- with(df1, n/ave(n, am, FUN = sum))
#Also with prop.table
#df1$prop <- with(df1, ave(n, am, FUN = prop.table))
df1
# am gear n prop
#1 0 3 15 0.7894737
#2 0 4 4 0.2105263
#3 1 4 8 0.6153846
#4 1 5 5 0.3846154
We can also use prop.table but the output displays differently.
prop.table(table(mtcars$am, mtcars$gear), 1)
# 3 4 5
# 0 0.7894737 0.2105263 0.0000000
# 1 0.0000000 0.6153846 0.3846154
This answer is based upon Matifou's answer.
First I modified it to ensure that I don't get the freq column returned as a scientific notation column by using the scipen option.
Then I multiple the answer by 100 to get a percent rather than decimal to make the freq column easier to read as a percentage.
getOption("scipen")
options("scipen"=10)
mtcars %>%
count(am, gear) %>%
mutate(freq = (n / sum(n)) * 100)

How to create tibble with n for categories [duplicate]

Suppose I want to calculate the proportion of different values within each group. For example, using the mtcars data, how do I calculate the relative frequency of number of gears by am (automatic/manual) in one go with dplyr?
library(dplyr)
data(mtcars)
mtcars <- tbl_df(mtcars)
# count frequency
mtcars %>%
group_by(am, gear) %>%
summarise(n = n())
# am gear n
# 0 3 15
# 0 4 4
# 1 4 8
# 1 5 5
What I would like to achieve:
am gear n rel.freq
0 3 15 0.7894737
0 4 4 0.2105263
1 4 8 0.6153846
1 5 5 0.3846154
Try this:
mtcars %>%
group_by(am, gear) %>%
summarise(n = n()) %>%
mutate(freq = n / sum(n))
# am gear n freq
# 1 0 3 15 0.7894737
# 2 0 4 4 0.2105263
# 3 1 4 8 0.6153846
# 4 1 5 5 0.3846154
From the dplyr vignette:
When you group by multiple variables, each summary peels off one level of the grouping. That makes it easy to progressively roll-up a dataset.
Thus, after the summarise, the last grouping variable specified in group_by, 'gear', is peeled off. In the mutate step, the data is grouped by the remaining grouping variable(s), here 'am'. You may check grouping in each step with groups.
The outcome of the peeling is of course dependent of the order of the grouping variables in the group_by call. You may wish to do a subsequent group_by(am), to make your code more explicit.
For rounding and prettification, please refer to the nice answer by #Tyler Rinker.
You can use count() function, which has however a different behaviour depending on the version of dplyr:
dplyr 0.7.1: returns an ungrouped table: you need to group again by am
dplyr < 0.7.1: returns a grouped table, so no need to group again, although you might want to ungroup() for later manipulations
dplyr 0.7.1
mtcars %>%
count(am, gear) %>%
group_by(am) %>%
mutate(freq = n / sum(n))
dplyr < 0.7.1
mtcars %>%
count(am, gear) %>%
mutate(freq = n / sum(n))
This results into a grouped table, if you want to use it for further analysis, it might be useful to remove the grouped attribute with ungroup().
#Henrik's is better for usability as this will make the column character and no longer numeric but matches what you asked for...
mtcars %>%
group_by (am, gear) %>%
summarise (n=n()) %>%
mutate(rel.freq = paste0(round(100 * n/sum(n), 0), "%"))
## am gear n rel.freq
## 1 0 3 15 79%
## 2 0 4 4 21%
## 3 1 4 8 62%
## 4 1 5 5 38%
EDIT Because Spacedman asked for it :-)
as.rel_freq <- function(x, rel_freq_col = "rel.freq", ...) {
class(x) <- c("rel_freq", class(x))
attributes(x)[["rel_freq_col"]] <- rel_freq_col
x
}
print.rel_freq <- function(x, ...) {
freq_col <- attributes(x)[["rel_freq_col"]]
x[[freq_col]] <- paste0(round(100 * x[[freq_col]], 0), "%")
class(x) <- class(x)[!class(x)%in% "rel_freq"]
print(x)
}
mtcars %>%
group_by (am, gear) %>%
summarise (n=n()) %>%
mutate(rel.freq = n/sum(n)) %>%
as.rel_freq()
## Source: local data frame [4 x 4]
## Groups: am
##
## am gear n rel.freq
## 1 0 3 15 79%
## 2 0 4 4 21%
## 3 1 4 8 62%
## 4 1 5 5 38%
Despite the many answers, one more approach which uses prop.table in combination with dplyr or data.table.
library(dplyr)
mtcars %>%
group_by(am, gear) %>%
tally() %>%
mutate(freq = prop.table(n))
#> # A tibble: 4 × 4
#> # Groups: am [2]
#> am gear n freq
#> <dbl> <dbl> <int> <dbl>
#> 1 0 3 15 0.789
#> 2 0 4 4 0.211
#> 3 1 4 8 0.615
#> 4 1 5 5 0.385
library(data.table)
cars_dt <- as.data.table(mtcars)
cars_dt[, .(n = .N), keyby = .(am, gear)][, freq := prop.table(n), by = "am"][]
#> am gear n freq
#> 1: 0 3 15 0.7894737
#> 2: 0 4 4 0.2105263
#> 3: 1 4 8 0.6153846
#> 4: 1 5 5 0.3846154
Created on 2022-10-22 with reprex v2.0.2
I wrote a small function for this repeating task:
count_pct <- function(df) {
return(
df %>%
tally %>%
mutate(n_pct = 100*n/sum(n))
)
}
I can then use it like:
mtcars %>%
group_by(cyl) %>%
count_pct
It returns:
# A tibble: 3 x 3
cyl n n_pct
<dbl> <int> <dbl>
1 4 11 34.4
2 6 7 21.9
3 8 14 43.8
For the sake of completeness of this popular question, since version 1.0.0 of dplyr, parameter .groups controls the grouping structure of the summarise function after group_by summarise help.
With .groups = "drop_last", summarise drops the last level of grouping. This was the only result obtained before version 1.0.0.
library(dplyr)
library(scales)
original <- mtcars %>%
group_by (am, gear) %>%
summarise (n=n()) %>%
mutate(rel.freq = scales::percent(n/sum(n), accuracy = 0.1))
#> `summarise()` regrouping output by 'am' (override with `.groups` argument)
original
#> # A tibble: 4 x 4
#> # Groups: am [2]
#> am gear n rel.freq
#> <dbl> <dbl> <int> <chr>
#> 1 0 3 15 78.9%
#> 2 0 4 4 21.1%
#> 3 1 4 8 61.5%
#> 4 1 5 5 38.5%
new_drop_last <- mtcars %>%
group_by (am, gear) %>%
summarise (n=n(), .groups = "drop_last") %>%
mutate(rel.freq = scales::percent(n/sum(n), accuracy = 0.1))
dplyr::all_equal(original, new_drop_last)
#> [1] TRUE
With .groups = "drop", all levels of grouping are dropped. The result is turned into an independent tibble with no trace of the previous group_by
# .groups = "drop"
new_drop <- mtcars %>%
group_by (am, gear) %>%
summarise (n=n(), .groups = "drop") %>%
mutate(rel.freq = scales::percent(n/sum(n), accuracy = 0.1))
new_drop
#> # A tibble: 4 x 4
#> am gear n rel.freq
#> <dbl> <dbl> <int> <chr>
#> 1 0 3 15 46.9%
#> 2 0 4 4 12.5%
#> 3 1 4 8 25.0%
#> 4 1 5 5 15.6%
If .groups = "keep", same grouping structure as .data (mtcars, in this case). summarise does not peel off any variable used in the group_by.
Finally, with .groups = "rowwise", each row is it's own group. It is equivalent to "keep" in this situation
# .groups = "keep"
new_keep <- mtcars %>%
group_by (am, gear) %>%
summarise (n=n(), .groups = "keep") %>%
mutate(rel.freq = scales::percent(n/sum(n), accuracy = 0.1))
new_keep
#> # A tibble: 4 x 4
#> # Groups: am, gear [4]
#> am gear n rel.freq
#> <dbl> <dbl> <int> <chr>
#> 1 0 3 15 100.0%
#> 2 0 4 4 100.0%
#> 3 1 4 8 100.0%
#> 4 1 5 5 100.0%
# .groups = "rowwise"
new_rowwise <- mtcars %>%
group_by (am, gear) %>%
summarise (n=n(), .groups = "rowwise") %>%
mutate(rel.freq = scales::percent(n/sum(n), accuracy = 0.1))
dplyr::all_equal(new_keep, new_rowwise)
#> [1] TRUE
Another point that can be of interest is that sometimes, after applying group_by and summarise, a summary line can help.
# create a subtotal line to help readability
subtotal_am <- mtcars %>%
group_by (am) %>%
summarise (n=n()) %>%
mutate(gear = NA, rel.freq = 1)
#> `summarise()` ungrouping output (override with `.groups` argument)
mtcars %>% group_by (am, gear) %>%
summarise (n=n()) %>%
mutate(rel.freq = n/sum(n)) %>%
bind_rows(subtotal_am) %>%
arrange(am, gear) %>%
mutate(rel.freq = scales::percent(rel.freq, accuracy = 0.1))
#> `summarise()` regrouping output by 'am' (override with `.groups` argument)
#> # A tibble: 6 x 4
#> # Groups: am [2]
#> am gear n rel.freq
#> <dbl> <dbl> <int> <chr>
#> 1 0 3 15 78.9%
#> 2 0 4 4 21.1%
#> 3 0 NA 19 100.0%
#> 4 1 4 8 61.5%
#> 5 1 5 5 38.5%
#> 6 1 NA 13 100.0%
Created on 2020-11-09 by the reprex package (v0.3.0)
Hope you find this answer useful.
Here is a general function implementing Henrik's solution on dplyr 0.7.1.
freq_table <- function(x,
group_var,
prop_var) {
group_var <- enquo(group_var)
prop_var <- enquo(prop_var)
x %>%
group_by(!!group_var, !!prop_var) %>%
summarise(n = n()) %>%
mutate(freq = n /sum(n)) %>%
ungroup
}
Also, try add_count() (to get around pesky group_by .groups).
mtcars %>%
count(am, gear) %>%
add_count(am, wt = n, name = "nn") %>%
mutate(proportion = n / nn)
Here is a base R answer using aggregate and ave :
df1 <- with(mtcars, aggregate(list(n = mpg), list(am = am, gear = gear), length))
df1$prop <- with(df1, n/ave(n, am, FUN = sum))
#Also with prop.table
#df1$prop <- with(df1, ave(n, am, FUN = prop.table))
df1
# am gear n prop
#1 0 3 15 0.7894737
#2 0 4 4 0.2105263
#3 1 4 8 0.6153846
#4 1 5 5 0.3846154
We can also use prop.table but the output displays differently.
prop.table(table(mtcars$am, mtcars$gear), 1)
# 3 4 5
# 0 0.7894737 0.2105263 0.0000000
# 1 0.0000000 0.6153846 0.3846154
This answer is based upon Matifou's answer.
First I modified it to ensure that I don't get the freq column returned as a scientific notation column by using the scipen option.
Then I multiple the answer by 100 to get a percent rather than decimal to make the freq column easier to read as a percentage.
getOption("scipen")
options("scipen"=10)
mtcars %>%
count(am, gear) %>%
mutate(freq = (n / sum(n)) * 100)

Does dplyr `rowwise()` group in the same way `group_by()` groups?

library(tidyverse)
mtcars %>% group_by(cyl) %>% is_grouped_df()
#> [1] TRUE
I can group a data frame by a variable and confirm if it is grouped using the is_grouped_df() function (shown above).
I can run the same analysis on the dplyr rowwise() function and it appears that rowwise() does not group data sets by row. I have a question and a reading of the help page (?rowwise) does not clearly answer the question for me.
Group input by rows
Description: rowwise() allows you to compute on a data frame a row-at-a-time. This is most useful when a vectorised function doesn't exist.
A row-wise tibble maintains its row-wise status until explicitly removed by group_by(), ungroup(), or as_tibble().
My question: After calling the rowwise() function do I need to call the ungroup() function later in my pipe to ungroup my data set? Or is this done by default? The following pipe suggests that a pipe containing rowwise() is not grouped:
mtcars %>% rowwise() %>% is_grouped_df()
#> [1] FALSE
This sentence is confusing me, "A row-wise tibble maintains its row-wise status until explicitly removed by... ungroup()...". Why would I need to ungroup() a tibble that is already ungrouped?
Interesting observation. This might be a bug of is_grouped_df unless it's somehow a feature that I don't know about. But I DO think it's important to ungroup considering the testing done below (see comments):
library(tidyverse)
mtcars %>% select(1:3) %>% rowwise() %>% head(2)
#> Source: local data frame [2 x 3]
#> Groups: <by row>
##### ^ THIS DOES HAVE A GROUP ####
#>
#> # A tibble: 2 x 3
#> mpg cyl disp
#> <dbl> <dbl> <dbl>
#> 1 21 6 160
#> 2 21 6 160
mtcars %>% select(1:3) %>% rowwise() %>% mutate(n()) %>% head(2)
#> Source: local data frame [2 x 4]
#> Groups: <by row>
#>
#> # A tibble: 2 x 4
#> mpg cyl disp `n()`
#> <dbl> <dbl> <dbl> <int>
#> 1 21 6 160 1
#> 2 21 6 160 1
mtcars %>% select(1:3) %>% mutate(n()) %>% head(2)
#> mpg cyl disp n()
#> 1 21 6 160 32
#> 2 21 6 160 32
##### ^ THIS IS EXPECTED AND THE n BEHAVES DIFFERENTLY WHEN THE ROWWISE() IS APPLIED ####
##### IF WE WANT TO RESTORE "NORMAL" BEHAVIOR, IT'S PROBABLY WISE TO UNGROUP IN ORDER TO LOSE THE ROWWISE OPERATIONS #####
mtcars %>% select(1:3) %>% rowwise() %>% ungroup %>% mutate(n()) %>% head(2)
#> # A tibble: 2 x 4
#> mpg cyl disp `n()`
#> <dbl> <dbl> <dbl> <int>
#> 1 21 6 160 32
#> 2 21 6 160 32
## ^ NORMAL AFTER UNGROUP

incosistent relative frequency outputs with summarise and mutate

I want to compute relative frequency of a group of values with respect to the remaining groups. For example, compute the relative frequency of gear==3 in am==0. I have computed using the following way.
library(dplyr)
mtcars %>%
select(am, gear) %>%
group_by(am, gear) %>%
summarise(N = n()) %>%
group_by(am) %>%
mutate(freq = N / sum(N))
# Source: local data frame [4 x 4]
# Groups: am [2]
#
# # A tibble: 4 x 4
# am gear N freq
# <dbl> <dbl> <int> <dbl>
# 1 0 3 15 0.7894737
# 2 0 4 4 0.2105263
# 3 1 4 8 0.6153846
# 4 1 5 5 0.3846154
The above output is as expected. However, I would like the freq values as a new column in original dataset with same values. I tried the below approach for calculating the count Ǹ and then relative frequency freq.
mtcars %>%
select(am, gear) %>%
group_by(am, gear) %>%
mutate(N = n()) %>%
group_by(am) %>%
mutate(freq = N / sum(N))
# Source: local data frame [32 x 4]
# Groups: am [2]
#
# # A tibble: 32 x 4
# am gear N freq
# <dbl> <dbl> <int> <dbl>
# 1 1 4 8 0.08988764
# 2 1 4 8 0.08988764
# 3 1 4 8 0.08988764
# 4 0 3 15 0.06224066
# 5 0 3 15 0.06224066
# 6 0 3 15 0.06224066
# 7 0 3 15 0.06224066
# 8 0 4 4 0.01659751
# 9 0 4 4 0.01659751
# 10 0 4 4 0.01659751
# # ... with 22 more rows
Now, it gives a different output. What might be the reason?
A better option would be left_join with the summarised output ('res')
mtcars %>%
select(am, gear) %>%
left_join(., res)
If we look at the sum(N) it is a bit larger value because there are more number of rows
You need to recalculate the N size for the am group as well:
mtcars %>%
select(am, gear) %>%
group_by(am, gear) %>%
mutate(N = n()) %>%
group_by(am) %>%
mutate(freq = N / n())
This gets the expected results

Relative frequencies / proportions with dplyr

Suppose I want to calculate the proportion of different values within each group. For example, using the mtcars data, how do I calculate the relative frequency of number of gears by am (automatic/manual) in one go with dplyr?
library(dplyr)
data(mtcars)
mtcars <- tbl_df(mtcars)
# count frequency
mtcars %>%
group_by(am, gear) %>%
summarise(n = n())
# am gear n
# 0 3 15
# 0 4 4
# 1 4 8
# 1 5 5
What I would like to achieve:
am gear n rel.freq
0 3 15 0.7894737
0 4 4 0.2105263
1 4 8 0.6153846
1 5 5 0.3846154
Try this:
mtcars %>%
group_by(am, gear) %>%
summarise(n = n()) %>%
mutate(freq = n / sum(n))
# am gear n freq
# 1 0 3 15 0.7894737
# 2 0 4 4 0.2105263
# 3 1 4 8 0.6153846
# 4 1 5 5 0.3846154
From the dplyr vignette:
When you group by multiple variables, each summary peels off one level of the grouping. That makes it easy to progressively roll-up a dataset.
Thus, after the summarise, the last grouping variable specified in group_by, 'gear', is peeled off. In the mutate step, the data is grouped by the remaining grouping variable(s), here 'am'. You may check grouping in each step with groups.
The outcome of the peeling is of course dependent of the order of the grouping variables in the group_by call. You may wish to do a subsequent group_by(am), to make your code more explicit.
For rounding and prettification, please refer to the nice answer by #Tyler Rinker.
You can use count() function, which has however a different behaviour depending on the version of dplyr:
dplyr 0.7.1: returns an ungrouped table: you need to group again by am
dplyr < 0.7.1: returns a grouped table, so no need to group again, although you might want to ungroup() for later manipulations
dplyr 0.7.1
mtcars %>%
count(am, gear) %>%
group_by(am) %>%
mutate(freq = n / sum(n))
dplyr < 0.7.1
mtcars %>%
count(am, gear) %>%
mutate(freq = n / sum(n))
This results into a grouped table, if you want to use it for further analysis, it might be useful to remove the grouped attribute with ungroup().
#Henrik's is better for usability as this will make the column character and no longer numeric but matches what you asked for...
mtcars %>%
group_by (am, gear) %>%
summarise (n=n()) %>%
mutate(rel.freq = paste0(round(100 * n/sum(n), 0), "%"))
## am gear n rel.freq
## 1 0 3 15 79%
## 2 0 4 4 21%
## 3 1 4 8 62%
## 4 1 5 5 38%
EDIT Because Spacedman asked for it :-)
as.rel_freq <- function(x, rel_freq_col = "rel.freq", ...) {
class(x) <- c("rel_freq", class(x))
attributes(x)[["rel_freq_col"]] <- rel_freq_col
x
}
print.rel_freq <- function(x, ...) {
freq_col <- attributes(x)[["rel_freq_col"]]
x[[freq_col]] <- paste0(round(100 * x[[freq_col]], 0), "%")
class(x) <- class(x)[!class(x)%in% "rel_freq"]
print(x)
}
mtcars %>%
group_by (am, gear) %>%
summarise (n=n()) %>%
mutate(rel.freq = n/sum(n)) %>%
as.rel_freq()
## Source: local data frame [4 x 4]
## Groups: am
##
## am gear n rel.freq
## 1 0 3 15 79%
## 2 0 4 4 21%
## 3 1 4 8 62%
## 4 1 5 5 38%
Despite the many answers, one more approach which uses prop.table in combination with dplyr or data.table.
library(dplyr)
mtcars %>%
group_by(am, gear) %>%
tally() %>%
mutate(freq = prop.table(n))
#> # A tibble: 4 × 4
#> # Groups: am [2]
#> am gear n freq
#> <dbl> <dbl> <int> <dbl>
#> 1 0 3 15 0.789
#> 2 0 4 4 0.211
#> 3 1 4 8 0.615
#> 4 1 5 5 0.385
library(data.table)
cars_dt <- as.data.table(mtcars)
cars_dt[, .(n = .N), keyby = .(am, gear)][, freq := prop.table(n), by = "am"][]
#> am gear n freq
#> 1: 0 3 15 0.7894737
#> 2: 0 4 4 0.2105263
#> 3: 1 4 8 0.6153846
#> 4: 1 5 5 0.3846154
Created on 2022-10-22 with reprex v2.0.2
I wrote a small function for this repeating task:
count_pct <- function(df) {
return(
df %>%
tally %>%
mutate(n_pct = 100*n/sum(n))
)
}
I can then use it like:
mtcars %>%
group_by(cyl) %>%
count_pct
It returns:
# A tibble: 3 x 3
cyl n n_pct
<dbl> <int> <dbl>
1 4 11 34.4
2 6 7 21.9
3 8 14 43.8
For the sake of completeness of this popular question, since version 1.0.0 of dplyr, parameter .groups controls the grouping structure of the summarise function after group_by summarise help.
With .groups = "drop_last", summarise drops the last level of grouping. This was the only result obtained before version 1.0.0.
library(dplyr)
library(scales)
original <- mtcars %>%
group_by (am, gear) %>%
summarise (n=n()) %>%
mutate(rel.freq = scales::percent(n/sum(n), accuracy = 0.1))
#> `summarise()` regrouping output by 'am' (override with `.groups` argument)
original
#> # A tibble: 4 x 4
#> # Groups: am [2]
#> am gear n rel.freq
#> <dbl> <dbl> <int> <chr>
#> 1 0 3 15 78.9%
#> 2 0 4 4 21.1%
#> 3 1 4 8 61.5%
#> 4 1 5 5 38.5%
new_drop_last <- mtcars %>%
group_by (am, gear) %>%
summarise (n=n(), .groups = "drop_last") %>%
mutate(rel.freq = scales::percent(n/sum(n), accuracy = 0.1))
dplyr::all_equal(original, new_drop_last)
#> [1] TRUE
With .groups = "drop", all levels of grouping are dropped. The result is turned into an independent tibble with no trace of the previous group_by
# .groups = "drop"
new_drop <- mtcars %>%
group_by (am, gear) %>%
summarise (n=n(), .groups = "drop") %>%
mutate(rel.freq = scales::percent(n/sum(n), accuracy = 0.1))
new_drop
#> # A tibble: 4 x 4
#> am gear n rel.freq
#> <dbl> <dbl> <int> <chr>
#> 1 0 3 15 46.9%
#> 2 0 4 4 12.5%
#> 3 1 4 8 25.0%
#> 4 1 5 5 15.6%
If .groups = "keep", same grouping structure as .data (mtcars, in this case). summarise does not peel off any variable used in the group_by.
Finally, with .groups = "rowwise", each row is it's own group. It is equivalent to "keep" in this situation
# .groups = "keep"
new_keep <- mtcars %>%
group_by (am, gear) %>%
summarise (n=n(), .groups = "keep") %>%
mutate(rel.freq = scales::percent(n/sum(n), accuracy = 0.1))
new_keep
#> # A tibble: 4 x 4
#> # Groups: am, gear [4]
#> am gear n rel.freq
#> <dbl> <dbl> <int> <chr>
#> 1 0 3 15 100.0%
#> 2 0 4 4 100.0%
#> 3 1 4 8 100.0%
#> 4 1 5 5 100.0%
# .groups = "rowwise"
new_rowwise <- mtcars %>%
group_by (am, gear) %>%
summarise (n=n(), .groups = "rowwise") %>%
mutate(rel.freq = scales::percent(n/sum(n), accuracy = 0.1))
dplyr::all_equal(new_keep, new_rowwise)
#> [1] TRUE
Another point that can be of interest is that sometimes, after applying group_by and summarise, a summary line can help.
# create a subtotal line to help readability
subtotal_am <- mtcars %>%
group_by (am) %>%
summarise (n=n()) %>%
mutate(gear = NA, rel.freq = 1)
#> `summarise()` ungrouping output (override with `.groups` argument)
mtcars %>% group_by (am, gear) %>%
summarise (n=n()) %>%
mutate(rel.freq = n/sum(n)) %>%
bind_rows(subtotal_am) %>%
arrange(am, gear) %>%
mutate(rel.freq = scales::percent(rel.freq, accuracy = 0.1))
#> `summarise()` regrouping output by 'am' (override with `.groups` argument)
#> # A tibble: 6 x 4
#> # Groups: am [2]
#> am gear n rel.freq
#> <dbl> <dbl> <int> <chr>
#> 1 0 3 15 78.9%
#> 2 0 4 4 21.1%
#> 3 0 NA 19 100.0%
#> 4 1 4 8 61.5%
#> 5 1 5 5 38.5%
#> 6 1 NA 13 100.0%
Created on 2020-11-09 by the reprex package (v0.3.0)
Hope you find this answer useful.
Here is a general function implementing Henrik's solution on dplyr 0.7.1.
freq_table <- function(x,
group_var,
prop_var) {
group_var <- enquo(group_var)
prop_var <- enquo(prop_var)
x %>%
group_by(!!group_var, !!prop_var) %>%
summarise(n = n()) %>%
mutate(freq = n /sum(n)) %>%
ungroup
}
Also, try add_count() (to get around pesky group_by .groups).
mtcars %>%
count(am, gear) %>%
add_count(am, wt = n, name = "nn") %>%
mutate(proportion = n / nn)
Here is a base R answer using aggregate and ave :
df1 <- with(mtcars, aggregate(list(n = mpg), list(am = am, gear = gear), length))
df1$prop <- with(df1, n/ave(n, am, FUN = sum))
#Also with prop.table
#df1$prop <- with(df1, ave(n, am, FUN = prop.table))
df1
# am gear n prop
#1 0 3 15 0.7894737
#2 0 4 4 0.2105263
#3 1 4 8 0.6153846
#4 1 5 5 0.3846154
We can also use prop.table but the output displays differently.
prop.table(table(mtcars$am, mtcars$gear), 1)
# 3 4 5
# 0 0.7894737 0.2105263 0.0000000
# 1 0.0000000 0.6153846 0.3846154
This answer is based upon Matifou's answer.
First I modified it to ensure that I don't get the freq column returned as a scientific notation column by using the scipen option.
Then I multiple the answer by 100 to get a percent rather than decimal to make the freq column easier to read as a percentage.
getOption("scipen")
options("scipen"=10)
mtcars %>%
count(am, gear) %>%
mutate(freq = (n / sum(n)) * 100)

Resources