Related
Suppose I group a data.frame() using dplyr::group_by(). Is there any scenario where passing this to group_keys() or summarise() would produce different results? Was surprised to see a group_keys function.
library(dplyr)
df <- data.frame(x = rep(1:2, 10), y = rep(1:10,2))
df_grouped <- df %>% group_by(x,y)
# group_keys
df_grouped %>% group_keys()
# summarise
df_grouped %>% summarise()
summarise() without arguments will strip one level of grouping, returning
a grouped data frame if there are multiple grouping columns:
library(dplyr)
mtcars %>%
group_by(am, vs) %>%
summarise()
#> `summarise()` has grouped output by 'am'. You can override using the `.groups`
#> argument.
#> # A tibble: 4 x 2
#> # Groups: am [2]
#> am vs
#> <dbl> <dbl>
#> 1 0 0
#> 2 0 1
#> 3 1 0
#> 4 1 1
group_keys() does not return a grouped data frame, and is more idiomatic
for the task:
mtcars %>%
group_by(am, vs) %>%
group_keys()
#> # A tibble: 4 x 2
#> am vs
#> <dbl> <dbl>
#> 1 0 0
#> 2 0 1
#> 3 1 0
#> 4 1 1
Can someone help me please?
I Have Column A, Column B and Column C, I want to get the top value of column C, grouped by A, but also have the information of B for those top values
Max <-X %>% select(A,B,C) %>% group_by(A) %>% summarise(top = max(C))
But this code only show me the top values of each unique A data, so I dont know whats the B value assigned to that. (Important, making group_by(A,B) doesnt work, because it doesnt give the top values for each unique A value, it returns the same as the data base X)
This could be achieved via dplyr::top_n or ? dplyr::slice_max like so:
library(dplyr)
mtcars %>% select(cyl, mpg, hp) %>% group_by(cyl) %>% top_n(1, hp)
#> # A tibble: 3 x 3
#> # Groups: cyl [3]
#> cyl mpg hp
#> <dbl> <dbl> <dbl>
#> 1 4 30.4 113
#> 2 6 19.7 175
#> 3 8 15 335
mtcars %>% select(cyl, mpg, hp) %>% group_by(cyl) %>% slice_max(hp)
#> # A tibble: 3 x 3
#> # Groups: cyl [3]
#> cyl mpg hp
#> <dbl> <dbl> <dbl>
#> 1 4 30.4 113
#> 2 6 19.7 175
#> 3 8 15 335
So, in your case it should be:
Max <-X %>% select(A,B,C) %>% group_by(A) %>% slice_max(C)
Suppose I want to calculate the proportion of different values within each group. For example, using the mtcars data, how do I calculate the relative frequency of number of gears by am (automatic/manual) in one go with dplyr?
library(dplyr)
data(mtcars)
mtcars <- tbl_df(mtcars)
# count frequency
mtcars %>%
group_by(am, gear) %>%
summarise(n = n())
# am gear n
# 0 3 15
# 0 4 4
# 1 4 8
# 1 5 5
What I would like to achieve:
am gear n rel.freq
0 3 15 0.7894737
0 4 4 0.2105263
1 4 8 0.6153846
1 5 5 0.3846154
Try this:
mtcars %>%
group_by(am, gear) %>%
summarise(n = n()) %>%
mutate(freq = n / sum(n))
# am gear n freq
# 1 0 3 15 0.7894737
# 2 0 4 4 0.2105263
# 3 1 4 8 0.6153846
# 4 1 5 5 0.3846154
From the dplyr vignette:
When you group by multiple variables, each summary peels off one level of the grouping. That makes it easy to progressively roll-up a dataset.
Thus, after the summarise, the last grouping variable specified in group_by, 'gear', is peeled off. In the mutate step, the data is grouped by the remaining grouping variable(s), here 'am'. You may check grouping in each step with groups.
The outcome of the peeling is of course dependent of the order of the grouping variables in the group_by call. You may wish to do a subsequent group_by(am), to make your code more explicit.
For rounding and prettification, please refer to the nice answer by #Tyler Rinker.
You can use count() function, which has however a different behaviour depending on the version of dplyr:
dplyr 0.7.1: returns an ungrouped table: you need to group again by am
dplyr < 0.7.1: returns a grouped table, so no need to group again, although you might want to ungroup() for later manipulations
dplyr 0.7.1
mtcars %>%
count(am, gear) %>%
group_by(am) %>%
mutate(freq = n / sum(n))
dplyr < 0.7.1
mtcars %>%
count(am, gear) %>%
mutate(freq = n / sum(n))
This results into a grouped table, if you want to use it for further analysis, it might be useful to remove the grouped attribute with ungroup().
#Henrik's is better for usability as this will make the column character and no longer numeric but matches what you asked for...
mtcars %>%
group_by (am, gear) %>%
summarise (n=n()) %>%
mutate(rel.freq = paste0(round(100 * n/sum(n), 0), "%"))
## am gear n rel.freq
## 1 0 3 15 79%
## 2 0 4 4 21%
## 3 1 4 8 62%
## 4 1 5 5 38%
EDIT Because Spacedman asked for it :-)
as.rel_freq <- function(x, rel_freq_col = "rel.freq", ...) {
class(x) <- c("rel_freq", class(x))
attributes(x)[["rel_freq_col"]] <- rel_freq_col
x
}
print.rel_freq <- function(x, ...) {
freq_col <- attributes(x)[["rel_freq_col"]]
x[[freq_col]] <- paste0(round(100 * x[[freq_col]], 0), "%")
class(x) <- class(x)[!class(x)%in% "rel_freq"]
print(x)
}
mtcars %>%
group_by (am, gear) %>%
summarise (n=n()) %>%
mutate(rel.freq = n/sum(n)) %>%
as.rel_freq()
## Source: local data frame [4 x 4]
## Groups: am
##
## am gear n rel.freq
## 1 0 3 15 79%
## 2 0 4 4 21%
## 3 1 4 8 62%
## 4 1 5 5 38%
Despite the many answers, one more approach which uses prop.table in combination with dplyr or data.table.
library(dplyr)
mtcars %>%
group_by(am, gear) %>%
tally() %>%
mutate(freq = prop.table(n))
#> # A tibble: 4 × 4
#> # Groups: am [2]
#> am gear n freq
#> <dbl> <dbl> <int> <dbl>
#> 1 0 3 15 0.789
#> 2 0 4 4 0.211
#> 3 1 4 8 0.615
#> 4 1 5 5 0.385
library(data.table)
cars_dt <- as.data.table(mtcars)
cars_dt[, .(n = .N), keyby = .(am, gear)][, freq := prop.table(n), by = "am"][]
#> am gear n freq
#> 1: 0 3 15 0.7894737
#> 2: 0 4 4 0.2105263
#> 3: 1 4 8 0.6153846
#> 4: 1 5 5 0.3846154
Created on 2022-10-22 with reprex v2.0.2
I wrote a small function for this repeating task:
count_pct <- function(df) {
return(
df %>%
tally %>%
mutate(n_pct = 100*n/sum(n))
)
}
I can then use it like:
mtcars %>%
group_by(cyl) %>%
count_pct
It returns:
# A tibble: 3 x 3
cyl n n_pct
<dbl> <int> <dbl>
1 4 11 34.4
2 6 7 21.9
3 8 14 43.8
For the sake of completeness of this popular question, since version 1.0.0 of dplyr, parameter .groups controls the grouping structure of the summarise function after group_by summarise help.
With .groups = "drop_last", summarise drops the last level of grouping. This was the only result obtained before version 1.0.0.
library(dplyr)
library(scales)
original <- mtcars %>%
group_by (am, gear) %>%
summarise (n=n()) %>%
mutate(rel.freq = scales::percent(n/sum(n), accuracy = 0.1))
#> `summarise()` regrouping output by 'am' (override with `.groups` argument)
original
#> # A tibble: 4 x 4
#> # Groups: am [2]
#> am gear n rel.freq
#> <dbl> <dbl> <int> <chr>
#> 1 0 3 15 78.9%
#> 2 0 4 4 21.1%
#> 3 1 4 8 61.5%
#> 4 1 5 5 38.5%
new_drop_last <- mtcars %>%
group_by (am, gear) %>%
summarise (n=n(), .groups = "drop_last") %>%
mutate(rel.freq = scales::percent(n/sum(n), accuracy = 0.1))
dplyr::all_equal(original, new_drop_last)
#> [1] TRUE
With .groups = "drop", all levels of grouping are dropped. The result is turned into an independent tibble with no trace of the previous group_by
# .groups = "drop"
new_drop <- mtcars %>%
group_by (am, gear) %>%
summarise (n=n(), .groups = "drop") %>%
mutate(rel.freq = scales::percent(n/sum(n), accuracy = 0.1))
new_drop
#> # A tibble: 4 x 4
#> am gear n rel.freq
#> <dbl> <dbl> <int> <chr>
#> 1 0 3 15 46.9%
#> 2 0 4 4 12.5%
#> 3 1 4 8 25.0%
#> 4 1 5 5 15.6%
If .groups = "keep", same grouping structure as .data (mtcars, in this case). summarise does not peel off any variable used in the group_by.
Finally, with .groups = "rowwise", each row is it's own group. It is equivalent to "keep" in this situation
# .groups = "keep"
new_keep <- mtcars %>%
group_by (am, gear) %>%
summarise (n=n(), .groups = "keep") %>%
mutate(rel.freq = scales::percent(n/sum(n), accuracy = 0.1))
new_keep
#> # A tibble: 4 x 4
#> # Groups: am, gear [4]
#> am gear n rel.freq
#> <dbl> <dbl> <int> <chr>
#> 1 0 3 15 100.0%
#> 2 0 4 4 100.0%
#> 3 1 4 8 100.0%
#> 4 1 5 5 100.0%
# .groups = "rowwise"
new_rowwise <- mtcars %>%
group_by (am, gear) %>%
summarise (n=n(), .groups = "rowwise") %>%
mutate(rel.freq = scales::percent(n/sum(n), accuracy = 0.1))
dplyr::all_equal(new_keep, new_rowwise)
#> [1] TRUE
Another point that can be of interest is that sometimes, after applying group_by and summarise, a summary line can help.
# create a subtotal line to help readability
subtotal_am <- mtcars %>%
group_by (am) %>%
summarise (n=n()) %>%
mutate(gear = NA, rel.freq = 1)
#> `summarise()` ungrouping output (override with `.groups` argument)
mtcars %>% group_by (am, gear) %>%
summarise (n=n()) %>%
mutate(rel.freq = n/sum(n)) %>%
bind_rows(subtotal_am) %>%
arrange(am, gear) %>%
mutate(rel.freq = scales::percent(rel.freq, accuracy = 0.1))
#> `summarise()` regrouping output by 'am' (override with `.groups` argument)
#> # A tibble: 6 x 4
#> # Groups: am [2]
#> am gear n rel.freq
#> <dbl> <dbl> <int> <chr>
#> 1 0 3 15 78.9%
#> 2 0 4 4 21.1%
#> 3 0 NA 19 100.0%
#> 4 1 4 8 61.5%
#> 5 1 5 5 38.5%
#> 6 1 NA 13 100.0%
Created on 2020-11-09 by the reprex package (v0.3.0)
Hope you find this answer useful.
Here is a general function implementing Henrik's solution on dplyr 0.7.1.
freq_table <- function(x,
group_var,
prop_var) {
group_var <- enquo(group_var)
prop_var <- enquo(prop_var)
x %>%
group_by(!!group_var, !!prop_var) %>%
summarise(n = n()) %>%
mutate(freq = n /sum(n)) %>%
ungroup
}
Also, try add_count() (to get around pesky group_by .groups).
mtcars %>%
count(am, gear) %>%
add_count(am, wt = n, name = "nn") %>%
mutate(proportion = n / nn)
Here is a base R answer using aggregate and ave :
df1 <- with(mtcars, aggregate(list(n = mpg), list(am = am, gear = gear), length))
df1$prop <- with(df1, n/ave(n, am, FUN = sum))
#Also with prop.table
#df1$prop <- with(df1, ave(n, am, FUN = prop.table))
df1
# am gear n prop
#1 0 3 15 0.7894737
#2 0 4 4 0.2105263
#3 1 4 8 0.6153846
#4 1 5 5 0.3846154
We can also use prop.table but the output displays differently.
prop.table(table(mtcars$am, mtcars$gear), 1)
# 3 4 5
# 0 0.7894737 0.2105263 0.0000000
# 1 0.0000000 0.6153846 0.3846154
This answer is based upon Matifou's answer.
First I modified it to ensure that I don't get the freq column returned as a scientific notation column by using the scipen option.
Then I multiple the answer by 100 to get a percent rather than decimal to make the freq column easier to read as a percentage.
getOption("scipen")
options("scipen"=10)
mtcars %>%
count(am, gear) %>%
mutate(freq = (n / sum(n)) * 100)
I started getting a new message (see post title) when running group_by and summarise() after updating to dplyr development version 0.8.99.9003.
Here is an example to recreate the output:
library(tidyverse)
library(hablar)
df <- read_csv("year, week, rat_house_females, rat_house_males, mouse_wild_females, mouse_wild_males
2018,10,1,1,1,1
2018,10,1,1,1,1
2018,11,2,2,2,2
2018,11,2,2,2,2
2019,10,3,3,3,3
2019,10,3,3,3,3
2019,11,4,4,4,4
2019,11,4,4,4,4") %>%
convert(chr(year,week)) %>%
mutate(total_rodents = rowSums(select_if(., is.numeric))) %>%
convert(num(year,week)) %>%
group_by(year,week) %>% summarise(average = mean(total_rodents))
The output tibble is correct, but this message appears:
summarise() regrouping output by 'year' (override with .groups argument)
How should this be interpreted? Why does it report regrouping only by 'year' when I grouped by both year and week? Also, what does it mean to override and why would I want to do that?
I don't think the message indicates a problem because it appears throughout the dplyr vignette:
https://cran.r-project.org/web/packages/dplyr/vignettes/programming.html
I believe it is a new message because it has only appeared on very recent SO questions such as How to melt pairwise.wilcox.test output using dplyr? and R Aggregate over multiple columns (neither of which addresses the regrouping/override message).
Thank you!
It is just a friendly warning message. By default, if there is any grouping before the summarise, it drops one group variable i.e. the last one specified in the group_by. If there is only one grouping variable, there won't be any grouping attribute after the summarise and if there are more than one i.e. here it is two, so, the attribute for grouping is reduce to 1 i.e. the data would have the 'year' as grouping attribute. As a reproducible example
library(dplyr)
mtcars %>%
group_by(am) %>%
summarise(mpg = sum(mpg))
#`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 2 x 2
# am mpg
#* <dbl> <dbl>
#1 0 326.
#2 1 317.
The message is that it is ungrouping i.e when there is a single group_by, it drops that grouping after the summarise
mtcars %>%
group_by(am, vs) %>%
summarise(mpg = sum(mpg))
#`summarise()` regrouping output by 'am' (override with `.groups` argument)
# A tibble: 4 x 3
# Groups: am [2]
# am vs mpg
# <dbl> <dbl> <dbl>
#1 0 0 181.
#2 0 1 145.
#3 1 0 118.
#4 1 1 199.
Here, it drops the last grouping and regroup with the 'am'
If we check the ?summarise, there is .groups argument which by default is "drop_last" and the other options are "drop", "keep", "rowwise"
.groups - Grouping structure of the result.
"drop_last": dropping the last level of grouping. This was the only supported option before version 1.0.0.
"drop": All levels of grouping are dropped.
"keep": Same grouping structure as .data.
"rowwise": Each row is its own group.
When .groups is not specified, you either get "drop_last" when all the results are size 1, or "keep" if the size varies. In addition, a message informs you of that choice, unless the option "dplyr.summarise.inform" is set to FALSE.
i.e. if we change the .groups in summarise, we don't get the message because the group attributes are removed
mtcars %>%
group_by(am) %>%
summarise(mpg = sum(mpg), .groups = 'drop')
# A tibble: 2 x 2
# am mpg
#* <dbl> <dbl>
#1 0 326.
#2 1 317.
mtcars %>%
group_by(am, vs) %>%
summarise(mpg = sum(mpg), .groups = 'drop')
# A tibble: 4 x 3
# am vs mpg
#* <dbl> <dbl> <dbl>
#1 0 0 181.
#2 0 1 145.
#3 1 0 118.
#4 1 1 199.
mtcars %>%
group_by(am, vs) %>%
summarise(mpg = sum(mpg), .groups = 'drop') %>%
str
#tibble [4 × 3] (S3: tbl_df/tbl/data.frame)
# $ am : num [1:4] 0 0 1 1
# $ vs : num [1:4] 0 1 0 1
# $ mpg: num [1:4] 181 145 118 199
Previously, this warning was not issued and it could lead to situations where the OP does a mutate or something else assuming there is no grouping and results in unexpected output. Now, the warning gives the user an indication that we should be careful that there is a grouping attribute
NOTE: The .groups right now is experimental in its lifecycle. So, the behaviour could be modified in the future releases
Depending upon whether we need any transformation of the data based on the same grouping variable (or not needed), we could select the different options in .groups.
Paraphrasing the accepted answer, it is just a friendly confusing warning.
summarise() has grouped output by 'xxx'
should be read: the output is OK and contains all grouping columns as attributes, only the grouping keys may be limited.
Example of grouping mtcars by cyl, am calculating mean(mpg)
mtcars %>% group_by(cyl, am) %>% summarise(avg_mpg = mean(mpg))
`summarise()` has grouped output by 'cyl'. You can override using the `.groups` argument.
# A tibble: 6 x 3
# Groups: cyl [3]
cyl am avg_mpg
<dbl> <dbl> <dbl>
1 4 0 22.9
2 4 1 28.1
3 6 0 19.1
4 6 1 20.6
5 8 0 15.0
6 8 1 15.4
The warning is saying that in the output only the first of the original grouping keys was preserved using the default .groups = "drop_last". See the line # Groups: cyl [3].
Nevertheless, the attributes are complete, both cyl and am are defined.
Here a quick overview of the available option showing the result with the function group_keys()
mtcars %>% group_by(cyl, am) %>% summarise(avg_mpg = mean(mpg)) %>% group_keys()
`summarise()` has grouped output by 'cyl'. You can override using the `.groups` argument.
# A tibble: 3 x 1
cyl
<dbl>
1 4
2 6
3 8
mtcars %>% group_by(cyl, am) %>% summarise(avg_mpg = mean(mpg), .groups = "keep") %>% group_keys()
# A tibble: 6 x 2
cyl am
<dbl> <dbl>
1 4 0
2 4 1
3 6 0
4 6 1
5 8 0
6 8 1
mtcars %>% group_by(cyl, am) %>% summarise(avg_mpg = mean(mpg), .groups = "drop") %>% group_keys()
# A tibble: 1 x 0
The only visible consequence is while using a cascading summarization - the example below produce only one summary row as the group key were dropped.
mtcars %>% group_by(cyl, am) %>% summarise(avg_mpg = mean(mpg), .groups = "drop") %>% summarise(min_avg_mpg = min(avg_mpg))
# A tibble: 1 x 1
min_avg_mpg
<dbl>
1 15.0
But as the grouping attributes are all available, it should be not a problem to reset the group keys as required using group_by(cyl, am) before the subsequent summarization.
The answer is explained in ?summarise:
"When .groups is not specified, it is chosen based on the number of rows of the results:
If all the results have 1 row, you get "drop_last".
If the number of rows varies, you get "keep".".
Basically, you get such message when there is more than one option to be used as .groups= argument. The message warns you that one option has been used in the calculation of the statistics following the condition above: "drop_last" or "keep" for results with 1 or more rows, respectively.
Let's say that in your pipeline for some reason you applied two or more grouping criteria but you still need to summarise the data all across values regarless grouping, this can be done by setting .group = 'drop'. Unfortunately, this is only in theory, because, as you can see in #akrun's example, statistic values remain de same, no matter which option was set in .group = (I applied these different options to one of my datasets and obtained same results and same dataframe structure ('grouping structure is controlled by the .group= argument...'). However, by specifying the argument .group, no message is printed.
The bottom line is that when using summarise, if not grouping criteria is used, the output statistic is calculated across all rows and therefore 'results have 1 row'. When one or more grouping criteria are used, the output statistic is calculated within each group and therefore 'the number of rows varies' depending on the number of groups in data frame.
To solve this use summarise(avg_mpg = mean(mpg), .groups = "drop"),
dplyr actually interprets the result table as grouped, thats why he shows you that warning.
This can be as a result of summarise_all() vs summarise(across(everything()... when you have 2 or more grouping columns
> tibble(gr1=c(1,1,2), gr2=c(1,1,2), val=1:3) %>%
group_by(gr1, gr2) %>%
summarise(across(everything(), mean))
#`summarise()` has grouped output by 'gr1'.
# You can override using the #`.groups` argument.
# A tibble: 2 x 3
# Groups: gr1 [2]
gr1 gr2 val
<dbl> <dbl> <dbl>
1 1 1 1.5
2 2 2 3
> tibble(gr1=c(1,1,2), gr2=c(1,1,2), val=1:3) %>%
+ group_by(gr1, gr2) %>%
+ summarise_all(mean)
# No warnings here
# A tibble: 2 x 3
# Groups: gr1 [2]
gr1 gr2 val
<dbl> <dbl> <dbl>
1 1 1 1.5
2 2 2 3
So, the warning meaning: despite everything(), some of the columns will be skipped (grouping ones) in summarise()
Using mtcars data, I want to calculate proportion of mpg for each group of cyl and am. How to calc it?
mtcars %>%
group_by(cyl, am) %>%
summarise(mpg = n(mpg)) %>%
mutate(mpg.gr = mpg/(sum(mpg))
Thanks in advance!
If I understand you correctly, you want the proportion of records for each combination of cyl and am. If so, then I believe your code isn't working because n() doesn't accept an argument. You also need to ungroup() before calculating your proportions.
You could simply do:
mtcars %>%
group_by(cyl, am) %>%
summarise(mpg = n()) %>%
ungroup() %>%
mutate(mpg.gr = mpg/(sum(mpg))
#> # A tibble: 6 x 4
#> cyl am mpg mpg.gr
#> <dbl> <dbl> <int> <dbl>
#> 1 4 0 3 0.0938
#> 2 4 1 8 0.25
#> 3 6 0 4 0.125
#> 4 6 1 3 0.0938
#> 5 8 0 12 0.375
#> 6 8 1 2 0.0625
Note that thanks to ungroup(), the proportions are calculated using the counts of all records, not just those within the cyl group, as before.