dplyr: Arrange not behaving as expected after group_by and summarize - r

I must be missing something with how group_by levels in dplyr get peeled off. In the example below, I group by 2 columns, summarize values into a single variable, then sort by that new variable:
mtcars %>% group_by( cyl, gear ) %>%
summarize( hp_range = max(hp) - min(mpg)) %>%
arrange( desc(hp_range) )
# Source: local data frame [8 x 3]
# Groups: cyl [3]
#
# cyl gear hp_range
# (dbl) (dbl) (dbl)
#1 4 4 87.6
#2 4 5 87.0
#3 4 3 75.5
#4 6 5 155.3
#5 6 4 105.2
#6 6 3 91.9
#7 8 5 320.0
#8 8 3 234.6
Obviously this is not sorted by hp_range as intended. What am I missing?
EDIT: The example works as expected without the call to desc in arrange. Still unclear why?

Ok, just got to the bottom of this:
The call to desc had no effect, it was by chance that the example did not work without it
The key is that when you group_by multiple columns, it seems that results are automatically sorted by the Groups. In the example above it is sorted by cyl. To get the intended sort of the entire data table, you must first ungroup and then arrange
mtcars %>% group_by( cyl, gear ) %>%
summarize( hp_range = max(hp) - min(mpg)) %>%
ungroup() %>%
arrange( hp_range )

Related

How to count() each variable automatically

I am cleaning some data and like to use the count() function in dplyr to look at unique values of every variable.
Is there a way to do this automatically? Right now I am using this method:
df %>% count(variable1)
df %>% count(variable2)
df %>% count(variable3)
...
I would like something that returns all of them without me having to repeat the line of code and type in each variable. I thought about trying to have R recognize all the column names and automatically fill them in but I'm not sure where to start. If I just add variables together, say
df %>% count(variable1, variable2)
I get counts by both of those variables when I want individual tables for each variable.
Assume that you want to count am, gear, and carb from mtcars. You can apply the function table() on each variable by map(), which returns a list object.
library(dplyr)
library(purrr)
mtcars %>%
select(am, gear, carb) %>%
map(table)
# $am
# 0 1
# 19 13
#
# $gear
# 3 4 5
# 15 12 5
#
# $carb
# 1 2 3 4 6 8
# 7 10 3 10 1 1
base Version :
lapply(mtcars[c("am", "gear", "carb")], table)
In addition, you can use summary(), which counts factor variables.
mtcars %>%
select(am, gear, carb) %>%
mutate(across(.fn = as.factor)) %>%
summary
# am gear carb
# 0:19 3:15 1: 7
# 1:13 4:12 2:10
# 5: 5 3: 3
# 4:10
# 6: 1
# 8: 1
It looks like you can use a tidyverse approach to solve your issue. You want to get the counts for each variable in your dataset (Please next time add a sample of df). You can get something close to what you want using data in long format. I will show you an example with mtcars data. I will choose some variables that display classes so that they can be summarised with counts. Here the code:
library(tidyverse)
#Data
data("mtcars")
I will select some categorical variables with next code, then I will reshape to long. Finally, I will use summarise() and n() (used for counting) with group_by() to determine the counts:
#Code
mtcars %>% select(cyl,vs,am,gear,carb) %>%
#Format to long
pivot_longer(cols = everything()) %>%
#Group and summarise
group_by(name,value) %>%
summarise(N=n())
Output:
# A tibble: 16 x 3
# Groups: name [5]
name value N
<chr> <dbl> <int>
1 am 0 19
2 am 1 13
3 carb 1 7
4 carb 2 10
5 carb 3 3
6 carb 4 10
7 carb 6 1
8 carb 8 1
9 cyl 4 11
10 cyl 6 7
11 cyl 8 14
12 gear 3 15
13 gear 4 12
14 gear 5 5
15 vs 0 18
16 vs 1 14
As you can see all the variables are showed with their respective groups and counts.
a simple solution would be to use sapply or lapply with table
sapply(df,table)
This will return you a list of count tables for each of the columns for dt. You can always pass in a subsetted dataframe to get the count for your variables of interest.

How to interpret dplyr message `summarise()` regrouping output by 'x' (override with `.groups` argument)?

I started getting a new message (see post title) when running group_by and summarise() after updating to dplyr development version 0.8.99.9003.
Here is an example to recreate the output:
library(tidyverse)
library(hablar)
df <- read_csv("year, week, rat_house_females, rat_house_males, mouse_wild_females, mouse_wild_males
2018,10,1,1,1,1
2018,10,1,1,1,1
2018,11,2,2,2,2
2018,11,2,2,2,2
2019,10,3,3,3,3
2019,10,3,3,3,3
2019,11,4,4,4,4
2019,11,4,4,4,4") %>%
convert(chr(year,week)) %>%
mutate(total_rodents = rowSums(select_if(., is.numeric))) %>%
convert(num(year,week)) %>%
group_by(year,week) %>% summarise(average = mean(total_rodents))
The output tibble is correct, but this message appears:
summarise() regrouping output by 'year' (override with .groups argument)
How should this be interpreted? Why does it report regrouping only by 'year' when I grouped by both year and week? Also, what does it mean to override and why would I want to do that?
I don't think the message indicates a problem because it appears throughout the dplyr vignette:
https://cran.r-project.org/web/packages/dplyr/vignettes/programming.html
I believe it is a new message because it has only appeared on very recent SO questions such as How to melt pairwise.wilcox.test output using dplyr? and R Aggregate over multiple columns (neither of which addresses the regrouping/override message).
Thank you!
It is just a friendly warning message. By default, if there is any grouping before the summarise, it drops one group variable i.e. the last one specified in the group_by. If there is only one grouping variable, there won't be any grouping attribute after the summarise and if there are more than one i.e. here it is two, so, the attribute for grouping is reduce to 1 i.e. the data would have the 'year' as grouping attribute. As a reproducible example
library(dplyr)
mtcars %>%
group_by(am) %>%
summarise(mpg = sum(mpg))
#`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 2 x 2
# am mpg
#* <dbl> <dbl>
#1 0 326.
#2 1 317.
The message is that it is ungrouping i.e when there is a single group_by, it drops that grouping after the summarise
mtcars %>%
group_by(am, vs) %>%
summarise(mpg = sum(mpg))
#`summarise()` regrouping output by 'am' (override with `.groups` argument)
# A tibble: 4 x 3
# Groups: am [2]
# am vs mpg
# <dbl> <dbl> <dbl>
#1 0 0 181.
#2 0 1 145.
#3 1 0 118.
#4 1 1 199.
Here, it drops the last grouping and regroup with the 'am'
If we check the ?summarise, there is .groups argument which by default is "drop_last" and the other options are "drop", "keep", "rowwise"
.groups - Grouping structure of the result.
"drop_last": dropping the last level of grouping. This was the only supported option before version 1.0.0.
"drop": All levels of grouping are dropped.
"keep": Same grouping structure as .data.
"rowwise": Each row is its own group.
When .groups is not specified, you either get "drop_last" when all the results are size 1, or "keep" if the size varies. In addition, a message informs you of that choice, unless the option "dplyr.summarise.inform" is set to FALSE.
i.e. if we change the .groups in summarise, we don't get the message because the group attributes are removed
mtcars %>%
group_by(am) %>%
summarise(mpg = sum(mpg), .groups = 'drop')
# A tibble: 2 x 2
# am mpg
#* <dbl> <dbl>
#1 0 326.
#2 1 317.
mtcars %>%
group_by(am, vs) %>%
summarise(mpg = sum(mpg), .groups = 'drop')
# A tibble: 4 x 3
# am vs mpg
#* <dbl> <dbl> <dbl>
#1 0 0 181.
#2 0 1 145.
#3 1 0 118.
#4 1 1 199.
mtcars %>%
group_by(am, vs) %>%
summarise(mpg = sum(mpg), .groups = 'drop') %>%
str
#tibble [4 × 3] (S3: tbl_df/tbl/data.frame)
# $ am : num [1:4] 0 0 1 1
# $ vs : num [1:4] 0 1 0 1
# $ mpg: num [1:4] 181 145 118 199
Previously, this warning was not issued and it could lead to situations where the OP does a mutate or something else assuming there is no grouping and results in unexpected output. Now, the warning gives the user an indication that we should be careful that there is a grouping attribute
NOTE: The .groups right now is experimental in its lifecycle. So, the behaviour could be modified in the future releases
Depending upon whether we need any transformation of the data based on the same grouping variable (or not needed), we could select the different options in .groups.
Paraphrasing the accepted answer, it is just a friendly confusing warning.
summarise() has grouped output by 'xxx'
should be read: the output is OK and contains all grouping columns as attributes, only the grouping keys may be limited.
Example of grouping mtcars by cyl, am calculating mean(mpg)
mtcars %>% group_by(cyl, am) %>% summarise(avg_mpg = mean(mpg))
`summarise()` has grouped output by 'cyl'. You can override using the `.groups` argument.
# A tibble: 6 x 3
# Groups: cyl [3]
cyl am avg_mpg
<dbl> <dbl> <dbl>
1 4 0 22.9
2 4 1 28.1
3 6 0 19.1
4 6 1 20.6
5 8 0 15.0
6 8 1 15.4
The warning is saying that in the output only the first of the original grouping keys was preserved using the default .groups = "drop_last". See the line # Groups: cyl [3].
Nevertheless, the attributes are complete, both cyl and am are defined.
Here a quick overview of the available option showing the result with the function group_keys()
mtcars %>% group_by(cyl, am) %>% summarise(avg_mpg = mean(mpg)) %>% group_keys()
`summarise()` has grouped output by 'cyl'. You can override using the `.groups` argument.
# A tibble: 3 x 1
cyl
<dbl>
1 4
2 6
3 8
mtcars %>% group_by(cyl, am) %>% summarise(avg_mpg = mean(mpg), .groups = "keep") %>% group_keys()
# A tibble: 6 x 2
cyl am
<dbl> <dbl>
1 4 0
2 4 1
3 6 0
4 6 1
5 8 0
6 8 1
mtcars %>% group_by(cyl, am) %>% summarise(avg_mpg = mean(mpg), .groups = "drop") %>% group_keys()
# A tibble: 1 x 0
The only visible consequence is while using a cascading summarization - the example below produce only one summary row as the group key were dropped.
mtcars %>% group_by(cyl, am) %>% summarise(avg_mpg = mean(mpg), .groups = "drop") %>% summarise(min_avg_mpg = min(avg_mpg))
# A tibble: 1 x 1
min_avg_mpg
<dbl>
1 15.0
But as the grouping attributes are all available, it should be not a problem to reset the group keys as required using group_by(cyl, am) before the subsequent summarization.
The answer is explained in ?summarise:
"When .groups is not specified, it is chosen based on the number of rows of the results:
If all the results have 1 row, you get "drop_last".
If the number of rows varies, you get "keep".".
Basically, you get such message when there is more than one option to be used as .groups= argument. The message warns you that one option has been used in the calculation of the statistics following the condition above: "drop_last" or "keep" for results with 1 or more rows, respectively.
Let's say that in your pipeline for some reason you applied two or more grouping criteria but you still need to summarise the data all across values regarless grouping, this can be done by setting .group = 'drop'. Unfortunately, this is only in theory, because, as you can see in #akrun's example, statistic values remain de same, no matter which option was set in .group = (I applied these different options to one of my datasets and obtained same results and same dataframe structure ('grouping structure is controlled by the .group= argument...'). However, by specifying the argument .group, no message is printed.
The bottom line is that when using summarise, if not grouping criteria is used, the output statistic is calculated across all rows and therefore 'results have 1 row'. When one or more grouping criteria are used, the output statistic is calculated within each group and therefore 'the number of rows varies' depending on the number of groups in data frame.
To solve this use summarise(avg_mpg = mean(mpg), .groups = "drop"),
dplyr actually interprets the result table as grouped, thats why he shows you that warning.
This can be as a result of summarise_all() vs summarise(across(everything()... when you have 2 or more grouping columns
> tibble(gr1=c(1,1,2), gr2=c(1,1,2), val=1:3) %>%
group_by(gr1, gr2) %>%
summarise(across(everything(), mean))
#`summarise()` has grouped output by 'gr1'.
# You can override using the #`.groups` argument.
# A tibble: 2 x 3
# Groups: gr1 [2]
gr1 gr2 val
<dbl> <dbl> <dbl>
1 1 1 1.5
2 2 2 3
> tibble(gr1=c(1,1,2), gr2=c(1,1,2), val=1:3) %>%
+ group_by(gr1, gr2) %>%
+ summarise_all(mean)
# No warnings here
# A tibble: 2 x 3
# Groups: gr1 [2]
gr1 gr2 val
<dbl> <dbl> <dbl>
1 1 1 1.5
2 2 2 3
So, the warning meaning: despite everything(), some of the columns will be skipped (grouping ones) in summarise()

Calculate mean by groups in R with two group variables [duplicate]

I want to start using dplyr in place of ddply but I can't get a handle on how it works (I've read the documentation).
For example, why when I try to mutate() something does the "group_by" function not work as it's supposed to?
Looking at mtcars:
library(car)
Say I make a data.frame which is a summary of mtcars, grouped by "cyl" and "gear":
df1 <- mtcars %.%
group_by(cyl, gear) %.%
summarise(
newvar = sum(wt)
)
Then say I want to further summarise this dataframe. With ddply, it'd be straightforward, but when I try to do with with dplyr, it's not actually "grouping by":
df2 <- df1 %.%
group_by(cyl) %.%
mutate(
newvar2 = newvar + 5
)
Still yields an ungrouped output:
cyl gear newvar newvar2
1 6 3 6.675 11.675
2 4 4 19.025 24.025
3 6 4 12.375 17.375
4 6 5 2.770 7.770
5 4 3 2.465 7.465
6 8 3 49.249 54.249
7 4 5 3.653 8.653
8 8 5 6.740 11.740
Am I doing something wrong with the syntax?
Edit:
If I were to do this with plyr and ddply:
df1 <- ddply(mtcars, .(cyl, gear), summarise, newvar = sum(wt))
and then to get the second df:
df2 <- ddply(df1, .(cyl), summarise, newvar2 = sum(newvar) + 5)
But that same approach, with sum(newvar) + 5 in the summarise() function doesn't work with dplyr...
I had a similar problem. I found that simply detaching plyr solved it:
detach(package:plyr)
library(dplyr)
Taking Dickoa's answer one step further -- as Hadley says "summarise peels off a single layer of grouping". It peels off grouping from the reverse order in which you applied it so you can just use
mtcars %>%
group_by(cyl, gear) %>%
summarise(newvar = sum(wt)) %>%
summarise(newvar2 = sum(newvar) + 5)
Note that this will give a different answer if you use group_by(gear, cyl) in the second line.
And to get your first attempt working:
df1 <- mtcars %>%
group_by(cyl, gear) %>%
summarise(newvar = sum(wt))
df2 <- df1 %>%
group_by(cyl) %>%
summarise(newvar2 = sum(newvar)+5)
If you translate your plyr code into dplyr using summarise instead of mutate you get the same results.
library(plyr)
df1 <- ddply(mtcars, .(cyl, gear), summarise, newvar = sum(wt))
df2 <- ddply(df1, .(cyl), summarise, newvar2 = sum(newvar) + 5)
df2
## cyl newvar2
## 1 4 30.143
## 2 6 26.820
## 3 8 60.989
detach(package:plyr)
library(dplyr)
mtcars %.%
group_by(cyl, gear) %.%
summarise(newvar = sum(wt)) %.%
group_by(cyl) %.%
summarise(newvar2 = sum(newvar) + 5)
## cyl newvar2
## 1 4 30.143
## 2 8 60.989
## 3 6 26.820
EDIT
Since summarise drops the last group (gear) you can skip the second group_by (see #hadley comment below)
library(dplyr)
mtcars %.%
group_by(cyl, gear) %.%
summarise(newvar = sum(wt)) %.%
summarise(newvar2 = sum(newvar) + 5)
## cyl newvar2
## 1 4 30.143
## 2 8 60.989
## 3 6 26.820
Detaching plyr is one way to solve the problem so you can use dplyr functions as desired... but what if you need other functions from plyr to complete other tasks in your code?
(In this example, I've got both dplyr and plyr libraries loaded)
Suppose we have a simple data.frame and we want to compute the groupwise sum of the variable value, when grouped by different levels of gname
> dx<-data.frame(gname=c(1,1,1,2,2,2,3,3,3), value = c(2,2,2,4,4,4,5,6,7))
> dx
gname value
1 1 2
2 1 2
3 1 2
4 2 4
5 2 4
6 2 4
7 3 5
8 3 6
9 3 7
But when we try to use what we believe will produce a dplyr grouped sum, here's what happens:
dx %>% group_by(gname) %>% mutate(mysum=sum(value))
Source: local data frame [9 x 3]
Groups: gname
gname value mysum
1 1 2 36
2 1 2 36
3 1 2 36
4 2 4 36
5 2 4 36
6 2 4 36
7 3 5 36
8 3 6 36
9 3 7 36
It doesn't give us the desired answer. Probably because of some interaction or overloading of the group_by and or mutate functions between dplyr and plyr. We could detach plyr, but another way is to give a unique call to the dplyr versions of group_by and mutate:
dx %>% dplyr::group_by(gname) %>% dplyr::mutate(mysum=sum(value))
Source: local data frame [9 x 3]
Groups: gname
gname value mysum
1 1 2 6
2 1 2 6
3 1 2 6
4 2 4 12
5 2 4 12
6 2 4 12
7 3 5 18
8 3 6 18
9 3 7 18
now we see that this works as expected.
dplyr is working as you should expect in your example. Mutate, as you specified it, will just add 5 to each value of newvar as it creates newvar2. This would look the same if you group or not. If, however, you specify something that differs by group you will get something different. For example:
df1 %.%
group_by(cyl) %.%
mutate(
newvar2 = newvar + mean(cyl)
)

Summary of proportions by group [duplicate]

This question already has answers here:
Relative frequencies / proportions with dplyr
(10 answers)
Closed 2 years ago.
What would be the best tool/package to use to calculate proportions by subgroups? I thought I could try something like this:
data(mtcars)
library(plyr)
ddply(mtcars, .(cyl), transform, Pct = gear/length(gear))
But the output is not what I want, as I would want something with a number of rows equal to cyl. Even if change it to summarise i still get the same problem.
I am open to other packages, but I thought plyr would be best as I would eventually like to build a function around this. Any ideas?
I'd appreciate any help just solving a basic problem like this.
library(dplyr)
mtcars %>%
count(cyl, gear) %>%
mutate(prop = prop.table(n))
See ?count, basically, count is a wrapper for summarise with n() but it does the group by for you. Look at the output of just mtcars %>% count(cyl, gear). Then, we add an additional variable with mutate named prop which is the result of calling prop.table() on the n variable we created after as a result of count(cyl, gear).
You could create this as a function using the SE versions of count(), that is count_(). Look at the vignette for Non-Standard Evaluation in the dplyr package.
Here's a nice github gist addressing lots of cross-tabulation variants with dplyr and other packages.
To get frequency within a group:
library(dplyr)
mtcars %>% count(cyl, gear) %>% mutate(Freq = n/sum(n))
# Source: local data frame [8 x 4]
# Groups: cyl [3]
#
# cyl gear n Freq
# (dbl) (dbl) (int) (dbl)
# 1 4 3 1 0.09090909
# 2 4 4 8 0.72727273
# 3 4 5 2 0.18181818
# 4 6 3 2 0.28571429
# 5 6 4 4 0.57142857
# 6 6 5 1 0.14285714
# 7 8 3 12 0.85714286
# 8 8 5 2 0.14285714
or equivalently,
mtcars %>% group_by(cyl, gear) %>% summarise(n = n()) %>% mutate(Freq = n/sum(n))
Careful of what the grouping is at each stage, or your numbers will be off.

Concatenating words without quotes in R for groupby in dplyr

I have datasets that involve large number of column joins (8-12) and at the same time depending upon the circumstance 1-3 of these columns may not be needed.
Presently I have been writing out these long group-bys using dplyr but with so many columns and changing situations, it is easy to misspell or forget a column.
I'd like to somehow create a variable that goes along with doing this, but I haven't been able to figure out how to due to the quotes that are present when I try to use paste. Can anyone show me a quick example of how to do this?
For example:
library(dplyr)
# I want this group-list not to have quotes so I can drop in my group_by below
my_group_list = paste0("vs"," ","am") #quotes get in the way
mtcars %>% group_by(my_group_list) %>% summarise(countofvalues = n())
If there are many columns, we can specify the columns to group from directly subsetting the column names. In that case, use group_by_
library(dplyr)
mtcars %>%
group_by_(.dots=names(.)[8:9]) %>%
summarise(countofvalues = n())
# vs am countofvalues
# (dbl) (dbl) (int)
#1 0 0 12
#2 0 1 6
#3 1 0 7
#4 1 1 7
The above also works if we have a vector of values
my_group_list <- c("vs", "am")
mtcars %>%
group_by_(.dots = my_group_list) %>%
summarise(countofvalues = n())
# vs am countofvalues
# (dbl) (dbl) (int)
#1 0 0 12
#2 0 1 6
#3 1 0 7
#4 1 1 7
As the OP mentioned that it is not doing the grouping, we can test it by uniteing the 'vs' and 'am' columns, use it as grouping variable and then do the n().
library(tidyr)
mtcars %>%
unite(vs_am, vs, am) %>%
group_by(vs_am) %>%
summarise(countofvalues = n())
# vs_am countofvalues
# (chr) (int)
#1 0_0 12
#2 0_1 6
#3 1_0 7
#4 1_1 7
I know this is a pretty stale thread, but I happened on to it and found a more recent answer. You can use group_by_at() and tidy select helpers (I found it on this dplyr issue). For example:
my_group_list <- c("vs", "am")
mtcars %>%
group_by_at(all_of(my_group_list)) %>%
summarise(countofvalues = n())
# `summarise()` regrouping output by 'vs' (override with `.groups` argument)
# A tibble: 4 x 3
# Groups: vs [2]
# vs am countofvalues
# <dbl> <dbl> <int>
# 1 0 0 12
# 2 0 1 6
# 3 1 0 7
# 4 1 1 7

Resources