Summarise for multiple group_by variables combined and individually - r

I am using dplyr's group_by and summarise to get a mean by each group_by variable combined, but also want to get the mean by each group_by variable individually.
For example if I run
mtcars %>%
group_by(cyl, vs) %>%
summarise(new = mean(wt))
I get
cyl vs new
<dbl> <dbl> <dbl>
4 0 2.140000
4 1 2.300300
6 0 2.755000
6 1 3.388750
8 0 3.999214
But I want to get
cyl vs new
<dbl> <dbl> <dbl>
4 0 2.140000
4 1 2.300300
4 NA 2.285727
6 0 2.755000
6 1 3.388750
6 NA 3.117143
8 0 3.999214
NA 0 3.688556
NA 1 2.611286
I.e. get the mean for the variables both combined and individually
Edit
Jaap marked this as duplicate and pointed me in the direction of Using aggregate to apply several functions on several variables in one call. I looked at jaap's answer there which referenced dplyr but I can't see how that answers my question? You say to use summarise_each, but I still don't see how I can use that to get the mean of each of my group by variables individually? Apologies if I am being stupid...

Here is an idea using bind_rows,
library(dplyr)
mtcars %>%
group_by(cyl, vs) %>%
summarise(new = mean(wt)) %>%
bind_rows(.,
mtcars %>% group_by(cyl) %>% summarise(new = mean(wt)) %>% mutate(vs = NA),
mtcars %>% group_by(vs) %>% summarise(new = mean(wt)) %>% mutate(cyl = NA)) %>%
arrange(cyl) %>%
ungroup()
# A tibble: 10 × 3
# cyl vs new
# <dbl> <dbl> <dbl>
#1 4 0 2.140000
#2 4 1 2.300300
#3 4 NA 2.285727
#4 6 0 2.755000
#5 6 1 3.388750
#6 6 NA 3.117143
#7 8 0 3.999214
#8 8 NA 3.999214
#9 NA 0 3.688556
#10 NA 1 2.611286

Related

Get top values of multiple group bys

I've been trying a few ways to achieve (do, row_number) this but still stuck.
I have 3 groups: month, city, and gender.
I would like to get only the top 5 count of these 3 group bys.
This code works fine only with 2 groups:
df_top5_2grp <- df %>%
group_by(month, city) %>%
tally() %>%
top_n(n = 5, wt = n) %>%
arrange(retention_month, desc(n))
However, it won't return the top 5 count if I add an additional group:
df_top5_3grp <- df %>%
group_by(month, city, gender) %>%
tally() %>%
top_n(n = 5, wt = n) %>%
arrange(retention_month, gender, desc(n))
It returns all rows instead. The only difference is I added gender.
Any help is appreciated. Thanks!
You probably need an ungroup() in there.
In the first example below, it returns all the rows, since there are 7 groups, each with one row. So returning the top 5 of each of the seven groups returns all rows.
mtcars %>%
group_by(cyl, vs, am) %>% # grouping across three variables
tally() %>% # tally is a summarization that removes the last grouping
top_n(n = 5, wt = n)
# A tibble: 7 x 4
# Groups: cyl, vs [5] # NOTE! This reminds us the data is still grouped
cyl vs am n
<dbl> <dbl> <dbl> <int>
1 4 0 1 1
2 4 1 0 3
3 4 1 1 7
4 6 0 1 3
5 6 1 0 4
6 8 0 0 12
7 8 0 1 2
Adding ungroup makes it so the top 5 filtering happens across all the summarized groups, not within each group.
mtcars %>%
group_by(cyl, vs, am) %>%
tally() %>%
ungroup() %>%
top_n(n = 5, wt = n)
# A tibble: 5 x 4
cyl vs am n
<dbl> <dbl> <dbl> <int>
1 4 1 0 3
2 4 1 1 7
3 6 0 1 3
4 6 1 0 4
5 8 0 0 12

how to calculate proportion by another variable (not by frequency) in dplyr in R

Using mtcars data, I want to calculate proportion of mpg for each group of cyl and am. How to calc it?
mtcars %>%
group_by(cyl, am) %>%
summarise(mpg = n(mpg)) %>%
mutate(mpg.gr = mpg/(sum(mpg))
Thanks in advance!
If I understand you correctly, you want the proportion of records for each combination of cyl and am. If so, then I believe your code isn't working because n() doesn't accept an argument. You also need to ungroup() before calculating your proportions.
You could simply do:
mtcars %>%
group_by(cyl, am) %>%
summarise(mpg = n()) %>%
ungroup() %>%
mutate(mpg.gr = mpg/(sum(mpg))
#> # A tibble: 6 x 4
#> cyl am mpg mpg.gr
#> <dbl> <dbl> <int> <dbl>
#> 1 4 0 3 0.0938
#> 2 4 1 8 0.25
#> 3 6 0 4 0.125
#> 4 6 1 3 0.0938
#> 5 8 0 12 0.375
#> 6 8 1 2 0.0625
Note that thanks to ungroup(), the proportions are calculated using the counts of all records, not just those within the cyl group, as before.

Dplyr Unique count AND a general count in the same data frame

Is there a way to do this in one line of code resulting in one dataframe instead of two as seen below:
df1 <- mtcars %>% group_by(gear, carb) %>%
distinct(gear, cyl, am) %>%
summarise(UniqCnt = n()
df2 <- mtcars %>% group_by(gear, carb) %>%
summarise(Cnt = n())
I attempted this
attempt1 <- mtcars %>% (group_by(gear, carb) %>%
distinct(gear, cyl,am) %>%
summarise(UniqCnt = n())) %>%
(group_by(gear, carb) %>%
summarise(Cnt = n()))
but it did not work. I can rbind the two but I would prefer not to.
Thank you in advance.
You can use the n_distinct() function in your summarize(). For example
mtcars %>% group_by(gear, carb) %>%
summarize(UniqCnt = n_distinct(am), Cnt=n())
# gear carb UniqCnt Cnt
# <dbl> <dbl> <int> <int>
# 1 3 1 1 3
# 2 3 2 1 4
# 3 3 3 1 3
# 4 3 4 1 5
# 5 4 1 1 4
# 6 4 2 2 4
# 7 4 4 2 4
# 8 5 2 1 2
# 9 5 4 1 1
# 10 5 6 1 1
# 11 5 8 1 1
How about this:
df1 <- mtcars %>%
group_by(gear, carb) %>%
summarise(UniqCnt = n_distinct(gear, cyl, am),
Cnt = n())

Dplyr to count means by group and then quantiles for each

I have a problem with dplyr, or I just can't figure out how to code the quantile-part right.
I have a data that i want to group by X and Y, then count the means for a in each group
dmean %>%
group_by(x,y) %>%
summarise(mean=mean(a))
This part works, no problem.
How do i continue the code to get the lowest 10% and highest 10% percentile of each group?
You can put several expressions inside summarise, as so:
library(dplyr)
mtcars %>%
group_by(cyl, am) %>%
summarise(mean = mean(mpg),
quantile_10 = quantile(mpg, 0.1),
quantile_90 = quantile(mpg, 0.9))
# A tibble: 6 x 5
# Groups: cyl [?]
cyl am mean quantile_10 quantile_90
<dbl> <dbl> <dbl> <dbl> <dbl>
1 4 0 22.90000 21.76 24.08
2 4 1 28.07500 22.38 32.85
3 6 0 19.12500 17.89 20.74
4 6 1 20.56667 19.96 21.00
5 8 0 15.05000 10.69 18.56
6 8 1 15.40000 15.08 15.72

Count number of rows by group using dplyr

I am using the mtcars dataset. I want to find the number of records for a particular combination of data. Something very similar to the count(*) group by clause in SQL. ddply() from plyr is working for me
library(plyr)
ddply(mtcars, .(cyl,gear),nrow)
has output
cyl gear V1
1 4 3 1
2 4 4 8
3 4 5 2
4 6 3 2
5 6 4 4
6 6 5 1
7 8 3 12
8 8 5 2
Using this code
library(dplyr)
g <- group_by(mtcars, cyl, gear)
summarise(g, length(gear))
has output
length(cyl)
1 32
I found various functions to pass in to summarise() but none seem to work for me. One function I found is sum(G), which returned
Error in eval(expr, envir, enclos) : object 'G' not found
Tried using n(), which returned
Error in n() : This function should not be called directly
What am I doing wrong? How can I get group_by() / summarise() to work for me?
There's a special function n() in dplyr to count rows (potentially within groups):
library(dplyr)
mtcars %>%
group_by(cyl, gear) %>%
summarise(n = n())
#Source: local data frame [8 x 3]
#Groups: cyl [?]
#
# cyl gear n
# (dbl) (dbl) (int)
#1 4 3 1
#2 4 4 8
#3 4 5 2
#4 6 3 2
#5 6 4 4
#6 6 5 1
#7 8 3 12
#8 8 5 2
But dplyr also offers a handy count function which does exactly the same with less typing:
count(mtcars, cyl, gear) # or mtcars %>% count(cyl, gear)
#Source: local data frame [8 x 3]
#Groups: cyl [?]
#
# cyl gear n
# (dbl) (dbl) (int)
#1 4 3 1
#2 4 4 8
#3 4 5 2
#4 6 3 2
#5 6 4 4
#6 6 5 1
#7 8 3 12
#8 8 5 2
I think what you are looking for is as follows.
cars_by_cylinders_gears <- mtcars %>%
group_by(cyl, gear) %>%
summarise(count = n())
This is using the dplyr package. This is essentially the longhand version of the count () solution provided by docendo discimus.
another approach is to use the double colons:
mtcars %>%
dplyr::group_by(cyl, gear) %>%
dplyr::summarise(length(gear))
Another option, not necesarily more elegant, but does not require to refer to a specific column:
mtcars %>%
group_by(cyl, gear) %>%
do(data.frame(nrow=nrow(.)))
This is equivalent to using count():
library(dplyr, warn.conflicts = FALSE)
all.equal(mtcars %>%
group_by(cyl, gear) %>%
do(data.frame(n=nrow(.))) %>%
ungroup(),
count(mtcars, cyl, gear), check.attributes=FALSE)
#> [1] TRUE
Another option is using the function tally from dplyr. Here is a reproducible example:
library(dplyr)
mtcars %>%
group_by(cyl, gear) %>%
tally()
#> # A tibble: 8 × 3
#> # Groups: cyl [3]
#> cyl gear n
#> <dbl> <dbl> <int>
#> 1 4 3 1
#> 2 4 4 8
#> 3 4 5 2
#> 4 6 3 2
#> 5 6 4 4
#> 6 6 5 1
#> 7 8 3 12
#> 8 8 5 2
Created on 2022-09-11 with reprex v2.0.2

Resources