Get top values of multiple group bys

Get top values of multiple group bys - r

I've been trying a few ways to achieve (do, row_number) this but still stuck.
I have 3 groups: month, city, and gender.
I would like to get only the top 5 count of these 3 group bys.
This code works fine only with 2 groups:
df_top5_2grp <- df %>%
group_by(month, city) %>%
tally() %>%
top_n(n = 5, wt = n) %>%
arrange(retention_month, desc(n))
However, it won't return the top 5 count if I add an additional group:
df_top5_3grp <- df %>%
group_by(month, city, gender) %>%
tally() %>%
top_n(n = 5, wt = n) %>%
arrange(retention_month, gender, desc(n))
It returns all rows instead. The only difference is I added gender.
Any help is appreciated. Thanks!

You probably need an ungroup() in there.
In the first example below, it returns all the rows, since there are 7 groups, each with one row. So returning the top 5 of each of the seven groups returns all rows.
mtcars %>%
group_by(cyl, vs, am) %>% # grouping across three variables
tally() %>% # tally is a summarization that removes the last grouping
top_n(n = 5, wt = n)
# A tibble: 7 x 4
# Groups: cyl, vs [5] # NOTE! This reminds us the data is still grouped
cyl vs am n
<dbl> <dbl> <dbl> <int>
1 4 0 1 1
2 4 1 0 3
3 4 1 1 7
4 6 0 1 3
5 6 1 0 4
6 8 0 0 12
7 8 0 1 2
Adding ungroup makes it so the top 5 filtering happens across all the summarized groups, not within each group.
mtcars %>%
group_by(cyl, vs, am) %>%
tally() %>%
ungroup() %>%
top_n(n = 5, wt = n)
# A tibble: 5 x 4
cyl vs am n
<dbl> <dbl> <dbl> <int>
1 4 1 0 3
2 4 1 1 7
3 6 0 1 3
4 6 1 0 4
5 8 0 0 12

Related

how to calculate proportion by another variable (not by frequency) in dplyr in R

Using mtcars data, I want to calculate proportion of mpg for each group of cyl and am. How to calc it?
mtcars %>%
group_by(cyl, am) %>%
summarise(mpg = n(mpg)) %>%
mutate(mpg.gr = mpg/(sum(mpg))
Thanks in advance!

If I understand you correctly, you want the proportion of records for each combination of cyl and am. If so, then I believe your code isn't working because n() doesn't accept an argument. You also need to ungroup() before calculating your proportions.
You could simply do:
mtcars %>%
group_by(cyl, am) %>%
summarise(mpg = n()) %>%
ungroup() %>%
mutate(mpg.gr = mpg/(sum(mpg))
#> # A tibble: 6 x 4
#> cyl am mpg mpg.gr
#> <dbl> <dbl> <int> <dbl>
#> 1 4 0 3 0.0938
#> 2 4 1 8 0.25
#> 3 6 0 4 0.125
#> 4 6 1 3 0.0938
#> 5 8 0 12 0.375
#> 6 8 1 2 0.0625
Note that thanks to ungroup(), the proportions are calculated using the counts of all records, not just those within the cyl group, as before.

Dplyr Unique count AND a general count in the same data frame

Is there a way to do this in one line of code resulting in one dataframe instead of two as seen below:
df1 <- mtcars %>% group_by(gear, carb) %>%
distinct(gear, cyl, am) %>%
summarise(UniqCnt = n()
df2 <- mtcars %>% group_by(gear, carb) %>%
summarise(Cnt = n())
I attempted this
attempt1 <- mtcars %>% (group_by(gear, carb) %>%
distinct(gear, cyl,am) %>%
summarise(UniqCnt = n())) %>%
(group_by(gear, carb) %>%
summarise(Cnt = n()))
but it did not work. I can rbind the two but I would prefer not to.
Thank you in advance.

You can use the n_distinct() function in your summarize(). For example
mtcars %>% group_by(gear, carb) %>%
summarize(UniqCnt = n_distinct(am), Cnt=n())
# gear carb UniqCnt Cnt
# <dbl> <dbl> <int> <int>
# 1 3 1 1 3
# 2 3 2 1 4
# 3 3 3 1 3
# 4 3 4 1 5
# 5 4 1 1 4
# 6 4 2 2 4
# 7 4 4 2 4
# 8 5 2 1 2
# 9 5 4 1 1
# 10 5 6 1 1
# 11 5 8 1 1

How about this:
df1 <- mtcars %>%
group_by(gear, carb) %>%
summarise(UniqCnt = n_distinct(gear, cyl, am),
Cnt = n())

incosistent relative frequency outputs with summarise and mutate

I want to compute relative frequency of a group of values with respect to the remaining groups. For example, compute the relative frequency of gear==3 in am==0. I have computed using the following way.
library(dplyr)
mtcars %>%
select(am, gear) %>%
group_by(am, gear) %>%
summarise(N = n()) %>%
group_by(am) %>%
mutate(freq = N / sum(N))
# Source: local data frame [4 x 4]
# Groups: am [2]
#
# # A tibble: 4 x 4
# am gear N freq
# <dbl> <dbl> <int> <dbl>
# 1 0 3 15 0.7894737
# 2 0 4 4 0.2105263
# 3 1 4 8 0.6153846
# 4 1 5 5 0.3846154
The above output is as expected. However, I would like the freq values as a new column in original dataset with same values. I tried the below approach for calculating the count Ǹ and then relative frequency freq.
mtcars %>%
select(am, gear) %>%
group_by(am, gear) %>%
mutate(N = n()) %>%
group_by(am) %>%
mutate(freq = N / sum(N))
# Source: local data frame [32 x 4]
# Groups: am [2]
#
# # A tibble: 32 x 4
# am gear N freq
# <dbl> <dbl> <int> <dbl>
# 1 1 4 8 0.08988764
# 2 1 4 8 0.08988764
# 3 1 4 8 0.08988764
# 4 0 3 15 0.06224066
# 5 0 3 15 0.06224066
# 6 0 3 15 0.06224066
# 7 0 3 15 0.06224066
# 8 0 4 4 0.01659751
# 9 0 4 4 0.01659751
# 10 0 4 4 0.01659751
# # ... with 22 more rows
Now, it gives a different output. What might be the reason?

A better option would be left_join with the summarised output ('res')
mtcars %>%
select(am, gear) %>%
left_join(., res)
If we look at the sum(N) it is a bit larger value because there are more number of rows

You need to recalculate the N size for the am group as well:
mtcars %>%
select(am, gear) %>%
group_by(am, gear) %>%
mutate(N = n()) %>%
group_by(am) %>%
mutate(freq = N / n())
This gets the expected results

Summarise for multiple group_by variables combined and individually

I am using dplyr's group_by and summarise to get a mean by each group_by variable combined, but also want to get the mean by each group_by variable individually.
For example if I run
mtcars %>%
group_by(cyl, vs) %>%
summarise(new = mean(wt))
I get
cyl vs new
<dbl> <dbl> <dbl>
4 0 2.140000
4 1 2.300300
6 0 2.755000
6 1 3.388750
8 0 3.999214
But I want to get
cyl vs new
<dbl> <dbl> <dbl>
4 0 2.140000
4 1 2.300300
4 NA 2.285727
6 0 2.755000
6 1 3.388750
6 NA 3.117143
8 0 3.999214
NA 0 3.688556
NA 1 2.611286
I.e. get the mean for the variables both combined and individually
Edit
Jaap marked this as duplicate and pointed me in the direction of Using aggregate to apply several functions on several variables in one call. I looked at jaap's answer there which referenced dplyr but I can't see how that answers my question? You say to use summarise_each, but I still don't see how I can use that to get the mean of each of my group by variables individually? Apologies if I am being stupid...

Here is an idea using bind_rows,
library(dplyr)
mtcars %>%
group_by(cyl, vs) %>%
summarise(new = mean(wt)) %>%
bind_rows(.,
mtcars %>% group_by(cyl) %>% summarise(new = mean(wt)) %>% mutate(vs = NA),
mtcars %>% group_by(vs) %>% summarise(new = mean(wt)) %>% mutate(cyl = NA)) %>%
arrange(cyl) %>%
ungroup()
# A tibble: 10 × 3
# cyl vs new
# <dbl> <dbl> <dbl>
#1 4 0 2.140000
#2 4 1 2.300300
#3 4 NA 2.285727
#4 6 0 2.755000
#5 6 1 3.388750
#6 6 NA 3.117143
#7 8 0 3.999214
#8 8 NA 3.999214
#9 NA 0 3.688556
#10 NA 1 2.611286

dplyr cross tab with missing values

I would like to make a cross tab in R using dplyr. I have good reasons for not just using the base table() command.
table(mtcars$cyl, mtcars$gear)
3 4 5
4 1 8 2
6 2 4 1
8 12 0 2
library(dplyr)
library(tidyr)
mtcars %>%
group_by(cyl, gear) %>%
tally() %>%
spread(gear, n, fill = 0)
Source: local data frame [3 x 4]
cyl 3 4 5
1 4 1 8 2
2 6 2 4 1
3 8 12 0 2
This is all well and good. But it seems to fall apart when there are missing values in the group_by() variables.
mtcars %>%
mutate(
cyl = ifelse(cyl > 6, NA, cyl),
gear = ifelse(gear > 4, NA, gear)
) %>%
group_by(cyl, gear) %>%
tally()
Source: local data frame [8 x 3]
Groups: cyl
cyl gear n
1 4 3 1
2 4 4 8
3 4 NA 2
4 6 3 2
5 6 4 4
6 6 NA 1
7 NA 3 12
8 NA NA 2
# DITTO # %>%
spread(gear, n)
Error in if (any(names2(x) == "")) { :
missing value where TRUE/FALSE needed
I guess what I would like is for a NA column like when you do table(..., useNA = "always"). Any tips?

One option is to replace the NAs with a label. This can be accomplished easily with mutate_each:
mtcars %>%
mutate(
cyl = ifelse(cyl > 6, NA, cyl),
gear = ifelse(gear > 4, NA, gear)
) %>%
group_by(cyl, gear) %>%
tally() %>%
ungroup() %>%
mutate_each(funs(replace(., is.na(.), 'missing'))) %>%
spread(gear, n)
# cyl 3 4 missing
# 1 4 1 8 2
# 2 6 2 4 1
# 3 missing 12 NA 2

Agreed that the permanent solution to this should be a tidyr bug fix, but in the meantime, this can be worked around by dropping the dplyr tbl_df format:
mtcars %>%
mutate(
cyl = ifelse(cyl > 6, NA, cyl),
gear = ifelse(gear > 4, NA, gear)
) %>%
group_by(cyl, gear) %>%
tally() %>%
data.frame() %>% ### <-- go from tbl_df to data.frame
spread(gear, n)
cyl 3 4 NA
1 4 1 8 2
2 6 2 4 1
3 NA 12 NA 2
The addition of the data.frame() call allows your code to run, though it produces a column named NA so this is probably best suited for exploratory analyses that print to the console.

Here's an updated answer that works with current dplyr (1.1.0) and tidyr (1.3.0) in 2023.
library(tidyr); library(dplyr)
mtcars %>%
mutate(
cyl = ifelse(cyl > 6, NA, cyl),
gear = ifelse(gear > 4, NA, gear)
) %>%
count(cyl, gear) %>%
mutate(across(everything(), ~coalesce(as.character(.), "missing"))) %>%
pivot_wider(names_from = gear, values_from = n)
# A tibble: 3 × 4
cyl `3` `4` missing
<chr> <chr> <chr> <chr>
1 4 1 8 2
2 6 2 4 1
3 missing 12 NA 2

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Get top values of multiple group bys - r

Related

how to calculate proportion by another variable (not by frequency) in dplyr in R

Dplyr Unique count AND a general count in the same data frame

incosistent relative frequency outputs with summarise and mutate

Summarise for multiple group_by variables combined and individually

dplyr cross tab with missing values

Categories

Resources