Use group size (`group_size`) in `summarise` in `dplyr` [duplicate] - r

This question already has answers here:
Count number of rows within each group
(17 answers)
Closed 3 years ago.
I want to use the size of a group as part of a groupwise operation in dplyr::summarise.
E.g calculate the proportion of manuals by cylinder, by grouping the cars data by cyl and dividing the number of manuals by the size of the group:
mtcars %>%
group_by(cyl) %>%
summarise(zz = sum(am)/group_size(.))
But, (I think), because group_size is after a grouped tbl_df and . is ungrouped, this returns
Error in mutate_impl(.data, dots) : basic_string::resize
Is there a way to do this?

You probably can use n() to get the number of rows for group
library(dplyr)
mtcars %>%
group_by(cyl) %>%
summarise(zz = sum(am)/n())
# cyl zz
# <dbl> <dbl>
#1 4.00 0.727
#2 6.00 0.429
#3 8.00 0.143

It is just a group by mean
mtcars %>%
group_by(cyl) %>%
summarise(zz = mean(am))
# A tibble: 3 x 2
# cyl zz
# <dbl> <dbl>
#1 4 0.727
#2 6 0.429
#3 8 0.143
If we need to use group_size
library(tidyverse)
mtcars %>%
group_by(cyl) %>%
nest %>%
mutate(zz = map_dbl(data, ~ sum(.x$am)/group_size(.x))) %>%
arrange(cyl) %>%
select(-data)
# A tibble: 3 x 2
# cyl zz
# <dbl> <dbl>
#1 4 0.727
#2 6 0.429
#3 8 0.143
Or using do
mtcars %>%
group_by(cyl) %>%
do(data.frame(zz = sum(.$am)/group_size(.)))
# A tibble: 3 x 2
# Groups: cyl [3]
# cyl zz
# <dbl> <dbl>
#1 4 0.727
#2 6 0.429
#3 8 0.143

Related

Add Another Column Info to Results of groupby r

Can someone help me please?
I Have Column A, Column B and Column C, I want to get the top value of column C, grouped by A, but also have the information of B for those top values
Max <-X %>% select(A,B,C) %>% group_by(A) %>% summarise(top = max(C))
But this code only show me the top values of each unique A data, so I dont know whats the B value assigned to that. (Important, making group_by(A,B) doesnt work, because it doesnt give the top values for each unique A value, it returns the same as the data base X)
This could be achieved via dplyr::top_n or ? dplyr::slice_max like so:
library(dplyr)
mtcars %>% select(cyl, mpg, hp) %>% group_by(cyl) %>% top_n(1, hp)
#> # A tibble: 3 x 3
#> # Groups: cyl [3]
#> cyl mpg hp
#> <dbl> <dbl> <dbl>
#> 1 4 30.4 113
#> 2 6 19.7 175
#> 3 8 15 335
mtcars %>% select(cyl, mpg, hp) %>% group_by(cyl) %>% slice_max(hp)
#> # A tibble: 3 x 3
#> # Groups: cyl [3]
#> cyl mpg hp
#> <dbl> <dbl> <dbl>
#> 1 4 30.4 113
#> 2 6 19.7 175
#> 3 8 15 335
So, in your case it should be:
Max <-X %>% select(A,B,C) %>% group_by(A) %>% slice_max(C)

set the name order in pivot_wider()

I am trying to do the same thing as below, except naming order changes. Got the code from here
mtcars; rownames(mtcars) <- NULL
df <- mtcars[,c(2,8,9)]
head(df)
(df
%>% pivot_longer(-cyl) ## spread out variables (vs, am)
%>% group_by(cyl,name)
%>% dplyr::mutate(n=n()) ## obs per cyl/var combo
%>% group_by(cyl,name,value)
%>% dplyr::summarise(prop=n()/n) ## proportion of 0/1 per cyl/var
%>% unique() ## not sure why I need this?
%>% pivot_wider(id_cols=c(cyl,name),names_from=value,values_from=prop)
)
Expected answer
cyl name `0` `1`
4 vs 0.0909 0.909
4 am 0.273 0.727
6 vs 0.429 0.571
6 am 0.571 0.429
8 vs 1 NA
8 am 0.857 0.143
One possible solution involves adding three lines below your code.
Basically, you modify your variable name to be a factor with values coming in the order specified in levels so that it is internally coded as 1, 2, ...
Then you group by cyl and sort according to name
(df
%>% pivot_longer(-cyl) ## spread out variables (vs, am)
%>% group_by(cyl,name)
%>% dplyr::mutate(n=n()) ## obs per cyl/var combo
%>% group_by(cyl,name,value)
%>% dplyr::summarise(prop=n()/n) ## proportion of 0/1 per cyl/var
%>% unique() ## not sure why I need this?
%>% pivot_wider(id_cols=c(cyl,name),names_from=value,values_from=prop)
%>% mutate(name = factor(name, levels = c("vs", "am")))
%>% group_by(cyl)
%>% arrange(name, .by_group = TRUE)
)
# A tibble: 6 x 4
# Groups: cyl [3]
cyl name `0` `1`
<dbl> <fct> <dbl> <dbl>
1 4 vs 0.0909 0.909
2 4 am 0.273 0.727
3 6 vs 0.429 0.571
4 6 am 0.571 0.429
5 8 vs 1 NA
6 8 am 0.857 0.143
Different take:
df %>% pivot_longer(!cyl) %>% group_by(cyl, name, value) %>% mutate(cnt = n()) %>%
ungroup() %>% group_by(cyl, name) %>% mutate(prop = cnt/n()) %>% distinct() %>%
pivot_wider(id_cols = c(cyl, name), names_from = value, values_from = prop) %>%
arrange(cyl, desc(name))
# A tibble: 6 x 4
# Groups: cyl, name [6]
cyl name `0` `1`
<dbl> <chr> <dbl> <dbl>
1 4 vs 0.0909 0.909
2 4 am 0.273 0.727
3 6 vs 0.429 0.571
4 6 am 0.571 0.429
5 8 vs 1 NA
6 8 am 0.857 0.143
>

how to calculate proportion by another variable (not by frequency) in dplyr in R

Using mtcars data, I want to calculate proportion of mpg for each group of cyl and am. How to calc it?
mtcars %>%
group_by(cyl, am) %>%
summarise(mpg = n(mpg)) %>%
mutate(mpg.gr = mpg/(sum(mpg))
Thanks in advance!
If I understand you correctly, you want the proportion of records for each combination of cyl and am. If so, then I believe your code isn't working because n() doesn't accept an argument. You also need to ungroup() before calculating your proportions.
You could simply do:
mtcars %>%
group_by(cyl, am) %>%
summarise(mpg = n()) %>%
ungroup() %>%
mutate(mpg.gr = mpg/(sum(mpg))
#> # A tibble: 6 x 4
#> cyl am mpg mpg.gr
#> <dbl> <dbl> <int> <dbl>
#> 1 4 0 3 0.0938
#> 2 4 1 8 0.25
#> 3 6 0 4 0.125
#> 4 6 1 3 0.0938
#> 5 8 0 12 0.375
#> 6 8 1 2 0.0625
Note that thanks to ungroup(), the proportions are calculated using the counts of all records, not just those within the cyl group, as before.

Custom function with dplyr mutate or summarise for different levels within a factor?

Here is some example data:
library(car)
library(dplyr)
df1 <- mtcars %>%
group_by(cyl, gear) %>%
summarise(
newvar = sum(wt)
)
# A tibble: 8 x 3
# Groups: cyl [?]
cyl gear newvar
<dbl> <dbl> <dbl>
1 4 3 2.46
2 4 4 19.0
3 4 5 3.65
4 6 3 6.68
5 6 4 12.4
6 6 5 2.77
7 8 3 49.2
8 8 5 6.74
What if I then wanted to apply a custom function calculating the difference between the newvar values for cars with 3 or 5 gears for each level of cylinder?
df2 <- df1 %>% mutate(Diff = newvar[gear == "3"] - newvar[gear == "5"])
or with summarise?
df2 <- df1 %>% summarise(Diff = newvar[gear == "3"] - newvar[gear == "5"])
There must be a way to apply functions for different levels within different factors?
Any help appreciated!
Your example code is most of the way there. You can do:
df1 %>%
mutate(Diff = newvar[gear == "3"] - newvar[gear == "5"])
Or:
df1 %>%
summarise(Diff = newvar[gear == "3"] - newvar[gear == "5"])
Logical subsetting still works in mutate() and summarise() calls like with any other vector.
Note that this works because after your summarise() call in your example code, df1 is still grouped by cyl, otherwise you would need to do a group_by() call to create the correct grouping.
An option is to spread into 'wide' format and then do the -
library(tidyverse)
df1 %>%
filter(gear %in% c(3, 5) ) %>%
spread(gear, newvar) %>%
transmute(newvar = `3` - `5`)
# A tibble: 3 x 2
# Groups: cyl [3]
# cyl newvar
# <dbl> <dbl>
#1 4 -1.19
#2 6 3.90
#3 8 42.5

Using dplyr summarise_at with column index

I noticed that when supplying column indices to dplyr::summarize_at the column to be summarized is determined excluding the grouping column(s). I wonder if that is how it's supposed to be since by this design, using the correct column index depends on whether the summarising column(s) are positioned before or after the grouping columns.
Here's an example:
library(dplyr)
data("mtcars")
# grouping column after summarise columns
mtcars %>% group_by(gear) %>% summarise_at(3:4, mean)
## A tibble: 3 x 3
# gear disp hp
# <dbl> <dbl> <dbl>
#1 3 326.3000 176.1333
#2 4 123.0167 89.5000
#3 5 202.4800 195.6000
# grouping columns before summarise columns
mtcars %>% group_by(cyl) %>% summarise_at(3:4, mean)
## A tibble: 3 x 3
# cyl hp drat
# <dbl> <dbl> <dbl>
#1 4 82.63636 4.070909
#2 6 122.28571 3.585714
#3 8 209.21429 3.229286
# no grouping columns
mtcars %>% summarise_at(3:4, mean)
# disp hp
#1 230.7219 146.6875
# actual third & fourth columns
names(mtcars)[3:4]
#[1] "disp" "hp"
packageVersion("dplyr")
#[1] ‘0.7.2’
Notice how the summarised columns change depending on grouping and position of the grouping column.
Is this the same on other platforms? Is it a bug or a feature?
with version 0.7.5 this behavior can't be reproduced anymore:
library(dplyr)
mtcars %>% group_by(gear) %>% summarise_at(3:4, mean)
# # A tibble: 3 x 3
# gear disp hp
# <dbl> <dbl> <dbl>
# 1 3 326. 176.
# 2 4 123. 89.5
# 3 5 202. 196.
mtcars %>% group_by(cyl) %>% summarise_at(3:4, mean)
# # A tibble: 3 x 3
# cyl disp hp
# <dbl> <dbl> <dbl>
# 1 4 105. 82.6
# 2 6 183. 122.
# 3 8 353. 209.
#docendodiscimus thanks for pointing this out, because even if this feature was intentional, documentation doesn't explicitly explain this and in my case could be source of errors. Actually, this problem was solved before answering on the other question, and my comment above does it properly with the same logic.
At this moment, possible solution is to provide names instead of indexes. But one is still able to make it using indexes just by adding few symbols .vars = names(.)[3:4], like below:
mtcars %>%
group_by(cyl) %>%
summarise_at( .vars = colnames(.)[3:4] , mean)
mtcars %>%
group_by(cyl) %>%
summarise_at( .vars = names(.)[3:4] , mean)
## A tibble: 3 x 3
# cyl disp hp
# <dbl> <dbl> <dbl>
#1 4 105.1364 82.63636
#2 6 183.3143 122.28571
#3 8 353.1000 209.21429

Resources