How to count rows by group with n() inside dplyr::across()? - r

In previous versions of dplyr, if I wanted to get row counts in addition to other summary values using summarise(), I could do something like
library(tidyverse)
df <- tibble(
group = c("A", "A", "B", "B", "C"),
value = c(1, 2, 3, 4, 5)
)
df %>%
group_by(group) %>%
summarise(total = sum(value), count = n())
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 3 x 3
group total count
<chr> <dbl> <int>
1 A 3 2
2 B 7 2
3 C 5 1
My instinct to get the same output using the new across() function would be
df %>%
group_by(group) %>%
summarise(across(value, list(sum = sum, count = n)))
Error: Problem with `summarise()` input `..1`.
x unused argument (col)
ℹ Input `..1` is `across(value, list(sum = sum, count = n))`.
ℹ The error occurred in group 1: group = "A".
The issue is specific to the n() function, just calling sum() works as expected:
df %>%
group_by(group) %>%
summarise(across(value, list(sum = sum)))
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 3 x 2
group value_sum
<chr> <dbl>
1 A 3
2 B 7
3 C 5
I've tried out various syntactic variations (using lambdas, experimenting with cur_group(), etc.), to no avail. How would I get the desired result within across()?

We can use the lamdba function for n() while the sum can be invoked just by calling it if there are no other arguments to be specified
library(dplyr)
df %>%
group_by(group) %>%
summarise(across(value, list(sum = sum, count = ~ n())), .groups = 'drop')
-output
# A tibble: 3 x 3
# group value_sum value_count
# <chr> <dbl> <int>
#1 A 3 2
#2 B 7 2
#3 C 5 1

Related

When I use diff(), it breaks summarise()

One of the columns I want to have after using summarise() is the difference between the (two) values. Each group is ALWAYS going to have two or less rows, in m case. The function I found online was diff(). However, I ran into a problem.
Look at this example:
df <- data.frame(value = runif(198),
id = c(
sample(1:100, 99),
sample(1:100, 99))
)
find <- df %>%
group_by(id) %>%
summarise(count = n()) %>%
filter(count != 2)
find
In this case, since I'm not using diff(), I get this:
> find
# A tibble: 2 x 2
id count
<int> <int>
1 14 1
2 39 1
It works fine. Now, if I include diff():
> find <- df %>%
+ group_by(id) %>%
+ summarise(diference = diff(value), count = n()) %>%
+ filter(count != 2)
`summarise()` regrouping output by 'id' (override with `.groups` argument)
> find
# A tibble: 0 x 3
# Groups: id [0]
# … with 3 variables: id <int>, diference <dbl>, count <int>
It comes up with nothing. If I don't filter (it was a relatively short data frame, so I went one by one), I see those rows disappear. In a shorter example, it would be:
> df <- data.frame(value = runif(10),
+ id = c(
+ sample(1:6, 5),
+ sample(1:6, 5))
+ )
> find <- df %>%
+ group_by(id) %>%
+ summarise(diference = diff(value), count = n())
`summarise()` regrouping output by 'id' (override with `.groups` argument)
> find
# A tibble: 4 x 3
# Groups: id [4]
id diference count
<int> <dbl> <int>
1 2 -0.309 2
2 3 0.474 2
3 4 -0.148 2
4 6 0.291 2
As you can see, the 1 and 5 rows (id) disappeared. I believe apllying diff() causes it, since without it, that doesn't happen:
> find <- df %>%
+ group_by(id) %>%
+ summarise(count = n())
`summarise()` ungrouping output (override with `.groups` argument)
> find
# A tibble: 6 x 2
id count
<int> <int>
1 1 1
2 2 2
3 3 2
4 4 2
5 5 1
6 6 2
That was the exact same data.
However, if I do it manually, diff() gives me an output, though in a slightly different way:
> diff(5)
numeric(0)
> diff(c(5, 4))
[1] -1
My question, then, is whether or not there is a better function to do this, or just some way for me to get the output without it erasing the one-item groups, like this:
id count diference
1 1 1 1
2 58 1 58
I know the differenc will be the same as the id, but the reason I'm interested in this is because this is just one of the arguments I will put in filter(). It will be: filter(diference != 0 || count != 2) (once again, this isn't my original data).
Maybe this is what the question wants. It uses ifelse to compute the difference between values only if the groups have 2 rows, else it returns value unchanged.
library(dplyr)
set.seed(2020)
df <- data.frame(value = runif(10),
id = c(
sample(1:6, 5),
sample(1:6, 5))
)
find <- df %>%
group_by(id) %>%
summarise(count = n(),
diference = ifelse(count > 1, c(0, diff(value)), value),
.groups = 'drop') %>%
filter(count != 2 | diference != 0)
find
## A tibble: 2 x 3
# id count diference
# <int> <int> <dbl>
#1 1 1 0.647
#2 6 1 0.0674

How to get grouped Sum with nest in one step

I am unable to get grouped sum in one single step using nest but in 2 steps. How can I use map to loop over data column in the output of nest(). Also suggest a way to include the output column in the existing dataframe.
suppressWarnings(library(tidyverse))
tmp_df <-
data.frame(group = rep(c(2L, 1L), each = 5), b = rep(c(-1, 1), each = 5))
tmp_df1 = tmp_df %>% group_by(group) %>% nest() #step1
map(tmp_df1$data, sum) #step 2
#> [[1]]
#> [1] -5
#>
#> [[2]]
#> [1] 5
I know how to get the sum using group_by.
suppressWarnings(library(tidyverse))
tmp_df <-
data.frame(group = rep(c(2L, 1L), each = 5), b = rep(c(-1, 1), each = 5))
tmp_df %>%
group_by(group) %>%
summarise(sum = sum(b))
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 2 x 2
#> group sum
#> <int> <dbl>
#> 1 1 5
#> 2 2 -5
Created on 2020-08-04 by the reprex package (v0.3.0)
We can use c_across
library(dplyr)
tmp_df %>%
nest_by(group) %>%
mutate(sum = sum(c_across(data)))
-output
# A tibble: 2 x 3
# Rowwise: group
# group data sum
# <int> <list<tbl_df[,1]>> <dbl>
#1 1 [5 × 1] 5
#2 2 [5 × 1] -5
Or just
tmp_df %>%
nest_by(group) %>%
mutate(sum = sum(data))
If you want to use nest you can try map_dbl :
library(tidyverse)
tmp_df %>%
group_by(group) %>%
nest() %>%
mutate(sum = map_dbl(data, ~.x %>% sum))
# group data sum
# <int> <list> <dbl>
#1 2 <tibble [5 × 1]> -5
#2 1 <tibble [5 × 1]> 5

R: How to summarize and group by variables as column names

I have a wide dataframe with about 200 columns and want to summarize it over various columns. I can not figure the syntax for this, I think it should work with .data$ and .env$ but I don't get it. Heres an example:
> library(dplyr)
> df = data.frame('A'= c('X','X','X','Y','Y'), 'B'= 1:5, 'C' = 6:10)
> df
A B C
1 X 1 6
2 X 2 7
3 X 3 8
4 Y 4 9
5 Y 5 10
> df %>% group_by(A) %>% summarise(sum(B), sum(C))
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 2 x 3
A `sum(B)` `sum(C)`
<chr> <int> <int>
1 X 6 21
2 Y 9 19
But I want to be able to do something like this:
columns_to_sum = c('B','C')
columns_to_group = c('A')
df %>% group_by(colums_to_group)%>% summarise(sum(columns_to_sum))
We can use across from the new version of dplyr
library(dplyr)
df %>%
group_by(across(colums_to_group)) %>%
summarise(across(all_of(columns_to_sum), sum, na.rm = TRUE), .groups = 'drop')
# A tibble: 2 x 3
# A B C
# <chr> <int> <int>
#1 X 6 21
#2 Y 9 19
In the previous version, we could use group_by_at along with summarise_at
df %>%
group_by_at(colums_to_group) %>%
summarise_at(vars(columns_to_sum), sum, na.rm = TRUE)

How to summarise with sum dependent on another column - using dplyr

I am looking to perform a group by on id, code1 and then summarise. I want the summarise to do several conditional sums i.e. sum of the count column when code2 == "B". I know how to do this by creating an intermediary binary column but I was wondering if there is quicker method where this can all be performed in the summarise statement.
Here is some test data:
id <- c(1,1,1)
code1 <- c("M", "M", "M")
code2 <- c("B", "B", "U")
code3 <- c("H", "N", "N")
count <- c(15, 2, 1)
x <- data.frame(id, code1, code2, code3, count)
Desired output:
id | code1 | Total | B_count | U_count | H_count | N_count
1 M 18 17 1 15 3
We can use the conditions inside the summarise call:
library(dplyr)
x %>%
group_by(id, code1) %>%
summarise(total = sum(count),
B_count = sum(count[code2 == "B"]),
U_count = sum(count[code2 == "U"]),
H_count = sum(count[code3 == "H"]),
N_count = sum(count[code3 == "N"]))
`summarise()` regrouping output by 'id' (override with `.groups` argument)
# A tibble: 1 x 7
# Groups: id [1]
id code1 total B_count U_count H_count N_count
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 M 18 17 1 15 3
This solution is very complicated but it gets the job done.
library(dplyr)
library(tidyr)
x %>%
pivot_longer(
cols = matches('code[2-9]'),
names_to = 'vars',
values_to = 'code'
) %>%
dplyr::select(-vars) %>%
group_by(id, code1, code) %>%
summarise(count = sum(count), .groups = "rowwise") %>%
pivot_wider(
id_cols = c(id, code1),
names_from = code,
values_from = count
) %>%
left_join(
x %>%
group_by(id, code1) %>%
summarise(Total = sum(count), .groups = "rowwise"),
by = c("id", "code1")
) %>%
select(id, code1, Total, everything())
## A tibble: 1 x 7
# id code1 Total B H N U
# <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 M 18 17 15 3 1

dplyr number of rows across groups after filtering

I want the count and proportion (of all of elements) of each group in a data frame (after filtering). This code produces the desired output:
library(dplyr)
df <- data_frame(id = sample(letters[1:3], 100, replace = TRUE),
value = rnorm(100))
summary <- filter(df, value > 0) %>%
group_by(id) %>%
summarize(count = n()) %>%
ungroup() %>%
mutate(proportion = count / sum(count))
> summary
# A tibble: 3 x 3
id count proportion
<chr> <int> <dbl>
1 a 17 0.3695652
2 b 13 0.2826087
3 c 16 0.3478261
Is there an elegant solution to avoid the ungroup() and second summarize() steps. Something like:
summary <- filter(df, value > 0) %>%
group_by(id) %>%
summarize(count = n(),
proportion = n() / [?TOTAL_ROWS()?])
I couldn't find such a function in the documentation, but I must be missing something obvious. Thanks!
You can use nrow on . which refers to the entire data frame piped in:
df %>%
filter(value > 0) %>%
group_by(id) %>%
summarise(count = n(), proportion = count / nrow(.))
# A tibble: 3 x 3
# id count proportion
# <chr> <int> <dbl>
#1 a 14 0.2592593
#2 b 22 0.4074074
#3 c 18 0.3333333

Resources