When I use diff(), it breaks summarise() - r

One of the columns I want to have after using summarise() is the difference between the (two) values. Each group is ALWAYS going to have two or less rows, in m case. The function I found online was diff(). However, I ran into a problem.
Look at this example:
df <- data.frame(value = runif(198),
id = c(
sample(1:100, 99),
sample(1:100, 99))
)
find <- df %>%
group_by(id) %>%
summarise(count = n()) %>%
filter(count != 2)
find
In this case, since I'm not using diff(), I get this:
> find
# A tibble: 2 x 2
id count
<int> <int>
1 14 1
2 39 1
It works fine. Now, if I include diff():
> find <- df %>%
+ group_by(id) %>%
+ summarise(diference = diff(value), count = n()) %>%
+ filter(count != 2)
`summarise()` regrouping output by 'id' (override with `.groups` argument)
> find
# A tibble: 0 x 3
# Groups: id [0]
# … with 3 variables: id <int>, diference <dbl>, count <int>
It comes up with nothing. If I don't filter (it was a relatively short data frame, so I went one by one), I see those rows disappear. In a shorter example, it would be:
> df <- data.frame(value = runif(10),
+ id = c(
+ sample(1:6, 5),
+ sample(1:6, 5))
+ )
> find <- df %>%
+ group_by(id) %>%
+ summarise(diference = diff(value), count = n())
`summarise()` regrouping output by 'id' (override with `.groups` argument)
> find
# A tibble: 4 x 3
# Groups: id [4]
id diference count
<int> <dbl> <int>
1 2 -0.309 2
2 3 0.474 2
3 4 -0.148 2
4 6 0.291 2
As you can see, the 1 and 5 rows (id) disappeared. I believe apllying diff() causes it, since without it, that doesn't happen:
> find <- df %>%
+ group_by(id) %>%
+ summarise(count = n())
`summarise()` ungrouping output (override with `.groups` argument)
> find
# A tibble: 6 x 2
id count
<int> <int>
1 1 1
2 2 2
3 3 2
4 4 2
5 5 1
6 6 2
That was the exact same data.
However, if I do it manually, diff() gives me an output, though in a slightly different way:
> diff(5)
numeric(0)
> diff(c(5, 4))
[1] -1
My question, then, is whether or not there is a better function to do this, or just some way for me to get the output without it erasing the one-item groups, like this:
id count diference
1 1 1 1
2 58 1 58
I know the differenc will be the same as the id, but the reason I'm interested in this is because this is just one of the arguments I will put in filter(). It will be: filter(diference != 0 || count != 2) (once again, this isn't my original data).

Maybe this is what the question wants. It uses ifelse to compute the difference between values only if the groups have 2 rows, else it returns value unchanged.
library(dplyr)
set.seed(2020)
df <- data.frame(value = runif(10),
id = c(
sample(1:6, 5),
sample(1:6, 5))
)
find <- df %>%
group_by(id) %>%
summarise(count = n(),
diference = ifelse(count > 1, c(0, diff(value)), value),
.groups = 'drop') %>%
filter(count != 2 | diference != 0)
find
## A tibble: 2 x 3
# id count diference
# <int> <int> <dbl>
#1 1 1 0.647
#2 6 1 0.0674

Related

Iterating name of a field with dplyr::summarise function

first time for me here, I'll try to explain you my problem as clearly as possible.
I'm working on erosion data contained in farms in the form of pixels (e.g. 1 farm = 10 pixels so 10 lines in my df), for this I have 4 df in a list, and I would like to calculate for each farm the mean of erosion. I thought about a loop on the name of erosion field but my problem is that my df don't have the exact name (either ERO13 or ERO17). I don't want to work the position of the field because it could change between the df, only with the name which is variable.
Here's a example :
df1 <- data.frame(ID = c(1,1,2), ERO13 = c(2,4,6))
df2 <- data.frame(ID = c(4,4,6), ERO17 = c(4,5,12))
lst_df <- list(df1,df2)
for (df in lst_df){
cur_df <- df
cur_df <- cur_df %>%
group_by(ID) %>%
summarise(current_name_of_erosion_field = mean(current_name_of_erosion_field))
}
I tried with
for (df in lst_df){
cur_df <- df
cur_camp <- names(cur_df)[2]
cur_df <- cur_df %>%
group_by(ID) %>%
summarise(cur_camp = mean(cur_camp))
}
but first doesn't work because it's a string character and not a variable containing the string character and it works with the position.
How can I build the current_name_of_erosion_field here ?
We may convert it to symbol and evaluate (!!) or may pass the string across. Also, as we are using a for loop, make sure to create a list to store the output. Also, to assign from an object created, use := with !!
out <- vector('list', length(lst_df))
for (i in seq_along(lst_df)){
cur_df <- lst_df[[i]]
cur_camp <- names(cur_df)[2]
cur_df <- cur_df %>%
group_by(ID) %>%
summarise(!!cur_camp := mean(!! sym(cur_camp)))
out[[i]] <- cur_df
}
-output
> out
[[1]]
# A tibble: 2 × 2
ID ERO13
<dbl> <dbl>
1 1 3
2 2 6
[[2]]
# A tibble: 2 × 2
ID ERO17
<dbl> <dbl>
1 4 4.5
2 6 12
Or may use across
out <- vector('list', length(lst_df))
for (i in seq_along(lst_df)){
cur_df <- lst_df[[i]]
cur_camp <- names(cur_df)[2]
cur_df <- cur_df %>%
group_by(ID) %>%
summarise(across(all_of(cur_camp), mean))
out[[i]] <- cur_df
}
-output
> out
[[1]]
# A tibble: 2 × 2
ID ERO13
<dbl> <dbl>
1 1 3
2 2 6
[[2]]
# A tibble: 2 × 2
ID ERO17
<dbl> <dbl>
1 4 4.5
2 6 12
A slightly different approach would be to bind the dataframes and use pivot_longer to separate the erosion name from the erosion value. Then you can take the mean of the values without having to specify the name.
library(tidyverse)
df1 <- data.frame(ID = c(1,1,2), ERO13 = c(2,4,6))
df2 <- data.frame(ID = c(4,4,6), ERO17 = c(4,5,12))
bind_rows(df1, df2) %>%
pivot_longer(starts_with('ERO'),
names_to = 'ERO',
values_drop_na = TRUE) %>%
group_by(ID, ERO) %>%
summarize(value = mean(value))
#> `summarise()` has grouped output by 'ID'. You can override using the `.groups` argument.
#> # A tibble: 4 x 3
#> # Groups: ID [4]
#> ID ERO value
#> <dbl> <chr> <dbl>
#> 1 1 ERO13 3
#> 2 2 ERO13 6
#> 3 4 ERO17 4.5
#> 4 6 ERO17 12
Created on 2022-01-14 by the reprex package (v2.0.0)

How to count rows by group with n() inside dplyr::across()?

In previous versions of dplyr, if I wanted to get row counts in addition to other summary values using summarise(), I could do something like
library(tidyverse)
df <- tibble(
group = c("A", "A", "B", "B", "C"),
value = c(1, 2, 3, 4, 5)
)
df %>%
group_by(group) %>%
summarise(total = sum(value), count = n())
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 3 x 3
group total count
<chr> <dbl> <int>
1 A 3 2
2 B 7 2
3 C 5 1
My instinct to get the same output using the new across() function would be
df %>%
group_by(group) %>%
summarise(across(value, list(sum = sum, count = n)))
Error: Problem with `summarise()` input `..1`.
x unused argument (col)
ℹ Input `..1` is `across(value, list(sum = sum, count = n))`.
ℹ The error occurred in group 1: group = "A".
The issue is specific to the n() function, just calling sum() works as expected:
df %>%
group_by(group) %>%
summarise(across(value, list(sum = sum)))
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 3 x 2
group value_sum
<chr> <dbl>
1 A 3
2 B 7
3 C 5
I've tried out various syntactic variations (using lambdas, experimenting with cur_group(), etc.), to no avail. How would I get the desired result within across()?
We can use the lamdba function for n() while the sum can be invoked just by calling it if there are no other arguments to be specified
library(dplyr)
df %>%
group_by(group) %>%
summarise(across(value, list(sum = sum, count = ~ n())), .groups = 'drop')
-output
# A tibble: 3 x 3
# group value_sum value_count
# <chr> <dbl> <int>
#1 A 3 2
#2 B 7 2
#3 C 5 1

applying function to each group using dplyr and return specified dataframe

I used group_map for the first time and think I do it correctly. This is my code:
library(REAT)
df <- data.frame(value = c(1,1,1, 1,0.5,0.1, 0,0,0,1), group = c(1,1,1, 2,2,2, 3,3,3,3))
haves <- df %>%
group_by(group) %>%
group_map(~gini(.x$value, coefnorm = TRUE))
The thing is that haves is a list rather than a data frame. What would I have to do to obtain this df
wants <- data.frame(group = c(1,2,3), gini = c(0,0.5625,1))
group gini
1 0.0000
2 0.5625
3 1.0000
Thanks!
You can use dplyr::summarize:
df %>%
group_by(group) %>%
summarize(gini = gini(value, coefnorm = TRUE))
#> # A tibble: 3 x 2
#> group gini
#> <dbl> <dbl>
#> 1 1 0
#> 2 2 0.562
#> 3 3 1
According to the documentation, group_map always produces a list. group_modify is an alternative that produces a tibble if the function does, but gini just outputs a vector. So, you could do something like this...
df %>%
group_by(group) %>%
group_modify(~tibble(gini = gini(.x$value, coefnorm = TRUE)))
# A tibble: 3 x 2
# Groups: group [3]
group gini
<dbl> <dbl>
1 1 0
2 2 0.562
3 3 1
Using data.table
library(data.table)
setDT(df)[, .(gini = gini(value, coefnorm = TRUE)), group]
For grouped datasets, we can specify .data if in case we don't want to use column names unquoted
library(dplyr)
df %>%
group_by(group) %>%
summarize(gini = gini(.data$value, coefnorm = TRUE))

R: How to summarize and group by variables as column names

I have a wide dataframe with about 200 columns and want to summarize it over various columns. I can not figure the syntax for this, I think it should work with .data$ and .env$ but I don't get it. Heres an example:
> library(dplyr)
> df = data.frame('A'= c('X','X','X','Y','Y'), 'B'= 1:5, 'C' = 6:10)
> df
A B C
1 X 1 6
2 X 2 7
3 X 3 8
4 Y 4 9
5 Y 5 10
> df %>% group_by(A) %>% summarise(sum(B), sum(C))
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 2 x 3
A `sum(B)` `sum(C)`
<chr> <int> <int>
1 X 6 21
2 Y 9 19
But I want to be able to do something like this:
columns_to_sum = c('B','C')
columns_to_group = c('A')
df %>% group_by(colums_to_group)%>% summarise(sum(columns_to_sum))
We can use across from the new version of dplyr
library(dplyr)
df %>%
group_by(across(colums_to_group)) %>%
summarise(across(all_of(columns_to_sum), sum, na.rm = TRUE), .groups = 'drop')
# A tibble: 2 x 3
# A B C
# <chr> <int> <int>
#1 X 6 21
#2 Y 9 19
In the previous version, we could use group_by_at along with summarise_at
df %>%
group_by_at(colums_to_group) %>%
summarise_at(vars(columns_to_sum), sum, na.rm = TRUE)

dplyr number of rows across groups after filtering

I want the count and proportion (of all of elements) of each group in a data frame (after filtering). This code produces the desired output:
library(dplyr)
df <- data_frame(id = sample(letters[1:3], 100, replace = TRUE),
value = rnorm(100))
summary <- filter(df, value > 0) %>%
group_by(id) %>%
summarize(count = n()) %>%
ungroup() %>%
mutate(proportion = count / sum(count))
> summary
# A tibble: 3 x 3
id count proportion
<chr> <int> <dbl>
1 a 17 0.3695652
2 b 13 0.2826087
3 c 16 0.3478261
Is there an elegant solution to avoid the ungroup() and second summarize() steps. Something like:
summary <- filter(df, value > 0) %>%
group_by(id) %>%
summarize(count = n(),
proportion = n() / [?TOTAL_ROWS()?])
I couldn't find such a function in the documentation, but I must be missing something obvious. Thanks!
You can use nrow on . which refers to the entire data frame piped in:
df %>%
filter(value > 0) %>%
group_by(id) %>%
summarise(count = n(), proportion = count / nrow(.))
# A tibble: 3 x 3
# id count proportion
# <chr> <int> <dbl>
#1 a 14 0.2592593
#2 b 22 0.4074074
#3 c 18 0.3333333

Resources