dplyr::summarise with filtering inside - r

Inside of dplyr::summarise, how can I apply filters based on different rows than the one I'm summarising?
Example:
t = data.frame(
x = c(1,1,1,1,2,2,2,2,3,3, 3, 3),
y = c(1,2,3,4,5,6,7,8,9,10,11,12),
z = c(1,2,1,2,1,2,1,2,1,2, 1, 2)
)
t %>%
dplyr::group_by(x) %>%
dplyr::summarise(
mall = mean(y), # this should include all rows in each group
ma = mean(y), # this should only include rows where z == 1
mb = mean(y) # this should only include rows where z == 2
)
So, the problem here is to apply a summary function to one column, while filtering based on another, all within summarise.
One idea was double-grouping, so applying group_by on both x and z, but I don't want all summary columns to be based on double-grouping, some (like mall in the example above) should be based on single-grouping only.

One quick option would be to use ifelse to filter to the rows you need, make the rest missing and use the na.rm = T argument to ignore missing values, like the example below.
dplyr::group_by(x) %>%
dplyr::summarise(
mall = mean(y), # this should include all rows in each group
ma = mean(ifelse(z == 1, y, NA), na.rm = T), # this should only include rows where z == 1
mb = mean(ifelse(z == 2, y, NA), na.rm = T) # this should only include rows where z == 2
)
# A tibble: 3 x 4
x mall ma mb
<dbl> <dbl> <dbl> <dbl>
1 1 2.5 2 3
2 2 6.5 6 7
3 3 10.5 10 11

While the answer by #Colin H is certainly the way to go for this specific example, a more flexible way to approach this could be to work within the subsets of the first grouping operation. This could be implemented with dplyr::group_split plus a subsequent purrr::map_dfr, but there is also dplyr::group_modify to do this in one step.
Note this relevant sentence from the documentation of dplyr::group_modify:
Use group_modify() when summarize() is too limited, in terms of what you need to do and return for each group.
So here is a solution for the example provided above:
t = data.frame(
x = c(1,1,1,1,2,2,2,2,3,3, 3, 3),
y = c(1,2,3,4,5,6,7,8,9,10,11,12),
z = c(1,2,1,2,1,2,1,2,1,2, 1, 2)
)
t %>%
dplyr::group_by(x) %>%
dplyr::group_modify(function(x, ...) {
x %>% dplyr::mutate(
mall = mean(y)
) %>%
dplyr::group_by(z, mall) %>%
dplyr::summarise(
m = mean(y),
.groups = "drop"
)
}) %>%
dplyr::ungroup()
# A tibble: 6 x 4
x z mall m
<dbl> <dbl> <dbl> <dbl>
1 1 1 2.5 2
2 1 2 2.5 3
3 2 1 6.5 6
4 2 2 6.5 7
5 3 1 10.5 10
6 3 2 10.5 11
group_modify applies a function to each subset tibble after grouping by x. This function has two arguments:
The subset of the data for the group, exposed as .x.
The key, a tibble with exactly one row and columns for each grouping
variable, exposed as .y.
Within our function here we use mutate to cover the requested mall-case first. We do not need any further grouping for that, because that is already covered by the wrapping group_modify. Then we apply another group_by + summarise to cover the different iterations of z. Note that this solution is independent of the number of cases in z we want to consider. While the two cases in this example can be easily handled manually, this might change if there are more.
If the wide output format with individual columns for the cases in z is required, then you can further modify the output of my code with tidyr::pivot_wider.

Another option and perhaps a little more concise is via subsetting:
t %>%
group_by(x) %>%
summarise(mall = mean(y),
ma = mean(y[z == 1]),
mb = mean(y[z == 2]))
# A tibble: 3 x 4
x mall ma mb
* <dbl> <dbl> <dbl> <dbl>
1 1 2.5 2 3
2 2 6.5 6 7
3 3 10.5 10 11

Here is another generic way (just like group_modify) to perform custom filtering on a group data while summarizing. This uses dplyr's context dependent expression: cur_data(), which makes the current group's data available inside dplyr verbs like mutate/summary:
t %>%
dplyr::group_by(x) %>%
dplyr::summarize(
mall = mean(y),
ma = mean(cur_data() %>% as.data.frame() %>% filter(z == 1) %>% pull(y)),
mb = mean(cur_data() %>% as.data.frame() %>% filter(z == 2) %>% pull(y))
)
The benefit of using cur_data() is that you can perform any complex filtering or munging before returning the final summary. For more information refer to: https://dplyr.tidyverse.org/reference/context.html

Related

Elegant Way to Get the Minimum of More than One Column of a Data Frame and Their Corresponding Values in Another Column

Given a data frame df, I want a new data frame that will keep the minimum values of columns Y and Z in a column and their corresponding values on the X column in another column using R.
df <- read.table(text =
"X Y Z
1 2 3 1.4
2 4 5 1.7
3 6 7 1.2
4 8 9 2.1
5 10 11 3.2",
header = TRUE)
Trial
Here is what I have tried using R which is labour intensive.
data.frame(
x_min = c(df[df$Y == min(df[,"Y"]), "X"], df[df$Z == min(df[,"Z"]), "X"]),
min_Y_Z = c(min(df[,"Y"]), min(df[,"Z"]))
)
I know that apply(df, 2, min) will only work if I am to find the minimum of each and every column in the data frame, so no need to look toward the apply() family function.
The Result
x_min min_Y_Z
1 2 3.0
2 6 1.2
What I Want
I want an R-elegant way to write the same solution. I will not mind using packages in R
One way to do this is using the subset() and mutate() functions from the tidyverse package:
library(tidyverse)
df_new <- df %>%
subset(Y == min(Y) | Z == min(Z)) %>%
mutate(min_Y_Z = c(min(Y), min(Z)))
This gives you this output:
X Y Z min_Y_Z
1 2 3 1.4 3.0
3 6 7 1.2 1.2
If needed, removing the old 'Y' and 'Z' columns is pretty simple to do as well.
In base R, you could use lapply. This version is similar to your current method, but does not rely on the explicit column names as much, since "Y" and "Z" are abstracted by lapply. This version also places the original column names in the rownames.
lst <- lapply(df[c('Y', 'Z')], function(i) {
min_index <- which.min(i)
return(data.frame(x_min = df$X[min_index], min_Y_Z = i[min_index]))
})
result <- do.call(rbind, lst)
x_min min_Y_Z
Y 2 3.0
Z 6 1.2
Or another tidyverse solution:
result <- df %>%
summarize(across(c(Y, Z), which.min)) %>%
pivot_longer(everything(), values_to = 'idx') %>%
rowwise() %>%
mutate(
x_min = df[['X']][idx],
min_Y_Z = df[[name]][idx]
) %>%
select(-name, -idx)
x_min min_Y_Z
<int> <dbl>
1 2 3
2 6 1.2

summarise_all with additional parameter that is a vector

Say I have a data frame:
df <- data.frame(a = 1:10,
b = 1:10,
c = 1:10)
I'd like to apply several summary functions to each column, so I use dplyr::summarise_all
library(dplyr)
df %>% summarise_all(.funs = c(mean, sum))
# a_fn1 b_fn1 c_fn1 a_fn2 b_fn2 c_fn2
# 1 5.5 5.5 5.5 55 55 55
This works great! Now, say I have a function that takes an extra parameter. For example, this function calculates the number of elements in a column above a threshold. (Note: this is a toy example and not the real function.)
n_above_threshold <- function(x, threshold) sum(x > threshold)
So, the function works like this:
n_above_threshold(1:10, 5)
#[1] 5
I can apply it to all columns like before, but this time passing the additional parameter, like so:
df %>% summarise_all(.funs = c(mean, n_above_threshold), threshold = 5)
# a_fn1 b_fn1 c_fn1 a_fn2 b_fn2 c_fn2
# 1 5.5 5.5 5.5 5 5 5
But, say I have a vector of thresholds where each element corresponds to a column. Say, c(1, 5, 7) for my example above. Of course, I can't simply do this, as it doesn't make any sense:
df %>% summarise_all(.funs = c(mean, n_above_threshold), threshold = c(1, 5, 7))
If I was using base R, I might do this:
> mapply(n_above_threshold, df, c(1, 5, 7))
# a b c
# 9 5 3
Is there a way of getting this result as part of a dplyr piped workflow like I was using for the simpler cases?
dplyr provides a bunch of context-dependent functions. One is cur_column(). You can use it in summarise to look up the threshold for a given column.
library("tidyverse")
df <- data.frame(
a = 1:10,
b = 1:10,
c = 1:10
)
n_above_threshold <- function(x, threshold) sum(x > threshold)
# Pair the parameters with the columns
thresholds <- c(1, 5, 7)
names(thresholds) <- colnames(df)
df %>%
summarise(
across(
everything(),
# Use `cur_column()` to access each column name in turn
list(count = ~ n_above_threshold(.x, thresholds[cur_column()]),
mean = mean)
)
)
#> a_count a_mean b_count b_mean c_count c_mean
#> 1 9 5.5 5 5.5 3 5.5
This returns NA silently if the current column name doesn't have a known threshold. This is something that you might or might not want to happen.
df %>%
# Add extra column to show what happens if we don't know the threshold for a column
mutate(
x = 1:10
) %>%
summarise(
across(
everything(),
# Use `cur_column()` to access each column name in turn
list(count = ~ n_above_threshold(.x, thresholds[cur_column()]),
mean = mean)
)
)
#> a_count a_mean b_count b_mean c_count c_mean x_count x_mean
#> 1 9 5.5 5 5.5 3 5.5 NA 5.5
Created on 2022-03-11 by the reprex package (v2.0.1)

Add summarize variable in multiple statements using dplyr?

In dplyr, group_by has a parameter add, and if it's true, it adds to the group_by. For example:
data <- data.frame(a=c('a','b','c'), b=c(1,2,3), c=c(4,5,6))
data <- data %>% group_by(a, add=TRUE)
data <- data %>% group_by(b, add=TRUE)
data %>% summarize(sum_c = sum(c))
Output:
a b sum_c
1 a 1 4
2 b 2 5
3 c 3 6
Is there an analogous way to add summary variables to a summarize statement? I have some complicated conditionals (with dbplyr) where if x=TRUE I want to add
variable x_v to the summary.
I see several related stackoverflow questions, but I didn't see this.
EDIT: Here is some precise example code, but simplified from the real code (which has more than two conditionals).
summarize_num <- TRUE
summarize_num_distinct <- FALSE
data <- data.frame(val=c(1,2,2))
if (summarize_num && summarize_num_distinct) {
summ <- data %>% summarize(n=n(), n_unique=n_distinct())
} else if (summarize_num) {
summ <- data %>% summarize(n=n())
} else if (summarize_num_distinct) {
summ <- data %>% summarize(n_unique=n_distinct())
}
Depending on conditions (summarize_num, and summarize_num_distinct here), the eventual summary (summ here) has different columns.
As the number of conditions goes up, the number of clauses goes up combinatorially. However, the conditions are independent, so I'd like to add the summary variables independently as well.
I'm using dbplyr, so I have to do it in a way that it can get translated into SQL.
Would this work for your situation? Here, we add a column for each requested summation using mutate. It's computationally wasteful since it does the same sum once for every row in each group, and then discards everything but the first row of each group. But that might be fine if your data's not too huge.
data <- data.frame(val=c(1,2,2), grp = c(1, 1, 2)) # To show it works within groups
summ <- data %>% group_by(grp)
if(summarize_num) {summ = mutate(summ, n = n())}
if(summarize_num_distinct) {summ = mutate(summ, n_unique=n_distinct(val))}
summ = slice(summ, 1) %>% ungroup() %>% select(-val)
## A tibble: 2 x 3
# grp n n_unique
# <dbl> <int> <int>
#1 1 2 2
#2 2 1 1
The summarise_at() function takes a list of functions as parameter. So, we can get
data <- data.frame(val=c(1,2,2))
fcts <- list(n_unique = n_distinct, n = length)
data %>%
summarise_at(.vars = "val", fcts)
n_unique n
1 2 3
All functions in the list must take one argument. Therefore, n() was replaced by length().
The list of functions can be modified dynamically as requested by the OP, e.g.,
summarize_num_distinct <- FALSE
summarize_num <- TRUE
fcts <- list(n_unique = n_distinct, n = length)
data %>%
summarise_at(.vars = "val", fcts[c(summarize_num_distinct, summarize_num)])
n
1 3
So, the idea is to define a list of possible aggregation functions and then to select dynamically the aggregation to compute. Even the order of columns in the aggregate can be determined:
fcts <- list(n_unique = n_distinct, n = length, sum = sum, avg = mean, min = min, max = max)
data %>%
summarise_at(.vars = "val", fcts[c(6, 2, 4, 3)])
max n avg sum
1 2 3 1.666667 5

Accessing grouped subset in dplyr

I have the feeling this was already asked several times, but I can not make it run in my case. Don't know why.
I group_by my data frame and calculate a mean from values. Additionally, I marked a specific row and I want to calculate the ratio of my fresh calculated mean with the value of my highlighted row of the subset.
library(dplyr)
df <- data.frame(int=c(5:1,4:1),
highlight=c(T,F,F,F,F,F,T,F,F),
exp=c('a','a','a','a','a','b','b','b','b'))
df %>%
group_by(exp) %>%
summarise(mean=mean(int),
l1=nrow(.),
ratio_mean=.[.$highlight, 'int']/mean)
But for some reason, . is not the subset of group_by but the complete input. Am I missing something here?
My expected output would be
exp mean ratio_mean
<fct> <dbl> <dbl>
1 a 3 1.67
2 b 2.5 1.2
This works:
df %>%
group_by(exp) %>%
summarise(mean = mean(int),
l1 = n(),
ratio_mean = int[highlight] / mean)
But what's going wrong with your solution?
nrow(.) counts the number of rows of your whole input dataframe, wherase n() counts only the rows per group
.[.$highlight, 'int']/mean here again you use the whole input dataframe and subset using the highlight column, but it get's divided by the correct group mean. Actually you are returning two values here as two rows of your original df have a highlight = TRUE. This causes a nasty NA-column name.
To save it, we could use do() as suggested by #MikkoMarttila, but this gets a little bit clunky:
df %>%
group_by(exp) %>%
do(summarise(., mean = mean(.$int),
l1 = nrow(.),
ratio_mean = .$int[.$highlight] / mean))
Original output
df %>%
group_by(exp) %>%
summarise(mean=mean(int),
l1=nrow(.),
ratio_mean=.[.$highlight, 'int']/mean)
# A tibble: 2 x 4
# exp mean l1 ratio_mean$ NA
# <fct> <dbl> <int> <dbl> <dbl>
# 1 a 3 9 1.67 2
# 2 b 2.5 9 1 1.2

Why won't dplyr::summarise work with my custom function?

I would like use a custom function within dplyr's function summarise(), as follows:
library(dplyr)
# Define custom function for calculating standard error
se <- function(x) sd(x) / sqrt(length(x))
# Create a dummy data table with two groups
d <- tibble(gp = sample(c("A", "B"), 20, replace = T),
x = ifelse(gp == "A", rnorm(20), rnorm(20) + 1))
# Summarise data
d %>%
group_by(gp) %>%
summarise(x = mean(x),
se = se(x))
Why do I get NA values in the output rather than the correct values of standard error?
# A tibble: 2 × 3
gp x se
<chr> <dbl> <lgl>
1 A -0.4060173 NA
2 B 0.2999004 NA
I'm aware of some possible alternatives. For example, using the base package:
tapply(d$x, d$gp, se)
But I don't understand why the first version gives the result that it does.
summarize evaluates each expression in turn, so when your first line does
x = mean(x)
The x column (within each group) is replaced by a single value, mean(x). Your next line calls sd on that constant x, and the sd of a single value is NA.
As #joran says in the comments, if you just choose a different name for your mean column, everything will work.
d %>%
group_by(gp) %>%
summarise(avg = mean(x),
se = se(x))
# # A tibble: 2 × 3
# gp avg se
# <chr> <dbl> <dbl>
# 1 A -0.2879016 0.2264810
# 2 B 0.8804859 0.2625018
Note that this sequential evaluation is a well-considered feature of dplyr. The practical difference between dplyr::mutate and base::transform is exactly that.
dd = data.frame(x = 1:3)
base::transform(dd, x = 0, y = x * 2)
# x y
# 1 0 2
# 2 0 4
# 3 0 6
dplyr::mutate(dd, x = 0, y = x * 2)
# x y
# 1 0 0
# 2 0 0
# 3 0 0
This is called out in the Introduction to dplyr vignette:
dplyr::mutate() works the same way as plyr::mutate() and similarly to base::transform(). The key difference between mutate() and transform() is that mutate allows you to refer to columns that you’ve just created.

Resources