Using dplyr summarize with different operations for multiple columns - r

Well, I know that there are already tons of related questions, but none gave an answer to my particular need.
I want to use dplyr "summarize" on a table with 50 columns, and I need to apply different summary functions to these.
"Summarize_all" and "summarize_at" both seem to have the disadvantage that it's not possible to apply different functions to different subgroups of variables.
As an example, let's assume the iris dataset would have 50 columns, so we do not want to address columns by names. I want the sum over the first two columns, the mean over the third and the first value for all remaining columns (after a group_by(Species)). How could I do this?

Fortunately, there is a much simpler way available now.
With the new dplyr 1.0.0 coming out soon, you can leverage the across function for this purpose.
All you need to type is:
iris %>%
group_by(Species) %>%
summarize(
# I want the sum over the first two columns,
across(c(1,2), sum),
# the mean over the third
across(3, mean),
# the first value for all remaining columns (after a group_by(Species))
across(-c(1:3), first)
)
Great, isn't it?
I first thought the across is not necessary as the scoped variants worked just fine, but this use case is exactly why the across function can be very beneficial.
You can get the latest version of dplyr by devtools::install_github("tidyverse/dplyr")

As other people have mentioned, this is normally done by calling summarize_each / summarize_at / summarize_if for every group of columns that you want to apply the summarizing function to. As far as I know, you would have to create a custom function that performs summarizations to each subset. You can for example set the colnames in such way that you can use the select helpers (e.g. contains()) to filter just the columns that you want to apply the function to. If not, then you can set the specific column numbers that you want to summarize.
For the example you mentioned, you could try the following:
summarizer <- function(tb, colsone, colstwo, colsthree,
funsone, funstwo, funsthree, group_name) {
return(bind_cols(
summarize_all(select(tb, colsone), .funs = funsone),
summarize_all(select(tb, colstwo), .funs = funstwo) %>%
ungroup() %>% select(-matches(group_name)),
summarize_all(select(tb, colsthree), .funs = funsthree) %>%
ungroup() %>% select(-matches(group_name))
))
}
#With colnames
iris %>% as.tibble() %>%
group_by(Species) %>%
summarizer(colsone = contains("Sepal"),
colstwo = matches("Petal.Length"),
colsthree = c(-contains("Sepal"), -matches("Petal.Length")),
funsone = "sum",
funstwo = "mean",
funsthree = "first",
group_name = "Species")
#With indexes
iris %>% as.tibble() %>%
group_by(Species) %>%
summarizer(colsone = 1:2,
colstwo = 3,
colsthree = 4,
funsone = "sum",
funstwo = "mean",
funsthree = "first",
group_name = "Species")

You could summarise the data with each function separately and then join the data later if needed.
So something like this for the iris example:
sums <- iris %>% group_by(Species) %>% summarise_at(1:2, sum)
means <- iris %>% group_by(Species) %>% summarise_at(3, mean)
firsts <- iris %>% group_by(Species) %>% summarise_at(4, first)
full_join(sums, means) %>% full_join(firsts)
Though I would try to think of something else if there are more than a handful of summarising functions you need to use.

Try this:
library(plyr)
library(dplyr)
dataframe <- data.frame(var = c(1,1,1,2,2,2),var2 = c(10,9,8,7,6,5),var3=c(2,3,4,5,6,7),var4=c(5,5,3,2,4,2))
dataframe
# var var2 var3 var4
#1 1 10 2 5
#2 1 9 3 5
#3 1 8 4 3
#4 2 7 5 2
#5 2 6 6 4
#6 2 5 7 2
funnames<-c(sum,mean,first)
colnums<-c(2,3,4)
ddply(.data = dataframe,.variables = "var",
function(x,funcs,inds){
mapply(function(func,ind){
func(x[,ind])
},funcs,inds)
},funnames,colnums)
# var V1 V2 V3
#1 1 27 3 5
#2 2 18 6 2

See this - feature coming soon

Related

R: What is the expected output of passing a character vector to dplyr::all_of()?

I am trying to understand the expected output of dplyr::group_by() in conjunction with the use of dplyr::all_of(). My understanding is that using dplyr::all_of() should convert character vectors containing variable names to the bare names so that group_by(), but this doesn't appear to happen.
Below, I generate some fake data, pass different objects to group_by() with(out) all_of() and calculate the number of observations in each group. In the example, passing a single bare column name without dplyr::all_of() produces the correct output: one row per unique value of the column. However, passing character vectors or using dplyr::all_of() produces incorrect output: one row regardless of the number of values in a column.
What is expected when using all_of and how might I alternatively pass a character vector to group_by to process as a vector of bare names?
library(dplyr)
# Create a 20-row data.frame with
# 2 variables each with 2 unique values.
df <- data.frame(var = rep(c("a", "b"), 10),
bar = rep(c(1, 2), 20))
# Output 1: 2x2 tibble - GOOD
df %>% group_by(var) %>% summarize(n = n())
# Output 2: 1x2 tibble - BAD
foo <- "var"
df %>% group_by(all_of(foo)) %>% summarize(n = n())
# Output 3: 1x2 tibble
df %>% group_by("var") %>% summarize(n = n())
# Output 4: Error in_var not found - BAD
foo2 <- list("var", "bar")
lapply(foo2, function(in_var) {
df %>%
group_by(in_var) %>%
summarize(n = n())
})
# Output 5: list of length 2 where
# each element is a 1x2 tibble - BAD
foo2 <- list("var", "bar")
lapply(foo2, function(in_var) {
df %>%
group_by(all_of(in_var)) %>%
summarize(n = n())
})
We can use group_by_at
lapply(foo2, function(in_var) df %>%
group_by_at(all_of(in_var)) %>%
summarise(n = n()))
-output
#[[1]]
# A tibble: 2 x 2
# var n
#* <chr> <int>
#1 a 20
#2 b 20
#[[2]]
# A tibble: 2 x 2
# bar n
#* <dbl> <int>
#1 1 20
#2 2 20
As across replaces some of the functionality of group_by_at, we can use it instead with all_of:
lapply(foo2, function(in_var) df %>%
group_by(across(all_of(in_var))) %>%
summarise(n = n()))
Or convert to symbol and evaluate (!!)
lapply(foo2, function(in_var) df %>%
group_by(!! rlang::sym(in_var)) %>%
summarise(n = n()))
Or use map
library(purrr)
map(foo2, ~ df %>%
group_by(!! rlang::sym(.x)) %>%
summarise(n = n()))
Or instead of group_by, it can be count
map(foo2, ~ df %>%
count(across(all_of(.x))))
To add to #akrun's answers of mutliple ways to achieve the desired output - my understanding of all_of() is that, it is a helper for selection of variables stored as character for dplyr function and uses vctrs underneath. Compared to any_of() which is a less strict version of all_of() and some convenient use cases.
reading the ?tidyselect::all_off() is helpful. This page is also helpful to keep up with changes in dplyr and tidy evaluation https://dplyr.tidyverse.org/articles/programming.html.
The scoped dplyr verbs are being superceded in the future with across based on decisions by the devs at RStudio. See ?group_by_at() or other *_if, *_at, *_all documentation. So I guess it really depends on what version of dplyr you are using in your workflow and what works best for you.
This SO post also gives context of changes in solutions over time with passing characters into dplyr functions, and there's probably more posts out there.

Dplyr top_n returns multiple rows

Dplyr provides a function top_n(), however in case of equal values it returns all rows (more than one). I would like to return exactly one row per group. See the example below.
df <- data.frame(id1=c(rep("A",3),rep("B",3),rep("C",3)),id2=c(8,8,4,7,7,4,5,5,5))
df %>% group_by(id1) %>% top_n(n=1)
You can use a combination of arrange and slice
df %>%
group_by(id1) %>%
arrange(desc(id2)) %>%
slice(1)
Use desc with in arrange if you want the larges element otherwise leave it out.
Apparently also slice_head is the new name of the function that you are looking for
df %>%
group_by(id1) %>%
arrange(desc(id2)) %>%
slice_head(id2, n=2)
Use slice_max() with the argument with_ties = FALSE:
library(dplyr)
df %>%
group_by(id1) %>%
slice_max(id2, with_ties = FALSE)
# A tibble: 3 x 2
# Groups: id1 [3]
id1 id2
<chr> <dbl>
1 A 8
2 B 7
3 C 5
If you don't want to remember so many {dplyr} function names that are prone to be changed anyway, I can recommend the {data.table} package for such tasks. Plus, it's faster.
require(data.table)
df <- data.frame(id1=c(rep("A",3),rep("B",3),rep("C",3)),id2=c(8,8,4,7,7,4,5,5,5))
setDT(df)
df[ ,
.(id2_head = head(id2, 1)),
by = id1 ]

R: rowwise nth element ordered_by row values

I have this input:
t <- data.frame(x=c(1,2,8,4), y=c(2,3,4,5), k=c(3,4,5,1))
And want to have the rowwise nth-lowest element of the dataframe ordered by the rowwise values, so that the output is something like this (example for nth_element = 2):
[1] 2 3 5 4
I tried a function like this:
apply(t, 1, nth, n=1, order_by = .)
But this does not work. Two questions:
What should I type in the order_by gument to make this function work?
Which is the best way to summarise rows with an own summary function if I don't want to mention the column names in the rowwise summary function?
Sidenote:
I don't want to mention the column names specifically, I want the function to use all rows in the dataset.
I tried the rownth function from the Rfast package but it only provides one result. Does anybody know what I do wrong?
We can use apply and sort to do this.
d <- data.frame(x=c(1,2,8,4), y=c(2,3,4,5), k=c(3,4,5,1))
nth_lowest <- 2
apply(d, 1, FUN = function(x) sort(x)[nth_lowest])
# [1] 2 3 5 4
Note that I am calling the data d instead of t. t is already a reserved name in R (matrix transpose function).
Not as elegant as #bouncyball's answer, but using dplyr (and tidyr), one possibility is to do:
library(dplyr)
library(tidyr)
t %>% mutate(Row = row_number()) %>%
pivot_longer(-Row, names_to = "Col", values_to = "Val") %>%
group_by(Row) %>%
arrange(Val) %>%
slice(2) %>%
select(Val)
Adding missing grouping variables: `Row`
# A tibble: 4 x 2
# Groups: Row [4]
Row Val
<int> <dbl>
1 1 2
2 2 3
3 3 5
4 4 4
Using Rfast you could reduce run time for big matrices and for matrices only.
d <- data.frame(x=c(1,2,8,4), y=c(2,3,4,5), k=c(3,4,5,1))
d<- Rfast::data.frame.to_matrix(d)
nth_lowests <- rep(2,ncol(d))
Rfast::rownth(d,nth_lowests)
# [1] 2 3 5 4
You could also use the parallel version of Rfast::rownth

Add summarize variable in multiple statements using dplyr?

In dplyr, group_by has a parameter add, and if it's true, it adds to the group_by. For example:
data <- data.frame(a=c('a','b','c'), b=c(1,2,3), c=c(4,5,6))
data <- data %>% group_by(a, add=TRUE)
data <- data %>% group_by(b, add=TRUE)
data %>% summarize(sum_c = sum(c))
Output:
a b sum_c
1 a 1 4
2 b 2 5
3 c 3 6
Is there an analogous way to add summary variables to a summarize statement? I have some complicated conditionals (with dbplyr) where if x=TRUE I want to add
variable x_v to the summary.
I see several related stackoverflow questions, but I didn't see this.
EDIT: Here is some precise example code, but simplified from the real code (which has more than two conditionals).
summarize_num <- TRUE
summarize_num_distinct <- FALSE
data <- data.frame(val=c(1,2,2))
if (summarize_num && summarize_num_distinct) {
summ <- data %>% summarize(n=n(), n_unique=n_distinct())
} else if (summarize_num) {
summ <- data %>% summarize(n=n())
} else if (summarize_num_distinct) {
summ <- data %>% summarize(n_unique=n_distinct())
}
Depending on conditions (summarize_num, and summarize_num_distinct here), the eventual summary (summ here) has different columns.
As the number of conditions goes up, the number of clauses goes up combinatorially. However, the conditions are independent, so I'd like to add the summary variables independently as well.
I'm using dbplyr, so I have to do it in a way that it can get translated into SQL.
Would this work for your situation? Here, we add a column for each requested summation using mutate. It's computationally wasteful since it does the same sum once for every row in each group, and then discards everything but the first row of each group. But that might be fine if your data's not too huge.
data <- data.frame(val=c(1,2,2), grp = c(1, 1, 2)) # To show it works within groups
summ <- data %>% group_by(grp)
if(summarize_num) {summ = mutate(summ, n = n())}
if(summarize_num_distinct) {summ = mutate(summ, n_unique=n_distinct(val))}
summ = slice(summ, 1) %>% ungroup() %>% select(-val)
## A tibble: 2 x 3
# grp n n_unique
# <dbl> <int> <int>
#1 1 2 2
#2 2 1 1
The summarise_at() function takes a list of functions as parameter. So, we can get
data <- data.frame(val=c(1,2,2))
fcts <- list(n_unique = n_distinct, n = length)
data %>%
summarise_at(.vars = "val", fcts)
n_unique n
1 2 3
All functions in the list must take one argument. Therefore, n() was replaced by length().
The list of functions can be modified dynamically as requested by the OP, e.g.,
summarize_num_distinct <- FALSE
summarize_num <- TRUE
fcts <- list(n_unique = n_distinct, n = length)
data %>%
summarise_at(.vars = "val", fcts[c(summarize_num_distinct, summarize_num)])
n
1 3
So, the idea is to define a list of possible aggregation functions and then to select dynamically the aggregation to compute. Even the order of columns in the aggregate can be determined:
fcts <- list(n_unique = n_distinct, n = length, sum = sum, avg = mean, min = min, max = max)
data %>%
summarise_at(.vars = "val", fcts[c(6, 2, 4, 3)])
max n avg sum
1 2 3 1.666667 5

Can't split dataframe into equal buckets preserving order without introducing Xn. prefix

I am trying to split an ordered data frame into 10 equal buckets. The following works but it introduces an X1., X2., X3. ... prefix to each bucket, which prevents me from iterating over the buckets to sum them.
num_dfs <- 10
buckets<-split(df, rep(1:num_dfs, each = round(nrow(df) / num_dfs)))
Produces a df[10] that looks like:
$`10`
predicted_duration actual_duration
177188 23.7402944 6
466561 23.7402663 12
479556 23.7401721 5
147585 23.7401666 48
Here's the crude code I am using to try to sum the groups.
for (i in c(1,2,3,4,5,6,7,8,9,10)){
p<-sum(as.data.frame(df[i],row.names=NULL)$X1.actual_duration) # X1., X2.,
print(paste(i,"=",p))
}
How do I remove the Xn. grouping prefix or programmatically reference it using the index i?
Here's a similar reproducible example:
df<-data.frame(actual_duration=sample(100))
num_dfs <- 10
df_grouped<-as.data.frame(split(df, rep(1:num_dfs, each = round(nrow(df) / num_dfs))))
for (i in c(1,2,3,4,5,6,7,8,9,10)){
p<-sum(df[i]$actual_duration) # does not work because postfix .1, .2.. was added by R
print(paste(p))
}
I'm not entirely clear on what your issue is, but if you are just trying to get the sum by group couldn't you use
library(tidyverse)
df <- data.frame(actual_duration=sample(100))
df %>%
arrange(actual_duration) %>%
mutate(group = rep(1:10, each = 10)) %>%
group_by(group) %>%
summarise(sums = sum(actual_duration))
alternatively if you want to keep the list format
df %>%
arrange(actual_duration) %>%
mutate(group = factor(rep(1:10, each = 10))) %>%
split(., .$group) %>%
map(., function(x) sum(x$actual_duration))

Resources