Is dplyr::left_join equivalent to base::merge(..., all.x=TRUE)? - r

I have a lot of old R code using the following syntax to perform what I think are left joins (or left outer joins if you prefer the SQL name):
merge(a, b, by="id", all.x=TRUE)
From my point of view, this is completely equivalent to using dplyr's dedicated function:
left_join(a, b, by="id")
I'm wondering if this is always the case or if the two can in some cases lead to different results. Please feel free to provide examples of when they could be considered equivalent and when not.
In this silly example, the two seems to yield the same result
require(dplyr)
a = data.frame(id=1:4, c(letters[1:3], NA)) %>% as_tibble()
b = data.frame(id=1:2) %>% as_tibble()
all_equal(left_join(b, a, by="id"), merge(b, a, by='id', all.x = T))
# TRUE
Why am I asking this question?
I'm asking this because, for instance, stats::aggregate and dplyr::group_by, if used with default arguments are not equivalent:
a %>% group_by(letter) %>% summarise(mean(id))
# # A tibble: 4 x 2
# letter `mean(id)`
# <fct> <dbl>
# 1 a 1.00
# 2 b 2.00
# 3 c 3.00
# 4 <NA> 4.00
aggregate(id ~ letter, data = a, FUN = mean)
# letter id
# 1 a 1
# 2 b 2
# 3 c 3
That, is they give the same result if you omit NAs from the dplyr's data (because the default for aggregate is na.omit). I'm asking also because when working with big datasets it's hard to spot at a glance why something is happening (especially when dealing with some code that was not written by you) and if you have to do some maintenance work, harmless sostitutions like those presented above can cause significant changes in the output.
EDIT: I'm using dplyr 0.7.4 and R 3.4.1.

The tidyverse functions uses de NA as a part of data, because it should explain some aspects of information that can't be explained by "identified" data. In other words you must use a especific function to drop NA values. In your example there are many ways to perform the same process with equivalent results. For example, consider the na.omit() function:
library(tidyverse) #This include dplyr package
a = data.frame(id = 1:4,
letter = c(letters[1:3], NA)
) %>% as_tibble()
all.equal(
a %>%
na.omit(letter) %>% #This drop NA values in the column "letter"
group_by(letter) %>%
summarise(id = mean(id)),
aggregate(id ~ letter,
data = a,
FUN = mean ))
#>[1] TRUE
Other example is using filter() function:
all.equal(
a %>%
filter(!is.na(letter)) %>% #This drop NA values in the column "letter"
group_by(letter) %>%
summarise(id = mean(id)),
aggregate(id ~ letter,
data = a,
FUN = mean ))
#>[1] TRUE
Hope is can help you!

Related

Dplyr syntax selecting columns and converting them to a single list

I am starting to learn how to use dplyr's pipe (%>%) command for manipulating data frames. I like that it seems much more streemlined. However, I just encountered a problem that I could not solve with only pipes.
I have a data frame which holds relationship (network) data which looks like this:
The first two columns indicate what items (genes) there is a relationship between, and the third column contains information about that relationship:
a b c
1 Gene_1 Gene_2 X
2 Gene_2 Gene_3 R
3 Gene_1 Gene_4 X
My goal is to get a list of unique genes that share the same attribute. If the attribute X in col 3 is selected, I would get this data frame:
a b c
1 Gene_1 Gene_2 X
3 Gene_1 Gene_4 X
And I would want to end with this list of unique genes:
genes = c("Gene_1" "Gene_2" "Gene_4")
It does not matter if the item (Gene) comes from the first column or the second, I just want a unique list. I came up with this solution:
library(tidyr)
net = tibble(a = c("Gene_1", "Gene_2", "Gene_1"),
b = c("Gene_2", "Gene_3", "Gene_4"),
c = c("X", "R", "X"))
df = net %>%
filter(c == "X") %>%
select(c(1,2))
genes = unique(c(df$a, df$b))
but am not satisfied, as I was not able to do everything within the dplyr pipe commands. I had to make a list outside of the pipe commands, and then call unique on it.
Is there a way to accomplish this task with a call to another pipe? I could not find anyway to do this. Thanks.
1) Use {...} like this:
net %>%
filter(c == "X") %>%
select(c(1,2)) %>%
{ unique(c(.$a, .$b)) }
## [1] "Gene_1" "Gene_3" "Gene_2" "Gene_5"
2) or use magrittr's %$% pipe:
library(magrittr)
net %>%
filter(c == "X") %>%
select(c(1,2)) %$%
unique(c(a, b))
## [1] "Gene_1" "Gene_3" "Gene_2" "Gene_5"
3) or use with:
net %>%
filter(c == "X") %>%
select(c(1,2)) %>%
with(unique(c(a, b)))
## [1] "Gene_1" "Gene_3" "Gene_2" "Gene_5"
Since the result is not a data frame best not call it df.
The unlist() function is probably what you are looking for.
Quoting from the built in documentation for ?unlist: "Given a list structure x, unlist simplifies it to produce a vector which contains all the atomic components which occur in x."
Since R data frames (and tibbles) are implemented as lists of column vectors with equal lengths, the unlist function will effectively convert a data frame into a vector.
Subset for the desired rows and columns with filter and select, then pipe the result through unlist() and then unique(). The result will be a vector with the distinct elements.
library(dplyr)
# The example data
tibble(a = c("Gene_1", "Gene_2", "Gene_1"),
b = c("Gene_2", "Gene_3", "Gene_4"),
c = c("X", "R", "X")) %>%
# Subset data for desired feature
filter(c == "X") %>%
# Select identifier columns
select(a, b) %>%
# convert to a vector
unlist() %>%
# derive unique elements
unique()
Result
[1] "Gene_1" "Gene_2" "Gene_4"
I would suggest using tidyr::pivot_longer to reshape the multiple columns of potential matches from the two distinct gene columns, to a value column (which we care about) and a name column (referencing the original column name, which we don't care about and can ignore). Then distinct to get unique matches, and finally the match to column c:
net %>%
pivot_longer(-c) %>%
distinct(c, value) %>%
filter(c == "X")
If you want the result as a vector, you could add %>% pull(value).
One benefit of this approach is that we already have every distinct set of genes for every column c value calculated, and the last filter step just narrows it to one example c value.
Result
c value
<chr> <chr>
1 X Gene_1
2 X Gene_2
3 X Gene_4
[Note: I made a = c("Gene_1", "Gene_2", "Gene_1") and b = c("Gene_2", "Gene_3", "Gene_4") to match example.]
I realize this question has several answers, but I would have gone a slightly different way with it. Perhaps it will be useful to someone?
I created a data set to demonstrate, as well.
library(tidyverse)
library(stringi) # only used in data generation
# data set creation 100 rows
a = paste0("Gene_",1:100)
b = paste0("Gene_",round(runif(100, 10, 99),digits = 0))
cC = paste0(stringi::stri_rand_strings(100, 1, '[A-Z]'))
# put it together and strip the information
data.frame(a = a, b = b, cC = cC) %>% # collect the data
filter(cC == "X") %>% # filter for attribute
select(-cC) %>% # remove attribute field
unlist() %>% # collapse the data frame into a vector
unique() # show me what's unique
# output example
# [1] "Gene_10" "Gene_12" "Gene_28" "Gene_77" "Gene_22" "Gene_41" "Gene_75"
# [8] "Gene_19"
library(tidyverse)
net <- tibble(
a = c("Gene_1", "Gene_1", "Gene_3"),
b = c("Gene_2", "Gene_4", "Gene_5"),
c = c("X", "R", "X")
)
df <- net %>%
filter(c == "X") %>%
select(a, b)
df
#> # A tibble: 2 x 2
#> a b
#> <chr> <chr>
#> 1 Gene_1 Gene_2
#> 2 Gene_3 Gene_5
genes <- net %>%
select(-c) %>%
unlist() %>%
unique()
genes
#> [1] "Gene_1" "Gene_3" "Gene_2" "Gene_4" "Gene_5"
Though many enlightening answers have been proposed and accepted by OP too, I just want to add that in case, you want it simultaneously for all values in c, do this
library(tidyverse)
net %>%
group_split(c, .keep = F) %>%
setNames(unique(net$c)) %>%
map(~ (.x %>% unlist() %>% unique()))
$X
[1] "Gene_2" "Gene_3"
$R
[1] "Gene_1" "Gene_2" "Gene_4"

Use shiny input to select variable for dplyr group_by [duplicate]

I'm trying to transfer my understanding of plyr into dplyr, but I can't figure out how to group by multiple columns.
# make data with weird column names that can't be hard coded
data = data.frame(
asihckhdoydkhxiydfgfTgdsx = sample(LETTERS[1:3], 100, replace=TRUE),
a30mvxigxkghc5cdsvxvyv0ja = sample(LETTERS[1:3], 100, replace=TRUE),
value = rnorm(100)
)
# get the columns we want to average within
columns = names(data)[-3]
# plyr - works
ddply(data, columns, summarize, value=mean(value))
# dplyr - raises error
data %.%
group_by(columns) %.%
summarise(Value = mean(value))
#> Error in eval(expr, envir, enclos) : index out of bounds
What am I missing to translate the plyr example into a dplyr-esque syntax?
Edit 2017: Dplyr has been updated, so a simpler solution is available. See the currently selected answer.
Just so as to write the code in full, here's an update on Hadley's answer with the new syntax:
library(dplyr)
df <- data.frame(
asihckhdoydk = sample(LETTERS[1:3], 100, replace=TRUE),
a30mvxigxkgh = sample(LETTERS[1:3], 100, replace=TRUE),
value = rnorm(100)
)
# Columns you want to group by
grp_cols <- names(df)[-3]
# Convert character vector to list of symbols
dots <- lapply(grp_cols, as.symbol)
# Perform frequency counts
df %>%
group_by_(.dots=dots) %>%
summarise(n = n())
output:
Source: local data frame [9 x 3]
Groups: asihckhdoydk
asihckhdoydk a30mvxigxkgh n
1 A A 10
2 A B 10
3 A C 13
4 B A 14
5 B B 10
6 B C 12
7 C A 9
8 C B 12
9 C C 10
Since this question was posted, dplyr added scoped versions of group_by (documentation here). This lets you use the same functions you would use with select, like so:
data = data.frame(
asihckhdoydkhxiydfgfTgdsx = sample(LETTERS[1:3], 100, replace=TRUE),
a30mvxigxkghc5cdsvxvyv0ja = sample(LETTERS[1:3], 100, replace=TRUE),
value = rnorm(100)
)
# get the columns we want to average within
columns = names(data)[-3]
library(dplyr)
df1 <- data %>%
group_by_at(vars(one_of(columns))) %>%
summarize(Value = mean(value))
#compare plyr for reference
df2 <- plyr::ddply(data, columns, plyr::summarize, value=mean(value))
table(df1 == df2, useNA = 'ifany')
## TRUE
## 27
The output from your example question is as expected (see comparison to plyr above and output below):
# A tibble: 9 x 3
# Groups: asihckhdoydkhxiydfgfTgdsx [?]
asihckhdoydkhxiydfgfTgdsx a30mvxigxkghc5cdsvxvyv0ja Value
<fctr> <fctr> <dbl>
1 A A 0.04095002
2 A B 0.24943935
3 A C -0.25783892
4 B A 0.15161805
5 B B 0.27189974
6 B C 0.20858897
7 C A 0.19502221
8 C B 0.56837548
9 C C -0.22682998
Note that since dplyr::summarize only strips off one layer of grouping at a time, you've still got some grouping going on in the resultant tibble (which can sometime catch people by suprise later down the line). If you want to be absolutely safe from unexpected grouping behavior, you can always add %>% ungroup to your pipeline after you summarize.
The support for this in dplyr is currently pretty weak, eventually I think the syntax will be something like:
df %.% group_by(.groups = c("asdfgfTgdsx", "asdfk30v0ja"))
But that probably won't be there for a while (because I need to think through all the consequences).
In the meantime, you can use regroup(), which takes a list of symbols:
library(dplyr)
df <- data.frame(
asihckhdoydk = sample(LETTERS[1:3], 100, replace=TRUE),
a30mvxigxkgh = sample(LETTERS[1:3], 100, replace=TRUE),
value = rnorm(100)
)
df %.%
regroup(list(quote(asihckhdoydk), quote(a30mvxigxkgh))) %.%
summarise(n = n())
If you have have a character vector of column names, you can convert them to the right structure with lapply() and as.symbol():
vars <- setdiff(names(df), "value")
vars2 <- lapply(vars, as.symbol)
df %.% regroup(vars2) %.% summarise(n = n())
String specification of columns in dplyr are now supported through variants of the dplyr functions with names finishing in an underscore. For example, corresponding to the group_by function there is a group_by_ function that may take string arguments. This vignette describes the syntax of these functions in detail.
The following snippet cleanly solves the problem that #sharoz originally posed (note the need to write out the .dots argument):
# Given data and columns from the OP
data %>%
group_by_(.dots = columns) %>%
summarise(Value = mean(value))
(Note that dplyr now uses the %>% operator, and %.% is deprecated).
Update with across() from dplyr 1.0.0
All the answers above are still working, and the solutions with the .dots argument are intruiging.
BUT if you look for a solution that is easier to remember, the new across() comes in handy. It was published 2020-04-03 by Hadley Wickham and can be used in mutate() and summarise() and replace the scoped variants like _at or _all. Above all, it replaces very elegantly the cumbersome non-standard evaluation (NSE) with quoting/unquoting such as !!! rlang::syms().
So the solution with across looks very readable:
data %>%
group_by(across(all_of(columns))) %>%
summarize(Value = mean(value))
Until dplyr has full support for string arguments, perhaps this gist is useful:
https://gist.github.com/skranz/9681509
It contains bunch of wrapper functions like s_group_by, s_mutate, s_filter, etc that use string arguments. You can mix them with the normal dplyr functions. For example
cols = c("cyl","gear")
mtcars %.%
s_group_by(cols) %.%
s_summarise("avdisp=mean(disp), max(disp)") %.%
arrange(avdisp)
It works if you pass it the objects (well, you aren't, but...) rather than as a character vector:
df %.%
group_by(asdfgfTgdsx, asdfk30v0ja) %.%
summarise(Value = mean(value))
> df %.%
+ group_by(asdfgfTgdsx, asdfk30v0ja) %.%
+ summarise(Value = mean(value))
Source: local data frame [9 x 3]
Groups: asdfgfTgdsx
asdfgfTgdsx asdfk30v0ja Value
1 A C 0.046538002
2 C B -0.286359899
3 B A -0.305159419
4 C A -0.004741504
5 B B 0.520126476
6 C C 0.086805492
7 B C -0.052613078
8 A A 0.368410146
9 A B 0.088462212
where df was your data.
?group_by says:
...: variables to group by. All tbls accept variable names, some
will also accept functons of variables. Duplicated groups
will be silently dropped.
which I interpret to mean not the character versions of the names, but how you would refer to them in foo$bar; bar is not quoted here. Or how you'd refer to variables in a formula: foo ~ bar.
#Arun also mentions that you can do:
df %.%
group_by("asdfgfTgdsx", "asdfk30v0ja") %.%
summarise(Value = mean(value))
But you can't pass in something that unevaluated is not a name of a variable in the data object.
I presume this is due to the internal methods Hadley is using to look up the things you pass in via the ... argument.
data = data.frame(
my.a = sample(LETTERS[1:3], 100, replace=TRUE),
my.b = sample(LETTERS[1:3], 100, replace=TRUE),
value = rnorm(100)
)
group_by(data,newcol=paste(my.a,my.b,sep="_")) %>% summarise(Value=mean(value))
One (tiny) case that is missing from the answers here, that I wanted to make explicit, is when the variables to group by are generated dynamically midstream in a pipeline:
library(wakefield)
df_foo = r_series(rnorm, 10, 1000)
df_foo %>%
# 1. create quantized versions of base variables
mutate_each(
funs(Quantized = . > 0)
) %>%
# 2. group_by the indicator variables
group_by_(
.dots = grep("Quantized", names(.), value = TRUE)
) %>%
# 3. summarize the base variables
summarize_each(
funs(sum(., na.rm = TRUE)), contains("X_")
)
This basically shows how to use grep in conjunction with group_by_(.dots = ...) to achieve this.
General example on using the .dots argument as character vector input to the dplyr::group_by function :
iris %>%
group_by(.dots ="Species") %>%
summarise(meanpetallength = mean(Petal.Length))
Or without a hard coded name for the grouping variable (as asked by the OP):
iris %>%
group_by(.dots = names(iris)[5]) %>%
summarise_at("Petal.Length", mean)
With the example of the OP:
data %>%
group_by(.dots =names(data)[-3]) %>%
summarise_at("value", mean)
See also the dplyr vignette on programming which explains pronouns, quasiquotation, quosures, and tidyeval.

Reference Column Names in R Functions [duplicate]

I'm trying to transfer my understanding of plyr into dplyr, but I can't figure out how to group by multiple columns.
# make data with weird column names that can't be hard coded
data = data.frame(
asihckhdoydkhxiydfgfTgdsx = sample(LETTERS[1:3], 100, replace=TRUE),
a30mvxigxkghc5cdsvxvyv0ja = sample(LETTERS[1:3], 100, replace=TRUE),
value = rnorm(100)
)
# get the columns we want to average within
columns = names(data)[-3]
# plyr - works
ddply(data, columns, summarize, value=mean(value))
# dplyr - raises error
data %.%
group_by(columns) %.%
summarise(Value = mean(value))
#> Error in eval(expr, envir, enclos) : index out of bounds
What am I missing to translate the plyr example into a dplyr-esque syntax?
Edit 2017: Dplyr has been updated, so a simpler solution is available. See the currently selected answer.
Just so as to write the code in full, here's an update on Hadley's answer with the new syntax:
library(dplyr)
df <- data.frame(
asihckhdoydk = sample(LETTERS[1:3], 100, replace=TRUE),
a30mvxigxkgh = sample(LETTERS[1:3], 100, replace=TRUE),
value = rnorm(100)
)
# Columns you want to group by
grp_cols <- names(df)[-3]
# Convert character vector to list of symbols
dots <- lapply(grp_cols, as.symbol)
# Perform frequency counts
df %>%
group_by_(.dots=dots) %>%
summarise(n = n())
output:
Source: local data frame [9 x 3]
Groups: asihckhdoydk
asihckhdoydk a30mvxigxkgh n
1 A A 10
2 A B 10
3 A C 13
4 B A 14
5 B B 10
6 B C 12
7 C A 9
8 C B 12
9 C C 10
Since this question was posted, dplyr added scoped versions of group_by (documentation here). This lets you use the same functions you would use with select, like so:
data = data.frame(
asihckhdoydkhxiydfgfTgdsx = sample(LETTERS[1:3], 100, replace=TRUE),
a30mvxigxkghc5cdsvxvyv0ja = sample(LETTERS[1:3], 100, replace=TRUE),
value = rnorm(100)
)
# get the columns we want to average within
columns = names(data)[-3]
library(dplyr)
df1 <- data %>%
group_by_at(vars(one_of(columns))) %>%
summarize(Value = mean(value))
#compare plyr for reference
df2 <- plyr::ddply(data, columns, plyr::summarize, value=mean(value))
table(df1 == df2, useNA = 'ifany')
## TRUE
## 27
The output from your example question is as expected (see comparison to plyr above and output below):
# A tibble: 9 x 3
# Groups: asihckhdoydkhxiydfgfTgdsx [?]
asihckhdoydkhxiydfgfTgdsx a30mvxigxkghc5cdsvxvyv0ja Value
<fctr> <fctr> <dbl>
1 A A 0.04095002
2 A B 0.24943935
3 A C -0.25783892
4 B A 0.15161805
5 B B 0.27189974
6 B C 0.20858897
7 C A 0.19502221
8 C B 0.56837548
9 C C -0.22682998
Note that since dplyr::summarize only strips off one layer of grouping at a time, you've still got some grouping going on in the resultant tibble (which can sometime catch people by suprise later down the line). If you want to be absolutely safe from unexpected grouping behavior, you can always add %>% ungroup to your pipeline after you summarize.
The support for this in dplyr is currently pretty weak, eventually I think the syntax will be something like:
df %.% group_by(.groups = c("asdfgfTgdsx", "asdfk30v0ja"))
But that probably won't be there for a while (because I need to think through all the consequences).
In the meantime, you can use regroup(), which takes a list of symbols:
library(dplyr)
df <- data.frame(
asihckhdoydk = sample(LETTERS[1:3], 100, replace=TRUE),
a30mvxigxkgh = sample(LETTERS[1:3], 100, replace=TRUE),
value = rnorm(100)
)
df %.%
regroup(list(quote(asihckhdoydk), quote(a30mvxigxkgh))) %.%
summarise(n = n())
If you have have a character vector of column names, you can convert them to the right structure with lapply() and as.symbol():
vars <- setdiff(names(df), "value")
vars2 <- lapply(vars, as.symbol)
df %.% regroup(vars2) %.% summarise(n = n())
String specification of columns in dplyr are now supported through variants of the dplyr functions with names finishing in an underscore. For example, corresponding to the group_by function there is a group_by_ function that may take string arguments. This vignette describes the syntax of these functions in detail.
The following snippet cleanly solves the problem that #sharoz originally posed (note the need to write out the .dots argument):
# Given data and columns from the OP
data %>%
group_by_(.dots = columns) %>%
summarise(Value = mean(value))
(Note that dplyr now uses the %>% operator, and %.% is deprecated).
Update with across() from dplyr 1.0.0
All the answers above are still working, and the solutions with the .dots argument are intruiging.
BUT if you look for a solution that is easier to remember, the new across() comes in handy. It was published 2020-04-03 by Hadley Wickham and can be used in mutate() and summarise() and replace the scoped variants like _at or _all. Above all, it replaces very elegantly the cumbersome non-standard evaluation (NSE) with quoting/unquoting such as !!! rlang::syms().
So the solution with across looks very readable:
data %>%
group_by(across(all_of(columns))) %>%
summarize(Value = mean(value))
Until dplyr has full support for string arguments, perhaps this gist is useful:
https://gist.github.com/skranz/9681509
It contains bunch of wrapper functions like s_group_by, s_mutate, s_filter, etc that use string arguments. You can mix them with the normal dplyr functions. For example
cols = c("cyl","gear")
mtcars %.%
s_group_by(cols) %.%
s_summarise("avdisp=mean(disp), max(disp)") %.%
arrange(avdisp)
It works if you pass it the objects (well, you aren't, but...) rather than as a character vector:
df %.%
group_by(asdfgfTgdsx, asdfk30v0ja) %.%
summarise(Value = mean(value))
> df %.%
+ group_by(asdfgfTgdsx, asdfk30v0ja) %.%
+ summarise(Value = mean(value))
Source: local data frame [9 x 3]
Groups: asdfgfTgdsx
asdfgfTgdsx asdfk30v0ja Value
1 A C 0.046538002
2 C B -0.286359899
3 B A -0.305159419
4 C A -0.004741504
5 B B 0.520126476
6 C C 0.086805492
7 B C -0.052613078
8 A A 0.368410146
9 A B 0.088462212
where df was your data.
?group_by says:
...: variables to group by. All tbls accept variable names, some
will also accept functons of variables. Duplicated groups
will be silently dropped.
which I interpret to mean not the character versions of the names, but how you would refer to them in foo$bar; bar is not quoted here. Or how you'd refer to variables in a formula: foo ~ bar.
#Arun also mentions that you can do:
df %.%
group_by("asdfgfTgdsx", "asdfk30v0ja") %.%
summarise(Value = mean(value))
But you can't pass in something that unevaluated is not a name of a variable in the data object.
I presume this is due to the internal methods Hadley is using to look up the things you pass in via the ... argument.
data = data.frame(
my.a = sample(LETTERS[1:3], 100, replace=TRUE),
my.b = sample(LETTERS[1:3], 100, replace=TRUE),
value = rnorm(100)
)
group_by(data,newcol=paste(my.a,my.b,sep="_")) %>% summarise(Value=mean(value))
One (tiny) case that is missing from the answers here, that I wanted to make explicit, is when the variables to group by are generated dynamically midstream in a pipeline:
library(wakefield)
df_foo = r_series(rnorm, 10, 1000)
df_foo %>%
# 1. create quantized versions of base variables
mutate_each(
funs(Quantized = . > 0)
) %>%
# 2. group_by the indicator variables
group_by_(
.dots = grep("Quantized", names(.), value = TRUE)
) %>%
# 3. summarize the base variables
summarize_each(
funs(sum(., na.rm = TRUE)), contains("X_")
)
This basically shows how to use grep in conjunction with group_by_(.dots = ...) to achieve this.
General example on using the .dots argument as character vector input to the dplyr::group_by function :
iris %>%
group_by(.dots ="Species") %>%
summarise(meanpetallength = mean(Petal.Length))
Or without a hard coded name for the grouping variable (as asked by the OP):
iris %>%
group_by(.dots = names(iris)[5]) %>%
summarise_at("Petal.Length", mean)
With the example of the OP:
data %>%
group_by(.dots =names(data)[-3]) %>%
summarise_at("value", mean)
See also the dplyr vignette on programming which explains pronouns, quasiquotation, quosures, and tidyeval.

dplyr summarise when function return is vector-valued?

The dplyr::summarize() function can apply arbitrary functions over the data, but it seems that function must return a scalar value. I'm curious if there is a reasonable way to handle functions that return a vector value without making multiple calls to the function.
Here's a somewhat silly minimal example. Consider a function that gives multiple values, such as:
f <- function(x,y){
coef(lm(x ~ y, data.frame(x=x,y=y)))
}
and data that looks like:
df <- data.frame(group=c('A','A','A','A','B','B','B','B','C','C','C','C'), x=rnorm(12,1,1), y=rnorm(12,1,1))
I'd like to do something like:
df %>%
group_by(group) %>%
summarise(f(x,y))
and get back a table that has 2 columns added for each of the returned values instead of the usual 1 column. Instead, this errors with: Expecting single value
Of course we can get multiple values from dlpyr::summarise() by giving the function argument multiple times:
f1 <- function(x,y) coef(lm(x ~ y, data.frame(x=x,y=y)))[[1]]
f2 <- function(x,y) coef(lm(x ~ y, data.frame(x=x,y=y)))[[2]]
df %>%
group_by(group) %>%
summarise(a = f1(x,y), b = f2(x,y))
This gives the desired output:
group a b
1 A 1.7957245 -0.339992915
2 B 0.5283379 -0.004325209
3 C 1.0797647 -0.074393457
but coding in this way is ridiculously crude and ugly.
data.table handles this case more succinctly:
dt <- as.data.table(df)
dt[, f(x,y), by="group"]
but creates an output that extend the table using additional rows instead of additional columns, resulting in an output that is both confusing and harder to work with:
group V1
1: A 1.795724536
2: A -0.339992915
3: B 0.528337890
4: B -0.004325209
5: C 1.079764710
6: C -0.074393457
Of course there are more classic apply strategies we could use here,
sapply(levels(df$group), function(x) coef(lm(x~y, df[df$group == x, ])))
A B C
(Intercept) 1.7957245 0.528337890 1.07976471
y -0.3399929 -0.004325209 -0.07439346
but this sacrifices both the elegance and I suspect the speed of the grouping. In particular, note that we cannot use our pre-defined function f in this case, but have to hard code the grouping into the function definition.
Is there a dplyr function for handling this case? If not, is there a more elegant way to handle this process of evaluating vector-valued functions over a data.frame by group?
You could try do
library(dplyr)
df %>%
group_by(group) %>%
do(setNames(data.frame(t(f(.$x, .$y))), letters[1:2]))
# group a b
#1 A 0.8983217 -0.04108092
#2 B 0.8945354 0.44905220
#3 C 1.2244023 -1.00715248
The output based on f1 and f2 are
df %>%
group_by(group) %>%
summarise(a = f1(x,y), b = f2(x,y))
# group a b
#1 A 0.8983217 -0.04108092
#2 B 0.8945354 0.44905220
#3 C 1.2244023 -1.00715248
Update
If you are using data.table, the option to get similar result is
library(data.table)
setnames(setDT(df)[, as.list(f(x,y)) , group], 2:3, c('a', 'b'))[]
This is why I still love plyr::ddply():
library(plyr)
f <- function(z) setNames(coef(lm(x ~ y, z)), c("a", "b"))
ddply(df, ~ group, f)
# group a b
# 1 A 0.5213133 0.04624656
# 2 B 0.3020656 0.01450137
# 3 C 0.2189537 0.22998823

Group by multiple columns in dplyr, using string vector input

I'm trying to transfer my understanding of plyr into dplyr, but I can't figure out how to group by multiple columns.
# make data with weird column names that can't be hard coded
data = data.frame(
asihckhdoydkhxiydfgfTgdsx = sample(LETTERS[1:3], 100, replace=TRUE),
a30mvxigxkghc5cdsvxvyv0ja = sample(LETTERS[1:3], 100, replace=TRUE),
value = rnorm(100)
)
# get the columns we want to average within
columns = names(data)[-3]
# plyr - works
ddply(data, columns, summarize, value=mean(value))
# dplyr - raises error
data %.%
group_by(columns) %.%
summarise(Value = mean(value))
#> Error in eval(expr, envir, enclos) : index out of bounds
What am I missing to translate the plyr example into a dplyr-esque syntax?
Edit 2017: Dplyr has been updated, so a simpler solution is available. See the currently selected answer.
Just so as to write the code in full, here's an update on Hadley's answer with the new syntax:
library(dplyr)
df <- data.frame(
asihckhdoydk = sample(LETTERS[1:3], 100, replace=TRUE),
a30mvxigxkgh = sample(LETTERS[1:3], 100, replace=TRUE),
value = rnorm(100)
)
# Columns you want to group by
grp_cols <- names(df)[-3]
# Convert character vector to list of symbols
dots <- lapply(grp_cols, as.symbol)
# Perform frequency counts
df %>%
group_by_(.dots=dots) %>%
summarise(n = n())
output:
Source: local data frame [9 x 3]
Groups: asihckhdoydk
asihckhdoydk a30mvxigxkgh n
1 A A 10
2 A B 10
3 A C 13
4 B A 14
5 B B 10
6 B C 12
7 C A 9
8 C B 12
9 C C 10
Since this question was posted, dplyr added scoped versions of group_by (documentation here). This lets you use the same functions you would use with select, like so:
data = data.frame(
asihckhdoydkhxiydfgfTgdsx = sample(LETTERS[1:3], 100, replace=TRUE),
a30mvxigxkghc5cdsvxvyv0ja = sample(LETTERS[1:3], 100, replace=TRUE),
value = rnorm(100)
)
# get the columns we want to average within
columns = names(data)[-3]
library(dplyr)
df1 <- data %>%
group_by_at(vars(one_of(columns))) %>%
summarize(Value = mean(value))
#compare plyr for reference
df2 <- plyr::ddply(data, columns, plyr::summarize, value=mean(value))
table(df1 == df2, useNA = 'ifany')
## TRUE
## 27
The output from your example question is as expected (see comparison to plyr above and output below):
# A tibble: 9 x 3
# Groups: asihckhdoydkhxiydfgfTgdsx [?]
asihckhdoydkhxiydfgfTgdsx a30mvxigxkghc5cdsvxvyv0ja Value
<fctr> <fctr> <dbl>
1 A A 0.04095002
2 A B 0.24943935
3 A C -0.25783892
4 B A 0.15161805
5 B B 0.27189974
6 B C 0.20858897
7 C A 0.19502221
8 C B 0.56837548
9 C C -0.22682998
Note that since dplyr::summarize only strips off one layer of grouping at a time, you've still got some grouping going on in the resultant tibble (which can sometime catch people by suprise later down the line). If you want to be absolutely safe from unexpected grouping behavior, you can always add %>% ungroup to your pipeline after you summarize.
The support for this in dplyr is currently pretty weak, eventually I think the syntax will be something like:
df %.% group_by(.groups = c("asdfgfTgdsx", "asdfk30v0ja"))
But that probably won't be there for a while (because I need to think through all the consequences).
In the meantime, you can use regroup(), which takes a list of symbols:
library(dplyr)
df <- data.frame(
asihckhdoydk = sample(LETTERS[1:3], 100, replace=TRUE),
a30mvxigxkgh = sample(LETTERS[1:3], 100, replace=TRUE),
value = rnorm(100)
)
df %.%
regroup(list(quote(asihckhdoydk), quote(a30mvxigxkgh))) %.%
summarise(n = n())
If you have have a character vector of column names, you can convert them to the right structure with lapply() and as.symbol():
vars <- setdiff(names(df), "value")
vars2 <- lapply(vars, as.symbol)
df %.% regroup(vars2) %.% summarise(n = n())
String specification of columns in dplyr are now supported through variants of the dplyr functions with names finishing in an underscore. For example, corresponding to the group_by function there is a group_by_ function that may take string arguments. This vignette describes the syntax of these functions in detail.
The following snippet cleanly solves the problem that #sharoz originally posed (note the need to write out the .dots argument):
# Given data and columns from the OP
data %>%
group_by_(.dots = columns) %>%
summarise(Value = mean(value))
(Note that dplyr now uses the %>% operator, and %.% is deprecated).
Update with across() from dplyr 1.0.0
All the answers above are still working, and the solutions with the .dots argument are intruiging.
BUT if you look for a solution that is easier to remember, the new across() comes in handy. It was published 2020-04-03 by Hadley Wickham and can be used in mutate() and summarise() and replace the scoped variants like _at or _all. Above all, it replaces very elegantly the cumbersome non-standard evaluation (NSE) with quoting/unquoting such as !!! rlang::syms().
So the solution with across looks very readable:
data %>%
group_by(across(all_of(columns))) %>%
summarize(Value = mean(value))
Until dplyr has full support for string arguments, perhaps this gist is useful:
https://gist.github.com/skranz/9681509
It contains bunch of wrapper functions like s_group_by, s_mutate, s_filter, etc that use string arguments. You can mix them with the normal dplyr functions. For example
cols = c("cyl","gear")
mtcars %.%
s_group_by(cols) %.%
s_summarise("avdisp=mean(disp), max(disp)") %.%
arrange(avdisp)
It works if you pass it the objects (well, you aren't, but...) rather than as a character vector:
df %.%
group_by(asdfgfTgdsx, asdfk30v0ja) %.%
summarise(Value = mean(value))
> df %.%
+ group_by(asdfgfTgdsx, asdfk30v0ja) %.%
+ summarise(Value = mean(value))
Source: local data frame [9 x 3]
Groups: asdfgfTgdsx
asdfgfTgdsx asdfk30v0ja Value
1 A C 0.046538002
2 C B -0.286359899
3 B A -0.305159419
4 C A -0.004741504
5 B B 0.520126476
6 C C 0.086805492
7 B C -0.052613078
8 A A 0.368410146
9 A B 0.088462212
where df was your data.
?group_by says:
...: variables to group by. All tbls accept variable names, some
will also accept functons of variables. Duplicated groups
will be silently dropped.
which I interpret to mean not the character versions of the names, but how you would refer to them in foo$bar; bar is not quoted here. Or how you'd refer to variables in a formula: foo ~ bar.
#Arun also mentions that you can do:
df %.%
group_by("asdfgfTgdsx", "asdfk30v0ja") %.%
summarise(Value = mean(value))
But you can't pass in something that unevaluated is not a name of a variable in the data object.
I presume this is due to the internal methods Hadley is using to look up the things you pass in via the ... argument.
data = data.frame(
my.a = sample(LETTERS[1:3], 100, replace=TRUE),
my.b = sample(LETTERS[1:3], 100, replace=TRUE),
value = rnorm(100)
)
group_by(data,newcol=paste(my.a,my.b,sep="_")) %>% summarise(Value=mean(value))
One (tiny) case that is missing from the answers here, that I wanted to make explicit, is when the variables to group by are generated dynamically midstream in a pipeline:
library(wakefield)
df_foo = r_series(rnorm, 10, 1000)
df_foo %>%
# 1. create quantized versions of base variables
mutate_each(
funs(Quantized = . > 0)
) %>%
# 2. group_by the indicator variables
group_by_(
.dots = grep("Quantized", names(.), value = TRUE)
) %>%
# 3. summarize the base variables
summarize_each(
funs(sum(., na.rm = TRUE)), contains("X_")
)
This basically shows how to use grep in conjunction with group_by_(.dots = ...) to achieve this.
General example on using the .dots argument as character vector input to the dplyr::group_by function :
iris %>%
group_by(.dots ="Species") %>%
summarise(meanpetallength = mean(Petal.Length))
Or without a hard coded name for the grouping variable (as asked by the OP):
iris %>%
group_by(.dots = names(iris)[5]) %>%
summarise_at("Petal.Length", mean)
With the example of the OP:
data %>%
group_by(.dots =names(data)[-3]) %>%
summarise_at("value", mean)
See also the dplyr vignette on programming which explains pronouns, quasiquotation, quosures, and tidyeval.

Resources