Related
I would like to generate overview tables for the same statistics (e.g., n, mean, sd) across multiple variables.
I started with combining the dyplr summarise and across function. See follwing example:
df <- data.frame(
var1 = 1:10,
var2 = 11:20
)
VarSum <- df %>% summarise(across(c(var1, var2), list(n = length, mean = mean, sd = sd)))
The output is of course given as one row (1x6) with three colums for each variable in this example. What I would like to achieve is to get the output rowise for each variable (2x3). Is that even possible with my approach? Would appriciate any suggestions.
You can pivot first:
library(dplyr)
library(tidyr)
df %>%
pivot_longer(everything()) %>%
summarise(across(value, list(n = length, mean = mean, sd = sd)), .by = name)
name value_n value_mean value_sd
<chr> <int> <dbl> <dbl>
1 var1 10 5.5 3.03
2 var2 10 15.5 3.03
I'm trying to transfer my understanding of plyr into dplyr, but I can't figure out how to group by multiple columns.
# make data with weird column names that can't be hard coded
data = data.frame(
asihckhdoydkhxiydfgfTgdsx = sample(LETTERS[1:3], 100, replace=TRUE),
a30mvxigxkghc5cdsvxvyv0ja = sample(LETTERS[1:3], 100, replace=TRUE),
value = rnorm(100)
)
# get the columns we want to average within
columns = names(data)[-3]
# plyr - works
ddply(data, columns, summarize, value=mean(value))
# dplyr - raises error
data %.%
group_by(columns) %.%
summarise(Value = mean(value))
#> Error in eval(expr, envir, enclos) : index out of bounds
What am I missing to translate the plyr example into a dplyr-esque syntax?
Edit 2017: Dplyr has been updated, so a simpler solution is available. See the currently selected answer.
Just so as to write the code in full, here's an update on Hadley's answer with the new syntax:
library(dplyr)
df <- data.frame(
asihckhdoydk = sample(LETTERS[1:3], 100, replace=TRUE),
a30mvxigxkgh = sample(LETTERS[1:3], 100, replace=TRUE),
value = rnorm(100)
)
# Columns you want to group by
grp_cols <- names(df)[-3]
# Convert character vector to list of symbols
dots <- lapply(grp_cols, as.symbol)
# Perform frequency counts
df %>%
group_by_(.dots=dots) %>%
summarise(n = n())
output:
Source: local data frame [9 x 3]
Groups: asihckhdoydk
asihckhdoydk a30mvxigxkgh n
1 A A 10
2 A B 10
3 A C 13
4 B A 14
5 B B 10
6 B C 12
7 C A 9
8 C B 12
9 C C 10
Since this question was posted, dplyr added scoped versions of group_by (documentation here). This lets you use the same functions you would use with select, like so:
data = data.frame(
asihckhdoydkhxiydfgfTgdsx = sample(LETTERS[1:3], 100, replace=TRUE),
a30mvxigxkghc5cdsvxvyv0ja = sample(LETTERS[1:3], 100, replace=TRUE),
value = rnorm(100)
)
# get the columns we want to average within
columns = names(data)[-3]
library(dplyr)
df1 <- data %>%
group_by_at(vars(one_of(columns))) %>%
summarize(Value = mean(value))
#compare plyr for reference
df2 <- plyr::ddply(data, columns, plyr::summarize, value=mean(value))
table(df1 == df2, useNA = 'ifany')
## TRUE
## 27
The output from your example question is as expected (see comparison to plyr above and output below):
# A tibble: 9 x 3
# Groups: asihckhdoydkhxiydfgfTgdsx [?]
asihckhdoydkhxiydfgfTgdsx a30mvxigxkghc5cdsvxvyv0ja Value
<fctr> <fctr> <dbl>
1 A A 0.04095002
2 A B 0.24943935
3 A C -0.25783892
4 B A 0.15161805
5 B B 0.27189974
6 B C 0.20858897
7 C A 0.19502221
8 C B 0.56837548
9 C C -0.22682998
Note that since dplyr::summarize only strips off one layer of grouping at a time, you've still got some grouping going on in the resultant tibble (which can sometime catch people by suprise later down the line). If you want to be absolutely safe from unexpected grouping behavior, you can always add %>% ungroup to your pipeline after you summarize.
The support for this in dplyr is currently pretty weak, eventually I think the syntax will be something like:
df %.% group_by(.groups = c("asdfgfTgdsx", "asdfk30v0ja"))
But that probably won't be there for a while (because I need to think through all the consequences).
In the meantime, you can use regroup(), which takes a list of symbols:
library(dplyr)
df <- data.frame(
asihckhdoydk = sample(LETTERS[1:3], 100, replace=TRUE),
a30mvxigxkgh = sample(LETTERS[1:3], 100, replace=TRUE),
value = rnorm(100)
)
df %.%
regroup(list(quote(asihckhdoydk), quote(a30mvxigxkgh))) %.%
summarise(n = n())
If you have have a character vector of column names, you can convert them to the right structure with lapply() and as.symbol():
vars <- setdiff(names(df), "value")
vars2 <- lapply(vars, as.symbol)
df %.% regroup(vars2) %.% summarise(n = n())
String specification of columns in dplyr are now supported through variants of the dplyr functions with names finishing in an underscore. For example, corresponding to the group_by function there is a group_by_ function that may take string arguments. This vignette describes the syntax of these functions in detail.
The following snippet cleanly solves the problem that #sharoz originally posed (note the need to write out the .dots argument):
# Given data and columns from the OP
data %>%
group_by_(.dots = columns) %>%
summarise(Value = mean(value))
(Note that dplyr now uses the %>% operator, and %.% is deprecated).
Update with across() from dplyr 1.0.0
All the answers above are still working, and the solutions with the .dots argument are intruiging.
BUT if you look for a solution that is easier to remember, the new across() comes in handy. It was published 2020-04-03 by Hadley Wickham and can be used in mutate() and summarise() and replace the scoped variants like _at or _all. Above all, it replaces very elegantly the cumbersome non-standard evaluation (NSE) with quoting/unquoting such as !!! rlang::syms().
So the solution with across looks very readable:
data %>%
group_by(across(all_of(columns))) %>%
summarize(Value = mean(value))
Until dplyr has full support for string arguments, perhaps this gist is useful:
https://gist.github.com/skranz/9681509
It contains bunch of wrapper functions like s_group_by, s_mutate, s_filter, etc that use string arguments. You can mix them with the normal dplyr functions. For example
cols = c("cyl","gear")
mtcars %.%
s_group_by(cols) %.%
s_summarise("avdisp=mean(disp), max(disp)") %.%
arrange(avdisp)
It works if you pass it the objects (well, you aren't, but...) rather than as a character vector:
df %.%
group_by(asdfgfTgdsx, asdfk30v0ja) %.%
summarise(Value = mean(value))
> df %.%
+ group_by(asdfgfTgdsx, asdfk30v0ja) %.%
+ summarise(Value = mean(value))
Source: local data frame [9 x 3]
Groups: asdfgfTgdsx
asdfgfTgdsx asdfk30v0ja Value
1 A C 0.046538002
2 C B -0.286359899
3 B A -0.305159419
4 C A -0.004741504
5 B B 0.520126476
6 C C 0.086805492
7 B C -0.052613078
8 A A 0.368410146
9 A B 0.088462212
where df was your data.
?group_by says:
...: variables to group by. All tbls accept variable names, some
will also accept functons of variables. Duplicated groups
will be silently dropped.
which I interpret to mean not the character versions of the names, but how you would refer to them in foo$bar; bar is not quoted here. Or how you'd refer to variables in a formula: foo ~ bar.
#Arun also mentions that you can do:
df %.%
group_by("asdfgfTgdsx", "asdfk30v0ja") %.%
summarise(Value = mean(value))
But you can't pass in something that unevaluated is not a name of a variable in the data object.
I presume this is due to the internal methods Hadley is using to look up the things you pass in via the ... argument.
data = data.frame(
my.a = sample(LETTERS[1:3], 100, replace=TRUE),
my.b = sample(LETTERS[1:3], 100, replace=TRUE),
value = rnorm(100)
)
group_by(data,newcol=paste(my.a,my.b,sep="_")) %>% summarise(Value=mean(value))
One (tiny) case that is missing from the answers here, that I wanted to make explicit, is when the variables to group by are generated dynamically midstream in a pipeline:
library(wakefield)
df_foo = r_series(rnorm, 10, 1000)
df_foo %>%
# 1. create quantized versions of base variables
mutate_each(
funs(Quantized = . > 0)
) %>%
# 2. group_by the indicator variables
group_by_(
.dots = grep("Quantized", names(.), value = TRUE)
) %>%
# 3. summarize the base variables
summarize_each(
funs(sum(., na.rm = TRUE)), contains("X_")
)
This basically shows how to use grep in conjunction with group_by_(.dots = ...) to achieve this.
General example on using the .dots argument as character vector input to the dplyr::group_by function :
iris %>%
group_by(.dots ="Species") %>%
summarise(meanpetallength = mean(Petal.Length))
Or without a hard coded name for the grouping variable (as asked by the OP):
iris %>%
group_by(.dots = names(iris)[5]) %>%
summarise_at("Petal.Length", mean)
With the example of the OP:
data %>%
group_by(.dots =names(data)[-3]) %>%
summarise_at("value", mean)
See also the dplyr vignette on programming which explains pronouns, quasiquotation, quosures, and tidyeval.
I'm trying to transfer my understanding of plyr into dplyr, but I can't figure out how to group by multiple columns.
# make data with weird column names that can't be hard coded
data = data.frame(
asihckhdoydkhxiydfgfTgdsx = sample(LETTERS[1:3], 100, replace=TRUE),
a30mvxigxkghc5cdsvxvyv0ja = sample(LETTERS[1:3], 100, replace=TRUE),
value = rnorm(100)
)
# get the columns we want to average within
columns = names(data)[-3]
# plyr - works
ddply(data, columns, summarize, value=mean(value))
# dplyr - raises error
data %.%
group_by(columns) %.%
summarise(Value = mean(value))
#> Error in eval(expr, envir, enclos) : index out of bounds
What am I missing to translate the plyr example into a dplyr-esque syntax?
Edit 2017: Dplyr has been updated, so a simpler solution is available. See the currently selected answer.
Just so as to write the code in full, here's an update on Hadley's answer with the new syntax:
library(dplyr)
df <- data.frame(
asihckhdoydk = sample(LETTERS[1:3], 100, replace=TRUE),
a30mvxigxkgh = sample(LETTERS[1:3], 100, replace=TRUE),
value = rnorm(100)
)
# Columns you want to group by
grp_cols <- names(df)[-3]
# Convert character vector to list of symbols
dots <- lapply(grp_cols, as.symbol)
# Perform frequency counts
df %>%
group_by_(.dots=dots) %>%
summarise(n = n())
output:
Source: local data frame [9 x 3]
Groups: asihckhdoydk
asihckhdoydk a30mvxigxkgh n
1 A A 10
2 A B 10
3 A C 13
4 B A 14
5 B B 10
6 B C 12
7 C A 9
8 C B 12
9 C C 10
Since this question was posted, dplyr added scoped versions of group_by (documentation here). This lets you use the same functions you would use with select, like so:
data = data.frame(
asihckhdoydkhxiydfgfTgdsx = sample(LETTERS[1:3], 100, replace=TRUE),
a30mvxigxkghc5cdsvxvyv0ja = sample(LETTERS[1:3], 100, replace=TRUE),
value = rnorm(100)
)
# get the columns we want to average within
columns = names(data)[-3]
library(dplyr)
df1 <- data %>%
group_by_at(vars(one_of(columns))) %>%
summarize(Value = mean(value))
#compare plyr for reference
df2 <- plyr::ddply(data, columns, plyr::summarize, value=mean(value))
table(df1 == df2, useNA = 'ifany')
## TRUE
## 27
The output from your example question is as expected (see comparison to plyr above and output below):
# A tibble: 9 x 3
# Groups: asihckhdoydkhxiydfgfTgdsx [?]
asihckhdoydkhxiydfgfTgdsx a30mvxigxkghc5cdsvxvyv0ja Value
<fctr> <fctr> <dbl>
1 A A 0.04095002
2 A B 0.24943935
3 A C -0.25783892
4 B A 0.15161805
5 B B 0.27189974
6 B C 0.20858897
7 C A 0.19502221
8 C B 0.56837548
9 C C -0.22682998
Note that since dplyr::summarize only strips off one layer of grouping at a time, you've still got some grouping going on in the resultant tibble (which can sometime catch people by suprise later down the line). If you want to be absolutely safe from unexpected grouping behavior, you can always add %>% ungroup to your pipeline after you summarize.
The support for this in dplyr is currently pretty weak, eventually I think the syntax will be something like:
df %.% group_by(.groups = c("asdfgfTgdsx", "asdfk30v0ja"))
But that probably won't be there for a while (because I need to think through all the consequences).
In the meantime, you can use regroup(), which takes a list of symbols:
library(dplyr)
df <- data.frame(
asihckhdoydk = sample(LETTERS[1:3], 100, replace=TRUE),
a30mvxigxkgh = sample(LETTERS[1:3], 100, replace=TRUE),
value = rnorm(100)
)
df %.%
regroup(list(quote(asihckhdoydk), quote(a30mvxigxkgh))) %.%
summarise(n = n())
If you have have a character vector of column names, you can convert them to the right structure with lapply() and as.symbol():
vars <- setdiff(names(df), "value")
vars2 <- lapply(vars, as.symbol)
df %.% regroup(vars2) %.% summarise(n = n())
String specification of columns in dplyr are now supported through variants of the dplyr functions with names finishing in an underscore. For example, corresponding to the group_by function there is a group_by_ function that may take string arguments. This vignette describes the syntax of these functions in detail.
The following snippet cleanly solves the problem that #sharoz originally posed (note the need to write out the .dots argument):
# Given data and columns from the OP
data %>%
group_by_(.dots = columns) %>%
summarise(Value = mean(value))
(Note that dplyr now uses the %>% operator, and %.% is deprecated).
Update with across() from dplyr 1.0.0
All the answers above are still working, and the solutions with the .dots argument are intruiging.
BUT if you look for a solution that is easier to remember, the new across() comes in handy. It was published 2020-04-03 by Hadley Wickham and can be used in mutate() and summarise() and replace the scoped variants like _at or _all. Above all, it replaces very elegantly the cumbersome non-standard evaluation (NSE) with quoting/unquoting such as !!! rlang::syms().
So the solution with across looks very readable:
data %>%
group_by(across(all_of(columns))) %>%
summarize(Value = mean(value))
Until dplyr has full support for string arguments, perhaps this gist is useful:
https://gist.github.com/skranz/9681509
It contains bunch of wrapper functions like s_group_by, s_mutate, s_filter, etc that use string arguments. You can mix them with the normal dplyr functions. For example
cols = c("cyl","gear")
mtcars %.%
s_group_by(cols) %.%
s_summarise("avdisp=mean(disp), max(disp)") %.%
arrange(avdisp)
It works if you pass it the objects (well, you aren't, but...) rather than as a character vector:
df %.%
group_by(asdfgfTgdsx, asdfk30v0ja) %.%
summarise(Value = mean(value))
> df %.%
+ group_by(asdfgfTgdsx, asdfk30v0ja) %.%
+ summarise(Value = mean(value))
Source: local data frame [9 x 3]
Groups: asdfgfTgdsx
asdfgfTgdsx asdfk30v0ja Value
1 A C 0.046538002
2 C B -0.286359899
3 B A -0.305159419
4 C A -0.004741504
5 B B 0.520126476
6 C C 0.086805492
7 B C -0.052613078
8 A A 0.368410146
9 A B 0.088462212
where df was your data.
?group_by says:
...: variables to group by. All tbls accept variable names, some
will also accept functons of variables. Duplicated groups
will be silently dropped.
which I interpret to mean not the character versions of the names, but how you would refer to them in foo$bar; bar is not quoted here. Or how you'd refer to variables in a formula: foo ~ bar.
#Arun also mentions that you can do:
df %.%
group_by("asdfgfTgdsx", "asdfk30v0ja") %.%
summarise(Value = mean(value))
But you can't pass in something that unevaluated is not a name of a variable in the data object.
I presume this is due to the internal methods Hadley is using to look up the things you pass in via the ... argument.
data = data.frame(
my.a = sample(LETTERS[1:3], 100, replace=TRUE),
my.b = sample(LETTERS[1:3], 100, replace=TRUE),
value = rnorm(100)
)
group_by(data,newcol=paste(my.a,my.b,sep="_")) %>% summarise(Value=mean(value))
One (tiny) case that is missing from the answers here, that I wanted to make explicit, is when the variables to group by are generated dynamically midstream in a pipeline:
library(wakefield)
df_foo = r_series(rnorm, 10, 1000)
df_foo %>%
# 1. create quantized versions of base variables
mutate_each(
funs(Quantized = . > 0)
) %>%
# 2. group_by the indicator variables
group_by_(
.dots = grep("Quantized", names(.), value = TRUE)
) %>%
# 3. summarize the base variables
summarize_each(
funs(sum(., na.rm = TRUE)), contains("X_")
)
This basically shows how to use grep in conjunction with group_by_(.dots = ...) to achieve this.
General example on using the .dots argument as character vector input to the dplyr::group_by function :
iris %>%
group_by(.dots ="Species") %>%
summarise(meanpetallength = mean(Petal.Length))
Or without a hard coded name for the grouping variable (as asked by the OP):
iris %>%
group_by(.dots = names(iris)[5]) %>%
summarise_at("Petal.Length", mean)
With the example of the OP:
data %>%
group_by(.dots =names(data)[-3]) %>%
summarise_at("value", mean)
See also the dplyr vignette on programming which explains pronouns, quasiquotation, quosures, and tidyeval.
I want to collapse multiple columns across groups such that the remaining summary statistic is the difference between the column values for each group. I have two methods but I have a feeling that there is a better way I should be doing this.
Example data
library(dplyr)
library(tidyr)
test <- data.frame(year = rep(2010:2011, each = 2),
id = c("A","B"),
val = 1:4,
val2 = 2:5,
stringsAsFactors = F)
Using summarize_each
test %>%
group_by(year) %>%
summarize_each(funs(.[id == "B"] - .[id == "A"]), val, val2)
Using tidyr
test %>%
gather(key,val,val:val2) %>%
spread(id,val) %>%
mutate(B.less.A = B - A) %>%
select(-c(A,B)) %>%
spread(key,B.less.A)
The summarize_each way seems relatively simple but I feel like there is a way to do this by grouping on id somehow? Is there a way that could ignore NA values in the columns?
We can use data.table
library(data.table)
setDT(test)[, lapply(.SD, diff), by = year, .SDcols = val:val2]
# year val val2
#1: 2010 1 1
#2: 2011 1 1
I'm trying to transfer my understanding of plyr into dplyr, but I can't figure out how to group by multiple columns.
# make data with weird column names that can't be hard coded
data = data.frame(
asihckhdoydkhxiydfgfTgdsx = sample(LETTERS[1:3], 100, replace=TRUE),
a30mvxigxkghc5cdsvxvyv0ja = sample(LETTERS[1:3], 100, replace=TRUE),
value = rnorm(100)
)
# get the columns we want to average within
columns = names(data)[-3]
# plyr - works
ddply(data, columns, summarize, value=mean(value))
# dplyr - raises error
data %.%
group_by(columns) %.%
summarise(Value = mean(value))
#> Error in eval(expr, envir, enclos) : index out of bounds
What am I missing to translate the plyr example into a dplyr-esque syntax?
Edit 2017: Dplyr has been updated, so a simpler solution is available. See the currently selected answer.
Just so as to write the code in full, here's an update on Hadley's answer with the new syntax:
library(dplyr)
df <- data.frame(
asihckhdoydk = sample(LETTERS[1:3], 100, replace=TRUE),
a30mvxigxkgh = sample(LETTERS[1:3], 100, replace=TRUE),
value = rnorm(100)
)
# Columns you want to group by
grp_cols <- names(df)[-3]
# Convert character vector to list of symbols
dots <- lapply(grp_cols, as.symbol)
# Perform frequency counts
df %>%
group_by_(.dots=dots) %>%
summarise(n = n())
output:
Source: local data frame [9 x 3]
Groups: asihckhdoydk
asihckhdoydk a30mvxigxkgh n
1 A A 10
2 A B 10
3 A C 13
4 B A 14
5 B B 10
6 B C 12
7 C A 9
8 C B 12
9 C C 10
Since this question was posted, dplyr added scoped versions of group_by (documentation here). This lets you use the same functions you would use with select, like so:
data = data.frame(
asihckhdoydkhxiydfgfTgdsx = sample(LETTERS[1:3], 100, replace=TRUE),
a30mvxigxkghc5cdsvxvyv0ja = sample(LETTERS[1:3], 100, replace=TRUE),
value = rnorm(100)
)
# get the columns we want to average within
columns = names(data)[-3]
library(dplyr)
df1 <- data %>%
group_by_at(vars(one_of(columns))) %>%
summarize(Value = mean(value))
#compare plyr for reference
df2 <- plyr::ddply(data, columns, plyr::summarize, value=mean(value))
table(df1 == df2, useNA = 'ifany')
## TRUE
## 27
The output from your example question is as expected (see comparison to plyr above and output below):
# A tibble: 9 x 3
# Groups: asihckhdoydkhxiydfgfTgdsx [?]
asihckhdoydkhxiydfgfTgdsx a30mvxigxkghc5cdsvxvyv0ja Value
<fctr> <fctr> <dbl>
1 A A 0.04095002
2 A B 0.24943935
3 A C -0.25783892
4 B A 0.15161805
5 B B 0.27189974
6 B C 0.20858897
7 C A 0.19502221
8 C B 0.56837548
9 C C -0.22682998
Note that since dplyr::summarize only strips off one layer of grouping at a time, you've still got some grouping going on in the resultant tibble (which can sometime catch people by suprise later down the line). If you want to be absolutely safe from unexpected grouping behavior, you can always add %>% ungroup to your pipeline after you summarize.
The support for this in dplyr is currently pretty weak, eventually I think the syntax will be something like:
df %.% group_by(.groups = c("asdfgfTgdsx", "asdfk30v0ja"))
But that probably won't be there for a while (because I need to think through all the consequences).
In the meantime, you can use regroup(), which takes a list of symbols:
library(dplyr)
df <- data.frame(
asihckhdoydk = sample(LETTERS[1:3], 100, replace=TRUE),
a30mvxigxkgh = sample(LETTERS[1:3], 100, replace=TRUE),
value = rnorm(100)
)
df %.%
regroup(list(quote(asihckhdoydk), quote(a30mvxigxkgh))) %.%
summarise(n = n())
If you have have a character vector of column names, you can convert them to the right structure with lapply() and as.symbol():
vars <- setdiff(names(df), "value")
vars2 <- lapply(vars, as.symbol)
df %.% regroup(vars2) %.% summarise(n = n())
String specification of columns in dplyr are now supported through variants of the dplyr functions with names finishing in an underscore. For example, corresponding to the group_by function there is a group_by_ function that may take string arguments. This vignette describes the syntax of these functions in detail.
The following snippet cleanly solves the problem that #sharoz originally posed (note the need to write out the .dots argument):
# Given data and columns from the OP
data %>%
group_by_(.dots = columns) %>%
summarise(Value = mean(value))
(Note that dplyr now uses the %>% operator, and %.% is deprecated).
Update with across() from dplyr 1.0.0
All the answers above are still working, and the solutions with the .dots argument are intruiging.
BUT if you look for a solution that is easier to remember, the new across() comes in handy. It was published 2020-04-03 by Hadley Wickham and can be used in mutate() and summarise() and replace the scoped variants like _at or _all. Above all, it replaces very elegantly the cumbersome non-standard evaluation (NSE) with quoting/unquoting such as !!! rlang::syms().
So the solution with across looks very readable:
data %>%
group_by(across(all_of(columns))) %>%
summarize(Value = mean(value))
Until dplyr has full support for string arguments, perhaps this gist is useful:
https://gist.github.com/skranz/9681509
It contains bunch of wrapper functions like s_group_by, s_mutate, s_filter, etc that use string arguments. You can mix them with the normal dplyr functions. For example
cols = c("cyl","gear")
mtcars %.%
s_group_by(cols) %.%
s_summarise("avdisp=mean(disp), max(disp)") %.%
arrange(avdisp)
It works if you pass it the objects (well, you aren't, but...) rather than as a character vector:
df %.%
group_by(asdfgfTgdsx, asdfk30v0ja) %.%
summarise(Value = mean(value))
> df %.%
+ group_by(asdfgfTgdsx, asdfk30v0ja) %.%
+ summarise(Value = mean(value))
Source: local data frame [9 x 3]
Groups: asdfgfTgdsx
asdfgfTgdsx asdfk30v0ja Value
1 A C 0.046538002
2 C B -0.286359899
3 B A -0.305159419
4 C A -0.004741504
5 B B 0.520126476
6 C C 0.086805492
7 B C -0.052613078
8 A A 0.368410146
9 A B 0.088462212
where df was your data.
?group_by says:
...: variables to group by. All tbls accept variable names, some
will also accept functons of variables. Duplicated groups
will be silently dropped.
which I interpret to mean not the character versions of the names, but how you would refer to them in foo$bar; bar is not quoted here. Or how you'd refer to variables in a formula: foo ~ bar.
#Arun also mentions that you can do:
df %.%
group_by("asdfgfTgdsx", "asdfk30v0ja") %.%
summarise(Value = mean(value))
But you can't pass in something that unevaluated is not a name of a variable in the data object.
I presume this is due to the internal methods Hadley is using to look up the things you pass in via the ... argument.
data = data.frame(
my.a = sample(LETTERS[1:3], 100, replace=TRUE),
my.b = sample(LETTERS[1:3], 100, replace=TRUE),
value = rnorm(100)
)
group_by(data,newcol=paste(my.a,my.b,sep="_")) %>% summarise(Value=mean(value))
One (tiny) case that is missing from the answers here, that I wanted to make explicit, is when the variables to group by are generated dynamically midstream in a pipeline:
library(wakefield)
df_foo = r_series(rnorm, 10, 1000)
df_foo %>%
# 1. create quantized versions of base variables
mutate_each(
funs(Quantized = . > 0)
) %>%
# 2. group_by the indicator variables
group_by_(
.dots = grep("Quantized", names(.), value = TRUE)
) %>%
# 3. summarize the base variables
summarize_each(
funs(sum(., na.rm = TRUE)), contains("X_")
)
This basically shows how to use grep in conjunction with group_by_(.dots = ...) to achieve this.
General example on using the .dots argument as character vector input to the dplyr::group_by function :
iris %>%
group_by(.dots ="Species") %>%
summarise(meanpetallength = mean(Petal.Length))
Or without a hard coded name for the grouping variable (as asked by the OP):
iris %>%
group_by(.dots = names(iris)[5]) %>%
summarise_at("Petal.Length", mean)
With the example of the OP:
data %>%
group_by(.dots =names(data)[-3]) %>%
summarise_at("value", mean)
See also the dplyr vignette on programming which explains pronouns, quasiquotation, quosures, and tidyeval.