I have some R code that looks like this:
library(dplyr)
library(datasets)
iris %.% group_by(Species) %.% filter(rank(Petal.Length, ties.method = 'random')<=2) %.% ungroup()
Giving:
Source: local data frame [6 x 5]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 4.3 3.0 1.1 0.1 setosa
2 4.6 3.6 1.0 0.2 setosa
3 5.0 2.3 3.3 1.0 versicolor
4 5.1 2.5 3.0 1.1 versicolor
5 4.9 2.5 4.5 1.7 virginica
6 6.0 3.0 4.8 1.8 virginica
This groups by species, and for each group keeps only the two with the shortest Petal.Length. I have some duplication in my code, because I do this several times for different columns and numbers. E.g.:
iris %.% group_by(Species) %.% filter(rank(Petal.Length, ties.method = 'random')<=2) %.% ungroup()
iris %.% group_by(Species) %.% filter(rank(-Petal.Length, ties.method = 'random')<=2) %.% ungroup()
iris %.% group_by(Species) %.% filter(rank(Petal.Width, ties.method = 'random')<=3) %.% ungroup()
iris %.% group_by(Species) %.% filter(rank(-Petal.Width, ties.method = 'random')<=3) %.% ungroup()
I want to extract this into a function. The naive approach doesn't work:
keep_min_n_by_species <- function(expr, n) {
iris %.% group_by(Species) %.% filter(rank(expr, ties.method = 'random') <= n) %.% ungroup()
}
keep_min_n_by_species(Petal.Width, 2)
Error in filter_impl(.data, dots(...), environment()) :
object 'Petal.Width' not found
As I understand it, the expression rank(Petal.Length, ties.method = 'random') <= 2 is evaluated in a different context, introduced by the filter function, that provides a meaning for the Petal.Length expression. I can't just swap in a variable for Petal.Length, because it will be evaluated in the wrong context. I've tried using different combinations of substitute and eval, having read this page: Non-standard evaluation. I can't figure out an appropriate combination. I think the problem might be that I don't just want to pass through an expression from the caller (Petal.Length) through to filter for it to evaluate - I want to construct a new bigger expression (rank(Petal.Length, ties.method = 'random') <= 2) and then pass that whole expression through to filter for it to evaluate.
How can I refactor this expression into a function?
More generally, how should I go about extracting an R expression into a function?
Even more generally, am I approaching this with the wrong mindset? In more mainstream languages I'm familiar with (e.g. Python, C++, C#), this is a relatively straightforward operation that I want to do all the time to remove duplication in my code. In R it seems (to me, at least) that non-standard evaluation can make it a very non-obvious operation. Should I be doing something else entirely?
dplyr version 0.3 is beginning to address this using the lazyeval package, as #baptiste mentioned, and a new family of functions that use standard evaluation (same function names as the NSE versions, but ending in _). There is a vignette here: https://github.com/hadley/dplyr/blob/master/vignettes/nse.Rmd
All that being said, I don't know best practices for what you're trying to do (though I'm trying to do the same thing). I have something working, but like I said, I don't know if it's the best way to do it. Note the use of filter_() instead of filter(), and passing in the argument as a quoted character string:
devtools::install_github("hadley/dplyr")
devtools::install_github("hadley/lazyeval")
library(dplyr)
library(lazyeval)
keep_min_n_by_species <- function(expr, n, rev = FALSE) {
iris %>%
group_by(Species) %>%
filter_(interp(~rank(if (rev) -x else x, ties.method = 'random') <= y, # filter_, not filter
x = as.name(expr), y = n)) %>%
ungroup()
}
keep_min_n_by_species("Petal.Width", 3) # "Petal.Width" as character string
keep_min_n_by_species("Petal.Width", 3, rev = TRUE)
Update based on #hadley's comment:
keep_min_n_by_species <- function(expr, n) {
expr <- lazy(expr)
formula <- interp(~rank(x, ties.method = 'random') <= y,
x = expr, y = n)
iris %>%
group_by(Species) %>%
filter_(formula) %>%
ungroup()
}
keep_min_n_by_species(Petal.Width, 3)
keep_min_n_by_species(-Petal.Width, 3)
How about
keep_min_n_by_species <- function(expr, n) {
mc <- match.call()
fx <- bquote(rank(.(mc$expr), ties.method = 'random') <= .(mc$n))
iris %.% group_by(Species) %.% filter(fx) %.% ungroup()
}
That seems to allow all the statements to run without error
keep_min_n_by_species(Petal.Width, 2)
keep_min_n_by_species(-Petal.Width, 2)
keep_min_n_by_species(Petal.Width, 3)
keep_min_n_by_species(-Petal.Width, 3)
The idea is that we use match.call() to capture the unevaluated expressions passed to the function. Then we use bquote() to build the filter as a call object.
Related
I want to parameterise the following computation using dplyr that finds which values of Sepal.Length are associated with more than one value of Sepal.Width:
library(dplyr)
iris %>%
group_by(Sepal.Length) %>%
summarise(n.uniq=n_distinct(Sepal.Width)) %>%
filter(n.uniq > 1)
Normally I would write something like this:
not.uniq.per.group <- function(data, group.var, uniq.var) {
iris %>%
group_by(group.var) %>%
summarise(n.uniq=n_distinct(uniq.var)) %>%
filter(n.uniq > 1)
}
However, this approach throws errors because dplyr uses non-standard evaluation. How should this function be written?
You need to use the standard evaluation versions of the dplyr functions (just append '_' to the function names, ie. group_by_ & summarise_) and pass strings to your function, which you then need to turn into symbols. To parameterise the argument of summarise_, you will need to use interp(), which is defined in the lazyeval package. Concretely:
library(dplyr)
library(lazyeval)
not.uniq.per.group <- function(df, grp.var, uniq.var) {
df %>%
group_by_(grp.var) %>%
summarise_( n_uniq=interp(~n_distinct(v), v=as.name(uniq.var)) ) %>%
filter(n_uniq > 1)
}
not.uniq.per.group(iris, "Sepal.Length", "Sepal.Width")
Note that in recent versions of dplyr the standard evaluation versions of the dplyr functions have been "soft deprecated" in favor of non-standard evaluation.
See the Programming with dplyr vignette for more information on working with non-standard evaluation.
Like the old dplyr versions up to 0.5, the new dplyr has facilities for both standard evaluation (SE) and nonstandard evaluation (NSE). But they are expressed differently than before.
If you want an NSE function, you pass bare expressions and use enquo to capture them as quosures. If you want an SE function, just pass quosures (or symbols) directly, then unquote them in the dplyr calls. Here is the SE solution to the question:
library(tidyverse)
library(rlang)
f1 <- function(df, grp.var, uniq.var) {
df %>%
group_by(!!grp.var) %>%
summarise(n_uniq = n_distinct(!!uniq.var)) %>%
filter(n_uniq > 1)
}
a <- f1(iris, quo(Sepal.Length), quo(Sepal.Width))
b <- f1(iris, sym("Sepal.Length"), sym("Sepal.Width"))
identical(a, b)
#> [1] TRUE
Note how the SE version enables you to work with string arguments - just turn them into symbols first using sym(). For more information, see the programming with dplyr vignette.
In the devel version of dplyr (soon to be released 0.6.0), we can also make use of slightly different syntax for passing the variables.
f1 <- function(df, grp.var, uniq.var) {
grp.var <- enquo(grp.var)
uniq.var <- enquo(uniq.var)
df %>%
group_by(!!grp.var) %>%
summarise(n_uniq = n_distinct(!!uniq.var)) %>%
filter(n_uniq >1)
}
res2 <- f1(iris, Sepal.Length, Sepal.Width)
res1 <- not.uniq.per.group(iris, "Sepal.Length", "Sepal.Width")
identical(res1, res2)
#[1] TRUE
Here enquo takes the arguments and returns the value as a quosure (similar to substitute in base R) by evaluating the function arguments lazily and inside the summarise, we ask it to unquote (!! or UQ) so that it gets evaluated.
Here's the way to do it from rlang 0.4 using curly curly {{ pseudo operator :
library(dplyr)
not.uniq.per.group <- function(data, group.var, uniq.var) {
data %>%
group_by({{ group.var }}) %>%
summarise(n.uniq = n_distinct({{ uniq.var }})) %>%
filter(n.uniq > 1)
}
iris %>% not.uniq.per.group(Sepal.Length, Sepal.Width)
#> # A tibble: 25 x 2
#> Sepal.Length n.uniq
#> <dbl> <int>
#> 1 4.4 3
#> 2 4.6 4
#> 3 4.8 3
#> 4 4.9 5
#> 5 5 8
#> 6 5.1 6
#> 7 5.2 4
#> 8 5.4 4
#> 9 5.5 6
#> 10 5.6 5
#> # ... with 15 more rows
In the current version of dplyr (0.7.4) the use of the standard evaluation function versions (appended '_' to the function name, e.g. group_by_) is deprecated.
Instead you should rely on tidyeval when writing functions.
Here's an example of how your function would look then:
# definition of your function
not.uniq.per.group <- function(data, group.var, uniq.var) {
# enquotes variables to be used with dplyr-functions
group.var <- enquo(group.var)
uniq.var <- enquo(uniq.var)
# use '!!' before parameter names in dplyr-functions
data %>%
group_by(!!group.var) %>%
summarise(n.uniq=n_distinct(!!uniq.var)) %>%
filter(n.uniq > 1)
}
# call of your function
not.uniq.per.group(iris, Sepal.Length, Sepal.Width)
If you want to learn all about the details, there's an excellent vignette by the dplyr-team on how this works.
I've written a function in the past that does something similar to what you're doing, except that it explores all the columns outside the primary key and looks for multiple unique values per group.
find_dups = function(.table, ...) {
require(dplyr)
require(tidyr)
# get column names of primary key
pk <- .table %>% select(...) %>% names
other <- names(.table)[!(names(.table) %in% pk)]
# group by primary key,
# get number of rows per unique combo,
# filter for duplicates,
# get number of distinct values in each column,
# gather to get df of 1 row per primary key, other column,
# filter for where a columns have more than 1 unique value,
# order table by primary key
.table %>%
group_by(...) %>%
mutate(cnt = n()) %>%
filter(cnt > 1) %>%
select(-cnt) %>%
summarise_each(funs(n_distinct)) %>%
gather_('column', 'unique_vals', other) %>%
filter(unique_vals > 1) %>%
arrange(...) %>%
return
# Final dataframe:
## One row per primary key and column that creates duplicates.
## Last column indicates how many unique values of
## the given column exist for each primary key.
}
This function also works with the piping operator:
dat %>% find_dups(key1, key2)
You can avoid lazyeval by using do to call an anonymous function and then using get. This solution can be used more generally to employ multiple aggregations. I usually write the function separately.
library(dplyr)
not.uniq.per.group <- function(df, grp.var, uniq.var) {
df %>%
group_by_(grp.var) %>%
do((function(., uniq.var) {
with(., data.frame(n_uniq = n_distinct(get(uniq.var))))
}
)(., uniq.var)) %>%
filter(n_uniq > 1)
}
not.uniq.per.group(iris, "Sepal.Length", "Sepal.Width")
I want to parameterise the following computation using dplyr that finds which values of Sepal.Length are associated with more than one value of Sepal.Width:
library(dplyr)
iris %>%
group_by(Sepal.Length) %>%
summarise(n.uniq=n_distinct(Sepal.Width)) %>%
filter(n.uniq > 1)
Normally I would write something like this:
not.uniq.per.group <- function(data, group.var, uniq.var) {
iris %>%
group_by(group.var) %>%
summarise(n.uniq=n_distinct(uniq.var)) %>%
filter(n.uniq > 1)
}
However, this approach throws errors because dplyr uses non-standard evaluation. How should this function be written?
You need to use the standard evaluation versions of the dplyr functions (just append '_' to the function names, ie. group_by_ & summarise_) and pass strings to your function, which you then need to turn into symbols. To parameterise the argument of summarise_, you will need to use interp(), which is defined in the lazyeval package. Concretely:
library(dplyr)
library(lazyeval)
not.uniq.per.group <- function(df, grp.var, uniq.var) {
df %>%
group_by_(grp.var) %>%
summarise_( n_uniq=interp(~n_distinct(v), v=as.name(uniq.var)) ) %>%
filter(n_uniq > 1)
}
not.uniq.per.group(iris, "Sepal.Length", "Sepal.Width")
Note that in recent versions of dplyr the standard evaluation versions of the dplyr functions have been "soft deprecated" in favor of non-standard evaluation.
See the Programming with dplyr vignette for more information on working with non-standard evaluation.
Like the old dplyr versions up to 0.5, the new dplyr has facilities for both standard evaluation (SE) and nonstandard evaluation (NSE). But they are expressed differently than before.
If you want an NSE function, you pass bare expressions and use enquo to capture them as quosures. If you want an SE function, just pass quosures (or symbols) directly, then unquote them in the dplyr calls. Here is the SE solution to the question:
library(tidyverse)
library(rlang)
f1 <- function(df, grp.var, uniq.var) {
df %>%
group_by(!!grp.var) %>%
summarise(n_uniq = n_distinct(!!uniq.var)) %>%
filter(n_uniq > 1)
}
a <- f1(iris, quo(Sepal.Length), quo(Sepal.Width))
b <- f1(iris, sym("Sepal.Length"), sym("Sepal.Width"))
identical(a, b)
#> [1] TRUE
Note how the SE version enables you to work with string arguments - just turn them into symbols first using sym(). For more information, see the programming with dplyr vignette.
In the devel version of dplyr (soon to be released 0.6.0), we can also make use of slightly different syntax for passing the variables.
f1 <- function(df, grp.var, uniq.var) {
grp.var <- enquo(grp.var)
uniq.var <- enquo(uniq.var)
df %>%
group_by(!!grp.var) %>%
summarise(n_uniq = n_distinct(!!uniq.var)) %>%
filter(n_uniq >1)
}
res2 <- f1(iris, Sepal.Length, Sepal.Width)
res1 <- not.uniq.per.group(iris, "Sepal.Length", "Sepal.Width")
identical(res1, res2)
#[1] TRUE
Here enquo takes the arguments and returns the value as a quosure (similar to substitute in base R) by evaluating the function arguments lazily and inside the summarise, we ask it to unquote (!! or UQ) so that it gets evaluated.
Here's the way to do it from rlang 0.4 using curly curly {{ pseudo operator :
library(dplyr)
not.uniq.per.group <- function(data, group.var, uniq.var) {
data %>%
group_by({{ group.var }}) %>%
summarise(n.uniq = n_distinct({{ uniq.var }})) %>%
filter(n.uniq > 1)
}
iris %>% not.uniq.per.group(Sepal.Length, Sepal.Width)
#> # A tibble: 25 x 2
#> Sepal.Length n.uniq
#> <dbl> <int>
#> 1 4.4 3
#> 2 4.6 4
#> 3 4.8 3
#> 4 4.9 5
#> 5 5 8
#> 6 5.1 6
#> 7 5.2 4
#> 8 5.4 4
#> 9 5.5 6
#> 10 5.6 5
#> # ... with 15 more rows
In the current version of dplyr (0.7.4) the use of the standard evaluation function versions (appended '_' to the function name, e.g. group_by_) is deprecated.
Instead you should rely on tidyeval when writing functions.
Here's an example of how your function would look then:
# definition of your function
not.uniq.per.group <- function(data, group.var, uniq.var) {
# enquotes variables to be used with dplyr-functions
group.var <- enquo(group.var)
uniq.var <- enquo(uniq.var)
# use '!!' before parameter names in dplyr-functions
data %>%
group_by(!!group.var) %>%
summarise(n.uniq=n_distinct(!!uniq.var)) %>%
filter(n.uniq > 1)
}
# call of your function
not.uniq.per.group(iris, Sepal.Length, Sepal.Width)
If you want to learn all about the details, there's an excellent vignette by the dplyr-team on how this works.
I've written a function in the past that does something similar to what you're doing, except that it explores all the columns outside the primary key and looks for multiple unique values per group.
find_dups = function(.table, ...) {
require(dplyr)
require(tidyr)
# get column names of primary key
pk <- .table %>% select(...) %>% names
other <- names(.table)[!(names(.table) %in% pk)]
# group by primary key,
# get number of rows per unique combo,
# filter for duplicates,
# get number of distinct values in each column,
# gather to get df of 1 row per primary key, other column,
# filter for where a columns have more than 1 unique value,
# order table by primary key
.table %>%
group_by(...) %>%
mutate(cnt = n()) %>%
filter(cnt > 1) %>%
select(-cnt) %>%
summarise_each(funs(n_distinct)) %>%
gather_('column', 'unique_vals', other) %>%
filter(unique_vals > 1) %>%
arrange(...) %>%
return
# Final dataframe:
## One row per primary key and column that creates duplicates.
## Last column indicates how many unique values of
## the given column exist for each primary key.
}
This function also works with the piping operator:
dat %>% find_dups(key1, key2)
You can avoid lazyeval by using do to call an anonymous function and then using get. This solution can be used more generally to employ multiple aggregations. I usually write the function separately.
library(dplyr)
not.uniq.per.group <- function(df, grp.var, uniq.var) {
df %>%
group_by_(grp.var) %>%
do((function(., uniq.var) {
with(., data.frame(n_uniq = n_distinct(get(uniq.var))))
}
)(., uniq.var)) %>%
filter(n_uniq > 1)
}
not.uniq.per.group(iris, "Sepal.Length", "Sepal.Width")
I want to parameterise the following computation using dplyr that finds which values of Sepal.Length are associated with more than one value of Sepal.Width:
library(dplyr)
iris %>%
group_by(Sepal.Length) %>%
summarise(n.uniq=n_distinct(Sepal.Width)) %>%
filter(n.uniq > 1)
Normally I would write something like this:
not.uniq.per.group <- function(data, group.var, uniq.var) {
iris %>%
group_by(group.var) %>%
summarise(n.uniq=n_distinct(uniq.var)) %>%
filter(n.uniq > 1)
}
However, this approach throws errors because dplyr uses non-standard evaluation. How should this function be written?
You need to use the standard evaluation versions of the dplyr functions (just append '_' to the function names, ie. group_by_ & summarise_) and pass strings to your function, which you then need to turn into symbols. To parameterise the argument of summarise_, you will need to use interp(), which is defined in the lazyeval package. Concretely:
library(dplyr)
library(lazyeval)
not.uniq.per.group <- function(df, grp.var, uniq.var) {
df %>%
group_by_(grp.var) %>%
summarise_( n_uniq=interp(~n_distinct(v), v=as.name(uniq.var)) ) %>%
filter(n_uniq > 1)
}
not.uniq.per.group(iris, "Sepal.Length", "Sepal.Width")
Note that in recent versions of dplyr the standard evaluation versions of the dplyr functions have been "soft deprecated" in favor of non-standard evaluation.
See the Programming with dplyr vignette for more information on working with non-standard evaluation.
Like the old dplyr versions up to 0.5, the new dplyr has facilities for both standard evaluation (SE) and nonstandard evaluation (NSE). But they are expressed differently than before.
If you want an NSE function, you pass bare expressions and use enquo to capture them as quosures. If you want an SE function, just pass quosures (or symbols) directly, then unquote them in the dplyr calls. Here is the SE solution to the question:
library(tidyverse)
library(rlang)
f1 <- function(df, grp.var, uniq.var) {
df %>%
group_by(!!grp.var) %>%
summarise(n_uniq = n_distinct(!!uniq.var)) %>%
filter(n_uniq > 1)
}
a <- f1(iris, quo(Sepal.Length), quo(Sepal.Width))
b <- f1(iris, sym("Sepal.Length"), sym("Sepal.Width"))
identical(a, b)
#> [1] TRUE
Note how the SE version enables you to work with string arguments - just turn them into symbols first using sym(). For more information, see the programming with dplyr vignette.
In the devel version of dplyr (soon to be released 0.6.0), we can also make use of slightly different syntax for passing the variables.
f1 <- function(df, grp.var, uniq.var) {
grp.var <- enquo(grp.var)
uniq.var <- enquo(uniq.var)
df %>%
group_by(!!grp.var) %>%
summarise(n_uniq = n_distinct(!!uniq.var)) %>%
filter(n_uniq >1)
}
res2 <- f1(iris, Sepal.Length, Sepal.Width)
res1 <- not.uniq.per.group(iris, "Sepal.Length", "Sepal.Width")
identical(res1, res2)
#[1] TRUE
Here enquo takes the arguments and returns the value as a quosure (similar to substitute in base R) by evaluating the function arguments lazily and inside the summarise, we ask it to unquote (!! or UQ) so that it gets evaluated.
Here's the way to do it from rlang 0.4 using curly curly {{ pseudo operator :
library(dplyr)
not.uniq.per.group <- function(data, group.var, uniq.var) {
data %>%
group_by({{ group.var }}) %>%
summarise(n.uniq = n_distinct({{ uniq.var }})) %>%
filter(n.uniq > 1)
}
iris %>% not.uniq.per.group(Sepal.Length, Sepal.Width)
#> # A tibble: 25 x 2
#> Sepal.Length n.uniq
#> <dbl> <int>
#> 1 4.4 3
#> 2 4.6 4
#> 3 4.8 3
#> 4 4.9 5
#> 5 5 8
#> 6 5.1 6
#> 7 5.2 4
#> 8 5.4 4
#> 9 5.5 6
#> 10 5.6 5
#> # ... with 15 more rows
In the current version of dplyr (0.7.4) the use of the standard evaluation function versions (appended '_' to the function name, e.g. group_by_) is deprecated.
Instead you should rely on tidyeval when writing functions.
Here's an example of how your function would look then:
# definition of your function
not.uniq.per.group <- function(data, group.var, uniq.var) {
# enquotes variables to be used with dplyr-functions
group.var <- enquo(group.var)
uniq.var <- enquo(uniq.var)
# use '!!' before parameter names in dplyr-functions
data %>%
group_by(!!group.var) %>%
summarise(n.uniq=n_distinct(!!uniq.var)) %>%
filter(n.uniq > 1)
}
# call of your function
not.uniq.per.group(iris, Sepal.Length, Sepal.Width)
If you want to learn all about the details, there's an excellent vignette by the dplyr-team on how this works.
I've written a function in the past that does something similar to what you're doing, except that it explores all the columns outside the primary key and looks for multiple unique values per group.
find_dups = function(.table, ...) {
require(dplyr)
require(tidyr)
# get column names of primary key
pk <- .table %>% select(...) %>% names
other <- names(.table)[!(names(.table) %in% pk)]
# group by primary key,
# get number of rows per unique combo,
# filter for duplicates,
# get number of distinct values in each column,
# gather to get df of 1 row per primary key, other column,
# filter for where a columns have more than 1 unique value,
# order table by primary key
.table %>%
group_by(...) %>%
mutate(cnt = n()) %>%
filter(cnt > 1) %>%
select(-cnt) %>%
summarise_each(funs(n_distinct)) %>%
gather_('column', 'unique_vals', other) %>%
filter(unique_vals > 1) %>%
arrange(...) %>%
return
# Final dataframe:
## One row per primary key and column that creates duplicates.
## Last column indicates how many unique values of
## the given column exist for each primary key.
}
This function also works with the piping operator:
dat %>% find_dups(key1, key2)
You can avoid lazyeval by using do to call an anonymous function and then using get. This solution can be used more generally to employ multiple aggregations. I usually write the function separately.
library(dplyr)
not.uniq.per.group <- function(df, grp.var, uniq.var) {
df %>%
group_by_(grp.var) %>%
do((function(., uniq.var) {
with(., data.frame(n_uniq = n_distinct(get(uniq.var))))
}
)(., uniq.var)) %>%
filter(n_uniq > 1)
}
not.uniq.per.group(iris, "Sepal.Length", "Sepal.Width")
This is a fairly simply question. But I couldn't find the answer per google/stackexchange and looking at the documentation of magrittr.
How do you feed the result of a chain of functions which are connected via %>% to create a vector?
what I saw most people do is:
a <-
data.frame( x = c(1:3), y = (4:6)) %>%
sum()
but is there also a solution where I can just pipe-chain the result to feed it to an object, maybe an alias or sth of the like, somewhat like this:
data.frame( x = c(1:3), y = (4:6)) %>%
sum() %>%
a <- ()
this would help with keeping all of the code in the same logic of feeding results forward "down the pipe".
Try this:
data.frame( x = c(1:3), y = (4:6)) %>% sum -> a
You can do it like so:
data.frame( x = c(1:3), y = (4:6)) %>%
sum %>%
assign(x="a",value=.,pos=1)
A couple of things to note:
You can use "." to tell magrittr which argument the object being brought forward belongs in. By default it is the first, but here I use the . to indicate that I want it in the second value argument instead.
Second I had to use the pos=1 argument to make the assignment in the global environment.
You can also use the <<- operator:
data.frame( x = c(1:3), y = (4:6)) %>%
sum() %>%
`<<-`(a,.)
Edit: I think John Paul's is the safest suggestion, and you could keep going with the chain doing different assignments of partial results. For example:
data.frame( x = c(1:3), y = (4:6)) %>%
sum %>%
assign(x="a",value=., pos=1) %>%
exp %>%
assign(x="b",value=., pos=1) %>%
sqrt %>%
assign(x="c", value=., pos=1)
This will correctly create a, b and c.
Using pipeR's %>>% this should be very easy.
library(pipeR)
data.frame( x = c(1:3), y = (4:6)) %>>%
sum %>>%
(~ a)
The pipeR tutorial may be helpful: http://renkun.me/pipeR-tutorial/
For assignment: http://renkun.me/pipeR-tutorial/Pipe-operator/Pipe-with-assignment.html
What I like to do (and I found this trick somewhere I can't remember) is to use {.} -> obj at the end of my pipe-chain. This way I can add extra steps to the end of the chain by just inserting a new line, and not have to re-position to -> assignment operator.
You can also use (.) isntead of {.} but it looks a bit, odd.
For example, instead of this:
iris %>%
ddply(.(Species), summarise,
mean.petal = mean(Petal.Length),
mean.sepal = mean(Sepal.Length)) -> summary
Do this:
iris %>%
ddply(.(Species), summarise,
mean.petal = mean(Petal.Length),
mean.sepal = mean(Sepal.Length)) %>%
{.} -> summary
It makes it easier to see where your piped data ends up. Also, while it doesn't seem like a big deal, it's easier to add another final step as you don't need to move the -> down to a new line, just add a new line before the {.} and add the step.
Like so:
iris %>%
ddply(.(Species), summarise,
mean.petal = mean(Petal.Length),
mean.sepal = mean(Sepal.Length)) %>%
arrange(desc(mean.petal)) %>% # just add a step here
{.} -> summary
This doesn't help with saving intermediate results though. John Paul's answer to use assign() is nice, but its a bit long to type. You need to use the . since the data isn't the first argument, you have to put the name of the new argument in ""'s, and specify the environment (pos = 1). It seems lazy on my part, but using %>% is about speed.
So I wrapped the assign() in a little function which speeds it up a bit:
keep <- function(x, name) {assign(as.character(substitute(name)), x, pos = 1)}
So now you can do this:
keep <- function(x, name) {assign(as.character(substitute(name)), x, pos = 1)}
iris %>%
ddply(.(Species), summarise,
mean.petal = mean(Petal.Length),
mean.sepal = mean(Sepal.Length)) %>% keep(unsorted.data) %>% # keep this step
arrange(mean.petal) %>%
{.} -> sorted.data
sorted.data
# Species mean.petal mean.sepal
#1 setosa 1.462 5.006
#2 versicolor 4.260 5.936
#3 virginica 5.552 6.588
unsorted.data
# Species mean.petal mean.sepal
#1 setosa 1.462 5.006
#2 versicolor 4.260 5.936
#3 virginica 5.552 6.588
I was expecting to see the same results between these two runs, and they are different. Makes me question if I really understand what how the dplyr code is working (I have read pretty much everything I can find about dplyr in the package and online). Can anyone explain why the results are different, or how to obtain similar results?
library(dplyr)
x <- iris
x <- x %.%
group_by(Species, Sepal.Width) %.%
summarise (freq=n()) %.%
summarise (mean_by_group = mean(Sepal.Width))
print(x)
x <- iris
x <- tapply(x$Sepal.Width, x$Species, mean)
print(x)
Update: I don't think this is the most efficient way to do this, but the following code gives a result that matches the tapply approach. Per Hadley's suggestion, I scrutinized the results line by line, and this is the best I could come up with using dplyr
library(dplyr)
x <- iris
x <- x %.%
group_by(Species, Sepal.Width) %.%
summarise (freq=n()) %.%
mutate (mean_by_group = sum(Sepal.Width*freq)/sum(freq)) %.%
print(x)
Update: for some reason I thought I had to group all variables I wanted to analyse, which is what sent things in the wrong direction. This is all I needed, which is closer to the examples in the package.
x <- iris %.%
group_by(Species) %.%
summarise(Sepal.Width = mean(Sepal.Width))
print(x)
Maybe this...
- dplyr:
require(dplyr)
iris %>% group_by(Species) %>% summarise(mean_width = mean(Sepal.Width))
# Source: local data frame [3 x 2]
#
# Species mean_width
# 1 setosa 3.428
# 2 versicolor 2.770
# 3 virginica 2.974
- tapply:
tapply(iris$Sepal.Width, iris$Species, mean)
# setosa versicolor virginica
# 3.428 2.770 2.974
NOTE: tapply() simplifies output by default whereas summarise() does not:
typeof(tapply(iris$Sepal.Width, iris$Species, mean, simplify=TRUE))
# [1] "double"
it returns a list otherwise:
typeof(tapply(iris$Sepal.Width, iris$Species, mean, simplify=FALSE))
# [1] "list"
So to actually get the same type of output form tapply() you would need:
tbl_df(
data.frame(
mean_width = tapply( iris$Sepal.Width,
iris$Species,
mean )))
# Source: local data frame [3 x 1]
#
# mean_width
# setosa 3.428
# versicolor 2.770
# virginica 2.974
and this still isn't the same! as unique(iris$Species) is an attribute here and not a column of the df...