Consider this dplyr treatment to a data frame:
existing.df <- filter(existing.df, justanEx > 0) %>%
arrange(desc(justanEx)) %>%
mutate(mean = mean(justanEx),
median = median(justanEx),
rank = seq_len(length(anotherVar)))
I have to do this a lot on an job I'm doing, so I tried making a function for it:
df.overZ <- function(data, var){
df <- data %>% filter(var > 0) %>%
arrange_(desc((var))) %>%
mutate(mean = mean(var),
median = median(var),
rank = seq_len(length(anotherVar)))
df
}
and them
existing.df <- df.overZ(existing.df, "realVar")
but this gives me this error:
Error in arrange_impl(.data, dots) :
incorrect size (1), expecting : 50000
If I try:
existing.df <- df.overZ(existing.df, realVar)
I get this error:
Error in filter_impl(.data, dots) : obj 'realVar' not found
I have already tried filter_, arrange_ and mutate_,
but nothing sens to work.
Can this work?
The following function works, though:
make.df <- function(var, n){
df <- orign.df %>% filter(!is.na(var)) %>%
select(1:2,n,3:6)
df
}
existing.df <- make.df("oneVar",7)
With the devel version of dplyr (soon to be released 0.6.0), we can make use of the quosures
library(dplyr)
df.overZ <- function(data, Var){
Var <- enquo(Var)
data %>%
filter(UQ(Var) > 0) %>%
arrange(desc(UQ(Var))) %>%
mutate(Mean = mean(UQ(Var)),
Median = median(UQ(Var)),
rank = row_number())
}
df.overZ(iris, Sepal.Length)
We can extend this function to have a group_by option as well
df.overZ2 <- function(data, Var, grpVar){
Var <- enquo(Var)
grpVar <- enquo(grpVar)
newVar <- paste(quo_name(Var), c("Mean", "Median", "Rank"), sep="_")
data %>%
filter(UQ(Var) > 0) %>%
arrange(desc(UQ(Var))) %>%
group_by(UQ(grpVar)) %>%
summarise(UQ(newVar[1]) := mean(UQ(Var)),
UQ(newVar[2]) := median(UQ(Var)),
UQ(newVar[3]) := n())
}
df.overZ2(iris, Sepal.Length, Species)
# A tibble: 3 × 4
# Species Sepal.Length_Mean Sepal.Length_Median Sepal.Length_Rank
# <fctr> <dbl> <dbl> <int>
#1 setosa 5.006 5.0 50
#2 versicolor 5.936 5.9 50
#3 virginica 6.588 6.5 50
Here, the enquo does a similar job as substitute from base R by taking the input arguments and converting it to quosure, then within the functions (filter/arrange/mutate/summarise/group_by) we unquote (!! or UQ) to evaluate it. We can also name the columns by passing the quosure on the lhs of the assignment (:=)
Related
Say I have a function that returns two scalars, and I want to use it with summarize, e.g.
fn = function(x) {
list(mean(x), sd(x))
}
iris %>%
summarize(fn(Petal.Length)) # Error: Column `fn(Petal.Length)` must be length 1 (a summary value), not 2
iris %>%
summarize(c("a","b") := fn(Petal.Length))
# Error: The LHS of `:=` must be a string or a symbol Run `rlang::last_error()` to see where the error occurred.
I tried both ways, but can't figure it out.
However, this can be done with data.table
library(data.table)
iris1 = copy(iris)
setDT(iris1)[, fn(Petal.Length)]
Is there a way to do this in dplyr?
Yes, you can save them as a list in a column and then use unnest_wider to separate them in different columns.
fn = function(x) {
list(mean = mean(x),sd = sd(x))
}
library(dplyr)
library(tidyr)
iris %>%
summarise(temp = list(fn(Petal.Length))) %>%
unnest_wider(temp)
# A tibble: 1 x 2
# mean sd
# <dbl> <dbl>
#1 3.76 1.77
Or unnest_longer to have them in separate rows
iris %>%
summarise(temp = list(fn(Petal.Length))) %>%
unnest_longer(temp)
# temp temp_id
# <dbl> <chr>
#1 3.76 mean
#2 1.77 sd
So I'm trying to do some programming in dplyr and I am having some trouble with the enquo and !! evaluations.
Basically I would like to mutate a column to a dynamic column name, and then be able to further manipulate that column (i.e. summarize). For instance:
my_function <- function(data, column) {
quo_column <- enquo(column)
new_col <- paste0(quo_column, "_adjusted")[2]
data %>%
mutate(!!new_col := (!!quo_column) + 1)
}
my_function(iris, Petal.Length)
This works great and returns a column called "Petal.Length.adjusted" which is just Petal.Length increased by one.
However I can't seem to summarize this new column.
my_function <- function(data, column) {
quo_column <- enquo(column)
new_col <- paste0(quo_column, "_adjusted")[2]
mean_col <- paste0(quo_column, "_meanAdjusted")[2]
data %>%
mutate(!!new_col := (!!quo_column) + 1) %>%
group_by(Species) %>%
summarize(!!mean_col := mean(!!new_col))
}
my_function(iris, Petal.Length)
This results in a warning stating the argument "Petal.Length_adjusted" is not numeric or logical, although the output from the mutate call gives a numeric column.
How do I reference this dynamically generated column name to pass it in further dplyr functions?
Unlike the quo_column which is a quosure, the new_col and mean_col are strings, so we convert it to symbol using sym (from rlang) and then do the evaluation
my_function <- function(data, column) {
quo_column <- enquo(column)
new_col <- paste0(quo_column, "_adjusted")[2]
mean_col <- paste0(quo_column, "_meanAdjusted")[2]
data %>%
mutate(!!new_col := (!!quo_column) + 1) %>%
group_by(Species) %>%
summarise(!!mean_col := mean(!! rlang::sym(new_col)))
}
head(my_function(iris, Petal.Length))
# A tibble: 3 x 2
# Species Petal.Length_meanAdjusted
# <fct> <dbl>
#1 setosa 2.46
#2 versicolor 5.26
#3 virginica 6.55
I am trying to use dplyr's group_by in a local function, example:
testFunction <- function(df, x) {
df %>%
group_by(x) %>%
summarize(mean.Petal.Width = mean(Petal.Width))
}
testFunction(iris, Species)
and I get an error "... unknown variable to group by: x"
I've tried group_by_ and it gives me a summary of the entire dataset.
Anybody have a clue how I can fix this?
Thanks in advance!
Here is one way to work with the new enquo from dplyr, where enquo takes the string and converts to quosure which gets evaluated by unquoting (UQ or !!) in group_by, mutate, summarise etc.
library(dplyr)
testFunction <- function(df, x) {
x <- enquo(x)
df %>%
group_by(!! x) %>%
summarize(mean.Petal.Width = mean(Petal.Width))
}
testFunction(iris, Species)
# A tibble: 3 x 2
# Species mean.Petal.Width
# <fctr> <dbl>
#1 setosa 0.246
#2 versicolor 1.326
#3 virginica 2.026
I got it to work like this:
testFunction <- function(df, x) {
df %>%
group_by(get(x)) %>%
summarize(mean.Petal.Width = mean(Petal.Width))
}
testFunction(iris,"Species")
I changed x to get(x), and Species to "Species" in testFunction(iris,...).
I want to use use the dplyr::group_by function inside another function, but I do not know how to pass the arguments to this function.
Can someone provide a working example?
library(dplyr)
data(iris)
iris %.% group_by(Species) %.% summarise(n = n()) #
## Source: local data frame [3 x 2]
## Species n
## 1 virginica 50
## 2 versicolor 50
## 3 setosa 50
mytable0 <- function(x, ...) x %.% group_by(...) %.% summarise(n = n())
mytable0(iris, "Species") # OK
## Source: local data frame [3 x 2]
## Species n
## 1 virginica 50
## 2 versicolor 50
## 3 setosa 50
mytable1 <- function(x, key) x %.% group_by(as.name(key)) %.% summarise(n = n())
mytable1(iris, "Species") # Wrong!
# Error: unsupported type for column 'as.name(key)' (SYMSXP)
mytable2 <- function(x, key) x %.% group_by(key) %.% summarise(n = n())
mytable2(iris, "Species") # Wrong!
# Error: index out of bounds
For programming, group_by_ is the counterpart to group_by:
library(dplyr)
mytable <- function(x, ...) x %>% group_by_(...) %>% summarise(n = n())
mytable(iris, "Species")
# or iris %>% mytable("Species")
which gives:
Species n
1 setosa 50
2 versicolor 50
3 virginica 50
Update At the time this was written dplyr used %.% which is what was originally used above but now %>% is favored so have changed above to that to keep this relevant.
Update 2 regroup is now deprecated, use group_by_ instead.
Update 3 group_by_(list(...)) now becomes group_by_(...) in new version of dplyr as per Roberto's comment.
Update 4 Added minor variation suggested in comments.
Update 5: With rlang/tidyeval it is now possible to do this:
library(rlang)
mytable <- function(x, ...) {
group_ <- syms(...)
x %>%
group_by(!!!group_) %>%
summarise(n = n())
}
mytable(iris, "Species")
or passing Species unevaluated, i.e. no quotes around it:
library(rlang)
mytable <- function(x, ...) {
group_ <- enquos(...)
x %>%
group_by(!!!group_) %>%
summarise(n = n())
}
mytable(iris, Species)
Update 6: There is now a {{...}} notation that works if there is just one grouping variable:
mytable <- function(x, group) {
x %>%
group_by({{group}}) %>%
summarise(n = n())
}
mytable(iris, Species)
UPDATE: As of dplyr 0.7.0 you can use tidy eval to accomplish this.
See http://dplyr.tidyverse.org/articles/programming.html for more details.
library(tidyverse)
data("iris")
my_table <- function(df, group_var) {
group_var <- enquo(group_var) # Create quosure
df %>%
group_by(!!group_var) %>% # Use !! to unquote the quosure
summarise(n = n())
}
my_table(iris, Species)
> my_table(iris, Species)
# A tibble: 3 x 2
Species n
<fctr> <int>
1 setosa 50
2 versicolor 50
3 virginica 50
As a complement to the Update 6 in the answer by #G. Grothendieck, if you want to use a string as an argument in your summary function, instead of embracing the argument with doubled braces ({{), you should use the .data pronoun as described in the Programming vignette: Loop over multiple variables:
mytable <- function( x, group ) {
x %>%
group_by( .data[[group]] ) %>%
summarise( n = n() )
}
group_string <- 'Species'
mytable( iris, group_string )
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 3 x 2
Species n
<fct> <int>
1 setosa 50
2 versicolor 50
3 virginica 50
Ugly as they come, but she works:
mytable3 <- function(x, key) {
my.call <- bquote(summarise(group_by(.(substitute(x)), NULL), n = n()))
my.call[[2]][[3]] <- as.name(key)
eval(my.call, parent.frame())
}
mytable3(iris, "Species")
# Source: local data frame [3 x 2]
#
# Species n
# 1 virginica 50
# 2 versicolor 50
# 3 setosa 50
There are almost certainly cases that will cause this to break, but you get the idea. I don't think you can get around messing with the call. One other thing that did work but was even uglier is:
mytable4 <- function(x, key) summarise(group_by(x, x[[key]]), n = n())
I want to use use the dplyr::group_by function inside another function, but I do not know how to pass the arguments to this function.
Can someone provide a working example?
library(dplyr)
data(iris)
iris %.% group_by(Species) %.% summarise(n = n()) #
## Source: local data frame [3 x 2]
## Species n
## 1 virginica 50
## 2 versicolor 50
## 3 setosa 50
mytable0 <- function(x, ...) x %.% group_by(...) %.% summarise(n = n())
mytable0(iris, "Species") # OK
## Source: local data frame [3 x 2]
## Species n
## 1 virginica 50
## 2 versicolor 50
## 3 setosa 50
mytable1 <- function(x, key) x %.% group_by(as.name(key)) %.% summarise(n = n())
mytable1(iris, "Species") # Wrong!
# Error: unsupported type for column 'as.name(key)' (SYMSXP)
mytable2 <- function(x, key) x %.% group_by(key) %.% summarise(n = n())
mytable2(iris, "Species") # Wrong!
# Error: index out of bounds
For programming, group_by_ is the counterpart to group_by:
library(dplyr)
mytable <- function(x, ...) x %>% group_by_(...) %>% summarise(n = n())
mytable(iris, "Species")
# or iris %>% mytable("Species")
which gives:
Species n
1 setosa 50
2 versicolor 50
3 virginica 50
Update At the time this was written dplyr used %.% which is what was originally used above but now %>% is favored so have changed above to that to keep this relevant.
Update 2 regroup is now deprecated, use group_by_ instead.
Update 3 group_by_(list(...)) now becomes group_by_(...) in new version of dplyr as per Roberto's comment.
Update 4 Added minor variation suggested in comments.
Update 5: With rlang/tidyeval it is now possible to do this:
library(rlang)
mytable <- function(x, ...) {
group_ <- syms(...)
x %>%
group_by(!!!group_) %>%
summarise(n = n())
}
mytable(iris, "Species")
or passing Species unevaluated, i.e. no quotes around it:
library(rlang)
mytable <- function(x, ...) {
group_ <- enquos(...)
x %>%
group_by(!!!group_) %>%
summarise(n = n())
}
mytable(iris, Species)
Update 6: There is now a {{...}} notation that works if there is just one grouping variable:
mytable <- function(x, group) {
x %>%
group_by({{group}}) %>%
summarise(n = n())
}
mytable(iris, Species)
UPDATE: As of dplyr 0.7.0 you can use tidy eval to accomplish this.
See http://dplyr.tidyverse.org/articles/programming.html for more details.
library(tidyverse)
data("iris")
my_table <- function(df, group_var) {
group_var <- enquo(group_var) # Create quosure
df %>%
group_by(!!group_var) %>% # Use !! to unquote the quosure
summarise(n = n())
}
my_table(iris, Species)
> my_table(iris, Species)
# A tibble: 3 x 2
Species n
<fctr> <int>
1 setosa 50
2 versicolor 50
3 virginica 50
As a complement to the Update 6 in the answer by #G. Grothendieck, if you want to use a string as an argument in your summary function, instead of embracing the argument with doubled braces ({{), you should use the .data pronoun as described in the Programming vignette: Loop over multiple variables:
mytable <- function( x, group ) {
x %>%
group_by( .data[[group]] ) %>%
summarise( n = n() )
}
group_string <- 'Species'
mytable( iris, group_string )
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 3 x 2
Species n
<fct> <int>
1 setosa 50
2 versicolor 50
3 virginica 50
Ugly as they come, but she works:
mytable3 <- function(x, key) {
my.call <- bquote(summarise(group_by(.(substitute(x)), NULL), n = n()))
my.call[[2]][[3]] <- as.name(key)
eval(my.call, parent.frame())
}
mytable3(iris, "Species")
# Source: local data frame [3 x 2]
#
# Species n
# 1 virginica 50
# 2 versicolor 50
# 3 setosa 50
There are almost certainly cases that will cause this to break, but you get the idea. I don't think you can get around messing with the call. One other thing that did work but was even uglier is:
mytable4 <- function(x, key) summarise(group_by(x, x[[key]]), n = n())