Using dplyr group_by in a function - r

I am trying to use dplyr's group_by in a local function, example:
testFunction <- function(df, x) {
df %>%
group_by(x) %>%
summarize(mean.Petal.Width = mean(Petal.Width))
}
testFunction(iris, Species)
and I get an error "... unknown variable to group by: x"
I've tried group_by_ and it gives me a summary of the entire dataset.
Anybody have a clue how I can fix this?
Thanks in advance!

Here is one way to work with the new enquo from dplyr, where enquo takes the string and converts to quosure which gets evaluated by unquoting (UQ or !!) in group_by, mutate, summarise etc.
library(dplyr)
testFunction <- function(df, x) {
x <- enquo(x)
df %>%
group_by(!! x) %>%
summarize(mean.Petal.Width = mean(Petal.Width))
}
testFunction(iris, Species)
# A tibble: 3 x 2
# Species mean.Petal.Width
# <fctr> <dbl>
#1 setosa 0.246
#2 versicolor 1.326
#3 virginica 2.026

I got it to work like this:
testFunction <- function(df, x) {
df %>%
group_by(get(x)) %>%
summarize(mean.Petal.Width = mean(Petal.Width))
}
testFunction(iris,"Species")
I changed x to get(x), and Species to "Species" in testFunction(iris,...).

Related

Pass a variable name to a user written function that uses dyplr

I am trying to write a function that index variables names.
In particular, in my function, I use mutate to encode a variable that I have without changing its name. Does anyone knows how I can index a variable on the left end side of mutate?
Here is an example
library(tydiverse)
# first create relevant dataset
iris <- iris%>% group_by(Species) %>% mutate(mean_Length=mean(Sepal.Length))
# second create my function
userfunction <- function(var){
newdata <- iris %>%
select(mean_Length,{var}) %>% distinct() %>%
mutate(get(var)= # this is what causes my function to fail. How can i refer to the `var` here?
factor(get(var),get(var))) %>%
arrange(get(var)) #
return(newdata)
}
# this function produces the following error # Error: unexpected '}' in "}"
#note that if I change the reference to its original string the function works
userfunction2 <- function(var){
newdata <- iris %>%
select(mean_Length,{var}) %>% distinct() %>%
mutate(Species= # without reference it works, but I am unable to use the function for multiple variables.
factor(get(var),get(var))) %>%
arrange(get(var)) #
return(newdata)
}
encodedata<- userfunction2("Species")
Thanks a lot in advance for your help
Best
Here is a working example that goes into a similar direction as Limey's answer:
iris <- datasets::iris %>%
group_by(Species) %>%
mutate(mean_Length=mean(Sepal.Length)) %>%
ungroup()
userfunction <- function(var){
iris %>%
transmute(mean_Length, "temp" = iris[[var]]) %>%
distinct() %>%
mutate("{var}" := factor(temp)) %>%
arrange(temp) %>%
select(-temp)
}
userfunction("Petal.Length")
I don't think var is your problem. I think it's the =. If you you have a enquoted variable on the left hand side of the assignment (which is effectively what you do have with get()), you need :=, not =.
See here for more details.
I would have written your function slightly differently:
userfunction <- function(data, var){
qVar <- enquo(var)
newdata <- data %>%
select(mean_Length, !! qVar) %>% distinct() %>%
mutate(!! qVar := factor(!! qVar, !! qVar)) %>%
arrange(!! qVar)
return(newdata)
}
The inclusion of the data parameter means you can include it in a pipe:
encodedata <- iris %>% userfunction(Species)
encodedata
# A tibble: 3 x 2
# Groups: Species [3]
mean_Length Species
<dbl> <fct>
1 5.01 setosa
2 5.94 versicolor
3 6.59 virginica

Summarizing by dynamic column name in dplyr

So I'm trying to do some programming in dplyr and I am having some trouble with the enquo and !! evaluations.
Basically I would like to mutate a column to a dynamic column name, and then be able to further manipulate that column (i.e. summarize). For instance:
my_function <- function(data, column) {
quo_column <- enquo(column)
new_col <- paste0(quo_column, "_adjusted")[2]
data %>%
mutate(!!new_col := (!!quo_column) + 1)
}
my_function(iris, Petal.Length)
This works great and returns a column called "Petal.Length.adjusted" which is just Petal.Length increased by one.
However I can't seem to summarize this new column.
my_function <- function(data, column) {
quo_column <- enquo(column)
new_col <- paste0(quo_column, "_adjusted")[2]
mean_col <- paste0(quo_column, "_meanAdjusted")[2]
data %>%
mutate(!!new_col := (!!quo_column) + 1) %>%
group_by(Species) %>%
summarize(!!mean_col := mean(!!new_col))
}
my_function(iris, Petal.Length)
This results in a warning stating the argument "Petal.Length_adjusted" is not numeric or logical, although the output from the mutate call gives a numeric column.
How do I reference this dynamically generated column name to pass it in further dplyr functions?
Unlike the quo_column which is a quosure, the new_col and mean_col are strings, so we convert it to symbol using sym (from rlang) and then do the evaluation
my_function <- function(data, column) {
quo_column <- enquo(column)
new_col <- paste0(quo_column, "_adjusted")[2]
mean_col <- paste0(quo_column, "_meanAdjusted")[2]
data %>%
mutate(!!new_col := (!!quo_column) + 1) %>%
group_by(Species) %>%
summarise(!!mean_col := mean(!! rlang::sym(new_col)))
}
head(my_function(iris, Petal.Length))
# A tibble: 3 x 2
# Species Petal.Length_meanAdjusted
# <fct> <dbl>
#1 setosa 2.46
#2 versicolor 5.26
#3 virginica 6.55

R dplyr methods inside own function

Consider this dplyr treatment to a data frame:
existing.df <- filter(existing.df, justanEx > 0) %>%
arrange(desc(justanEx)) %>%
mutate(mean = mean(justanEx),
median = median(justanEx),
rank = seq_len(length(anotherVar)))
I have to do this a lot on an job I'm doing, so I tried making a function for it:
df.overZ <- function(data, var){
df <- data %>% filter(var > 0) %>%
arrange_(desc((var))) %>%
mutate(mean = mean(var),
median = median(var),
rank = seq_len(length(anotherVar)))
df
}
and them
existing.df <- df.overZ(existing.df, "realVar")
but this gives me this error:
Error in arrange_impl(.data, dots) :
incorrect size (1), expecting : 50000
If I try:
existing.df <- df.overZ(existing.df, realVar)
I get this error:
Error in filter_impl(.data, dots) : obj 'realVar' not found
I have already tried filter_, arrange_ and mutate_,
but nothing sens to work.
Can this work?
The following function works, though:
make.df <- function(var, n){
df <- orign.df %>% filter(!is.na(var)) %>%
select(1:2,n,3:6)
df
}
existing.df <- make.df("oneVar",7)
With the devel version of dplyr (soon to be released 0.6.0), we can make use of the quosures
library(dplyr)
df.overZ <- function(data, Var){
Var <- enquo(Var)
data %>%
filter(UQ(Var) > 0) %>%
arrange(desc(UQ(Var))) %>%
mutate(Mean = mean(UQ(Var)),
Median = median(UQ(Var)),
rank = row_number())
}
df.overZ(iris, Sepal.Length)
We can extend this function to have a group_by option as well
df.overZ2 <- function(data, Var, grpVar){
Var <- enquo(Var)
grpVar <- enquo(grpVar)
newVar <- paste(quo_name(Var), c("Mean", "Median", "Rank"), sep="_")
data %>%
filter(UQ(Var) > 0) %>%
arrange(desc(UQ(Var))) %>%
group_by(UQ(grpVar)) %>%
summarise(UQ(newVar[1]) := mean(UQ(Var)),
UQ(newVar[2]) := median(UQ(Var)),
UQ(newVar[3]) := n())
}
df.overZ2(iris, Sepal.Length, Species)
# A tibble: 3 × 4
# Species Sepal.Length_Mean Sepal.Length_Median Sepal.Length_Rank
# <fctr> <dbl> <dbl> <int>
#1 setosa 5.006 5.0 50
#2 versicolor 5.936 5.9 50
#3 virginica 6.588 6.5 50
Here, the enquo does a similar job as substitute from base R by taking the input arguments and converting it to quosure, then within the functions (filter/arrange/mutate/summarise/group_by) we unquote (!! or UQ) to evaluate it. We can also name the columns by passing the quosure on the lhs of the assignment (:=)

Is dplyr easier than data.table to be used within functions and loops? [duplicate]

I want to use use the dplyr::group_by function inside another function, but I do not know how to pass the arguments to this function.
Can someone provide a working example?
library(dplyr)
data(iris)
iris %.% group_by(Species) %.% summarise(n = n()) #
## Source: local data frame [3 x 2]
## Species n
## 1 virginica 50
## 2 versicolor 50
## 3 setosa 50
mytable0 <- function(x, ...) x %.% group_by(...) %.% summarise(n = n())
mytable0(iris, "Species") # OK
## Source: local data frame [3 x 2]
## Species n
## 1 virginica 50
## 2 versicolor 50
## 3 setosa 50
mytable1 <- function(x, key) x %.% group_by(as.name(key)) %.% summarise(n = n())
mytable1(iris, "Species") # Wrong!
# Error: unsupported type for column 'as.name(key)' (SYMSXP)
mytable2 <- function(x, key) x %.% group_by(key) %.% summarise(n = n())
mytable2(iris, "Species") # Wrong!
# Error: index out of bounds
For programming, group_by_ is the counterpart to group_by:
library(dplyr)
mytable <- function(x, ...) x %>% group_by_(...) %>% summarise(n = n())
mytable(iris, "Species")
# or iris %>% mytable("Species")
which gives:
Species n
1 setosa 50
2 versicolor 50
3 virginica 50
Update At the time this was written dplyr used %.% which is what was originally used above but now %>% is favored so have changed above to that to keep this relevant.
Update 2 regroup is now deprecated, use group_by_ instead.
Update 3 group_by_(list(...)) now becomes group_by_(...) in new version of dplyr as per Roberto's comment.
Update 4 Added minor variation suggested in comments.
Update 5: With rlang/tidyeval it is now possible to do this:
library(rlang)
mytable <- function(x, ...) {
group_ <- syms(...)
x %>%
group_by(!!!group_) %>%
summarise(n = n())
}
mytable(iris, "Species")
or passing Species unevaluated, i.e. no quotes around it:
library(rlang)
mytable <- function(x, ...) {
group_ <- enquos(...)
x %>%
group_by(!!!group_) %>%
summarise(n = n())
}
mytable(iris, Species)
Update 6: There is now a {{...}} notation that works if there is just one grouping variable:
mytable <- function(x, group) {
x %>%
group_by({{group}}) %>%
summarise(n = n())
}
mytable(iris, Species)
UPDATE: As of dplyr 0.7.0 you can use tidy eval to accomplish this.
See http://dplyr.tidyverse.org/articles/programming.html for more details.
library(tidyverse)
data("iris")
my_table <- function(df, group_var) {
group_var <- enquo(group_var) # Create quosure
df %>%
group_by(!!group_var) %>% # Use !! to unquote the quosure
summarise(n = n())
}
my_table(iris, Species)
> my_table(iris, Species)
# A tibble: 3 x 2
Species n
<fctr> <int>
1 setosa 50
2 versicolor 50
3 virginica 50
As a complement to the Update 6 in the answer by #G. Grothendieck, if you want to use a string as an argument in your summary function, instead of embracing the argument with doubled braces ({{), you should use the .data pronoun as described in the Programming vignette: Loop over multiple variables:
mytable <- function( x, group ) {
x %>%
group_by( .data[[group]] ) %>%
summarise( n = n() )
}
group_string <- 'Species'
mytable( iris, group_string )
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 3 x 2
Species n
<fct> <int>
1 setosa 50
2 versicolor 50
3 virginica 50
Ugly as they come, but she works:
mytable3 <- function(x, key) {
my.call <- bquote(summarise(group_by(.(substitute(x)), NULL), n = n()))
my.call[[2]][[3]] <- as.name(key)
eval(my.call, parent.frame())
}
mytable3(iris, "Species")
# Source: local data frame [3 x 2]
#
# Species n
# 1 virginica 50
# 2 versicolor 50
# 3 setosa 50
There are almost certainly cases that will cause this to break, but you get the idea. I don't think you can get around messing with the call. One other thing that did work but was even uglier is:
mytable4 <- function(x, key) summarise(group_by(x, x[[key]]), n = n())

Use variable for column in dplyr's group_by [duplicate]

I want to use use the dplyr::group_by function inside another function, but I do not know how to pass the arguments to this function.
Can someone provide a working example?
library(dplyr)
data(iris)
iris %.% group_by(Species) %.% summarise(n = n()) #
## Source: local data frame [3 x 2]
## Species n
## 1 virginica 50
## 2 versicolor 50
## 3 setosa 50
mytable0 <- function(x, ...) x %.% group_by(...) %.% summarise(n = n())
mytable0(iris, "Species") # OK
## Source: local data frame [3 x 2]
## Species n
## 1 virginica 50
## 2 versicolor 50
## 3 setosa 50
mytable1 <- function(x, key) x %.% group_by(as.name(key)) %.% summarise(n = n())
mytable1(iris, "Species") # Wrong!
# Error: unsupported type for column 'as.name(key)' (SYMSXP)
mytable2 <- function(x, key) x %.% group_by(key) %.% summarise(n = n())
mytable2(iris, "Species") # Wrong!
# Error: index out of bounds
For programming, group_by_ is the counterpart to group_by:
library(dplyr)
mytable <- function(x, ...) x %>% group_by_(...) %>% summarise(n = n())
mytable(iris, "Species")
# or iris %>% mytable("Species")
which gives:
Species n
1 setosa 50
2 versicolor 50
3 virginica 50
Update At the time this was written dplyr used %.% which is what was originally used above but now %>% is favored so have changed above to that to keep this relevant.
Update 2 regroup is now deprecated, use group_by_ instead.
Update 3 group_by_(list(...)) now becomes group_by_(...) in new version of dplyr as per Roberto's comment.
Update 4 Added minor variation suggested in comments.
Update 5: With rlang/tidyeval it is now possible to do this:
library(rlang)
mytable <- function(x, ...) {
group_ <- syms(...)
x %>%
group_by(!!!group_) %>%
summarise(n = n())
}
mytable(iris, "Species")
or passing Species unevaluated, i.e. no quotes around it:
library(rlang)
mytable <- function(x, ...) {
group_ <- enquos(...)
x %>%
group_by(!!!group_) %>%
summarise(n = n())
}
mytable(iris, Species)
Update 6: There is now a {{...}} notation that works if there is just one grouping variable:
mytable <- function(x, group) {
x %>%
group_by({{group}}) %>%
summarise(n = n())
}
mytable(iris, Species)
UPDATE: As of dplyr 0.7.0 you can use tidy eval to accomplish this.
See http://dplyr.tidyverse.org/articles/programming.html for more details.
library(tidyverse)
data("iris")
my_table <- function(df, group_var) {
group_var <- enquo(group_var) # Create quosure
df %>%
group_by(!!group_var) %>% # Use !! to unquote the quosure
summarise(n = n())
}
my_table(iris, Species)
> my_table(iris, Species)
# A tibble: 3 x 2
Species n
<fctr> <int>
1 setosa 50
2 versicolor 50
3 virginica 50
As a complement to the Update 6 in the answer by #G. Grothendieck, if you want to use a string as an argument in your summary function, instead of embracing the argument with doubled braces ({{), you should use the .data pronoun as described in the Programming vignette: Loop over multiple variables:
mytable <- function( x, group ) {
x %>%
group_by( .data[[group]] ) %>%
summarise( n = n() )
}
group_string <- 'Species'
mytable( iris, group_string )
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 3 x 2
Species n
<fct> <int>
1 setosa 50
2 versicolor 50
3 virginica 50
Ugly as they come, but she works:
mytable3 <- function(x, key) {
my.call <- bquote(summarise(group_by(.(substitute(x)), NULL), n = n()))
my.call[[2]][[3]] <- as.name(key)
eval(my.call, parent.frame())
}
mytable3(iris, "Species")
# Source: local data frame [3 x 2]
#
# Species n
# 1 virginica 50
# 2 versicolor 50
# 3 setosa 50
There are almost certainly cases that will cause this to break, but you get the idea. I don't think you can get around messing with the call. One other thing that did work but was even uglier is:
mytable4 <- function(x, key) summarise(group_by(x, x[[key]]), n = n())

Resources