Related
Function box_m from library(rstatix) currently requires that its first argument NOT include the grouping variable, and its second argument only be the grouping variable. For example: box_m(d[-1], d$Group).
I'm trying to re-write this function such that box_m2 would work like: box_m2(d, Group).
I have tried the following without success but was wondering if there might be a way to achieve my goal?
library(rstatix)
library(tidyverse)
d <- read.csv("https://raw.githubusercontent.com/rnorouzian/v/main/memory.csv")[-1]
box_m(d[-1], d$Group) # How the function currently works
# box_m2(d, Group) # How I would like the function to work
# My trial without success to achieve `box_m2`:
box_m2 <- function(data, group){
dat <- dplyr::select(data, -vars(group))
box_m(dat, one_of(group))
}
# New function
box_m2(d, Group)
You can write the function with the help of curly-curly ({{}}) operator.
library(rstatix)
library(dplyr)
box_m2 <- function(data, group){
dat <- dplyr::select(data, -{{group}})
box_m(dat, data %>% pull({{group}}))
}
identical(box_m(d[-1], d$Group), box_m2(d, Group))
#[1] TRUE
We could also convert to symbol with ensym and evaluate (!!). In that way, it is flexible to either pass unquoted or quoted arguments
library(rstatix)
library(dplyr)
library(rstatix)
box_m2 <- function(data, group){
group <- ensym(group)
dat <- dplyr::select(data, -!!group)
box_m(dat, data %>%
pull(!! group))
}
-testing
identical(box_m(d[-1], d$Group), box_m2(d, Group))
#[1] TRUE
identical(box_m(d[-1], d$Group), box_m2(d, "Group"))
#[1] TRUE
I want to parameterise the following computation using dplyr that finds which values of Sepal.Length are associated with more than one value of Sepal.Width:
library(dplyr)
iris %>%
group_by(Sepal.Length) %>%
summarise(n.uniq=n_distinct(Sepal.Width)) %>%
filter(n.uniq > 1)
Normally I would write something like this:
not.uniq.per.group <- function(data, group.var, uniq.var) {
iris %>%
group_by(group.var) %>%
summarise(n.uniq=n_distinct(uniq.var)) %>%
filter(n.uniq > 1)
}
However, this approach throws errors because dplyr uses non-standard evaluation. How should this function be written?
You need to use the standard evaluation versions of the dplyr functions (just append '_' to the function names, ie. group_by_ & summarise_) and pass strings to your function, which you then need to turn into symbols. To parameterise the argument of summarise_, you will need to use interp(), which is defined in the lazyeval package. Concretely:
library(dplyr)
library(lazyeval)
not.uniq.per.group <- function(df, grp.var, uniq.var) {
df %>%
group_by_(grp.var) %>%
summarise_( n_uniq=interp(~n_distinct(v), v=as.name(uniq.var)) ) %>%
filter(n_uniq > 1)
}
not.uniq.per.group(iris, "Sepal.Length", "Sepal.Width")
Note that in recent versions of dplyr the standard evaluation versions of the dplyr functions have been "soft deprecated" in favor of non-standard evaluation.
See the Programming with dplyr vignette for more information on working with non-standard evaluation.
Like the old dplyr versions up to 0.5, the new dplyr has facilities for both standard evaluation (SE) and nonstandard evaluation (NSE). But they are expressed differently than before.
If you want an NSE function, you pass bare expressions and use enquo to capture them as quosures. If you want an SE function, just pass quosures (or symbols) directly, then unquote them in the dplyr calls. Here is the SE solution to the question:
library(tidyverse)
library(rlang)
f1 <- function(df, grp.var, uniq.var) {
df %>%
group_by(!!grp.var) %>%
summarise(n_uniq = n_distinct(!!uniq.var)) %>%
filter(n_uniq > 1)
}
a <- f1(iris, quo(Sepal.Length), quo(Sepal.Width))
b <- f1(iris, sym("Sepal.Length"), sym("Sepal.Width"))
identical(a, b)
#> [1] TRUE
Note how the SE version enables you to work with string arguments - just turn them into symbols first using sym(). For more information, see the programming with dplyr vignette.
In the devel version of dplyr (soon to be released 0.6.0), we can also make use of slightly different syntax for passing the variables.
f1 <- function(df, grp.var, uniq.var) {
grp.var <- enquo(grp.var)
uniq.var <- enquo(uniq.var)
df %>%
group_by(!!grp.var) %>%
summarise(n_uniq = n_distinct(!!uniq.var)) %>%
filter(n_uniq >1)
}
res2 <- f1(iris, Sepal.Length, Sepal.Width)
res1 <- not.uniq.per.group(iris, "Sepal.Length", "Sepal.Width")
identical(res1, res2)
#[1] TRUE
Here enquo takes the arguments and returns the value as a quosure (similar to substitute in base R) by evaluating the function arguments lazily and inside the summarise, we ask it to unquote (!! or UQ) so that it gets evaluated.
Here's the way to do it from rlang 0.4 using curly curly {{ pseudo operator :
library(dplyr)
not.uniq.per.group <- function(data, group.var, uniq.var) {
data %>%
group_by({{ group.var }}) %>%
summarise(n.uniq = n_distinct({{ uniq.var }})) %>%
filter(n.uniq > 1)
}
iris %>% not.uniq.per.group(Sepal.Length, Sepal.Width)
#> # A tibble: 25 x 2
#> Sepal.Length n.uniq
#> <dbl> <int>
#> 1 4.4 3
#> 2 4.6 4
#> 3 4.8 3
#> 4 4.9 5
#> 5 5 8
#> 6 5.1 6
#> 7 5.2 4
#> 8 5.4 4
#> 9 5.5 6
#> 10 5.6 5
#> # ... with 15 more rows
In the current version of dplyr (0.7.4) the use of the standard evaluation function versions (appended '_' to the function name, e.g. group_by_) is deprecated.
Instead you should rely on tidyeval when writing functions.
Here's an example of how your function would look then:
# definition of your function
not.uniq.per.group <- function(data, group.var, uniq.var) {
# enquotes variables to be used with dplyr-functions
group.var <- enquo(group.var)
uniq.var <- enquo(uniq.var)
# use '!!' before parameter names in dplyr-functions
data %>%
group_by(!!group.var) %>%
summarise(n.uniq=n_distinct(!!uniq.var)) %>%
filter(n.uniq > 1)
}
# call of your function
not.uniq.per.group(iris, Sepal.Length, Sepal.Width)
If you want to learn all about the details, there's an excellent vignette by the dplyr-team on how this works.
I've written a function in the past that does something similar to what you're doing, except that it explores all the columns outside the primary key and looks for multiple unique values per group.
find_dups = function(.table, ...) {
require(dplyr)
require(tidyr)
# get column names of primary key
pk <- .table %>% select(...) %>% names
other <- names(.table)[!(names(.table) %in% pk)]
# group by primary key,
# get number of rows per unique combo,
# filter for duplicates,
# get number of distinct values in each column,
# gather to get df of 1 row per primary key, other column,
# filter for where a columns have more than 1 unique value,
# order table by primary key
.table %>%
group_by(...) %>%
mutate(cnt = n()) %>%
filter(cnt > 1) %>%
select(-cnt) %>%
summarise_each(funs(n_distinct)) %>%
gather_('column', 'unique_vals', other) %>%
filter(unique_vals > 1) %>%
arrange(...) %>%
return
# Final dataframe:
## One row per primary key and column that creates duplicates.
## Last column indicates how many unique values of
## the given column exist for each primary key.
}
This function also works with the piping operator:
dat %>% find_dups(key1, key2)
You can avoid lazyeval by using do to call an anonymous function and then using get. This solution can be used more generally to employ multiple aggregations. I usually write the function separately.
library(dplyr)
not.uniq.per.group <- function(df, grp.var, uniq.var) {
df %>%
group_by_(grp.var) %>%
do((function(., uniq.var) {
with(., data.frame(n_uniq = n_distinct(get(uniq.var))))
}
)(., uniq.var)) %>%
filter(n_uniq > 1)
}
not.uniq.per.group(iris, "Sepal.Length", "Sepal.Width")
I want to parameterise the following computation using dplyr that finds which values of Sepal.Length are associated with more than one value of Sepal.Width:
library(dplyr)
iris %>%
group_by(Sepal.Length) %>%
summarise(n.uniq=n_distinct(Sepal.Width)) %>%
filter(n.uniq > 1)
Normally I would write something like this:
not.uniq.per.group <- function(data, group.var, uniq.var) {
iris %>%
group_by(group.var) %>%
summarise(n.uniq=n_distinct(uniq.var)) %>%
filter(n.uniq > 1)
}
However, this approach throws errors because dplyr uses non-standard evaluation. How should this function be written?
You need to use the standard evaluation versions of the dplyr functions (just append '_' to the function names, ie. group_by_ & summarise_) and pass strings to your function, which you then need to turn into symbols. To parameterise the argument of summarise_, you will need to use interp(), which is defined in the lazyeval package. Concretely:
library(dplyr)
library(lazyeval)
not.uniq.per.group <- function(df, grp.var, uniq.var) {
df %>%
group_by_(grp.var) %>%
summarise_( n_uniq=interp(~n_distinct(v), v=as.name(uniq.var)) ) %>%
filter(n_uniq > 1)
}
not.uniq.per.group(iris, "Sepal.Length", "Sepal.Width")
Note that in recent versions of dplyr the standard evaluation versions of the dplyr functions have been "soft deprecated" in favor of non-standard evaluation.
See the Programming with dplyr vignette for more information on working with non-standard evaluation.
Like the old dplyr versions up to 0.5, the new dplyr has facilities for both standard evaluation (SE) and nonstandard evaluation (NSE). But they are expressed differently than before.
If you want an NSE function, you pass bare expressions and use enquo to capture them as quosures. If you want an SE function, just pass quosures (or symbols) directly, then unquote them in the dplyr calls. Here is the SE solution to the question:
library(tidyverse)
library(rlang)
f1 <- function(df, grp.var, uniq.var) {
df %>%
group_by(!!grp.var) %>%
summarise(n_uniq = n_distinct(!!uniq.var)) %>%
filter(n_uniq > 1)
}
a <- f1(iris, quo(Sepal.Length), quo(Sepal.Width))
b <- f1(iris, sym("Sepal.Length"), sym("Sepal.Width"))
identical(a, b)
#> [1] TRUE
Note how the SE version enables you to work with string arguments - just turn them into symbols first using sym(). For more information, see the programming with dplyr vignette.
In the devel version of dplyr (soon to be released 0.6.0), we can also make use of slightly different syntax for passing the variables.
f1 <- function(df, grp.var, uniq.var) {
grp.var <- enquo(grp.var)
uniq.var <- enquo(uniq.var)
df %>%
group_by(!!grp.var) %>%
summarise(n_uniq = n_distinct(!!uniq.var)) %>%
filter(n_uniq >1)
}
res2 <- f1(iris, Sepal.Length, Sepal.Width)
res1 <- not.uniq.per.group(iris, "Sepal.Length", "Sepal.Width")
identical(res1, res2)
#[1] TRUE
Here enquo takes the arguments and returns the value as a quosure (similar to substitute in base R) by evaluating the function arguments lazily and inside the summarise, we ask it to unquote (!! or UQ) so that it gets evaluated.
Here's the way to do it from rlang 0.4 using curly curly {{ pseudo operator :
library(dplyr)
not.uniq.per.group <- function(data, group.var, uniq.var) {
data %>%
group_by({{ group.var }}) %>%
summarise(n.uniq = n_distinct({{ uniq.var }})) %>%
filter(n.uniq > 1)
}
iris %>% not.uniq.per.group(Sepal.Length, Sepal.Width)
#> # A tibble: 25 x 2
#> Sepal.Length n.uniq
#> <dbl> <int>
#> 1 4.4 3
#> 2 4.6 4
#> 3 4.8 3
#> 4 4.9 5
#> 5 5 8
#> 6 5.1 6
#> 7 5.2 4
#> 8 5.4 4
#> 9 5.5 6
#> 10 5.6 5
#> # ... with 15 more rows
In the current version of dplyr (0.7.4) the use of the standard evaluation function versions (appended '_' to the function name, e.g. group_by_) is deprecated.
Instead you should rely on tidyeval when writing functions.
Here's an example of how your function would look then:
# definition of your function
not.uniq.per.group <- function(data, group.var, uniq.var) {
# enquotes variables to be used with dplyr-functions
group.var <- enquo(group.var)
uniq.var <- enquo(uniq.var)
# use '!!' before parameter names in dplyr-functions
data %>%
group_by(!!group.var) %>%
summarise(n.uniq=n_distinct(!!uniq.var)) %>%
filter(n.uniq > 1)
}
# call of your function
not.uniq.per.group(iris, Sepal.Length, Sepal.Width)
If you want to learn all about the details, there's an excellent vignette by the dplyr-team on how this works.
I've written a function in the past that does something similar to what you're doing, except that it explores all the columns outside the primary key and looks for multiple unique values per group.
find_dups = function(.table, ...) {
require(dplyr)
require(tidyr)
# get column names of primary key
pk <- .table %>% select(...) %>% names
other <- names(.table)[!(names(.table) %in% pk)]
# group by primary key,
# get number of rows per unique combo,
# filter for duplicates,
# get number of distinct values in each column,
# gather to get df of 1 row per primary key, other column,
# filter for where a columns have more than 1 unique value,
# order table by primary key
.table %>%
group_by(...) %>%
mutate(cnt = n()) %>%
filter(cnt > 1) %>%
select(-cnt) %>%
summarise_each(funs(n_distinct)) %>%
gather_('column', 'unique_vals', other) %>%
filter(unique_vals > 1) %>%
arrange(...) %>%
return
# Final dataframe:
## One row per primary key and column that creates duplicates.
## Last column indicates how many unique values of
## the given column exist for each primary key.
}
This function also works with the piping operator:
dat %>% find_dups(key1, key2)
You can avoid lazyeval by using do to call an anonymous function and then using get. This solution can be used more generally to employ multiple aggregations. I usually write the function separately.
library(dplyr)
not.uniq.per.group <- function(df, grp.var, uniq.var) {
df %>%
group_by_(grp.var) %>%
do((function(., uniq.var) {
with(., data.frame(n_uniq = n_distinct(get(uniq.var))))
}
)(., uniq.var)) %>%
filter(n_uniq > 1)
}
not.uniq.per.group(iris, "Sepal.Length", "Sepal.Width")
I'm trying to put together a function that creates a subset from my original data frame, and then uses dplyr's SELECT and MUTATE to give me the number of large/small entries, based on the sum of the width and length of sepals/petals.
filter <- function (spp, LENGTH, WIDTH) {
d <- subset (iris, subset=iris$Species == spp) # This part seems to work just fine
large <- d %>%
select (LENGTH, WIDTH) %>% # This is where the problem arises.
mutate (sum = LENGTH + WIDTH)
big_samples <- which(large$sum > 4)
return (length(big_samples))
}
Basically, I want the function to return the number of large flowers. However, when I run the function I get the following error -
filter("virginica", "Sepal.Length", "Sepal.Width")
Error: All select() inputs must resolve to integer column positions.
The following do not:
* LENGTH
* WIDTH
What am I doing wrong?
You are running into NSE/SE problems, see the vignette for more info.
Briefly, dplyr uses a non standard evaluation (NSE) of names, and passing names of columns into functions breaks it, without using the standard evaluation (SE) version.
The SE versions of the dplyr functions end in _. You can see that select_ works nicely with your original arguments.
However, things get more complicated when using functions. We can use lazyeval::interp to convert most function arguments into column names, see the conversion of the mutate to mutate_ call in your function below and more generally, the help: ?lazyeval::interp
Try:
filter <- function (spp, LENGTH, WIDTH) {
d <- subset (iris, subset=iris$Species == spp)
large <- d %>%
select_(LENGTH, WIDTH) %>%
mutate_(sum = lazyeval::interp(~X + Y, X = as.name(LENGTH), Y = as.name(WIDTH)))
big_samples <- which(large$sum > 4)
return (length(big_samples))
}
UPDATE: As of dplyr 0.7.0 you can use tidy eval to accomplish this.
See http://dplyr.tidyverse.org/articles/programming.html for more details.
filter_big <- function(spp, LENGTH, WIDTH) {
LENGTH <- enquo(LENGTH) # Create quosure
WIDTH <- enquo(WIDTH) # Create quosure
iris %>%
filter(Species == spp) %>%
select(!!LENGTH, !!WIDTH) %>% # Use !! to unquote the quosure
mutate(sum = (!!LENGTH) + (!!WIDTH)) %>% # Use !! to unquote the quosure
filter(sum > 4) %>%
nrow()
}
filter_big("virginica", Sepal.Length, Sepal.Width)
> filter_big("virginica", Sepal.Length, Sepal.Width)
[1] 50
If quosure and quasiquotation are too much for you, use either .data[[ ]] or rlang {{ }} (curly curly) instead. See Hadley Wickham's 5min video on tidy evaluation and (maybe) Tidy evaluation section in Hadley's Advanced R book for more information.
library(rlang)
library(dplyr)
filter_data <- function(df, spp, LENGTH, WIDTH) {
res <- df %>%
filter(Species == spp) %>%
select(.data[[LENGTH]], .data[[WIDTH]]) %>%
mutate(sum = .data[[LENGTH]] + .data[[WIDTH]]) %>%
filter(sum > 4) %>%
nrow()
return(res)
}
filter_data(iris, "virginica", "Sepal.Length", "Sepal.Width")
#> [1] 50
filter_rlang <- function(df, spp, LENGTH, WIDTH) {
res <- df %>%
filter(Species == spp) %>%
select({{LENGTH}}, {{WIDTH}}) %>%
mutate(sum = {{LENGTH}} + {{WIDTH}}) %>%
filter(sum > 4) %>%
nrow()
return(res)
}
filter_rlang(iris, "virginica", Sepal.Length, Sepal.Width)
#> [1] 50
Created on 2019-11-10 by the reprex package (v0.3.0)
I want to parameterise the following computation using dplyr that finds which values of Sepal.Length are associated with more than one value of Sepal.Width:
library(dplyr)
iris %>%
group_by(Sepal.Length) %>%
summarise(n.uniq=n_distinct(Sepal.Width)) %>%
filter(n.uniq > 1)
Normally I would write something like this:
not.uniq.per.group <- function(data, group.var, uniq.var) {
iris %>%
group_by(group.var) %>%
summarise(n.uniq=n_distinct(uniq.var)) %>%
filter(n.uniq > 1)
}
However, this approach throws errors because dplyr uses non-standard evaluation. How should this function be written?
You need to use the standard evaluation versions of the dplyr functions (just append '_' to the function names, ie. group_by_ & summarise_) and pass strings to your function, which you then need to turn into symbols. To parameterise the argument of summarise_, you will need to use interp(), which is defined in the lazyeval package. Concretely:
library(dplyr)
library(lazyeval)
not.uniq.per.group <- function(df, grp.var, uniq.var) {
df %>%
group_by_(grp.var) %>%
summarise_( n_uniq=interp(~n_distinct(v), v=as.name(uniq.var)) ) %>%
filter(n_uniq > 1)
}
not.uniq.per.group(iris, "Sepal.Length", "Sepal.Width")
Note that in recent versions of dplyr the standard evaluation versions of the dplyr functions have been "soft deprecated" in favor of non-standard evaluation.
See the Programming with dplyr vignette for more information on working with non-standard evaluation.
Like the old dplyr versions up to 0.5, the new dplyr has facilities for both standard evaluation (SE) and nonstandard evaluation (NSE). But they are expressed differently than before.
If you want an NSE function, you pass bare expressions and use enquo to capture them as quosures. If you want an SE function, just pass quosures (or symbols) directly, then unquote them in the dplyr calls. Here is the SE solution to the question:
library(tidyverse)
library(rlang)
f1 <- function(df, grp.var, uniq.var) {
df %>%
group_by(!!grp.var) %>%
summarise(n_uniq = n_distinct(!!uniq.var)) %>%
filter(n_uniq > 1)
}
a <- f1(iris, quo(Sepal.Length), quo(Sepal.Width))
b <- f1(iris, sym("Sepal.Length"), sym("Sepal.Width"))
identical(a, b)
#> [1] TRUE
Note how the SE version enables you to work with string arguments - just turn them into symbols first using sym(). For more information, see the programming with dplyr vignette.
In the devel version of dplyr (soon to be released 0.6.0), we can also make use of slightly different syntax for passing the variables.
f1 <- function(df, grp.var, uniq.var) {
grp.var <- enquo(grp.var)
uniq.var <- enquo(uniq.var)
df %>%
group_by(!!grp.var) %>%
summarise(n_uniq = n_distinct(!!uniq.var)) %>%
filter(n_uniq >1)
}
res2 <- f1(iris, Sepal.Length, Sepal.Width)
res1 <- not.uniq.per.group(iris, "Sepal.Length", "Sepal.Width")
identical(res1, res2)
#[1] TRUE
Here enquo takes the arguments and returns the value as a quosure (similar to substitute in base R) by evaluating the function arguments lazily and inside the summarise, we ask it to unquote (!! or UQ) so that it gets evaluated.
Here's the way to do it from rlang 0.4 using curly curly {{ pseudo operator :
library(dplyr)
not.uniq.per.group <- function(data, group.var, uniq.var) {
data %>%
group_by({{ group.var }}) %>%
summarise(n.uniq = n_distinct({{ uniq.var }})) %>%
filter(n.uniq > 1)
}
iris %>% not.uniq.per.group(Sepal.Length, Sepal.Width)
#> # A tibble: 25 x 2
#> Sepal.Length n.uniq
#> <dbl> <int>
#> 1 4.4 3
#> 2 4.6 4
#> 3 4.8 3
#> 4 4.9 5
#> 5 5 8
#> 6 5.1 6
#> 7 5.2 4
#> 8 5.4 4
#> 9 5.5 6
#> 10 5.6 5
#> # ... with 15 more rows
In the current version of dplyr (0.7.4) the use of the standard evaluation function versions (appended '_' to the function name, e.g. group_by_) is deprecated.
Instead you should rely on tidyeval when writing functions.
Here's an example of how your function would look then:
# definition of your function
not.uniq.per.group <- function(data, group.var, uniq.var) {
# enquotes variables to be used with dplyr-functions
group.var <- enquo(group.var)
uniq.var <- enquo(uniq.var)
# use '!!' before parameter names in dplyr-functions
data %>%
group_by(!!group.var) %>%
summarise(n.uniq=n_distinct(!!uniq.var)) %>%
filter(n.uniq > 1)
}
# call of your function
not.uniq.per.group(iris, Sepal.Length, Sepal.Width)
If you want to learn all about the details, there's an excellent vignette by the dplyr-team on how this works.
I've written a function in the past that does something similar to what you're doing, except that it explores all the columns outside the primary key and looks for multiple unique values per group.
find_dups = function(.table, ...) {
require(dplyr)
require(tidyr)
# get column names of primary key
pk <- .table %>% select(...) %>% names
other <- names(.table)[!(names(.table) %in% pk)]
# group by primary key,
# get number of rows per unique combo,
# filter for duplicates,
# get number of distinct values in each column,
# gather to get df of 1 row per primary key, other column,
# filter for where a columns have more than 1 unique value,
# order table by primary key
.table %>%
group_by(...) %>%
mutate(cnt = n()) %>%
filter(cnt > 1) %>%
select(-cnt) %>%
summarise_each(funs(n_distinct)) %>%
gather_('column', 'unique_vals', other) %>%
filter(unique_vals > 1) %>%
arrange(...) %>%
return
# Final dataframe:
## One row per primary key and column that creates duplicates.
## Last column indicates how many unique values of
## the given column exist for each primary key.
}
This function also works with the piping operator:
dat %>% find_dups(key1, key2)
You can avoid lazyeval by using do to call an anonymous function and then using get. This solution can be used more generally to employ multiple aggregations. I usually write the function separately.
library(dplyr)
not.uniq.per.group <- function(df, grp.var, uniq.var) {
df %>%
group_by_(grp.var) %>%
do((function(., uniq.var) {
with(., data.frame(n_uniq = n_distinct(get(uniq.var))))
}
)(., uniq.var)) %>%
filter(n_uniq > 1)
}
not.uniq.per.group(iris, "Sepal.Length", "Sepal.Width")