Pass a variable name to a user written function that uses dyplr - r

I am trying to write a function that index variables names.
In particular, in my function, I use mutate to encode a variable that I have without changing its name. Does anyone knows how I can index a variable on the left end side of mutate?
Here is an example
library(tydiverse)
# first create relevant dataset
iris <- iris%>% group_by(Species) %>% mutate(mean_Length=mean(Sepal.Length))
# second create my function
userfunction <- function(var){
newdata <- iris %>%
select(mean_Length,{var}) %>% distinct() %>%
mutate(get(var)= # this is what causes my function to fail. How can i refer to the `var` here?
factor(get(var),get(var))) %>%
arrange(get(var)) #
return(newdata)
}
# this function produces the following error # Error: unexpected '}' in "}"
#note that if I change the reference to its original string the function works
userfunction2 <- function(var){
newdata <- iris %>%
select(mean_Length,{var}) %>% distinct() %>%
mutate(Species= # without reference it works, but I am unable to use the function for multiple variables.
factor(get(var),get(var))) %>%
arrange(get(var)) #
return(newdata)
}
encodedata<- userfunction2("Species")
Thanks a lot in advance for your help
Best

Here is a working example that goes into a similar direction as Limey's answer:
iris <- datasets::iris %>%
group_by(Species) %>%
mutate(mean_Length=mean(Sepal.Length)) %>%
ungroup()
userfunction <- function(var){
iris %>%
transmute(mean_Length, "temp" = iris[[var]]) %>%
distinct() %>%
mutate("{var}" := factor(temp)) %>%
arrange(temp) %>%
select(-temp)
}
userfunction("Petal.Length")

I don't think var is your problem. I think it's the =. If you you have a enquoted variable on the left hand side of the assignment (which is effectively what you do have with get()), you need :=, not =.
See here for more details.
I would have written your function slightly differently:
userfunction <- function(data, var){
qVar <- enquo(var)
newdata <- data %>%
select(mean_Length, !! qVar) %>% distinct() %>%
mutate(!! qVar := factor(!! qVar, !! qVar)) %>%
arrange(!! qVar)
return(newdata)
}
The inclusion of the data parameter means you can include it in a pipe:
encodedata <- iris %>% userfunction(Species)
encodedata
# A tibble: 3 x 2
# Groups: Species [3]
mean_Length Species
<dbl> <fct>
1 5.01 setosa
2 5.94 versicolor
3 6.59 virginica

Related

How do I add checks to a function created using tidy eval framework?

Say I have created a function using the tidy eval framework -
library(tidyverse)
library(rlang)
my_function <- function(data, var){
var_expr <- enquo(var)
data %>%
group_by(!!var_expr) %>%
summarise(count = n()) %>%
ungroup()
}
When I run the following function, I get the result below it
my_function(mtcars, cyl)
# A tibble: 3 x 2
cyl count
<dbl> <int>
1 4 11
2 6 7
3 8 14
How do I add the following checks to this function -
Check if data is a dataframe. If not, return the error data should be a dataframe
Check if var is missing. If so return the error var is missing
You can make the following modifications.
In order to check if our input data is of a particular class we can check its class attribute and in this case whether it's a data frame or tibble they both contains the class data.frame
Also for missing function, it is normally used inside many functions to check whether an argument is assigned a value so that they generate a value as the default value. In your case we can terminate the execution of the function (you can also check the source code of length function on how it specifies a value for size argument when it is missing)
You can use base::stop in place of rlang::abort as specified by dear #akrun
library(rlang)
my_function <- function(data, var){
if(!"data.frame" %in% attr(data, "class")) {
abort("data should be a data frame")
}
if(missing(var)) {
abort("var is missing")
}
var_expr <- enquo(var)
data %>%
group_by(!!var_expr) %>%
summarise(count = n()) %>%
ungroup()
}
Special thanks to dear #27 ϕ 9 for bringing this valuable point to my attention. We can also customize the output error message in stopifnot function which is another way of checking your input arguments:
my_function <- function(data, var){
stopifnot("The input data is not of class data frame" = "data.frame" %in% attr(data, "class") ,
"var is missing" = !missing(var))
var_expr <- enquo(var)
data %>%
group_by(!!var_expr) %>%
summarise(count = n()) %>%
ungroup()
}
Special thanks to dear #IceCreamToucan for presenting yet another option which is using the inherits function in lieu of attr. In case the input data does not include data.frame in its class attributes it returns FALSE:
my_function <- function(data, var){
if(!inherits(data, "data.frame")) {
stop("data is not of class data.frame")
}
if(missing(var)) {
stop("var is missing")
}
var_expr <- enquo(var)
data %>%
group_by(!!var_expr) %>%
summarise(count = n()) %>%
ungroup()
}

Moving from mutate_all to across() in dplyr 1.0

With the new release of dplyr I am refactoring quite a lot of code and removing functions that are now retired or deprecated. I had a function that is as follows:
processingAggregatedLoad <- function (df) {
defined <- ls()
passed <- names(as.list(match.call())[-1])
if (any(!defined %in% passed)) {
stop(paste("Missing values for the following arguments:", paste(setdiff(defined, passed), collapse=", ")))
}
df_isolated_load <- df %>% select(matches("snsr_val")) %>% mutate(global_demand = rowSums(.)) # we get isolated load
df_isolated_load_qlty <- df %>% select(matches("qlty_good_ind")) # we get isolated quality
df_isolated_load_qlty <- df_isolated_load_qlty %>% mutate_all(~ factor(.), colnames(df_isolated_load_qlty)) %>%
mutate_each(funs(as.numeric(.)), colnames(df_isolated_load_qlty)) # we convert the qlty to factors and then to numeric
df_isolated_load_qlty[df_isolated_load_qlty[]==1] <- 1 # 1 is bad
df_isolated_load_qlty[df_isolated_load_qlty[]==2] <- 0 # 0 is good we mask to calculate the global index quality
df_isolated_load_qlty <- df_isolated_load_qlty %>% mutate(global_quality = rowSums(.)) %>% select(global_quality)
df <- bind_cols(df, df_isolated_load, df_isolated_load_qlty)
return(df)
}
Basically the function does as follows:
1.The function selects all of the values of a pivoted dataframe and aggregated them.
2.The function selects the quality indicator (character) of a pivoted dataframe.
3.I convert the characters of the quality to factors and then to numeric to get the 2 levels (1 or 2).
4.I replace the numeric values of each of the individual columns by 0 or 1 depending on the level.
5.I rowsum the individual quality as I will get 0 if all of the values are good, otherwise the global quality is bad.
The problem is that I am getting the following messages:
1: `funs()` is deprecated as of dplyr 0.8.0.
Please use a list of either functions or lambdas:
# Simple named list:
list(mean = mean, median = median)
# Auto named with `tibble::lst()`:
tibble::lst(mean, median)
# Using lambdas
list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
This warning is displayed once every 8 hours.
Call `lifecycle::last_warnings()` to see where this warning was generated.
2: `mutate_each_()` is deprecated as of dplyr 0.7.0.
Please use `across()` instead.
I did multiple trials as for instance:
df_isolated_load_qlty %>% mutate(across(.fns = ~ as.factor(), .names = colnames(df_isolated_load_qlty)))
Error: Problem with `mutate()` input `..1`.
x All unnamed arguments must be length 1
ℹ Input `..1` is `across(.fns = ~as.factor(), .names = colnames(df_isolated_load_qlty))`.
But I am still a bit confused about the new dplyr syntax. Would someone be able to guide me a little bit around the right way of doing this?
mutate_each has been long deprecated and was replaced with mutate_all.
mutate_all is now replaced with across
across has default .cols as everything() which means it behaves as mutate_all by default (like here) if not mentioned explicitly.
You can apply the mulitple function in the same mutate call, so here factor and as.numeric can be applied together.
Considering all this you can change your existing function to :
library(dplyr)
processingAggregatedLoad <- function (df) {
defined <- ls()
passed <- names(as.list(match.call())[-1])
if (any(!defined %in% passed)) {
stop(paste("Missing values for the following arguments:",
paste(setdiff(defined, passed), collapse=", ")))
}
df_isolated_load <- df %>%
select(matches("snsr_val")) %>%
mutate(global_demand = rowSums(.))
df_isolated_load_qlty <- df %>% select(matches("qlty_good_ind"))
df_isolated_load_qlty <- df_isolated_load_qlty %>%
mutate(across(.fns = ~as.numeric(factor(.))))
df_isolated_load_qlty[df_isolated_load_qlty ==1] <- 1
df_isolated_load_qlty[df_isolated_load_qlty==2] <- 0
df_isolated_load_qlty <- df_isolated_load_qlty %>%
mutate(global_quality = rowSums(.)) %>%
select(global_quality)
df <- bind_cols(df, df_isolated_load, df_isolated_load_qlty)
return(df)
}

Problems passing column name as variable in R [duplicate]

I want to parameterise the following computation using dplyr that finds which values of Sepal.Length are associated with more than one value of Sepal.Width:
library(dplyr)
iris %>%
group_by(Sepal.Length) %>%
summarise(n.uniq=n_distinct(Sepal.Width)) %>%
filter(n.uniq > 1)
Normally I would write something like this:
not.uniq.per.group <- function(data, group.var, uniq.var) {
iris %>%
group_by(group.var) %>%
summarise(n.uniq=n_distinct(uniq.var)) %>%
filter(n.uniq > 1)
}
However, this approach throws errors because dplyr uses non-standard evaluation. How should this function be written?
You need to use the standard evaluation versions of the dplyr functions (just append '_' to the function names, ie. group_by_ & summarise_) and pass strings to your function, which you then need to turn into symbols. To parameterise the argument of summarise_, you will need to use interp(), which is defined in the lazyeval package. Concretely:
library(dplyr)
library(lazyeval)
not.uniq.per.group <- function(df, grp.var, uniq.var) {
df %>%
group_by_(grp.var) %>%
summarise_( n_uniq=interp(~n_distinct(v), v=as.name(uniq.var)) ) %>%
filter(n_uniq > 1)
}
not.uniq.per.group(iris, "Sepal.Length", "Sepal.Width")
Note that in recent versions of dplyr the standard evaluation versions of the dplyr functions have been "soft deprecated" in favor of non-standard evaluation.
See the Programming with dplyr vignette for more information on working with non-standard evaluation.
Like the old dplyr versions up to 0.5, the new dplyr has facilities for both standard evaluation (SE) and nonstandard evaluation (NSE). But they are expressed differently than before.
If you want an NSE function, you pass bare expressions and use enquo to capture them as quosures. If you want an SE function, just pass quosures (or symbols) directly, then unquote them in the dplyr calls. Here is the SE solution to the question:
library(tidyverse)
library(rlang)
f1 <- function(df, grp.var, uniq.var) {
df %>%
group_by(!!grp.var) %>%
summarise(n_uniq = n_distinct(!!uniq.var)) %>%
filter(n_uniq > 1)
}
a <- f1(iris, quo(Sepal.Length), quo(Sepal.Width))
b <- f1(iris, sym("Sepal.Length"), sym("Sepal.Width"))
identical(a, b)
#> [1] TRUE
Note how the SE version enables you to work with string arguments - just turn them into symbols first using sym(). For more information, see the programming with dplyr vignette.
In the devel version of dplyr (soon to be released 0.6.0), we can also make use of slightly different syntax for passing the variables.
f1 <- function(df, grp.var, uniq.var) {
grp.var <- enquo(grp.var)
uniq.var <- enquo(uniq.var)
df %>%
group_by(!!grp.var) %>%
summarise(n_uniq = n_distinct(!!uniq.var)) %>%
filter(n_uniq >1)
}
res2 <- f1(iris, Sepal.Length, Sepal.Width)
res1 <- not.uniq.per.group(iris, "Sepal.Length", "Sepal.Width")
identical(res1, res2)
#[1] TRUE
Here enquo takes the arguments and returns the value as a quosure (similar to substitute in base R) by evaluating the function arguments lazily and inside the summarise, we ask it to unquote (!! or UQ) so that it gets evaluated.
Here's the way to do it from rlang 0.4 using curly curly {{ pseudo operator :
library(dplyr)
not.uniq.per.group <- function(data, group.var, uniq.var) {
data %>%
group_by({{ group.var }}) %>%
summarise(n.uniq = n_distinct({{ uniq.var }})) %>%
filter(n.uniq > 1)
}
iris %>% not.uniq.per.group(Sepal.Length, Sepal.Width)
#> # A tibble: 25 x 2
#> Sepal.Length n.uniq
#> <dbl> <int>
#> 1 4.4 3
#> 2 4.6 4
#> 3 4.8 3
#> 4 4.9 5
#> 5 5 8
#> 6 5.1 6
#> 7 5.2 4
#> 8 5.4 4
#> 9 5.5 6
#> 10 5.6 5
#> # ... with 15 more rows
In the current version of dplyr (0.7.4) the use of the standard evaluation function versions (appended '_' to the function name, e.g. group_by_) is deprecated.
Instead you should rely on tidyeval when writing functions.
Here's an example of how your function would look then:
# definition of your function
not.uniq.per.group <- function(data, group.var, uniq.var) {
# enquotes variables to be used with dplyr-functions
group.var <- enquo(group.var)
uniq.var <- enquo(uniq.var)
# use '!!' before parameter names in dplyr-functions
data %>%
group_by(!!group.var) %>%
summarise(n.uniq=n_distinct(!!uniq.var)) %>%
filter(n.uniq > 1)
}
# call of your function
not.uniq.per.group(iris, Sepal.Length, Sepal.Width)
If you want to learn all about the details, there's an excellent vignette by the dplyr-team on how this works.
I've written a function in the past that does something similar to what you're doing, except that it explores all the columns outside the primary key and looks for multiple unique values per group.
find_dups = function(.table, ...) {
require(dplyr)
require(tidyr)
# get column names of primary key
pk <- .table %>% select(...) %>% names
other <- names(.table)[!(names(.table) %in% pk)]
# group by primary key,
# get number of rows per unique combo,
# filter for duplicates,
# get number of distinct values in each column,
# gather to get df of 1 row per primary key, other column,
# filter for where a columns have more than 1 unique value,
# order table by primary key
.table %>%
group_by(...) %>%
mutate(cnt = n()) %>%
filter(cnt > 1) %>%
select(-cnt) %>%
summarise_each(funs(n_distinct)) %>%
gather_('column', 'unique_vals', other) %>%
filter(unique_vals > 1) %>%
arrange(...) %>%
return
# Final dataframe:
## One row per primary key and column that creates duplicates.
## Last column indicates how many unique values of
## the given column exist for each primary key.
}
This function also works with the piping operator:
dat %>% find_dups(key1, key2)
You can avoid lazyeval by using do to call an anonymous function and then using get. This solution can be used more generally to employ multiple aggregations. I usually write the function separately.
library(dplyr)
not.uniq.per.group <- function(df, grp.var, uniq.var) {
df %>%
group_by_(grp.var) %>%
do((function(., uniq.var) {
with(., data.frame(n_uniq = n_distinct(get(uniq.var))))
}
)(., uniq.var)) %>%
filter(n_uniq > 1)
}
not.uniq.per.group(iris, "Sepal.Length", "Sepal.Width")

How to make wide data long via a function? [duplicate]

I want to parameterise the following computation using dplyr that finds which values of Sepal.Length are associated with more than one value of Sepal.Width:
library(dplyr)
iris %>%
group_by(Sepal.Length) %>%
summarise(n.uniq=n_distinct(Sepal.Width)) %>%
filter(n.uniq > 1)
Normally I would write something like this:
not.uniq.per.group <- function(data, group.var, uniq.var) {
iris %>%
group_by(group.var) %>%
summarise(n.uniq=n_distinct(uniq.var)) %>%
filter(n.uniq > 1)
}
However, this approach throws errors because dplyr uses non-standard evaluation. How should this function be written?
You need to use the standard evaluation versions of the dplyr functions (just append '_' to the function names, ie. group_by_ & summarise_) and pass strings to your function, which you then need to turn into symbols. To parameterise the argument of summarise_, you will need to use interp(), which is defined in the lazyeval package. Concretely:
library(dplyr)
library(lazyeval)
not.uniq.per.group <- function(df, grp.var, uniq.var) {
df %>%
group_by_(grp.var) %>%
summarise_( n_uniq=interp(~n_distinct(v), v=as.name(uniq.var)) ) %>%
filter(n_uniq > 1)
}
not.uniq.per.group(iris, "Sepal.Length", "Sepal.Width")
Note that in recent versions of dplyr the standard evaluation versions of the dplyr functions have been "soft deprecated" in favor of non-standard evaluation.
See the Programming with dplyr vignette for more information on working with non-standard evaluation.
Like the old dplyr versions up to 0.5, the new dplyr has facilities for both standard evaluation (SE) and nonstandard evaluation (NSE). But they are expressed differently than before.
If you want an NSE function, you pass bare expressions and use enquo to capture them as quosures. If you want an SE function, just pass quosures (or symbols) directly, then unquote them in the dplyr calls. Here is the SE solution to the question:
library(tidyverse)
library(rlang)
f1 <- function(df, grp.var, uniq.var) {
df %>%
group_by(!!grp.var) %>%
summarise(n_uniq = n_distinct(!!uniq.var)) %>%
filter(n_uniq > 1)
}
a <- f1(iris, quo(Sepal.Length), quo(Sepal.Width))
b <- f1(iris, sym("Sepal.Length"), sym("Sepal.Width"))
identical(a, b)
#> [1] TRUE
Note how the SE version enables you to work with string arguments - just turn them into symbols first using sym(). For more information, see the programming with dplyr vignette.
In the devel version of dplyr (soon to be released 0.6.0), we can also make use of slightly different syntax for passing the variables.
f1 <- function(df, grp.var, uniq.var) {
grp.var <- enquo(grp.var)
uniq.var <- enquo(uniq.var)
df %>%
group_by(!!grp.var) %>%
summarise(n_uniq = n_distinct(!!uniq.var)) %>%
filter(n_uniq >1)
}
res2 <- f1(iris, Sepal.Length, Sepal.Width)
res1 <- not.uniq.per.group(iris, "Sepal.Length", "Sepal.Width")
identical(res1, res2)
#[1] TRUE
Here enquo takes the arguments and returns the value as a quosure (similar to substitute in base R) by evaluating the function arguments lazily and inside the summarise, we ask it to unquote (!! or UQ) so that it gets evaluated.
Here's the way to do it from rlang 0.4 using curly curly {{ pseudo operator :
library(dplyr)
not.uniq.per.group <- function(data, group.var, uniq.var) {
data %>%
group_by({{ group.var }}) %>%
summarise(n.uniq = n_distinct({{ uniq.var }})) %>%
filter(n.uniq > 1)
}
iris %>% not.uniq.per.group(Sepal.Length, Sepal.Width)
#> # A tibble: 25 x 2
#> Sepal.Length n.uniq
#> <dbl> <int>
#> 1 4.4 3
#> 2 4.6 4
#> 3 4.8 3
#> 4 4.9 5
#> 5 5 8
#> 6 5.1 6
#> 7 5.2 4
#> 8 5.4 4
#> 9 5.5 6
#> 10 5.6 5
#> # ... with 15 more rows
In the current version of dplyr (0.7.4) the use of the standard evaluation function versions (appended '_' to the function name, e.g. group_by_) is deprecated.
Instead you should rely on tidyeval when writing functions.
Here's an example of how your function would look then:
# definition of your function
not.uniq.per.group <- function(data, group.var, uniq.var) {
# enquotes variables to be used with dplyr-functions
group.var <- enquo(group.var)
uniq.var <- enquo(uniq.var)
# use '!!' before parameter names in dplyr-functions
data %>%
group_by(!!group.var) %>%
summarise(n.uniq=n_distinct(!!uniq.var)) %>%
filter(n.uniq > 1)
}
# call of your function
not.uniq.per.group(iris, Sepal.Length, Sepal.Width)
If you want to learn all about the details, there's an excellent vignette by the dplyr-team on how this works.
I've written a function in the past that does something similar to what you're doing, except that it explores all the columns outside the primary key and looks for multiple unique values per group.
find_dups = function(.table, ...) {
require(dplyr)
require(tidyr)
# get column names of primary key
pk <- .table %>% select(...) %>% names
other <- names(.table)[!(names(.table) %in% pk)]
# group by primary key,
# get number of rows per unique combo,
# filter for duplicates,
# get number of distinct values in each column,
# gather to get df of 1 row per primary key, other column,
# filter for where a columns have more than 1 unique value,
# order table by primary key
.table %>%
group_by(...) %>%
mutate(cnt = n()) %>%
filter(cnt > 1) %>%
select(-cnt) %>%
summarise_each(funs(n_distinct)) %>%
gather_('column', 'unique_vals', other) %>%
filter(unique_vals > 1) %>%
arrange(...) %>%
return
# Final dataframe:
## One row per primary key and column that creates duplicates.
## Last column indicates how many unique values of
## the given column exist for each primary key.
}
This function also works with the piping operator:
dat %>% find_dups(key1, key2)
You can avoid lazyeval by using do to call an anonymous function and then using get. This solution can be used more generally to employ multiple aggregations. I usually write the function separately.
library(dplyr)
not.uniq.per.group <- function(df, grp.var, uniq.var) {
df %>%
group_by_(grp.var) %>%
do((function(., uniq.var) {
with(., data.frame(n_uniq = n_distinct(get(uniq.var))))
}
)(., uniq.var)) %>%
filter(n_uniq > 1)
}
not.uniq.per.group(iris, "Sepal.Length", "Sepal.Width")

Pass arguments to dplyr functions

I want to parameterise the following computation using dplyr that finds which values of Sepal.Length are associated with more than one value of Sepal.Width:
library(dplyr)
iris %>%
group_by(Sepal.Length) %>%
summarise(n.uniq=n_distinct(Sepal.Width)) %>%
filter(n.uniq > 1)
Normally I would write something like this:
not.uniq.per.group <- function(data, group.var, uniq.var) {
iris %>%
group_by(group.var) %>%
summarise(n.uniq=n_distinct(uniq.var)) %>%
filter(n.uniq > 1)
}
However, this approach throws errors because dplyr uses non-standard evaluation. How should this function be written?
You need to use the standard evaluation versions of the dplyr functions (just append '_' to the function names, ie. group_by_ & summarise_) and pass strings to your function, which you then need to turn into symbols. To parameterise the argument of summarise_, you will need to use interp(), which is defined in the lazyeval package. Concretely:
library(dplyr)
library(lazyeval)
not.uniq.per.group <- function(df, grp.var, uniq.var) {
df %>%
group_by_(grp.var) %>%
summarise_( n_uniq=interp(~n_distinct(v), v=as.name(uniq.var)) ) %>%
filter(n_uniq > 1)
}
not.uniq.per.group(iris, "Sepal.Length", "Sepal.Width")
Note that in recent versions of dplyr the standard evaluation versions of the dplyr functions have been "soft deprecated" in favor of non-standard evaluation.
See the Programming with dplyr vignette for more information on working with non-standard evaluation.
Like the old dplyr versions up to 0.5, the new dplyr has facilities for both standard evaluation (SE) and nonstandard evaluation (NSE). But they are expressed differently than before.
If you want an NSE function, you pass bare expressions and use enquo to capture them as quosures. If you want an SE function, just pass quosures (or symbols) directly, then unquote them in the dplyr calls. Here is the SE solution to the question:
library(tidyverse)
library(rlang)
f1 <- function(df, grp.var, uniq.var) {
df %>%
group_by(!!grp.var) %>%
summarise(n_uniq = n_distinct(!!uniq.var)) %>%
filter(n_uniq > 1)
}
a <- f1(iris, quo(Sepal.Length), quo(Sepal.Width))
b <- f1(iris, sym("Sepal.Length"), sym("Sepal.Width"))
identical(a, b)
#> [1] TRUE
Note how the SE version enables you to work with string arguments - just turn them into symbols first using sym(). For more information, see the programming with dplyr vignette.
In the devel version of dplyr (soon to be released 0.6.0), we can also make use of slightly different syntax for passing the variables.
f1 <- function(df, grp.var, uniq.var) {
grp.var <- enquo(grp.var)
uniq.var <- enquo(uniq.var)
df %>%
group_by(!!grp.var) %>%
summarise(n_uniq = n_distinct(!!uniq.var)) %>%
filter(n_uniq >1)
}
res2 <- f1(iris, Sepal.Length, Sepal.Width)
res1 <- not.uniq.per.group(iris, "Sepal.Length", "Sepal.Width")
identical(res1, res2)
#[1] TRUE
Here enquo takes the arguments and returns the value as a quosure (similar to substitute in base R) by evaluating the function arguments lazily and inside the summarise, we ask it to unquote (!! or UQ) so that it gets evaluated.
Here's the way to do it from rlang 0.4 using curly curly {{ pseudo operator :
library(dplyr)
not.uniq.per.group <- function(data, group.var, uniq.var) {
data %>%
group_by({{ group.var }}) %>%
summarise(n.uniq = n_distinct({{ uniq.var }})) %>%
filter(n.uniq > 1)
}
iris %>% not.uniq.per.group(Sepal.Length, Sepal.Width)
#> # A tibble: 25 x 2
#> Sepal.Length n.uniq
#> <dbl> <int>
#> 1 4.4 3
#> 2 4.6 4
#> 3 4.8 3
#> 4 4.9 5
#> 5 5 8
#> 6 5.1 6
#> 7 5.2 4
#> 8 5.4 4
#> 9 5.5 6
#> 10 5.6 5
#> # ... with 15 more rows
In the current version of dplyr (0.7.4) the use of the standard evaluation function versions (appended '_' to the function name, e.g. group_by_) is deprecated.
Instead you should rely on tidyeval when writing functions.
Here's an example of how your function would look then:
# definition of your function
not.uniq.per.group <- function(data, group.var, uniq.var) {
# enquotes variables to be used with dplyr-functions
group.var <- enquo(group.var)
uniq.var <- enquo(uniq.var)
# use '!!' before parameter names in dplyr-functions
data %>%
group_by(!!group.var) %>%
summarise(n.uniq=n_distinct(!!uniq.var)) %>%
filter(n.uniq > 1)
}
# call of your function
not.uniq.per.group(iris, Sepal.Length, Sepal.Width)
If you want to learn all about the details, there's an excellent vignette by the dplyr-team on how this works.
I've written a function in the past that does something similar to what you're doing, except that it explores all the columns outside the primary key and looks for multiple unique values per group.
find_dups = function(.table, ...) {
require(dplyr)
require(tidyr)
# get column names of primary key
pk <- .table %>% select(...) %>% names
other <- names(.table)[!(names(.table) %in% pk)]
# group by primary key,
# get number of rows per unique combo,
# filter for duplicates,
# get number of distinct values in each column,
# gather to get df of 1 row per primary key, other column,
# filter for where a columns have more than 1 unique value,
# order table by primary key
.table %>%
group_by(...) %>%
mutate(cnt = n()) %>%
filter(cnt > 1) %>%
select(-cnt) %>%
summarise_each(funs(n_distinct)) %>%
gather_('column', 'unique_vals', other) %>%
filter(unique_vals > 1) %>%
arrange(...) %>%
return
# Final dataframe:
## One row per primary key and column that creates duplicates.
## Last column indicates how many unique values of
## the given column exist for each primary key.
}
This function also works with the piping operator:
dat %>% find_dups(key1, key2)
You can avoid lazyeval by using do to call an anonymous function and then using get. This solution can be used more generally to employ multiple aggregations. I usually write the function separately.
library(dplyr)
not.uniq.per.group <- function(df, grp.var, uniq.var) {
df %>%
group_by_(grp.var) %>%
do((function(., uniq.var) {
with(., data.frame(n_uniq = n_distinct(get(uniq.var))))
}
)(., uniq.var)) %>%
filter(n_uniq > 1)
}
not.uniq.per.group(iris, "Sepal.Length", "Sepal.Width")

Resources