Major dplyr functions in a function - r

I have seen a couple of posts of how to write one's own function with dplyr functions. For example, you can see how you can use group_by (regroup) and summarise in this post. I thought that it would be interesting to see if I can write a function using major dplyr functions. My hope is that we can further understand how to write functions using dplyr functions.
DATA
country <- rep(c("UK", "France"), each = 5)
id <- rep(letters[1:5], times = 2)
value <- runif(10, 50, 100)
foo <- data.frame(country, id, value, stringsAsFactors = FALSE)
GOAL
I wanted to write the following process in a function.
foo %>%
mutate(new = ifelse(value > 60, 1, 0)) %>%
filter(id %in% c("a", "b", "d")) %>%
group_by(country) %>%
summarize(whatever = sum(value))
TRY
### Here is a function which does the same process
myFun <- function(x, ana, bob, cathy) x %>%
mutate(new = ifelse(ana > 60, 1, 0)) %>%
filter(bob %in% c("a", "b", "d")) %>%
regroup(as.list(cathy)) %>%
summarize(whatever = sum(ana))
myFun(foo, value, id, "country")
Source: local data frame [2 x 2]
country whatever
1 France 233.1384
2 UK 245.5400
You may realise that arrange() is not there. This is the one I am struggling. Here are two observations. The first experiment was successful. The order of the countries changed from UK-France to France-UK. But the second experiment was not successful.
### Experiment 1: This works for arrange()
myFun <- function(x, ana) x %>%
arrange(ana)
myFun(foo, country)
country id value
1 France a 90.12723
2 France b 86.64229
3 France c 74.93320
4 France d 80.69495
5 France e 72.60077
6 UK a 84.28033
7 UK b 67.01209
8 UK c 94.24756
9 UK d 79.49848
10 UK e 63.51265
### Experiment2: This was not successful.
myFun <- function(x, ana, bob) x %>%
filter(ana %in% c("a", "b", "d")) %>%
arrange(bob)
myFun(foo, id, country)
Error: incorrect size (10), expecting :6
### This works, by the way.
foo %>%
filter(id %in% c("a", "b", "d")) %>%
arrange(country)
Given the first experiment was successful, I have a hard time to understand why the second experiment failed. There may be something one has to do in the 2nd experimentDoes anybody have an idea? Thank you for taking your time.

I installed dplyr 0.3 and lazyeval once issue 352 was closed to see how it might work to use dplyr functions in another function. After reading the vignette on non-standard evaluation, it looks like interp from lazyeval combined with the new functions ending in _ is one option. Notice group_by_ now replaces regroup.
set.seed(16)
foo = data.frame(country = rep(c("UK", "France"), each = 5),
id = rep(letters[1:5], times = 2),
value = runif(10, 50, 100), stringsAsFactors = FALSE)
First the code/results outside the function:
library(lazyeval)
library(dplyr)
foo %>%
mutate(new = ifelse(value > 60, 1, 0)) %>%
filter(id %in% c("a", "b", "d")) %>%
group_by(country) %>%
summarize(whatever = sum(value))
Source: local data frame [2 x 2]
country whatever
1 France 213.0009
2 UK 207.8331
Then turn the above process into a function:
myFun = function(x, ana, bob, cathy) {
x %>%
mutate_(new = interp(~ifelse(var > 60 , 1, 0), var = as.name(ana))) %>%
filter_(interp(~var %in% c("a", "b", "d"), var = as.name(bob))) %>%
group_by_(cathy) %>%
summarize_(whatever = interp(~sum(var), var = as.name(ana)))
}
Which gives the desired results.
myFun(foo, "value", "id", "country")
Source: local data frame [2 x 2]
country whatever
1 France 213.0009
2 UK 207.8331
For your second problem with arrange, I tried
myfun2 = function(x, ana, bob) x%>%
filter_(interp(~var %in% c("a", "b", "d"), var = as.name(ana))) %>%
arrange_(as.name(bob))
myfun2(foo, "id", "country")

Actually, your experiments do not work, you will have scoping problems with all of them. It looks like they are working because you have defined the vectors country, id, and value on the Global Environment and did not remove them. So when you call your functions, they are using the vectors from the Global Environment.
To show this, let's remove those vectors before calling your functions:
Creating the vectors and data.frame:
library(dplyr)
country <- rep(c("UK", "France"), each = 5)
id <- rep(letters[1:5], times = 2)
value <- runif(10, 50, 100)
foo <- data.frame(country, id, value, stringsAsFactors = FALSE)
Defining your first function:
myFun <- function(x, ana, bob, cathy) x %>%
mutate(new = ifelse(ana > 60, 1, 0)) %>%
filter(bob %in% c("a", "b", "d")) %>%
regroup(as.list(cathy)) %>%
summarize(whatever = sum(ana))
Calling without removing the vectors (it will look like it works, but it is actually using the vectors from the global env):
myFun(foo, value, id, "country")
Source: local data frame [2 x 2]
country whatever
1 France 208.1008
2 UK 192.4287
Now removing the vectors and calling your function (and now it does not work, for it can't find the vectors):
rm(country, id, value)
myFun(foo, value, id, "country")
Error in mutate_impl(.data, named_dots(...), environment()) :
object 'value' not found
So that explains why your arrange example did not work while the others did. The vector your second experiment was calling was the vector country on the Global Environment, which has 10 elements. But the function arrange was expecting only 6 elements, which is the result of the filtered vector.
You have different strategies to make your functions work. For example, take a look at this answer by G. Grothendieck to have some insights on how to do it. Or just wait a little, for as Hadley pointed out, programming in dplyr is a future feature coming soon.

Related

R: Finding indirect links between colleagues. Code works with string ids, but not numeric ids

I am trying the extract the set of indirect colleagues of doctors. I call colleagues doctors who work together in the same hospital. An indirect colleague is a doctor who works with the colleague of a doctor in another hospital. In the example below, doctor "a" works with doctor "b" in hospital 1, who in turn work with doctor "c" in hospital 2. Therefore "c" is an indirect colleague of "a".
The code below works well when physician id constitutes of string values (df0) or low numeric values (df1), but not when physicians id constitutes of high numeric value (df2). I would like to fix the code to work with high numeric values (while keeping the original ids of physicians).
df0 <- tribble(
~hospital, ~doctors,
1, c("a", "b"),
2, c("b", "c"),
3, c("a", "d"),
) %>%
unnest(doctors)
# Below, I replaced doctor id with numeric values
df1 <- tribble(
~hospital, ~doctors,
1, c(1, 2),
2, c(2, 3),
3, c(1, 4),
) %>%
unnest(doctors)
# Now I added +5 to each physician id
df2 <- tribble(
~hospital, ~doctors,
1, c(6, 7),
2, c(7, 8),
3, c(6, 9)
) %>%
unnest(doctors)
df <- df2 # The code only works with df0 and df1, not with df2
colleagues <- full_join(df, df, by = c("hospital")) %>%
rename(doctor = doctors.x, colleagues = doctors.y) %>%
filter(doctor != colleagues) %>%
distinct(doctor, colleagues) %>%
chop(colleagues) %>%
deframe()
colleagues %>%
enframe(name = "ego",
value = "alter") %>%
unnest(alter) %>%
mutate(ego_colleagues = map(ego, ~ colleagues[[.x]]),
alter_colleagues = map(alter, ~ colleagues[[.x]]),
alter_colleague_only = map2(alter_colleagues, ego_colleagues, ~ .x[!(.x %in% .y)])) %>%
unnest(alter_colleague_only) %>%
filter(ego != alter_colleague_only) %>%
select(ego, alter, alter_colleague_only)
The issue is in your map calls. Using df2, when you map(ego, ~ colleagues[[.x]]), colleagues[.x] is indexing by position, not name. When you use character names, it defaults to using character names. When you use numeric names and they're 1, 2, 3, 4 it happens to work by luck. But when you have a list of 4 and you're calling colleagues[[6]], then you get the index out of bounds error. If that's not totally clear, print these:
colleagues[[1]] vs. colleagues[[6]] vs. colleagues$`6` .
A quick fix would be to wrap the first part of those map statements in as.character like this:
colleagues %>%
enframe(name = "ego",
value = "alter") %>%
unnest(alter) %>%
mutate(ego_colleagues = map(as.character(ego), ~ colleagues[[.x]]),
alter_colleagues = map(as.character(alter), ~ colleagues[[.x]]),
alter_colleague_only = map2(as.character(alter_colleagues), as.character(ego_colleagues), ~ .x[!(.x %in% .y)])) %>%
unnest(alter_colleague_only) %>%
filter(ego != alter_colleague_only) %>%
select(ego, alter, alter_colleague_only)
UPDATE:
Depending on your setup, you could try using the furrr package with future_map and future_map2, but at least in this minimal example that was a much slower approach. I don't know if that holds true on your real data.
Here's another option. While ugly because it has a lot of intermediate objects, it may be helpful. It uses matrices and leverages the fact that you have these reciprocal relationships (if I'm interpreting correctly). I benchmarked it and it takes half as long.
t1 <- colleagues %>%
enframe(name = "ego",
value = "alter") %>%
unnest(alter) %>%
filter(!duplicated(paste0(pmax(ego, alter), pmin(ego, alter)))) %>%
as.matrix()
t2 <- t1 %>%
rbind(t1[1:nrow(t1),c(2,1)])
alter_colleague_only <- t2[match(t2[,2], t2[,1]), "alter"]
t3 <- cbind(t2, alter_colleague_only)
t4 <- t3[which(t2[,1] != t3[,3]),]
t5 <- t4[,c(3,2,1)]
t6 <- rbind(t4, t5) %>%
as_tibble() %>%
arrange(ego)

Summarise multiple functions at once using tidyeval in dplyr 1.0

Say we have a data frame,
library(tidyverse)
library(rlang)
df <- tibble(id = rep(c(1:2), 10),
grade = sample(c("A", "B", "C"), 20, replace = TRUE))
we would like to get the mean of grades grouped by id,
df %>%
group_by(id) %>%
summarise(
n = n(),
mu_A = mean(grade == "A"),
mu_B = mean(grade == "B"),
mu_C = mean(grade == "C")
)
I am handling a case where there are multiple conditions (many grades in this case) and would like to make my code more robust. How can we simplify this using tidyevaluation in dplyr 1.0?
I am talking about the idea of generating multiple column names by passing all grades at once, without breaking the flow of piping in dplyr, something like
# how to get the mean of A, B, C all at once?
mu_{grade} := mean(grade == {grade})
I actually found the answer to my own question from a post that I wrote 2 years ago...
I am just going to post the code right below hoping to help anybody that comes across the same problem.
make_expr <- function(x) {
x %>%
map( ~ parse_expr(str_glue("mean(grade == '{.x}')")))
}
# generate multiple expressions
grades <- c("A", "B", "C")
exprs <- grades %>% make_expr() %>% set_names(paste0("mu_", grades))
# we can 'top up' something extra by adding named element
exprs <- c(n = parse_expr("n()"), exprs)
# using the big bang operator `!!!` to force expressions in data frame
df %>% group_by(id) %>% summarise(!!!exprs)

How to splice a tidyselect-style list of column names into a call of my function

I am trying to write a function that deduplicates my grouped data frame. It asserts that the values in each groups are all the same and then only keeps the first line of the group. I am trying to give it tidyselect-like semantics like are seen in pivot_longer() because I just need to forward the column names into a summary(a = n_distinct(...)) call.
So for an example table
test <- tribble(
~G, ~F, ~v1, ~v2,
"A", "a", 1, 2,
"A", "b", 1, 2,
"B", "a", 3, 3,
"B", "b", 3, 3) %>%
group_by(G)
I expect the call remove_duplicates(test, c(v1, v2)) (using the tidyselect helper c() to return
G F v1 v2
A a 1 2
B a 1 2
but I get
Error: `arg` must be a symbol
I tried to use the new "embrace" syntax to solve this (see function code below), which fails with the message shown above.
# Assert that values in each group are identical and keep the first row of each
# group
# tab: A grouped tibble
# vars: <tidy-select> Columns expected to be constant throughout the group
remove_duplicates <- function(tab, vars){
# Assert identical results for identical models and keep only the first per group.
tab %>%
summarise(a = n_distinct({{{vars}}}) == 1, .groups = "drop") %>%
{stopifnot(all(.$a))}
# Remove duplicates
tab <- tab %>%
slice(1) %>%
ungroup()
return(tab)
}
I think that I somehow would need to specify that the evaluation context of the expression vars must be changed to the sub-data-frame of tab that is currently under evaluation by substitute.
So something like
tab %>%
summarise(a = do.call(n_distinct, TIDYSELECT_TO_LIST_OF_VECTORS(vars, context = CURRENT_GROUP))))
but I do not understand the technical details enough to really make this work...
This works as expected if you first enquos your vars then use the curly-curly operator on the result:
remove_duplicates <- function(tab, vars){
vars <- enquos(vars)
tab %>%
summarise(a = n_distinct({{vars}}) == 1, .groups = "drop") %>%
{stopifnot(all(.$a))}
tab %>% slice(1) %>% ungroup()
}
So now
remove_duplicates(test, c(v1, v2))
#> # A tibble: 2 x 4
#> G F v1 v2
#> <chr> <chr> <dbl> <dbl>
#> 1 A a 1 2
#> 2 B a 3 3

mutate_at with two sets of variables

I just asked a question about generating multiple columns at once with dplyr, and I'm a bonehead and oversimplified the problem and have another question. I'd like to find a dplyr method for dynamically generating columns based on other columns.
cols <- c("x", "y")
foo <- c("a", "b")
bar <- c("c", "d")
df <- data.frame(a = 1, b = 2, c = 10, d = 20)
df[cols] <- df[foo] * df[bar]
In my first iteration of the question, I included only one set of previously defined columns, so the following worked:
df %>%
mutate_at(vars(foo), list(new = ~ . * 5)) %>%
rename_at(vars(matches('new')), ~ c('x', 'y'))
However, as the first few lines of code suggest, I would like to instead multiply two existing columns together, and am unable to figure out how to do this. I have tried:
df %>%
mutate_at(c(vars(foo), vars(bar)),
function(x,y) {x * y})
which returns the error:
Error in (function (x, y) : argument "y" is missing, with no default
Is it possible to reference multiple sets of columns to be used on each other with mutate_at?
Well as you want to work with two columns, I think purrr::map2 is the function to work with:
library(purrr)
library(dplyr)
map2(foo, bar, ~ df[[.x]] * df[[.y]]) %>%
set_names(cols) %>%
bind_cols(df, .)
#> a b c d x y
#> 1 1 2 10 20 10 40

Creating and using new variables in function in R: NSE programing error in the tidyverse

After reading and re-reading the many "programing with dplyr" guides, I still cannot find a way to solve my particular case.
I understand that the use of group_by_, mutate_ and such "string-friendly" versions of tidyverse functions is heading toward deprecation, and that enquo is the way to go.
However, my case is somewhat different, and I'm struggling to find a neat way to solve it in a tidy way.
Indeed, my aim is to create and manipulate dataframes within a function. Creating (mutating) new variables based on others, using them, etc.
However, no matter how hard I try, my code either errors or returns some warnings upon package check, such as no visible binding for global variable ....
Here's a reproducible example:
Here's what I want to do:
df <- data.frame(X=c("A", "B", "C", "D", "E"),
Y=c(1, 2, 3, 1, 1))
new_df <- df %>%
group_by(Y) %>%
summarise(N=n()) %>%
mutate(Y=factor(Y, levels=1:5)) %>%
complete(Y, fill=list(N = 0)) %>%
arrange(Y) %>%
rename(newY=Y) %>%
mutate(Y=as.integer(newY))
Some common dplyr manipulations which expected result should be:
# A tibble: 5 x 3
newY N Y
<fctr> <dbl> <int>
1 1 3 1
2 2 1 2
3 3 1 3
4 4 0 4
5 5 0 5
I would like this piece of code to quietly work inside a function. The following was my best attempt to deal with the non-NSE issues:
myfunction <- function(){
df <- data.frame(X=c("A", "B", "C", "D", "E"),
Y=c(1, 2, 3, 1, 1))
new_df <- df %>%
group_by_("Y") %>%
summarise(!!"N":=n()) %>%
mutate(!!"Y":=factor(Y, levels=1:5)) %>%
complete_("Y", fill=list(N = 0)) %>%
arrange_("Y") %>%
rename(!!"newY":="Y") %>%
mutate(!!"Y":=as.integer(newY))
}
Unfortunately, I still got the following messages:
myfunction: no visible global function definition for ':='
myfunction: no visible binding for global variable 'Y'
myfunction: no visible binding for global variable 'newY'
Undefined global functions or variables:
:= Y n.Factors n_optimal newY
Is there a way to solve it? Thanks a lot!
EDIT: I'm using R 3.4.1, dplyr_0.7.4, tidyr_0.7.2 and tidyverse_1.1.1
ANSWER
Thanks to the comments I've managed to solve it, here's the working solution:
myfunction <- function(){
df <- data.frame(X=c("A", "B", "C", "D", "E"),
Y=c(1, 2, 3, 1, 1))
new_df <- df %>%
group_by_("Y") %>%
summarise_("N"=~n()) %>%
mutate_("Y"= ~factor(Y, levels=1:5)) %>%
complete_("Y", fill=list(N = 0)) %>%
arrange_("Y") %>%
rename_("newY"=~Y) %>%
mutate_("Y"=~as.integer(newY))
}
Thanks A LOT :)
The answer wasn't in the "programing with dplyr" guides because your issue is more general. Although your code deals with non-standard evaluation, your case does not need it. If you remove the code that deals with non-standard evaluation, you will reduce the number of problems you need to fix.
Still, some important issues remain -- issues of NAMESPACE. You deal with NAMESPACE anytime you use functions from other packages inside functions of your own package. NAMESPACE is not an easy topic, but if you are writing packages it will pay off to learn a bit. I recommend you to read: From r-pkgs.had.co.nz/namespace.html, find section "Imports" and read its introduction and also the subheading "R functions". That will help you understand the steps, code and comments that I post below.
Follow these steps to fix your problem:
- Add dplyr, magrittr, and tidyr to DESCRIPTION.
- Refer to functions as PACKAGE::FUNCTION().
- Remove all !! and := because in this case you don't need them.
- Import and export the pipe from magrittr.
- Import .data from rlang.
- Pass global variables to utils::globalVariables().
- Rebuild, reload, recheck.
# I make your function shorter to focus on the important details.
myfunction <- function(){
df <- data.frame(
X = c("A", "B", "C", "D", "E"),
Y = c(1, 2, 3, 1, 1)
)
df %>%
dplyr::group_by(.data$Y) %>%
dplyr::summarise(N = n())
}
# Fix check() notes
#' #importFrom magrittr %>%
#' #export
magrittr::`%>%`
#' #importFrom rlang .data
NULL
utils::globalVariables(c(".data", "n"))
You can use rlang::sym() (or base::as.name()) to convert characters to symbols, so let me add an alternatives answer.
Note that I don't mean to force you to throw away these deprecated functions. You can use which is easy to understand for you. (I believe sym() is more useful, though)
Case 1: basic usage of rlang::sym()
This code
group_by_("Y") %>%
can be written as
group_by(!! rlang::sym("Y"))
or you can even assign the symbol to a variable beforehand.
col_Y <- rlang::sym("Y")
df %>%
group_by(!! col_Y)
Case 2: Lefthand-side symbols
This code is totally fine.
summarise(!!"N":=n())
Both characters and symbols are permitted for LHS. So this is also fine:
col_N <- rlang::sym("N")
# ...
summarise(!! col_N := n())
Case 3) select semantics
select() and rename() have the different semantics than other functions like mutate(); it allows characters in addition to symbols. This may be a bit advanced topic. You can find more detailed explanation in a vignette.
More precisely, the code bellow are both permitted:
rename(new = old)
rename(new = "old")
So, this code is fine.
rename(!! "newY" := "Y")
(Example)
reprex::reprex_info()
#> Created by the reprex package v0.1.1.9000 on 2017-11-12
library(dplyr, warn.conflicts = FALSE)
library(tidyr)
df <- data.frame(X=c("A", "B", "C", "D", "E"),
Y=c(1, 2, 3, 1, 1))
col_Y <- rlang::sym("Y")
col_N <- rlang::sym("N")
col_newY <- rlang::sym("newY")
df %>%
group_by(!! col_Y) %>%
summarise(!! col_N := n()) %>%
mutate(!! col_Y := factor(!! col_Y, levels=1:5)) %>%
complete(!! col_Y, fill = list(N = 0)) %>%
arrange(!! col_Y) %>%
rename(!! col_newY := !! col_Y) %>%
mutate(!! col_Y := as.integer(!! col_newY))
#> # A tibble: 5 x 3
#> newY N Y
#> <fctr> <dbl> <int>
#> 1 1 3 1
#> 2 2 1 2
#> 3 3 1 3
#> 4 4 0 4
#> 5 5 0 5

Resources