I have the following data frame:
dat <- data.frame(time = runif(20),
group1 = rep(1:2, times = 10),
group2 = rep(1:2, each = 10),
group3 = rep(3:4, each = 10))
I'm now writing a function my_function that takes the following form:
my_function(data, time_var = time, group_vars = c(group1, group2))
If I'm not mistaken, I'm passing the group_vars as symbols to my function, right?
However, within my function I want to first do some error checks if the variables passed to the function exist in the data. For the time variable I was successful, but I don't know how I can turn my group_vars list into a vector of strings so that it looks like c("group1", "group2").
My current function looks like:
my_function <- function (data, time_var = NULL, group_vars = NULL)
{
time_var <- enquo(time_var)
time_var_string <- as_label(time_var)
group_vars <- enquos(group_vars)
# is "time" variable part of the dataset?
if (!time_var_string %in% colnames(data))
{
stop(paste0("The variable '", time_var_string, "' doesn't exist in your data set. Please check for typos."))
}
}
And I want to extend the latter part so that I can also do some checks in the form of !group_vars %in% colnames(data). I know I could pass the group_var variables already as a vector of strings to the function, but I don't want to do that for other reasons.
enquos is the wrong function here: it operates on multiple arguments, but you’re only passing a single argument. Just use enquo. However, either way the result isn’t directly usable, because you don’t get a vector of unevaluated names — you get an unevaluated c call.
Working with this is a bit more convoluted, I’m afraid:
group_vars_expr = quo_squash(group_vars)
group_var_names = if (is_symbol(group_vars_expr)) {
as_name(group_vars_expr)
} else {
stopifnot(is_call(group_vars_expr))
stopifnot(identical(group_vars_expr[[1L]], sym('c')))
stopifnot(all(purrr::map_lgl(group_vars_expr[-1L], is_symbol)))
purrr::map_chr(group_vars_expr[-1L], as_name)
}
stopifnot(all(group_var_names %in% colnames(data)))
If you'd like to use c() in this way, chances are you need selections. One easy way to take selections in an argument is to interface with dplyr::select():
my_function <- function(data, group_vars = NULL) {
group_vars <- names(dplyr::select(data, {{ group_vars }}))
group_vars
}
mtcars %>% my_function(c(cyl, mpg))
#> [1] "cyl" "mpg"
mtcars %>% my_function(starts_with("d"))
#> [1] "disp" "drat"
Related
I have a function which takes a dataframe and its columns and processes it in various ways (left out for simplicity). We can put in column names as arguments or transform columns directly inside function arguments (like here). I need to find out what object(s) are passed in the function.
Reproducible example:
df <- data.frame(x= 1:10, y=1:10)
myfun <- function(data, col){
col_new <- eval(substitute(col), data)
# magic part
object_name <- ...
# magic part
plot(col_new, main= object_name)
}
For instance, the expected output for myfun(data= df, x*x) is the plot plot(df$x*df$x, main= "x"). So the title is x, not x*x. What I have got so far is this:
myfun <- function(data, col){
colname <- tryCatch({eval(substitute(col))}, error= function(e) {geterrmessage()})
colname <- gsub("' not found", "", gsub("object '", "", colname))
plot(eval(substitute(col), data), main= colname)
}
This function gives the expected output but there must be some more elegant way to find out to which object the input refers to. The answer must be with base R.
Use substitute to get the expression passed as col and then use eval and all.vars to get the values and name.
myfun <- function(data, col){
s <- substitute(col)
plot(eval(s, data), main = all.vars(s), type = "o", ylab = "")
}
myfun(df, x * x)
Anothehr possibility is to pass a one-sided formula.
myfun2 <- function(formula, data){
plot(eval(formula[[2]], data), main = all.vars(formula), type = "o", ylab = "")
}
myfun2(~ x * x, df)
The rlang package can be very powerful when you get a hang of it. Does something like this do what you want?
library(rlang)
myfun <- function (data, col){
.col <- enexpr(col)
unname(sapply(call_args(.col), as_string))
}
This gives you back the "wt" column.
myfun(mtcars, as.factor(wt))
# [1] "wt"
I am not sure your use case, but this would work for multiple inputs.
myfun(mtcars, sum(x, y))
# [1] "x" "y"
And finally, it is possible you might not even need to do this, but rather store the expression and operate directly on the data. The tidyeval framework can help with that as well.
create_c <- function(df, line_number = NA, prior_trt, line_name, biomarker, ...) {
if (!"data.frame" %in% class(df)) {
stop("First input must be dataframe")
}
# handle extra arguments
args <- enquos(...)
names(args) <- tolower(names(args))
# check for unknown argument - cols that do not exist in df
check_args_exist(df, args)
# argument to expression
ex_args <- unname(imap(args, function(expr, name) quo(!!sym(name) == !!expr)))
# special case arguments
if (!missing(line_number)) {
df <- df %>% filter(line_number %in% (!!line_number))
if (!missing(prior_trt)) {
df <- filter_arg(df. = df, arg = prior_trt, col = "prior_trt_", val = "y")
}
}
if (!missing(biomarker)) {
df <- filter_arg(df. = df, arg = biomarker, col = "has_", val = "positive")
}
if (!missing(line_name)) {
ln <- list()
if (!!str_detect(line_name[1], "or")) {
line_name <- str_split(line_name, " or ", simplify = TRUE)
}
for (i in 1:length(line_name)) {
ln[[i]] <- paste(tolower(sort(strsplit(line_name[i], "\\+")[[1]])), collapse = ",")
}
df <- df %>% filter(line_name %in% (ln))
}
df <- df %>%
group_by(patient_id) %>%
slice(which.min(line_number)) %>%
ungroup()
df <- df %>% filter(!!!ex_args)
invisible(df)
}
I have this function where I am basically filtering various columns based on parameters users pass. I want the users to be able to pass logical operators like >,<, != for some of the parameters. Right now my function is not able to handle any other operators besides '='. Is there a way to accomplish this?
create_c(df = bsl_all_nsclc,
line_number > 2)
create_c(df, biomarker != "positive)
Error in tolower(arg) : object 'biomarker' not found
Certainly there is a way: operators are regular functions in R, you can pass them around like any other function.
The only complication is that the operators are non-syntactic names so you can’t just pass them “as is”, this would confuse the parser. Instead, you need to wrap them in backticks, to make their use syntactically valid where a name would be expected:
filter_something = function (value, op) {
op(value, 13)
}
filter_something(cars$speed, `>`)
filter_something(cars$speed, `<`)
filter_something(cars$speed, `==`)
And since R also supports non-standard evaluation of function arguments, you can also pass unevaluated expressions — this gets slightly more complicated, since you’d want to evaluate them in the correct context. ‘rlang’/‘dplyr’ uses data masking for this.
How exactly you need to apply this depends entirely on the context in which the expression is to be used. In many cases, you can simply dispatch them to the corresponding ‘dplyr’ functions, e.g.
filter_something2 = function (.data, expr) {
.data %>%
filter({{expr}})
}
filter_something2(cars, speed < 13)
The “secret sauce” here is the {{…}} syntax. This works because filter from ‘dplyr’ accepts unevaluated arguments and handles {{expr}} specially by transforming it into (effectively) !! enexpr(expr). That is: expr is first “defused”: it is explicitly marked as unevaluated, and the name expr is replaced by the unevaluated expression it binds to (speed < 13 in the above). Next, this unevaluated expression is unquoted. That is, the wrapper is “peeled off” from the expression, and that unevaluated expression itself is handled inside filter as if it were passed as filter(.data, speed < 13). In other words: the name expr is substituted with the speed < 13 in the call expression.
For a more thorough explanation, please refer to the Programming with dplyr vignette.
I am working on a function that tries to give me the top answers of a column. In the example below there is just a part of my whole function. My final goal is to run the function over a loop. I have detected something weird: why is print(df_col_indicator) gonna change the result when I define "df_col_indicator" externally and not within my function? With print(df_col_indicator) my function is actually exactly doing what I want..
library(dplyr)
library(tidyverse)
remove(list = ls())
dataframe_test <- data.frame(
county_name = c("a", "b","c", "d","e", "f", "g", "h"),
column_test1 = c(100,100,100,100,100,100,50,50),
column_test2 = c(40,90,50,40,40,100,13,14),
column_test3 = c(100,90,50,40,30,40,100,50),
month = c("2020-09-01", "2020-09-01" ,"2020-09-01" ,"2020-09-01" ,"2020-09-01" ,"2020-09-01" ,"2020-08-01","2020-08-01"))
choose_top_5 <- function(df, df_col_indicator, df_col_month, char_month, numb_top, df_col_county) {
### this here changes output of my function
#print(df_col_indicator) # changes output of my function depending on included or excluded
### enquo / ensym / deparse
df_col_indicator_ensym <- ensym(df_col_indicator)
df_col_month_ensym <- ensym(df_col_month)
### filter month and top 5 observations
df_top <- df %>%
filter(!!df_col_month_ensym == char_month) %>%
slice_max(!!df_col_indicator_ensym, n = numb_top) %>%
select(!!df_col_county, !!df_col_month_ensym, !!df_col_indicator_ensym)
return(df_top)
}
### define "df_col_indicator" within the function
a = choose_top_5(df = dataframe_test, df_col_indicator = "column_test3",
df_col_month = "month", char_month = "2020-09-01", numb_top = 5,
df_col_county = "county_name")
a
### define "df_col_indicator" externally
external = "column_test3"
b = choose_top_5(df = dataframe_test, df_col_indicator = external,
df_col_month = "month", char_month = "2020-09-01", numb_top = 5,
df_col_county = "county_name")
b
### goal is to run function over loop
external <- c("column_test1","column_test2","column_test3")
my_list <- list()
for (i in external) {
my_list[[i]] <- choose_top_5(df = dataframe_test, df_col_indicator = i,
df_col_month = "month", char_month = "2020-09-01", numb_top = 5,
df_col_county = "county_name")
}
my_list
Your example is quite lengthy. Let's boil it down to a minimal reproducible example with two very similar functions. These both take a single argument and simply print the passed variable to the console, and return the result of calling ensym on the same variable.
The only difference between the two is the order in which the calls to print and ensym are made.
library(rlang)
test_ensym1 <- function(x)
{
result <- ensym(x)
print(x)
return(result)
}
test_ensym2 <- function(x)
{
print(x)
result <- ensym(x)
return(result)
}
Now we might expect these two functions to do exactly the same thing, and indeed when we pass a string directly to them, they both give the same result:
test_ensym1("hello")
#> [1] "hello"
#> hello
test_ensym2("hello")
#> [1] "hello"
#> hello
But look what happens when we use an external variable to pass in our string:
y <- "hello"
test_ensym1(y)
#> [1] "hello"
#> y
test_ensym2(y)
#> [1] "hello"
#> hello
The functions both still print "hello", as expected, but they return a different result. When we called ensym first, the function returned the symbol y, and when we called print first it returned the symbol hello.
The reason for this is that when you call a function in R, the symbols you pass as parameters are not evaluated immediately. Instead, they are interpreted as promise objects and evaluated as required in the body of the function. It is this lazy evalutation that allows for some of the tidyverse trickery.
The difference between the two functions above is that calling print(x) forces the evaluation of x. Before that point, x is an unevaluated symbol. Afterwards, it behaves just like any other variable you would use interactively in the console, so when you call ensym, you are calling it on this evaluated variable, not as an unevaluated promise.
ensym, on the other hand, does not evaluate x, so if ensym is called first, it will return the unevaluated symbol that was passed to the function.
So actually, the easiest way to fix your problem is to move print to after the ensym call.
You also have to change ensym to as.symbol.
Consider a function like this
f <- function(x) ensym(x)
myvar <- "some string"
You will find that
> f("some string")
`some string`
> f(myvar)
myvar
This is because ensym only searches for the thing one step ahead. It attempts to convert whatever thing found into a symbol and just returns that (note that if what found is neither a string nor variable, then you will get an error). As such, in your first example, ensym returns column_test3; in your second one, it returns external.
As far as I can tell, what you want to do is getting the value that df_col_indicator represents and then converting that value into a symbol. This means you have to first evaluate df_col_indicator and then convert. as.symbol does what you need.
g <- function(x) as.symbol(x)
myvar <- "some string"
Some tests
> g("some string")
`some string`
> g(myvar)
`some string`
I am trying to perform an inner join two tables using dplyr, and I think I'm getting tripped up by non-standard evaluation rules. When using the by=("a" = "b") argument, everything works as expected when "a" and "b" are actual strings. Here's a toy example that works:
library(dplyr)
data(iris)
inner_join(iris, iris, by=c("Sepal.Length" = "Sepal.Width"))
But let's say I was putting inner_join in a function:
library(dplyr)
data(iris)
myfn <- function(xname, yname) {
data(iris)
inner_join(iris, iris, by=c(xname = yname))
}
myfn("Sepal.Length", "Sepal.Width")
This returns the following error:
Error: cannot join on columns 'xname' x 'Sepal.Width': index out of bounds
I suspect there is some fancy expression, deparsing, quoting, or unquoting that I could do to make this work, but I'm a bit murky on those details.
You can use
myfn <- function(xname, yname) {
data(iris)
inner_join(iris, iris, by=setNames(yname, xname))
}
The suggested syntax in the ?inner_join documentation of
by = c("a"="b") # same as by = c(a="b")
is slightly misleading because both those values aren't proper character values. You're actually created a named character vector. To dynamically set the values to the left of the equals sign is different from those on the right. You can use setNames() to set the names of the vector dynamically.
I like MrFlick's answer and fber's addendum, but I prefer structure. For me setNames feels as something at the end of a pipe, not as an on-the-fly constructor. On another note, both setNames and structure enable the use of variables in the function call.
myfn <- function(xnames, ynames) {
data(iris)
inner_join(iris, iris, by = structure(names = xnames, .Data = ynames))
}
x <- "Sepal.Length"
myfn(x, "Sepal.Width")
A named vector argument would run into problems here:
myfn <- function(byvars) {
data(iris)
inner_join(iris, iris, by = byvars)
}
x <- "Sepal.Length"
myfn(c(x = "Sepal.Width"))
You could solve that, though, by using setNames or structure in the function call.
I know I'm late to the party, but how about:
myfn <- function(byvar) {
data(iris)
inner_join(iris, iris, by=byvar)
}
This way you can do what you want with:
myfn(c("Sepal.Length"="Sepal.Width"))
I faced a nearly identical challenge as #Peter, but needed to pass multiple different sets of by = join parameters at one time. I chose to use the map() function from the tidyverse package, purrr.
This is the subset of the tidyverse that I used.
library(magrittr)
library(dplyr)
library(rlang)
library(purrr)
First, I adapted myfn to use map() for the case posted by Peter. 42's comment and Felipe Gerard's answer made it clear that the by argument can take a named vector. map() requires a list over which to iterate.
myfn_2 <- function(xname, yname) {
by_names <- list(setNames(nm = xname, yname ))
data(iris)
# map() returns a single-element list. We index to retrieve dataframe.
map( .x = by_names,
.f = ~inner_join(x = iris,
y = iris,
by = .x)) %>%
`[[`(1)
}
myfn_2("Sepal.Length", "Sepal.Width")
I found that I didn't need quo_name() / !! in building the function.
Then, I adapted the function to take a list of by parameters. For each by_i in by_grps, we could extend x and y to add named values on which to join.
by_grps <- list( by_1 = list(x = c("Sepal.Length"), y = c("Sepal.Width")),
by_2 = list(x = c("Sepal.Width"), y = c("Petal.Width"))
)
myfn_3 <- function(by_grps_list, nm_dataset) {
by_named_vectors_list <- lapply(by_grps_list,
function(by_grp) setNames(object = by_grp$y,
nm = by_grp$x))
map(.x = by_named_vectors_list,
.f = ~inner_join(nm_dataset, nm_dataset, by = .x))
}
myfn_3(by_grps, iris)
I'm working with dplyr and created code to compute new data that is plotted with ggplot.
I want to create a function with this code. It should take a name of a column of the data frame that is manipulated by dplyr. However, trying to work with columnnames does not work. Please consider the minimal example below:
df <- data.frame(A = seq(-5, 5, 1), B = seq(0,10,1))
library(dplyr)
foo <- function (x) {
df %>%
filter(x < 1)
}
foo(B)
Error in filter_impl(.data, dots(...), environment()) :
object 'B' not found
Is there any solution to use the name of a column as a function argument?
If you want to create a function which accepts the string "B" as an argument (as in you question's title)
foo_string <- function (x) {
eval(substitute(df %>% filter(xx < 1),list(xx=as.name(x))))
}
foo_string("B")
If you want to create a function which accepts captures B as an argument (as in dplyr)
foo_nse <- function (x) {
# capture the argument without evaluating it
x <- substitute(x)
eval(substitute(df %>% filter(xx < 1),list(xx=x)))
}
foo_nse(B)
You can find more information in Advanced R
Edit
dplyr makes things easier in version 0.3. Functions with suffixes "_" accept a string or an expression as an argument
foo_string <- function (x) {
# construct the string
string <- paste(x,"< 1")
# use filter_ instead of filter
df %>% filter_(string)
}
foo_string("B")
foo_nse <- function (x) {
# capture the argument without evaluating it
x <- substitute(x)
# construct the expression
expression <- lazyeval::interp(quote(xx < 1), xx = x)
# use filter_ instead of filter
df %>% filter_(expression)
}
foo_nse(B)
You can find more information in this vignette
I remember a similar question which was answered by #Richard Scriven. I think you need to write something like this.
foo <- function(x,...)filter(x,...)
What #Richard Scriven mentioned was that you need to use ... here. If you type ?dplyr, you will be able to find this: filter(.data, ...) I think you replace .data with x or whatever. If you want to pick up rows which have values smaller than 1 in B in your df, it will be like this.
foo <- function (x,...) filter(x,...)
foo(df, B < 1)