Writing a function with split in R [duplicate] - r

I'm trying to write a function to accept a data.frame (x) and a column from it. The function performs some calculations on x and later returns another data.frame. I'm stuck on the best-practices method to pass the column name to the function.
The two minimal examples fun1 and fun2 below produce the desired result, being able to perform operations on x$column, using max() as an example. However, both rely on the seemingly (at least to me) inelegant
call to substitute() and possibly eval()
the need to pass the column name as a character vector.
fun1 <- function(x, column){
do.call("max", list(substitute(x[a], list(a = column))))
}
fun2 <- function(x, column){
max(eval((substitute(x[a], list(a = column)))))
}
df <- data.frame(B = rnorm(10))
fun1(df, "B")
fun2(df, "B")
I would like to be able to call the function as fun(df, B), for example. Other options I have considered but have not tried:
Pass column as an integer of the column number. I think this would avoid substitute(). Ideally, the function could accept either.
with(x, get(column)), but, even if it works, I think this would still require substitute
Make use of formula() and match.call(), neither of which I have much experience with.
Subquestion: Is do.call() preferred over eval()?

This answer will cover many of the same elements as existing answers, but this issue (passing column names to functions) comes up often enough that I wanted there to be an answer that covered things a little more comprehensively.
Suppose we have a very simple data frame:
dat <- data.frame(x = 1:4,
y = 5:8)
and we'd like to write a function that creates a new column z that is the sum of columns x and y.
A very common stumbling block here is that a natural (but incorrect) attempt often looks like this:
foo <- function(df,col_name,col1,col2){
df$col_name <- df$col1 + df$col2
df
}
#Call foo() like this:
foo(dat,z,x,y)
The problem here is that df$col1 doesn't evaluate the expression col1. It simply looks for a column in df literally called col1. This behavior is described in ?Extract under the section "Recursive (list-like) Objects".
The simplest, and most often recommended solution is simply switch from $ to [[ and pass the function arguments as strings:
new_column1 <- function(df,col_name,col1,col2){
#Create new column col_name as sum of col1 and col2
df[[col_name]] <- df[[col1]] + df[[col2]]
df
}
> new_column1(dat,"z","x","y")
x y z
1 1 5 6
2 2 6 8
3 3 7 10
4 4 8 12
This is often considered "best practice" since it is the method that is hardest to screw up. Passing the column names as strings is about as unambiguous as you can get.
The following two options are more advanced. Many popular packages make use of these kinds of techniques, but using them well requires more care and skill, as they can introduce subtle complexities and unanticipated points of failure. This section of Hadley's Advanced R book is an excellent reference for some of these issues.
If you really want to save the user from typing all those quotes, one option might be to convert bare, unquoted column names to strings using deparse(substitute()):
new_column2 <- function(df,col_name,col1,col2){
col_name <- deparse(substitute(col_name))
col1 <- deparse(substitute(col1))
col2 <- deparse(substitute(col2))
df[[col_name]] <- df[[col1]] + df[[col2]]
df
}
> new_column2(dat,z,x,y)
x y z
1 1 5 6
2 2 6 8
3 3 7 10
4 4 8 12
This is, frankly, a bit silly probably, since we're really doing the same thing as in new_column1, just with a bunch of extra work to convert bare names to strings.
Finally, if we want to get really fancy, we might decide that rather than passing in the names of two columns to add, we'd like to be more flexible and allow for other combinations of two variables. In that case we'd likely resort to using eval() on an expression involving the two columns:
new_column3 <- function(df,col_name,expr){
col_name <- deparse(substitute(col_name))
df[[col_name]] <- eval(substitute(expr),df,parent.frame())
df
}
Just for fun, I'm still using deparse(substitute()) for the name of the new column. Here, all of the following will work:
> new_column3(dat,z,x+y)
x y z
1 1 5 6
2 2 6 8
3 3 7 10
4 4 8 12
> new_column3(dat,z,x-y)
x y z
1 1 5 -4
2 2 6 -4
3 3 7 -4
4 4 8 -4
> new_column3(dat,z,x*y)
x y z
1 1 5 5
2 2 6 12
3 3 7 21
4 4 8 32
So the short answer is basically: pass data.frame column names as strings and use [[ to select single columns. Only start delving into eval, substitute, etc. if you really know what you're doing.

You can just use the column name directly:
df <- data.frame(A=1:10, B=2:11, C=3:12)
fun1 <- function(x, column){
max(x[,column])
}
fun1(df, "B")
fun1(df, c("B","A"))
There's no need to use substitute, eval, etc.
You can even pass the desired function as a parameter:
fun1 <- function(x, column, fn) {
fn(x[,column])
}
fun1(df, "B", max)
Alternatively, using [[ also works for selecting a single column at a time:
df <- data.frame(A=1:10, B=2:11, C=3:12)
fun1 <- function(x, column){
max(x[[column]])
}
fun1(df, "B")

Personally I think that passing the column as a string is pretty ugly. I like to do something like:
get.max <- function(column,data=NULL){
column<-eval(substitute(column),data, parent.frame())
max(column)
}
which will yield:
> get.max(mpg,mtcars)
[1] 33.9
> get.max(c(1,2,3,4,5))
[1] 5
Notice how the specification of a data.frame is optional. you can even work with functions of your columns:
> get.max(1/mpg,mtcars)
[1] 0.09615385

Another way is to use tidy evaluation approach. It is pretty straightforward to pass columns of a data frame either as strings or bare column names. See more about tidyeval here.
library(rlang)
library(tidyverse)
set.seed(123)
df <- data.frame(B = rnorm(10), D = rnorm(10))
Use column names as strings
fun3 <- function(x, ...) {
# capture strings and create variables
dots <- ensyms(...)
# unquote to evaluate inside dplyr verbs
summarise_at(x, vars(!!!dots), list(~ max(., na.rm = TRUE)))
}
fun3(df, "B")
#> B
#> 1 1.715065
fun3(df, "B", "D")
#> B D
#> 1 1.715065 1.786913
Use bare column names
fun4 <- function(x, ...) {
# capture expressions and create quosures
dots <- enquos(...)
# unquote to evaluate inside dplyr verbs
summarise_at(x, vars(!!!dots), list(~ max(., na.rm = TRUE)))
}
fun4(df, B)
#> B
#> 1 1.715065
fun4(df, B, D)
#> B D
#> 1 1.715065 1.786913
#>
Created on 2019-03-01 by the reprex package (v0.2.1.9000)

With dplyr it's now also possible to access a specific column of a dataframe by simply using double curly braces {{...}} around the desired column name within the function body, e.g. for col_name:
library(tidyverse)
fun <- function(df, col_name){
df %>%
filter({{col_name}} == "test_string")
}

As an extra thought, if is needed to pass the column name unquoted to the custom function, perhaps match.call() could be useful as well in this case, as an alternative to deparse(substitute()):
df <- data.frame(A = 1:10, B = 2:11)
fun <- function(x, column){
arg <- match.call()
max(x[[arg$column]])
}
fun(df, A)
#> [1] 10
fun(df, B)
#> [1] 11
If there is a typo in the column name, then would be safer to stop with an error:
fun <- function(x, column) max(x[[match.call()$column]])
fun(df, typo)
#> Warning in max(x[[match.call()$column]]): no non-missing arguments to max;
#> returning -Inf
#> [1] -Inf
# Stop with error in case of typo
fun <- function(x, column){
arg <- match.call()
if (is.null(x[[arg$column]])) stop("Wrong column name")
max(x[[arg$column]])
}
fun(df, typo)
#> Error in fun(df, typo): Wrong column name
fun(df, A)
#> [1] 10
Created on 2019-01-11 by the reprex package (v0.2.1)
I do not think I would use this approach since there is extra typing and complexity than just passing the quoted column name as pointed in the above answers, but well, is an approach.

Tung's answer and mgrund's answer presented tidy evaluation. In this answer I'll show how we can use these concepts to do something similar to joran's answer (specifically his function new_column3). The objective to this is to make it easier to see the differences between base evaluation and tidy one, and also to see the different syntaxes that can be used in tidy evaluation. You will need rlang and dplyr for this.
Using base evaluation tools (joran's answer):
new_column3 <- function(df,col_name,expr){
col_name <- deparse(substitute(col_name))
df[[col_name]] <- eval(substitute(expr),df,parent.frame())
df
}
In the first line, substitute is making us evaluate col_name as an expression, more specifically a symbol (also sometimes called a name), not an object. rlang's substitutes can be:
ensym - turns it into a symbol;
enexpr - turns it into a expression;
enquo - turns it into a quosure, an expression that also points the environment where R should look for the variables to evaluate it.
Most of the time, you want to have that pointer to the environment. When you don't specifically need it, having it rarely causes problems. Thus, most of the time you can use enquo. In this case, you can use ensym to make the code easier to read, as it makes it clearer what col_name is.
Also in the first line, deparse is turning the expression/symbol into a string. You could also use as.character or rlang::as_string.
In the second line, the substitute is turning expr into a 'full' expression (not a symbol), so ensym is not an option anymore.
Also in the second line, we can now change eval to rlang::eval_tidy. Eval would still work with enexpr, but not with a quosure. When you have a quosure, you don't need to pass the environment to the evaluation function (as joran did with parent.frame()).
One combination of the substitutions suggested above might be:
new_column3 <- function(df,col_name,expr){
col_name <- as_string(ensym(col_name))
df[[col_name]] <- eval_tidy(enquo(expr), df)
df
}
We can also use the dplyr operators, which allow for data-masking (evaluating a column in a data frame as a variable, calling it by its name). We can change the method of transforming the symbol to character + subsetting df using [[ with mutate:
new_column3 <- function(df,col_name,expr){
col_name <- ensym(col_name)
df %>% mutate(!!col_name := eval_tidy(enquo(expr), df))
}
To avoid the new column to be named "col_name", we anxious-evaluate it (as opposed to lazy-evaluate, the default of R) with the bang-bang !! operator. Because we made an operation to the left hand side, we can't use 'normal' =, and must use the new syntax :=.
The common operation of turning a column name into a symbol, then anxious-evaluating it with bang-bang has a shortcut: the curly-curly {{ operator:
new_column3 <- function(df,col_name,expr){
df %>% mutate({{col_name}} := eval_tidy(enquo(expr), df))
}
I'm not an expert in evaluation in R and might have done an over simplification, or used a wrong term, so please correct me in the comments. I hope to have helped in comparing the different tools used in the answers to this question.

If you are trying to build this function within an R package or simply want to reduce complexity, you can do the following:
test_func <- function(df, column) {
if (column %in% colnames(df)) {
return(max(df[, column, with=FALSE]))
} else {
stop(cat(column, "not in data.frame columns."))
}
}
The argument with=FALSE "disables the ability to refer to columns as if they are variables, thereby restoring the “data.frame mode” (per CRAN documentation). The if statement is a quick way to catch if the column name provided is within the data.frame. Could also use tryCatch error handling here.

Related

Passing a DataFrame column as argument to a function [duplicate]

I'm trying to write a function to accept a data.frame (x) and a column from it. The function performs some calculations on x and later returns another data.frame. I'm stuck on the best-practices method to pass the column name to the function.
The two minimal examples fun1 and fun2 below produce the desired result, being able to perform operations on x$column, using max() as an example. However, both rely on the seemingly (at least to me) inelegant
call to substitute() and possibly eval()
the need to pass the column name as a character vector.
fun1 <- function(x, column){
do.call("max", list(substitute(x[a], list(a = column))))
}
fun2 <- function(x, column){
max(eval((substitute(x[a], list(a = column)))))
}
df <- data.frame(B = rnorm(10))
fun1(df, "B")
fun2(df, "B")
I would like to be able to call the function as fun(df, B), for example. Other options I have considered but have not tried:
Pass column as an integer of the column number. I think this would avoid substitute(). Ideally, the function could accept either.
with(x, get(column)), but, even if it works, I think this would still require substitute
Make use of formula() and match.call(), neither of which I have much experience with.
Subquestion: Is do.call() preferred over eval()?
This answer will cover many of the same elements as existing answers, but this issue (passing column names to functions) comes up often enough that I wanted there to be an answer that covered things a little more comprehensively.
Suppose we have a very simple data frame:
dat <- data.frame(x = 1:4,
y = 5:8)
and we'd like to write a function that creates a new column z that is the sum of columns x and y.
A very common stumbling block here is that a natural (but incorrect) attempt often looks like this:
foo <- function(df,col_name,col1,col2){
df$col_name <- df$col1 + df$col2
df
}
#Call foo() like this:
foo(dat,z,x,y)
The problem here is that df$col1 doesn't evaluate the expression col1. It simply looks for a column in df literally called col1. This behavior is described in ?Extract under the section "Recursive (list-like) Objects".
The simplest, and most often recommended solution is simply switch from $ to [[ and pass the function arguments as strings:
new_column1 <- function(df,col_name,col1,col2){
#Create new column col_name as sum of col1 and col2
df[[col_name]] <- df[[col1]] + df[[col2]]
df
}
> new_column1(dat,"z","x","y")
x y z
1 1 5 6
2 2 6 8
3 3 7 10
4 4 8 12
This is often considered "best practice" since it is the method that is hardest to screw up. Passing the column names as strings is about as unambiguous as you can get.
The following two options are more advanced. Many popular packages make use of these kinds of techniques, but using them well requires more care and skill, as they can introduce subtle complexities and unanticipated points of failure. This section of Hadley's Advanced R book is an excellent reference for some of these issues.
If you really want to save the user from typing all those quotes, one option might be to convert bare, unquoted column names to strings using deparse(substitute()):
new_column2 <- function(df,col_name,col1,col2){
col_name <- deparse(substitute(col_name))
col1 <- deparse(substitute(col1))
col2 <- deparse(substitute(col2))
df[[col_name]] <- df[[col1]] + df[[col2]]
df
}
> new_column2(dat,z,x,y)
x y z
1 1 5 6
2 2 6 8
3 3 7 10
4 4 8 12
This is, frankly, a bit silly probably, since we're really doing the same thing as in new_column1, just with a bunch of extra work to convert bare names to strings.
Finally, if we want to get really fancy, we might decide that rather than passing in the names of two columns to add, we'd like to be more flexible and allow for other combinations of two variables. In that case we'd likely resort to using eval() on an expression involving the two columns:
new_column3 <- function(df,col_name,expr){
col_name <- deparse(substitute(col_name))
df[[col_name]] <- eval(substitute(expr),df,parent.frame())
df
}
Just for fun, I'm still using deparse(substitute()) for the name of the new column. Here, all of the following will work:
> new_column3(dat,z,x+y)
x y z
1 1 5 6
2 2 6 8
3 3 7 10
4 4 8 12
> new_column3(dat,z,x-y)
x y z
1 1 5 -4
2 2 6 -4
3 3 7 -4
4 4 8 -4
> new_column3(dat,z,x*y)
x y z
1 1 5 5
2 2 6 12
3 3 7 21
4 4 8 32
So the short answer is basically: pass data.frame column names as strings and use [[ to select single columns. Only start delving into eval, substitute, etc. if you really know what you're doing.
You can just use the column name directly:
df <- data.frame(A=1:10, B=2:11, C=3:12)
fun1 <- function(x, column){
max(x[,column])
}
fun1(df, "B")
fun1(df, c("B","A"))
There's no need to use substitute, eval, etc.
You can even pass the desired function as a parameter:
fun1 <- function(x, column, fn) {
fn(x[,column])
}
fun1(df, "B", max)
Alternatively, using [[ also works for selecting a single column at a time:
df <- data.frame(A=1:10, B=2:11, C=3:12)
fun1 <- function(x, column){
max(x[[column]])
}
fun1(df, "B")
Personally I think that passing the column as a string is pretty ugly. I like to do something like:
get.max <- function(column,data=NULL){
column<-eval(substitute(column),data, parent.frame())
max(column)
}
which will yield:
> get.max(mpg,mtcars)
[1] 33.9
> get.max(c(1,2,3,4,5))
[1] 5
Notice how the specification of a data.frame is optional. you can even work with functions of your columns:
> get.max(1/mpg,mtcars)
[1] 0.09615385
Another way is to use tidy evaluation approach. It is pretty straightforward to pass columns of a data frame either as strings or bare column names. See more about tidyeval here.
library(rlang)
library(tidyverse)
set.seed(123)
df <- data.frame(B = rnorm(10), D = rnorm(10))
Use column names as strings
fun3 <- function(x, ...) {
# capture strings and create variables
dots <- ensyms(...)
# unquote to evaluate inside dplyr verbs
summarise_at(x, vars(!!!dots), list(~ max(., na.rm = TRUE)))
}
fun3(df, "B")
#> B
#> 1 1.715065
fun3(df, "B", "D")
#> B D
#> 1 1.715065 1.786913
Use bare column names
fun4 <- function(x, ...) {
# capture expressions and create quosures
dots <- enquos(...)
# unquote to evaluate inside dplyr verbs
summarise_at(x, vars(!!!dots), list(~ max(., na.rm = TRUE)))
}
fun4(df, B)
#> B
#> 1 1.715065
fun4(df, B, D)
#> B D
#> 1 1.715065 1.786913
#>
Created on 2019-03-01 by the reprex package (v0.2.1.9000)
With dplyr it's now also possible to access a specific column of a dataframe by simply using double curly braces {{...}} around the desired column name within the function body, e.g. for col_name:
library(tidyverse)
fun <- function(df, col_name){
df %>%
filter({{col_name}} == "test_string")
}
As an extra thought, if is needed to pass the column name unquoted to the custom function, perhaps match.call() could be useful as well in this case, as an alternative to deparse(substitute()):
df <- data.frame(A = 1:10, B = 2:11)
fun <- function(x, column){
arg <- match.call()
max(x[[arg$column]])
}
fun(df, A)
#> [1] 10
fun(df, B)
#> [1] 11
If there is a typo in the column name, then would be safer to stop with an error:
fun <- function(x, column) max(x[[match.call()$column]])
fun(df, typo)
#> Warning in max(x[[match.call()$column]]): no non-missing arguments to max;
#> returning -Inf
#> [1] -Inf
# Stop with error in case of typo
fun <- function(x, column){
arg <- match.call()
if (is.null(x[[arg$column]])) stop("Wrong column name")
max(x[[arg$column]])
}
fun(df, typo)
#> Error in fun(df, typo): Wrong column name
fun(df, A)
#> [1] 10
Created on 2019-01-11 by the reprex package (v0.2.1)
I do not think I would use this approach since there is extra typing and complexity than just passing the quoted column name as pointed in the above answers, but well, is an approach.
Tung's answer and mgrund's answer presented tidy evaluation. In this answer I'll show how we can use these concepts to do something similar to joran's answer (specifically his function new_column3). The objective to this is to make it easier to see the differences between base evaluation and tidy one, and also to see the different syntaxes that can be used in tidy evaluation. You will need rlang and dplyr for this.
Using base evaluation tools (joran's answer):
new_column3 <- function(df,col_name,expr){
col_name <- deparse(substitute(col_name))
df[[col_name]] <- eval(substitute(expr),df,parent.frame())
df
}
In the first line, substitute is making us evaluate col_name as an expression, more specifically a symbol (also sometimes called a name), not an object. rlang's substitutes can be:
ensym - turns it into a symbol;
enexpr - turns it into a expression;
enquo - turns it into a quosure, an expression that also points the environment where R should look for the variables to evaluate it.
Most of the time, you want to have that pointer to the environment. When you don't specifically need it, having it rarely causes problems. Thus, most of the time you can use enquo. In this case, you can use ensym to make the code easier to read, as it makes it clearer what col_name is.
Also in the first line, deparse is turning the expression/symbol into a string. You could also use as.character or rlang::as_string.
In the second line, the substitute is turning expr into a 'full' expression (not a symbol), so ensym is not an option anymore.
Also in the second line, we can now change eval to rlang::eval_tidy. Eval would still work with enexpr, but not with a quosure. When you have a quosure, you don't need to pass the environment to the evaluation function (as joran did with parent.frame()).
One combination of the substitutions suggested above might be:
new_column3 <- function(df,col_name,expr){
col_name <- as_string(ensym(col_name))
df[[col_name]] <- eval_tidy(enquo(expr), df)
df
}
We can also use the dplyr operators, which allow for data-masking (evaluating a column in a data frame as a variable, calling it by its name). We can change the method of transforming the symbol to character + subsetting df using [[ with mutate:
new_column3 <- function(df,col_name,expr){
col_name <- ensym(col_name)
df %>% mutate(!!col_name := eval_tidy(enquo(expr), df))
}
To avoid the new column to be named "col_name", we anxious-evaluate it (as opposed to lazy-evaluate, the default of R) with the bang-bang !! operator. Because we made an operation to the left hand side, we can't use 'normal' =, and must use the new syntax :=.
The common operation of turning a column name into a symbol, then anxious-evaluating it with bang-bang has a shortcut: the curly-curly {{ operator:
new_column3 <- function(df,col_name,expr){
df %>% mutate({{col_name}} := eval_tidy(enquo(expr), df))
}
I'm not an expert in evaluation in R and might have done an over simplification, or used a wrong term, so please correct me in the comments. I hope to have helped in comparing the different tools used in the answers to this question.
If you are trying to build this function within an R package or simply want to reduce complexity, you can do the following:
test_func <- function(df, column) {
if (column %in% colnames(df)) {
return(max(df[, column, with=FALSE]))
} else {
stop(cat(column, "not in data.frame columns."))
}
}
The argument with=FALSE "disables the ability to refer to columns as if they are variables, thereby restoring the “data.frame mode” (per CRAN documentation). The if statement is a quick way to catch if the column name provided is within the data.frame. Could also use tryCatch error handling here.

R : Can I use Mydata$variable[i] in my for loop? [duplicate]

I'm trying to write a function to accept a data.frame (x) and a column from it. The function performs some calculations on x and later returns another data.frame. I'm stuck on the best-practices method to pass the column name to the function.
The two minimal examples fun1 and fun2 below produce the desired result, being able to perform operations on x$column, using max() as an example. However, both rely on the seemingly (at least to me) inelegant
call to substitute() and possibly eval()
the need to pass the column name as a character vector.
fun1 <- function(x, column){
do.call("max", list(substitute(x[a], list(a = column))))
}
fun2 <- function(x, column){
max(eval((substitute(x[a], list(a = column)))))
}
df <- data.frame(B = rnorm(10))
fun1(df, "B")
fun2(df, "B")
I would like to be able to call the function as fun(df, B), for example. Other options I have considered but have not tried:
Pass column as an integer of the column number. I think this would avoid substitute(). Ideally, the function could accept either.
with(x, get(column)), but, even if it works, I think this would still require substitute
Make use of formula() and match.call(), neither of which I have much experience with.
Subquestion: Is do.call() preferred over eval()?
This answer will cover many of the same elements as existing answers, but this issue (passing column names to functions) comes up often enough that I wanted there to be an answer that covered things a little more comprehensively.
Suppose we have a very simple data frame:
dat <- data.frame(x = 1:4,
y = 5:8)
and we'd like to write a function that creates a new column z that is the sum of columns x and y.
A very common stumbling block here is that a natural (but incorrect) attempt often looks like this:
foo <- function(df,col_name,col1,col2){
df$col_name <- df$col1 + df$col2
df
}
#Call foo() like this:
foo(dat,z,x,y)
The problem here is that df$col1 doesn't evaluate the expression col1. It simply looks for a column in df literally called col1. This behavior is described in ?Extract under the section "Recursive (list-like) Objects".
The simplest, and most often recommended solution is simply switch from $ to [[ and pass the function arguments as strings:
new_column1 <- function(df,col_name,col1,col2){
#Create new column col_name as sum of col1 and col2
df[[col_name]] <- df[[col1]] + df[[col2]]
df
}
> new_column1(dat,"z","x","y")
x y z
1 1 5 6
2 2 6 8
3 3 7 10
4 4 8 12
This is often considered "best practice" since it is the method that is hardest to screw up. Passing the column names as strings is about as unambiguous as you can get.
The following two options are more advanced. Many popular packages make use of these kinds of techniques, but using them well requires more care and skill, as they can introduce subtle complexities and unanticipated points of failure. This section of Hadley's Advanced R book is an excellent reference for some of these issues.
If you really want to save the user from typing all those quotes, one option might be to convert bare, unquoted column names to strings using deparse(substitute()):
new_column2 <- function(df,col_name,col1,col2){
col_name <- deparse(substitute(col_name))
col1 <- deparse(substitute(col1))
col2 <- deparse(substitute(col2))
df[[col_name]] <- df[[col1]] + df[[col2]]
df
}
> new_column2(dat,z,x,y)
x y z
1 1 5 6
2 2 6 8
3 3 7 10
4 4 8 12
This is, frankly, a bit silly probably, since we're really doing the same thing as in new_column1, just with a bunch of extra work to convert bare names to strings.
Finally, if we want to get really fancy, we might decide that rather than passing in the names of two columns to add, we'd like to be more flexible and allow for other combinations of two variables. In that case we'd likely resort to using eval() on an expression involving the two columns:
new_column3 <- function(df,col_name,expr){
col_name <- deparse(substitute(col_name))
df[[col_name]] <- eval(substitute(expr),df,parent.frame())
df
}
Just for fun, I'm still using deparse(substitute()) for the name of the new column. Here, all of the following will work:
> new_column3(dat,z,x+y)
x y z
1 1 5 6
2 2 6 8
3 3 7 10
4 4 8 12
> new_column3(dat,z,x-y)
x y z
1 1 5 -4
2 2 6 -4
3 3 7 -4
4 4 8 -4
> new_column3(dat,z,x*y)
x y z
1 1 5 5
2 2 6 12
3 3 7 21
4 4 8 32
So the short answer is basically: pass data.frame column names as strings and use [[ to select single columns. Only start delving into eval, substitute, etc. if you really know what you're doing.
You can just use the column name directly:
df <- data.frame(A=1:10, B=2:11, C=3:12)
fun1 <- function(x, column){
max(x[,column])
}
fun1(df, "B")
fun1(df, c("B","A"))
There's no need to use substitute, eval, etc.
You can even pass the desired function as a parameter:
fun1 <- function(x, column, fn) {
fn(x[,column])
}
fun1(df, "B", max)
Alternatively, using [[ also works for selecting a single column at a time:
df <- data.frame(A=1:10, B=2:11, C=3:12)
fun1 <- function(x, column){
max(x[[column]])
}
fun1(df, "B")
Personally I think that passing the column as a string is pretty ugly. I like to do something like:
get.max <- function(column,data=NULL){
column<-eval(substitute(column),data, parent.frame())
max(column)
}
which will yield:
> get.max(mpg,mtcars)
[1] 33.9
> get.max(c(1,2,3,4,5))
[1] 5
Notice how the specification of a data.frame is optional. you can even work with functions of your columns:
> get.max(1/mpg,mtcars)
[1] 0.09615385
Another way is to use tidy evaluation approach. It is pretty straightforward to pass columns of a data frame either as strings or bare column names. See more about tidyeval here.
library(rlang)
library(tidyverse)
set.seed(123)
df <- data.frame(B = rnorm(10), D = rnorm(10))
Use column names as strings
fun3 <- function(x, ...) {
# capture strings and create variables
dots <- ensyms(...)
# unquote to evaluate inside dplyr verbs
summarise_at(x, vars(!!!dots), list(~ max(., na.rm = TRUE)))
}
fun3(df, "B")
#> B
#> 1 1.715065
fun3(df, "B", "D")
#> B D
#> 1 1.715065 1.786913
Use bare column names
fun4 <- function(x, ...) {
# capture expressions and create quosures
dots <- enquos(...)
# unquote to evaluate inside dplyr verbs
summarise_at(x, vars(!!!dots), list(~ max(., na.rm = TRUE)))
}
fun4(df, B)
#> B
#> 1 1.715065
fun4(df, B, D)
#> B D
#> 1 1.715065 1.786913
#>
Created on 2019-03-01 by the reprex package (v0.2.1.9000)
With dplyr it's now also possible to access a specific column of a dataframe by simply using double curly braces {{...}} around the desired column name within the function body, e.g. for col_name:
library(tidyverse)
fun <- function(df, col_name){
df %>%
filter({{col_name}} == "test_string")
}
As an extra thought, if is needed to pass the column name unquoted to the custom function, perhaps match.call() could be useful as well in this case, as an alternative to deparse(substitute()):
df <- data.frame(A = 1:10, B = 2:11)
fun <- function(x, column){
arg <- match.call()
max(x[[arg$column]])
}
fun(df, A)
#> [1] 10
fun(df, B)
#> [1] 11
If there is a typo in the column name, then would be safer to stop with an error:
fun <- function(x, column) max(x[[match.call()$column]])
fun(df, typo)
#> Warning in max(x[[match.call()$column]]): no non-missing arguments to max;
#> returning -Inf
#> [1] -Inf
# Stop with error in case of typo
fun <- function(x, column){
arg <- match.call()
if (is.null(x[[arg$column]])) stop("Wrong column name")
max(x[[arg$column]])
}
fun(df, typo)
#> Error in fun(df, typo): Wrong column name
fun(df, A)
#> [1] 10
Created on 2019-01-11 by the reprex package (v0.2.1)
I do not think I would use this approach since there is extra typing and complexity than just passing the quoted column name as pointed in the above answers, but well, is an approach.
Tung's answer and mgrund's answer presented tidy evaluation. In this answer I'll show how we can use these concepts to do something similar to joran's answer (specifically his function new_column3). The objective to this is to make it easier to see the differences between base evaluation and tidy one, and also to see the different syntaxes that can be used in tidy evaluation. You will need rlang and dplyr for this.
Using base evaluation tools (joran's answer):
new_column3 <- function(df,col_name,expr){
col_name <- deparse(substitute(col_name))
df[[col_name]] <- eval(substitute(expr),df,parent.frame())
df
}
In the first line, substitute is making us evaluate col_name as an expression, more specifically a symbol (also sometimes called a name), not an object. rlang's substitutes can be:
ensym - turns it into a symbol;
enexpr - turns it into a expression;
enquo - turns it into a quosure, an expression that also points the environment where R should look for the variables to evaluate it.
Most of the time, you want to have that pointer to the environment. When you don't specifically need it, having it rarely causes problems. Thus, most of the time you can use enquo. In this case, you can use ensym to make the code easier to read, as it makes it clearer what col_name is.
Also in the first line, deparse is turning the expression/symbol into a string. You could also use as.character or rlang::as_string.
In the second line, the substitute is turning expr into a 'full' expression (not a symbol), so ensym is not an option anymore.
Also in the second line, we can now change eval to rlang::eval_tidy. Eval would still work with enexpr, but not with a quosure. When you have a quosure, you don't need to pass the environment to the evaluation function (as joran did with parent.frame()).
One combination of the substitutions suggested above might be:
new_column3 <- function(df,col_name,expr){
col_name <- as_string(ensym(col_name))
df[[col_name]] <- eval_tidy(enquo(expr), df)
df
}
We can also use the dplyr operators, which allow for data-masking (evaluating a column in a data frame as a variable, calling it by its name). We can change the method of transforming the symbol to character + subsetting df using [[ with mutate:
new_column3 <- function(df,col_name,expr){
col_name <- ensym(col_name)
df %>% mutate(!!col_name := eval_tidy(enquo(expr), df))
}
To avoid the new column to be named "col_name", we anxious-evaluate it (as opposed to lazy-evaluate, the default of R) with the bang-bang !! operator. Because we made an operation to the left hand side, we can't use 'normal' =, and must use the new syntax :=.
The common operation of turning a column name into a symbol, then anxious-evaluating it with bang-bang has a shortcut: the curly-curly {{ operator:
new_column3 <- function(df,col_name,expr){
df %>% mutate({{col_name}} := eval_tidy(enquo(expr), df))
}
I'm not an expert in evaluation in R and might have done an over simplification, or used a wrong term, so please correct me in the comments. I hope to have helped in comparing the different tools used in the answers to this question.
If you are trying to build this function within an R package or simply want to reduce complexity, you can do the following:
test_func <- function(df, column) {
if (column %in% colnames(df)) {
return(max(df[, column, with=FALSE]))
} else {
stop(cat(column, "not in data.frame columns."))
}
}
The argument with=FALSE "disables the ability to refer to columns as if they are variables, thereby restoring the “data.frame mode” (per CRAN documentation). The if statement is a quick way to catch if the column name provided is within the data.frame. Could also use tryCatch error handling here.

R: Why this function does not work when using $ sign in the code? [duplicate]

I'm trying to write a function to accept a data.frame (x) and a column from it. The function performs some calculations on x and later returns another data.frame. I'm stuck on the best-practices method to pass the column name to the function.
The two minimal examples fun1 and fun2 below produce the desired result, being able to perform operations on x$column, using max() as an example. However, both rely on the seemingly (at least to me) inelegant
call to substitute() and possibly eval()
the need to pass the column name as a character vector.
fun1 <- function(x, column){
do.call("max", list(substitute(x[a], list(a = column))))
}
fun2 <- function(x, column){
max(eval((substitute(x[a], list(a = column)))))
}
df <- data.frame(B = rnorm(10))
fun1(df, "B")
fun2(df, "B")
I would like to be able to call the function as fun(df, B), for example. Other options I have considered but have not tried:
Pass column as an integer of the column number. I think this would avoid substitute(). Ideally, the function could accept either.
with(x, get(column)), but, even if it works, I think this would still require substitute
Make use of formula() and match.call(), neither of which I have much experience with.
Subquestion: Is do.call() preferred over eval()?
This answer will cover many of the same elements as existing answers, but this issue (passing column names to functions) comes up often enough that I wanted there to be an answer that covered things a little more comprehensively.
Suppose we have a very simple data frame:
dat <- data.frame(x = 1:4,
y = 5:8)
and we'd like to write a function that creates a new column z that is the sum of columns x and y.
A very common stumbling block here is that a natural (but incorrect) attempt often looks like this:
foo <- function(df,col_name,col1,col2){
df$col_name <- df$col1 + df$col2
df
}
#Call foo() like this:
foo(dat,z,x,y)
The problem here is that df$col1 doesn't evaluate the expression col1. It simply looks for a column in df literally called col1. This behavior is described in ?Extract under the section "Recursive (list-like) Objects".
The simplest, and most often recommended solution is simply switch from $ to [[ and pass the function arguments as strings:
new_column1 <- function(df,col_name,col1,col2){
#Create new column col_name as sum of col1 and col2
df[[col_name]] <- df[[col1]] + df[[col2]]
df
}
> new_column1(dat,"z","x","y")
x y z
1 1 5 6
2 2 6 8
3 3 7 10
4 4 8 12
This is often considered "best practice" since it is the method that is hardest to screw up. Passing the column names as strings is about as unambiguous as you can get.
The following two options are more advanced. Many popular packages make use of these kinds of techniques, but using them well requires more care and skill, as they can introduce subtle complexities and unanticipated points of failure. This section of Hadley's Advanced R book is an excellent reference for some of these issues.
If you really want to save the user from typing all those quotes, one option might be to convert bare, unquoted column names to strings using deparse(substitute()):
new_column2 <- function(df,col_name,col1,col2){
col_name <- deparse(substitute(col_name))
col1 <- deparse(substitute(col1))
col2 <- deparse(substitute(col2))
df[[col_name]] <- df[[col1]] + df[[col2]]
df
}
> new_column2(dat,z,x,y)
x y z
1 1 5 6
2 2 6 8
3 3 7 10
4 4 8 12
This is, frankly, a bit silly probably, since we're really doing the same thing as in new_column1, just with a bunch of extra work to convert bare names to strings.
Finally, if we want to get really fancy, we might decide that rather than passing in the names of two columns to add, we'd like to be more flexible and allow for other combinations of two variables. In that case we'd likely resort to using eval() on an expression involving the two columns:
new_column3 <- function(df,col_name,expr){
col_name <- deparse(substitute(col_name))
df[[col_name]] <- eval(substitute(expr),df,parent.frame())
df
}
Just for fun, I'm still using deparse(substitute()) for the name of the new column. Here, all of the following will work:
> new_column3(dat,z,x+y)
x y z
1 1 5 6
2 2 6 8
3 3 7 10
4 4 8 12
> new_column3(dat,z,x-y)
x y z
1 1 5 -4
2 2 6 -4
3 3 7 -4
4 4 8 -4
> new_column3(dat,z,x*y)
x y z
1 1 5 5
2 2 6 12
3 3 7 21
4 4 8 32
So the short answer is basically: pass data.frame column names as strings and use [[ to select single columns. Only start delving into eval, substitute, etc. if you really know what you're doing.
You can just use the column name directly:
df <- data.frame(A=1:10, B=2:11, C=3:12)
fun1 <- function(x, column){
max(x[,column])
}
fun1(df, "B")
fun1(df, c("B","A"))
There's no need to use substitute, eval, etc.
You can even pass the desired function as a parameter:
fun1 <- function(x, column, fn) {
fn(x[,column])
}
fun1(df, "B", max)
Alternatively, using [[ also works for selecting a single column at a time:
df <- data.frame(A=1:10, B=2:11, C=3:12)
fun1 <- function(x, column){
max(x[[column]])
}
fun1(df, "B")
Personally I think that passing the column as a string is pretty ugly. I like to do something like:
get.max <- function(column,data=NULL){
column<-eval(substitute(column),data, parent.frame())
max(column)
}
which will yield:
> get.max(mpg,mtcars)
[1] 33.9
> get.max(c(1,2,3,4,5))
[1] 5
Notice how the specification of a data.frame is optional. you can even work with functions of your columns:
> get.max(1/mpg,mtcars)
[1] 0.09615385
Another way is to use tidy evaluation approach. It is pretty straightforward to pass columns of a data frame either as strings or bare column names. See more about tidyeval here.
library(rlang)
library(tidyverse)
set.seed(123)
df <- data.frame(B = rnorm(10), D = rnorm(10))
Use column names as strings
fun3 <- function(x, ...) {
# capture strings and create variables
dots <- ensyms(...)
# unquote to evaluate inside dplyr verbs
summarise_at(x, vars(!!!dots), list(~ max(., na.rm = TRUE)))
}
fun3(df, "B")
#> B
#> 1 1.715065
fun3(df, "B", "D")
#> B D
#> 1 1.715065 1.786913
Use bare column names
fun4 <- function(x, ...) {
# capture expressions and create quosures
dots <- enquos(...)
# unquote to evaluate inside dplyr verbs
summarise_at(x, vars(!!!dots), list(~ max(., na.rm = TRUE)))
}
fun4(df, B)
#> B
#> 1 1.715065
fun4(df, B, D)
#> B D
#> 1 1.715065 1.786913
#>
Created on 2019-03-01 by the reprex package (v0.2.1.9000)
With dplyr it's now also possible to access a specific column of a dataframe by simply using double curly braces {{...}} around the desired column name within the function body, e.g. for col_name:
library(tidyverse)
fun <- function(df, col_name){
df %>%
filter({{col_name}} == "test_string")
}
As an extra thought, if is needed to pass the column name unquoted to the custom function, perhaps match.call() could be useful as well in this case, as an alternative to deparse(substitute()):
df <- data.frame(A = 1:10, B = 2:11)
fun <- function(x, column){
arg <- match.call()
max(x[[arg$column]])
}
fun(df, A)
#> [1] 10
fun(df, B)
#> [1] 11
If there is a typo in the column name, then would be safer to stop with an error:
fun <- function(x, column) max(x[[match.call()$column]])
fun(df, typo)
#> Warning in max(x[[match.call()$column]]): no non-missing arguments to max;
#> returning -Inf
#> [1] -Inf
# Stop with error in case of typo
fun <- function(x, column){
arg <- match.call()
if (is.null(x[[arg$column]])) stop("Wrong column name")
max(x[[arg$column]])
}
fun(df, typo)
#> Error in fun(df, typo): Wrong column name
fun(df, A)
#> [1] 10
Created on 2019-01-11 by the reprex package (v0.2.1)
I do not think I would use this approach since there is extra typing and complexity than just passing the quoted column name as pointed in the above answers, but well, is an approach.
Tung's answer and mgrund's answer presented tidy evaluation. In this answer I'll show how we can use these concepts to do something similar to joran's answer (specifically his function new_column3). The objective to this is to make it easier to see the differences between base evaluation and tidy one, and also to see the different syntaxes that can be used in tidy evaluation. You will need rlang and dplyr for this.
Using base evaluation tools (joran's answer):
new_column3 <- function(df,col_name,expr){
col_name <- deparse(substitute(col_name))
df[[col_name]] <- eval(substitute(expr),df,parent.frame())
df
}
In the first line, substitute is making us evaluate col_name as an expression, more specifically a symbol (also sometimes called a name), not an object. rlang's substitutes can be:
ensym - turns it into a symbol;
enexpr - turns it into a expression;
enquo - turns it into a quosure, an expression that also points the environment where R should look for the variables to evaluate it.
Most of the time, you want to have that pointer to the environment. When you don't specifically need it, having it rarely causes problems. Thus, most of the time you can use enquo. In this case, you can use ensym to make the code easier to read, as it makes it clearer what col_name is.
Also in the first line, deparse is turning the expression/symbol into a string. You could also use as.character or rlang::as_string.
In the second line, the substitute is turning expr into a 'full' expression (not a symbol), so ensym is not an option anymore.
Also in the second line, we can now change eval to rlang::eval_tidy. Eval would still work with enexpr, but not with a quosure. When you have a quosure, you don't need to pass the environment to the evaluation function (as joran did with parent.frame()).
One combination of the substitutions suggested above might be:
new_column3 <- function(df,col_name,expr){
col_name <- as_string(ensym(col_name))
df[[col_name]] <- eval_tidy(enquo(expr), df)
df
}
We can also use the dplyr operators, which allow for data-masking (evaluating a column in a data frame as a variable, calling it by its name). We can change the method of transforming the symbol to character + subsetting df using [[ with mutate:
new_column3 <- function(df,col_name,expr){
col_name <- ensym(col_name)
df %>% mutate(!!col_name := eval_tidy(enquo(expr), df))
}
To avoid the new column to be named "col_name", we anxious-evaluate it (as opposed to lazy-evaluate, the default of R) with the bang-bang !! operator. Because we made an operation to the left hand side, we can't use 'normal' =, and must use the new syntax :=.
The common operation of turning a column name into a symbol, then anxious-evaluating it with bang-bang has a shortcut: the curly-curly {{ operator:
new_column3 <- function(df,col_name,expr){
df %>% mutate({{col_name}} := eval_tidy(enquo(expr), df))
}
I'm not an expert in evaluation in R and might have done an over simplification, or used a wrong term, so please correct me in the comments. I hope to have helped in comparing the different tools used in the answers to this question.
If you are trying to build this function within an R package or simply want to reduce complexity, you can do the following:
test_func <- function(df, column) {
if (column %in% colnames(df)) {
return(max(df[, column, with=FALSE]))
} else {
stop(cat(column, "not in data.frame columns."))
}
}
The argument with=FALSE "disables the ability to refer to columns as if they are variables, thereby restoring the “data.frame mode” (per CRAN documentation). The if statement is a quick way to catch if the column name provided is within the data.frame. Could also use tryCatch error handling here.

Assign Argument to Variable Name in R [duplicate]

I'm trying to write a function to accept a data.frame (x) and a column from it. The function performs some calculations on x and later returns another data.frame. I'm stuck on the best-practices method to pass the column name to the function.
The two minimal examples fun1 and fun2 below produce the desired result, being able to perform operations on x$column, using max() as an example. However, both rely on the seemingly (at least to me) inelegant
call to substitute() and possibly eval()
the need to pass the column name as a character vector.
fun1 <- function(x, column){
do.call("max", list(substitute(x[a], list(a = column))))
}
fun2 <- function(x, column){
max(eval((substitute(x[a], list(a = column)))))
}
df <- data.frame(B = rnorm(10))
fun1(df, "B")
fun2(df, "B")
I would like to be able to call the function as fun(df, B), for example. Other options I have considered but have not tried:
Pass column as an integer of the column number. I think this would avoid substitute(). Ideally, the function could accept either.
with(x, get(column)), but, even if it works, I think this would still require substitute
Make use of formula() and match.call(), neither of which I have much experience with.
Subquestion: Is do.call() preferred over eval()?
This answer will cover many of the same elements as existing answers, but this issue (passing column names to functions) comes up often enough that I wanted there to be an answer that covered things a little more comprehensively.
Suppose we have a very simple data frame:
dat <- data.frame(x = 1:4,
y = 5:8)
and we'd like to write a function that creates a new column z that is the sum of columns x and y.
A very common stumbling block here is that a natural (but incorrect) attempt often looks like this:
foo <- function(df,col_name,col1,col2){
df$col_name <- df$col1 + df$col2
df
}
#Call foo() like this:
foo(dat,z,x,y)
The problem here is that df$col1 doesn't evaluate the expression col1. It simply looks for a column in df literally called col1. This behavior is described in ?Extract under the section "Recursive (list-like) Objects".
The simplest, and most often recommended solution is simply switch from $ to [[ and pass the function arguments as strings:
new_column1 <- function(df,col_name,col1,col2){
#Create new column col_name as sum of col1 and col2
df[[col_name]] <- df[[col1]] + df[[col2]]
df
}
> new_column1(dat,"z","x","y")
x y z
1 1 5 6
2 2 6 8
3 3 7 10
4 4 8 12
This is often considered "best practice" since it is the method that is hardest to screw up. Passing the column names as strings is about as unambiguous as you can get.
The following two options are more advanced. Many popular packages make use of these kinds of techniques, but using them well requires more care and skill, as they can introduce subtle complexities and unanticipated points of failure. This section of Hadley's Advanced R book is an excellent reference for some of these issues.
If you really want to save the user from typing all those quotes, one option might be to convert bare, unquoted column names to strings using deparse(substitute()):
new_column2 <- function(df,col_name,col1,col2){
col_name <- deparse(substitute(col_name))
col1 <- deparse(substitute(col1))
col2 <- deparse(substitute(col2))
df[[col_name]] <- df[[col1]] + df[[col2]]
df
}
> new_column2(dat,z,x,y)
x y z
1 1 5 6
2 2 6 8
3 3 7 10
4 4 8 12
This is, frankly, a bit silly probably, since we're really doing the same thing as in new_column1, just with a bunch of extra work to convert bare names to strings.
Finally, if we want to get really fancy, we might decide that rather than passing in the names of two columns to add, we'd like to be more flexible and allow for other combinations of two variables. In that case we'd likely resort to using eval() on an expression involving the two columns:
new_column3 <- function(df,col_name,expr){
col_name <- deparse(substitute(col_name))
df[[col_name]] <- eval(substitute(expr),df,parent.frame())
df
}
Just for fun, I'm still using deparse(substitute()) for the name of the new column. Here, all of the following will work:
> new_column3(dat,z,x+y)
x y z
1 1 5 6
2 2 6 8
3 3 7 10
4 4 8 12
> new_column3(dat,z,x-y)
x y z
1 1 5 -4
2 2 6 -4
3 3 7 -4
4 4 8 -4
> new_column3(dat,z,x*y)
x y z
1 1 5 5
2 2 6 12
3 3 7 21
4 4 8 32
So the short answer is basically: pass data.frame column names as strings and use [[ to select single columns. Only start delving into eval, substitute, etc. if you really know what you're doing.
You can just use the column name directly:
df <- data.frame(A=1:10, B=2:11, C=3:12)
fun1 <- function(x, column){
max(x[,column])
}
fun1(df, "B")
fun1(df, c("B","A"))
There's no need to use substitute, eval, etc.
You can even pass the desired function as a parameter:
fun1 <- function(x, column, fn) {
fn(x[,column])
}
fun1(df, "B", max)
Alternatively, using [[ also works for selecting a single column at a time:
df <- data.frame(A=1:10, B=2:11, C=3:12)
fun1 <- function(x, column){
max(x[[column]])
}
fun1(df, "B")
Personally I think that passing the column as a string is pretty ugly. I like to do something like:
get.max <- function(column,data=NULL){
column<-eval(substitute(column),data, parent.frame())
max(column)
}
which will yield:
> get.max(mpg,mtcars)
[1] 33.9
> get.max(c(1,2,3,4,5))
[1] 5
Notice how the specification of a data.frame is optional. you can even work with functions of your columns:
> get.max(1/mpg,mtcars)
[1] 0.09615385
Another way is to use tidy evaluation approach. It is pretty straightforward to pass columns of a data frame either as strings or bare column names. See more about tidyeval here.
library(rlang)
library(tidyverse)
set.seed(123)
df <- data.frame(B = rnorm(10), D = rnorm(10))
Use column names as strings
fun3 <- function(x, ...) {
# capture strings and create variables
dots <- ensyms(...)
# unquote to evaluate inside dplyr verbs
summarise_at(x, vars(!!!dots), list(~ max(., na.rm = TRUE)))
}
fun3(df, "B")
#> B
#> 1 1.715065
fun3(df, "B", "D")
#> B D
#> 1 1.715065 1.786913
Use bare column names
fun4 <- function(x, ...) {
# capture expressions and create quosures
dots <- enquos(...)
# unquote to evaluate inside dplyr verbs
summarise_at(x, vars(!!!dots), list(~ max(., na.rm = TRUE)))
}
fun4(df, B)
#> B
#> 1 1.715065
fun4(df, B, D)
#> B D
#> 1 1.715065 1.786913
#>
Created on 2019-03-01 by the reprex package (v0.2.1.9000)
With dplyr it's now also possible to access a specific column of a dataframe by simply using double curly braces {{...}} around the desired column name within the function body, e.g. for col_name:
library(tidyverse)
fun <- function(df, col_name){
df %>%
filter({{col_name}} == "test_string")
}
As an extra thought, if is needed to pass the column name unquoted to the custom function, perhaps match.call() could be useful as well in this case, as an alternative to deparse(substitute()):
df <- data.frame(A = 1:10, B = 2:11)
fun <- function(x, column){
arg <- match.call()
max(x[[arg$column]])
}
fun(df, A)
#> [1] 10
fun(df, B)
#> [1] 11
If there is a typo in the column name, then would be safer to stop with an error:
fun <- function(x, column) max(x[[match.call()$column]])
fun(df, typo)
#> Warning in max(x[[match.call()$column]]): no non-missing arguments to max;
#> returning -Inf
#> [1] -Inf
# Stop with error in case of typo
fun <- function(x, column){
arg <- match.call()
if (is.null(x[[arg$column]])) stop("Wrong column name")
max(x[[arg$column]])
}
fun(df, typo)
#> Error in fun(df, typo): Wrong column name
fun(df, A)
#> [1] 10
Created on 2019-01-11 by the reprex package (v0.2.1)
I do not think I would use this approach since there is extra typing and complexity than just passing the quoted column name as pointed in the above answers, but well, is an approach.
Tung's answer and mgrund's answer presented tidy evaluation. In this answer I'll show how we can use these concepts to do something similar to joran's answer (specifically his function new_column3). The objective to this is to make it easier to see the differences between base evaluation and tidy one, and also to see the different syntaxes that can be used in tidy evaluation. You will need rlang and dplyr for this.
Using base evaluation tools (joran's answer):
new_column3 <- function(df,col_name,expr){
col_name <- deparse(substitute(col_name))
df[[col_name]] <- eval(substitute(expr),df,parent.frame())
df
}
In the first line, substitute is making us evaluate col_name as an expression, more specifically a symbol (also sometimes called a name), not an object. rlang's substitutes can be:
ensym - turns it into a symbol;
enexpr - turns it into a expression;
enquo - turns it into a quosure, an expression that also points the environment where R should look for the variables to evaluate it.
Most of the time, you want to have that pointer to the environment. When you don't specifically need it, having it rarely causes problems. Thus, most of the time you can use enquo. In this case, you can use ensym to make the code easier to read, as it makes it clearer what col_name is.
Also in the first line, deparse is turning the expression/symbol into a string. You could also use as.character or rlang::as_string.
In the second line, the substitute is turning expr into a 'full' expression (not a symbol), so ensym is not an option anymore.
Also in the second line, we can now change eval to rlang::eval_tidy. Eval would still work with enexpr, but not with a quosure. When you have a quosure, you don't need to pass the environment to the evaluation function (as joran did with parent.frame()).
One combination of the substitutions suggested above might be:
new_column3 <- function(df,col_name,expr){
col_name <- as_string(ensym(col_name))
df[[col_name]] <- eval_tidy(enquo(expr), df)
df
}
We can also use the dplyr operators, which allow for data-masking (evaluating a column in a data frame as a variable, calling it by its name). We can change the method of transforming the symbol to character + subsetting df using [[ with mutate:
new_column3 <- function(df,col_name,expr){
col_name <- ensym(col_name)
df %>% mutate(!!col_name := eval_tidy(enquo(expr), df))
}
To avoid the new column to be named "col_name", we anxious-evaluate it (as opposed to lazy-evaluate, the default of R) with the bang-bang !! operator. Because we made an operation to the left hand side, we can't use 'normal' =, and must use the new syntax :=.
The common operation of turning a column name into a symbol, then anxious-evaluating it with bang-bang has a shortcut: the curly-curly {{ operator:
new_column3 <- function(df,col_name,expr){
df %>% mutate({{col_name}} := eval_tidy(enquo(expr), df))
}
I'm not an expert in evaluation in R and might have done an over simplification, or used a wrong term, so please correct me in the comments. I hope to have helped in comparing the different tools used in the answers to this question.
If you are trying to build this function within an R package or simply want to reduce complexity, you can do the following:
test_func <- function(df, column) {
if (column %in% colnames(df)) {
return(max(df[, column, with=FALSE]))
} else {
stop(cat(column, "not in data.frame columns."))
}
}
The argument with=FALSE "disables the ability to refer to columns as if they are variables, thereby restoring the “data.frame mode” (per CRAN documentation). The if statement is a quick way to catch if the column name provided is within the data.frame. Could also use tryCatch error handling here.

Using a string to create a variable and put it in a data frame [duplicate]

I'm trying to write a function to accept a data.frame (x) and a column from it. The function performs some calculations on x and later returns another data.frame. I'm stuck on the best-practices method to pass the column name to the function.
The two minimal examples fun1 and fun2 below produce the desired result, being able to perform operations on x$column, using max() as an example. However, both rely on the seemingly (at least to me) inelegant
call to substitute() and possibly eval()
the need to pass the column name as a character vector.
fun1 <- function(x, column){
do.call("max", list(substitute(x[a], list(a = column))))
}
fun2 <- function(x, column){
max(eval((substitute(x[a], list(a = column)))))
}
df <- data.frame(B = rnorm(10))
fun1(df, "B")
fun2(df, "B")
I would like to be able to call the function as fun(df, B), for example. Other options I have considered but have not tried:
Pass column as an integer of the column number. I think this would avoid substitute(). Ideally, the function could accept either.
with(x, get(column)), but, even if it works, I think this would still require substitute
Make use of formula() and match.call(), neither of which I have much experience with.
Subquestion: Is do.call() preferred over eval()?
This answer will cover many of the same elements as existing answers, but this issue (passing column names to functions) comes up often enough that I wanted there to be an answer that covered things a little more comprehensively.
Suppose we have a very simple data frame:
dat <- data.frame(x = 1:4,
y = 5:8)
and we'd like to write a function that creates a new column z that is the sum of columns x and y.
A very common stumbling block here is that a natural (but incorrect) attempt often looks like this:
foo <- function(df,col_name,col1,col2){
df$col_name <- df$col1 + df$col2
df
}
#Call foo() like this:
foo(dat,z,x,y)
The problem here is that df$col1 doesn't evaluate the expression col1. It simply looks for a column in df literally called col1. This behavior is described in ?Extract under the section "Recursive (list-like) Objects".
The simplest, and most often recommended solution is simply switch from $ to [[ and pass the function arguments as strings:
new_column1 <- function(df,col_name,col1,col2){
#Create new column col_name as sum of col1 and col2
df[[col_name]] <- df[[col1]] + df[[col2]]
df
}
> new_column1(dat,"z","x","y")
x y z
1 1 5 6
2 2 6 8
3 3 7 10
4 4 8 12
This is often considered "best practice" since it is the method that is hardest to screw up. Passing the column names as strings is about as unambiguous as you can get.
The following two options are more advanced. Many popular packages make use of these kinds of techniques, but using them well requires more care and skill, as they can introduce subtle complexities and unanticipated points of failure. This section of Hadley's Advanced R book is an excellent reference for some of these issues.
If you really want to save the user from typing all those quotes, one option might be to convert bare, unquoted column names to strings using deparse(substitute()):
new_column2 <- function(df,col_name,col1,col2){
col_name <- deparse(substitute(col_name))
col1 <- deparse(substitute(col1))
col2 <- deparse(substitute(col2))
df[[col_name]] <- df[[col1]] + df[[col2]]
df
}
> new_column2(dat,z,x,y)
x y z
1 1 5 6
2 2 6 8
3 3 7 10
4 4 8 12
This is, frankly, a bit silly probably, since we're really doing the same thing as in new_column1, just with a bunch of extra work to convert bare names to strings.
Finally, if we want to get really fancy, we might decide that rather than passing in the names of two columns to add, we'd like to be more flexible and allow for other combinations of two variables. In that case we'd likely resort to using eval() on an expression involving the two columns:
new_column3 <- function(df,col_name,expr){
col_name <- deparse(substitute(col_name))
df[[col_name]] <- eval(substitute(expr),df,parent.frame())
df
}
Just for fun, I'm still using deparse(substitute()) for the name of the new column. Here, all of the following will work:
> new_column3(dat,z,x+y)
x y z
1 1 5 6
2 2 6 8
3 3 7 10
4 4 8 12
> new_column3(dat,z,x-y)
x y z
1 1 5 -4
2 2 6 -4
3 3 7 -4
4 4 8 -4
> new_column3(dat,z,x*y)
x y z
1 1 5 5
2 2 6 12
3 3 7 21
4 4 8 32
So the short answer is basically: pass data.frame column names as strings and use [[ to select single columns. Only start delving into eval, substitute, etc. if you really know what you're doing.
You can just use the column name directly:
df <- data.frame(A=1:10, B=2:11, C=3:12)
fun1 <- function(x, column){
max(x[,column])
}
fun1(df, "B")
fun1(df, c("B","A"))
There's no need to use substitute, eval, etc.
You can even pass the desired function as a parameter:
fun1 <- function(x, column, fn) {
fn(x[,column])
}
fun1(df, "B", max)
Alternatively, using [[ also works for selecting a single column at a time:
df <- data.frame(A=1:10, B=2:11, C=3:12)
fun1 <- function(x, column){
max(x[[column]])
}
fun1(df, "B")
Personally I think that passing the column as a string is pretty ugly. I like to do something like:
get.max <- function(column,data=NULL){
column<-eval(substitute(column),data, parent.frame())
max(column)
}
which will yield:
> get.max(mpg,mtcars)
[1] 33.9
> get.max(c(1,2,3,4,5))
[1] 5
Notice how the specification of a data.frame is optional. you can even work with functions of your columns:
> get.max(1/mpg,mtcars)
[1] 0.09615385
Another way is to use tidy evaluation approach. It is pretty straightforward to pass columns of a data frame either as strings or bare column names. See more about tidyeval here.
library(rlang)
library(tidyverse)
set.seed(123)
df <- data.frame(B = rnorm(10), D = rnorm(10))
Use column names as strings
fun3 <- function(x, ...) {
# capture strings and create variables
dots <- ensyms(...)
# unquote to evaluate inside dplyr verbs
summarise_at(x, vars(!!!dots), list(~ max(., na.rm = TRUE)))
}
fun3(df, "B")
#> B
#> 1 1.715065
fun3(df, "B", "D")
#> B D
#> 1 1.715065 1.786913
Use bare column names
fun4 <- function(x, ...) {
# capture expressions and create quosures
dots <- enquos(...)
# unquote to evaluate inside dplyr verbs
summarise_at(x, vars(!!!dots), list(~ max(., na.rm = TRUE)))
}
fun4(df, B)
#> B
#> 1 1.715065
fun4(df, B, D)
#> B D
#> 1 1.715065 1.786913
#>
Created on 2019-03-01 by the reprex package (v0.2.1.9000)
With dplyr it's now also possible to access a specific column of a dataframe by simply using double curly braces {{...}} around the desired column name within the function body, e.g. for col_name:
library(tidyverse)
fun <- function(df, col_name){
df %>%
filter({{col_name}} == "test_string")
}
As an extra thought, if is needed to pass the column name unquoted to the custom function, perhaps match.call() could be useful as well in this case, as an alternative to deparse(substitute()):
df <- data.frame(A = 1:10, B = 2:11)
fun <- function(x, column){
arg <- match.call()
max(x[[arg$column]])
}
fun(df, A)
#> [1] 10
fun(df, B)
#> [1] 11
If there is a typo in the column name, then would be safer to stop with an error:
fun <- function(x, column) max(x[[match.call()$column]])
fun(df, typo)
#> Warning in max(x[[match.call()$column]]): no non-missing arguments to max;
#> returning -Inf
#> [1] -Inf
# Stop with error in case of typo
fun <- function(x, column){
arg <- match.call()
if (is.null(x[[arg$column]])) stop("Wrong column name")
max(x[[arg$column]])
}
fun(df, typo)
#> Error in fun(df, typo): Wrong column name
fun(df, A)
#> [1] 10
Created on 2019-01-11 by the reprex package (v0.2.1)
I do not think I would use this approach since there is extra typing and complexity than just passing the quoted column name as pointed in the above answers, but well, is an approach.
Tung's answer and mgrund's answer presented tidy evaluation. In this answer I'll show how we can use these concepts to do something similar to joran's answer (specifically his function new_column3). The objective to this is to make it easier to see the differences between base evaluation and tidy one, and also to see the different syntaxes that can be used in tidy evaluation. You will need rlang and dplyr for this.
Using base evaluation tools (joran's answer):
new_column3 <- function(df,col_name,expr){
col_name <- deparse(substitute(col_name))
df[[col_name]] <- eval(substitute(expr),df,parent.frame())
df
}
In the first line, substitute is making us evaluate col_name as an expression, more specifically a symbol (also sometimes called a name), not an object. rlang's substitutes can be:
ensym - turns it into a symbol;
enexpr - turns it into a expression;
enquo - turns it into a quosure, an expression that also points the environment where R should look for the variables to evaluate it.
Most of the time, you want to have that pointer to the environment. When you don't specifically need it, having it rarely causes problems. Thus, most of the time you can use enquo. In this case, you can use ensym to make the code easier to read, as it makes it clearer what col_name is.
Also in the first line, deparse is turning the expression/symbol into a string. You could also use as.character or rlang::as_string.
In the second line, the substitute is turning expr into a 'full' expression (not a symbol), so ensym is not an option anymore.
Also in the second line, we can now change eval to rlang::eval_tidy. Eval would still work with enexpr, but not with a quosure. When you have a quosure, you don't need to pass the environment to the evaluation function (as joran did with parent.frame()).
One combination of the substitutions suggested above might be:
new_column3 <- function(df,col_name,expr){
col_name <- as_string(ensym(col_name))
df[[col_name]] <- eval_tidy(enquo(expr), df)
df
}
We can also use the dplyr operators, which allow for data-masking (evaluating a column in a data frame as a variable, calling it by its name). We can change the method of transforming the symbol to character + subsetting df using [[ with mutate:
new_column3 <- function(df,col_name,expr){
col_name <- ensym(col_name)
df %>% mutate(!!col_name := eval_tidy(enquo(expr), df))
}
To avoid the new column to be named "col_name", we anxious-evaluate it (as opposed to lazy-evaluate, the default of R) with the bang-bang !! operator. Because we made an operation to the left hand side, we can't use 'normal' =, and must use the new syntax :=.
The common operation of turning a column name into a symbol, then anxious-evaluating it with bang-bang has a shortcut: the curly-curly {{ operator:
new_column3 <- function(df,col_name,expr){
df %>% mutate({{col_name}} := eval_tidy(enquo(expr), df))
}
I'm not an expert in evaluation in R and might have done an over simplification, or used a wrong term, so please correct me in the comments. I hope to have helped in comparing the different tools used in the answers to this question.
If you are trying to build this function within an R package or simply want to reduce complexity, you can do the following:
test_func <- function(df, column) {
if (column %in% colnames(df)) {
return(max(df[, column, with=FALSE]))
} else {
stop(cat(column, "not in data.frame columns."))
}
}
The argument with=FALSE "disables the ability to refer to columns as if they are variables, thereby restoring the “data.frame mode” (per CRAN documentation). The if statement is a quick way to catch if the column name provided is within the data.frame. Could also use tryCatch error handling here.

Resources