How to use str_glue in data.table (environments question) - r

In a real life project I am trying to create a function to be used in data.table (j) which relies on string formatting using the str_glue function. I get an error related to environments.
What follows below is a trivial example (converting mpg to l/100km in the mtcars dataset) to highlight the behavior. My example might seem convoluted (e.g. a list holding only 1 element) but this is to better reflect my real-life problem which is of course more complex.
01 - Setting up the data.table and the list holding the conversion factor
conversion_rates = list(mpg_lkm = 282.5)
dt = as.data.table(mtcars, keep.rownames = "rn")
02 - Creating a function to calculate liters per 100km
mpg_to_lkm = function(mpg, cr) {
cri = cr$mpg_lkm
return(cri / mpg)
}
03 - using the function in j, by passing the column as the first argument, and the conversion rates from the global environment as the second
dt[, lkm := mpg_to_lkm(mpg, conversion_rates)] # passing conversion_rates works
head(dt)
So far so good. I seem to have established that you can pass arguments from the data.table scope and global scope without any problem.
04 - defining a function that will create a 'character' column which explains the formula
explain_formula = function(mpg, cr) {
print(ls(envir = environment())) # proves that mpg & cr are available in my envir
cri = cr$mpg_lkm
result = paste("lkm obtained by dividing {cri} by", mpg) # curly brace notation
result = purrr::map_chr(result, stringr::str_glue) # str_glue is not vectorised so I use map
return(result)
}
05 - testing my function shows that it runs smoothly
testcol = dt$mpg
result = explain_formula(testcol, conversion_rates)
result # --> output exactly as expected
06 - now trying it from j in data.table
dt[, cr := explain_formula(mpg, conversion_rates)]
throws error
Error in eval(parse(text = text, keep.source = FALSE), envir) :
object 'cri' not found
str_glue has as a default argument envir = parent.frame(). The parent frame of a function evaluation is the environment in which the function was called. In my case this has to be the environment of "explain_formula" right?
I have found out that 2 things work solve my issue:
creating cri in the global environment -> not usefull for my real life problem
passing .envir = environment() to str_glue -> this is my current solution
Can anyone explain me where my reasoning is bad? Why is '06' not giving the desired result?

Related

R: cannot make use of `data.table`'s environment, base environment and global environment all at once

Consider the dummy example below: I want to run a model on a range of subsets of the data.table in a loop, and want to specify the exact line to iterate as a string (with an iterator i)
library(data.table)
DT <- data.table(X = runif(100), Y = runif(100))
f1 <- function(code) {
for (i in c(20,30,50)) {
eval(parse(text = code))
}
}
f1("lm(X ~ Y, data = DT[sample(.N, i)])")
Obviously this doesn't return any output as lm() is merely evaluated in the background 3 times. The actual use case is more convoluted, but this is meant to be a theoretical simplification of it.
The example above, nonetheless, works fine. The problems begin when the function f1 is included in the package, instead of being defined in the global environment. If I'm not mistaken, in this case f1 is defined in the package's base env. Then, calling f1 from global env gives the error: Error in [.data.frame(x, i) : undefined columns selected. R can correctly access iterator i in its base env and DT in the global env, but cannot access the column by name inside data.table's square brackets.
I tried experimenting by setting envir and enclos arguments to eval() to baseenv(), globalenv(), parent.frame(), but haven't managed to find a combination that works.
For example, setting envir = globalenv() seems to result in accessing DT and i, but not X and Y from the DT inside lm(). Setting envir = baseenv() we lose the global env and cannot access DT (envir = baseenv(), enclos = globalenv() doesn't change it). Using envir = list(baseenv(), globalenv()) results in not being able to access anything inside data.table's square brackets, I think, error message: "Error in [.data.frame(x, i) : undefined columns selected".
The problem is that variables are resolved lexicographically. You could try passing in the expression and the substituting the value of i specifically before evaluating. This would take care of eliminating the need for explicit parsing.
f1 <- function(code) {
code <- substitute(code)
for (i in c(20,30,50)) {
cmd <- do.call("substitute", list(code, list(i=i)))
print(cmd)
result <- eval.parent(cmd)
print(result)
}
}
f1(lm(X ~ Y, data = DT[sample(.N, i)]))

force the evaluation of .SD in data.table

In order to debug j in data.table I prefer to interactively inspect the resulting -by- dt´s with browser(). SO 2013 adressed this issue and I understand that .SD must be invoked in j in order for all columns to be evaluated. I use Rstudio and using the SO 2013 method, there are two problems:
The environment pane is not updated reflecting the browser environment
I often encounter the following error msg
Error: option error has NULL value
In addition: Warning message:
In get(object, envir = currentEnv, inherits = TRUE) :
restarting interrupted promise evaluation
I can get around this by doing:
f <- function(sd=force(.SD),.env = parent.frame(n = 1)) {
by = .env$.BY;
i = .env$.I;
sd = .env$.SD;
grp = .env$.GRP;
N = .env$.N;
browser()
}
library (data.table)
setDT(copy(mtcars))[,f(.SD),by=.(gear)]
But - in the data.table spirit of keeping things short and sweet- can I somehow force (the force in f does not work) the evaluation of .SD in the call to f so that the final code could run:
setDT(copy(mtcars))[,f(),by=.(gear)]
As far as I know,
data.table needs to explicitly see .SD somewhere in the code passed to j,
otherwise it won't even expose it in the environment it creates for the execution.
See for example this question and its comments.
Why don't you create a different helper function that always specifies .SD in j?
Something like:
dt_debugger <- function(dt, ...) {
f <- function(..., .caller_env = parent.frame()) {
by <- .caller_env$.BY;
i <- .caller_env$.I;
sd <- .caller_env$.SD;
grp <- .caller_env$.GRP;
N <- .caller_env$.N;
browser()
}
dt[..., j = f(.SD)]
}
dt_debugger(as.data.table(mtcars), by = .(gear))

Check expression argument of function

When writing functions it is important to check for the type of arguments. For example, take the following (not necessarily useful) function which is performing subsetting:
data_subset = function(data, date_col) {
if (!TRUE %in% (is.character(date_col) | is.expression(date_col))){
stop("Input variable date is of wrong format")
}
if (is.character(date_col)) {
x <- match(date_col, names(data))
} else x <- match(deparse(substitute(date_col)), names(data))
sub <- data[,x]
}
I would like to allow the user to provide the column which should be extracted as character or expression (e.g. a column called "date" vs. just date). At the beginning I would like to check that the input for date_col is really either a character value or an expression. However, 'is.expression' does not work:
Error in match(x, table, nomatch = 0L) : object '...' not found
Since deparse(substitute)) works if one provides expressions I thought 'is.expression' has to work as well.
What is wrong here, can anyone give me a hint?
I think you are not looking for is.expression but for is.name.
The tricky part is to get the type of date_col and to check if it is of type character only if it is not of type name. If you called is.character when it's a name, then it would get evaluated, typically resulting in an error because the object is not defined.
To do this, short circuit evaluation can be used: In
if(!(is.name(substitute(date_col)) || is.character(date_col)))
is.character is only called if is.name returns FALSE.
Your function boils down to:
data_subset = function(data, date_col) {
if(!(is.name(substitute(date_col)) || is.character(date_col))) {
stop("Input variable date is of wrong format")
}
date_col2 <- as.character(substitute(date_col))
return(data[, date_col2])
}
Of course, you could use if(is.name(…)) to convert only to character when date_col is a name.
This works:
testDF <- data.frame(col1 = rnorm(10), col2 = rnorm(10, mean = 10), col3 = rnorm(10, mean = 50), rnorm(10, mean = 100))
data_subset(testDF, "col1") # ok
data_subset(testDF, col1) # ok
data_subset(testDF, 1) # Error in data_subset(testDF, 1) : Input variable date is of wrong format
However, I don't think you should do this. Consider the following example:
var <- "col1"
data_subset(testDF, var) # Error in `[.data.frame`(data, , date_col2) : undefined columns selected
col1 <- "col2"
data_subset(testDF, col1) # Gives content of column 1, not column 2.
Though this "works as designed", it is confusing because unless carefully reading your function's documentation one would expect to get col1 in the first case and col2 in the second case.
Abusing a famous quote:
Some people, when confronted with a problem, think “I know, I'll use non-standard evaluation.” Now they have two problems.
Hadley Wickham in Non-standard evaluation:
Non-standard evaluation allows you to write functions that are extremely powerful. However, they are harder to understand and to program with. As well as always providing an escape hatch, carefully consider both the costs and benefits of NSE before using it in a new domain.
Unless you expect large benefits from allowing to skip the quotes around the name of the column, don't do it.

Passing expression through functions

I'm using data.table package and trying to write a function (shown below):
require(data.table)
# Function definition
f = function(path, key) {
table = data.table(read.delim(path, header=TRUE))
e = substitute(key)
setkey(table, e) # <- Error in setkeyv(x, cols, verbose = verbose) : some columns are not in the data.table: e
return(table)
}
# Usage
f("table.csv", ID)
Here I try to pass an expression to the function. Why this code doesn't work?
I've already tried different combinations of substitute(), quote() and eval(). So, it'd be great if you could also explain how to get this to work.
First, let's look at how the setkey function does things from the data.table package:
# setkey function
function (x, ..., verbose = getOption("datatable.verbose"))
{
if (is.character(x))
stop("x may no longer be the character name of the data.table. The possibility was undocumented and has been removed.")
cols = getdots()
if (!length(cols))
cols = colnames(x)
else if (identical(cols, "NULL"))
cols = NULL
setkeyv(x, cols, verbose = verbose)
}
So, when you do:
require(data.table)
dt <- data.table(ID=c(1,1,2,2,3), y = 1:5)
setkey(dt, ID)
It calls the function getdots which is internal to data.table (that is, it's not exported). Let's have a look at that function:
# data.table:::getdots
function ()
{
as.character(match.call(sys.function(-1), call = sys.call(-1),
expand.dots = FALSE)$...)
}
So, what does this do? It takes the parameter you entered in setkey and it uses match.call to extract the arguments separately. That is, the match.call argument for this example case would be:
setkey(x = dt, ... = list(ID))
and since it's a list, you can access the ... parameter with $... to get a list of 1 element with its value ID and converting to this list to a character with as.character results in "ID" (a character vector). And then setkey passes this to setkeyv internally to set the keys.
Now why doesn't this work when you write setkey(table, key) inside your function?
This is precisely because of the way setkey/getdots is. The setkey function is designed to take any argument after the first argument (which is a data.table) and then return the ... argument as a character.
That is, if you give setkey(dt, key) then it'll return cols <- "key". If you give setkey(dt, e), it'll give back cols <- "e". It doesn't look for if "key" is an existing variable and then if so substitute the value of the variable. All it does is convert the value you provide (whether it be a symbol or character) back to a character.
Of course this won't work in your case because you want the value in key = ID to be provided in setkey. At least I can't think of a way to do this.
How to get around this?
As #agstudy already mentions, the best/easiest way is to pass "ID" and use setkeyv. But, if you really insist on using f("table.csv", ID) then, this is what you could do:
f <- function(path, key) {
table = data.table(read.delim(path, header=TRUE))
e = as.character(match.call(f)$key)
setkeyv(table, e)
return(table)
}
Here, you first use match.call to get the value corresponding to argument key and then convert it to a character and then pass that to setkeyv.
In short, setkey internally uses setkeyv. And imho, setkey is a convenient function to be used when you already know the column name of the data.table for which you need to set the key. Hope this helps.
I can't tell from your code what you're trying to achieve, so I'll answer the question the title asks instead; "How to pass an expression through a function?"
If you want to do this (this should be avoided where possible), you can do the following:
f <- function(expression) {
return(eval(parse(text=expression)))
}
For example:
f("a <- c(1,2,3); sum(a)")
# [1] 6

Disable assignment via = in R

R allows for assignment via <- and =.
Whereas there a subtle differences between both assignment operators, there seems to be a broad consensus that <- is the better choice than =, as = is also used as operator mapping values to arguments and thus its use may lead to ambiguous statements. The following exemplifies this:
> system.time(x <- rnorm(10))
user system elapsed
0 0 0
> system.time(x = rnorm(10))
Error in system.time(x = rnorm(10)) : unused argument(s) (x = rnorm(10))
In fact, the Google style code disallows using = for assignment (see comments to this answer for a converse view).
I also almost exclusively use <- as assignment operator. However, the almost in the previous sentence is the reason for this question. When = acts as assignment operator in my code it is always accidental and if it leads to problems these are usually hard to spot.
I would like to know if there is a way to turn off assignment via = and let R throw an error any time = is used for assignment.
Optimally this behavior would only occur for code in the Global Environment, as there may well be code in attached namespaces that uses = for assignment and should not break.
(This question was inspired by a discussion with Jonathan Nelson)
Here's a candidate:
`=` <- function(...) stop("Assignment by = disabled, use <- instead")
# seems to work
a = 1
Error in a = 1 : Assignment by = disabled, use <- instead
# appears not to break named arguments
sum(1:2,na.rm=TRUE)
[1] 3
I'm not sure, but maybe simply overwriting the assignment of = is enough for you. After all, `=` is a name like any other—almost.
> `=` <- function() { }
> a = 3
Error in a = 3 : unused argument(s) (a, 3)
> a <- 3
> data.frame(a = 3)
a
1 3
So any use of = for assignment will result in an error, whereas using it to name arguments remains valid. Its use in functions might go unnoticed unless the line in question actually gets executed.
The lint package (CRAN) has a style check for that, so assuming you have your code in a file, you can run lint against it and it will warn you about those line numbers with = assignments.
Here is a basic example:
temp <- tempfile()
write("foo = function(...) {
good <- 0
bad = 1
sum(..., na.rm = TRUE)
}", file = temp)
library(lint)
lint(file = temp, style = list(styles.assignment.noeq))
# Lint checking: C:\Users\flodel\AppData\Local\Temp\RtmpwF3pZ6\file19ac3b66b81
# Lint: Equal sign assignemnts: found on lines 1, 3
The lint package comes with a few more tests you may find interesting, including:
warns against right assignments
recommends spaces around =
recommends spaces after commas
recommends spaces between infixes (a.k.a. binary operators)
warns against tabs
possibility to warn against a max line width
warns against assignments inside function calls
You can turn on or off any of the pre-defined style checks and you can write your own. However the package is still in its infancy: it comes with a few bugs (https://github.com/halpo/lint) and the documentation is a bit hard to digest. The author is responsive though and slowly making improvements.
If you don't want to break existing code, something like this (printing a warning not an error) might work - you give the warning then assign to the parent.frame using <- (to avoid any recursion)
`=` <- function(...){
.what <- as.list(match.call())
.call <- sprintf('%s <- %s', deparse(.what[[2]]), deparse(.what[[3]]))
mess <- 'Use <- instead of = for assigment '
if(getOption('warn_assign', default = T)) {
stop (mess) } else {
warning(mess)
eval(parse(text =.call), envir = parent.frame())
}
}
If you set options(warn_assign = F), then = will warn and assign. Anything else will throw an error and not assign.
examples in use
# with no option set
z = 1
## Error in z = 1 : Use <- instead of = for assigment
options(warn_assign = T)
z = 1
## Error in z = 1 : Use <- instead of = for assigment
options(warn_assign = F)
z = 1
## Warning message:
## In z = 1 : Use <- instead of = for assigment
Better options
I think formatR or lint and code formatting are better approaches.

Resources