R - How to extract object names from expression - r

Given an rlang expression:
expr1 <- rlang::expr({
d <- a + b
})
How to retrieve the names of the objects refered to within the expression ?
> extractObjects(expr1)
[1] "d" "a" "b"
Better yet, how to retrieve the object names and categorise them by "required"(input) and "created"(output) ?
> extractObjects(expr1)
$created
[1] "d"
$required
[1] "a" "b"

The base function all.vars does this:
〉all.vars(expr1)
[1] "d" "a" "b"
Alternatively, you can use all.names to get all names in the expression rather than just those that aren’t used as calls or operators:
〉all.names(expr1)
[1] "{" "<-" "d" "+" "a" "b"
Don’t be misled: this result is correct! All of these appear in the expression, not just a, b and d.
But it may not be what you want.
In fact, I’m assuming what you want corresponds to the leaf tokens in the abstract syntax tree (AST) — in other words, everything except function calls (and operators, which are also function calls).
The syntax tree for your expression looks as follows:1
{
|
<-
/\
d +
/ \
a b
Getting this information means walking the AST:
leaf_nodes = function (expr) {
if(is.call(expr)) {
unlist(lapply(as.list(expr)[-1L], leaf_nodes))
} else {
as.character(expr)
}
}
〉leaf_nodes(expr1)
[1] "d" "a" "b"
Thanks to the AST representation we can also find inputs and outputs:
is_assignment = function (expr) {
is.call(expr) && as.character(expr[[1L]]) %in% c('=', '<-', '<<-', 'assign')
}
vars_in_assign = function (expr) {
if (is.call(expr) && identical(expr[[1L]], quote(`{`))) {
vars_in_assign(expr[[2L]])
} else if (is_assignment(expr)) {
list(created = all.vars(expr[[2L]]), required = all.vars(expr[[3L]]))
} else {
stop('Expression is not an assignment')
}
}
〉vars_in_assign(expr1)
$created
[1] "d"
$required
[1] "a" "b"
Note that this function does not handle complex assignments (i.e. stuff like d[x] <- a + b or f(d) <- a + b very well.
1 lobstr::ast shows the syntax tree differently, namely as
█─`{`
└─█─`<-`
├─d
└─█─`+`
├─a
└─b
… but the above representation is more conventional outside R, and I find it more intuitive.

Another solution is to extract the abstract symbolic tree explicitly:
getAST <- function(ee) purrr::map_if(as.list(ee), is.call, getAST)
str(getAST(expr1))
# List of 2
# $ : symbol {
# $ :List of 3
# ..$ : symbol <-
# ..$ : symbol d
# ..$ :List of 3
# .. ..$ : symbol +
# .. ..$ : symbol a
# .. ..$ : symbol b
Then traverse the AST to find the assignment(s):
extractObjects <- function(ast)
{
## Ensure that there is at least one node
if( length(ast) == 0 ) stop("Provide an AST")
## If we are working with the assigment
if( identical(ast[[1]], as.name("<-")) ) {
## Separate the LHS and RHS
list(created = as.character(ast[[2]]),
required = sapply(unlist(ast[[3]]), as.character))
} else {
## Otherwise recurse to find all assignments
rc <- purrr::map(ast[-1], extractObjects)
## If there was only one assignment, simplify reporting
if( length(rc) == 1 ) purrr::flatten(rc)
else rc
}
}
extractObjects( getAST(expr1) )
# $created
# [1] "d"
#
# $required
# [1] "+" "a" "b"
You may then filter math operators out, if needed.

This is an interesting one. I think that conceptually, it might not be clear in ALL possible expressions what exactly is input and output. If you look at the so called abstract syntax tree (AST), which you can visualize with lobstr::ast(), it looks like this.
So in simple cases when you always have LHS <- operations on RHS variables, if you iterate over the AST, you will always get the LST right after the <- operator. If you assign z <- rlang::expr(d <- a+b), then z behaves like a list and you can for example do the following:
z <- rlang::expr(d <- a+b)
for (i in 1:length(z)) {
if (is.symbol(z[[i]])) {
print(paste("Element", i, "of z:", z[[i]], "is of type", typeof(z[[i]])))
if (grepl("[[:alnum:]]", z[[i]])) {print(paste("Seems like", z[[i]], "is a variable"))}
} else {
for (j in 1:length(z[[i]])){
print(paste("Element", j, paste0("of z[[",i,"]]:"), z[[i]][[j]], "is of type", typeof(z[[i]][[j]])))
if (grepl("[[:alnum:]]", z[[i]][[j]])) {print(paste("Seems like", z[[i]][[j]], "is a variable"))}
}
}
}
#> [1] "Element 1 of z: <- is of type symbol"
#> [1] "Element 2 of z: d is of type symbol"
#> [1] "Seems like d is a variable"
#> [1] "Element 1 of z[[3]]: + is of type symbol"
#> [1] "Element 2 of z[[3]]: a is of type symbol"
#> [1] "Seems like a is a variable"
#> [1] "Element 3 of z[[3]]: b is of type symbol"
#> [1] "Seems like b is a variable"
Created on 2020-09-03 by the reprex package (v0.3.0)
As you can see these trees can quickly get complicated and nested. So in a simple case like in your example, assuming that variables are using alphanumeric representations, we can kind of identify what the "objects" (as you call them) are and what are operators (which don't match the [[:alnum:]] regex). As you can see the type cannot be used to distinguish between objects and operators since it is always symbol (btw z below is a language as is z[[3]] which is why we can condition on whether z[[i]] is a symbol or not and if not, dig a level deeper). You could then (at your peril) try to classify that the objects that appear immediately after a <- are "outputs" and the rest are "inputs" but I don't have too much confidence in this, especially for more complex expressions.
In short, this is all very speculative.

Related

strsplit(rquote, split = "")[[1]] in R

rquote <- "r's internals are irrefutably intriguing"
chars <- strsplit(rquote, split = "")[[1]]
This question has been asked before on this forum and has one answer on it but I couldn't understand anything from that answer, so here I am asking this question again.
In the above code what is the meaning of [[1]] ?
The program that I'm trying to run:
rquote <- "r's internals are irrefutably intriguing"
chars <- strsplit(rquote, split = "")[[1]]
rcount <- 0
for (char in chars) {
if (char == "r") {
rcount <- rcount + 1
}
if (char == "u") {
break
}
}
print(rcount)
When I don't use [[1]] I get the following warning message in for loop and I get a wrong output of 1 for rcount instead of 5:
Warning message: the condition has length > 1 and only the first element will be used
strsplit is vectorized. That means it splits each element of a vector into a vectors. To handle this vector of vectors it returns a list in which a slot (indexed by [[) corresponds to a element of the input vector.
If you use the function on a one element vector (single string as you do), you get a one-slot list. Using [[1]] right after strsplit() selects the first slot of the list - the anticipated vector.
Unfortunately, your list chars works in a for loop - you have one iteration with the one slot. In if you compare the vector of letters against "r" which throws the warning. Since the first element of the comparison is TRUE, the condition holds and rcount is rised by 1 = your result. Since you are not indexing the letters but the one phrase, the cycle stops there.
Maybe if you run something like strsplit(c("one", "two"), split="") , the outcome will be more straightforward.
> strsplit(c("one", "two"), split="")
[[1]]
[1] "o" "n" "e"
[[2]]
[1] "t" "w" "o"
> strsplit(c("one", "two"), split="")[[1]]
[1] "o" "n" "e"
> strsplit(c("one"), split="")[[1]][2]
[1] "n"
We'll start with the below as data, without [[1]]:
rquote <- "r's internals are irrefutably intriguing"
chars2 <- strsplit(rquote, split = "")
class(chars2)
[1] "list"
It is always good to have an estimate of your return value, your above '5'. We have both length and lengths.
length(chars2)
[1] 1 # our list
lengths(chars2)
[1] 40 # elements within our list
We'll use lengths in our for loop for counter, and, as you did, establish a receiver vector outside the loop,
rcount2 <- 0
for (i in 1:lengths(chars2)) {
if (chars2[[1]][i] == 'r') {
rcount2 <- rcount2 +1
}
if (chars2[[1]][i] == 'u') {
break
}
}
print(rcount2)
[1] 6
length(which(chars2[[1]] == 'r')) # as a check, and another way to estimate
[1] 6
Now supposing, rather than list, we have a character vector:
chars1 <- strsplit(rquote, split = '')[[1]]
length(chars1)
[1] 40
rcount1 <- 0
for(i in 1:length(chars1)) {
if(chars1[i] == 'r') {
rcount1 <- rcount1 +1
}
if (chars1[i] == 'u') {
break
}
}
print(rcount1)
[1] 5
length(which(chars1 == 'r'))
[1] 6
Hey, there's your '5'. What's going on here? Head scratch...
all.equal(chars1, unlist(chars2))
[1] TRUE
That break should just give us 5 'r' before a 'u' is encountered. What's happening when it's a list (or does that matter...?), how does the final r make it into rcount2?
And this is where the fun begins. Jeez. break for coffee and thinking. Runs okay. Usual morning hallucination. They come and go. But, as a final note, when you really want to torture yourself, put browser() inside your for loop and step thru.
Browse[1]> i
[1] 24
Browse[1]> n
debug at #7: break
Browse[1]> chars2[[1]][i] == 'u'
[1] TRUE
Browse[1]> n
> rcount2
[1] 5

Does R use applicative order reduction?

I came across this example from an R tutorial recently and I found this syntax really odd because it hints towards normal order reduction where arguments are wrapped / delayed. In applicative order reduction something like this should result in all the strings printing.
switch(grade,
"A" = print("Great"),
"B" = print("Good"),
"C" = print("Ok"),
"D" = print("Bad"),
"F" = print("Terrible"),
print("No Such Grade"))
Was wondering if anyone is privy and familiar with how R implements this?
Arguments to functions, including switch, are passed as promises which are forced, i.e. evaluated, only when actually used. See https://cran.r-project.org/doc/manuals/R-ints.html#Argument-evaluation
A promise has several parts. Its value slot is filled in the first time it is forced (i.e. accessed). Until then it just exists as unevaluated code and its environment as well as components which indicate that it has not been evaluated.
f does not force x:
library(pryr)
f <- function(x) promise_info(x)
f(3+pi)
giving:
$code
3 + pi
$env
<environment: R_GlobalEnv>
$evaled
[1] FALSE
$value
NULL
g forces x:
g <- function(x) { force(x); promise_info(x) }
g(3 + pi)
giving:
$code
3 + pi
$env
NULL
$evaled
[1] TRUE
$value
[1] 6.141593

R - Checking if a string is a valid mathematical expression using non-standard evaluation

I would like to check if the strings below are valid mathematical expressions:
s1 = 'sin(x)'
s2 = 'sin(x*m)'
s3 = 'sin'
s4 = 'sin(xm)'
By 'valid', I mean the expression is a combination of
operators (must be used in conjunction with variables or constants)
variables x and/or m
constants.
By this definition s1 and s2 are valid while s3 and s4 are not.
To identify if a string is valid, I wrote a function checkFxn that first attempts to convert the string into a call or one of its parts. If successful, it then recurses through the call-tree and checks for the above conditions. If the conditions are satisfied, then the call is returned as-is. If not, an error is thrown.
checkFxn <- function(x) {
lang <- str2lang(x)
checkFxn2 <- function(y) {
if(is.name(y)) {
stopifnot(deparse(y) %in% c('x', 'm'))
} else if(is.call(y)) {
stopifnot(is.function(eval(y[[1]])) | is.primitive(eval(y[[1]])))
lapply(y[-1], checkFxn2)
} else {
stopifnot(is.logical(y) | is.numeric(y) | is.complex(y))
}
return(y)
}
checkFxn2(lang)
}
#Applying checkFxn to s1-4
lapply(list(s1,s2,s3,s4), function(x) {try(checkFxn(x), silent = T)})
[[1]]
sin(x)
[[2]]
sin(x * m)
[[3]]
[1] "Error in checkFxn2(lang) : deparse(y) %in% c(\"x\", \"m\") is not TRUE\n"
attr(,"class")
[1] "try-error"
attr(,"condition")
<simpleError in checkFxn2(lang): deparse(y) %in% c("x", "m") is not TRUE>
[[4]]
[1] "Error in FUN(X[[i]], ...) : deparse(y) %in% c(\"x\", \"m\") is not TRUE\n"
attr(,"class")
[1] "try-error"
attr(,"condition")
<simpleError in FUN(X[[i]], ...): deparse(y) %in% c("x", "m") is not TRUE>
It seems to work as expected but I'm wary of my use of eval and was wondering if someone could suggest an alternative to using it? I know that it follows the usual lexical scoping rules, so I'm worried about it evaluating variables in the gobal environment - is there a way to restrict its scope? I've read the chapter on non-standard evaluation but I can't figure it out.
Also, is there a way to identify if a base function or primitive is a mathematical operator? I would like to use something more specific than is.function and is.primitive.
Step 1: Decide what constitutes a "mathematical operator". One option is to retrieve relevant groups from the S4 generics. For example,
mathOps <- unlist(lapply( c("Arith","Compare","Math"), getGroupMembers ))
# [1] "+" "-" "*" "^" "%%" "%/%"
# [7] "/" "==" ">" "<" "!=" "<="
# [13] ">=" "abs" "sign" "sqrt" "ceiling" "floor"
# [19] "trunc" "cummax" "cummin" "cumprod" "cumsum" "exp"
# [25] "expm1" "log" "log10" "log2" "log1p" "cos"
# [31] "cosh" "sin" "sinh" "tan" "tanh" "acos"
# [37] "acosh" "asin" "asinh" "atan" "atanh" "cospi"
# [43] "sinpi" "tanpi" "gamma" "lgamma" "digamma" "trigamma"
Step 2: Decompose your expressions into abstract syntax trees.
getAST <- function( ee )
lapply( as.list(ee), function(x) `if`(is.call(x), getAST(x), x) )
# Example usage
getAST( quote(sin(x+5)) )
# [[1]]
# sin
#
# [[2]]
# [[2]][[1]]
# `+`
#
# [[2]][[2]]
# x
#
# [[2]][[3]]
# [1] 5
Step 3: Traverse the ASTs based on your definition of "validity"
checkFxn <- function( ast, validOps )
{
## Terminal nodes of an AST will not be lists
## Wrap them into a list of length 1 to keep the recursion flow
if( !is.list(ast) ) ast <- list(ast)
## Operators must be called with one or more arguments
if( as.character(ast[[1]]) %in% validOps )
return( `if`(length(ast) < 2, FALSE,
all(sapply(ast[-1], checkFxn, validOps))) )
## Variables x and m are OK
if( identical(ast[[1]], quote(x)) || identical(ast[[1]], quote(m)) )
return(TRUE)
## Constants are OK
if( is.numeric(ast[[1]]) ) return(TRUE)
## Everything else is invalid
FALSE
}
Putting it all together
exprs <- lapply( list(s1,s2,s3,s4), str2lang ) # Convert strings to expressions
asts <- lapply( exprs, getAST ) # Build ASTs
sapply( asts, checkFxn, mathOps ) # Evaluate validity
# [1] TRUE TRUE FALSE FALSE
Alternative to ASTs
As pointed out by #Moody_Mudskipper, one can also use all.names to retrieve the list of symbols occurring inside an arbitrary expression. While this doesn't preserve the relative structure of those symbols, the names can be compared directly against mathOps.

Get the argument names of an R function

For an arbitrary function
f <- function(x, y = 3){
z <- x + y
z^2
}
I want to be able take the argument names of f
> argument_names(f)
[1] "x" "y"
Is this possible?
formalArgs and formals are two functions that would be useful in this case. If you just want the parameter names then formalArgs will be more useful as it just gives the names and ignores any defaults. formals gives a list as the output and provides the parameter name as the name of the element in the list and the default as the value of the element.
f <- function(x, y = 3){
z <- x + y
z^2
}
> formalArgs(f)
[1] "x" "y"
> formals(f)
$x
$y
[1] 3
My first inclination was to just suggest formals and if you just wanted the names of the parameters you could use names like names(formals(f)). The formalArgs function just is a wrapper that does that for you so either way works.
Edit: Note that technically primitive functions don't have "formals" so this method will return NULL if used on primitives. A way around that is to first wrap the function in args before passing to formalArgs. This works regardless of it the function is primitive or not.
> # formalArgs will work for non-primitives but not primitives
> formalArgs(f)
[1] "x" "y"
> formalArgs(sum)
NULL
> # But wrapping the function in args first will work in either case
> formalArgs(args(f))
[1] "x" "y"
> formalArgs(args(sum))
[1] "..." "na.rm"

What/Where are the attributes of a function object?

By playing around with a function in R, I found out there are more aspects to it than meets the eye.
Consider ths simple function assignment, typed directly in the console:
f <- function(x)x^2
The usual "attributes" of f, in a broad sense, are (i) the list of formal arguments, (ii) the body expression and (iii) the environment that will be the enclosure of the function evaluation frame. They are accessible via:
> formals(f)
$x
> body(f)
x^2
> environment(f)
<environment: R_GlobalEnv>
Moreover, str returns more info attached to f:
> str(f)
function (x)
- attr(*, "srcref")=Class 'srcref' atomic [1:8] 1 6 1 19 6 19 1 1
.. ..- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' <environment: 0x00000000145a3cc8>
Let's try to reach them:
> attributes(f)
$srcref
function(x)x^2
This is being printed as a text, but it's stored as a numeric vector:
> c(attributes(f)$srcref)
[1] 1 6 1 19 6 19 1 1
And this object also has its own attributes:
> attributes(attributes(f)$srcref)
$srcfile
$class
[1] "srcref"
The first one is an environment, with 3 internal objects:
> mode(attributes(attributes(f)$srcref)$srcfile)
[1] "environment"
> ls(attributes(attributes(f)$srcref)$srcfile)
[1] "filename" "fixedNewlines" "lines"
> attributes(attributes(f)$srcref)$srcfile$filename
[1] ""
> attributes(attributes(f)$srcref)$srcfile$fixedNewlines
[1] TRUE
> attributes(attributes(f)$srcref)$srcfile$lines
[1] "f <- function(x)x^2" ""
There you are! This is the string used by R to print attributes(f)$srcref.
So the questions are:
Are there any other objects linked to f? If so, how to reach them?
If we strip f of its attributes, using attributes(f) <- NULL, it doesn't seem to affect the function. Are there any drawbacks of doing this?
As far as I know, srcref is the only attribute typically attached to S3 functions. (S4 functions are a different matter, and I wouldn't recommend messing with their sometimes numerous attributes).
The srcref attribute is used for things like enabling printing of comments included in a function's source code, and (for functions that have been sourced in from a file) for setting breakpoints by line number, using utils::findLineNum() and utils::setBreakpoint().
If you don't want your functions to carry such additional baggage, you can turn off recording of srcref by doing options(keep.source=FALSE). From ?options (which also documents the related keep.source.pkgs option):
‘keep.source’: When ‘TRUE’, the source code for functions (newly
defined or loaded) is stored internally allowing comments to
be kept in the right places. Retrieve the source by printing
or using ‘deparse(fn, control = "useSource")’.
Compare:
options(keep.source=TRUE)
f1 <- function(x) {
## This function is needlessly commented
x
}
options(keep.source=FALSE)
f2 <- function(x) {
## This one is too
x
}
length(attributes(f1))
# [1] 1
f1
# function(x) {
# ## This function is needlessly commented
# x
# }
length(attributes(f2))
# [1] 0
f2
# function (x)
# {
# x
# }
I jst figured out an attribute that compiled functions (package compiler) have that is not available with attributes or str. It's the bytecode.
Example:
require(compiler)
f <- function(x){ y <- 0; for(i in 1:length(x)) y <- y + x[i]; y }
g <- cmpfun(f)
The result is:
> print(f, useSource=FALSE)
function (x)
{
y <- 0
for (i in 1:length(x)) y <- y + x[i]
y
}
> print(g, useSource=FALSE)
function (x)
{
y <- 0
for (i in 1:length(x)) y <- y + x[i]
y
}
<bytecode: 0x0000000010eb29e0>
However, this doesn't show with normal commands:
> identical(f, g)
[1] TRUE
> identical(f, g, ignore.bytecode=FALSE)
[1] FALSE
> identical(body(f), body(g), ignore.bytecode=FALSE)
[1] TRUE
> identical(attributes(f), attributes(g), ignore.bytecode=FALSE)
[1] TRUE
It seems to be accessible only via .Internal(bodyCode(...)):
> .Internal(bodyCode(f))
{
y <- 0
for (i in 1:length(x)) y <- y + x[i]
y
}
> .Internal(bodyCode(g))
<bytecode: 0x0000000010eb29e0>

Resources