When does R evaluate operands to ~ - r

When does R evaluate the formula argument passed into lm(). For example:
tmp <- rnorm(1000)
y <- rnorm(1000) + tmp
df = data.frame(x=tmp)
lm(y~x, data=df)$coefficients
returns
(Intercept) x
0.01713098 0.98073687
This suggests that y~x is evaluated after control is passed to lm() because there's no x variable in the calling context. However, it also suggests it is evaluated before control is passed to lm() because there's no y column in df.
In most languages the arguments are fully evaluated before being passed down the stack. I'm probably missing something subtle here. Any help would be appreciated.

The formula indicator ~ is really just a function that captures unevaluated symbols. It's a lot like quote() where you can run quote(f(x)+1) and get back the unevaluated R expression for the syntax you pass in. The main difference is that ~ allows you to have two sets of unevaluated expressions, and that the formula will keep track of the environment where it was created. So these are the same
a ~ b
`~`(a, b)
So you can pass any valid R syntax into a formula, it doesn't have to be a variable name. The values are stored as un-evaluated symbols or calls. So the formula call itself is evaluated, but its parameters are not.
Other functions can use these formulas however they like. The lm() function for example will pass the formula to model.matrix. Then model.matrix will attempt to evaluate the variables found in the formula in the data.frame that was passed to the function, and, if not present there, the environment where the formula was created. Different functions may choose to evaluate formulas in different ways so it's really not possible to say with certainty when the evaluation will happen.
It usually happens something like this
myformula <- a~b
b <- 4
dd <- data.frame(a=1:3)
eval(myformula[[2]], dd, environment(myformula)) + # a
eval(myformula[[3]], dd, environment(myformula)) # b
# [1] 5 6 7
The symbols are looked up in the environment/list passed to the evaluator.
The environment part is also relevant because you can also have situations like this
b <- 10
foo <- function(myformula = a~b) {
b <- 4
dd <- data.frame(a=1:3)
eval(myformula[[2]], dd, environment(myformula)) +
eval(myformula[[3]], dd, environment(myformula))
}
foo()
# [1] 5 6 7
foo(a~b)
# [1] 11 12 13
If you call foo without a parameter, the formula is created by default inside the function body so it will look for the value b there. But if you call foo with a formula, that formula is evaulated in the calling environment so R will look for the value of b outside the function for most modeling functions. Again, this is just a convention and non-base R package may choose to do something else.

Related

What is the calling function in R?

One of the most important things to know about the evaluation of arguments to a function is that supplied arguments and default arguments are treated differently. The supplied arguments to a function are evaluated in the evaluation frame of the calling function. The default arguments to a function are evaluated in the evaluation frame of the function.
I don't quite understand what it is meant by calling function. Is it the function that is invoked (like in interactive sesion with function that has named assigned you type name and hit enter). If yes how evaluation frame of the callinig function differs from evaluation frame of the function?
First change to standard terms. The arguments that are used in the function definition are the formal arguments and the arguments that are passed to the function when calling it are the actual arguments. (The quoted passage in the question is referring to the actual arguments when it uses the nonstandard term, supplied arguments.)
Consider two cases via example.
Case 1
Below f has the formal argument x and when f is called in the last line of code there are no actual arguments.
Now when f is called in the last line of code x gets the value 2 because x is not set until it is used and when it is used a is looked up within the function where it has the value 2, not in the caller where it has the value 1.
a <- 1
f <- function(x = a) {
a <- 2
x
}
f()
## [1] 2
Case 2
On the other hand the actual arguments are evaluated in the caller. In the last line of code below x is set to 1 because that is the value of b in the caller. Again, x is not evaluated until it is used but now even though b has been set to 2 in the function itself this has no effect on x. x is set to 1, not 2.
b <- 1
g <- function(x) { b <- 2; x + b }
g(b)
## [1] 3
Other
Although this covers the two cases in the quote note that there exists another case which is the situation that occurs when x is referred to in a function but is not defined in the function. In the code below a is a free variable in g since a is not an argument or otherwise defined in g. In this case when gg (which equals g) is called R attempts to look up a in the function g and fails but the next place it looks is not the caller (where a is 1) but the environment in which the function was defined, i.e. the environment where the word function appears and a is 2 in that environment.
a <- 1
f <- function() {
a <- 2
g <- function() a
}
gg <- f()
gg()
## [1] 2
This is referred to as lexical scoping since one can tell where the free variables are looked up by simply looking at the function definitions.

What's the difference between substitute and quote in R

In the official docs, it says:
substitute returns the parse tree for the (unevaluated) expression
expr, substituting any variables bound in env.
quote simply returns its argument. The argument is not evaluated and
can be any R expression.
But when I try:
> x <- 1
> substitute(x)
x
> quote(x)
x
It looks like both quote and substitute returns the expression that's passed as argument to them.
So my question is, what's the difference between substitute and quote, and what does it mean to "substituting any variables bound in env"?
Here's an example that may help you to easily see the difference between quote() and substitute(), in one of the settings (processing function arguments) where substitute() is most commonly used:
f <- function(argX) {
list(quote(argX),
substitute(argX),
argX)
}
suppliedArgX <- 100
f(argX = suppliedArgX)
# [[1]]
# argX
#
# [[2]]
# suppliedArgX
#
# [[3]]
# [1] 100
R has lazy evaluation, so the identity of a variable name token is a little less clear than in other languages. This is used in libraries like dplyr where you can write, for instance:
summarise(mtcars, total_cyl = sum(cyl))
We can ask what each of these tokens means: summarise and sum are defined functions, mtcars is a defined data frame, total_cyl is a keyword argument for the function summarise. But what is cyl?
> cyl
Error: object 'cyl' not found
It isn't anything! Well, not yet. R doesn't evaluate it right away, but treats it as an expression to be parsed later with some parse tree that is different than the global environment your command line is working in, specifically one where the columns of mtcars are defined. Somewhere in the guts of dplyr, something like this is happening:
> substitute(cyl, mtcars)
[1] 6 6 4 6 8 ...
Suddenly cyl means something. That's what substitute is for.
So what is quote for? Well sometimes you want your lazily-evaluated expression to be represented somewhere else before it's evaluated, i.e. you want to display the actual code you're writing without any (or only some) values substituted. The docs you quoted explain this is common for "informative labels for data sets and plots".
So, for example, you could create a quoted expression, and then both print the unevaluated expression in your chart to show how you calculated and actually calculate with the expression.
expr <- quote(x + y)
print(expr) # x + y
eval(expr, list(x = 1, y = 2)) # 3
Note that substitute can do this expression trick also while giving you the option to parse only part of it. So its features are a superset of quote.
expr <- substitute(x + y, list(x = 1))
print(expr) # 1 + y
eval(expr, list(y = 2)) # 3
Maybe this section of the documentation will help somewhat:
Substitution takes place by examining each component of the parse tree
as follows: If it is not a bound symbol in env, it is unchanged. If it
is a promise object, i.e., a formal argument to a function or
explicitly created using delayedAssign(), the expression slot of the
promise replaces the symbol. If it is an ordinary variable, its value
is substituted, unless env is .GlobalEnv in which case the symbol is
left unchanged.
Note the final bit, and consider this example:
e <- new.env()
assign(x = "a",value = 1,envir = e)
> substitute(a,env = e)
[1] 1
Compare that with:
> quote(a)
a
So there are two basic situations when the substitution will occur: when we're using it on an argument of a function, and when env is some environment other than .GlobalEnv. So that's why you particular example was confusing.
For another comparison with quote, consider modifying the myplot function in the examples section to be:
myplot <- function(x, y)
plot(x, y, xlab = deparse(quote(x)),
ylab = deparse(quote(y)))
and you'll see that quote really doesn't do any substitution.
Regarding your question why GlobalEnv is treated as an exception for substitute, it is just a heritage of S. From The R language definition (https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Substitutions):
The special exception for substituting at the top level is admittedly peculiar. It has been inherited from S and the rationale is most likely that there is no control over which variables might be bound at that level so that it would be better to just make substitute act as quote.

What are the advantages of using with() vs. calling vectors?

I am curious if there are any advantages of using with() rather than calling the vector name (aside from using fewer key strokes)?
For example, is with(d,x1) always equivalent to d$x1?
where d is
structure(list(x1 = c(-1.96300839219158, -1.7799470435444, -0.247433477421076,
-0.333402872895705, -1.37145403620246, -0.23484024054114, -0.808080155419075,
-0.359895157796401, 0.54316873679816, -0.687429214935226), x2 = c(-0.619089899920824,
-0.0716448494478719, -0.136643798928645, 2.58777656543295, 0.758900617148999,
0.687980864291582, 0.442931351818574, -0.734342463692198, 2.55862689249189,
1.30677108261702)), .Names = c("x1", "x2"), row.names = c(NA,
-10L), class = "data.frame")
If you're just referencing an item in a list, e.g. a column in a data frame, then d$x1 and with(d, x1) will both return x1 from d. However, on its own the latter is rather unusual that's not really the intended purpose of with(); extracting a value from a list is what $ is for.
The advantage of using with() is to evaluate expressions in the context of a single environment without worring about global variables or attached data frames making references to variables ambiguous.
The $ syntax does not support expressions, so to perform a calculation involving multiple variables in a data frame, you would need to use d$x1, d$x2, etc. which is inconvenient. But for otherwise simply extracting an item from a list, $ is preferred.
A notable case in which the two methods are not equivalent is as follows. Suppose d is defined as
d <- data.frame(x1=c(1, 2, 3))
Now define y <- "x1". What happens when we try to reference x1 using y?
> d$y
NULL
> with(d, y)
[1] "x1"
> d[, y]
[1] 1 2 3
d$y returns NULL since there is no column y in d, so there's nothing to extract.
And since there's no column y in d, with(d, y) looks for y in the parent frame of d, which in this case is the global environment. So this evaluates y in the global environment and thus returns "x1". Even though there's nothing to extract, there is something to evaluate because y does exist, just not in d.
Now d[, y] gets us what we want. This first evaluates y, which turns this into d[, "x1"], which is the correct syntax for extracting x1 from d using another variable.
Some finer detail courtesy of David Arenburg:
Note that with() is actually a generic function that performs method dispatch, whereas $ is a primitive. An inspection of base:::with.default is illuminating:
function(data, expr, ...)
eval(substitute(expr), data, enclos = parent.frame())
This serves to confirm that with() is for evaluation.
Since $ is a primitive, it calls .Primitive("$"), which means that it calls an entry point in compiled internal code. Doing a bit of hunting shows that $ goes to an entry point called do_subset3 in subset.c. The comment immediately preceding that piece of C code is equally illuminating:
/* The $ subset operator.
We need to be sure to only evaluate the first argument.
The second will be a symbol that needs to be matched, not evaluated.
*/
This serves to confirm that $ is for extraction, not evaluation.
So in short, as David put it so well in a comment, with() and $ have different purposes which in certain circumstances can overlap.

parameter passing mechanism in R

The following function is used to multiply a sequence 1:x by y
f1<-function(x,y){return (lapply(1:x, function(a,b) b*a, b=y))}
Looks like a is used to represent the element in the sequence 1:x, but I do not know how to understand this parameter passing mechanism. In other OO languages, like Java or C++, there have call by reference or call by value.
Short answer: R is call by value. Long answer: it can do both.
Call By Value, Lazy Evaluation, and Scoping
You'll want to read through: the R language definition for more details.
R mostly uses call by value but this is complicated by its lazy evaluation:
So you can have a function:
f <- function(x, y) {
x * 3
}
If you pass in two big matrixes to x and y, only x will be copied into the callee environment of f, because y is never used.
But you can also access variables in parent environments of f:
y <- 5
f <- function(x) {
x * y
}
f(3) # 15
Or even:
y <- 5
f <- function() {
x <- 3
g <- function() {
x * y
}
}
f() # returns function g()
f()() # returns 15
Call By Reference
There are two ways for doing call by reference in R that I know of.
One is by using Reference Classes, one of the three object oriented paradigms of R (see also: Advanced R programming: Object Oriented Field Guide)
The other is to use the bigmemory and bigmatrix packages (see The bigmemory project). This allows you to create matrices in memory (underlying data is stored in C), returning a pointer to the R session. This allows you to do fun things like accessing the same matrix from multiple R sessions.
To multiply a vector x by a constant y just do
x * y
The (some prefix)apply functions works very similar to each other, you want to map a function to every element of your vector, list, matrix and so on:
x = 1:10
x.squared = sapply(x, function(elem)elem * elem)
print(x.squared)
[1] 1 4 9 16 25 36 49 64 81 100
It gets better with matrices and data frames because you can now apply a function over all rows or columns, and collect the output. Like this:
m = matrix(1:9, ncol = 3)
# The 1 below means apply over rows, 2 would mean apply over cols
row.sums = apply(m, 1, function(some.row) sum(some.row))
print(row.sums)
[1] 12 15 18
If you're looking for a simple way to multiply a sequence by a constant, definitely use #Fernando's answer or something similar. I'm assuming you're just trying to determine how parameters are being passed in this code.
lapply calls its second argument (in your case function(a, b) b*a) with each of the values of its first argument 1, 2, ..., x. Those values will be passed as the first parameter to the second argument (so, in your case, they will be argument a).
Any additional parameters to lapply after the first two, in your case b=y, are passed to the function by name. So if you called your inner function fxn, then your invocation of lapply is making calls like fxn(1, b=4), fxn(2, b=4), .... The parameters are passed by value.
You should read the help of lapply to understand how it works. Read this excellent answer to get and a good explanation of different xxpply family functions.
From the help of laapply:
lapply(X, FUN, ...)
Here FUN is applied to each elementof X and ... refer to:
... optional arguments to FUN.
Since FUN has an optional argument b, We replace the ... by , b=y.
You can see it as a syntax sugar and to emphasize the fact that argument b is optional comparing to argument a. If the 2 arguments are symmetric maybe it is better to use mapply.

Why are arguments to replacement functions not evaluated lazily?

Consider the following simple function:
f <- function(x, value){print(x);print(substitute(value))}
Argument x will eventually be evaluated by print, but value never will. So we can get results like this:
> f(a, a)
Error in print(x) : object 'a' not found
> f(3, a)
[1] 3
a
> f(1+1, 1+1)
[1] 2
1 + 1
> f(1+1, 1+"one")
[1] 2
1 + "one"
Everything as expected.
Now consider the same function body in a replacement function:
'g<-' <- function(x, value){print(x);print(substitute(value))}
(the single quotes should be fancy quotes)
Let's try it:
> x <- 3
> g(x) <- 4
[1] 3
[1] 4
Nothing unusual so far...
> g(x) <- a
Error: object 'a' not found
This is unexpected. Name a should be printed as a language object.
> g(x) <- 1+1
[1] 4
1 + 1
This is ok, as x's former value is 4. Notice the expression passed unevaluated.
The final test:
> g(x) <- 1+"one"
Error in 1 + "one" : non-numeric argument to binary operator
Wait a minute... Why did it try to evaluate this expression?
Well the question is: bug or feature? What is going on here? I hope some guru users will shed some light about promises and lazy evaluation on R. Or we may just conclude it's a bug.
We can reduce the problem to a slightly simpler example:
g <- function(x, value)
'g<-' <- function(x, value) x
x <- 3
# Works
g(x, a)
`g<-`(x, a)
# Fails
g(x) <- a
This suggests that R is doing something special when evaluating a replacement function: I suspect it evaluates all arguments. I'm not sure why, but the comments in the C code (https://github.com/wch/r-source/blob/trunk/src/main/eval.c#L1656 and https://github.com/wch/r-source/blob/trunk/src/main/eval.c#L1181) suggest it may be to make sure other intermediate variables are not accidentally modified.
Luke Tierney has a long comment about the drawbacks of the current approach, and illustrates some of the more complicated ways replacement functions can be used:
There are two issues with the approach here:
A complex assignment within a complex assignment, like
f(x, y[] <- 1) <- 3, can cause the value temporary
variable for the outer assignment to be overwritten and
then removed by the inner one. This could be addressed by
using multiple temporaries or using a promise for this
variable as is done for the RHS. Printing of the
replacement function call in error messages might then need
to be adjusted.
With assignments of the form f(g(x, z), y) <- w the value
of z will be computed twice, once for a call to g(x, z)
and once for the call to the replacement function g<-. It
might be possible to address this by using promises.
Using more temporaries would not work as it would mess up
replacement functions that use substitute and/or
nonstandard evaluation (and there are packages that do
that -- igraph is one).
I think the key may be found in this comment beginning at line 1682 of "eval.c" (and immediately followed by the evaluation of the assignment operation's RHS):
/* It's important that the rhs get evaluated first because
assignment is right associative i.e. a <- b <- c is parsed as
a <- (b <- c). */
PROTECT(saverhs = rhs = eval(CADR(args), rho));
We expect that if we do g(x) <- a <- b <- 4 + 5, both a and b will be assigned the value 9; this is in fact what happens.
Apparently, the way that R ensures this consistent behavior is to always evaluate the RHS of an assignment first, before carrying out the rest of the assignment. If that evaluation fails (as when you try something like g(x) <- 1 + "a"), an error is thrown and no assignment takes place.
I'm going to go out on a limb here, so please, folks with more knowledge feel free to comment/edit.
Note that when you run
'g<-' <- function(x, value){print(x);print(substitute(value))}
x <- 1
g(x) <- 5
a side effect is that 5 is assigned to x. Hence, both must be evaluated. But if you then run
'g<-'(x,10)
both the values of x and 10 are printed, but the value of x remains the same.
Speculation:
So the parser is distinguishing between whether you call g<- in the course of making an actual assignment, and when you simply call g<- directly.

Resources