Modifying R variable assigned to a literal through C leading to strange behavior - r

Does anyone know what is going on with the following R code?
library(inline)
increment <- cfunction(c(a = "integer"), "
INTEGER(a)[0]++;
return R_NilValue;
")
print_one = function(){
one = 0L
increment(one)
print(one)
}
print_one() # prints 1
print_one() # prints 2
The printed results are 1, 2. Replacing one = 0L with one = integer(1) gives result 1, 1.
Just to clarify. I know I'm passing the variable one by reference to the C function, so its value should change (become 1). What I don't understand is that, how come resetting one = 0L seem to have no effect at all after the first call to print_one (the second call to print_one prints 2 instead of 1).

This really (as the last comment hinted) have little to do with Rcpp. It's really mostly about the .C() interface used by the old inline package and here by cfunction approach in the question.
That leads to two answers.
First, the consensus among R developers is that .C() is deprecated and should no longer be used. Statements to that effect can be found on the r-devel and r-package-devel lists. .C() uses plain old types as pointers in the interface so here the integer value is passed as int* by reference and can be altered.
If we switch to Rcpp uses, and hence to the underlying .Call() interface using only SEXP types for input and output, then an int no passes by reference. So the code behaves and prints only 0:
Rcpp::cppFunction("void increment2(int a) { a++; }")
print_two <- function(){
two <- 0L
increment2(two)
print(two)
}
print_two() # prints 0
print_two() # prints 0
Lastly, Rcpp (capital R) is of course not the "sucessor" to inline (as it does a whole lot more than inline but it among all its functionality is (since around 2013) a quasi-replacement for inline in Rcpp Attributes. So with Rcpp 'as-is' since about 2013 you no longer need the examples and approach from inline.

Related

why does as.vector deep copy a matrix?

Using top, I manually measured the following memory usages at the specific points designated in the comments of the following code block:
x <- matrix(rnorm(1e9),nrow=1e4)
#~15gb
gc()
# ~7gb after gc()
y <- as.vector(x)
gc()
#~15gb after gc()
It's pretty clear that rnorm(1e9) is a ~7gb vector that's then copied to create the matrix. gc() removes the original vector since it's not assigned to anything. as.vector(x) then coerces and copies the data to vector.
My question is, why can't these three objects all point to the same memory block (at least until one is modified)? Isn't a matrix really just a vector with some additional metadata?
This is in R version 3.6.2
edit: also tested in 4.0.3, same results.
The question you're asking is to the reasoning. That seems more suited for R-devel, and I am assuming the answer in return is "no one knows". The relevant function from R-source is the do_asvector function.
Going down the source code of a call to as.vector(matrix(...)), it is important to note that the default argument for mode is any. This translates to ANYSXP (see R internals). This lets us find the evil culprit (line 1524) of the copy-behaviour.
// source reference: do_asvector
...
if(type == ANYSXP || TYPEOF(x) == type) {
switch(TYPEOF(x)) {
case LGLSXP:
case INTSXP:
case REALSXP:
case CPLXSXP:
case STRSXP:
case RAWSXP:
if(ATTRIB(x) == R_NilValue) return x;
ans = MAYBE_REFERENCED(x) ? duplicate(x) : x; // <== evil culprit
CLEAR_ATTRIB(ans);
return ans;
case EXPRSXP:
case VECSXP:
return x;
default:
;
}
...
Going one step further, we can find the definition for MAYBE_REFERENCED in src/include/Rinternals.h, and by digging a bit we can find that it checks whether sxpinfo.named is equal to 0 (false) or not (true). What I am guessing here is that the assignment operator <- increments the sxpinfo.named counter and thus MAYBE_REFERENCED(x) returns TRUE and we get a duplicate (deep copy).
However, Is this behaviour necessary?
That is a great question. If we had given an argument to mode other than any or class(x) (same as our input class), we skip the duplicate line, and we continue down the function, until we hit a ascommon. So I dug a bit extra and took a look at the source code for ascommon, we can see that if we were to try and convert to list manually (setting mode = "list"), ascommon only calls shallowDuplicate.
// Source reference: ascommon
---
if ((type == LISTSXP) &&
!(TYPEOF(u) == LANGSXP || TYPEOF(u) == LISTSXP ||
TYPEOF(u) == EXPRSXP || TYPEOF(u) == VECSXP)) {
if (MAYBE_REFERENCED(v)) v = shallow_duplicate(v); // <=== ascommon duplication behaviour
CLEAR_ATTRIB(v);
}
return v;
}
---
So one could imagine that the call to duplicate in do_asvector could be replaced by a call to shallow_duplicate. Perhaps a "better safe than sorry" strategy was chosen when the code was originally implemented (prior to R-2.13.0 according to a comment in the source code), or perhaps there is a scenario in one of the types not handled by ascommon that requires a deep-copy.
For now I would test if the function does a deep-copy if we set mode='list' or pass the list without assignment. In either case it might not be a bad idea to send a follow-up question to the R-devel mailing list.
Edit: <- behaviour
I took the liberty to confirm my suspicion, and looked at the source code for <-. I previously stated that I assumed that <- incremented sxpinfo.named, and we can confirm this by looking at do_set (the c source code for <-). When assigning as x <- ... x is a SYMSXP, and this we can see that the source code calls INCREMENT_NAMED which in turn calls SET_NAMED(x, NAMED(X) + 1). So everything else equal we should see a copy behaviour for x <- matrix(...); y <- as.vector(x) while we shouldn't for y <- as.vector(matrix(...)).
At the final gc(), you have x pointing to a vector with a dim attribute, and y pointing to a vector without any dim attribute. The data is an intrinsic part of the object, it's not an attribute, so those two vectors have to be different.
If matrices had been implemented as lists, e.g.
x <- list(data = rnorm(1e9), dim = c(1e4, 1e5))
then a shallow copy would be possible, but that's not how it was done. You can read the details of the internal structure of objects in the R Internals manual. For the current release, that's here: https://cloud.r-project.org/doc/manuals/r-release/R-ints.html#SEXPs .
You may wonder why things were implemented this way. I suspect it's intended to be efficient for the common use cases. Converting a matrix to a vector isn't generally necessary (you can treat x as a vector already, e.g. x[100000] and y[100000] will give the same value), so there's no need for "convert to vector" to be efficient. On the other hand, extracting elements is very common, so you don't want to have an extra pointer dereference slowing that down.

return value of if statement in r

So, I'm brushing up on how to work with data frames in R and I came across this little bit of code from https://cloud.r-project.org/web/packages/data.table/vignettes/datatable-intro.html:
input <- if (file.exists("flights14.csv")) {
"flights14.csv"
} else {
"https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv"
}
Apparently, this assigns the strings (character vectors?) in the if and else statements to input based on the conditional. How is this working? It seems like magic. I am hoping to find somewhere in the official R documentation that explains this.
From other languages I would have just done:
if (file.exists("flights14.csv")) {
input <- "flights14.csv"
} else {
input <- "https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv"
}
or in R there is ifelse which also seems designed to do exactly this, but somehow that first example also works. I can memorize that this works but I'm wondering if I'm missing the opportunity to understand the bigger picture about how R works.
From the documentation on the ?Control help page under "Value"
if returns the value of the expression evaluated, or NULL invisibly if none was (which may happen if there is no else).
So the if statement is kind of like a function that returns a value. The value that's returned is the result of either evaulating the if or the then block. When you have a block in R (code between {}), the brackets are also like a function that just return the value of the last expression evaluated in the block. And a string literal is a valid expression that returns itself
So these are the same
x <- "hello"
x <- {"hello"}
x <- {"dropped"; "hello"}
x <- if(TRUE) {"hello"}
x <- if(TRUE) {"dropped"; "hello"}
x <- if(TRUE) {"hello"} else {"dropped"}
And you only really need blocks {} with if/else statements when you have more than one expression to run or when spanning multiple lines. So you could also do
x <- if(TRUE) "hello" else "dropped"
x <- if(FALSE) "dropped" else "hello"
These all store "hello" in x
You are not really missing anything about the "big picture" in R. The R if function is atypical compared both to other languages as well as to R's typical behavior. Unlike most functions in R which do require assignment of their output to a "symbol", i.e a proper R name, if allows assignments that occur within its consequent or alternative code blocks to occur within the global environment. Most functions would return only the final evaluation, while anything else that occurred inside the function body would be garbage collected.
The other common atypical function is for. R for-loops only
retain these interior assignments and always return NULL. The R Language Definition calls these atypical R functions "control structures". See section 3.3. On my machine (and I suspect most Linux boxes) that document is installed at: http://127.0.0.1:10731/help/doc/manual/R-lang.html#Control-structures. If you are on another OS then there is probably a pulldown Help menu in your IDE that will have a pointer to it. Thew help document calls them "control flow constructs" and the help page is at ?Control. Note that it is necessary to quote these terms when you wnat to access that help page using one of those names since they are "reserved words". So you would need ?'if' rather than typing ?if. The other reserved words are described in the ?Reserved page.
?Control
?'if' ; ?'for'
?Reserved
# When you just type:
?if # and hit <return>
# you will see a "+"-sign which indicateds an incomplete expression.
# you nthen need to hit <escape> to get back to a regular R interaction.
In R, functions don't need explicit return. If not specified the last line of the function is automatically returned. Consider this example :
a <- 5
b <- 1
result <- if(a == 5) {
a <- a + 1
b <- b + 1
a
} else {b}
result
#[1] 6
The last line in if block was saved in result. Similarly, in your case the string values are "returned" implicitly.

How does R evaluate these weird expressions?

I was trying to make Python 3-style assignment unpacking possible in R (e.g., a, *b, c = [1,2,3], "C"), and although I got so close (you can check out my code here), I ultimately ran into a few (weird) problems.
My code is meant to work like this:
a %,*% b %,% c <- c(1,2,3,4,5)
and will assign a = 1, b = c(2,3,4) and c = 5 (my code actually does do this, but with one small snag I will get to later).
In order for this to do anything, I have to define:
`%,%` <- function(lhs, rhs) {
...
}
and
`%,%<-` <- function(lhs, rhs, value) {
...
}
(as well as %,*% and %,*%<-, which are slight variants of the previous functions).
First issue: why R substitutes *tmp* for the lhs argument
As far as I can tell, R evaluates this code from left to right at first (i.e., going from a to c, until it reaches the last %,%, where upon, it goes back from right to left, assigning values along the way. But the first weird thing I noticed is that when I do match.call() or substitute(lhs) in something like x %infix% y <- z, it says that the input into the lhs argument in %infix% is *tmp*, instead of say, a or x.
This is bizarre to me, and I couldn't find any mention of it in the R manual or docs. I actually make use of this weird convention in my code (i.e., it doesn't show this behavior on the righthand side of the assignment, so I can use the presence of the *tmp* input to make %,% behave differently on this side of the assignment), but I don't know why it does this.
Second issue: why R checks for object existence before anything else
My second problem is what makes my code ultimately not work. I noticed that if you start with a variable name on the lefthand side of any assignment, R doesn't seem to even start evaluating the expression---it returns the error object '<variable name>' not found. I.e., if x is not defined, x %infix% y <- z won't evaluate, even if %infix% doesn't actually use or evaluate x.
Why does R behave like this, and can I change it or get around it? If I could to run the code in %,% before R checks to see if x exists, I could probably hack it so that I wouldn't be a problem, and my Python unpacking code would be useful enough to actually share. But as it is now, the first variable needs to already exist, which is just too limiting in my opinion. I know that I could probably do something by changing the <- to a custom infix operator like %<-%, but then my code would be so similar to the zeallot package, that I wouldn't consider it worth it. (It's already very close in what it does, but I like my style better.)
Edit:
Following Ben Bolker's excellent advice, I was able to find a way around the problem... by overwriting <-.
`<-` <- function(x, value) {
base::`<-`(`=`, base::`=`)
find_and_assign(match.call(), parent.frame())
do.call(base::`<-`, list(x = substitute(x), value = substitute(value)),
quote = FALSE, envir = parent.frame())
}
find_and_assign <- function(expr, envir) {
base::`<-`(`<-`, base::`<-`)
base::`<-`(`=`, base::`=`)
while (is.call(expr)) expr <- expr[[2]]
if (!rlang::is_symbol(expr)) return()
var <- rlang::as_string(expr) # A little safer than `as.character()`
if (!exists(var, envir = envir)) {
assign(var, NULL, envir = envir)
}
}
I'm pretty sure that this would be a mortal sin though, right? I can't exactly see how it would mess anything up, but the tingling of my programmer senses tells me this would not be appropriate to share in something like a package... How bad would this be?
For your first question, about *tmp* (and maybe related to your second question):
From Section 3.4.4 of the R Language definition:
Assignment to subsets of a structure is a special case of a general mechanism for complex assignment:
x[3:5] <- 13:15
The result of this command is as if the following had been executed
`*tmp*` <- x
x <- "[<-"(`*tmp*`, 3:5, value=13:15)
rm(`*tmp*`)
Note that the index is first converted to a numeric index and then the elements are replaced sequentially along the numeric index, as if a for loop had been used. Any existing variable called *tmp* will be overwritten and deleted, and this variable name should not be used in code.
The same mechanism can be applied to functions other than [. The replacement function has the same name with <- pasted on. Its last argument, which must be called value, is the new value to be assigned.
I can imagine that your second problem has to do with the first step of the "as if" code: if R is internally trying to evaluate *tmp* <- x, it may be impossible to prevent from trying to evaluate x at this point ...
If you want to dig farther, I think the internal evaluation code used to deal with "complex assignment" (as it seems to be called in the internal comments) is around here ...

R: Enriched debugging for linear code chains

I am trying to figure out if it is possible, with a sane amount of programming, to create a certain debugging function by using R's metaprogramming features.
Suppose I have a block of code, such that each line uses as all or part of its input the output from thee line before -- the sort of code you might build with pipes (though no pipe is used here).
{
f1(args1) -> out1
f2(out1, args2) -> out2
f3(out2, args3) -> out3
...
fn(out<n-1>, args<n>) -> out<n>
}
Where for example it might be that:
f1 <- function(first_arg, second_arg, ...){my_body_code},
and you call f1 in the block as:
f1(second_arg = 1:5, list(a1 ="A", a2 =1), abc = letters[1:3], fav = foo_foo)
where foo_foo is an object defined in the calling environment of f1.
I would like a function I could wrap around my block that would, for each line of code, create an entry in a list. Each entry would be named (line1, line2) and each line entry would have a sub-entry for each argument and for the function output. the argument entries would consist, first, of the name of the formal, to which the actual argument is matched, second, the expression or name supplied to that argument if there is one (and a placeholder if the argument is just a constant), and third, the value of that expression as if it were immediately forced on entry into the function. (I'd rather have the value as of the moment the promise is first kept, but that seems to me like a much harder problem, and the two values will most often be the same).
All the arguments assigned to the ... (if any) would go in a dots = list() sublist, with entries named if they have names and appropriately labeled (..1, ..2, etc.) if they are assigned positionally. The last element of each line sublist would be the name of the output and its value.
The point of this is to create a fairly complete record of the operation of the block of code. I think of this as analogous to an elaborated version of purrr::safely that is not confined to iteration and keeps a more detailed record of each step, and indeed if a function exits with an error you would want the error message in the list entry as well as as much of the matched arguments as could be had before the error was produced.
It seems to me like this would be very useful in debugging linear code like this. This lets you do things that are difficult using just the RStudio debugger. For instance, it lets you trace code backwards. I may not know that the value in out2 is incorrect until after I have seen some later output. Single-stepping does not keep intermediate values unless you insert a bunch of extra code to do so. In addition, this keeps the information you need to track down matching errors that occur before promises are even created. By the time you see output that results from such errors via single-stepping, the matching information has likely evaporated.
I have actually written code that takes a piped function and eliminates the pipes to put it in this format, just using text manipulation. (Indeed, it was John Mount's "Bizarro pipe" that got me thinking of this). And if I, or we, or you, can figure out how to do this, I would hope to make a serious run on a second version where each function calls the next, supplying it with arguments internally rather than externally -- like a traceback where you get the passed argument values as well as the function name and and formals. Other languages have debugging environments like that (e.g. GDB), and I've been wishing for one for R for at least five years, maybe 10, and this seems like a step toward it.
Just issue the trace shown for each function that you want to trace.
f <- function(x, y) {
z <- x + y
z
}
trace(f, exit = quote(print(returnValue())))
f(1,2)
giving the following which shows the function name, the input and output. (The last 3 is from the function itself.)
Tracing f(1, 2) on exit
[1] 3
[1] 3

Why the "=" R operator should not be used in functions?

The manual states:
The operator ‘<-’ can be used anywhere,
whereas the operator ‘=’ is only allowed at the top level (e.g.,
in the complete expression typed at the command prompt) or as one
of the subexpressions in a braced list of expressions.
The question here mention the difference when used in the function call. But in the function definition, it seems to work normally:
a = function ()
{
b = 2
x <- 3
y <<- 4
}
a()
# (b and x are undefined here)
So why the manual mentions that the operator ‘=’ is only allowed at the top level??
There is nothing about it in the language definition (there is no = operator listed, what a shame!)
The text you quote says at the top level OR in a braced list of subexpressions. You are using it in a braced list of subexpressions. Which is allowed.
You have to go to great lengths to find an expression which is neither toplevel nor within braces. Here is one. You sometimes want to wrap an assignment inside a try block: try( x <- f() ) is fine, but try( x = f(x) ) is not -- you need to either change the assignment operator or add braces.
Expressions not at the top level include usage in control structures like if. For example, the following programming error is illegal.
> if(x = 0) 1 else x
Error: syntax error
As mentioned here: https://stackoverflow.com/a/4831793/210673
Also see http://developer.r-project.org/equalAssign.html
Other than some examples such as system.time as others have shown where <- and = have different results, the main difference is more philisophical. Larry Wall, the creater of Perl, said something along the lines of "similar things should look similar, different things should look different", I have found it interesting in different languages to see what things are considered "similar" and which are considered "different". Now for R assignment let's compare 2 commands:
myfun( a <- 1:10 )
myfun( a = 1:10 )
Some would argue that in both cases we are assigning 1:10 to a so what we are doing is similar.
The other argument is that in the first call we are assigning to a variable a that is in the same environment from which myfun is being called and in the second call we are assigning to a variable a that is in the environment created when the function is called and is local to the function and those two a variables are different.
So which to use depends on whether you consider the assignments "similar" or "different".
Personally, I prefer <-, but I don't think it is worth fighting a holy war over.

Resources