Lazy sequences in R - r

In Clojure, it's easy to create infinite sequences using the lazy sequence constructor. For example,
(def N (iterate inc 0))
returns a data object N which is equivalent to the infinite sequence
(0 1 2 3 ...)
Evaluating the value N results in an infinite loop. Evaluating (take 20 N) returns the top 20 numbers. Since the sequence is lazy, the inc function is only iterated when you ask it to. Since Clojure is homoiconic, the lazy sequence is stored recursively.
In R, is it possible to do something similar? Can you present some sample R code which produces a data object N that is equivalent to the full infinite sequence of natural numbers? Evaluating the full object N should result in a loop, but something like head(N) should return just the leading numbers.
Note: I am really more interested in lazy sequences rather than the natural numbers per se.
Edit: Here is the Clojure source for lazy-seq:
(defmacro lazy-seq
"Takes a body of expressions that returns an ISeq or nil, and yields
a Seqable object that will invoke the body only the first time seq
is called, and will cache the result and return it on all subsequent
seq calls. See also - realized?"
{:added "1.0"}
[& body]
(list 'new 'clojure.lang.LazySeq (list* '^{:once true} fn* [] body)))
I'm looking for a macro with the same functionality in R.

Alternate Implementation
Introduction
I've had occasion to be working in R more frequently since this post, so I offer an alternate base R implementation. Again, I suspect you could get much better performance by dropping down to the C extension level. Original answer follows after the break.
The first challenge in base R is the lack of a true cons (exposed at the R level, that is). R uses c for a hybrid cons/concat operation, but this does not create a linked list, but rather a new vector populated with the elements of both arguments. In particular, the length of both arguments has to be known, which is not the case with lazy sequences. Additionally, successive c operations exhibit quadratic performance rather than constant time. Now, you could use "lists" (which are really vectors, not linked-lists) of length two to emulate cons cells, but...
The second challenge is the forcing of promises in data structures. R has some lazy evaluation semantics using implicit promises, but these are second class citizens. The function for returning an explicit promise, delay, has been deprecated in favor implicit delayedAssign, which is only executed for its side effects -- "Unevaluated promises should never be visible." Function arguments are implicit promises, so you can get your hands on them, but you can't place a promise into a data structure without it being forced.
CS 101
It turns out these two challenges can be solved by thinking back to Computer Science 101. Data structures can be implemented via closures.
cons <- function(h,t) function(x) if(x) h else t
first <- function(kons) kons(TRUE)
rest <- function(kons) kons(FALSE)
Now due to R's lazy function argument semantics, our cons already capable of lazy sequences.
fibonacci <- function(a,b) cons(a, fibonacci(b, a+b))
fibs <- fibonacci(1,1)
In order to be useful though we need a suite of lazy sequence processing functions. In Clojure, the sequence processing functions that are part of the core language work naturally with lazy sequences as well. R's sequence functions, on the other hand, would not be immediately compatible. Many rely on knowing ahead of time the (finite) sequence length. Let's define a few capable of working on lazy sequences instead.
filterz <- function(pred, s) {
if(is.null(s)) return(NULL)
f <- first(s)
r <- rest(s)
if(pred(f)) cons(f, filterz(pred, r)) else filterz(pred, r) }
take_whilez <- function(pred, s) {
if(is.null(s) || !pred(first(s))) return(NULL)
cons(first(s), take_whilez(pred, rest(s))) }
reduce <- function(f, init, s) {
r <- init
while(!is.null(s)) {
r <- f(r, first(s))
s <- rest(s) }
return(r) }
Let's use what we've created to sum all of the even Fibonacci numbers less than 4 million (Euler Project #2):
reduce(`+`, 0, filterz(function(x) x %% 2 == 0, take_whilez(function(x) x < 4e6, fibs)))
# [1] 4613732
Original Answer
I'm very rusty with R, but since (1) since I'm familiar with Clojure, and (2) I don't think you are getting your point across to the R users, I'll attempt a sketch based on my illustration of how Clojure lazy-sequences work. This is for example purposes only, and not tuned for performance in any way. It would probably be better implemented as an C extension (if one does not exist already).
A lazy sequence has the rest of the sequence generating calculation in a thunk. It is not immediately called. As each element (or chunk of elements as the case may be) is requested, a call to the next thunk is made to retrieve the value(s). That thunk may create another thunk to represent the tail of the sequence if it continues. The magic is that (1) these special thunks implement a sequence interface and can transparently be used as such and (2) each thunk is only called once -- its value is cached -- so the realized portion is a sequence of values.
Standard examples first
Natural numbers
numbers <- function(x) as.LazySeq(c(x, numbers(x+1)))
nums <- numbers(1)
take(10,nums)
#=> [1] 1 2 3 4 5 6 7 8 9 10
#Slow, but does not overflow the stack (level stack consumption)
sum(take(100000,nums))
#=> [1] 5000050000
Fibonacci sequence
fibonacci <- function(a,b) {
as.LazySeq(c(a, fibonacci(b, a+b)))}
fibs <- fibonacci(1,1)
take(10, fibs)
#=> [1] 1 1 2 3 5 8 13 21 34 55
nth(fibs, 20)
#=> [1] 6765
Followed by naive R implementation
Lazy sequence class
is.LazySeq <- function(x) inherits(x, "LazySeq")
as.LazySeq <- function(s) {
cache <- NULL
value <- function() {
if (is.null(cache)) {
cache <<- force(s)
while (is.LazySeq(cache)) cache <<- cache()}
cache}
structure(value, class="LazySeq")}
Some generic sequence methods with implementations for LazySeq
first <- function(x) UseMethod("first", x)
rest <- function(x) UseMethod("rest", x)
first.default <- function(s) s[1]
rest.default <- function(s) s[-1]
first.LazySeq <- function(s) s()[[1]]
rest.LazySeq <- function(s) s()[[-1]]
nth <- function(s, n) {
while (n > 1) {
n <- n - 1
s <- rest(s) }
first(s) }
#Note: Clojure's take is also lazy, this one is "eager"
take <- function(n, s) {
r <- NULL
while (n > 0) {
n <- n - 1
r <- c(r, first(s))
s <- rest(s) }
r}

The iterators library might be able to achieve what you're looking for:
library(iterators)
i <- icount()
nextElem(i)
# 1
nextElem(i)
# 2
You can keep calling nextElem forever.

Since R works best on vectors it seems like one often wants an iterator that return a vector, for instance
library(iterators)
ichunk <- function(n, ..., chunkSize) {
it <- icount(n) # FIXME: scale by chunkSize
chunk <- seq_len(chunkSize)
structure(list(nextElem=function() {
(nextElem(it) - 1L) * chunkSize + chunk
}), class=c("abstractiter", "iter"))
}
with typical use being with chunkSize in the millions range. At least this speeds up the approach to forever.

You didn't state your intended goal, so I'll point out that, in any language,
j <- 1
while(TRUE) { x[j+1] <- x[j]+1 ; j<-j+1}
will give you an infinite sequence. What do you actually want to do with an iterator?

Related

Why does this function works?

with this function you can calculate the fibonacci sequence with a recursive function, but i am not sure why this works, i marked at which position i struggled, can someone explain me this code?
fib <- function(n){
if (n == 0) return(0)
if (n == 1) return(1)
seq <- integer(n) # at this point i didnt understand much at all
seq[1:2] <- 1
calc <- function(n) {
if (seq[n] != 0) return(seq[n])
seq[n] <<- calc(n-1) + calc(n-2)
seq[n]
}
calc(n)
}
In the course of the recursive function evaluation, fib() ends up getting called many times with the same n. One way to make this computation faster is to use memoization, which saves a record of values for which the function has previously been called, and the return values. Using #AnoushArivanR's fib function:
system.time(fib(30))
## user system elapsed
## 3.987 0.000 3.987
library(memoise)
fib <- memoise(fib)
system.time(fib(30))
## user system elapsed
## 0.004 0.000 0.004
In fact, now that I look at your code above, I believe it's doing exactly this — but it's definitely hard to understand! (Your question might have been better received if you explained that this was what you're trying to do ...)
Note
This implementation of fib function has been cited from Mastering Software Development in R by Dr. Roger D. Peng who taught me so much and I am forever grateful to him.
As mentioned above this code is poorly written and has been unnecessarily complicated. Here is a more simpler version. We first check that the n value is not less than 0, then as the first two elements of the sequence are necessary for the series calculation to start (each element being the sum of two previous elements in the series) we set them as 0 and 1 for n == 1 and n == 2 respectively. Then we use recursion which is a technique that a function calls itself from its body creating a series of repetitive computations until the maximum number of n is reached. So for example for n == 3 the function calls it self by calling fib(1) and fib(2) both of which have already been set and so on. Then for fib(4) the function calls both fib(3) and fib(2). fib(2) is already set and fib(3) will be calculated by summing fib(2) and fib(1) and ...
I hope this explanation has helped you get your mind around the idea.
fib <- function(n){
stopifnot(n > 0)
if(n == 1) {
return(0)
} else if(n == 2) {
return(1)
} else {
fib(n - 1) + fib(n - 2)
}
}
fib(7)
8
But as some of the calculations are computed more than once for example both fib(6) and fib(5) calculate fib(4) the execution of the function gets slower. For the sake of optimization your code has saved every fib(n) output into and empty vector called seq so before any computation takes place it checks whether the value for seq[n] has already been computed or not. If so, it will be used and if not it will be computed again. This technique is called memoization and whenver a new seq[n] is calculated it will be the nth element of the seq vector and for this purpose we make use of <<- called complex assignment operator as we are modifying an object in the parent environment of the function.

How does R know not to use the old 'f'?

Into the R console, type:
#First code snippet
x <- 0
x <- x+1
x
You'll get '1'. That makes sense: the idea is that the 'x' in 'x+1' is the current value of x, namely 0, and this is used to compute the value of x+1, namely 1, which is then shoveled into the container x. So far, so good.
Now type:
#Second code snippet
f <- function(n) {n^2}
f <- function(n) {if (n >= 1) {n*f(n-1)} else {1}}
f(5)
You'll get '120', which is 5 factorial.
I find this perplexing. Following the logic of the first code snippet, we might expect the 'f' in the expression
if (n >= 1) {n*f(n-1)} else {1}
to be interpreted as the current value of f, namely
function(n) {n^2}
Following this reasoning, the value of f(5) should be 5*(5-1)^2 = 80. But that's not what we get.
Question. What's really going on here? How does R know not to use the old 'f'?
we might expect the 'f' in the expression
if (n >= 1) {n*f(n-1)} else {1}
to be interpreted as the current value of f
— Yes, we might expect that. And we would be correct.
But what is the “current value of f”? Or, more precisely, what is “current”?
“Current” is when the function is executed, not when it is defined. That is, by the time you execute f(5), it has already been redefined. So now the execution enters the function, looks up inside the function what f refers to — and also finds the current (= new) definition, not the old one.
In other words: the objects associated with names are looked up when they are actually accessed. And inside a function this means that names are accessed when the function is executed, not when it’s defined.
The same is true for all objects. Let’s say f is using a global object that’s not a function:
n = 5
f = function() n ^ 2
n = 1
f() # = 1
To understand the difference between your first and second example, consider the following case which involved functions, yet behaves like your first case (i.e. it uses the “old” value of f).
To make the example work, we need a little helper: a function that modifies other functions. In the following, twice is a function which takes a function as an argument and returns a new function. That new function is the same as the old function, only it runs twice when invoked:
twice = function (original_function) {
force(original_function)
function (...) {
original_function(original_function(...))
}
}
To illustrate what twice does, let’s invoke it on an example function:
plus1 = function (n) n + 1
plus2 = twice(plus1)
plus2(3) # = 5
Neat — R allows us to handle functions like any other object!
Now let’s modify your f:
f = function(n) {n^2}
f = twice(f)
f(5) # 625
… and here we have it: in the statement f = twice(f), the second f refers to the current (= old) definition. Only after that line does f refer to the new, modified function.
Here's a simple example illustrating my comment on Konrad's excellent answer:
a <- 2
f <- function() a*b
e <- new.env()
assign("b",5,e)
environment(f) <- e
> f()
[1] 10
b <- 10
> f()
[1] 10
So we've manually altered the environment for f so that it always first looks in e for b. Theoretically, one could even lock that binding ?lockBinding to make sure it never changes without throwing an error.
This sort of thing could get complicated, though, as in general you'd want to make sure that you set the parent environment of e correctly based on where the function f is actually being created. In this example f is created in the global environment, but if f were being created inside another function, you'd want e's parent environment to reflect that.

what is the purpose of 'NULL' in processing loops?

sqr = seq(1, 100, by=2)
sqr.squared = NULL
for (n in 1:50)
{
sqr.squared[n] = sqr[n]^2
}
I came accross the loop above, for a beginner this was simple enough. To further understand r what was the precise purpose of the second line? For my research I gather it has something to do with resetting the vector. If someone could elaborate it'd be much appreciated.
sqr.squared <- NULL
is one of many ways initialize the empty vector sqr.squared prior to running it through a loop. In general, when the length of the resulting vector is known, it is much better practice to allocate the vector's length. So here,
sqr.squared <- vector("integer", 50)
would be much better practice. And faster too. This way you are not building the new vector in the loop. But since ^ is vectorized, you could also simply do
sqr[1:50] ^ 2
and ditch the loop all together.
Another way to think about it is to remember that everything in r is a function call, and functions need input (usually).
say you calculated y and want to store that value somewhere. You can do x <- y without initializing an x object (r does this for you unlike in other languages, c for example), but say you want to store it in a specific place in x.
So note that <- (or = in your example) is a function
y <- 1
x[2] <- y
# Error in x[2] <- y : object 'x' not found
This is a different function than <-. Since you want to put y at x[2], you need the function [<-
`[<-`(x, 2, y)
# Error: object 'x' not found
But this still doesn't work because we need the object x to use this function, so initialize x to something.
(x <- numeric(5))
# [1] 0 0 0 0 0
# and now use the function
`[<-`(x, 2, y)
# [1] 0 1 0 0 0
This prefix notation is easier for computers to parse (eg, + 1 1) but harder for humans (me at least), so we prefer infix notation (eg, 1 + 1). R makes such functions easier to use x[2] <- y rather than how I did above.
The first answer is correct, when you assign a NULL value to a variable, the purpose is to initialize a vector. In many cases, when you are working checking numbers or with different types of variables, you will need to set NULL this arrays, matrix, etc.
For example, in you want to create a some type of element, in some cases you will need to put something inside them. This is the purpose of to use NULL. In addition, sometimes you will require NA instead of NULL.

parameter passing mechanism in R

The following function is used to multiply a sequence 1:x by y
f1<-function(x,y){return (lapply(1:x, function(a,b) b*a, b=y))}
Looks like a is used to represent the element in the sequence 1:x, but I do not know how to understand this parameter passing mechanism. In other OO languages, like Java or C++, there have call by reference or call by value.
Short answer: R is call by value. Long answer: it can do both.
Call By Value, Lazy Evaluation, and Scoping
You'll want to read through: the R language definition for more details.
R mostly uses call by value but this is complicated by its lazy evaluation:
So you can have a function:
f <- function(x, y) {
x * 3
}
If you pass in two big matrixes to x and y, only x will be copied into the callee environment of f, because y is never used.
But you can also access variables in parent environments of f:
y <- 5
f <- function(x) {
x * y
}
f(3) # 15
Or even:
y <- 5
f <- function() {
x <- 3
g <- function() {
x * y
}
}
f() # returns function g()
f()() # returns 15
Call By Reference
There are two ways for doing call by reference in R that I know of.
One is by using Reference Classes, one of the three object oriented paradigms of R (see also: Advanced R programming: Object Oriented Field Guide)
The other is to use the bigmemory and bigmatrix packages (see The bigmemory project). This allows you to create matrices in memory (underlying data is stored in C), returning a pointer to the R session. This allows you to do fun things like accessing the same matrix from multiple R sessions.
To multiply a vector x by a constant y just do
x * y
The (some prefix)apply functions works very similar to each other, you want to map a function to every element of your vector, list, matrix and so on:
x = 1:10
x.squared = sapply(x, function(elem)elem * elem)
print(x.squared)
[1] 1 4 9 16 25 36 49 64 81 100
It gets better with matrices and data frames because you can now apply a function over all rows or columns, and collect the output. Like this:
m = matrix(1:9, ncol = 3)
# The 1 below means apply over rows, 2 would mean apply over cols
row.sums = apply(m, 1, function(some.row) sum(some.row))
print(row.sums)
[1] 12 15 18
If you're looking for a simple way to multiply a sequence by a constant, definitely use #Fernando's answer or something similar. I'm assuming you're just trying to determine how parameters are being passed in this code.
lapply calls its second argument (in your case function(a, b) b*a) with each of the values of its first argument 1, 2, ..., x. Those values will be passed as the first parameter to the second argument (so, in your case, they will be argument a).
Any additional parameters to lapply after the first two, in your case b=y, are passed to the function by name. So if you called your inner function fxn, then your invocation of lapply is making calls like fxn(1, b=4), fxn(2, b=4), .... The parameters are passed by value.
You should read the help of lapply to understand how it works. Read this excellent answer to get and a good explanation of different xxpply family functions.
From the help of laapply:
lapply(X, FUN, ...)
Here FUN is applied to each elementof X and ... refer to:
... optional arguments to FUN.
Since FUN has an optional argument b, We replace the ... by , b=y.
You can see it as a syntax sugar and to emphasize the fact that argument b is optional comparing to argument a. If the 2 arguments are symmetric maybe it is better to use mapply.

Why are arguments to replacement functions not evaluated lazily?

Consider the following simple function:
f <- function(x, value){print(x);print(substitute(value))}
Argument x will eventually be evaluated by print, but value never will. So we can get results like this:
> f(a, a)
Error in print(x) : object 'a' not found
> f(3, a)
[1] 3
a
> f(1+1, 1+1)
[1] 2
1 + 1
> f(1+1, 1+"one")
[1] 2
1 + "one"
Everything as expected.
Now consider the same function body in a replacement function:
'g<-' <- function(x, value){print(x);print(substitute(value))}
(the single quotes should be fancy quotes)
Let's try it:
> x <- 3
> g(x) <- 4
[1] 3
[1] 4
Nothing unusual so far...
> g(x) <- a
Error: object 'a' not found
This is unexpected. Name a should be printed as a language object.
> g(x) <- 1+1
[1] 4
1 + 1
This is ok, as x's former value is 4. Notice the expression passed unevaluated.
The final test:
> g(x) <- 1+"one"
Error in 1 + "one" : non-numeric argument to binary operator
Wait a minute... Why did it try to evaluate this expression?
Well the question is: bug or feature? What is going on here? I hope some guru users will shed some light about promises and lazy evaluation on R. Or we may just conclude it's a bug.
We can reduce the problem to a slightly simpler example:
g <- function(x, value)
'g<-' <- function(x, value) x
x <- 3
# Works
g(x, a)
`g<-`(x, a)
# Fails
g(x) <- a
This suggests that R is doing something special when evaluating a replacement function: I suspect it evaluates all arguments. I'm not sure why, but the comments in the C code (https://github.com/wch/r-source/blob/trunk/src/main/eval.c#L1656 and https://github.com/wch/r-source/blob/trunk/src/main/eval.c#L1181) suggest it may be to make sure other intermediate variables are not accidentally modified.
Luke Tierney has a long comment about the drawbacks of the current approach, and illustrates some of the more complicated ways replacement functions can be used:
There are two issues with the approach here:
A complex assignment within a complex assignment, like
f(x, y[] <- 1) <- 3, can cause the value temporary
variable for the outer assignment to be overwritten and
then removed by the inner one. This could be addressed by
using multiple temporaries or using a promise for this
variable as is done for the RHS. Printing of the
replacement function call in error messages might then need
to be adjusted.
With assignments of the form f(g(x, z), y) <- w the value
of z will be computed twice, once for a call to g(x, z)
and once for the call to the replacement function g<-. It
might be possible to address this by using promises.
Using more temporaries would not work as it would mess up
replacement functions that use substitute and/or
nonstandard evaluation (and there are packages that do
that -- igraph is one).
I think the key may be found in this comment beginning at line 1682 of "eval.c" (and immediately followed by the evaluation of the assignment operation's RHS):
/* It's important that the rhs get evaluated first because
assignment is right associative i.e. a <- b <- c is parsed as
a <- (b <- c). */
PROTECT(saverhs = rhs = eval(CADR(args), rho));
We expect that if we do g(x) <- a <- b <- 4 + 5, both a and b will be assigned the value 9; this is in fact what happens.
Apparently, the way that R ensures this consistent behavior is to always evaluate the RHS of an assignment first, before carrying out the rest of the assignment. If that evaluation fails (as when you try something like g(x) <- 1 + "a"), an error is thrown and no assignment takes place.
I'm going to go out on a limb here, so please, folks with more knowledge feel free to comment/edit.
Note that when you run
'g<-' <- function(x, value){print(x);print(substitute(value))}
x <- 1
g(x) <- 5
a side effect is that 5 is assigned to x. Hence, both must be evaluated. But if you then run
'g<-'(x,10)
both the values of x and 10 are printed, but the value of x remains the same.
Speculation:
So the parser is distinguishing between whether you call g<- in the course of making an actual assignment, and when you simply call g<- directly.

Resources