I'm working on a script that finds text in large PDFs, and I have the bare bones script written out. I'm trying to refactor my code to encapsulate the main while loop in a function, so I can run sapply() on it with a list of the PDFs. Some of the functions that I call within the main loop require values from that main loop: here's a stripped down, pseudo-version of my code:
pdfParse <- function() {
N <- sample(1:50, 1)*2
n = N/2; i = 0
while (i <= N) {
what <- whatP(n)
i = i + length(what)
if !length(what) {break}
else {n <- N/2 - i}
}
n
}
res <- sample(0:1, N)
r = 1
whatP <- function(t) {
r = r*2
if (t%%3) {
if (t%%5) {
return(res[(n/r):n])
} else {
whatP((rev(t)[1]):(rev(t)[1] + r))
} else {return(rep(NaN, 2))}
}
So my question is, how do I access the variable n that I've defined in the pdfParse function within the function it calls? Even if it's possible, I'd like to avoid assigning it as a global variable. I've read a bit into closures, but I'm not sure if that's an applicable solution here.
Edit: For clarification, whatP(n) starts out with n as its initial argument, but it's recursive, so depending on whether certain conditions are fulfilled, it may end up operating on a vector that doesn't even include n. but I still want to return the something that depends on the original n I defined in pdfParse
The simplest (and probably safest, given that your res function is recursive) is to make n an argument of whatP.
whatP <- function(t,n) {
...
}
and then call it from pdfParse with two arguments instead of one.
If for some reason you don't want to do this, then you have two options
(a) you can actually just use n as though it were in scope. R's rules for where it looks for a variable are very different from, say, C(++). In order, R searches in
the environment of the current function
the environment of its parent (the function that called it)
the environment of its parent's parent and so on
the global environment
the environments of loaded packages, in the same order they appear in search().
Since your function is being called from within the function that defined n, it will find the appropriate value under the second (or third, given it's recursive) bullet.
(b) you can use get with a suitable (negative) value of pos, corresponding to the parent. (Alternatively, use sys.frame). Not recommended here as it's tricky to get right with recursive functions, but can be useful in other situations (and it will bypass any n you might have redefined in the meantime in another, closer, scope).
Related
This thread discusses two basic approaches to using functions inside other functions in R: What are the benefits of defining and calling a function inside another function in R?
The top answer says the second approach, naming externally and just calling via name in the outer function, is faster: " f2 needs to be redefined every time you call f1, which adds some overhead (not very much overhead, but definitely there)". My question is, is this overhead caused by the assignment itself or by passing through the function itself?
For example, consider this third option besides the two in that thread:
#Approach 1
fun1a <- function(x) {
fun1b <- function(y){return(y^2)}
return(fun1b(x))
}
#Approach 2
fun2a <- function(y){return(y^2)}
fun2b <- function(x){return(fun2a(x))}
#Approach 3
fun3 <- function(x) {
return(function(x){return(x^2)})
}
It was confirmed that Approach 2 is faster than Approach 1 because Approach 1 needs to redefine fun1b in the function repeatedly. But if you use Approach 3 --basically, Approach 1, but not assigning fun1b to a named function everytime you run it -- is that always faster?
If so, why would anyone not just use Approach 3 for everything? i.e. what disadvantages does it have compared to Approach 2 (or 1)
Some of these (but not all) are already mentioned in the link in the question but here is a longer list.
Visibility Functions defined within functions are not visible outside that function increasing the modularity of the software if that function is not also used elsewhere. It provides a sort of poor man's namespace. For example, an alternative to using an anonymous function in a lapply appearing within a function would be to define it as a named function within the outer function to keep it from being visible outside the outer function. The name might form a sort of documentation for the inner function.
Scope Functions defined within functions can access variables defined in the outer function without passing them as arguments.
Cache Functions defined within functions and passed back out can use the outer function to cache results so that they are remembered the next time the passed out function is run. Here makeIncr is a factory function which constructs a new counter function each time it is run. The counter functions return the next number in sequence each time they are run.
makeIncr <- function(init) function() { init <<- init + 1; init }
counter1 <- makeIncr(0)
counter1()
## [1] 1
counter1()
## [1] 2
counter2 <- makeIncr(0)
counter2()
## [1] 1
Object Orientation Functions defined within functions can be used to emulate a limited form of object orientation. See an example by running: demo(scoping)
Debugging can be a bit more awkward with functions within functions. For example, debug(makeIncr) using makeIncr above does not debug the counters which would have to be debugged separately.
I am not sure that the performance issue discussed is really material since the functions would be byte compiled the first time the outer function is run. In most cases you would want to make a decision based on other factors.
I understand how "function(x)" works, but what is the role of "function()" here?
z <- function() {
y <- 2
function(x) {
x + y
}
}
function is a keyword which is part of the creation of a function (in the programming sense that Gilles describes in his answer). The other parts are the argument list (in parentheses) and the function body (in braces).
In your example, z is a function which takes no arguments. It returns a function which takes 1 argument (named x) (since R returns the last evaluated statement as the return value by default). That function returns its argument x plus 2.
When z is called (with no arguments: z()) it assigns 2 to y (inside the functions variable scope, an additional concept that I'm not going to get into). Then it creates a function (without a name) which takes a single argument named x, which, when itself called, returns its argument x plus 2. That anonymous function is returned from the call to z and, presumably, stored so that it can be called later.
See https://github.com/hadley/devtools/wiki/Functions and https://github.com/hadley/devtools/wiki/Functionals for more discussion on passing around functions as objects.
The word “function” means somewhat different things in mathematics and in programming. In mathematics, a function is a correspondence between each possible value of the parameters and a result. In programming, a function is a sequence of instructions to compute the result from the parameters.
In mathematics, a function with no argument is a constant. In programming, this is not the case, because functions can have side effects, such as printing something. So you will encounter many functions with no arguments in programs.
Hre the function function(x) { x + y } depends on the variable y. There are no side effects, so this function is very much like the mathematical function defined by $f(x) = x + y$. However, this definition is only complete for a given value of y. The previous instruction sets y to 2, so
function() {
y <- 2
function(x) {
x + y
}
}
is equivalent to
function () {
function(x) {
x + 2
}
}
in the sense that both definitions produce the same results when applied to the same value. They are, however, computed in slightly different ways.
That function is given the name z. When you call z (with no argument, so you write z()), this builds the function function (x) { x + 2 }, or something equivalent: z() is a function of one argument that adds 2 to its argument. So you can write something like z()(3) — the result is 5.
This is obviously a toy example. As you progress in your lectures, you'll see progressively more complex examples where such function building is mixed with other features to achieve useful things.
With some help I've picked out a few examples of functions without formal arguments to help you understand why they could be useful.
Functions which have side-effects
plot.new() for instance, initializes a graphics device.
Want to update the console buffer? flush.console() has your back.
Functions which have a narrow purpose
This is probably the majority of the cases.
Want to know the date/time? Call date().
Want to know the version of R? Call getRversion().
There's a conditional debugging flag I miss from Matlab: dbstop if infnan described here. If set, this condition will stop code execution when an Inf or NaN is encountered (IIRC, Matlab doesn't have NAs).
How might I achieve this in R in a more efficient manner than testing all objects after every assignment operation?
At the moment, the only ways I see to do this are via hacks like the following:
Manually insert a test after all places where these values might be encountered (e.g. a division, where division by 0 may occur). The testing would be to use is.finite(), described in this Q & A, on every element.
Use body() to modify the code to call a separate function, after each operation or possibly just each assignment, which tests all of the objects (and possibly all objects in all environments).
Modify R's source code (?!?)
Attempt to use tracemem to identify those variables that have changed, and check only these for bad values.
(New - see note 2) Use some kind of call handlers / callbacks to invoke a test function.
The 1st option is what I am doing at present. This is tedious, because I can't guarantee I've checked everything. The 2nd option will test everything, even if an object hasn't been updated. That is a massive waste of time. The 3rd option would involve modifying assignments of NA, NaN, and infinite values (+/- Inf), so that an error is produced. That seems like it's better left to R Core. The 4th option is like the 2nd - I'd need a call to a separate function listing all of the memory locations, just to ID those that have changed, and then check the values; I'm not even sure this will work for all objects, as a program may do an in-place modification, which seems like it would not invoke the duplicate function.
Is there a better approach that I'm missing? Maybe some clever tool by Mark Bravington, Luke Tierney, or something relatively basic - something akin to an options() parameter or a flag when compiling R?
Example code Here is some very simple example code to test with, incorporating the addTaskCallback function proposed by Josh O'Brien. The code isn't interrupted, but an error does occur in the first scenario, while no error occurs in the second case (i.e. badDiv(0,0,FALSE) doesn't abort). I'm still investigating callbacks, as this looks promising.
badDiv <- function(x, y, flag){
z = x / y
if(flag == TRUE){
return(z)
} else {
return(FALSE)
}
}
addTaskCallback(stopOnNaNs)
badDiv(0, 0, TRUE)
addTaskCallback(stopOnNaNs)
badDiv(0, 0, FALSE)
Note 1. I'd be satisfied with a solution for standard R operations, though a lot of my calculations involve objects used via data.table or bigmemory (i.e. disk-based memory mapped matrices). These appear to have somewhat different memory behaviors than standard matrix and data.frame operations.
Note 2. The callbacks idea seems a bit more promising, as this doesn't require me to write functions that mutate R code, e.g. via the body() idea.
Note 3. I don't know whether or not there is some simple way to test the presence of non-finite values, e.g. meta information about objects that indexes where NAs, Infs, etc. are stored in the object, or if these are stored in place. So far, I've tried Simon Urbanek's inspect package, and have not found a way to divine the presence of non-numeric values.
Follow-up: Simon Urbanek has pointed out in a comment that such information is not available as meta information for objects.
Note 4. I'm still testing the ideas presented. Also, as suggested by Simon, testing for the presence of non-finite values should be fastest in C/C++; that should surpass even compiled R code, but I'm open to anything. For large datasets, e.g. on the order of 10-50GB, this should be a substantial savings over copying the data. One may get further improvements via use of multiple cores, but that's a bit more advanced.
The idea sketched below (and its implementation) is very imperfect. I'm hesitant to even suggest it, but: (a) I think it's kind of interesting, even in all of its ugliness; and (b) I can think of situations where it would be useful. Given that it sounds like you are right now manually inserting a check after each computation, I'm hopeful that your situation is one of those.
Mine is a two-step hack. First, I define a function nanDetector() which is designed to detect NaNs in several of the object types that might be returned by your calculations. Then, it using addTaskCallback() to call the function nanDetector() on .Last.value after each top-level task/calculation is completed. When it finds an NaN in one of those returned values, it throws an error, which you can use to avoid any further computations.
Among its shortcomings:
If you do something like setting stop(error = recover), it's hard to tell where the error was triggered, since the error is always thrown from inside of stopOnNaNs().
When it throws an error, stopOnNaNs() is terminated before it can return TRUE. As a consequence, it is removed from the task list, and you'll need to reset with addTaskCallback(stopOnNaNs) it you want to use it again. (See the 'Arguments' section of ?addTaskCallback for more details).
Without further ado, here it is:
# Sketch of a function that tests for NaNs in several types of objects
nanDetector <- function(X) {
# To examine data frames
if(is.data.frame(X)) {
return(any(unlist(sapply(X, is.nan))))
}
# To examine vectors, matrices, or arrays
if(is.numeric(X)) {
return(any(is.nan(X)))
}
# To examine lists, including nested lists
if(is.list(X)) {
return(any(rapply(X, is.nan)))
}
return(FALSE)
}
# Set up the taskCallback
stopOnNaNs <- function(...) {
if(nanDetector(.Last.value)) {stop("NaNs detected!\n")}
return(TRUE)
}
addTaskCallback(stopOnNaNs)
# Try it out
j <- 1:00
y <- rnorm(99)
l <- list(a=1:4, b=list(j=1:4, k=NaN))
# Error in function (...) : NaNs detected!
# Subsequent time consuming code that could be avoided if the
# error thrown above is used to stop its evaluation.
I fear there is no such shortcut. In theory on unix there is SIGFPE that you could trap on, but in practice
there is no standard way to enable FP operations to trap it (even C99 doesn't include a provision for that) - it is highly system-specifc (e.g. feenableexcept on Linux, fp_enable_all on AIX etc.) or requires the use of assembler for your target CPU
FP operations are nowadays often done in vector units like SSE so you can't be even sure that FPU is involved and
R intercepts some operations on things like NaNs, NAs and handles them separately so they won't make it to the FP code
That said, you could hack yourself an R that will catch some exceptions for your platform and CPU if you tried hard enough (disable SSE etc.). It is not something we would consider building into R, but for a special purpose it may be doable.
However, it would still not catch NaN/NA operations unless you change R internal code. In addition, you would have to check every single package you are using since they may be using FP operations in their C code and may also handle NA/NaN separately.
If you are only worried about things like division by zero or over/underflows, the above will work and is probably the closest to something like a solution.
Just checking your results may not be very reliable, because you don't know whether a result is based on some intermediate NaN calculation that changed an aggregated value which may not need to be NaN as well. If you are willing to discard such case, then you could simply walk recursively through your result objects or the workspace. That should not be extremely inefficient, because you only need to worry about REALSXP and not anything else (unless you don't like NAs either - then you'd have more work).
This is an example code that could be used to traverse R object recursively:
static int do_isFinite(SEXP x) {
/* recurse into generic vectors (lists) */
if (TYPEOF(x) == VECSXP) {
int n = LENGTH(x);
for (int i = 0; i < n; i++)
if (!do_isFinite(VECTOR_ELT(x, i))) return 0;
}
/* recurse into pairlists */
if (TYPEOF(x) == LISTSXP) {
while (x != R_NilValue) {
if (!do_isFinite(CAR(x))) return 0;
x = CDR(x);
}
return 1;
}
/* I wouldn't bother with attributes except for S4
where attributes are slots */
if (IS_S4_OBJECT(x) && !do_isFinite(ATTRIB(x))) return 0;
/* check reals */
if (TYPEOF(x) == REALSXP) {
int n = LENGTH(x);
double *d = REAL(x);
for (int i = 0; i < n; i++) if (!R_finite(d[i])) return 0;
}
return 1;
}
SEXP isFinite(SEXP x) { return ScalarLogical(do_isFinite(x)); }
# in R: .Call("isFinite", x)
I am searching for a way to terminate an apply function early on some condition. Using a for loop, something like:
FDP_HCFA = function(FaultMatrix, TestCosts, GenerateNeighbors, RandomSeed) {
set.seed(RandomSeed)
## number of tests, mind the summary column
nT = ncol(FaultMatrix) - 1
StartingSequence = sample(1:nT)
BestAPFD = APFD_C(StartingSequence, FaultMatrix, TestCosts)
BestPrioritization = StartingSequence
MakingProgress = TRUE
NumberOfIterations = 0
while(MakingProgress) {
BestPrioritizationBefore = BestPrioritization
AllCurrentNeighbors = GenerateNeighbors(BestPrioritization)
for(CurrentNeighbor in AllCurrentNeighbors) {
CurrentAPFD = APFD_C(CurrentNeighbor, FaultMatrix, TestCosts)
if(CurrentAPFD > BestAPFD) {
BestAPFD = CurrentAPFD
BestPrioritization = CurrentNeighbor
break
}
}
if(length(union(list(BestPrioritizationBefore),
list(BestPrioritization))) == 1)
MakingProgress = FALSE
NumberOfIterations = NumberOfIterations + 1
}
}
I would like to rewrite this function using some derivation of apply. In particular, terminating the evaluation of the first individual with increased fitness, thereby avoiding the cost of considering the rest of the population.
I reckon that you don't really grasp the apply family and its purpose. Contrary to the general idea, they're not the equivalent of any for-loop. One can say that most for-loops are the equivalent of an apply, but that's another matter.
Apply does exactly as it says: it applies a function on a number of similar arguments sequentially, and returns the result. Hence, by definition you cannot break out of an apply. You're not operating in the global environment any more, so in principle you cannot keep global counters, check after each execution some condition and adapt the loop. You can access the global environment and even change variables using assign or <<-, but this is pretty dangerous.
To understand the difference, don't read apply(1:3,afunc) as for(i in 1:3) afunc(i), but as
afunc(1)
afunc(2)
afunc(3)
in one (block) statement. That reflects better what you're doing exactly. An equivalent for break in an apply simply doesn't make sense, as it is more a block of code than a loop.
Aside from getting your sample code to work* I think this is a clear case where a loop is the right choice. Although R can apply a function to a whole vector of variables [EDIT: but you have to decide what they are before applying], in this case I'd use a while loop to avoid the cost of running unnecessary repetitions. Caveat: I know for loops have compared favorably with apply in timing tests, but I have not seen a similar test for while. Check out some of the options at http://cran.r-project.org/doc/manuals/R-lang.html#Control-structures.
while ( *statement1* ) *statement2*
So I have this function that I'm trying to convert from a recursive algorithm to an iterative algorithm. I'm not even sure if I have the right subproblems but this seems to determined what I need in the correct way, but recursion can't be used you need to use dynamic programming so I need to change it to iterative bottom up or top down dynamic programming.
The basic recursive function looks like this:
Recursion(i,j) {
if(i > j) {
return 0;
}
else {
// This finds the maximum value for all possible
// subproblems and returns that for this problem
for(int x = i; x < j; x++) {
if(some subsection i to x plus recursion(x+1,j) is > current max) {
max = some subsection i to x plus recursion(x+1,j)
}
}
}
}
This is the general idea, but since recursions typically don't have for loops in them I'm not sure exactly how I would convert this to iterative. Does anyone have any ideas?
You have a recursive function that can be summarised as this:
recursive(i, j):
if stopping condition:
return value
loop:
if test current value involving recursive call passes:
set value based on recursive call
return value # this appears to be missing from your example
(I am going to be pretty loose with the pseudo code here, to emphasize the structure of the code rather than the specific implementation)
And you want to flatten it to a purely iterative approach. First it would be good to describe exactly what this involves in the general case, as you seem to be interested in that. Then we can move on to flattening the pseudo code above.
Now flattening a primitive recursive function is quite straightforward. When you are given code that is like:
simple(i):
if i has reached the limit: # stopping condition
return value
# body of method here
return simple(i + 1) # recursive call
You can quickly see that the recursive calls will continue until i reaches the predefined limit. When this happens the value will be returned. The iterative form of this is:
simple_iterative(start):
for (i = start; i < limit; i++):
# body here
return value
This works because the recursive calls form the following call tree:
simple(1)
-> simple(2)
-> simple(3)
...
-> simple(N):
return value
I would describe that call tree as a piece of string. It has a beginning, a middle, and an end. The different calls occur at different points on the string.
A string of calls like that is very like a for loop - all of the work done by the function is passed to the next invocation and the final result of the recursion is just passed back. The for loop version just takes the values that would be passed into the different calls and runs the body code on them.
Simple so far!
Now your method is more complex in two ways:
There are multiple separate statements that make recursive calls
Those statements themselves are within a for loop
So your call tree is something like:
recursive(i, j):
for (v in 1, 2, ... N):
-> first_recursive_call(i + v, j):
-> ... inner calls ...
-> potential second recursive call(i + v, j):
-> ... inner calls ...
As you can see this is not at all like a string. Instead it really is like a tree (or a bush) in that each call results in two more calls. At this point it is actually very hard to turn this back into an entirely iterative function.
This is because of the fundamental relationship between loops and recursion. Any loop can be restated as a recursive call. However not all recursive calls can be transformed into loops.
The class of recursive calls that can be transformed into loops are called primitive recursion. Your function initially appears to have transcended that. If this is the case then you will not be able to transform it into a purely iterative function (short of actually implementing a call stack and similar within your function).
This video explains the difference between primitive recursion and fundamentally recursive types that follow:
https://www.youtube.com/watch?v=i7sm9dzFtEI
I would add that your condition and the value that you assign to max appear to be the same. If this is the case then you can remove one of the recursive calls, allowing your function to become an instance of primitive recursion wrapped in a loop. If you did so then you might be able to flatten it.
well unless there is an issue with the logic not included yet, it should be fine
for & while are ok in recursion
just make sure you return in every case that may occur