R writing style - require vs. :: - r

OK, we're all familiar with double colon operator in R. Whenever I'm about to write some function, I use require(<pkgname>), but I was always thinking about using :: instead. Using require in custom functions is better practice than library, since require returns warning and FALSE, unlike library, which returns error if you provide a name of non-existent package.
On the other hand, :: operator gets the variable from the package, while require loads whole package (at least I hope so), so speed differences came first to my mind. :: must be faster than require.
And I did some analysis in order to check that - I've written two simple functions that load read.systat function from foreign package, with require and :: respectively, hence import Iris.syd dataset that ships with foreign package, replicated functions 1000 times each (which was shamelessly arbitrary), and... crunched some numbers.
Strangely (or not) I found significant differences in terms of user CPU and elapsed time, while there were no significant differences in terms of system CPU. And yet more strange conclusion: :: is actually slower! Documentation for :: is very blunt, and just by looking at sources it's obvious that :: should perform better!
require
#!/usr/local/bin/r
## with require
fn1 <- function() {
require(foreign)
read.systat("Iris.syd", to.data.frame=TRUE)
}
## times
n <- 1e3
sink("require.txt")
print(t(replicate(n, system.time(fn1()))))
sink()
double colon
#!/usr/local/bin/r
## with ::
fn2 <- function() {
foreign::read.systat("Iris.syd", to.data.frame=TRUE)
}
## times
n <- 1e3
sink("double_colon.txt")
print(t(replicate(n, system.time(fn2()))))
sink()
Grab CSV data here. Some stats:
user CPU: W = 475366 p-value = 0.04738 MRr = 975.866 MRc = 1025.134
system CPU: W = 503312.5 p-value = 0.7305 MRr = 1003.8125 MRc = 997.1875
elapsed time: W = 403299.5 p-value < 2.2e-16 MRr = 903.7995 MRc = 1097.2005
MRr is mean rank for require, MRc ibid for ::. I must have done something wrong here. It just doesn't make any sense... Execution time for :: seems way faster!!! I may have screwed something up, you shouldn't discard that option...
OK... I've wasted my time in order to see that there is some difference, and I carried out completely useless analysis, so, back to the question:
"Why should one prefer require over :: when writing a function?"
=)

"Why should one prefer require over ::
when writing a function?"
I usually prefer require due to the nice TRUE/FALSE return value that lets me deal with the possibility of the package not being available up front before getting into the code. Crash as early as possible instead of halfway through your analysis.
I only use :: when I need to make sure I am using the correct version of a function, not a version from some other package that is masking the name.
On the other hand, :: operator gets
the variable from the package, while
require loads whole package (at least
I hope so), so speed differences came
first to my mind. :: must be faster
than require.
I think you may be ignoring the effects of lazy loading which is used by the foreign package according to the first page of its manual. Essentially, packages that use lazy loading defer the loading of objects, such as functions, until the objects are called upon for the first time. So your argument that ":: must be faster than require" is not necessarily true as foreign is not loading all of its contents into memory when you attach it with require. For full details on lazy loading, see Prof. Ripley's article in RNews, Volume 4, Issue 2.

Since the time to load a package is almost always small compared to the time you spend trying to figure out what the code you wrote six months ago was about, in this case coding for clarity is the most important thing.
For scripts, having a call to require or library at the start lets you know which packages you need straight away.
Similarly, calling require (or a wrapper like requirePackage in Hmisc or try_require in ggplot2) at the start of a function is the most unambiguous way of showing that you need to use that package.
:: should be reserved for cases when you have naming conflicts between packages – compare, e.g.,
Hmisc::is.discrete
and
plyr::is.discrete

Related

Rf_allocVector only allocates and does not zero out memory

Original motivation behind this is that I have a dynamically sized array of floats that I want to pass to R through Rcpp without either incurring the cost of a zeroing out nor the cost of a deep copy.
Originally I had thought that there might be some way to take heap allocated array, make it aware to R's gc system and then wrap it with other data to create a "Rcpp::NumericVector" but it seems like that that's not possible - or doable with my current knowledge.
However and correct me if I'm wrong it looks like simply constructing a NumericVector with a size N and then using it as an N sized allocation will call R.h's Rf_allocVector and that itself does not either zero out the allocated array - I tested it on a small C program that gets dyn.loaded into R and it looks like garbage values. I also took a peek at the assembly and there doesn't seem to be any zeroing out.
Can anyone confirm this or offer any alternate solution?
Welcome to StackOverflow.
You marked this rcpp but that is a function from the C API of R -- whereas the Rcpp API offers you its constructors which do in fact set the memory tp zero:
> Rcpp::cppFunction("NumericVector goodVec(int n) { return NumericVector(n); }")
> sum(goodVec(1e7))
[1] 0
>
This creates a dynamically allocated vector using R's memory functions. The vector is indistinguishable from R's own. And it has the memory set to zero
as we use R_Calloc, which is documented in Writing R Extension to setting the memory to zero. (We may also use memcpy() explicitly, you can check the sources.)
So in short, you just have yourself confused over what the C API of R, as well as Rcpp offer, and what is easiest to use when. Keep reading documentation, running and writing examples, and studying existing code. It's all out there!

CRAN package submission: "Error: C stack usage is too close to the limit"

Right upfront: this is an issue I encountered when submitting an R package to CRAN. So I
dont have control of the stack size (as the issue occured on one of CRANs platforms)
I cant provide a reproducible example (as I dont know the exact configurations on CRAN)
Problem
When trying to submit the cSEM.DGP package to CRAN the automatic pretest (for Debian x86_64-pc-linux-gnu; not for Windows!) failed with the NOTE: C stack usage 7975520 is too close to the limit.
I know this is caused by a function with three arguments whose body is about 800 rows long. The function body consists of additions and multiplications of these arguments. It is the function varzeta6() which you find here (from row 647 onwards).
How can I adress this?
Things I cant do:
provide a reproducible example (at least I would not know how)
change the stack size
Things I am thinking of:
try to break the function into smaller pieces. But I dont know how to best do that.
somehow precompile? the function (to be honest, I am just guessing) so CRAN doesnt complain?
Let me know your ideas!
Details / Background
The reason why varzeta6() (and varzeta4() / varzeta5() and even more so varzeta7()) are so long and R-inefficient is that they are essentially copy-pasted from mathematica (after simplifying the mathematica code as good as possible and adapting it to be valid R code). Hence, the code is by no means R-optimized (which #MauritsEvers righly pointed out).
Why do we need mathematica? Because what we need is the general form for the model-implied construct correlation matrix of a recursive strucutral equation model with up to 8 constructs as a function of the parameters of the model equations. In addition there are constraints.
To get a feel for the problem, lets take a system of two equations that can be solved recursivly:
Y2 = beta1*Y1 + zeta1
Y3 = beta2*Y1 + beta3*Y2 + zeta2
What we are interested in is the covariances: E(Y1*Y2), E(Y1*Y3), and E(Y2*Y3) as a function of beta1, beta2, beta3 under the constraint that
E(Y1) = E(Y2) = E(Y3) = 0,
E(Y1^2) = E(Y2^2) = E(Y3^3) = 1
E(Yi*zeta_j) = 0 (with i = 1, 2, 3 and j = 1, 2)
For such a simple model, this is rather trivial:
E(Y1*Y2) = E(Y1*(beta1*Y1 + zeta1) = beta1*E(Y1^2) + E(Y1*zeta1) = beta1
E(Y1*Y3) = E(Y1*(beta2*Y1 + beta3*(beta1*Y1 + zeta1) + zeta2) = beta2 + beta3*beta1
E(Y2*Y3) = ...
But you see how quickly this gets messy when you add Y4, Y5, until Y8.
In general the model-implied construct correlation matrix can be written as (the expression actually looks more complicated because we also allow for up to 5 exgenous constructs as well. This is why varzeta1() already looks complicated. But ignore this for now.):
V(Y) = (I - B)^-1 V(zeta)(I - B)'^-1
where I is the identity matrix and B a lower triangular matrix of model parameters (the betas). V(zeta) is a diagonal matrix. The functions varzeta1(), varzeta2(), ..., varzeta7() compute the main diagonal elements. Since we constrain Var(Yi) to always be 1, the variances of the zetas follow. Take for example the equation Var(Y2) = beta1^2*Var(Y1) + Var(zeta1) --> Var(zeta1) = 1 - beta1^2. This looks simple here, but is becomes extremly complicated when we take the variance of, say, the 6th equation in such a chain of recursive equations because Var(zeta6) depends on all previous covariances betwenn Y1, ..., Y5 which are themselves dependend on their respective previous covariances.
Ok I dont know if that makes things any clearer. Here are the main point:
The code for varzeta1(), ..., varzeta7() is copy pasted from mathematica and hence not R-optimized.
Mathematica is required because, as far as I know, R cannot handle symbolic calculations.
I could R-optimze "by hand" (which is extremly tedious)
I think the structure of the varzetaX() must be taken as given. The question therefore is: can I somehow use this function anyway?
Once conceivable approach is to try to convince the CRAN maintainers that there's no easy way for you to fix the problem. This is a NOTE, not a WARNING; The CRAN repository policy says
In principle, packages must pass R CMD check without warnings or significant notes to be admitted to the main CRAN package area. If there are warnings or notes you cannot eliminate (for example because you believe them to be spurious) send an explanatory note as part of your covering email, or as a comment on the submission form
So, you could take a chance that your well-reasoned explanation (in the comments field on the submission form) will convince the CRAN maintainers. In the long run it would be best to find a way to simplify the computations, but it might not be necessary to do it before submission to CRAN.
This is a bit too long as a comment, but hopefully this will give you some ideas for optimising the code for the varzeta* functions; or at the very least, it might give you some food for thought.
There are a few things that confuse me:
All varzeta* functions have arguments beta, gamma and phi, which seem to be matrices. However, in varzeta1 you don't use beta, yet beta is the first function argument.
I struggle to link the details you give at the bottom of your post with the code for the varzeta* functions. You don't explain where the gamma and phi matrices come from, nor what they denote. Furthermore, seeing that beta are the model's parameter etimates, I don't understand why beta should be a matrix.
As I mentioned in my earlier comment, I would be very surprised if these expressions cannot be simplified. R can do a lot of matrix operations quite comfortably, there shouldn't really be a need to pre-calculate individual terms.
For example, you can use crossprod and tcrossprod to calculate cross products, and %*% implements matrix multiplication.
Secondly, a lot of mathematical operations in R are vectorised. I already mentioned that you can simplify
1 - gamma[1,1]^2 - gamma[1,2]^2 - gamma[1,3]^2 - gamma[1,4]^2 - gamma[1,5]^2
as
1 - sum(gamma[1, ]^2)
since the ^ operator is vectorised.
Perhaps more fundamentally, this seems somewhat of an XY problem to me where it might help to take a step back. Not knowing the full details of what you're trying to model (as I said, I can't link the details you give to the cSEM.DGP code), I would start by exploring how to solve the recursive SEM in R. I don't really see the need for Mathematica here. As I said earlier, matrix operations are very standard in R; analytically solving a set of recursive equations is also possible in R. Since you seem to come from the Mathematica realm, it might be good to discuss this with a local R coding expert.
If you must use those scary varzeta* functions (and I really doubt that), an option may be to rewrite them in C++ and then compile them with Rcpp to turn them into R functions. Perhaps that will avoid the C stack usage limit?

R copies for no apparent reason

Some R function will make R copy the object AFTER the function call, like nrow, while some others don't, like sum.
For example the following code:
x = as.double(1:1e8)
system.time(x[1] <- 100)
y = sum(x)
system.time(x[1] <- 200) ## Fast (takes 0s), after calling sum
foo = function(x) {
return(sum(x))
}
y = foo(x)
system.time(x[1] <- 300) ## Slow (takes 0.35s), after calling foo
Calling foo is NOT slow, because x isn't copied. However, changing x again is very slow, as x is copied. My guess is that calling foo will leave a reference to x, so when changing it after, R makes another copy.
Any one knows why R does this? Even when the function doesn't change x at all? Thanks.
I definitely recommend Hadley's Advanced R book, as it digs into some of the internals that you will likely find interesting and relevant. Most relevant to your question (and as mentioned by #joran and #lmo), the reason for the slow-down was an additional reference that forced copy-on-modify.
An excerpt that might be beneficial from Memory#Modification:
There are two possibilities:
R modifies x in place.
R makes a copy of x to a new location, modifies the copy, and then
uses the name x to point to the new location.
It turns out that R can do either depending on the circumstances. In
the example above, it will modify in place. But if another variable
also points to x, then R will copy it to a new location. To explore
what’s going on in greater detail, we use two tools from the pryr
package. Given the name of a variable, address() will tell us the
variable’s location in memory and refs() will tell us how many names
point to that location.
Also of interest are the sections on R's C interface and Performance. The pryr package also has tools for working with these sorts of internals in an easier fashion.
One last note from Hadley's book (same Memory section) that might be helpful:
While determining that copies are being made is not hard, preventing
such behaviour is. If you find yourself resorting to exotic tricks to
avoid copies, it may be time to rewrite your function in C++, as
described in Rcpp.

General information or tips about eval(parse( [duplicate]

There are several questions on how to avoid using eval(parse(...))
r-evalparse-is-often-suboptimal
avoiding-the-infamous-evalparse-construct
Which sparks the questions:
Why Specifically should eval(parse()) be avoided?
And most importantly, What are the dangers?
Are there any dangerous if the code is not used in production? (I'm thinking, any danger of getting back unintended results. Clearly if you are not careful about what you are parsing, you will have issues. But is that any more dangerous than being sloppy with get()?)
Most of the arguments against eval(parse(...)) arise not because of security concerns, after all, no claims are made about R being a safe interface to expose to the Internet, but rather because such code is generally doing things that can be accomplished using less obscure methods, i.e. methods that are both quicker and more human parse-able. The R language is supposed to be high-level, so the preference of the cognoscenti (and I do not consider myself in that group) is to see code that is both compact and expressive.
So the danger is that eval(parse(..)) is a backdoor method of getting around lack of knowledge and the hope in raising that barrier is that people will improve their use of the R language. The door remains open but the hope is for more expressive use of other features. Carl Witthoft's question earlier today illustrated not knowing that the get function was available, and the question he linked to exposed a lack of understanding of how the [[ function behaved (and how $ was more limited than [[). In both cases an eval(parse(..)) solution could be constructed, but it was clunkier and less clear than the alternative.
The security concerns only really arise if you start calling eval on strings that another user has passed to you. This is a big deal if you are creating an application that runs R in the background, but for data analysis where you are writing code to be run by yourself, then you shouldn't need to worry about the effect of eval on security.
Some other problems with eval(parse( though.
Firstly, code using eval-parse is usually much harder to debug than non-parsed code, which is problematic because debugging software is twice as difficult as writing it in the first place.
Here's a function with a mistake in it.
std <- function()
{
mean(1to10)
}
Silly me, I've forgotten about the colon operator and created my vector wrongly. If I try and source this function, then R notices the problem and throws an error, pointing me at my mistake.
Here's the eval-parse version.
ep <- function()
{
eval(parse(text = "mean(1to10)"))
}
This will source, because the error is inside a valid string. It is only later, when we come to run the code that the error is thrown. So by using eval-parse, we've lost the source-time error checking capability.
I also think that this second version of the function is much more difficult to read.
The other problem with eval-parse is that it is much slower than directly executed code. Compare
system.time(for(i in seq_len(1e4)) mean(1:10))
user system elapsed
0.08 0.00 0.07
and
system.time(for(i in seq_len(1e4)) eval(parse(text = "mean(1:10)")))
user system elapsed
1.54 0.14 1.69
Usually there's a better way of 'computing on the language' than working with code-strings; evalparse heavy-code needs a lot of safe-guarding to guarantee a sensible output, in my experience.
The same task can usually be solved by working on R code as a language object directly; Hadley Wickham has a useful guide on meta-programming in R here:
The defmacro() function in the gtools library is my favourite substitute (no half-assed R pun intended) for the evalparse construct
require(gtools)
# both action_to_take & predicate will be subbed with code
F <- defmacro(predicate, action_to_take, expr =
if(predicate) action_to_take)
F(1 != 1, action_to_take = print('arithmetic doesnt work!'))
F(pi > 3, action_to_take = return('good!'))
[1] 'good!'
# the raw code for F
print(F)
function (predicate = stop("predicate not supplied"), action_to_take = stop("action_to_take not supplied"))
{
tmp <- substitute(if (predicate) action_to_take)
eval(tmp, parent.frame())
}
<environment: 0x05ad5d3c>
The benefit of this method is that you are guaranteed to get back syntactically-legal R code. More on this useful function can be found here:
Hope that helps!
In some programming languages, eval() is a function which evaluates
a string as though it were an expression and returns a result; in
others, it executes multiple lines of code as though they had been
included instead of the line including the eval. The input to eval is
not necessarily a string; in languages that support syntactic
abstractions (like Lisp), eval's input will consist of abstract
syntactic forms.
http://en.wikipedia.org/wiki/Eval
There are all kinds of exploits that one can take advantage of if eval is used improperly.
An attacker could supply a program with the string
"session.update(authenticated=True)" as data, which would update the
session dictionary to set an authenticated key to be True. To remedy
this, all data which will be used with eval must be escaped, or it
must be run without access to potentially harmful functions.
http://en.wikipedia.org/wiki/Eval
In other words, the biggest danger of eval() is the potential for code injection into your application. The use of eval() can also cause performance issues in some languages depending on what is being used for.
Specifically in R, it's probably because you can use get() in place of eval(parse()) and your results will be the same without having to resort to eval()

How to not fall into R's 'lazy evaluation trap'

"R passes promises, not values. The promise is forced when it is first evaluated, not when it is passed.", see this answer by G. Grothendieck. Also see this question referring to Hadley's book.
In simple examples such as
> funs <- lapply(1:10, function(i) function() print(i))
> funs[[1]]()
[1] 10
> funs[[2]]()
[1] 10
it is possible to take such unintuitive behaviour into account.
However, I find myself frequently falling into this trap during daily development. I follow a rather functional programming style, which means that I often have a function A returning a function B, where B is in some way depending on the parameters with which A was called. The dependency is not as easy to see as in the above example, since calculations are complex and there are multiple parameters.
Overlooking such an issue leads to difficult to debug problems, since all calculations run smoothly - except that the result is incorrect. Only an explicit validation of the results reveals the problem.
What comes on top is that even if I have noticed such a problem, I am never really sure which variables I need to force and which I don't.
How can I make sure not to fall into this trap? Are there any programming patterns that prevent this or that at least make sure that I notice that there is a problem?
You are creating functions with implicit parameters, which isn't necessarily best practice. In your example, the implicit parameter is i. Another way to rework it would be:
library(functional)
myprint <- function(x) print(x)
funs <- lapply(1:10, function(i) Curry(myprint, i))
funs[[1]]()
# [1] 1
funs[[2]]()
# [1] 2
Here, we explicitly specify the parameters to the function by using Curry. Note we could have curried print directly but didn't here for illustrative purposes.
Curry creates a new version of the function with parameters pre-specified. This makes the parameter specification explicit and avoids the potential issues you are running into because Curry forces evaluations (there is a version that doesn't, but it wouldn't help here).
Another option is to capture the entire environment of the parent function, copy it, and make it the parent env of your new function:
funs2 <- lapply(
1:10, function(i) {
fun.res <- function() print(i)
environment(fun.res) <- list2env(as.list(environment())) # force parent env copy
fun.res
}
)
funs2[[1]]()
# [1] 1
funs2[[2]]()
# [1] 2
but I don't recommend this since you will be potentially copying a whole bunch of variables you may not even need. Worse, this gets a lot more complicated if you have nested layers of functions that create functions. The only benefit of this approach is that you can continue your implicit parameter specification, but again, that seems like bad practice to me.
As others pointed out, this might not be the best style of programming in R. But, one simple option is to just get into the habit of forcing everything. If you do this, realize you don't need to actually call force, just evaluating the symbol will do it. To make it less ugly, you could make it a practice to start functions like this:
myfun<-function(x,y,z){
x;y;z;
## code
}
There is some work in progress to improve R's higher order functions like the apply functions, Reduce, and such in handling situations like these. Whether this makes into R 3.2.0 to be released in a few weeks depend on how disruptive the changes turn out to be. Should become clear in a week or so.
R has a function that helps safeguard against lazy evaluation, in situations like closure creation: forceAndCall().
From the online R help documentation:
forceAndCall is intended to help defining higher order functions like apply to behave more reasonably when the result returned by the function applied is a closure that captured its arguments.

Resources