I came across the following code and wondering if there is a reason for this given that the num variable comes from a numeric column (integer or double).
num_val <- eval(parse(text = num %>% as.character()))
Could it be a pattern that is (was) useful for something in older versions of R?
No, that hasn't really changed for a very long time.
It allows strings like "1+1" to be evaluated and stored as the result (i.e. 2).
If num was already numeric, it would have the effect of rounding it to the value that as.character() would calculate. Typically that value is not affected by the options("digits") setting and is usually much more precise, but it's not quite the same precision as used internally.
If num held an integer, that would convert it to a double.
If num held a vector of values, that would evaluate all of them, then only keep the last one.
I'd have to see the context, but I can't think of any useful reason to use code like that if num was known to be numeric. Only the first reason (allowing users to enter expressions instead of numbers) would make sense.
A very good general piece of advice is to never use the eval(parse( combination in your own code. From the fortunes package:
> library(fortunes)
> fortune(106)
If the answer is parse() you should usually rethink the question.
-- Thomas Lumley
R-help (February 2005)
There is almost always a better alternative in R, the above code looks like a more complicated, hard to understand version of:
num_val <- as.numeric(as.character(num))
if the values in num are numbers, or a factor with numeric labels then this works (there is a slightly more efficient version in the FAQ). The eval-parse method would work if num contained something like "c(1,4,6)" or "1:10", but in those cases it would probably be better to figure out where num is coming from that it would contain something like the above and figure out a better workflow.
Using eval and parse can be dangerous (if num contains some attack code, then this could cause major problems) and very hard to debug (google for "debug action at a distance").
That said, there is code out there where the programmers used parse and eval as a quick and dirty (but not best) way to do something, then that code was copied and modified and people took the quick and dirty approach to somehow be a reasonable approach, so things like this are out there. But you best approach (for good code and learning to be a better R programmer) is to find better options and never use eval(parse(.
My own contribution to fortunes on this topic:
> fortune(181)
Personally I have never regretted trying not to underestimate my own future
stupidity.
-- Greg Snow (explaining why eval(parse(...)) is often suboptimal, answering a
question triggered by the infamous fortune(106))
R-help (January 2007)
Related
There are several questions on how to avoid using eval(parse(...))
r-evalparse-is-often-suboptimal
avoiding-the-infamous-evalparse-construct
Which sparks the questions:
Why Specifically should eval(parse()) be avoided?
And most importantly, What are the dangers?
Are there any dangerous if the code is not used in production? (I'm thinking, any danger of getting back unintended results. Clearly if you are not careful about what you are parsing, you will have issues. But is that any more dangerous than being sloppy with get()?)
Most of the arguments against eval(parse(...)) arise not because of security concerns, after all, no claims are made about R being a safe interface to expose to the Internet, but rather because such code is generally doing things that can be accomplished using less obscure methods, i.e. methods that are both quicker and more human parse-able. The R language is supposed to be high-level, so the preference of the cognoscenti (and I do not consider myself in that group) is to see code that is both compact and expressive.
So the danger is that eval(parse(..)) is a backdoor method of getting around lack of knowledge and the hope in raising that barrier is that people will improve their use of the R language. The door remains open but the hope is for more expressive use of other features. Carl Witthoft's question earlier today illustrated not knowing that the get function was available, and the question he linked to exposed a lack of understanding of how the [[ function behaved (and how $ was more limited than [[). In both cases an eval(parse(..)) solution could be constructed, but it was clunkier and less clear than the alternative.
The security concerns only really arise if you start calling eval on strings that another user has passed to you. This is a big deal if you are creating an application that runs R in the background, but for data analysis where you are writing code to be run by yourself, then you shouldn't need to worry about the effect of eval on security.
Some other problems with eval(parse( though.
Firstly, code using eval-parse is usually much harder to debug than non-parsed code, which is problematic because debugging software is twice as difficult as writing it in the first place.
Here's a function with a mistake in it.
std <- function()
{
mean(1to10)
}
Silly me, I've forgotten about the colon operator and created my vector wrongly. If I try and source this function, then R notices the problem and throws an error, pointing me at my mistake.
Here's the eval-parse version.
ep <- function()
{
eval(parse(text = "mean(1to10)"))
}
This will source, because the error is inside a valid string. It is only later, when we come to run the code that the error is thrown. So by using eval-parse, we've lost the source-time error checking capability.
I also think that this second version of the function is much more difficult to read.
The other problem with eval-parse is that it is much slower than directly executed code. Compare
system.time(for(i in seq_len(1e4)) mean(1:10))
user system elapsed
0.08 0.00 0.07
and
system.time(for(i in seq_len(1e4)) eval(parse(text = "mean(1:10)")))
user system elapsed
1.54 0.14 1.69
Usually there's a better way of 'computing on the language' than working with code-strings; evalparse heavy-code needs a lot of safe-guarding to guarantee a sensible output, in my experience.
The same task can usually be solved by working on R code as a language object directly; Hadley Wickham has a useful guide on meta-programming in R here:
The defmacro() function in the gtools library is my favourite substitute (no half-assed R pun intended) for the evalparse construct
require(gtools)
# both action_to_take & predicate will be subbed with code
F <- defmacro(predicate, action_to_take, expr =
if(predicate) action_to_take)
F(1 != 1, action_to_take = print('arithmetic doesnt work!'))
F(pi > 3, action_to_take = return('good!'))
[1] 'good!'
# the raw code for F
print(F)
function (predicate = stop("predicate not supplied"), action_to_take = stop("action_to_take not supplied"))
{
tmp <- substitute(if (predicate) action_to_take)
eval(tmp, parent.frame())
}
<environment: 0x05ad5d3c>
The benefit of this method is that you are guaranteed to get back syntactically-legal R code. More on this useful function can be found here:
Hope that helps!
In some programming languages, eval() is a function which evaluates
a string as though it were an expression and returns a result; in
others, it executes multiple lines of code as though they had been
included instead of the line including the eval. The input to eval is
not necessarily a string; in languages that support syntactic
abstractions (like Lisp), eval's input will consist of abstract
syntactic forms.
http://en.wikipedia.org/wiki/Eval
There are all kinds of exploits that one can take advantage of if eval is used improperly.
An attacker could supply a program with the string
"session.update(authenticated=True)" as data, which would update the
session dictionary to set an authenticated key to be True. To remedy
this, all data which will be used with eval must be escaped, or it
must be run without access to potentially harmful functions.
http://en.wikipedia.org/wiki/Eval
In other words, the biggest danger of eval() is the potential for code injection into your application. The use of eval() can also cause performance issues in some languages depending on what is being used for.
Specifically in R, it's probably because you can use get() in place of eval(parse()) and your results will be the same without having to resort to eval()
"R passes promises, not values. The promise is forced when it is first evaluated, not when it is passed.", see this answer by G. Grothendieck. Also see this question referring to Hadley's book.
In simple examples such as
> funs <- lapply(1:10, function(i) function() print(i))
> funs[[1]]()
[1] 10
> funs[[2]]()
[1] 10
it is possible to take such unintuitive behaviour into account.
However, I find myself frequently falling into this trap during daily development. I follow a rather functional programming style, which means that I often have a function A returning a function B, where B is in some way depending on the parameters with which A was called. The dependency is not as easy to see as in the above example, since calculations are complex and there are multiple parameters.
Overlooking such an issue leads to difficult to debug problems, since all calculations run smoothly - except that the result is incorrect. Only an explicit validation of the results reveals the problem.
What comes on top is that even if I have noticed such a problem, I am never really sure which variables I need to force and which I don't.
How can I make sure not to fall into this trap? Are there any programming patterns that prevent this or that at least make sure that I notice that there is a problem?
You are creating functions with implicit parameters, which isn't necessarily best practice. In your example, the implicit parameter is i. Another way to rework it would be:
library(functional)
myprint <- function(x) print(x)
funs <- lapply(1:10, function(i) Curry(myprint, i))
funs[[1]]()
# [1] 1
funs[[2]]()
# [1] 2
Here, we explicitly specify the parameters to the function by using Curry. Note we could have curried print directly but didn't here for illustrative purposes.
Curry creates a new version of the function with parameters pre-specified. This makes the parameter specification explicit and avoids the potential issues you are running into because Curry forces evaluations (there is a version that doesn't, but it wouldn't help here).
Another option is to capture the entire environment of the parent function, copy it, and make it the parent env of your new function:
funs2 <- lapply(
1:10, function(i) {
fun.res <- function() print(i)
environment(fun.res) <- list2env(as.list(environment())) # force parent env copy
fun.res
}
)
funs2[[1]]()
# [1] 1
funs2[[2]]()
# [1] 2
but I don't recommend this since you will be potentially copying a whole bunch of variables you may not even need. Worse, this gets a lot more complicated if you have nested layers of functions that create functions. The only benefit of this approach is that you can continue your implicit parameter specification, but again, that seems like bad practice to me.
As others pointed out, this might not be the best style of programming in R. But, one simple option is to just get into the habit of forcing everything. If you do this, realize you don't need to actually call force, just evaluating the symbol will do it. To make it less ugly, you could make it a practice to start functions like this:
myfun<-function(x,y,z){
x;y;z;
## code
}
There is some work in progress to improve R's higher order functions like the apply functions, Reduce, and such in handling situations like these. Whether this makes into R 3.2.0 to be released in a few weeks depend on how disruptive the changes turn out to be. Should become clear in a week or so.
R has a function that helps safeguard against lazy evaluation, in situations like closure creation: forceAndCall().
From the online R help documentation:
forceAndCall is intended to help defining higher order functions like apply to behave more reasonably when the result returned by the function applied is a closure that captured its arguments.
There are several questions on how to avoid using eval(parse(...))
r-evalparse-is-often-suboptimal
avoiding-the-infamous-evalparse-construct
Which sparks the questions:
Why Specifically should eval(parse()) be avoided?
And most importantly, What are the dangers?
Are there any dangerous if the code is not used in production? (I'm thinking, any danger of getting back unintended results. Clearly if you are not careful about what you are parsing, you will have issues. But is that any more dangerous than being sloppy with get()?)
Most of the arguments against eval(parse(...)) arise not because of security concerns, after all, no claims are made about R being a safe interface to expose to the Internet, but rather because such code is generally doing things that can be accomplished using less obscure methods, i.e. methods that are both quicker and more human parse-able. The R language is supposed to be high-level, so the preference of the cognoscenti (and I do not consider myself in that group) is to see code that is both compact and expressive.
So the danger is that eval(parse(..)) is a backdoor method of getting around lack of knowledge and the hope in raising that barrier is that people will improve their use of the R language. The door remains open but the hope is for more expressive use of other features. Carl Witthoft's question earlier today illustrated not knowing that the get function was available, and the question he linked to exposed a lack of understanding of how the [[ function behaved (and how $ was more limited than [[). In both cases an eval(parse(..)) solution could be constructed, but it was clunkier and less clear than the alternative.
The security concerns only really arise if you start calling eval on strings that another user has passed to you. This is a big deal if you are creating an application that runs R in the background, but for data analysis where you are writing code to be run by yourself, then you shouldn't need to worry about the effect of eval on security.
Some other problems with eval(parse( though.
Firstly, code using eval-parse is usually much harder to debug than non-parsed code, which is problematic because debugging software is twice as difficult as writing it in the first place.
Here's a function with a mistake in it.
std <- function()
{
mean(1to10)
}
Silly me, I've forgotten about the colon operator and created my vector wrongly. If I try and source this function, then R notices the problem and throws an error, pointing me at my mistake.
Here's the eval-parse version.
ep <- function()
{
eval(parse(text = "mean(1to10)"))
}
This will source, because the error is inside a valid string. It is only later, when we come to run the code that the error is thrown. So by using eval-parse, we've lost the source-time error checking capability.
I also think that this second version of the function is much more difficult to read.
The other problem with eval-parse is that it is much slower than directly executed code. Compare
system.time(for(i in seq_len(1e4)) mean(1:10))
user system elapsed
0.08 0.00 0.07
and
system.time(for(i in seq_len(1e4)) eval(parse(text = "mean(1:10)")))
user system elapsed
1.54 0.14 1.69
Usually there's a better way of 'computing on the language' than working with code-strings; evalparse heavy-code needs a lot of safe-guarding to guarantee a sensible output, in my experience.
The same task can usually be solved by working on R code as a language object directly; Hadley Wickham has a useful guide on meta-programming in R here:
The defmacro() function in the gtools library is my favourite substitute (no half-assed R pun intended) for the evalparse construct
require(gtools)
# both action_to_take & predicate will be subbed with code
F <- defmacro(predicate, action_to_take, expr =
if(predicate) action_to_take)
F(1 != 1, action_to_take = print('arithmetic doesnt work!'))
F(pi > 3, action_to_take = return('good!'))
[1] 'good!'
# the raw code for F
print(F)
function (predicate = stop("predicate not supplied"), action_to_take = stop("action_to_take not supplied"))
{
tmp <- substitute(if (predicate) action_to_take)
eval(tmp, parent.frame())
}
<environment: 0x05ad5d3c>
The benefit of this method is that you are guaranteed to get back syntactically-legal R code. More on this useful function can be found here:
Hope that helps!
In some programming languages, eval() is a function which evaluates
a string as though it were an expression and returns a result; in
others, it executes multiple lines of code as though they had been
included instead of the line including the eval. The input to eval is
not necessarily a string; in languages that support syntactic
abstractions (like Lisp), eval's input will consist of abstract
syntactic forms.
http://en.wikipedia.org/wiki/Eval
There are all kinds of exploits that one can take advantage of if eval is used improperly.
An attacker could supply a program with the string
"session.update(authenticated=True)" as data, which would update the
session dictionary to set an authenticated key to be True. To remedy
this, all data which will be used with eval must be escaped, or it
must be run without access to potentially harmful functions.
http://en.wikipedia.org/wiki/Eval
In other words, the biggest danger of eval() is the potential for code injection into your application. The use of eval() can also cause performance issues in some languages depending on what is being used for.
Specifically in R, it's probably because you can use get() in place of eval(parse()) and your results will be the same without having to resort to eval()
I'm maintaining code for a mathematical algorithm that came from a book, with references in the comments. Is it better to have variable names that are descriptive of what the variables represent, or should the variables match what is in the book?
For a simple example, I may see this code, which reflects the variable in the book.
A_c = v*v/r
I could rewrite it as
centripetal_acceleration = velocity*velocity/radius
The advantage of the latter is that anyone looking at the code could understand it. However, the advantage of the former is that it is easier to compare the code with what is in the book. I may do this in order to double check the implementation of the algorithms, or I may want to add additional calculations.
Perhaps I am over-thinking this, and should simply use comments to describe what the variables are. I tend to favor self-documenting code however (use descriptive variable names instead of adding comments to describe what they are), but maybe this is a case where comments would be very helpful.
I know this question can be subjective, but I wondered if anyone had any guiding principles in order to make a decision, or had links to guidelines for coding math algorithms.
I would prefer to use the more descriptive variable names. You can't guarantee everyone that is going to look at the code has access to "the book". You may leave and take your copy, it may go out of print, etc. In my opinion it's better to be descriptive.
We use a lot of mathematical reference books in our work, and we reference them in comments, but we rarely use the same mathematically abbreviated variable names.
A common practise is to summarise all your variables, indexes and descriptions in a comment header before starting the code proper. eg.
// A_c = Centripetal Acceleration
// v = Velocity
// r = Radius
A_c = (v^2)/r
I write a lot of mathematical software. IF I can insert in the comments a very specific reference to a book or a paper or (best) web site that explains the algorithm and defines the variable names, then I will use the SHORT names like a = v * v / r because it makes the formulas easier to read and write and verify visually.
IF not, then I will write very verbose code with lots of comments and long descriptive variable names. Essentially, my code becomes a paper that describes the algorithm (anyone remember Knuth's "Literate Programming" efforts, years ago? Though the technology for it never took off, I emulate the spirit of that effort). I use a LOT of ascii art in my comments, with box-and-arrow diagrams and other descriptive graphics. I use Jave.de -- the Java Ascii Vmumble Editor.
I will sometimes write my math with short, angry little variable names, easier to read and write for ME because I know the math, then use REFACTOR to replace the names with longer, more descriptive ones at the end, but only for code that is much more informal.
I think it depends almost entirely upon the audience for whom you're writing -- and don't ever mistake the compiler for the audience either. If your code is likely to be maintained by more or less "general purpose" programmers who may not/probably won't know much about physics so they won't recognize what v and r mean, then it's probably better to expand them to be recognizable for non-physicists. If they're going to be physicists (or, for another example, game programmers) for whom the textbook abbreviations are clear and obvious, then use the abbreviations. If you don't know/can't guess which, it's probably safer to err on the side of the names being longer and more descriptive.
I vote for the "book" version. 'v' and 'r' etc are pretty well understood as acronymns for velocity and radius and is more compact.
How far would you take it?
Most (non-greek :-)) keyboards don't provide easy access to Δ, but it's valid as part of an identifier in some languages (e.g. C#):
int Δv;
int Δx;
Anyone coming afterwards and maintaining the code may curse you every day. Similarly for a lot of other symbols used in maths. So if you're not going to use those actual symbols (and I'd encourage you not to), I'd argue you ought to translate the rest, where it doesn't make for code that's too verbose.
In addition, what if you need to combine algorithms, and those algorithms have conflicting usage of variables?
A compromise could be to code and debug as contained in the book, and then perform a global search and replace for all of your variables towards the end of your development, so that it is easier to read. If you do this I would change the names of the variables slightly so that it is easier to change them later.
e.g A_c# = v#*v#/r#
One thing I want to do all the time in my R code is to test whether certain conditions hold for a vector, such as whether it contains any or all values equal to some specified value. The Rish way to do this is to create a boolean vector and use any or all, for example:
any(is.na(my_big_vector))
all(my_big_vector == my_big_vector[[1]])
...
It seems really inefficient to me to allocate a big vector and fill it with values, just to throw it away (especially if any() or all() call can be short-circuited after testing only a couple of the values. Is there a better way to do this, or should I just hand in my desire to write code that is both efficient and succinct when working in R?
"Cheap, fast, reliable: pick any two" is a dry way of saying that you sometimes need to order your priorities when building or designing systems.
It is rather similar here: the cost of the concise expression is the fact that memory gets allocated behind the scenes. If that really is a problem, then you can always write a (compiled ?) routines to runs (quickly) along the vectors and uses only pair of values at a time.
You can trade off memory usage versus performance versus expressiveness, but is difficult to hit all three at the same time.
which(is.na(my_big_vector))
which(my_big_vector == 5)
which(my_big_vector < 3)
And if you want to count them...
length(which(is.na(my_big_vector)))
I think it is not a good idea -- R is a very high-level language, so what you should do is to follow standards. This way R developers know what to optimize. You should also remember that while R is functional and lazy language, it is even possible that statement like
any(is.na(a))
can be recognized and executed as something like
.Internal(is_any_na,a)