unit tests and checks in package function: do we do checks in both? - r

I'm a new to R and package development so bear with me. I am writing test cases to keep package is line with standard practices. But I'm confused if I do the checks in testthat, should I not perform if/else checks in the package function?
my_function<-function(dt_genetic, dt_gene, dt_snpBP){
if((is.data.table(dt_genetic) & is.data.table(dt_gene) & is.data.table(dt_snpBP))== FALSE){
stop("data format unacceptable")
}
## similary more checks on column names and such
} ## function ends
In my test-data_integrity.R
## create sample data.table
test_gene_coord<-data.table(GENE=c("ABC","XYG","alpha"),"START"=c(10,200,320),"END"=c(101,250,350))
test_snp_pos<-data.table(SNP=c("SNP1","SNP2","SNP3"),"BP"=c(101,250,350))
test_snp_gene<-data.table(SNP=c("SNP1","SNP2","SNP3"),"GENE"=c("ABC","BRCA1","gamma"))
## check data type
test_that("data types correct works", {
expect_is(test_data_table,'data.table')
expect_is(test_gene_coord,'data.table')
expect_is(test_snp_pos,'data.table')
expect_is(test_snp_gene,'data.table')
expect_is(test_gene_coord$START, 'numeric')
expect_is(test_gene_coord$END, 'numeric')
expect_is(test_snp_pos$BP, 'numeric')
})
## check column names
test_that("column names works", {
expect_named(test_gene_coord, c("GENE","START","END"))
expect_named(test_snp_pos, c("SNP","BP"))
expect_named(test_snp_gene, c("SNP","GENE"))
})
when I run devtools::test() all tests are passed, but does it mean that I should not test within my function?
Pardon me if this seems naive but this is confusing as this is completely alien to me.
Edited: data.table if check.

(This is an expansion on my comments on the question. My comments are from a quasi-professional programmer; some of what I say here may be good "in general" but not perfectly complete from a theoretical standpoint.)
There are many "types" of tests, but I'll focus on distinguishing between "unit-tests" and "assertions". For me, the main difference is that unit-tests are typically run by the developer(s) only, and assertions are run at run-time.
Assertions
When you mention adding tests to your function, which to me sounds like assertions: a programmatic statement that an object meets specific property assumptions. This is often necessary when the data is provided by the user or from an external source (database), where the size or quality of the data is previously unknown.
There are "formal" packages for assertions, including assertthat, assertr, and assertive; while I have little experience with any of them, there is also sufficient support in base R that these aren't strictly required. The most basic method is
if (!inherits(mtcars, "data.table")) {
stop("'obj' is not 'data.table'")
}
# Error: 'obj' is not 'data.table'
which gives you absolute control at the expense of several lines of code. There's another function which shortens this a little:
stopifnot(inherits(mtcars, "data.table"))
# Error: inherits(mtcars, "data.table") is not TRUE
Multiple conditions can be provided, all must be TRUE to pass. (Unlike many R conditionals such as if, this statement must resolve to exactly TRUE: stopifnot(3) does not pass.) In R < 4.0, the error messages were uncontrolled, but starting in R-4.0 one can now name them:
stopifnot(
"mtcars not data.frame" = inherits(mtcars, "data.frame"),
"mtcars data.table error" = inherits(mtcars, "data.table")
)
# Error: mtcars data.table error
In some programming languages, these assertions are more declarative/deliberate so that compilation can optimize them out of a production executable. In this sense, they are useful during development, but for production it is assumed that some steps that worked before no longer need validation. I believe there is not an automatic way to do this in R (especially since it is generally not "compiled into an executable"), but one could fashion a function in a way to mimic this behavior:
myfunc <- function(x, ..., asserts = getOption("run_my_assertions", FALSE)) {
# this one only runs when the user explicitly says "asserts=TRUE"
if (asserts) stopifnot("'x' not a data.frame" = inherits(x, "data.frame"))
# this assertion runs all the time
stopifnot("'x' not a data.table" = inherits(x, "data.table"))
}
I have not seen that logic or flow often in R packages.
Regardless, my assumption of assertions is that those not optimized out (due to compilation or user arguments) execute every time the function runs. This tends to ensure a "safer" flow, and is a good idea especially for less-experienced developers who do not have the experience ("have not been burned enough") to know how many ways certain calls can go wrong.
Unit Tests
These are a bit different, both in their purpose and runtime effect.
First and foremost, unit-tests are not run every time a function is used. They are typically defined in a completely different file, not within the function at all[^1]. They are deliberate sets of calls to your functions, testing/confirming specific behaviors given certain inputs.
With the testthat package, R scripts (that match certain filename patterns) in the package's ./tests/testthat/ sub-directory will be run on command as unit-tests. (Other unit-test packages exist.) (Unit-tests do not require that they operate on a package; they can be located anywhere, and run on any set of files or directories of files. I'm using a "package" as an example.)
Side note: it is certainly feasible to include some of the testthat tools within your function for runtime validation as well. For instance, one might replace stopifnot(inherits(x, "data.frame")) with expect_is(x, "data.frame"), and it will fail with non-frames, and pass with all three types of frames tested above. I don't know that this is always the best way to go, and I haven't seen its use in packages I use. (Doesn't mean it isn't there. If you see testthat in a package's "Imports:", then it's possible.)
The premise here is not validation of runtime objects. The premise is validation of your function's performance given very specific inputs[^2]. For instance, one might define a unit-test to confirm that your function operates equally well on frames of class "data.frame", "tbl_df", and "data.table". (This is not a throw-away unit-test, btw.)
Consider a meek function that one would presume can work equally well on any data.frame-like object:
func <- function(x, nm) head(x[nm], n = 2)
To test that this accepts various types, one might simply call it on the console with:
func(mtcars, "cyl")
# cyl
# Mazda RX4 6
# Mazda RX4 Wag 6
When a colleague complains that this function isn't working, you might wonder that they're using either the tidyverse (and tibble) or data.table, so you can quickly test on the console:
func(tibble::as_tibble(mtcars), "cyl")
# # A tibble: 2 x 1
# cyl
# <dbl>
# 1 6
# 2 6
func(data.table::as.data.table(mtcars), "cyl")
# Error in `[.data.table`(x, nm) :
# When i is a data.table (or character vector), the columns to join by must be specified using 'on=' argument (see ?data.table), by keying x (i.e. sorted, and, marked as sorted, see ?setkey), or by sharing column names between x and i (i.e., a natural join). Keyed joins might have further speed benefits on very large data due to x being sorted in RAM.
So now you know where the problem lies (if not yet how to fix it). If you test this "as is" with data.table, one might think to try something like this (obviously wrong) fix:
func <- function(x, nm) head(x[,..nm], n = 2)
func(data.table::as.data.table(mtcars), "cyl")
# cyl
# 1: 6
# 2: 6
While this works, unfortunately it now fails for the other two frame-like objects.
The answer to this dilemma is to make tests so that when you make a change to your function, if previously-successful property assumptions now change, you will know immediately. Had all three of those tests been incorporated into a unit-test, one might have done something such as
library(testthat)
test_that("func works with all frame-like objects", {
expect_silent(func(mtcars, "cyl"))
expect_silent(func(tibble::as_tibble(mtcars), "cyl"))
expect_silent(func(data.table::as.data.table(mtcars), "cyl"))
})
# Error: Test failed: 'func works with all frame-like objects'
Given some research, you find one method that you think will satisfy all three frame-like objects:
func <- function(x, nm) head(subset(x, select = nm), n = 2)
And then run your unit-tests again:
test_that("func works with all frame-like objects", {
expect_silent(func(mtcars, "cyl"))
expect_silent(func(tibble::as_tibble(mtcars), "cyl"))
expect_silent(func(data.table::as.data.table(mtcars), "cyl"))
})
(No output ... silence is golden.)
Similar to many things in programming, there are many opinions on how to organize, fashion, or even when to create these unit-tests. Many of these opinions are right for somebody. One strategy that I tend to start with is this:
since I know that my functions can be used on all three frame-like objects, I often preemptively set up a test given one object of each type (you'd be surprised at some of the lurking differences between them);
when I find or receive a bug report, one of the first things I do after confirming the bug is write a test that triggers that bug, given the minimum inputs required to do so; then I fix the bug, and run my unit-tests to ensure that this new test now passes (and no other test now fails)
Experience will dictate types of tests to write preemptively before the bugs even come.
Tests don't always have to be about "no errors", by the way. They can test for a lot of things:
silence (no errors)
expected messages, warnings, or stop errors (whether internally generated or passed from another function)
output class (matrix or numeric), dimensions, attributes
expected values (returning 3 vice 3.14 might be a problem)
Some will say that unit-tests are no fun to write, and abhor efforts on them. While I don't disagree that unit-tests are not fun, I have burned myself countless times when making a simple fix to a function inadvertently broke several other things ... and since I deployed the "simple fix" without applicable unit-tests, I just shifted the bug reports from "this title has "NA" in it" to "the app crashes and everybody is angry" (true story).
For some packages, unit-testing can be done in moments; for others, it may take minutes or hours. Due to complexity in functions, some of my unit-tests deal with "large" data structures, so a single test takes several minutes to reveal its success. Most of my unit-tests are relatively instantaneous with inputs of vectors of length 1 to 3, or frames/matrices with 2-4 rows and/or columns.
This is by far not a complete document on testing. There are books, tutorials, and countless blogs about different techniques. One good reference is Hadley's book on R Packages, Testing chapter: http://r-pkgs.had.co.nz/tests.html. I like that, but it is far from the only one.
[^1] Tangentially, I believe that one power the roxygen2 package affords is the convenience of storing a function's documentation in the same file as the function itself. Its proximity "reminds" me to update the docs when I'm working on code. It would be nice if we could determine a sane way to similarly add formal testthat (or similar) unit-tests to the function file itself. I've seen (and at times used) informal unit-tests by including specific code in the roxygen2 #examples section: when the file is rendered to an .Rd file, any errors in the example code will alert me on the console. I know that this technique is sloppy and hasty, and in general I only suggest it when more formal unit-testing will not be done. It does tend to make help documentation a lot more verbose than it needs to be.
[^2] I said above "given very specific inputs": an alternative is something called "fuzzing", a technique where functions are called with random or invalid input. I believe this is very useful for searching for stack overflow, memory-access, or similar problems that cause a program to crash and/or execute the wrong code. I've not seen this used in R (ymmv).

Related

How to check that a user-defined function works in r?

THis is probably a very silly question, but how can I check if a function written by myself will work or not?
I'm writing a not very simple function involving many other functions and loops and was wondering if there are any ways to check for errors/bugs, or simply just check if the function will work. Do I just create a simple fake data frame and test on it?
As suggested by other users in the comment, I have added the part of the function that I have written. So basically I have a data frame with good and bad data, and bad data are marked with flags. I want to write a function that allows me to produce plots as usual (with the flag points) when user sets flag.option to 1, and remove the flag points from the plot when user sets flag.option to 0.
AIR.plot <- function(mydata, flag.option) {
if (flag.option == 1) {
par(mfrow(2,1))
conc <- tapply(mydata$CO2, format(mydata$date, "%Y-%m-%d %T"), mean)
dates <- seq(mydata$date[1], mydata$date[nrow(mydata(mydata))], length = nrow(conc))
plot(dates, conc,
type = "p",
col = "blue",
xlab = "day",
ylab = "CO2"), error = function(e) plot.new(type = "n")
barplot(mydata$lines, horiz = TRUE, col = c("red", "blue")) # this is just a small bar plot on the bottom that specifies which sample-taking line (red or blue) is providing the samples
} else if (flag.option == 0) {
# I haven't figured out how to write this part yet but essentially I want to remove all
# of the rows with flags on
}
}
Thanks in advance, I'm not an experienced R user yet so please help me.
Before we (meaning, at my workplace) release any code to our production environment we run through a series of testing procedures to make sure our code behaves the way we want it to. It usually involves several people with different perspectives on the code.
Ideally, such verification should start before you write any code. Some questions you should be able to answer are:
What should the code do?
What inputs should it accept? (including type, ranges, etc)
What should the output look like?
How will it handle missing values?
How will it handle NULL values?
How will it handle zero-length values?
If you prepare a list of requirements and write your documentation before you begin writing any code, the probability of success goes up pretty quickly. Naturally, as you begin writing your code, you may find that your requirements need to be adjusted, or the function arguments need to be modified. That's okay, but document those changes when they happen.
While you are writing your function, use a package like assertthat or checkmate to write as many argument checks as you need in your code. Some of the best, most reliable code where I work consists of about 100 lines of argument checks and 3-4 lines of what the code actually is intended to do. It may seem like overkill, but you prevent a lot of problems from bad inputs that you never intended for users to provide.
When you've finished writing your function, you should at this point have a list of requirements and clearly documented expectations of your arguments. This is where you make use of the testthat package.
Write tests that verify all of the requirements you wrote are met.
Write tests that verify you can no put in unintended inputs and get the results you want.
Write tests that verify you get the output you intended on your test data.
Write tests that test any edge cases you can think of.
It can take a long time to write all of these tests, but once it is done, any further development is easier to check since anything that violates your existing requirements should fail the test.
That being said, I'm really bad at following this process in my own work. I have the tendency to write code, then document what I did. But the best code I've written has been where I've planned it out conceptually, wrote my documentation, coded, and then tested against my documentation.
As #antoine-sac pointed out in the links, some things cannot be checked programmatically; for example, if your function terminates.
Looking at it pragmatically, have a look at the packages assertthat and testthat. assertthat will help you insert checks of results "in between", testthat is for writing proper tests. Yes, the usual way of writing tests is creating a small test example including test data.

Why is using assign bad?

This post (Lazy evaluation in R – is assign affected?) covers some common ground but I am not sure it answers my question.
I stopped using assign when I discovered the apply family quite a while back, albeit, purely for reasons of elegance in situations such as this:
names.foo <- letters
values.foo <- LETTERS
for (i in 1:length(names.foo))
assign(names.foo[i], paste("This is: ", values.foo[i]))
which can be replaced by:
foo <- lapply(X=values.foo, FUN=function (k) paste("This is :", k))
names(foo) <- names.foo
This is also the reason this (http://cran.r-project.org/doc/FAQ/R-FAQ.html#How-can-I-turn-a-string-into-a-variable_003f) R-faq says this should be avoided.
Now, I know that assign is generally frowned upon. But are there other reasons I don't know? I suspect it may mess with the scoping or lazy evaluation but I am not sure? Example code that demonstrates such problems will be great.
Actually those two operations are quite different. The first gives you 26 different objects while the second gives you only one. The second object will be a lot easier to use in analyses. So I guess I would say you have already demonstrated the major downside of assign, namely the necessity of then needing always to use get for corralling or gathering up all the similarly named individual objects that are now "loose" in the global environment. Try imagining how you would serially do anything with those 26 separate objects. A simple lapply(foo, func) will suffice for the second strategy.
That FAQ citation really only says that using assignment and then assigning names is easier, but did not imply it was "bad". I happen to read it as "less functional" since you are not actually returning a value that gets assigned. The effect looks to be a side-effect (and in this case the assign strategy results in 26 separate side-effects). The use of assign seems to be adopted by people that are coming from languages that have global variables as a way of avoiding picking up the "True R Way", i.e. functional programming with data-objects. They really should be learning to use lists rather than littering their workspace with individually-named items.
There is another assignment paradigm that can be used:
foo <- setNames( paste0(letters,1:26), LETTERS)
That creates a named atomic vector rather than a named list, but the access to values in the vector is still done with names given to [.
As the source of fortune(236) I thought I would add a couple examples (also see fortune(174)).
First, a quiz. Consider the following code:
x <- 1
y <- some.function.that.uses.assign(rnorm(100))
After running the above 2 lines of code, what is the value of x?
The assign function is used to commit "Action at a distance" (see http://en.wikipedia.org/wiki/Action_at_a_distance_(computer_programming) or google for it). This is often the source of hard to find bugs.
I think the biggest problem with assign is that it tends to lead people down paths of thinking that take them away from better options. A simple example is the 2 sets of code in the question. The lapply solution is more elegant and should be promoted, but the mere fact that people learn about the assign function leads people to the loop option. Then they decide that they need to do the same operation on each object created in the loop (which would be just another simple lapply or sapply if the elegant solution were used) and resort to an even more complicated loop involving both get and apply along with ugly calls to paste. Then those enamored with assign try to do something like:
curname <- paste('myvector[', i, ']')
assign(curname, i)
And that does not do quite what they expected which leads to either complaining about R (which is as fair as complaining that my next door neighbor's house is too far away because I chose to walk the long way around the block) or even worse, delve into using eval and parse to get their constructed string to "work" (which then leads to fortune(106) and fortune(181)).
I'd like to point out that assign is meant to be used with environments.
From that point of view, the "bad" thing in the example above is using a not quite appropriate data structure (the base environment instead of a list or data.frame, vector, ...).
Side note: also for environments, the $ and $<- operators work, so in many cases the explicit assign and get isn't necessary there, neither.

Strategies for repeating large chunk of analysis

I find myself in the position of having completed a large chunk of analysis and now need to repeat the analysis with slightly different input assumptions.
The analysis, in this case, involves cluster analysis, plotting several graphs, and exporting cluster ids and other variables of interest. The key point is that it is an extensive analysis, and needs to be repeated and compared only twice.
I considered:
Creating a function. This isn't ideal, because then I have to modify my code to know whether I am evaluating in the function or parent environments. This additional effort seems excessive, makes it harder to debug and may introduce side-effects.
Wrap it in a for-loop. Again, not ideal, because then I have to create indexing variables, which can also introduce side-effects.
Creating some pre-amble code, wrapping the analysis in a separate file and source it. This works, but seems very ugly and sub-optimal.
The objective of the analysis is to finish with a set of objects (in a list, or in separate output files) that I can analyse further for differences.
What is a good strategy for dealing with this type of problem?
Making code reusable takes some time, effort and holds a few extra challenges like you mention yourself.
The question whether to invest is probably the key issue in informatics (if not in a lot of other fields): do I write a script to rename 50 files in a similar fashion, or do I go ahead and rename them manually.
The answer, I believe, is highly personal and even then, different case by case. If you are easy on the programming, you may sooner decide to go the reuse route, as the effort for you will be relatively low (and even then, programmers typically like to learn new tricks, so that's a hidden, often counterproductive motivation).
That said, in your particular case: I'd go with the sourcing option: since you plan to reuse the code only 2 times more, a greater effort would probably go wasted (you indicate the analysis to be rather extensive). So what if it's not an elegant solution? Nobody is ever going to see you do it, and everybody will be happy with the swift results.
If it turns out in a year or so that the reuse is higher than expected, you can then still invest. And by that time, you will also have (at least) three cases for which you can compare the results from the rewritten and funky reusable version of your code with your current results.
If/when I do know up front that I'm going to reuse code, I try to keep that in mind while developing it. Either way I hardly ever write code that is not in a function (well, barring the two-liners for SO and other out-of-the-box analyses): I find this makes it easier for me to structure my thoughts.
If at all possible, set parameters that differ between sets/runs/experiments in an external parameter file. Then, you can source the code, call a function, even utilize a package, but the operations are determined by a small set of externally defined parameters.
For instance, JSON works very well for this and the RJSONIO and rjson packages allow you to load the file into a list. Suppose you load it into a list called parametersNN.json. An example is as follows:
{
"Version": "20110701a",
"Initialization":
{
"indices": [1,2,3,4,5,6,7,8,9,10],
"step_size": 0.05
},
"Stopping":
{
"tolerance": 0.01,
"iterations": 100
}
}
Save that as "parameters01.json" and load as:
library(RJSONIO)
Params <- fromJSON("parameters.json")
and you're off and running. (NB: I like to use unique version #s within my parameters files, just so that I can identify the set later, if I'm looking at the "parameters" list within R.) Just call your script and point to the parameters file, e.g.:
Rscript --vanilla MyScript.R parameters01.json
then, within the program, identify the parameters file from the commandArgs() function.
Later, you can break out code into functions and packages, but this is probably the easiest way to make a vanilla script generalizeable in the short term, and it's a good practice for the long-term, as code should be separated from the specification of run/dataset/experiment-dependent parameters.
Edit: to be more precise, I would even specify input and output directories or files (or naming patterns/prefixes) in the JSON. This makes it very clear how one set of parameters led to one particular output set. Everything in between is just code that runs with a given parametrization, but the code shouldn't really change much, should it?
Update:
Three months, and many thousands of runs, wiser than my previous answer, I'd say that the external storage of parameters in JSON is useful for 1-1000 different runs. When the parameters or configurations number in the thousands and up, it's better to switch to using a database for configuration management. Each configuration may originate in a JSON (or XML), but being able to grapple with different parameter layouts requires a larger scale solution, for which a database like SQLite (via RSQLite) is a fine solution.
I realize this answer is overkill for the original question - how to repeat work only a couple of times, with a few parameter changes, but when scaling up to hundreds or thousands of parameter changes in ongoing research, more extensive tools are necessary. :)
I like to work with combination of a little shell script, a pdf cropping program and Sweave in those cases. That gives you back nice reports and encourages you to source. Typically I work with several files, almost like creating a package (at least I think it feels like that :) . I have a separate file for the data juggling and separate files for different types of analysis, such as descriptiveStats.R, regressions.R for example.
btw here's my little shell script,
#!/bin/sh
R CMD Sweave docSweave.Rnw
for file in `ls pdfs`;
do pdfcrop pdfs/"$file" pdfs/"$file"
done
pdflatex docSweave.tex
open docSweave.pdf
The Sweave file typically sources the R files mentioned above when needed. I am not sure whether that's what you looking for, but that's my strategy so far. I at least I believe creating transparent, reproducible reports is what helps to follow at least A strategy.
Your third option is not so bad. I do this in many cases. You can build a bit more structure by putting the results of your pre-ample code in environments and attach the one you want to use for further analysis.
An example:
setup1 <- local({
x <- rnorm(50, mean=2.0)
y <- rnorm(50, mean=1.0)
environment()
# ...
})
setup2 <- local({
x <- rnorm(50, mean=1.8)
y <- rnorm(50, mean=1.5)
environment()
# ...
})
attach(setup1) and run/source your analysis code
plot(x, y)
t.test(x, y, paired = T, var.equal = T)
...
When finished, detach(setup1) and attach the second one.
Now, at least you can easily switch between setups. Helped me a few times.
I tend to push such results into a global list.
I use Common Lisp but then R isn't so different.
Too late for you here, but I use Sweave a lot, and most probably I'd have used a Sweave file from the beginning (e.g. if I know that the final product needs to be some kind of report).
For repeating parts of the analysis a second and third time, there are then two options:
if the results are rather "independent" (i.e. should produce 3 reports, comparison means the reports are inspected side by side), and the changed input comes in the form of new data files, that goes into its own directory together with a copy of the Sweave file, and I create separate reports (similar to source, but feels more natural for Sweave than for plain source).
if I rather need to do the exactly same thing once or twice again inside one Sweave file I'd consider reusing code chunks. This is similar to the ugly for-loop.
The reason is that then of course the results are together for the comparison, which would then be the last part of the report.
If it is clear from the beginning that there will be some parameter sets and a comparison, I write the code in a way that as soon as I'm fine with each part of the analysis it is wrapped into a function (i.e. I'm acutally writing the function in the editor window, but evaluate the lines directly in the workspace while writing the function).
Given that you are in the described situation, I agree with Nick - nothing wrong with source and everything else means much more effort now that you have it already as script.
I can't make a comment on Iterator's answer so I have to post it here. I really like his answer so I made a short script for creating the parameters and exporting them to external JSON files. And I hope someone finds this useful: https://github.com/kiribatu/Kiribatu-R-Toolkit/blob/master/docs/parameter_configuration.md

Scoping and functions in R 2.11.1 : What's going wrong?

This question comes from a range of other questions that all deal with essentially the same problem. For some strange reason, using a function within another function sometimes fails in the sense that variables defined within the local environment of the first function are not found back in the second function.
The classical pattern in pseudo-code :
ff <- function(x){
y <- some_value
some_function(y)
}
ff(x)
Error in eval(expr, envir, enclos) :
object 'y' not found
First I thought it had something to do with S4 methods and the scoping in there, but it also happens with other functions. I've had some interaction with the R development team, but all they did was direct me to the bug report site (which is not the most inviting one, I have to say). I never got any feedback.
As the problem keeps arising, I wonder if there is a logic explanation for it. Is it a common mistake made in all these cases, and if so, which one? Or is it really a bug?
Some of those questions :
Using functions and environments
R (statistical) scoping error using transformBy(), part of the doBy package.
How to use acast (reshape2) within a function in R?
Why can't I pass a dataset to a function?
Values not being copied to the next local environment
PS : I know the R-devel list, in case you wondered...
R has both lexical and dynamic scope. Lexical scope works automatically, but dynamic scope must be implemented manually, and requires careful book-keeping. Only functions used interactively for data analysis need dynamic scope, so most authors (like me!) don't learn how to do it correctly.
See also: the standard non-standard evaluation rules.
There are undoubtedly bugs in R, but a lot of the issues that people have been having are quite often errors in the implementation of some_function, not R itself. R has scoping rules ( see http://cran.r-project.org/doc/manuals/R-intro.html#Scope) which when combined with lazy evaluation of function arguments and the ability to eval arguments in other scopes are extremely powerful but which also often lead to subtle errors.
As Dirk mentioned in his answer, there isn't actually a problem with the code that you posted. In the links you posted in the question, there seems to be a common theme: some_function contains code that messes about with environments in some way. This messing is either explicit, using new.env and with or implicitly, using a data argument, that probably has a line like
y <- eval(substitute(y), data)
The moral of the story is twofold. Firstly, try to avoid explicitly manipulating environments, unless you are really sure that you know what you are doing. And secondly, if a function has a data argument then put all the variables that you need the function to use inside that data frame.
Well there is no problem in what you posted:
/tmp$ cat joris.r
#!/usr/bin/r -t
some_function <- function(y) y^2
ff <- function(x){
y <- 4
some_function(y) # so we expect 16
}
print(ff(3)) # 3 is ignored
$ ./joris.r
[1] 16
/tmp$
Could you restate and postan actual bug or misfeature?

Efficiency of operations on R data structures

I'm wondering if there's any documentation about the efficiency of operations in R, specifically those related to data manipulation.
For example:
I imagine it's efficient to add columns to a data frame, because I'm guessing you're just adding an element to a linked list.
I imagine adding rows is slower because vectors are held in arrays at the C level and you have to allocate a new array of length n+1 and copy all the elements over.
The developers probably don't want to tie themselves to a particular implementation, but it would be nice to have something more solid than guesses to go on.
Also, I know the main R performance hint is to use vectored operations whenever possible as opposed to loops.
what about the various flavors of apply?
are those just hidden loops?
what about matrices vs. data frames?
Data IO was one of the features i looked into before i committed to learning R. For better or worse, here are my observations and solutions/palliatives on these issues:
1. That R doesn't handle big data (>2 GB?) To me this is a misnomer. By default, the common data input functions load your data into RAM. Not to be glib, but to me, this is a feature not a bug--anytime my data will fit in my available RAM, that's where i want it. Likewise, one of SQLite's most popular features is the in-memory option--the user has the easy option of loading the entire dB into RAM. If your data won't fit in memory, then R makes it astonishingly easy to persist it, via connections to the common RDBMS systems (RODBC, RSQLite, RMySQL, etc.), via no-frills options like the filehash package, and via systems that current technology/practices (for instance, i can recommend ff). In other words, the R developers have chosen a sensible (and probably optimal) default, from which it is very easy to opt out.
2. The performance of read.table (read.csv, read.delim, et al.), the most common means for getting data into R, can be improved 5x (and often much more in my experience) just by opting out of a few of read.table's default arguments--the ones having the greatest effect on performance are mentioned in the R's Help (?read.table). Briefly, the R Developers tell us that if you provide values for the parameters 'colClasses', 'nrows', 'sep', and 'comment.char' (in particular, pass in '' if you know your file begins with headers or data on line 1), you'll see a significant performance gain. I've found that to be true.
Here are the snippets i use for those parameters:
To get the number of rows in your data file (supply this snippet as an argument to the parameter, 'nrows', in your call to read.table):
as.numeric((gsub("[^0-9]+", "", system(paste("wc -l ", file_name, sep=""), intern=T))))
To get the classes for each column:
function(fname){sapply(read.table(fname, header=T, nrows=5), class)}
Note: You can't pass this snippet in as an argument, you have to call it first, then pass in the value returned--in other words, call the function, bind the returned value to a variable, and then pass in the variable as the value to to the parameter 'colClasses' in your call to read.table:
3. Using Scan. With only a little more hassle, you can do better than that (optimizing 'read.table') by using 'scan' instead of 'read.table' ('read.table' is actually just a wrapper around 'scan'). Once again, this is very easy to do. I use 'scan' to input each column individually then build my data.frame inside R, i.e., df = data.frame(cbind(col1, col2,....)).
4. Use R's Containers for persistence in place of ordinary file formats (e.g., 'txt', 'csv'). R's native data file '.RData' is a binary format that a little smaller than a compressed ('.gz') txt data file. You create them using save(, ). You load it back into the R namespace with load(). The difference in load times compared with 'read.table' is dramatic. For instance, w/ a 25 MB file (uncompressed size)
system.time(read.table("tdata01.txt.gz", sep=","))
=> user system elapsed
6.173 0.245 **6.450**
system.time(load("tdata01.RData"))
=> user system elapsed
0.912 0.006 **0.912**
5. Paying attention to data types can often give you a performance boost and reduce your memory footprint. This point is probably more useful in getting data out of R. The key point to keep in mind here is that by default, numbers in R expressions are interpreted as double-precision floating point, e.g., > typeof(5) returns "double." Compare the object size of a reasonable-sized array of each and you can see the significance (use object.size()). So coerce to integer when you can.
Finally, the 'apply' family of functions (among others) are not "hidden loops" or loop wrappers. They are loops implemented in C--big difference performance-wise. [edit: AWB has correctly pointed out that while 'sapply', 'tapply', and 'mapply' are implemented in C, 'apply' is simply a wrapper function.
These things do pop up on the lists, in particular on r-devel. One fairly well-established nugget is that e.g. matrix operations tend to be faster than data.frame operations. Then there are add-on packages that do well -- Matt's data.table package is pretty fast, and Jeff has gotten xts indexing to be quick.
But it "all depends" -- so you are usually best adviced to profile on your particular code. R has plenty of profiling support, so you should use it. My Intro to HPC with R tutorials have a number of profiling examples.
I will try to come back and provide more detail. If you have any question about the efficiency of one operation over another, you would do best to profile your own code (as Dirk suggests). The system.time() function is the easiest way to do this although there are many more advanced utilities (e.g. Rprof, as documented here).
A quick response for the second part of your question:
What about the various flavors of apply? Are those just hidden loops?
For the most part yes, the apply functions are just loops and can be slower than for statements. Their chief benefit is clearer code. The main exception that I have found is lapply which can be faster because it is coded in C directly.
And what about matrices vs. data frames?
Matrices are more efficient than data frames because they require less memory for storage. This is because data frames require additional attribute data. From R Introduction:
A data frame may for many purposes be regarded as a matrix with columns possibly of differing modes and attributes

Resources