How to check that a user-defined function works in r? - r

THis is probably a very silly question, but how can I check if a function written by myself will work or not?
I'm writing a not very simple function involving many other functions and loops and was wondering if there are any ways to check for errors/bugs, or simply just check if the function will work. Do I just create a simple fake data frame and test on it?
As suggested by other users in the comment, I have added the part of the function that I have written. So basically I have a data frame with good and bad data, and bad data are marked with flags. I want to write a function that allows me to produce plots as usual (with the flag points) when user sets flag.option to 1, and remove the flag points from the plot when user sets flag.option to 0.
AIR.plot <- function(mydata, flag.option) {
if (flag.option == 1) {
par(mfrow(2,1))
conc <- tapply(mydata$CO2, format(mydata$date, "%Y-%m-%d %T"), mean)
dates <- seq(mydata$date[1], mydata$date[nrow(mydata(mydata))], length = nrow(conc))
plot(dates, conc,
type = "p",
col = "blue",
xlab = "day",
ylab = "CO2"), error = function(e) plot.new(type = "n")
barplot(mydata$lines, horiz = TRUE, col = c("red", "blue")) # this is just a small bar plot on the bottom that specifies which sample-taking line (red or blue) is providing the samples
} else if (flag.option == 0) {
# I haven't figured out how to write this part yet but essentially I want to remove all
# of the rows with flags on
}
}
Thanks in advance, I'm not an experienced R user yet so please help me.

Before we (meaning, at my workplace) release any code to our production environment we run through a series of testing procedures to make sure our code behaves the way we want it to. It usually involves several people with different perspectives on the code.
Ideally, such verification should start before you write any code. Some questions you should be able to answer are:
What should the code do?
What inputs should it accept? (including type, ranges, etc)
What should the output look like?
How will it handle missing values?
How will it handle NULL values?
How will it handle zero-length values?
If you prepare a list of requirements and write your documentation before you begin writing any code, the probability of success goes up pretty quickly. Naturally, as you begin writing your code, you may find that your requirements need to be adjusted, or the function arguments need to be modified. That's okay, but document those changes when they happen.
While you are writing your function, use a package like assertthat or checkmate to write as many argument checks as you need in your code. Some of the best, most reliable code where I work consists of about 100 lines of argument checks and 3-4 lines of what the code actually is intended to do. It may seem like overkill, but you prevent a lot of problems from bad inputs that you never intended for users to provide.
When you've finished writing your function, you should at this point have a list of requirements and clearly documented expectations of your arguments. This is where you make use of the testthat package.
Write tests that verify all of the requirements you wrote are met.
Write tests that verify you can no put in unintended inputs and get the results you want.
Write tests that verify you get the output you intended on your test data.
Write tests that test any edge cases you can think of.
It can take a long time to write all of these tests, but once it is done, any further development is easier to check since anything that violates your existing requirements should fail the test.
That being said, I'm really bad at following this process in my own work. I have the tendency to write code, then document what I did. But the best code I've written has been where I've planned it out conceptually, wrote my documentation, coded, and then tested against my documentation.

As #antoine-sac pointed out in the links, some things cannot be checked programmatically; for example, if your function terminates.
Looking at it pragmatically, have a look at the packages assertthat and testthat. assertthat will help you insert checks of results "in between", testthat is for writing proper tests. Yes, the usual way of writing tests is creating a small test example including test data.

Related

unit tests and checks in package function: do we do checks in both?

I'm a new to R and package development so bear with me. I am writing test cases to keep package is line with standard practices. But I'm confused if I do the checks in testthat, should I not perform if/else checks in the package function?
my_function<-function(dt_genetic, dt_gene, dt_snpBP){
if((is.data.table(dt_genetic) & is.data.table(dt_gene) & is.data.table(dt_snpBP))== FALSE){
stop("data format unacceptable")
}
## similary more checks on column names and such
} ## function ends
In my test-data_integrity.R
## create sample data.table
test_gene_coord<-data.table(GENE=c("ABC","XYG","alpha"),"START"=c(10,200,320),"END"=c(101,250,350))
test_snp_pos<-data.table(SNP=c("SNP1","SNP2","SNP3"),"BP"=c(101,250,350))
test_snp_gene<-data.table(SNP=c("SNP1","SNP2","SNP3"),"GENE"=c("ABC","BRCA1","gamma"))
## check data type
test_that("data types correct works", {
expect_is(test_data_table,'data.table')
expect_is(test_gene_coord,'data.table')
expect_is(test_snp_pos,'data.table')
expect_is(test_snp_gene,'data.table')
expect_is(test_gene_coord$START, 'numeric')
expect_is(test_gene_coord$END, 'numeric')
expect_is(test_snp_pos$BP, 'numeric')
})
## check column names
test_that("column names works", {
expect_named(test_gene_coord, c("GENE","START","END"))
expect_named(test_snp_pos, c("SNP","BP"))
expect_named(test_snp_gene, c("SNP","GENE"))
})
when I run devtools::test() all tests are passed, but does it mean that I should not test within my function?
Pardon me if this seems naive but this is confusing as this is completely alien to me.
Edited: data.table if check.
(This is an expansion on my comments on the question. My comments are from a quasi-professional programmer; some of what I say here may be good "in general" but not perfectly complete from a theoretical standpoint.)
There are many "types" of tests, but I'll focus on distinguishing between "unit-tests" and "assertions". For me, the main difference is that unit-tests are typically run by the developer(s) only, and assertions are run at run-time.
Assertions
When you mention adding tests to your function, which to me sounds like assertions: a programmatic statement that an object meets specific property assumptions. This is often necessary when the data is provided by the user or from an external source (database), where the size or quality of the data is previously unknown.
There are "formal" packages for assertions, including assertthat, assertr, and assertive; while I have little experience with any of them, there is also sufficient support in base R that these aren't strictly required. The most basic method is
if (!inherits(mtcars, "data.table")) {
stop("'obj' is not 'data.table'")
}
# Error: 'obj' is not 'data.table'
which gives you absolute control at the expense of several lines of code. There's another function which shortens this a little:
stopifnot(inherits(mtcars, "data.table"))
# Error: inherits(mtcars, "data.table") is not TRUE
Multiple conditions can be provided, all must be TRUE to pass. (Unlike many R conditionals such as if, this statement must resolve to exactly TRUE: stopifnot(3) does not pass.) In R < 4.0, the error messages were uncontrolled, but starting in R-4.0 one can now name them:
stopifnot(
"mtcars not data.frame" = inherits(mtcars, "data.frame"),
"mtcars data.table error" = inherits(mtcars, "data.table")
)
# Error: mtcars data.table error
In some programming languages, these assertions are more declarative/deliberate so that compilation can optimize them out of a production executable. In this sense, they are useful during development, but for production it is assumed that some steps that worked before no longer need validation. I believe there is not an automatic way to do this in R (especially since it is generally not "compiled into an executable"), but one could fashion a function in a way to mimic this behavior:
myfunc <- function(x, ..., asserts = getOption("run_my_assertions", FALSE)) {
# this one only runs when the user explicitly says "asserts=TRUE"
if (asserts) stopifnot("'x' not a data.frame" = inherits(x, "data.frame"))
# this assertion runs all the time
stopifnot("'x' not a data.table" = inherits(x, "data.table"))
}
I have not seen that logic or flow often in R packages.
Regardless, my assumption of assertions is that those not optimized out (due to compilation or user arguments) execute every time the function runs. This tends to ensure a "safer" flow, and is a good idea especially for less-experienced developers who do not have the experience ("have not been burned enough") to know how many ways certain calls can go wrong.
Unit Tests
These are a bit different, both in their purpose and runtime effect.
First and foremost, unit-tests are not run every time a function is used. They are typically defined in a completely different file, not within the function at all[^1]. They are deliberate sets of calls to your functions, testing/confirming specific behaviors given certain inputs.
With the testthat package, R scripts (that match certain filename patterns) in the package's ./tests/testthat/ sub-directory will be run on command as unit-tests. (Other unit-test packages exist.) (Unit-tests do not require that they operate on a package; they can be located anywhere, and run on any set of files or directories of files. I'm using a "package" as an example.)
Side note: it is certainly feasible to include some of the testthat tools within your function for runtime validation as well. For instance, one might replace stopifnot(inherits(x, "data.frame")) with expect_is(x, "data.frame"), and it will fail with non-frames, and pass with all three types of frames tested above. I don't know that this is always the best way to go, and I haven't seen its use in packages I use. (Doesn't mean it isn't there. If you see testthat in a package's "Imports:", then it's possible.)
The premise here is not validation of runtime objects. The premise is validation of your function's performance given very specific inputs[^2]. For instance, one might define a unit-test to confirm that your function operates equally well on frames of class "data.frame", "tbl_df", and "data.table". (This is not a throw-away unit-test, btw.)
Consider a meek function that one would presume can work equally well on any data.frame-like object:
func <- function(x, nm) head(x[nm], n = 2)
To test that this accepts various types, one might simply call it on the console with:
func(mtcars, "cyl")
# cyl
# Mazda RX4 6
# Mazda RX4 Wag 6
When a colleague complains that this function isn't working, you might wonder that they're using either the tidyverse (and tibble) or data.table, so you can quickly test on the console:
func(tibble::as_tibble(mtcars), "cyl")
# # A tibble: 2 x 1
# cyl
# <dbl>
# 1 6
# 2 6
func(data.table::as.data.table(mtcars), "cyl")
# Error in `[.data.table`(x, nm) :
# When i is a data.table (or character vector), the columns to join by must be specified using 'on=' argument (see ?data.table), by keying x (i.e. sorted, and, marked as sorted, see ?setkey), or by sharing column names between x and i (i.e., a natural join). Keyed joins might have further speed benefits on very large data due to x being sorted in RAM.
So now you know where the problem lies (if not yet how to fix it). If you test this "as is" with data.table, one might think to try something like this (obviously wrong) fix:
func <- function(x, nm) head(x[,..nm], n = 2)
func(data.table::as.data.table(mtcars), "cyl")
# cyl
# 1: 6
# 2: 6
While this works, unfortunately it now fails for the other two frame-like objects.
The answer to this dilemma is to make tests so that when you make a change to your function, if previously-successful property assumptions now change, you will know immediately. Had all three of those tests been incorporated into a unit-test, one might have done something such as
library(testthat)
test_that("func works with all frame-like objects", {
expect_silent(func(mtcars, "cyl"))
expect_silent(func(tibble::as_tibble(mtcars), "cyl"))
expect_silent(func(data.table::as.data.table(mtcars), "cyl"))
})
# Error: Test failed: 'func works with all frame-like objects'
Given some research, you find one method that you think will satisfy all three frame-like objects:
func <- function(x, nm) head(subset(x, select = nm), n = 2)
And then run your unit-tests again:
test_that("func works with all frame-like objects", {
expect_silent(func(mtcars, "cyl"))
expect_silent(func(tibble::as_tibble(mtcars), "cyl"))
expect_silent(func(data.table::as.data.table(mtcars), "cyl"))
})
(No output ... silence is golden.)
Similar to many things in programming, there are many opinions on how to organize, fashion, or even when to create these unit-tests. Many of these opinions are right for somebody. One strategy that I tend to start with is this:
since I know that my functions can be used on all three frame-like objects, I often preemptively set up a test given one object of each type (you'd be surprised at some of the lurking differences between them);
when I find or receive a bug report, one of the first things I do after confirming the bug is write a test that triggers that bug, given the minimum inputs required to do so; then I fix the bug, and run my unit-tests to ensure that this new test now passes (and no other test now fails)
Experience will dictate types of tests to write preemptively before the bugs even come.
Tests don't always have to be about "no errors", by the way. They can test for a lot of things:
silence (no errors)
expected messages, warnings, or stop errors (whether internally generated or passed from another function)
output class (matrix or numeric), dimensions, attributes
expected values (returning 3 vice 3.14 might be a problem)
Some will say that unit-tests are no fun to write, and abhor efforts on them. While I don't disagree that unit-tests are not fun, I have burned myself countless times when making a simple fix to a function inadvertently broke several other things ... and since I deployed the "simple fix" without applicable unit-tests, I just shifted the bug reports from "this title has "NA" in it" to "the app crashes and everybody is angry" (true story).
For some packages, unit-testing can be done in moments; for others, it may take minutes or hours. Due to complexity in functions, some of my unit-tests deal with "large" data structures, so a single test takes several minutes to reveal its success. Most of my unit-tests are relatively instantaneous with inputs of vectors of length 1 to 3, or frames/matrices with 2-4 rows and/or columns.
This is by far not a complete document on testing. There are books, tutorials, and countless blogs about different techniques. One good reference is Hadley's book on R Packages, Testing chapter: http://r-pkgs.had.co.nz/tests.html. I like that, but it is far from the only one.
[^1] Tangentially, I believe that one power the roxygen2 package affords is the convenience of storing a function's documentation in the same file as the function itself. Its proximity "reminds" me to update the docs when I'm working on code. It would be nice if we could determine a sane way to similarly add formal testthat (or similar) unit-tests to the function file itself. I've seen (and at times used) informal unit-tests by including specific code in the roxygen2 #examples section: when the file is rendered to an .Rd file, any errors in the example code will alert me on the console. I know that this technique is sloppy and hasty, and in general I only suggest it when more formal unit-testing will not be done. It does tend to make help documentation a lot more verbose than it needs to be.
[^2] I said above "given very specific inputs": an alternative is something called "fuzzing", a technique where functions are called with random or invalid input. I believe this is very useful for searching for stack overflow, memory-access, or similar problems that cause a program to crash and/or execute the wrong code. I've not seen this used in R (ymmv).

Model namespaces in code completion with R - or how to organize R code

This is more of a general code structuring question.
At the moment I try to write my code into "namespaces". So for example, I would have:
Mine.FancyPlot.Plot(...)
Mine.FancyPlot.Impl.PlotCanvas(...)
Mine.FancyPlot.Impl.PlotLegend(...)
Mine.BasicPlot.Plot(...)
Mine.BasicPlot.Impl.PlotCanvas(...)
Mine.BasicPlot.Impl.PlotLegend(...)
Mine.BasicPlot.Impl.PlotLines(...)
The idea is that I am trying to hide away "private" functions in a "Impl" for implementation namespace. So outside of Mine_FancyPlot.R I wouldn't call Mine.FancyPlot.Impl functions.
This approach works reasonably well, except code completion isn't as nice as it could be.
To begin with, when I type Mine.BasicPlot. and hit TAB, I get all functions, including the Impl functions, and because I is before P, they even hide the "public" user functions.
So I started changing the structure to
MyPub.FancyPlot.Plot(...)
MyPriv.FancyPlot.PlotCanvas(...)
MyPriv.FancyPlot.PlotLegend(...)
MyPub.BasicPlot.Plot(...)
MyPriv.Mine.BasicPlot.PlotCanvas(...)
MyPriv.Mine.BasicPlot.PlotLegend(...)
MyPriv.Mine.BasicPlot.PlotLines(...)
This works better in that "private" functions are no longer predicted. However, I still have the issue that if I type MyPub. and hit TAB, I can't actually see all different "namespaces" (such as I would in Java, C++, ...), but rather a long list of functions starting all in the first "namespace".
Ideally, I'd like code completion in R to cut off all predictions at the next dot, and unique them, so Ideally when I type MyPub. and hit TAB, I would only get a list of "sub-namespaces" and functions in MyPub.
Is this possible? Can the code prediction be altered to reflect this behaviour? Or is there a better way to achieve what I am aiming for?
You should consider putting your functions in a package to organise them. Functions that are not exported will only be accessible by doing 'package:::functionNotExported' and will not be listed when just doing 'functionNotExpo[tab]'
See for instance debugging a function in R that was not exported by a package

What software package can you suggest for a programmer who rarely works with statistics?

Being a programmer I occasionally find the need to analyze large amounts of data such as performance logs or memory usage data, and I am always frustrated by how much time it takes me to do something that I expect to be easier.
As an example to put the question in context, let me quickly show you an example from a CSV file I received today (heavily filtered for brevity):
date,time,PS Eden Space used,PS Old Gen Used, PS Perm Gen Used
2011-06-28,00:00:03,45004472,184177208,94048296
2011-06-28,00:00:18,45292232,184177208,94048296
I have about 100,000 data points like this with different variables that I want to plot in a scatter plot in order to look for correlations. Usually the data needs to be processed in some way for presentation purposes (such as converting nanoseconds to milliseconds and rounding fractional values), some columns may need to be added or inverted, or combined (like the date/time columns).
The usual recommendation for this kind of work is R and I have recently made a serious effort to use it, but after a few days of work my experience has been that most tasks that I expect to be simple seem to require many steps and have special cases; solutions are often non-generic (for example, adding a data set to an existing plot). It just seems to be one of those languages that people love because of all the powerful libraries that have accumulated over the years rather than the quality and usefulness of the core language.
Don't get me wrong, I understand the value of R to people who are using it, it's just that given how rarely I spend time on this kind of thing I think that I will never become an expert on it, and to a non-expert every single task just becomes too cumbersome.
Microsoft Excel is great in terms of usability but it just isn't powerful enough to handle large data sets. Also, both R and Excel tend to freeze completely (!) with no way out other than waiting or killing the process if you accidentally make the wrong kind of plot over too much data.
So, stack overflow, can you recommend something that is better suited for me? I'd hate to have to give up and develop my own tool, I have enough projects already. I'd love something interactive that could use hardware acceleration for the plot and/or culling to avoid spending too much time on rendering.
#flodin It would have been useful for you to provide an example of the code you use to read in such a file to R. I regularly work with data sets of the size you mention and do not have the problems you mention. One thing that might be biting you if you don't use R often is that if you don't tell R what the column-types R, it has to do some snooping on the file first and that all takes time. Look at argument colClasses in ?read.table.
For your example file, I would do:
dat <- read.csv("foo.csv", colClasses = c(rep("character",2), rep("integer", 3)))
then post process the date and time variables into an R date-time object class such as POSIXct, with something like:
dat <- transform(dat, dateTime = as.POSIXct(paste(date, time)))
As an example, let's read in your example data set, replicate it 50,000 times and write it out, then time different ways of reading it in, with foo containing your data:
> foo <- read.csv("log.csv")
> foo
date time PS.Eden.Space.used PS.Old.Gen.Used
1 2011-06-28 00:00:03 45004472 184177208
2 2011-06-28 00:00:18 45292232 184177208
PS.Perm.Gen.Used
1 94048296
2 94048296
Replicate that, 50000 times:
out <- data.frame(matrix(nrow = nrow(foo) * 50000, ncol = ncol(foo)))
out[, 1] <- rep(foo[,1], times = 50000)
out[, 2] <- rep(foo[,2], times = 50000)
out[, 3] <- rep(foo[,3], times = 50000)
out[, 4] <- rep(foo[,4], times = 50000)
out[, 5] <- rep(foo[,5], times = 50000)
names(out) <- names(foo)
Write it out
write.csv(out, file = "bigLog.csv", row.names = FALSE)
Time loading the naive way and the proper way:
system.time(in1 <- read.csv("bigLog.csv"))
system.time(in2 <- read.csv("bigLog.csv",
colClasses = c(rep("character",2),
rep("integer", 3))))
Which is very quick on my modest laptop:
> system.time(in1 <- read.csv("bigLog.csv"))
user system elapsed
0.355 0.008 0.366
> system.time(in2 <- read.csv("bigLog.csv",
colClasses = c(rep("character",2),
rep("integer", 3))))
user system elapsed
0.282 0.003 0.287
For both ways of reading in.
As for plotting, the graphics can be a bit slow, but depending on your OS this can be sped up a bit by altering the device you plot - on Linux for example, don't use the default X11() device, which uses Cairo, instead try the old X window without anti-aliasing. Also, what are you hoping to see with a data set as large as 100,000 observations on a graphics device with not many pixels? Perhaps try to rethink your strategy for data analysis --- no stats software will be able to save you from doing something ill-advised.
It sounds as if you are developing code/analysis as you go along, on the full data set. It would be far more sensible to just work with a small subset of the data when developing new code or new ways of looking at your data, say with a random sample of 1000 rows, and work with that object instead of the whole data object. That way you guard against accidentally doing something that is slow:
working <- out[sample(nrow(out), 1000), ]
for example. Then use working instead of out. Alternatively, whilst testing and writing a script, set argument nrows to say 1000 in the call to load the data into R (see ?read.csv). That way whilst testing you only read in a subset of the data, but one simple change will allow you to run your script against the full data set.
For data sets of the size you are talking about, I see no problem whatsoever in using R. Your point, about not becoming expert enough to use R, will more than likely apply to other scripting languages that might be suggested, such as python. There is a barrier to entry, but that is to be expected if you want the power of a language such as python or R. If you write scripts that are well commented (instead of just plugging away at the command line), and focus on a few key data import/manipulations, a bit of plotting and some simple analysis, it shouldn't take long to masters that small subset of the language.
R is a great tool, but I never had to resort to use it. Instead I find python to be more than adequate for my needs when I need to pull data out of huge logs. Python really comes with "batteries included" with built-in support for working with csv-files
The simplest example of reading a CSV file:
import csv
with open('some.csv', 'rb') as f:
reader = csv.reader(f)
for row in reader:
print row
To use another separator, e.g. tab and extract n-th column, use
spamReader = csv.reader(open('spam.csv', 'rb'), delimiter='\t')
for row in spamReader:
print row[n]
To operate on columns use the built-in list data-type, it's extremely versatile!
To create beautiful plots I use matplotlib
code
The python tutorial is a great way to get started! If you get stuck, there is always stackoverflow ;-)
There seem to be several questions mixed together:
Can you draw plots quicker and more easily?
Can you do things in R with less learning effort?
Are there other tools which require less learning effort than R?
I'll answer these in turn.
There are three plotting systems in R, namely base, lattice and ggplot2 graphics. Base graphics will render quickest, but making them look pretty can involve pathological coding. ggplot2 is the opposite, and lattice is somewhere in between.
Reading in CSV data, cleaning it and drawing a scatterplot sounds like a pretty straightforward task, and the tools are definitely there in R for solving such problems. Try asking a question here about specific bits of code that feel clunky, and we'll see if we can fix it for you. If your datasets all look similar, then you can probably reuse most of your code over and over. You could also give the ggplot2 web app a try.
The two obvious alternative languages for data processing are MATLAB (and its derivatives: Octave, Scilab, AcslX) and Python. Either of these will be suitable for your needs, and MATLAB in particular has a pretty shallow learning curve. Finally, you could pick a graph-specific tool like gnuplot or Prism.
SAS can handle larger data sets than R or Excel, however many (if not most) people--myself included--find it a lot harder to learn. Depending on exactly what you need to do, it might be worthwhile to load the CSV into an RDBMS and do some of the computations (eg correlations, rounding) there, and then export only what you need to R to generate graphics.
ETA: There's also SPSS, and Revolution; the former might not be able to handle the size of data that you've got, and the latter is, from what I've heard, a distributed version of R (that, unlike R, is not free).

Strategies for repeating large chunk of analysis

I find myself in the position of having completed a large chunk of analysis and now need to repeat the analysis with slightly different input assumptions.
The analysis, in this case, involves cluster analysis, plotting several graphs, and exporting cluster ids and other variables of interest. The key point is that it is an extensive analysis, and needs to be repeated and compared only twice.
I considered:
Creating a function. This isn't ideal, because then I have to modify my code to know whether I am evaluating in the function or parent environments. This additional effort seems excessive, makes it harder to debug and may introduce side-effects.
Wrap it in a for-loop. Again, not ideal, because then I have to create indexing variables, which can also introduce side-effects.
Creating some pre-amble code, wrapping the analysis in a separate file and source it. This works, but seems very ugly and sub-optimal.
The objective of the analysis is to finish with a set of objects (in a list, or in separate output files) that I can analyse further for differences.
What is a good strategy for dealing with this type of problem?
Making code reusable takes some time, effort and holds a few extra challenges like you mention yourself.
The question whether to invest is probably the key issue in informatics (if not in a lot of other fields): do I write a script to rename 50 files in a similar fashion, or do I go ahead and rename them manually.
The answer, I believe, is highly personal and even then, different case by case. If you are easy on the programming, you may sooner decide to go the reuse route, as the effort for you will be relatively low (and even then, programmers typically like to learn new tricks, so that's a hidden, often counterproductive motivation).
That said, in your particular case: I'd go with the sourcing option: since you plan to reuse the code only 2 times more, a greater effort would probably go wasted (you indicate the analysis to be rather extensive). So what if it's not an elegant solution? Nobody is ever going to see you do it, and everybody will be happy with the swift results.
If it turns out in a year or so that the reuse is higher than expected, you can then still invest. And by that time, you will also have (at least) three cases for which you can compare the results from the rewritten and funky reusable version of your code with your current results.
If/when I do know up front that I'm going to reuse code, I try to keep that in mind while developing it. Either way I hardly ever write code that is not in a function (well, barring the two-liners for SO and other out-of-the-box analyses): I find this makes it easier for me to structure my thoughts.
If at all possible, set parameters that differ between sets/runs/experiments in an external parameter file. Then, you can source the code, call a function, even utilize a package, but the operations are determined by a small set of externally defined parameters.
For instance, JSON works very well for this and the RJSONIO and rjson packages allow you to load the file into a list. Suppose you load it into a list called parametersNN.json. An example is as follows:
{
"Version": "20110701a",
"Initialization":
{
"indices": [1,2,3,4,5,6,7,8,9,10],
"step_size": 0.05
},
"Stopping":
{
"tolerance": 0.01,
"iterations": 100
}
}
Save that as "parameters01.json" and load as:
library(RJSONIO)
Params <- fromJSON("parameters.json")
and you're off and running. (NB: I like to use unique version #s within my parameters files, just so that I can identify the set later, if I'm looking at the "parameters" list within R.) Just call your script and point to the parameters file, e.g.:
Rscript --vanilla MyScript.R parameters01.json
then, within the program, identify the parameters file from the commandArgs() function.
Later, you can break out code into functions and packages, but this is probably the easiest way to make a vanilla script generalizeable in the short term, and it's a good practice for the long-term, as code should be separated from the specification of run/dataset/experiment-dependent parameters.
Edit: to be more precise, I would even specify input and output directories or files (or naming patterns/prefixes) in the JSON. This makes it very clear how one set of parameters led to one particular output set. Everything in between is just code that runs with a given parametrization, but the code shouldn't really change much, should it?
Update:
Three months, and many thousands of runs, wiser than my previous answer, I'd say that the external storage of parameters in JSON is useful for 1-1000 different runs. When the parameters or configurations number in the thousands and up, it's better to switch to using a database for configuration management. Each configuration may originate in a JSON (or XML), but being able to grapple with different parameter layouts requires a larger scale solution, for which a database like SQLite (via RSQLite) is a fine solution.
I realize this answer is overkill for the original question - how to repeat work only a couple of times, with a few parameter changes, but when scaling up to hundreds or thousands of parameter changes in ongoing research, more extensive tools are necessary. :)
I like to work with combination of a little shell script, a pdf cropping program and Sweave in those cases. That gives you back nice reports and encourages you to source. Typically I work with several files, almost like creating a package (at least I think it feels like that :) . I have a separate file for the data juggling and separate files for different types of analysis, such as descriptiveStats.R, regressions.R for example.
btw here's my little shell script,
#!/bin/sh
R CMD Sweave docSweave.Rnw
for file in `ls pdfs`;
do pdfcrop pdfs/"$file" pdfs/"$file"
done
pdflatex docSweave.tex
open docSweave.pdf
The Sweave file typically sources the R files mentioned above when needed. I am not sure whether that's what you looking for, but that's my strategy so far. I at least I believe creating transparent, reproducible reports is what helps to follow at least A strategy.
Your third option is not so bad. I do this in many cases. You can build a bit more structure by putting the results of your pre-ample code in environments and attach the one you want to use for further analysis.
An example:
setup1 <- local({
x <- rnorm(50, mean=2.0)
y <- rnorm(50, mean=1.0)
environment()
# ...
})
setup2 <- local({
x <- rnorm(50, mean=1.8)
y <- rnorm(50, mean=1.5)
environment()
# ...
})
attach(setup1) and run/source your analysis code
plot(x, y)
t.test(x, y, paired = T, var.equal = T)
...
When finished, detach(setup1) and attach the second one.
Now, at least you can easily switch between setups. Helped me a few times.
I tend to push such results into a global list.
I use Common Lisp but then R isn't so different.
Too late for you here, but I use Sweave a lot, and most probably I'd have used a Sweave file from the beginning (e.g. if I know that the final product needs to be some kind of report).
For repeating parts of the analysis a second and third time, there are then two options:
if the results are rather "independent" (i.e. should produce 3 reports, comparison means the reports are inspected side by side), and the changed input comes in the form of new data files, that goes into its own directory together with a copy of the Sweave file, and I create separate reports (similar to source, but feels more natural for Sweave than for plain source).
if I rather need to do the exactly same thing once or twice again inside one Sweave file I'd consider reusing code chunks. This is similar to the ugly for-loop.
The reason is that then of course the results are together for the comparison, which would then be the last part of the report.
If it is clear from the beginning that there will be some parameter sets and a comparison, I write the code in a way that as soon as I'm fine with each part of the analysis it is wrapped into a function (i.e. I'm acutally writing the function in the editor window, but evaluate the lines directly in the workspace while writing the function).
Given that you are in the described situation, I agree with Nick - nothing wrong with source and everything else means much more effort now that you have it already as script.
I can't make a comment on Iterator's answer so I have to post it here. I really like his answer so I made a short script for creating the parameters and exporting them to external JSON files. And I hope someone finds this useful: https://github.com/kiribatu/Kiribatu-R-Toolkit/blob/master/docs/parameter_configuration.md

Undo command in R

I can't find something to the effect of an undo command in R (neither on An Introduction to R nor in R in a Nutshell). I am particularly interested in undoing/deleting when dealing with interactive graphs.
What approaches do you suggest?
You should consider a different approach which leads to reproducible work:
Pick an editor you like and which has R support
Write your code in 'snippets', ie short files for functions, and then use the facilities of the editor / R integration to send the code to the R interpreter
If you make a mistake, re-edit your snippet and run it again
You will always have a log of what you did
All this works tremendously well in ESS which is why many experienced R users like this environment. But editors are a subjective and personal choice; other people like Eclipse with StatET better. There are other solutions for Mac OS X and Windows too, and all this has been discussed countless times before here on SO and on other places like the R lists.
In general I do adopt Dirk's strategy. You should aim for your code to be a completely reproducible record of how you have transformed your raw data into output.
However, if you have complex code it can take a long time to re-run it all. I've had code that takes over 30 minutes to process the data (i.e., import, transform, merge, etc.).
In these cases, a single data-destroying line of code would require me to wait 30 minutes to restore my workspace.
By data destroying code I mean things like:
x <- merge(x, y)
df$x <- df$x^2
e.g., merges, replacing an existing variable with a transformation, removing rows or columns, and so on. In these cases, it's easy, especially when first learning R to make a mistake.
To avoid having to wait this 30 minutes, I adopt several strategies:
If I'm about to do something where there's a risk of destroying my active objects, I'll first copy the result into a temporary object. I'll then check that it worked with the temporary object and then rerun replacing it with the proper object.
E.g., first run temp <- merge(x, y); check that it worked str(temp); head(temp); tail(temp) and if everything looks good x <- merge(x, y)
As is common in psychological research, I often have large data frames with hundreds of variables and different subsets of cases. For a given analysis (e.g., a table, a figure, some results text), I'll often extract just the subset of cases and variables that I need into a separate object for the analysis and work with that object when preparing and finalising my analysis code. That way, I'm less likely to accidentally damage my main data frame. This assumes that the results of the analysis does not need to be fed back into the main data frame.
If I have finished performing a large number of complex data transformations, I may save a copy of the core workspace objects. E.g., save(x, y, z , file = 'backup.Rdata') That way, If I make a mistake, I only have to reload these objects.
df$x <- NULL is a handy way of removing a variable in a data frame that you did not want to create
However, in the end I still run all the code from scratch to check that the result is reproducible.

Resources