I could equally have titled this question, "Is it good enough for CRAN?"
I have a collection of functions that I've built up for specific tasks. Some of these are convenience functions:
# Returns odds/evens from a vector
odds=function(vec) {
stopifnot(class(vec)=="integer")
ret = vec[fpart(vec/2)!=0]
ret
}
evens=function(vec) {
stopifnot(class(vec)=="integer")
ret = vec[fpart(vec/2)==0]
ret
}
Some are minor additions that have proven useful in answering common SO question:
# Shift a vector over by n spots
# wrap adds the entry at the beginning to the end
# pad does nothing unless wrap is false, in which case it specifies whether to pad with NAs
shift <- function(vec,n=1,wrap=TRUE,pad=FALSE) {
if(length(vec)<abs(n)) {
#stop("Length of vector must be greater than the magnitude of n \n")
}
if(n==0) {
return(vec)
} else if(length(vec)==n) {
# return empty
length(vec) <- 0
return(vec)
} else if(n>0) {
returnvec <- vec[seq(n+1,length(vec) )]
if(wrap) {
returnvec <- c(returnvec,vec[seq(n)])
} else if(pad) {
returnvec <- c(returnvec,rep(NA,n))
}
} else if(n<0) {
returnvec <- vec[seq(1,length(vec)-abs(n))]
if(wrap) {
returnvec <- c( vec[seq(length(vec)-abs(n)+1,length(vec))], returnvec )
} else if(pad) {
returnvec <- c( rep(NA,abs(n)), returnvec )
}
}
return(returnvec)
}
The most important are extensions to existing classes that can't be found anywhere else (e.g. a CDF panel function for lattice plots, various xtable and LaTeX output functions, classes for handling and converting between geospatial object types and performing various GIS-like operations such as overlays).
I would like to make these available somewhere on the internet in R-ized form (e.g. posting them on a blog as plain text functions is not what I'm looking for), so that maintenance is easier and so that I and others can access them from any computer that I go to. The logical thing to do is to make a package out of them and post them to CRAN--and indeed I already have them packaged up. But is this collection of functions suitable for a CRAN package?
I have two main concerns:
The functions don't seem to have any coherent overlay. It's just a
collection of functions that do lots of different things.
My code isn't always the prettiest. I've tried to clean it up as I
learned better coding practices, but producing R Core-worthy beautiful
code is not in the cards.
The CRAN webpage is surprisingly bereft of guidelines on posting. Should I post to CRAN, given that some people will find it useful but that it will in some sense forever lock R into having some pretty basic function names taken up? Or is there another place I can use an install.packages-like command to install from? Note I'd rather avoid posting the package to a webpage and having people have to memorize the URL to install the package (not least for version control issues).
I would use http://r-forge.r-project.org/. From the top of the page:
R-Forge offers a central platform for the development of R packages,
R-related software and further projects. It is based on FusionForge
offering easy access to the best in SVN, daily built and checked
packages, mailing lists, bug tracking, message boards/forums, site
hosting, permanent file archival, full backups, and total web-based
administration.
Most packages should be collections of related functions with an obvious purpose, so a useful thing to do would be to try and group what you have together, and see if you can classify them. Several smaller packages are better than one huge incoherent package.
That said, there are some packages that are collections of miscellaneous utility functions, most notably Hmisc and gregmisc, so it is okay to do that sort of thing. If you just have a few functions like that, it might be worth contacting the author of some of the misc packages and seeing if they'll let you include your code in their package.
As for writing pretty code, the most important thing you can do is to use a style guide.
In my opinion it is not a good idea to make this type material into packages.
Misc-packages do exist, but mostly for historical reason and/or due to their authoritative contributors, see Frank Harrell Hmisc .
I see three main reason why this choice does non fit for disparate collection of functions.
There are by and large 7000 packages on CRAN only. It is unlikely that your package will be chosen if it does not target a specific field and, even when this happens, it is very possible that other established packages do the same. Therefore your package should also sport an original/better solution to the problem it deals with.
Repositories, and CRAN in particular, are task-oriented, which suggests packages' functions should address a coherent task. And for a good reason: there is no point in downloading a whole package with say, 50 autonomous functions, when I need just a couple of them. Instead, if a package solves a specific data problem of mine, than I will most likely need most (if not all) of them.
R repositories tend to mask the content. Contrary to tech blogs, you do not immediately see the functions' source. You need to download a separate source package and there is a lot of overhead due to the package structure, which buries the actual functions you are willing to show and the others need to read.
In my opinion the best place for general convenience functions, are sites like GitHub. In fact:
One immediately reads them with the comfort of syntax highlight. If they are interesting, they can be pasted in R to give a try and possibly keep them, otherwise one simply steps over to read next function.
There is the possibility of organising code, but without all the constraints of an actual package. Similar functions might go in the same file and coherent files in the same subfolder.
You can show your ideas to the others in a simple way. The readme file can immediately become a sort of mini webpage (via markdown). In comparison CRAN is quite rigid.
There are a lot of other benefits (revision history, accepting contributions, GitHub pages), which may or may not interest you.
Of course, after several functions grow in a stable coherent direction, you will turn them into an actual CRAN package. Also because the copy and paste method to try them becomes then inconvenient.
EDIT: Nowadays there are alternatives to GitHub, which can be taken into consideration too and GitHub has become a common way to distribute packages not yet ready for CRAN or to integrate the official CRAN distribution page.
Related
I am creating a package and including data of experiments. The experiments are ongoing and therefore the data is going to change in the future.
I am using roxygen2 for documentation and would like to include a description like the following:
#' Data_foo contains the experimental data of [x] participants
In future, 'x' is going to change. Of course, I can always manually update this - not a very big problem - or I could simply decide not to document like this.
But I simply wondered if this is generally possible, and if - how (e.g., might there be a markdown-like inline-code way?)
Yes, the Rd format (which Roxygen produces) supports that. Use syntax like \Sexpr{nrow(foo_package::Data_foo)} where you want [x] to appear.
I want to use the R package future (supports asynchronous calculations) to make a cluster-jobserver that can dynamically add/remove jobs to/from a queue.
One specific functionality that I would like to add to my jobserver is to distribute memory-demanding jobs to the more powerful machines in my cluster. However, since I have no experience with the package, I am not quite sure whether my approach (given below) has any pitfalls. Specifically, do the subsequent calls of plan have any side effects that might mess things up? Please see the comments in the code for more details.
Thanks in advance!
library(parallel)
library(future)
slaveIPs=c("172.16.2.10","172.16.2.21")
masterIP="172.16.2.33"
workers=makePSOCKcluster(slaveIPs,master=masterIP)
#check whether PSOCK cluster was correctly set up
unlist(clusterCall(workers,function(x) unname(Sys.info()["nodename"]))
#[1] "ip-172-16-2-10" "ip-172-16-2-21"
#now the first important part that I am not sure about
#as you can see, I only use workers[1] for the first task
#is it OK to use workers[1] like that?
plan(cluster,workers=workers[1])
f=future({
#do memory-hungry work
unname(Sys.info()["nodename"])
})
message(value(f))
#ip-172-16-2-10
#now I am only using workers[2] for the second task
#Is this ok? Does the previous call to 'plan' need some cleaning before?
plan(cluster,workers=workers[2])
f=future({
#do low-memory work
unname(Sys.info()["nodename"])
})
message(value(f))
#ip-172-16-2-21
stopCluster(workers)
Author of future here:
Yes, it alright to change future strategies like that, i.e. by using plan(). An alternative is to use:
f <- cluster({
#do low-memory work
unname(Sys.info()["nodename"])
}, workers = workers[2])
which is basically what is happening internally.
The downside of explicitly specifying future strategies like this is that your code will be hard coded to use cluster futures.
FYI, I'm planning to add some kind of mechanism for specifying preferred or required "resources" per future. This is just conceptual for now and will not exists anytime soon, but I'm thinking of something in line with:
f <- future({ ... }, needs = "himem")
where one can query workers for the himem tag / property, e.g. attr(workers[2], "provides") <- c("himem", "superfast"). I'm sharing these thoughts just so you know that I'm aware of needs like yours. Again, it will be quite some time before such mechanisms are available, so in the meanwhile, you need to explicitly specify the future strategy as above.
BTW, instead of:
slaveIPs=c("172.16.2.10","172.16.2.21")
masterIP="172.16.2.33"
workers=makePSOCKcluster(slaveIPs,master=masterIP)
you can try:
slaveIPs <- c("172.16.2.10", "172.16.2.21")
workers <- makeClusterPSOCK(slaveIPs)
provided by the future package - this avoids having to know/specify the IP address of master.
The code completion in RStudio is great, and I really like how a popover appears to describe the arguments for the function inputs. For example, if one types matrix( and then presses "tab", a list of arguments for the matrix() function appears, along with a description of the input. Say, nrow= is selected, then the adjoining window describes the nrow input as "the desired number of rows.".
Can I get RStudio to do this for my custom functions? Would I have to create a package to achieve this effect?
Say I have a file full of custom functions, myCustomFunctions.R, and I store all my miscellaneous helper functions in there. I want to be able to add meta data for my functions so that this helper window also describes my function inputs.
To add to Hadley's answer in the comments, Rstudio is mining specific portions of the help files to generate the helper window. Specifically, tabbing before the parentheses brings up the "Usage" and "Description" sections and tabbing inside the parentheses or after a comma brings up the "Arguments" section. Therefore, not only does a package need to be made, but the help files must be generated to take advantage of this feature.
Following up Hadley: even if the functions are only for your own use, it's worth package-izing them. You will then get a lot of useful stuff for free, above and beyond the package documentation system: things like version control, unit testing, portability, shareability ... I could go on. There's a small potential barrier which you have to get over before you can get back to the fun part (i.e. hacking your own neat stuff), but it's worth the investment of time.
Hadley has public-spiritedly put his Packages book online with step by step descriptions of how to access all the goodies I've mentioned. Hopefully you'll decide it's worth paying for (I did).
I'm reading the R FAQ source in texinfo, and thinking that it would be easier to manage and extend if it was parsed as an R structure. There are several existing examples related to this:
the fortunes package
bibtex entries
Rd files
each with some desirable features.
In my opinion, FAQs are underused in the R community because they lack i) easy access from the R command-line (ie through an R package); ii) powerful search functions; iii) cross-references; iv) extensions for contributed packages. Drawing ideas from packages bibtex and fortunes, we could conceive a new system where:
FAQs can be searched from R. Typical calls would resemble the fortune() interface: faq("lattice print"), or faq() #surprise me!, faq(51), faq(package="ggplot2").
Packages can provide their own FAQ.rda, the format of which is not clear yet (see below)
Sweave/knitr drivers are provided to output nicely formatted Markdown/LaTeX, etc.
QUESTION
I'm not sure what is the best input format, however. Either for converting the existing FAQ, or for adding new entries.
It is rather cumbersome to use R syntax with a tree of nested lists (or an ad hoc S3/S4/ref class or structure,
\list(title = "Something to be \\escaped", entry = "long text with quotes, links and broken characters", category = c("windows", "mac", "test"))
Rd documentation, even though not an R structure per se (it is more a subset of LaTeX with its own parser), can perhaps provide a more appealing example of an input format. It also has a set of tools to parse the structure in R. However, its current purpose is rather specific and different, being oriented towards general documentation of R functions, not FAQ entries. Its syntax is not ideal either, I think a more modern markup, something like markdown, would be more readable.
Is there something else out there, maybe examples of parsing markdown files into R structures? An example of deviating Rd files away from their intended purpose?
To summarise
I would like to come up with:
1- a good design for an R structure (class, perhaps) that would extend the fortune package to more general entries such as FAQ items
2- a more convenient format to enter new FAQs (rather than the current texinfo format)
3- a parser, either written in R or some other language (bison?) to convert the existing FAQ into the new structure (1), and/or the new input format (2) into the R structure.
Update 2: in the last two days of the bounty period I got two answers, both interesting but completely different. Because the question is quite vast (arguably ill-posed), none of the answers provide a complete solution, thus I will not (for now anyway) accept an answer. As for the bounty, I'll attribute it to the answer most up-voted before the bounty expires, wishing there was a way to split it more equally.
(This addresses point 3.)
You can convert the texinfo file to XML
wget http://cran.r-project.org/doc/FAQ/R-FAQ.texi
makeinfo --xml R-FAQ.texi
and then read it with the XML package.
library(XML)
doc <- xmlParse("R-FAQ.xml")
r <- xpathSApply( doc, "//node", function(u) {
list(list(
title = xpathSApply(u, "nodename", xmlValue),
contents = as(u, "character")
))
} )
free(doc)
But it is much easier to convert it to text
makeinfo --plaintext R-FAQ.texi > R-FAQ.txt
and parse the result manually.
doc <- readLines("R-FAQ.txt")
# Split the document into questions
# i.e., around lines like ****** or ======.
i <- grep("[*=]{5}", doc) - 1
i <- c(1,i)
j <- rep(seq_along(i)[-length(i)], diff(i))
stopifnot(length(j) == length(doc))
faq <- split(doc, j)
# Clean the result: since the questions
# are in the subsections, we can discard the sections.
faq <- faq[ sapply(faq, function(u) length(grep("[*]", u[2])) == 0) ]
# Use the result
cat(faq[[ sample(seq_along(faq),1) ]], sep="\n")
I'm a little unclear on your goals. You seem to want all the R-related documentation converted into some format which R can manipulate, presumably so the one can write R routines to extract information from the documentation better.
There seem to be three assumptions here.
1) That it will be easy to convert these different document formats (texinfo, RD files, etc.) to some standard form with (I emphasize) some implicit uniform structure and semantics.
Because if you cannot map them all to a single structure, you'll have to write separate R tools for each type and perhaps for each individual document, and then the post-conversion tool work will overwhelm the benefit.
2) That R is the right language in which to write such document processing tools; suspect you're a little biased towards R because you work in R and don't want to contemplate "leaving" the development enviroment to get information about working with R better. I'm not an R expert, but I think R is mainly a numerical language, and does not offer any special help for string handling, pattern recognition, natural language parsing or inference, all of which I'd expect to play an important part in extracting information from the converted documents that largely contain natural language. I'm not suggesting a specific alternative language (Prolog??), but you might be better off, if you succeed with the conversion to normal form (task 1) to carefully choose the target language for processing.
3) That you can actually extract useful information from those structures. Library science was what the 20th century tried to push; now we're all into "Information Retrieval" and "Data Fusion" methods. But in fact reasoning about informal documents has defeated most of the attempts to do it. There are no obvious systems that organize raw text and extract deep value from it (IBM's Jeopardy-winning Watson system being the apparent exception but even there it isn't clear what Watson "knows"; would you want Watson to answer the question, "Should the surgeon open you with a knife?" no matter how much raw text you gave it) The point is that you might succeed in converting the data but it isn't clear what you can successfully do with it.
All that said, most markup systems on text have markup structure and raw text. One can "parse" those into tree-like structures (or graph-like structures if you assume certain things are reliable cross-references; texinfo certainly has these). XML is widely pushed as a carrier for such parsed-structures, and being able to represent arbitrary trees or graphs it is ... OK ... for capturing such trees or graphs. [People then push RDF or OWL or some other knoweldge encoding system that uses XML but this isn't changing the problem; you pick a canonical target independent of R]. So what you really want is something that will read the various marked-up structures (texinfo, RD files) and spit out XML or equivalent trees/graphs. Here I think you are doomed into building separate O(N) parsers to cover all the N markup styles; how otherwise would a tool know what the value markup (therefore parse) was? (You can imagine a system that could read marked-up documents when given a description of the markup, but even this is O(N): somebody still has to describe the markup). One this parsing is to this uniform notation, you can then use an easily built R parser to read the XML (assuming one doesn't already exist), or if R isn't the right answer, parse this with whatever the right answer is.
There are tools that help you build parsers and parse trees for arbitrary lanuages (and even translators from the parse trees to other forms). ANTLR is one; it is used by enough people so you might even accidentally find a texinfo parser somebody already built. Our DMS Software Reengineering Toolkit is another; DMS after parsing will export an XML document with the parse tree directly (but it won't necessarily be in that uniform representation you ideally want). These tools will likely make it relatively easy to read the markup and represent it in XML.
But I think your real problem will be deciding what you want to extract/do, and then finding a way to do that. Unless you have a clear idea of how to do the latter, doing all the up front parsers just seems like a lot of work with unclear payoff. Maybe you have a simpler goal ("manage and extend" but those words can hide a lot) that's more doable.
I find myself in the position of having completed a large chunk of analysis and now need to repeat the analysis with slightly different input assumptions.
The analysis, in this case, involves cluster analysis, plotting several graphs, and exporting cluster ids and other variables of interest. The key point is that it is an extensive analysis, and needs to be repeated and compared only twice.
I considered:
Creating a function. This isn't ideal, because then I have to modify my code to know whether I am evaluating in the function or parent environments. This additional effort seems excessive, makes it harder to debug and may introduce side-effects.
Wrap it in a for-loop. Again, not ideal, because then I have to create indexing variables, which can also introduce side-effects.
Creating some pre-amble code, wrapping the analysis in a separate file and source it. This works, but seems very ugly and sub-optimal.
The objective of the analysis is to finish with a set of objects (in a list, or in separate output files) that I can analyse further for differences.
What is a good strategy for dealing with this type of problem?
Making code reusable takes some time, effort and holds a few extra challenges like you mention yourself.
The question whether to invest is probably the key issue in informatics (if not in a lot of other fields): do I write a script to rename 50 files in a similar fashion, or do I go ahead and rename them manually.
The answer, I believe, is highly personal and even then, different case by case. If you are easy on the programming, you may sooner decide to go the reuse route, as the effort for you will be relatively low (and even then, programmers typically like to learn new tricks, so that's a hidden, often counterproductive motivation).
That said, in your particular case: I'd go with the sourcing option: since you plan to reuse the code only 2 times more, a greater effort would probably go wasted (you indicate the analysis to be rather extensive). So what if it's not an elegant solution? Nobody is ever going to see you do it, and everybody will be happy with the swift results.
If it turns out in a year or so that the reuse is higher than expected, you can then still invest. And by that time, you will also have (at least) three cases for which you can compare the results from the rewritten and funky reusable version of your code with your current results.
If/when I do know up front that I'm going to reuse code, I try to keep that in mind while developing it. Either way I hardly ever write code that is not in a function (well, barring the two-liners for SO and other out-of-the-box analyses): I find this makes it easier for me to structure my thoughts.
If at all possible, set parameters that differ between sets/runs/experiments in an external parameter file. Then, you can source the code, call a function, even utilize a package, but the operations are determined by a small set of externally defined parameters.
For instance, JSON works very well for this and the RJSONIO and rjson packages allow you to load the file into a list. Suppose you load it into a list called parametersNN.json. An example is as follows:
{
"Version": "20110701a",
"Initialization":
{
"indices": [1,2,3,4,5,6,7,8,9,10],
"step_size": 0.05
},
"Stopping":
{
"tolerance": 0.01,
"iterations": 100
}
}
Save that as "parameters01.json" and load as:
library(RJSONIO)
Params <- fromJSON("parameters.json")
and you're off and running. (NB: I like to use unique version #s within my parameters files, just so that I can identify the set later, if I'm looking at the "parameters" list within R.) Just call your script and point to the parameters file, e.g.:
Rscript --vanilla MyScript.R parameters01.json
then, within the program, identify the parameters file from the commandArgs() function.
Later, you can break out code into functions and packages, but this is probably the easiest way to make a vanilla script generalizeable in the short term, and it's a good practice for the long-term, as code should be separated from the specification of run/dataset/experiment-dependent parameters.
Edit: to be more precise, I would even specify input and output directories or files (or naming patterns/prefixes) in the JSON. This makes it very clear how one set of parameters led to one particular output set. Everything in between is just code that runs with a given parametrization, but the code shouldn't really change much, should it?
Update:
Three months, and many thousands of runs, wiser than my previous answer, I'd say that the external storage of parameters in JSON is useful for 1-1000 different runs. When the parameters or configurations number in the thousands and up, it's better to switch to using a database for configuration management. Each configuration may originate in a JSON (or XML), but being able to grapple with different parameter layouts requires a larger scale solution, for which a database like SQLite (via RSQLite) is a fine solution.
I realize this answer is overkill for the original question - how to repeat work only a couple of times, with a few parameter changes, but when scaling up to hundreds or thousands of parameter changes in ongoing research, more extensive tools are necessary. :)
I like to work with combination of a little shell script, a pdf cropping program and Sweave in those cases. That gives you back nice reports and encourages you to source. Typically I work with several files, almost like creating a package (at least I think it feels like that :) . I have a separate file for the data juggling and separate files for different types of analysis, such as descriptiveStats.R, regressions.R for example.
btw here's my little shell script,
#!/bin/sh
R CMD Sweave docSweave.Rnw
for file in `ls pdfs`;
do pdfcrop pdfs/"$file" pdfs/"$file"
done
pdflatex docSweave.tex
open docSweave.pdf
The Sweave file typically sources the R files mentioned above when needed. I am not sure whether that's what you looking for, but that's my strategy so far. I at least I believe creating transparent, reproducible reports is what helps to follow at least A strategy.
Your third option is not so bad. I do this in many cases. You can build a bit more structure by putting the results of your pre-ample code in environments and attach the one you want to use for further analysis.
An example:
setup1 <- local({
x <- rnorm(50, mean=2.0)
y <- rnorm(50, mean=1.0)
environment()
# ...
})
setup2 <- local({
x <- rnorm(50, mean=1.8)
y <- rnorm(50, mean=1.5)
environment()
# ...
})
attach(setup1) and run/source your analysis code
plot(x, y)
t.test(x, y, paired = T, var.equal = T)
...
When finished, detach(setup1) and attach the second one.
Now, at least you can easily switch between setups. Helped me a few times.
I tend to push such results into a global list.
I use Common Lisp but then R isn't so different.
Too late for you here, but I use Sweave a lot, and most probably I'd have used a Sweave file from the beginning (e.g. if I know that the final product needs to be some kind of report).
For repeating parts of the analysis a second and third time, there are then two options:
if the results are rather "independent" (i.e. should produce 3 reports, comparison means the reports are inspected side by side), and the changed input comes in the form of new data files, that goes into its own directory together with a copy of the Sweave file, and I create separate reports (similar to source, but feels more natural for Sweave than for plain source).
if I rather need to do the exactly same thing once or twice again inside one Sweave file I'd consider reusing code chunks. This is similar to the ugly for-loop.
The reason is that then of course the results are together for the comparison, which would then be the last part of the report.
If it is clear from the beginning that there will be some parameter sets and a comparison, I write the code in a way that as soon as I'm fine with each part of the analysis it is wrapped into a function (i.e. I'm acutally writing the function in the editor window, but evaluate the lines directly in the workspace while writing the function).
Given that you are in the described situation, I agree with Nick - nothing wrong with source and everything else means much more effort now that you have it already as script.
I can't make a comment on Iterator's answer so I have to post it here. I really like his answer so I made a short script for creating the parameters and exporting them to external JSON files. And I hope someone finds this useful: https://github.com/kiribatu/Kiribatu-R-Toolkit/blob/master/docs/parameter_configuration.md