FAQ markup to R data structure - r

I'm reading the R FAQ source in texinfo, and thinking that it would be easier to manage and extend if it was parsed as an R structure. There are several existing examples related to this:
the fortunes package
bibtex entries
Rd files
each with some desirable features.
In my opinion, FAQs are underused in the R community because they lack i) easy access from the R command-line (ie through an R package); ii) powerful search functions; iii) cross-references; iv) extensions for contributed packages. Drawing ideas from packages bibtex and fortunes, we could conceive a new system where:
FAQs can be searched from R. Typical calls would resemble the fortune() interface: faq("lattice print"), or faq() #surprise me!, faq(51), faq(package="ggplot2").
Packages can provide their own FAQ.rda, the format of which is not clear yet (see below)
Sweave/knitr drivers are provided to output nicely formatted Markdown/LaTeX, etc.
QUESTION
I'm not sure what is the best input format, however. Either for converting the existing FAQ, or for adding new entries.
It is rather cumbersome to use R syntax with a tree of nested lists (or an ad hoc S3/S4/ref class or structure,
\list(title = "Something to be \\escaped", entry = "long text with quotes, links and broken characters", category = c("windows", "mac", "test"))
Rd documentation, even though not an R structure per se (it is more a subset of LaTeX with its own parser), can perhaps provide a more appealing example of an input format. It also has a set of tools to parse the structure in R. However, its current purpose is rather specific and different, being oriented towards general documentation of R functions, not FAQ entries. Its syntax is not ideal either, I think a more modern markup, something like markdown, would be more readable.
Is there something else out there, maybe examples of parsing markdown files into R structures? An example of deviating Rd files away from their intended purpose?
To summarise
I would like to come up with:
1- a good design for an R structure (class, perhaps) that would extend the fortune package to more general entries such as FAQ items
2- a more convenient format to enter new FAQs (rather than the current texinfo format)
3- a parser, either written in R or some other language (bison?) to convert the existing FAQ into the new structure (1), and/or the new input format (2) into the R structure.
Update 2: in the last two days of the bounty period I got two answers, both interesting but completely different. Because the question is quite vast (arguably ill-posed), none of the answers provide a complete solution, thus I will not (for now anyway) accept an answer. As for the bounty, I'll attribute it to the answer most up-voted before the bounty expires, wishing there was a way to split it more equally.

(This addresses point 3.)
You can convert the texinfo file to XML
wget http://cran.r-project.org/doc/FAQ/R-FAQ.texi
makeinfo --xml R-FAQ.texi
and then read it with the XML package.
library(XML)
doc <- xmlParse("R-FAQ.xml")
r <- xpathSApply( doc, "//node", function(u) {
list(list(
title = xpathSApply(u, "nodename", xmlValue),
contents = as(u, "character")
))
} )
free(doc)
But it is much easier to convert it to text
makeinfo --plaintext R-FAQ.texi > R-FAQ.txt
and parse the result manually.
doc <- readLines("R-FAQ.txt")
# Split the document into questions
# i.e., around lines like ****** or ======.
i <- grep("[*=]{5}", doc) - 1
i <- c(1,i)
j <- rep(seq_along(i)[-length(i)], diff(i))
stopifnot(length(j) == length(doc))
faq <- split(doc, j)
# Clean the result: since the questions
# are in the subsections, we can discard the sections.
faq <- faq[ sapply(faq, function(u) length(grep("[*]", u[2])) == 0) ]
# Use the result
cat(faq[[ sample(seq_along(faq),1) ]], sep="\n")

I'm a little unclear on your goals. You seem to want all the R-related documentation converted into some format which R can manipulate, presumably so the one can write R routines to extract information from the documentation better.
There seem to be three assumptions here.
1) That it will be easy to convert these different document formats (texinfo, RD files, etc.) to some standard form with (I emphasize) some implicit uniform structure and semantics.
Because if you cannot map them all to a single structure, you'll have to write separate R tools for each type and perhaps for each individual document, and then the post-conversion tool work will overwhelm the benefit.
2) That R is the right language in which to write such document processing tools; suspect you're a little biased towards R because you work in R and don't want to contemplate "leaving" the development enviroment to get information about working with R better. I'm not an R expert, but I think R is mainly a numerical language, and does not offer any special help for string handling, pattern recognition, natural language parsing or inference, all of which I'd expect to play an important part in extracting information from the converted documents that largely contain natural language. I'm not suggesting a specific alternative language (Prolog??), but you might be better off, if you succeed with the conversion to normal form (task 1) to carefully choose the target language for processing.
3) That you can actually extract useful information from those structures. Library science was what the 20th century tried to push; now we're all into "Information Retrieval" and "Data Fusion" methods. But in fact reasoning about informal documents has defeated most of the attempts to do it. There are no obvious systems that organize raw text and extract deep value from it (IBM's Jeopardy-winning Watson system being the apparent exception but even there it isn't clear what Watson "knows"; would you want Watson to answer the question, "Should the surgeon open you with a knife?" no matter how much raw text you gave it) The point is that you might succeed in converting the data but it isn't clear what you can successfully do with it.
All that said, most markup systems on text have markup structure and raw text. One can "parse" those into tree-like structures (or graph-like structures if you assume certain things are reliable cross-references; texinfo certainly has these). XML is widely pushed as a carrier for such parsed-structures, and being able to represent arbitrary trees or graphs it is ... OK ... for capturing such trees or graphs. [People then push RDF or OWL or some other knoweldge encoding system that uses XML but this isn't changing the problem; you pick a canonical target independent of R]. So what you really want is something that will read the various marked-up structures (texinfo, RD files) and spit out XML or equivalent trees/graphs. Here I think you are doomed into building separate O(N) parsers to cover all the N markup styles; how otherwise would a tool know what the value markup (therefore parse) was? (You can imagine a system that could read marked-up documents when given a description of the markup, but even this is O(N): somebody still has to describe the markup). One this parsing is to this uniform notation, you can then use an easily built R parser to read the XML (assuming one doesn't already exist), or if R isn't the right answer, parse this with whatever the right answer is.
There are tools that help you build parsers and parse trees for arbitrary lanuages (and even translators from the parse trees to other forms). ANTLR is one; it is used by enough people so you might even accidentally find a texinfo parser somebody already built. Our DMS Software Reengineering Toolkit is another; DMS after parsing will export an XML document with the parse tree directly (but it won't necessarily be in that uniform representation you ideally want). These tools will likely make it relatively easy to read the markup and represent it in XML.
But I think your real problem will be deciding what you want to extract/do, and then finding a way to do that. Unless you have a clear idea of how to do the latter, doing all the up front parsers just seems like a lot of work with unclear payoff. Maybe you have a simpler goal ("manage and extend" but those words can hide a lot) that's more doable.

Related

Function to check (English language) grammar in R

Are there any R functions that receive a sentence/paragraph and return a boolean indicating whether that document is grammatically correct?
i.e.
grammar_check("I've had a good day!")
# TRUE
grammar_check("ive have a good day")
# FALSE
I have found some very advanced text capabilities in R (mostly used in machine learning / bag of words etc), for example those in the caret and tm packages, but I have not been able to find any function that can check grammar
Note that this answer points to some python packages and web services, but are for extracting and stemming words, and are not in R
Also note: An informative comment on this answer points out that to write a rules-based function would take a whole team of people a very long time (since English is so idiosyncratic) - so this is not something that can be hacked together by an individual in a short amount of time

Easy export and table formatting of R dataframe to Word? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
Is there an easy way to automatize the conversion of a R dataframe to a pretty Word table in APA format for publishing manuscripts? I'm currently doing this by saving the table in a csv, opening that in excel, copying the excel table to Word, and formatting it there, but I'm hoping there would be a way to automatize the formatting in R, so that when I convert it to Word, it would already be in APA format, because Word sucks in automatization.
Basically, I want to continue writing the manuscript itself in Word, while doing my analyses in R. Then gather all the results in R to a table (with manually modifiable formatting) by a script and convert it to whatever format I could then simply copy-paste to Word (so that the formatting actually holds). When I need to modify the table, I would make the changes in R and then just run the script again without the need to do any changes in Word.
I don't want to learn LaTeX, because everyone in my field uses Word with features like track changes, and I use Zotero add-in for citations, so it's simpler to just keep the writing separate from the analyses. Also, I am a psychologist, not a coder, so learning a lot of new technologies just for this is probably not worth the effort for me. Typically with new technologies come new technical problems, and I am aiming to make my workflow quicker, but not at the cost of unpredictability (which may make it slower exactly at the moment when I cannot afford it).
I found a R+knitr+rmarkdown+pander+pandoc solution "with as little overhead as possible", but it seems to be quite heavy still because I don't know any of those technologies apart from R. And I'm not eager to start learning all that, as it seems to be aimed for doing the writing and all in R to the very end, while I want to separate my writing and my code - I never need code in my writing, only the result tables. In addition, based on the examples, it seems to fetch the values directly from R code (e.g., from summary() to create a descriptive table), while I need to be able to tinker with my table manually before converting it, for instance, writing the title and notes (like a specific note to one cell and explaining it in the bottom). I also found R2wd, but it seems to be an older attempt for the same "whole workflow in R" problem as the solution above. SWord does not seem to be working anymore.
Any suggestions?
(Just to let you know, I am the author of the packages I recommend you...)
You can use package ReporteRs to output your table to Word. See here a tutorial (not mine):
http://www.sthda.com/english/wiki/create-and-format-word-documents-using-r-software-and-reporters-package
Objects FlexTable let you format and arrange tables easily with some standard R code. For example, to set the 2nd column in bold, the code looks like:
myFlexTable[, 2] = textBold()
There are (old) examples here:
http://davidgohel.github.io/ReporteRs/flextable_examples.html
These objects can be added to a Word report using the function addFlexTable. The word report can be generated with function writeDoc.
If you are working in RStudio, you can print the object and it will be rendered in the html viewer so you can export it in Word when you are satisfied with its content.
You can even add real Word footnotes (see the link below)
http://davidgohel.github.io/ReporteRs/pot_objects.html#pot_footnotes
If you need more tabular output, I recommend you also the rtable package that handles xtable objects (and other things I have to develop to satisfy my colleagues or customers) - a quick demo can be seen here:
http://davidgohel.github.io/tabular/
Hope it helps...
I have had the same need, and I have ended up using the package htmlTable, which is quite 'cost-efficient'. This creates a HTML table (in RStudio it is created in the "Viewer" windows in the bottom right which I just mark using the mouse copy-paste to Word. (Start marking form the bottom of the table and drag the mouse upwards, that way you are sure to include the start of the HTML code.) Word handles these tables quite nicely. The syntax of is quite simple involving just the function htmlTable(), but is still able to make somewhat more complex tables, such as grouped rows and primary and secondary column headers (i.e. column headers spanning more than one column). Check out the examples in the vignette.
One note of caution: htmlTable will not work will factor variables, i.e., they will come out as integer numbers (according to factor levels). So read the data using stringsAsFactors = FALSE or convert them using as.character().
Including trailing zeroes can be done using the txtRound function. Example:
mini_table <- data.frame(Name="A", x=runif(20), stringsAsFactors = FALSE)
txt <- txtRound(mini_table, 2)
It is not completely straightforward to assign formatting soch as bold or italics, but it can be done by wrapping the table contents in HTML code. If you for instance want to make an entire column bold, it can be done like this (please note the use of single and double quotation marks inside paste0):
library(plyr)
mini_table <- data.frame(Name="A", x=runif(20), stringsAsFactors = FALSE)
txt <- txtRound(mini_table, 2)
txt$x <- aaply(txt$x, 1, function(x)
paste0("<span style='font-weight:bold'>", x, "</span")
)
htmlTable(txt)
Of course, that would be easier to to in Word. However, it is more interesting to add formatting to numbers according to some criteria. For instance, if we want to emphasize all values of x that are less than 0.2 by applying bold font, we can modify the code above as follows:
library(plyr)
mini_table <- data.frame(Name="A", x=runif(20), stringsAsFactors = FALSE)
txt <- txtRound(mini_table, 2)
txt$x <- aaply(txt$x, 1, function(x)
if (as.numeric(x)<0.2) {
paste0("<span style='font-weight:bold'>", x, "</span>")
} else {
paste0("<span>", x, "</span>")
})
htmlTable(txt)
If you want even fancier emphasis, you can for instance replace the bold font by red background color by using span style='background-color: red' in the code above. All these changes carry over to Word, at least on my computer (Windows 7).
The short answer is "not really." I've never had much luck getting well formatted tables into MS Word. The best approach I can offer you requires using Rmarkdown to render your tables into an HTML file. You can copy and paste you results from the HTML file to MS Word, but I make no guarantees about how well the formatting will follow.
To format your tables, you can try something like the xtable package, or the pixiedust package. But again, no guarantees that the formatting will transfer.

Easiest way to save an S4 class

Probably the most basic question on S4 classes imaginable here.
What is the simplest way to save an S4 class you have defined so that you can reuse it elsewhere. I have a project where I'm taking a number of very large datasets and compiling summary information from them into small S4 objects. Since I'll therefore be switching R sessions to create the summary object for each dataset, it'd be good to be able to load in the definition of the class from a saved object (or have it load automatically) rather than having to include the long definition of the object at the top of each script (which I assume is bad practice anyway because the code defining the object might become inconsistent).
So what's the syntax along the lines of saveclass("myClass"), loadclass("myclass") or am I just thinking about this in the wrong way?
setClass("track", representation(x="numeric", y="numeric"))
x <- new("track", x=1:4, y=5:8)
save as binary
fn <- tempfile()
save(x, ascii=FALSE, file=fn)
rm(x)
load(fn)
x
save as ASCII
save(x, ascii=TRUE, file=fn)
ASCII text representation from which to regenerate the data
dput(x, file=fn)
y <- dget(fn)
The original source can be found here.
From the question, I think you really do want to include the class definition at the top of each script (although not literally; see below), rather than saving a binary representation of the class definition and load that. The reason is the general one that binary representations are more fragile (subject to changes in software implementation) compared to simple text representations (for instance, in the not too distant past S4 objects were based on simple lists with a class attribute; more recently they have been built around an S4 'bit' set on the underlying C-level data representation).
Instead of copying and pasting the definition into each script, really the best practice is to included the class definition (and related methods) in an R package, and to load the package at the top of the script. It is not actually hard to write packages; an easy way to get started is to use Rstudio to create a 'New Project' as an 'R package'. Use a version number in the package to keep track of the specific version of the class definition / methods you're using, and version control (svn or git, for instance) to make it easy to track the changes / explorations you make as your class matures. Share with your colleagues and eventually the larger R community to let others benefit from your hard work and insight!

Making a term document matrix from an excel file using R

For sentiment analysis using tm plugin webmining, I am to create a TermDocumentMatrix, as shown in the code sample below:
http://www.inside-r.org/packages/cran/tm/docs/tm_tag_score
I have a csv file with headlines of articles on separate rows, in a total of 1 column and without a heading. My goal is to create a term document matrix (or PlainTextDocument, if possible) using the rows of headlines in my csv file, but so far I was only able to create a regular matrix:
#READ IN FILE
filevz <- read.csv("headlinesonly.csv")
#make matrix
headvzDTM <- as.matrix(filevz)
#look at dimension of file
dim(filevz)
#[1] 829 1
#contents of DTM
headvzDTM
European.Central.Bank.President.Draghi.News.Conference..Text.
[1,] "Euro Gains Seen as ECB Bank Test Sparks Repatriation: Currencies"
[2,] "Euro-Area Inflation Rate Falls to Four-Year Low"
[3,] "Euro-Area October Economic Confidence Rise Beats Forecast"
[4,] "Europe Breakup Forces Mount as Union Relevance Fades"
[5,] "Cyprus Tops Germany as Bailout Payer, Relatively Speaking"
....//the entire contents are printed, I include the top 5 and last entry here
[829,] "Copper, Metals Plummet as Europe Credit-Rating Cuts Erode Demand Prospects"
I did not include a header in the csv file. This was the error message when I tried to begin the sentiment analysis:
pos <- tm_tag_score(TermDocumentMatrix(headvzDTM,
control = list(removePunctuation = TRUE)),
tm_get_tags("Positiv"))
Error in UseMethod("TermDocumentMatrix", x) :
no applicable method for 'TermDocumentMatrix' applied to an object of class "c('matrix', 'character')"
Is there a way to create a TermDocumentMatrix using the matrix I have created?
I had alternatively tried to create a reader to extract the contents of the csv file and place it into a corpus, but this gave me an error:
//read in csv
read.table("headlinesonly.csv", header=FALSE, sep = ";")
//call the table by a name
headlinevz=read.table("headlinesonly.csv", header=FALSE, sep = ";")
m <- list(Content = "contents")
ds <- DataframeSource(headlinevz)
elem <- getElem(stepNext(ds))
//make myreader
myReader <- readTabular(mapping = m)
//error message
> (headvz <- Corpus(DataframeSource(headlinevz, encoding = "UTF-8"),
+ readerControl = myReader(elem, language = "eng", id = "id1"
+ )))
Error in [.default(elem$content, , mapping[[n]]) :
incorrect number of dimensions
When I try other suggestions on this site (such as, R text mining documents from CSV file (one row per doc) ) , I continue having the problem of not being able to do sentiment analysis to an object of class "data.frame":
hvz <- read.csv("headlinesonly.csv", header=FALSE)
require(tm)
corp <- Corpus(DataframeSource(hvz))
dtm <- DocumentTermMatrix(corp)
pos <- tm_tag_score(TermDocumentMatrix(hvz, control = list(removePunctuation = TRUE)),
tm_get_tags("Positiv"))
Error in UseMethod("TermDocumentMatrix", x) :
no applicable method for 'TermDocumentMatrix' applied to an object of class "data.frame"
require("tm.plugin.tags")
Loading required package: tm.plugin.tags
sapply(hvz, tm_tag_score, tm_get_tags("Positiv"))
Error in UseMethod("tm_tag_score", x) :
no applicable method for 'tm_tag_score' applied to an object of class "factor"
Here's how to get tm_tag_score working, using a reproducible example that appears similar to your use-case.
First some example data...
examp1 <- "When discussing performance with colleagues, teaching, sending a bug report or searching for guidance on mailing lists and here on SO, a reproducible example is often asked and always helpful. What are your tips for creating an excellent example? How do you paste data structures from r in a text format? What other information should you include? Are there other tricks in addition to using dput(), dump() or structure()? When should you include library() or require() statements? Which reserved words should one avoid, in addition to c, df, data, etc? How does one make a great r reproducible example?"
examp2 <- "Sometimes the problem really isn't reproducible with a smaller piece of data, no matter how hard you try, and doesn't happen with synthetic data (although it's useful to show how you produced synthetic data sets that did not reproduce the problem, because it rules out some hypotheses). Posting the data to the web somewhere and providing a URL may be necessary. If the data can't be released to the public at large but could be shared at all, then you may be able to offer to e-mail it to interested parties (although this will cut down the number of people who will bother to work on it). I haven't actually seen this done, because people who can't release their data are sensitive about releasing it any form, but it would seem plausible that in some cases one could still post data if it were sufficiently anonymized/scrambled/corrupted slightly in some way. If you can't do either of these then you probably need to hire a consultant to solve your problem"
examp3 <- "You are most likely to get good help with your R problem if you provide a reproducible example. A reproducible example allows someone else to recreate your problem by just copying and pasting R code. There are four things you need to include to make your example reproducible: required packages, data, code, and a description of your R environment. Packages should be loaded at the top of the script, so it's easy to see which ones the example needs. The easiest way to include data in an email is to use dput() to generate the R code to recreate it. For example, to recreate the mtcars dataset in R, I'd perform the following steps: Run dput(mtcars) in R Copy the output In my reproducible script, type mtcars <- then paste. Spend a little bit of time ensuring that your code is easy for others to read: make sure you've used spaces and your variable names are concise, but informative, use comments to indicate where your problem lies, do your best to remove everything that is not related to the problem. The shorter your code is, the easier it is to understand. Include the output of sessionInfo() as a comment. This summarises your R environment and makes it easy to check if you're using an out-of-date package. You can check you have actually made a reproducible example by starting up a fresh R session and pasting your script in. Before putting all of your code in an email, consider putting it on http://gist.github.com/. It will give your code nice syntax highlighting, and you don't have to worry about anything getting mangled by the email system."
examp4 <- "Do your homework before posting: If it is clear that you have done basic background research, you are far more likely to get an informative response. See also Further Resources further down this page. Do help.search(keyword) and apropos(keyword) with different keywords (type this at the R prompt). Do RSiteSearch(keyword) with different keywords (at the R prompt) to search R functions, contributed packages and R-Help postings. See ?RSiteSearch for further options and to restrict searches. Read the online help for relevant functions (type ?functionname, e.g., ?prod, at the R prompt) If something seems to have changed in R, look in the latest NEWS file on CRAN for information about it. Search the R-faq and the R-windows-faq if it might be relevant (http://cran.r-project.org/faqs.html) Read at least the relevant section in An Introduction to R If the function is from a package accompanying a book, e.g., the MASS package, consult the book before posting. The R Wiki has a section on finding functions and documentation"
examp5 <- "Before asking a technical question by e-mail, or in a newsgroup, or on a website chat board, do the following: Try to find an answer by searching the archives of the forum you plan to post to. Try to find an answer by searching the Web. Try to find an answer by reading the manual. Try to find an answer by reading a FAQ. Try to find an answer by inspection or experimentation. Try to find an answer by asking a skilled friend. If you're a programmer, try to find an answer by reading the source code. When you ask your question, display the fact that you have done these things first; this will help establish that you're not being a lazy sponge and wasting people's time. Better yet, display what you have learned from doing these things. We like answering questions for people who have demonstrated they can learn from the answers. Use tactics like doing a Google search on the text of whatever error message you get (searching Google groups as well as Web pages). This might well take you straight to fix documentation or a mailing list thread answering your question. Even if it doesn't, saying “I googled on the following phrase but didn't get anything that looked promising” is a good thing to do in e-mail or news postings requesting help, if only because it records what searches won't help. It will also help to direct other people with similar problems to your thread by linking the search terms to what will hopefully be your problem and resolution thread. Take your time. Do not expect to be able to solve a complicated problem with a few seconds of Googling. Read and understand the FAQs, sit back, relax and give the problem some thought before approaching experts. Trust us, they will be able to tell from your questions how much reading and thinking you did, and will be more willing to help if you come prepared. Don't instantly fire your whole arsenal of questions just because your first search turned up no answers (or too many). Prepare your question. Think it through. Hasty-sounding questions get hasty answers, or none at all. The more you do to demonstrate that having put thought and effort into solving your problem before seeking help, the more likely you are to actually get help. Beware of asking the wrong question. If you ask one that is based on faulty assumptions, J. Random Hacker is quite likely to reply with a uselessly literal answer while thinking Stupid question..., and hoping the experience of getting what you asked for rather than what you needed will teach you a lesson."
Put example data in a data frame...
df <- data.frame(txt = sapply(1:5, function(i) eval(parse(text=paste0("examp",i))))
)
Now load the packages tm and tm.plugin.tags (hat tip: https://stackoverflow.com/a/19331289/1036500)
require(tm)
install.packages("tm.plugin.tags", repos = "http://datacube.wu.ac.at", type = "source")
require(tm.plugin.tags)
Now transform data frame of text to a corpus
corp <- Corpus(DataframeSource(df))
Now compute tag score. Note that the tag score function includes a TermDocumentMatrix function, which operates on the corpus object, not the document term matrix as you were attempting.
pos <- tm_tag_score(TermDocumentMatrix(corp, control = list(removePunctuation = TRUE)), tm_get_tags("Positiv"))
sapply(corp, tm_tag_score, tm_get_tags("Positiv"))
And here's the output, same for pos as the sapply function, in this case:
1 2 3 4 5
4 6 16 10 29
Does that answer your question?

Strategies for repeating large chunk of analysis

I find myself in the position of having completed a large chunk of analysis and now need to repeat the analysis with slightly different input assumptions.
The analysis, in this case, involves cluster analysis, plotting several graphs, and exporting cluster ids and other variables of interest. The key point is that it is an extensive analysis, and needs to be repeated and compared only twice.
I considered:
Creating a function. This isn't ideal, because then I have to modify my code to know whether I am evaluating in the function or parent environments. This additional effort seems excessive, makes it harder to debug and may introduce side-effects.
Wrap it in a for-loop. Again, not ideal, because then I have to create indexing variables, which can also introduce side-effects.
Creating some pre-amble code, wrapping the analysis in a separate file and source it. This works, but seems very ugly and sub-optimal.
The objective of the analysis is to finish with a set of objects (in a list, or in separate output files) that I can analyse further for differences.
What is a good strategy for dealing with this type of problem?
Making code reusable takes some time, effort and holds a few extra challenges like you mention yourself.
The question whether to invest is probably the key issue in informatics (if not in a lot of other fields): do I write a script to rename 50 files in a similar fashion, or do I go ahead and rename them manually.
The answer, I believe, is highly personal and even then, different case by case. If you are easy on the programming, you may sooner decide to go the reuse route, as the effort for you will be relatively low (and even then, programmers typically like to learn new tricks, so that's a hidden, often counterproductive motivation).
That said, in your particular case: I'd go with the sourcing option: since you plan to reuse the code only 2 times more, a greater effort would probably go wasted (you indicate the analysis to be rather extensive). So what if it's not an elegant solution? Nobody is ever going to see you do it, and everybody will be happy with the swift results.
If it turns out in a year or so that the reuse is higher than expected, you can then still invest. And by that time, you will also have (at least) three cases for which you can compare the results from the rewritten and funky reusable version of your code with your current results.
If/when I do know up front that I'm going to reuse code, I try to keep that in mind while developing it. Either way I hardly ever write code that is not in a function (well, barring the two-liners for SO and other out-of-the-box analyses): I find this makes it easier for me to structure my thoughts.
If at all possible, set parameters that differ between sets/runs/experiments in an external parameter file. Then, you can source the code, call a function, even utilize a package, but the operations are determined by a small set of externally defined parameters.
For instance, JSON works very well for this and the RJSONIO and rjson packages allow you to load the file into a list. Suppose you load it into a list called parametersNN.json. An example is as follows:
{
"Version": "20110701a",
"Initialization":
{
"indices": [1,2,3,4,5,6,7,8,9,10],
"step_size": 0.05
},
"Stopping":
{
"tolerance": 0.01,
"iterations": 100
}
}
Save that as "parameters01.json" and load as:
library(RJSONIO)
Params <- fromJSON("parameters.json")
and you're off and running. (NB: I like to use unique version #s within my parameters files, just so that I can identify the set later, if I'm looking at the "parameters" list within R.) Just call your script and point to the parameters file, e.g.:
Rscript --vanilla MyScript.R parameters01.json
then, within the program, identify the parameters file from the commandArgs() function.
Later, you can break out code into functions and packages, but this is probably the easiest way to make a vanilla script generalizeable in the short term, and it's a good practice for the long-term, as code should be separated from the specification of run/dataset/experiment-dependent parameters.
Edit: to be more precise, I would even specify input and output directories or files (or naming patterns/prefixes) in the JSON. This makes it very clear how one set of parameters led to one particular output set. Everything in between is just code that runs with a given parametrization, but the code shouldn't really change much, should it?
Update:
Three months, and many thousands of runs, wiser than my previous answer, I'd say that the external storage of parameters in JSON is useful for 1-1000 different runs. When the parameters or configurations number in the thousands and up, it's better to switch to using a database for configuration management. Each configuration may originate in a JSON (or XML), but being able to grapple with different parameter layouts requires a larger scale solution, for which a database like SQLite (via RSQLite) is a fine solution.
I realize this answer is overkill for the original question - how to repeat work only a couple of times, with a few parameter changes, but when scaling up to hundreds or thousands of parameter changes in ongoing research, more extensive tools are necessary. :)
I like to work with combination of a little shell script, a pdf cropping program and Sweave in those cases. That gives you back nice reports and encourages you to source. Typically I work with several files, almost like creating a package (at least I think it feels like that :) . I have a separate file for the data juggling and separate files for different types of analysis, such as descriptiveStats.R, regressions.R for example.
btw here's my little shell script,
#!/bin/sh
R CMD Sweave docSweave.Rnw
for file in `ls pdfs`;
do pdfcrop pdfs/"$file" pdfs/"$file"
done
pdflatex docSweave.tex
open docSweave.pdf
The Sweave file typically sources the R files mentioned above when needed. I am not sure whether that's what you looking for, but that's my strategy so far. I at least I believe creating transparent, reproducible reports is what helps to follow at least A strategy.
Your third option is not so bad. I do this in many cases. You can build a bit more structure by putting the results of your pre-ample code in environments and attach the one you want to use for further analysis.
An example:
setup1 <- local({
x <- rnorm(50, mean=2.0)
y <- rnorm(50, mean=1.0)
environment()
# ...
})
setup2 <- local({
x <- rnorm(50, mean=1.8)
y <- rnorm(50, mean=1.5)
environment()
# ...
})
attach(setup1) and run/source your analysis code
plot(x, y)
t.test(x, y, paired = T, var.equal = T)
...
When finished, detach(setup1) and attach the second one.
Now, at least you can easily switch between setups. Helped me a few times.
I tend to push such results into a global list.
I use Common Lisp but then R isn't so different.
Too late for you here, but I use Sweave a lot, and most probably I'd have used a Sweave file from the beginning (e.g. if I know that the final product needs to be some kind of report).
For repeating parts of the analysis a second and third time, there are then two options:
if the results are rather "independent" (i.e. should produce 3 reports, comparison means the reports are inspected side by side), and the changed input comes in the form of new data files, that goes into its own directory together with a copy of the Sweave file, and I create separate reports (similar to source, but feels more natural for Sweave than for plain source).
if I rather need to do the exactly same thing once or twice again inside one Sweave file I'd consider reusing code chunks. This is similar to the ugly for-loop.
The reason is that then of course the results are together for the comparison, which would then be the last part of the report.
If it is clear from the beginning that there will be some parameter sets and a comparison, I write the code in a way that as soon as I'm fine with each part of the analysis it is wrapped into a function (i.e. I'm acutally writing the function in the editor window, but evaluate the lines directly in the workspace while writing the function).
Given that you are in the described situation, I agree with Nick - nothing wrong with source and everything else means much more effort now that you have it already as script.
I can't make a comment on Iterator's answer so I have to post it here. I really like his answer so I made a short script for creating the parameters and exporting them to external JSON files. And I hope someone finds this useful: https://github.com/kiribatu/Kiribatu-R-Toolkit/blob/master/docs/parameter_configuration.md

Resources