Document/Scripts management for R code - r

I am looking for a solution that allows me to keep a track of a multitude of R scripts that I create for various projects and purposes. Some scripts are easily tracked to specific projects, whereas others are "convenience" functions created to serve a set of tasks.
Is there a way I can create a central DB and query it to find which scripts match most appropriately?
I could create a system using a DBMS manually, but are users aware of anything in general or specific to R, that comes in the form of a software tool (maybe FOSS) ?
EDIT: Thank you for the responses. My current system is just a set of scripts with comments that allow me to identify their intended task. Though I use StatET with SVN, I would like a search utility along the lines of the "sos" package.

The question
I am looking for a solution that allows me to keep a track of a multitude of R scripts
that I create for various projects and purposes. Some scripts are easily tracked to specific
projects, whereas others are "convenience" functions created to serve a set of tasks.
fails to address the obvious follow-up of why the existing mechanism is not suitable:
Create a local package for each project
Create one or more local packages for local utility functions
Use R's already existing mechanisms for searching, indexing, testing, cross-referencing
And use any revision control system of your liking, local or on the web, to host the code for 1. to 3. above.
Reinventing an RDBMS schema for 1. to 3. is just wrong in my book. But if you must, go ahead and replicate what you can already (mostly) get for free in tested and widely used code.

R comes with several mechanisms for searching for help, most of which naturally use CRAN. Some examples: the sos package, cranberries, crantastic, and rseek. In many cases, these could be adapted to use a local repository (you can find out how to create a local repository in the R manual, which is very easy to do). Otherwise, if you package your scripts and submit them to CRAN, you will naturally have these available to you. I would also highly recommend this presentation on the subject: Creating R Packages, Using CRAN, R-Forge, And Local R Archive Networks And Subversion (SVN) Repositories from Spencer Graves and Sundar Dorai-Raj.
These would require you to put your code in packages, and create documentation, all of which is worth doing anyway. The package documentation turns out to be very useful for both documenting what things do, and helping your find them in the future. You can use roxygen to create this documentation in-line with your code. Also read this related question: Organizing R Source Code.
Alternatively, the help.search() function can be very useful for searching local packages, regardless of whether you have a repository set up.

You'd probably be best working with a version control system. Many can be indexed and be made search-able. At my work, a stack of R, Eclipse, StatET, Subversion and Subclipse works very well for us.

Related

Is there a way to protect my R code that runs on a AWS account owned by a client?

I just joined a company that needs to build an ETL pipeline inside an AWS account owned by a client.
There's one part of the ETL pipeline that runs a code written in R. The problem is, this R code is a very important part of our business, and our intelectual property. Our clients can't see this code.
Is there any way to run this in their AWS environment without them having access to our code? R is not compilable, so we can't just deploy an executable file there. And we HAVE to run this in their environment. I suggested creating an API to run this in our AWS environment, but this is not an option.
In my experience, these are the options I've realized in situations like this, in increasing order of difficulty:
Take the computation off-premises. This sounds like not an option for you.
Generate an API (e.g., shiny, opencpu, plumber) that is callable from their premises. This might require some finessing on their end, as I'm inferring (since they want it all done within their environment) that they might prefer a locked-down computation (perhaps disabling network access).
Rewrite the sensitive portions in Rcpp. While this does have the possible benefit of speed improvements, it makes it slightly harder for them to "discover" the underlying intellectual property. Realize that R and Rcpp are both GPL, which means that anything linked to by R must also be GPL, meaning source-code available. (It is feasible that since you are not making it public that you can argue your case here, but I am not a lawyer and would not want to be the first consultant found on the wrong side of GPL law here. Again, IANAL.)
Rewrite the sensitive portions in a non-R executable (note that I don't say "as a non-R library and link to it via R calls", since the linking action taints the library with R's GPL). This executable can be called by your otherwise releasable R package (via system or processx::run).
(For the record, one might infer C or C++ here, but other higher-level languages do allow compilable executables and are not GPL. Python has some such modes. Be sure to obfuscate your variables :-)
I think your "safest" options are #2 and #4.

Can you have multiple plans using R package drake?

I know it is not best practice to use the R package called drake within a notebook tool, but I'm doing it anyway as a workaround for the limitations to the collaboration infrastructure we have on my team at work. Since my code is broken up into chunks that are distributed throughout sections of the notebook, it would be useful to have multiple analysis plans, which I would execute in the appropriate section, and other plans may be written and executed in subsequent sections of the notebook. Is it possible to write multiple plans in drake?
Sorry I am late to this thread. I am the maintainer of the drake R package, and I usually expect to receive questions on the issue tracker. A drake-r-package StackOverflow tag would really help me keep up, but I do not have that privilege.
Anyway, interesting use case. I do see some workarounds:
Separate caches for separate plans. drake uses storr to cache its targets, and you could create different caches for different sections of your report. Essentially, your report would manage a bunch of separate drake projects. See this chapter in the manual for more on the caching system. Use the cache argument to make() to supply a manual or non-default storr cache.
Separate plans and a single cache. Here, you would need to ensure that each plan has a completely unique set of targets. If there is overlap, then some targets will always rebuild whenever your run the report.
A single cumulative plan. Essentially, when it comes time to build additional targets as you move through the report, you can add new rows to an existing plan. In fact, this is the recommended approach for large complex projects (related example here). For even more control, use the targets argument to make() to only build a select few targets and their out-of-date dependencies.
I’m not sure i understand the question? We often use drake in jupyter notebook, and do try to support that use case (via the python bindings).
By “plans”, do you mean mathematical programs? Or inverse kinematics calls? Both should be ok in a notebook framework. Or are you actually calling them in parallel?
Not sure how R fits into it?

Centralizing libraries in julia

I've long thought about learing julia - a language I secretly hope will become the new standard for scientific computing - and when it is now packaged and included in the standard Ubuntu repositories, I figured it was time. I quickly found this tutorial and started hacking...
In the linked chapter, one is urged to download a library called ols.jl from a Github repository, place it in the local directory and start using it. I feel there must be a better way of doing this.
For example, it would be logical to have some "default"-directory in which julia can always look for library files. That folder could reside under my home directory, or (perhaps even better) somewhere under e.g. /usr/share/lib on an Ubuntu system.
Also, downloading the libraries directly seems to me like something I should be able to avoid. Isn't it possible to find libraries like these in some sort of packaging system (be it via Ubuntu's apt-get or something else)?
I do realize that many of these questions and concerns may be just because julia is a young language, that most of these features are missing because of this, and that there are plans (or at least wishes) to go in this direction in the future. However, it would be nice to know if I'm just missing something obvious =)
That tutorial on Forio is ancient. There's a newer, much better package system as of version 0.1 of Julia. See the documentation here: http://docs.julialang.org/en/release-0.1/manual/packages/

Are there features of R that are system-dependent?

My co-workers would like to make sure that our work in R is platform-independent, specifically that code will run on Linux, Mac, and Windows, and that files created on one system will work on other systems.
Since the issue has come up before in my group, I would appreciate a general answer that will make it easier for me to confidently assure my collaborators that there will not be an issue. E.g., it would help to have a reference other than "because (subject matter expert) said so on SO".
Generally, is there a way to know if any features of R are platform-specific (can I assume that this would be stated in a function's help)?
Are there packages or functions that I can be confident will be platform-independent?
Are there types of packages or functions that I should be wary of?
I have previously asked two questions about the cross-platform readability of files created by R: What are the disadvantages of using .Rdata files compared to HDF5 or netCDF? and Are R objects dumped using `dump` readable cross-platform?
Besides Carl's answer, the obvious way to ensure that your work in platform-independent is to test on all platforms.
Which is precisely what CRAN does with its 3800+ packages, and you have access to logs here.
In short, R really tries hard to be platform-independent, and mostly succeeds. To do so with your code, it is up to you to avoid APIs or tools which introduce dependencies. Look at abstractions like system.file(package="boot") and the functions they use---you can easily abstract file-system "roots", and separators are already taken care of.
Check cran.r-project.org for package listings. Every package has a page which will tell you if it's passed testing for different operating systems. Further, as you suggested, the help files are pretty explicit about OS dependencies.
R is "smart" enough to translate "/" to "\" in pathnames for those poor folks working in Windows.
Generally speaking, graphics access is the area most likely to have platform dependencies. Obviously if you system lacks {X11, ImageMagick, ..} you're stuck anyway.
Besides Carl's and Dirk's comments, you should understand that any package that requires compilation from source (as do many (all?) packages that are on Omegahat, Rforge or r-forge) will need to be done on a machine that has the proper C and Fortran libraries. Some interesting packages depend on GTK+ and Tcl/Tk, and there may be a need to make sure you can get the right versions. The http://r.research.att.com/ page that Simon Urbanek maintains is a useful resource for keeping up with supporting resources for Macs.

How to keep abreast of known bugs and bug fixes in R packages?

Is there a standard R community resource for keeping up to date on known bugs or bug fixes for packages? My current approach is rather manual. (NB: I'm restricting this to CRAN - see Note 1.)
My use case is basically bug surveillance and the management of package updates. I've been averaging a couple of bug discoveries each month for awhile (which I duly report to the authors ;-)). Since a lot of my work is done with virtual machines, I tend to update the VM images when I have a good handle on the bug status for necessary packages. When a bunch of bugs are fixed, I can remove my workarounds, which is great, and I update the images. When I discover an outbreak of bugs, I don't create a new image.
Here are the sources I'm currently using:
NEWS files: Many, but not all, packages have NEWS files. These are certainly a helpful place to start.
Package home page: Some packages do not have a NEWS file on CRAN, but separately post a change log on the author's site.
R project-hosted mailing lists
Google Groups for packages
Personal communication with package authors
Bug tracking for packages (e.g. a developer may use Bugzilla)
It's one thing to be the first to discover a bug (I grant that bugs happen to all of us), it's another to belatedly discover a bug that is either already known or, better yet, already fixed. Both slow down my own coding, but better bug surveillance (maybe we need a cdc4R package :)) would significantly reduce the impact. Without a standard update alerting system (e.g. an extension to update.packages() that reports on which packages could be updated and links to info on what's changed), it's the user's job to seek out this information.
As such a user, trying to seek out this information, is there some standard resource I've overlooked in the list above? For instance, is there an R mailing list where it's common for developers to post their changes and bug fixes? Or is there a site that aggregates such posts, posts tests (CRAN posts R CMD CHECK output, it seems), or that gives some other feedback?
A few additional notes on other resources, for others' benefit:
I see that CRANberries has a terse diff summary on packages, which is new to me. (I am inspired to consider a grep for bug or fix in the diff output.)
bug.report() in R is a good way to send a message to R Core or the email address of a package maintainer.
Several testing packages worth consideration are: testthat, RUnit, and svUnit.
My personal "quick test" is to simply use digest to verify that results match, without having to test equality of very large objects.
Note 1: I'm tagging this cran because it's impossible to manage the universe of all R packages. For an individual package author, one can distribute a package wherever they'd like, use whatever mailing list or bug tracking system they like, etc. However, that's outside the "mainstream" for R. Were I to release a package and alert users to changes, bugs, bugfixes, I'd go with CRAN + NEWS + Bugzilla + Google Groups + R-Forge (and/or RForge), etc., but is there another standard reporting mechanism that is missing from this list?
In some sense, this note also serves to ask if there's a mechanism that developers are encouraged to use. I suspect there is no standard, as packages by R Core members seem to do many different things regarding bug and change reporting.
Note 2: I'm also adding administration (though something else may be more apropos), since this also relates to administering R. For reproducibility, administration of packages is quite important; when there are multiple users or more moving pieces, keeping aware of bugs and fixes becomes an administrative task, as well as an important consideration for development that depends on the external packages. If another tag, e.g. system-administration is more appropriate, I'm open to a change.
Not a complete answer but here are some thoughts.
In the case of data.table we track bugs (and feature requests) on R-Forge here. I imagine you could query R-Forge's tracker (programatically) for all packages hosted there. To add to your list anyway. That web tracker is where bug.report(package="data.table") points to (not just an email address to maintainer).
Also, anyone can subscribe to any <pkgname>-commits#lists.r-forge.r-project.org mailing list to receive a unified diff and commit message (at the time of commit) for each project on R-Forge. I'm not aware of a general mailing list spanning any commit to any R-Forge project, though.
At the top of ?data.table there is a link to up to the minute NEWS. This is how we communicate to users what is in the latest version (and in development) if they upgrade. That link updates in real-time; i.e., "up to the minute" is meant literally. But, they do have to check there!

Resources