Are there any good resources/best-practices to "industrialize" code in R for a data science project? - r

I need to "industrialize" an R code for a data science project, because the project will be rerun several times in the future with fresh data. The new code should be really easy to follow even for people who have not worked on the project before and they should be able to redo the whole workflow quite quickly. Therefore I am looking for tips, suggestions, resources and best-practices on how to achieve this objective.
Thank you for your help in advance!

You can make an R package out of your project, because it has everything you need for a standalone project that you want to share with others :
Easy to share, download and install
R has a very efficient documentation system for your functions and objects when you work within R Studio. Combined with roxygen2, it enables you to document precisely every function, and makes the code clearer since you can avoid commenting with inline comments (but please do so anyway if needed)
You can specify quite easily which dependancies your package will need, so that every one knows what to install for your project to work. You can also use packrat if you want to mimic python's virtualenv
R also provide a long format documentation system, which are called vignettes and are similar to a printed notebook : you can display code, text, code results, etc. This is were you will write guidelines and methods on how to use the functions, provide detailed instructions for a certain method, etc. Once the package is installed they are automatically included and available for all users.
The only downside is the following : since R is a functional programming language, a package consists of mainly functions, and some other relevant objects (data, for instance), but not really scripts.
More details about the last point if your project consists in a script that calls a set of functions to do something, it cannot directly appear within the package. Two options here : a) you make a dispatcher function that runs a set of functions to do the job, so that users just have to call one function to run the whole method (not really good for maintenance) ; b) you make the whole script appear in a vignette (see above). With this method, people just have to write a single R file (which can be copy-pasted from the vignette), which may look like this :
library(mydatascienceproject)
library(...)
...
dothis()
dothat()
finishwork()
That enables you to execute the whole work from a terminal or a distant machine with Rscript, with the following (using argparse to add arguments)
Rscript myautomatedtask.R --arg1 anargument --arg2 anotherargument
And finally if you write a bash file calling Rscript, you can automate everything !
Feel free to read Hadley Wickham's book about R packages, it is super clear, full of best practices and of great help in writing your packages.

One can get lost in the multiple files in the project's folder, so it should be structured properly: link
Naming conventions that I use: first, second.
Set up the random seed, so the outputs should be reproducible.
Documentation is important: you can use the Roxygen skeleton in rstudio (default ctrl+alt+shift+r).
I usually separate the code into smaller, logically cohesive scripts, and use a main.R script, that uses the others.
If you use a special set of libraries, you can consider using packrat. Once you set it up, you can manage the installed project-specific libraries.

Related

R testthat: use external package only in test file - not in DESCRIPTION

I'm trying to run a testthat script using GitHub Actions.
I would like to test a functionality of my function that allows it to be combined with (many) external packages. Now I want to test these external packages for the R CMD Check but I don't want to load the external packages generally (i.e. putting them into the Description) - after all, most people will not use these external packages.
Any ideas how to just include an external package in the testing files but not in the DESCRIPTION?
Thanks!
I think you describe a very standard use of Suggests.
I see two related but separable issues:
You want to test something using CI, in this case GHA. That is fine. Because you control the execution of the code, you could move your code from the test runner to, say, inst/examples and call it explicitly. That way the standard check of 'is the package using undeclared code' passes as inst/examples is not checked
You want to not force other people to have to load these packages. That is fine too, and we have Suggests: for this! Read Section 1.1 of Writing R Extensions about all the detailed semantics. If your package invokes other packages via tests, the every R CMD check touches this (and the external packages) so they must be declared. But you already know that only "some" people will want to use this "some of the time": that is precisely what Suggests: does, and you bracket the use with if (requireNamespace(pkgHere, quietly=TRUE)).
You can go either way, or even combine both. But you cannot call packages from tests and not declare them.

add and Rcpp file to an existing r Package?

I have already made a simple R package (pure R) to solve a problem with brute force then I tried to faster the code by writing the Rcpp script. I wrote a script to compare the running time with the "bench" library. now, how can I add this script to my package? I tried to add
#'#importFrom Rcpp cppFunction
on top of my R script and inserting the Rcpp file in the scr folder but didn't work. Is there a way to add it to my r package without creating the package from scratch? sorry if it has already been asked but I am new to all this and completely lost.
That conversion is actually (still) surprisingly difficult (in the sense of requiring more than just one file). It is easy to overlook details. Let me walk you through why.
Let us assume for a second that you started a working package using the R package package.skeleton(). That is the simplest and most general case. The package will work (yet have warning, see my pkgKitten package for a wrapper than cleans up, and a dozen other package helping functions and packages on CRAN). Note in particular that I have said nothing about roxygen2 which at this point is a just an added complication so let's focus on just .Rd files.
You can now contrast your simplest package with one built by and for Rcpp, namely by using Rcpp.package.skeleton(). You will see at least these differences in
DESCRIPTION for LinkingTo: and Imports
NAMESPACE for importFrom as well as the useDynLib line
a new src directory and a possible need for src/Makevars
All of which make it easier to (basically) start a new package via Rcpp.package.skeleton() and copy your existing package code into that package. We simply do not have a conversion helper. I still do the "manual conversion" you tried every now and then, and even I need a try or two and I have seen all the error messages a few times over...
So even if you don't want to "copy everything over" I think the simplest way is to
create two packages with and without Rcpp
do a recursive diff
ensure the difference is applied in your original package.
PS And remember that when you use roxygen2 and have documentation in the src/ directory to always first run Rcpp::compileAttributes() before running roxygen2::roxygenize(). RStudio and other helpers do that for you but it is still easy to forget...

Where can I find information on how to structure long R code? [duplicate]

Does anyone have any wisdom on workflows for data analysis related to custom report writing? The use-case is basically this:
Client commissions a report that uses data analysis, e.g. a population estimate and related maps for a water district.
The analyst downloads some data, munges the data and saves the result (e.g. adding a column for population per unit, or subsetting the data based on district boundaries).
The analyst analyzes the data created in (2), gets close to her goal, but sees that needs more data and so goes back to (1).
Rinse repeat until the tables and graphics meet QA/QC and satisfy the client.
Write report incorporating tables and graphics.
Next year, the happy client comes back and wants an update. This should be as simple as updating the upstream data by a new download (e.g. get the building permits from the last year), and pressing a "RECALCULATE" button, unless specifications change.
At the moment, I just start a directory and ad-hoc it the best I can. I would like a more systematic approach, so I am hoping someone has figured this out... I use a mix of spreadsheets, SQL, ARCGIS, R, and Unix tools.
Thanks!
PS:
Below is a basic Makefile that checks for dependencies on various intermediate datasets (w/ .RData suffix) and scripts (.R suffix). Make uses timestamps to check dependencies, so if you touch ss07por.csv, it will see that this file is newer than all the files / targets that depend on it, and execute the given scripts in order to update them accordingly. This is still a work in progress, including a step for putting into SQL database, and a step for a templating language like sweave. Note that Make relies on tabs in its syntax, so read the manual before cutting and pasting. Enjoy and give feedback!
http://www.gnu.org/software/make/manual/html_node/index.html#Top
R=/home/wsprague/R-2.9.2/bin/R
persondata.RData : ImportData.R ../../DATA/ss07por.csv Functions.R
$R --slave -f ImportData.R
persondata.Munged.RData : MungeData.R persondata.RData Functions.R
$R --slave -f MungeData.R
report.txt: TabulateAndGraph.R persondata.Munged.RData Functions.R
$R --slave -f TabulateAndGraph.R > report.txt
I generally break my projects into 4 pieces:
load.R
clean.R
func.R
do.R
load.R: Takes care of loading in all the data required. Typically this is a short file, reading in data from files, URLs and/or ODBC. Depending on the project at this point I'll either write out the workspace using save() or just keep things in memory for the next step.
clean.R: This is where all the ugly stuff lives - taking care of missing values, merging data frames, handling outliers.
func.R: Contains all of the functions needed to perform the actual analysis. source()'ing this file should have no side effects other than loading up the function definitions. This means that you can modify this file and reload it without having to go back an repeat steps 1 & 2 which can take a long time to run for large data sets.
do.R: Calls the functions defined in func.R to perform the analysis and produce charts and tables.
The main motivation for this set up is for working with large data whereby you don't want to have to reload the data each time you make a change to a subsequent step. Also, keeping my code compartmentalized like this means I can come back to a long forgotten project and quickly read load.R and work out what data I need to update, and then look at do.R to work out what analysis was performed.
If you'd like to see some examples, I have a few small (and not so small) data cleaning and analysis projects available online. In most, you'll find a script to download the data, one to clean it up, and a few to do exploration and analysis:
Baby names from the social security administration
30+ years of fuel economy data from the EPI
A big collection of data about the housing crisis
Movie ratings from the IMDB
House sale data in the Bay Area
Recently I have started numbering the scripts, so it's completely obvious in which order they should be run. (If I'm feeling really fancy I'll sometimes make it so that the exploration script will call the cleaning script which in turn calls the download script, each doing the minimal work necessary - usually by checking for the presence of output files with file.exists. However, most times this seems like overkill).
I use git for all my projects (a source code management system) so its easy to collaborate with others, see what is changing and easily roll back to previous versions.
If I do a formal report, I usually keep R and latex separate, but I always make sure that I can source my R code to produce all the code and output that I need for the report. For the sorts of reports that I do, I find this easier and cleaner than working with latex.
I agree with the other responders: Sweave is excellent for report writing with R. And rebuilding the report with updated results is as simple as re-calling the Sweave function. It's completely self-contained, including all the analysis, data, etc. And you can version control the whole file.
I use the StatET plugin for Eclipse for developing the reports, and Sweave is integrated (Eclipse recognizes latex formating, etc). On Windows, it's easy to use MikTEX.
I would also add, that you can create beautiful reports with Beamer. Creating a normal report is just as simple. I included an example below that pulls data from Yahoo! and creates a chart and a table (using quantmod). You can build this report like so:
Sweave(file = "test.Rnw")
Here's the Beamer document itself:
%
\documentclass[compress]{beamer}
\usepackage{Sweave}
\usetheme{PaloAlto}
\begin{document}
\title{test report}
\author{john doe}
\date{September 3, 2009}
\maketitle
\begin{frame}[fragile]\frametitle{Page 1: chart}
<<echo=FALSE,fig=TRUE,height=4, width=7>>=
library(quantmod)
getSymbols("PFE", from="2009-06-01")
chartSeries(PFE)
#
\end{frame}
\begin{frame}[fragile]\frametitle{Page 2: table}
<<echo=FALSE,results=tex>>=
library(xtable)
xtable(PFE[1:10,1:4], caption = "PFE")
#
\end{frame}
\end{document}
I just wanted to add, in case anyone missed it, that there's a great post on the learnr blog about creating repetitive reports with Jeffrey Horner's brew package. Matt and Kevin both mentioned brew above. I haven't actually used it much myself.
The entries follows a nice workflow, so it's well worth a read:
Prepare the data.
Prepare the report template.
Produce the report.
Actually producing the report once the first two steps are complete is very simple:
library(tools)
library(brew)
brew("population.brew", "population.tex")
texi2dvi("population.tex", pdf = TRUE)
For creating custom reports, I've found it useful to incorporate many of the existing tips suggested here.
Generating reports:
A good strategy for generating reports involves the combination of Sweave, make, and R.
Editor:
Good editors for preparing Sweave documents include:
StatET and Eclipse
Emacs and ESS
Vim and Vim-R
R Studio
Code organisation:
In terms of code organisation, I find two strategies useful:
Read up about analysis workflow (e.g., ProjectTemplate,
Josh Reich's ideas, my own presentation on R workflow
Slides
and Video )
Study example reports and discern the workflow
Hadley Wickham's examples
My examples on github
Examples of reproducible research listed on Cross Validated
I use Sweave for the report-producing side of this, but I've also been hearing about the brew package - though I haven't yet looked into it.
Essentially, I have a number of surveys for which I produce summary statistics. Same surveys, same reports every time. I built a Sweave template for the reports (which takes a bit of work). But once the work is done, I have a separate R script that lets me point out the new data. I press "Go", Sweave dumps out a few score .tex files, and I run a little Python script to pdflatex them all. My predecessor spent ~6 weeks each year on these reports; I spend about 3 days (mostly on cleaning data; escape characters are hazardous).
It's very possible that there are better approaches now, but if you do decide to go this route, let me know - I've been meaning to put up some of my Sweave hacks, and that would be a good kick in the pants to do so.
I'm going to suggest something in a different sort of direction from the other submitters, based on the fact that you asked specifically about project workflow, rather than tools. Assuming you're relatively happy with your document-production model, it sounds like your challenges really may be centered more around issues of version tracking, asset management, and review/publishing process.
If that sounds correct, I would suggest looking into an integrated ticketing/source management/documentation tool like Redmine. Keeping related project artifacts such as pending tasks, discussion threads, and versioned data/code files together can be a great help even for projects well outside the traditional "programming" bailiwick.
Agreed that Sweave is the way to go, with xtable for generating LaTeX tables. Although I haven't spent too much time working with them, the recently released tikzDevice package looks really promising, particularly when coupled with pgfSweave (which, as far as I know is only available on rforge.net at this time -- there is a link to r-forge from there, but it's not responding for me at the moment).
Between the two, you'll get consistent formatting between text and figures (fonts, etc.). With brew, these might constitute the holy grail of report generation.
At a more "meta" level, you might be interested in the CRISP-DM process model.
"make" is great because (1) you can use it for all your work in any language (unlike, say, Sweave and Brew), (2) it is very powerful (enough to build all the software on your machine), and (3) it avoids repeating work. This last point is important to me because a lot of the work is slow; when I latex a file, I like to see the result in a few seconds, not the hour it would take to recreate the figures.
I use project templates along with R studio, currently mine contains the following folders:
info : pdfs, powerpoints, docs... which won't be used by any script
data input : data that will be used by my scripts but not generated by them
data output : data generated by my scripts for further use but not as a proper report.
reports : Only files that will actually be shown to someone else
R : All R scripts
SAS : Because I sometimes have to :'(
I wrote custom functions so I can call smart_save(x,y) or smart_load(x) to save or load RDS files to and from the data output folder (files named with variable names) so I'm not bothered by paths during my analysis.
A custom function new_project creates a numbered project folder, copies all the files from the template, renames the RProj file and edits the setwd calls, and set working directory to new project.
All R scripts are in the R folder, structured as follow :
00_main.R
setwd
calls scripts 1 to 5
00_functions.R
All functions and only functions go there, if there's too many I'll separate it into several, all named like 00_functions_something.R, in particular if I plan to make a package out of some of them I'll put them apart
00_explore.R
a bunch of script chunks where i'm testing things or exploring my data
It's the only file where i'm allowed to be messy.
01_initialize.R
Prefilled with a call to a more general initialize_general.R script from my template folder which loads the packages and data I always use and don't mind having in my workspace
loads 00_functions.R (prefilled)
loads additional libraries
set global variables
02_load data.R
loads csv/txt xlsx RDS, there's a prefilled commented line for every type of file
displays which files hava been created in the workspace
03_pull data from DB.R
Uses dbplyr to fetch filtered and grouped tables from the DB
some prefilled commented lines to set up connections and fetch.
Keep client side operations to bare minimum
No server side operations outside of this script
Displays which files have been created in the workspace
Saves these variables so they can be reloaded faster
Once it's been done once I switch off a query_db boolean and the data will reloaded from RDS next time.
It can happen that I have to refeed data to DBs, If so I'll create additional steps.
04_Build.R
Data wrangling, all the fun dplyr / tidyr stuff goes there
displays which files have been created in the workspace
save these variables
Once it's been done once I switch off a build boolean and the data will reloaded from RDS next time.
05_Analyse.R
Summarize, model...
report excel and csv files
95_build ppt.R
template for powerpoint report using officer
96_prepare markdown.R
setwd
load data
set markdown parameters if needed
render
97_prepare shiny.R
setwd
load data
set shiny parameters if needed
runApp
98_Markdown report.Rmd
A report template
99_Shiny report.Rmd
An app template
For writing a quick preliminary report or email to a colleague, I find that it can be very efficient to copy-and-paste plots into MS Word or an email or wiki page -- often best is a bitmapped screenshot (e.g. on mac, Apple-Shift-(Ctrl)-4). I think this is an underrated technique.
For a more final report, writing R functions to easily regenerate all the plots (as files) is very important. It does take more time to code this up.
On the larger workflow issues, I like Hadley's answer on enumerating the code/data files for the cleaning and analysis flow. All of my data analysis projects have a similar structure.
I'll add my voice to sweave. For complicated, multi-step analysis you can use a makefile to specify the different parts. Can prevent having to repeat the whole analysis if just one part has changed.
I also do what Josh Reich does, only I do that creating my personal R-packages, as it helps me structure my code and data, and it is also quite easy to share those with others.
create my package
load
clean
functions
do
creating my package: devtools::create('package_name')
load and clean: I create scripts in the data-raw/ subfolder of my package for loading, cleaning, and storing the resulting data objects in the package using devtools::use_data(object_name). Then I compile the package.
From now on, calling library(package_name) makes these data available (and they are not loaded until necessary).
functions: I put the functions for my analyses into the R/ subfolder of my package, and export only those that need to be called from outside (and not the helper functions, which can remain invisible).
do: I create a script that uses the data and functions stored in my package.
(If the analyses only need to be done once, I can put this script as well into the data-raw/ subfolder, run it, and store the results in the package to make it easily accessible.)

Sourcing So many R scripts

I have lots of .r scripts that I want to source all. I have written a function like the one below to source.
sourcer=function(){
source("wil.r")
source("k.r")
source("l.r")
}
Please can any one tell me how to get this codes activated and how to call each one any time I want to use it?
In addition to the answer by #user2885462, if the amount of R code you need to source becomes bigger, you might want to wrap the code into an R package. This provides a convenient way of loading the code, and allows you to add tests, documentation, etc. Reading the official package writing tutorial is a good place to start for that.
For an individual project, I like to have all (or most) of my R functions in separate .r files, all in the same folder: e.g., AllFunctions
Then at the beginning of my main code I run the following line of code, which sources all .r (and other extensions if they exist - which they usually don't) in the AllFunctions folder:
for (nm in list.files("AllFunctions", pattern = ".[RrSsQq]$")) source(file.path("AllFunctions", nm))

How to keep various-topic R script at hand?

I have written and collected R code on various topics that solve particular problems at hand. I stored the R script/code in .txt files. I have now 100s of them.
How do you keep your R code at hand efficiently?
#Manetheran has the right idea: write a package. It's easy to do (especially with RStudio). Read "Writing R Extensions" and then on top of that learn about roxygen2 (which allows you to document each function in-line and avoid writing .Rd files).
Then you can use devtools to load your package locally, or once it's stable if you think other people can use the functions you can submit your package to CRAN.
I prefer to keep it simple. I use Total Commander and when I need an example which uses some R function, I just do Alt-F7 and search for *.R files which contain the desired string.
I use RStudio and have created two or three basic scripts. I save my much-used functions in the basic script that is most appropriate. Then, at the start of an RStudio script for a project, I source one or more of the basic scripts as is appropriate.

Resources