What is the best practice for transferring objects across R projects? - r

I would like to use R objects (e.g., cleaned data) generated in one git-versioned R project in another git-versioned R project.
Specifically, I have multiple git-versioned R projects (that hold drake plans) that do various things for my thesis experiments (e.g., generate materials, import and clean data, generate reports/articles).
The experiment-specific projects should ideally be:
Connectable - so that I can get objects (mainly data and materials) that I generated in these projects into another git-versioned R project that generates my thesis report.
Self-contained - so that I can use them in other non-thesis projects (such as presentations, reports, and journal manuscripts). When sharing such projects, I'd ideally like not to need to share a monolithic thesis project.
Versioned - so that their use in different projects can be independent (e.g., if I make changes to the data cleaning for a manuscript after submitting the thesis, I still want the thesis to be reproducible as it was originally compiled).
At the moment I can see three ways of doing this:
Re-create the data cleaning process
But: this involves copy/paste, which I'd like to avoid, especially if things change upstream.
Access the relevant scripts/functions by changing the working directory
But: even if I used here it seems that this would introduce poor reproducibility.
Make the source projects into packages and make the objects I want to "export" into exported data (as per the data section of Hadley's R packages guide)
But: I'd like to avoid the unnecessary metadata, artefacts, and noise (e.g., see Miles McBain's "Project as an R package: An okay idea") if I can.
Is there any other way of doing this?
Edit: I tried #landau's suggestion of using a single drake plan, which worked well for a while, until (similar to #vrognas' case) I ended up with too many sub-projects (e.g., conference presentations and manuscripts) that relied on the same objects. Therefore, I added some clarifications above to my intentions with the question.

My first recommendation is to use a single drake plan to unite the stages of the overall project that need to share data. drake is designed to handle a lot of moving parts this way, and it will be more seamless when it comes to drake's decisions about what to rerun downstream. But if you really do need different plans in different sub-projects that share data, you can track each shared dataset as a file_out() file in one plan and track it with file_in() in another plan.
upstream_plan <- drake_plan(
export_file = write_csv(dataset, file_out("exported_data/dataset.csv"))
)
downstream_plan <- drake_plan(
dataset = read_csv(file_in("../upstream_project/exported_data/dataset.csv"))
)

You fundamentally misunderstood Miles McBain’s critique. He isn’t saying that you shouldn’t write reusable code nor that you shouldn’t use packages. He’s saying that you shouldn’t use packages for everything. But reusable code (i.e. code that you want to reuse) absolutely belongs in packages (or, better, modules), which can then be used in multiple projects.
That being said, first off, pay attention to Will Landau’s advice.
Secondly, you can make your RStudio projects configurable such that they can load data based on paths given in a configuration. Once that’s accomplished, nothing speaks against hard-coding paths to data in different projects inside that config file.

I am in a similar situation. I have many projects that are spawned from one raw dataset. Previously, when the project was young and small, I had it all in one version controlled project. This got out of hand as more sub-projects were spawned and my git history got cluttered from working on projects in parallel. This could be to my lack of skills with git. My folder structure looked something like this:
project/.git
project/main/
project/sub-project_1/
project/sub-project_2/
project/sub-project_n/
I contemplated having each project in its own git branch, but then I could not access them simultaneously. If I had to change something to the main dataset (eg I might have not cleaned some parts) then project 1 could become outdated and nonfunctional. Once I had finished project 1, I would have liked it to be isolated and contained for reproducibility. This is easier to achieve if the projects are separated. I don't think a drake/targets plan would solve this?
I also looked briefly into having the projects as git submodules but it seemed to add too much complexity. Again, my git ignorance might shine through here.
My current solution is to have the main data as an R-package, and each sub-project as a separate git-versioned folder (they are actually packages as well, but this is not necessary). This way I can load in a specific version of the data (using renv for package versions).
My folder structure now looks something like this:
main/.git
sub-project_1/.git
sub-project_2/.git
sub-project_n/.git
And inside each sub-project, I call library(main) to load the cleaned data. Within each sub-project, a drake/targets plan could be used.

Related

Shiny apps in R: how to structure them correctly

I created my first Shiny app, which runs perfectly in my laptop.
However, I need to submit it to my professor and I want to make sure he will be able to run it.
I have a UI file, a server file, a global file and a process file.
The process file stores the data preparation.
The global file reads two RDS files which are the datasets that I use in the server.
Where should my libraries be loaded? For example, the app does not run without leaflet, how can I ensure the libraries are run automatically?
My RDS files are saved to my local drive, which means that my professor will need to change the path in order to use them, how can avoid this?
Shall I put the UI, server and global into one R script or is it ok to have them onto two different scripts?
Thank you!
As you describe your current set up, the most obvious place to load your libraries is at the start of your global file. (Or at the start of app.R if you move to a single file configuration.) Though not exactly what the reprex package is designed for, you could probably use reprex to make sure that your code is reproducible and independent of anything you may have overlooked. (You've already identified the obvious issue with data files.) Look here for more information on reprex.
Indeed. This is a problem. If you must load the data from files, you need to find a way of providing them to your professor. Telling him to edit your code by hand is not a good way to start. Exactly how to do this depends on the infrastucture your institution has set up, so it's difficult to advise you what to do. Do you not have a shared area for your course? Ask your fellow students - or even your professor.
Whether you bundle the app in a single file or in seperate files is really a matter of choice. I don't think there's a wroing way and a right way. For me, the deciding factor is usually the size of the app. Large apps get several files. For small apps, separation is just an unnecessary complication. Another factor - that doesn't seem to be an issue here - is whether I see the app as a potential front end for some methodology that might be worthy of its own package. In that case, I develop the app as a front end, and the package itself as separate entities.
Question 1 and 3 depend on criteria for code quality like performance and maintainability. When the code grows, there will come the point when it is more difficult to handle all code in one file. Once it grows even further you will encounter the point when you need split the app into modules to keep your code easy to maintain.
Regarding libraries I advise declaring them at the entrance point of an app (though, basically, that is a matter of taste and style). That way, you provide maximum clarity regarding the dependencies of the app. Again, if the app becomes very large and not all parts of the app rely on the same packages, it could improve performance and maintainability that each part loads the packages as required. It could give you a performance advantage when you do not need to load all packages at once. However, that is probably only true for very large apps.
However, since this all seems to be an exercise at university, I doubt that your app will reach higher levels of complexity.
Question 2: In a shiny app you can provide fileInput widget. This SO question shows you how.

Julia: packaging things into modules vs include()-ing them

I'm building a simulation in Julia and I have my code split across a bunch of files. Are there any benefits to wrapping everything in modules versus simplying include()-ing them in the runscript?
I have something like the following at the top of my runscript right now:
for filename in split(readall(`git ls-files`))
#everywhere include(filename)
end
I'm not planning to use the code outside of this immediate project, but I am running the simulation in parallel. Is there any benefit in creating modules?
I would say that the most important benefit is modularity :)
If you have different files that deal with different things, splitting the code into modules let's you keep track on the dependencies between the modules:
Which functions are purely implementation details of the given module and subject to change?
Which modules depend on which other modules?
It also lets you reuse the same name for different things in the different modules if you need to, if you're a little careful of what you export. (You can still access those names from the outside as qualified names)
For an example of such organisation, you can look at my repo https://github.com/toivoh/Debug.jl

Association of any kind inbetween QC projects?

I want to associate one QC project with another (e.g., manual testing and automation testing). I use QC 11.00
I would like to know what kind of association there can be between two QC projects (on the same domain), so I do not have to maintain two projects and then copy paste what I need e.g. common repositories etc.
I'm not sure that you can do this. A project in QC is supposed to be a self-contained entity, that is, there is no way (that I know of) that you can automatically move data between projects.
Sure, you can copy and paste data, as well as create a project with another one as base, but that is probably not what you want.
I would rather have manual testing and automation in the same project, which makes more sense I think. The point is that the project is supposed to identify the test object, rather than the test methodology - the latter can be done better in Test Plan where you specify a Test Type when you create your test.
This way, you will have all defects and test reports for your test object in the same project which will make it all the easier to track what is going on.
As a general rule; you would want to keep all project data for one project in that project; and, you want project data from that project to be unique and separate from all other projects.
That being said... if you really wanted to do this (and were able to convince a QC subject matter expert that it was a good idea?), then it should be a relatively simple matter to amend the workflow with additional code to interface with another project.

project organization with R [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Workflow for statistical analysis and report writing
I have been programming with R for not too long but am running into a project organization question that I was hoping somebody could give me some tips on. I am finding that a lot of the analysis I do is ad hoc: that is, I run something, think about the results, tweek it and run some more. This is conceptually different than in a language like C++ where you think about the entire thing you want to run before coding. It is a huge benefit of interpreted languages. However, the issue that comes up is I end up having a lot of .RData files that I save so I don't have to source my script every time. Does anyone have any good ideas about how to organize my project so I can return to it a month later and have a good idea of what each file is associated with?
This is sort of a documentation question I guess. Should I document my entire project at each leg and be vigorous about cleaning up files that will no longer be necessary but were a byproduct of the research? This is my current system but it is a bit cumbersome. Does anyone else have any other suggestions?
Per the comment below: One of the key things that I am trying to avoid is the proliferation of .R analysis files and .RData sets that go along with them.
Some thoughts on research project organisation here:
http://software-carpentry.org/4_0/data/mgmt/
the take-home message being:
Use Version Control for your programs
Use sensible directory names
Use Version Control for your metadata
Really, Version Control is a good thing.
My analysis is a knitr document, with some external .R files which are called from it.
All data is in a database, but during my analysis the processed data are saved as .RData files. Only when I delete the RData, they are recreated from the database when I run the analysis again. Kinda like a cache, saves database access and data processing time when I rerun (parts of) my analysis.
Using a knitr (Sweave, etc) document for the analysis enables you to easily write a documented workflow with the results included. And knitr caches the results of the analysis, so small changes do usually not result in a full rerun of all R code, but only of a small section. Saves quite some running time for a bigger analysis.
(Ah, and as said before: use version control. Another tip: working with knitr and version control is very easy with RStudio.)

Where in R do I permanently store my custom functions?

I have several custom functions that I use frequently in R. Rather than souce this file (or parts thereof) in each script, is there some way to add this to a base R file such that they are always available when I use R?
Yes, create a package. There are numerous tutorials as well as the Writing R Extensions manual that came with your copy of R.
It may seem like too much work at first, but you will probably be glad that you did this in the longer run.
PS And you can then load that package from ~/.Rprofile. For really short code, you can also define it there.
A package may be overkill for a for a few useful functions. I'd argue there's nothing wrong with explicitly source()ing them as you need them - at least it is explicit so that if you email someone your code, you won't forget to include those other scripts.
Another option is to use the .Rprofile file. You can read about the details in ?Startup. Basically, the idea is that:
...a file called ‘.Rprofile’ is searched for in the current directory or
in the user's home directory (in that order). The user profile file is
sourced into the workspace.
You can read here about how many people use this functionality.
The accepted answer is best long-term: Make a package.
Luckily, the learning curve for doing this has been dramatically reduced by the devtools package: It automates package creation (a nice assist in getting off on the right foot), encourages good practices (like documenting with roxygen2, and helps with using online version control (bitbucket, github or other), sharing your package with others. It's also very helpful for smoothing your way to CRAN submission.
Good docs at http://adv-r.had.co.nz and http://r-pkgs.had.co.nz .
to create your package, for instance you can:
install.packages("devtools")
devtools::create("path/to/package/pkgname")
You could also look at the 'mvbutils' package: it lets you set up a hierarchical set of "tasks" (folders with workspace ".RData" files in them) such that you can always see what's in the ancestral tasks (ie the ancestors are in the search() path). So you can put your custom functions in the "starting task" where you always start R; and then you change to vwhatever project-specific task you require, so you can avoid cluttered workspaces, but you'll still be able to use (and edit) your custom functions because the starting task is always ancestral. Objects (including functions) get stored in ".RData" files and are thus loaded/saved automatically, but there are separate text-backup facilities for functions.
There are lots of different ways of working in R, and no "one-size-fits-all" best solution. It's also not easy to find an overview! Speaking just for myself:
I'm not a fan of having to 'source' everything in every time; for one thing, it simply doesn't work with big data sets and/or results of model runs.
I think packages are hard to create and maintain; there is a really significant overhead. After the first 5 packages you write, it does get a bit easier provided you do it on at least a weekly basis so you don't forget how, but really...
In fact, 'mvbutils' also has a bunch of tools for facilitating the creation and (especially) maintenance of packages, designed to interface smoothly with the task-hierarchy system. I use & edit my own packages all the time (including editing mvbutils itself); but if it wasn't for the tools in 'mvbutils', I'd be grinding my teeth in frustration most days of the week.

Resources