Build system supports multiple output per target - build-process

I work in the field of bioinformatics. My daily work processes several data files (DNA sequences, alignments, etc..) and produce many result files, so I want to use something like Unix make to automate the whole process, especially to resolve dependencies between different data.
However, the Unix make only supports one output per target, as it is designed for software build, which typically generates one object file from several source files, or one executable from several object files. If you use custom virtual targets, it won't benefit from timestamp checking. Is there any build system that supports multiple output file per one target? If there aren't any, I'm going to make the wheel.

Have a look at Drake, which is a replacement of make designed for data workflow management (make for data).
Another option is makepp, which is an improved make. Among other features it supports multiple targets.

Related

How do you use guards(1) with quilt(1)

One of the ancillary tools bundled with quilt is guards, which processes a list of guards and a configuration file matching guards and files, and outputs a list of files whose guard specifications match the provided guards.
However I can't figure how they're supposed to fit together: quilt(1) doesn't show any way to invoke a command to generate series files, I didn't find examples in the mans or the working copy, and the internets are less than helpful (all the hits talk about bedding).
I feel like guard has to be manually invoked whenever its "dependencies" change and the series file overwritten, is that the case? If so, how is data fed back the other way around e.g. when adding a new patch to the series, does it have to be manually synchronised to the guard file?
Background: a few years back I used mq quite a bit, but it integrates guards natively so the synchronisation back and forth is not an issue at all.

What is the best practice for transferring objects across R projects?

I would like to use R objects (e.g., cleaned data) generated in one git-versioned R project in another git-versioned R project.
Specifically, I have multiple git-versioned R projects (that hold drake plans) that do various things for my thesis experiments (e.g., generate materials, import and clean data, generate reports/articles).
The experiment-specific projects should ideally be:
Connectable - so that I can get objects (mainly data and materials) that I generated in these projects into another git-versioned R project that generates my thesis report.
Self-contained - so that I can use them in other non-thesis projects (such as presentations, reports, and journal manuscripts). When sharing such projects, I'd ideally like not to need to share a monolithic thesis project.
Versioned - so that their use in different projects can be independent (e.g., if I make changes to the data cleaning for a manuscript after submitting the thesis, I still want the thesis to be reproducible as it was originally compiled).
At the moment I can see three ways of doing this:
Re-create the data cleaning process
But: this involves copy/paste, which I'd like to avoid, especially if things change upstream.
Access the relevant scripts/functions by changing the working directory
But: even if I used here it seems that this would introduce poor reproducibility.
Make the source projects into packages and make the objects I want to "export" into exported data (as per the data section of Hadley's R packages guide)
But: I'd like to avoid the unnecessary metadata, artefacts, and noise (e.g., see Miles McBain's "Project as an R package: An okay idea") if I can.
Is there any other way of doing this?
Edit: I tried #landau's suggestion of using a single drake plan, which worked well for a while, until (similar to #vrognas' case) I ended up with too many sub-projects (e.g., conference presentations and manuscripts) that relied on the same objects. Therefore, I added some clarifications above to my intentions with the question.
My first recommendation is to use a single drake plan to unite the stages of the overall project that need to share data. drake is designed to handle a lot of moving parts this way, and it will be more seamless when it comes to drake's decisions about what to rerun downstream. But if you really do need different plans in different sub-projects that share data, you can track each shared dataset as a file_out() file in one plan and track it with file_in() in another plan.
upstream_plan <- drake_plan(
export_file = write_csv(dataset, file_out("exported_data/dataset.csv"))
)
downstream_plan <- drake_plan(
dataset = read_csv(file_in("../upstream_project/exported_data/dataset.csv"))
)
You fundamentally misunderstood Miles McBain’s critique. He isn’t saying that you shouldn’t write reusable code nor that you shouldn’t use packages. He’s saying that you shouldn’t use packages for everything. But reusable code (i.e. code that you want to reuse) absolutely belongs in packages (or, better, modules), which can then be used in multiple projects.
That being said, first off, pay attention to Will Landau’s advice.
Secondly, you can make your RStudio projects configurable such that they can load data based on paths given in a configuration. Once that’s accomplished, nothing speaks against hard-coding paths to data in different projects inside that config file.
I am in a similar situation. I have many projects that are spawned from one raw dataset. Previously, when the project was young and small, I had it all in one version controlled project. This got out of hand as more sub-projects were spawned and my git history got cluttered from working on projects in parallel. This could be to my lack of skills with git. My folder structure looked something like this:
project/.git
project/main/
project/sub-project_1/
project/sub-project_2/
project/sub-project_n/
I contemplated having each project in its own git branch, but then I could not access them simultaneously. If I had to change something to the main dataset (eg I might have not cleaned some parts) then project 1 could become outdated and nonfunctional. Once I had finished project 1, I would have liked it to be isolated and contained for reproducibility. This is easier to achieve if the projects are separated. I don't think a drake/targets plan would solve this?
I also looked briefly into having the projects as git submodules but it seemed to add too much complexity. Again, my git ignorance might shine through here.
My current solution is to have the main data as an R-package, and each sub-project as a separate git-versioned folder (they are actually packages as well, but this is not necessary). This way I can load in a specific version of the data (using renv for package versions).
My folder structure now looks something like this:
main/.git
sub-project_1/.git
sub-project_2/.git
sub-project_n/.git
And inside each sub-project, I call library(main) to load the cleaned data. Within each sub-project, a drake/targets plan could be used.

Julia: packaging things into modules vs include()-ing them

I'm building a simulation in Julia and I have my code split across a bunch of files. Are there any benefits to wrapping everything in modules versus simplying include()-ing them in the runscript?
I have something like the following at the top of my runscript right now:
for filename in split(readall(`git ls-files`))
#everywhere include(filename)
end
I'm not planning to use the code outside of this immediate project, but I am running the simulation in parallel. Is there any benefit in creating modules?
I would say that the most important benefit is modularity :)
If you have different files that deal with different things, splitting the code into modules let's you keep track on the dependencies between the modules:
Which functions are purely implementation details of the given module and subject to change?
Which modules depend on which other modules?
It also lets you reuse the same name for different things in the different modules if you need to, if you're a little careful of what you export. (You can still access those names from the outside as qualified names)
For an example of such organisation, you can look at my repo https://github.com/toivoh/Debug.jl

Store map key/values in a persistent file

I will be creating a structure more or less of the form:
type FileState struct {
LastModified int64
Hash string
Path string
}
I want to write these values to a file and read them in on subsequent calls. My initial plan is to read them into a map and lookup values (Hash and LastModified) using the key (Path). Is there a slick way of doing this in Go?
If not, what file format can you recommend? I have read about and experimented with with some key/value file stores in previous projects, but not using Go. Right now, my requirements are probably fairly simple so a big database server system would be overkill. I just want something I can write to and read from quickly, easily, and portably (Windows, Mac, Linux). Because I have to deploy on multiple platforms I am trying to keep my non-go dependencies to a minimum.
I've considered XML, CSV, JSON. I've briefly looked at the gob package in Go and noticed a BSON package on the Go package dashboard, but I'm not sure if those apply.
My primary goal here is to get up and running quickly, which means the least amount of code I need to write along with ease of deployment.
As long as your entiere data fits in memory, you should't have a problem. Using an in-memory map and writing snapshots to disk regularly (e.g. by using the gob package) is a good idea. The Practical Go Programming talk by Andrew Gerrand uses this technique.
If you need to access those files with different programs, using a popular encoding like json or csv is probably a good idea. If you just have to access those file from within Go, I would use the excellent gob package, which has a lot of nice features.
As soon as your data becomes bigger, it's not a good idea to always write the whole database to disk on every change. Also, your data might not fit into the RAM anymore. In that case, you might want to take a look at the leveldb key-value database package by Nigel Tao, another Go developer. It's currently under active development (but not yet usable), but it will also offer some advanced features like transactions and automatic compression. Also, the read/write throughput should be quite good because of the leveldb design.
There's an ordered, key-value persistence library for the go that I wrote called gkvlite -
https://github.com/steveyen/gkvlite
JSON is very simple but makes bigger files because of the repeated variable names. XML has no advantage. You should go with CSV, which is really simple too. Your program will make less than one page.
But it depends, in fact, upon your modifications. If you make a lot of modifications and must have them stored synchronously on disk, you may need something a little more complex that a single file. If your map is mainly read-only or if you can afford to dump it on file rarely (not every second) a single csv file along an in-memory map will keep things simple and efficient.
BTW, use the csv package of go to do this.

Build Numbers synchronicity when delivering on multiple OS platforms

My organization builds a C++ application that runs on multiple operating systems.
Should the build number, visible to customer, be the same for a given state of the source code tree on all the platforms?
Yes, I think so.
One practice you can use is to use the changelist number or however your source control system identifies the checkin that your build system pulled to build your product. That way you always know what source you should pull to rebuild it as well.
There aren't any downsides that I can see. You want to be able to reproduce the build, so each should say what the build is. If it's the same build, it should be the same build number.
If you choose to (or cannot - from a build process point of view) not use the exact same versionnumber for builds made for different platforms, you should document exactly what your versionnumbers actually imply.
If you don't, typically a user of the software will treat the whole thing (e.g. 3.1.0.333 - where 333 would be the build number) as identifying a certain version of the software (thuse the code tree). If they use your software on differnt platforms, they then might think that 3.1.0.333 and 3.1.0.334 are actually refering differnt versions (as in code changes) of the product, which they might not.
The same is true for outher "build number" styles, like using some sort of date/time derived build-id.
If you globally manage your build numbers, you might consider adding a platform ID. So that you can build 3.1.0.333-x86 and 3.1.0.333-amd64 one after the other, but still have them share the same "number". It will also be more obvious to the user what the indent/meaning is.

Resources