i first want to say that i'm beginner in ocaml. So i made a simple app that takes data from a json, does some calculations or replace some of them with arg from the command line, then writes another json with the new data and also replace those values in a html template and writes that too. You can see my project here https://github.com/ralcr/invoice-cmd/blob/master/invoice.ml
The question is how to deal with that amount of variables? In the languages i know i would probably repeat myself twice, but here are like 6 times. Thanks for any advice.
First of all, I would like to notice, that StackExchange code review is probably a better place to post such questions, as the question is more about a design rather than about the language.
I have two suggestions, on how to improve your code. The first one is to use string maps (or hashtables) to store your variables. Another is much more radical, is to rewrite the code in a more functional way.
Use maps
In your code, you're doing a lot of pouring the same water from one bucket into another, without doing actual work. The first thing that comes to mind, is whether it is necessary at all. When you parse JSON definitions into a set of variables, you do not actually reduce complexity or enforce any particular invariants. Basically, you're confusing data with code. These variables, are actually data that you're processing not a part of the logic of your application. So the first step would be to use string map, and store them in it. Then you can easily process a big set of variables with fold and map.
Use functions
Another approach is not to store the variables at all and express everything as stateless transformations on JSON data. Your application looks like a JSON processor, so I don't really see any reason why you should first read everything and store it in the memory, and then later produce the result. It is more natural to process data on the fly and express your logic as a set of small transformations. Try to split everything into small functions, so that each individual transformation can be easily understood. Then compose your transformation from smaller parts. This would be a functional style, where the flow of data is explicit.
I'm designing an API and I want to allow my users to combine a GET parameter with AND operators. What's the best way to do this?
Specifically I have a group_by parameter that gets passed to a Mongo backend. I want to allow users to group by multiple variables.
I can think of two ways:
?group_by=alpha&group_by=beta
or:
?group_by=alpha,beta
Is either one to be preferred? I've consulted a few API design references but no-one seems to have a view on this.
There is no strict preference. The advantage to the first approach is that many frameworks will turn group_by into an array or similar structure for you, whereas in the second approach you need to parse out the values yourself. The second approach is also less verbose, which may be relevant if your query string is particularly large.
You may also want to test with the first approach that the query strings always come into your framework in the order the client sent them. Some frameworks have a bug where that doesn't happen.
I'm working on a project right now where I have been slowly accumulating a bunch of different variables from a bunch of different sources. Being a somewhat clever person, I created a different sub-directory for each under a main "original_data" directory, and included a .txt file with the URL and other descriptors of where I got the data from. Being an insufficiently clever person, these .txt files have no structure.
Now I am faced with the task of compiling a methods section which documents all the different data sources. I am willing to go through and add structure to the data, but then I would need to find or build a reporting tool to scan through the directories and extract the information.
This seems like something that ProjectTemplate would have already, but I can't seem to find that functionality there.
Does such a tool exist?
If it does not, what considerations should be taken into account to provide maximum flexibility? Some preliminary thoughts:
A markup language should be used (YAML?)
All sub-directories should be scanned
To facilitate (2), a standard extension for a dataset descriptor should be used
Critically, to make this most useful there needs to be some way to match variable descriptors with the name that they ultimately take on. Therefore either all renaming of variables has to be done in the source files rather than in a cleaning step (less than ideal), some code-parsing has to be done by the documentation engine to track variable name changes (ugh!), or some simpler hybrid such as allowing the variable renames to be specified in the markup file should be used.
Ideally the report would be templated as well (e.g. "We pulled the [var] variable from [dset] dataset on [date]."), and possibly linked to Sweave.
The tool should be flexible enough to not be overly burdensome. This means that minimal documentation would simply be a dataset name.
This is a very good question: people should be very concerned about all of the sequences of data collection, aggregation, transformation, etc., that form the basis for statistical results. Unfortunately, this is not widely practiced.
Before addressing your questions, I want to emphasize that this appears quite related to the general aim of managing data provenance. I might as well give you a Google link to read more. :) There are a bunch of resources that you'll find, such as the surveys, software tools (e.g. some listed in the Wikipedia entry), various research projects (e.g. the Provenance Challenge), and more.
That's a conceptual start, now to address practical issues:
I'm working on a project right now where I have been slowly accumulating a bunch of different variables from a bunch of different sources. Being a somewhat clever person, I created a different sub-directory for each under a main "original_data" directory, and included a .txt file with the URL and other descriptors of where I got the data from. Being an insufficiently clever person, these .txt files have no structure.
Welcome to everyone's nightmare. :)
Now I am faced with the task of compiling a methods section which documents all the different data sources. I am willing to go through and add structure to the data, but then I would need to find or build a reporting tool to scan through the directories and extract the information.
No problem. list.files(...,recursive = TRUE) might become a good friend; see also listDirectory() in R.utils.
It's worth noting that filling in a methods section on data sources is a narrow application within data provenance. In fact, it's rather unfortunate that the CRAN Task View on Reproducible Research focuses only on documentation. The aims of data provenance are, in my experience, a subset of reproducible research, and documentation of data manipulation and results are a subset of data provenance. Thus, this task view is still in its infancy regarding reproducible research. It might be useful for your aims, but you'll eventually outgrow it. :)
Does such a tool exist?
Yes. What are such tools? Mon dieu... it is very application-centric in general. Within R, I think that these tools are not given much attention (* see below). That's rather unfortunate - either I'm missing something, or else the R community is missing something that we should be using.
For the basic process that you've described, I typically use JSON (see this answer and this answer for comments on what I'm up to). For much of my work, I represent this as a "data flow model" (that term can be ambiguous, by the way, especially in the context of computing, but I mean it from a statistical analyses perspective). In many cases, this flow is described via JSON, so it is not hard to extract the sequence from JSON to address how particular results arose.
For more complex or regulated projects, JSON is not enough, and I use databases to define how data was collected, transformed, etc. For regulated projects, the database may have lots of authentication, logging, and more in it, to ensure that data provenance is well documented. I suspect that that kind of DB is well beyond your interest, so let's move on...
1. A markup language should be used (YAML?)
Frankly, whatever you need to describe your data flow will be adequate. Most of the time, I find it adequate to have good JSON, good data directory layouts, and good sequencing of scripts.
2. All sub-directories should be scanned
Done: listDirectory()
3. To facilitate (2), a standard extension for a dataset descriptor should be used
Trivial: ".json". ;-) Or ".SecretSauce" works, too.
4. Critically, to make this most useful there needs to be some way to match variable descriptors with the name that they ultimately take on. Therefore either all renaming of variables has to be done in the source files rather than in a cleaning step (less than ideal), some code-parsing has to be done by the documentation engine to track variable name changes (ugh!), or some simpler hybrid such as allowing the variable renames to be specified in the markup file should be used.
As stated, this doesn't quite make sense. Suppose that I take var1 and var2, and create var3 and var4. Perhaps var4 is just a mapping of var2 to its quantiles and var3 is the observation-wise maximum of var1 and var2; or I might create var4 from var2 by truncating extreme values. If I do so, do I retain the name of var2? On the other hand, if you're referring to simply matching "long names" with "simple names" (i.e. text descriptors to R variables), then this is something only you can do. If you have very structured data, it's not hard to create a list of text names matching variable names; alternatively, you could create tokens upon which string substitution could be performed. I don't think it's hard to create a CSV (or, better yet, JSON ;-)) that matches variable name to descriptor. Simply keep checking that all variables have matching descriptor strings, and stop once that's done.
5. Ideally the report would be templated as well (e.g. "We pulled the [var] variable from [dset] dataset on [date]."), and possibly linked to Sweave.
That's where others' suggestions of roxygen and roxygen2 can apply.
6. The tool should be flexible enough to not be overly burdensome. This means that minimal documentation would simply be a dataset name.
Hmm, I'm stumped here. :)
(*) By the way, if you want one FOSS project that relates to this, check out Taverna. It has been integrated with R as documented in several places. This may be overkill for your needs at this time, but it's worth investigating as an example of a decently mature workflow system.
Note 1: Because I frequently use bigmemory for large data sets, I have to name the columns of each matrix. These are stored in a descriptor file for each binary file. That process encourages the creation of descriptors matching variable names (and matrices) to descriptors. If you store your data in a database or other external files supporting random access and multiple R/W access (e.g. memory mapped files, HDF5 files, anything but .rdat files), you will likely find that adding descriptors becomes second nature.
I've been looking for a proper implementation of hash map in R, with functionalities similar to the map type in Python.
After some googling and searching the R documentations, I found that environment and named list are the ONLY options I can use (is that really so?).
But the problem with the two is that they can only take charaters as key for the hashing, not even a number, let alone other type of things.
So is there a way to use arbitrary things as key? or at least more than just characters.
Or is there a better implemtation of hash map that I didn't find with better functionalities ?
Thanks in advance.
Edit:
My current problem: I need a map to store the distance relationship between data points. That is, the key of the map is a tuple (p1, p2) and the value is a number.
The reason I asked a generic question instead of a concrete one is that I'm learning R recently and I want to know how to manipulate some of the most fundamental data structures, not only what my problem refers to. So I may need to use other things as key in the future, and I want to avoid asking similar questions with only minor difference every time I run into them.
Edit 2:
I got a lot of very good advices on this topic. It seems I'm still thinking quite in the Pythonic way, rather than the should-be R way. I should really get more R-ly ! I think my purpose can easily be satisfied by a matrix in R. Thanks All !
The reason people keep asking you for a specific example is that most problems for which hash tables are the appropriate technique in Python have a good solution in R that does not involve hash tables.
That said, there are certainly times when a real hash table is useful in R, and I recommend you check out the hash package for R. It uses environments as its base but lets you do a lot of R-like vector work with them. It's efficient and I've never run into a problem with it.
Just keep in mind that if you're using hash tables a lot while working with R and your code is running slowly or is buggy, you may be able to get some mileage from figuring out a more R-like way of doing it :)
I wonder about the idea of representing and executing programs using graphs. Some kind of stackless model where the each node in the graph represents a function and the edges represent arguments to the functions. In this way a function doesn't return the result to its caller,but passes the result as an arg to another function node. Total nonsense? Or maybe it is just a state machine in disguise? Any actual implementations of this anywhere?
This sounds a lot like a State machine.
I think Dybvig's dissertation Three Implementation Models for Scheme does this with Scheme.
I'm pretty sure the first model is graph-based in the way you mean. I don't remember whether the third model is or not. I don't think I got all the way through the dissertation.
for javascript you might want to checkout node-red (visual) or jsonflow (json)