How to organize a large number of variables in ocaml - functional-programming

i first want to say that i'm beginner in ocaml. So i made a simple app that takes data from a json, does some calculations or replace some of them with arg from the command line, then writes another json with the new data and also replace those values in a html template and writes that too. You can see my project here https://github.com/ralcr/invoice-cmd/blob/master/invoice.ml
The question is how to deal with that amount of variables? In the languages i know i would probably repeat myself twice, but here are like 6 times. Thanks for any advice.

First of all, I would like to notice, that StackExchange code review is probably a better place to post such questions, as the question is more about a design rather than about the language.
I have two suggestions, on how to improve your code. The first one is to use string maps (or hashtables) to store your variables. Another is much more radical, is to rewrite the code in a more functional way.
Use maps
In your code, you're doing a lot of pouring the same water from one bucket into another, without doing actual work. The first thing that comes to mind, is whether it is necessary at all. When you parse JSON definitions into a set of variables, you do not actually reduce complexity or enforce any particular invariants. Basically, you're confusing data with code. These variables, are actually data that you're processing not a part of the logic of your application. So the first step would be to use string map, and store them in it. Then you can easily process a big set of variables with fold and map.
Use functions
Another approach is not to store the variables at all and express everything as stateless transformations on JSON data. Your application looks like a JSON processor, so I don't really see any reason why you should first read everything and store it in the memory, and then later produce the result. It is more natural to process data on the fly and express your logic as a set of small transformations. Try to split everything into small functions, so that each individual transformation can be easily understood. Then compose your transformation from smaller parts. This would be a functional style, where the flow of data is explicit.

Related

Which nodes should be included in CFG?

I'm writing a source-code-level optimizer for a language (LSL) whose server-side compiler is non-optimizing. The optimizer is working and performing well in some fields like constant folding, but to make it stronger, I wanted to build a CFG, to enable/simplify jump optimization, dead code removal, and later benefit from SSA.
Now, since it's source-level, I can't afford to convert the code to something like 3AC, otherwise I risk creating a version that is actually less optimal than the original.
So my question is: What would I include in the CFG? The logical answer would be to store AST nodes that represent whole statements. The problem I'm facing is that expressions may call user functions, and these functions may alter global variables, which would affect the SSA form, which is why I wanted to include the functions as part of the same graph, rather than different graphs.

In a paragraph or less, what is the purpose and benefits of pointers?

See title. That's all I have to ask. The net doesn't have many succinct answers to this question. Please keep in mind stack vs heap. Explain as you would to a complete beginner. Just looking for the "why" not the "how".
edit
Are pointers a way to get large objects out of the stack?
When passing a huge object from one piece of your program to another to be worked on like an entire class for example or something with a large amount of data like an image or video passing every single bit of data would be very inefficient. Instead you can just pass a tiny little memory address (pointer) that the receiving part of your program can then use to get to the object to be worked on.
Aside from that huge aspect, they offer a lot of flexibility but I need more than a paragraph for that.
When you get into managed code like C# or Java EVERYTHING is done with pointers/references but it's all behind the scenes and you don't have to deal with them like you would in C++ or another similar language. But it's still crucial to understand how they work.
Edit in response to:
"why would I pass a large object around if I don't need to work on
it?"
You wouldn't. However; Correct me if I'm straying from what your asking but what you'll learn if you continue into Computer Science is that a piece of your program should be as simple as possible it should only do 1 thing. Commonly known as the Single Responsibility Principle this dictates that you will have many seemingly tiny parts of your program that will all work together to accomplish the over arching goal. That means that a lot of those tiny pieces are going to need to work on the same objects, the same data and use the same tools to get the job done. Lets look at a hypothetical.
You're coding a simple image editing application.You're going to need a cropping tool, a paint brush tool, a selection tool, and a re-size tool. Each of these tools are going to need their own place in your program (a class or more likely many classes that work together) and that class will have many smaller pieces (methods/functions and other things) that work together to accomplish the goal of that class. Every single one of these classes and methods is most likely going to need to look at or modify the image data. With a pointer you can provide them with a memory address instead of making an entire copy of the image. That way when one of the classes or methods makes a change to it you don't need to worry about managing all these copies and making sure they all get the same change.
It allows you to do pass-by-reference/shared data structures, which has two big features: it saves memory and CPU overhead by not making copies, and it provides for complex communication patterns by making changes to shared data.

Automatic documentation of datasets

I'm working on a project right now where I have been slowly accumulating a bunch of different variables from a bunch of different sources. Being a somewhat clever person, I created a different sub-directory for each under a main "original_data" directory, and included a .txt file with the URL and other descriptors of where I got the data from. Being an insufficiently clever person, these .txt files have no structure.
Now I am faced with the task of compiling a methods section which documents all the different data sources. I am willing to go through and add structure to the data, but then I would need to find or build a reporting tool to scan through the directories and extract the information.
This seems like something that ProjectTemplate would have already, but I can't seem to find that functionality there.
Does such a tool exist?
If it does not, what considerations should be taken into account to provide maximum flexibility? Some preliminary thoughts:
A markup language should be used (YAML?)
All sub-directories should be scanned
To facilitate (2), a standard extension for a dataset descriptor should be used
Critically, to make this most useful there needs to be some way to match variable descriptors with the name that they ultimately take on. Therefore either all renaming of variables has to be done in the source files rather than in a cleaning step (less than ideal), some code-parsing has to be done by the documentation engine to track variable name changes (ugh!), or some simpler hybrid such as allowing the variable renames to be specified in the markup file should be used.
Ideally the report would be templated as well (e.g. "We pulled the [var] variable from [dset] dataset on [date]."), and possibly linked to Sweave.
The tool should be flexible enough to not be overly burdensome. This means that minimal documentation would simply be a dataset name.
This is a very good question: people should be very concerned about all of the sequences of data collection, aggregation, transformation, etc., that form the basis for statistical results. Unfortunately, this is not widely practiced.
Before addressing your questions, I want to emphasize that this appears quite related to the general aim of managing data provenance. I might as well give you a Google link to read more. :) There are a bunch of resources that you'll find, such as the surveys, software tools (e.g. some listed in the Wikipedia entry), various research projects (e.g. the Provenance Challenge), and more.
That's a conceptual start, now to address practical issues:
I'm working on a project right now where I have been slowly accumulating a bunch of different variables from a bunch of different sources. Being a somewhat clever person, I created a different sub-directory for each under a main "original_data" directory, and included a .txt file with the URL and other descriptors of where I got the data from. Being an insufficiently clever person, these .txt files have no structure.
Welcome to everyone's nightmare. :)
Now I am faced with the task of compiling a methods section which documents all the different data sources. I am willing to go through and add structure to the data, but then I would need to find or build a reporting tool to scan through the directories and extract the information.
No problem. list.files(...,recursive = TRUE) might become a good friend; see also listDirectory() in R.utils.
It's worth noting that filling in a methods section on data sources is a narrow application within data provenance. In fact, it's rather unfortunate that the CRAN Task View on Reproducible Research focuses only on documentation. The aims of data provenance are, in my experience, a subset of reproducible research, and documentation of data manipulation and results are a subset of data provenance. Thus, this task view is still in its infancy regarding reproducible research. It might be useful for your aims, but you'll eventually outgrow it. :)
Does such a tool exist?
Yes. What are such tools? Mon dieu... it is very application-centric in general. Within R, I think that these tools are not given much attention (* see below). That's rather unfortunate - either I'm missing something, or else the R community is missing something that we should be using.
For the basic process that you've described, I typically use JSON (see this answer and this answer for comments on what I'm up to). For much of my work, I represent this as a "data flow model" (that term can be ambiguous, by the way, especially in the context of computing, but I mean it from a statistical analyses perspective). In many cases, this flow is described via JSON, so it is not hard to extract the sequence from JSON to address how particular results arose.
For more complex or regulated projects, JSON is not enough, and I use databases to define how data was collected, transformed, etc. For regulated projects, the database may have lots of authentication, logging, and more in it, to ensure that data provenance is well documented. I suspect that that kind of DB is well beyond your interest, so let's move on...
1. A markup language should be used (YAML?)
Frankly, whatever you need to describe your data flow will be adequate. Most of the time, I find it adequate to have good JSON, good data directory layouts, and good sequencing of scripts.
2. All sub-directories should be scanned
Done: listDirectory()
3. To facilitate (2), a standard extension for a dataset descriptor should be used
Trivial: ".json". ;-) Or ".SecretSauce" works, too.
4. Critically, to make this most useful there needs to be some way to match variable descriptors with the name that they ultimately take on. Therefore either all renaming of variables has to be done in the source files rather than in a cleaning step (less than ideal), some code-parsing has to be done by the documentation engine to track variable name changes (ugh!), or some simpler hybrid such as allowing the variable renames to be specified in the markup file should be used.
As stated, this doesn't quite make sense. Suppose that I take var1 and var2, and create var3 and var4. Perhaps var4 is just a mapping of var2 to its quantiles and var3 is the observation-wise maximum of var1 and var2; or I might create var4 from var2 by truncating extreme values. If I do so, do I retain the name of var2? On the other hand, if you're referring to simply matching "long names" with "simple names" (i.e. text descriptors to R variables), then this is something only you can do. If you have very structured data, it's not hard to create a list of text names matching variable names; alternatively, you could create tokens upon which string substitution could be performed. I don't think it's hard to create a CSV (or, better yet, JSON ;-)) that matches variable name to descriptor. Simply keep checking that all variables have matching descriptor strings, and stop once that's done.
5. Ideally the report would be templated as well (e.g. "We pulled the [var] variable from [dset] dataset on [date]."), and possibly linked to Sweave.
That's where others' suggestions of roxygen and roxygen2 can apply.
6. The tool should be flexible enough to not be overly burdensome. This means that minimal documentation would simply be a dataset name.
Hmm, I'm stumped here. :)
(*) By the way, if you want one FOSS project that relates to this, check out Taverna. It has been integrated with R as documented in several places. This may be overkill for your needs at this time, but it's worth investigating as an example of a decently mature workflow system.
Note 1: Because I frequently use bigmemory for large data sets, I have to name the columns of each matrix. These are stored in a descriptor file for each binary file. That process encourages the creation of descriptors matching variable names (and matrices) to descriptors. If you store your data in a database or other external files supporting random access and multiple R/W access (e.g. memory mapped files, HDF5 files, anything but .rdat files), you will likely find that adding descriptors becomes second nature.

Abstraction or not?

The other day i stumbled onto a rather old usenet post by Linus Torwalds. It is the infamous "You are full of bull****" post when he defends his choice of using plain C for Git over something more modern.
In particular this post made me think about the enormous amount of abstraction layers that accumulate one over the other where I work. Mine is a Windows .Net environment. I must say that I like C# and the .Net environment, it really makes most things easy.
Now, I come from a very different background made of Unix technologies like C and a plethora or scripting languages; to me, also, OOP is just one, and not always the best, programming paradigm.. I often struggle (in a working kind of way, of course!) with my colleagues (one in particular), because they appear to be of the "any problem can be solved with an additional level of abstraction" church, while I'm more of the "keeping it simple" school. I think that there is a very different mental approach to the problems that maybe comes from the exposure to different cultures.
As a very simple example, for the first project I did here I needed some configuration for an application. I made a 10 rows class to load and parse a txt file to be located in the program's root dir containing colon separated key / value pairs, one per row. It worked.
In the end, to standardize the approach to the configuration problem, we now have a library to be located on every machine running each configured program that calls a service that, at startup, loads up an xml that contains the references to other xmls, one per application, that contain the configurations themselves.
Now, it is extensible and made up of fancy reusable abstractions, providers and all, but I still think that, if we one day really happen to reuse part of it, with the time taken to make it up, we can make the needed code from start or copy / past the old code and modify it.
What are your thoughts about it? Can you point out some interesting reference dealing with the problem?
Thanks
Abstraction makes it easier to construct software and understand how it is put together, but it complicates fully understanding certain issues around performance and security, because the abstraction layers introduce certain kinds of complexity.
Torvalds' position is not absurd, but he is an extremist.
Simple answer: programming languages provide data structures and ways to combine them. Use these directly at first, do not abstract. If you find you have representation invariants to maintain that are at a high risk of being broken due to a large number of usage sites possibly outside your control, then consider abstraction.
To implement this, first provide functions and convert the call sites to use them without hiding the representation. Hide the data representation only when you're satisfied your functional representation is sufficient. Make sure at this time to document the invariant being protected.
An "extreme programming" version of this: do not abstract until you have test cases that break your program. If you think the invariant can be breached, write the case that breaks it first.
Here's a similar question: https://stackoverflow.com/questions/1992279/abstraction-in-todays-languages-excited-or-sad.
I agree with #Steve Emmerson - 'Coders at Work' would give you some excellent perspective on this issue.

How do you test programs that require complex input data?

We have a suite of converters that take complex data and transform it. Mostly the input is EDI and the output XML, or vice-versa, although there are other formats.
There are many inter-dependencies in the data. What methods or software are available that can generate complex input data like this?
Right now we use two methods: (1) a suite of sample files that we've built over the years mostly from files bugs and samples in documentation, and (2) generating pseudo-random test data. But the former only covers a fraction of the cases, and the latter has lots of compromises and only tests a subset of the fields.
Before go further down the path of implementing (reinventing?) a complex table-driven data generator, what options have you found successful?
Well, the answer is in your question. Unless you implement a complex table-driven data generator, you're doing the things right with (1) and (2).
(1) covers the rule of "1 bug verified, 1 new test case".
And if the structure of the pseudo-random test data of (2) corresponds whatsoever in real life situations, it is fine.
(2) can always be improved, and it'll improve mainly over time, when thinking about new edge cases. The problem with random data for tests is that it can only be random to a point where it becomes so difficult to compute the expected output from the random data in the test case, that you have to basically rewrite the tested algorithm in the test case.
So (2) will always match a fraction of the cases. If one day it matches all the cases, it will be in fact a new version of your algorithm.
I'd advise against using random data as it can make it difficult if not impossible to reproduce the error that reported (I know you said 'pseudo-random', just not sure what you mean by that exactly).
Operating over entire files of data would likely be considering functional or integration testing. I would suggest taking your set of files with known bugs and translating these into unit tests, or at least do so for any future bugs you come across. Then you can also extend these unit tests to include coverage for the other erroneous conditions that you don't have any 'sample data'. This will likely be easier then coming up with a whole new data file every time you think of a condition/rule violation you want to check for.
Make sure your parsing of the data format is encapsulated from the interpretation of the data in the format. This will make unit testing as described above much easier.
If you definitely need to drive your testing you may want to consider getting a machine readable description of the file format, and writing a test data generator which will analyze the format and generate valid/invalid files based upon it. This will also allow your test data to evolve as the file formats do as well.

Resources