deterministically recursively iterate over directory structure - recursion

I have a static directory structure as input. My program starts from the root and recursively iterates through and lists the file names as output. My question is whether the ordering of the output list of filenames stays constant across multiple runs of my program. I have empirically observed that it does remain constant. But is it something that the programming language (lets say, java) or the operating system (I am using linux) guarantees.

This depends on a couple things:
What method you're using the read the directories (ls sorts alphabetically, readdir() grabs in directory order)
Whether filenames are being created/deleted between runs. This tends to screw you up for any filesystem.
What filesystem you're using (by default, ext2 would give you a stable order. ext4, which comes with directory hashing out of the box, will not).
In general, there is no guarantee that the order of directory entries will be stable between runs. If order is important, it's best to sort the directory after reading it.

Related

Won't an app based on immutable data structures run out of memory?

Never mind redux or something - I am solely asking about Immutable.JS, Ramda, etc.
If new versions of a data structure is created by structural sharing, that means that every new version needs to have a pointer to the previous version in order for it to be able to share anything. That again means that older versions of a structure cannot be garbage collected, meaning again, that in an app where you have state, this state will use a monotonically increasing amount of memory. If this is the case, then that data structure will at some point have used all available memory, if it keeps getting modified.
Am I missing something here? I can see that for many (most) use cases on the web (in a browser), this won't be a problem, as you are probably just changing a tiny part of the structure each time, and you will probably leave the page or reload it way before you use all memory, but for long running processes this should pose a problem. Right? Riiight?
If new versions of a data structure is created by structural sharing, that means that every new version needs to have a pointer to the previous version in order for it to be able to share anything.
This is not correct in general. The new version would have pointers to subparts of the previous version. So it shares a fraction (
often almost all) of the data of the older version.
For example, Ocaml's maps (represented and implemented by some self-balancing binary search tree variant of red-black trees) are immutable: see documentation of Map. But if you add (or remove) some binding to (from) a map, you get a new map sharing most (but not all) of its internal nodes with the old one.
So the garbage collector would eventually "delete" those old internal nodes which are not relevant to the current "state".
BTW web programming (and web navigation) is related to continuations and continuation-passing style. See e.g. Byrd's Web Programming with Continuations and several papers by C.Queinnec.
Read also more about monads in functional programming.

How are we actually supposed to include our OpenCL code?

How are we actually supposed to include our OpenCL code in our C projects?
We can't possibly be supposed to ship our .cl files along with our executable for the executable to find them and load them at runtime, because that's stupid, right?
We can't be supposed to use some stringify macro because a) that's apparently not portable/leads to undefined behaviour and b) it all breaks down if you use commas not enclosed in brackets like when defining many variables of the same type, I've spent an hour here looking for a solution to that and there doesn't seem to be one that actually works and c) that's kind of stupid.
Are we expected to write our code into C string literals like "int x, y;\n" "float4 p;\n"? Because I'm not doing that. Are we supposed to do a C include-style hexdump of our .cl files? That seems inconvenient. What are we actually supposed to do?
It's bad enough that all these approaches basically mean that you have to ship your program with your OpenCL code essentially open sourced when your OpenCL code is probably the last thing you want open sourced, on top of it it seems every OpenCL project I've seen uses one of the approaches listed above, it just doesn't seem right at all, it's like the people who made OpenCL forgot about something.
This thread: OpenCL bytecode running on another card mentions SPIR, a "platform-portable intermediate representation for OpenCL device programs". Other than that, you are basically restrained to the options you already mentioned.
Personally, I began to use C++11 raw string literals to get rid of my nasty stringify-macros. Don't know if C++ is an option for you, however.
Concerning your rejection of the "ship our .cl files along with our executable" approach: I don't see why this is inherently stupid -- the CL "shaders" are an application resource like all other separate files beside the executable, and thus are part of the "application bundle". It's perfectly reasonable to have such kind of files, and each operating system has its way to deal with it (in win32, the program directory is the bundle https://blogs.msdn.microsoft.com/oldnewthing/20110620-00/?p=10393 , OSX has its own bundle concept, etc...).
Now, if you are worried about other people peeking into your OpenCL code, you can still apply some obfuscation methods (e.g. encrypt your .cl-files by a key which is more or less cleverly hidden in your executable).
[edit/sidenote]: We could also investigate how other companies deal with this issue in the context of, for example, OpenGL/Direct3D shaders. In my limited experience, gaming companies tend to dump their shaders in text form somewhere in their application directory, for all to see (and even to tamper with). So in the gaming world at least, there is no great deal of secrecy in that respect... Wonder what adobe or CAD software companies do in their professional software.

Strong pointers to varinfo that does not exist in AST?

My Frama-C plug-in creates some varinfos with makeGlobalVar ~logic:true name type. These varinfos do not exist in the AST (they are placeholders for the results of calls to allocating functions in the target program, created “dynamically” during the analysis). If my plug-in takes care not to keep any strong pointer onto these varinfos, will they have a chance to be garbage-collected? Or are they registered in a data structure with strong pointers? If so, would it be possible to make that data structure weak? OCaml does not have the variety of weak data structure found in the literature for other languages, but there is nothing a periodical explicit pass to clean up empty stubs cannot fix.
Now that I think about it, I may not even have to create a varinfo. But it is a bit late to change my plug-in now. What I use of the varinfo is a name and a representation of a C type. Function makeGlobalVar offers a guarantee of unicity for the name, which is nice, I guess, as long as it does not create a strong pointer to it or to part of it in the process.
Context:
Say that you are writing a C interpreter to execute C programs that call malloc() and free(). If the target program does not have a memory leak (it frees everything it allocates and never holds too much memory), you would like the interpreter to behave the same.
If you don't explicitely register the varinfos into one of the Globals table, Frama-C won't do it for you (and in fact, if you do, you're supposed to add their declaration in the AST and vice-versa), so I guess that you are safe here. The only visible side-effect as far as the kernel is concerned should be the incrementation of the Vid counter. Note however that makeGlobalVar itself does not guarantee the unicity of the vname, but only of the vid field.

What is the purpose of environments in R and when I need to use more than one?

This is a basic R question: R has the concept of environment. So what purpose does it have, when do I need to start more then one and how do I switch between them? What is the advantage of multiple environments (other then looking up content of .Rdata file)?
The idea of environments is important and you use them all the time, mostly without realizing it. If you are just using R and not doing anything fancy then the indirect use of environments is all that you need and you will not need to explicitly create and manipulate environments. Only when you get into more advanced usage will you need to understand more. The main place that you use (indirectly) environments is that every function has its own environment, so every time you run a function you are using new envirnments. Why this is important is because this means that if the function uses a variable named "x" and you have a variable named "x" then the computer can keep them straight and use the correct one when it needs to and your copy of "x" does not get over written by the functions version.
Some other cases where you might use environments: Each package has its own environment so 2 packages can both be loaded with the same name of an internal function and they won't interfere with each other. You can keep your workspace a little cleaner by attaching a new enironment and loading function definitions into that environment rather than the global or working environment. When you write your own functions and you want to share variables between functions then you will need to understand about environments. Environmets can be used to emulate pass-by-reference instead of pass-by-value if you are ever in a situation where that matters (if you don't recognize those phrases then it probably does not matter).
You can think of environments as unordered lists. Both datatypes offer something like the hash table data structure to the user, i.e., a mapping from names to values. The lack of ordering in environments offers better performance when compared with lists on similar tasks.
The access functions [[ and $ work for both.
A nice fact about environments which is not true for lists is that environments pass by reference when supplied as function arguments, offering a way to improve performance when working large objects.
Personally, I never work directly with environments. Instead, I divide my scripts up in functions. This leads to an increased reusability, and to more structure. In addition, each function runs in its own environment, ensuring minimum interference in variables etc.

Automatic documentation of datasets

I'm working on a project right now where I have been slowly accumulating a bunch of different variables from a bunch of different sources. Being a somewhat clever person, I created a different sub-directory for each under a main "original_data" directory, and included a .txt file with the URL and other descriptors of where I got the data from. Being an insufficiently clever person, these .txt files have no structure.
Now I am faced with the task of compiling a methods section which documents all the different data sources. I am willing to go through and add structure to the data, but then I would need to find or build a reporting tool to scan through the directories and extract the information.
This seems like something that ProjectTemplate would have already, but I can't seem to find that functionality there.
Does such a tool exist?
If it does not, what considerations should be taken into account to provide maximum flexibility? Some preliminary thoughts:
A markup language should be used (YAML?)
All sub-directories should be scanned
To facilitate (2), a standard extension for a dataset descriptor should be used
Critically, to make this most useful there needs to be some way to match variable descriptors with the name that they ultimately take on. Therefore either all renaming of variables has to be done in the source files rather than in a cleaning step (less than ideal), some code-parsing has to be done by the documentation engine to track variable name changes (ugh!), or some simpler hybrid such as allowing the variable renames to be specified in the markup file should be used.
Ideally the report would be templated as well (e.g. "We pulled the [var] variable from [dset] dataset on [date]."), and possibly linked to Sweave.
The tool should be flexible enough to not be overly burdensome. This means that minimal documentation would simply be a dataset name.
This is a very good question: people should be very concerned about all of the sequences of data collection, aggregation, transformation, etc., that form the basis for statistical results. Unfortunately, this is not widely practiced.
Before addressing your questions, I want to emphasize that this appears quite related to the general aim of managing data provenance. I might as well give you a Google link to read more. :) There are a bunch of resources that you'll find, such as the surveys, software tools (e.g. some listed in the Wikipedia entry), various research projects (e.g. the Provenance Challenge), and more.
That's a conceptual start, now to address practical issues:
I'm working on a project right now where I have been slowly accumulating a bunch of different variables from a bunch of different sources. Being a somewhat clever person, I created a different sub-directory for each under a main "original_data" directory, and included a .txt file with the URL and other descriptors of where I got the data from. Being an insufficiently clever person, these .txt files have no structure.
Welcome to everyone's nightmare. :)
Now I am faced with the task of compiling a methods section which documents all the different data sources. I am willing to go through and add structure to the data, but then I would need to find or build a reporting tool to scan through the directories and extract the information.
No problem. list.files(...,recursive = TRUE) might become a good friend; see also listDirectory() in R.utils.
It's worth noting that filling in a methods section on data sources is a narrow application within data provenance. In fact, it's rather unfortunate that the CRAN Task View on Reproducible Research focuses only on documentation. The aims of data provenance are, in my experience, a subset of reproducible research, and documentation of data manipulation and results are a subset of data provenance. Thus, this task view is still in its infancy regarding reproducible research. It might be useful for your aims, but you'll eventually outgrow it. :)
Does such a tool exist?
Yes. What are such tools? Mon dieu... it is very application-centric in general. Within R, I think that these tools are not given much attention (* see below). That's rather unfortunate - either I'm missing something, or else the R community is missing something that we should be using.
For the basic process that you've described, I typically use JSON (see this answer and this answer for comments on what I'm up to). For much of my work, I represent this as a "data flow model" (that term can be ambiguous, by the way, especially in the context of computing, but I mean it from a statistical analyses perspective). In many cases, this flow is described via JSON, so it is not hard to extract the sequence from JSON to address how particular results arose.
For more complex or regulated projects, JSON is not enough, and I use databases to define how data was collected, transformed, etc. For regulated projects, the database may have lots of authentication, logging, and more in it, to ensure that data provenance is well documented. I suspect that that kind of DB is well beyond your interest, so let's move on...
1. A markup language should be used (YAML?)
Frankly, whatever you need to describe your data flow will be adequate. Most of the time, I find it adequate to have good JSON, good data directory layouts, and good sequencing of scripts.
2. All sub-directories should be scanned
Done: listDirectory()
3. To facilitate (2), a standard extension for a dataset descriptor should be used
Trivial: ".json". ;-) Or ".SecretSauce" works, too.
4. Critically, to make this most useful there needs to be some way to match variable descriptors with the name that they ultimately take on. Therefore either all renaming of variables has to be done in the source files rather than in a cleaning step (less than ideal), some code-parsing has to be done by the documentation engine to track variable name changes (ugh!), or some simpler hybrid such as allowing the variable renames to be specified in the markup file should be used.
As stated, this doesn't quite make sense. Suppose that I take var1 and var2, and create var3 and var4. Perhaps var4 is just a mapping of var2 to its quantiles and var3 is the observation-wise maximum of var1 and var2; or I might create var4 from var2 by truncating extreme values. If I do so, do I retain the name of var2? On the other hand, if you're referring to simply matching "long names" with "simple names" (i.e. text descriptors to R variables), then this is something only you can do. If you have very structured data, it's not hard to create a list of text names matching variable names; alternatively, you could create tokens upon which string substitution could be performed. I don't think it's hard to create a CSV (or, better yet, JSON ;-)) that matches variable name to descriptor. Simply keep checking that all variables have matching descriptor strings, and stop once that's done.
5. Ideally the report would be templated as well (e.g. "We pulled the [var] variable from [dset] dataset on [date]."), and possibly linked to Sweave.
That's where others' suggestions of roxygen and roxygen2 can apply.
6. The tool should be flexible enough to not be overly burdensome. This means that minimal documentation would simply be a dataset name.
Hmm, I'm stumped here. :)
(*) By the way, if you want one FOSS project that relates to this, check out Taverna. It has been integrated with R as documented in several places. This may be overkill for your needs at this time, but it's worth investigating as an example of a decently mature workflow system.
Note 1: Because I frequently use bigmemory for large data sets, I have to name the columns of each matrix. These are stored in a descriptor file for each binary file. That process encourages the creation of descriptors matching variable names (and matrices) to descriptors. If you store your data in a database or other external files supporting random access and multiple R/W access (e.g. memory mapped files, HDF5 files, anything but .rdat files), you will likely find that adding descriptors becomes second nature.

Resources