The information about Ada.Containers.Functional_Maps in the GNAT documentation is quite—let's say—abstruse.
First, it says this:
…these containers can still be used safely.
In the second paragraph, it seems to me that you cannot free the memory allocated for those objects once the program exits the context where they are created. I am understanding that you could run into a memory leak. Am I right?
They are also memory consuming, as the allocated memory is not reclaimed when the container is no longer referenced.
Read the next two sentences in the doc:
Thus, they should in general be used in ghost code and annotations, so that they can be removed from the final executable. The specification of this unit is compatible with SPARK 2014.
Because the specification of Ada.Containers.Functional_Maps is compatible with SPARK, it may help to examine it in the context of related SPARK Libraries with regard to proof, testing and annotation. In particular,
The functional maps, sets and vectors are unbounded collections of indefinite elements that are neither controlled nor limited. While they are inefficient with regard to memory, they are simple, immutable and useful "to model user defined data structures."
The functional containers can be used in Ghost Code, "parts of the code that are only meant for specification and verification", as suggested here. This related example illustrates a ghost function.
it seems to me that you cannot free the memory allocated for those
objects once the program exits the context where they are created. I
am understanding that you could run into a memory leak. Am I right?
There are some things that you can do in Ada to manage memory, I would be surprised if (for example) the usage of an instance inside a declare-block were not cleaned-up on the block's exit. — This is, in fact, how some surprisingly robust applications can get away without "dynamically-allocated" memory/values (it's actually heap-allocated, but that's pedantic).
This sort of granular control is really nice, as you can constrain things/usages to specific points. Combined with Ada's good facilities for presenting interfaces, this means that changing some structure to another can be less-painful than it otherwise might be.
As an example of the above, I had a nested key-value map (a JSON object) that was being used to pass parameters around; the method for doing this changed and so I had a string of values (with common-rooted keys) coming in and a procedure that took JSON as input. Obviously what was needed was a "keys&values-to-JSON function, so inside the function I used the multiway-tree container where the leafs represented values and the internal-nodes the keys, the second step was to traverse the tree and create the JSON-object as needed - simple recursion and data-structure selection used to address the problem of adapting the textual key-value pairs of these nested parameters to JSON. — And because the usage of multi-way trees was exclusive to this function, I can be confident that the memory used by the intermediate tree-object I used is released on the function's exit.
Related
There are a lot of questions online about allocating, copying, indexing, etc 2d and 3d arrays on CUDA. I'm getting a lot of conflicting answers so I'm attempting to compile past questions to see if I can ask the right ones.
First link: https://devtalk.nvidia.com/default/topic/392370/how-to-cudamalloc-two-dimensional-array-/
Problem: Allocating a 2d array of pointers
User solution: use mallocPitch
"Correct" inefficient solution: Use malloc and memcpy in a for loop for each row (Absurd overhead)
"More correct" solution: Squash it into a 1d array "professional opinion," one comment saying no one with an eye on performance uses 2d pointer structures on the gpu
Second link: https://devtalk.nvidia.com/default/topic/413905/passing-a-multidimensional-array-to-kernel-how-to-allocate-space-in-host-and-pass-to-device-/
Problem: Allocating space on host and passing it to device
Sub link: https://devtalk.nvidia.com/default/topic/398305/cuda-programming-and-performance/dynamically-allocate-array-of-structs/
Sub link solution: Coding pointer based structures on the GPU is a bad experience and highly inefficient, squash it into a 1d array.
Third link: Allocate 2D Array on Device Memory in CUDA
Problem: Allocating and transferring 2d arrays
User solution: use mallocPitch
Other solution: flatten it
Fourth link: How to use 2D Arrays in CUDA?
Problem: Allocate and traverse 2d arrays
Submitted solution: Does not show allocation
Other solution: squash it
There are a lot of other sources mostly saying the same thing but in multiple instances I see warnings about pointer structures on the GPU.
Many people claim the proper way to allocate an array of pointers is with a call to malloc and memcpy for each row yet the functions mallocPitch and memcpy2D exist. Are these functions somehow less efficient? Why wouldn't this be the default answer?
The other 'correct' answer for 2d arrays is to squash them into one array. Should I just get used to this as a fact of life? I'm very persnickety about my code and it feels inelegant to me.
Another solution I was considering was to max a matrix class that uses a 1d pointer array but I can't find a way to implement the double bracket operator.
Also according to this link: Copy an object to device?
and the sub link answer: cudaMemcpy segmentation fault
This gets a little iffy.
The classes I want to use CUDA with all have 2/3d arrays and wouldn't there be a lot of overhead in converting those to 1d arrays for CUDA?
I know I've asked a lot but in summary should I get used to squashed arrays as a fact of life or can I use the 2d allocate and copy functions without getting bad overhead like in the solution where alloc and cpy are called in a for loop?
Since your question compiles a list of other questions, I'll answer by compiling a list of other answers.
cudaMallocPitch/cudaMemcpy2D:
First, the cuda runtime API functions like cudaMallocPitch and cudaMemcpy2D do not actually involve either double-pointer allocations or 2D (doubly-subscripted) arrays. This is easy to confirm simply by looking at the documentation, and noting the types of parameters in the function prototypes. The src and dst parameters are single-pointer parameters. They could not be doubly-subscripted, or doubly dereferenced. For additional example usage, here is one of many questions on this. here is a fully worked example usage. Another example covering various concepts associated with cudaMallocPitch/cudaMemcpy2d usage is here. Instead the correct way to think about these is that they work with pitched allocations. Also, you cannot use cudaMemcpy2D to transfer data when the underlying allocation has been created using a set of malloc (or new, or similar) operations in a loop. That sort of host data allocation construction is particularly ill-suited to working with the data on the device.
general, dynamically allocated 2D case:
If you wish to learn how to use a dynamically allocated 2D array in a CUDA kernel (meaning you can use doubly-subscripted access, e.g. data[x][y]), then the cuda tag info page contains the "canonical" question for this, it is here. The answer given by talonmies there includes the proper mechanics, as well as appropriate caveats:
there is additional, non-trivial complexity
the access will generally be less efficient than 1D access, because data access requires dereferencing 2 pointers, instead of 1.
(note that allocating an array of objects, where the object(s) has an embedded pointer to a dynamic allocation, is essentially the same as the 2D array concept, and the example you linked in your question is a reasonable demonstration for that)
Also, here is a thrust method for building a general dynamically allocated 2D array.
flattening:
If you think you must use the general 2D method, then go ahead, it's not impossible (although sometimes people struggle with the process!) However, due to the added complexity and reduced efficiency, the canonical "advice" here is to "flatten" your storage method, and use "simulated" 2D access. Here is one of many examples of questions/answers discussing "flattening".
general, dynamically allocated 3D case:
As we extend this to 3 (or higher!) dimensions, the general case becomes overly complex to handle, IMO. The additional complexity should strongly motivate us to seek alternatives. The triply-subscripted general case involves 3 pointer accesses before the data is actually retrieved, so even less efficient. Here is a fully worked example (2nd code example).
special case: array width known at compile time:
Note that it should be considered a special case when the array dimension(s) (the width, in the case of a 2D array, or 2 of the 3 dimensions for a 3D array) is known at compile-time. In this case, with an appropriate auxiliary type definition, we can "instruct" the compiler how the indexing should be computed, and in this case we can use doubly-subscripted access with considerably less complexity than the general case, and there is no loss of efficiency due to pointer-chasing. Only one pointer need be dereferenced to retrieve the data (regardless of array dimensionality, if n-1 dimensions are known at compile time for a n-dimensional array). The first code example in the already-mentioned answer here (first code example) gives a fully worked example of that in the 3D case, and the answer here gives a 2D example of this special case.
doubly-subscripted host code, singly-subscripted device code:
Finally another methodology option allows us to easily mix 2D (doubly-subscripted) access in host code while using only 1D (singly-subscripted, perhaps with "simulated 2D" access) in device code. A worked example of that is here. By organizing the underlying allocation as a contiguous allocation, then building the pointer "tree", we can enable doubly-subscripted access on the host, and still easily pass the flat allocation to the device. Although the example does not show it, it would be possible to extend this method to create a doubly-subscripted access system on the device based off a flat allocation and a manually-created pointer "tree", however this would have approximately the same issues as the 2D general dynamically allocated method given above: it would involve double-pointer (double-dereference) access, so less efficient, and there is some complexity associated with building the pointer "tree", for use in device code (e.g. it would necessitate an additional cudaMemcpy operation, probably).
From the above methods, you'll need to choose one that fits your appetite and needs. There is not one single recommendation that fits every possible case.
Suppose I have a subprogram written using the SPARK Ada subset with postconditions verifying some property - for example, that the returned array is sorted, whose body just calls out to a function external to SPARK - for example, a C/C++ function that sorts arrays. Is there any way to force SPARK to assume, after this call, that the array will be sorted?
In short, GNATprove takes a divide-and-conquer approach when analyzing code. The following explanation is incomplete and in practice things are slightly more complicated, but for the sake of understanding, it gives a useful perspective on how things work.
For each assertion, loop invariant and pre-/post-condition GNATprove creates verification conditions (VCs) that must be proven. Verification conditions are to be proven by assumptions and the semantics of the code.
When a code section is being analyzed, and this code section starts just after a call to a subprogram, then any post-condition of that subprogram is assumed to hold.
If that particular subprogram is implemented in SPARK, then GNATprove will try to proof that the post-conditions indeed hold by analyzing the subprogram. However, if the particular subprogram is not in SPARK (e.g., the subprogram is imported), then the post-conditions will remain assumptions and it is left to the developer to prove them by other means.
A nice example that illustrates the first point can be found in sections 1 and 2 of the recently published article The Work of Proof in SPARK (available here). Note in particular how a repeated call to the Increment function is being analyzed by GNATprove.
So, if you want SPARK to assume particular post-conditions to hold for a subprogram that is not in SPARK (an imported function, for example), just provide the post-conditions.
My Frama-C plug-in creates some varinfos with makeGlobalVar ~logic:true name type. These varinfos do not exist in the AST (they are placeholders for the results of calls to allocating functions in the target program, created “dynamically” during the analysis). If my plug-in takes care not to keep any strong pointer onto these varinfos, will they have a chance to be garbage-collected? Or are they registered in a data structure with strong pointers? If so, would it be possible to make that data structure weak? OCaml does not have the variety of weak data structure found in the literature for other languages, but there is nothing a periodical explicit pass to clean up empty stubs cannot fix.
Now that I think about it, I may not even have to create a varinfo. But it is a bit late to change my plug-in now. What I use of the varinfo is a name and a representation of a C type. Function makeGlobalVar offers a guarantee of unicity for the name, which is nice, I guess, as long as it does not create a strong pointer to it or to part of it in the process.
Context:
Say that you are writing a C interpreter to execute C programs that call malloc() and free(). If the target program does not have a memory leak (it frees everything it allocates and never holds too much memory), you would like the interpreter to behave the same.
If you don't explicitely register the varinfos into one of the Globals table, Frama-C won't do it for you (and in fact, if you do, you're supposed to add their declaration in the AST and vice-versa), so I guess that you are safe here. The only visible side-effect as far as the kernel is concerned should be the incrementation of the Vid counter. Note however that makeGlobalVar itself does not guarantee the unicity of the vname, but only of the vid field.
I basically have two C functions to be used from R, one of which is making some blob and the second which needs to use it. While the user is not supposed to look inside it, I thought it would be reasonable not to do any serialization/conversion to R types and just dump it to an RAWSXP.
Are there any non-obvious disadvantages of this (i.e. except of killing user's console when printing it)?
EDIT: Ok, let's say for instance that I have an array of double/int64/(4 x int16) unions which is a result of some algorithm; I want it to be have a normal R copy semantics to behave naturally from an user's point of view (thus external pointer is rather not an option) but I'm not too eager to serialize it to R objects since it would not be straightforward and would probably end in a significant memory overhead.
If the blob is meant to persist within a single R session then it would be more natural to create, at the C level, an external pointer, and to return that to the user. This is outlined in Writing R Extensions, section 5.13.
One limitation of this approach is that the external pointer does not serialize, so is not saved to disk or, e.g., returned from a parallel job. This is often appropriate when the blob is a reference to a data structure that only makes sense in the context in which it was created (e.g., a file handle) but less so if it is a static data structure. In that case storing the data as a RAWSXP can be appropriate, typically as a slot or element of an S3 or S4 class with print / show methods to hide the gory details from the user. Perhaps the downside is that the RAWSXP is allocated and managed by R, e.g., subject to garbage collection, whereas the content of an external pointer would likely be allocated more directly via Calloc and Free.
As Martin and Josh pointed out, external pointers may be preferable.
Your approach sounds related to what e.g. the bigmemory does: it allocates a chunk of memory putside of R and controls it, thereby circumventing R's memory management and constraints. It doesn't matter for your purposes that bigmemory uses this to pass the memory back to R as a custom data type -- the external pointer makes that possible. Other packages using external pointers are
RODBC for a database connection object, and my RcppDE package which does what DEoptim does but in C++ and thereby allows to user-provided compiled functions in for the optimization, leveraging the Rcpp wrapper to external pointers: the Rcpp::XPtr class.
And as Marting rightly says, it is all in the good manual.
I recently read a discussion regarding whether managed languages are slower (or faster) than native languages (specifically C# vs C++). One person that contributed to the discussion said that the JIT compilers of managed languages would be able to make optimizations regarding references that simply isn't possible in languages that use pointers.
What I'd like to know is what kind of optimizations that are possible on references and not on pointers?
Note that the discussion was about execution speed, not memory usage.
In C++ there are two advantages of references related to optimization aspects:
A reference is constant (refers to the same variable for its whole lifetime)
Because of this it is easier for the compiler to infer which names refer to the same underlying variables - thus creating optimization opportunities. There is no guarantee that the compiler will do better with references, but it might...
A reference is assumed to refer to something (there is no null reference)
A reference that "refers to nothing" (equivalent to the NULL pointer) can be created, but this is not as easy as creating a NULL pointer. Because of this the check of the reference for NULL can be omitted.
However, none of these advantages carry over directly to managed languages, so I don't see the relevance of that in the context of your discussion topic.
There are some benefits of JIT compilation mentioned in Wikipedia:
JIT code generally offers far better performance than interpreters. In addition, it can in some or many cases offer better performance than static compilation, as many optimizations are only feasible at run-time:
The compilation can be optimized to the targeted CPU and the operating system model where the application runs. For example JIT can choose SSE2 CPU instructions when it detects that the CPU supports them. With a static compiler one must write two versions of the code, possibly using inline assembly.
The system is able to collect statistics about how the program is actually running in the environment it is in, and it can rearrange and recompile for optimum performance. However, some static compilers can also take profile information as input.
The system can do global code optimizations (e.g. inlining of library functions) without losing the advantages of dynamic linking and without the overheads inherent to static compilers and linkers. Specifically, when doing global inline substitutions, a static compiler must insert run-time checks and ensure that a virtual call would occur if the actual class of the object overrides the inlined method.
Although this is possible with statically compiled garbage collected languages, a bytecode system can more easily rearrange memory for better cache utilization.
I can't think of something related directly to the use of references instead of pointers.
In general speak, references make it possible to refer to the same object from different places.
A 'Pointer' is the name of a mechanism to implement references. C++, Pascal, C... have pointers, C++ offers another mechanism (with slightly other use cases) called 'Reference', but essentially these are all implementations of the general referencing concept.
So there is no reason why references are by definition faster/slower than pointers.
The real difference is in using a JIT or a classic 'up front' compiler: the JIT can data take into account that aren't available for the up front compiler. It has nothing to do with the implementation of the concept 'reference'.
Other answers are right.
I would only add that any optimization won't make a hoot of difference unless it is in code where the program counter actually spends much time, like in tight loops that don't contain function calls (such as comparing strings).
An object reference in a managed framework is very different from a passed reference in C++. To understand what makes them special, imagine how the following scenario would be handled, at the machine level, without garbage-collected object references: Method "Foo" returns a string, which is stored into various collections and passed to different pieces of code. Once nothing needs the string any more, it should be possible to reclaim all memory used in storing it, but it's unclear what piece of code will be the last one to use the string.
In a non-GC system, every collection either needs to have its own copy of the string, or else needs to hold something containing a pointer to a shared object which holds the characters in the string. In the latter situation, the shared object needs to somehow know when the last pointer to it gets eliminated. There are a variety of ways this can be handled, but an essential common aspect of all of them is that shared objects need to be notified when pointers to them are copied or destroyed. Such notification requires work.
In a GC system by contrast, programs are decorated with metadata to say which registers or parts of a stack frame will be used at any given time to hold rooted object references. When a garbage collection cycle occurs, the garbage collector will have to parse this data, identify and preserve all live objects, and nuke everything else. At all other times, however, the processor can copy, replace, shuffle, or destroy references in any pattern or sequence it likes, without having to notify any of the objects involved. Note that when using pointer-use notifications in a multi-processor system, if different threads might copy or destroy references to the same object, synchronization code will be required to make the necessary notification thread-safe. By contrast, in a GC system, each processor may change reference variables at any time without having to synchronize its actions with any other processor.