Is RAWSXP a good idea to store a blob? - r

I basically have two C functions to be used from R, one of which is making some blob and the second which needs to use it. While the user is not supposed to look inside it, I thought it would be reasonable not to do any serialization/conversion to R types and just dump it to an RAWSXP.
Are there any non-obvious disadvantages of this (i.e. except of killing user's console when printing it)?
EDIT: Ok, let's say for instance that I have an array of double/int64/(4 x int16) unions which is a result of some algorithm; I want it to be have a normal R copy semantics to behave naturally from an user's point of view (thus external pointer is rather not an option) but I'm not too eager to serialize it to R objects since it would not be straightforward and would probably end in a significant memory overhead.

If the blob is meant to persist within a single R session then it would be more natural to create, at the C level, an external pointer, and to return that to the user. This is outlined in Writing R Extensions, section 5.13.
One limitation of this approach is that the external pointer does not serialize, so is not saved to disk or, e.g., returned from a parallel job. This is often appropriate when the blob is a reference to a data structure that only makes sense in the context in which it was created (e.g., a file handle) but less so if it is a static data structure. In that case storing the data as a RAWSXP can be appropriate, typically as a slot or element of an S3 or S4 class with print / show methods to hide the gory details from the user. Perhaps the downside is that the RAWSXP is allocated and managed by R, e.g., subject to garbage collection, whereas the content of an external pointer would likely be allocated more directly via Calloc and Free.

As Martin and Josh pointed out, external pointers may be preferable.
Your approach sounds related to what e.g. the bigmemory does: it allocates a chunk of memory putside of R and controls it, thereby circumventing R's memory management and constraints. It doesn't matter for your purposes that bigmemory uses this to pass the memory back to R as a custom data type -- the external pointer makes that possible. Other packages using external pointers are
RODBC for a database connection object, and my RcppDE package which does what DEoptim does but in C++ and thereby allows to user-provided compiled functions in for the optimization, leveraging the Rcpp wrapper to external pointers: the Rcpp::XPtr class.
And as Marting rightly says, it is all in the good manual.

Related

Is Ada.Containers.Functional_Maps usable in Ada2012?

The information about Ada.Containers.Functional_Maps in the GNAT documentation is quite—let's say—abstruse.
First, it says this:
…these containers can still be used safely.
In the second paragraph, it seems to me that you cannot free the memory allocated for those objects once the program exits the context where they are created. I am understanding that you could run into a memory leak. Am I right?
They are also memory consuming, as the allocated memory is not reclaimed when the container is no longer referenced.
Read the next two sentences in the doc:
Thus, they should in general be used in ghost code and annotations, so that they can be removed from the final executable. The specification of this unit is compatible with SPARK 2014.
Because the specification of Ada.Containers.Functional_Maps is compatible with SPARK, it may help to examine it in the context of related SPARK Libraries with regard to proof, testing and annotation. In particular,
The functional maps, sets and vectors are unbounded collections of indefinite elements that are neither controlled nor limited. While they are inefficient with regard to memory, they are simple, immutable and useful "to model user defined data structures."
The functional containers can be used in Ghost Code, "parts of the code that are only meant for specification and verification", as suggested here. This related example illustrates a ghost function.
it seems to me that you cannot free the memory allocated for those
objects once the program exits the context where they are created. I
am understanding that you could run into a memory leak. Am I right?
There are some things that you can do in Ada to manage memory, I would be surprised if (for example) the usage of an instance inside a declare-block were not cleaned-up on the block's exit. — This is, in fact, how some surprisingly robust applications can get away without "dynamically-allocated" memory/values (it's actually heap-allocated, but that's pedantic).
This sort of granular control is really nice, as you can constrain things/usages to specific points. Combined with Ada's good facilities for presenting interfaces, this means that changing some structure to another can be less-painful than it otherwise might be.
As an example of the above, I had a nested key-value map (a JSON object) that was being used to pass parameters around; the method for doing this changed and so I had a string of values (with common-rooted keys) coming in and a procedure that took JSON as input. Obviously what was needed was a "keys&values-to-JSON function, so inside the function I used the multiway-tree container where the leafs represented values and the internal-nodes the keys, the second step was to traverse the tree and create the JSON-object as needed - simple recursion and data-structure selection used to address the problem of adapting the textual key-value pairs of these nested parameters to JSON. — And because the usage of multi-way trees was exclusive to this function, I can be confident that the memory used by the intermediate tree-object I used is released on the function's exit.

Is there a "memcpy" for R?

I work with large R objects that are sometimes accessed for read-only purposes by multiple people on our local network. For example, a reference class or R6 object might be used to store validation results related to a particular model, and it may have many read-only validation-related methods. I would like to keep using R to maintain workflow homogeneity and avoid moving to a language (like Java or Python) that would be more appropriate to solving the question I am about to ask.
Rather than instantiating these objects anew or reading them from serialized output (e.g., RDS or redis) every time we need them in a new R session, it would be much more efficient to keep an active R process running on some server that is accessible on the network, and then "memcpy"ing objects from that server onto local machines: some kind of quasi-object pooling. Note these are sometimes legitimately non-tabular R objects that would be difficult to e.g. translate into a database-backed object (which might still be slower).
I understand that R maintains all information about what is in scope on the heap, so this may be difficult to do without control of the gc, but is it possible to "siphon" objects away from other R sessions on a byte-by-byte level using some sort of underlying C magic? I don't understand enough about how R manages objects in memory to know how to do this, but perhaps there is a package or snippets of existing code that can provide inspiration.
I am also willing to put on the straightjacket and make restrictions on the aforementioned objects that would make this task easier (e.g., can only reference certain packages, or the method definitions cannot be weird closures that would make this task impossible, or even can only be S3 objects).
EDIT: I just realized I haven't looked into RProtoBuf. Could that be appropriate?
The standard way to do this would be to serialize your objects into a stream of bytes that can be safely loaded on another machine at a later time. This is exactly what the base::serialize method is for if everything is in R, or what RProtoBuf is for if the data needs to be shared between applications written in other languages. In either case, you can write the serialized bytes to RDS or redis or any other data store.
Direct memcpy between machines would be problematic for many reason. Most fundamentally, architectural differences between machines make this error prone if not all of your computers are the same endianness. Also, you would have to find a way to represent complex data structures as a stream of bytes that could be interpreted on another machine. Maybe things were loaded into memory at different addresses and so you can't just expect to do a raw memcpy without fixing up pointers to different memory locations, but if you are doing that, you are doing serialization, so again why not use base::serialize or RProtoBuf.

Ada: pragma Pure / Remote_Types and system types

I'm writing an Ada application that needs to be distributed, and I'm trying to use the DSA to do it, but I'm finding big limitations in what is "allowed" to be "withed" and what isn't.
I won't post sourcecode, since it's quite complex and this is a generic question anyway, I just wanted some pointers on what I'm not understanding correctly, so please bear with me and correct me if I'm wrong.
So my problem is this: I want to mark a procedure with the pragma Remote_Call_Interface so it can be called remotely. However as soon as I add the pragma compilation breaks due to the fact that the procedure is including other packages in my project that are not categorized as either Pure or Remote_Types.
So I try to mark the packages I need as either Pure or Remote_Types (dpeending wether they have state or not) but this in turn breaks compilation even further, since it turns out that you can't use even basic system types in a Pure/Remote_Types package, for example: you can't use Vectors, you can't use Unbounded_Strings, you can't use Maps, etc... the whole program falls to pieces since I can't use the data structures I used to build it anymore!
Is there a way around this? Or if I want to distribute my application I must strictly limit myself to the most basic types like Integers and booleans and little else?? I don't understand if I'm hitting against a limitation of the language or if I'm just doing it incorrectly (unfortunately the tutorials I found on DSA are all very vague, incidentally if anyone has some good ones feel free to link them!)
EDIT: after ajb's answer let me specify what is annoying me in particular: in the package I want to mark with pragma Remote_Call_Interface I'm trying to "with" some packages that are not pure/remote_types, however it only uses the types in those packages locally, it does not contain any procedures that accept such types as parameters, nor functions that return such types. This is what bothers me: since those types would not have to "travel" over the network, why can't I with them? I'm only using them locally... I don't understand this, and that is why I was trying to make those types Pure/Remote_Types, but now that I've read ajb's explanation (ie: Remote_Types is used so that objects of those types can travel over the network) I'm even more confused about why I can't use them if I only use them locally.
I'm not an expert on Ada distributed programming, but here's what I do know (or think I know):
The Annotated Ada Reference Manual, Section E.2.3 says, "The restrictions governing a remote call interface library unit are intended to ensure that the values of the actual parameters in a remote call can be meaningfully sent between two active partitions." For example, if a record type has a field that's an access type, you can't send it from one partition to another blindly, because the called partition won't be able to access the memory that the pointer points to. (Unbounded_String, Map, and Vector are implemented using access types as part of the internals.) All types used as parameters or return types must support "external streaming", meaning there has to be a way for the type to be converted to and from a stream of bytes so that the parameter value can be transmitted over a socket. If you have a record with an access type, but you provide 'Read and 'Write attributes so that the type can be written to and read from a byte stream without any actual pointers being transmitted, then you can put your record type in a Remote_Types package.
I'm not sure exactly what your problem is: are there certain types you want to pass as a parameter to a remote call but can't; or are there types that you want to use only in the rest of your application, but are getting in the way?
If it's the second one, then I think the solution is to restructure your packages so that all the "remote types" are separate from the non-remote types.
However, if you're really looking to pass an Unbounded_String, Map, or Vector from one partition to another in a remote call, it's trickier. Unbounded_String really should support external streaming, and there was a proposal to make Unbounded_String a Remote_Types package (see AI05-0204), but it wasn't acted on--I don't know why. Map and Vector would be bigger problems, though, since they are generic packages that have to work on any type, including those that don't support external streaming. In any case, those types aren't set up to be automatically converted to or from bytes to be passed over a socket.
But I think you could make it work like this:
private with Ada.Strings.Unbounded;
package Remote_Types_Package is
pragma Remote_Types;
type My_Unbounded_String is private;
private
type My_Unbounded_String is record
S : Ada.Strings.Unbounded.Unbounded_String;
end record;
end Remote_Types_Package;
The Unbounded_String package must be withed with private with; see E.2.2(6). You'll need to provide a function to create the My_Unbounded_String, and you'll need to provide stream read and write routines for My_Unbounded_String, and define 'Read and 'Write for the type. You should be able to write the Read and Write attributes by using the Read and Write attributes for the Unbounded_String. Something similar should be doable if you want to use a Vector as a remote call parameter, although you may have to do more work to marshal/unmarshal the type yourself.
Once again, I have not tried this, and it's possible there are some hitches in this solution.
EDIT: Since it now looks like the question is the simpler one--i.e. you have some types that are not going to be passed between partitions getting in the way--the solution should be simpler. Any types that you define that are going to be communicated between partitions need to be in a Remote_Types package, say P1. Other types should be in a different package, say P2 (or multiple packages). If types in P1 depend on types in P2, you can still get this to work by having P1 say private with P2;, and making sure you have the marshalling and unmarshalling procedures you need. If you run into difficulties, I'd encourage you to ask a new question here.
I don't know why the language required all such types to be quarantined in a Remote_Types package, instead of just saying that any type used in a Remote_Call_Interface package has to have only parts that can be streamed. There may have been some implementation issues. Any code that exists for a Remote_Types package has to be in programs in both partitions, perhaps, and this may have been an attempt to limit the type of code that would have to be linked into multiple partitions. But I'm just guessing.

Strong pointers to varinfo that does not exist in AST?

My Frama-C plug-in creates some varinfos with makeGlobalVar ~logic:true name type. These varinfos do not exist in the AST (they are placeholders for the results of calls to allocating functions in the target program, created “dynamically” during the analysis). If my plug-in takes care not to keep any strong pointer onto these varinfos, will they have a chance to be garbage-collected? Or are they registered in a data structure with strong pointers? If so, would it be possible to make that data structure weak? OCaml does not have the variety of weak data structure found in the literature for other languages, but there is nothing a periodical explicit pass to clean up empty stubs cannot fix.
Now that I think about it, I may not even have to create a varinfo. But it is a bit late to change my plug-in now. What I use of the varinfo is a name and a representation of a C type. Function makeGlobalVar offers a guarantee of unicity for the name, which is nice, I guess, as long as it does not create a strong pointer to it or to part of it in the process.
Context:
Say that you are writing a C interpreter to execute C programs that call malloc() and free(). If the target program does not have a memory leak (it frees everything it allocates and never holds too much memory), you would like the interpreter to behave the same.
If you don't explicitely register the varinfos into one of the Globals table, Frama-C won't do it for you (and in fact, if you do, you're supposed to add their declaration in the AST and vice-versa), so I guess that you are safe here. The only visible side-effect as far as the kernel is concerned should be the incrementation of the Vid counter. Note however that makeGlobalVar itself does not guarantee the unicity of the vname, but only of the vid field.

Execution speed of references vs pointers

I recently read a discussion regarding whether managed languages are slower (or faster) than native languages (specifically C# vs C++). One person that contributed to the discussion said that the JIT compilers of managed languages would be able to make optimizations regarding references that simply isn't possible in languages that use pointers.
What I'd like to know is what kind of optimizations that are possible on references and not on pointers?
Note that the discussion was about execution speed, not memory usage.
In C++ there are two advantages of references related to optimization aspects:
A reference is constant (refers to the same variable for its whole lifetime)
Because of this it is easier for the compiler to infer which names refer to the same underlying variables - thus creating optimization opportunities. There is no guarantee that the compiler will do better with references, but it might...
A reference is assumed to refer to something (there is no null reference)
A reference that "refers to nothing" (equivalent to the NULL pointer) can be created, but this is not as easy as creating a NULL pointer. Because of this the check of the reference for NULL can be omitted.
However, none of these advantages carry over directly to managed languages, so I don't see the relevance of that in the context of your discussion topic.
There are some benefits of JIT compilation mentioned in Wikipedia:
JIT code generally offers far better performance than interpreters. In addition, it can in some or many cases offer better performance than static compilation, as many optimizations are only feasible at run-time:
The compilation can be optimized to the targeted CPU and the operating system model where the application runs. For example JIT can choose SSE2 CPU instructions when it detects that the CPU supports them. With a static compiler one must write two versions of the code, possibly using inline assembly.
The system is able to collect statistics about how the program is actually running in the environment it is in, and it can rearrange and recompile for optimum performance. However, some static compilers can also take profile information as input.
The system can do global code optimizations (e.g. inlining of library functions) without losing the advantages of dynamic linking and without the overheads inherent to static compilers and linkers. Specifically, when doing global inline substitutions, a static compiler must insert run-time checks and ensure that a virtual call would occur if the actual class of the object overrides the inlined method.
Although this is possible with statically compiled garbage collected languages, a bytecode system can more easily rearrange memory for better cache utilization.
I can't think of something related directly to the use of references instead of pointers.
In general speak, references make it possible to refer to the same object from different places.
A 'Pointer' is the name of a mechanism to implement references. C++, Pascal, C... have pointers, C++ offers another mechanism (with slightly other use cases) called 'Reference', but essentially these are all implementations of the general referencing concept.
So there is no reason why references are by definition faster/slower than pointers.
The real difference is in using a JIT or a classic 'up front' compiler: the JIT can data take into account that aren't available for the up front compiler. It has nothing to do with the implementation of the concept 'reference'.
Other answers are right.
I would only add that any optimization won't make a hoot of difference unless it is in code where the program counter actually spends much time, like in tight loops that don't contain function calls (such as comparing strings).
An object reference in a managed framework is very different from a passed reference in C++. To understand what makes them special, imagine how the following scenario would be handled, at the machine level, without garbage-collected object references: Method "Foo" returns a string, which is stored into various collections and passed to different pieces of code. Once nothing needs the string any more, it should be possible to reclaim all memory used in storing it, but it's unclear what piece of code will be the last one to use the string.
In a non-GC system, every collection either needs to have its own copy of the string, or else needs to hold something containing a pointer to a shared object which holds the characters in the string. In the latter situation, the shared object needs to somehow know when the last pointer to it gets eliminated. There are a variety of ways this can be handled, but an essential common aspect of all of them is that shared objects need to be notified when pointers to them are copied or destroyed. Such notification requires work.
In a GC system by contrast, programs are decorated with metadata to say which registers or parts of a stack frame will be used at any given time to hold rooted object references. When a garbage collection cycle occurs, the garbage collector will have to parse this data, identify and preserve all live objects, and nuke everything else. At all other times, however, the processor can copy, replace, shuffle, or destroy references in any pattern or sequence it likes, without having to notify any of the objects involved. Note that when using pointer-use notifications in a multi-processor system, if different threads might copy or destroy references to the same object, synchronization code will be required to make the necessary notification thread-safe. By contrast, in a GC system, each processor may change reference variables at any time without having to synchronize its actions with any other processor.

Resources