Is there a "memcpy" for R? - r

I work with large R objects that are sometimes accessed for read-only purposes by multiple people on our local network. For example, a reference class or R6 object might be used to store validation results related to a particular model, and it may have many read-only validation-related methods. I would like to keep using R to maintain workflow homogeneity and avoid moving to a language (like Java or Python) that would be more appropriate to solving the question I am about to ask.
Rather than instantiating these objects anew or reading them from serialized output (e.g., RDS or redis) every time we need them in a new R session, it would be much more efficient to keep an active R process running on some server that is accessible on the network, and then "memcpy"ing objects from that server onto local machines: some kind of quasi-object pooling. Note these are sometimes legitimately non-tabular R objects that would be difficult to e.g. translate into a database-backed object (which might still be slower).
I understand that R maintains all information about what is in scope on the heap, so this may be difficult to do without control of the gc, but is it possible to "siphon" objects away from other R sessions on a byte-by-byte level using some sort of underlying C magic? I don't understand enough about how R manages objects in memory to know how to do this, but perhaps there is a package or snippets of existing code that can provide inspiration.
I am also willing to put on the straightjacket and make restrictions on the aforementioned objects that would make this task easier (e.g., can only reference certain packages, or the method definitions cannot be weird closures that would make this task impossible, or even can only be S3 objects).
EDIT: I just realized I haven't looked into RProtoBuf. Could that be appropriate?

The standard way to do this would be to serialize your objects into a stream of bytes that can be safely loaded on another machine at a later time. This is exactly what the base::serialize method is for if everything is in R, or what RProtoBuf is for if the data needs to be shared between applications written in other languages. In either case, you can write the serialized bytes to RDS or redis or any other data store.
Direct memcpy between machines would be problematic for many reason. Most fundamentally, architectural differences between machines make this error prone if not all of your computers are the same endianness. Also, you would have to find a way to represent complex data structures as a stream of bytes that could be interpreted on another machine. Maybe things were loaded into memory at different addresses and so you can't just expect to do a raw memcpy without fixing up pointers to different memory locations, but if you are doing that, you are doing serialization, so again why not use base::serialize or RProtoBuf.

Related

Is Ada.Containers.Functional_Maps usable in Ada2012?

The information about Ada.Containers.Functional_Maps in the GNAT documentation is quite—let's say—abstruse.
First, it says this:
…these containers can still be used safely.
In the second paragraph, it seems to me that you cannot free the memory allocated for those objects once the program exits the context where they are created. I am understanding that you could run into a memory leak. Am I right?
They are also memory consuming, as the allocated memory is not reclaimed when the container is no longer referenced.
Read the next two sentences in the doc:
Thus, they should in general be used in ghost code and annotations, so that they can be removed from the final executable. The specification of this unit is compatible with SPARK 2014.
Because the specification of Ada.Containers.Functional_Maps is compatible with SPARK, it may help to examine it in the context of related SPARK Libraries with regard to proof, testing and annotation. In particular,
The functional maps, sets and vectors are unbounded collections of indefinite elements that are neither controlled nor limited. While they are inefficient with regard to memory, they are simple, immutable and useful "to model user defined data structures."
The functional containers can be used in Ghost Code, "parts of the code that are only meant for specification and verification", as suggested here. This related example illustrates a ghost function.
it seems to me that you cannot free the memory allocated for those
objects once the program exits the context where they are created. I
am understanding that you could run into a memory leak. Am I right?
There are some things that you can do in Ada to manage memory, I would be surprised if (for example) the usage of an instance inside a declare-block were not cleaned-up on the block's exit. — This is, in fact, how some surprisingly robust applications can get away without "dynamically-allocated" memory/values (it's actually heap-allocated, but that's pedantic).
This sort of granular control is really nice, as you can constrain things/usages to specific points. Combined with Ada's good facilities for presenting interfaces, this means that changing some structure to another can be less-painful than it otherwise might be.
As an example of the above, I had a nested key-value map (a JSON object) that was being used to pass parameters around; the method for doing this changed and so I had a string of values (with common-rooted keys) coming in and a procedure that took JSON as input. Obviously what was needed was a "keys&values-to-JSON function, so inside the function I used the multiway-tree container where the leafs represented values and the internal-nodes the keys, the second step was to traverse the tree and create the JSON-object as needed - simple recursion and data-structure selection used to address the problem of adapting the textual key-value pairs of these nested parameters to JSON. — And because the usage of multi-way trees was exclusive to this function, I can be confident that the memory used by the intermediate tree-object I used is released on the function's exit.

Can ASP.NET performance be improved with modules/static classes?

Can using Modules or Shared/Static references to the BLL/DAL improve the performance of an ASP.NET website?
I am working of a site that consists of two projects, one the website, the other a VB.NET class library which acts as a combination of DAL and BLL.
The library is used to communicate with databases and sometimes transform/validate the data going into/coming from the DBs.
Currently each page on the site that needs db access (vast majority) will create an instance of the relevant class in the library to access specific tables.
As I understand it this leads to a class from the library being instantiated and garbage collected for each request, with the possibility of multiple concurrent instances if multiple users view the same page.
If I converted the classes to modules (shared/static class) would performance increase and memory be saved as only one instance of each module exists at a time and a new instance is not having to be created for each request?
(if so, does anyone know if having TableAdapters as global variables in the modules would cause problems due to threading?)
Alternatively would making the references to the Library class it the ASP.NET page have the same effect? (except I would have to re-write a lot less)
I'm no expert, but think that the absence of examples of this static class / session object model in books and online is indicative of it being a bad idea.
I inherited a Linq-To-Sql application where the db contexts were static, and after n requests the whole thing just fell apart. The standard model for L2Sql is the Unit-of-Work pattern (define a task or set of tasks - do them and close). Let the framework worry about connection pooling and efficient GC.
Are you just trying to be efficient or do you have performance issues? If the latter it's usually more effective to look at caching or improving query efficiency (use stored procedures, root out queries in loops) than looking at object instantiation.
Statics don't play well with unit tests either (another reason why they have dropped out of fashion).
instances are only a problem if they are not collected by the CG (a memory leak). Instances are more flexible than static as well because you can configure the instance to the specific context you are using.
When an application has poor performance or memory problems its usually a sign that
instances are not properly released (IDisposable)
the amount of data retrieved is too big (not paging large sets of data)
a large number of queries are executed (select n+1, or just a lot of queries)
poorly constructed sql statements (missing indexes, FK, too many joins, etc)
too many remote calls (either to other servers, or disk)
These are first things I would check. then start looking at the number of instantiated objects. Chances are that correcting the above mentioned list will solve most performance bottlenecks.
Can using Modules or Shared/Static references to the BLL/DAL improve
the performance of an ASP.NET website?
It's possible, but it depends heavily on how you use your data. One tradeoff in using a single shared instance of an object instead of one per request is that you will need to apply locking unless the objects are strictly read-only, and locking can both slow things down and complicate your code (not to mention being a common source of bugs).
However, if each object is going to contain the exact same data, then the tradeoff may be worth it -- even more so if it can save a DB round-trip.
You might consider using either a Singleton or a small number of parameterized objects rather than a static, though -- and use caching to manage them. That would give you the flexibility to let go of objects that you no longer need, which is harder to do when you're dealing with statics.

Is RAWSXP a good idea to store a blob?

I basically have two C functions to be used from R, one of which is making some blob and the second which needs to use it. While the user is not supposed to look inside it, I thought it would be reasonable not to do any serialization/conversion to R types and just dump it to an RAWSXP.
Are there any non-obvious disadvantages of this (i.e. except of killing user's console when printing it)?
EDIT: Ok, let's say for instance that I have an array of double/int64/(4 x int16) unions which is a result of some algorithm; I want it to be have a normal R copy semantics to behave naturally from an user's point of view (thus external pointer is rather not an option) but I'm not too eager to serialize it to R objects since it would not be straightforward and would probably end in a significant memory overhead.
If the blob is meant to persist within a single R session then it would be more natural to create, at the C level, an external pointer, and to return that to the user. This is outlined in Writing R Extensions, section 5.13.
One limitation of this approach is that the external pointer does not serialize, so is not saved to disk or, e.g., returned from a parallel job. This is often appropriate when the blob is a reference to a data structure that only makes sense in the context in which it was created (e.g., a file handle) but less so if it is a static data structure. In that case storing the data as a RAWSXP can be appropriate, typically as a slot or element of an S3 or S4 class with print / show methods to hide the gory details from the user. Perhaps the downside is that the RAWSXP is allocated and managed by R, e.g., subject to garbage collection, whereas the content of an external pointer would likely be allocated more directly via Calloc and Free.
As Martin and Josh pointed out, external pointers may be preferable.
Your approach sounds related to what e.g. the bigmemory does: it allocates a chunk of memory putside of R and controls it, thereby circumventing R's memory management and constraints. It doesn't matter for your purposes that bigmemory uses this to pass the memory back to R as a custom data type -- the external pointer makes that possible. Other packages using external pointers are
RODBC for a database connection object, and my RcppDE package which does what DEoptim does but in C++ and thereby allows to user-provided compiled functions in for the optimization, leveraging the Rcpp wrapper to external pointers: the Rcpp::XPtr class.
And as Marting rightly says, it is all in the good manual.

Language without explicit memory alloc/dealloc AND without garbage collection

I was wondering if it is possible to create a programming language without explicit memory allocation/deallocation (like C, C++ ...) AND without garbage collection (like Java, C#...) by doing a full analysis at the end of each scope?
The obvious problem is that this would take some time at the end of each scope, but I was wondering if it has become feasible with all the processing power and multiple cores in current CPU's. Do such languages exist already?
I also was wondering if a variant of C++ where smart pointers are the only pointers that can be used, would be exactly such a language (or am I missing some problems with that?).
Edit:
Well after some more research apparently it's this: http://en.wikipedia.org/wiki/Reference_counting
I was wondering why this isn't more popular. The disadvantages listed there don't seem quite serious, the overhead should be that large according to me. A (non-interpreted, properly written from the ground up) language with C family syntax with reference counting seems like a good idea to me.
The biggest problem with reference counting is that it is not a complete solution and is not capable of collecting a cyclic structure. The overhead is incurred every time you set a reference; for many kinds of problems this adds up quickly and can be worse than just waiting for a GC later. (Modern GC is quite advanced and awesome - don't count it down like that!!!)
What you are talking about is nothing special, and it shows up all the time. The C or C++ variant you are looking for is just plain regular C or C++.
For example write your program normally, but constrain yourself not to use any dynamic memory allocation (no new, delete, malloc, or free, or any of their friends, and make sure your libraries do the same), then you have that kind of system. You figure out in advance how much memory you need for everything you could do, and declare that memory statically (either function level static variables, or global variables). The compiler takes care of all the accounting the normal way, nothing special happens at the end of each scope, and no extra computation is necessary.
You can even configure your runtime environment to have a statically allocated stack space (this one isn't really under the compiler's control, more linker and operating system environment). Just figure out how deep your function call chain goes, and how much memory it uses (with a profiler or similar tool), an set it in your link options.
Without dynamic memory allocation (and thus no deallocation through either garbage collection or explicit management), you are limited to the memory you declared when you wrote the program. But that's ok, many programs don't need dynamic memory, and are already written that way. The real need for this shows up in embedded and real-time systems when you absolutely, positively need to know exactly how long an operation will take, how much memory (and other resources) it will use, and that the running time and the use of those resources can't ever change.
The great thing about C and C++ is that the language requires so little from the environment, and gives you the tools to do so much, that smart pointers or statically allocated memory, or even some special scheme that you dream up can be implemented. Requiring the use them, and the constraints you put on yourself just becomes a policy decision. You can enforce that policy with code auditing (use scripts to scan the source or object files and don't permit linking to the dynamic memory libraries)

Is re-using a Command and Connection object in ado.net a legitimate way of reducing new object creation?

The current way our application is written, involves creating a new connection and command object in every method that access our sqlite db. Considering we need it to run on a WM5 device, that is leading to hideous performance.
Our plan is to use just one connection object per-thread but it's also occurred to us to use one global command object per-thread too. The benefit of this is it reduces the overhead on the garbage collector created by instantiating objects all over the place.
I can't find any advice against doing this but wondered if anyone can answer definitively if this is a good or bad thing to do, and why?
While I'm not sure about reducing the number of command objects, reducing the number of connections is definitely a good plan. They're designed to be relatively expensive to set up (hey, they involve actually opening a disk file!) so keeping them around for a relatively long time is highly sensible. So do the first stage of your plan and retime to see if that makes things good enough, or if you need to do more work optimizing…
Note that it is quite possible that generating the command objects once per connection will be a saving too, since that will allow them to be compiled once and reused multiple times. Not that that matters until you're persisting the connection in the first place!

Resources