Memory virtualization with R on cluster - r

I don't know almost anything about parallel computing so this question might be very stupid and it is maybe impossible to do what I would like to.
I am using linux cluster with 40 nodes, however since I don't know how to write parallel code in R I am limited to using only one. On this node I am trying to analyse data which floods the memory (arround 64GB). So my problem isn't lack of computational power but rather memory limitation.
My question is, whether it is even possible to use some R package (like doSnow) for implicit parallelisation to use 2-3 nodes to increase the RAM limit or would I have to rewrite the script from ground to make it explicit parallelised ?
Sorry if my question is naive, any suggestions are welcomed.
Thanks,
Simon

I don't think there is such a package. The reason is that it would not make much sense to have one. Memory access is very fast, and accessing data from another computer over the network is very slow compared to that. So if such a package existed it would be almost useless, since the processor would need to wait for data over the network all the time, and this would make the computation very very slow.
This is true for common computing clusters, built from off-the-shelf hardware. If you happen to have a special cluster where remote memory access is fast, and is provided as a service of the operating system, then of course it might be not that bad.
Otherwise, what you need to do is to try to divide up the problem into multiple pieces, manually, and then parallelize, either using R, or another tool.
An alternative to this would be to keep some of the data on the disk, instead of loading all of it into the memory. You still need to (kind of) divide up the problem, to make sure that the part of the data in the memory is used for a reasonable amount of time for computation, before loading another part of the data.
Whether it is worth (or possible at all) doing either of these options, depends completely on your application.
Btw. a good list of high performance computing tools in R is here:
http://cran.r-project.org/web/views/HighPerformanceComputing.html

For future inquiry:
You may want to have a look at two packages "snow" and "parallel".
Library "snow" extends the functionality of apply/lapply/sapply... to work on more than one core and/or one node.
Of course, you can perform simple parallel computing using more than one core:
#SBATCH --cpus-per-task= (enter some number here)
You can also perform parallel computing using more than one node (preferably with the previously mentioned libraries) using:
#SBATCH --ntasks-per-node= (enter some number here)
However, for several implications, you may wanna think of using Python instead of R where parallelism can be much more efficient using "Dask" workers.

You might want to take a look at TidalScale, which can allow you to aggregate nodes on your cluster to run a single instance of Linux with the collective resources of the underlying nodes. www.tidalscale.com. Though the R application may be inherently single threaded, you'll be able to provide your R application with a single, simple coherent memory space across the nodes that will be transparent to your application.
Good luck with your project!

Related

Overview of the different Parallel programming methods [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 months ago.
Improve this question
I'm learning how to use parallel programming (specifically in R, but I'm trying to make this question as general as possible). There are many different libraries that work with it, and understanding their differences is hard without knowing the computer science terms used in their descriptions.
I identified some attributes that define these categories, such as: fine-grained and coarse-grained, explicit and implicit, levels of parallelism (bit-level etc.), classes of parallel computers (multi-core computing, grid computing, etc.), and what I call "methods" (I will explain what I mean by that later).
First question: is that list complete? Or there are other relevant attributes that define categories of parallel programing?
Secondary question: for each attribute, what are the pros and cons of the different options? When to use each option?
About the "methods": I saw materials talking about socket and forking; other talking about Parallel Virtual Machine (PVM) and Message Passing Interface (MPI) (and another option called "NWS"); and some other specific methods like Hadoop/map-reduce and futures (from R's "future" package).
Second question: I don't understand some of them, and I'm not sure if it makes sense to join them in this list that I called (parallel processing) "methods". Also, are there other "methods" that I left out?
Secondary question: what are the pros and cons of each of these methods, and when is it better to use each?
Third question: in light of this categorization and the discussion on the pros and cons of each category, how can we make an overview of the parallel computing libraries in R?
My post might be too broad, and ask too much at once, so I'll answer it with what I found until now, and maybe you can correct it/add to it in your own answer. The points that I feel that are most lacking is understanding the pros/cons of each "method", and of each R package.
OP here. I can't answer the "is that list complete?" part of the questions, but here are the explanations to each attribute and the pros and cons of each option. Just to reiterate that I'm new to this subject and might write something false/misleading.
Categories
Fine-grained, coarse-grained, and embarrassing parallelism (ref):
Attribute: how often their subtasks need to synchronize or communicate with each other.
Fine-grained parallelism: if its subtasks must communicate many times per second;
Coarse-grained parallelism if they do not communicate many times per second;
Embarrassing parallelism if they rarely or never have to communicate.
When to use each: self-explanatory
Explicit and implicit parallelism (ref):
Attribute: the need to write code that directly instructs the computer to parallelize
Explicit: needs it
Implicit: automatically detects a task that needs parallelism
When to use each: Parallel computing might introduce too much complexity when working with tasks, such that implicitly parallelism can lead to inefficiencies in some cases.
Types/levels of parallelism (ref):
Attribute: the code level where parallelism happens.
Bit-level, instruction-level, task and data-level, superword-level
When to use each: From what i understood, in the most common statistics/R applications, we use task and data-level, thus I didn't searched about when to use the other ones.
Classes of parallel computers (ref)
Attribute: the level at which the hardware supports parallelism
Multi-core computing, Symmetric multiprocessing, Distributed computing, Cluster computing, Massively parallel computing, Grid computing, Specialized parallel computers.
When to use each: From what i understood, unless you have a really big task that needs several/external computers working, you can use multi-core computing (using only your own machine).
Parallelism "method":
Attribute: different methods (there probably is a better word for this).
This post makes a distinction between socket approach (launches a new version the code on each core) and forking approach (copies the entire current version of your project on each core). Forking isn't supported by Windows.
This post makes a difference between Parallel Virtual Machine (PVM) and Message Passing Interface (MPI) (which I couldn't quite understand). Apparently, there is another option called "NWS", which I couldn't find information about.
This CRAN task view contains a list of packages, grouped by topic, that are useful for high-performance computing in R. It refers to a method called "Hadoop" which was built upon "map-reduce", and to the "future" package, that uses "futures" to introduce parallelism to R.
When to use each: While in socket each node is unique (avoiding cross-contamination), and runs on any system, forking is faster and allows to access your entire workspace in each process. I couldn't find information to talk about PVM, MPI, and NWS; and didn't get in depth into Hadoop and futures, so there is alot of space to contribute to this paragraph.
R packages
The CRAN task view is a great reference to this. It separates which packages deal with explicit/implicit parallel processing, most of which work with multicore-processing (while also pointing out some that do grid-processing). It points to a specific group of packages that use Hadoop, and other groups of parallel processing tools and specific applications. As they're more common, I'll list the explicit parallelism ones. Package names inside parenthesis are upgrades to the listed package.
rpvm (no longer actively maintained): PVM dedicated;
Rmpi (pdbMPI): MPI dedicated;
snow (snowFT, snowfall): works with PVM, MPI, NWS and socket, but not forking;
parallel (parallelly) was built upon multicore (forking focused and no longer actively maintained) and snow, and it is in base R;
future (future.apply and furrr): "parallel evaluations via abstraction of futures"
foreach: needs a "parallel backend", which can be doMPI (using Rmpi), doMC (using multicore), doSNOW (using snow), doPararell (using parallel), and doFuture (using future)
RHIPE, rmr, segue, and RProtoBuf: use Hadoop and other map-reduce techniques.
I also didn't get in depth into how each package works, and there is room to add pros/cons and when to use each.

Limit GPU Memory in Julia using CuArrays

I'm fairly new to julia and I'm currently trying out some deep convolution networks with recurrent structures. I'm training the networks on a GPU using
CuArrays(CUDA Version 9.0).
Having two separate GPU's, I started two instances with different datasets.
Soon after some training both julia instances allocated all available Memory (2 x 11GB) and I couldn't even start another instance on my own using CuArrays (Memory allocation error). This became quite a problem, since this is running on a Server which is shared among many people.
I'm assuming that this is a normal behavior to use all available memory to train as fast as possible. But, under these circumstances I would like to limit the memory which can be allocated to run two instances at the same time and don't block me or other people from using the GPU.
To my surprise I found only very, very little information about this.
I'm aware of the CUDA_VISIBLE_DEVICES Option but this does not help since I want to train simultaneously on both devices.
Another one suggested to call GC.gc() and CuArrays.clearpool()
The second call throws an unknown function error and seems not to be within the CuArray Package anymore. The first one I'm currently testing but not exactly what I need. Is there any possibilty to limit the allocation of RAM on a GPU using CuArrays and Julia?
Thanks in advance
My Batchsize is 100 and one batch should have less than 1MB...
There is currently no such functionality. I quickly whipped something up, see https://github.com/JuliaGPU/CuArrays.jl/pull/379, you can use it to define CUARRAYS_MEMORY_LIMIT and set it to an amount of bytes that the allocator will not go beyond. Note that this might significantly increase memory pressure, a situation for which the CuArrays.jl memory allocator is currently not optimized (though it is one of my top priorities for the Julia GPU infrastructure).

Is there a method to attach to fork of R in interactive session?

I would like to offend the elder gods and use parallel::mcfork (or something like it) with minimal starting knowledge understanding that there are hidden dangers that may fall on top of my head. Maybe the behavior I'm hoping for is foolish or impossible; but I didn't think it could hurt too much to ask.
What I want to do is load some data into the workspace and then work with it from two separate interactive sessions with no intent for those sessions to communicate with each other. I can parallel::mcfork(estranged=TRUE) and see that there is another R session with a distinct pid. What I haven't been able to do is figure out how to connect to it in an interactive session. I tried using reptyr, but only got a message saying that both pids have a sub-process and I can't attach to them.
Is it possible to accomplish this aim? If so, how?
For what purpose? I have a largish dataset that takes a while to load. Now that I'm using Ubuntu, I've noticed that I can do parallel processing on this large dataset incurring a much lower cost in RAM and time than when I was using a Windows machine (i.e. mclapply vs parLapply). Now I have this large dataset... but I don't know quite what all I might want to do with it in terms of analysis. What I do know is that a measurable amount of time is going to pass between my issuing a command and the result. I'd like to, having loaded the data, analyze the data in two separate interactive sessions so that I can pursue the lines of reasoning that seem most fruitful without having layed out a plan in advance or being stuck waiting and manually monitoring the results of mcparallel. Incidentally, mcparallel provides some hopeful looking options, e.g. mc.interactive, but yields similar errors as before with reptyr.

MPI vs openMP for a shared memory

Lets say there is a computer with 4 CPUs each having 2 cores, so totally 8 cores. With my limited understanding I think that all processors share same memory in this case. Now, is it better to directly use openMP or to use MPI to make it general so that the code could work on both distributed and shared settings. Also, if I use MPI for a shared setting would performance decrease compared with openMP?
Whether you need or want MPI or OpenMP (or both) heavily depends the type of application you are running, and whether your problem is mostly memory-bound or CPU-bound (or both). Furthermore, it depends on the type of hardware you are running on. A few examples:
Example 1
You need parallelization because you are running out of memory, e.g. you have a simulation and the problem size is so large that your data does not fit into the memory of a single node anymore. However, the operations you perform on the data are rather fast, so you do not need more computational power.
In this case you probably want to use MPI and start one MPI process on each node, thereby making maximum use of the available memory while limiting communication to the bare minimum.
Example 2
You usually have small datasets and only want to speed up your application, which is computationally heavy. Also, you do not want to spend much time thinking about parallelization, but more your algorithms in general.
In this case OpenMP is your first choice. You only need to add a few statements here and there (e.g. in front of your for loops that you want to accelerate), and if your program is not too complex, OpenMP will do the rest for you automatically.
Example 3
You want it all. You need more memory, i.e. more computing nodes, but you also want to speed up your calculations as much as possible, i.e. running on more than one core per node.
Now your hardware comes into play. From my personal experience, if you have only a few cores per node (4-8), the performance penalty created by the general overhead of using OpenMP (i.e. starting up the OpenMP threads etc.) is more than the overhead of processor-internal MPI communication (i.e. sending MPI messages between processes that actually share memory and would not need MPI to communicate).
However, if you are working on a machine with more cores per node (16+), it will become necessary to use a hybrid approach, i.e. parallelizing with MPI and OpenMP at the same time. In this case, hybrid parallelization will be necessary to make full use of your computational resources, but it is also the most difficult to code and to maintain.
Summary
If you have a problem that is small enough to be run on just one node, use OpenMP. If you know that you need more than one node (and thus definitely need MPI), but you favor code readability/effort over performance, use only MPI. If using MPI only does not give you the speedup you would like/require, you have to do it all and go hybrid.
To your second question (in case that did not become clear):
If you setup is such that you do not need MPI at all (because your will always run on only one node), use OpenMP as it will be faster. But If you know that you need MPI anyways, I would start with that and only add OpenMP later, when you know that you've exhausted all reasonable optimization options for MPI.
With most distributed memory platforms nowadays consisting of SMP or NUMA nodes it just makes no sense to not use OpenMP. OpenMP and MPI can perfectly work together; OpenMP feeds the cores on each node and MPI communicates between the nodes. This is called hybrid programming. It was considered exotic 10 years ago but now it is becoming mainstream in High Performance Computing.
As for the question itself, the right answer, given the information provided, has always been one and the same: IT DEPENDS.
For use on a single shared memory machine like that, I'd recommend OpenMP. It make some aspects of the problem simpler and might be faster.
If you ever plan to move to a distributed memory machine, then use MPI. It'll save you solving the same problem twice.
The reason I say OpenMP might be faster is because a good implementation of MPI could be clever enough to spot that it's being used in a shared memory environment and optimise its behaviour accordingly.
Just for a bigger picture, hybrid programming has become popular because OpenMP benefits from cache topology, by using the same address space. As MPI might have the same data replicated over the memory (because process can't share data) it might suffer from cache cancelation.
On the other hand, if you partition your data correctly, and each processor has a private cache, it might come to a point were your problem fit completely in cache. In this case you have super linear speedups.
By talking in cache, there are very different cache topology on recent processors, and has always: IT DEPENDS...

How do you pre allocate memory to a process in solaris?

My problem is:
I have a perl script which uses lot of memory (expected behaviour because of caching). But, I noticed that the more I do caching, slower it gets and the process spends most of the time in sleep mode.
I thought pre-allocating memory to the process might speed up the performance.
Does someone have any ideas here?
Update:
I think I am not being very clear here. I will put question in clearer way:
I am not looking for the ways of pre-allocating inside the perl script. I dont think that would help me much here. What I am interested in is a way to tell OS to allocate X amount of memory for my perl script so that it does not have to compete with other processes coming in later.
Assume that I cant get away with the memory usage. Although, I am exploring ways of reducing that too but dont expect much improvement there.
FYI, I am working on a solaris 10 machine.
What I gathered from your posting and comments is this:
Your program gets slow when memory use rises
Your pogram increasingly spends time sleeping, not computing.
Most likely eplanation: Sleeping means waiting for a resource to become available. In this case the resource most likely is memory. Use the vmstat 1 command to verify. Have a look at the sr column. If it goes beyond ~150 consistently the system is desperate to free pages to satisfy demand. This is accompanied by high activity in the pi, po and fr columns.
If this is in fact the case, your best choices are:
Upgrade system memory to meet demand
Reduce memory usage to a level appropiate for the system at hand.
Preallocating memory will not help. In either case memory demand will exceed the available main memory at some point. The kernel will then have to decide which pages need to be in memory now and which pages may be cleared and reused for the more urgently needed pages. If all regularily needed pages (the working set) exceeds the size of main memory, the system is constantly moving pages from and to secondary storage (swap). The system is then said to be thrashing and spends not much time doing useful work. There is nothing you can do about this execept adding memory or using less of it.
From a comment:
The memory limitations are not very severe but the memory footprint easily grows to GBs and when we have competing processes for memory, it gets very slow. I want to reserve some memory from OS so that thrashing is minimal even when too many other processes come. Jagmal
Let's take a different tack then. The problem isn't really with your Perl script in particular. Instead, all the processes on the machine are consuming too much memory for the machine to handle as configured.
You can "reserve" memory, but that won't prevent thrashing. In fact, it could make the problem worse because the OS won't know if you are using the memory or just saving it for later.
I suspect you are suffering the tragedy of the commons. Am I right that many other users are on the machine in question? If so, this is more of a social problem than a technical problem. What you need is someone (probably the System Administrator) to step in and coordinate all the processes on the machine. They should find the most extravagant memory hogs and work with their programmers to reduce the cost on system resources. Further, they ought to arrange for processes to be scheduled so that resource allocation is efficient. Finally, they may need to get more or improved hardware to handle the expected system load.
Some questions you might ask yourself:
are my data structures really useful for the task at hand?
do I really have to cache that much?
can I throw away cached data after some time?
my #array;
$#array = 1_000_000; # pre-extend array to one million elements,
# http://perldoc.perl.org/perldata.html#Scalar-values
my %hash;
keys(%hash) = 8192; # pre-allocate hash buckets
# (same documentation section)
Not being familiar with your code, I'll venture some wild speculation here [grin] that these techniques aren't going to offer new great efficiencies to your script, but that the pre-allocation could help a little bit.
Good luck!
-- Douglas Hunter
I recently rediscovered an excellent Randal L. Schwartz article that includes preallocating an array. Assuming this is your problem, you can test preallocating with a variation on that code. But be sure to test the result.
The reason the script gets slower with more caching might be thrashing. Presumably the reason for caching in the first place is to increase performance. So a quick answer is: reduce caching.
Now there may be ways to modify your caching scheme so that it uses less main memory and avoids thrashing. For instance, you might find that caching to a file or database instead of to memory can boost performance. I've found that file system and database caching can be more efficient than application caching and can be shared among multiple instances.
Another idea might be to alter your algorithm to reduce memory usage in other areas. For instance, instead of pulling an entire file into memory, Perl programs tend to work better reading line by line.
Finally, have you explored the Memoize module? It might not be immediately applicable, but it could be a source of ideas.
I could not find a way to do this yet.
But, I found out that (See this for details)
Memory allocated to lexicals (i.e.
my() variables) cannot be reclaimed or
reused even if they go out of scope.
It is reserved in case the variables
come back into scope. Memory allocated
to global variables can be reused
(within your program) by using
undef()ing and/or delete().
So, I believe a possibility here could be to check if i can reduce the total memory print of lexical variables at a given point in time.
It sounds like you are looking for limit or ulimit. But I suspect that will cause a script that goes over the limit to fail, which probably isn't what you want.
A better idea might be to share cached data between processes. Putting data in a database or in a file works well in my experience.
I hate to say it, but if your memory limitations are this severe, Perl is probably not the right language for this application. C would be a better choice, I'd think.
One thing you could do is to use solaris zones (containers) .
You could put your process in a zone and allocate it resources like RAM and CPU's.
Here are two links to some tutorials :
Solaris Containers How To Guide
Zone Resource Control in the Solaris 10 08/07 OS
While it's not pre-allocating as you asked for, you may also want to look at the large page size options, so that when perl has to ask the OS for more memory for your program, it gets it in
larger chunks.
See Solaris Internals: Multiple Page Size Support for more information on the difference this makes and how to do it.
Look at http://metacpan.org/pod/Devel::Size
You could also inline a c function to do the above.
As far as I know, you cannot allocate memory directly from Perl. You can get around this by writing an XS module, or using an inline C function like I mentioned.

Resources