It's been a while since the last time I looked (e.g. outdated nws package) and so I wondered if anything "happened" in the meantime.
Is there a way to share memory across parallel processes?
I would like each process to have access to an environment object that plays the role of a meta object.
The rredis package provides functionality that is similar to nws. You could use rredis with the foreach and doRedis packages, or with any other parallel programming package, such as parallel.
What will work quite efficiently is a shared matrix via the bigmemory package. You can serialize/unserialize pretty every R object into such matrix.
Unfortunately the only way you can share the matrices between the processes is via their descriptors, which are not deterministic (i.e. unless you communicate with the other process, you can not get a descriptor). To solve this chicken-and-egg problem you can save the descriptor on the chosen location on the filesystem. (The descriptor is really small, the only non-trivial thing it contains is a memory address of the actual bigmatrx).
If you are still interested, I can post the R code.
You can do this efficiently via a new yaplr package.
First install it
devtools::install_github('adamryczkowski/yaplr')
R session number 1:
library(yaplr)
send_object(obj=1:10, tag='myobject')
# Server process spawned
R session number 2:
library(yaplr)
list_objects()
# size ctime
# myobject 62 Sat Sep 24 13:01:57 2016
retrieve_object(tag='myobject')
# [1] 1 2 3 4 5 6 7 8 9 10
remove_object('myobject')
quit_server()
The package uses bigmemory::big.matrix for efficient data transfer: when copying a big object between R processes no unnecessary copies are made: only one serialization and one unserialization.
No network sockets are used.
Related
I know that some version of this question has been addressed multiple times in the past, but I think this iteration of this widely shared problem is sufficiently distinct to justify its own response. I would like to permanently set the maximum memory available to R to largest value that my machine can handle, i.e., not just for a single session. I am running 64-bit R on a windows 7 machine with 6 gig of RAM.
Currently I am trying to do a conversion of a 10 GB Stata file into a .rds object. On similar smaller objects the compression in the .dta to .rds conversion has been by a factor of four or better, and I (rather surprisingly) have not had any trouble doing dplyr manipulation on objects of 2 to 3 GB (after compression), even when two of them and work product are all in memory at once. This seems to conflict with my previous belief that the amount of physical RAM is the absolute upper limit of what R can handle, as I am fairly certain that between loaded .rds objects and various intermediate work products I have had more than 6 GB of undeleted objects laying about my workspace at one time.
I find conflicting statements about whether the maximum memory size is my actual RAM less OS demands, or my actual RAM, or my actual RAM plus an unknown (to me) amount of virtual RAM (subject to a potentially serious slowdown when you reach into virtual RAM). These file conversions are one-time (per file) jobs and I do not care if they are slow.
Looking at the base R help page on “Memory limits” and the help-pages for memory.size(), it seems that there are multiple distinct limits under Windows, relating to total memory used in a session, available to a single process, allocatable by malloc or contained in a single vector. The individual vectors in my file are only around eight million rows long.
memory.size and memory.limit both report current settings in the neighborhood of 6 GB. I got multiple warning messages saying that I was pressed up against that limit, but the actual error message was something like “cannot allocate vector of length 120 MB”.
So I think there are three distinct questions:
How do I determine the maximum possible memory for each 64-bit R
memory setting; and
How many distinct memory settings do I need to make; and
How do I make them permanently, as opposed to for a single session?
Following the advice of #Konrad below, I had this rather puzzling exchange with R/RStudio:
> memory.size()
[1] 424.85
> memory.size(max=TRUE)
[1] 454.94
> memory.size()
[1] 436.89
> memory.size(5000)
[1] 6046
Warning message:
In memory.size(5000) : cannot decrease memory limit: ignored
> memory.size()
[1] 446.27
The first three interactions seem to suggest that there is a hard memory limit on my machine of 455 MB. The second-to-last one, on the other hand, appears to be saying that the memory limit is set at my RAM level, without allowance for the OS, and without using virtual memory. Then the last one goes back claiming to a limit of around 450.
I just tried the recommendation here:
Increasing (or decreasing) the memory available to R processes
but with 6000 MB rather than 500; I'll provide a report.
I'm running a simulation in an ipython notebook that is composed of seven functions that are dependent of each other, and requires 13 different parameters. Some of the functions are called within other functions to allow one function to run the entire simulation. The simulation involves manipulating two parameters for a total of >20k iterations. Two simulations can be run asynchronously. Since each iteration is taking ~1.5 seconds, I'm investigating parallel processing.
When I first tried ipyparallel, I got a global name not defined error. Makes sense that local objects can't been found a worker. In an effort to avoid spending quite a bit of time going down a rabbit hole, what would be the easiest way to pass a whole bunch of objects to all of the workers? Are there other gotchas to consider when using ipyparallel in this way?
There is a bit more detail in this related question, but the gist is: interactively defined modules resolve in the interactive namespace (__main__), which is different on the engine and client. You can send functions to the engine with view.push(dict(func=func, func2=func2)), in which case they will be found. The alternative is to define your functions in a module or package that you ensure is installed on all the engines.
For instance, in a script:
def bar(x):
return x * x
def foo(y):
return bar(y)
view.apply(foo, 5) # NameError on bar
view.push(dict(bar=bar)) # send bar
view.apply(foo, 5) # 25
Often when using IPython parallel from a notebook or larger script, one of the early steps is seeding the namespace of the engines:
rc[:].push(dict(
f1=f1,
f2=f2,
const=const,
))
If you have more than a few names to push this way, it might be time to consider defining these functions in a module, and distributing that instead.
Suppose I have two instances of R running. Are there existing solutions to easily send variables/data from one instance to the other? Maybe even synchronize the values of a variable between the two instances?
For example, first the two instances (R1 & R2) would be connected somehow, then in R1:
> a <- 12
> push(a)
and at this point in R2:
> a
[1] 12
The keyword here is ease of use: make it as quick as possible (for the user) to interactively synchronize the value of certain variables. I would use this with Mathematica's RLink to work interactively in one R instance and push/pull data to/from Mathematica's instance.
I realize that the question might sound strange. The reason why I'm hopeful that something like this exists is that it would be useful for parallel or distributed computing as well (which is not my use case here).
Have a look at svSocket. From the package description at: svSocket.pdf
The SciViews svSocket package provides a stateful, multi-client and preemtive socket server. [...]
Although initially designed to server GUI clients, the R socket server can also be used to exchange data between separate R processes.
This demo video is really worth it.
This is a different approach to the push/pull model, but you can use the bigmemory package to create a matrix that exists in shared memory (or on disk) that can be accessed across multiple R sessions on the same machine:
R session 1
library(bigmemory)
m <- matrix(1:9, 3, 3)
m <- as.big.matrix(m, type="double", backingfile="m.bin", descriptorfile="m.desc")
m
# An object of class "big.matrix"
# Slot "address":
# <pointer: 0x7fba95004ee0>
R session 2
library(bigmemory)
m <- attach.big.matrix("m.desc")
# Now any changes you make to m will be reflected in both sessions!
This is also useful for parallel computing using on matrices, since you're now only passing around the pointer to the matrix to each of the spawned R sessions, rather than the whole
object.
Since we've created a file-backed big matrix, it also allows you to create matrices it also allows you to create and operate on matrices larger than memory!
Parallel example
library(bigmemory)
library(doMC) # Windows users will need to choose a different parallel backend
library(foreach)
registerDoMC(4) # number of cores (new R sessions to spawn) to run in parallel.
m <- matrix(rnorm(1000*1000), 1000)
as.big.matrix(m, type="double", backingfile="m.bin", descriptorfile="m.desc")
# Just to make sure we don't have any of these objects in memory when we spawn the
# parallel sessions
rm(m)
gc()
foreach(i = 1:4) %dopar% {
m <- attach.big.matrix("m.desc")
# do something!
}
I think Redis can help you achieve what you want.
You can use the R packages rredis and/or RcppRedis
On the first instance of R you can do
library(rredis)
redisConnect()
redisSet("a", 12)
[1] "OK"
Then on the second R instance you can then do
library(rredis)
redisConnect()
redisGet("a")
[1] 12
Suppose I have a matrix bigm. I need to use a random subset of this matrix and give it to a machine learning algorithm such as say svm. The random subset of the matrix will only be known at runtime. Additionally there are other parameters that are also chosen from a grid.
So, I have code that looks something like this:
foo = function (bigm, inTrain, moreParamsList) {
parsList = c(list(data=bigm[inTrain, ]), moreParamsList)
do.call(svm, parsList)
}
What I am seeking to know is whether R uses new memory to save that bigm[inTrain, ] object in parsList. (My guess is that it does.) What commands can I use to test such hypotheses myself? Additionally, is there a way of using a sub-matrix in R without using new memory?
Edit:
Also, assume I am calling foo using mclapply (on Linux) where bigm resides in the parent process. Does that mean I am making mc.cores number of copies of bigm or do all cores just use the object from the parent?
Any functions and heuristics of tracking memory location and consumption of objects being made in different cores?
Thanks.
I am just going to put in here what I find from my research on this topic:
I don't think using mclapply makes mc.cores copies of bigm based on this from the manual for multicore:
In a nutshell fork spawns a copy (child) of the current process, that can work in parallel
to the master (parent) process. At the point of forking both processes share exactly the
same state including the workspace, global options, loaded packages etc. Forking is
relatively cheap in modern operating systems and no real copy of the used memory is
created, instead both processes share the same memory and only modified parts are copied.
This makes fork an ideal tool for parallel processing since there is no need to setup the
parallel working environment, data and code is shared automatically from the start.
For your first part of the question, you can use tracemem :
This function marks an object so that a message is printed whenever the internal code copies the object
Here an example:
a <- 1:10
tracemem(a)
## [1] "<0x000000001669cf00"
b <- a ## b and a share memory (no message)
d <- stats::rnorm(10)
invisible(lm(d ~ a+log(b)))
## tracemem[0x000000001669cf00 -> 0x000000001669e298] ## object a is copied twice
## tracemem[0x000000001669cf00 -> 0x0000000016698a38]
untracemem(a)
You already found from the manual that mclapply isn't supposed to make copies of bigm.
But each thread needs to make its own copy of the smaller training matrix as it varies across the threads.
If you'd parallelize with e.g. snow, you'd need to have a copy of the data in each of the cluster nodes. However, in that case you could rewrite your problem in a way that only the smaller training matrices are handed over.
The search term for the general investigation of memory consumption behaviour is memory profiling. Unfortunately, AFAIK the available tools are not (yet) very comfortable, see e.g.
Monitor memory usage in R
Memory profiling in R - tools for summarizing
I have some big, big files that I work with and I use several different I/O functions to access them. The most common one is the bigmemory package.
When writing to the files, I've learned the hard way to flush output buffers, otherwise all bets are off on whether the data was saved. However, this can lead to some very long wait times while bigmemory does its thing (many minutes). I don't know why this happens - it doesn't always occur and it's not easily reproduced.
Is there some way to determine whether or not I/O buffers have been flushed in R, especially for bigmemory? If the operating system matters, then feel free to constrain the answer in that way.
If an answer can be generalized beyond bigmemory, that would be great, as I sometimes rely on other memory mapping functions or I/O streams.
If there are no good solutions to checking whether buffers have been flushed, are there cases in which it can be assumed that buffers have been flushed? I.e. besides using flush().
Update: I should clarify that these are all binary connections. #RichieCotton noted that isIncomplete(), though the help documentation only mentions text connections. It's not clear if that is usable for binary connections.
Is this more convincing that isIncomplete() works with binary files?
# R process 1
zz <- file("~/test", "wb")
writeBin(c(1:100000),con=zz)
close(zz)
# R process 2
zz2 <- file("~/test", "rb")
inpp <- readBin(con=zz2, integer(), 10000)
while(isIncomplete(con2)) {Sys.sleep(1); inpp <- c(inpp, readBin(zz2),integer(), 10000)}
close(zz2)
(Modified from the help(connections) file.)
I'll put forward my own answer, but I welcome anything that is clearer.
From what I've seen so far, the various connection functions, e.g. file, open, close, flush, isOpen, and isIncomplete (among others), are based on specific connection types, e.g. files, pipes, URLs, and a few other things.
In contrast, bigmemory has its own connection type and the bigmemory object is an S4 object with a slot for a memory address for operating system buffers. Once placed there, the OS is in charge of flushing those buffers. Since it's an OS responsibility, then getting information on "dirty" buffers requires interacting with the OS, not with R.
Thus, the answer for bigmemory is "no" as the data is stored in the kernel buffer, though it may be "yes" for other connections that are handled through STDIO (i.e. stored in "user space").
For more insight on the OS / kernel side of things, see this question on SO; I am investigating a couple of programs (not just R + bigmemory) that are producing buffer flushing curiosities, and that thread helped to enlighten me about the kernel side of things.