Is it possible to push/pull variables between two instances of R? - r

Suppose I have two instances of R running. Are there existing solutions to easily send variables/data from one instance to the other? Maybe even synchronize the values of a variable between the two instances?
For example, first the two instances (R1 & R2) would be connected somehow, then in R1:
> a <- 12
> push(a)
and at this point in R2:
> a
[1] 12
The keyword here is ease of use: make it as quick as possible (for the user) to interactively synchronize the value of certain variables. I would use this with Mathematica's RLink to work interactively in one R instance and push/pull data to/from Mathematica's instance.
I realize that the question might sound strange. The reason why I'm hopeful that something like this exists is that it would be useful for parallel or distributed computing as well (which is not my use case here).

Have a look at svSocket. From the package description at: svSocket.pdf
The SciViews svSocket package provides a stateful, multi-client and preemtive socket server. [...]
Although initially designed to server GUI clients, the R socket server can also be used to exchange data between separate R processes.
This demo video is really worth it.

This is a different approach to the push/pull model, but you can use the bigmemory package to create a matrix that exists in shared memory (or on disk) that can be accessed across multiple R sessions on the same machine:
R session 1
library(bigmemory)
m <- matrix(1:9, 3, 3)
m <- as.big.matrix(m, type="double", backingfile="m.bin", descriptorfile="m.desc")
m
# An object of class "big.matrix"
# Slot "address":
# <pointer: 0x7fba95004ee0>
R session 2
library(bigmemory)
m <- attach.big.matrix("m.desc")
# Now any changes you make to m will be reflected in both sessions!
This is also useful for parallel computing using on matrices, since you're now only passing around the pointer to the matrix to each of the spawned R sessions, rather than the whole
object.
Since we've created a file-backed big matrix, it also allows you to create matrices it also allows you to create and operate on matrices larger than memory!
Parallel example
library(bigmemory)
library(doMC) # Windows users will need to choose a different parallel backend
library(foreach)
registerDoMC(4) # number of cores (new R sessions to spawn) to run in parallel.
m <- matrix(rnorm(1000*1000), 1000)
as.big.matrix(m, type="double", backingfile="m.bin", descriptorfile="m.desc")
# Just to make sure we don't have any of these objects in memory when we spawn the
# parallel sessions
rm(m)
gc()
foreach(i = 1:4) %dopar% {
m <- attach.big.matrix("m.desc")
# do something!
}

I think Redis can help you achieve what you want.
You can use the R packages rredis and/or RcppRedis
On the first instance of R you can do
library(rredis)
redisConnect()
redisSet("a", 12)
[1] "OK"
Then on the second R instance you can then do
library(rredis)
redisConnect()
redisGet("a")
[1] 12

Related

how can I parallelize scaling a matrix (using the Seurat package) across multiple computing nodes?

I am working on scaling and clustering a matrix of single-nucleus RNA sequencing data (genes x cells) using the R package Seurat. My data is large, containing 11,500 genes and ~1.5mil cells. Due to the size of the data, the fastest way to scale the matrix would be to parallelize over multiple nodes (each containing 40 cores). I am computing on the Niagara cluster and can request as many cores as needed. My problem is that I can't figure out a way to effectively parallelize my code. I tried using the future package (which is recommended by Seurat) but that confines my data to one node, which is not enough. I also tried Rmpi, however that seemed to assign the same task to all the spawned workers, which was to scale the whole matrix and took too long. I have read about future.batchtools, but haven't been able to figure out the syntax.
I'll include the code I used for Rmpi and future.batchtools. I would appreciate any troubleshooting/alternative strategies to try.
Rmpi:
Seuratdata<-readRDS("/path/seuratobject.RDS")
mpi.universe.size()
mpi.spawn.Rslaves(nslaves=60)
mpi.bcast.cmd( id <- mpi.comm.rank() )
mpi.bcast.cmd( np <- mpi.comm.size() )
mpi.bcast.cmd( host <- mpi.get.processor.name() )
myfunc(data){
all.genes<-rownames(x=data)
Seuratdata<-ScaleData(data, features=all.genes)
}
Seuratdata<-mpi.remote.exec(cmd=myfunc, data=Seuratdata)
saveRDS(Seuratdata, file = "scaled_Seuratdata.rds")
mpi.close.Rslaves()
mpi.exit()
future.batchtools:
plan(tweak(batchtools_slurm, workers=80,resources=list(ncpus = 1, memory=10*1024^3,
walltime=10*60*60, partition='batch'), template = "./slurm.tmpl"))
Seuratdata<-readRDS("/path/seuratobject.RDS")
all.genes<-rownames(x=data)
Seuratdata<-ScaleData(Seuratdata, features=all.genes)
saveRDS(Seuratdata, file = "scaled_Seuratdata.rds")
If you've got SSH permission between compute nodes, then you can submit a main job to scheduler:
$ sbatch --partiton=batch --ntasks=100 --time=10:00:00 --mem=10G script.sh
which then calls your script.R, e.g. Rscript script.R, that looks like:
library(future)
plan(cluster)
...
This will spin up 100 PSOCK cluster workers on whatever compute nodes Slurm has allocated the job. This works, because plan(cluster) defaults to plan(cluster, workers = availableWorkers()) and availableWorkers() picks up the information in SLURM_JOB_NODELIST set by Slurm. You can add:
print(parallelly::availableWorkers())`
at the top to log which compute nodes.
However, there are two limitations:
plan(cluster) requires SSH access to the hosts in order to spin up the parallel workers on those hosts
R has a maximum number of 125 workers this way, cf. https://github.com/HenrikBengtsson/Wishlist-for-R/issues/28.

R: clarification on memory management

Suppose I have a matrix bigm. I need to use a random subset of this matrix and give it to a machine learning algorithm such as say svm. The random subset of the matrix will only be known at runtime. Additionally there are other parameters that are also chosen from a grid.
So, I have code that looks something like this:
foo = function (bigm, inTrain, moreParamsList) {
parsList = c(list(data=bigm[inTrain, ]), moreParamsList)
do.call(svm, parsList)
}
What I am seeking to know is whether R uses new memory to save that bigm[inTrain, ] object in parsList. (My guess is that it does.) What commands can I use to test such hypotheses myself? Additionally, is there a way of using a sub-matrix in R without using new memory?
Edit:
Also, assume I am calling foo using mclapply (on Linux) where bigm resides in the parent process. Does that mean I am making mc.cores number of copies of bigm or do all cores just use the object from the parent?
Any functions and heuristics of tracking memory location and consumption of objects being made in different cores?
Thanks.
I am just going to put in here what I find from my research on this topic:
I don't think using mclapply makes mc.cores copies of bigm based on this from the manual for multicore:
In a nutshell fork spawns a copy (child) of the current process, that can work in parallel
to the master (parent) process. At the point of forking both processes share exactly the
same state including the workspace, global options, loaded packages etc. Forking is
relatively cheap in modern operating systems and no real copy of the used memory is
created, instead both processes share the same memory and only modified parts are copied.
This makes fork an ideal tool for parallel processing since there is no need to setup the
parallel working environment, data and code is shared automatically from the start.
For your first part of the question, you can use tracemem :
This function marks an object so that a message is printed whenever the internal code copies the object
Here an example:
a <- 1:10
tracemem(a)
## [1] "<0x000000001669cf00"
b <- a ## b and a share memory (no message)
d <- stats::rnorm(10)
invisible(lm(d ~ a+log(b)))
## tracemem[0x000000001669cf00 -> 0x000000001669e298] ## object a is copied twice
## tracemem[0x000000001669cf00 -> 0x0000000016698a38]
untracemem(a)
You already found from the manual that mclapply isn't supposed to make copies of bigm.
But each thread needs to make its own copy of the smaller training matrix as it varies across the threads.
If you'd parallelize with e.g. snow, you'd need to have a copy of the data in each of the cluster nodes. However, in that case you could rewrite your problem in a way that only the smaller training matrices are handed over.
The search term for the general investigation of memory consumption behaviour is memory profiling. Unfortunately, AFAIK the available tools are not (yet) very comfortable, see e.g.
Monitor memory usage in R
Memory profiling in R - tools for summarizing

Asynchronous command dispatch in interactive R

I'm wondering if this is possible to do (it probably isn't) using one of the parallel processing backends in R. I've tried a few google searches and come up with nothing.
The general problem I have at the moment:
I have some large objects that take about half an hour to load
I want to generate a series of plots on the data (takes a few minutes).
I want to go and do other things with the data while this happens (Not changing the underlying data though!)
Ideally I would be able to dispatch the command from the interactive session, and not have to wait for it to return (so I can go do other things while I wait for the plot to render). Is this possible, or is this a case of wishful thinking?
To expand on Dirk's answer, I suggest that you use the "snow" API in the parallel package. The mcparallel function would seem to be perfect for this (if you're not using Windows), but it doesn't work well for performing graphic operations due to it's use of fork. The problem with the "snow" API is that it doesn't officially support asynchronous operations. However, it's rather easy to do if you don't mind cheating by using non-exported functions. If you look at the code for clusterCall, you can figure out how to submit tasks asynchronously:
> library(parallel)
> clusterCall
function (cl = NULL, fun, ...)
{
cl <- defaultCluster(cl)
for (i in seq_along(cl)) sendCall(cl[[i]], fun, list(...))
checkForRemoteErrors(lapply(cl, recvResult))
}
So you just use sendCall to submit a task, and recvResult to wait for the result. Here's an example of that using the bigmemory package, as suggested by Dirk.
You can create a "big matrix" using functions such as big.matrix or as.big.matrix. You'll probably want to do that efficiently, but I'll just convert a matrix z using as.big.matrix:
library(bigmemory)
big <- as.big.matrix(z)
Now I'll create a cluster and connect each of the workers to big using describe and attach.big.matrix:
cl <- makePSOCKcluster(2)
worker.init <- function(descr) {
library(bigmemory)
big <<- attach.big.matrix(descr)
X11() # use "quartz()" on a Mac; "windows()" on Windows
NULL
}
clusterCall(cl, worker.init, describe(big))
This also opens graphics window on each worker in addition to attaching to the big matrix.
To call persp on the first cluster worker, we use sendCall:
parallel:::sendCall(cl[[1]], function() {persp(big[]); NULL}, list())
This returns almost immediately, although it may take awhile until the plot appears. At this point, you can submit tasks to the other cluster worker, or do something else that is completely unrelated. Just make sure that you read the result before submitting another task to the same worker:
r1 <- parallel:::recvResult(cl[[1]])
Of course, this is all very error prone and not at all pretty, but you could write some functions to make it easier. Just keep in mind that non-exported functions such as these can change with any new release of R.
Note that it is perfectly possible and legitimate to execute a task on a specific worker or set of workers by subsetting the cluster object. For example:
clusterEvalQ(cl[1], persp(big[]))
This will send the task to the first worker while the others do nothing. But of course, this is synchronous, so you can't do anything on the other cluster workers until this task finishes. The only way that I know to send the tasks asynchronously is to cheat.
R is, and will remain, single-threaded.
But you can share resources. One approach would be to load your big data in one session, assign it to a bigmemory object -- and then share the 'handle' to that object with other R sessions on the same box. Should be a reasonably easy piece of cake on a decent Linux box with sufficient ram (ie a low multiple of all your data needs).

Understanding the differences between mclapply and parLapply in R

I've recently started using parallel techniques in R for a project and have my program working on Linux systems using mclapply from the parallel package. However, I've hit a road block with my understanding of parLapply for Windows.
Using mclapply I can set the number of cores, iterations, and pass that to an existing function in my workspace.
mclapply(1:8, function(z) adder(z, 100), mc.cores=4)
I don't seem to be able to achieve the same in windows using parLapply. As I understand it, I need to pass all the variables through using clusterExport() and pass the actual function I want to apply into the argument.
Is this correct or is there something similar to the mclapply function that's applicable to Windows?
The beauty of mclapply is that the worker processes are all created as clones of the master right at the point that mclapply is called, so you don't have to worry about reproducing your environment on each of the cluster workers. Unfortunately, that isn't possible on Windows.
When using parLapply, you generally have to perform the following additional steps:
Create a PSOCK cluster
Register the cluster if desired
Load necessary packages on the cluster workers
Export necessary data and functions to the global environment of the cluster workers
Also, when you're done, it's good practice to shutdown the PSOCK cluster using stopCluster.
Here's a translation of your example to parLapply:
library(parallel)
cl <- makePSOCKcluster(4)
setDefaultCluster(cl)
adder <- function(a, b) a + b
clusterExport(NULL, c('adder'))
parLapply(NULL, 1:8, function(z) adder(z, 100))
If your adder function requires a package, you'll have to load that package on each of the workers before calling it with parLapply. You can do that quite easily with clusterEvalQ:
clusterEvalQ(NULL, library(MASS))
Note that the NULL first argument to clusterExport, clusterEval and parLapply indicates that they should use the cluster object registered via setDefaultCluster. That can be very useful if your program is using mclapply in many different functions, so that you don't have to pass the cluster object to every function that needs it when converting your program to use parLapply.
Of course, adder may call other functions in your global environment which call other functions, etc. In that case, you'll have to export them as well and load any packages that they need. Also note that if any variables that you've exported change during the course of your program, you will have to export them again in order to update them on the cluster workers. Again, that isn't necessary with mclapply because it always creates/clones/forks the workers whenever it is called, making that unnecessary.
mclapply is simpler to use, and uses the underlying operating system fork() functionality to achieve parallelization. However, since Windows does not have fork(), it will run standard lapply instead - no parallelization.
parLapply is a different beast. It will create a cluster of processes, which could even reside on different machines on your network, and they communicate via TCP/IP in order to pass the tasks and results between each other.
The problem in your code there is that you didn't realize the first parameter to parLapply should be a "cluster" object. The simplest example of using parLapply to run on a single machine that I can think of is this:
library(parallel)
# Spawn child processes using fork() on the local machine
cl <- makeForkCluster(getOption("cl.cores", 2))
# Use parLapply to calculate lengths of 1000 strings
text = rep("Hello, world!", 1000)
len = parLapply(cl, text, nchar)
# Kill child processes since they are no longer needed
stopCluster(cl)
Using parLapply with a cluster created using makeForkCluster as above is functionally equivalent to calling mclapply. So it will also not work on Windows. :) Take a look at the other ways you can create a cluster with makeCluster and makePSOCKcluster on the documentation and check out what works best for your requirements.
This is how I would use parLapply for parallel computatiuons:
library(parallel)
##
test_func0 = function(i) {
lfactorial(i) + q + d
}
##
x = 1:100
q=20
d= 25
##
t1=Sys.time()
##
clust <- makeCluster(4)
## clusterExport loads these objects from the specified environment into the clust object
clusterExport(clust, c("test_func0", "q", "d"), envir = environment())
a <- do.call(c, parLapply(clust, x, function(i) {
test_func0(i)
}))
print(Sys.time()-t1)
##
stopCluster(clust)
Please keep in mind that makeCluster is much faster than makePSOCKcluster to create the environment cluster of variables to feed parLapply.

How to avoid duplicating objects with foreach

I have a very huge string vector and would like to do a parallel computing using foreach and dosnow package. I noticed that foreach would make copies of the vector for each process, thus exhaust system memory quickly. I tried to break the vector into smaller pieces in a list object, but still do not see any memory usage reduction. Does anyone have thoughts on this? Below are some demo code:
library(foreach)
library(doSNOW)
library(snow)
x<-rep('some string', 200000000)
# split x into smaller pieces in a list object
splits<-getsplits(x, mode='bysize', size=1000000)
tt<-vector('list', length(splits$start))
for (i in 1:length(tt)) tt[[i]]<-x[splits$start[i]: splits$end[i]]
ret<-foreach(i = 1:length(splits$start), .export=c('somefun'), .combine=c) %dopar% somefun(tt[[i]])
The style of iterating that you're using generally works well with the doMC backend because the workers can effectively share tt by the magic of fork. But with doSNOW, tt will be auto-exported to the workers, using lots of memory even though they only actually need a fraction of it. The suggestion made by #Beasterfield to iterate directly over tt resolves that issue, but it's possible to be even more memory efficient through the use of iterators and an appropriate parallel backend.
In cases like this, I use the isplitVector function from the itertools package. It splits a vector into a sequence of sub-vectors, allowing them to be processed in parallel without losing the benefits of vectorization. Unfortunately, with doSNOW, it will put these sub-vectors into a list in order to call the clusterApplyLB function in snow since clusterApplyLB doesn't support iterators. However, the doMPI and doRedis backends will not do that. They will send the sub-vectors to the workers right from the iterator, using almost half as much memory.
Here's a complete example using doMPI:
suppressMessages(library(doMPI))
library(itertools)
cl <- startMPIcluster()
registerDoMPI(cl)
n <- 20000000
chunkSize <- 1000000
x <- rep('some string', n)
somefun <- function(s) toupper(s)
ret <- foreach(s=isplitVector(x, chunkSize=chunkSize), .combine='c') %dopar% {
somefun(s)
}
print(length(ret))
closeCluster(cl)
mpi.quit()
When I run this on my MacBook Pro with 4 GB of memory
$ time mpirun -n 5 R --slave -f split.R
it takes about 16 seconds.
You have to be careful with the number of workers that you create on the same machine, although decreasing the value of chunkSize may allow you to start more.
You can decrease your memory usage even more if you're able to use an iterator that doesn't require all of the strings to be in memory at the same time. For example, if the strings are in a file named 'strings.txt', you can use s=ireadLines('strings.txt', n=chunkSize).

Resources