I've been playing around with the R interface to the redis database, as well as the doRedis parallel backend for foreach. I have a couple of questions, to help me better apply this tool:
doMC, doSMP, doSnow, etc. all seem to work by calling up worker processes on the same computer, passing them an element from a list and a function to apply, and then gathering the results. In the case of doMC, the workers share memory. However, I'm a little bit confused as to how a database can provide this same functionality.
When I add an additional slave computer to the doRedis job queue (as in this video), is the entire doredis database being sent to the slave computer? Or is each slave just the data it needs at a particular moment (i.e. one element of a list and a function to apply).
How do I explicitly pass additional data and functions to the doRedis job queue, that each slave will need to perform it's computations?
When using doRedis and foreach, are there any additional 'gotchas' that might not apply to other parallel backends?
I know this is a lot of questions, but I've been running into situations where my limited understanding of how parallel processing works has been hindering my abilities to implement it. For example, I recently tried to parallelize a computation on a large database, and caught myself passing the entire database to each node on my cluster, an operation which completely destroyed any advantage I'd gained from parallelizing.
Thank you!
One piece of the puzzle is rredis
1 - doRedis uses rredis. Specifically, doRedis.R uses redis:RPush (as it iterates over the foreach items) and each redisWorker uses redis:BRPop to grab something from the redis list (which you named in your doRedis "job").
Redis is not just a database. Here it is being used as a queue!
2 - You have 1 instance (remotely) accessible to all your R workers. Think of the Redis server as a distributed queue. Your job master pushes items to a list, and workers grab and item and process it and push it to the result list. You can have m workers for N items. Depends on what you want to do.
3 - Use the env param. That uses the Redis:Set which all workers have access to (via Redis:Get). You pass a delimited expression on the foreach side and that is set in a string key in redis to which the workers have access.
4 - None that I know (but that is hardly authoritative so do ask around.) I also suggest you read the provided source code. The answers above are straight from reading doRedis.R and redisWorker.R.
Hope this helps.
[p.s. telnet to your redis and issue the Redis:monitor command to monitor the chatter back and forth.]
Related
We are running R services for SQL on an Azure VM to create a web interface to a modeling algorithm. We are using T-SQL to call a very complicated and memory intensive R script. It runs fine when we submit a single job but we get a memory allocation error when we submit another job before the 1st has finished. Eventually we will need to queue hundreds of jobs that will run over many hours. We assume that it is initiating two R processes which use up the resources. What is the best way to force it to run the R jobs sequentially rather than simultaneously? We have looked at updating the resource pool to MAX_PROCESSES = 0, with no success (we have already adjusted memory resources in this way) . We are considering using the Azure Service Bus Queue but we are hoping there might be simpler options. We would appreciate any advice on how others have dealt with this sort of issue.
Thanks 10^6.
I'd like replace RExcel with the Excelsi-R. Excelsi-R talks R via RServe, and RServe has this feature, that makes each client work in independent workspaces.
What I want is to actually share a single workspace between at least 2 simultaneously connected clients. One client would be run by Excelsi-R, and another by manually launched interactive R Session. That would allow me to interface with the Excelsi-R session in traditional way (say, in RStudio).
I don't need asynchronous computation; I'm perfectly happy if Excelsi-R would have to wait, until a command issued by the other connection finishes, and vice versa; just like in the RExcel "foreground mode".
Is it possible?
Not currently, since each process has exactly one connection. There are a few hacks - such as you can "switch" sessions by starting a listener for another connection in an existing session - but that may be a bit too limited.
That said, it is technically possible (Rserve support looping over multiple connections - it is used in RCloud to support two separate processes on one connection) - the challenge is how to link two independent connections to a single process. The rsio communication was added in Rserve 1.8 specifically to allow the passing of descriptors between Rserve instances, but it was not used so far. If there is interest in that kind of functionality, I can see how it could be added.
I need to run thousands* of models on 15 machines (each of 4 cores), all Windows. I started to learn parallel, snow and snowfall packages and read a bunch of intro's, but they mainly focus on the setup of the master. There is only a little information on how to set up the worker (slave) nodes on Windows. The information is often contradictory: some say that SOCK cluster is practically the easiest way to go, others claim that SOCK cluster setup is complicated on Windows (sshd setup) and the best way to go is MPI.
So, what is an easiest way to install slave nodes on Windows? MPI, PVM, SOCK or NWS? My, possibly naive ideas were (listed by priority):
To use all 4 cores on the slave nodes (required).
Ideally, I need only R with some packages and a slave R script or R function that would listen on some port and wait for tasks from master.
Ideally, nodes can be added/removed dynamically from the cluster.
Ideally, the slaves would connect to the master - so I wouldn't have to list all the slaves IP's in configuration of the master.
Only 1 is 100% required, 2-4 are "would be good". Is it too naive to request?
I am sorry but I have not been able to figure this out from the available docs and tutorials. I would be grateful if you point me out to the right source.
* Note that each of those thousands of models will take at least 7 minutes, so there won't be a big communication overhead.
It's a shame how all these APIs (like parallel/snow/snowfall) are complex to work with, a lots of docs but not what you need... I have found an API which is very simple and goes straight to the ideas I sketched!! It is redis and doRedis R package (as recommended here). Finally a very simple tutorial is present! Just modified a bit and got this:
The workers need only R, doRedis package and this script:
require(doRedis)
redisWorker('jobs', '10.0.0.7') # IP of the server
The master needs redis server running (installed the experimental windows binaries for Windows), and this R code:
require(doRedis)
registerDoRedis('jobs')
foreach(j=1:10,.combine=sum,.multicombine=TRUE) %dopar%
... # whatever you need to run
removeQueue('jobs')
Adding/removing workers is fully dynamic, no need to specify IPs at master, automatic "load balanancing", simple and no need for tons of docs! This solution fulfills all the requirements and even more - as stated in ?registerDoRedis:
The doRedis parallel back end tolerates faults among the worker processes and automatically resubmits failed tasks.
I don't know how complex this would be using the parallel/snow/snowfall with SOCKS/MPI/PVM/NWS, if it would be possible at all, but I guess very complex...
The only disadvantages of using redis I found:
It is a database server. I wonder if this API exist somewhere without the need to install the database server which I don't need at all. I guess it must exist!
There is a bug in the current doRedis package ("object '.doRedisGlobals' not found") with no solution yet and I am not able to install the old working doRedis 1.0.5 package into R 3.0.1.
I am trying to run parallel R jobs using the multicore package. Every job is the execution of the same script.R with different arguments.
A general idea is to define a function that takes the args and then calls source("script.R"). The problem is that I cannot pass the args to the script.R. Since I am running in parallel the args cannot be defined in the global scope.
Any help is welcomed.
As running parallel R instances which might be even on different nodes/computers, using an outer database to store parameters might be a good option.
I would use redis as being extremely fast and fully accessible in R, and for parallel runs its brother: doredis.
So you could have a redis server (or even a replicated, slave database on every host) which could be fetched for parameters. You could instantly update the parameters even from outside of R available to all workers and could easily add new workers for the task with doredis.
I have two processes, A and B. B is a process that performs some functions. Process A is the one that controls B. i.e Process A instruct process B by providing data (control and functional) to it.
I have a thread in B dedicated to IPC, All that thread does is to get instructions from process A while the other threads which are running do whatever they have to with the already existing data.
I thought of pipes and shared memory using shmat. But i am not satisfied, I want something like, whenever Process A writes a msg to B, only then should the ipc thread in B has to wake up.. Any idea as how to acheive this?
The specifics sort of depend on what kind of flexibility you need and who is using what pipes, but this should work: Have process B's IPC thread select for readability on the pipe. When process A writes to the pipe, process B's IPC thread will be awoken.
I found a solution. I made one of the threads open one end of the pipe for read, do the actual read and close it. This goes on in a while loop which is infinite one!
The process which wants to write to it will open it only when it needs to write and close it and will eventually end.
Infact this setup avoids synchronisation issues as well. But I don know what are the consequences of it though interms of performances!