multicore programmation : master/slave system using package `parallel` - r

This question does not have an answer in What is the easiest way to parallelize a vectorized function in R? as I am not asking for an easy way, but for an extra control on what the processes do and when then die, through a master/slave system.
Here is my problem: I used successfully mcparallel and mccollect to parallelize some tasks, along the following lines (X is a list) :
p1 <- mcparallel( lapply( X[1:25], function(x) my.function(x, theta) ) )
p2 <- mcparallel( lapply( X[26:50], function(x) my.function(x, theta) ) )
p3 <- mcparallel( lapply( X[51:75], function(x) my.function(x, theta) ) )
x4 <- lapply(X[76:100], function(x) my.function(x, theta) )
c( mccollect(p1), mccollect(p2), mccollect(p3), x4 )
The elements of X are big, the parameter theta is small, and the aim is to perform optimization on theta. Note that mclapply(X, ...) performs very badly on my problem (almost no time gained). I also tried %dopar% from the foreach package: no time gained at all!
To reduce the overhead and avoid new forks at each computation, I’d like to use a master/slave logic as exemplified in this Rmpi tutorial. I could feed the slaves new values of theta, this would avoid new forks at each new computation, with (I guess) copying the whole memory at each new fork, and so. As theta is small, and so are the results of my.function, the computation between the process would be fast and this would allow to gain a substantial amount of time in the subsequent computations.
However, I am told that MPI is a protocol which is more suited for using several computers. I use a multicore computer (16 cores), and I am told lighter protocols would be suited.
Can you give me an advice? In particular, is it possible to implement a master/slave system on a multicore computer, using the parallel package?

I sort of found a solution.
> library('parallel')
> makeCluster(2) -> cl
> # loading data to nodes:
> invisible(clusterApply( cl, 4:5, function(t) a <<- t ))
> # computations on these data with different arguments
> clusterApply( cl, 1:2, function(t) a+t )
[[1]]
[1] 5
[[2]]
[1] 7
> clusterApply( cl, 10:11, function(t) a+t )
[[1]]
[1] 14
[[2]]
[1] 16
> stopCluster(cl)
I think it will do what I want, but I still wait for other suggestions (hence I don’t accept my own answer).

Related

R parallel package - performance very slow in my toy example

I am trying to sample for two vectors 1000 times with replacement and calculate the ratio of means. Repeat this process 10,000 times.
I wrote a sample parallel code but it's taking much longer that using simple for loops on a single machine.
ratio_sim_par <- function(x1, x2, nrep = 1000) {
# Initiate cluster
cl <- makeCluster(detectCores() - 1) #Leave one core for other operations
clusterExport(cl, varlist=c("x1", "x2", "nrep"), envir=environment())
Tboot <- parLapply(cl, 1:nrep, function(x){
n1 <- length(x1)
n2 <- length(x2)
xx1 <- sample(x1, n1, replace = TRUE) # sample of size n1 with replacement from x1
xx2 <- sample(x2, n2, replace = TRUE) # sample of size n2 with replacement from x2
return(mean(xx1) / mean(xx2))
})
stopCluster(cl)
return(unlist(Tboot))
}
ratio_sim_par(x1, x2, 10000)
System time is unbearable. Can anyone help me understand the mistake I'm making? Thanks
Distributing tasks to different nodes takes a lot of computational overhead and can cancel out any gains you make from parallelizing your script. In your case, you're calling parLapply 10,000 times and probably spending more resources forking each task than actually doing the resampling. Try something like this with a non-parallel version of ratio_sim_par:
mclapply(1:10000, ratio_sim_par, x1, x2, nrep = 1000, mc.cores = n_cores)
mclapply will split the job into as many cores as you have available and fork it once. I'm using mclapply instead of parLapply because I'm used to it and doesn't require as much setup.

Parallel Processing Example in R

Firstly, I would like to say that I am new to this topic.
Secondly, although I read a lot about Parallel processing in R, I'm still not confident about it.
I just invented simulation in R. So can someone help me with this invented code to understand Parallel processing? (I can see how it works)
My code as follows (Large Random numbers)
SimulateFn<-function(B,n){
M1=list()
for (i in 1:B){
M1[i]=(n^2)}
return(M1)}
SimulateFn(100000000,300000)
Could you please help me?
First of all, parallelization is the procedure of dividing a task into sub tasks, which are simultaneously processed by multiple processors or cores and can be independent or share some dependency between them - the latter case needs more planning and attention.
This procedure has some overhead to shedule subtasks - like copying data to each processor. That said, parallelization is worthless for fast computations. In your example, the threee main procedures are indexing ([), assignment (<-), and a (fast) math operation (^). The overhead for paralellization may be greater than the time to execute the subtask, so in that case parallelization can result in poorer performance!
Despite that, simple parallelization in R is fairly easy. An approach to parallelize your task is provided below, using the doParallel package. Other approachs include using packages as parallel.
library(doParallel)
## choose number of processors/cores
cl <- makeCluster(2)
registerDoParallel(cl)
## register elapsed time to evaluate code snippet
## %dopar% execute code in parallale
B <- 100000; n <- 300000
ptime <- system.time({
M1=list()
foreach(i=1:B) %dopar% {
M1[i]=(n^2)
}
})
## %do% execute sequentially
stime <- system.time({
M1=list()
foreach(i=1:B) %do% {
M1[i]=(n^2)
}
})
The elapsed times on my computer (2 core) were 59.472 and 44.932, respectively. Clearly, there were no improvement by parallelization: indeed, performance was worse!
A better example is shown below, where the main task is much more expensive in terms of computation need:
x <- iris[which(iris[,5] != "setosa"), c(1,5)]
trials <- 10000
ptime <- system.time({
r <- foreach(icount(trials), .combine=cbind) %dopar% {
ind <- sample(100, 100, replace=TRUE)
result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
coefficients(result1)
}
})
stime <- system.time({
r <- foreach(icount(trials), .combine=cbind) %do% {
ind <- sample(100, 100, replace=TRUE)
result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
coefficients(result1)
}
})
And elapsed times were 24.709 and 34.502: a gain of 28%.

How can I optimize the cpu usage of R using snowfall?

I was trying to implement fibonacci function using snowfall parallel package in R.
Following is the code I used.
vec <- 1:37
fib <- function(x)
{ if (x==0) return(0)
if (x==1) return(1)
if (x==2) return(2)
return(fib(x-1)+fib(x-2))
}
library(snowfall)
sfInit(parallel = TRUE, cpus = 4)
sfExport("vec","fib")
result <- sfLapply(vec,fib)
sfstop()
while running the code, I observed the cpu usage. Though I have asked to use 4 cores, the machine was always using 2 cores.
does that mean, my code isn't using all 4 cores? Can anyone help me with a guidance? Can I optimize this performance?

How can I make a parallel operation faster than the serial version?

I'm attempting to "map" a function onto an array. However when trying both simple and complex functions, the parallel version is always slower than the serial version. How can I improve the performance of a parallel computation in R?
Simple parallel example:
library(parallel)
# Number of elements
arrayLength = 100
# Create data
input = 1:arrayLength
# A simple computation
foo = function(x, y) x^y - x^(y-1)
# Add complexity
iterations = 5 * 1000 * 1000
# Perform complex computation on each element
compute = function (x) {
y = x
for (i in 1:iterations) {
x = foo(x, y)
}
return(x)
}
# Parallelized compute
computeParallel = function(x) {
# Create a cluster with 1 fewer cores than are available.
cl <- makeCluster(detectCores() - 1) # 8-1 cores
# Send static vars & funcs to all cores
clusterExport(cl, c('foo', 'iterations'))
# Map
out = parSapply(cl, x, compute)
# Clean up
stopCluster(cl)
return(out)
}
system.time(out <- compute(input)) # 12 seconds using 25% of cpu
system.time(out <- computeParallel(input)) # 160 seconds using 100% of cpu
The problem is that you traded off all of the vectorization for parallelization, and that's a bad trade. You need to keep as much vectorization as possible to have any hope of getting an improvement with parallelization for this kind of problem.
The pvec function in the parallel package can be a good solution to this kind of problem, but it isn't supported in parallel on Windows. A more general solution which works on Windows is to use foreach with the itertools package which contains functions which are useful for iterating over various objects. Here's an example that uses the "isplitVector" function to create one subvector for each worker:
library(doParallel)
library(itertools)
cl <- makeCluster(detectCores() - 1)
registerDoParallel(cl)
computeChunk <- function(x) {
foreach(xc=isplitVector(x, chunks=getDoParWorkers()),
.export=c('foo', 'iterations', 'compute'),
.combine='c') %dopar% {
compute(xc)
}
}
This still may not compare very well to the pure vector version, but it should get better as the value of "iterations" increases. It may actually help to decrease the number of workers unless the value of "iterations" is very large.
parSapply will run the function on each element of input separately, which means you are giving up the speed you gained from writing foo and compute in a vectorized fashion.
pvec will run a vectorized function on multiple cores by chunks. Try this:
system.time(out <- pvec(input, compute, mc.cores=4))

Is there an efficient way to parallelize mapply?

I have many rows and on every row I compute the uniroot of a non-linear function. I have a quad-core Ubuntu machine which hasn't stopped running my code for two days now. Not surprisingly, I'm looking for ways to speed things up ;-)
After some research, I noticed that only one core is currently used and parallelization is the thing to do. Digging deeper, I came to the conclusion (maybe incorrectly?) that the package foreach isn't really meant for my problem because too much overhead is produced (see, for example, SO). A good alternative seems to be multicore for Unix machines. In particular, the pvec function seems to be the most efficient one after I checked the help page.
However, if I understand it correctly, this function only takes one vector and splits it up accordingly. I need a function that can be parallized, but takes multiple vectors (or a data.frame instead), just like the mapply function does. Is there anything out there that I missed?
Here is a small example of what I want to do: (Note that I include a plyr example here because it can be an alternative to the base mapply function and it has a parallelize option. However, it is slower in my implementation and internally, it calls foreach to parallelize, so I think it won't help. Is that correct?)
library(plyr)
library(foreach)
n <- 10000
df <- data.frame(P = rnorm(n, mean=100, sd=10),
B0 = rnorm(n, mean=40, sd=5),
CF1 = rnorm(n, mean=30, sd=10),
CF2 = rnorm(n, mean=30, sd=5),
CF3 = rnorm(n, mean=90, sd=8))
get_uniroot <- function(P, B0, CF1, CF2, CF3) {
uniroot(function(x) {-P + B0 + CF1/x + CF2/x^2 + CF3/x^3},
lower = 1,
upper = 10,
tol = 0.00001)$root
}
system.time(x1 <- mapply(get_uniroot, df$P, df$B0, df$CF1, df$CF2, df$CF3))
#user system elapsed
#0.91 0.00 0.90
system.time(x2 <- mdply(df, get_uniroot))
#user system elapsed
#5.85 0.00 5.85
system.time(x3 <- foreach(P=df$P, B0=df$B0, CF1=df$CF1, CF2=df$CF2, CF3=df$CF3, .combine = "c") %do% {
get_uniroot(P, B0, CF1, CF2, CF3)})
#user system elapsed
# 10.30 0.00 10.36
all.equal(x1, x2$V1) #TRUE
all.equal(x1, x3) #TRUE
Also, I tried to implement Ryan Thompson's function chunkapply from the SO link above (only got rid of doMC part, because I couldn't install it. His example works, though, even after adjusting his function.),
but didn't get it to work. However, since it uses foreach, I thought the same arguments mentioned above apply, so I didn't try it too long.
#chunkapply(get_uniroot, list(P=df$P, B0=df$B0, CF1=df$CF1, CF2=df$CF2, CF3=df$CF3))
#Error in { : task 1 failed - "invalid function value in 'zeroin'"
PS: I know that I could just increase tol to reduce the number of steps that are necessary to find a uniroot. However, I already set tol as big as possible.
I'd use the parallel package that's built into R 2.14 and work with matrices. You could then simply use mclapply like this:
dfm <- as.matrix(df)
result <- mclapply(seq_len(nrow(dfm)),
function(x) do.call(get_uniroot,as.list(dfm[x,])),
mc.cores=4L
)
unlist(result)
This is basically doing the same mapply does, but in a parallel way.
But...
Mind you that parallelization always counts for some overhead as well. As I explained in the question you link to, going parallel only pays off if your inner function calculates significantly longer than the overhead involved. In your case, your uniroot function works pretty fast. You might then consider to cut your data frame in bigger chunks, and combine both mapply and mclapply. A possible way to do this is:
ncores <- 4
id <- floor(
quantile(0:nrow(df),
1-(0:ncores)/ncores
)
)
idm <- embed(id,2)
mapply_uniroot <- function(id){
tmp <- df[(id[1]+1):id[2],]
mapply(get_uniroot, tmp$P, tmp$B0, tmp$CF1, tmp$CF2, tmp$CF3)
}
result <-mclapply(nrow(idm):1,
function(x) mapply_uniroot(idm[x,]),
mc.cores=ncores)
final <- unlist(result)
This might need some tweaking, but it essentially breaks your df in exactly as many bits as there are cores, and run the mapply on every core. To show this works :
> x1 <- mapply(get_uniroot, df$P, df$B0, df$CF1, df$CF2, df$CF3)
> all.equal(final,x1)
[1] TRUE
it's an old topic but fyi you now have parallel::mcmapply doc is here. don't forget to set mc.cores in the options. I usually use mc.cores=parallel::detectCores()-1 to let one cpu free for OS operations.
x4 <- mcmapply(get_uniroot, df$P, df$B0, df$CF1, df$CF2, df$CF3,mc.cores=parallel::detectCores()-1)
This isn't exactly a best practices suggestion, but considerable speed-up can be had by identifying the root for all parameters in a 'vectorized' fashion. For instance,
bisect <-
function(f, interval, ..., lower=min(interval), upper=max(interval),
f.lower=f(lower, ...), f.upper=f(upper, ...), maxiter=20)
{
nrow <- length(f.lower)
bounds <- matrix(c(lower, upper), nrow, 2, byrow=TRUE)
for (i in seq_len(maxiter)) {
## move lower or upper bound to mid-point, preserving opposite signs
mid <- rowSums(bounds) / 2
updt <- ifelse(f(mid, ...) > 0, 0L, nrow) + seq_len(nrow)
bounds[updt] <- mid
}
rowSums(bounds) / 2
}
and then
> system.time(x2 <- with(df, {
+ f <- function(x, PB0, CF1, CF2, CF3)
+ PB0 + CF1/x + CF2/x^2 + CF3/x^3
+ bisect(f, c(1, 10), PB0, CF1, CF2, CF3)
+ }))
user system elapsed
0.180 0.000 0.181
> range(x1 - x2)
[1] -6.282406e-06 6.658593e-06
versus about 1.3s for application of uniroot separately to each. This also combined P and B0 into a single value ahead of time, since that is how they enter the equation.
The bounds on the final value are +/- diff(interval) * (.5 ^ maxiter) or so. A fancier implementation would replace bisection with linear or quadratic interpolation (as in the reference cited in ?uniroot), but then uniform efficient convergence (and in all cases error handling) would be more tricky to arrange.

Resources