How can I optimize the cpu usage of R using snowfall? - r

I was trying to implement fibonacci function using snowfall parallel package in R.
Following is the code I used.
vec <- 1:37
fib <- function(x)
{ if (x==0) return(0)
if (x==1) return(1)
if (x==2) return(2)
return(fib(x-1)+fib(x-2))
}
library(snowfall)
sfInit(parallel = TRUE, cpus = 4)
sfExport("vec","fib")
result <- sfLapply(vec,fib)
sfstop()
while running the code, I observed the cpu usage. Though I have asked to use 4 cores, the machine was always using 2 cores.
does that mean, my code isn't using all 4 cores? Can anyone help me with a guidance? Can I optimize this performance?

Related

Parallel processing in R using parallel package - not reproducible with different number of cores

I'm using the parallel package and mclapply() to parallel process simulations in R, using R Programming for Data Science (Chapter 22, Section 22.4.1) as a reference.
I'm setting the seed as instructed, however, when I change the number of cores used in the mclapply() function, I get different results even with the same seed.
A simple reprex:
# USING 2 CORES
library(parallel)
RNGkind("L'Ecuyer-CMRG")
set.seed(1)
x <- mclapply(1:100, function(i) {rnorm(1)}, mc.cores = 2)
y <- do.call(rbind, x)
z <- mean(y)
print(mean(z))
# returns 0.143
# USING 3 CORES
library(parallel)
RNGkind("L'Ecuyer-CMRG")
set.seed(1)
x <- mclapply(1:100, function(i) {rnorm(1)}, mc.cores = 3)
y <- do.call(rbind, x)
z <- mean(y)
print(mean(z))
# returns 0.035
How can I set the seed such that changing the number of cores used doesn't change the result? I feel like this should be a fairly simple thing to do - maintaining reproducibility irrespective of number of cores used.

Why the processing time behaves different with these two functions using parallel?

Imagine I have two functions, one is a simple mean of sum of squares, and the other, a little more elaborated that computes a regression, that I want to apply to the lines of a "big" matrix or data frame.
In order to take advantage of multiple cores (on Windows) I tried the parallel package and got very different results for the two functions using the same sequence of commands.
For the apparently more complex function (regression) it appears that the time reduction is significant using more cores (Here I show a result from a PC with 3 cores and a PC with 12 cores, the behavior is similar with up to 11 cores, the time reduction decreases with more cores).
But for the "simple" function, mean of squares, the time of executions is very variable, almost erratic (also tested with up to 11 cores).
First, Is there a reason why this is happening? Second, I imagine there are other ways to do that task, can you suggest any?
Here is the code to generate the plots:
library(parallel)
nc=detectCores()-1 #number of cores
myFun =function(z) coef(lm(rep(1,length(z))~z)) #regression
myFun2 =function(z) sum(z^2)/length(z) # mean of squares
my.mat = matrix(rnorm(1000000,.01,0.4),ncol=100) #data
# using FUN = myFun
# Replicate 10 times
for(j in 1:10){
ncor=2:nc
timed=c()
for (i in seq_along(ncor)){
cl <- makeCluster(mc <- getOption("cl.cores", ncor[i]))
stime <- Sys.time()
res=parApply(cl = cl, X = my.mat, MARGIN = 1, FUN = myFun)
tm=Sys.time()-stime
timed[i]=tm
stopCluster(cl)
}
# no cores
stime <- Sys.time()
res=apply(my.mat, MARGIN = 1, FUN = myFun)
tm=Sys.time()-stime
(dr=data.frame(nc=c(1,ncor),ts=as.numeric(c(tm,timed))))
plot(dr,type="l",col=3,main=j)
#stopCluster(cl)
if (j==1)fres1=dr else fres1=merge(fres1,dr,by="nc")
}
plot(fres1[,1:2],type="l",col=2,ylim=range(fres1[,-1]))
for(i in 3:11)lines(fres1[,i],col=i+1)
# For the second plot use the same code but change FUN = myFun2

Parallel Processing Example in R

Firstly, I would like to say that I am new to this topic.
Secondly, although I read a lot about Parallel processing in R, I'm still not confident about it.
I just invented simulation in R. So can someone help me with this invented code to understand Parallel processing? (I can see how it works)
My code as follows (Large Random numbers)
SimulateFn<-function(B,n){
M1=list()
for (i in 1:B){
M1[i]=(n^2)}
return(M1)}
SimulateFn(100000000,300000)
Could you please help me?
First of all, parallelization is the procedure of dividing a task into sub tasks, which are simultaneously processed by multiple processors or cores and can be independent or share some dependency between them - the latter case needs more planning and attention.
This procedure has some overhead to shedule subtasks - like copying data to each processor. That said, parallelization is worthless for fast computations. In your example, the threee main procedures are indexing ([), assignment (<-), and a (fast) math operation (^). The overhead for paralellization may be greater than the time to execute the subtask, so in that case parallelization can result in poorer performance!
Despite that, simple parallelization in R is fairly easy. An approach to parallelize your task is provided below, using the doParallel package. Other approachs include using packages as parallel.
library(doParallel)
## choose number of processors/cores
cl <- makeCluster(2)
registerDoParallel(cl)
## register elapsed time to evaluate code snippet
## %dopar% execute code in parallale
B <- 100000; n <- 300000
ptime <- system.time({
M1=list()
foreach(i=1:B) %dopar% {
M1[i]=(n^2)
}
})
## %do% execute sequentially
stime <- system.time({
M1=list()
foreach(i=1:B) %do% {
M1[i]=(n^2)
}
})
The elapsed times on my computer (2 core) were 59.472 and 44.932, respectively. Clearly, there were no improvement by parallelization: indeed, performance was worse!
A better example is shown below, where the main task is much more expensive in terms of computation need:
x <- iris[which(iris[,5] != "setosa"), c(1,5)]
trials <- 10000
ptime <- system.time({
r <- foreach(icount(trials), .combine=cbind) %dopar% {
ind <- sample(100, 100, replace=TRUE)
result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
coefficients(result1)
}
})
stime <- system.time({
r <- foreach(icount(trials), .combine=cbind) %do% {
ind <- sample(100, 100, replace=TRUE)
result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
coefficients(result1)
}
})
And elapsed times were 24.709 and 34.502: a gain of 28%.

How can I make a parallel operation faster than the serial version?

I'm attempting to "map" a function onto an array. However when trying both simple and complex functions, the parallel version is always slower than the serial version. How can I improve the performance of a parallel computation in R?
Simple parallel example:
library(parallel)
# Number of elements
arrayLength = 100
# Create data
input = 1:arrayLength
# A simple computation
foo = function(x, y) x^y - x^(y-1)
# Add complexity
iterations = 5 * 1000 * 1000
# Perform complex computation on each element
compute = function (x) {
y = x
for (i in 1:iterations) {
x = foo(x, y)
}
return(x)
}
# Parallelized compute
computeParallel = function(x) {
# Create a cluster with 1 fewer cores than are available.
cl <- makeCluster(detectCores() - 1) # 8-1 cores
# Send static vars & funcs to all cores
clusterExport(cl, c('foo', 'iterations'))
# Map
out = parSapply(cl, x, compute)
# Clean up
stopCluster(cl)
return(out)
}
system.time(out <- compute(input)) # 12 seconds using 25% of cpu
system.time(out <- computeParallel(input)) # 160 seconds using 100% of cpu
The problem is that you traded off all of the vectorization for parallelization, and that's a bad trade. You need to keep as much vectorization as possible to have any hope of getting an improvement with parallelization for this kind of problem.
The pvec function in the parallel package can be a good solution to this kind of problem, but it isn't supported in parallel on Windows. A more general solution which works on Windows is to use foreach with the itertools package which contains functions which are useful for iterating over various objects. Here's an example that uses the "isplitVector" function to create one subvector for each worker:
library(doParallel)
library(itertools)
cl <- makeCluster(detectCores() - 1)
registerDoParallel(cl)
computeChunk <- function(x) {
foreach(xc=isplitVector(x, chunks=getDoParWorkers()),
.export=c('foo', 'iterations', 'compute'),
.combine='c') %dopar% {
compute(xc)
}
}
This still may not compare very well to the pure vector version, but it should get better as the value of "iterations" increases. It may actually help to decrease the number of workers unless the value of "iterations" is very large.
parSapply will run the function on each element of input separately, which means you are giving up the speed you gained from writing foo and compute in a vectorized fashion.
pvec will run a vectorized function on multiple cores by chunks. Try this:
system.time(out <- pvec(input, compute, mc.cores=4))

multicore programmation : master/slave system using package `parallel`

This question does not have an answer in What is the easiest way to parallelize a vectorized function in R? as I am not asking for an easy way, but for an extra control on what the processes do and when then die, through a master/slave system.
Here is my problem: I used successfully mcparallel and mccollect to parallelize some tasks, along the following lines (X is a list) :
p1 <- mcparallel( lapply( X[1:25], function(x) my.function(x, theta) ) )
p2 <- mcparallel( lapply( X[26:50], function(x) my.function(x, theta) ) )
p3 <- mcparallel( lapply( X[51:75], function(x) my.function(x, theta) ) )
x4 <- lapply(X[76:100], function(x) my.function(x, theta) )
c( mccollect(p1), mccollect(p2), mccollect(p3), x4 )
The elements of X are big, the parameter theta is small, and the aim is to perform optimization on theta. Note that mclapply(X, ...) performs very badly on my problem (almost no time gained). I also tried %dopar% from the foreach package: no time gained at all!
To reduce the overhead and avoid new forks at each computation, I’d like to use a master/slave logic as exemplified in this Rmpi tutorial. I could feed the slaves new values of theta, this would avoid new forks at each new computation, with (I guess) copying the whole memory at each new fork, and so. As theta is small, and so are the results of my.function, the computation between the process would be fast and this would allow to gain a substantial amount of time in the subsequent computations.
However, I am told that MPI is a protocol which is more suited for using several computers. I use a multicore computer (16 cores), and I am told lighter protocols would be suited.
Can you give me an advice? In particular, is it possible to implement a master/slave system on a multicore computer, using the parallel package?
I sort of found a solution.
> library('parallel')
> makeCluster(2) -> cl
> # loading data to nodes:
> invisible(clusterApply( cl, 4:5, function(t) a <<- t ))
> # computations on these data with different arguments
> clusterApply( cl, 1:2, function(t) a+t )
[[1]]
[1] 5
[[2]]
[1] 7
> clusterApply( cl, 10:11, function(t) a+t )
[[1]]
[1] 14
[[2]]
[1] 16
> stopCluster(cl)
I think it will do what I want, but I still wait for other suggestions (hence I don’t accept my own answer).

Resources