Poor speed gain in using `future` for parallelization - r

I find that the speed gain in using the future (and furrr) package for parallelization in R is not satisfactory. In particular, the speed improvement is not close to linear. My machine has 4 workers, so I thought the speed gain should be around linear when the number of workers I specify is not larger than the number of cores available in my machine. However, it is not the case.
The following is an example that illustrates the problem, where I draw 10^7 random numbers for 500 times.
library(future)
library(furrr)
# Parameters
n <- 1e7
m <- 500
# Compute the mean
rmean <- function(x, n) {
rand.vec <- runif(n)
rand.mean <- mean(rand.vec)
return(rand.mean)
}
# Record the time used to compute the mean of n numbers for m times
rtime <- function(m, n) {
t1 <- Sys.time()
temp <- future_map(.x = 1:m,
.f = rmean,
n = n,
.options = furrr::furrr_options(seed = TRUE))
t2 <- Sys.time()
# Print the time used
print(t2 - t1)
return(temp)
}
# Print the time used for different number of workers
plan(multisession, workers = 1)
set.seed(1)
x <- rtime(m, n)
# Time difference of 2.503885 mins
plan(multisession, workers = 2)
set.seed(1)
x <- rtime(m, n)
# Time difference of 1.341357 mins
plan(multisession, workers = 3)
set.seed(1)
x <- rtime(m, n)
# Time difference of 57.25641 secs
plan(multisession, workers = 4)
set.seed(1)
x <- rtime(m, n)
# Time difference of 47.31929 secs
In the above example, the speed gain that I get are:
1.87x for 2 workers
2.62x for 3 workers
3.17x for 4 workers
The speed gain in the above example is not close to linear, especially when I use 4 workers. I thought this might be because of the overhead time from the plan function. However, the speed gain is similar if I run the procedure multiple times after setting the number of workers. This is illustrated as follows:
plan(multisession, workers = 3)
set.seed(1)
x <- rtime(m, n)
# Time difference of 58.07243 secs
set.seed(1)
x <- rtime(m, n)
# Time difference of 1.012799 mins
set.seed(1)
x <- rtime(m, n)
# Time difference of 57.96777 secs
I also tried to use the future_lapply function from the future.apply package instead of the future_map function from the furrr package. However, their speed gain is similar as well. Therefore, I would appreciate any advice on what is going on here. Thank you!

Related

Parallel processing in R using parallel package - not reproducible with different number of cores

I'm using the parallel package and mclapply() to parallel process simulations in R, using R Programming for Data Science (Chapter 22, Section 22.4.1) as a reference.
I'm setting the seed as instructed, however, when I change the number of cores used in the mclapply() function, I get different results even with the same seed.
A simple reprex:
# USING 2 CORES
library(parallel)
RNGkind("L'Ecuyer-CMRG")
set.seed(1)
x <- mclapply(1:100, function(i) {rnorm(1)}, mc.cores = 2)
y <- do.call(rbind, x)
z <- mean(y)
print(mean(z))
# returns 0.143
# USING 3 CORES
library(parallel)
RNGkind("L'Ecuyer-CMRG")
set.seed(1)
x <- mclapply(1:100, function(i) {rnorm(1)}, mc.cores = 3)
y <- do.call(rbind, x)
z <- mean(y)
print(mean(z))
# returns 0.035
How can I set the seed such that changing the number of cores used doesn't change the result? I feel like this should be a fairly simple thing to do - maintaining reproducibility irrespective of number of cores used.

Why the processing time behaves different with these two functions using parallel?

Imagine I have two functions, one is a simple mean of sum of squares, and the other, a little more elaborated that computes a regression, that I want to apply to the lines of a "big" matrix or data frame.
In order to take advantage of multiple cores (on Windows) I tried the parallel package and got very different results for the two functions using the same sequence of commands.
For the apparently more complex function (regression) it appears that the time reduction is significant using more cores (Here I show a result from a PC with 3 cores and a PC with 12 cores, the behavior is similar with up to 11 cores, the time reduction decreases with more cores).
But for the "simple" function, mean of squares, the time of executions is very variable, almost erratic (also tested with up to 11 cores).
First, Is there a reason why this is happening? Second, I imagine there are other ways to do that task, can you suggest any?
Here is the code to generate the plots:
library(parallel)
nc=detectCores()-1 #number of cores
myFun =function(z) coef(lm(rep(1,length(z))~z)) #regression
myFun2 =function(z) sum(z^2)/length(z) # mean of squares
my.mat = matrix(rnorm(1000000,.01,0.4),ncol=100) #data
# using FUN = myFun
# Replicate 10 times
for(j in 1:10){
ncor=2:nc
timed=c()
for (i in seq_along(ncor)){
cl <- makeCluster(mc <- getOption("cl.cores", ncor[i]))
stime <- Sys.time()
res=parApply(cl = cl, X = my.mat, MARGIN = 1, FUN = myFun)
tm=Sys.time()-stime
timed[i]=tm
stopCluster(cl)
}
# no cores
stime <- Sys.time()
res=apply(my.mat, MARGIN = 1, FUN = myFun)
tm=Sys.time()-stime
(dr=data.frame(nc=c(1,ncor),ts=as.numeric(c(tm,timed))))
plot(dr,type="l",col=3,main=j)
#stopCluster(cl)
if (j==1)fres1=dr else fres1=merge(fres1,dr,by="nc")
}
plot(fres1[,1:2],type="l",col=2,ylim=range(fres1[,-1]))
for(i in 3:11)lines(fres1[,i],col=i+1)
# For the second plot use the same code but change FUN = myFun2

Pairwise Dijkstra's with early termination according to hop-count in R

I am looking for the most computational and memory friendly approach to computing particular entries of the distance matrix D obtained by pairwise Dijkstra's algorithm in R. More precisely, I only need D[i,j] if the hop-count (unweighted) distance between node i and node j is at most a particular integer k (D[i,j] itself may computed as a weighted shortest path length for which the number of hops may be greater than k). D should be encoded as a sparse matrix for memory efficiency.
I was wondering if there has been some work done on this or if there is an efficient approach towards optimizing the current igraph functions to account for this restriction. E.g., early exit in pairwise Dijkstra's algorithm could really improve the efficiency of solving my problem.
I have tried to make this as efficient as possible myself, but with no luck so far. Some first attempt is illustrated below.
library(igraph)
library(Matrix)
library(spam)
# Hope this to the more efficient one
bounded_hop_pairG_1 <- function(G, k=2){
to <- ego(G, order=k)
D <- sparseMatrix(i=unlist(lapply(1:length(V(G)), function(v) rep(v, length(to[[v]])))),
j=unlist(to),
x=unlist(lapply(1:length(V(G)), function(v) distances(G, v=v, to=to[[v]]))))
return(D)
}
# Hope this to be the less efficient one
bounded_hop_pairG_2 <- function(G, k=2){
D <- distances(G)
D[distances(G, weight=NA) > k] <- 0
return(as.spam(D))
}
# Sample graph
set.seed(42)
G <- sample_bipartite(500, 500, p=0.1)
E(G)$weight <- runif(length(E(G)))
# Check whether 'distances' actually implements early termination
start_time <- Sys.time()
d1 <- distances(G, v=1)
end_time <- Sys.time()
print(end_time - start_time)
# Time difference of 0.00497961 secs
start_time <- Sys.time()
d2 <- distances(G, v=1, to=521)
end_time <- Sys.time()
print(end_time - start_time)
# Time difference of 0.002238274 secs (consistently smaller than above)
start_time <- Sys.time()
D1 <- bounded_hop_pairG_1(G)
end_time <- Sys.time()
print(end_time - start_time)
# Time difference of 2.671333 secs
start_time <- Sys.time()
D2 <- bounded_hop_pairG_2(G)
end_time <- Sys.time()
print(end_time - start_time)
# Time difference of 1.101419 secs
Though I suspect my first function to apply early termination and never stores the full pairwise distance matrix, it appears to be much less efficient than my second function (which also performs a full unweighted pairwise distance computation) in terms of computational time. Hence, I was hoping somebody could point out the most efficient way to implement the first function in R.
you could try cppRouting package available via github.
It provides functions like get_distance_matrix() which can use all cores.
library(cppRouting)
library(igraph)
library(spam)
library(Matrix)
# Sample graph
set.seed(42)
G <- sample_bipartite(500, 500, p=0.1)
E(G)$weight <- runif(length(E(G)))
#Graph to data frame
G2<-as_long_data_frame(G)
#Weighted graph
graph1<-makegraph(G2[,1:3],directed = F)
#Unweighted graph
graph2<-makegraph(cbind(G2[,1:2],rep(1,nrow(G2))),directed = F)
nodes<-unique(c(G2$from,G2$to)) %>% sort
myfunc<-function(Gr1,Gr2,nd,k=2,cores=FALSE){
test<-get_distance_matrix(graph,nd,nd,allcores = cores)
test2<-get_distance_matrix(graph2,nd,nd,allcores = cores)
test[test2>k]<-0
return(as.spam(test))
}
#Your first function
system.time(
D1 <- bounded_hop_pairG_1(G)
)
#2.18s
#Your second function
system.time(
D2 <- bounded_hop_pairG_2(G)
)
#1.01s
#One core
system.time(
D3 <- myfunc(graph1,graph2,nodes))
#0.69s
#Parallel
system.time(
D4 <- myfunc(graph1,graph2,nodes,cores=TRUE))
#0.32s
If you really want to stop the algorithm when k-nodes is reached and have a little knowledge in C++, it seems rather simple to slightly modify original Dijkstra algorithm then use it via Rcpp.

R parallel - clusters inside a cluster

Clusterception
I need to run an expensive algorithm 320 times. I can easily parallelize it over 4 clusters running each 80 iterations. I have access to a machine with 32 cores, so I want to further parallelize the problem. Parallelizing the groups of 80 iterations is trickier, but possible. My idea is to run 8 sub-clusters on each main cluster, each processing 10 iterations.
To test the idea, I have implemented the following dummy code using the parallel package in R. I have tested it on 4 cores and was surprised with the results.
Normal variables generator
Method 1: Parallelize once over 2 clusters
library(parallel)
library(tictoc)
N <- 6*10^8 # total number of observations to generate
n.threads <- 2
sample.sizes <- rep(round(N/n.threads, 0), n.threads) # Each cluster generates half of the sample
tic()
cl <- makeCluster(n.threads)
list <- parLapply(cl, sample.sizes, rnorm)
stopCluster(cl)
v <- unlist(list)
toc()
rm(list, v); gc()
36 sec exec time; 50% CPU usage
Method 2: Parallelize first over 2 clusters, and each cluster is further parallelized over 2 clusters
library(parallel)
library(tictoc)
N <- 6*10^8 # total number of observations to generate
rnorm.parallel <- function(N.inside){
n.threads.inside <- 2
sample.sizes <- rep(round(N.inside/n.threads.inside, 0), n.threads.inside) # each sub thread generates 1*10^8 obs
cl.inside <- makeCluster(n.threads.inside)
list <- parLapply(cl.inside, sample.sizes, rnorm)
stopCluster(cl.inside)
v <- unlist(list)
return(v)
}
n.threads <- 2
sample.sizes <- rep(round(N/n.threads, 0), n.threads) # each main thread generates 2*10^8 obs
tic()
cl <- makeCluster(length(sample.sizes))
clusterEvalQ(cl, library(parallel))
list <- parLapply(cl, sample.sizes, rnorm.parallel)
stopCluster(cl)
v <- unlist(list)
toc()
rm(list, v); gc()
42 sec exec time; 100% CPU usage
My conclusion is that although the technique of running clusters inside a cluster works, the additional overhead makes it less efficient. Is there another package/technique that could help me?

Parallel Monte Carlo Simulation in R using snowfall

I try to compare up to thousands of estimated beta distributions. Each beta distribution is characterized by the two shape parameters alpha & beta.
I now draw 100,000 samples of every distribution. As a final result I want to get an order of the distributions with the highest Probability in every sample draw.
My first approach was to use lapply for generating a matrix of N * NDRAWS numeric values which was consuming too much memory as N gets beyond 10,000. (10,000 * 100,000 * 8 Bytes)
So I decided to use a sequential approach of ordering every single draw, then cumsum the order of all draws and get the final order as shown in the example below:
set.seed(12345)
N=100
NDRAWS=100000
df <- data.frame(alpha=sample(1:20, N, replace=T), beta=sample(1:200, N, replace=T))
vec <- vector(mode = "integer", length = N )
for(i in 1:NDRAWS){
# order probabilities after a single draw for every theta
pos <- order(rbeta(N, shape1=df$alpha, shape2=df$beta) )
# sum up winning positions for every theta
vec[pos] <- vec[pos] + 1:N
}
# order thetas
ord <- order(-vec)
df[ord,]
This is only consuming N * 4 Bytes of memory, as there is no giant matrix but a single vector of length N. My Question now is, how to speed up this operation using snowfall (or any other multicore package) by taking advantage of my 4 CPU Cores, instead of using just one core???
# parallelize using snowfall pckg
library(snowfall)
sfInit( parallel=TRUE, cpus=4, type="SOCK")
sfLapply( 1:NDRAWS, function(x) ?????? )
sfStop()
Any help is appreciated!
This can be parallelized in the same way that one would parallelize random forest or bootstrapping. You just perform the sequential code on each of the workers but with each using a smaller number of iterations. That is much more efficient than splitting each iteration of the for loop into a separate parallel task.
Here's your complete example converted to use the foreach package with the doParallel backend:
set.seed(12345)
N=100
NDRAWS=100000
df <- data.frame(alpha=sample(1:20, N, replace=T),
beta=sample(1:200, N, replace=T))
library(doParallel)
nworkers <- detectCores()
cl <- makePSOCKcluster(nworkers)
clusterSetRNGStream(cl, c(1,2,3,4,5,6,7))
registerDoParallel(cl)
vec <- foreach(ndraws=rep(ceiling(NDRAWS/nworkers), nworkers),
.combine='+') %dopar% {
v <- integer(N)
for(i in 1:ndraws) {
pos <- order(rbeta(N, shape1=df$alpha, shape2=df$beta) )
v[pos] <- v[pos] + 1:N
}
v
}
ord <- order(-vec)
df[ord,]
Note that this gives different results than the sequential version because different random numbers are generated by the workers. I used the parallel random number support provided by the parallel package since that is good practice.
Well, the functionality is there. I'm not sure though what you'd be returning with each iteration.
Perhaps try this?
myFunc <- function(xx, N) {
pos <- order(rbeta(N, shape1=df$alpha, shape2=df$beta) )
vec[pos] + 1:N
}
Using doParallel will allow you to add results:
require(doParallel)
registerDoParallel(cores=4)
foreach(i=1:NDRAWS, .combine='+') %dopar% myFunc(i, N)

Resources