Parallelising a for loop with R correctly - r

I've been working on a simple collection of functions for my supervisor that will do some simple initial genome scale stats, that is easy to do to give my team a quick indication as to future analyses which may make more time - for example RDP4 or BioC (just to explain why I haven't just gone straight to BioConductor). I'd like to speed some things up to allow larger contig sizes so I've decided to use doParallel and foreach to edit some for loops to allow this. Below is one simple function which identifies bases in some sequences (stored as a matrix) which are identical and removes them.
strip.invar <- function(x) {
cat("
Now removing invariant sites from DNA data matrix, this may take some time...
")
prog <- txtProgressBar(min=0, max=ncol(x), style=3)
removals<-c()
for(i in 1:ncol(x)){
setTxtProgressBar(prog, i)
if(length(unique(x[,i])) == 1) {
removals <- append(removals, i)
}
}
newDnaMatrix <- x[,-removals]
return(newDnaMatrix)
}
After reading the introduction to doParallel and foreach I tried to make a version to accommodate for more cores - on my mac this is 8 - a quad core with two threads per core - 8 virtual cores:
strip.invar <- function(x, coresnum=detectCores()){
cat("
Now removing invariant sites from DNA data matrix, this may take some time...
")
prog <- txtProgressBar(min=0, max=ncol(x), style=3)
removals<-c()
if(coresnum > 1) {
cl <- makeCluster(coresnum)
registerDoParallel(cl)
foreach(i=1:ncol(x)) %dopar% {
setTxtProgressBar(prog, i)
if(all(x[,i] == x[[1,i]])){
removals <- append(removals, i)
}
}
} else {
for(i in 1:ncol(x)){
setTxtProgressBar(prog, i)
if(length(unique(x[,i])) == 1) {
removals <- append(removals, i)
}
}
}
newDnaMatrix <- x[,-removals]
return(newDnaMatrix)
}
However if I run this and have the number of cores set to 8 I'm not entirely sure it works - I can't see the progress bar doing anything but then I've heard that printing to screen and stuff involving graphic devices is tricky with parallel computing in R. But it still seems to take some time and my laptop get's 'very' hot so I'm not sure if I've done this correctly, I've tried after seeing a few examples (I successfully ran a nice bootstrap example in the vignette), but I'm bound to hit learning bumps. As an aside, I thought I'd also ask people's opinion, what is the best speed up for R code bottlenecks where loops or apply is involved - parallelising it, or Rcpp?
Thanks.

My other answer was not correct, since the colmean being equal to the first value is not sufficient as a test for the number of unique values. So here is another answer:
You can optimize the loop by using apply.
set.seed(42)
dat <- matrix(sample(1:5,1e5,replace=TRUE),nrow=2)
res1 <- strip.invar(dat)
strip.invar2 <- function(dat) {
ix <- apply(dat,2,function(x) length(unique(x))>1)
dat[,ix]}
res2 <- strip.invar2(dat)
all.equal(res1,res2)
#TRUE
library(microbenchmark)
microbenchmark(strip.invar(dat),strip.invar2(dat),times=10)
#Unit: milliseconds
# expr min lq median uq max neval
#strip.invar(dat) 2514.7995 2529.2827 2547.6751 2678.464 2792.405 10
#strip.invar2(dat) 933.3701 945.5689 974.7564 1008.589 1018.400 10
This improves performance quite a bit, though not as much as you could achieve if vectorization was possible.
Parallelization won't give better performance here, since each iteration does not require much performance on is own, so parallelization overhead will actually increase the time needed. However, you could split the data and process chunks in parallel.

Firstly, try running cl <- makeCluster( coresnum-1 ). The master R process is already using one of your cores and is used to dispatch and receive results from the slave jobs, so you have 7 free cores for the slave jobs. I think you will be effectively queuing one of your foreach loops to wait until one of the previous loops finishes and therefore the job will take longer to complete.
Secondly, what you would normally see on the console running this function in a non-parallel environment is stil printed to the console, it's just that each jobs output is printed to the slave processes console so you won't see it. You can however save the output from the different foreach loops to a text file to examine them. Here is an example of how to save console output. Stick the code there inside your foreach statement.
Your laptop will get very hot because all of your cores are working at 100% capacity while you are running this job.
I have found the foreach package to provide an excellent set of functions to provide simple parallel processing. Rcpp may (will?!) give you much greater performance, but how are you at writing C++ code and what is the runtime of this function and how often will it be used? I always think about these things first.

Related

foreach very slow with large number of values

I'm trying to use foreach to do parallel computations. It works fine if there are a small number of values to iterate over, but at some point it becomes incredibly slow. Here's a simple example:
library(foreach)
library(doParallel)
registerDoParallel(8)
out1 <- foreach(idx=1:1e6) %do%
{
1+1
}
out2 <- foreach(idx=1:1e6) %dopar%
{
1+1
}
out3 <- mclapply(1:1e6,
function(x) 1+1,
mc.cores=20)
out1 and out2 take an incredibly long time to run. Neither of them even spawns multiple threads for as long as I keep them running. out3 spawns the threads almost immediately and runs very quickly. Is foreach doing some sort of initial processing that doesn't scale well? If so, is there is a simple fix? I really prefer the syntax of foreach.
I should also note that the actual code that I'm trying to parallelize is substantially more complicated than 1+1. I only show this as an example because even with this simple code foreach seems to be doing some pre-processing that is incredibly slow.
the forach/doParallel vignette says (to a code much smaller than yours):
Note well that this is not a practical use of doParallel. This is our
“Hello, world” program for parallel computing. It tests that
everything is installed and set up properly, but don’t expect it to
run faster than a sequential for loop, because it won’t! sqrt executes
far too quickly to be worth executing in parallel, even with a large
number of iterations. With small tasks, the overhead of scheduling the
task and returning the result can be greater than the time to execute
the task itself, resulting in poor performance. In addition, this
example doesn’t make use of the vector capabilities of sqrt, which it
must to get decent performance. This is just a test and a pedagogical
example, not a benchmark.
So it might be in the nature of your setting that it is not faster.
Instead try without parallelization but using vectorization:
q <- sapply(1:1e6, function(x) 1 + 1 )
It does exactly the same like your example loops and is done in a second.
And now try this (it does still exactly the same thing exaclty the same times:
x <- rep(1, n=1e6)
r <- x + 1
It adds to 1e6 1s a 1 instantly. (The power of vectorization ...)
The combination of foreach with doParallel is from my personal experience much slower than if you use the bioinformatics BiocParallel package from the repository Bioconda. (I am a bioinformatician and in bioinformatics, we have very often calculation-heavy stuff, since we have single data files of several gigabytes to process - and many of them).
I tried your function using BiocParallel and it uses all assigned CPUs by 100% (tested by running htop during job execution) the entire thing took 17 seconds.
For sure - with your lightweight example, this applies:
the overhead of scheduling the task and returning the result
can be greater than the time to execute the task itself
Anyway, it seems to use the CPUs more thoroughly than doParallel. So use this, if you have calculation-heavy tasks to be get done.
Here the code how I did it:
# For bioconductor packages, the best is to install this:
install.packages("BiocManager")
# Then activate the installer
require(BiocManager)
# Now, with the `install()` function in this package, you can install
# conveniently Bioconductor packages like `BiocParallel`
install("BiocParallel")
# then, activate it
require(BiocParallel)
# initiate cores:
bpparam <- bpparam <- SnowParam(workers=4, type="SOCK") # 4 or take more CPUs
# prepare the function you want to parallelize
FUN <- function(x) { 1 + 1 }
# and now you can call the function using `bplapply()`
# the loop parallelizing function in BiocParallel.
s <- bplapply(1:1e6, FUN, BPPARAM=bpparam) # each value of 1:1e6 is given to
# FUN, note you have to pass the SOCK cluster (bpparam) for the
# parallelization
For more info, go to the vignette of the BiocParallel package.
Look at bioconductor how many packages it provides and all well documented.
I hope this helps you for your future parallel computing stuff.

Parallel computing taking same or more time

I am trying to set up parallel computing in R for a large simulation, but I noticed that there is no improvement in time.
I tried a simple example:
library(foreach)
library(doParallel)
stime<-system.time(for (i in 1:10000) rnorm(10000))[3]
print(stime)
10.823
cl<-makeCluster(2)
registerDoParallel(cores=2)
stime<-system.time(ls<-foreach(s = 1:10000) %dopar% rnorm(10000))[3]
stopCluster(cl)
print(stime)
29.526
The system time is more then twice as much as it was in the original case without parallel computing.
Obviously I am doing something wrong but I cannot figure out what it is.
Performing many tiny tasks in parallel can be very inefficient. The standard solution is to use chunking:
ls <- foreach(s=1:2) %dopar% {
for (i in 1:5000) rnorm(10000)
}
Instead of executing 10,000 tiny tasks in parallel, this loop executes two larger tasks, and runs almost twice as fast as the sequential version on my Linux machine.
Also note that your "foreach" example is actually sending a lot of data from the workers to the master. My "foreach" example throws that data away just like your sequential example, so I think it's a better comparison.
If you need to return a large amount of data then a fair comparison would be:
ls <- lapply(rep(10000, 10000), rnorm)
versus:
ls <- foreach(s=1:2, .combine='c') %dopar% {
lapply(rep(10000, 5000), rnorm)
}
On my Linux machine the times are 8.6 seconds versus 7.0 seconds. That's not impressive due to the large communication to computation ratio, but it would have been much worse if I hadn't used chunking.

Efficient programming to overcome memory limit in R

I have a function that calculates an index in R for a matrix of binary data. The goal of this function is to calculate a person-fit index for binary response data called HT. It divides the covariance between response vectors of two respondents (e.g. person i & j) by the maximum possible covariance between the two response patterns which can be calculated using the mean of response vectors(e.g. Bi).The function is:
fit<-function(Data){
N<-dim(Data)[1]
L<-dim(Data)[2]
r <- rowSums(Data)
p.cor.n <- (r/L) #proportion correct for each response pattern
sig.ij <- var(t(Data),t(Data)) #covariance of response patterns
diag(sig.ij) <-0
H.num <- apply(sig.ij,1,sum)
H.denom1 <- matrix(p.cor.n,N,1) %*% matrix(1-p.cor.n,1,N) #Bi(1-Bj)
H.denom2 <- matrix(1-p.cor.n,N,1) %*% matrix(p.cor.n,1,N) #(1-Bi)Bj
H.denomm <- ifelse(H.denom1>H.denom2,H.denom2,H.denom1)
diag(H.denomm) <-0
H.denom <- apply(H.denomm,1,sum)
HT <- H.num / H.denom
return(HT)
}
This function works fine with small matrices (e.g. 1000 by 20) but when I increased the number of rows (e.g. to 10000) I came across to memory limitation problem. The source of the problem is this line in the function:
H.denomm <- ifelse(H.denom1>H.denom2,H.denom2,H.denom1)
which selects the denominator for each response pattern.Is there any other way to re-write this line which demands lower memory?
P.S.: you can try data<-matrix(rbinom(200000,1,.7),10000,20).
Thanks.
Well here is one way you could shave a little time off. Overall I still think there might be a better theoretical answer in terms of the approach you take....But here goes. I wrote up an Rcpp function that specifically implements ifelse in the sense you use it in above. It only works for square matrices like in your example. BTW I wasn't really trying to optimize R ifelse because I'm pretty sure it already calls internal C functions. I was just curious if a C++ function designed to do exactly what you are trying to do and nothing more would be faster. I shaved 11 seconds off. (This selects the larger value).
C++ Function:
library(Rcpp)
library(inline)
code <-"
Rcpp::NumericMatrix x(xs);
Rcpp::NumericMatrix y(ys);
Rcpp::NumericMatrix ans (x.nrow(), y.ncol());
int ii, jj;
for (ii=0; ii &lt x.nrow(); ii++){
for (jj=0; jj &lt x.ncol(); jj++){
if(x(ii,jj) &lt y(ii,jj)){
ans(ii,jj) = y(ii,jj);
} else {
ans(ii,jj) = x(ii,jj);
}
}
}
return(ans);"
matIfelse <- cxxfunction(signature(xs="numeric",ys="numeric"),
plugin="Rcpp",
body=code)
Now if you replace ifelse in your function above with matIfelse you can give it a try. For example:
H.denomm <- matIfelse(H.denom1,H.denom2)
# Time for old version to run with the matrix you suggested above matrix(rbinom(200000,1,.7),10000,20)
# user system elapsed
# 37.78 3.36 41.30
# Time to run with dedicated Rcpp function
# user system elapsed
# 28.25 0.96 30.22
Not bad roughly 36% faster, again though I don't claim that this is generally faster than ifelse just in this very specific instance. Cheers
P.s. I forgot to mention that to use Rcpp you need to have Rtools installed and during the install make sure environment path variables are added for Rtools and gcc. On my machine those would look like: c:\Rtools\bin;c:\Rtools\gcc-4.6.3\bin
Edit:
I just noticed that you were running into memory problems... So I'm not sure if you are running a 32 or 64 bit machine, but you probably just need to allow R to increase the amount of RAM it can use. I'll assume you are running on 32 bit to be safe. So you should be able to let R take at least 2gigs of RAM. Give this a try: memory.limit(size=1900) size is in megabytes so I just went for 1.9 gigs just to be safe. I'd imagine this is plenty of memory for what you need.
Do you actually intend to do NxL independent ifelse((H.denom1>H.denom2,... operations?
H.denomm <- ifelse(H.denom1>H.denom2,H.denom2,H.denom1)
If you really do, look for a library or alternatively, a better decomposition.
If you told us in general terms what this code is trying to do, it would help us answer it.

Foreach & SNOW do not work on Windows

I want to use a foreach loop on a Windows machine to make use of multiple cores in cpu heavy computation. However, I cannot get the processes to do any work.
Here is a minimal example of what I think should work, but doesn't:
library(snow)
library(doSNOW)
library(foreach)
cl <- makeSOCKcluster(4)
registerDoSNOW(cl)
pois <- rpois(1e6, 1500) # draw 1500 times from poisson with mean 1500
x <- foreach(i=1:1e6) %dopar% {
runif(pois[i]) # draw from uniform distribution pois[i] times
}
stopCluster(cl)
SNOW does create the 4 "slave" processes, but they don't do any work:
I hope this isn't a duplicate, but I cannot find anything with the search terms I can come up with.
It's probably working (at least it does on my mac). However, one call to runif takes such a small amount of time that all the time is spent for the overhead and the child processes spend negligible CPU power with the actual tasks.
x <- foreach(i=1:20) %dopar% {
system.time(runif(pois[i]))
}
x[[1]]
#user system elapsed
# 0 0 0
Parallelization makes sense if you have some heavy computations that cannot be optimized. That's not the case in your example. You don't need 1e6 calls to runif, one would be sufficient (e.g., runif(sum(pois)) and then split the result).
PS: Always test with a smaller example.
Although this particular example isn't worth executing in parallel, it's worth noting that since it uses doSNOW, the entire pois vector is auto-exported to all of the workers even though each worker only needs a fraction of it. However, you can avoid auto-exporting any data to the workers by iterating over pois itself:
x <- foreach(p=pois) %dopar% {
runif(p)
}
Now the elements of pois are sent to the workers in the tasks, so each worker only receives the data that's actually needed to perform its tasks. This technique isn't important when using doMC, since the doMC workers get pois for free.
You can also often improve performance enormously by processing pois in larger chunks using an iterator function such as "isplitVector" from the itertools package.

How to avoid duplicating objects with foreach

I have a very huge string vector and would like to do a parallel computing using foreach and dosnow package. I noticed that foreach would make copies of the vector for each process, thus exhaust system memory quickly. I tried to break the vector into smaller pieces in a list object, but still do not see any memory usage reduction. Does anyone have thoughts on this? Below are some demo code:
library(foreach)
library(doSNOW)
library(snow)
x<-rep('some string', 200000000)
# split x into smaller pieces in a list object
splits<-getsplits(x, mode='bysize', size=1000000)
tt<-vector('list', length(splits$start))
for (i in 1:length(tt)) tt[[i]]<-x[splits$start[i]: splits$end[i]]
ret<-foreach(i = 1:length(splits$start), .export=c('somefun'), .combine=c) %dopar% somefun(tt[[i]])
The style of iterating that you're using generally works well with the doMC backend because the workers can effectively share tt by the magic of fork. But with doSNOW, tt will be auto-exported to the workers, using lots of memory even though they only actually need a fraction of it. The suggestion made by #Beasterfield to iterate directly over tt resolves that issue, but it's possible to be even more memory efficient through the use of iterators and an appropriate parallel backend.
In cases like this, I use the isplitVector function from the itertools package. It splits a vector into a sequence of sub-vectors, allowing them to be processed in parallel without losing the benefits of vectorization. Unfortunately, with doSNOW, it will put these sub-vectors into a list in order to call the clusterApplyLB function in snow since clusterApplyLB doesn't support iterators. However, the doMPI and doRedis backends will not do that. They will send the sub-vectors to the workers right from the iterator, using almost half as much memory.
Here's a complete example using doMPI:
suppressMessages(library(doMPI))
library(itertools)
cl <- startMPIcluster()
registerDoMPI(cl)
n <- 20000000
chunkSize <- 1000000
x <- rep('some string', n)
somefun <- function(s) toupper(s)
ret <- foreach(s=isplitVector(x, chunkSize=chunkSize), .combine='c') %dopar% {
somefun(s)
}
print(length(ret))
closeCluster(cl)
mpi.quit()
When I run this on my MacBook Pro with 4 GB of memory
$ time mpirun -n 5 R --slave -f split.R
it takes about 16 seconds.
You have to be careful with the number of workers that you create on the same machine, although decreasing the value of chunkSize may allow you to start more.
You can decrease your memory usage even more if you're able to use an iterator that doesn't require all of the strings to be in memory at the same time. For example, if the strings are in a file named 'strings.txt', you can use s=ireadLines('strings.txt', n=chunkSize).

Resources