foreach doparallel on GPU - r

I have this code for writing my results in parallel. I am using foreach and doParallel libraries in R.
output_location='/home/Desktop/pp/'
library(foreach)
library(doParallel)
library(data.table)
no_cores <- detectCores()
registerDoParallel(makeCluster(no_cores))
a=Sys.time()
foreach(i=1:100,.packages = c('foreach','doParallel')
,.options.multicore=mcoptions)%dopar%
{result<- my_functon(arg1,arg2)
write(result,file=paste(output_location,"out",toString(i),".csv"))
gc()
}
Now it uses 4 cores in the CPU and thus the writing takes very less time using this code.But i want foreach-doparallel using GPU. Is there any method for processing the foreach doParallel loop on GPU. gputools,gpuR are some GPU supporting R packages. But they are mainly for mathematical computations like gpuMatMult(),gpuMatrix() etc. I am looking for running the loop on GPU. Any help or guidance will be great.

Parallelization with foreach or similar tools works because you have multiple CPUs (or a CPU with multiple cores), which can process multiple tasks at once. A GPU also has multiple cores, but these are already used to process a single task in parallel. So if you want to parallelize further, you will need multiple GPUs.
However, keep in mind that GPUs are faster than CPUs only for certain types of applications. Matrix operations with large matrices being a prime example! See the performance section here for a recent comparison of one particular example. So it might make sense for you to consider if the GPU is the right tool for you.
In addition: File IO will always go via the CPU.

Related

Parallel computing with R in a SLURM cluster

I need to do a model estimation using the MCMC method on a SLURM cluster (the system is CentOS). The estimation takes a very long time to finish.
Within each MCMC interaction, there is one step taking a particularly long time. Since this step is doing a lapply-loop (around 100000 loops, 30s to finish all loops), so as far as I understand, I should be able to use parallel computing to speed up.
I tried several packages (doMC, doParallel, doSNOW) together with the foreach framework. The setup is
parallel_cores=8
#doParallel
library(doParallel)
cl<-makeCluster(parallel_cores)
registerDoParallel(cl)
#doMC
library(doMC)
registerDoMC(parallel_cores)
#doSNOW, this is also fast
library(doSNOW)
ml<-makeCluster( parallel_cores)
registerDoSNOW(cl)
#foreach framework
#data is a list
data2=foreach(
data_i=
data,
.packages=c("somePackage")
) %dopar% {
data_i=some_operation(data_i)
list(beta=data_i$beta,sigma=data_i$sigma)
}
Using a doMC, the time for this step can be reduced to about 9s. However, as doMC is using shared memory and I have a large array to store estimate results, I quickly ran out of memory (i.e. slurmstepd: error: Exceeded job memory limit).
Using doParallel and doSNOW, the time for this step even increased, to about 120s, which sounds ridiculous. The mysterious thing is that when I tested the code in both my Mac and Windows machines, doParallel and doSNOW actually gave similar speed compared to doMC.
I'm stuck and not sure how to proceed. Any suggestions will be greatly appreciated!

Unlimiting the CPU usage from R

Is there any way to unlimit the CPU usage so my PC puts more effort in finishing a task for rapidly? At the moment the k-means algorithm is estimated to finish in about 10 days, which is something I would like to reduce.
R is single-threaded by default, and runs only on a single thread on the CPU, which is a pity if you have a machine with 16 or 32 cores. By unlimiting the CPU usage, I have to assume you're asking if there's any way to have an R process (let's say part of the k-means algorithm) take advantage of your full CPU power by running the process in-parallel.
Many R packages and processes are not going to be helped by parallel processing though. So the technical solution to your particular problem goes down to the package implementation you're using. Popular packages like caret do support parallelization when that's possible, even though you may need to add an additional allowParallel=T parameter. They work in conjunction with a library such as doMC to allow multi-core processes. In the following sample code, I have my machine use 8 cores through the registerDoMC(8) function, and then set allowParallel=T.
library(doMC)
registerDoMC(8)
system.time({
ctrl_2 <- trainControl(method="cv", number=3, allowParallel=T)
fb_forest_2 <- train(classe ~ ., data=fb_train, method="rf", trControl = ctrl_2)
})
Again, parallel processing doesn't always help - Not all process can be parallelized! The documentation for foreach are a great read so if you can afford the time take a look at it. The specific code solution for your problem also depend on the library implementation you're using.

parallel r with foreach and mclapply at the same time

I am implementing a parallel processing system which will eventually be deployed on a cluster, but I'm having trouble working out how the various methods of parallel processing interact.
I need to use a for loop to run a big block of code, which contains several large list of matrices operations. To speed this up, I want to parallelise the for loop with a foreach(), and parallelise the list operations with mclapply.
example pseudocode:
cl<-makeCluster(2)
registerDoParallel(cl)
outputs <- foreach(k = 1:2, .packages = "various packages") {
l_output1 <- mclapply(l_input1, function, mc.cores = 2)
l_output2 <- mclapply(l_input2, function, mc.cores = 2)
return = mapply(cbind, l_output1, l_output2, SIMPLIFY=FALSE)
}
This seems to work. My questions are:
1) is it a reasonable approach? They seem to work together on my small scale tests, but it feels a bit kludgy.
2) how many cores/processors will it use at any given time? When I upscale it to a cluster, I will need to understand how much I can push this (the foreach only loops 7 times, but the mclapply lists are up to 70 or so big matrices). It appears to create 6 "cores" as written (presumably 2 for the foreach, and 2 for each mclapply.
I think it's a very reasonable approach on a cluster because it allows you to use multiple nodes while still using the more efficient mclapply across the cores of the individual nodes. It also allows you to do some of the post-processing on the workers (calling cbind in this case) which can significantly improve performance.
On a single machine, your example will create a total of 10 additional processes: two by makeCluster which each call mclapply twice (2 + 2(2 + 2)). However, only four of them should use any significant CPU time at a time. You could reduce that to eight processes by restructuring the functions called by mclapply so that you only need to call mclapply once in the foreach loop, which may be more efficient.
On multiple machines, you will create the same number of processes, but only two processes per node will use much CPU time at a time. Since they are spread out across multiple machines it should scale well.
Be aware that mclapply may not play nicely if you use an MPI cluster. MPI doesn't like you to fork processes, as mclapply does. It may just issue some stern warnings, but I've also seen other problems, so I'd suggest using a PSOCK cluster which uses ssh to launch the workers on the remote nodes rather than using MPI.
Update
It looks like there is a problem calling mclapply from cluster workers created by the "parallel" and "snow" packages. For more information, see my answer to a problem report.

Simple OpenCL example in R with R code?

Is it possible to use OpenCL but with R code? I still don't have a good understanding of OpenCL and GPU programming. For example, suppose I have the following R code:
aaa <- function(x) mean(rnorm(1000000))
sapply(1:10, aaa)
I like that I can kind of use mclapply as a dropin replacement for lapply. Is there a way to do that for OpenCL? Or to use OpenCL as a backend for mclapply? I'm guessing this is not possible because I have not been able to find an example, so I have two questions:
Is this possible and if so can you give a complete example using my function aaa above?
If this is not possible, can you please explain why? I do not know much about GPU programming. I view GPU just like CPUs, so why cannot I run R code in parallel?
I would start by looking at the High Performance Computing CRAN task view, in particular the Parallel computing: GPUs section.
There are a number of packages listed there which take advantage of GPGPU for specific tasks that lend themselves to massive parallelisation (e.g. gputools, HiPLARM). Most of these use NVIDIA's own CUDA rather than OpenCL.
There is also a more generic OpenCL package, but it requires you to learn how to write OpenCL code yourself, and merely provides an interface to that code from R.
It isn't possible because GPUs work differently than CPUs which means you can't give them the same instructions that you'd give a CPU.
Nvidia puts on a good show with this video of describing the difference between CPU and GPU processing. Essentially the difference is that GPUs typically have, by orders of magnitude, more cores than CPUs.
Your example is one that can be extended to GPU code because it is highly parallel.
Here's some code to create random numbers (although they aren't normally distributed) http://cas.ee.ic.ac.uk/people/dt10/research/rngs-gpu-mwc64x.html
Once you create the random numbers you could break them into chunks and then sum each of the chunks in parallel and then add the sums of the chunks to get the overall sum Is it possible to run the sum computation in parallel in OpenCL?
I realize that your code would make the random number vector and its sum in serial and parallel that operation 10 times but with GPU processing, having a mere 10 tasks isn't very efficient since you'd leave so many cores idle.

How to let R use all the cores of the computer?

I have read that R uses only a single CPU. How can I let R use all the available cores to run statistical algorithms?
Yes, for starters, see the High Performance Computing Task View on CRAN. This lists details of packages that can be used in support of parallel computing on a single machine.
From R version 2.14.0, there is inbuilt support for parallel computing via the parallel package, which includes slightly modified versions of the existing snow and multicore packages. The parallel package has a vignette that you should read. You can view it using:
vignette(package="parallel", topic = "parallel")
There are other ways to exploit your multiple cores, for example via use of a multi-threaded BLAS for linear algebra computations.
Whether any of this will speed up the "statistics calculations" you want to do will depend on what those "statistics calculations" are. Spawning off multiple threads or workers entails an overhead cost to set them up, manage them and collect the results. Some operations see a benefit (some large, some small) of using multiple cores/threads, others are slowed down because of this extra overhead.
In short, do not expect to get an n times decrease in your compute time by using n cores instead of just 1.
If you happen to do few* iterations of the same thing (or same code for few* different parameters), the easiest way to go is to run several copies of R -- OS will allocate the work on different cores.
In the opposite case, go and learn how to use real parallel extensions.
For the sake of this answer, few means less or equal the number of cores.

Resources