Simple OpenCL example in R with R code? - r

Is it possible to use OpenCL but with R code? I still don't have a good understanding of OpenCL and GPU programming. For example, suppose I have the following R code:
aaa <- function(x) mean(rnorm(1000000))
sapply(1:10, aaa)
I like that I can kind of use mclapply as a dropin replacement for lapply. Is there a way to do that for OpenCL? Or to use OpenCL as a backend for mclapply? I'm guessing this is not possible because I have not been able to find an example, so I have two questions:
Is this possible and if so can you give a complete example using my function aaa above?
If this is not possible, can you please explain why? I do not know much about GPU programming. I view GPU just like CPUs, so why cannot I run R code in parallel?

I would start by looking at the High Performance Computing CRAN task view, in particular the Parallel computing: GPUs section.
There are a number of packages listed there which take advantage of GPGPU for specific tasks that lend themselves to massive parallelisation (e.g. gputools, HiPLARM). Most of these use NVIDIA's own CUDA rather than OpenCL.
There is also a more generic OpenCL package, but it requires you to learn how to write OpenCL code yourself, and merely provides an interface to that code from R.

It isn't possible because GPUs work differently than CPUs which means you can't give them the same instructions that you'd give a CPU.
Nvidia puts on a good show with this video of describing the difference between CPU and GPU processing. Essentially the difference is that GPUs typically have, by orders of magnitude, more cores than CPUs.
Your example is one that can be extended to GPU code because it is highly parallel.
Here's some code to create random numbers (although they aren't normally distributed) http://cas.ee.ic.ac.uk/people/dt10/research/rngs-gpu-mwc64x.html
Once you create the random numbers you could break them into chunks and then sum each of the chunks in parallel and then add the sums of the chunks to get the overall sum Is it possible to run the sum computation in parallel in OpenCL?
I realize that your code would make the random number vector and its sum in serial and parallel that operation 10 times but with GPU processing, having a mere 10 tasks isn't very efficient since you'd leave so many cores idle.

Related

foreach doparallel on GPU

I have this code for writing my results in parallel. I am using foreach and doParallel libraries in R.
output_location='/home/Desktop/pp/'
library(foreach)
library(doParallel)
library(data.table)
no_cores <- detectCores()
registerDoParallel(makeCluster(no_cores))
a=Sys.time()
foreach(i=1:100,.packages = c('foreach','doParallel')
,.options.multicore=mcoptions)%dopar%
{result<- my_functon(arg1,arg2)
write(result,file=paste(output_location,"out",toString(i),".csv"))
gc()
}
Now it uses 4 cores in the CPU and thus the writing takes very less time using this code.But i want foreach-doparallel using GPU. Is there any method for processing the foreach doParallel loop on GPU. gputools,gpuR are some GPU supporting R packages. But they are mainly for mathematical computations like gpuMatMult(),gpuMatrix() etc. I am looking for running the loop on GPU. Any help or guidance will be great.
Parallelization with foreach or similar tools works because you have multiple CPUs (or a CPU with multiple cores), which can process multiple tasks at once. A GPU also has multiple cores, but these are already used to process a single task in parallel. So if you want to parallelize further, you will need multiple GPUs.
However, keep in mind that GPUs are faster than CPUs only for certain types of applications. Matrix operations with large matrices being a prime example! See the performance section here for a recent comparison of one particular example. So it might make sense for you to consider if the GPU is the right tool for you.
In addition: File IO will always go via the CPU.

OpenCL BLAS in julia

I am really amazed by the julia language, having implemented lots of machine learning algorithms there for my current project. Even though julia 0.2 manages to get some great results out of my 2011 MBA outperforming all other solutuions on similar linux hardware (due to vecLib blas, i suppose), I would certainly like more. I am in a process of buying radeon 5870 and would like to push my matrix operations there. I use basically only simple BLAS operations such as matmul, additios and transpositions. I use julia's compact syntax A' * B + C and would certainly like to keep it.
Is there any way (or pending milestone) I can get those basic operations execute on GPU? I have like 2500x2500 single precision matrices so I expect significant speedup.
I don't believe that GPU integration into the core of Julia is planned at this time. One of the key issues is that there is substantial overhead moving the data to and from the GPU, making a drop-in replacement for BLAS operations infeasible.
I expect that most of the progress in this area will actually come from the package ecosystem, in particular packages under the JuliaGPU organization. I see there is a CLBLAS package in there.

Advice about inversion of large sparse matrices

Just got a Windows box set up with two 64 bit Intel Xeon X5680 3.33 GHz processors (6 cores each) and 12 GB of RAM. I've been using SAS on some large data sets, but it's just too slow, so I want to set up R to do parallel processing. I want to be able to carry out matrix operations, e.g., multiplication and inversion. Most of my data are not huge, 3-4 GB range, but one file is around 50 GB. It's been a while since I used R, so I looked around on the web, including the CRAN HPC, to see what was available. I think a foreach loop and the bigmemory package will be applicable. I came across this post: Is there a package for parallel matrix inversion in R that had some interesting suggestions. I was wondering if anyone has experience with the HIPLAR packages. Looks like hiparlm adds functionality to the matrix package and hiplarb add new functions altogether. Which of these would be recommended for my application? Furthermore, there is a reference to the PLASMA library. Is this of any help? My matrices have a lot of zeros, so I think they could be considered sparse. I didn't see any examples of how to pass data fro R to PLASMA, and looking at the PLASMA docs, it says it does not support sparse matrices, so I'm thinking that I don't need this library. Am I on the right track here? Any suggestions on other approaches?
EDIT: It looks like HIPLAR and package pbdr will not be helpful. I'm leaning more toward bigmemory, although it looks like I/O may be a problem: http://files.meetup.com/1781511/bigmemoryRandLinearAlgebra_BryanLewis.pdf. This article talks about a package vam for virtual associative matrices, but it must be proprietary. Would package ff be of any help here? My R skills are just not current enough to know what direction to pursue. Pretty sure I can read this using bigmemory, but not sure the processing will be very fast.
If you want to use HiPLAR (MAGMA and PLASMA libraries in R), it is only available for Linux at the moment. For this and many other things, I suggest switching your OS to the penguin.
That being said, Intel MKL optimization can do wonders for these sort of operations. For most practical uses, it is the way to go. Python built with MKL optimization for example can process large matrices about 20x faster than IDL, which was designed specifically for image processing. R has similarly shown vast improvements when built with MKL optimization. You can also install R Open from Revolution Analytics, which includes MKL optimization, but I am not sure that it has quite the same effect as building it yourself using Intel tools: https://software.intel.com/en-us/articles/build-r-301-with-intel-c-compiler-and-intel-mkl-on-linux
I would definitely consider the type of operations one is looking to perform. GPU processes are those that lend well to high parallelism (many of the same little computations running at once, as with matrix algebra), but they are limited by bus speeds. Intel MKL optimization is similar in that it can help use all of your CPU cores, but it is really optimized to Intel CPU architecture. Hence, it should provide basic memory optimization too. I think that is the simplest route. HiPLAR is certainly the future, as it is CPU-GPU by design, especially with highly parallel heterogeneous architectures making their way into consumer systems. Most consumer systems today cannot fully utilize this though I think.
Cheers,
Adam

MPI and OpenMP. Do I even have a choice?

I have a linear algebra code that I am trying get to run faster. Its a iterative algorithm with a loop and matrix vector multiplications within in.
So far, I have used MATMUL (Fortran Lib.), DGEMV, Tried writing my own MV code in OpenMP but the algorithm is doing no better in terms of scalability. Speed ups are barely 3.5 - 4 irrespective of how many processors I am allotting to it (I have tried up 64 processors).
The profiling shows significant time being spent in Matrix-Vector and the rest is fairly nominal.
My question is:
I have a shared memory system with tons of RAM and processors. I have tried tweaking OpenMP implementation of the code (including Matrix Vector) but has not helped. Will it help to code in MPI? I am not a pro at MPI but the ability to fine tune the message communication might help a bit but I can't be sure. Any comments?
More generally, from the literature I have read, MPI = Distributed, OpenMP = Shared but can they perform well in the others' territory? Like MPI in Shared? Will it work? Will it be better than the OpenMP implementation if done well?
You're best off just using a linear algebra package that is already well optimized for a multitcore environment and using that for your matrix-vector multiplication. The Atlas package, gotoblas (if you have a nehalem or older; sadly it's no longer being updated), or vendor BLAS implementations (like MKL for intel CPUs, ACML for AMD, or VecLib for apple, which all cost money) all have good, well-tuned, multithreaded BLAS implementations. Unless you have excellent reason to believe that you can do better than those full time development teams can, you're best off using them.
Note that you'll never get the parallel speedup with DGEMV that you do with DGEMM, just because the vector is smaller than another matrix and so there's less work; but you can still do quite well, and you'll find you get much better perforamance with these libraries than you do with anything hand-rolled unless you were already doing multi-level cache blocking.
You can use MPI in a shared environment (though not OpenMP in a distributed one). However, achieving a good speedup depends a lot more on your algorithms and data dependencies than the technology used. Since you have a lot of shared memory, I'd recommend you stick with OpenMP, and carefully examine whether you're making the best use of your resources.

How to let R use all the cores of the computer?

I have read that R uses only a single CPU. How can I let R use all the available cores to run statistical algorithms?
Yes, for starters, see the High Performance Computing Task View on CRAN. This lists details of packages that can be used in support of parallel computing on a single machine.
From R version 2.14.0, there is inbuilt support for parallel computing via the parallel package, which includes slightly modified versions of the existing snow and multicore packages. The parallel package has a vignette that you should read. You can view it using:
vignette(package="parallel", topic = "parallel")
There are other ways to exploit your multiple cores, for example via use of a multi-threaded BLAS for linear algebra computations.
Whether any of this will speed up the "statistics calculations" you want to do will depend on what those "statistics calculations" are. Spawning off multiple threads or workers entails an overhead cost to set them up, manage them and collect the results. Some operations see a benefit (some large, some small) of using multiple cores/threads, others are slowed down because of this extra overhead.
In short, do not expect to get an n times decrease in your compute time by using n cores instead of just 1.
If you happen to do few* iterations of the same thing (or same code for few* different parameters), the easiest way to go is to run several copies of R -- OS will allocate the work on different cores.
In the opposite case, go and learn how to use real parallel extensions.
For the sake of this answer, few means less or equal the number of cores.

Resources