I have to perform a lot of element-wise arithmetic and special function mathematical operations on a large numeric matrix (e.g. 100,000 x 20,000) in R, including:
exp, log, lbeta, digamma, plogis (logistic function: 1/[1+exp(-a*x)]), and /, *, +, -
Looking into the math library C source code for R, there are essentially no #pragma omp compiler directives to take advantage of shared-memory multi-threaded parallelism. Specialized packages for fast math operations in R like [Rfast][2] do not have #pragma omp directives for multi-threaded special functions either.
Similarly, as far as I understand, matrix algebra libraries like BLAS and LAPACK are for matrix operations (e.g. factorizations), not element-wise operations.
I am hoping to run the R script on AWS to speed up the operations using e.g. 72 or 96 core servers.
Do I really have to make my own copies of the R math library C source code just to add a few #pragma omp directives to take advantage of OpenMP parallelism in R?
I can't believe simple parallelism like this does not already exist in R in the base or in a simple math library package.
When installing R you can choose between plain R and other distributions including the MKL libraries such as Microsoft R.
There are other distributions such Oracle R or you can compile R yourelf with other libraries such as BLAS.
MKL is supposed to increase the speed of matrix algebraic operations like Matrix Multiplication, Cholesky Factorization, Singular Value Decomposition or Principal Components Analysis.
Does it offer any other advantages not related with matrix algebra?
Would I benefit from MKL if I'm not explicitly using any of these operations?
For example if I'm selecting rows and computing averages by group?
Does R use them internally even if the user doesn't notice it?
For example if I'm using the package lme4 does it use MKL internally?
Yes. Using Intel® MKL with R does offer many other advantages apart from matrix algebra. Guess you would have referred this link on how to use mkl with R.
You might notice a significant difference while running heavy jobs.
Intel® MKL's features include:[Source]
Features highly optimized, threaded, and vectorized math functions that maximize performance on each processor family
Uses industry-standard C and Fortran APIs for compatibility with popular BLAS, LAPACK, and FFTW functions—no code changes required
Dispatches optimized code for each processor automatically without the need to branch code
Given a dataframe like this in R:
+---+---+
| X | Y |
+---+---+
| 1 | 2 |
| 2 | 4 |
| 4 | 5 |
+---+---+
If a vectorized operation is performed on this dataframe, like so:
data$Z <- data$X * data$Y
Will this leverage the processor's single-instruction-multiple-data (SIMD) capabilities to optimize the performance? This seems like the perfect case for it, but I cannot find anything that confirms my hunch.
I just got a "good answer" badge two years after my initial answer. Thanks for acknowledging the quality of this answer. In return, I would substantially enrich the original contents. This answer / article is now intended for any R user that wants to try out SIMD. It has the following sections:
Background: what is SIMD?
Summary: how can we leverage SIMD for our assembly or compiled code?
R side: how can R possibly use SIMD?
Performance: is vector code always faster than scalar code?
Writing R extensions: write compiled code with OpenMP SIMD
Background: what is SIMD?
Many R programmers may not know SIMD if they don't write assembly or compiled code.
SIMD (single instruction, multiple data) is a data-level parallel processing technology that has a very long history. Before personal computers were around, SIMD unambiguously referred to the vector processing in a vector processor, and was the main route to high-performance computing. When personal computers later came into the market, they didn't have any features resembling those of a vector processor. However, as the demands for processing multi-media data grew higher and higher, they began to have vector registers and corresponding sets of vector instructions to use those registers for vector data load, vector data arithmetics and vector data store. The capacity of vector registers is getting bigger, and the functionality of vector instruction sets are also increasingly versatile. Till today, they are able to do stream load / store, strided load / store, scattered load / store, vector elements shuffling, vector arithmetics (including fused arithmetics like fused multiply-add), vector logical operations, masking, etc. So they are more and more alike a mini vector processors of the old days.
Although SIMD has been with personal computers for nearly two decades, many programmers are unaware of it. Instead, many are familiar with thread-level parallelism like multi-core computing (which can be referred to as MIMD). So if you are new to SIMD, you are highly recommended to watch this YouTube video Utilizing the other 80% of your system's performance: Starting with Vectorization by Ulrich Drepper from Red Hat Linux.
Since vector instruction sets are extensions to the original architecture instruction sets, you have to invest extra efforts to use them. If you are to write assembly code, you can call these instructions straightaway. If you are to write compiled code like C, C++ and Fortran, you have to write inline assembly or use vector intrinsics. A vector intrinsic appears like a function, but it is in fact an inline assembly mapping to a vector assembly instruction. These intrinsics (or "functions") are not part of standard libraries of a compiled language; they are provided by the architecture / machine. To use them, we need including appropriate header files and compiler-specific flags.
Let's first define the following for ease of later discussions:
Writing assembly or compiled code that does not use vector instruction sets is called "writing scalar code";
Writing assembly or compiled code that uses vector instruction sets is called "writing vector code".
So these two paths are "write A, get A" and "write B, get B". However, compilers are getting stronger and there is another "write A, get B" path:
They have a power to translate your written scalar compiled code to vector assembly code, a compiler optimization called "auto-vectorization".
Some compilers like GCC considers auto-vectorization as part of highest-level optimization, and is enabled by flag -O3; while other more aggressive compiles like ICC (intel C++ compiler) and clang would enable it at -O2. Auto-vectorization can also be directly controlled by specific flags. For example, GCC has -ftree-vectorize. When exploiting auto-vectorization, it is advised to further hint compilers to taylor vector assembly code for machine. For example, for GCC we may do -march=native and for ICC we use -xHost. This makes sense, because even on the x86-64 architecture family, later microarchitectures come with more vector instruction sets. For example, sandybridge supports vector instruction sets up to AVX, haswell further supports AVX2 and FMA3 and skylake further supports AVX-512. Without -march=native, GCC only generates vector instructions using instruction sets up to SSE2, which is a much smaller subset common to all x86-64.
Summary: how can we leverage SIMD for our assembly or compiled code?
There are five ways to implement SIMD:
Writing machine-specific vector assembly code directly. For example, on x86-64 we use SSE / AVX instruction sets, and on ARM architectures we use NEON instruction sets.
Pros: Code can be hand-tuned for better performance;
Cons: We have to write different versions of assembly code for different machines.
Writing vector compiled code by using machine-specific vector intrinsics and compiling it with compiler-specific flags. For example, on x86-64 we use SSE / AVX intrinsics, and for GCC we set -msse2, -mavx, etc (or simply -march=native for auto-detection). A variant of this option is to write compiler-specific inline-assembly. For example, introduction to GCC's inline assembly can be found here.
Pros: Writing compiled code is easier than writing assembly code, and the code is more readable hence easier to maintain;
Cons: We have to write different versions of code for different machines, and adapt Makefile for different compilers.
Writing vector compiled code by using compiler-specific vector extensions. Some compilers have defined their own vector data type. For example, GCC's vector extensions can be found here;
Pros: We don't need to worry about the difference across architectures, as compiler can generate machine-specific assembly;
Cons: We have to write different versions of code for different compilers, and adapt Makefile likewise.
Writing scalar compiled code and using compiler-specific flags for auto-vectorization. Optionally we can insert compiler-specific pragmas along our compiled code to give compilers more hints on, say, data alignment, loop unrolling depth, etc.
Pros: Writing scalar compiled code is easier than writing vector compiled code, and is more readable to broad audience.
Cons: We have to adapt Makefile for different compilers, and in case we have used pragmas, they also need be versioned.
writing scalar compiled code and inserting OpenMP pragmas (requiring OpenMP 4.0+) #pragma opm simd.
Pros: same as option 4, and additionaly, we can use a single version of pragmas as many main stream compilers support OpenMP standard;
Cons: We have to adapt Makefile for different compilers as they may have different flags to enable OpenMP and machine-specific tuning.
From top to bottom, programmers progressively do less and compilers do increasingly more. Implementing SIMD is interesting, but this article unfortunately does not have the room for a decent coverage with examples. I would provide the most informative references I found.
For options 1 and 2 on x86-64, SSE / AVX intrinsics is definitely the best reference card, but not the right place to start learning these instructions. Where to start is individual-specific. I picked up intrinsics / assembly from BLISLab when I tried to write my own high-performance DGEMM (to be introduced later). After digesting example code over there I started practising, and posted a few questions on StackOverflow or CodeReview when I got stucked.
For options 4, a good explanation is given by A guide to auto-vectorization with intel C++ compilers. Although the manual is for ICC, the principle of how auto-vectorization works applies to GCC as well. The official website for GCC's auto-vectorization is so out-dated, and this presentation slide is more useful: GCC auto-vectorization.
For option 5, there is a very good technical report by Oak Ridge National Laboratory: Effective Vectorization with OpenMP 4.5.
In terms of portability,
Options 1 to 3 are not easily portable, because the version of vector code depends on machine and / or compiler;
Option 4 is much better as we get rid of machine-dependency, but we still have problem with compiler-dependency;
Option 5 is very close to portable, as adapting Makefile is way much easier than adapting code.
In terms of performance, conventionally it is believed that option 1 is the best, and performance would degrade as we move downward. However, compilers are getting better, and newer machines have hardware improvement (for example, the performance penalty for unaligned vector load is smaller). So auto-vectorization is very positive. As part of my own DGEMM case study, I found that on an Intel Xeon E5-2650 v2 workstation with a peak performance of 18 GFLOPs per CPU, GCC's auto-vectorization has attained 14 ~ 15 GFLOPs which is rather impressive.
R side: how can R possibly use SIMD?
R can only use SIMD by calling compiled code that exploits SIMD. Compiled code in R has three sources:
R packages with "base" priority, like base, stats, utils, etc, that come with R's source code;
Other packages on CRAN that require compilation;
External scientific libraries like BLAS and LAPACK.
Since R software itself is portable across architectures, platforms and operating systems, and CRAN policy expects that an R package to be equally portable, compiled code in sources 1 and 2 can not be written in assembly code or compiler-dependent compiled code, ruling out options 1 to 3 for SIMD implementation. Auto-vectorization is the only chance left for R to leverage SIMD.
If you have built R with compiler's auto-vectorization enabled, compiled code from sources 1 can exploit SIMD. In fact, although R is written in a portable way, you can tune it for your machine when building it. For example, I would do icc -xHost -O2 with ICC or gcc -march=native -O2 -ftree-vectorize -ffast-math with GCC. Flags are set at R's build time and recorded in RHOME/etc/Makeconf (on Linux). Usually people would just do a quick build, so flag configurations are auto-decided. The result can be different depending your machine and your default compiler. On a Linux machine with GCC, optimization flag is often automatically set at -O2, hence auto-vectorization is off; instead, on a Mac OS X machine with clang, auto-vectorization is on at -O2. So I suggest you checking your Makeconf.
Flags in Makeconf are used when you run R CMD SHLIB, invoked by R CMD INSTALL or install.packages when installing CRAN R packages that needs compilation. By default, if Makeconf says that auto-vectorization is off, compiled code from source 2 can not leverage SIMD. However, it is possible to override Makeconf flags by providing a user Makevars file (like ~/.R/Makevars on Linux), so that R CMD SHLIB can take these flags and auto-vectorize compiled code from source 2.
BLAS and LAPACK are not part of R project or CRAN mirror. R simply takes it as it is, and does not even check whether it is a valid one! For example, on Linux if you alias your BLAS library to an arbitrary library foo.so, R will "stupidly" load foo.so instead on its startup and cause you trouble! The loose relationship between R and BLAS makes it easy to link different versions of BLAS to R so that benchmarking different BLAS libraries in R becomes straightforward (or course you have to restart R after you update the linkage). For Linux users with root privilege, switching between different BLAS libraries are recommoned by using sudo update-alternatives --config. If you don't have root privilege, this thread on StackOverflow will help you: Without root access, run R with tuned BLAS when it is linked with reference BLAS.
In case you don't know what BLAS is, here is brief introduction. BLAS originally referred to a coding standard for vector-vector, matrix-vector and matrix-matrix computations in scientific computations. For example, it was recommended that a general matrix-matrix multiplication should be C <- beta * C + alpha * op(A) %*% op(B), known as DGEMM. Note that this operation is more than just C <- A %*% B, and the motivation of this design was to maximize code reuse. For example, C <- C + A %*% B, C <- 2 * C + t(A) %*% B, etc can all be computed using DGEMM. A model implementation using FORTRAN 77 is provided with such standard for a reference, and this model library is commonly known as the reference BLAS library. Such library is static; it is there to motivate people to tune its performance for any specific machines. BLAS optimization is actually a very difficult job. In the end of optimization, everything changes except its user-interface. I.e., everything inside a BLAS function is changed, expect that you still call it in the same way. The various optimized BLAS libraries are known as tuned BLAS libraries, and include ATLAS, OpenBLAS or Intel MKL for example. All tuned BLAS libraries exploit SIMD as part of their optimization. Optimized BLAS library is remarkably faster than the reference one, and the performance gap will be increasely wider for new machines.
R relies on BLAS. For example, the matrix-matrix multiply operator "%*%" in R will call DGEMM. Functions crossprod, tcrossprod are also mapped to DGEMM. BLAS lies in the centre of scientific computations. Without BLAS, R would largely be broken. It is advocated so much to link an optimized BLAS library to R. It used to be difficult to check which BLAS library is linked to R (as this can be obscured by alias), but from R 3.4.0 this is no longer the case. sessionInfo() will show the full paths to the library or executable files providing the BLAS / LAPACK implementations currently in use (not available on Windows).
LAPACK is a more advanced scientific library built on top of BLAS. R relies on LAPACK for various matrix factorizations. For example, qr(, pivot = TRUE), chol, svd and eigen in R are mapped to LAPACK for QR factorization, Cholesky factorization, singular value decomposition and eigen decomposition. Note that all tuned BLAS libraries include a clone of LAPACK, so if R is linked to a tuned BLAS library, sessionInfo() will show that both libraries come from the same path; instead, if R is linked to the reference BLAS library, sessionInfo() will have two difference paths for BLAS and LAPACK. There have been plenty of questions taged r regarding the drastic performance difference of matrix multiplication across platforms, like Large performance differences between OS for matrix computation. In fact, if you just look at the output of sessionInfo(), you get an immediate clue that R is linked to a tuned BLAS on the first platform and a reference BLAS on the second.
Performance: is vector code always faster than scalar code?
Vector code looks fast, but they may not be realistically faster than scalar code. Here is a case study: Why is this SIMD multiplication not faster than non-SIMD multiplication?. And what a coincidence, the vector operation examined there is exactly what OP here took for example: Hadamard product. People often forget that the processing speed of CPU is not the deciding factor for practical performance. If data can not be transported from memory to CPU as fast as CPU requests, a CPU would just sit there and wait for most of the time. The Hadamard product example just falls into this situation: for every multiplication, three data must be fetched from memory, so Hadamard product is a memory-bound operation. The processing power of a CPU can only be realized, when substantially more arithmetics are done than the number of data movement. The classic matrix-matrix multiplication in BLAS belongs to this case, and this explains why SIMD implementation from a tuned BLAS library is so rewarding.
In light of this, I don't think you need to worry that much if you did not build your R software with compiler auto-vectorization turned on. It is hard to know whether R will really be faster.
Writing R extensions: write compiled code with OpenMP SIMD
If you decide to make contributions to CRAN by writing your own R packages, you can consider using SIMD option 5: OpenMP auto-vectorization if some section of your compiled code can benefit from SIMD. The reason for not choosing option 4, is that when you write a distributable package, you have no idea of what compiler will be used by an end-user. So there is no way you can write compiler-specific code and get it published on CRAN.
As we pointed out earlier in the SIMD options list, using OpenMP SIMD requires us adapting Makefile. In fact, R makes this very easy for you. You never need to write a Makefile alongside an R package. All you need is a Makevars file. When your package is compiled, compiler flags specified in your package Makevars and the RHOME/etc/Makeconf in the user's machine will be passed to R CMD SHLIB. Although you don't know what compiler that user might be using, RHOME/etc/Makeconf knows! All you need to do is to specify in your package Makevars that you want OpenMP support.
The only thing you can't do in your package Makevars is giving hint for machine-specific tuning. You may instead advise your package users to do the following:
If the RHOME/etc/Makeconf on the user's machine already have such tuning configuration (that is, the user have configured flags when they built R), your compiled code should be transformed to the tuned vector assembly code and there is nothing further to do;
Otherwise, you need to advise users to edit there personal Makevars file (like ~/.R/Makevars on Linux). You need to produce a table (maybe in your package vignette or documentation) about what tuning flags should be set for what compilers. Say -xHost for ICC and -march=native for GCC.
Well, there is little-known R distribution from Microsoft (artist, formerly known as Revolution R), which could be found here
It comes with Intel MKL library, which utilizes multiple threads and vector operations as well (you have to run Intel CPU though), and it really helps with matrices, things like SVD, etc
Unless you're willing to write C/C++ code with SIMD intrinsics using Rcpp or similar interfaces, Microsoft R is your best bet to utilize SIMD
Download and try it
Is it possible to use OpenCL but with R code? I still don't have a good understanding of OpenCL and GPU programming. For example, suppose I have the following R code:
aaa <- function(x) mean(rnorm(1000000))
sapply(1:10, aaa)
I like that I can kind of use mclapply as a dropin replacement for lapply. Is there a way to do that for OpenCL? Or to use OpenCL as a backend for mclapply? I'm guessing this is not possible because I have not been able to find an example, so I have two questions:
Is this possible and if so can you give a complete example using my function aaa above?
If this is not possible, can you please explain why? I do not know much about GPU programming. I view GPU just like CPUs, so why cannot I run R code in parallel?
I would start by looking at the High Performance Computing CRAN task view, in particular the Parallel computing: GPUs section.
There are a number of packages listed there which take advantage of GPGPU for specific tasks that lend themselves to massive parallelisation (e.g. gputools, HiPLARM). Most of these use NVIDIA's own CUDA rather than OpenCL.
There is also a more generic OpenCL package, but it requires you to learn how to write OpenCL code yourself, and merely provides an interface to that code from R.
It isn't possible because GPUs work differently than CPUs which means you can't give them the same instructions that you'd give a CPU.
Nvidia puts on a good show with this video of describing the difference between CPU and GPU processing. Essentially the difference is that GPUs typically have, by orders of magnitude, more cores than CPUs.
Your example is one that can be extended to GPU code because it is highly parallel.
Here's some code to create random numbers (although they aren't normally distributed) http://cas.ee.ic.ac.uk/people/dt10/research/rngs-gpu-mwc64x.html
Once you create the random numbers you could break them into chunks and then sum each of the chunks in parallel and then add the sums of the chunks to get the overall sum Is it possible to run the sum computation in parallel in OpenCL?
I realize that your code would make the random number vector and its sum in serial and parallel that operation 10 times but with GPU processing, having a mere 10 tasks isn't very efficient since you'd leave so many cores idle.
Does anyone know of a library or example of openCL code which will solve for Ax=B where A is large and sparse? I do not want to calculate the inverse of A as it would be very large and dense. The A matrix is >90% sparse and just calculating x is likely far less memory and computationally intensive.
The following post will help me on the CPU and looks like a good option but I really need the speedup of the GPU for this application.
C++ Memory Efficient Solution for Ax=b Linear Algebra System
What you are looking for is a Sparse Linear System Solver. For OpenCL take a look at ViennaCL: http://viennacl.sourceforge.net/
It has Conjugate Gradient, Stabilized Bi-Conjugate Gradient, Generalized Minimum Residual solvers.
However if you want to solve it efficiently you need a multigrid method. Take a look at: http://www.paralution.com/
PARALUTION is a library which enables you to perform various sparse iterative solvers and preconditioners on multi/many-core CPU and GPU devices.
There is also SpeedIT OpenCL:
This version of SpeedIT utilizes the power of OpenCL framework, which allows to use computational power of suitable GPUs. The SpeedIT OpenCL library provides a set of accelerated solvers and functions for sparse linear systems of
equations which are:
Preconditioned Conjugate Gradient solver
• Preconditioned Stabilized Bi-Conjugate Gradient solver
• Accelerated Sparse Matrix-Vector Multiplication
• Preconditioners:
◦ Jacobi diagonal
◦ Row scaling with norms l1, l2 and l∞
◦ ILU0 – Incomplete LU with 0 filling
You can solve Linear Simultaneous Equations of the form AX=B using Sequalator.
You can either use the OpenCL functionality or the multi threaded CPU functionality as per your hardware requirements.
You can also analyse the solutions to get an understanding of the errors in equations after substituting the solutions.