Does R leverage SIMD when doing vectorized calculations? - r

Given a dataframe like this in R:
+---+---+
| X | Y |
+---+---+
| 1 | 2 |
| 2 | 4 |
| 4 | 5 |
+---+---+
If a vectorized operation is performed on this dataframe, like so:
data$Z <- data$X * data$Y
Will this leverage the processor's single-instruction-multiple-data (SIMD) capabilities to optimize the performance? This seems like the perfect case for it, but I cannot find anything that confirms my hunch.

I just got a "good answer" badge two years after my initial answer. Thanks for acknowledging the quality of this answer. In return, I would substantially enrich the original contents. This answer / article is now intended for any R user that wants to try out SIMD. It has the following sections:
Background: what is SIMD?
Summary: how can we leverage SIMD for our assembly or compiled code?
R side: how can R possibly use SIMD?
Performance: is vector code always faster than scalar code?
Writing R extensions: write compiled code with OpenMP SIMD
Background: what is SIMD?
Many R programmers may not know SIMD if they don't write assembly or compiled code.
SIMD (single instruction, multiple data) is a data-level parallel processing technology that has a very long history. Before personal computers were around, SIMD unambiguously referred to the vector processing in a vector processor, and was the main route to high-performance computing. When personal computers later came into the market, they didn't have any features resembling those of a vector processor. However, as the demands for processing multi-media data grew higher and higher, they began to have vector registers and corresponding sets of vector instructions to use those registers for vector data load, vector data arithmetics and vector data store. The capacity of vector registers is getting bigger, and the functionality of vector instruction sets are also increasingly versatile. Till today, they are able to do stream load / store, strided load / store, scattered load / store, vector elements shuffling, vector arithmetics (including fused arithmetics like fused multiply-add), vector logical operations, masking, etc. So they are more and more alike a mini vector processors of the old days.
Although SIMD has been with personal computers for nearly two decades, many programmers are unaware of it. Instead, many are familiar with thread-level parallelism like multi-core computing (which can be referred to as MIMD). So if you are new to SIMD, you are highly recommended to watch this YouTube video Utilizing the other 80% of your system's performance: Starting with Vectorization by Ulrich Drepper from Red Hat Linux.
Since vector instruction sets are extensions to the original architecture instruction sets, you have to invest extra efforts to use them. If you are to write assembly code, you can call these instructions straightaway. If you are to write compiled code like C, C++ and Fortran, you have to write inline assembly or use vector intrinsics. A vector intrinsic appears like a function, but it is in fact an inline assembly mapping to a vector assembly instruction. These intrinsics (or "functions") are not part of standard libraries of a compiled language; they are provided by the architecture / machine. To use them, we need including appropriate header files and compiler-specific flags.
Let's first define the following for ease of later discussions:
Writing assembly or compiled code that does not use vector instruction sets is called "writing scalar code";
Writing assembly or compiled code that uses vector instruction sets is called "writing vector code".
So these two paths are "write A, get A" and "write B, get B". However, compilers are getting stronger and there is another "write A, get B" path:
They have a power to translate your written scalar compiled code to vector assembly code, a compiler optimization called "auto-vectorization".
Some compilers like GCC considers auto-vectorization as part of highest-level optimization, and is enabled by flag -O3; while other more aggressive compiles like ICC (intel C++ compiler) and clang would enable it at -O2. Auto-vectorization can also be directly controlled by specific flags. For example, GCC has -ftree-vectorize. When exploiting auto-vectorization, it is advised to further hint compilers to taylor vector assembly code for machine. For example, for GCC we may do -march=native and for ICC we use -xHost. This makes sense, because even on the x86-64 architecture family, later microarchitectures come with more vector instruction sets. For example, sandybridge supports vector instruction sets up to AVX, haswell further supports AVX2 and FMA3 and skylake further supports AVX-512. Without -march=native, GCC only generates vector instructions using instruction sets up to SSE2, which is a much smaller subset common to all x86-64.
Summary: how can we leverage SIMD for our assembly or compiled code?
There are five ways to implement SIMD:
Writing machine-specific vector assembly code directly. For example, on x86-64 we use SSE / AVX instruction sets, and on ARM architectures we use NEON instruction sets.
Pros: Code can be hand-tuned for better performance;
Cons: We have to write different versions of assembly code for different machines.
Writing vector compiled code by using machine-specific vector intrinsics and compiling it with compiler-specific flags. For example, on x86-64 we use SSE / AVX intrinsics, and for GCC we set -msse2, -mavx, etc (or simply -march=native for auto-detection). A variant of this option is to write compiler-specific inline-assembly. For example, introduction to GCC's inline assembly can be found here.
Pros: Writing compiled code is easier than writing assembly code, and the code is more readable hence easier to maintain;
Cons: We have to write different versions of code for different machines, and adapt Makefile for different compilers.
Writing vector compiled code by using compiler-specific vector extensions. Some compilers have defined their own vector data type. For example, GCC's vector extensions can be found here;
Pros: We don't need to worry about the difference across architectures, as compiler can generate machine-specific assembly;
Cons: We have to write different versions of code for different compilers, and adapt Makefile likewise.
Writing scalar compiled code and using compiler-specific flags for auto-vectorization. Optionally we can insert compiler-specific pragmas along our compiled code to give compilers more hints on, say, data alignment, loop unrolling depth, etc.
Pros: Writing scalar compiled code is easier than writing vector compiled code, and is more readable to broad audience.
Cons: We have to adapt Makefile for different compilers, and in case we have used pragmas, they also need be versioned.
writing scalar compiled code and inserting OpenMP pragmas (requiring OpenMP 4.0+) #pragma opm simd.
Pros: same as option 4, and additionaly, we can use a single version of pragmas as many main stream compilers support OpenMP standard;
Cons: We have to adapt Makefile for different compilers as they may have different flags to enable OpenMP and machine-specific tuning.
From top to bottom, programmers progressively do less and compilers do increasingly more. Implementing SIMD is interesting, but this article unfortunately does not have the room for a decent coverage with examples. I would provide the most informative references I found.
For options 1 and 2 on x86-64, SSE / AVX intrinsics is definitely the best reference card, but not the right place to start learning these instructions. Where to start is individual-specific. I picked up intrinsics / assembly from BLISLab when I tried to write my own high-performance DGEMM (to be introduced later). After digesting example code over there I started practising, and posted a few questions on StackOverflow or CodeReview when I got stucked.
For options 4, a good explanation is given by A guide to auto-vectorization with intel C++ compilers. Although the manual is for ICC, the principle of how auto-vectorization works applies to GCC as well. The official website for GCC's auto-vectorization is so out-dated, and this presentation slide is more useful: GCC auto-vectorization.
For option 5, there is a very good technical report by Oak Ridge National Laboratory: Effective Vectorization with OpenMP 4.5.
In terms of portability,
Options 1 to 3 are not easily portable, because the version of vector code depends on machine and / or compiler;
Option 4 is much better as we get rid of machine-dependency, but we still have problem with compiler-dependency;
Option 5 is very close to portable, as adapting Makefile is way much easier than adapting code.
In terms of performance, conventionally it is believed that option 1 is the best, and performance would degrade as we move downward. However, compilers are getting better, and newer machines have hardware improvement (for example, the performance penalty for unaligned vector load is smaller). So auto-vectorization is very positive. As part of my own DGEMM case study, I found that on an Intel Xeon E5-2650 v2 workstation with a peak performance of 18 GFLOPs per CPU, GCC's auto-vectorization has attained 14 ~ 15 GFLOPs which is rather impressive.
R side: how can R possibly use SIMD?
R can only use SIMD by calling compiled code that exploits SIMD. Compiled code in R has three sources:
R packages with "base" priority, like base, stats, utils, etc, that come with R's source code;
Other packages on CRAN that require compilation;
External scientific libraries like BLAS and LAPACK.
Since R software itself is portable across architectures, platforms and operating systems, and CRAN policy expects that an R package to be equally portable, compiled code in sources 1 and 2 can not be written in assembly code or compiler-dependent compiled code, ruling out options 1 to 3 for SIMD implementation. Auto-vectorization is the only chance left for R to leverage SIMD.
If you have built R with compiler's auto-vectorization enabled, compiled code from sources 1 can exploit SIMD. In fact, although R is written in a portable way, you can tune it for your machine when building it. For example, I would do icc -xHost -O2 with ICC or gcc -march=native -O2 -ftree-vectorize -ffast-math with GCC. Flags are set at R's build time and recorded in RHOME/etc/Makeconf (on Linux). Usually people would just do a quick build, so flag configurations are auto-decided. The result can be different depending your machine and your default compiler. On a Linux machine with GCC, optimization flag is often automatically set at -O2, hence auto-vectorization is off; instead, on a Mac OS X machine with clang, auto-vectorization is on at -O2. So I suggest you checking your Makeconf.
Flags in Makeconf are used when you run R CMD SHLIB, invoked by R CMD INSTALL or install.packages when installing CRAN R packages that needs compilation. By default, if Makeconf says that auto-vectorization is off, compiled code from source 2 can not leverage SIMD. However, it is possible to override Makeconf flags by providing a user Makevars file (like ~/.R/Makevars on Linux), so that R CMD SHLIB can take these flags and auto-vectorize compiled code from source 2.
BLAS and LAPACK are not part of R project or CRAN mirror. R simply takes it as it is, and does not even check whether it is a valid one! For example, on Linux if you alias your BLAS library to an arbitrary library foo.so, R will "stupidly" load foo.so instead on its startup and cause you trouble! The loose relationship between R and BLAS makes it easy to link different versions of BLAS to R so that benchmarking different BLAS libraries in R becomes straightforward (or course you have to restart R after you update the linkage). For Linux users with root privilege, switching between different BLAS libraries are recommoned by using sudo update-alternatives --config. If you don't have root privilege, this thread on StackOverflow will help you: Without root access, run R with tuned BLAS when it is linked with reference BLAS.
In case you don't know what BLAS is, here is brief introduction. BLAS originally referred to a coding standard for vector-vector, matrix-vector and matrix-matrix computations in scientific computations. For example, it was recommended that a general matrix-matrix multiplication should be C <- beta * C + alpha * op(A) %*% op(B), known as DGEMM. Note that this operation is more than just C <- A %*% B, and the motivation of this design was to maximize code reuse. For example, C <- C + A %*% B, C <- 2 * C + t(A) %*% B, etc can all be computed using DGEMM. A model implementation using FORTRAN 77 is provided with such standard for a reference, and this model library is commonly known as the reference BLAS library. Such library is static; it is there to motivate people to tune its performance for any specific machines. BLAS optimization is actually a very difficult job. In the end of optimization, everything changes except its user-interface. I.e., everything inside a BLAS function is changed, expect that you still call it in the same way. The various optimized BLAS libraries are known as tuned BLAS libraries, and include ATLAS, OpenBLAS or Intel MKL for example. All tuned BLAS libraries exploit SIMD as part of their optimization. Optimized BLAS library is remarkably faster than the reference one, and the performance gap will be increasely wider for new machines.
R relies on BLAS. For example, the matrix-matrix multiply operator "%*%" in R will call DGEMM. Functions crossprod, tcrossprod are also mapped to DGEMM. BLAS lies in the centre of scientific computations. Without BLAS, R would largely be broken. It is advocated so much to link an optimized BLAS library to R. It used to be difficult to check which BLAS library is linked to R (as this can be obscured by alias), but from R 3.4.0 this is no longer the case. sessionInfo() will show the full paths to the library or executable files providing the BLAS / LAPACK implementations currently in use (not available on Windows).
LAPACK is a more advanced scientific library built on top of BLAS. R relies on LAPACK for various matrix factorizations. For example, qr(, pivot = TRUE), chol, svd and eigen in R are mapped to LAPACK for QR factorization, Cholesky factorization, singular value decomposition and eigen decomposition. Note that all tuned BLAS libraries include a clone of LAPACK, so if R is linked to a tuned BLAS library, sessionInfo() will show that both libraries come from the same path; instead, if R is linked to the reference BLAS library, sessionInfo() will have two difference paths for BLAS and LAPACK. There have been plenty of questions taged r regarding the drastic performance difference of matrix multiplication across platforms, like Large performance differences between OS for matrix computation. In fact, if you just look at the output of sessionInfo(), you get an immediate clue that R is linked to a tuned BLAS on the first platform and a reference BLAS on the second.
Performance: is vector code always faster than scalar code?
Vector code looks fast, but they may not be realistically faster than scalar code. Here is a case study: Why is this SIMD multiplication not faster than non-SIMD multiplication?. And what a coincidence, the vector operation examined there is exactly what OP here took for example: Hadamard product. People often forget that the processing speed of CPU is not the deciding factor for practical performance. If data can not be transported from memory to CPU as fast as CPU requests, a CPU would just sit there and wait for most of the time. The Hadamard product example just falls into this situation: for every multiplication, three data must be fetched from memory, so Hadamard product is a memory-bound operation. The processing power of a CPU can only be realized, when substantially more arithmetics are done than the number of data movement. The classic matrix-matrix multiplication in BLAS belongs to this case, and this explains why SIMD implementation from a tuned BLAS library is so rewarding.
In light of this, I don't think you need to worry that much if you did not build your R software with compiler auto-vectorization turned on. It is hard to know whether R will really be faster.
Writing R extensions: write compiled code with OpenMP SIMD
If you decide to make contributions to CRAN by writing your own R packages, you can consider using SIMD option 5: OpenMP auto-vectorization if some section of your compiled code can benefit from SIMD. The reason for not choosing option 4, is that when you write a distributable package, you have no idea of what compiler will be used by an end-user. So there is no way you can write compiler-specific code and get it published on CRAN.
As we pointed out earlier in the SIMD options list, using OpenMP SIMD requires us adapting Makefile. In fact, R makes this very easy for you. You never need to write a Makefile alongside an R package. All you need is a Makevars file. When your package is compiled, compiler flags specified in your package Makevars and the RHOME/etc/Makeconf in the user's machine will be passed to R CMD SHLIB. Although you don't know what compiler that user might be using, RHOME/etc/Makeconf knows! All you need to do is to specify in your package Makevars that you want OpenMP support.
The only thing you can't do in your package Makevars is giving hint for machine-specific tuning. You may instead advise your package users to do the following:
If the RHOME/etc/Makeconf on the user's machine already have such tuning configuration (that is, the user have configured flags when they built R), your compiled code should be transformed to the tuned vector assembly code and there is nothing further to do;
Otherwise, you need to advise users to edit there personal Makevars file (like ~/.R/Makevars on Linux). You need to produce a table (maybe in your package vignette or documentation) about what tuning flags should be set for what compilers. Say -xHost for ICC and -march=native for GCC.

Well, there is little-known R distribution from Microsoft (artist, formerly known as Revolution R), which could be found here
It comes with Intel MKL library, which utilizes multiple threads and vector operations as well (you have to run Intel CPU though), and it really helps with matrices, things like SVD, etc
Unless you're willing to write C/C++ code with SIMD intrinsics using Rcpp or similar interfaces, Microsoft R is your best bet to utilize SIMD
Download and try it

Related

lineprof equivalent for Rcpp

The lineprof package in R is very useful for profiling which parts of function take up time and allocate/free memory.
Is there a lineprof() equivalent for Rcpp ?
I currently use std::chrono::steady_clock and such to get chunk timings out of an Rcpp function. Alternatives? Does Rstudio IDE provide some help here?
To supplement #Dirk's answer...
If you are working on OS X, the Time Profiler Instrument, part of Apple's Instruments set of instrumentation tools, is an excellent sampling profiler.
Just to fix ideas:
A sampling profiler lets you answer the question, what code paths does my program spend the most time executing?
A (full) cache profiler lets you answer the question, which are the most frequently executed code paths in my program?
These are different questions -- it's possible that your hottest code paths are already optimized enough that, even though the total number of instructions executed in that path is very high, the amount of time required to execute them might be relatively low.
If you want to use instruments to profile C++ code / routines used in an R package, the easiest way to go about this is:
Create a target, pointed at your R executable, with appropriate command line arguments to run whatever functions you wish to profile:
Set the command line arguments to run the code that will exercise your C++ routines -- for example, this code runs Rcpp:::test(), to instrument all of the Rcpp test code:
Click the big red Record button, and off you go!
I'll leave the rest of the instructions in understanding instruments + the timing profiler to your google-fu + the documentation, but (if you're on OS X) you should be aware of this tool.
See any decent introduction to high(er) performance computing as eg some slides from (older) presentation of my talks page which include worked examples for both KCacheGrind (part of the KDE frontend to Valgrind) as well as Google Perftools.
In a more abstract sense, you need to come to terms with the fact that C++ != R and not all tools have identical counterparts. In particular Rprof, the R profiler which several CRAN packages for profiling build on top of, is based on the fact that R is interpreted. C++ is not, so things will be different. But profiling compiled is about as old as compiling and debugging so you will find numerous tutorials.

OpenCL BLAS in julia

I am really amazed by the julia language, having implemented lots of machine learning algorithms there for my current project. Even though julia 0.2 manages to get some great results out of my 2011 MBA outperforming all other solutuions on similar linux hardware (due to vecLib blas, i suppose), I would certainly like more. I am in a process of buying radeon 5870 and would like to push my matrix operations there. I use basically only simple BLAS operations such as matmul, additios and transpositions. I use julia's compact syntax A' * B + C and would certainly like to keep it.
Is there any way (or pending milestone) I can get those basic operations execute on GPU? I have like 2500x2500 single precision matrices so I expect significant speedup.
I don't believe that GPU integration into the core of Julia is planned at this time. One of the key issues is that there is substantial overhead moving the data to and from the GPU, making a drop-in replacement for BLAS operations infeasible.
I expect that most of the progress in this area will actually come from the package ecosystem, in particular packages under the JuliaGPU organization. I see there is a CLBLAS package in there.

Changing the LAPACK implementation used by IDL linear algebra routines?

Over at http://scicomp.stackexchange.com I asked this question about parallel matrix algorithms in IDL. The answers suggest using a multi-threaded LAPACK implementation and suggest some hacks to get IDL to use a specific LAPACK library. I haven't been able to get this to work.
I would ideally like the existing LAPACK DLM to simply be able to use a multi-threaded LAPACK library and it feels like this should be possible but I have not had any success. Alternatively I guess the next simplest step would be to create a new DLM to wrap a matrix inversion call in some C code and ensure this DLM points to the desired implementation. The documentation for creating DLMs is making me cross-eyed though, so any pointers to doing this (if it is required) would also be appreciated.
What platform are you targeting?
Looking at idl_lapack.so with nm on my platform (Mac OS X, IDL 8.2.1) seems to indicate that the LAPACK routines are directly in the .so, so my (albeit limited) understanding is that it would not be simple to swap out (i.e., by setting LD_LIBRARY_PATH).
$ nm idl_lapack.so
...
000000000023d5bb t _dgemm_
000000000023dfcb t _dgemv_
000000000009d9be t _dgeqp3_
000000000009e204 t _dgeqr2_
000000000009e41d t _dgeqrf_
000000000023e714 t _dger_
000000000009e9ad t _dgerfs_
000000000009f4ba t _dgerq2_
000000000009f6e1 t _dgerqf_
Some other possibilities...
My personal library has a directory src/dist_tools/bindings containing routines for automatically creating bindings for a library given "simple" (i.e., not using typedefs) function prototypes. LAPACK would be fairly easy to create bindings for (the hardest part would probably be to build the package you want to use ATLAS, PLAPACK, ScaLAPACK, etc.). The library is free to use, a small consulting contract could be done if you would like it done for you.
The next version of GPULib will contain a GPU implementation of LAPACK, using the MAGMA library. This is effectively a highly parallel option, but only works on CUDA graphics cards. It would also work best if other operations besides the matrix inversion could be done on the GPU to minimize memory transfer. This option costs money.

MPI and OpenMP. Do I even have a choice?

I have a linear algebra code that I am trying get to run faster. Its a iterative algorithm with a loop and matrix vector multiplications within in.
So far, I have used MATMUL (Fortran Lib.), DGEMV, Tried writing my own MV code in OpenMP but the algorithm is doing no better in terms of scalability. Speed ups are barely 3.5 - 4 irrespective of how many processors I am allotting to it (I have tried up 64 processors).
The profiling shows significant time being spent in Matrix-Vector and the rest is fairly nominal.
My question is:
I have a shared memory system with tons of RAM and processors. I have tried tweaking OpenMP implementation of the code (including Matrix Vector) but has not helped. Will it help to code in MPI? I am not a pro at MPI but the ability to fine tune the message communication might help a bit but I can't be sure. Any comments?
More generally, from the literature I have read, MPI = Distributed, OpenMP = Shared but can they perform well in the others' territory? Like MPI in Shared? Will it work? Will it be better than the OpenMP implementation if done well?
You're best off just using a linear algebra package that is already well optimized for a multitcore environment and using that for your matrix-vector multiplication. The Atlas package, gotoblas (if you have a nehalem or older; sadly it's no longer being updated), or vendor BLAS implementations (like MKL for intel CPUs, ACML for AMD, or VecLib for apple, which all cost money) all have good, well-tuned, multithreaded BLAS implementations. Unless you have excellent reason to believe that you can do better than those full time development teams can, you're best off using them.
Note that you'll never get the parallel speedup with DGEMV that you do with DGEMM, just because the vector is smaller than another matrix and so there's less work; but you can still do quite well, and you'll find you get much better perforamance with these libraries than you do with anything hand-rolled unless you were already doing multi-level cache blocking.
You can use MPI in a shared environment (though not OpenMP in a distributed one). However, achieving a good speedup depends a lot more on your algorithms and data dependencies than the technology used. Since you have a lot of shared memory, I'd recommend you stick with OpenMP, and carefully examine whether you're making the best use of your resources.

R 2.14 byte compilation - how much performance increase

I saw that the latest R version supports byte compilation. What is the performance gain I can expect? And are specific tasks more positively impacted that others?
It is a good question:
The byte compiler appeared with 2.13.0 in April, and some of us run tests then (with thanks to gsk3 for the links).
I tend to look more at what Rcpp can do for us, but I now always include 'uncompiled R' and 'byte compiled R' as baselines.
I fairly often see gains by a factor of two (eg the varSims example) but also essentially no gains at all (the fibonacci example which is of course somewhat extreme).
We could do with more data. When we had the precursor byte compile in the Ra engine by Stephen Milborrow (which I had packaged up for Debian too), it was often said that both algebraic expression and loops benefited. I don't know of a clear rule for the byte compiler by Luke Tierney now in R---but it generally never seems to hurt so it may be worthwhile to get into the habit of turning it on.
Between a one- and five-fold increase:
http://radfordneal.wordpress.com/2011/05/13/speed-tests-for-r-%E2%80%94-and-a-look-at-the-compiler/
http://dirk.eddelbuettel.com/blog/2011/04/12/

Resources