Changing the LAPACK implementation used by IDL linear algebra routines? - linear-algebra

Over at http://scicomp.stackexchange.com I asked this question about parallel matrix algorithms in IDL. The answers suggest using a multi-threaded LAPACK implementation and suggest some hacks to get IDL to use a specific LAPACK library. I haven't been able to get this to work.
I would ideally like the existing LAPACK DLM to simply be able to use a multi-threaded LAPACK library and it feels like this should be possible but I have not had any success. Alternatively I guess the next simplest step would be to create a new DLM to wrap a matrix inversion call in some C code and ensure this DLM points to the desired implementation. The documentation for creating DLMs is making me cross-eyed though, so any pointers to doing this (if it is required) would also be appreciated.

What platform are you targeting?
Looking at idl_lapack.so with nm on my platform (Mac OS X, IDL 8.2.1) seems to indicate that the LAPACK routines are directly in the .so, so my (albeit limited) understanding is that it would not be simple to swap out (i.e., by setting LD_LIBRARY_PATH).
$ nm idl_lapack.so
...
000000000023d5bb t _dgemm_
000000000023dfcb t _dgemv_
000000000009d9be t _dgeqp3_
000000000009e204 t _dgeqr2_
000000000009e41d t _dgeqrf_
000000000023e714 t _dger_
000000000009e9ad t _dgerfs_
000000000009f4ba t _dgerq2_
000000000009f6e1 t _dgerqf_
Some other possibilities...
My personal library has a directory src/dist_tools/bindings containing routines for automatically creating bindings for a library given "simple" (i.e., not using typedefs) function prototypes. LAPACK would be fairly easy to create bindings for (the hardest part would probably be to build the package you want to use ATLAS, PLAPACK, ScaLAPACK, etc.). The library is free to use, a small consulting contract could be done if you would like it done for you.
The next version of GPULib will contain a GPU implementation of LAPACK, using the MAGMA library. This is effectively a highly parallel option, but only works on CUDA graphics cards. It would also work best if other operations besides the matrix inversion could be done on the GPU to minimize memory transfer. This option costs money.

Related

Does R leverage SIMD when doing vectorized calculations?

Given a dataframe like this in R:
+---+---+
| X | Y |
+---+---+
| 1 | 2 |
| 2 | 4 |
| 4 | 5 |
+---+---+
If a vectorized operation is performed on this dataframe, like so:
data$Z <- data$X * data$Y
Will this leverage the processor's single-instruction-multiple-data (SIMD) capabilities to optimize the performance? This seems like the perfect case for it, but I cannot find anything that confirms my hunch.
I just got a "good answer" badge two years after my initial answer. Thanks for acknowledging the quality of this answer. In return, I would substantially enrich the original contents. This answer / article is now intended for any R user that wants to try out SIMD. It has the following sections:
Background: what is SIMD?
Summary: how can we leverage SIMD for our assembly or compiled code?
R side: how can R possibly use SIMD?
Performance: is vector code always faster than scalar code?
Writing R extensions: write compiled code with OpenMP SIMD
Background: what is SIMD?
Many R programmers may not know SIMD if they don't write assembly or compiled code.
SIMD (single instruction, multiple data) is a data-level parallel processing technology that has a very long history. Before personal computers were around, SIMD unambiguously referred to the vector processing in a vector processor, and was the main route to high-performance computing. When personal computers later came into the market, they didn't have any features resembling those of a vector processor. However, as the demands for processing multi-media data grew higher and higher, they began to have vector registers and corresponding sets of vector instructions to use those registers for vector data load, vector data arithmetics and vector data store. The capacity of vector registers is getting bigger, and the functionality of vector instruction sets are also increasingly versatile. Till today, they are able to do stream load / store, strided load / store, scattered load / store, vector elements shuffling, vector arithmetics (including fused arithmetics like fused multiply-add), vector logical operations, masking, etc. So they are more and more alike a mini vector processors of the old days.
Although SIMD has been with personal computers for nearly two decades, many programmers are unaware of it. Instead, many are familiar with thread-level parallelism like multi-core computing (which can be referred to as MIMD). So if you are new to SIMD, you are highly recommended to watch this YouTube video Utilizing the other 80% of your system's performance: Starting with Vectorization by Ulrich Drepper from Red Hat Linux.
Since vector instruction sets are extensions to the original architecture instruction sets, you have to invest extra efforts to use them. If you are to write assembly code, you can call these instructions straightaway. If you are to write compiled code like C, C++ and Fortran, you have to write inline assembly or use vector intrinsics. A vector intrinsic appears like a function, but it is in fact an inline assembly mapping to a vector assembly instruction. These intrinsics (or "functions") are not part of standard libraries of a compiled language; they are provided by the architecture / machine. To use them, we need including appropriate header files and compiler-specific flags.
Let's first define the following for ease of later discussions:
Writing assembly or compiled code that does not use vector instruction sets is called "writing scalar code";
Writing assembly or compiled code that uses vector instruction sets is called "writing vector code".
So these two paths are "write A, get A" and "write B, get B". However, compilers are getting stronger and there is another "write A, get B" path:
They have a power to translate your written scalar compiled code to vector assembly code, a compiler optimization called "auto-vectorization".
Some compilers like GCC considers auto-vectorization as part of highest-level optimization, and is enabled by flag -O3; while other more aggressive compiles like ICC (intel C++ compiler) and clang would enable it at -O2. Auto-vectorization can also be directly controlled by specific flags. For example, GCC has -ftree-vectorize. When exploiting auto-vectorization, it is advised to further hint compilers to taylor vector assembly code for machine. For example, for GCC we may do -march=native and for ICC we use -xHost. This makes sense, because even on the x86-64 architecture family, later microarchitectures come with more vector instruction sets. For example, sandybridge supports vector instruction sets up to AVX, haswell further supports AVX2 and FMA3 and skylake further supports AVX-512. Without -march=native, GCC only generates vector instructions using instruction sets up to SSE2, which is a much smaller subset common to all x86-64.
Summary: how can we leverage SIMD for our assembly or compiled code?
There are five ways to implement SIMD:
Writing machine-specific vector assembly code directly. For example, on x86-64 we use SSE / AVX instruction sets, and on ARM architectures we use NEON instruction sets.
Pros: Code can be hand-tuned for better performance;
Cons: We have to write different versions of assembly code for different machines.
Writing vector compiled code by using machine-specific vector intrinsics and compiling it with compiler-specific flags. For example, on x86-64 we use SSE / AVX intrinsics, and for GCC we set -msse2, -mavx, etc (or simply -march=native for auto-detection). A variant of this option is to write compiler-specific inline-assembly. For example, introduction to GCC's inline assembly can be found here.
Pros: Writing compiled code is easier than writing assembly code, and the code is more readable hence easier to maintain;
Cons: We have to write different versions of code for different machines, and adapt Makefile for different compilers.
Writing vector compiled code by using compiler-specific vector extensions. Some compilers have defined their own vector data type. For example, GCC's vector extensions can be found here;
Pros: We don't need to worry about the difference across architectures, as compiler can generate machine-specific assembly;
Cons: We have to write different versions of code for different compilers, and adapt Makefile likewise.
Writing scalar compiled code and using compiler-specific flags for auto-vectorization. Optionally we can insert compiler-specific pragmas along our compiled code to give compilers more hints on, say, data alignment, loop unrolling depth, etc.
Pros: Writing scalar compiled code is easier than writing vector compiled code, and is more readable to broad audience.
Cons: We have to adapt Makefile for different compilers, and in case we have used pragmas, they also need be versioned.
writing scalar compiled code and inserting OpenMP pragmas (requiring OpenMP 4.0+) #pragma opm simd.
Pros: same as option 4, and additionaly, we can use a single version of pragmas as many main stream compilers support OpenMP standard;
Cons: We have to adapt Makefile for different compilers as they may have different flags to enable OpenMP and machine-specific tuning.
From top to bottom, programmers progressively do less and compilers do increasingly more. Implementing SIMD is interesting, but this article unfortunately does not have the room for a decent coverage with examples. I would provide the most informative references I found.
For options 1 and 2 on x86-64, SSE / AVX intrinsics is definitely the best reference card, but not the right place to start learning these instructions. Where to start is individual-specific. I picked up intrinsics / assembly from BLISLab when I tried to write my own high-performance DGEMM (to be introduced later). After digesting example code over there I started practising, and posted a few questions on StackOverflow or CodeReview when I got stucked.
For options 4, a good explanation is given by A guide to auto-vectorization with intel C++ compilers. Although the manual is for ICC, the principle of how auto-vectorization works applies to GCC as well. The official website for GCC's auto-vectorization is so out-dated, and this presentation slide is more useful: GCC auto-vectorization.
For option 5, there is a very good technical report by Oak Ridge National Laboratory: Effective Vectorization with OpenMP 4.5.
In terms of portability,
Options 1 to 3 are not easily portable, because the version of vector code depends on machine and / or compiler;
Option 4 is much better as we get rid of machine-dependency, but we still have problem with compiler-dependency;
Option 5 is very close to portable, as adapting Makefile is way much easier than adapting code.
In terms of performance, conventionally it is believed that option 1 is the best, and performance would degrade as we move downward. However, compilers are getting better, and newer machines have hardware improvement (for example, the performance penalty for unaligned vector load is smaller). So auto-vectorization is very positive. As part of my own DGEMM case study, I found that on an Intel Xeon E5-2650 v2 workstation with a peak performance of 18 GFLOPs per CPU, GCC's auto-vectorization has attained 14 ~ 15 GFLOPs which is rather impressive.
R side: how can R possibly use SIMD?
R can only use SIMD by calling compiled code that exploits SIMD. Compiled code in R has three sources:
R packages with "base" priority, like base, stats, utils, etc, that come with R's source code;
Other packages on CRAN that require compilation;
External scientific libraries like BLAS and LAPACK.
Since R software itself is portable across architectures, platforms and operating systems, and CRAN policy expects that an R package to be equally portable, compiled code in sources 1 and 2 can not be written in assembly code or compiler-dependent compiled code, ruling out options 1 to 3 for SIMD implementation. Auto-vectorization is the only chance left for R to leverage SIMD.
If you have built R with compiler's auto-vectorization enabled, compiled code from sources 1 can exploit SIMD. In fact, although R is written in a portable way, you can tune it for your machine when building it. For example, I would do icc -xHost -O2 with ICC or gcc -march=native -O2 -ftree-vectorize -ffast-math with GCC. Flags are set at R's build time and recorded in RHOME/etc/Makeconf (on Linux). Usually people would just do a quick build, so flag configurations are auto-decided. The result can be different depending your machine and your default compiler. On a Linux machine with GCC, optimization flag is often automatically set at -O2, hence auto-vectorization is off; instead, on a Mac OS X machine with clang, auto-vectorization is on at -O2. So I suggest you checking your Makeconf.
Flags in Makeconf are used when you run R CMD SHLIB, invoked by R CMD INSTALL or install.packages when installing CRAN R packages that needs compilation. By default, if Makeconf says that auto-vectorization is off, compiled code from source 2 can not leverage SIMD. However, it is possible to override Makeconf flags by providing a user Makevars file (like ~/.R/Makevars on Linux), so that R CMD SHLIB can take these flags and auto-vectorize compiled code from source 2.
BLAS and LAPACK are not part of R project or CRAN mirror. R simply takes it as it is, and does not even check whether it is a valid one! For example, on Linux if you alias your BLAS library to an arbitrary library foo.so, R will "stupidly" load foo.so instead on its startup and cause you trouble! The loose relationship between R and BLAS makes it easy to link different versions of BLAS to R so that benchmarking different BLAS libraries in R becomes straightforward (or course you have to restart R after you update the linkage). For Linux users with root privilege, switching between different BLAS libraries are recommoned by using sudo update-alternatives --config. If you don't have root privilege, this thread on StackOverflow will help you: Without root access, run R with tuned BLAS when it is linked with reference BLAS.
In case you don't know what BLAS is, here is brief introduction. BLAS originally referred to a coding standard for vector-vector, matrix-vector and matrix-matrix computations in scientific computations. For example, it was recommended that a general matrix-matrix multiplication should be C <- beta * C + alpha * op(A) %*% op(B), known as DGEMM. Note that this operation is more than just C <- A %*% B, and the motivation of this design was to maximize code reuse. For example, C <- C + A %*% B, C <- 2 * C + t(A) %*% B, etc can all be computed using DGEMM. A model implementation using FORTRAN 77 is provided with such standard for a reference, and this model library is commonly known as the reference BLAS library. Such library is static; it is there to motivate people to tune its performance for any specific machines. BLAS optimization is actually a very difficult job. In the end of optimization, everything changes except its user-interface. I.e., everything inside a BLAS function is changed, expect that you still call it in the same way. The various optimized BLAS libraries are known as tuned BLAS libraries, and include ATLAS, OpenBLAS or Intel MKL for example. All tuned BLAS libraries exploit SIMD as part of their optimization. Optimized BLAS library is remarkably faster than the reference one, and the performance gap will be increasely wider for new machines.
R relies on BLAS. For example, the matrix-matrix multiply operator "%*%" in R will call DGEMM. Functions crossprod, tcrossprod are also mapped to DGEMM. BLAS lies in the centre of scientific computations. Without BLAS, R would largely be broken. It is advocated so much to link an optimized BLAS library to R. It used to be difficult to check which BLAS library is linked to R (as this can be obscured by alias), but from R 3.4.0 this is no longer the case. sessionInfo() will show the full paths to the library or executable files providing the BLAS / LAPACK implementations currently in use (not available on Windows).
LAPACK is a more advanced scientific library built on top of BLAS. R relies on LAPACK for various matrix factorizations. For example, qr(, pivot = TRUE), chol, svd and eigen in R are mapped to LAPACK for QR factorization, Cholesky factorization, singular value decomposition and eigen decomposition. Note that all tuned BLAS libraries include a clone of LAPACK, so if R is linked to a tuned BLAS library, sessionInfo() will show that both libraries come from the same path; instead, if R is linked to the reference BLAS library, sessionInfo() will have two difference paths for BLAS and LAPACK. There have been plenty of questions taged r regarding the drastic performance difference of matrix multiplication across platforms, like Large performance differences between OS for matrix computation. In fact, if you just look at the output of sessionInfo(), you get an immediate clue that R is linked to a tuned BLAS on the first platform and a reference BLAS on the second.
Performance: is vector code always faster than scalar code?
Vector code looks fast, but they may not be realistically faster than scalar code. Here is a case study: Why is this SIMD multiplication not faster than non-SIMD multiplication?. And what a coincidence, the vector operation examined there is exactly what OP here took for example: Hadamard product. People often forget that the processing speed of CPU is not the deciding factor for practical performance. If data can not be transported from memory to CPU as fast as CPU requests, a CPU would just sit there and wait for most of the time. The Hadamard product example just falls into this situation: for every multiplication, three data must be fetched from memory, so Hadamard product is a memory-bound operation. The processing power of a CPU can only be realized, when substantially more arithmetics are done than the number of data movement. The classic matrix-matrix multiplication in BLAS belongs to this case, and this explains why SIMD implementation from a tuned BLAS library is so rewarding.
In light of this, I don't think you need to worry that much if you did not build your R software with compiler auto-vectorization turned on. It is hard to know whether R will really be faster.
Writing R extensions: write compiled code with OpenMP SIMD
If you decide to make contributions to CRAN by writing your own R packages, you can consider using SIMD option 5: OpenMP auto-vectorization if some section of your compiled code can benefit from SIMD. The reason for not choosing option 4, is that when you write a distributable package, you have no idea of what compiler will be used by an end-user. So there is no way you can write compiler-specific code and get it published on CRAN.
As we pointed out earlier in the SIMD options list, using OpenMP SIMD requires us adapting Makefile. In fact, R makes this very easy for you. You never need to write a Makefile alongside an R package. All you need is a Makevars file. When your package is compiled, compiler flags specified in your package Makevars and the RHOME/etc/Makeconf in the user's machine will be passed to R CMD SHLIB. Although you don't know what compiler that user might be using, RHOME/etc/Makeconf knows! All you need to do is to specify in your package Makevars that you want OpenMP support.
The only thing you can't do in your package Makevars is giving hint for machine-specific tuning. You may instead advise your package users to do the following:
If the RHOME/etc/Makeconf on the user's machine already have such tuning configuration (that is, the user have configured flags when they built R), your compiled code should be transformed to the tuned vector assembly code and there is nothing further to do;
Otherwise, you need to advise users to edit there personal Makevars file (like ~/.R/Makevars on Linux). You need to produce a table (maybe in your package vignette or documentation) about what tuning flags should be set for what compilers. Say -xHost for ICC and -march=native for GCC.
Well, there is little-known R distribution from Microsoft (artist, formerly known as Revolution R), which could be found here
It comes with Intel MKL library, which utilizes multiple threads and vector operations as well (you have to run Intel CPU though), and it really helps with matrices, things like SVD, etc
Unless you're willing to write C/C++ code with SIMD intrinsics using Rcpp or similar interfaces, Microsoft R is your best bet to utilize SIMD
Download and try it

When to use eigen and when to use Blas

I did some basic reading on Eigen and Blas. Both library have support for matrix matrix, matrix vector multiplication. I don't understand which one should I use in which case? To me it seems, both have almost same performance. It would be nice if someone could give me some resource or just tell me what are the advantages one library have over another? Or how does these two differer in case of matrix and vector manipulation? Thanks in advance.
Use Eigen, it's more complete and much easier to use. Then, if you wonder if another fully optimized BLAS implementation could give you higher performance, then just recompile your code with -DEIGEN_USE_BLAS and link to your favorite blas and see by yourself.
Also, when using Eigen, don't forget to enable compiler optimizations, e.g. -O3 and the instruction-sets your hardware supports, e.g., -mavx -mfma when using latest Eigen.
So the answer to this question is here.
http://eigen.tuxfamily.org/index.php?title=FAQ#How_does_Eigen_compare_to_BLAS.2FLAPACK.3F
More or less, I use Eigen mostly, because it has a comforable interface. If you need speed and multicore parallelism or have only little but time-consuming linear algebra stuff in your code, go for GotoBlas2. Usually it is fastest on Intel machines.

How to calculate in R with variables

I'm a R newbie.
is there a way i can calculate
(x+x^2+x^3)^2
in R?
so i will get the result:
x^6+2 x^5+3 x^4+2 x^3+x^2
I get an Error: object 'x' not found.
Thanks!
R isn't well suited for this. Some interface packages to languages and libraries that are better at this do exist, such as rSymPy, which allows you to access the SymPy Python library for symbolic mathematics (you'll need to install both). In a similar vein, Ryacas links to the yacas algebra system.
Those interfaces are useful if you need symbolic manipulation as part of an R workflow. Otherwise, consider using the original tools. The ones above are open source and freely available, while other free use alternatives also exist, such as the proprietary web based Wolfram Alpha (for limited use).

Converting models in Matlab/R to C++/Java

I would like to convert an ARIMA model developed in R using the forecast library to Java code. Note that I need to implement only the forecasting part. The fitting can be done in R itself. I am going to look at the predict function and translate it to Java code. I was just wondering if anyone else had been in a similar situation before and managed to successfully use a Java library for the same.
Along similar lines, and perhaps this is a more general question without a concrete answer; What is the best way to deal with situations where in model building can be done in Matlab/R but the prediction/forecasting needs to be done in Java/C++? Increasingly, I have been encountering such a situation over and over again. I guess you have to bite the bullet and write the code yourself and this is not generally as hard as writing the fitting/estimation yourself. Any advice on the topic would be helpful.
You write about 'R or Matlab' to 'C++ or Java'. This gives 2 x 2 choices which is too many degrees of freedom for my taste. So allow me to concentrate on C++ as the target.
Let's consider a simpler case: Prototyping in R, and deploying in C++. If and when the R package you use is actually implemented in C or C++, this becomes pretty easy. You "merely" need to disentangle the routine you are after from its other dependencies (header files, defines, data structures, ...) and provide it with the data and parameters needed. I have done that in the past for production systems.
Here, you talk about the forecast package. This happens to depend on the RcppArmadillo package which itself brings the nice Armadillo C++ library to R. So chances are you can in fact re-write this as a self-contained unit.
Armadillo is also interesting when you want to port Matlab to C++ as it is written to help with exactly that task in mind. I have ported some relatively extensive Matlab code to C++ and reaped a substantial speed gain.
I'm not sure whether this is possible in R, but in Matlab you can interact with your Matlab code from Java - see http://www.cs.virginia.edu/~whitehouse/matlab/JavaMatlab.html. This would enable you to leave all the forecasting code in Matlab and have e.g. an interface written in Java.
Alternatively, you might want to have predictive code written in Java so that you can produce a model and then distribute a program that uses the model without having a dependency on Matlab. The Matlab compiler maybe be useful here, but I've never used it.
A final simple way of interacting messily between Matlab and Java would be (on linux) using pseudoterminals where you would have a pty/tty pair to interface Java and Matlab. In this case you would send data from Java to Matlab, and have Matlab return the forecasting results. I expect this would also work in R, but I don't know the syntax.
In general though, reimplementing the code is a decent solution and probably quicker than learning how to interface java+matlab or create Matlab libraries.
Some further information on the answer given by Richante: Matlab has some really nice capabilities for interop with compiled languages such as C/C++, C#, and Java. In your particular case you might find the toolbox Matlab Builder JA to be particularly relevant. It allows you to export your Matlab code directly to Java, meaning you can directly call code that you've constructed during your model-building phase in Matlab from Java.
More information from the Mathworks here.
I am also concerned with converting "R to Java" so will speak to that part.
As Vincent Zooneykind said in his comment - the PMML library in R makes sense for model export in general but "forecast" is not a supported library as of yet.
An alternative is to use something like https://www.opencpu.org/ to make a call to R from your java program. It surfaces the R code on a http server. Can then just call it with parameters as with a normal http call and return what is neede using java.net.HttpUrlConnection or a choice of http libraries available in Java.
Pros: Separation of concerns, no need to re-write the R code
Cons: Invoking an R server in your live process so need to make sure that is handled robustly

General sparse iterative solver libraries

What are some of the better libraries for large sparse iterative (conjugate gradient, MINRES, GMRES, etc.) linear algebra system solving? I've often coded my own routines, but I'm interested to know which "off-the-shelf" packages people prefer. I've heard of PETSc, TAUCS, IML++, and a few others. I'm wondering how these stack up, and what else is out there. My preference is for ease of use, and freely available software.
Victor Eijkhout's Overview of Iterative Linear System Solver Packages would probably be a good place to start.
You may also wish to look at Trilinos
http://trilinos.sandia.gov/
It is designed by some great software craftsman, using modern
design techniques.
Moreover, from within Trilinos, you can call PetsC if you desire.
NIST has some sparse Linear Algebra software you can download
here: http://math.nist.gov/sparselib++/ and here: http://math.nist.gov/spblas/
I haven't used those packages myself, but I've heard good things about them.
http://www.cise.ufl.edu/research/sparse/umfpack/
UMFPACK is a set of routines for
solving unsymmetric sparse linear
systems, Ax=b, using the Unsymmetric
MultiFrontal method. Written in
ANSI/ISO C, with a MATLAB (Version 6.0
and later) interface. Appears as a
built-in routine (for lu, backslash,
and forward slash) in MATLAB. Includes
a MATLAB interface, a C-callable
interface, and a Fortran-callable
interface. Note that "UMFPACK" is
pronounced in two syllables, "Umph
Pack". It is not "You Em Ef Pack".
I'm using it for FEM code.
I would check out Microsoft's Solver Foundation. It's free to cheap for even pretty big problems. The unlimited version is industrial strength and is based on Gurobi and of course isn't cheap.
http://code.msdn.microsoft.com/solverfoundation

Resources