Linking a program with both OpenBLAS and IntelMKL - dynamic-linking

Since Intel MKL's blas library and OpenBLAS's library have the same interface, I find that the order in which I link my program to these libraries will affect whether a function in MKL will be called or a function in OpenBLAS will be called.
Actually I want to selectively call either MKL's matrix multiplication or OpenBLAS's matrix multiplication.
How can I do that?

Related

Using nvBLAS in R on Windows?

I am having trouble getting nvBLAS to work in R. I'm using RStudio on a Windows 10 machine, and I have no idea how to link nvBLAS and the original Rblas together for R to start up with both. From the nvBLAS documentation:
To use the NVBLAS Library, the user application must be relinked
against NVBLAS in addition to the original CPU Blas (technically only
NVBLAS is needed unless some BLAS routines not supported by NVBLAS are
used by the application). To be sure that the linker links against the
exposed symbols of NVBLAS and not the ones from the CPU Blas, the
NVBLAS Library needs to be put before the CPU Blas on the linkage
command line.
How exactly do I do this in Windows? Caveat, I am a pretty solid R user, but I am by no means an R expert or a computer scientist. I would ideally like to avoid using an Ubuntu-build for this as well.

GNU+Intel openmp dynamic loading

I have been trying to create an R library dynamically loading a dependency using Intel OpenMP. When the library is loaded, there is a clash with OpenMP library.
Using KMP_DUPLICATE_LIB_OK=TRUE gets me past the loading error but the program crashes once it is in a parallel section.
Unfortunately, neither compiling R using intel OpenMP or the dependency using GNU OpenMP is an option (because I want it to work with the standard R distribution and some external dependencies linked statically have to use Intel OpenMP).
However, I can recompile the dependency with some compatibility flags or modify how linking is done (but in the end, it has to be loaded dynamically from the R library). Setting environment variables is also an option (I am thinking about https://software.intel.com/en-us/node/522775 but none of the options seems to help so far).
The R library is written in C and I doubt that the fact it is R which will load it in the end really matters.
Any idea how to handle this?

Configure R --enable-BLAS-shlib with MKL [duplicate]

I found that using one of BLAS/ATLAS/MKL/OPENBLAS will give improvement on speed in R. However, will it still improve the R Package that is written in C or C++?
for example, R package Glmnet is implemented in FORTRAN and R package rpart is implemented in C++. Will it just installing BLAS/...etc will improve the execution time? or do we have to rebuild (building new C code) the package based on BLAS/...etc?
It is frequently stated, including in a comment here, that "you have to recompile R" to use different BLAS or LAPACK library. That is wrong.
You do not have to recompile R provided it is build against the shared library versions of BLAS and LAPACK.
I have a package and vignette on CRAN which uses this fact to provide a benchmarking framework in which different BLAS and LAPACK version are timed against each just by installing different ones (one commmand in Debian/Ubuntu) and running benchmarks -- this is so straightforward that it can be automated in a package such as this.
The results in that package will provide an idea of the possible speed differences. Exactly how they pan out depends on your computer, your data (size), your problem etc. But if, say, your problem uses LAPACK functions which can run benefit from running multithreaded then installing OpenBLAS may help. That is true for any R package using LAPACK as they will use the same LAPACK installation accessed through are, and these can be changed.

R Parallel Processing with Xeon Phi, minimal code changes?

Looking at buying a couple Xeon Phi 5110P, but trying to estimate how much code I have to change or other software needed.
Currently I make good use of R on a multi-core Windows machine (24 cores) by using the foreach package, passing it other packages forecast, glmnet, etc. to do my parallel processing.
Having a Xeon Phi I understand I would want to compile R
https://software.intel.com/en-us/articles/running-r-with-support-for-intel-xeon-phi-coprocessors And I understand this could be done with a trail version of Parallel Studio XE.
Then do I then need to edit R's Makeconf file, adding the C/C++ flags and for the Phi? Compile all the needed packages before the trail on Parallel Studio expires? Or do I not need to edit the Makeconf to get the benefits of foreach on the Phi?
Seems like some of this will be handled automatically once R is compiled, with offloading done by the Math Kernel Library (MKL), but I'm not totally sure of this.
Somewhat related question: Is the Intel Xeon Phi usable without a costly Intel Compiler?
Also revolutionanalytics.com seems to have a few related blog posts but not entirely conclusive for me: http://blog.revolutionanalytics.com/2015/05/behold-the-power-of-parallel.html
If all you need is matrix operations, you can compile it with MKL libraries per here: [Running R with Support for Intel® Xeon Phi™ Coprocessors][1] , which requires the Intel Complier. Microsoft R comes pre compiled with MKL but I was not able to use the auto offload, I had to compile R with the Intel compiler for it to work properly.
You could use the trial version compiler and compile it during the trial period to see if it fits your purpose.
If you want to use things like foreach package by setting up a cluster,since each node is a linux computer, I'm afraid you're out of luck. On page 3 of [R-Admin][1] it says
Cross-building is not possible: installing R builds a minimal version of R and then runs many
R scripts to complete the build.
You have to cross compile from xeon host for xeon phi node with the intel compiler, and it's just not feasible.
The last way to utilize the Phi is to rewrite your code to call it directly. Rcpp provides an easy interface to C and C++ routines. If you found a C routine that runs well on xeon you can call the nodes within your code. I have done this with CUDA and the Rcpp is a thin layer and there are good examples of how to use it, and if you join it with examples of calling the phi card nodes you can probably achieve your goal with less overhead.
BUt, if all you need is matrix ops, there is no quicker route to supercomputing than a good double precision nvidea card and pre loading nvBlas during R startup.

Running a loop in parallel using multiple cores in R [duplicate]

I have a quad-core laptop running Windows XP, but looking at Task Manager R only ever seems to use one processor at a time. How can I make R use all four processors and speed up my R programs?
I have a basic system I use where I parallelize my programs on the "for" loops. This method is simple once you understand what needs to be done. It only works for local computing, but that seems to be what you're after.
You'll need these libraries installed:
library("parallel")
library("foreach")
library("doParallel")
First you need to create your computing cluster. I usually do other stuff while running parallel programs, so I like to leave one open. The "detectCores" function will return the number of cores in your computer.
cl <- makeCluster(detectCores() - 1)
registerDoParallel(cl, cores = detectCores() - 1)
Next, call your for loop with the "foreach" command, along with the %dopar% operator. I always use a "try" wrapper to make sure that any iterations where the operations fail are discarded, and don't disrupt the otherwise good data. You will need to specify the ".combine" parameter, and pass any necessary packages into the loop. Note that "i" is defined with an equals sign, not an "in" operator!
data = foreach(i = 1:length(filenames), .packages = c("ncdf","chron","stats"),
.combine = rbind) %dopar% {
try({
# your operations; line 1...
# your operations; line 2...
# your output
})
}
Once you're done, clean up with:
stopCluster(cl)
The CRAN Task View on High-Performance Compting with R lists several options. XP is a restriction, but you still get something like snow to work using sockets within minutes.
As of version 2.15, R now comes with native support for multi-core computations. Just load the parallel package
library("parallel")
and check out the associated vignette
vignette("parallel")
I hear tell that REvolution R supports better multi-threading then the typical CRAN version of R and REvolution also supports 64 bit R in windows. I have been considering buying a copy but I found their pricing opaque. There's no price list on their web site. Very odd.
I believe the multicore package works on XP. It gives some basic multi-process capability, especially through offering a drop-in replacement for lapply() and a simple way to evaluate an expression in a new thread (mcparallel()).
On Windows I believe the best way to do this would probably be with foreach and snow as David Smith said.
However, Unix/Linux based systems can compute using multiple processes with the 'multicore' package. It provides a high-level function, 'mclapply', that performs a list comprehension across multiple cores. An advantage of the 'multicore' package is that each processor gets a private copy of the Global Environment that it may modify. Initially, this copy is just a pointer to the Global Environment, making the sharing of variable extremely quick if the Global Environment is treated as read-only.
Rmpi requires that the data be explicitly transferred between R processes instead of working with the 'multicore' closure approach.
-- Dan
If you do a lot of matrix operations and you are using Windows you can install revolutionanalytics.com/revolution-r-open for free, and this one comes with the intel MKL libraries which allow you to do multithreaded matrix operations. On Windows if you take the libiomp5md.dll, Rblas.dll and Rlapack.dll files from that install and overwrite the ones in whatever R version you like to use you'll have multithreaded matrix operations (typically you get a 10-20 x speedup for matrix operations). Or you can use the Atlas Rblas.dll from prs.ism.ac.jp/~nakama/SurviveGotoBLAS2/binary/windows/x64 which also work on 64 bit R and are almost as fast as the MKL ones. I found this the single easiest thing to do to drastically increase R's performance on Windows systems. Not sure why they don't come as standard in fact on R Windows installs.
On Windows, multithreading unfortunately is not well supported in R (unless you use OpenMP via Rcpp) and the available SOCKET-based parallelization on Windows systems, e.g. via package parallel, is very inefficient. On POSIX systems things are better as you can use forking there. (package multicore there is I believe the most efficient one). You could also try to use package Rdsm for multithreading within a shared memory model - I've got a version on my github that has unflagged -unix only flag and should work also on Windows (earlier Windows wasn't supported as dependency bigmemory supposedly didn't work on Windows, but now it seems it does) :
library(devtools)
devtools::install_github('tomwenseleers/Rdsm')
library(Rdsm)

Resources