MPI and OpenMP. Do I even have a choice? - mpi

I have a linear algebra code that I am trying get to run faster. Its a iterative algorithm with a loop and matrix vector multiplications within in.
So far, I have used MATMUL (Fortran Lib.), DGEMV, Tried writing my own MV code in OpenMP but the algorithm is doing no better in terms of scalability. Speed ups are barely 3.5 - 4 irrespective of how many processors I am allotting to it (I have tried up 64 processors).
The profiling shows significant time being spent in Matrix-Vector and the rest is fairly nominal.
My question is:
I have a shared memory system with tons of RAM and processors. I have tried tweaking OpenMP implementation of the code (including Matrix Vector) but has not helped. Will it help to code in MPI? I am not a pro at MPI but the ability to fine tune the message communication might help a bit but I can't be sure. Any comments?
More generally, from the literature I have read, MPI = Distributed, OpenMP = Shared but can they perform well in the others' territory? Like MPI in Shared? Will it work? Will it be better than the OpenMP implementation if done well?

You're best off just using a linear algebra package that is already well optimized for a multitcore environment and using that for your matrix-vector multiplication. The Atlas package, gotoblas (if you have a nehalem or older; sadly it's no longer being updated), or vendor BLAS implementations (like MKL for intel CPUs, ACML for AMD, or VecLib for apple, which all cost money) all have good, well-tuned, multithreaded BLAS implementations. Unless you have excellent reason to believe that you can do better than those full time development teams can, you're best off using them.
Note that you'll never get the parallel speedup with DGEMV that you do with DGEMM, just because the vector is smaller than another matrix and so there's less work; but you can still do quite well, and you'll find you get much better perforamance with these libraries than you do with anything hand-rolled unless you were already doing multi-level cache blocking.

You can use MPI in a shared environment (though not OpenMP in a distributed one). However, achieving a good speedup depends a lot more on your algorithms and data dependencies than the technology used. Since you have a lot of shared memory, I'd recommend you stick with OpenMP, and carefully examine whether you're making the best use of your resources.

Related

R performance on Ryzen+Ubuntu: openBLAS/MKL, Rcpp and other improvements?

I'm trying to push the maximum from a Ryzen 3 3950x 16-core machine on Ubuntu 20.04, Microsoft R 3.5.2, with Intel MKL, and the RCpp code is compiled with Sys.setenv(MKL_DEBUG_CPU_TYPE=5) header.
The following are the main operations, that I'd like to optimize for:
Fast multivariate random normal (for which I use the Armadillo version):
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
using namespace Rcpp;
// [[Rcpp::export]]
arma::mat mvrnormArma(int n, arma::vec mu, arma::mat sigma) {
int ncols = sigma.n_cols;
arma::mat Y = arma::randn(n, ncols);
return arma::repmat(mu, 1, n).t() + Y * arma::chol(sigma);
}
Fast SVD (I found that the base::svd performs better than any Rcpp realization I've found so far, including arma::svd("dc"), probably due to different U,S,V dimensions).
Fast matrix multiplications for various results (found code written in C, rewrote all of it in base R and am finding vast improvements due to multicore vs previous 1-core performance. Can base R matrix operations be further improved?)
I've tried various setups with R4.0.2 and openBLAS (through the Ropenblas package), played with various Intel MKL releases, researched about AMD's BLIS and libflame (which I don't know how to even test with R).
Overall, this setup is able to outperform a laptop with i7-8750h and Microsoft R 3.5.1 (With working MKL) by around 2x, while based on 6 vs 16 cores (and faster RAM), I was expecting at least 3-3.5x improvement (based, e.g., by cinebench and similar performance benchmarks).
How can this setup be further improved?
My main issues/questions:
First, I've noticed that the current setup, when ran with 1 worker, is using around 1000-1200% cpu when looking at top call. Through experimentation, I've found that spawning two parallel workers uses most of the cpu, around 85-95%, and delivers the best performance. For example, 3 workers uses whole 100%, but bottlenecks somewhere, drastically reducing the performance for some reason.
I'm guessing that this is a limitation either coming from R/MKL, or something when compiling Rcpp code, since 10-12 cores seems oddly specific. Can this be improved by some hints when compiling Rcpp code?
Secondly, I'm sure I'm not using the optimal BLAS/LAPACK/etc drivers for the job. My guess is that properly compiled R4.0.2 should be significantly faster than Microsoft R3.5.2, but I have absolutely no idea what am I missing, whether the AVX/AVX2 are properly called/used, and what else should I try with the machine?
Lastly, I have seen zero guides for calling/working with AMD BLIS/libflame for R. If this is trivial, would appreciate any hints/help of what to look into.
Until any other (hopefully much better) answers pops up, will post here my latest findings by guesswork. Hopefully, someone with a similar machine will find this useful. Will try expanding the answer if any additional improvements comes up.
Guide for clean R compiling. Seems outdated, but hopefully nothing much missing:
Speed up RcppArmadillo: How to link to OpenBlas in an R package
OpenBLAS and IntelMKL examples + Rstudio
OpenBLAS works terrible on my Ryzen + Ubuntu configuration; with 3.10 BLAS, compiled with zen2 hints, uses all the CPU cores, but terribly. top reports 3200% usage for R instance, but total CPU utilisation doesn't rise more than 20-30%. Hence, the performance is at least 3x slower than with Intel MKL.
IntelMKL. Versions till 2019 work with the MKL_DEBUG_CPU_TYPE workaround. Can confirm that intel-mkl-64bit-2019.5-075 works.
For later versions since 2020.0-088 a different
workaround is
needed. With my benchmarks the performance did not see any
improvement, however, this may change with future MKL releases.
The 10-12 hardcoded cap per instance appears to be controlled by several environmental variables. I found the following list as per this old guide. Likely that these may change with later versions, but seems to work with 2019.5-075:
export MKL_NUM_THREADS=2
export OMP_NESTED="TRUE"
export MKL_DOMAIN_NUM_THREADS="MKL_BLAS=1"
export OMP_NUM_THREADS=1
export MKL_DYNAMIC="TRUE"
export OMP_DYNAMIC="FALSE"
Playing around with various configurations I was able to find that reducing the number of threads and spawning more workers, for my specific benchmark that I've tested on, increased the performance drastically (around 3-4 fold). Even though the claimed CPU usage were similar to multicore variants of the configuration, 2 workers using 16 threads each (totaling in ~70% cpu utilisation) are much slower than 16 workers using 2 threads each (also, similar cpu utilisation). The results may vary with different tasks, so these seem to be the go-to parameters to play with for every longer task.
AMD BLIS. Testing this as an alternative to MKL BLAS, currently only experimenting, but the performance seems to be on par with the Intel MKL with all the fixes. Checked whether BLIS was actually used through perf, for my benchmarks the calls were made bli_dgemmsup_rd_haswell_asm_6x8m, bli_daxpyv_zen_int10 and others. Not sure yet whether the settings for compiling BLIS were optimal. The takeaway here could be that both the MKL and BLIS are actually pushing max from the CPU, given my specific benchmarks... or at least that both libraries are similarly optimized.
Important downside of sticking to AMD BLIS: noticed after months of usage, but there seems to be some unresolved issues with BLIS or LAPACK packed into AMD libraries that I have no idea about. I've noticed random matrix multiplication issues that are not replicable (essentially, hitting into this problem) and are solved by switching back to MKL libs. I can't say whether there's a problem with my way of building R or the actual libs, so just a warning.

Parallel *apply in Azure Machine Learning Studio

I have just started to get myself acquainted with parallelism in R.
As I am planning to use Microsoft Azure Machine Learning Studio for my project, I have started investigating what Microsoft R Open offers for parallelism, and thus, I found this, in which it says that parallelism is done under the hood that leverages the benefit of all available cores, without changing the R code. The article also shows some performance benchmarks, however, most of them demonstrate the performance benefit in doing mathematical operations.
This was good so far. In addition, I am also interested to know whether it also parallelize the *apply functions under the hood or not. I also found these 2 articles that describes how to parallelize *apply functions in general:
Quick guide to parallel R with snow: describes facilitating parallelism using snow package, par*apply function family, and clusterExport.
A gentle introduction to parallel computing in R: using parallel package, par*apply function family, and binding values to environment.
So my question is when I will be using *apply functions in Microsoft Azure Machine Learning Studio, will that be parallelized under the hood by default, or I need to make use of packages like parallel, snow etc.?
Personally, I think we could have marketed MRO a bit differently, without making such a big deal about parallelism/multithreading. Ah well.
R comes with an Rblas.dll/.so which implements the routines used for linear algebra computations. These routines are used in various places, but one common use case is for fitting regression models. With MRO, we replace the standard Rblas with one that uses the Intel Math Kernel Library. When you call a function like lm or glm, MRO will use multiple threads and optimized CPU instructions to fit the model, which can get you dramatic speedups over the standard implementation.
MRO isn't the only way you can get this sort of speedup; you can also compile/download other BLAS implementations that are similarly optimized. We just make it an easy one-step download.
Note that the MKL only affects code that involves linear algebra. It isn't a general-purpose speedup tool; any R code that doesn't do matrix computations won't see a performance improvement. In particular, it won't speed up any code that involves explicit parallelism, such as code using the parallel package, SNOW, or other cluster computing tools.
On the other hand, it won't degrade them either. You can still use packages like parallel, SNOW, etc to create compute clusters and distribute your code across multiple processes. MRO works just like regular CRAN R in this respect. (One thing you might want to do, though, if you're creating a cluster of nodes on the one machine, is reduce the number of MKL threads. Otherwise you risk contention between the nodes for CPU cores, which will degrade performance.)
Disclosure: I work for Microsoft.

OpenCL BLAS in julia

I am really amazed by the julia language, having implemented lots of machine learning algorithms there for my current project. Even though julia 0.2 manages to get some great results out of my 2011 MBA outperforming all other solutuions on similar linux hardware (due to vecLib blas, i suppose), I would certainly like more. I am in a process of buying radeon 5870 and would like to push my matrix operations there. I use basically only simple BLAS operations such as matmul, additios and transpositions. I use julia's compact syntax A' * B + C and would certainly like to keep it.
Is there any way (or pending milestone) I can get those basic operations execute on GPU? I have like 2500x2500 single precision matrices so I expect significant speedup.
I don't believe that GPU integration into the core of Julia is planned at this time. One of the key issues is that there is substantial overhead moving the data to and from the GPU, making a drop-in replacement for BLAS operations infeasible.
I expect that most of the progress in this area will actually come from the package ecosystem, in particular packages under the JuliaGPU organization. I see there is a CLBLAS package in there.

Advice about inversion of large sparse matrices

Just got a Windows box set up with two 64 bit Intel Xeon X5680 3.33 GHz processors (6 cores each) and 12 GB of RAM. I've been using SAS on some large data sets, but it's just too slow, so I want to set up R to do parallel processing. I want to be able to carry out matrix operations, e.g., multiplication and inversion. Most of my data are not huge, 3-4 GB range, but one file is around 50 GB. It's been a while since I used R, so I looked around on the web, including the CRAN HPC, to see what was available. I think a foreach loop and the bigmemory package will be applicable. I came across this post: Is there a package for parallel matrix inversion in R that had some interesting suggestions. I was wondering if anyone has experience with the HIPLAR packages. Looks like hiparlm adds functionality to the matrix package and hiplarb add new functions altogether. Which of these would be recommended for my application? Furthermore, there is a reference to the PLASMA library. Is this of any help? My matrices have a lot of zeros, so I think they could be considered sparse. I didn't see any examples of how to pass data fro R to PLASMA, and looking at the PLASMA docs, it says it does not support sparse matrices, so I'm thinking that I don't need this library. Am I on the right track here? Any suggestions on other approaches?
EDIT: It looks like HIPLAR and package pbdr will not be helpful. I'm leaning more toward bigmemory, although it looks like I/O may be a problem: http://files.meetup.com/1781511/bigmemoryRandLinearAlgebra_BryanLewis.pdf. This article talks about a package vam for virtual associative matrices, but it must be proprietary. Would package ff be of any help here? My R skills are just not current enough to know what direction to pursue. Pretty sure I can read this using bigmemory, but not sure the processing will be very fast.
If you want to use HiPLAR (MAGMA and PLASMA libraries in R), it is only available for Linux at the moment. For this and many other things, I suggest switching your OS to the penguin.
That being said, Intel MKL optimization can do wonders for these sort of operations. For most practical uses, it is the way to go. Python built with MKL optimization for example can process large matrices about 20x faster than IDL, which was designed specifically for image processing. R has similarly shown vast improvements when built with MKL optimization. You can also install R Open from Revolution Analytics, which includes MKL optimization, but I am not sure that it has quite the same effect as building it yourself using Intel tools: https://software.intel.com/en-us/articles/build-r-301-with-intel-c-compiler-and-intel-mkl-on-linux
I would definitely consider the type of operations one is looking to perform. GPU processes are those that lend well to high parallelism (many of the same little computations running at once, as with matrix algebra), but they are limited by bus speeds. Intel MKL optimization is similar in that it can help use all of your CPU cores, but it is really optimized to Intel CPU architecture. Hence, it should provide basic memory optimization too. I think that is the simplest route. HiPLAR is certainly the future, as it is CPU-GPU by design, especially with highly parallel heterogeneous architectures making their way into consumer systems. Most consumer systems today cannot fully utilize this though I think.
Cheers,
Adam

Parallelize Solve() for Ax=b?

Crossposted with STATS.se since this problem could straddle both STATs.se/SO
https://stats.stackexchange.com/questions/17712/parallelize-solve-for-ax-b
I have some extremely large sparse matrices created using spMatrix function from the matrix package.
Using the solve() function works for my Ax=b issue, but it takes a very long time. Several days.
I noticed that http://cran.r-project.org/web/packages/RScaLAPACK/RScaLAPACK.pdf
appears to have a function that can parallelize the solve function, however, it can take several weeks to get new packages installed on this particular server.
The server already has the snow package installed it.
So
Is there a way of using snow to parallelize this operation?
If not, are there other ways to speed up this type of operation?
Are there other packages like RScaLAPACK? My search on RScaLAPACK seemed to suggest people had a lot of issues with it.
Thanks.
[EDIT] -- Additional details
The matrices are about 370,000 x 370,000.
I'm using it to solve for alpha centrality, http://en.wikipedia.org/wiki/Alpha_centrality. I was originally using the alpha centrality function in the igraph package, but it would crash R.
More details
This is on a single machine with 12 cores and 96 gigs of memory (I believe)
It's a directed graph along the lines of paper citation relationships.
Calculating condition number and density will take awhile. Will post as it comes available.
Will crosspost on stat.SE and will add a link back to here
[Update 1: For those just tuning in: The original question involved parallelizing computations to solving a regression problem; given that the underlying problem is related to alpha centrality, some of the issues, such as bagging and regularized regression may not be as immediately applicable, though that leads down the path of further statistical discussions.]
There are a bundle of issues to address here, from the infrastructural to the statistical.
Infrastructure
[Updated - also see Update #2 below.]
Regarding parallelized linear solvers, you can replace R's BLAS / LAPACK library with one that supports multithreaded computations, such as ATLAS, Goto BLAS, Intel's MKL, or AMD's ACML. Personally, I use AMD's version. ATLAS is irritating, because one fixes the number of cores at compilation, not at run-time. MKL is commercial. Goto is not well supported anymore, but is often the fastest, but only by a slight margin. It's under the BSD license. You can also look at Revolution Analytics's R, which includes, I think, the Intel libraries.
So, you can start using all of the cores right away, with a simple back-end change. This could give you a 12X speedup (b/c of the number of cores) or potentially much more (b/c of better implementation). If that brings down the time to an acceptable range, then you're done. :) But, changing the statistical methods could be even better.
You've not mentioned the amount of RAM available (or the distribution of it per core or machine), but A sparse solver should be pretty smart about managing RAM accesses and not try to chew on too much data at once. Nonetheless, if it is on one machine and if things are being done naively, then you may encounter a lot of swapping. In that case, take a look at packages like biglm, bigmemory, ff, and others. The former addresses solving linear equations (or GLMs, rather) in limited memory, the latter two address shared memory (i.e. memory mapping and file-based storage), which is handy for very large objects. More packages (e.g. speedglm and others) can be found at the CRAN Task View for HPC.
A semi-statistical, semi-computational issue is to address visualization of your matrix. Try sorting by the support per row & column (identical if graph is undirected, else do one then the other, or try a reordering method like reverse Cuthill-McKee), and then use image() to plot the matrix. It would be interesting to see how this is shaped, and that affects which computational and statistical methods one could try.
Another suggestion: Can you migrate to Amazon's EC2? It is inexpensive, and you can manage your own installation. If nothing else, you can prototype what you need and migrate it in-house once you have tested the speedups. JD Long has a package called segue that apparently makes life easier for distributing jobs on Amazon's Elastic MapReduce infrastructure. No need to migrate to EC2 if you have 96GB of RAM and 12 cores - distributing it could speed things up, but that's not the issue here. Just getting 100% utilization on this machine would be a good improvement.
Statistical
Next up are multiple simple statistical issues:
BAGGING You could consider sampling subsets of your data in order to fit the models and then bag your models. This can give you a speedup. This can allow you to distribute your computations on as many machines & cores as you have available. You can use SNOW, along with foreach.
REGULARIZATION The glmnet supports sparse matrices and is very fast. You would be wise to test it out. Be careful about ill-conditioned matrices and very small values of lambda.
RANK Your matrices are sparse: are they full rank? If they are not, that could be part of the issue you're facing. When matrices are either singular or very nearly so (check your estimated condition number, or at least look at how your 1st and Nth eigenvalues compare - if there's a steep drop off, you're in trouble - you might check eval1 versus ev2,...,ev10,...). Again, if you have nearly singular matrices, then you need to go back to something like glmnet to shrink out the variables are either collinear or have very low support.
BOUNDING Can you reduce the bandwidth of your matrix? If you can block diagonalize it, that's great, but you'll likely have cliques and members of multiple cliques. If you can trim the most poorly connected members, then you may be able to estimate their alpha centrality as being upper bounded by the lowest value in the same clique. There are some packages in R that are good for this sort of thing (check out Reverse Cuthill-McKee; or simply look to see how you'd convert it into rectangles, often relating to cliques or much smaller groups). If you have multiple disconnected components, then, by all means, separate the data into separate matrices.
ALTERNATIVES Are you wedded to the Alpha Centrality? There may be other measures that are monotonically correlated (i.e. have high rank correlation) with the same value that could be calculated more cheaply or at least implemented quite efficiently. If those will work, then your analyses could proceed with a lot less effort. I have a few ideas, but SO isn't really the place to go about that discussion.
For more statistical perspectives, appropriate Q&A should occur on the stats.stackexchange.com, Cross-Validated.
Update 2: I was a bit too quick in answering and didn't address this from the long-term perspective. If you are planning to do research on such systems for the long-term, you should look at other solvers that may be more applicable to your type of data and computing infrastructure. Here is a very nice directory of the options for both solvers and pre-conditioners. It seems this doesn't include IBM's "Watson" solver suite. Although it may take weeks to get software installed, it's quite possible that one of the packages is already installed if you have a good HPC administrator.
Also, keep in mind that R packages can be installed to the user directory - you need not have a package installed in the general directory. If you need to execute something as a user other than yourself, you could also download a package to the scratch or temporary space (if you're running within just 1 R instance, but using multiple cores, check out tempdir).

Resources