OpenCL BLAS in julia

I am really amazed by the julia language, having implemented lots of machine learning algorithms there for my current project. Even though julia 0.2 manages to get some great results out of my 2011 MBA outperforming all other solutuions on similar linux hardware (due to vecLib blas, i suppose), I would certainly like more. I am in a process of buying radeon 5870 and would like to push my matrix operations there. I use basically only simple BLAS operations such as matmul, additios and transpositions. I use julia's compact syntax A' * B + C and would certainly like to keep it.
Is there any way (or pending milestone) I can get those basic operations execute on GPU? I have like 2500x2500 single precision matrices so I expect significant speedup.

I don't believe that GPU integration into the core of Julia is planned at this time. One of the key issues is that there is substantial overhead moving the data to and from the GPU, making a drop-in replacement for BLAS operations infeasible.
I expect that most of the progress in this area will actually come from the package ecosystem, in particular packages under the JuliaGPU organization. I see there is a CLBLAS package in there.


Parallel *apply in Azure Machine Learning Studio

I have just started to get myself acquainted with parallelism in R.
As I am planning to use Microsoft Azure Machine Learning Studio for my project, I have started investigating what Microsoft R Open offers for parallelism, and thus, I found this, in which it says that parallelism is done under the hood that leverages the benefit of all available cores, without changing the R code. The article also shows some performance benchmarks, however, most of them demonstrate the performance benefit in doing mathematical operations.
This was good so far. In addition, I am also interested to know whether it also parallelize the *apply functions under the hood or not. I also found these 2 articles that describes how to parallelize *apply functions in general:
Quick guide to parallel R with snow: describes facilitating parallelism using snow package, par*apply function family, and clusterExport.
A gentle introduction to parallel computing in R: using parallel package, par*apply function family, and binding values to environment.
So my question is when I will be using *apply functions in Microsoft Azure Machine Learning Studio, will that be parallelized under the hood by default, or I need to make use of packages like parallel, snow etc.?
Personally, I think we could have marketed MRO a bit differently, without making such a big deal about parallelism/multithreading. Ah well.
R comes with an Rblas.dll/.so which implements the routines used for linear algebra computations. These routines are used in various places, but one common use case is for fitting regression models. With MRO, we replace the standard Rblas with one that uses the Intel Math Kernel Library. When you call a function like lm or glm, MRO will use multiple threads and optimized CPU instructions to fit the model, which can get you dramatic speedups over the standard implementation.
MRO isn't the only way you can get this sort of speedup; you can also compile/download other BLAS implementations that are similarly optimized. We just make it an easy one-step download.
Note that the MKL only affects code that involves linear algebra. It isn't a general-purpose speedup tool; any R code that doesn't do matrix computations won't see a performance improvement. In particular, it won't speed up any code that involves explicit parallelism, such as code using the parallel package, SNOW, or other cluster computing tools.
On the other hand, it won't degrade them either. You can still use packages like parallel, SNOW, etc to create compute clusters and distribute your code across multiple processes. MRO works just like regular CRAN R in this respect. (One thing you might want to do, though, if you're creating a cluster of nodes on the one machine, is reduce the number of MKL threads. Otherwise you risk contention between the nodes for CPU cores, which will degrade performance.)
Disclosure: I work for Microsoft.

Advice about inversion of large sparse matrices

Just got a Windows box set up with two 64 bit Intel Xeon X5680 3.33 GHz processors (6 cores each) and 12 GB of RAM. I've been using SAS on some large data sets, but it's just too slow, so I want to set up R to do parallel processing. I want to be able to carry out matrix operations, e.g., multiplication and inversion. Most of my data are not huge, 3-4 GB range, but one file is around 50 GB. It's been a while since I used R, so I looked around on the web, including the CRAN HPC, to see what was available. I think a foreach loop and the bigmemory package will be applicable. I came across this post: Is there a package for parallel matrix inversion in R that had some interesting suggestions. I was wondering if anyone has experience with the HIPLAR packages. Looks like hiparlm adds functionality to the matrix package and hiplarb add new functions altogether. Which of these would be recommended for my application? Furthermore, there is a reference to the PLASMA library. Is this of any help? My matrices have a lot of zeros, so I think they could be considered sparse. I didn't see any examples of how to pass data fro R to PLASMA, and looking at the PLASMA docs, it says it does not support sparse matrices, so I'm thinking that I don't need this library. Am I on the right track here? Any suggestions on other approaches?
EDIT: It looks like HIPLAR and package pbdr will not be helpful. I'm leaning more toward bigmemory, although it looks like I/O may be a problem: This article talks about a package vam for virtual associative matrices, but it must be proprietary. Would package ff be of any help here? My R skills are just not current enough to know what direction to pursue. Pretty sure I can read this using bigmemory, but not sure the processing will be very fast.
If you want to use HiPLAR (MAGMA and PLASMA libraries in R), it is only available for Linux at the moment. For this and many other things, I suggest switching your OS to the penguin.
That being said, Intel MKL optimization can do wonders for these sort of operations. For most practical uses, it is the way to go. Python built with MKL optimization for example can process large matrices about 20x faster than IDL, which was designed specifically for image processing. R has similarly shown vast improvements when built with MKL optimization. You can also install R Open from Revolution Analytics, which includes MKL optimization, but I am not sure that it has quite the same effect as building it yourself using Intel tools:
I would definitely consider the type of operations one is looking to perform. GPU processes are those that lend well to high parallelism (many of the same little computations running at once, as with matrix algebra), but they are limited by bus speeds. Intel MKL optimization is similar in that it can help use all of your CPU cores, but it is really optimized to Intel CPU architecture. Hence, it should provide basic memory optimization too. I think that is the simplest route. HiPLAR is certainly the future, as it is CPU-GPU by design, especially with highly parallel heterogeneous architectures making their way into consumer systems. Most consumer systems today cannot fully utilize this though I think.

MPI and OpenMP. Do I even have a choice?

I have a linear algebra code that I am trying get to run faster. Its a iterative algorithm with a loop and matrix vector multiplications within in.
So far, I have used MATMUL (Fortran Lib.), DGEMV, Tried writing my own MV code in OpenMP but the algorithm is doing no better in terms of scalability. Speed ups are barely 3.5 - 4 irrespective of how many processors I am allotting to it (I have tried up 64 processors).
The profiling shows significant time being spent in Matrix-Vector and the rest is fairly nominal.
My question is:
I have a shared memory system with tons of RAM and processors. I have tried tweaking OpenMP implementation of the code (including Matrix Vector) but has not helped. Will it help to code in MPI? I am not a pro at MPI but the ability to fine tune the message communication might help a bit but I can't be sure. Any comments?
More generally, from the literature I have read, MPI = Distributed, OpenMP = Shared but can they perform well in the others' territory? Like MPI in Shared? Will it work? Will it be better than the OpenMP implementation if done well?
You're best off just using a linear algebra package that is already well optimized for a multitcore environment and using that for your matrix-vector multiplication. The Atlas package, gotoblas (if you have a nehalem or older; sadly it's no longer being updated), or vendor BLAS implementations (like MKL for intel CPUs, ACML for AMD, or VecLib for apple, which all cost money) all have good, well-tuned, multithreaded BLAS implementations. Unless you have excellent reason to believe that you can do better than those full time development teams can, you're best off using them.
Note that you'll never get the parallel speedup with DGEMV that you do with DGEMM, just because the vector is smaller than another matrix and so there's less work; but you can still do quite well, and you'll find you get much better perforamance with these libraries than you do with anything hand-rolled unless you were already doing multi-level cache blocking.
You can use MPI in a shared environment (though not OpenMP in a distributed one). However, achieving a good speedup depends a lot more on your algorithms and data dependencies than the technology used. Since you have a lot of shared memory, I'd recommend you stick with OpenMP, and carefully examine whether you're making the best use of your resources.

Parallelize Solve() for Ax=b?

I have some extremely large sparse matrices created using spMatrix function from the matrix package.
Using the solve() function works for my Ax=b issue, but it takes a very long time. Several days.
I noticed that
appears to have a function that can parallelize the solve function, however, it can take several weeks to get new packages installed on this particular server.
The server already has the snow package installed it.
Is there a way of using snow to parallelize this operation?
If not, are there other ways to speed up this type of operation?
Are there other packages like RScaLAPACK? My search on RScaLAPACK seemed to suggest people had a lot of issues with it.
[EDIT] -- Additional details
The matrices are about 370,000 x 370,000.
I'm using it to solve for alpha centrality, I was originally using the alpha centrality function in the igraph package, but it would crash R.
More details
This is on a single machine with 12 cores and 96 gigs of memory (I believe)
It's a directed graph along the lines of paper citation relationships.
Calculating condition number and density will take awhile. Will post as it comes available.
[Update 1: For those just tuning in: The original question involved parallelizing computations to solving a regression problem; given that the underlying problem is related to alpha centrality, some of the issues, such as bagging and regularized regression may not be as immediately applicable, though that leads down the path of further statistical discussions.]
There are a bundle of issues to address here, from the infrastructural to the statistical.
[Updated - also see Update #2 below.]
Regarding parallelized linear solvers, you can replace R's BLAS / LAPACK library with one that supports multithreaded computations, such as ATLAS, Goto BLAS, Intel's MKL, or AMD's ACML. Personally, I use AMD's version. ATLAS is irritating, because one fixes the number of cores at compilation, not at run-time. MKL is commercial. Goto is not well supported anymore, but is often the fastest, but only by a slight margin. It's under the BSD license. You can also look at Revolution Analytics's R, which includes, I think, the Intel libraries.
So, you can start using all of the cores right away, with a simple back-end change. This could give you a 12X speedup (b/c of the number of cores) or potentially much more (b/c of better implementation). If that brings down the time to an acceptable range, then you're done. :) But, changing the statistical methods could be even better.
You've not mentioned the amount of RAM available (or the distribution of it per core or machine), but A sparse solver should be pretty smart about managing RAM accesses and not try to chew on too much data at once. Nonetheless, if it is on one machine and if things are being done naively, then you may encounter a lot of swapping. In that case, take a look at packages like biglm, bigmemory, ff, and others. The former addresses solving linear equations (or GLMs, rather) in limited memory, the latter two address shared memory (i.e. memory mapping and file-based storage), which is handy for very large objects. More packages (e.g. speedglm and others) can be found at the CRAN Task View for HPC.
A semi-statistical, semi-computational issue is to address visualization of your matrix. Try sorting by the support per row & column (identical if graph is undirected, else do one then the other, or try a reordering method like reverse Cuthill-McKee), and then use image() to plot the matrix. It would be interesting to see how this is shaped, and that affects which computational and statistical methods one could try.
Another suggestion: Can you migrate to Amazon's EC2? It is inexpensive, and you can manage your own installation. If nothing else, you can prototype what you need and migrate it in-house once you have tested the speedups. JD Long has a package called segue that apparently makes life easier for distributing jobs on Amazon's Elastic MapReduce infrastructure. No need to migrate to EC2 if you have 96GB of RAM and 12 cores - distributing it could speed things up, but that's not the issue here. Just getting 100% utilization on this machine would be a good improvement.
Next up are multiple simple statistical issues:
BAGGING You could consider sampling subsets of your data in order to fit the models and then bag your models. This can give you a speedup. This can allow you to distribute your computations on as many machines & cores as you have available. You can use SNOW, along with foreach.
REGULARIZATION The glmnet supports sparse matrices and is very fast. You would be wise to test it out. Be careful about ill-conditioned matrices and very small values of lambda.
RANK Your matrices are sparse: are they full rank? If they are not, that could be part of the issue you're facing. When matrices are either singular or very nearly so (check your estimated condition number, or at least look at how your 1st and Nth eigenvalues compare - if there's a steep drop off, you're in trouble - you might check eval1 versus ev2,...,ev10,...). Again, if you have nearly singular matrices, then you need to go back to something like glmnet to shrink out the variables are either collinear or have very low support.
BOUNDING Can you reduce the bandwidth of your matrix? If you can block diagonalize it, that's great, but you'll likely have cliques and members of multiple cliques. If you can trim the most poorly connected members, then you may be able to estimate their alpha centrality as being upper bounded by the lowest value in the same clique. There are some packages in R that are good for this sort of thing (check out Reverse Cuthill-McKee; or simply look to see how you'd convert it into rectangles, often relating to cliques or much smaller groups). If you have multiple disconnected components, then, by all means, separate the data into separate matrices.
ALTERNATIVES Are you wedded to the Alpha Centrality? There may be other measures that are monotonically correlated (i.e. have high rank correlation) with the same value that could be calculated more cheaply or at least implemented quite efficiently. If those will work, then your analyses could proceed with a lot less effort. I have a few ideas, but SO isn't really the place to go about that discussion.
For more statistical perspectives, appropriate Q&A should occur on the, Cross-Validated.
Update 2: I was a bit too quick in answering and didn't address this from the long-term perspective. If you are planning to do research on such systems for the long-term, you should look at other solvers that may be more applicable to your type of data and computing infrastructure. Here is a very nice directory of the options for both solvers and pre-conditioners. It seems this doesn't include IBM's "Watson" solver suite. Although it may take weeks to get software installed, it's quite possible that one of the packages is already installed if you have a good HPC administrator.
Also, keep in mind that R packages can be installed to the user directory - you need not have a package installed in the general directory. If you need to execute something as a user other than yourself, you could also download a package to the scratch or temporary space (if you're running within just 1 R instance, but using multiple cores, check out tempdir).

R package that automatically uses several cores?

I have noticed that R only uses one core while executing one of my programs which requires lots of calculations. I would like to take advantage of my multi-core processor to make my program run faster.
I have not yet investigated the question in depth but I would appreciate to benefit from your comments because I do not have good knowledge in computer science and it is difficult for me to get easily understandable information on that subject.
Is there a package that allows R to automatically use several cores when needed?
I guess it is not that simple.
R can only make use of multiple cores with the help of add-on packages, and only for some types of operation. The options are discussed in detail on the High Performance Computing Task View on CRAN
Update: From R Version 2.14.0 add-on packages are not necessarily required due to the inclusion of the parallel package as a recommended package shipped with R. parallel includes functionality from the multicore and snow packages, largely unchanged.
The easiest way to take advantage of multiprocessors is the multicore package which includes the function mclapply(). mclapply() is a multicore version of lapply(). So any process that can use lapply() can be easily converted to an mclapply() process. However, multicore does not work on Windows. I wrote a blog post about this last year which might be helpful. The package Revolution Analytics created, doSMP, is NOT a multi-threaded version of R. It's effectively a Windows version of multicore.
If your work is embarrassingly parallel, it's a good idea to get comfortable with the lapply() type of structuring. That will give you easy segue into mclapply() and even distributed computing using the same abstraction.
Things get much more difficult for operations that are not "embarrassingly parallel".
As a side note, Rstudio is getting increasingly popular as a front end for R. I love Rstudio and use it daily. However it needs to be noted that Rstudio does not play nice with Multicore (at least as of Oct 2011... I understand that the RStudio team is going to fix this). This is because Rstudio does some forking behind the scenes and these forks conflict with Multicore's attempts to fork. So if you need Multicore, you can write your code in Rstuido, but run it in a plain-Jane R session.
On this question you always get very short answers. The easiest solution according to me is the package snowfall, based on snow. That is, on a Windows single computer with multiple cores. See also here the article of Knaus et al for a simple example. Snowfall is a wrapper around the snow package, and allows you to setup a multicore with a few commands. It's definitely less hassle than most of the other packages (I didn't try all of them).
On a sidenote, there are indeed only few tasks that can be parallelized, for the very simple reason that you have to be able to split up the tasks before multicore calculation makes sense. the apply family is obviously a logical choice for this : multiple and independent computations, which is crucial for multicore use. Anything else is not always that easily multicored.
Read also this discussion on sfApply and custom functions.
Microsoft R Open includes multi-threaded math libraries to improve the performance of R.It works in Windows/Unix/Mac all OS type. It's open source and can be installed in a separate directory if you have any existing R(from CRAN) installation. You can use popular IDE Rstudio also with this.From its inception, R was designed to use only a single thread (processor) at a time. Even today, R works that way unless linked with multi-threaded BLAS/LAPACK libraries.
The multi-core machines of today offer parallel processing power. To take advantage of this, Microsoft R Open includes multi-threaded math libraries.
These libraries make it possible for so many common R operations, such as matrix multiply/inverse, matrix decomposition, and some higher-level matrix operations, to compute in parallel and use all of the processing power available to reduce computation times.
Please check the below link:
As David Heffernan said, take a look at the Blog of revolution Analytics. But you should know that most packages are for Linux. So, if you use windows it will be much harder.
Anyway, take a look at these sites:
Revolution. Here you will find a lecture about parallerization in R. The lecture is actually very good, but, as I said, most tips are for Linux.
And this thread here at Stackoverflow will disscuss some implementation in Windows.
The package future makes it extremely simple to work in R using parallel and distributed processing. More info here. If you want to apply a function to elements in parallel, the future.apply package provides a quick way to use the "apply" family functions (e.g. apply(), lapply(), and vapply()) in parallel.
x <- 1:10
# Single core
y <- lapply(x, FUN = quantile, probs = 1:3/4)
# Multicore in parallel
y <- future_lapply(x, FUN = quantile, probs = 1:3/4)
