Kalman Filter with OpenCL - opencl

I'm a new OpenCL programmer, just got into GPGPU computing, i'm using an Nvidia Quadro 600.
I'm making a research work based on GPU programming, my aim is to write a simple Kalman Filter kernel for OpenCL using a SIMT approach. I found this document which describes the principle by how it could be done with CUDA, i think it's a similar approach for OpenCL.
The basic operations done by Kalman Filter on a linear system are three equations, each one involving matrix manipulation, which computes the Kalman Gain matrix (K), the state estimation (x~), and updates the Error Covariance matrix (P) for the next state estimation. These three steps are iterated for each measurement taken from a real system.
Considering the SIMT approach, I thought to execute one iteration of the kalman filter on each gpu thread within a thread block, i send to each thread values needed to compute the iteration (input and output from the the real system measurement, linear system matrices).
There is some better design i could consider for this algorithm to be done with OpenCL?
It is possible (and useful) to make matrix operations on a separate kernel with a parallel manner?
Another question: assume we have k iteration, for each iteration k we calculate P for the step k+1, taking as input the P for step k from iteration k-1... Since each tread compute one iteration, it is possible to synchronize thread k to wait the P matrix from thread k-1?
UPDATE: After many search and tries, I thought it is impossible to adapt my implementation (as described above) for this problem to the OpenCL principles of operation. The only way I found to do that is to parallelize each single matrix operation, maybe using more GPUs to compute each matrix operation simultaneously. Real efficiency for this implementation could be achieved with big linear system to filter (that means: it works good with big matrices).

Related

Is calculating the index in an array more efficient than letting the compiler do it?

I'm trying to generalize a neural network function to arbitrarily many layers, and so I need multiple matrices to hold the weights for each neuron in each layer. I was originally explicitly declaring matrix objects in R to hold my weights for each layer. Instead of having one matrix per layer, I thought of a way (not saying it's original), to store all of my weights in a single array and defined an "indexing function" to map a weight to its appropriate index in the array.
I defined the function as follows:
where is the k-th weight of the j-th neuron in the i-th layer and L(r) is the number of neurons in layer r. After writing these definitions, I realize that stackoverflow doesn't allow latex like mathoverflow which is unfortunate.
Now the question is: Is it more efficient to compute the index of my weights in this way, or is actually less efficient?
After looking up how indices are computed for arrays in general, this is essentially what is done on compilation anyway if I just kept a matrix in each layer holding the weights, so it seems like I may just be making my code overly complicated and harder to understand if there's no difference in time efficiency.
TL;DR use the matrices its easier to understand and takes advantage of optimized CPU instructions.
In computer science parlance, the efficiency (scalability) of algorithms is reasoned about using Big O cost. A score can be given to both the time and space complexity.
Using Big O notation lets compare the two approaches:
Array Approach
time complexity:
Array index access is O(1) time, no matter how large an array becomes, it is just as computationally easy to access an element given its index.
As you've created a function to compute the index of the k-th weight, this adds some small complexity but would probably run in constant O(1) time as it is a mathematical expression, so negligible.
space complexity:
O(N) where N is the number of weights across all layers.
Matrices Approach
time complexity:
A matrix is essentially a 2d array with O(1) access
space complexity
O(N + M), where N is number of neurons and M is number of weights.
Conceptually, we can see that the two approaches have an equivalent time and space complexity score.
However there are the other trade-offs involved (and as a good SO-er must inform you of those)
When it comes to working with the data in the array vs matrices approach, the array approach is less efficient as it circumvents the opportunity for MISD operations. As #liborm alluded to there are vectorised (MISD) operations handled by lower level system libraries like LAPACK/BLAS, which "batch" CPU instructions for some matrix operations (less overhead cost to transfer and compute data at CPU compared to sending a new instruction every time)
Instead of having one matrix per layer, I thought of a way ... to store all of my weights in a single array
It's hard to see why you would opt-ed for the latter as it requires you to create a bespoke indexing function. Maybe its nicer to think about all your weights being in one long array place? However I would argue the mental load required to maintain the array mapping is higher than having multiple matrices dedicated to a layer.
A hash-table like structure of matrices would be much easier to reason about
layers <- list(layer1 = [[...]], layer2 = [[...]], layerN = [[...]])
Further reading
http://www.noamross.net/blog/2014/4/16/vectorization-in-r--why.html
There are many factors to take into consideration in each of the approaches. I'm not familiar with R but I'm assuming matrices' buffers are represented as one-dimensional arrays in memory. (Even if they are written as two dimensional arrays in the underlying C implementation the compiler stores it as one-dimensional array in memory)
The overall outline of memory operations are:
Case: Several matrices per layers
Allocation of matrices:
Accessing of indices:
Case: One matrix for all layers + index calculation
Allocation of matrix cost:
Accesing each of the indices cost:
Function cost:
We can clearly see that the second case, scales better, even though there's the additional cost of the function call.
Having said that, in general having a statically allocated array with all the weights for all the layers, should be faster.
In most cases, computers's bottleneck is memory bandwidth, and the best way to counteract this is to minimize the number of memory accesses.
With this in mind there's another more primitive reason why the 2nd approach will probably be faster: Caches.
Here's a good explanation of the performance difference in accesing a two-dimensional array in a loop by Good Ol' Bob Martin
TL; DR: Caches take advantage of the principle of locality, and therefore, having memory accesses spatially close to each other (as you would in one single array and accessing them in a cache-friendly way as explained in Bob Martin's answer) renders better performance than having them spatially separated (having them in several distinct arrays).
PS: I also recommend to benchmark both approaches and compare, since these nuances regarding the cache are machine-dependent. It might be the case that the Dataset/NN is small enough to fit completely in RAM or even in cache? in a very powerful server.
I'm sure you want to use some kind of native array objects, so you get the speedups provided by BLAS/LAPACK implementations (see eg Intel MKL discussion here if you're on Windows). Most of the time in NN evaluation will be spent in matrix multiplications (like SGEMM), and this is where BLAS implementations like Intel MKL can be an order of magnitude faster.
That is - even if the hand-coded indices for your single-array multi-layer network were super fast, you won't be able to use it with the optimised multiplication routines, which would make your whole network significantly slower. Use the native array objects and create a multi layer abstraction on top of them.
But actually if you want speed and usability (and to really build some NN models), you should consider using something like R interface to TensorFlow. As a bonus you'll get things like running on the GPU for free.
Nice puzzle.. If you are asking calculating index in which would happen in runtime for which it needs to be compiled. Just want to understand how would you let the compiler compute it? IF you have a need to playing with the info anytime later then I would suggest to use Hasmap kind of mechanism. I had done it for a similar need.

Slepc shell matrices with multiple processes

I currently use explicit matrix storage for my generalized Eigenvalue equation of the form $AX = \lambda BX$ with eigenvalue lambda and eigenvector $X$. $A$ and $B$ are pentadiagonal by blocks, Hermitian and every block is Hermitian as well.
The problem is that for large simulations memory usage gets out of hand. I would therefore like to switch to Shell matrices. An added advantage would be that then I can avoid the duplication of a lot of information, as A and B are both filled through finite differences. I.e., the first derivative of a function X can be approximated by $X_i' = \frac{X_{i+1}-X_{i-1}}{\Delta}$, so that the same piece of information appears in two places. It gets (much) worse for higher orders.
When I try to implement this in Fortran, using multiple MPI processes that each contain a part of the rows of $A$ and $B$, I stumble upon the following issue: To perform matrix multiplication, one needs the vector information of $X$ from other ranks at the end of each rank's interval, due to the off-diagonal elements of $A$ and $B$.
I found a conceptual solution by using MPI all to all commands that pass the information from these "ghosted" regions to the ranks next-door. However, I fear that this might not be most portable, and also not too elegant.
Is there any way to automate this process of taking the information from ghost zones in Petsc / Slepc?

Parallelized multidimensional numerical integration in R?

I am working on a problem that needs a numerical integration of a bivariate function, where each evaluation of the function takes about 1 minute. Since numerical integration on a single core would evaluate the function thousands to tens of thousand times, I would like to parallelize the calculation. Right now I am using the bruteforce approach that calculates a naive grid of points and add them up with appropriate area multipliers. This is definitely not efficient and I suspect any modern multidimensional numerical integration algorithm would be able to achieve the same precision with a lot fewer function evaluations. There are many packages in R that would calculate 2-d integration much more efficiently and accurately (e.g. R2Cuba), but I haven't found anything that can be easily parallelized on a cluster with SGE managed job queues. Since this is only a small part of a bigger research problem, I would like to see if this can be done with reasonable effort , before I try to parallelize one of the cubature-rule based methods in R by myself.
I have found that using sparse grid achieves the best compromise between speed and accuracy in multi-dimensional integration, and it's easily parallized on the cluster because it doesn't involve any sequential steps. It won't be as accurate as other sequentially adpative integration algorithms but it is much better than the naive method because it provides a much sparser grid of points to calculate on each core.
The following R code deals with 2-dimensional integration, but can be easily modified for higher dimensions. The apply function towards the end can be easily parallelized on a cluster.
sg.int<-function(g,...,lower,upper)
{ require("SparseGrid")
lower<-floor(lower)
upper<-ceiling(upper)
if (any(lower>upper)) stop("lower must be smaller than upper")
gridss<-as.matrix(expand.grid(seq(lower[1],upper[1]-1,by=1),seq(lower[2],upper[2]-1,by=1)))
sp.grid <- createIntegrationGrid( 'KPU', dimension=2, k=5 )
nodes<-gridss[1,]+sp.grid$nodes
weights<-sp.grid$weights
for (i in 2:nrow(gridss))
{
nodes<-rbind(nodes,gridss[i,]+sp.grid$nodes)
weights<-c(weights,sp.grid$weights)
}
gx.sp <- apply(nodes, 1, g,...)
val.sp <- gx.sp %*%weights
val.sp
}

A framework for comparing the time performance of Expectation Maximization

I have my own implementation of the Expectation Maximization (EM) algorithm based on this paper, and I would like to compare this with the performance of another implementation. For the tests, I am using k centroids with 1 Gb of txt data and I am just measuring the time it takes to compute the new centroids in 1 iteration. I tried it with an EM implementation in R, but I couldn't, since the result is plotted in a graph and gets stuck when there's a large number of txt data. I was following the examples in here.
Does anybody know of an implementation of EM that can measure its performance or know how to do it with R?
Fair benchmarking of EM is hard. Very hard.
the initialization will usually involve random, and can be very different. For all I know, the R implementation by default uses hierarchical clustering to find the initial clusters. Which comes at O(n^2) memory and most likely at O(n^3) runtime cost. In my benchmarks, R would run out of memory due to this. I assume there is a way to specify initial cluster centers/models. A random-objects initialization will of course be much faster. Probably k-means++ is a good way to choose initial centers in practise.
EM theoretically never terminates. It just at some point does not change much anymore, and thus you can set a threshold to stop. However, the exact definition of the stopping threshold varies.
There exist all kinds of model variations. A method only using fuzzy assignments such as Fuzzy-c-means will of course be much faster than an implementation using multivariate Gaussian Mixture Models with a covaraince matrix. In particular with higher dimensionality.
Covariance matrixes also need O(k * d^2) memory, and the inversion will take O(k * d^3) time, and thus is clearly not appropriate for text data.
Data may or may not be appropriate. If you run EM on a data set that actually has Gaussian clusters, it will usually work much better than on a data set that doesn't provide a good fit at all. When there is no good fit, you will see a high variance in runtime even with the same implementation.
For a starter, try running your own algorithm several times with different initialization, and check your runtime for variance. How large is the variance compared to the total runtime?
You can try benchmarking against the EM implementation in ELKI. But I doubt the implementation will work with sparse data such as text - that data just is not Gaussian, it is not proper to benchmark. Most likely it will not be able to process the data at all because of this. This is expected, and can be explained from theory. Try to find data sets that are dense and that can be expected to have multiple gaussian clusters (sorry, I can't give you many recommendations here. The classic Iris and Old Faithful data sets are too small to be useful for benchmarking.

Parallel Forward-Backward Algorithm for Hidden Markov Model

As a side project, I want to implement a Hidden Markov Model for my NVidia graphics card so that I can have it execute quickly and using many cores.
I'm looking at the Forward-Backward algorithm and was wondering what is there that I can make parallel here? If you look at the forward part of the algorithm for instance, the matrix multiplications can be divided up to be done in parallel, but can the iterative parts of the algorithm that depend on the previous step be parallelized in any way? Is there some kind of a mathematical trick that can be applied here?
Thanks,
mj
http://en.wikipedia.org/wiki/Forward%E2%80%93backward_algorithm#Example
You are correct in your assessment - you can parallelize the matrix multiplications (i.e. across states), but you can't parallelize the recursive steps. I just made a blog post on my work with HMMs and GPUs. Check it out here:
http://sgmustadio.wordpress.com/2012/02/27/hidden-markov-models-in-cuda-gpu/
If you are still working on this project, you may want to check out HMMlib and parredHMMlib.
sgmustadio is right to point out that you cannot parallelize recursive steps, but it seems that these authors have come up with a clever way to reduce the Forward and Viterbi algorithms to a series of matrix multiplications and reductions.

Resources