Numerical optimization with MPI - mpi

I am trying to parallelize an optimization routine by using MPI directives. The structure of the program is roughly like in the block diagram at the end of the text. Data is fed to the optimization routine, it calls an Objective function subroutine and another subroutine, which calculates a matrix called “Jacobian”. The Optimization routine iterates as many times as needed to reach a minimum of the Objective function and exits with a result.The Jacobian is used to decide in which direction the minimum might be and to take a step in that direction. I don’t have control over the Optimization routine, I only supply the Objective function and the function calculating the Jacobian. Most of the time is spend on calculating the Jacobian. Since each matrix element of the Jacobian is independent of the rest of the elements, it seems as a good candidate for parallelization. However, I haven’t been able to accomplish this. Initially I was thinking that I can distribute the calculation of the Jacobian over a large number of nodes, each of which would calculate only some of the matrix elements. I did that but after just one iteration all the threads on the nodes exit and the program stalls. I am starting to think that without the source code of the Optimization routine this might not be possible. The reason is that distributing the code over multiple nodes and instructing them to only calculate a fraction of the Jacobian messes up the optimization on all of them, except the master. Is there a way around it, using MPI and without touching the code in the optimization routine ? Can only the function calculating the Jacobian be executed on all nodes except the master ? How would you do this ?

It turned out easier than I thought. As explained in the question, the worker threads were exiting after just one iteration. The solution is to just enclose the code in the Jacobian calculation executed by the workers with an infinite while loop and break out of it by sending a message from the main thread (master) once it exits with the answer.

Related

OpenMDAO: conditional statement depending on the number of iterations

During an optimisation using OpenMDAO, is there any way to access the number of iterations or the values of the design variables in previous iterations during optimisation?
I would like to create a conditional statement depending on the corresponding number of iterations.
I have created a continuous function representing discrete points linked by exponential functions. I would like to increase the exponent of the intermediate function with the number of iterations so that it penalises the intermediate values and the optimisation converges close to one of the discrete values.
Thank you in advance.
What you are describing sounds like a form of continuation/smoothing. I can suggest two different approaches:
Set a reasonable max-iteration limit on the optimizer and add an outside for-loop around the call to run_driver. You could even adapt the iteration limit after each stopping point is reached. Start with a very small iteration limit, and let it grow as you converge more.
Pros:
fairly simple to implement
uses existing OpenMDAO Driver APIs
Cons:
Limited ability to set your own stoping conditions (only really have iteration limit)
Restarting the optimization does not preserve the prior hessian approximation and may lead to poor convergence for quasi-newton method
Skip the OpenMDAO driver interface, and roll your own. This approach was suggested in the 2020 OpenMDAO Reverse Hackathon, for users who find the OpenMDAO Driver interface doesn't meet their need.
Pros:
A lot more flexibility
total control
Cons:
A lot more work

Why a recursive algorithm cannot be parallelized efficiently ?

It is very tough to convert a sequential code which has recursion in it to an equivalent parallel code written in openmp,CUDA or MPI .
Why is it so ?
If a piece of code has been written as a recursive algorithm, there is a good chance that the calculations performed in each level of recursion depends on the results of the next. This would imply that it is hard to do the calculations from different recursive steps in parallel.
Another way of thinking about this is to imagine flattening out the recursion into iteration (see for example Can every recursion be converted into iteration?). A recursive algorithm is likely to generate a flattened version where each iterations depend on the results from other iterations, making it hard to do the iterations in parallel.

Slepc shell matrices with multiple processes

I currently use explicit matrix storage for my generalized Eigenvalue equation of the form $AX = \lambda BX$ with eigenvalue lambda and eigenvector $X$. $A$ and $B$ are pentadiagonal by blocks, Hermitian and every block is Hermitian as well.
The problem is that for large simulations memory usage gets out of hand. I would therefore like to switch to Shell matrices. An added advantage would be that then I can avoid the duplication of a lot of information, as A and B are both filled through finite differences. I.e., the first derivative of a function X can be approximated by $X_i' = \frac{X_{i+1}-X_{i-1}}{\Delta}$, so that the same piece of information appears in two places. It gets (much) worse for higher orders.
When I try to implement this in Fortran, using multiple MPI processes that each contain a part of the rows of $A$ and $B$, I stumble upon the following issue: To perform matrix multiplication, one needs the vector information of $X$ from other ranks at the end of each rank's interval, due to the off-diagonal elements of $A$ and $B$.
I found a conceptual solution by using MPI all to all commands that pass the information from these "ghosted" regions to the ranks next-door. However, I fear that this might not be most portable, and also not too elegant.
Is there any way to automate this process of taking the information from ghost zones in Petsc / Slepc?

Kalman Filter with OpenCL

I'm a new OpenCL programmer, just got into GPGPU computing, i'm using an Nvidia Quadro 600.
I'm making a research work based on GPU programming, my aim is to write a simple Kalman Filter kernel for OpenCL using a SIMT approach. I found this document which describes the principle by how it could be done with CUDA, i think it's a similar approach for OpenCL.
The basic operations done by Kalman Filter on a linear system are three equations, each one involving matrix manipulation, which computes the Kalman Gain matrix (K), the state estimation (x~), and updates the Error Covariance matrix (P) for the next state estimation. These three steps are iterated for each measurement taken from a real system.
Considering the SIMT approach, I thought to execute one iteration of the kalman filter on each gpu thread within a thread block, i send to each thread values needed to compute the iteration (input and output from the the real system measurement, linear system matrices).
There is some better design i could consider for this algorithm to be done with OpenCL?
It is possible (and useful) to make matrix operations on a separate kernel with a parallel manner?
Another question: assume we have k iteration, for each iteration k we calculate P for the step k+1, taking as input the P for step k from iteration k-1... Since each tread compute one iteration, it is possible to synchronize thread k to wait the P matrix from thread k-1?
UPDATE: After many search and tries, I thought it is impossible to adapt my implementation (as described above) for this problem to the OpenCL principles of operation. The only way I found to do that is to parallelize each single matrix operation, maybe using more GPUs to compute each matrix operation simultaneously. Real efficiency for this implementation could be achieved with big linear system to filter (that means: it works good with big matrices).

Parallel Forward-Backward Algorithm for Hidden Markov Model

As a side project, I want to implement a Hidden Markov Model for my NVidia graphics card so that I can have it execute quickly and using many cores.
I'm looking at the Forward-Backward algorithm and was wondering what is there that I can make parallel here? If you look at the forward part of the algorithm for instance, the matrix multiplications can be divided up to be done in parallel, but can the iterative parts of the algorithm that depend on the previous step be parallelized in any way? Is there some kind of a mathematical trick that can be applied here?
Thanks,
mj
http://en.wikipedia.org/wiki/Forward%E2%80%93backward_algorithm#Example
You are correct in your assessment - you can parallelize the matrix multiplications (i.e. across states), but you can't parallelize the recursive steps. I just made a blog post on my work with HMMs and GPUs. Check it out here:
http://sgmustadio.wordpress.com/2012/02/27/hidden-markov-models-in-cuda-gpu/
If you are still working on this project, you may want to check out HMMlib and parredHMMlib.
sgmustadio is right to point out that you cannot parallelize recursive steps, but it seems that these authors have come up with a clever way to reduce the Forward and Viterbi algorithms to a series of matrix multiplications and reductions.

Resources