Error in MPI broadcast - mpi

Sorry for the long post. I did read some other MPI broadcast related errors but I couldn't
find out why my program is failing.
I am new to MPI and I am facing this problem. First I will explain what I am trying to do:
My declarations :
ROWTAG 400
COLUMNTAG 800
Create a 2 X 2 Cartesian topology.
Rank 0 has the whole matrix. It wants to dissipate parts of matrix to all the processes in the 2 X 2 Cartesian topology. For now, instead
of matrix I am just dealing with integers. So for process P(i,j) in 2 X 2 Cartesian topology, (i - row , j - column), I want it to receive
(ROWTAG + i ) in one message and (COLUMNTAG + j) in another message.
My strategy to do so is:
Processes: P(0,0) , P(0,1), P(1,0), P(1,1)
P(0,0) has all the initial data.
P(0,0) sends (ROWTAG+1) (in this case 401) to P(1,0) - In essense P(1,0) is responsible for dissipating information related to row 1 for all the processes in Row 1 - I just used a blocking send
P(0,0) sends (COLUMNTAG+1) (in this case 801) to P(0,1) - In essense P(0,1) is responsible for dissipating information related to column 1 for all the processes in Column 1 - Used a blocking send
For each process, I made a row_group containing all the processes in that row and out of this created a row_comm (communicator object)
For each process, I made a col_group containing all the processes in that column and out of this created a col_comm (communicator object)
At this point, P(0,0) has given information related to row 'i' to Process P(i,0) and P(0,0) has given information related to column 'j' to
P(0,j). I call P(i,0) and P(0,j) as row_head and col_head respectively.
For Process P(i,j) , P(i,0) gives information related to row i, and P(0,j) gives information related to column j.
I used a broad cast call:
MPI_Bcast(&row_data,1,MPI_INT,row_head,row_comm)
MPI_Bcast(&col_data,1,MPI_INT,col_head,col_comm)
Please find my code here: http://pastebin.com/NpqRWaWN
Here is the error I see:
* An error occurred in MPI_Bcast
on communicator MPI COMMUNICATOR 5 CREATE FROM 3
MPI_ERR_ROOT: invalid root
* MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
Also please let me know if there is any better way to distribute the matrix data.

There are several errors in your program. First, row_Ranks is declared with one element less and when writing to it, you possibly overwrite other stack variables:
int col_Ranks[SIZE], row_Ranks[SIZE-1];
// ^^^^^^
On my test system the program just hangs because of that.
Second, you create new subcommunicators out of matrixComm but you use rank numbers from the latter to address processes in the former when performing the broadcast. That doesn't work. For example, in a 2x2 Cartesian communicator ranks range from 0 to 3. In any column- or row-wise subgroup there are only two processes with ranks 0 and 1 - there is neither rank 2 nor rank 3. If you take a look at the value of row_head across the ranks, it is 2 in two of them, hence the error.
For a much better way to distribute the data, you should refer to this extremely informative answer.

Related

LDPC: Hard_decision decoding algorithm

I am working with Bit-flipping decoding [Hard_decision] for one bit flip.
I have followed for below H_matrix :"For the bit-flipping algorithm the messages passed along the Tanner graph edges are also binary: a bit node sends a message declaring if it is a one or a zero, and each check node sends a message to each connected bit node, declaring what value the bit is based on the information available to the check node.
The check node determines that its parity-check equation is satisfied if the modulo-2 sum of the incoming bit values is zero. If the majority of the messages received by a bit node are different from its received value the bit node changes (flips) its current value. This process is repeated until all of the parity-check equations are satisfied.
Correct codeword = [10010101]
H matrix[4][8] = {{0,1,0,1,1,0,0,1},{1,1,1,0,0,1,0,0},{0,0,1,0,0,1,1,1}, {1,0,0,1,1,0,1,0}};
ReceivedCodeWord[8]={1,0,1,1,0,1,0,1}; //Error codeword
I need to get [10010101] but instead i am getting [10010001] for ReceivedCodeWord[8] = {1,0,1,1,0,1,0,1}.
But for other possible ReceivedCodeWord i am getting correct.
e.g. ReceivedCodeWord [00010101] i am getting correct codeword [10010101]
ReceivedCodeWord [11010101] i am getting correct codeword [10010101].
Doubt: why for ReceivedCodeWord {1,0,1,1,0,1,0,1} i am getting [10010001], its totally wrong. Please explain me.
Here's a link
Thanks
This is caused by the liner block code you used. Firstly, this code isn't a LDPC code, since the H martix doesn't satisfy the row-column constraint (e.g., the 3-th colmun and the 6-th column have 1s in two same poistions).
(1) When {1,0,1,1,0,1,0,1} is received , the check sums are {0,1,1,0}. If you use multiple bit flipping decoding algorithm, which means more than one bit can be flipped during one iteration, the 3-th variable node and 6-th variable node will be flipped together since both of them connect two unsatisfied check nodes, then the result will be {1,0,0,1,0,0,0,1} after first iteration. During the second iteration, the check sums are still {0,1,1,0}, so the 3-th variable node and 6-th variable node will be flipped again, this process will be recurrent in later iterations.
(2) The minimum Hamming distance of this code is 2, so this code can dectect 1 erros and correct 0.5 errors. Even you use single bit flipping decoding algorithm, it will has 50% probability to decode the received codeword {1,0,1,1,0,1,0,1} as another vaild codeword {1,0,1,1,0,0,0,1}.

C++ functions and MPI programing

From what I have learned in my supercomputing class I know that MPI is a communicating (and data passing) interface.
I'm confused on when you run a function in a C++ program and want each processor to perform a specific task.
For example, a prime number search (very popular for supercomputers). Say I have a range of values (531-564, some arbitrary range) and say I have 50 processes I could run a series of evaluations on for each number. If root (process 0) wants to examine 531 and knowing prime numbers I can use 8 processes (1-8) to evaluate the prime status. If the number is divisible by any number 2-9 with a remainder of 0, then it is not prime.
Is it possible that for MPI which passes data to each process to have these processes perform these actions?
The hardest part for me is understanding that if I perform an action in the original C++ program the processes taking place could be allocated on several different processes, then in MPI how can I structure this? Or is my understanding completely wrong? If so how am I supposed to truly go about this path of thinking in a correct manner?
The big idea is passing data to a process versus having a function sent to a process. I'm fairly certain I'm wrong but I'm trying to back track to fix my thinking.
Each MPI process is running the same program, but that doesn't mean that they are doing the same thing. Different processes can be running different branches of the code, depending on the id (or "rank") of the process, and in effect be completely independent. Like any distributed computation, the actors do need to agree on how they will communicate.
The most basic strategy in MPI is scatter-gather, where the "master" process (usually the one with rank 0) will split an array of work equally amongst the peers (including the master process itself) by having them all call scatter, the peers will do the work, then all peers will call gather to send the results back to master.
In your prime algorithm example, build an array of integers, "scatter" it to all the peers, each peer will run through its array saving 1 if it is prime, 0 if it is not then "gather" the results to master. [In this particular example, since the input data is completely predictable based on process rank, the scatter step is unnecessary but we will do it anyway.]
As pseudo-code:
main():
int x[n], n = 100
MPI_init()
// prepare data on master
if rank == 0:
for i in 1 ... n, x[i] = i
// send data from x on root to local on each process in world
MPI_scatter(x, n, int, local, n/k, int, root, world)
for i in 1 ... n/k
result[i] = 1 // assume prime
if 2 divides local[i], result[i] = 0
if 3 divides local[i], result[i] = 0
if 5 divides local[i], result[i] = 0
if 7 divides local[i], result[i] = 0
// gather reults from local on each process in world to x on root
MPI_gather(result, n/k, int, x, n, int, root, world)
// print results
if rank == 0:
for i in 1 ... n, print i if x[i] == 1
MPI_finalize()
There are lots of details to fill in such as proper declarations, and dealing with the fact that some ranks will have fewer elements than others, using
proper C syntax, etc., but getting them right doesn't help explain the overall picture.
More fine-grained synchronization and communication is possible using direct send/recv between processes. Such programs are harder to write since the different processes may be in different states. In particular, it is important that if process a is calling MPI_send to process b, then process b had better be calling MPI_recv from a.

In MPI how to communicate a part of a "shared" array to all other ranks?

I have an MPI program with some array of data. Every rank needs all the array to do its work, but will only work on a patch of the array. After a calculation step I need every rank to communicate its computed piece of the array to all other ranks.
How do I achieve this efficiently?
In pseudo code I would do something like this as a first approach:
if rank == 0: // only master rank
initialise_data()
end if
MPI_Bcast(all_data,0) // from master to every rank
compute which part of the data to work on
for ( several steps ): // each rank
execute_computation(part_of_data)
for ( each rank ):
MPI_Bcast(part_of_data, rank_number) // from every rank to every rank
end for
end for
The disadvantage is that there is as many broadcasts, i.e. barriers as there is ranks. So how would I replace the MPI_Bcasts ?
edit: I just might have found a hint... Is it MPI_Allgather I am looking for?
Yes, you are looking for MPI_Allgather. Note that recvcount is not the length of the whole recieve buffer, but the amount of data should be recieved from one process. Analogically, in MPI_Allgatherv recvcount[i] is the amount of data you want to recieve from i-th process. Moreover, recvcount should be equal (not less) to the respective sendcount. I tested it on my implemetation (OpenMPI), and if I tried to recieve less elements that were sent, I got MPI_ERR_TRUNCATE error.
Also in some rare cases I used MPI_Allreduce for that puprose. For example if we have the following arrays:
process0: AA0000
process1: 0000BB
process2: 00CC00
then we can do Allreduce with MPI_SUM operation and get AACCBB in all processes. Obviously, the same trick can be done with ones instead of zeros and MPI_PROD instead of MPI_SUM.

Generalization of MPI rank number to MPI groups?

Is there any generalization of rank number to group numbers? For my code I would like to create a hierarchical decomposition of MPI::COMM_WORLD. Assume we make use of 16 threads. I use MPI::COMM_WORLD.Split to create 4 communicators each having 4 ranks. Is there now an MPI function that provides some unique ids to the corresponding four groups?
Well, you can still refer to each process by its original rank in MPI_COMM_WORLD. You also have complete control over what rank each process receives in its new communicator via the color and key arguments of MPI_Comm_split(). This is plenty enough information to create a mapping between old ranks and new groups/ranks.
If you don't like #suszterpatt's answer (I do) you could always abuse a Cartesian communicator and pretend that the process at index (2,3) in the communicator is process 3 in group 2 of your hierarchical decomposition.
But don't read this and take away the impression that I recommend such abuse, it's just a thought.

How to calculate the number of messages sent in a distributed system?

Suppose we have n processes forming a general network. We don't know which are connected together, but we know the number of the processes (n).If at each round, a process sends a message to all processes it is connected to, receives 1 message from each of them, and the program executes for r rounds, is there a way to find how many messages have been sent during the program execution?
As you have pointed out, without the exact network structure it is impossible to put a specific value on the number of messages sent. Instead, we can look at its Big-O value.
Now just to be clear what we mean by Big-O:
Big-O refers to the upper bound (ie. the worst possible case) complexity
It is possible (and quite likely in real systems) that the actual value will be less
Without some function that describes the average case (eg. each process is connected to, on average N / 2 other processes) we must assume the worst case
By "worst case" for this problem we mean the maximum number of messages are sent
So let us assume the worst case, in which each process is connected to N - 1 other processes.
Let us also define some variables:
S := the set of processes
N := the number of processes in S
We can represent the set S as a complete (every node connects to every other node), undirected graph in which each node in the graph corresponds to a process and each edge in the graph corresponds to 2 messages sent (1 outgoing transmission and one reply).
From here, we see that the number of edges in a complete graph is (N(N-1))/2
So in the worst case, the number of messages sent is N(N-1), or N^2 - N.
Now because we are dealing with Big-O notation, we are interested in how this value grows as a function of N.
By the triangle inequality, we can see that O(N^2 - N) is an element of O(N^2).
So the number of messages sent grows as N^2 in the worst case.
It is also possible to arrive at this result using an adjacency matrix, that is an N x N matrix where a 1 in the (i, j)th element refers to an edge from node i to node j.
Because in the original problem each process sends a single message to all connected processes, who respond with a single message, we can see that for every pair (i, j) and (j, i) there will be an edge (one representing an outgoing message, one a reply). The exception to this will be pairs where i = j, ie. we a process doesn't send itself a message.
So the matrix will be completely filled with 1s with the exception of the diagonal.
0 1 1 1
1 0 1 1
1 1 0 1
1 1 1 0
Above: the adjacency matrix for N = 4.
So we first want to determine a formula for the total number of messages sent as a function of the number of nodes.
By the area of a rectangle we can see that the number of 1s in the matrix (ignoring the diagonal) would be N x N = N^2.
Now we must consider the diagonal. The number of pairs (x, x) that can exist is given by the function f(i) where Z(N) -> Z(N) x Z(N) : f(i) = (i, i) -- this means that there will be exactly N distinct solutions to this function.
So the overall result is that we have N^2 - N messages when the diagonal is considered.
Now, we use the same Big-O reasoning from above to arrive at the same conclusion, the number of messages grows in the worst case as O(N^2).
So, now you need only take into consideration the number of rounds that have occured, leaving you with O(RN^2).
Of course now you must consider whether you really do have the worst case ...

Resources