Difference between MPI_Allgather and MPI_Alltoall functions? - mpi

What is the main difference betweeen the MPI_Allgather and MPI_Alltoall functions in MPI?
I mean can some one give me examples where MPI_Allgather will be helpful and MPI_Alltoall will not? and vice versa.
I am not able to understand the main difference? It looks like in both the cases all the processes sends send_cnt elements to every other process participating in the communicator and receives them?
Thank You

A picture says more than thousand words, so here are several ASCII art pictures:
rank send buf recv buf
---- -------- --------
0 a,b,c MPI_Allgather a,b,c,A,B,C,#,#,%
1 A,B,C ----------------> a,b,c,A,B,C,#,#,%
2 #,#,% a,b,c,A,B,C,#,#,%
This is just the regular MPI_Gather, only in this case all processes receive the data chunks, i.e. the operation is root-less.
rank send buf recv buf
---- -------- --------
0 a,b,c MPI_Alltoall a,A,#
1 A,B,C ----------------> b,B,#
2 #,#,% c,C,%
(a more elaborate case with two elements per process)
rank send buf recv buf
---- -------- --------
0 a,b,c,d,e,f MPI_Alltoall a,b,A,B,#,#
1 A,B,C,D,E,F ----------------> c,d,C,D,%,$
2 #,#,%,$,&,* e,f,E,F,&,*
(looks better if each element is coloured by the rank that sends it but...)
MPI_Alltoall works as combined MPI_Scatter and MPI_Gather - the send buffer in each process is split like in MPI_Scatter and then each column of chunks is gathered by the respective process, whose rank matches the number of the chunk column. MPI_Alltoall can also be seen as a global transposition operation, acting on chunks of data.
Is there a case when the two operations are interchangeable? To properly answer this question, one has to simply analyse the sizes of the data in the send buffer and of the data in the receive buffer:
operation send buf size recv buf size
--------- ------------- -------------
MPI_Allgather sendcnt n_procs * sendcnt
MPI_Alltoall n_procs * sendcnt n_procs * sendcnt
The receive buffer size is actually n_procs * recvcnt, but MPI mandates that the number of basic elements sent should be equal to the number of basic elements received, hence if the same MPI datatype is used in both send and receive parts of MPI_All..., then recvcnt must be equal to sendcnt.
It is immediately obvious that for the same size of the received data, the amount of data sent by each process is different. For the two operations to be equal, one necessary condition is that the sizes of the sent buffers in both cases are equal, i.e. n_procs * sendcnt == sendcnt, which is only possible if n_procs == 1, i.e. if there is only one process, or if sendcnt == 0, i.e. no data is being sent at all. Hence there is no practically viable case where both operations are really interchangeable. But one can simulate MPI_Allgather with MPI_Alltoall by repeating n_procs times the same data in the send buffer (as already noted by Tyler Gill). Here is the action of MPI_Allgather with one-element send buffers:
rank send buf recv buf
---- -------- --------
0 a MPI_Allgather a,A,#
1 A ----------------> a,A,#
2 # a,A,#
And here the same implemented with MPI_Alltoall:
rank send buf recv buf
---- -------- --------
0 a,a,a MPI_Alltoall a,A,#
1 A,A,A ----------------> a,A,#
2 #,#,# a,A,#
The reverse is not possible - one cannot simulate the action of MPI_Alltoall with MPI_Allgather in the general case.

These two screenshots have a quick explanation:
MPI_Allgatherv
MPI_Alltoallv
Though this a comparison between MPI_Allgatherv and MPI_Alltoallv, but it also explains how MPI_Allgather differs from MPI_Alltoall.

While these two methods are indeed very similar, there seems to be one crucial difference between the two of them.
MPI_Allgather ends with each process having the exact same data in its receive buffer, and each process contributes a single value to the overall array. For example, if each of a set of processes needed to share some single value about their state with everyone else, each would provide their single value. These values would then be sent to everyone, so everyone would have a copy of the same structure.
MPI_Alltoall does not send the same values to each other process. Instead of providing a single value that should be shared with each other process, each process specifies one value to give to each other process. In other words, with n processes, each must specify n values to share. Then, for each processor j, its k'th value will be sent to process k's j'th index in the receive buffer. This is useful if each process has a single, unique message for each other process.
As a final note, the results of running allgather and alltoall would be the same in the case where each process filled its send buffer with the same value. The only difference would be that allgather would likely be much more efficient.

Related

C++ functions and MPI programing

From what I have learned in my supercomputing class I know that MPI is a communicating (and data passing) interface.
I'm confused on when you run a function in a C++ program and want each processor to perform a specific task.
For example, a prime number search (very popular for supercomputers). Say I have a range of values (531-564, some arbitrary range) and say I have 50 processes I could run a series of evaluations on for each number. If root (process 0) wants to examine 531 and knowing prime numbers I can use 8 processes (1-8) to evaluate the prime status. If the number is divisible by any number 2-9 with a remainder of 0, then it is not prime.
Is it possible that for MPI which passes data to each process to have these processes perform these actions?
The hardest part for me is understanding that if I perform an action in the original C++ program the processes taking place could be allocated on several different processes, then in MPI how can I structure this? Or is my understanding completely wrong? If so how am I supposed to truly go about this path of thinking in a correct manner?
The big idea is passing data to a process versus having a function sent to a process. I'm fairly certain I'm wrong but I'm trying to back track to fix my thinking.
Each MPI process is running the same program, but that doesn't mean that they are doing the same thing. Different processes can be running different branches of the code, depending on the id (or "rank") of the process, and in effect be completely independent. Like any distributed computation, the actors do need to agree on how they will communicate.
The most basic strategy in MPI is scatter-gather, where the "master" process (usually the one with rank 0) will split an array of work equally amongst the peers (including the master process itself) by having them all call scatter, the peers will do the work, then all peers will call gather to send the results back to master.
In your prime algorithm example, build an array of integers, "scatter" it to all the peers, each peer will run through its array saving 1 if it is prime, 0 if it is not then "gather" the results to master. [In this particular example, since the input data is completely predictable based on process rank, the scatter step is unnecessary but we will do it anyway.]
As pseudo-code:
main():
int x[n], n = 100
MPI_init()
// prepare data on master
if rank == 0:
for i in 1 ... n, x[i] = i
// send data from x on root to local on each process in world
MPI_scatter(x, n, int, local, n/k, int, root, world)
for i in 1 ... n/k
result[i] = 1 // assume prime
if 2 divides local[i], result[i] = 0
if 3 divides local[i], result[i] = 0
if 5 divides local[i], result[i] = 0
if 7 divides local[i], result[i] = 0
// gather reults from local on each process in world to x on root
MPI_gather(result, n/k, int, x, n, int, root, world)
// print results
if rank == 0:
for i in 1 ... n, print i if x[i] == 1
MPI_finalize()
There are lots of details to fill in such as proper declarations, and dealing with the fact that some ranks will have fewer elements than others, using
proper C syntax, etc., but getting them right doesn't help explain the overall picture.
More fine-grained synchronization and communication is possible using direct send/recv between processes. Such programs are harder to write since the different processes may be in different states. In particular, it is important that if process a is calling MPI_send to process b, then process b had better be calling MPI_recv from a.

AX.25 protocol interfering with sending data packet

I am very sorry to not be able to provide code for this question but it is more of a logical situation. My termination sequence for the AX.25 protocol is "111111" which is six 1s. So if this sequence of 1s is found inside my data packet, it will denote the end of the packet file and send it without correctly sending the rest of the packet. I will do my best to explain my conclusions and test results such that you can understand my dilemma.
***Programming in Arduino******
byte 1 contains 8 bits. Look below and attempt to picture a byte in a rectangular box. right next to it is byte 2 which also contains 8 bits.
Situation 1:
||_1_0_1_1_1_0_1_0_ ||_1_1_1_1_1_1_0_0_||
Attempted Solution 1: you could simply change 1 into 0 and keep track of it.
Situation 2:
||_1_0_1_1_1_0_1_1_ ||_1_1_1_1_0_0_1_0_||
Attempted Solution 2: attempted solution 1 breaks apart. and I am stuck here.
Individually the bytes are safe from activating AX.25 termination sequence but combined bytes results in a problem.
Here is a list of possible cases:
1) six 1s = termination sequence activated for end of packet
2) six 1s inside actual data of packet = premature termination
3) if 1s are changed to 0s than a sequence of six 0s can be a problem in reverting changes back
4) can only read 1 byte at a time (EEPROM) due to memory limitations
5) if six 1s occur between two bytes will also prematurely activate termination sequence.
Thank you in advance for any kind of help.
The solution mandated by the ax.25 protocol is bit stuffing.
Conceptually, any time the receiver sees five sequential one bits and a zero bit, it assumes that the zero bit has been stuffed by the sender (to break up erroneous frame sequences in the data), and removes it before emitting the received data. The only sequence of six 1-bits that can have been sent un-stuffed is the framing sequence; all data will have been sent stuffed. The receiver must always de-stuff.
To stuff or unstuff will generally require a couple of bytes of working ram (or a couple of bytes of registers), although there might be creative ways around that.
To quote the official TAPR protocol standard:
"In order to ensure that the flag bit sequence mentioned above does not appear accidentally anywhere else in a frame, the sending station monitors the bit sequence for a group of five or more contiguous “1” bits. Any time five contiguous “1” bits are sent, the sending station inserts a “0” bit after the fifth “1” bit. During frame reception, any time five contiguous “1” bits are received, a “0” bit immediately following five “1” bits is discarded."
A google search for AX.25 bit stuffing should return as much detail as you might need.

Is there a size limit of variable in MPI_bcast?

I have the latest MPICH2 (3.0.4) compiled with intel fort compiler in a quad-core, dual CPU (Intel Xeon) machine.
I am encountering one MPI_bcast problem where, I am unable to broadcast the array
gpsi(1:201,1:381,1:38,1:20,1:7)
making it an array of size 407410920. When I try to broadcast this array I have the following error
Fatal error in PMPI_Bcast: Other MPI error, error stack:
PMPI_Bcast(1525)......: MPI_Bcast(buf=0x7f506d811010, count=407410920,
MPI_DOUBLE_PRECISION, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast_impl(1369).:
MPIR_Bcast_intra(1160):
MPIR_SMP_Bcast(1077)..: Failure during collective
rank 1 in job 31 Grace_52261 caused collective abort of all ranks
exit status of rank 1: killed by signal 9
MPI launch string is: mpiexec -n 2 %B/tvdbootstrap
Testing MPI configuration with 'mpich2version'
Exit value was 127 (expected 0), status: execute_command_t::exited
Launching MPI job with command: mpiexec -n 2 %B/tvdbootstrap
Server args: -callback 127.0.0.1:4142 -set_pw 65f76672:41f20a5c
So the question, is there a limit in the size of variable in MPI_bcast or is the size of my array is more than what it can handle?
As John said, your array is too big because it can no longer be described by an int variable. When this is the case, you have a few options.
Use multiple MPI calls to send your data. For this option, you would just divide your data up into chunks smaller than 2^31 and send them individually until you've received everything.
Use MPI datatypes. With this option, you need to create a datatype to describe some portion of your data, then send multiples of that datatype. For example, if you are just sending an array of 100 integers, you can create a datatype of 10 integers using MPI_TYPE_VECTOR, then send 10 of that new datatype. Datatypes can be a bit confusing when you're first taking a look at them, but they are very powerful for sending either large data or non-contiguous data.
Yes, there is a limit. It's usually 2^31 so about two billion elements. You say your array has 407 million elements so it seems like it should work. However, if the limit is two billion bytes, then you are exceeding it by about 30%. Try cutting your array size in half and see if that works.
See: Maximum amount of data that can be sent using MPI::Send

Error in MPI broadcast

Sorry for the long post. I did read some other MPI broadcast related errors but I couldn't
find out why my program is failing.
I am new to MPI and I am facing this problem. First I will explain what I am trying to do:
My declarations :
ROWTAG 400
COLUMNTAG 800
Create a 2 X 2 Cartesian topology.
Rank 0 has the whole matrix. It wants to dissipate parts of matrix to all the processes in the 2 X 2 Cartesian topology. For now, instead
of matrix I am just dealing with integers. So for process P(i,j) in 2 X 2 Cartesian topology, (i - row , j - column), I want it to receive
(ROWTAG + i ) in one message and (COLUMNTAG + j) in another message.
My strategy to do so is:
Processes: P(0,0) , P(0,1), P(1,0), P(1,1)
P(0,0) has all the initial data.
P(0,0) sends (ROWTAG+1) (in this case 401) to P(1,0) - In essense P(1,0) is responsible for dissipating information related to row 1 for all the processes in Row 1 - I just used a blocking send
P(0,0) sends (COLUMNTAG+1) (in this case 801) to P(0,1) - In essense P(0,1) is responsible for dissipating information related to column 1 for all the processes in Column 1 - Used a blocking send
For each process, I made a row_group containing all the processes in that row and out of this created a row_comm (communicator object)
For each process, I made a col_group containing all the processes in that column and out of this created a col_comm (communicator object)
At this point, P(0,0) has given information related to row 'i' to Process P(i,0) and P(0,0) has given information related to column 'j' to
P(0,j). I call P(i,0) and P(0,j) as row_head and col_head respectively.
For Process P(i,j) , P(i,0) gives information related to row i, and P(0,j) gives information related to column j.
I used a broad cast call:
MPI_Bcast(&row_data,1,MPI_INT,row_head,row_comm)
MPI_Bcast(&col_data,1,MPI_INT,col_head,col_comm)
Please find my code here: http://pastebin.com/NpqRWaWN
Here is the error I see:
* An error occurred in MPI_Bcast
on communicator MPI COMMUNICATOR 5 CREATE FROM 3
MPI_ERR_ROOT: invalid root
* MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
Also please let me know if there is any better way to distribute the matrix data.
There are several errors in your program. First, row_Ranks is declared with one element less and when writing to it, you possibly overwrite other stack variables:
int col_Ranks[SIZE], row_Ranks[SIZE-1];
// ^^^^^^
On my test system the program just hangs because of that.
Second, you create new subcommunicators out of matrixComm but you use rank numbers from the latter to address processes in the former when performing the broadcast. That doesn't work. For example, in a 2x2 Cartesian communicator ranks range from 0 to 3. In any column- or row-wise subgroup there are only two processes with ranks 0 and 1 - there is neither rank 2 nor rank 3. If you take a look at the value of row_head across the ranks, it is 2 in two of them, hence the error.
For a much better way to distribute the data, you should refer to this extremely informative answer.

How to calculate the number of messages sent in a distributed system?

Suppose we have n processes forming a general network. We don't know which are connected together, but we know the number of the processes (n).If at each round, a process sends a message to all processes it is connected to, receives 1 message from each of them, and the program executes for r rounds, is there a way to find how many messages have been sent during the program execution?
As you have pointed out, without the exact network structure it is impossible to put a specific value on the number of messages sent. Instead, we can look at its Big-O value.
Now just to be clear what we mean by Big-O:
Big-O refers to the upper bound (ie. the worst possible case) complexity
It is possible (and quite likely in real systems) that the actual value will be less
Without some function that describes the average case (eg. each process is connected to, on average N / 2 other processes) we must assume the worst case
By "worst case" for this problem we mean the maximum number of messages are sent
So let us assume the worst case, in which each process is connected to N - 1 other processes.
Let us also define some variables:
S := the set of processes
N := the number of processes in S
We can represent the set S as a complete (every node connects to every other node), undirected graph in which each node in the graph corresponds to a process and each edge in the graph corresponds to 2 messages sent (1 outgoing transmission and one reply).
From here, we see that the number of edges in a complete graph is (N(N-1))/2
So in the worst case, the number of messages sent is N(N-1), or N^2 - N.
Now because we are dealing with Big-O notation, we are interested in how this value grows as a function of N.
By the triangle inequality, we can see that O(N^2 - N) is an element of O(N^2).
So the number of messages sent grows as N^2 in the worst case.
It is also possible to arrive at this result using an adjacency matrix, that is an N x N matrix where a 1 in the (i, j)th element refers to an edge from node i to node j.
Because in the original problem each process sends a single message to all connected processes, who respond with a single message, we can see that for every pair (i, j) and (j, i) there will be an edge (one representing an outgoing message, one a reply). The exception to this will be pairs where i = j, ie. we a process doesn't send itself a message.
So the matrix will be completely filled with 1s with the exception of the diagonal.
0 1 1 1
1 0 1 1
1 1 0 1
1 1 1 0
Above: the adjacency matrix for N = 4.
So we first want to determine a formula for the total number of messages sent as a function of the number of nodes.
By the area of a rectangle we can see that the number of 1s in the matrix (ignoring the diagonal) would be N x N = N^2.
Now we must consider the diagonal. The number of pairs (x, x) that can exist is given by the function f(i) where Z(N) -> Z(N) x Z(N) : f(i) = (i, i) -- this means that there will be exactly N distinct solutions to this function.
So the overall result is that we have N^2 - N messages when the diagonal is considered.
Now, we use the same Big-O reasoning from above to arrive at the same conclusion, the number of messages grows in the worst case as O(N^2).
So, now you need only take into consideration the number of rounds that have occured, leaving you with O(RN^2).
Of course now you must consider whether you really do have the worst case ...

Resources