How do I check that all MPI procs were used to call a procedure? - mpi

I have designed a procedure that must be called by all processors in the communicator in order to function properly. If the user called it with only the root rank, I want the procedure to know this and then produce a meaningful error message to the user of the procedure. At first I thought of having the procedure call a checking routine shown below:
subroutine AllProcsPresent
! Checks that all procs have been used to call this procedure
use MPI_stub, only: nproc, Allreduce
integer :: counter
counter=1
call Allreduce(counter) ! This is a stub procedure that will add "counter" across all procs
if (counter(1)==get_nproc()) then
return
else
print *, "meaningful error"
end if
end subroutine AllProcsPresent
But this won't work because the Allreduce is going to wait for all procs to check in and if only root was used to do the call, the other procs will never arrive. Is there a way to do what I'm trying to do?

There's not much you can do here. You might want to look at 'collecheck' for ideas, but it's hard to find a good resource for that package. Here's its git home:
http://git.mpich.org/mpe.git/tree/HEAD:/src/collchk
If you look at 'NOTES' there's an item about "call consistency" described as "Ensures that all processes in the communicator have made the same call in a given event". Hope that can give you some ideas.

Ensuring that a collective operation is entered by all ranks within a communicator is the responsibility of the programmer.
However, you might consider using the MPI 3.0 non-blocking collective MPI_Ibarrier with an MPI_Test loop and time out. However, non-blocking collectives can't be cancelled, so if the other ranks do not join in the operation within your time out, you will have to abort the entire job. Something like:
void AllPresent(MPI_Comm comm, double timeout) {
int all_here = 0;
MPI_Request req;
MPI_Ibarrier(comm, &req);
double start_time = MPI_Wtime();
do {
MPI_Test(&req, &all_here, MPI_STATUS_IGNORE);
sleep(0.01);
double now = MPI_Wtime();
if (now - start_time > timeout) {
/* Print an error message */
MPI_Abort(comm, 1);
}
} while (!all_here);
/* Run your procedure now */
}

Related

How to check completion of non-blocking reduce?

It is not clear to me how to properly use non-blocking collective in MPI, particularly MPI_Ireduce() in this case:
Say I want to collect a sum from root rank:
int local_cnt;
int total_cnt;
MPI_Request request;
MPI_Ireduce(&local_cnt, &total_cnt, 1, MPI_INT, MPI_SUM, 0, MPI_WORLD_COMM, &request);
/* now I want to check if the reduce is finished */
if (rank == 0) {
int flag = 0;
MPI_Status status;
MPI_Test(&request, &flag, &status);
if (flag) {
/* reduce is finished? */
}
}
Is this the correct way to check if non-blocking reduce is done? My confusion comes from two aspects: one, Can or should just root process check for it using MPI_Test() since this is meaningful only to root? second, since MPI_Test() is local op, how can this local op knows the operation is complete? it does require all processes to finish, right?
You must check for completion on all participating ranks, not just the root.
From a user perspective, you need to know about the completion of the communication because you must not do anything with the memory provided to a non-blocking operation. I.e. if you send a local scope variable like local_cnt, you cannot write to it or leave it's scope before you have confirmed that the operation is complete.
One option to ensure completion is calling MPI_Test until it eventually returns flag==true. Use this only if you can do something useful between the calls to MPI_Test:
{
int local_cnt;
int total_cnt;
// fill local_cnt on all ranks
MPI_Request request;
MPI_Ireduce(&local_cnt, &total_cnt, 1, MPI_INT, MPI_SUM, 0, MPI_WORLD_COMM, &request);
int flag;
do {
// perform some useful computation
MPI_Status status;
MPI_Test(&request, &flag, &status);
} while (!flag)
}
Do not call MPI_Test in a loop if you have nothing useful to do in between calls. Instead use MPI_Wait, which blocks until completion.
{
int local_cnt;
int total_cnt;
// fill local_cnt on all ranks
MPI_Request request;
MPI_Ireduce(&local_cnt, &total_cnt, 1, MPI_INT, MPI_SUM, 0, MPI_WORLD_COMM, &request);
// perform some useful computation
MPI_Status status;
MPI_Wait(&request, &status);
}
Remember, if you have no useful computation at all, and don't need to be non-blocking for deadlock reasons, use a blocking communication in the first place. If you have multiple ongoing non-blocking communication, there are MPI_Waitany, MPI_Waitsome, MPI_Waitall and their Test variant's.
Zulan brilliantly answered the first part of your question.
MPI_Reduce() returns when
the send buffer can be overwritten on a non root rank
the result is available on the root rank (which implies all the ranks have completed)
So there is no way for a non root rank to know whether the root rank completed. If you do need this information, then you need to manually add a MPI_Barrier(). That being said, you generally do not require this information, and if you believe you do need it, there might be something wrong with your app.
This remains true if you use non blocking collectives (e.g. MPI_Wait() corresponding to MPI_Ireduce() completes on a non root rank : that simply means the send buffer can be overwritten.

Using MPI_Scatter with 3 processes

I am new to MPI , and my question is how the root(for example rank-0) initializes all its values (in the array) before other processes receive their i'th value from the root?
for example:
in the root i initialize: arr[0]=20,arr[1]=90,arr[2]=80.
My question is ,If i have for example process (number -2) that starts a little bit before the root process. Can the MPI_Scatter sends incorrect value instead 80?
How can i assure the root initialize all his memory before others use Scatter ?
Thank you !
The MPI standard specifies that
If comm is an intracommunicator, the outcome is as if the root executed n
send operations, MPI_Send(sendbuf+i, sendcount, extent(sendtype), sendcount, sendtype, i,...), and each process executed a receive, MPI_Recv(recvbuf, recvcount, recvtype, i,...).
This means that all the non-root processes will wait until their recvcount respective elements have been transmitted. This is also known as synchronized routine (the process waits until the communication is completed).
You as the programmer are responsible of ensuring that the data being sent is correct by the time you call any communication routine and until the send buffer available again (in this case, until MPI_Scatter returns). In a MPI only program, this is as simple as placing the initialization code before the call to MPI_Scatter, as each process executes the program sequentially.
The following is an example based in the document's Example 5.11:
MPI_Comm comm = MPI_COMM_WORLD;
int grank, gsize,*sendbuf;
int root, rbuf[100];
MPI_Comm_rank( comm, &grank );
MPI_Comm_size(comm, &gsize);
root = 0;
if( grank == root ) {
sendbuf = (int *)malloc(gsize*100*sizeof(int));
// Initialize sendbuf. None of its values are valid at this point.
for( int i = 0; i < gsize * 100; i++ )
sendbuf[i] = i;
}
rbuf = (int *)malloc(100*sizeof(int));
// Distribute sendbuf data
// At the root process, all sendbuf values are valid
// In non-root processes, sendbuf argument is ignored.
MPI_Scatter(sendbuf, 100, MPI_INT, rbuf, 100, MPI_INT, root, comm);
MPI_Scatter() is a collective operation, so the MPI library does take care of everything, and the outcome of a collective operation does not depend on which rank called earlier than an other.
In this specific case, a non root rank will block (at least) until the root rank calls MPI_Scatter().
This is no different than a MPI_Send() / MPI_Recv().
MPI_Recv() blocks if called before the remote peer MPI_Send() a matching message.

MPI irecv,isend communication between tasks

I am new to MPI, so sorry if this sounds stupid. I want a process to have an MPI_Irecv. If it has been called for a task, then it finds a result and sends the result back to the process that called it. How can I check if it has been actually assigned to a task? So that I can have an if{} in which that task takes place while the rest of the process continues with other stuff.
Code example:
for (i=0;i<size_of_Q;i++) {
MPI_Irecv( &shmeio, 1, mpi_point_C, root, 77, MPI_COMM_WORLD, &req );
//I want to put an if right here.
//If it's true process does task.
//Finds a number. then
MPI_Isend( &Bestcandidate, 1, mpi_point_C, root, 66, MPI_COMM_WORLD, &req );
//so that it can return the result.
//if it wasn't assigned a task it carries on with its other tasks.
} //(here is where for loop ends)
You might be confusing what MPI is supposed to do. MPI isn't really a tasking-based model as compared to some others (map reduce, some parts of OpenMP, etc.). MPI has historically focused on SPMD (single program multiple data) types of applications. That's not to say that MPI can't handle MPMD (there's an entire chapter in the standard about dynamic processes and most launchers can run different executables on different ranks.
With that in mind, when you start your job, you'll usually have all of the processes that you'll ever have (unless you're using dynamic processing like MPI_COMM_SPAWN). You probably used something like:
mpiexec -n 8 ./my_program arg1 arg2 arg3
Many times, if people are trying to emulate a tasking (or master/worker) model, they'll treat rank 0 as the special "master":
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (0 == rank) {
while (/* work not done */) {
/* Check if parts of work are done */
/* Send work to ranks without work */
}
} else {
while (/* work not done */ {
/* Get work from master */
/* Compute */
/* Send results to master */
}
}
Often, when waiting for the work, you'll do something like:
for (i = 1; i < num_process; i++) {
MPI_Irecv(&result[i], ..., &requests[i]);
}
This will set up the receives for each rank that will send you work. Then later, you can do something like:
MPI_Testany(num_processes - 1, requests, &index, &flag, statuses);
if (flag) {
/* Process work */
MPI_Send(work_data, ..., index, ...);
}
This will check to see if any of the requests (the handles used to track the status of a nonblocking operation) are completed and will then send new work to the worker that finished.
Obviously, all of this code is not copy/paste ready. You'll have to figure out how/if it applies to your work and adapt it accordingly.

Concurrent Processing - Petersons Algorithm

For those unfamiliar, the following is Peterson's algorithm used for process coordination:
int No_Of_Processes; // Number of processes
int turn; // Whose turn is it?
int interested[No_Of_Processes]; // All values initially FALSE
void enter_region(int process) {
int other; // number of the other process
other = 1 - process; // the opposite process
interested[process] = TRUE; // this process is interested
turn = process; // set flag
while(turn == process && interested[other] == TRUE); // wait
}
void leave_region(int process) {
interested[process] = FALSE; // process leaves critical region
}
My question is, can this algorithm give rise to deadlock?
No, there is no deadlock possible.
The only place you are waiting is while loop. And the process variables is not shared between threads and they are different, but turn variable is shared. So it's impossible to get true condition for turn == process for more then one thread in every single moment.
But anyway your solution is not correct at all, the Peterson's algorithm is only for two concurrent threads, not for any No_Of_Processes like in your code.
In original algorithm for N processes deadlocks are possible link.

Reentrancy and recursion

Would it be a true statement to say that every recursive function needs to be reentrant?
If by reentrant you mean that a further call to the function may begin before a previous one has ended, then yes, all recursive functions happen to be reentrant, because recursion implies reentrance in that sense.
However, "reentrant" is sometimes used as a synonym for "thread safe", which is introduces a lot of other requirements, and in that sense, the answer is no. In single-threaded recursion, we have the special case that only one "instance" of the function will be executing at a time, because the "idle" instances on the stack are each waiting for their "child" instance to return.
No, I recall a factorial function that works with static (global) variables. Having static (global) variables goes against being reentrant, and still the function is recursive.
global i;
factorial()
{ if i == 0 return 1;
else { i = i -1; return i*factorial();
}
This function is recursive and it's non-reentrant.
'Reentrant' normally means that the function can be entered more than once, simultaneously, by two different threads.
To be reentrant, it has to do things like protect/lock access to static state.
A recursive function (on the other hand) doesn't need to protect/lock access to static state, because it's only executing one statement at a time.
So: no.
Not at all.
Why shouldn't a recursive function be able to have static data, for example? Should it not be able to lock on critical sections?
Consider:
sem_t mutex;
int calls = 0;
int fib(int n)
{
down(mutex); // lock for critical section - not reentrant per def.
calls++; // global varible - not reentrant per def.
up(mutex);
if (n==1 || n==0)
return 1;
else
return fib(n-1) + fib(n-2);
}
This does not go to say that writing a recursive and reentrant function is easy, neither that it is a common pattern, nor that it is recommended in any way. But it is possible.

Resources