How to determine if an MPI communicator is valid? - mpi

In my program, I have wrapped up some MPI communicators in to a data structure. Unfortunately, sometimes the destructor of an object of this type might get called before it has been initialized. In my destructor, I of course call MPI_Comm_Free. But if this is called on an invalid communicator the code crashes.
I've been looking through the MPI standard, but I can't find a function to test if a communicator is valid. I also assume I can't use MPI_Comm_set_errhandler to try and catch the free exception because there isn't a valid communicator to set the handler of. I could maintain a flag value of my own saying if the communicator is valid, but I prefer to avoid duplicating state information like that. Is there any built in way I can safely check if a communicator is valid?
Here is a basic program demonstrating my problem:
#include <mpi.h>
typedef struct {
MPI_Comm comm;
} mystruct;
void cleanup(mystruct* a) {
MPI_Comm_free(&(a->comm));
}
int main(int argc, char* argv[]) {
MPI_Init(&argc, &argv);
mystruct a;
/* Some early exit condition triggers cleanup without
initialization */
cleanup(&a);
MPI_Finalize();
return 0;
}

MPI_COMM_NULL is a constant used for invalid communicators. However, you cannot determine if an MPI communicator has been initialized. In C, it is impossible to determine if a variable has been initialized. Non-static variables start with an indeterminate value, reading it causes undefined behavior.
You must initialized the communicator with MPI_COMM_NULL yourself. This only make sense if cannot possibly create actual communicator during initialization.
Note: MPI_Comm_free also sets comm to MPI_COMM_NULL.

Related

Qt5 how to catch exception [duplicate]

Here is a simple piece of code where a division by zero occurs. I'm trying to catch it :
#include <iostream>
int main(int argc, char *argv[]) {
int Dividend = 10;
int Divisor = 0;
try {
std::cout << Dividend / Divisor;
} catch(...) {
std::cout << "Error.";
}
return 0;
}
But the application crashes anyway (even though I put the option -fexceptions of MinGW).
Is it possible to catch such an exception (which I understand is not a C++ exception, but a FPU exception) ?
I'm aware that I could check for the divisor before dividing, but I made the assumption that, because a division by zero is rare (at least in my app), it would be more efficient to try dividing (and catching the error if it occurs) than testing each time the divisor before dividing.
I'm doing these tests on a WindowsXP computer, but would like to make it cross platform.
It's not an exception. It's an error which is determined at hardware level and is returned back to the operating system, which then notifies your program in some OS-specific way about it (like, by killing the process).
I believe that in such case what happens is not an exception but a signal. If it's the case: The operating system interrupts your program's main control flow and calls a signal handler, which - in turn - terminates the operation of your program.
It's the same type of error which appears when you dereference a null pointer (then your program crashes by SIGSEGV signal, segmentation fault).
You could try to use the functions from <csignal> header to try to provide a custom handler for the SIGFPE signal (it's for floating point exceptions, but it might be the case that it's also raised for integer division by zero - I'm really unsure here). You should however note that the signal handling is OS-dependent and MinGW somehow "emulates" the POSIX signals under Windows environment.
Here's the test on MinGW 4.5, Windows 7:
#include <csignal>
#include <iostream>
using namespace std;
void handler(int a) {
cout << "Signal " << a << " here!" << endl;
}
int main() {
signal(SIGFPE, handler);
int a = 1/0;
}
Output:
Signal 8 here!
And right after executing the signal handler, the system kills the process and displays an error message.
Using this, you can close any resources or log an error after a division by zero or a null pointer dereference... but unlike exceptions that's NOT a way to control your program's flow even in exceptional cases. A valid program shouldn't do that. Catching those signals is only useful for debugging/diagnosing purposes.
(There are some useful signals which are very useful in general in low-level programming and don't cause your program to be killed right after the handler, but that's a deep topic).
Dividing by zero is a logical error, a bug by the programmer. You shouldn't try to cope with it, you should debug and eliminate it. In addition, catching exceptions is extremely expensive- way more than divisor checking will be.
You can use Structured Exception Handling to catch the divide by zero error. How that's achieved depends on your compiler. MSVC offers a function to catch Structured Exceptions as catch(...) and also provides a function to translate Structured Exceptions into regular exceptions, as well as offering __try/__except/__finally. However I'm not familiar enough with MinGW to tell you how to do it in that compiler.
There's isn't a
language-standard way of catching
the divide-by-zero from the CPU.
Don't prematurely "optimize" away a
branch. Is your application
really CPU-bound in this context? I doubt it, and it isn't really an
optimization if you break your code.
Otherwise, I could make your code
even faster:
int main(int argc, char *argv[]) { /* Fastest program ever! */ }
Divide by zero is not an exception in C++, see https://web.archive.org/web/20121227152410/http://www.jdl.co.uk/briefings/divByZeroInCpp.html
Somehow the real explanation is still missing.
Is it possible to catch such an exception (which I understand is not a C++ exception, but a FPU exception) ?
Yes, your catch block should work on some compilers. But the problem is that your exception is not an FPU exception. You are doing integer division. I don’t know whether that’s also a catchable error but it’s not an FPU exception, which uses a feature of the IEEE representation of floating point numbers.
On Windows (with Visual C++), try this:
BOOL SafeDiv(INT32 dividend, INT32 divisor, INT32 *pResult)
{
__try
{
*pResult = dividend / divisor;
}
__except(GetExceptionCode() == EXCEPTION_INT_DIVIDE_BY_ZERO ?
EXCEPTION_EXECUTE_HANDLER : EXCEPTION_CONTINUE_SEARCH)
{
return FALSE;
}
return TRUE;
}
MSDN: http://msdn.microsoft.com/en-us/library/ms681409(v=vs.85).aspx
Well, if there was an exception handling about this, some component actually needed to do the check. Therefore you don't lose anything if you check it yourself. And there's not much that's faster than a simple comparison statement (one single CPU instruction "jump if equal zero" or something like that, don't remember the name)
To avoid infinite "Signal 8 here!" messages, just add 'exit' to Kos nice code:
#include <csignal>
#include <iostream>
#include <cstdlib> // exit
using namespace std;
void handler(int a) {
cout << "Signal " << a << " here!" << endl;
exit(1);
}
int main() {
signal(SIGFPE, handler);
int a = 1/0;
}
As others said, it's not an exception, it just generates a NaN or Inf.
Zero-divide is only one way to do that. If you do much math, there are lots of ways, like
log(not_positive_number), exp(big_number), etc.
If you can check for valid arguments before doing the calculation, then do so, but sometimes that's hard to do, so you may need to generate and handle an exception.
In MSVC there is a header file #include <float.h> containing a function _finite(x) that tells if a number is finite.
I'm pretty sure MinGW has something similar.
You can test that after a calculation and throw/catch your own exception or whatever.

Is there a way to cancel a blocking `MPI_Probe` call?

The MPI_Irecv and MPI_Isend operations return an MPI_Request that can be later marked as cancelled with MPI_Cancel. Is there a similar mechanism for blocking MPI_Probe and MPI_Mprobe ?
The context of the question is the latest implementation of Boost.MPI request handlers using Probe.
EDIT - Here is an example of how an hypothetical MPI_Probecancel could be used:
#include <mpi.h>
#include <chrono>
#include <future>
using namespace std::literals::chrono_literals;
// Executed in a thread
void async_cancel(MPI_Probe *probe)
{
std::this_thread::sleep_for(1s);
int res = MPI_Probecancel(probe);
}
int main(int argc, char* argv[])
{
int provided;
MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);
if (rank == 0)
{
// A handle to the probe (similar to a request)
MPI_Probe probe;
// Start a thread
// `probe` will be filled with the next call, pretty ugly
// Ideally, this should be done in two steps like MPI_Irecv, MPI_Wait
auto res = std::async(std::launch::async, &async_cancel, &probe);
MPI_Message message;
MPI_Status status;
MPI_MProbe(1, 123, MPI_COMM_WORLD, &message, &status, &probe);
if (!probe.cancelled)
{
int buffer;
MPI_Mrecv(&buffer, 1, MPI_INT, &message, &status);
}
}
else
std::this_thread::sleep_for(2s);
MPI_Finalize();
return 0;
}
First, the premise / nomenclature of your question is wrong. It is the nonblocking calls. MPI_Irecv and MPI_Isend which return a request object that you may cancel. For these calls, you cancel the local operation.
MPI_Probe and MPI_Mprobe are in fact blocking. You cannot possibly cancel these operations in the sense that control flow will only leave when a message is available.
On the other hand, MPI_Iprobe and MPI_Improbe are nonblocking, meaning they always complete immediately, telling you whether a message is available.
For neither kind of probe call, there is any kind of local state left after the completion. So there is nothing that could be cancelled locally after the functions return.
That said, if a probe tells you that a message is available, you should definitely receive it. Otherwise a send operation may bock and you would leak resources on all side. But that's just a receive operation.
Edit: Regarding your idea to cancel a ongoing local MPI_Probe in a concurrent thread: This is not directly supported.
Theoretically, you could emulate this on a conforming implementation with MPI_THREAD_MULTIPE by running the probe on MPI_ANY_SOURCE and send a message to the same rank from the other thread. That, of course, has the consequence that you change must probe on message from any incoming rank.
Realistically, if you have to do this, you would probably just use a loop like while(!cancelled) MPI_Iprobe();.
That said, I would again question the scenario: How would another thread on your rank suddenly know to cancel a local MPI_Probe operation? It would probably have to be based on information received from a remote rank - in which case that would be covered by actually being able to receive information from it, i.e. the actual Probe would complete.
Maybe for some high-level abstraction it makes some sense to offer a local cancel, but in an actual practical situation I would believe you could design a idiomatic pattern without needing this.

Segmentation fault on value of function pointer

Is it possible to have a Segmentation Fault on if incorrectly set the value of a function pointer?
Or will the interpreter/compiler detect that beforehand?
The details depend on the language you're using, but in general it's not just possible but likely.
C provides no guarantees whatsoever. You can just say e.g.
#include <stddef.h>
typedef void (*foo)( void );
int main( void ) {
((foo)NULL)( );
return 0;
}
which takes NULL, casts it to a function and calls it (or at least attempts to, and crashes.) As of writing, both gcc -Wall and clang -Wall will neither detect nor warn for even this pathological case.
With other languages, there may be more safeguards in place. But generally, don't expect your program to survive a bad function pointer.
void (*ptr)() = (void (*) ())0x0;
ptr();
Nothing prevents you from compiling/executing this, but it will fail for sure.
The following example produces the segmentation fault you mention:
int main(int argc, char *argv[]) {
void (*fun_ptr)() = (void (*)()) 1;
(*fun_ptr)();
return 0;
}
None of cc, clang, splint issue a warning. C assumes that the programmer knows what he is doing.
UPDATE
The following reference illustrates why a C allows for absolute memory addressing to be accessed through pointers.
Koenig, Andrew R., C Traps an Pitfalls, Bell Telephone Laboratories, Murray Hill, New Jersey, Technical Memorandum, 2.1. Understanding Declarations:
I once talked to someone who was writing a C program that was going to
run stand-alone in a small microprocessor. When this machine was
switched on, the hardware would call the subroutine whose address was
stored in location 0.
In order to simulate turning power on, we had to devise a C statement
that would call this subroutine explicitly. After some thought, we
came up with the following:
(*(void(*)())0)();

Controlled premature termination of mpi program running under slurm?

I am running a script that does multiple subsequent mpirun calls through slurms squeue command. Each call to mpirun will write its output to an own directory, but there is a dependency between them in the way that a given run will use data from the former runs output directory.
The mpi program internally performs some iterative optimization algorithm, which will terminate if some convergence criteria are met. Every once in a while it happens, that the algorithm reaches a state in which those criteria are not quite met yet, but by plotting the output (which is continuosly written to disk) one can quite easily tell that the important things have converged and that further iterations would not change the nature of the final result anymore.
What I am therefore looking for is a way to manually terminate the run in a controlled way and have the outer script proceed to the next mpirun call. What is the best way to achieve this? I do not have direct access to the node on which the calculation is actually performed, but I have of course access to all of slurms commands and the working directories of the individual runs. I have access to the mpi programs full source code.
One solution that would work is the following: If one manually wants to terminate a run, one places a file with a special name like killme in the working directory, which could easily be done with touch killme. The mpi program would regulary check for the existence of this file and terminate in a controlled manner if it exists. The outer script or slurm would not be involved at all here and the script would just continue with the next mpirun call. What do you think of this solution? Can you think of anything better?
Here is a short code snippet for getting SIGUSR1 as a signal.
More detailed explanation can be found here.
#include <stdio.h>
#include <stdlib.h>
#include <signal.h>
#include <string.h>
#include <unistd.h>
void sighandler(int signum, siginfo_t *info, void *ptr) {
fprintf(stderr, "Received signal %d\n", signum);
fprintf(stderr, "Signal originates from process %lu\n",
(unsigned long) info->si_pid);
fprintf(stderr, "Shutting down properly.\n");
exit(0);
}
int main(int argc, char** argv) {
struct sigaction act;
printf("pid %lu\n", (unsigned long) getpid());
memset(&act, 0, sizeof(act));
act.sa_sigaction = sighandler;
act.sa_flags = SA_SIGINFO;
sigaction(SIGUSR1, &act, NULL);
while (1) {
};
return 0;
}

mpi multiple init finalize

Assuming I have good reason to do the following (I think I have), how to make it works?
#include "mpi.h"
int main( int argc, char *argv[] )
{
int myid, numprocs;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
// ...
MPI_Finalize();
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
// ...
MPI_Finalize();
return 0;
}
I got the error:
--------------------------------------------------------------------------
Calling any MPI-function after calling MPI_Finalize is erroneous.
The only exceptions are MPI_Initialized, MPI_Finalized and MPI_Get_version.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** after MPI was finalized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[ange:13049] Abort after MPI_FINALIZE completed successfully; not able to guarantee that all other processes were killed!
The reason to do that:
I've Python wrapping around C++ code. Some wrapped class have constructor that call MPI_Init, and destructor that call MPI_Finalize. I would like to be able in Python to freely create, delete re-create the Python object that wrap this C++ class. The ultimate goal is to create a webservice entirely in Python, that import the Python C++ exstension once, and execute some Python code given the user request.
EDIT: I think I'll refactor the C++ code to give possibility to not MPI_Init and MPI_Finalize in constructor and destructor, so it's possible to do it exactly one time in the Python script (using mpi4py).
You've basically got the right solution, so I'll just confirm. It is in fact erroneous to call MPI_Init and MPI_Finalize multiple times, and if you have an entity that calls these internally on creation/destruction, then you can only instantiate that entity once. If you want to create multiple instances, you'll need to change the entity to do one of the following:
Offer an option to not call Init and Finalize that the user can set externally
Use MPI_Initialized and MPI_Finalized to decide whether it needs to call either of the above

Resources