Controlled premature termination of mpi program running under slurm? - mpi

I am running a script that does multiple subsequent mpirun calls through slurms squeue command. Each call to mpirun will write its output to an own directory, but there is a dependency between them in the way that a given run will use data from the former runs output directory.
The mpi program internally performs some iterative optimization algorithm, which will terminate if some convergence criteria are met. Every once in a while it happens, that the algorithm reaches a state in which those criteria are not quite met yet, but by plotting the output (which is continuosly written to disk) one can quite easily tell that the important things have converged and that further iterations would not change the nature of the final result anymore.
What I am therefore looking for is a way to manually terminate the run in a controlled way and have the outer script proceed to the next mpirun call. What is the best way to achieve this? I do not have direct access to the node on which the calculation is actually performed, but I have of course access to all of slurms commands and the working directories of the individual runs. I have access to the mpi programs full source code.
One solution that would work is the following: If one manually wants to terminate a run, one places a file with a special name like killme in the working directory, which could easily be done with touch killme. The mpi program would regulary check for the existence of this file and terminate in a controlled manner if it exists. The outer script or slurm would not be involved at all here and the script would just continue with the next mpirun call. What do you think of this solution? Can you think of anything better?

Here is a short code snippet for getting SIGUSR1 as a signal.
More detailed explanation can be found here.
#include <stdio.h>
#include <stdlib.h>
#include <signal.h>
#include <string.h>
#include <unistd.h>
void sighandler(int signum, siginfo_t *info, void *ptr) {
fprintf(stderr, "Received signal %d\n", signum);
fprintf(stderr, "Signal originates from process %lu\n",
(unsigned long) info->si_pid);
fprintf(stderr, "Shutting down properly.\n");
exit(0);
}
int main(int argc, char** argv) {
struct sigaction act;
printf("pid %lu\n", (unsigned long) getpid());
memset(&act, 0, sizeof(act));
act.sa_sigaction = sighandler;
act.sa_flags = SA_SIGINFO;
sigaction(SIGUSR1, &act, NULL);
while (1) {
};
return 0;
}

Related

Qt5 how to catch exception [duplicate]

Here is a simple piece of code where a division by zero occurs. I'm trying to catch it :
#include <iostream>
int main(int argc, char *argv[]) {
int Dividend = 10;
int Divisor = 0;
try {
std::cout << Dividend / Divisor;
} catch(...) {
std::cout << "Error.";
}
return 0;
}
But the application crashes anyway (even though I put the option -fexceptions of MinGW).
Is it possible to catch such an exception (which I understand is not a C++ exception, but a FPU exception) ?
I'm aware that I could check for the divisor before dividing, but I made the assumption that, because a division by zero is rare (at least in my app), it would be more efficient to try dividing (and catching the error if it occurs) than testing each time the divisor before dividing.
I'm doing these tests on a WindowsXP computer, but would like to make it cross platform.
It's not an exception. It's an error which is determined at hardware level and is returned back to the operating system, which then notifies your program in some OS-specific way about it (like, by killing the process).
I believe that in such case what happens is not an exception but a signal. If it's the case: The operating system interrupts your program's main control flow and calls a signal handler, which - in turn - terminates the operation of your program.
It's the same type of error which appears when you dereference a null pointer (then your program crashes by SIGSEGV signal, segmentation fault).
You could try to use the functions from <csignal> header to try to provide a custom handler for the SIGFPE signal (it's for floating point exceptions, but it might be the case that it's also raised for integer division by zero - I'm really unsure here). You should however note that the signal handling is OS-dependent and MinGW somehow "emulates" the POSIX signals under Windows environment.
Here's the test on MinGW 4.5, Windows 7:
#include <csignal>
#include <iostream>
using namespace std;
void handler(int a) {
cout << "Signal " << a << " here!" << endl;
}
int main() {
signal(SIGFPE, handler);
int a = 1/0;
}
Output:
Signal 8 here!
And right after executing the signal handler, the system kills the process and displays an error message.
Using this, you can close any resources or log an error after a division by zero or a null pointer dereference... but unlike exceptions that's NOT a way to control your program's flow even in exceptional cases. A valid program shouldn't do that. Catching those signals is only useful for debugging/diagnosing purposes.
(There are some useful signals which are very useful in general in low-level programming and don't cause your program to be killed right after the handler, but that's a deep topic).
Dividing by zero is a logical error, a bug by the programmer. You shouldn't try to cope with it, you should debug and eliminate it. In addition, catching exceptions is extremely expensive- way more than divisor checking will be.
You can use Structured Exception Handling to catch the divide by zero error. How that's achieved depends on your compiler. MSVC offers a function to catch Structured Exceptions as catch(...) and also provides a function to translate Structured Exceptions into regular exceptions, as well as offering __try/__except/__finally. However I'm not familiar enough with MinGW to tell you how to do it in that compiler.
There's isn't a
language-standard way of catching
the divide-by-zero from the CPU.
Don't prematurely "optimize" away a
branch. Is your application
really CPU-bound in this context? I doubt it, and it isn't really an
optimization if you break your code.
Otherwise, I could make your code
even faster:
int main(int argc, char *argv[]) { /* Fastest program ever! */ }
Divide by zero is not an exception in C++, see https://web.archive.org/web/20121227152410/http://www.jdl.co.uk/briefings/divByZeroInCpp.html
Somehow the real explanation is still missing.
Is it possible to catch such an exception (which I understand is not a C++ exception, but a FPU exception) ?
Yes, your catch block should work on some compilers. But the problem is that your exception is not an FPU exception. You are doing integer division. I don’t know whether that’s also a catchable error but it’s not an FPU exception, which uses a feature of the IEEE representation of floating point numbers.
On Windows (with Visual C++), try this:
BOOL SafeDiv(INT32 dividend, INT32 divisor, INT32 *pResult)
{
__try
{
*pResult = dividend / divisor;
}
__except(GetExceptionCode() == EXCEPTION_INT_DIVIDE_BY_ZERO ?
EXCEPTION_EXECUTE_HANDLER : EXCEPTION_CONTINUE_SEARCH)
{
return FALSE;
}
return TRUE;
}
MSDN: http://msdn.microsoft.com/en-us/library/ms681409(v=vs.85).aspx
Well, if there was an exception handling about this, some component actually needed to do the check. Therefore you don't lose anything if you check it yourself. And there's not much that's faster than a simple comparison statement (one single CPU instruction "jump if equal zero" or something like that, don't remember the name)
To avoid infinite "Signal 8 here!" messages, just add 'exit' to Kos nice code:
#include <csignal>
#include <iostream>
#include <cstdlib> // exit
using namespace std;
void handler(int a) {
cout << "Signal " << a << " here!" << endl;
exit(1);
}
int main() {
signal(SIGFPE, handler);
int a = 1/0;
}
As others said, it's not an exception, it just generates a NaN or Inf.
Zero-divide is only one way to do that. If you do much math, there are lots of ways, like
log(not_positive_number), exp(big_number), etc.
If you can check for valid arguments before doing the calculation, then do so, but sometimes that's hard to do, so you may need to generate and handle an exception.
In MSVC there is a header file #include <float.h> containing a function _finite(x) that tells if a number is finite.
I'm pretty sure MinGW has something similar.
You can test that after a calculation and throw/catch your own exception or whatever.

googlebenchmark and MPI: Is there hope?

I want to run a particular MPI function under google benchmark. Something like:
#include <mpi.h>
#include <benchmark/benchmark.h>
template<class Real>
void MPIInitFinalize(benchmark::State& state)
{
auto mpi = []() {
MPI_Init(nullptr, nullptr);
foo();
MPI_Finalize();
};
for(auto _ : state) {
mpi();
}
}
BENCHMARK_TEMPLATE(MPIInitFinalize, double);
BENCHMARK_MAIN();
Of course, we know what will happen:
*** The MPI_Init() function was called after MPI_FINALIZE was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
I understand that MPI isn't cool with what I want to do. But google benchmark is simply too useful to not at least try to find a hack to make this work.
Is there anything that can be done? Can I fork a process and pass the lambda to it? Is there a threading pattern that will work? Even expensive things will be helpful, as I can just subtract the cost of doing whatever hack works without a call too foo() from the one which call foo().
If you don't need to include MPI_Init and MPI_Finalize in your time (which you probably don't want anyways) you can take alook at this gist: https://gist.github.com/mdavezac/eb16de7e8fc08e522ff0d420516094f5
It countains an example on how to benchmark MPI enabled code with google benchmark. The basic idea is to call google benchmark from your own main method (using ::benchmark::Initialize(&argc, argv) and ::benchmark::RunSpecifiedBenchmarks()), synchronize using MPI_Barrier, time your code using std::chrono::high_resolution_clock and using MPI_Allreduce to find the slowest process. You can then publish that time using state.SetIterationTime (but only on the main process).

would change process priority frequently have side effect

I am doing embedded system programming.
our process is set as higher priority by default, however for some actions like invoking shell command, write file. I was thinking to lower its priority and then up it again. so it's kind of like a pair of function calling: "setdefaultpriority" and "improve priority".
And there are lots of shell command calling in our process. In one file, I may need to call tens of pair of "setdefault..." and "improve.."
My question, would so many priority operation in one process have any bad effect ?
setpriority in a non-root process can only go up (decrease priority), never down.
What you can do is decrease process priority in the child process, before it execs the shell command.
//errror checks ommited
#include <sys/resource.h>
#include <sys/time.h>
#include <stdio.h>
#include <unistd.h>
#include <assert.h>
#include <sys/wait.h>
int main()
{
pid_t pid;
pid=fork();
assert(pid>=0);
if (!pid){
execlp("nice", "nice", (char*)0);
_exit(1);
}
wait(0);
pid=fork();
if (!pid){
setpriority(PRIO_PROCESS, 0, 10);
execlp("nice", "nice", (char*)0);
_exit(1);
}
}
/* should print:
0
10
*/
The performance overhead of a system call as simple as setpriority should be negligible compared to the cost of fork and exec*.

heap memory release policy in Arduino

#include <Arduino.h>
#include "include/MainComponent.h"
/*
 Turns on an LED on for one second, then off for one second, repeatedly.
*/
MainComponent* mainComponent;
void setup()
{
   mainComponent = new MainComponent();
   mainComponent->beginComponent();
}
void loop()
{
   mainComponent->runComponent();
}
is there any callback to release memory in Arduino ?(e.g to call delete mainComponent)
or this will happen automatically as the loop ends?
what is the strategy to ensure freeing the memory allocated in that code snippet?
SCENARIO :"I wanted to access the object in both methods , so the  object is declared in the global scope then instantiated at setup."
What happen when loop() terminated ? will  mainComponent still remain in the memory?
If it was in OS NO , process will terminated then resources will be deallocated.
So in Arduino how can I achieve above SCENARIO , by ensuring memory will be deallocated when the controller is switched off ?
What is confusing you is that the main() function is hidden by the basic Arduino IDE. Your programs have a main() function just like on any other platform, and have a lifecycle same as when run on a computer with OS. If you look under arduino___\hardware\cores\aduino, you will find a file main.cpp, which is included into your binaries:
int main(void)
{
init();
//...
setup();
for (;;) {
loop();
if (serialEventRun) serialEventRun();
}
return 0;
}
Considering this file you will now see, that while you exit the loop(), it is continuously called. Your program never exits. In general, your best pattern is to new objects once and never delete, like you have done here. If you are new'ing and delete'ing objects repeatedly on a microcontroller, you are not thinking about lifecycles and resources wisely.
So
"is the new'd object deleted at return from loop()?" No, the program is still running and it stays on the heap.
"What happens at power off? Is there a way to clean up?" The moment the supply voltage drops too low, the microcontroller will stop executing instructions. Power supervisor circuitry prevents the controller from doing anything erratic as the voltage drops (should prevent) When the voltage is conpletely drained, all the RAM is lost. Without adding circuitry, you have no way to execute any clean up at power off.
"Do I need to clean up?" No, at power up, everything is reset to a known state. Operation cannot be affected by anything left behind in RAM (presumes you initialize all your variables).

mpi multiple init finalize

Assuming I have good reason to do the following (I think I have), how to make it works?
#include "mpi.h"
int main( int argc, char *argv[] )
{
int myid, numprocs;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
// ...
MPI_Finalize();
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
// ...
MPI_Finalize();
return 0;
}
I got the error:
--------------------------------------------------------------------------
Calling any MPI-function after calling MPI_Finalize is erroneous.
The only exceptions are MPI_Initialized, MPI_Finalized and MPI_Get_version.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** after MPI was finalized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[ange:13049] Abort after MPI_FINALIZE completed successfully; not able to guarantee that all other processes were killed!
The reason to do that:
I've Python wrapping around C++ code. Some wrapped class have constructor that call MPI_Init, and destructor that call MPI_Finalize. I would like to be able in Python to freely create, delete re-create the Python object that wrap this C++ class. The ultimate goal is to create a webservice entirely in Python, that import the Python C++ exstension once, and execute some Python code given the user request.
EDIT: I think I'll refactor the C++ code to give possibility to not MPI_Init and MPI_Finalize in constructor and destructor, so it's possible to do it exactly one time in the Python script (using mpi4py).
You've basically got the right solution, so I'll just confirm. It is in fact erroneous to call MPI_Init and MPI_Finalize multiple times, and if you have an entity that calls these internally on creation/destruction, then you can only instantiate that entity once. If you want to create multiple instances, you'll need to change the entity to do one of the following:
Offer an option to not call Init and Finalize that the user can set externally
Use MPI_Initialized and MPI_Finalized to decide whether it needs to call either of the above

Resources