Apparently, mpirun uses a SIGINT handler which "forwards" the SIGINT signal to each of the processes it spawned.
This means you can write an interrupt handler for your mpi-enabled code, execute mpirun -np 3 my-mpi-enabled-executable and then SIGINT will be raised for each of the three processes. Shortly after that, mpirun exits. This works fine when you have a small custom handler which only prints an error message and then exits. However, when your custom interrupt handler is doing a non-trivial job (e.g. doing serious computations or persisting data), the handler does not run to completion. I'm assuming this is because mpirun decided to exit too soon.
Here's the stderr upon pressing ctrl-c (i.e. causing SIGINT) after executing my-mpi-enabled-executable. This is the desirable expected behavior:
interrupted by signal 2.
running viterbi... done.
persisting parameters... done.
the master process will now exit.
Here's the stderr upon pressing ctrl-c after executing mpirun -np 1 my-mpi-enabled-executable. This is the problematic behavior:
interrupted by signal 2.
running viterbi... mpirun: killing job...
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 8970 on node pharaoh exited on signal 0 (Unknown signal 0).
--------------------------------------------------------------------------
mpirun: clean termination accomplished
Answering any of the following questions will solve my problem:
How to override the mpirun SIGINT handler (if at all possible)?
How to avoid the termination of the processes mpirun spawned right after mpirun terminates?
Is there another signal which mpirun may be sending to the children processes before mpirun terminates?
Is there a way to "capture" the so-called "signal 0 (Unknown signal 0)" (see the second stderr above)?
I'm running openmpi-1.6.3 on linux.
As per the OpenMPI manpage you can send a SIGUSR1 or SIGUSR2 to mpirun which will forward it and not shut down itsself.
When having the same issue, I came across this question and the answer by #Zulan.
In particular I wanted to catch a SIGINT (Ctrl+C) from the user, do some stuff and then exit in an orderly fashion. Thus, using SIGUSR1 was not an option. Reading the man page that #Zulan linked however, shows that mpirun (at least the OpenMPI version) catches a SIGINT and then sends a SIGTERM signal to the child processes. Thus, catching SIGTERM in my code allowed me to call the proper exit routines.
Note that signal handling is not save with MPI as noted here.
Related
I have the following situation:
Thread 1:
Forks a child and the child, say A in turn forks again and executes a process. B
Thread 2:
Listens for commands over a Unix Domain Socket and kills the process, B that has been forked by child, A in Thread 1
Respond to caller that it has killed the child
I want to ignore SIGPIPE for Thread 2 as I do not want the program to crash when client has closed the socket. So i tried doing this using
sigset_t set;
sigemptyset(&set);
sigaddset(&set, SIGPIPE);
pthread_sigmask(SIG_BLOCK, &set, NULL);
Doing this helps block SIGPIPE but it also blocks the ability of the thread 1 to send SIGKILL to the child.
I also tried using the below in main function before creating threads
signal(SIG_IGN, SIGPIPE);
and
send with MSG_NOSIGNAL flag in socket.
This doesnt help my scenario with SIGKILL as well. Any idea how to ignore safely the SIGPIPE in a multi-threaded condition like above with forks and execs and SIGKILLs sent?
After some investigation, I found out the solution to it. I had to do pthread_sigmask(SIG_UNBLOCK, &set, NULL);
after the each fork() calls and before exec() in the grand child. This caused SIGKILL to not get blocked.
I have a simple question (in my mind) and I cannot find an answer. How do I suppress output messages from mpirun?
For example, I have an MPI based program that takes input file names. If a file name is bad, the program generates a log file such as:
Beginning initialization...
*****************************
Reading topology file...
Error: Topology file mysample.top was not found.
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 21581 on
node newton-compute-2-25.local exiting improperly. There are two reasons this could occur:
1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.
2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"
This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
The behavior is correct; the program terminates execution (by calling MPI_Abort) with a message that the input file is bad. The messages from MPI are not necessary, and these are what I would like to suppress.
I did try adding the -q and --quiet options to the mpirun call, but they appear to do nothing for this particular problem. I am also using OpenMPI, if the implementation matters.
Edit: I should mention the MPI messages go to stderr, which is not necessarily stdout. That is fine, but I still do not want to see them with error messages from the program.
As MPI must be capable of handling errors from all nodes it is run on, I'm pretty confident you can't split the MPI error stream and the processes error streams. You can remove all the stderr with 2>/dev/null or to an error log with 2> err.log, but again, I don't believe you can split the errors.
What's the difference between the SIGINT signal and the SIGTERM signal? I know that SIGINT is equivalent to pressing ctrl+c on the keyboard, but what is SIGTERM for? If I wanted to stop some background process gracefully, which of these should I use?
The only difference in the response is up to the developer. If the developer wants the application to respond to SIGTERM differently than to SIGINT, then different handlers will be registered. If you want to stop a background process gracefully, you would typically send SIGTERM. If you are developing an application, you should respond to SIGTERM by exiting gracefully. SIGINT is often handled the same way, but not always. For example, it is often convenient to respond to SIGINT by reporting status or partial computation. This makes it easy for the user running the application on a terminal to get partial results, but slightly more difficult to terminate the program since it generally requires the user to open another shell and send a SIGTERM via kill. In other words, it depends on the application but the convention is to respond to SIGTERM by shutting down gracefully, the default action for both signals is termination, and most applications respond to SIGINT by stopping gracefully.
If I wanted to stop some background process gracefully, which of these should I use?
The unix list of signals date back to the time when computers had serial terminals and modems, which is where the concept of a controlling terminal originates. When a modem drops the carrier, the line is hung up.
SIGHUP(1) therefore would indicate a loss of connection, forcing programs to exit or restart. For daemons like syslogd and sshd, processes without a terminal connection that are supposed to keep running, SIGHUP is typically the signal used to restart or reset.
SIGINT(2) and SIGQUIT(3) are literally "interrupt" or "quit" - "from keyboard" - giving the user immediate control if a program would go haywire. With a physical character based terminal this would be the
only way to stop a program!
SIGTERM(15) is not related to any terminal handling, and can only be sent from another process. This would be the conventional signal to send to a background process.
SIGINT is a program interrupt signal,
which will sent when an user presses Ctrl+C.
SIGTERM is a termination signal, this will sent to an process to request that process termination, but it can be caught or ignored by that specific process.
I have a bash script where i kill a running process by sending the SIGTERM signal to it's process ID. However, i want to know the return code of the process i just sent the signal.
Is that possible?
i cannot use 'wait' because the process to kill was not started from my script and i'm receiving
"pid ##### is not a child of this shell"
I did some tests in a command line, in a console where the process was running, after i send the SIGTERM signal (from another console), i checked the exit code and it was 143.
I want to kill the process from a different script and catch that number.
As shellter said, you cannot get the exit code of a process except using wait (or waitpid(), etc...) and you can only do that if you are its parent.
But even if you could, think about this:
When you send a process a SIGTERM, only one of three things can happen:
The process has not installed any signal handler for SIGTERM. In this case it dies immediately as a result of the signal. But in this case the exit code is uninteresting – you already know what it is. On most platforms it is 143 (128 + integer value of SIGTERM), indicating, unsurprisingly, that the process has died as a result of SIGTERM.
The process has configured SIGTERM to be ignored. In this case, nothing happens, the process does not die, and so there is no exit code to obtain anyway.
The process has installed a signal handler for SIGTERM. In this case, the handler is invoked. The handler might do anything at all: possibly nothing, possibly exit immediately, possibly carry out some cleanup operation and exit later, possibly something completely different. Even if the process does exit, that's only an indirect result of the signal, and it happens at a later time, so there is no exit code to obtain that comes directly from the delivery of the signal.
Could you please explain me the logic of UNIX signal system: firstly it sends SIGHUP signal to process group and then it send SIGCONT signal in spite of the main idea of SIGHUP is "kill yourself, there is no terminal anymore"?
In case the process was stopped with SIGSTOP (which, for example, happens when you press CTRL+Z) and can't respond to SIGHUP because of that.