SLURM: how to disable automatic job cleanup when one PE crashes - mpi

I distribute OpenMPI-based application using SLURM launcher srun. When one the process crashes, I would like to detect that in the other PEs and to do some actions. I am aware of the fact that OpenMPI does not have fault-tolerance, but still I need to perform a graceful exit in other PEs.
To do this, every PE has to be able:
To continue running despite the crash of another PE.
To detect that one of the PEs crashed.
Currently I'm focusing on the first task. According to the manual, srun has --no-kill flag. However, it does not seem to work for me. I see the following log messages:
srun: error: node0: task 0: Aborted // this is where I crash the PE deliberately
slurmstepd: error: node0: [0] pmixp_client_v2.c:210 [_errhandler] mpi/pmix: ERROR: Error handler invoked: status = -25: Interrupted system call (4)
srun: Jb step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: ***STEP 12123.0 ON node0 CANCELLED AT 2020-12-02 ***
srun: error: node0: task 1: Killed // WHY?!
Why does it happen? Is there any other relevant flag or environment variable, or any configuration option that might help?
To reproduce the problem, one can use the following program (it uses Boost.MPI for brevity, but has the same effect without Boost as well):
#include <boost/mpi.hpp>
int main() {
using namespace boost::mpi;
environment env;
communicator comm;
comm.barrier();
if (comm.rank() == 0) {
throw 0;
}
while (true) {}
}

According to the documentation that you linked, the --no-kill flag only affects the behaviour in case of node failure.
In your case you should be using the --kill-on-bad-exit=0 option that will prevent the rest of the tasks to be killed if one of them exits with a non-zero exit code.

Related

PIN Assert- AddVmThread: 155: assertion failed: m_sysIdMap.find(sysThreadId) == m_sysIdMap.end()

I'm using latest version of Pin (3.24) on propriety version of CentOS (version not sure, I'm checking) inside a container, running on x86-64.
I am getting following assert on running my pintool which adds sleep at a particular line# in code.
Application is our propriety OS.
Assert -
A: /tmp_proj/pinjen/workspace/pypl-pin-nightly/GitPin/Source/pin/vm/vm_threaddb.cpp: AddVmThread: 155: assertion failed: m_sysIdMap.find(sysThreadId) == m_sysIdMap.end()
Can you pls explain what this assert means ? How to avoid this, I'm running inside a container. Stack is not dumped. Instrumentation (Sleep at correct line#) is added but after 10mins of run it crashes.
Sometimes it crashes after executing sleep(analysis) function 50k times and sometimes 150k times..
Not sure what is happening. Pls help.
Later I tried to simply run "WITHOUT my PIN TOOL" and then also it hit the same assert.
$ <pin> -- <my custom OS executable>
[ Note there is NO PIN TOOL above but still got below assert.]
A: /tmp_proj/pinjen/workspace/pypl-pin-nightly/GitPin/Source/pin/vm/vm_threaddb.cpp: AddVmThread: 155: assertion failed: m_sysIdMap.find(sysThreadId) == m_sysIdMap.end()
#############################################################
## STACK TRACE
#############################################################
Pin must be run with tool in order to generate Pin stack trace
Detach Service Count: 20269696
Pin: pin-3.24-98612-6bd5931f2
Could you pls explain how can I avoid the above assert ?
Thanks

Optimization Hangs When Using Nonlinear Solver While Running Under MPI

I am trying to solve optimization problems using gradient-free algorithms (such as the simple genetic algorithm) in OpemMDAO, utilizing parallel function evaluation with MPI. When my problem does not have cycles I do not run into any problems. However, as soon as I have to use a nonlinear solver to converge a cycle, the process will hang indefinitely after one of the ranks' nl_solver finishes.
Here is a code example (solve_sellar.py):
import openmdao.api as om
from openmdao.test_suite.components.sellar_feature import SellarMDA
from openmdao.utils.mpi import MPI
if not MPI:
rank = 0
else:
rank = MPI.COMM_WORLD.rank
if __name__ == "__main__":
prob = om.Problem()
prob.model = SellarMDA()
prob.model.add_design_var('x', lower=0, upper=10)
prob.model.add_design_var('z', lower=0, upper=10)
prob.model.add_objective('obj')
prob.model.add_constraint('con1', upper=0)
prob.model.add_constraint('con2', upper=0)
prob.driver = om.SimpleGADriver(run_parallel=(MPI is not None), bits={"x": 32, "z": 32})
prob.setup()
prob.set_solver_print(level=0)
prob.run_driver()
if rank == 0:
print('minimum found at')
print(prob['x'][0])
print(prob['z'])
print('minumum objective')
print(prob['obj'][0])
As you can see, this code is meant to solve the Sellar problem using the SimpleGADriver that is included in OpenMDAO. When I simply run this code in serial (python3 solve_sellar.py) I get a result after a while and the following output:
Unable to import mpi4py. Parallel processing unavailable.
NL: NLBGSSolver 'NL: NLBGS' on system 'cycle' failed to converge in 10 iterations.
<string>:1: RuntimeWarning: overflow encountered in exp
NL: NLBGSSolver 'NL: NLBGS' on system 'cycle' failed to converge in 10 iterations.
minimum found at
0.0
[0. 0.]
minumum objective
0.7779677271254263
If I instead run this with MPI (mpirun -np 16 python3 solve_sellar.py) I get the following output:
NL: NLBJSolver 'NL: NLBJ' on system 'cycle' failed to converge in 10 iterations.
And then a whole lot of nothing. The command hangs and blocks the assigned processors, but there is no further output. Eventually I kill the command with CTRL-C. The process then continues to hang after the following output:
[mpiexec#eb26233a2dd8] Sending Ctrl-C to processes as requested
[mpiexec#eb26233a2dd8] Press Ctrl-C again to force abort
Hence, I have to force abort the process:
Ctrl-C caught... cleaning up processes
[proxy:0:0#eb26233a2dd8] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:883): assert (!closed) failed
[proxy:0:0#eb26233a2dd8] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:0#eb26233a2dd8] main (pm/pmiserv/pmip.c:202): demux engine error waiting for event
[mpiexec#eb26233a2dd8] HYDT_bscu_wait_for_completion (tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting
[mpiexec#eb26233a2dd8] HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec#eb26233a2dd8] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for completion
[mpiexec#eb26233a2dd8] main (ui/mpich/mpiexec.c:340): process manager error waiting for completion
You should be able to reproduce this on any working MPI-enabled OpenMDAO environment, but I have made a Dockerfile as well to ensure the environment is consistent:
FROM danieldv/hode:latest
RUN pip3 install --upgrade openmdao==2.9.0
ADD . /usr/src/app
WORKDIR /usr/src/app
CMD mpirun -np 16 python3 solve_sellar.py
Does anyone have a suggestion of how to solve this?
Thank you for reporting this. Yes, this looks like a bug that we introduced when we fixed the MPI norm calculation on some of the solvers.
This bug has now been fixed as of commit c4369225f43e56133d5dd4238d1cdea07d76ecc3. You can access the fix by pulling down the latest from the OpenMDAO github repo, or wait until the next release (which will be 2.9.2).

Why can an Rsession process continue after SIGSEGV and what does it mean

I'm developing an R package for myself that interacts with both Java code using rJava and C++ code using Rcpp. While trying to debug Rsession crashes when working under Rstudio using lldb, I noticed that lddb outputs the following message when I try to load the package I'm developing:
(lldb) Process 19030 stopped
* thread #1, name = 'rsession', stop reason = signal SIGSEGV: invalid address (fault address: 0x0)
frame #0: 0x00007fe6c7b872b4
-> 0x7fe6c7b872b4: movl (%rsi), %eax
0x7fe6c7b872b6: leaq 0xf8(%rbp), %rsi
0x7fe6c7b872bd: vmovdqu %ymm0, (%rsi)
0x7fe6c7b872c1: vmovdqu %ymm7, 0x20(%rsi)
(where 19030 is the pid of rsession). At this point, Rstudio stops waiting for lldb to resume execution, but instead of getting the dreaded "R session aborted" popup, entering the 'c' command in lldb resumes the rsession process and Rstudio continues chugging along just fine and I can use the loaded package with no problems. i.e.:
c
Process 19030 resuming
What is going on here? Why is Rstudio's rsession not crashing if lldb says it has "stopped"? Is this due to R's (or Rstudio's?) SIGSEGV handling mechanism? Does that mean that the original SIGSEGV is spurious and should not be a cause of concern? And of course (but probably off topic in this question): how do I make sense of lldb's output in order to ascertain if this SIGSEGV on loading my package should be debugged further?
The SIGSEGV does not occur in Rsession's process, but in the JVM process launched by rJava on package load. This behaviour is known and due to JVM's memory management, as stated here:
Java uses speculative loads. If a pointer points to addressable
memory, the load succeeds. Rarely the pointer does not point to
addressable memory, and the attempted load generates SIGSEGV ... which
java runtime intercepts, makes the memory addressable again, and
restarts the load instruction.
The proposed workaround for gdb works fine:
(gdb) handle SIGSEGV nostop noprint pass

Error in running fortran mpi program

After running MPI fortran program, I am getting error:
"Abort signaled by rank 2: No ACTIVE ports found
MPI process terminated unexpectedly
Abort signaled by rank 1: No ACTIVE ports found"
How to solve it?
It looks like you are using an MPI implementation compiled for Infiniband. See here: https://bugzilla.redhat.com/show_bug.cgi?id=467532 Probably you need to find (or build) an MPI library for TCP.

Make R exit with non-zero status code

I am looking for the R equivalent of linux/POSIX exit(n) which will halt the process with exit code n, signaling to a parent process that an error had occurred. Does R have such a facility?
It's an argument to quit(). See ?quit.
Arguments:
status: the (numerical) error status to be returned to the operating
system, where relevant. Conventionally ‘0’ indicates
successful completion.
Details:
Some error statuses are used by R itself. The default error
handler for non-interactive use effectively calls ‘q("no", 1,
FALSE)’ and returns error code 1. Error status 2 is used for R
‘suicide’, that is a catastrophic failure, and other small numbers
are used by specific ports for initialization failures. It is
recommended that users choose statuses of 10 or more.
quit(status=1)
Replace 1 by whatever exit code you need.

Resources