I am trying to solve optimization problems using gradient-free algorithms (such as the simple genetic algorithm) in OpemMDAO, utilizing parallel function evaluation with MPI. When my problem does not have cycles I do not run into any problems. However, as soon as I have to use a nonlinear solver to converge a cycle, the process will hang indefinitely after one of the ranks' nl_solver finishes.
Here is a code example (solve_sellar.py):
import openmdao.api as om
from openmdao.test_suite.components.sellar_feature import SellarMDA
from openmdao.utils.mpi import MPI
if not MPI:
rank = 0
else:
rank = MPI.COMM_WORLD.rank
if __name__ == "__main__":
prob = om.Problem()
prob.model = SellarMDA()
prob.model.add_design_var('x', lower=0, upper=10)
prob.model.add_design_var('z', lower=0, upper=10)
prob.model.add_objective('obj')
prob.model.add_constraint('con1', upper=0)
prob.model.add_constraint('con2', upper=0)
prob.driver = om.SimpleGADriver(run_parallel=(MPI is not None), bits={"x": 32, "z": 32})
prob.setup()
prob.set_solver_print(level=0)
prob.run_driver()
if rank == 0:
print('minimum found at')
print(prob['x'][0])
print(prob['z'])
print('minumum objective')
print(prob['obj'][0])
As you can see, this code is meant to solve the Sellar problem using the SimpleGADriver that is included in OpenMDAO. When I simply run this code in serial (python3 solve_sellar.py) I get a result after a while and the following output:
Unable to import mpi4py. Parallel processing unavailable.
NL: NLBGSSolver 'NL: NLBGS' on system 'cycle' failed to converge in 10 iterations.
<string>:1: RuntimeWarning: overflow encountered in exp
NL: NLBGSSolver 'NL: NLBGS' on system 'cycle' failed to converge in 10 iterations.
minimum found at
0.0
[0. 0.]
minumum objective
0.7779677271254263
If I instead run this with MPI (mpirun -np 16 python3 solve_sellar.py) I get the following output:
NL: NLBJSolver 'NL: NLBJ' on system 'cycle' failed to converge in 10 iterations.
And then a whole lot of nothing. The command hangs and blocks the assigned processors, but there is no further output. Eventually I kill the command with CTRL-C. The process then continues to hang after the following output:
[mpiexec#eb26233a2dd8] Sending Ctrl-C to processes as requested
[mpiexec#eb26233a2dd8] Press Ctrl-C again to force abort
Hence, I have to force abort the process:
Ctrl-C caught... cleaning up processes
[proxy:0:0#eb26233a2dd8] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:883): assert (!closed) failed
[proxy:0:0#eb26233a2dd8] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:0#eb26233a2dd8] main (pm/pmiserv/pmip.c:202): demux engine error waiting for event
[mpiexec#eb26233a2dd8] HYDT_bscu_wait_for_completion (tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting
[mpiexec#eb26233a2dd8] HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec#eb26233a2dd8] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for completion
[mpiexec#eb26233a2dd8] main (ui/mpich/mpiexec.c:340): process manager error waiting for completion
You should be able to reproduce this on any working MPI-enabled OpenMDAO environment, but I have made a Dockerfile as well to ensure the environment is consistent:
FROM danieldv/hode:latest
RUN pip3 install --upgrade openmdao==2.9.0
ADD . /usr/src/app
WORKDIR /usr/src/app
CMD mpirun -np 16 python3 solve_sellar.py
Does anyone have a suggestion of how to solve this?
Thank you for reporting this. Yes, this looks like a bug that we introduced when we fixed the MPI norm calculation on some of the solvers.
This bug has now been fixed as of commit c4369225f43e56133d5dd4238d1cdea07d76ecc3. You can access the fix by pulling down the latest from the OpenMDAO github repo, or wait until the next release (which will be 2.9.2).
Related
I distribute OpenMPI-based application using SLURM launcher srun. When one the process crashes, I would like to detect that in the other PEs and to do some actions. I am aware of the fact that OpenMPI does not have fault-tolerance, but still I need to perform a graceful exit in other PEs.
To do this, every PE has to be able:
To continue running despite the crash of another PE.
To detect that one of the PEs crashed.
Currently I'm focusing on the first task. According to the manual, srun has --no-kill flag. However, it does not seem to work for me. I see the following log messages:
srun: error: node0: task 0: Aborted // this is where I crash the PE deliberately
slurmstepd: error: node0: [0] pmixp_client_v2.c:210 [_errhandler] mpi/pmix: ERROR: Error handler invoked: status = -25: Interrupted system call (4)
srun: Jb step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: ***STEP 12123.0 ON node0 CANCELLED AT 2020-12-02 ***
srun: error: node0: task 1: Killed // WHY?!
Why does it happen? Is there any other relevant flag or environment variable, or any configuration option that might help?
To reproduce the problem, one can use the following program (it uses Boost.MPI for brevity, but has the same effect without Boost as well):
#include <boost/mpi.hpp>
int main() {
using namespace boost::mpi;
environment env;
communicator comm;
comm.barrier();
if (comm.rank() == 0) {
throw 0;
}
while (true) {}
}
According to the documentation that you linked, the --no-kill flag only affects the behaviour in case of node failure.
In your case you should be using the --kill-on-bad-exit=0 option that will prevent the rest of the tasks to be killed if one of them exits with a non-zero exit code.
I am running an external executable, capturing the result as an object. The .exe is a tool for selection of a population based on genetic parameters and genetic value predictions. The program executes and writes output as requested, but fails to exit. There is no error and when there is a manual stop it exits with status code 0. How can I get this call to exit and continue as it might with other system calls?
The call is formatted as seen below:
t <- tryCatch(system2("OPSEL.exe", args = "CMD.txt", timeout = 10))
I've tried running this in command shell with the two files referenced above and it exits appropriately.
I have tried running the Openmdao paraboloid tutorial as well as benchmarks and I consistently receive the same error which reads as following:
[0]PETSC ERROR: ------------------------------------------------------------------------
[0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[0]PETSC ERROR: or see http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[0]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
[0]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run
[0]PETSC ERROR: to get more information on the crash
---------------------------------------------------------------------
MPI_abort was invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode 59.
NOTE: invoking MPI_ABORT causes MPI to kill all MPI processes.
you may or may not see output from other processes, depending on exactly when Open MPI kills them.
I don't understand why this error is occurring and what I can do to be able to run OpenMDAO without getting this error. Can you please help me with this?
Something has not gone well with your PETSc install. Its hard to debug that from afar though. It could be in your MPI install, or your PETSc install, or your petsc4py install. I suggest not install PETSc or PETSc4Py through pip though. I've had mixed success with that. Both can be installed from source without tremendous difficulty.
However, to run the tutorials you don't need to have PETSc installed. You could remove those packages and the tutorials will run correctly in serial for you
I'm developing an R package for myself that interacts with both Java code using rJava and C++ code using Rcpp. While trying to debug Rsession crashes when working under Rstudio using lldb, I noticed that lddb outputs the following message when I try to load the package I'm developing:
(lldb) Process 19030 stopped
* thread #1, name = 'rsession', stop reason = signal SIGSEGV: invalid address (fault address: 0x0)
frame #0: 0x00007fe6c7b872b4
-> 0x7fe6c7b872b4: movl (%rsi), %eax
0x7fe6c7b872b6: leaq 0xf8(%rbp), %rsi
0x7fe6c7b872bd: vmovdqu %ymm0, (%rsi)
0x7fe6c7b872c1: vmovdqu %ymm7, 0x20(%rsi)
(where 19030 is the pid of rsession). At this point, Rstudio stops waiting for lldb to resume execution, but instead of getting the dreaded "R session aborted" popup, entering the 'c' command in lldb resumes the rsession process and Rstudio continues chugging along just fine and I can use the loaded package with no problems. i.e.:
c
Process 19030 resuming
What is going on here? Why is Rstudio's rsession not crashing if lldb says it has "stopped"? Is this due to R's (or Rstudio's?) SIGSEGV handling mechanism? Does that mean that the original SIGSEGV is spurious and should not be a cause of concern? And of course (but probably off topic in this question): how do I make sense of lldb's output in order to ascertain if this SIGSEGV on loading my package should be debugged further?
The SIGSEGV does not occur in Rsession's process, but in the JVM process launched by rJava on package load. This behaviour is known and due to JVM's memory management, as stated here:
Java uses speculative loads. If a pointer points to addressable
memory, the load succeeds. Rarely the pointer does not point to
addressable memory, and the attempted load generates SIGSEGV ... which
java runtime intercepts, makes the memory addressable again, and
restarts the load instruction.
The proposed workaround for gdb works fine:
(gdb) handle SIGSEGV nostop noprint pass
I am writing a very small program to understand MPI (MPICH implementation) and Fortran 90. Unfortunately the code is not running properly when executed with "-np 2".
This is the code:
PROGRAM main
USE MPI
IMPLICIT none
INTEGER :: ierr, npe, mynpe
INTEGER :: istatus(MPI_STATUS_SIZE)
REAL :: aa
CALL MPI_INIT(ierr)
CALL MPI_Comm_size(MPI_COMM_WORLD, npe, ierr)
CALL MPI_Comm_rank(MPI_COMM_WORLD, mynpe, ierr)
IF (mynpe == 0) THEN
READ(*,*) aa
CALL MPI_Send(aa, 1, MPI_REAL, 1, 99, MPI_COMM_WORLD, ierr)
ELSE IF (mynpe == 1) THEN
CALL MPI_Recv(aa, 1, MPI_REAL, 0, 99, MPI_COMM_WORLD, istatus, ierr)
WRITE(*,*) "Ho ricevuto il numero ", aa
END IF
CALL MPI_FINALIZE(ierr)
END PROGRAM
I am compiling it with mpif90 mpi_2.f90 -o output and when I execute it with mpirun -np 2 output I get the following error:
At line 14 of file mpi_2.f90 (unit = 5, file = 'stdin')
Fortran runtime error: End of file
The shell still waits for an input and if I insert a number (e.g. 11) I get the following output:
11
Fatal error in MPI_Send: Invalid rank, error stack:
MPI_Send(173): MPI_Send(buf=0xbff4783c, count=1, MPI_REAL, dest=1, tag=99, MPI_COMM_WORLD) failed
MPI_Send(98).: Invalid rank has value 1 but must be nonnegative and less than 1
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
Thank you for all the help!
Two different MPI implementations get mixed in your case. The run-time MPI environment comes from a different implementation that the one used to compile the program and therefore both processes behave as MPI singletons, i.e. each of them forms a separate MPI_COMM_WORLD communicator and becomes rank 0 in it. As a result the first branch of the conditional executes in both processes. On the other side mpirun performs input redirection to the first process only while all other get their standard input closed or connected to /dev/null. MPI_SEND fails for the same reason - in the singleton universe of each MPI process there is no rank 1.
The most frequent cause for such behaviour is that mpirun and mpif90 come from different MPI libraries. In your case you have MPICH mixed with Open MPI. Indeed, the following error message:
MPI_Send(173): MPI_Send(buf=0xbff4783c, count=1, MPI_REAL, dest=1, tag=99, MPI_COMM_WORLD) failed
MPI_Send(98).: Invalid rank has value 1 but must be nonnegative and less than 1
is in the error format of MPICH. Therefore mpif90 comes from MPICH.
But the next error message:
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
is in the error format used by the OpenRTE framework of Open MPI. Therefore mpirun comes from Open MPI and not from MPICH.
This could happen if you have installed a development package for MPICH, so that it provides mpicc, mpif90, and so on, but then you've installed a run-time package for Open MPI. Make sure that you have only packages from one kind of MPI installed. If you have compiled MPICH from source, make sure the path to its binaries is the first element of $PATH.