Error forrtl (78) in run mpi over qsub without IB - mpi

firt I'm sorry that I don't know english well.
I have 1 pbs server, and 6 pbs node
And I have IB interface.(Infiniband)
When IB interface alive, all of it is normal
But I have problem when not use IB.
When I did down IB Interface,I execute mpirun source in command line it is normal
But I execute qsub same source(command) it has error
forrtl:servere (174):SIGSEGV, segmentation fault occurred
...
...
forrtl:error (78): process killed (SIGTERM)
I should run this code without IB
How can I fix it?

Related

MPI error from very simple example fortran code

I have compiled this code:
program mpisimple
implicit none
integer ierr
include 'mpif.h'
call mpi_init(ierr)
write(6,*) 'Hello World!'
call mpi_finalize(ierr)
end
using the command: mpif90 -o helloworld simplempi.f90
When I run with this command:
$ mpiexec -np 1 ./helloworld
Hello World!
it works fine as you can see. But when I run with any other number of processors (here 4) I get the errors and I basically have to ctrl+C to kill it.
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(805).....: fail failed
MPID_Init(1859)...........: channel initialization failed
MPIDI_CH3_Init(126).......: fail failed
MPID_nem_init_ckpt(858)...: fail failed
MPIDI_CH3I_Seg_commit(427): PMI_KVS_Get returned 4
In: PMI_Abort(69777679, Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(805).....: fail failed
MPID_Init(1859)...........: channel initialization failed
MPIDI_CH3_Init(126).......: fail failed
MPID_nem_init_ckpt(858)...: fail failed
MPIDI_CH3I_Seg_commit(427): PMI_KVS_Get returned 4)
forrtl: severe (174): SIGSEGV, segmentation fault occurred
What could be the problem? I am doing this on a Linux hpc system.
I figured out why this happened. The system I am using does not require users to submit single-core jobs through the scheduler, but does require it for multi-core jobs. Once the mpiexec command was submitted through a PBS bash script, the errors went away and output was as expected.

MPICH mpiexec (MPI) process terminating upon error, unable to debug in lldb

EDIT I had a typo in my command to launch lldb (see comment below) and I'm updating the post to get to a different larger issue
I'm trying to debug my MPI application in lldb and upon an error (e.g., segv or abort). Here's how I'm invoking my mpi run:
/usr/local/bin/mpiexec -np 3 -disable-auto-cleanup xterm -e "lldb -s lldb.commands -- app_binary <args> ; sleep 100
Immediately when I start running, I get this error trace. I think the most relevant line is PMI_Get_appnum returned -1
[cli_0]: write_line error; fd=8 buf=:cmd=init pmi_version=1 pmi_subversion=1
:
system msg for write_line failure : Bad file descriptor
[cli_0]: Unable to write to PMI_fd
[cli_0]: write_line error; fd=8 buf=:cmd=get_appnum
:
system msg for write_line failure : Bad file descriptor
Fatal error in MPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(565):
MPID_Init(175).......: channel initialization failed
MPID_Init(463).......: PMI_Get_appnum returned -1
[cli_0]: write_line error; fd=8 buf=:cmd=abort exitcode=1094415
:
system msg for write_line failure : Bad file descriptor
Process 19063 exited with status = 15 (0x0000000f)
Unfortunately, some mailing lists show that this is a general bug with MPICH on OSX (see https://github.com/pmodels/mpich/issues/2063 -- currently still unresolved). Does anyone have a workaround?
Since you're using lldb and you're probably also using clang, you could use something called the address sanitizer to compile your code with runtime checks for memory errors.
Just add the following to your compile command: -g -fsanitize=address -fno-omit-frame-pointer -fsanitize-recover=address. It would look like
mpicc object.o -o exec -g -fsanitize=address -fno-omit-frame-pointer -fsanitize-recover=address
When using the address sanitizer your code will print a small stack trace to when you made a move to index out of bounds or address memory you don't own.
If you combine the address sanitizer with lldb then it should stop the execution at the line where a memory problem occurred. Although, I haven't had much success with running lldb and MPI at the same time. Either way the address sanitizer should help you.

MPI: Pin each instance to certain cores on each node

I want to execute several instances of my program with OpenMPI 2.11. Each instance runs on its own node (-N 1) on my cluster. This works fine. I now want to pin each program-instance to the first 2 cores of its node. To do that, it looks like I need to use rankfiles. Here is my rankfile:
rank 0=+n0 slot=0-1
rank 1=+n1 slot=0-1
This, in my opinion, should limit each program-instance to cores 0 and 1 of the local machine it runs on.
I execute mpirun like so:
mpirun -np 2 -N 1 -rf /my/rank/file my_program
But mpirun fails with this error without even executing my program:
Conflicting directives for mapping policy are causing the policy
to be redefined:
New policy: RANK_FILE
Prior policy: UNKNOWN
Please check that only one policy is defined.
What's this? Did I make a mistake in the rankfile?
Instead of using a rankfile, simply use a hostfile:
n0 slots=n max_slots=n
n1 slots=n max_slots=n
Then tell Open MPI to map one process per node with two cores per process using:
mpiexec --hostfile hostfile --map-by ppr:1:node:PE=2 --bind-to core ...
ppr:1:node:PE=2 reads as: 1 process per resource; resource type is node; 2 processing elements per process. You can check the actual binding by adding the --report-bindings option.

How to get R to save its state when killed?

I would like to assign some code that will be run when R is killed, for instance, save(list=ls(),file="dump.RData"). I thought this would be by trapping signals, e.g. SIGTERM, as referred to in this post, but there's nothing about signals from the shell in ?conditions.
?conditions does mention user interrupts; you can e.g. catch a Ctrl-C with withCallingHandlers( Sys.sleep(10), interrupt=function (e){cat("I saw that.\n")} ), but this doesn't catch SIGTERM.
How can I do this?
Indeed if you send SIGUSR1 to an R process, it will dump the workspace and stop. On Linux you can do that with
kill -USR1 Rpid
where Rpid is the process id of the R instance you want to stop. You can find it with pgrep for instance.
If R is running in a terminal, you can interrupt it with CTRL-Z and then type
kill -USR1 %

Monit fails to execute the complete command in start program

I'm using monit 5.4 on Mac 10.7.4 machine. When i tried to execute a example configuration
check process syslogd with pidfile /var/run/syslogd.pid
start program = "/etc/init.d/sysklogd start"
stop program = "/etc/init.d/sysklogd stop"
if 5 restarts within 5 cycles then timeout
from monit wiki page, I get the following error.
'syslogd' process is not running
'syslogd' trying to restart
'syslogd' start: /etc/init.d/sysklogd
'syslogd' failed to start
Monit does not take the complete command given in the "start program" of the monitrc file. It just takes the first word in the command and tries to execute it and fails. Is this a known issue? If yes, does it have a workaround? If not, what am i missing here and how to get it working?
Thanks in advance.
Try this (from http://mmonit.com/wiki/Monit/FAQ#execution)
start program = "/bin/bash -c '/etc/init.d/blah start'"
Does /etc/init.d/sysklogd actually exist?
On 10.8 I have /etc/init.d/syslog and manually running /etc/init.d/syslog restart works fine.

Resources