MPI error from very simple example fortran code - mpi

I have compiled this code:
program mpisimple
implicit none
integer ierr
include 'mpif.h'
call mpi_init(ierr)
write(6,*) 'Hello World!'
call mpi_finalize(ierr)
end
using the command: mpif90 -o helloworld simplempi.f90
When I run with this command:
$ mpiexec -np 1 ./helloworld
Hello World!
it works fine as you can see. But when I run with any other number of processors (here 4) I get the errors and I basically have to ctrl+C to kill it.
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(805).....: fail failed
MPID_Init(1859)...........: channel initialization failed
MPIDI_CH3_Init(126).......: fail failed
MPID_nem_init_ckpt(858)...: fail failed
MPIDI_CH3I_Seg_commit(427): PMI_KVS_Get returned 4
In: PMI_Abort(69777679, Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(805).....: fail failed
MPID_Init(1859)...........: channel initialization failed
MPIDI_CH3_Init(126).......: fail failed
MPID_nem_init_ckpt(858)...: fail failed
MPIDI_CH3I_Seg_commit(427): PMI_KVS_Get returned 4)
forrtl: severe (174): SIGSEGV, segmentation fault occurred
What could be the problem? I am doing this on a Linux hpc system.

I figured out why this happened. The system I am using does not require users to submit single-core jobs through the scheduler, but does require it for multi-core jobs. Once the mpiexec command was submitted through a PBS bash script, the errors went away and output was as expected.

Related

Error forrtl (78) in run mpi over qsub without IB

firt I'm sorry that I don't know english well.
I have 1 pbs server, and 6 pbs node
And I have IB interface.(Infiniband)
When IB interface alive, all of it is normal
But I have problem when not use IB.
When I did down IB Interface,I execute mpirun source in command line it is normal
But I execute qsub same source(command) it has error
forrtl:servere (174):SIGSEGV, segmentation fault occurred
...
...
forrtl:error (78): process killed (SIGTERM)
I should run this code without IB
How can I fix it?

Data unpack would read past end of buffer in file util/show_help.c at line 501

I submitted a job via slurm. The job ran for 12 hours and was working as expected. Then I got Data unpack would read past end of buffer in file util/show_help.c at line 501. It is usual for me to get errors like ORTE has lost communication with a remote daemon but I usually get this in the beginning of the job. It is annoying but still does not cause as much time loss as getting error after 12 hours. Is there a quick fix for this? Open MPI version is 4.0.1.
--------------------------------------------------------------------------
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default. The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.
Local host: barbun40
Local adapter: mlx5_0
Local port: 1
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: barbun40
Local device: mlx5_0
--------------------------------------------------------------------------
[barbun21.yonetim:48390] [[15284,0],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in
file util/show_help.c at line 501
[barbun21.yonetim:48390] 127 more processes have sent help message help-mpi-btl-openib.txt / ib port
not selected
[barbun21.yonetim:48390] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error
messages
[barbun21.yonetim:48390] 126 more processes have sent help message help-mpi-btl-openib.txt / error in
device init
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
An MPI communication peer process has unexpectedly disconnected. This
usually indicates a failure in the peer process (e.g., a crash or
otherwise exiting without calling MPI_FINALIZE first).
Although this local MPI process will likely now behave unpredictably
(it may even hang or crash), the root cause of this problem is the
failure of the peer -- that is what you need to investigate. For
example, there may be a core file that you can examine. More
generally: such peer hangups are frequently caused by application bugs
or other external events.
Local host: barbun64
Local PID: 252415
Peer host: barbun39
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[15284,1],35]
Exit code: 9
--------------------------------------------------------------------------

MPICH mpiexec (MPI) process terminating upon error, unable to debug in lldb

EDIT I had a typo in my command to launch lldb (see comment below) and I'm updating the post to get to a different larger issue
I'm trying to debug my MPI application in lldb and upon an error (e.g., segv or abort). Here's how I'm invoking my mpi run:
/usr/local/bin/mpiexec -np 3 -disable-auto-cleanup xterm -e "lldb -s lldb.commands -- app_binary <args> ; sleep 100
Immediately when I start running, I get this error trace. I think the most relevant line is PMI_Get_appnum returned -1
[cli_0]: write_line error; fd=8 buf=:cmd=init pmi_version=1 pmi_subversion=1
:
system msg for write_line failure : Bad file descriptor
[cli_0]: Unable to write to PMI_fd
[cli_0]: write_line error; fd=8 buf=:cmd=get_appnum
:
system msg for write_line failure : Bad file descriptor
Fatal error in MPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(565):
MPID_Init(175).......: channel initialization failed
MPID_Init(463).......: PMI_Get_appnum returned -1
[cli_0]: write_line error; fd=8 buf=:cmd=abort exitcode=1094415
:
system msg for write_line failure : Bad file descriptor
Process 19063 exited with status = 15 (0x0000000f)
Unfortunately, some mailing lists show that this is a general bug with MPICH on OSX (see https://github.com/pmodels/mpich/issues/2063 -- currently still unresolved). Does anyone have a workaround?
Since you're using lldb and you're probably also using clang, you could use something called the address sanitizer to compile your code with runtime checks for memory errors.
Just add the following to your compile command: -g -fsanitize=address -fno-omit-frame-pointer -fsanitize-recover=address. It would look like
mpicc object.o -o exec -g -fsanitize=address -fno-omit-frame-pointer -fsanitize-recover=address
When using the address sanitizer your code will print a small stack trace to when you made a move to index out of bounds or address memory you don't own.
If you combine the address sanitizer with lldb then it should stop the execution at the line where a memory problem occurred. Although, I haven't had much success with running lldb and MPI at the same time. Either way the address sanitizer should help you.

MPI (OpenMPI) - MPI_Publish_name cannot contact global ompi-server and throws error

I am attempting to write an MPI application that would consist of programs in the server client mould. I am stuck trying to get the server to publish its name to the ompi-server in the global scope
Here is the server code:
int main(int argc, char** argv) {
int myrank, nprocs, errmpi;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
char port_name[MPI_MAX_PORT_NAME];
MPI_Info info;
MPI_Info_create(&info);
MPI_Info_set(info, "ompi_global_scope", "yes");
MPI_Open_port(info, port_name);
//Fails here
MPI_Publish_name("ServerName", info, port_name);
// Rest of code...
I get the following error on running it:
$ ./mpi/bin/mpirun -np 1 --mca btl self ServerName
--------------------------------------------------------------------------
Process rank 0 attempted to publish to a global ompi_server that
could not be contacted. This is typically caused by either not
specifying the contact info for the server, or by the server not
currently executing. If you did specify the contact info for a
server, please check to see that the server is running and start
it again (or have your sys admin start it) if it isn't.
--------------------------------------------------------------------------
[xxx:18205] *** An error occurred in MPI_Publish_name
[xxx:18205] *** reported by process [1424949249,139676631433216]
[xxx:18205] *** on communicator MPI_COMM_WORLD
[xxx:18205] *** MPI_ERR_INTERN: internal error
[xxx:18205] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[xxx:18205] *** and potentially your MPI job)
I do have the ompi-server process running in debug mode on console
$ ./ompi-server --no-daemonize -d -r +
[xxx:14140] [[9416,0],0] orte-server: up and running!
Ultimately I will distribute the processes across various nodes, but for now I would really like to get the framework working on a single node. Could someone please help? Thanks very much indeed!
EDIT 1: Thank you very much for your quick reply. I made the following changes
$mpi/bin/ompi-server --no-daemonize -d -r mpiuri
If I now run the program so, I find the program hangs at the point where it previously fails
$./mpi/bin/mpirun --ompi-server file:mpiuri -mca btn tcp,self,sm -np 1 -v Server
While if I run the program with the following,
$ ./mpi/bin/mpirun --ompi-server file:mpiuri -mca btn tcp,self,sm -np 1 -v --wait-for-server --server-wait-time 10 Server
With the following error
--------------------------------------------------------------------------
mpirun was instructed to wait for the requested ompi-server, but was unable to
establish contact with the server during the specified wait time:
Server uri: 799801344.0;tcp://192.168.1.113:44487
Timeout time: 10
Error received: Not supported
Please check to ensure that the requested server matches the actual server
information, and that the server is in operation.
--------------------------------------------------------------------------
I must be close... but I cant quite figure it
I am fairly sure it is not the firewall, since I added the rule ALLOW 192.168.1.0/24 to ufw
Here is how to connect with the ompi-server
1) Ensure that ompi server is up and running, and is writing its uri to a file with the following command
$mpi/bin/ompi-server --no-daemonize -d -r mpiuri
2) Start all the mpi processes with this uri file, ensuring that you
prefix the uri filename with "file:" when you enter the
--ompi-server parameter
enter the hostname of the the node where you are run mpirun ... like so
$./mpi/bin/mpirun --ompi-server file:mpiuri -host myHostName -np 1 -v Server

Riak 1.3.1 will not start on lucid, Ec2 instance

I have installed riak (apt-get) on an EC2 instance, lucid, amd64 with libssl.
When running riak start I get:
Attempting to restart script through sudo -H -u riak
Riak failed to start within 15 seconds,
see the output of 'riak console' for more information.
If you want to wait longer, set the environment variable
WAIT_FOR_ERLANG to the number of seconds to wait.
Running riak console:
Exec: /usr/lib/riak/erts-5.9.1/bin/erlexec -boot /usr/lib/riak/releases/1.3.1/riak
-embedded -config /etc/riak/app.config
-pa /usr/lib/riak/lib/basho-patches
-args_file /etc/riak/vm.args -- console
Root: /usr/lib/riak
Erlang R15B01 (erts-5.9.1) [source] [64-bit] [smp:2:2] [async-threads:64] [kernel-poll:true]
/usr/lib/riak/lib/os_mon-2.2.9/priv/bin/memsup: Erlang has closed.
Erlang has closed
{"Kernel pid terminated",application_controller,"{application_start_failure,riak_core, {shutdown,{riak_core_app,start,[normal,[]]}}}"}
Crash dump was written to: /var/log/riak/erl_crash.dump
Kernel pid terminated (application_controller) ({application_start_failure,riak_core, {shutdown,{riak_core_app,start,[normal,[]]}}})
The error logs:
2013-04-24 11:36:20.897 [error] <0.146.0> CRASH REPORT Process riak_core_handoff_listener with 1 neighbours exited with reason: bad return value: {error,eaddrinuse} in gen_server:init_it/6 line 332
2013-04-24 11:36:20.899 [error] <0.145.0> Supervisor riak_core_handoff_listener_sup had child riak_core_handoff_listener started with riak_core_handoff_listener:start_link() at undefined exit with reason bad return value: {error,eaddrinuse} in context start_error
2013-04-24 11:36:20.902 [error] <0.142.0> Supervisor riak_core_handoff_sup had child riak_core_handoff_listener_sup started with riak_core_handoff_listener_sup:start_link() at undefined exit with reason shutdown in context start_error
2013-04-24 11:36:20.903 [error] <0.130.0> Supervisor riak_core_sup had child riak_core_handoff_sup started with riak_core_handoff_sup:start_link() at undefined exit with reason shutdown in context start_error
I'm new to Riak and basically tried to run through the "Fast Track" docs.
None of the default core IP settings in the configs have been changed. They are still set to {http, [ {"127.0.0.1", 8098 } ]}, {handoff_port, 8099 }
Any help would be greatly appreciated.
I know this is old but there is some solid documentation about the errors in the crash.dump file on the Riak site.

Resources