mpi + infiniband too many connections - mpi

I am running a MPI application on a cluster, using 4 nodes each with 64 cores.
The application performs an all to all communication pattern.
Executing the application by the following runs fine:
$: mpirun -npernode 36 ./Application
Adding a further process per node let the application crash:
$: mpirun -npernode 37 ./Application
--------------------------------------------------------------------------
A process failed to create a queue pair. This usually means either
the device has run out of queue pairs (too many connections) or
there are insufficient resources available to allocate a queue pair
(out of memory). The latter can happen if either 1) insufficient
memory is available, or 2) no more physical memory can be registered
with the device.
For more information on memory registration see the Open MPI FAQs at:
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
Local host: laser045
Local device: qib0
Queue pair type: Reliable connected (RC)
--------------------------------------------------------------------------
[laser045:15359] *** An error occurred in MPI_Issend
[laser045:15359] *** on communicator MPI_COMM_WORLD
[laser045:15359] *** MPI_ERR_OTHER: known error not in list
[laser045:15359] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
[laser040:49950] [[53382,0],0]->[[53382,1],30] mca_oob_tcp_msg_send_handler: writev failed: Connection reset by peer (104) [sd = 163]
[laser040:49950] [[53382,0],0]->[[53382,1],21] mca_oob_tcp_msg_send_handler: writev failed: Connection reset by peer (104) [sd = 154]
--------------------------------------------------------------------------
mpirun has exited due to process rank 128 with PID 15358 on
node laser045 exiting improperly. There are two reasons this could occur:
1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.
2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"
This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
[laser040:49950] 4 more processes have sent help message help-mpi-btl-openib-cpc-base.txt / ibv_create_qp failed
[laser040:49950] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[laser040:49950] 4 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal
EDIT added some source code of all to all communication pattern:
// Send data to all other ranks
for(unsigned i = 0; i < (unsigned)size; ++i){
if((unsigned)rank == i){
continue;
}
MPI_Request request;
MPI_Issend(&data, dataSize, MPI_DOUBLE, i, 0, MPI_COMM_WORLD, &request);
requests.push_back(request);
}
// Recv data from all other ranks
for(unsigned i = 0; i < (unsigned)size; ++i){
if((unsigned)rank == i){
continue;
}
MPI_Status status;
MPI_Recv(&recvData, recvDataSize, MPI_DOUBLE, i, 0, MPI_COMM_WORLD, &status);
}
// Finish communication operations
for(MPI_Request &r: requests){
MPI_Status status;
MPI_Wait(&r, &status);
}
Is there something I can do as cluster user or some advices I can give the cluster admin ?

The line mca_oob_tcp_msg_send_handler error line may indicate that the node corresponding to a receiving rank died (ran out of memory or received a SIGSEGV):
http://www.open-mpi.org/faq/?category=tcp#tcp-connection-errors
The OOB (out-of-band) framework in Open-MPI is used for control messages, not for the messages of your applications. Indeed, messages typically go throught byte transfer layers (BTLs) such as self, sm, vader, openib (Infiniband), and so on.
The output of 'ompi_info -a' is useful in that regard.
Finally, it is not specified in the question is the Infiniband hardware vendor is Mellanox, so the XRC option may not work (for instance, Intel/QLogic Infiniband does not support this option).

The error is connected to the buffer size
of the mpi message queues commented here:
http://www.open-mpi.org/faq/?category=openfabrics#ib-xrc
The following environment setting solved my problem:
$ export OMPI_MCA_btl_openib_receive_queues="P,128,256,192,128:S,65536,256,192,128"

Related

Openmpi 4.0.5 fails to distribute tasks to more than 1 node

We are having trouble with openmpi 4.0.5 on our cluster: It works as long as only 1 node is requested, but as soon as more than 1 is requested (e.g. mpirun -np 24 ./hello_world with --ntasks-per-node=12) it crashes and we get the following error message:
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 2
slots that were requested by the application:
./hello_world
Either request fewer slots for your application, or make more slots
available for use.
A "slot" is the Open MPI term for an allocatable unit where we can
launch a process. The number of slots available are defined by the
environment in which Open MPI processes are run:
1. Hostfile, via "slots=N" clauses (N defaults to number of
processor cores if not provided)
2. The --host command line parameter, via a ":N" suffix on the
hostname (N defaults to 1 if not provided)
3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
4. If none of a hostfile, the --host command line parameter, or an
RM is present, Open MPI defaults to the number of processor cores
In all the above cases, if you want Open MPI to default to the number
of hardware threads instead of the number of processor cores, use the
--use-hwthread-cpus option.
Alternatively, you can use the --oversubscribe option to ignore the
number of available slots when deciding the number of processes to
launch.
--------------------------------------------------------------------------
I have tried using --oversubscribe, but this will still only use 1 node, even though smaller jobs would run that way. I have also tried specifically requesting nodes (e.g. -host node36,node37), but this results in the following error message:
[node37:16739] *** Process received signal ***
[node37:16739] Signal: Segmentation fault (11)
[node37:16739] Signal code: Address not mapped (1)
[node37:16739] Failing at address: (nil)
[node37:16739] [ 0] /lib64/libpthread.so.0(+0xf5f0)[0x2ac57d70e5f0]
[node37:16739] [ 1] /lib64/libc.so.6(+0x13ed5a)[0x2ac57da59d5a]
[node37:16739] [ 2] /usr/lib64/openmpi/lib/libopen-rte.so.12(orte_daemon+0x10d7)[0x2ac57c6c4827]
[node37:16739] [ 3] orted[0x4007a7]
[node37:16739] [ 4] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2ac57d93d505]
[node37:16739] [ 5] orted[0x400810]
[node37:16739] *** End of error message ***
The cluster has 59 nodes. Slurm 19.05.0 is used as a scheduler and gcc 9.1.0 to compile.
I don't have much experience with mpi - any help would be much appreciated! Maybe someone is familiar with this error and could point me towards what the problem might be.
Thanks for your help,
Johanna

Data unpack would read past end of buffer in file util/show_help.c at line 501

I submitted a job via slurm. The job ran for 12 hours and was working as expected. Then I got Data unpack would read past end of buffer in file util/show_help.c at line 501. It is usual for me to get errors like ORTE has lost communication with a remote daemon but I usually get this in the beginning of the job. It is annoying but still does not cause as much time loss as getting error after 12 hours. Is there a quick fix for this? Open MPI version is 4.0.1.
--------------------------------------------------------------------------
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default. The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.
Local host: barbun40
Local adapter: mlx5_0
Local port: 1
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: barbun40
Local device: mlx5_0
--------------------------------------------------------------------------
[barbun21.yonetim:48390] [[15284,0],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in
file util/show_help.c at line 501
[barbun21.yonetim:48390] 127 more processes have sent help message help-mpi-btl-openib.txt / ib port
not selected
[barbun21.yonetim:48390] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error
messages
[barbun21.yonetim:48390] 126 more processes have sent help message help-mpi-btl-openib.txt / error in
device init
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
An MPI communication peer process has unexpectedly disconnected. This
usually indicates a failure in the peer process (e.g., a crash or
otherwise exiting without calling MPI_FINALIZE first).
Although this local MPI process will likely now behave unpredictably
(it may even hang or crash), the root cause of this problem is the
failure of the peer -- that is what you need to investigate. For
example, there may be a core file that you can examine. More
generally: such peer hangups are frequently caused by application bugs
or other external events.
Local host: barbun64
Local PID: 252415
Peer host: barbun39
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[15284,1],35]
Exit code: 9
--------------------------------------------------------------------------

Error in MPI program execution - no active ports found

I am trying to run a simple MPI job across multiple hosts of a cluster.
[capc#gpu6 mpi_tests]$ /opt/openmpi4.0.3/build/bin/mpirun --host gpu7,gpu6 ./a.out
WARNING: There is at least non-excluded one OpenFabrics device found,
but there are no active ports detected (or Open MPI was unable to use
them). This is most certainly not what you wanted. Check your
cables, subnet manager configuration, etc. The openib BTL will be
ignored for this job.
Local host: gpu7
We have 2 processes.
WARNING: Open MPI accepted a TCP connection from what appears to be a
another Open MPI process but cannot find a corresponding process
entry for that peer.
This attempted connection will be ignored; your MPI job may or may not
continue properly.
Local host: gpu6
PID: 29209
[gpu6:29203] 1 more process has sent help message help-mpi-btl-openib.txt / no active ports found
[gpu6:29203] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
I have compiled the MPI program with mpicc and on running with mpirun it hangs.
Can anyone guide me regarding this?

POSIX message queues permission denied issue

I have a server.c program that is initialising a message queue with the following permissions:
#define SERVER "/serverqueue"
...
struct mq_attr attr;
attr.mq_flags = 0;
attr.mq_maxmsg = MAX_MSGS;
attr.mq_msgsize = MAX_MSG_SIZE;
attr.mq_curmsgs = 0;
server = mq_open(SERVER, O_RDWR | O_CREAT, 666, &attr)
...
In the first run, the mq_open() is successful and the program exits with no error. On subsequent executions, I get Permission denied errors at mq_open(). Why is this happening?
In case its relevant, I am not explicitly closing/unlinking the message queue descriptors as the OS does that automatically when the program exits, if i am not wrong
Message queues persist after process exit. The reason the second creation attempt fails is because you specify the mode as 666, which results rather strange permissions:
$ ls -l /dev/mqueue/serverqueue
--w--wx--T. 1 fw fw 80 Feb 17 13:13 serverqueue
There are no read permissions, so opening with O_RDWR fails.
Furthermore, since the queue names are a shared resource, it usually results in a security vulnerability if you create queues with O_CREAT instead of O_CREAT | O_EXCL. Another user could have created the same queue, with different permissions, and thus gain access to what you are trying to do with the queue.

MPI (OpenMPI) - MPI_Publish_name cannot contact global ompi-server and throws error

I am attempting to write an MPI application that would consist of programs in the server client mould. I am stuck trying to get the server to publish its name to the ompi-server in the global scope
Here is the server code:
int main(int argc, char** argv) {
int myrank, nprocs, errmpi;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
char port_name[MPI_MAX_PORT_NAME];
MPI_Info info;
MPI_Info_create(&info);
MPI_Info_set(info, "ompi_global_scope", "yes");
MPI_Open_port(info, port_name);
//Fails here
MPI_Publish_name("ServerName", info, port_name);
// Rest of code...
I get the following error on running it:
$ ./mpi/bin/mpirun -np 1 --mca btl self ServerName
--------------------------------------------------------------------------
Process rank 0 attempted to publish to a global ompi_server that
could not be contacted. This is typically caused by either not
specifying the contact info for the server, or by the server not
currently executing. If you did specify the contact info for a
server, please check to see that the server is running and start
it again (or have your sys admin start it) if it isn't.
--------------------------------------------------------------------------
[xxx:18205] *** An error occurred in MPI_Publish_name
[xxx:18205] *** reported by process [1424949249,139676631433216]
[xxx:18205] *** on communicator MPI_COMM_WORLD
[xxx:18205] *** MPI_ERR_INTERN: internal error
[xxx:18205] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[xxx:18205] *** and potentially your MPI job)
I do have the ompi-server process running in debug mode on console
$ ./ompi-server --no-daemonize -d -r +
[xxx:14140] [[9416,0],0] orte-server: up and running!
Ultimately I will distribute the processes across various nodes, but for now I would really like to get the framework working on a single node. Could someone please help? Thanks very much indeed!
EDIT 1: Thank you very much for your quick reply. I made the following changes
$mpi/bin/ompi-server --no-daemonize -d -r mpiuri
If I now run the program so, I find the program hangs at the point where it previously fails
$./mpi/bin/mpirun --ompi-server file:mpiuri -mca btn tcp,self,sm -np 1 -v Server
While if I run the program with the following,
$ ./mpi/bin/mpirun --ompi-server file:mpiuri -mca btn tcp,self,sm -np 1 -v --wait-for-server --server-wait-time 10 Server
With the following error
--------------------------------------------------------------------------
mpirun was instructed to wait for the requested ompi-server, but was unable to
establish contact with the server during the specified wait time:
Server uri: 799801344.0;tcp://192.168.1.113:44487
Timeout time: 10
Error received: Not supported
Please check to ensure that the requested server matches the actual server
information, and that the server is in operation.
--------------------------------------------------------------------------
I must be close... but I cant quite figure it
I am fairly sure it is not the firewall, since I added the rule ALLOW 192.168.1.0/24 to ufw
Here is how to connect with the ompi-server
1) Ensure that ompi server is up and running, and is writing its uri to a file with the following command
$mpi/bin/ompi-server --no-daemonize -d -r mpiuri
2) Start all the mpi processes with this uri file, ensuring that you
prefix the uri filename with "file:" when you enter the
--ompi-server parameter
enter the hostname of the the node where you are run mpirun ... like so
$./mpi/bin/mpirun --ompi-server file:mpiuri -host myHostName -np 1 -v Server

Resources