MPI_Barrier doesn't work properly in Ubuntu - mpi

I'm a beginner in using MPI. Here I wrote a very simple program to test if MPI can run. Here is my hello.c:
#include <stdio.h>
#include <mpi.h>
int main(int argc, char *argv[]) {
int numprocs, rank, namelen;
char processor_name[MPI_MAX_PROCESSOR_NAME];
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Get_processor_name(processor_name, &namelen);
MPI_Barrier(MPI_COMM_WORLD);
printf("Process %d on %s out of %d\n", rank, processor_name, numprocs);
MPI_Finalize();
}
I use to node to test, the hostfile is: node1 node2
So I have two machines with name node1 and node2. I can ssh to each other without password.
I launch the program by typing: mpirun -np 2 -f hostfile ./hello.
The executable hello is in the same directory in both machine.
Then after I run, I get an error:
Fatal error in PMPI_Barrier: Other MPI error, error stack:
PMPI_Barrier(425).........: MPI_Barrier(MPI_COMM_WORLD) failed
MPIR_Barrier_impl(331)....: Failure during collective
MPIR_Barrier_impl(313)....: MPIR_Barrier_intra(83)....:
dequeue_and_set_error(596): Communication error with rank 0 Fatal
error in PMPI_Barrier: Other MPI error, error stack:
PMPI_Barrier(425).........: MPI_Barrier(MPI_COMM_WORLD) failed
MPIR_Barrier_impl(331)....: Failure during collective
MPIR_Barrier_impl(313)....: MPIR_Barrier_intra(83)....:
dequeue_and_set_error(596): Communication error with rank 1
If I comment out the MPI_Barrier(), it can work properly. It seems the communication between machines has problem? Or I didn't install openmpi correctly? Any ideas?
I'm using Ubuntu 12.10
I got some hints: This doesn't work well in MPICH2, if I use openmpi, then it works. I installed MPICH just by sudo apt-get install mpich2. Do I miss something? The size of mpich2 is much smaller than openmpi

In /etc/hosts, newer versions of some Linux distros add the following types of lines at the top of the file:
127.0.0.1 localhost
127.0.0.1 [hostname]
This should be changed so that the hostname line contains your actual IP address. The MPI hydra process will abort if you do not make this change with errors like:
Fatal error in PMPI_Barrier: Other MPI error, error stack:
PMPI_Barrier(425)...........: MPI_Barrier(MPI_COMM_WORLD) failed
MPIR_Barrier_impl(292)......:
MPIR_Barrier_or_coll_fn(121):
MPIR_Barrier_intra(83)......:
dequeue_and_set_error(596)..: Communication error with rank 0

Related

MPI error from very simple example fortran code

I have compiled this code:
program mpisimple
implicit none
integer ierr
include 'mpif.h'
call mpi_init(ierr)
write(6,*) 'Hello World!'
call mpi_finalize(ierr)
end
using the command: mpif90 -o helloworld simplempi.f90
When I run with this command:
$ mpiexec -np 1 ./helloworld
Hello World!
it works fine as you can see. But when I run with any other number of processors (here 4) I get the errors and I basically have to ctrl+C to kill it.
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(805).....: fail failed
MPID_Init(1859)...........: channel initialization failed
MPIDI_CH3_Init(126).......: fail failed
MPID_nem_init_ckpt(858)...: fail failed
MPIDI_CH3I_Seg_commit(427): PMI_KVS_Get returned 4
In: PMI_Abort(69777679, Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(805).....: fail failed
MPID_Init(1859)...........: channel initialization failed
MPIDI_CH3_Init(126).......: fail failed
MPID_nem_init_ckpt(858)...: fail failed
MPIDI_CH3I_Seg_commit(427): PMI_KVS_Get returned 4)
forrtl: severe (174): SIGSEGV, segmentation fault occurred
What could be the problem? I am doing this on a Linux hpc system.
I figured out why this happened. The system I am using does not require users to submit single-core jobs through the scheduler, but does require it for multi-core jobs. Once the mpiexec command was submitted through a PBS bash script, the errors went away and output was as expected.

I get "There are not enough slots available in the system" when I run mpi

I am a high school student. An error occurred while studying and coding the basic theory of mpi. I searched on the internet and tried everything, but I couldn't understand it well.
The code is really simple. There is no problem with the code and I understood it well.
#include <stdio.h>
#include <mpi.h>
int main(int argc, char *argv[])
{
int num_procs, my_rank;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &num_procs);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
printf("Hello world! I'm rank %d among %d processes.\n", my_rank, num_procs);
MPI_Finalize();
return 0;
}
But there was a problem with running mpi. It works well when i type it like this.
mpirun -np 2 ./hello
Hello world! I'm rank 1 among 2 processes.
Hello world! I'm rank 0 among 2 processes.
This error occurs at -np 3.
mpirun -np 3 ./hello
`There are not enough slots available in the system to satisfy the 3
slots that were requested by the application:
./hello
Either request fewer slots for your application, or make more slots
available for use.
A "slot" is the Open MPI term for an allocatable unit where we can
launch a process. The number of slots available are defined by the
environment in which Open MPI processes are run:
1. Hostfile, via "slots=N" clauses (N defaults to number of
processor cores if not provided)
2. The --host command line parameter, via a ":N" suffix on the
hostname (N defaults to 1 if not provided)
3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
4. If none of a hostfile, the --host command line parameter, or an
RM is present, Open MPI defaults to the number of processor cores
In all the above cases, if you want Open MPI to default to the number
of hardware threads instead of the number of processor cores, use the
--use-hwthread-cpus option.
Alternatively, you can use the --oversubscribe option to ignore the
number of available slots when deciding the number of processes to
launch.
My laptop is Intel i5 and cpu core is 2 and 4 threads. Did such a problem happen because there were only 2 cores? I don't exactly understand this part.
There is not much data about mpi in Korea, so I always googling and studying. If that's the cause, is there any way to increase the number of processes? Other people wrote that there was an error in -np 17, how did they increase the process to double digits? Is the computer capable? Please explain it easily so that I can understand it well.
My laptop is Intel i5 and cpu core is 2 and 4 threads. Did such a problem happen because there were only 2 cores?
Yes. By default Open MPI uses the number of cores as slots. So since you only have 2 cores, you can only launch maximum of 2 processes.
If that's the cause, is there any way to increase the number of processes?
Yes, If you use --use-hwthread-cpus with your mpirun command you can use upto 4 mpi processes in your laptop since you have 4 threads in your laptop. Try running the command, mpirun -np 4 --use-hwthread-cpus a.out
Also, you can use --oversubscribe option to increase the number of processes greater than the available cores/threads. For example try this mpirun -np 10 --oversubscribe a.out

Run MPI program without a X11 server?

I'm testing simple MPI programs locally on my Ubuntu Focal server (Open MPI 4.0.3). However whatever I run with mpirun it produces an annoying message No protocol specified. The problem appears to be related to the fact that mpirun is trying to connect to the X server. How can I disable this behavior so I can use mpirun without a X server ready? I primarily work over SSH (text-only, with tmux).
An example of what I'm doing:
ubuntu#iBug-Server:~$ cat test.c
#include <mpi.h>
int main(int argc, char **argv) {
MPI_Init(&argc, &argv);
// This is a stub program
MPI_Finalize();
return 0;
}
ubuntu#iBug-Server:~$ mpicc test.c
ubuntu#iBug-Server:~$ mpirun -np 2 a.out
No protocol specified
ubuntu#iBug-Server:~$
Update 1: It appears to be related to LightDM and Xorg. The unwanted message goes away after systemctl stop lightdm. Alternatively, running Open MPI in a graphical terminal (connected via VNC or RDP (xrdp), both work) also eliminates the message, as strace shows that the connection to the X server is successful.

MPICH mpiexec (MPI) process terminating upon error, unable to debug in lldb

EDIT I had a typo in my command to launch lldb (see comment below) and I'm updating the post to get to a different larger issue
I'm trying to debug my MPI application in lldb and upon an error (e.g., segv or abort). Here's how I'm invoking my mpi run:
/usr/local/bin/mpiexec -np 3 -disable-auto-cleanup xterm -e "lldb -s lldb.commands -- app_binary <args> ; sleep 100
Immediately when I start running, I get this error trace. I think the most relevant line is PMI_Get_appnum returned -1
[cli_0]: write_line error; fd=8 buf=:cmd=init pmi_version=1 pmi_subversion=1
:
system msg for write_line failure : Bad file descriptor
[cli_0]: Unable to write to PMI_fd
[cli_0]: write_line error; fd=8 buf=:cmd=get_appnum
:
system msg for write_line failure : Bad file descriptor
Fatal error in MPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(565):
MPID_Init(175).......: channel initialization failed
MPID_Init(463).......: PMI_Get_appnum returned -1
[cli_0]: write_line error; fd=8 buf=:cmd=abort exitcode=1094415
:
system msg for write_line failure : Bad file descriptor
Process 19063 exited with status = 15 (0x0000000f)
Unfortunately, some mailing lists show that this is a general bug with MPICH on OSX (see https://github.com/pmodels/mpich/issues/2063 -- currently still unresolved). Does anyone have a workaround?
Since you're using lldb and you're probably also using clang, you could use something called the address sanitizer to compile your code with runtime checks for memory errors.
Just add the following to your compile command: -g -fsanitize=address -fno-omit-frame-pointer -fsanitize-recover=address. It would look like
mpicc object.o -o exec -g -fsanitize=address -fno-omit-frame-pointer -fsanitize-recover=address
When using the address sanitizer your code will print a small stack trace to when you made a move to index out of bounds or address memory you don't own.
If you combine the address sanitizer with lldb then it should stop the execution at the line where a memory problem occurred. Although, I haven't had much success with running lldb and MPI at the same time. Either way the address sanitizer should help you.

MPI (OpenMPI) - MPI_Publish_name cannot contact global ompi-server and throws error

I am attempting to write an MPI application that would consist of programs in the server client mould. I am stuck trying to get the server to publish its name to the ompi-server in the global scope
Here is the server code:
int main(int argc, char** argv) {
int myrank, nprocs, errmpi;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
char port_name[MPI_MAX_PORT_NAME];
MPI_Info info;
MPI_Info_create(&info);
MPI_Info_set(info, "ompi_global_scope", "yes");
MPI_Open_port(info, port_name);
//Fails here
MPI_Publish_name("ServerName", info, port_name);
// Rest of code...
I get the following error on running it:
$ ./mpi/bin/mpirun -np 1 --mca btl self ServerName
--------------------------------------------------------------------------
Process rank 0 attempted to publish to a global ompi_server that
could not be contacted. This is typically caused by either not
specifying the contact info for the server, or by the server not
currently executing. If you did specify the contact info for a
server, please check to see that the server is running and start
it again (or have your sys admin start it) if it isn't.
--------------------------------------------------------------------------
[xxx:18205] *** An error occurred in MPI_Publish_name
[xxx:18205] *** reported by process [1424949249,139676631433216]
[xxx:18205] *** on communicator MPI_COMM_WORLD
[xxx:18205] *** MPI_ERR_INTERN: internal error
[xxx:18205] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[xxx:18205] *** and potentially your MPI job)
I do have the ompi-server process running in debug mode on console
$ ./ompi-server --no-daemonize -d -r +
[xxx:14140] [[9416,0],0] orte-server: up and running!
Ultimately I will distribute the processes across various nodes, but for now I would really like to get the framework working on a single node. Could someone please help? Thanks very much indeed!
EDIT 1: Thank you very much for your quick reply. I made the following changes
$mpi/bin/ompi-server --no-daemonize -d -r mpiuri
If I now run the program so, I find the program hangs at the point where it previously fails
$./mpi/bin/mpirun --ompi-server file:mpiuri -mca btn tcp,self,sm -np 1 -v Server
While if I run the program with the following,
$ ./mpi/bin/mpirun --ompi-server file:mpiuri -mca btn tcp,self,sm -np 1 -v --wait-for-server --server-wait-time 10 Server
With the following error
--------------------------------------------------------------------------
mpirun was instructed to wait for the requested ompi-server, but was unable to
establish contact with the server during the specified wait time:
Server uri: 799801344.0;tcp://192.168.1.113:44487
Timeout time: 10
Error received: Not supported
Please check to ensure that the requested server matches the actual server
information, and that the server is in operation.
--------------------------------------------------------------------------
I must be close... but I cant quite figure it
I am fairly sure it is not the firewall, since I added the rule ALLOW 192.168.1.0/24 to ufw
Here is how to connect with the ompi-server
1) Ensure that ompi server is up and running, and is writing its uri to a file with the following command
$mpi/bin/ompi-server --no-daemonize -d -r mpiuri
2) Start all the mpi processes with this uri file, ensuring that you
prefix the uri filename with "file:" when you enter the
--ompi-server parameter
enter the hostname of the the node where you are run mpirun ... like so
$./mpi/bin/mpirun --ompi-server file:mpiuri -host myHostName -np 1 -v Server

Resources