MPI_Barrier() does not work on a small cluster

MPI_Barrier() does not work on a small cluster - mpi

I want to use MPI_Barrier() in my programme, but there are some fatal errors.
This is my code:
1 #include <stdio.h>
2 #include "mpi.h"
3
4 int main(int argc, char* argv[]){
5 int rank, size;
6
7 MPI_Init(&argc, &argv);
8 MPI_Comm_rank(MPI_COMM_WORLD, &rank);
9 MPI_Comm_size(MPI_COMM_WORLD, &size);
10 printf("Hello, world, I am %d of %d. \n", rank, size);
11 MPI_Barrier(MPI_COMM_WORLD);
12 MPI_Finalize();
13
14 return 0;
15 }
And this is the output:
Hello, world, I am 0 of 2.
Hello, world, I am 1 of 2.
Fatal error in PMPI_Barrier: Other MPI error, error stack:
PMPI_Barrier(425).........: MPI_Barrier(MPI_COMM_WORLD) failed
MPIR_Barrier_impl(331)....: Failure during collective
MPIR_Barrier_impl(313)....:
MPIR_Barrier_intra(83)....:
dequeue_and_set_error(596): Communication error with rank 0
Any suggestions?
Thanks and Regards!

This generally reflects some sort of configuration error -- either host or username configurations are not consistent across nodes, or there is some sort of firewall blocking some ports. The MPICH2 FAQ discusses some places to look.

Related

Process placement with aprun -- need one process per node

I need to run an MPI code on a Cray system under aprun. For reasons I won't go into, I am being asked to run it such that no node has more than one process. I have been puzzling over the aprun man page and I'm not sure if I've figured this out or not. If I have only two processes, will this command ensure that they run on different nodes? (Let's say there are 32 cores on a node.)
> aprun -n 2 -d 32 --cc depth ./myexec

If anyone is interested, my above command line does work. I tested it with the code:
#include <mpi.h>
#include <stdio.h>
 
int main(int argc, char **argv) {
char hname[80];
int length;
 
MPI_Init(&argc, &argv);
 
MPI_Get_processor_name(hname, &length);
printf("Hello world from %s\n", hname);
MPI_Finalize();
}

Can I link two separate executables with MPI_open_port and share port information in a text file?

I'm trying to create a shared MPI COMM between two executables which are both started independently, e.g.
mpiexec -n 1 ./exe1
mpiexec -n 1 ./exe2
I use MPI_Open_port to generate port details and write these to a file in exe1 and then read with exe2. This is followed by MPI_Comm_connect/MPI_Comm_accept and then send/recv communication (minimal example below).
My question is: can we write port information to file in this way, or is the MPI_Publish_name/MPI_Lookup_name required for MPI to work as in this, this and this? As supercomputers usually share a file system, this file based approach seems simpler and maybe avoids establishing a server.
It seems this should work according to the MPI_Open_Port documentation in the MPI 3.1 standard,
port_name is essentially a network address. It is unique within the communication universe to which it belongs (determined by the implementation), and may be used by any client within that communication universe. For instance, if it is an internet (host:port) address, it will be unique on the internet. If it is a low level switch address on an IBM SP, it will be unique to that SP
In addition, according to documentation on the MPI forum:
The following should be compatible with MPI: The server prints out an address to the terminal, the user gives this address to the client program.
MPI does not require a nameserver
A port_name is a system-supplied string that encodes a low-level network address at which a server can be contacted.
By itself, the port_name mechanism is completely portable ...
Writing the port information to file does work as expected, i.e creates a shared communicator and exchanges information using MPICH (3.2) but hangs at the MPI_Comm_connect/MPI_Comm_accept line when using OpenMPI versions 2.0.1 and 4.0.1 (on my local workstation running Ubuntu 12.04 but eventually needs to work on a tier 1 supercomputer). I have raised as an issue here but welcome a solution or workaround in the meantime.
Further Information
If I use the MPMD mode with OpenMPI,
mpiexec -n 1 ./exe1 : -n 1 ./exe2
this works correctly, so must be an issue with allowing the jobs to share ompi_global_scope as in this question. I've also tried adding,
MPI_Info info;
MPI_Info_create(&info);
MPI_Info_set(info, "ompi_global_scope", "true");
with info passed to all commands, with no success. I'm not running a server/client model as both codes run simultaneously so sharing a URL/PID from one is not ideal, although I cannot get this to work even using the suggested approach, which for OpenMPI 2.0.1,
mpirun -n 1 --report-pid + ./OpenMPI_2.0.1 0
1234
mpirun -n 1 --ompi-server pid:1234 ./OpenMPI_2.0.1 1
gives,
ORTE_ERROR_LOG: Bad parameter in file base/rml_base_contact.c at line 161
This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
pmix server init failed
--> Returned value Bad parameter (-5) instead of ORTE_SUCCESS
and with OpenMPI 4.0.1,
mpirun -n 1 --report-pid + ./OpenMPI_4.0.1 0
1234
mpirun -n 1 --ompi-server pid:1234 ./OpenMPI_4.0.1 1
gives,
ORTE_ERROR_LOG: Bad parameter in file base/rml_base_contact.c at line 50
...
A publish/lookup server was provided, but we were unable to connect
to it - please check the connection info and ensure the server
is alive:
Using 4.0.1 means the error should not be related to this bug in OpenMPI.
Minimal code
#include "mpi.h"
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <iostream>
#include <fstream>
using namespace std;
int main( int argc, char *argv[] )
{
int num_errors = 0;
int rank, size;
char port1[MPI_MAX_PORT_NAME];
char port2[MPI_MAX_PORT_NAME];
MPI_Status status;
MPI_Comm comm1, comm2;
int data = 0;
char *ptr;
int runno = strtol(argv[1], &ptr, 10);
for (int i = 0; i < argc; ++i)
printf("inputs %d %d %s \n", i,runno, argv[i]);
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (runno == 0)
{
printf("0: opening ports.\n");fflush(stdout);
MPI_Open_port(MPI_INFO_NULL, port1);
printf("opened port1: <%s>\n", port1);
//Write port file
ofstream myfile;
myfile.open("port");
if( !myfile )
cout << "Opening file failed" << endl;
myfile << port1 << endl;
if( !myfile )
cout << "Write failed" << endl;
myfile.close();
printf("Port %s written to file \n", port1); fflush(stdout);
printf("Attempt to accept port1.\n");fflush(stdout);
//Establish connection and send data
MPI_Comm_accept(port1, MPI_INFO_NULL, 0, MPI_COMM_WORLD, &comm1);
printf("sending 5 \n");fflush(stdout);
data = 5;
MPI_Send(&data, 1, MPI_INT, 0, 0, comm1);
MPI_Close_port(port1);
}
else if (runno == 1)
{
//Read port file
size_t chars_read = 0;
ifstream myfile;
//Wait until file exists and is avaialble
myfile.open("port");
while(!myfile){
myfile.open("port");
cout << "Opening file failed" << myfile << endl;
usleep(30000);
}
while( myfile && chars_read < 255 ) {
myfile >> port1[ chars_read ];
if( myfile )
++chars_read;
if( port1[ chars_read - 1 ] == '\n' )
break;
}
printf("Reading port %s from file \n", port1); fflush(stdout);
remove( "port" );
//Establish connection and recieve data
MPI_Comm_connect(port1, MPI_INFO_NULL, 0, MPI_COMM_WORLD, &comm1);
MPI_Recv(&data, 1, MPI_INT, 0, 0, comm1, &status);
printf("Received %d 1\n", data); fflush(stdout);
}
//Barrier on intercomm before disconnecting
MPI_Barrier(comm1);
MPI_Comm_disconnect(&comm1);
MPI_Finalize();
return 0;
}
The 0 and 1 simply specify if this code writes a port file or reads it in the example above. This is then run with,
mpiexec -n 1 ./a.out 0
mpiexec -n 1 ./a.out 1

Why does reading from STDOUT work?

I encountered a curious case, where reading from STDOUT works, when running a program in terminal. The question is, why and how? Let's start with code:
#include <QCoreApplication>
#include <QSocketNotifier>
#include <QDebug>
#include <QByteArray>
#include <unistd.h>
#include <errno.h>
int main(int argc, char *argv[])
{
QCoreApplication a(argc, argv);
const int fd_arg = (a.arguments().length()>=2) ? a.arguments().at(1).toInt() : 0;
qDebug() << "reading from fd" << fd_arg;
QSocketNotifier n(fd_arg, QSocketNotifier::Read);
QObject::connect(&n, &QSocketNotifier::activated, [](int fd) {
char buf[1024];
auto len = ::read(fd, buf, sizeof buf);
if (len < 0) { qDebug() << "fd" << fd << "read error:" << errno; qApp->quit(); }
else if (len == 0) { qDebug() << "fd" << fd << "done"; qApp->quit(); }
else {
QByteArray data(buf, len);
qDebug() << "fd" << fd << "data" << data.length() << data.trimmed();
}
});
return a.exec();
}
Here's qmake .pro file for convenience, if someone wants to test above code:
QT += core
QT -= gui
CONFIG += c++11
TARGET = stdoutreadtest
CONFIG += console
CONFIG -= app_bundle
TEMPLATE = app
SOURCES += main.cpp
And here's output of 4 executions:
$ ./stdoutreadtest 0 # input from keyboard, ^D ends, works as expected
reading from fd 0
typtyptyp
fd 0 data 10 "typtyptyp"
fd 0 done
$ echo pipe | ./stdoutreadtest 0 # input from pipe, works as expected
reading from fd 0
fd 0 data 5 "pipe"
fd 0 done
$ ./stdoutreadtest 1 # input from keyboard, ^D ends, works!?
reading from fd 1
typtyp
fd 1 data 7 "typtyp"
fd 1 done
$ echo pipe | ./stdoutreadtest 1 # input from pipe, still reads keyboard!?
reading from fd 1
typtyp
fd 1 data 7 "typtyp"
fd 1 done
So, the question is, what is going on, why do the last two runs above actually read what is typed on terminal?
I also tried looking at QSocketNotifier sources here leading to here, but didn't really gain any insight.

There is no different between fd 0,1,2, all of the three fds is pointing to terminal if not redirected, they are strictly IDENTICAL!
Programs usually use 0 for input, 1 for output, 2 for error, but all of them can be different ways.
Eg, for less, ordinary usage:
prog | less
Now input of less is redirected to the prog, and less can not read any user input from stdin, so how does less get user input like scroll up/down or page up/down ?
Sure less can read user input from stdout, which is exactly what less does.
So you can use these fds wisely when you know how bash handle these fds.

a simple MPI can't run after compile

i write a simple MPI program:
#include <stdio.h>
2 #include "mpi.h"
3
4 int main(int argc,char* argv[])
5 {
6 int rank;
7 int size;
8
9 MPI_Init(0,0);
10 MPI_Comm_rank(MPI_COMM_WORLD,&rank);
11 MPI_Comm_size(MPI_COMM_WORLD,&size);
12 printf("Hello World from process %d of %d\n",rank,size);
13 MPI_Finalize();
14 return 0;
15 }
the program compile successflly,but can't run
i use "mpirun -np 4 ./hello" or "mpirun -np 4 hello"
it shows like this:
_create_ep, create command failed: Operation not permitted
GLEX_ERR(ln0): _init_glex(608), _create_ep: system error
_create_ep, create command failed: Operation not permitted
GLEX_ERR(ln0): _init_glex(608), _create_ep: system error
_create_ep, create command failed: Operation not permitted
GLEX_ERR(ln0): _init_glex(608), _create_ep: system error
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(498)........:
MPID_Init(187)...............: channel initialization failed
MPIDI_CH3_Init(89)...........:
MPID_nem_init(320)...........:
MPID_nem_glex_init(74).......:
MPIDI_nem_glex_init_glex(610): Cannot create GLEX endpoint.
besides,i wirite this program on HPC.And I guess the problem "Cannot create GLEX endpoint" maybe related to the HPC(HPC has already deployed MPI).

I'm not too sure of the level of support for MPI_Init() when null pointers are passed as arguments (I think there is something like calling it without arguments supported since MPI 3.0, but I wouldn't commit on that).
However, I would definitely replace MPI_Init(0,0) in your code by MPI_Init( &argc, &argv ) for a starter.
EDIT: my bad, MPI_Init() is supposed to support null pointer as argument as stated here.
However, trying with MPI_Init( &argc, &argv ) would still be my first try for fixing the issue.

MPI Number of processors?

Following is my code in MPI, which I run it over a core i7 CPU (quad core), but the problem is it shows me that it's running under 1 processor CPU, which has to be 4.
int main(int argc, char *argv[])
{
int rank, size;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
printf("Hello world! I am %d of %d\n", rank, size);
MPI_Finalize();
return 0;
}
I was wondering if the problem is with MPI library or sth else?
Here is the result that it shows me:
Hello world! I am 0 of 1
Additional info:
Windows 7 - Professional x64

Prima facie it looks like you are running the program directly. Did you try using mpiexec -n 2 or -n 4?

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

MPI_Barrier() does not work on a small cluster - mpi

This generally reflects some sort of configuration error -- either host or username configurations are not consistent across nodes, or there is some sort of firewall blocking some ports. The MPICH2 FAQ discusses some places to look.

Related

Process placement with aprun -- need one process per node

Can I link two separate executables with MPI_open_port and share port information in a text file?

Why does reading from STDOUT work?

a simple MPI can't run after compile

MPI Number of processors?

Categories

Resources