a simple MPI can't run after compile - runtime-error

i write a simple MPI program:
#include <stdio.h>
2 #include "mpi.h"
3
4 int main(int argc,char* argv[])
5 {
6 int rank;
7 int size;
8
9 MPI_Init(0,0);
10 MPI_Comm_rank(MPI_COMM_WORLD,&rank);
11 MPI_Comm_size(MPI_COMM_WORLD,&size);
12 printf("Hello World from process %d of %d\n",rank,size);
13 MPI_Finalize();
14 return 0;
15 }
the program compile successflly,but can't run
i use "mpirun -np 4 ./hello" or "mpirun -np 4 hello"
it shows like this:
_create_ep, create command failed: Operation not permitted
GLEX_ERR(ln0): _init_glex(608), _create_ep: system error
_create_ep, create command failed: Operation not permitted
GLEX_ERR(ln0): _init_glex(608), _create_ep: system error
_create_ep, create command failed: Operation not permitted
GLEX_ERR(ln0): _init_glex(608), _create_ep: system error
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(498)........:
MPID_Init(187)...............: channel initialization failed
MPIDI_CH3_Init(89)...........:
MPID_nem_init(320)...........:
MPID_nem_glex_init(74).......:
MPIDI_nem_glex_init_glex(610): Cannot create GLEX endpoint.
besides,i wirite this program on HPC.And I guess the problem "Cannot create GLEX endpoint" maybe related to the HPC(HPC has already deployed MPI).

I'm not too sure of the level of support for MPI_Init() when null pointers are passed as arguments (I think there is something like calling it without arguments supported since MPI 3.0, but I wouldn't commit on that).
However, I would definitely replace MPI_Init(0,0) in your code by MPI_Init( &argc, &argv ) for a starter.
EDIT: my bad, MPI_Init() is supposed to support null pointer as argument as stated here.
However, trying with MPI_Init( &argc, &argv ) would still be my first try for fixing the issue.

Related

OpenMPI MPI_Comm_spawn() giving unreachable errors

I've got a master and worker system, where the master uses OpenMPI to spawn and communicate with its workers. I've used versions 4.0.4 and 3.1.6, both are giving similar errors.
master.cxx
#include <mpi.h>
void doStuff() {
// does stuff
}
int main(int argc, char *argv[]) {
int NUM_JOBS = 10;
MPI_Init(&argc, &argv);
char * args[] = {"Arg1","Arg2",NULL,};
MPI_Info mpi_info;
MPI_Info_create(&mpi_info);
MPI_Info_set(mpi_info, "add-hostfile", "nodefile");
MPI_Comm child_comm;
MPI_Comm_spawn("worker", args, NUM_JOBS, mpi_info, 0, MPI_COMM_SELF, &child_comm, MPI_ERRCODES_IGNORE);
doStuff();
MPI_Finalize();
return 0;
}
worker.cxx
#include <mpi.h>
void doStuff() {
// does stuff
}
int main(int argc, char *argv[]) {
MPI_Init(&argc, &argv);
doStuff();
MPI_Finalize();
return 0;
}
Both are compiled via cmake. I have 101 cores available to me, for 1 master and up to 100 workers, with 24 cores per node. For the first run, I'm just doing workers to prove that it can run. I had to add in some infiniband parameters
run1.sh
export OMPI_MCA_btl_openib_allow_ib=1
export OMPI_MCA_btl_openib_if_include="mlx4_0:1"
mpirun -n 101 --mca btl openib,self --hostfile nodefile worker
Now, I try to run the workers using the master. Note that the code as written above spawns 10 workers, so the 11 total procs can fit on one 24-core node.
run2.sh
export OMPI_MCA_btl_openib_allow_ib=1
export OMPI_MCA_btl_openib_if_include="mlx4_0:1"
mpirun -n 1 --mca btl openib,self --hostfile nodefile master
That works fine. So I up master.cxx NUM_JOBS=100 and repeat the run using run2.sh
This time, I get an MPI error and the job fails
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications. This means that no Open MPI device has indicated
that it can be used to communicate between these processes. This is
an error; Open MPI requires that all MPI processes be able to reach
each other. This error can sometimes be the result of forgetting to
specify the "self" BTL.
Process 1 ([[53646,2],88]) is on host: mymachine04
Process 2 ([[53636,1],0]) is on host: unnknown1
BTLs attempted: self openib
Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
[mymachine04:22480] [[53646,2],88] ORTE_ERROR_LOG: Unreachable in file dpm/dpm.c at line 493
That last line is repeated 77 total times. Presumably uncoincidentally, 101 procs - 24 procs per node = 77. The hostfile is just 101 lines of hostnames, and I get the same error whether I supply it or not. Interestingly, if I remove the first line of the hostfile before spawning my workers, on the grounds that that slot is already used by the master, I get an error about running out of slots. Any idea what I'm doing wrong, how I can get all of my processes to reach each other?

OpenMP code in CUDA source file not compiling on Google Colab

I am trying to run a simple Hello World program with OpenMP directives on Google Colab using OpenMP library and CUDA. I have followed this tutorial but I am getting an error even if I am trying to include %%cu in my code. This is my code-
%%cu
#include<stdio.h>
#include<stdlib.h>
#include<omp.h>
/* Main Program */
int main(int argc , char **argv)
{
int Threadid, Noofthreads;
printf("\n\t\t---------------------------------------------------------------------------");
printf("\n\t\t Objective : OpenMP program to print \"Hello World\" using OpenMP PARALLEL directives\n ");
printf("\n\t\t..........................................................................\n");
/* Set the number of threads */
/* omp_set_num_threads(4); */
/* OpenMP Parallel Construct : Fork a team of threads */
#pragma omp parallel private(Threadid)
{
/* Obtain the thread id */
Threadid = omp_get_thread_num();
printf("\n\t\t Hello World is being printed by the thread : %d\n", Threadid);
/* Master Thread Has Its Threadid 0 */
if (Threadid == 0) {
Noofthreads = omp_get_num_threads();
printf("\n\t\t Master thread printing total number of threads for this execution are : %d\n", Noofthreads);
}
}/* All thread join Master thread */
return 0;
}
And this is the error I am getting-
/tmp/tmpxft_00003eb7_00000000-10_15fcc2da-f354-487a-8206-ea228a09c770.o: In function `main':
tmpxft_00003eb7_00000000-5_15fcc2da-f354-487a-8206-ea228a09c770.cudafe1.cpp:(.text+0x54): undefined reference to `omp_get_thread_num'
tmpxft_00003eb7_00000000-5_15fcc2da-f354-487a-8206-ea228a09c770.cudafe1.cpp:(.text+0x78): undefined reference to `omp_get_num_threads'
collect2: error: ld returned 1 exit status
Without OpenMP directives, a simple Hello World program is running perfectly as can be seen below-
%%cu
#include <iostream>
int main()
{
std::cout << "Welcome To GeeksforGeeks\n";
return 0;
}
Output-
Welcome To GeeksforGeeks
There are two problems here:
nvcc doesn't enable or natively support OpenMP compilation. This has to be enabled by additional command line arguments passed through to the host compiler (gcc by default)
The standard Google Colab/Jupyter notebook plugin for nvcc doesn't allow passing of extra compilation arguments, meaning that even if you solve the first issue, it doesn't help in Colab or Jupyter.
You can solve the first problem as described here, and you can solve the second as described here and here.
Combining these in Colab got me this:
and then this:

breakpad not generate minidump on erase iterator twice

I find breakpad does not handle sigsegv sometimes.
and i wrote a simple example to reproduce it:
#include <vector>
#include <breakpad/client/linux/handler/exception_handler.h>
int InitBreakpad()
{
char core_file_folder[] = "/tmp/cores/";
google_breakpad::MinidumpDescriptor descriptor(core_file_folder);
auto exception_handler_ =
new google_breakpad::ExceptionHandler(descriptor,
nullptr,
nullptr,
nullptr,
true,
-1);
}
int main()
{
InitBreakpad();
// int* ptr = nullptr;
// *ptr = 1;
std::vector<int> sum;
sum.push_back(1);
auto it = sum.begin();
sum.erase(it);
sum.erase(it);
return 0;
}
and gcc is 4.8.5 and my comiple cmd is
g++ test_breakpad.cpp -I./include -I./include/breakpad -L./lib -lbreakpad -lbreakpad_client -std=c++11 -lpthread
run a.out, get "Segmentation fault" but no minidump is generated.
if i uncomment nullptr write, breakpad works!
what should i do to correct it?
GDB debug output:
(gdb) b google_breakpad::ExceptionHandler::~ExceptionHandler()
Breakpoint 2 at 0x402ed0: file src/client/linux/handler/exception_handler.cc, line 264.
(gdb) c
The program is not being run.
(gdb) r
Starting program: /home/zen/tmp/a.out
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Breakpoint 1, google_breakpad::ExceptionHandler::ExceptionHandler (this=0x619040, descriptor=..., filter=0x0, callback=0x0, callback_context=0x0, install_handler=true, server_fd=-1) at src/client/linux/handler/exception_handler.cc:224
224 ExceptionHandler::ExceptionHandler(const MinidumpDescriptor& descriptor,
Missing separate debuginfos, use: debuginfo-install glibc-2.17-157.el7_3.1.x86_64 libgcc-4.8.5-11.el7.x86_64 libstdc++-4.8.5-11.el7.x86_64
(gdb) c
Continuing.
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff712f19d in __memmove_ssse3_back () from /lib64/libc.so.6
(gdb) c
Continuing.
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff712f19d in __memmove_ssse3_back () from /lib64/libc.so.6
(gdb) c
Continuing.
Program terminated with signal SIGSEGV, Segmentation fault.
The program no longer exists.
and i tried breakpad out of process dump, but still got nothing(nullptr write works).
After some debugging I think that the reason that the sum.erase(it) does not create a minidump in your example is due to stack corruption.
While debugging you can see that the variable g_handler_stack_ in src/client/linux/handler/exception_handler.cc is correctly initialized and the google_breakpad::ExceptionHandler instance is correctly added to the vector. However when google_breakpad::ExceptionHandler::SignalHandler is called the vector is reported empty despite no calls to google_breakpad::ExceptionHandler::~ExceptionHandler or any of the std::vector methods that would change the vector.
Some further data points that point to stack corruption is that the code works with clang++. Additionally, as soon as we change the std::vector<int> sum; to a std::vector<int>* sum, which will ensure that we don't corrupt the stack, the minidump is written to disk.

mpiexec checkpointing error (RPi)

When I try to run an application (just a simple hello_world.c doesn't work) I receive this error every time:
mpiexec -ckpointlib blcr -ckpoint-prefix /tmp/ -ckpoint-interval 10 -machinefile /tmp/machinefile -n 1 ./app_name
[proxy:0:0#masterpi] requesting checkpoint
[proxy:0:0#masterpi] checkpoint completed
[proxy:0:0#masterpi] requesting checkpoint
[proxy:0:0#masterpi] HYDT_ckpoint_checkpoint (./tools/ckpoint/ckpoint.c:111): Previous checkpoint has not completed.[proxy:0:0#masterpi] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:905): checkpoint suspend failed
[proxy:0:0#masterpi] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:0#masterpi] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
[mpiexec#masterpi] control_cb (./pm/pmiserv/pmiserv_cb.c:202): assert (!closed) failed
[mpiexec#masterpi] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[mpiexec#masterpi] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:197): error waiting for event
[mpiexec#masterpi] main (./ui/mpich/mpiexec.c:331): process manager error waiting for completion
I want just to make a checkpoint and nothing else (and restart later).
Thanks in advance
UPDATE:
I have tried with MPICH2, no chance. Or maybe I'm wrong somewhere...
pi#raspberrypi ~ $ mpiexec -n 1 -ckpointlib blcr -ckpoint-prefix /tmp/ -ckpoint-interval 2 ./test3
Count to: 0
[proxy:0:0#raspberrypi] requesting checkpoint
[proxy:0:0#raspberrypi] checkpoint completed
Count to: 1
[proxy:0:0#raspberrypi] requesting checkpoint
[proxy:0:0#raspberrypi] HYDT_ckpoint_checkpoint (/tmp/mpich/mpich2-1.5/src/pm/hydra/tools/ckpoint/ckpoint.c:111): Previous checkpoint has not completed.[proxy:0:0#raspberrypi] HYD_pmcd_pmip_control_cmd_cb (/tmp/mpich/mpich2-1.5/src/pm/hydra/pm/pmiserv/pmip_cb.c:902): checkpoint suspend failed
[proxy:0:0#raspberrypi] HYDT_dmxu_poll_wait_for_event (/tmp/mpich/mpich2-1.5/src/pm/hydra/tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:0#raspberrypi] main (/tmp/mpich/mpich2-1.5/src/pm/hydra/pm/pmiserv/pmip.c:210): demux engine error waiting for event
[mpiexec#raspberrypi] control_cb (/tmp/mpich/mpich2-1.5/src/pm/hydra/pm/pmiserv/pmiserv_cb.c:201): assert (!closed) failed
[mpiexec#raspberrypi] HYDT_dmxu_poll_wait_for_event (/tmp/mpich/mpich2-1.5/src/pm/hydra/tools/demux/demux_poll.c:77): callback returned error status
[mpiexec#raspberrypi] HYD_pmci_wait_for_completion (/tmp/mpich/mpich2-1.5/src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:196): error waiting for event
[mpiexec#raspberrypi] main (/tmp/mpich/mpich2-1.5/src/pm/hydra/ui/mpich/mpiexec.c:325): process manager error waiting for completion
Test3-Code:
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
int main(int argc, char* argv[]) {
int rank;
int size;
int i = 0;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Status status;
if (rank == 0) {
for(i; i <=100; i++){
int j = 0;
while(j < 100000000){
j++;
}
printf("Count to: %i\n", i);
}
} else {
}
MPI_Finalize();
return 0;
}
I just need to have one successful checkpoint and to show the restart.
If someone has a working example (irrelevant what it makes, simple working "Hello World" would make me happy!) I would be very glad.
Happy new year!
Unfortunately, the checkpoint/restart code in MPICH 3.0.4 is known to be buggy at the moment. That will hopefully get fixed in a future release. It looks like you're probably using it correctly. It's possible that if you go back to a previous version, you might have better luck.
Here the problem was with the too small interval for checkpointing.
Setting it to 20s or more has solved this (but not the other :( ) problem.

MPI: Why does my MPICH program fails for large no. of processes?

/* C Example */
#include <mpi.h>
#include <stdio.h>
#include <stddef.h>
#include <stdlib.h>
int main (int argc, char* argv[])
{
int rank, size;
int buffer_length = MPI_MAX_PROCESSOR_NAME;
char hostname[buffer_length];
MPI_Init (&argc, &argv); /* starts MPI */
MPI_Comm_rank (MPI_COMM_WORLD, &rank); /* get current process id */
MPI_Comm_size (MPI_COMM_WORLD, &size); /* get number of processes */
MPI_Get_processor_name(hostname, &buffer_length); /* get hostname */
printf( "Hello world from process %d running on %s of %d\n", rank, hostname, size );
MPI_Finalize();
return 0;
}
The above program compiles and run successfully on ubuntu 12.04 for smaller no. of processes. But it fails when I try to execute with 1000s of processes. Why it is so?
I am expecting that scheduler would keep the threads in queue and can dispatch one by one (I am running this code on a single core machine)
Why the following error is coming for large no. of processes and how to resolve this issue?
root#ubuntu:/home# mpiexec -n 1000 ./hello
[proxy:0:0#ubuntu] HYDU_create_process (./utils/launch/launch.c:26): pipe error (Too many open files)
[proxy:0:0#ubuntu] launch_procs (./pm/pmiserv/pmip_cb.c:751): create process returned error
[proxy:0:0#ubuntu] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:935): launch_procs returned error
[proxy:0:0#ubuntu] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:0#ubuntu] main (./pm/pmiserv/pmip.c:226): demux engine error waiting for event
Killed
You are running into the open file limit on your system. Default in Ubuntu is 1024. You can try raising the limit in your session with the ulimit command.
ulimit -n 2048

Resources