MPI MULTIPLE SPAWN: number of processes in child - mpi

I need to spawn multiple processes from only one original process.
The parent looks like this, it spawns two processes:
int ncommands = 2;
//definition of arrays in arguments, and here the one that I am interested in:
for(int ic = 0; ic < ncommands; ic++) nprocs[ic] = 1;
MPI_Comm_spawn_multiple(2, commands,MPI_ARGVS_NULL,nprocs,
infos,0,MPI_COMM_SELF,&child,errcodes);
This calls the executable correctly, but I don't understand the number of processors in the child code.
I expect to set with the array nprocs a maximum of one processor per command. This is very
important to my code because the child executable does not work with more than one process (basically there is a system solver that does not support parallelization). However, when I print in the child the size obtained from MPI_Comm_size(MPI_COMM_WORLD, &mpisize), it returns 2. And indeed, I can't do what I need in the child because it is trying to parallelize the problem between two processes.
How to make only one process to be used in the child ? Why the size I get is 2 ?

Your spawn_multiple call does what the documentation says: "Spawns multiple binaries [...], establishing communication with them and placing them in the same MPI_COMM_WORLD."
If you want to spawn multiple programs, but making each a separate COMM_WORLD, use a sequence of calls to MPI_Comm_spawn.

Related

Is hierachical parallelism possible with MPI libraries?

I'm writing a computational code with MPI. I have a few parts of the software each compute different part of the problem. Each part is written with MPI thus could be run as an independent module. Now I want to combine these parts to be run together within one program, and all parts of the code run in parallel while each part itself is also running in parallel.
e.g. Total number of nodes = 10, part1 run with 6 nodes and part 2 run with 4 nodes and both running together.
Is there ways that I can mpirun with 10 nodes and mpi_init each part with desired number of node without rewritten the overall program to allocate process for each part of code?
This is not straightforward.
One option is to use an external program that with MPI_Comm_spawn() (twice) your sub-programs. The drawback is this requires one slot.
An other option needs some rewriting, since all the tasks will end up in the same MPI_COMM_WORLD, it is up to them to MPI_Comm_split() based on who they are, and use the resulting communicator instead of MPI_COMM_WORLD.

Why my attempt to global syncronization does not work?

I try to use this code.
But kernel exits after executing cycle only once.
If I remove "while(...)" line - cycle works, but results of course are mess.
If I state "volatile __global uint *g_barrier" it freezes a PC with black screen for a while and then program deadlocks.
__kernel void Some_Kernel(__global uint *g_barrier)
{
uint i, t;
for (i = 1; i < MAX; i++) {
// some useful code here
barrier(CLK_GLOBAL_MEM_FENCE);
if (get_local_id(0) == 0) atomic_add(g_barrier, 1);
t = i*get_num_groups(0);
while(*g_barrier < t); // try to sync it all
}
}
You seem to be expecting all work groups to be scheduled to run in parallel. OpenCL does not guarantee this to happen. Some work groups may not start until some other work groups have entirely completed running the kernel.
Moreover, barriers only synchronise within a work group. Atomic operations on global memory are atomic with regard to other work groups too, but there is no guarantee about order.
If you need other work groups to complete some code before running some other code, you will need to enqueue each of those chunks of work separately on a serial command queue (or appropriately connect them using events on an out-of-order queue). So for your example code, you need to remove your for and while loops, and enqueue your kernel MAX-1 times and pass i as a kernel argument.
Depending on the capabilities of your device and the size of your data set, your other option is to submit only one large work group, though this is unlikely to give you good performance unless you have a lot of such smaller tasks which are independent from one another.
(I will point out that there is a good chance your question suffers from the XY problem - you have not stated the overall problem your code is trying to solve. So there may be better solutions than the ones I have suggested.)

How to avoid reading back in OpenCL

I am implementing an algorithm with OpenCL. I will loop in C++ many times and call a same OpenCL kernel each time. The kernel will generate the input data of next iteration and the number of these data. Currently, I read back this number in each loop for two usages:
I use this number to decide how many work items I need for next loop; and
I use this number to decide when to exit the loop (when the number is 0).
I found the reading takes most of time of the loop. Is there any way to avoid it?
Generally speaking, if you need to call a kernel repeatedly, and the exit condition is dependent to the result generated by the kernel (not fixed number loops), how can you do it efficiently? Is there anything like the occlusion query in OpenGL that you can just do some query instead of reading back from GPU?
Reading a number back from a GPU Kernel will always take 10s - 1000s microseconds or more.
If the controlling number is always reducing, you can keep in global memory, and test it against the global id and decide if the kernel does work or not on each iteration. Use a global memory barrier to sync all the threads ...
kernel void x(global int * the_number, constant int max_iterations, ... )
{
int index = get_global_id(0);
int count = 0; // stops an infinite loop
while( index < the_number[0] && count < max_iterations )
{
count++;
// loop code follows
....
// Use one thread decide what to do next
if ( index == 0 )
{
the_number[0] = ... next value
}
barrier( CLK_GLOBAL_MEM_FENCE ); // Barrier to sync threads
}
}
You have a couple of options here:
If possible, you can simply move the loop and the conditional into the kernel? Use a scheme where additional work items do nothing depending on the input for the current iteration.
If 1. isn't possible, I would recommend that you store the data generated by the "decision" kernel in a buffer and use that buffer to "direct" your other kernels.
Both these options will allow you to skip the readback.
I'm just finishing up some research where we had to tackle this exact problem!
We discovered a couple of things:
Use two (or more) buffers! Have the first iteration of the kernel
operate on data in b1, then the next on b2, then on b1 again. In
between each kernel call, read back the result of the other buffer
and check to see if it's time to stop iterating. Works best when the kernel takes longer than a read. Use a profiling tool to make sure you aren't waiting on reads (and if you are, increase the number of buffers).
Over shoot! Add a finishing check to each kernel, and call it
several (100s) of times before copying data back. If your kernel is
low-cost, this can work very well.

Checking get_global_id in OpenCL Kernel Necessary?

I have noticed a number of kernel sources that look like this (found randomly by Googling):
__kernel void fill(__global float* array, unsigned int arrayLength, float val)
{
if(get_global_id(0) < arrayLength)
{
array[get_global_id(0)] = val;
}
}
My question is if that if-statement is actually necessary (assuming that "arrayLength" in this example is the same as the global work size).
In some of the more "professional" kernels I have seen, it is not present. It also seems to me that the hardware would do well to not assign kernels to nonsense coordinates.
However, I also know that processors work in groups. Hence, I can imagine that some processors of a group must do nothing (for example if you have 1 group of size 16, and a work size of 41, then the group would process the first 16 work items, then then next 16, then the next 9, with 7 processors not doing anything--do they get dummy kernels?).
I checked the spec., and the only relevant mention of "get_global_id" is the same as the online documentation, which reads:
The global work-item ID specifies the work-item ID based on the number of global work-items specified to execute the kernel.
. . . based how?
So what is it? Is it safe to omit iff the array's size is a multiple of the work group size? What?
You have the right answer already, I think. If the global size of your kernel execution is the same as the array length, then this if statement is useless.
In general, that type of check is only needed for cases where you've partitioned your data in such a way that you know you might execute extra work items relative to your array size. In my experience, you can almost always avoid such cases.

How to determine MPI rank/process number local to a socket/node

Say, I run a parallel program using MPI. Execution command
mpirun -n 8 -npernode 2 <prg>
launches 8 processes in total. That is 2 processes per node and 4 nodes in total. (OpenMPI 1.5). Where a node comprises 1 CPU (dual core) and network interconnect between nodes is InfiniBand.
Now, the rank number (or process number) can be determined with
int myrank;
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
This returns a number between 0 and 7.
But, How can I determine the node number (in this case a number between 0 and 3) and the process number within a node (number between 0 and 1)?
I believe you can achieve that with MPI-3 in this manner:
MPI_Comm shmcomm;
MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0,
MPI_INFO_NULL, &shmcomm);
int shmrank;
MPI_Comm_rank(shmcomm, &shmrank);
It depends on the MPI implementation - and there is no standard for this particular problem.
Open MPI has some environment variables that can help. OMPI_COMM_WORLD_LOCAL_RANK will give you the local rank within a node - ie. this is the process number which you are looking for. A call to getenv will therefore answer your problem - but this is not portable to other MPI implementations.
See this for the (short) list of variables in OpenMPI.
I don't know of a corresponding "node number".
This exact problem is discussed on Markus Wittmann's Blog, MPI Node-Local Rank determination.
There, three strategies are suggested:
A naive, portable solution employs MPI_Get_processor_name or gethostname to create an unique identifier for the node and performs an MPI_Alltoall on it. [...]
[Method 2] relies on MPI_Comm_split, which provides an easy way to split a communicator into subgroups (sub-communicators). [...]
Shared memory can be utilized, if available. [...]
For some working code (presumably LGPL licensed?), Wittmann links to MpiNodeRank.cpp from the APSM library.
Alternatively you can use
int MPI_Get_processor_name( char *name, int *resultlen )
to retreive node name, then use it as color in
int MPI_Comm_split(MPI_Comm comm, int color, int key, MPI_Comm *newcomm)
This is not as simple as MPI_Comm_split_type, however it offers a bit more freedom to split your comunicator the way you want.

Resources