Say, I run a parallel program using MPI. Execution command
mpirun -n 8 -npernode 2 <prg>
launches 8 processes in total. That is 2 processes per node and 4 nodes in total. (OpenMPI 1.5). Where a node comprises 1 CPU (dual core) and network interconnect between nodes is InfiniBand.
Now, the rank number (or process number) can be determined with
int myrank;
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
This returns a number between 0 and 7.
But, How can I determine the node number (in this case a number between 0 and 3) and the process number within a node (number between 0 and 1)?
I believe you can achieve that with MPI-3 in this manner:
MPI_Comm shmcomm;
MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0,
MPI_INFO_NULL, &shmcomm);
int shmrank;
MPI_Comm_rank(shmcomm, &shmrank);
It depends on the MPI implementation - and there is no standard for this particular problem.
Open MPI has some environment variables that can help. OMPI_COMM_WORLD_LOCAL_RANK will give you the local rank within a node - ie. this is the process number which you are looking for. A call to getenv will therefore answer your problem - but this is not portable to other MPI implementations.
See this for the (short) list of variables in OpenMPI.
I don't know of a corresponding "node number".
This exact problem is discussed on Markus Wittmann's Blog, MPI Node-Local Rank determination.
There, three strategies are suggested:
A naive, portable solution employs MPI_Get_processor_name or gethostname to create an unique identifier for the node and performs an MPI_Alltoall on it. [...]
[Method 2] relies on MPI_Comm_split, which provides an easy way to split a communicator into subgroups (sub-communicators). [...]
Shared memory can be utilized, if available. [...]
For some working code (presumably LGPL licensed?), Wittmann links to MpiNodeRank.cpp from the APSM library.
Alternatively you can use
int MPI_Get_processor_name( char *name, int *resultlen )
to retreive node name, then use it as color in
int MPI_Comm_split(MPI_Comm comm, int color, int key, MPI_Comm *newcomm)
This is not as simple as MPI_Comm_split_type, however it offers a bit more freedom to split your comunicator the way you want.
Related
I need to spawn multiple processes from only one original process.
The parent looks like this, it spawns two processes:
int ncommands = 2;
//definition of arrays in arguments, and here the one that I am interested in:
for(int ic = 0; ic < ncommands; ic++) nprocs[ic] = 1;
MPI_Comm_spawn_multiple(2, commands,MPI_ARGVS_NULL,nprocs,
infos,0,MPI_COMM_SELF,&child,errcodes);
This calls the executable correctly, but I don't understand the number of processors in the child code.
I expect to set with the array nprocs a maximum of one processor per command. This is very
important to my code because the child executable does not work with more than one process (basically there is a system solver that does not support parallelization). However, when I print in the child the size obtained from MPI_Comm_size(MPI_COMM_WORLD, &mpisize), it returns 2. And indeed, I can't do what I need in the child because it is trying to parallelize the problem between two processes.
How to make only one process to be used in the child ? Why the size I get is 2 ?
Your spawn_multiple call does what the documentation says: "Spawns multiple binaries [...], establishing communication with them and placing them in the same MPI_COMM_WORLD."
If you want to spawn multiple programs, but making each a separate COMM_WORLD, use a sequence of calls to MPI_Comm_spawn.
I'm writing a computational code with MPI. I have a few parts of the software each compute different part of the problem. Each part is written with MPI thus could be run as an independent module. Now I want to combine these parts to be run together within one program, and all parts of the code run in parallel while each part itself is also running in parallel.
e.g. Total number of nodes = 10, part1 run with 6 nodes and part 2 run with 4 nodes and both running together.
Is there ways that I can mpirun with 10 nodes and mpi_init each part with desired number of node without rewritten the overall program to allocate process for each part of code?
This is not straightforward.
One option is to use an external program that with MPI_Comm_spawn() (twice) your sub-programs. The drawback is this requires one slot.
An other option needs some rewriting, since all the tasks will end up in the same MPI_COMM_WORLD, it is up to them to MPI_Comm_split() based on who they are, and use the resulting communicator instead of MPI_COMM_WORLD.
Newbie to OpenCL here. I'm trying to convert a numerical method I've written to OpenCL for acceleration. I'm using the PyOpenCL package as I've written this once in Python already and as far as I can tell there's no compelling reason to use the C version. I'm all ears if I'm wrong on this, though.
I've managed to translate over most of the functionality I need in to OpenCL kernels. My question is on how to (properly) tell OpenCL to ignore my boundary/ghost cells. The reason I need to do this is that my method (for example) for point i accesses cells at [i-2:i+2], so if i=1, I'll run off the end of the array. So - I add some extra points that serve to prevent this, and then just tell my algorithm to only run on points [2:nPts-2]. It's easy to see how to do this with a for loop, but I'm a little more unclear on the 'right' way to do this for a kernel.
Is it sufficient to do, for example (pseudocode)
__kernel void myMethod(...) {
gid = get_global_id(0);
if (gid < nGhostCells || gid > nPts-nGhostCells) {
retVal[gid] = 0;
}
// Otherwise perform my calculations
}
or is there another/more appropriate way to enforce this constraint?
It looks sufficient.
Branching is same for nPts-nGhostCells*2 number of points and it is predictable if nPts and nGhostCells are compile-time constants. Even if it is not predictable, sufficiently large nPts vs nGhostCells (1024 vs 3) should not be distinctively slower than zero-branching version, except the latency of "or" operation. Even that "or" latency must be hidden behind array access latency, thanks to thread level parallelism.
At those "break" points, mostly 16 or 32 threads would lose some performance and only for several clock cycles because of the lock-step running of SIMD-like architectures.
If you happen to code some chaotic branching, like data-driven code path, then you should split them into different kernels(for different regions) or sort them before the kernel so that average branching between neighboring threads are minimized.
From what I have learned in my supercomputing class I know that MPI is a communicating (and data passing) interface.
I'm confused on when you run a function in a C++ program and want each processor to perform a specific task.
For example, a prime number search (very popular for supercomputers). Say I have a range of values (531-564, some arbitrary range) and say I have 50 processes I could run a series of evaluations on for each number. If root (process 0) wants to examine 531 and knowing prime numbers I can use 8 processes (1-8) to evaluate the prime status. If the number is divisible by any number 2-9 with a remainder of 0, then it is not prime.
Is it possible that for MPI which passes data to each process to have these processes perform these actions?
The hardest part for me is understanding that if I perform an action in the original C++ program the processes taking place could be allocated on several different processes, then in MPI how can I structure this? Or is my understanding completely wrong? If so how am I supposed to truly go about this path of thinking in a correct manner?
The big idea is passing data to a process versus having a function sent to a process. I'm fairly certain I'm wrong but I'm trying to back track to fix my thinking.
Each MPI process is running the same program, but that doesn't mean that they are doing the same thing. Different processes can be running different branches of the code, depending on the id (or "rank") of the process, and in effect be completely independent. Like any distributed computation, the actors do need to agree on how they will communicate.
The most basic strategy in MPI is scatter-gather, where the "master" process (usually the one with rank 0) will split an array of work equally amongst the peers (including the master process itself) by having them all call scatter, the peers will do the work, then all peers will call gather to send the results back to master.
In your prime algorithm example, build an array of integers, "scatter" it to all the peers, each peer will run through its array saving 1 if it is prime, 0 if it is not then "gather" the results to master. [In this particular example, since the input data is completely predictable based on process rank, the scatter step is unnecessary but we will do it anyway.]
As pseudo-code:
main():
int x[n], n = 100
MPI_init()
// prepare data on master
if rank == 0:
for i in 1 ... n, x[i] = i
// send data from x on root to local on each process in world
MPI_scatter(x, n, int, local, n/k, int, root, world)
for i in 1 ... n/k
result[i] = 1 // assume prime
if 2 divides local[i], result[i] = 0
if 3 divides local[i], result[i] = 0
if 5 divides local[i], result[i] = 0
if 7 divides local[i], result[i] = 0
// gather reults from local on each process in world to x on root
MPI_gather(result, n/k, int, x, n, int, root, world)
// print results
if rank == 0:
for i in 1 ... n, print i if x[i] == 1
MPI_finalize()
There are lots of details to fill in such as proper declarations, and dealing with the fact that some ranks will have fewer elements than others, using
proper C syntax, etc., but getting them right doesn't help explain the overall picture.
More fine-grained synchronization and communication is possible using direct send/recv between processes. Such programs are harder to write since the different processes may be in different states. In particular, it is important that if process a is calling MPI_send to process b, then process b had better be calling MPI_recv from a.
I was trying to find the best work-group size for a problem and I figured out something that I couldn't justify for myself.
These are my results :
GlobalWorkSize {6400 6400 1}, WorkGroupSize {64 4 1}, Time(Milliseconds) = 44.18
GlobalWorkSize {6400 6400 1}, WorkGroupSize {4 64 1}, Time(Milliseconds) = 24.39
Swapping axes caused a twice faster execution. Why !?
By the way, I was using an AMD GPU.
Thanks :-)
EDIT :
This is the kernel (a Simple Matrix Transposition):
__kernel void transpose(__global float *input, __global float *output, const int size){
int i = get_global_id(0);
int j = get_global_id(1);
output[i*size + j] = input[j*size + i];
}
I agree with #Thomas, it most probably depends on your kernel. Most probably, in the second case you access memory in a coalescent way and/or make a full use of memory transaction.
Coalescence: When threads need to access elements in the memory the hardware tries to access these elements in as less as possible transactions i.e. if the thread 0 and the thread 1 have to access contiguous elements there will be only one transaction.
full use of a memory transaction: Let's say you have a GPU that fetches 32 bytes in one transaction. Therefore if you have 4 threads that need to fetch one int each you are using only half of the data fetched by the transaction; you waste the rest (assuming an int is 4 bytes).
To illustrate this, let's say that you have a n by n matrix to access. Your matrix is in row major, and you use n threads organized in one dimension. You have two possibilities:
Each workitem takes care of one column, looping through each column element one at a time.
Each workitem takes care of one line, looping through each line element one at a time.
It might be counter-intuitive, but the first solution will be able to make coalescent access while the second won't be. The reason is that when the first workitem will need to access the first element in the first column, the second workitem will access the first element in the second column and so on. These elements are contiguous in the memory. This is not the case for the second solution.
Now if you take the same example, and apply the solution 1 but this time you have 4 workitems instead of n and the same GPU I've just spoken before you'll most probably increase the time by a factor 2 since you will waste half of your memory transactions.
EDIT: Now that you posted your kernel I see that I forgot to mention something else.
With your kernel, it seems that choosing a local size of (1, 256) or (256, 1) is always a bad choice. In the first case 256 transactions will be necessary to read a column (each fetching 32 bytes out of which only 4 will be used - keeping in mind the same GPU of my previous examples) in input while 32 transactions will be necessary to write in output: You can write 8 floats in one transaction hence 32 transactions to write the 256 elements.
This is the same problem with a workgroup size of (256, 1) but this time using 32 transactions to read, and 256 to write.
So why the first size works better? It's because there is a cache system, that can mitigate the bad access for the read part. Therefore the size (1, 256) is good for the write part and the cache system handle the not very good read part, decreasing the number of necessary read transactions.
Note that the number of transactions decreases overall (taking into considerations all the workgroups within the NDRange). For example the first workgroup issues the 256 transactions, to read the 256 first elements of the first column. The second workgroup might just go in the cache to retrieve the elements of the second column because they were fetched by the transactions (of 32 bytes) issued by the first workgroup.
Now, I'm almost sure that you can do better than (1, 256) try (8, 32).