The setup: A single-processor executable and two parallel mpi-based codes that can run on 100s of processors each. All on an HPC cluster that uses a PBS-based job scheduler.
The problem: Using a shared memory communication between single-processor executable and the parallel codes requires that rank 0 of the parallel codes all be located physically on the same node in an HPC cluster that uses a PBS job scheduler.
The question: Can a PBS script be created that can specify that rank 0 of the two parallel codes must start on a specific node( the same node that the other single-processor executable is running on)?
Example:
ExecA --Single processor
ExecB -- 100 processors
ExecC -- 100 processors
I want a situation where ExecA, ExecB(Rank0), and ExecC(Rank0) all start up on the same node. I need this so that the Rank 0 processors can communicate with the single-processor using a shared memory paradigm and then broadcast that information out to the rest of their respective MPI processes.
From this post, it does appear that specification of the number of cores to use on a code can be controlled using the PBS script. From my reading of the MPI manual, it also appears that if given a hostfile, MPI will sequentially go down the hostfile until it has allocated all the processors that were requested. So theoretically if I had a hostfile/machinefile that contained the host name of a particular node, and had a specification of 1 processor on that node being used, then I believe rank 0 would likely reside on that node.
I know that most cluster-based job schedulers do provide node names for users that they can use to specify a particular node to execute on, but I can't quite determine if ability to generally tell a scheduler "Hey, for this parallel job put the first process on this node, and put the rest elsewhere" is possible.
Related
I am trying use MPI's Spawn functionality to run subprocesses that also use MPI. I am using MPI 2x and dynamic process management.
I have a master process (maybe I should say "master program") that runs in python (via mpi4py) that uses MPI to communicate between cores. This master process/program runs on 16 cores, and it will also make MPI_Comm_spawn_multiple calls to C and Fortran programs (which also use MPI). While the C and Fortran processes run, the master python program waits until they are finished.
A little more explicitly, the master python program does two primary things:
Uses MPI to do preprocessing for the spawning in step (2). MPI_Barrier is called after this preprocessing to ensure that all ranks have finished their preprocessing before step (2) begins. Note that the preprocessing is distributed across all 16 cores, and at the end of the preprocessing the resulting information is passed back to the root rank (e.g. rank == 0).
After the preprocessing, the root rank spawns 4 workers, each of which use 4 cores (i.e. all 16 cores are needed to run all 4 processes at the same time). This is done via MPI_Comm_spawn_multiple and these workers use MPI to communicate within their 4 cores. In the master python program, only rank == 0 spawns the C and Fortran subprocesses, and an MPI_Barrier is called after the spawn on all ranks so that all the rank != 0 cores wait until the spawned processes finish before they continue execution.
Repeat (1) and (2) many many times in a for loop.
The issue I am having is that if I use mpiexec -np 16 to start the master python program, all the cores are being taken up by the master program and I get the error:
All nodes which are allocated for this job are already filled.
when the program hits the MPI_Comm_spawn_multiple line.
If I use any other value less than 16 for -np, then only some of the cores are allocated and some are available (but I still need all 16), so I get a similar error:
There are not enough slots available in the system to satisfy the 4 slots
that were requested by the application:
/home/username/anaconda/envs/myenvironment/bin/python
Either request fewer slots for your application, or make more slots available
for use.
So it seems like even though I am going to run MPI_Barrier in step (2) to block until the spawned processes finish, MPI still thinks those cores are being used and won't allocate another process on top of them. Is there a way to fix this?
(If the answer is hostfiles, could you please explain them for me? I am not understanding the full idea and how they might be useful here.)
This is the poster of this question. I found out that I can use -oversubscribe as an argument to mpiexec to avoid these errors, but as Zulan mentioned in his comments, this could be a poor decision.
In addition, I don't know if the cores are being subscribed like I want them to be. For example, maybe all 4 C/Fortran processes are being run on the same 4 cores. I don't know how to tell.
Most MPIs have a parameter -usize 123 for the mpiexec program that indicates the size of the "universe", which can be larger than the world communicator. In that case you can spawn extra processes up to the size of the universe. You can query the size of the universe:
int universe_size, *universe_size_attr,uflag;
MPI_Comm_get_attr(comm_world,MPI_UNIVERSE_SIZE,
&universe_size_attr,&uflag);
universe_size = *universe_size_attr;
I need to debug an MPI code for which I only have access to a single node/machine. The problem is the bug I am looking for only arises when running on more than node but it doesn't when running, for example, two MPI tasks in the same node, everything goes fine. I assume that my MPI implementation (mviapich2) cleverly treats tasks running on the same node by, for example, replacing network communications by IPC strategies or even memcpy.
So my question is: how could I run two MPI tasks on a single node but making MPI treat them as tasks on different nodes? Is that possible?
You can disable the MVAPICH2 shared memory device by setting the MV2_USE_SHARED_MEM environment variable to 0:
mpiexec ... -env MV2_USE_SHARED_MEM 0 ... ./executable
Make sure that your MVAPICH2 was built with the TCP/IP device, otherwise your ranks won't be able to communicate with shared memory support turned off.
I have two programs master and slave. My master does data decomposition and slaves do computation on the part of decomposed data. MPI scaterv is implemented for distribution of work.I execute my master program first then it dynamically spawns child or slave processes and slave executes different code ie.computation. Now again master has to collect results from slaves and executes next level of decomposition. how do I do that using MPI? I actually wanted to execute my master and slave code alternately.. How can I implement this?
Thank you in advance..
MPI-2 (if I remember correctly) introduced mechanisms for dynamic process management, you might care to search for mpi_comm_spawn to start learning about those mechanisms. So it is certainly possible to write an MPI program which alternates between one process running the master task and multiple processes running the worker tasks (the term slave is deprecated). It's even possible to design your computation so that one program runs the master task and another program runs the (multiple) worker tasks and to use MPI for passing messages between the two.
BUT (that's a big but) I don't think that many resource managers (either the humans who manage parallel computer systems or the operating system and systems software such as job managers) support such dynamic process management. Imagine the complexities of scheduling, and managing, two or more programs with the basic design that you propose. Just as program A tries to fire up 2^10 worker processes so too does program B, and program C, while program D tries to drop 2^8 worker processes; all this on a cluster with only 2^10 processors (or cores). It's probably not too difficult to construct scenarios where the throughput of jobs on the cluster falls towards zero as multiple jobs contend for scarce resources.
If your platform supports dynamic process management, go right ahead. In the far more likely case that your platform does not you have at least two choices, which one you choose depends on the ratio of master:worker time and probably other factors too. You could:
Do what most of us have always done and continue to do and request a total number of processors for the entire job, leaving all but one of them idle during the master-only phases. Wasteful perhaps but easy for the resource managers to cope with. Relatively easy to program too.
If the master does a lot of work between worker phases you could modify your program so that the master and worker are separate programs. First have the master execute on one process and, as it finishes, submit a request to the job management system to initiate the first phase of the worker computation. Have that, in turn, initiate the execution of the next master phase, and so on and so on.
I want to easily perform collective communications independently on each machine of my cluster. Let's say I have 4 machines with 8 cores on each, my MPI program would run 32 MPI tasks. What I would like is, for a given function:
on each host, only one task performs a computation, the other tasks do nothing during this computation. In my example, 4 MPI tasks will do the computation, 28 others are waiting.
once the computation is done, each MPI task on each will perform a collective communication ONLY to local tasks (tasks running on the same host).
Conceptually, I understand I must create one communicator for each host. I searched around, and found nothing explicitly doing that. I am not really comfortable with MPI groups and communicators. Here my two questions:
is MPI_Get_processor_name is enough unique for such a behaviour?
more generally, do you have a piece of code doing that?
The specification says that MPI_Get_processor_name returns "A unique specifier for the actual (as opposed to virtual) node", so I think you'd be ok with that. I guess you'd do a gather to assemble all the host names and then assign groups of processors to go off and make their communicators; or dup MPI_COMM_WORLD, turn the names into integer hashes, and use mpi_comm_split to partition the set.
You could also take the approach janneb suggests and use implementation-specific options to mpirun to ensure that the MPI implementation assigns tasks that way; OpenMPI uses --byslot to generate this ordering; with mpich2 you can use -print-rank-map to see the mapping.
But is this really what you want to do? If the other processes are sitting idle while one processor is working, how is this better than everyone redundantly doing the calculation? (Or is this very memory or I/O intensive, and you're worried about contention?) If you're going to be doing a lot of this -- treating on-node parallelization very different from off-node parallelization -- then you may want to think about hybrid programming models - running one MPI task per node and MPI_spawning subtasks or using OpenMP for on-node communications, both as suggested by HPM.
I don't think (educated thought, not definitive) that you'll be able to do what you want entirely from within your MPI program.
The response of the system to a call to MPI_Get_processor_name is system-dependent; on your system it might return node00, node01, node02, node03 as appropriate, or it might return my_big_computer for whatever processor you are actually running on. The former is more likely, but it is not guaranteed.
One strategy would be to start 32 processes and, if you can determine what node each is running on, partition your communicator into 4 groups, one on each node. This way you can manage inter- and intra-communications yourself as you wish.
Another strategy would be to start 4 processes and pin them to different nodes. How you pin processes to nodes (or processors) will depend on your MPI runtime and any job management system you might have, such as Grid Engine. This will probably involve setting environment variables -- but you don't tell us anything about your run-time system so we can't guess what they might be. You could then have each of the 4 processes dynamically spawn a further 7 (or 8) processes and pin those to the same node as the initial process. To do this, read up on the topic of intercommunicators and your run-time system's documentation.
A third strategy, now it's getting a little crazy, would be to start 4 separate MPI programs (8 processes each), one on each node of your cluster, and to join them as they execute. Read about MPI_Comm_connect and MPI_Open_port for details.
Finally, for extra fun, you might consider hybridising your program, running one MPI process on each node, and have each of those processes execute an OpenMP shared-memory (sub-)program.
Typically your MPI runtime environment can be controlled e.g. by environment variables how tasks are distributed over nodes. The default tends to be sequential allocation, that is, for your example with 32 tasks distributed over 4 8-core machines you'd have
machine 1: MPI ranks 0-7
machine 2: MPI ranks 8-15
machine 3: MPI ranks 16-23
machine 4: MPI ranks 24-31
And yes, MPI_Get_processor_name should get you the hostname so you can figure out where the boundaries between hosts are.
The modern MPI 3 answer to this is to call MPI_Comm_split_type
I am starting programming on an OpenMPI managed cluster.
I use the following command to run my executable:
mpirun -np 32 file
Now what I understand is that 32 specifies the number of processes that should be created. They may be created on the same processor. Am I right?
I am noticing increasing time for execution with increase in the number of processes. Could the above be a reason for this?
How do I find out the execution and scheduling policy of the cluster?
Is it correct to assume that typically the cluster I am working on will have many processes running on each node just as they run on my PC.
I would expect your job management system (which is ?) to allocate 1 MPI process per core. But that is a configuration matter and your cluster may not be configured as I expect. Can you see what processes are running on the various cores of your cluster at run time ?
There are many explanations for increasing execution time with increasing numbers of processes, several good ones which include the possibility of one-process-per-core. But multiple processes per core is a potential explanation.
You find out about the policies of your cluster by asking the cluster administrator.
No, I think it is atypical for cluster processors (or cores) to execute multiple MPI processes simultaneously.