How do I select the no. of processors/cores to run my MPI program on? - mpi

I am using mpich2 1.2.1p1 version which has MPD as its default process manager.
When we run mpiexec, we can mention the no. of processes we want to spawn, but I also want to mention/select the no. of processors/cores I want to use. How do i do it?
Also, when we simply spawn n no. of processes, how do we know how many processors/cores are being used??
Please help.

Any sensible operating system will use as many cores as possible on each machine. You should not have to worry about that. When spawning 4 mpi processes on a quad core machine, it is safe to assume that all 4 cores will be used. If not, there is something seriously wrong with the configuration. Anyway, if you really want to be sure, check the CPU usage with for example 'top'.

The number of processes is the number of cores used. Mpi will put at least one process on each core. If you want to make sure you are always using the maximum number of cores on your machine then use the OS resources on your system to get the number of cores and pass that to the mpiexec call.


How to increase processing power in SLURM? (nodes/cores/tasks?)

I would like to increase the processing power of my jobs but am not sure how to go about this. At the moment I am requesting 1 node on SLURM (#SBATCH --nodes 1) but am not sure whether I should request more cores or more nodes? I know that my workplace HPC has 44 cores to each node, so am I currently using all 44 nodes and need to use an additional 44? Or does this command just request one core from this node by default and I need to find a way to request more cores from that node?
I also know that commands like --ntasks=1, --ntasks-per-node 10 and --cpus-per-task=4 modify number of tasks, but I think all my code is run sequentially (I'm not using threading modules or anything like that) so is there any use in doing this?
EDIT: I've changed my code from
#SBATCH --nodes 1
#SBATCH --ntasks-per-node 10
(originally copied from someone else, no idea what it's doing)
#SBATCH --nodes 1
#SBATCH --ntasks 1
#SBATCH --cpus-per-task 10
Any advice appreciated
Increasing the number of node only worth it if your application benefit from distributed computing (using MPI for example). This is the case of most HPC applications. Whether increasing the number of node or core is better is very dependent of the target application (and low-level details of the target platform). Hybrid applications doing a lot of communications tends to better perform with more cores while memory bound ones require more nodes to execute faster. Note that using more core generally helps to better use available HPC resources as remaining cores are generally unused and wasted (HPC cluster/supercomputers allowing multiple users to use cores of the same node simultaneously are very rare). A lot of HPC applications do not scale well in shared memory though (often due to IO/memory saturation or a poor support of NUMA platforms). This is a complex topic where researcher worked since several decades.
I think all my code is run sequentially (I'm not using threading modules or anything like that) so is there any use in doing this?
You cannot speed up your application by using more cores or more nodes if your application use only one process with one thread (and do not use accelerators like GPUs). You need to parallelize your application first. There are many tools for that starting from OpenMP and MPI (for the basics). There is no free lunch.

Optimal paralleism for a given project with gnu make

I'd like to know optimal number of cores needed to build a project with GNU make.
I can use --max-load to tune for an existing system, but I want to know if doubling or tripling the core count and memory would improve build wall clock times.
If I could collect statistics on how many recipes make holds waiting for a free core to execute and how long they occupy the core, this could be turned into a standard job scheduling problem.
I don't think there's any way to answer your question, really. Maybe you can be more specific about what you'd like to know.
Obviously the more cores you have, assuming sufficient memory to support them, then the more recipes make can invoke in parallel without crushing your system.
If you have 2 cores and you run make -j300 then make will dutifully invoke 300 jobs at once and your operating system will dutifully attempt to run all of them at the same time. Most likely, your system will be swapping and context switching so much that it will make very little progress and it would take less wall clock time to run make -j2 instead.
On the other hand, if you have 256 cores then make -j300 is probably quite reasonable... assuming you have enough memory to ensure that all those jobs don't wait swapping memory out.
And of course, at some point (but probably far away from any reasonable number of cores you have unless you have a lot of money to spend) you will run into disk IO issues with so many compiler processes running at the same time trying to read source from the disk to compile.
My goto number is num cpus + 1. This is based on a lot of informal benchmarks, and is usually very close to the optimal number. -j9 on a hyper-threaded four core laptop, and -j49 on my usual production build server.
The + 1 means that make keeps all the CPUs occupied, even as jobs are being retired, and is usually a teensy-weensy bit faster than without the increment.
It also means that other users can use the same multiplier without melting the machine.
Be aware though, that although -j49 ensures there are only 49 processes actually running, the parent make will potentially have many more child processes than that. For instance, a single compile may mean the shell is called, which calls a shell script, which calls the compiler driver, which calls the correct compiler stage. On some toolchains my -j49 builds have a peak of 245 child processes. A bit annoying when my ulimit max user processes is only 512.

MPICH-p4 alternative to -nolocal flag

Is there an alternative for the -nolocal option when I run an MPI program using mpiexec of MPICH-p4?
If all you want to do is run all of your processes locally, don't provide a hostfile (or provide one that only includes localhost). Keep in mind that you're severely limiting how much parallelization you can do to essentially the number of cores you have. After that, you start to oversubscribe them and you can run out of resources quickly.

MPI on a multicore machine

My situation is quite simple: I want to run a MPI-enabled software on a single multiprocessor/core machine, let's say 8.
My implementation of MPI is MPICH2.
As I understand I have a few options:
$ mpiexec -n 8 my_software
$ mpiexec -n 8 -hosts {localhost:8} my_software
or I could also specify Hydra to "fork" and not "ssh";
$ mpiexec -n 8 -launcher fork my_software
Could you tell me if there will be any differences or if the behavior will be the same ?
Of course as all my nodes will be on the same machine I don't want "message passing" to be done through the network (even the local loop) but through shared memory. As I understood MPI will figure that out itself and that will be the case for all the three options.
Simple answer:
All methods should lead to the same performance. You'll have 8 processes running on the cores and using shared memory.
Technical answer:
"fork" has the advantage of compatibility, on systems where rsh/ssh process spawning would be a problem. But can, I guess, only start processes locally.
At the end (unless MPI is weirdly configured) all processes on the same CPU will end up using "shared memory", and the launcher or the host specification method should not matter for this. The communication method is handled by another parameter (-channel ?).
Specific syntax of host specification method can permit to bind processes to a specific CPU core, then you might have slightly better/worse performance depending of your application.
If you've got everything set up correctly then I don't see that your program's behaviour will depend on how you launch it, unless that is it fails to launch under one or other of the options. (Which would mean that you didn't have everything set up correctly in the first place.)
If memory serves me well the way in which message passing is implemented depends on the MPI device(s) you use. It used to be that you would use the mpi ch_shmem device. This managed the passing of messages between processes but it did use buffer space and messages were sent to and from this space. So message passing was done, but at memory bus speed.
I write in the past tense because it's a while since I was that close to the hardware that I knew (or, frankly, cared) about low-level implementation details and more modern MPI installations might be a bit, or a lot, more sophisticated. I'll be surprised, and pleased, to learn that any modern MPI installation does, in fact, replace message-passing with shared memory read/write on a multicore/multiprocessor machine. I'll be surprised because it would require translating message-passing into shared memory access and I'm not sure that that is easy (or easy enough to be feasible) for the whole of MPI. I think it's far more likely that current implementations still rely on message-passing across the memory bus through some buffer area. But, as I state, that's only my best guess and I'm often wrong on these matters.

Group MPI tasks by host

I want to easily perform collective communications independently on each machine of my cluster. Let's say I have 4 machines with 8 cores on each, my MPI program would run 32 MPI tasks. What I would like is, for a given function:
on each host, only one task performs a computation, the other tasks do nothing during this computation. In my example, 4 MPI tasks will do the computation, 28 others are waiting.
once the computation is done, each MPI task on each will perform a collective communication ONLY to local tasks (tasks running on the same host).
Conceptually, I understand I must create one communicator for each host. I searched around, and found nothing explicitly doing that. I am not really comfortable with MPI groups and communicators. Here my two questions:
is MPI_Get_processor_name is enough unique for such a behaviour?
more generally, do you have a piece of code doing that?
The specification says that MPI_Get_processor_name returns "A unique specifier for the actual (as opposed to virtual) node", so I think you'd be ok with that. I guess you'd do a gather to assemble all the host names and then assign groups of processors to go off and make their communicators; or dup MPI_COMM_WORLD, turn the names into integer hashes, and use mpi_comm_split to partition the set.
You could also take the approach janneb suggests and use implementation-specific options to mpirun to ensure that the MPI implementation assigns tasks that way; OpenMPI uses --byslot to generate this ordering; with mpich2 you can use -print-rank-map to see the mapping.
But is this really what you want to do? If the other processes are sitting idle while one processor is working, how is this better than everyone redundantly doing the calculation? (Or is this very memory or I/O intensive, and you're worried about contention?) If you're going to be doing a lot of this -- treating on-node parallelization very different from off-node parallelization -- then you may want to think about hybrid programming models - running one MPI task per node and MPI_spawning subtasks or using OpenMP for on-node communications, both as suggested by HPM.
I don't think (educated thought, not definitive) that you'll be able to do what you want entirely from within your MPI program.
The response of the system to a call to MPI_Get_processor_name is system-dependent; on your system it might return node00, node01, node02, node03 as appropriate, or it might return my_big_computer for whatever processor you are actually running on. The former is more likely, but it is not guaranteed.
One strategy would be to start 32 processes and, if you can determine what node each is running on, partition your communicator into 4 groups, one on each node. This way you can manage inter- and intra-communications yourself as you wish.
Another strategy would be to start 4 processes and pin them to different nodes. How you pin processes to nodes (or processors) will depend on your MPI runtime and any job management system you might have, such as Grid Engine. This will probably involve setting environment variables -- but you don't tell us anything about your run-time system so we can't guess what they might be. You could then have each of the 4 processes dynamically spawn a further 7 (or 8) processes and pin those to the same node as the initial process. To do this, read up on the topic of intercommunicators and your run-time system's documentation.
A third strategy, now it's getting a little crazy, would be to start 4 separate MPI programs (8 processes each), one on each node of your cluster, and to join them as they execute. Read about MPI_Comm_connect and MPI_Open_port for details.
Finally, for extra fun, you might consider hybridising your program, running one MPI process on each node, and have each of those processes execute an OpenMP shared-memory (sub-)program.
Typically your MPI runtime environment can be controlled e.g. by environment variables how tasks are distributed over nodes. The default tends to be sequential allocation, that is, for your example with 32 tasks distributed over 4 8-core machines you'd have
machine 1: MPI ranks 0-7
machine 2: MPI ranks 8-15
machine 3: MPI ranks 16-23
machine 4: MPI ranks 24-31
And yes, MPI_Get_processor_name should get you the hostname so you can figure out where the boundaries between hosts are.
The modern MPI 3 answer to this is to call MPI_Comm_split_type
