MPICH-p4 alternative to -nolocal flag - mpi

Is there an alternative for the -nolocal option when I run an MPI program using mpiexec of MPICH-p4?

If all you want to do is run all of your processes locally, don't provide a hostfile (or provide one that only includes localhost). Keep in mind that you're severely limiting how much parallelization you can do to essentially the number of cores you have. After that, you start to oversubscribe them and you can run out of resources quickly.

Related

Optimal paralleism for a given project with gnu make

I'd like to know optimal number of cores needed to build a project with GNU make.
I can use --max-load to tune for an existing system, but I want to know if doubling or tripling the core count and memory would improve build wall clock times.
If I could collect statistics on how many recipes make holds waiting for a free core to execute and how long they occupy the core, this could be turned into a standard job scheduling problem.
I don't think there's any way to answer your question, really. Maybe you can be more specific about what you'd like to know.
Obviously the more cores you have, assuming sufficient memory to support them, then the more recipes make can invoke in parallel without crushing your system.
If you have 2 cores and you run make -j300 then make will dutifully invoke 300 jobs at once and your operating system will dutifully attempt to run all of them at the same time. Most likely, your system will be swapping and context switching so much that it will make very little progress and it would take less wall clock time to run make -j2 instead.
On the other hand, if you have 256 cores then make -j300 is probably quite reasonable... assuming you have enough memory to ensure that all those jobs don't wait swapping memory out.
And of course, at some point (but probably far away from any reasonable number of cores you have unless you have a lot of money to spend) you will run into disk IO issues with so many compiler processes running at the same time trying to read source from the disk to compile.
My goto number is num cpus + 1. This is based on a lot of informal benchmarks, and is usually very close to the optimal number. -j9 on a hyper-threaded four core laptop, and -j49 on my usual production build server.
The + 1 means that make keeps all the CPUs occupied, even as jobs are being retired, and is usually a teensy-weensy bit faster than without the increment.
It also means that other users can use the same multiplier without melting the machine.
Be aware though, that although -j49 ensures there are only 49 processes actually running, the parent make will potentially have many more child processes than that. For instance, a single compile may mean the shell is called, which calls a shell script, which calls the compiler driver, which calls the correct compiler stage. On some toolchains my -j49 builds have a peak of 245 child processes. A bit annoying when my ulimit max user processes is only 512.

How can I run an mpi executable so that top shows one process

Is it possible to run an mpi executable using multiple threads so that on doing "top" one sees only one process with the full cpu usage?
For example, if I run "mpiexec -np 4 ./executable" and do "top", I see 4 processes with different PIDs, each using 100% cpu. I would like to see a single process with a unique PID using 400% cpu.
No, that is not possible. MPI is explicitly designed for separate processes. They will inevitably show up as separate processes on top.
Now there may be some esoteric MPI implementations based on threads instead of processes, but I highly doubt these would be conforming and practically usable MPI implementations.
Edit:
1.-2. The program atop in "Accumulated per program" mode (press 'p') might do what you want.
3. usually, if you kill mpiexec / mpirun, it will terminate all ranks. Otherwise, consider killall.
I can see how that may be convenient for a superficial glance about performance, but you should consider investing in learning more sophisticated performance analysis tools for parallel applications.

MPI on a multicore machine

My situation is quite simple: I want to run a MPI-enabled software on a single multiprocessor/core machine, let's say 8.
My implementation of MPI is MPICH2.
As I understand I have a few options:
$ mpiexec -n 8 my_software
$ mpiexec -n 8 -hosts {localhost:8} my_software
or I could also specify Hydra to "fork" and not "ssh";
$ mpiexec -n 8 -launcher fork my_software
Could you tell me if there will be any differences or if the behavior will be the same ?
Of course as all my nodes will be on the same machine I don't want "message passing" to be done through the network (even the local loop) but through shared memory. As I understood MPI will figure that out itself and that will be the case for all the three options.
Simple answer:
All methods should lead to the same performance. You'll have 8 processes running on the cores and using shared memory.
Technical answer:
"fork" has the advantage of compatibility, on systems where rsh/ssh process spawning would be a problem. But can, I guess, only start processes locally.
At the end (unless MPI is weirdly configured) all processes on the same CPU will end up using "shared memory", and the launcher or the host specification method should not matter for this. The communication method is handled by another parameter (-channel ?).
Specific syntax of host specification method can permit to bind processes to a specific CPU core, then you might have slightly better/worse performance depending of your application.
If you've got everything set up correctly then I don't see that your program's behaviour will depend on how you launch it, unless that is it fails to launch under one or other of the options. (Which would mean that you didn't have everything set up correctly in the first place.)
If memory serves me well the way in which message passing is implemented depends on the MPI device(s) you use. It used to be that you would use the mpi ch_shmem device. This managed the passing of messages between processes but it did use buffer space and messages were sent to and from this space. So message passing was done, but at memory bus speed.
I write in the past tense because it's a while since I was that close to the hardware that I knew (or, frankly, cared) about low-level implementation details and more modern MPI installations might be a bit, or a lot, more sophisticated. I'll be surprised, and pleased, to learn that any modern MPI installation does, in fact, replace message-passing with shared memory read/write on a multicore/multiprocessor machine. I'll be surprised because it would require translating message-passing into shared memory access and I'm not sure that that is easy (or easy enough to be feasible) for the whole of MPI. I think it's far more likely that current implementations still rely on message-passing across the memory bus through some buffer area. But, as I state, that's only my best guess and I'm often wrong on these matters.

Multitasking on Linux with multiple CPUs

I feel my question is quite basic, but I couldn't find any related SO question.
I need to run a program a few thousands of times (different input each time), and currently it is done by a shell script. The machine runs Ubuntu and has 8 CPUs (as revealed by cat /proc/cpuinfo). Using top I see that only 1 CPU is utilized. In order to speed thing up, I want to utilize all 8 CPUs. I know I can start the program in the background, and then call it again (and indeed top reveals that 2 CPUs are utilized in that case), so I can change my shell script to call the program in groups of 8. My question is, is that a recommended way to utilize all CPUs, or is there another, somewhat 'cleaner' way?
You can use cpu affinity to be explicit about the processor for the processes.
http://www.cyberciti.biz/tips/setting-processor-affinity-certain-task-or-process.html
However, if each process runs on a cpu (as it should, the kernel will make sure that things are running as efficiently as possible), then just fire n processes off (8 in your case, or make your shell script figure out what n is so your script is a bit more robust, or make it a command line option) and let the kernel do it for you. Each time a process ends, fire off another process until you are done.
Question is overly vague.
That you want to use all the CPUs implies you want the end result as quickly as possible - but a major concern for the performance f multiple instances would be contention for resources (reducing performance) and caching (improving performance).
Usually splitting the job amongst multiple processes will usually yield results faster. And there are many, many ways of sharding the workload. But without knowing a lot more about what it is doing it is difficult to recommend a particular approach.
Given that you have 8 CPUs, and assuming that the only constrained resource is the CPU, then you don't want to have more than 8 threads running concurrently on the job. So the problem then becomes how you schedule work to ensure that you are using the 8 cores optimally. Splitting the work into 8 scripts and running them concurrently you will initially see all 8 scripts running concurrently - but its very likely, depending on the nature of the work, that the scripts will finish at different times.
So if you really want to use the hardware optimally, that means running 8 processes as daemons, preferably with each process having a cpu affinity set, fed by a message queue. But is it really worthwhile coding all this if you're not going to be running this regularly? Also it may be faster to run just 7 and keep a CPU for handling the quueue and other demands placed on the box.

How do I select the no. of processors/cores to run my MPI program on?

I am using mpich2 1.2.1p1 version which has MPD as its default process manager.
When we run mpiexec, we can mention the no. of processes we want to spawn, but I also want to mention/select the no. of processors/cores I want to use. How do i do it?
Also, when we simply spawn n no. of processes, how do we know how many processors/cores are being used??
Please help.
Any sensible operating system will use as many cores as possible on each machine. You should not have to worry about that. When spawning 4 mpi processes on a quad core machine, it is safe to assume that all 4 cores will be used. If not, there is something seriously wrong with the configuration. Anyway, if you really want to be sure, check the CPU usage with for example 'top'.
The number of processes is the number of cores used. Mpi will put at least one process on each core. If you want to make sure you are always using the maximum number of cores on your machine then use the OS resources on your system to get the number of cores and pass that to the mpiexec call.

Resources