MPI on a multicore machine - mpi

My situation is quite simple: I want to run a MPI-enabled software on a single multiprocessor/core machine, let's say 8.
My implementation of MPI is MPICH2.
As I understand I have a few options:
$ mpiexec -n 8 my_software
$ mpiexec -n 8 -hosts {localhost:8} my_software
or I could also specify Hydra to "fork" and not "ssh";
$ mpiexec -n 8 -launcher fork my_software
Could you tell me if there will be any differences or if the behavior will be the same ?
Of course as all my nodes will be on the same machine I don't want "message passing" to be done through the network (even the local loop) but through shared memory. As I understood MPI will figure that out itself and that will be the case for all the three options.

Simple answer:
All methods should lead to the same performance. You'll have 8 processes running on the cores and using shared memory.
Technical answer:
"fork" has the advantage of compatibility, on systems where rsh/ssh process spawning would be a problem. But can, I guess, only start processes locally.
At the end (unless MPI is weirdly configured) all processes on the same CPU will end up using "shared memory", and the launcher or the host specification method should not matter for this. The communication method is handled by another parameter (-channel ?).
Specific syntax of host specification method can permit to bind processes to a specific CPU core, then you might have slightly better/worse performance depending of your application.

If you've got everything set up correctly then I don't see that your program's behaviour will depend on how you launch it, unless that is it fails to launch under one or other of the options. (Which would mean that you didn't have everything set up correctly in the first place.)
If memory serves me well the way in which message passing is implemented depends on the MPI device(s) you use. It used to be that you would use the mpi ch_shmem device. This managed the passing of messages between processes but it did use buffer space and messages were sent to and from this space. So message passing was done, but at memory bus speed.
I write in the past tense because it's a while since I was that close to the hardware that I knew (or, frankly, cared) about low-level implementation details and more modern MPI installations might be a bit, or a lot, more sophisticated. I'll be surprised, and pleased, to learn that any modern MPI installation does, in fact, replace message-passing with shared memory read/write on a multicore/multiprocessor machine. I'll be surprised because it would require translating message-passing into shared memory access and I'm not sure that that is easy (or easy enough to be feasible) for the whole of MPI. I think it's far more likely that current implementations still rely on message-passing across the memory bus through some buffer area. But, as I state, that's only my best guess and I'm often wrong on these matters.

Related

Why do OpenMPI programs have to be executed using `mpirun`?

Why can MPI(by which I will mean OpenMPI throughout this post) programs not be executed like any other, and instead have to be executed using mpirun?
In other words, why does MPI not simply provide headers/packages/... that you can import and then let you be master in your own house, by letting you use MPI when and where you want, in your sourcecode, and allowing you to compile your own parallel-processing-included-executables?
I'm really a novice, but for example, I feel like the -np argument passed to mpirun could easily be fixed in the sourcecode, or could be prompted for by the program itself, or could be read in from a configuration file, or could be simply configured to use all available cores, whose number will be determined by a surrounding scheduler script anyway, or ....
(Of course, you can argue that there is a certain convenience in having mpirun do this automatically in some sense, but that hardly justifies, in my view, taking away the coder's possibility to write his own executable.)
For example, I really have little experience, but in Python you can do multiprocessing by simply calling functions of the multiprocessing module and then running your script like any other. Of course, MPI provides more than Python's multiprocessing, but if MPI for example has to start a background service, then I still don't understand why it can't do so automatically upon calls of MPI functions in the source.
For another possibly stupid example, CUDA programs do not require cudarun. And for a good reason, since if they did, and if you used both CUDA and MPI in parts of your program, you would now have to execute cudarun mpirun ./foo (or possibly mpirun cudarun ./foo) and if every package worked like this, you would soon have to have a degree in computer science to simply execute a program.
All of this is maybe super important, as you can simply ship each of your MPI executables with a corresponding wrapper script, but this is kind of annoying and I would still be interested in why this design choice was made.
You can spin up processes however you like, you'll need to have some channel to send port information between processes, a command line arg works. I've had to spin up processes manually, but it's far easier and less painful to use a preconstructed communicator. If you have a good reason, you can do it though.
I have a question where I edited a minimal complete example into the question. The key calls are MPI_Open_port, MPI_Comm_accept, MPI_Comm_connect, and MPI_Intercomm_merge. You have to merge the connecting nodes one at a time. If you want to go after this, be sure you have a good idea about the difference between an inter and intracommunicator. Here's the example for you:
Trying to start another process and join it via MPI but getting access violation

Why does process creation using `clone` result in an out-of-memory failure?

I have a process that allocates about 20GB of RAM on a 32GB machine. After some events, I'm streaming the data from the parent process to stdin of the child process. It's mandatory to keep the 20GB of data in the parent process at the point when the child is spawned.
The app is written in Rust and I'm calling Command::new('path/to/command') to create the child process.
When I spawn the child process the operating system is trapping an out-of-memory error.
strace output:
[pid 747] 16:04:41.128377 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7ff4c7f87b10) = -1 ENOMEM (Cannot allocate memory)
Why does the trap occur? The child process should not consume more than 1GB and exec() is called immediately after clone().
The Problem
When a child process is created by the Rust call, several things happen at a C/C++ level. This is a simplification, but it will help explain the dilemma.
The streams are duplicated (with dup2 or a similar call)
The parent process is forked (with the fork or clone system call)
The forked process executes the child (with call from the execvp family)
The parent and child are now concurrent processes. The Rust call you are currently using appears to be a clone call that is behaving much like a pure fork, so you're 20G x 2 - 32G = 8G short, without considering the space needed by the operating system and anything else that might be running. The clone call is returning with a negative return value and errno is set by the call to ENOMEM errno.
If the architectural solutions of adding physical memory, compressing the data, or streaming it through a process that does not require the entirety of it to be in memory at any one time are not options, then the classic solution is reasonably simple.
Recommendation
Design the parent process to be lean. Then spawn two worker children, one that handles your 20GB need and the other that handles your 1 GB need1. These children can be connected to one another via pipe, file, shared memory, socket, semaphore, signalling, and/or other communication mechanism(s), just as a parent and child can be.
Many mature software packages from Apache httpd to embedded cell tower routing daemons use this design pattern. It is reliable, maintainable, extensible, and portable.
The 32G would then likely suffice for the 20G and 1G processing needs, along with OS and lean parent process.
Although this solution will surely solve your problem, if the code is to be reused or extended later, there may be value in looking into potential process design changes involving data frames or multidimensional slices to support streaming of data and memory requirement reductions.
Memory Overcommit Always
Setting overcommit_memory to 1 eliminates the clone error condition referenced in the question because the Rust call calls the LINUX clone call that reads that setting. But there are several caveats with this solution that point back to the above recommendation as superior, primarily that the value of 1 is dangerous, especially for production environments.
Background
Kernel discussions about OpenBSD rfork and the clone call ensued in the late 1990s and early 2000s. The features stemming from those discussions permit less extreme forking than processes, which is symmetrically like the provision of more extensive independence between pthreads. Some of these discussions have produced extensions to the traditional process spawning that have entered POSIX standardization.
In the early 2000s, Linux Torvalds suggested a flag structure to determine what components of the execution model are shared and what are copied when execution forks, blurring the distinction between processes and threads. From this, the clone call emerged.
Over-committing memory is not discussed much if any in those threads. The design goal was MORE control of the results of a fork rather than the delegation of memory usage optimization to an operating system heuristic, which is what the default setting of overcommit_memory = 0 does.
Caveats
Memory overcommit goes beyond these extensions, adding the complexity of trade-offs of its modes2, design trend caveats3, practical run time limitations4, and performance impacts5.
Portability and Longevity
Additionally, without standardization, the code using memory overcommit may not be portable, and the question of longevity is pertinent, especially when a setting controls the behavior of a function. There is no guarantee of backward compatibility or even some warning of deprication if the setting system changes.
Danger
The linuxdevcenter documentation2 says, "1 always overcommits. Perhaps you now realize the danger of this mode.", and there are other indications of danger with ALWAYS overcommitting 6, 7.
The implementers of overcommit on LINUX, Windows, and VMWare may guarantee reliability, but it is a statistical game that, combined with the many other complexities of process control, may lead to certain unstable characteristics under certain conditions. Even the name overcommit tells us something about its true character as a practice.
A non-default overcommit_memory mode, for which several warnings are issues, but works for the immediate trial of the immediate case may later lead to intermittent reliability.
Predictability and Its Impact on System Reliability and Response Time Consistency
The idea of a process in a UNIX like operating system, from its Bell Labs beginnings, is that a process makes a concrete requests to its container, the operating system. The result both predictable and binary. Either the request is denied or granted. Once granted, the process is given complete control and direct access over the resources until the use of it is relinquished by the process.
The swap space aspect of virtual memory is a breach of this principle that appears as gross deceleration of activity on workstations, when RAM is heavily consumed. For instance, there are times during development when one presses a key and has to wait ten seconds to see the character on the display.
Conclusion
There are many ways to get the most out of physical memory, but doing so by hoping that use of memory allocated will be sparse will likely introduce negative impacts. Performance hits from swapping when overcommit is overused is the well documented example. If you are keeping 20G of data in RAM, this may particularly be the case.
Only allocating what is needed, forking in intelligent ways, using threads, and freeing memory that is surely no longer needed lead to memory thrift without impacting reliability, creating spikes in swap disk usage, and can operate without caveat up to the limits of system resources.
The position of the designer of the Command::new call may be based on this perspective. In this case, how soon after the fork the exec is called is not a determining factor in how much memory is requested during the spawn.
Notes and References
[1] Spawning worker children may require some code refactoring and appear to be too much trouble on a superficial level, but the refactoring may be surprisingly straightforward and significantly beneficial.
[2] http://www.linuxdevcenter.com/pub/a/linux/2006/11/30/linux-out-of-memory.html?page=2
[3] https://www.etalabs.net/overcommit.html
[4] http://www.gabesvirtualworld.com/memory-overcommit-in-production-yes-yes-yes/
[5] https://labs.vmware.com/vmtj/memory-overcommitment-in-the-esx-server
[6] https://github.com/kubernetes/kubernetes/issues/14452
[7] http://linuxtoolkit.blogspot.com/2011_08_01_archive.html

FreeSWITCH minimal installation and module selection

As someone who is very new to the opensource PBX projects like Asterisk and FreeSWITCH, I am grappling with some information overload. Have read the basic FreeSWITCH docs on Wiki, but still have few questions. Since I am not very familiar with the terminology, I will try to use close approximations.
Trying to create a small/minimalistic build of FreeSWITCH, that needs to run on an rather old laptop (Celeron 1GHz, 512MB RAM, 20GB HDD, already running Debian "Wheezy"), and set it up as a 6-port GSM-SIP/Jabber gateway. So, by "small" and "minimalistic", I mean one which doesn't have modules/optional-software that is not absolutely necessary (e.g. no need for IVR announcements, or Skype integration) -- to keep memory footprint smallest, and occupy less hard-disk real-estate.
The rough idea is to have 6 GSM ports (via 'GSM-open module', similar to chan_dongle) towards public telephony network, and about 60 SIP extension, and support upto 6 calls involving GSM ports, and about 6 SIP-SIP calls (intra PBX), on this setup. I have read that the CPU overhead of GSMopen module is pretty low, so I am guessing this is possible.
Can someone confirm this to be a realistic goal?
What might be the minimum set of modules to select for minimalistic build?
For modules not chosen during initial build, can those be added later? If so, would it require me to rebuild FreeSWITCH completely, only the modules, or that everything would be built, but only configuration changes would be required to ensure that modules are loaded, and configure?
Is there any rough estimate of what might be the maximum call-rate that could be supported in such a configuration? For SIP-SIP calls? Given the underpowered processor, and little RAM (as per modern standards), I am guessing that both shall be bottlenecks, but adding RAM might still be possible (even if costly and difficult).
I have read that "hooks" can be created using Lua/Python/Java etc.. However if someone share share few examples of what-all is possible using such hooks, it would make the concept clearer. Can one hope to write an application like "missed call log" or "redirect on no answer" using these hooks?
Can someone confirm this to be a realistic goal?
Yes, this is quite realistic. You need to target as little as possible transcoding, because that's where CPU resources are needed. But even with a 1Ghz Celeron, 6 transcoded sessions seem quite realistic. But it needs testing :)
What might be the minimum set of modules to select for minimalistic build?
Just start with the default list of modules, and add gsmopen (I have no experience with gsm gateways, can't help with that part). The memory footprint is pretty low, and you may need some of those modules later.
For modules not chosen during initial build, can those be added later?
as far as I remember, Wiki describes this process. You edit modules.conf and make the specific module.
Is there any rough estimate of what might be the maximum call-rate that could be supported in such a configuration? For SIP-SIP calls? Given the underpowered processor, and little RAM (as per modern standards), I am guessing that both shall be bottlenecks, but adding RAM might still be possible (even if costly and difficult).
It really depends on complexity of your dialplan. Each context consists of a number of conditions, which are doing regexp match on channel variables. So, the more complex your dialplan is, the less CPS you get. But for a 6-channel gateway, I don't see this a problem. GSM network will be much slower than your box :)
I have read that "hooks" can be created using Lua/Python/Java etc.. However if someone share share few examples of what-all is possible using such hooks, it would make the concept clearer. Can one hope to write an application like "missed call log" or "redirect on no answer" using these hooks?
You can control every aspect of FreeSWITCH behavior with FreeSWITCH. There are even examples when the complete dialplan is re-implemented by an external program (Kazoo does that).
The simplest mode of operation is when your Lua/JS/Perl/Python script is launched from within the dialplan: then it receives a "session" object, and you can do whatever you want with the call: play sounds, bridge, forward, make a new call and bridge them together, and so on. Here in my blog there's a little practical example.
Then, you can build an external application which connects to the FS socket and monitors the events and performs actions on active calls.
Also, it can be done in the opposite direction: you run a server, and FS connects to it with its socket library.
Also, you can have an HTTP service which delivers pieces of XML configuration to FreeSWITCH, and it requests those on every call (this would be the most CPU-intensive application). This way, you can feed FS from some internal database, and build fault-tolerant systems.
I hope this helps :)
You can also find me in skype if needed.
FreeSWITCH is not really memory-hungry, and you can simply start with the default set of modules (the best is to use the prebuilt Debian packages). For example, on my 64bit machine, the FreeSWIITH process occupies only 35MB of memory.
freeswitch#vx03:~$ uname -a
Linux vx03 2.6.32-5-xen-amd64 #1 SMP Thu Nov 3 05:42:31 UTC 2011 x86_64 GNU/Linux
freeswitch#vx03:~$ ps -p 11873 v
PID TTY STAT TIME MAJFL TRS DRS RSS %MEM COMMAND
11873 ? S<l 10:29 0 0 258136 36852 2.3 /opt/freeswitch/bin/freeswitch -nc -rp -nonat -u freeswitch -g freeswitch
I will go through the rest of your questions later today

MPI_Isend /Irecv: Is it possible to access the sendbuffer on unused memory-locations in the meanwhile

I would like to speedup my MPI- Program with the use of asynchronous communication. But the used time remains the same. The workflow is as followed.
before:
1. MPI_send/ MPI_recv Halo (ca. 10 Seconds)
2. process the whole Array (ca. 12 Seconds)
after:
1. MPI_Isend/ MPI_Irecv Halo (ca. 0,1 Seconds)
2. process the Array (without Halo) (ca. 10 Seconds)
3. MPI_Wait (ca. 10 Seconds) (should be ca. 0 Seconds)
4. process the Halo only (ca. 2 Seconds)
Measurements showed that the communication and processing the Array-core nearly take the same time for common workloads. So asynchronism should nearly hide the communication time.
But it dosn't.
One fact - and I thinks this could be the problem - is that the sendbuffer is also the array the calculations are made on. Is it possible that MPI serializes the memory-access although communication ONLY accesses the Halo (with derived datatype) and the computation ONLY accesses the core (only reading) of the array???
Does anybody know if this is for sure the reason?
Is it maybe implementation-dependend (I'm using OpenMPI)?
Thanks in advance.
It isn't the case that MPI serializes the memory accesses in the user code (that's beyond the library's power to do, in general), and it is true that what exactly does happen is implementation specific.
But as a practical matter, MPI libraries don't do as much communication "in the background" as you might hope, and this is particularly true when using transports and networks like tcp + ethernet, where there's no meaningful way to hand off communication to another set of hardware.
You can only be sure that the MPI library is actually doing something when you're running MPI library code, eg in an MPI function call. Often, a call to any of a number of MPI calls will nudge an implementations "progress engine" that keeps track of in-flight messages and ushers them along. So for instance one thing you can quickly do is to make calls to MPI_Test() on the requests within the compute loop to make sure things start happening well before the MPI_Wait(). There is of course overhead to this, but this is something that's easy to try to measure.
Of course you could imagine the MPI library would use some other mechanism to run things behind the scenes. Both MPICH2 and OpenMPI have played with separate "progress threads" which execute separately from the user code and do this ushering along in the background; but getting that to work well, and without tying up a processor while you're trying to run your computation, is a genuinely difficult problem. OpenMPI's progress threads implementation has long been experimental, and in fact is temporarily out of the current (1.6.x) release, although work continues. I'm not sure about MPICH2's support.
If you are using infiniband, where the network hardware has a lot of intelligence to it, then prospects brighten a bit. If you are willing to leave memory pinned (for the openfabrics), and/or you can use a vendor-specific module (mxm for Mellanox, psm for Qlogic), then things can progress somewhat more rapidly. If you're using shared memory, than the knem kernel module can also help with intranode transport.
One other implementation-specific approach you can take, if memory isn't a big issue, is to try to use eager protocols for sending the data directly, or send more data per chunk so fewer nudges of the progress engine are needed. What eager protocols means here is that data is automatically sent at send time, rather than just initiating a set of handshakes which will eventually lead to the message being sent. The bad news is that this generally requires extra buffer memory for the library, but if that's not a problem and you know the number of incoming messages is bounded (eg, by the number of halo neighbours you have), this can help a great deal. How to do this for (eg) shared memory transport for openmpi is described on the OpenMPI page for tuning for shared memory, but similar parameters exist for other transports and often for other implementations. One nice tool that IntelMPI has is an "mpitune" tool that automatically runs through a number of such parameters for best performance.
The MPI specification states:
A nonblocking send call indicates that the system may start copying
data out of the send buffer. The sender should not modify any part of the
send buffer after a nonblocking send operation is called, until the
send completes.
So yes, you should copy your data to a dedicated send buffer first.

Group MPI tasks by host

I want to easily perform collective communications independently on each machine of my cluster. Let's say I have 4 machines with 8 cores on each, my MPI program would run 32 MPI tasks. What I would like is, for a given function:
on each host, only one task performs a computation, the other tasks do nothing during this computation. In my example, 4 MPI tasks will do the computation, 28 others are waiting.
once the computation is done, each MPI task on each will perform a collective communication ONLY to local tasks (tasks running on the same host).
Conceptually, I understand I must create one communicator for each host. I searched around, and found nothing explicitly doing that. I am not really comfortable with MPI groups and communicators. Here my two questions:
is MPI_Get_processor_name is enough unique for such a behaviour?
more generally, do you have a piece of code doing that?
The specification says that MPI_Get_processor_name returns "A unique specifier for the actual (as opposed to virtual) node", so I think you'd be ok with that. I guess you'd do a gather to assemble all the host names and then assign groups of processors to go off and make their communicators; or dup MPI_COMM_WORLD, turn the names into integer hashes, and use mpi_comm_split to partition the set.
You could also take the approach janneb suggests and use implementation-specific options to mpirun to ensure that the MPI implementation assigns tasks that way; OpenMPI uses --byslot to generate this ordering; with mpich2 you can use -print-rank-map to see the mapping.
But is this really what you want to do? If the other processes are sitting idle while one processor is working, how is this better than everyone redundantly doing the calculation? (Or is this very memory or I/O intensive, and you're worried about contention?) If you're going to be doing a lot of this -- treating on-node parallelization very different from off-node parallelization -- then you may want to think about hybrid programming models - running one MPI task per node and MPI_spawning subtasks or using OpenMP for on-node communications, both as suggested by HPM.
I don't think (educated thought, not definitive) that you'll be able to do what you want entirely from within your MPI program.
The response of the system to a call to MPI_Get_processor_name is system-dependent; on your system it might return node00, node01, node02, node03 as appropriate, or it might return my_big_computer for whatever processor you are actually running on. The former is more likely, but it is not guaranteed.
One strategy would be to start 32 processes and, if you can determine what node each is running on, partition your communicator into 4 groups, one on each node. This way you can manage inter- and intra-communications yourself as you wish.
Another strategy would be to start 4 processes and pin them to different nodes. How you pin processes to nodes (or processors) will depend on your MPI runtime and any job management system you might have, such as Grid Engine. This will probably involve setting environment variables -- but you don't tell us anything about your run-time system so we can't guess what they might be. You could then have each of the 4 processes dynamically spawn a further 7 (or 8) processes and pin those to the same node as the initial process. To do this, read up on the topic of intercommunicators and your run-time system's documentation.
A third strategy, now it's getting a little crazy, would be to start 4 separate MPI programs (8 processes each), one on each node of your cluster, and to join them as they execute. Read about MPI_Comm_connect and MPI_Open_port for details.
Finally, for extra fun, you might consider hybridising your program, running one MPI process on each node, and have each of those processes execute an OpenMP shared-memory (sub-)program.
Typically your MPI runtime environment can be controlled e.g. by environment variables how tasks are distributed over nodes. The default tends to be sequential allocation, that is, for your example with 32 tasks distributed over 4 8-core machines you'd have
machine 1: MPI ranks 0-7
machine 2: MPI ranks 8-15
machine 3: MPI ranks 16-23
machine 4: MPI ranks 24-31
And yes, MPI_Get_processor_name should get you the hostname so you can figure out where the boundaries between hosts are.
The modern MPI 3 answer to this is to call MPI_Comm_split_type

Resources