Why do OpenMPI programs have to be executed using `mpirun`?

Why can MPI(by which I will mean OpenMPI throughout this post) programs not be executed like any other, and instead have to be executed using mpirun?
In other words, why does MPI not simply provide headers/packages/... that you can import and then let you be master in your own house, by letting you use MPI when and where you want, in your sourcecode, and allowing you to compile your own parallel-processing-included-executables?
I'm really a novice, but for example, I feel like the -np argument passed to mpirun could easily be fixed in the sourcecode, or could be prompted for by the program itself, or could be read in from a configuration file, or could be simply configured to use all available cores, whose number will be determined by a surrounding scheduler script anyway, or ....
(Of course, you can argue that there is a certain convenience in having mpirun do this automatically in some sense, but that hardly justifies, in my view, taking away the coder's possibility to write his own executable.)
For example, I really have little experience, but in Python you can do multiprocessing by simply calling functions of the multiprocessing module and then running your script like any other. Of course, MPI provides more than Python's multiprocessing, but if MPI for example has to start a background service, then I still don't understand why it can't do so automatically upon calls of MPI functions in the source.
For another possibly stupid example, CUDA programs do not require cudarun. And for a good reason, since if they did, and if you used both CUDA and MPI in parts of your program, you would now have to execute cudarun mpirun ./foo (or possibly mpirun cudarun ./foo) and if every package worked like this, you would soon have to have a degree in computer science to simply execute a program.
All of this is maybe super important, as you can simply ship each of your MPI executables with a corresponding wrapper script, but this is kind of annoying and I would still be interested in why this design choice was made.

You can spin up processes however you like, you'll need to have some channel to send port information between processes, a command line arg works. I've had to spin up processes manually, but it's far easier and less painful to use a preconstructed communicator. If you have a good reason, you can do it though.
I have a question where I edited a minimal complete example into the question. The key calls are MPI_Open_port, MPI_Comm_accept, MPI_Comm_connect, and MPI_Intercomm_merge. You have to merge the connecting nodes one at a time. If you want to go after this, be sure you have a good idea about the difference between an inter and intracommunicator. Here's the example for you:
Why does OS need Fork()?

I am learning by myself introduction of OS, I have the following two questions:
(1) Since Fork() system call is used to duplicate the current process for the sake of multitasking, I'd like to see a example that shows without forking, we will not have such multitasking? In other word, I would like to see an example (or an external link) that shows how important Fork() is.
(2) DoesZombie process exist because of the child's process crash?
Thank you very much
There is no need for fork, fork is just the Unix way of creating process. Older and different systems used something different as spawn (Vax/VMS) for example.
Zombies are just traces of died processes, this is useful for parents to be aware of died children after having been busy. Remember that in Unix, a process that terminates let its parents be aware of the cause of its termination. So there is a need to let died process store that information somewhere, Unix way is to maintain a process entry named zombie as that process is reduced to this small entry and no other resources are linked to.

How to Call a Go Program from Common Lisp

I have a Go program which cannot be rewritten in Common Lisp for efficiency reasons. How can I run it via Common Lisp?
Options so far:
Using the foreign function interface seems to me like the "correct" way to do this. However, the research I did lead directly to a dead end. If this is the winner, what resources are there to learn about how to interface with Go?
2. Sockets
Leaving the Go program running all the time while listening on a port would work. If this is the best way, I'll continue trying to make it work.
3. Execute System Command
This seems all kinds of wrong.
4. Unknown
Or is there an awesome way I haven't thought of yet?
It depends on what you want to do, but 1-3 are all viable options
To get this to work you will need to use FFI on both the go and lisp side.
You will need to extern the appropriate function from Go as C functions, and then call them using cffi from lisp. See https://golang.org/cmd/cgo/#hdr-C_references_to_Go on how to extern function in go. In this case you would create a dynamically linkable library (dll or so file) rather than an executable file.
2. Sockets (IPC)
The second option is to run your go program as a daemon and use some form of IPC (such as sockets) to communicate between lisp and go. This works well if your program is long running, or if it makes sense to have a server and one or more clients (the server could just as easily be the lisp code as the go code). Sockets in particular are also more flexible, you could write components in other languages or change languges for one component without having to change the others as long as you maintain the same protocol. Also, you could potentially run the components on seperate hardware. However, using sockets may hurt performance.
There are other IPC methods available, such as FIFO files (named pipes), SHM, and message queues, but they are more system dependent than sockets.
3. System command (subprocess)
The third way is to start a sub-process. This is a viable option, but it has some caveats. First of all, the behavior of starting a sub process is dependent both on the lisp implementation and the operating system. UIOP smooths out a lot of the details for implementation differences, but some are too great to overcome. In particular, depending on the implementation you may or may not be able to run a subprocess in parallel. If not you will have to run a seperate command every time you want to communicate with go, which means waiting for the process to start up every time you need it. You also may, or may not be able to send input to the subprocess after starting it.
Another option is to run a command to start a go process, and then communicate with it using sockets or some other IPC, and then running a command to stop the process before closing the lisp program.
Personally, I think that using sockets is the most attractive option, but depending on your needs, on of the other options might be better suited.
CFFI is to use C from Common Lisp. It's an easy way to get new features without too much hassle as the libraries out there usually are written in C or have a C interface. If you can make a C library from your Go source then you can do this and use the foreign feature from CL.
Sockets (or other two way communication bus) are good if the Go program is a service that is supposed to provide something. Eg. an application server to serve http requests. Usually if you only need to use the go program once each run of the CL program this isn't the way to go.
Subprocess is best if you can run your application with arguments and get a result that is used in Common Lisp. It's not good if you are going to use the Go program many times as it will have overhead (in which the sockets thing would be best)
Awesome way to do this is to make the whole thing in Common Lisp. If you choose a implementation that has a good compiler and write it well you might get away with having the application as a CL image. If you need to speed up things you can focus on the slow parts and optimize them og you can use CFFI by writing the optimizations in C. There is even a Inline C for SBCL where you can just write C where you want to optimize in CL and you don't need to write the optimizations in their own file and compile and link separately.

Abstract implementation of non-blocking MPI calls

Non-blocking sends/recvs return immediately in MPI and the operation is completed in the background. The only way I see that happening is that the current process/thread invokes/creates another process/thread and loads an image of the send/recv code into that and itself returns. Then this new process/thread completes this operation and sets a flag somewhere which the Wait/Test returns. Am I correct ?
There are two ways that progress can happen:
In a separate thread. This is usually an option in most MPI implementations (usually at configure/compile time). In this version, as you speculated, the MPI implementation has another thread that runs a separate progress engine. That thread manages all of the MPI messages and sending/receiving data. This way works well if you're not using all of the cores on your machine as it makes progress in the background without adding overhead to your other MPI calls.
Inside other MPI calls. This is the more common way of doing things and is the default for most implementations I believe. In this version, non-blocking calls are started when you initiate the call (MPI_I<something>) and are essentially added to an internal queue. Nothing (probably) happens on that call until you make another call to MPI later that actually does some blocking communication (or waits for the completion of previous non-blocking calls). When you enter that future MPI call, in addition to doing whatever you asked it to do, it will run the progress engine (the same thing that's running in a thread in version #1). Depending on what the MPI call that's supposed to be happening is doing, the progress engine may run for a while or may just run through once. For instance, if you called MPI_WAIT on an MPI_IRECV, you'll stay inside the progress engine until you receive the message that you're waiting for. If you are just doing an MPI_TEST, it might just cycle through the progress engine once and then jump back out.
More exotic methods. As Jeff mentions in his post, there are more exotic methods that depend on the hardware on which you're running. You may have a NIC that will do some magic for you in terms of moving your messages in the background or some other way to speed up your MPI calls. In general, these are very specific to the implementation and hardware on which you're running, so if you want to know more about them, you'll need to be more specific in your question.
All of this is specific to your implementation, but most of them work in some way similar to this.
Are you asking, if a separate thread for message processing is the only solution for non-blocking operations?
If so, the answer is no. I even think, many setups use a different strategy. Usually progress of the message processing is done during all MPI-Calls. I'd recommend you to have a look into this Blog entry by Jeff Squyres.
See the answer by Wesley Bland for a more complete answer.

MPI on a multicore machine

My situation is quite simple: I want to run a MPI-enabled software on a single multiprocessor/core machine, let's say 8.
My implementation of MPI is MPICH2.
As I understand I have a few options:
$ mpiexec -n 8 my_software
$ mpiexec -n 8 -hosts {localhost:8} my_software
or I could also specify Hydra to "fork" and not "ssh";
$ mpiexec -n 8 -launcher fork my_software
Could you tell me if there will be any differences or if the behavior will be the same ?
Of course as all my nodes will be on the same machine I don't want "message passing" to be done through the network (even the local loop) but through shared memory. As I understood MPI will figure that out itself and that will be the case for all the three options.
Simple answer:
All methods should lead to the same performance. You'll have 8 processes running on the cores and using shared memory.
Technical answer:
"fork" has the advantage of compatibility, on systems where rsh/ssh process spawning would be a problem. But can, I guess, only start processes locally.
At the end (unless MPI is weirdly configured) all processes on the same CPU will end up using "shared memory", and the launcher or the host specification method should not matter for this. The communication method is handled by another parameter (-channel ?).
Specific syntax of host specification method can permit to bind processes to a specific CPU core, then you might have slightly better/worse performance depending of your application.
If you've got everything set up correctly then I don't see that your program's behaviour will depend on how you launch it, unless that is it fails to launch under one or other of the options. (Which would mean that you didn't have everything set up correctly in the first place.)
If memory serves me well the way in which message passing is implemented depends on the MPI device(s) you use. It used to be that you would use the mpi ch_shmem device. This managed the passing of messages between processes but it did use buffer space and messages were sent to and from this space. So message passing was done, but at memory bus speed.
I write in the past tense because it's a while since I was that close to the hardware that I knew (or, frankly, cared) about low-level implementation details and more modern MPI installations might be a bit, or a lot, more sophisticated. I'll be surprised, and pleased, to learn that any modern MPI installation does, in fact, replace message-passing with shared memory read/write on a multicore/multiprocessor machine. I'll be surprised because it would require translating message-passing into shared memory access and I'm not sure that that is easy (or easy enough to be feasible) for the whole of MPI. I think it's far more likely that current implementations still rely on message-passing across the memory bus through some buffer area. But, as I state, that's only my best guess and I'm often wrong on these matters.

A program to kill long-running runaway programs

I manage Unix systems where, sometimes, programs like CGI scripts run forever, sometimes eating a lot of CPU time and wasting resources.
I want a program (typically invoked from cron) which can kill these runaways, based on the following criteria (combined with AND and OR):
Name (given by a regexp)
CPU time used
elapsed time (for programs which are blocked on an I/O)
I do not really know what to type in a search engine for this sort of program. I certainly could write it myself in Python but I'm lazy and there is may be a good program already existing?
(I did not tag my question with a language name since a program in Perl or Ruby or whatever would work as well)
Try using system-level quota enforcement instead. Most systems will allow to set per-process CPU time limit for different users.
Linux: /etc/security/limits.conf
FreeBSD: /etc/login.conf
CGI scripts can usually be run under their own user ID, for example using mod_suid for Apache.
This might be something more like what you were looking for:
Most of the watchdig-like programs or libraries are just trying to see whether a given process is running, so I'd say you'd better off writing your own, using the existing libraries that give out process information.
