propagation of contents of argc and argv by MPI runtime - mpi

Is it valid for a conformant MPI program to rely on the MPI runtime to start the process for each rank with the same contents of argc and argv? Or is it necessary to e.g. broadcast things from a designated master rank?

Just to be clear, it is only guaranteed that argc/argv are define after the call to MPI_Init(), even though the processes all exist before the call. This is why MPI_Init() takes pointers to argc and argv, specifically to enable them to be initialised on all processes by the MPI_Init() call.
It is therefore essential that you use:
MPI_Init(&argc, &argv);
and not
MPI_Init(NULL, NULL);
In practice, many MPI implementations make the command-line arguments available before the Init call, but you should not rely on this.

The standard doesn't make it clear whether that is the case or not as it tries too hard to abstract the actual process by which the MPI ranks come into existence.
On one side, Section 8.8 Portable MPI Process Startup recommends that a portable process launcher by the name of mpiexec exists (if required at all by the execution environment) and it is advisable that the launcher be able to be viewed as command-line version of MPI_COMM_SPAWN.
On the other side, MPI_COMM_SPAWN takes among its arguments an array of command-line arguments to be passed on to the spawned processes and those are supposed to be passed on (Section 10.3.2 Starting Processes and Establishing Communication):
Arguments are supplied to the program if this is allowed by the operating system. [...]
But the paragraph following the cited one is:
If a Fortran implementation supplies routines that allow a program to obtain its arguments, the arguments may be available through that mechanism. In C, if the operating system does not support arguments appearing in argv of main(), the MPI implementation may add the arguments to the argv that is passed to MPI_INIT. (emphasis mine)
I would therefore read this as: MPI implementations are advised to make their best to provide all ranks with the command-line arguments of the mpiexec command, but no absolute guarantee is given.

Related

Which command_queue to pass to clEnqueueCopyBuffer when launching kernels simultaneously?

So I am implementing a Kmeans clustering algorithm with OpenCL that uses channels: a feature from Intel's FPGA SDK for OpenCL.
To keep it succinct, this means I have two kernels that have to be enqueued on different command queues so they run simultaneously. I want to copy the cl_mem buffer from one kernel to the other every iteration (it's for the 4 clusters, so on the small side), part of which requires me to call clEnqueueCopyBuffer. This requires passing the function a command queue, but I don't know if it wants the queue of the buffer being copied or the queue of the buffer being copied to.
This is all the OpenCL Specification says for the command_queue parameter:
The command-queue in which the copy command will be queued. The OpenCL context associated with command_queue, src_buffer, and dst_buffer must be the same.
I can confirm these kernels are in fact in the same context.
You could use either command queue but you need to get an event from the copy operation to pass to the other kernel enqueue on the other command queue. Otherwise it might start before the copy finishes.

Why should there be minimal work before MPI_Init?

The documentations for both MPICH and OpenMPI mention that there should be minimal amount of work done before MPI_Init or after MPI_Finilize:
The MPI standard does not say what a program can do before an MPI_INIT or after an MPI_FINALIZE. In the MPICH implementation, you should do as little as possible.
What is the reason behind this?
To me it seems perfectly reasonable for processes to do a significant amount of calculations before starting the communication with each other.
I believe it was worded in the like that in order to allow MPI implementations that spawn its ranks within MPI_Init. That means not all ranks are technically guaranteed to exist before MPI_Init. If you had opened file descriptors or performed other things with side effects on the process state, it would become a huge mess.
Afaik no major current MPI implementation does that, nevertheless an MPI implementation might use this requirement for other tricks.
EDIT: I found no evidence of this and only remember this from way back, so I'm not sure about it. I can't seem to find the formulation in MPI standard that you quoted from MPICH. However, the MPI standard regulates which MPI functions you may call before MPI_Init:
The only MPI functions that may be invoked before the MPI initialization routines are called are MPI_GET_VERSION, MPI_GET_LIBRARY_VERSION, MPI_INITIALIZED, MPI_FINALIZED, and any function with the prefix MPI_T_.
The MPI_Init documentation of MPICH is giving some hints:
The MPI standard does not say what a program can do before an MPI_INIT or after an MPI_FINALIZE. In the MPICH implementation, you should do as little as possible. In particular, avoid anything that changes the external state of the program, such as opening files, reading standard input or writing to standard output.
BTW, I would not expect MPI_Init to do communications. These would happen later.
And the mpich/init.c implementation is free software; you can study its source code and understand that it is initializing some timers, some threads, etc... (and that should indeed happen really early).
To me it seems perfectly reasonable for processes to do a significant amount of calculations before starting the communication with each other.
Of course, but these should happen after MPI_Init (but before some MPI_Send etc).
On some supercomputers, MPI might use dedicated hardware (like InfiniBand, Fibre Channel, etc...) and there might be some hardware or operating system reasons to initialize it very early. So it makes sense to call MPI_Init very early. BTW, it is also given the pointers to main arguments and I guess that it would modify them before further processing by your main. Then the call to MPI_Init is probably the first statement of your main.

What is the difference between kernel and program object?

I've been through several resources: the OpenCL Khronos book, GATech tutorial, NYU tutorial, and I could go through more. But I still don't understand fully. What is the difference between a kernel and a program object?
So far the best explanation is this for me, but this is not enough for me to fully understand:
PROGRAM OBJECT: A program object encapsulates some source code (with potentially several kernel functions) and its last successful build.
KERNEL: A kernel object encapsulates the values of the kernel’s
arguments used when the kernel is executed.
Maybe a program object is the code? And the kernel is the compiled executable? Is that it? Because I could understand something like that.
Thanks in advance!
A program is a collection of one or more kernels plus optionally supporting functions. A program could be created from source or from several types of binaries (e.g. SPIR, SPIR-V, native). Some program objects (created from source or from intermediate binaries) need to be built for one or more devices (with clBuildProgram or clCompileProgram and clLinkProgram) prior to selecting kernels from them. The easiest way to think about programs is that they are like DLLs and export kernels for use by the programmer.
Kernel is an executable entity (not necessarily compiled, since you can have built-in kernels that represent piece of hardware (e.g. Video Motion Estimation kernels on Intel hardware)), you can bind its arguments and submit them to various queues for execution.
For an OpenCL context, we can create multiple Program objects. First, I will describe the uses of program objects in the OpenCL application.
To facilitate the compilation of the kernels for the devices to which the program is
attached
To provide facilities for determining build errors and querying the program for information
An OpenCL application uses kernel objects to execute a function parallelly on the device. Kernel objects are created from program objects. A program object can have multiple kernel objects.
As we know, to execute kernel we need to pass arguments to it. The primary purpose of kernel objects are this.
To get more clear about it here is an analogy which is given in the book "OpenCL Programming Guide" by Aaftab Munshi et al
An analogy that may be helpful in understanding the distinction between kernel objects and program objects is that the program object is like a dynamic library in that it holds a collection of kernel functions. The kernel object is like a handle to a function within the dynamic library. The program object is created from either source code (OpenCL C) or a compiled program binary (more on this later). The program gets built for any of the devices to which the program object is attached. The kernel object is then used to access properties of the compiled kernel function, enqueue calls to it, and set its arguments.

How are MPI processes started?

When starting an MPI job with mpirun or mpiexec, I can understand how one might go about starting each individual process. However, without any compiler magic, how do these wrapper executables communicate the arrangement (MPI communicator) to the MPI processes?
I am interested in the details, or a pointer on where to look.
Details on how individual processes establish the MPI universe are implementation specific. You should look into the source code of the specific library in order to understand how it works. There are two almost universal approaches though:
command line arguments: the MPI launcher can pass arguments to the spawned processes indicating how and where to connect in order to establish the universe. That's why MPI has to be initialised by calling MPI_Init() with argc and argv in C - thus the library can get access to the command line and extract all arguments that are meant for it;
environment variables: the MPI launcher can set specific environment variables whose content can indicate where and how to connect.
Open MPI for example sets environment variables and also writes some universe state in a disk location known to all processes that run on the same node. You can easily see the special variables that its run-time component ORTE (OpenMPI Run-Time Environment) uses by executing a command like mpirun -np 1 printenv:
$ mpiexec -np 1 printenv | grep OMPI
... <many more> ...
OMPI_MCA_orte_hnp_uri=1660944384.0;tcp://x.y.z.t:43276;tcp://p.q.r.f:43276
OMPI_MCA_orte_local_daemon_uri=1660944384.1;tcp://x.y.z.t:36541
... <many more> ...
(IPs changed for security reasons)
Once a child process is launched remotely and MPI_Init() or MPI_Init_thread() is called, ORTE kicks in and reads those environment variables. Then it connects back to the specified network address with the "home" mpirun/mpiexec process which then coordinates all spawned processes into establishing the MPI universe.
Other MPI implementations work in a similar fashion.

POSIX Threads: are pthreads_cond_wait() and others systemcalls?

The POSIX standard defines several routines for thread synchronization, based on concepts like mutexes and conditional variables.
my question is now: are these (like e.g. pthreads_cond_init(), pthreads_mutex_init(), pthreads_mutex_lock()... and so on) system calls or just library calls? i know they are included via "pthread.h", but do they finally result in a system call and therefore are implemented in the kernel of the operating system?
On Linux a pthread mutex makes a "futex" system call, but only if the lock is contended. That means that taking a lock no other thread wants is almost free.
In a similar way, sending a condition signal is only expensive when there is someone waiting for it.
So I believe that your answer is that pthread functions are library calls that sometimes result in a system call.
Whenever possible, the library avoids trapping into the kernel for performance reasons. If you already have some code that uses these calls you may want to take a look at the output from running your program with strace to better understand how often it is actually making system calls.
I never looked into all those library call , but as far as I understand they all involve kernel operations as they are supposed to provide synchronisations between process and/or threads at global level - I mean at the OS level.
The kernel need to maintain for a mutex, for instance, a thread list: threads that are currently sleeping, waiting that a locked mutex get released. When the thread that currently lock/owns that mutex invokes the kernel with pthread_mutex_release(), the kernel system call will browse that aforementioned list to get the higher priority thread that is waiting for the mutex release, flag the new mutex owner into the mutex kernel structure, and then will give away the cpu (aka "ontect switch") to the newly owner thread, thus this process will return from the posix library call pthread_mutex_lock().
I only see a cooperation with the kernel when it involves IPC between processes (I am not talking between threads at a single process level). Therefore I expect those library call to invoke the kernel, so.
When you compile a program on Linux that uses pthreads, you have to add -lphtread to the compiler options. by doing this, you tell the linker to link libpthreads. So, on linux, they are calls to a library.

Resources