POSIX Threads: are pthreads_cond_wait() and others systemcalls? - unix

The POSIX standard defines several routines for thread synchronization, based on concepts like mutexes and conditional variables.
my question is now: are these (like e.g. pthreads_cond_init(), pthreads_mutex_init(), pthreads_mutex_lock()... and so on) system calls or just library calls? i know they are included via "pthread.h", but do they finally result in a system call and therefore are implemented in the kernel of the operating system?

On Linux a pthread mutex makes a "futex" system call, but only if the lock is contended. That means that taking a lock no other thread wants is almost free.
In a similar way, sending a condition signal is only expensive when there is someone waiting for it.
So I believe that your answer is that pthread functions are library calls that sometimes result in a system call.

Whenever possible, the library avoids trapping into the kernel for performance reasons. If you already have some code that uses these calls you may want to take a look at the output from running your program with strace to better understand how often it is actually making system calls.

I never looked into all those library call , but as far as I understand they all involve kernel operations as they are supposed to provide synchronisations between process and/or threads at global level - I mean at the OS level.
The kernel need to maintain for a mutex, for instance, a thread list: threads that are currently sleeping, waiting that a locked mutex get released. When the thread that currently lock/owns that mutex invokes the kernel with pthread_mutex_release(), the kernel system call will browse that aforementioned list to get the higher priority thread that is waiting for the mutex release, flag the new mutex owner into the mutex kernel structure, and then will give away the cpu (aka "ontect switch") to the newly owner thread, thus this process will return from the posix library call pthread_mutex_lock().
I only see a cooperation with the kernel when it involves IPC between processes (I am not talking between threads at a single process level). Therefore I expect those library call to invoke the kernel, so.

When you compile a program on Linux that uses pthreads, you have to add -lphtread to the compiler options. by doing this, you tell the linker to link libpthreads. So, on linux, they are calls to a library.

Related

Is it possible to attach an asynchronous callback/continuation to a SYCL kernel?

I have a collection of thousands of SYCL kernels to execute. Once each of these kernels has finished, I need to execute a function on a cl::sycl::buffer written to by said kernel.
The methods I'm aware of for achieving this are:
by using RAII; the requisite global memory is copied back to the host upon destruction of the cl::sycl::buffer
by constructing a host cl::sycl::accessor (with cl::sycl::access::target::host_buffer)
Both of these methods are synchronous and blocking. Is it possible to instead attach an asynchronous callback/continuation when submitting kernels to a cl::sycl::queue that executes as soon as the kernel has finished? Or even better, can the same functionality be achieved with C++2a coroutines? If not, is such a feature planned for SYCL?
The feature to attach callbacks or execute on the host from a SYCL queue did not make the cut for SYCL 1.2.1.
There are some proposals being discussed at the moment to bring that feature into the next version of the standard, but everything is still internal to the SYCL group.
In the meantime, if you use ComputeCpp, you can use the host_handler extension, which allows you to execute a lambda on the host based on dependencies from the device.
The open source compiler doesn't have that feature yet that I've seen.

Why should there be minimal work before MPI_Init?

The documentations for both MPICH and OpenMPI mention that there should be minimal amount of work done before MPI_Init or after MPI_Finilize:
The MPI standard does not say what a program can do before an MPI_INIT or after an MPI_FINALIZE. In the MPICH implementation, you should do as little as possible.
What is the reason behind this?
To me it seems perfectly reasonable for processes to do a significant amount of calculations before starting the communication with each other.
I believe it was worded in the like that in order to allow MPI implementations that spawn its ranks within MPI_Init. That means not all ranks are technically guaranteed to exist before MPI_Init. If you had opened file descriptors or performed other things with side effects on the process state, it would become a huge mess.
Afaik no major current MPI implementation does that, nevertheless an MPI implementation might use this requirement for other tricks.
EDIT: I found no evidence of this and only remember this from way back, so I'm not sure about it. I can't seem to find the formulation in MPI standard that you quoted from MPICH. However, the MPI standard regulates which MPI functions you may call before MPI_Init:
The only MPI functions that may be invoked before the MPI initialization routines are called are MPI_GET_VERSION, MPI_GET_LIBRARY_VERSION, MPI_INITIALIZED, MPI_FINALIZED, and any function with the prefix MPI_T_.
The MPI_Init documentation of MPICH is giving some hints:
The MPI standard does not say what a program can do before an MPI_INIT or after an MPI_FINALIZE. In the MPICH implementation, you should do as little as possible. In particular, avoid anything that changes the external state of the program, such as opening files, reading standard input or writing to standard output.
BTW, I would not expect MPI_Init to do communications. These would happen later.
And the mpich/init.c implementation is free software; you can study its source code and understand that it is initializing some timers, some threads, etc... (and that should indeed happen really early).
To me it seems perfectly reasonable for processes to do a significant amount of calculations before starting the communication with each other.
Of course, but these should happen after MPI_Init (but before some MPI_Send etc).
On some supercomputers, MPI might use dedicated hardware (like InfiniBand, Fibre Channel, etc...) and there might be some hardware or operating system reasons to initialize it very early. So it makes sense to call MPI_Init very early. BTW, it is also given the pointers to main arguments and I guess that it would modify them before further processing by your main. Then the call to MPI_Init is probably the first statement of your main.

Abstract implementation of non-blocking MPI calls

Non-blocking sends/recvs return immediately in MPI and the operation is completed in the background. The only way I see that happening is that the current process/thread invokes/creates another process/thread and loads an image of the send/recv code into that and itself returns. Then this new process/thread completes this operation and sets a flag somewhere which the Wait/Test returns. Am I correct ?
There are two ways that progress can happen:
In a separate thread. This is usually an option in most MPI implementations (usually at configure/compile time). In this version, as you speculated, the MPI implementation has another thread that runs a separate progress engine. That thread manages all of the MPI messages and sending/receiving data. This way works well if you're not using all of the cores on your machine as it makes progress in the background without adding overhead to your other MPI calls.
Inside other MPI calls. This is the more common way of doing things and is the default for most implementations I believe. In this version, non-blocking calls are started when you initiate the call (MPI_I<something>) and are essentially added to an internal queue. Nothing (probably) happens on that call until you make another call to MPI later that actually does some blocking communication (or waits for the completion of previous non-blocking calls). When you enter that future MPI call, in addition to doing whatever you asked it to do, it will run the progress engine (the same thing that's running in a thread in version #1). Depending on what the MPI call that's supposed to be happening is doing, the progress engine may run for a while or may just run through once. For instance, if you called MPI_WAIT on an MPI_IRECV, you'll stay inside the progress engine until you receive the message that you're waiting for. If you are just doing an MPI_TEST, it might just cycle through the progress engine once and then jump back out.
More exotic methods. As Jeff mentions in his post, there are more exotic methods that depend on the hardware on which you're running. You may have a NIC that will do some magic for you in terms of moving your messages in the background or some other way to speed up your MPI calls. In general, these are very specific to the implementation and hardware on which you're running, so if you want to know more about them, you'll need to be more specific in your question.
All of this is specific to your implementation, but most of them work in some way similar to this.
Are you asking, if a separate thread for message processing is the only solution for non-blocking operations?
If so, the answer is no. I even think, many setups use a different strategy. Usually progress of the message processing is done during all MPI-Calls. I'd recommend you to have a look into this Blog entry by Jeff Squyres.
See the answer by Wesley Bland for a more complete answer.

Cooperative Multitasking system

I'm trying to get around the concept of cooperative multitasking system and exactly how it works in a single threaded application.
My understanding is that this is a "form of multitasking in which multiple tasks execute by voluntarily ceding control to other tasks at programmer-defined points within each task."
So if you have a list of tasks and one task is executing, how do you determine to pass execution to another task? And when you give execution back to a previous task, how do resume from where you were previously?
I find this a bit confusing because I don't understand how this can be achieve without a multithreaded application.
Any advice would be very helpeful :)
Thanks
In your specific scenario where a single process (or thread of execution) uses cooperative multitasking, you can use something like Windows' fibers or POSIX setcontext family of functions. I will use the term fiber here.
Basically when one fiber is finished executing a chunk of work and wants to voluntarily allow other fibers to run (hence the "cooperative" term), it either manually switches to the other fiber's context or more typically it performs some kind of yield() or scheduler() call that jumps into the scheduler's context, then the scheduler finds a new fiber to run and switches to that fiber's context.
What do we mean by context here? Basically the stack and registers. There is nothing magic about the stack, it's just a block of memory the stack pointer happens to point to. There is also nothing magic about the program counter, it just points to the next instruction to execute. Switching contexts simply saves the current registers somewhere, changes the stack pointer to a different chunk of memory, updates the program counter to a different stream of instructions, copies that context's saved registers into the CPU, then does a jump. Bam, you're now executing different instructions with a different stack. Often the context switch code is written in assembly that is invoked in a way that doesn't modify the current stack or it backs out the changes, in either case it leaves no traces on the stack or in registers so when code resumes execution it has no idea anything happened. (Again, the theme: we assume that method calls fiddle with registers, push arguments to the stack, move the stack pointer, etc but that is just the C calling convention. Nothing requires you to maintain a stack at all or to have any particular method call leave any traces of itself on the stack).
Since each stack is separate, you don't have some continuous chain of seemingly random method calls eventually overflowing the stack (which might be the result if you naively tried to implement this scheme using standard C methods that continuously called each other). You could implement this manually with a state machine where each fiber kept a state machine of where it was in its work, periodically returning to the calling dispatcher's method, but why bother when actual fiber/co-routine support is widely available?
Also remember that cooperative multitasking is orthogonal to processes, protected memory, address spaces, etc. Witness Mac OS 9 or Windows 3.x. They supported the idea of separate processes. But when you yielded, the context was changed to the OS context, allowing the OS scheduler to run, which then potentially selected another process to switch to. In theory you could have a full protected virtual memory OS that still used cooperative multitasking. In those systems, if a errant process never yielded, the OS scheduler never ran, so all other processes in the system were frozen. **
The next natural question is what makes something pre-emptive... The answer is that the OS schedules an interrupt timer with the CPU to stop the currently executing task and switch back to the OS scheduler's context regardless of whether the current task cares to release the CPU or not, thus "pre-empting" it.
If the OS uses CPU privilege levels, the (kernel configured) timer is not cancelable by lower level (user mode) code, though in theory if the OS didn't use such protections an errant task could mask off or cancel the interrupt timer and hijack the CPU. There are some other scenarios like IO calls where the scheduler can be invoked outside the timer, and the scheduler may decide no other process has higher priority and return control to the same process without a switch... And in reality most OSes don't do a real context switch here because that's expensive, the scheduler code runs inside the context of whatever process was executing, so it has to be very careful not to step on the stack, to save register states, etc.
** You might ask why not just fire a timer if yield isn't called within a certain period of time. The answer lies in multi-threaded synchronization. In a cooperative system, you don't have to bother taking locks, worry about re-entrance, etc because you only yield when things are in a known good state. If this mythical timer fires, you have now potentially corrupted the state of the program that was interrupted. If programs have to be written to handle this, congrats... You now have a half-assed pre-emptive multitasking system. Might as well just do it right! And if you are changing things anyway, may as well add threads, protected memory, etc. That's pretty much the history of the major OSes right there.
The basic idea behind cooperative multitasking is trust - that each subtask will relinquish control, of its own accord, in a timely fashion, to avoid starving other tasks of processor time. This is why tasks in a cooperative multitasking system need to be tested extremely thoroughly, and in some cases certified for use.
I don't claim to be an expert, but I imagine cooperative tasks could be implemented as state machines, where passing control to the task would cause it to run for the absolute minimal amount of time it needs to make any kind of progress. For example, a file reader might read the next few bytes of a file, a parser might parse the next line of a document, or a sensor controller might take a single reading, before returning control back to a cooperative scheduler, which would check for task completion.
Each task would have to keep its internal state on the heap (at object level), rather than on the stack frame (at function level) like a conventional blocking function or thread.
And unlike conventional multitasking, which relies on a hardware timer to trigger a context switch, cooperative multitasking relies on the code to be written in such a way that each step of each long-running task is guaranteed to finish in an acceptably small amount of time.
The tasks will execute an explicit wait or pause or yield operation which makes the call to the dispatcher. There may be different operations for waiting on IO to complete or explicitly yielding in a heavy computation. In an application task's main loop, it could have a *wait_for_event* call instead of busy polling. This would suspend the task until it has input to process.
There may also be a time-out mechanism for catching runaway tasks, but it is not the primary means of switching (or else it wouldn't be cooperative).
One way to think of cooperative multitasking is to split a task into steps (or states). Each task keeps track of the next step it needs to execute. When it's the task's turn, it executes only that one step and returns. That way, in the main loop of your program you are simply calling each task in order, and because each task only takes up a small amount of time to complete a single step, we end up with a system which allows all of the tasks to share cpu time (ie. cooperate).

What exactly are "spin-locks"?

I always wondered what they are: every time I hear about them, images of futuristic flywheel-like devices go dancing (rolling?) through my mind...
What are they?
When you use regular locks (mutexes, critical sections etc), operating system puts your thread in the WAIT state and preempts it by scheduling other threads on the same core. This has a performance penalty if the wait time is really short, because your thread now has to wait for a preemption to receive CPU time again.
Besides, kernel objects are not available in every state of the kernel, such as in an interrupt handler or when paging is not available etc.
Spinlocks don't cause preemption but wait in a loop ("spin") till the other core releases the lock. This prevents the thread from losing its quantum and continue as soon as the lock gets released. The simple mechanism of spinlocks allows a kernel to utilize it in almost any state.
That's why on a single core machine a spinlock is simply a "disable interrupts" or "raise IRQL" which prevents thread scheduling completely.
Spinlocks ultimately allow kernels to avoid "Big Kernel Lock"s (a lock acquired when core enters kernel and released at the exit) and have granular locking over kernel primitives, causing better multi-processing on multi-core machines thus better performance.
EDIT: A question came up: "Does that mean I should use spinlocks wherever possible?" and I'll try to answer it:
As I mentioned, Spinlocks are only useful in places where anticipated waiting time is shorter than a quantum (read: milliseconds) and preemption doesn't make much sense (e.g. kernel objects aren't available).
If waiting time is unknown, or if you're in user mode Spinlocks aren't efficient. You consume 100% CPU time on the waiting core while checking if a spinlock is available. You prevent other threads from running on that core till your quantum expires. This scenario is only feasible for short bursts at kernel level and unlikely an option for a user-mode application.
Here is a question on SO addressing that: Spinlocks, How Useful Are They?
Say a resource is protected by a lock ,a thread that wants access to the resource needs to acquire the lock first. If the lock is not available, the thread might repeatedly check if the lock has been freed. During this time the thread busy waits, checking for the lock, using CPU, but not doing any useful work. Such a lock is termed as a spin lock.
It is pertty much a loop that keeps going till a certain condition is met:
while(cantGoOn) {};
while(something != TRUE ){};
// it happend
move_on();
It's a type of lock that does busy waiting
It's considered an anti-pattern, except for very low-level driver programming (where it can happen that calling a "proper" waiting function has more overhead than simply busy locking for a few cycles).
See for example Spinlocks in Linux kernel.
SpinLocks are the ones in which thread waits till the lock is available. This will normally be used to avoid overhead of obtaining the kernel objects when there is a scope of acquiring the kernel object within some small time period.
Ex:
While(SpinCount-- && Kernel Object is not free)
{}
try acquiring Kernel object
You would want to use a spinlock when you think it is cheaper to enter a busy waiting loop and pool a resource instead of blocking when the resource is locked.
Spinning can be beneficial when locks are fine grained and large in number (for example, a lock per node in a linked list) as well as when lock hold times are always extremely short. In general, while holding a spin lock, one should avoid blocking, calling anything that itself may block, holding more than one spin lock at once, making dynamically dispatched calls (interface and virtuals), making statically dispatched calls into any code one doesn't own, or allocating memory.
It's also important to note that SpinLock is a value type, for performance reasons. As such, one must be very careful not to accidentally copy a SpinLock instance, as the two instances (the original and the copy) would then be completely independent of one another, which would likely lead to erroneous behavior of the application. If a SpinLock instance must be passed around, it should be passed by reference rather than by value.
It's a loop that spins around until a condition is met.
In nutshell, spinlock employs atomic compare and swap (CAS) or test-and-set like instructions to implement lock free, wait free thread safe idiom. Such structures scale well in multi-core machines.
Well, yes - the point of spin locks (vs a traditional critical sections, etc) is that they offer better performance under some circumstances (multicore systems..), because they don't immediately yield the rest of the thread's quantum.
Spinlock, is a type of lock, which is non-block able & non-sleep-able. Any thread which want to acquire a spinlock for any shared or critical resource will continuously spin, wasting the CPU processing cycle till it acquire the lock for the specified resource. Once spinlock is acquired, it try to complete the work in its quantum and then release the resource respectively. Spinlock is the highest priority type of lock, simply can say, it is non-preemptive kind of lock.

Resources