Spinlock vs Busy wait [closed] - spinlock

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
Please explain why Busy Waiting is generally frowned upon whereas Spinning is often seen as okay. As far as I can tell, they both loop infinitely until some condition is met.

A spin-lock is usually used when there is low contention for the resource and the CPU will therefore only make a few iterations before it can move on to do productive work. However, library implementations of locking functionality often use a spin-lock followed by a regular lock. The regular lock is used if the resource cannot be acquired in a reasonable time-frame. This is done to reduce the overhead with context switches in settings where locks usually are quickly obtained.
The term busy-waiting tends to mean that you are willing to spin and wait for a change in a hardware register or a memory location. The term does not necessarily mean locking, but it does imply waiting in a tight loop, repeatedly probing for a change.
You may want to use busy-waiting in order to detect some kind of change in the environment that you want to respond to immediately. So a spin-lock is implemented using busy-waiting. Busy-waiting is useful in any situation where a very low latency response is more important than wasting CPU cycles (like in some types of embedded programming).
Related to this are the terms «lock-free» and «wait-free»:
So-called lock-free algorithms tend to use tight busy-waiting with a CAS instruction, but the contention is in ordinary situations so low that the CPU usually have to iterate only a few times.
So-called wait-free algorithms don't do any busy-waiting at all.
(Please note that «lock-free» and «wait-free» is used slightly differently in academic contexts, see Wikipedia's article on Non-blocking algorithms.)

Standard versus spin mutexes: When a thread calls the function that
locks and acquires a mutex, the function does not return until the
mutex is locked. This is typical for event synchronization: a thread
waiting for an event, namely, the fact that it has acquired mutex
ownership. There are two ways in which this can happen:
• An idle wait: the thread waiting to lock the mutex is blocked in a
wait state as explained in Chapter 2. It releases the CPU, which can
then be used to run another thread. When the mutex becomes available,
the runtime system wakes up and reschedules the waiting thread, which
can then lock the now available mutex.
• A busy wait, also called a spin wait, in which a thread waiting to
lock the mutex does not release the CPU. It remains scheduled,
executing some trivial do nothing instruction until the mutex is
released.
Standard mutexes normally subscribe to the first strategy, and perform
an idle wait. But some libraries also provide mutexes that subscribe
to the spin wait strategy. The best one depends on the application
context. For very short waits spinning in user space is more efficient
because putting a thread to sleep in a blocked state takes cycles. But
for long waits, a sleeping thread releases its CPU making cycles
available to other threads
I found this quite related from this source: https://www.sciencedirect.com/topics/computer-science/waiting-thread

When you understand the precise reason for a rule and have detailed platform and application knowledge, you know when it's appropriate to violate that rule. Spinlocks are implemented by experts who have a complete understanding of the platform they're developing on and the intended applications for spinlocks.
The problems with busy waiting are numerous, but on most platforms, there are solutions for them. Issues include:
For CPUs with hyper-threading, a busy-waiting thread can starve another thread in the same physical core, even the very thread it's waiting for.
When you busy wait, when you finally do get the resource you're waiting for, you take the mother of all mispredicted branches.
Busy waiting interferes with CPU power management.
Busy waiting can saturate inter-core buses as the thing you keep checking causes cache synchronization traffic.
But the people who design spinlocks understand all these issues and know precisely how to mitigate them on the platform. They don't write naive spinning code, they write intelligent spinning code.
So yes, they both loop infinitely until some condition is met, but they loop differently.

Related

Why does process creation using `clone` result in an out-of-memory failure?

I have a process that allocates about 20GB of RAM on a 32GB machine. After some events, I'm streaming the data from the parent process to stdin of the child process. It's mandatory to keep the 20GB of data in the parent process at the point when the child is spawned.
The app is written in Rust and I'm calling Command::new('path/to/command') to create the child process.
When I spawn the child process the operating system is trapping an out-of-memory error.
strace output:
[pid 747] 16:04:41.128377 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7ff4c7f87b10) = -1 ENOMEM (Cannot allocate memory)
Why does the trap occur? The child process should not consume more than 1GB and exec() is called immediately after clone().
The Problem
When a child process is created by the Rust call, several things happen at a C/C++ level. This is a simplification, but it will help explain the dilemma.
The streams are duplicated (with dup2 or a similar call)
The parent process is forked (with the fork or clone system call)
The forked process executes the child (with call from the execvp family)
The parent and child are now concurrent processes. The Rust call you are currently using appears to be a clone call that is behaving much like a pure fork, so you're 20G x 2 - 32G = 8G short, without considering the space needed by the operating system and anything else that might be running. The clone call is returning with a negative return value and errno is set by the call to ENOMEM errno.
If the architectural solutions of adding physical memory, compressing the data, or streaming it through a process that does not require the entirety of it to be in memory at any one time are not options, then the classic solution is reasonably simple.
Recommendation
Design the parent process to be lean. Then spawn two worker children, one that handles your 20GB need and the other that handles your 1 GB need1. These children can be connected to one another via pipe, file, shared memory, socket, semaphore, signalling, and/or other communication mechanism(s), just as a parent and child can be.
Many mature software packages from Apache httpd to embedded cell tower routing daemons use this design pattern. It is reliable, maintainable, extensible, and portable.
The 32G would then likely suffice for the 20G and 1G processing needs, along with OS and lean parent process.
Although this solution will surely solve your problem, if the code is to be reused or extended later, there may be value in looking into potential process design changes involving data frames or multidimensional slices to support streaming of data and memory requirement reductions.
Memory Overcommit Always
Setting overcommit_memory to 1 eliminates the clone error condition referenced in the question because the Rust call calls the LINUX clone call that reads that setting. But there are several caveats with this solution that point back to the above recommendation as superior, primarily that the value of 1 is dangerous, especially for production environments.
Background
Kernel discussions about OpenBSD rfork and the clone call ensued in the late 1990s and early 2000s. The features stemming from those discussions permit less extreme forking than processes, which is symmetrically like the provision of more extensive independence between pthreads. Some of these discussions have produced extensions to the traditional process spawning that have entered POSIX standardization.
In the early 2000s, Linux Torvalds suggested a flag structure to determine what components of the execution model are shared and what are copied when execution forks, blurring the distinction between processes and threads. From this, the clone call emerged.
Over-committing memory is not discussed much if any in those threads. The design goal was MORE control of the results of a fork rather than the delegation of memory usage optimization to an operating system heuristic, which is what the default setting of overcommit_memory = 0 does.
Caveats
Memory overcommit goes beyond these extensions, adding the complexity of trade-offs of its modes2, design trend caveats3, practical run time limitations4, and performance impacts5.
Portability and Longevity
Additionally, without standardization, the code using memory overcommit may not be portable, and the question of longevity is pertinent, especially when a setting controls the behavior of a function. There is no guarantee of backward compatibility or even some warning of deprication if the setting system changes.
Danger
The linuxdevcenter documentation2 says, "1 always overcommits. Perhaps you now realize the danger of this mode.", and there are other indications of danger with ALWAYS overcommitting 6, 7.
The implementers of overcommit on LINUX, Windows, and VMWare may guarantee reliability, but it is a statistical game that, combined with the many other complexities of process control, may lead to certain unstable characteristics under certain conditions. Even the name overcommit tells us something about its true character as a practice.
A non-default overcommit_memory mode, for which several warnings are issues, but works for the immediate trial of the immediate case may later lead to intermittent reliability.
Predictability and Its Impact on System Reliability and Response Time Consistency
The idea of a process in a UNIX like operating system, from its Bell Labs beginnings, is that a process makes a concrete requests to its container, the operating system. The result both predictable and binary. Either the request is denied or granted. Once granted, the process is given complete control and direct access over the resources until the use of it is relinquished by the process.
The swap space aspect of virtual memory is a breach of this principle that appears as gross deceleration of activity on workstations, when RAM is heavily consumed. For instance, there are times during development when one presses a key and has to wait ten seconds to see the character on the display.
Conclusion
There are many ways to get the most out of physical memory, but doing so by hoping that use of memory allocated will be sparse will likely introduce negative impacts. Performance hits from swapping when overcommit is overused is the well documented example. If you are keeping 20G of data in RAM, this may particularly be the case.
Only allocating what is needed, forking in intelligent ways, using threads, and freeing memory that is surely no longer needed lead to memory thrift without impacting reliability, creating spikes in swap disk usage, and can operate without caveat up to the limits of system resources.
The position of the designer of the Command::new call may be based on this perspective. In this case, how soon after the fork the exec is called is not a determining factor in how much memory is requested during the spawn.
Notes and References
[1] Spawning worker children may require some code refactoring and appear to be too much trouble on a superficial level, but the refactoring may be surprisingly straightforward and significantly beneficial.
[2] http://www.linuxdevcenter.com/pub/a/linux/2006/11/30/linux-out-of-memory.html?page=2
[3] https://www.etalabs.net/overcommit.html
[4] http://www.gabesvirtualworld.com/memory-overcommit-in-production-yes-yes-yes/
[5] https://labs.vmware.com/vmtj/memory-overcommitment-in-the-esx-server
[6] https://github.com/kubernetes/kubernetes/issues/14452
[7] http://linuxtoolkit.blogspot.com/2011_08_01_archive.html

Cooperative Multitasking system

I'm trying to get around the concept of cooperative multitasking system and exactly how it works in a single threaded application.
My understanding is that this is a "form of multitasking in which multiple tasks execute by voluntarily ceding control to other tasks at programmer-defined points within each task."
So if you have a list of tasks and one task is executing, how do you determine to pass execution to another task? And when you give execution back to a previous task, how do resume from where you were previously?
I find this a bit confusing because I don't understand how this can be achieve without a multithreaded application.
Any advice would be very helpeful :)
Thanks
In your specific scenario where a single process (or thread of execution) uses cooperative multitasking, you can use something like Windows' fibers or POSIX setcontext family of functions. I will use the term fiber here.
Basically when one fiber is finished executing a chunk of work and wants to voluntarily allow other fibers to run (hence the "cooperative" term), it either manually switches to the other fiber's context or more typically it performs some kind of yield() or scheduler() call that jumps into the scheduler's context, then the scheduler finds a new fiber to run and switches to that fiber's context.
What do we mean by context here? Basically the stack and registers. There is nothing magic about the stack, it's just a block of memory the stack pointer happens to point to. There is also nothing magic about the program counter, it just points to the next instruction to execute. Switching contexts simply saves the current registers somewhere, changes the stack pointer to a different chunk of memory, updates the program counter to a different stream of instructions, copies that context's saved registers into the CPU, then does a jump. Bam, you're now executing different instructions with a different stack. Often the context switch code is written in assembly that is invoked in a way that doesn't modify the current stack or it backs out the changes, in either case it leaves no traces on the stack or in registers so when code resumes execution it has no idea anything happened. (Again, the theme: we assume that method calls fiddle with registers, push arguments to the stack, move the stack pointer, etc but that is just the C calling convention. Nothing requires you to maintain a stack at all or to have any particular method call leave any traces of itself on the stack).
Since each stack is separate, you don't have some continuous chain of seemingly random method calls eventually overflowing the stack (which might be the result if you naively tried to implement this scheme using standard C methods that continuously called each other). You could implement this manually with a state machine where each fiber kept a state machine of where it was in its work, periodically returning to the calling dispatcher's method, but why bother when actual fiber/co-routine support is widely available?
Also remember that cooperative multitasking is orthogonal to processes, protected memory, address spaces, etc. Witness Mac OS 9 or Windows 3.x. They supported the idea of separate processes. But when you yielded, the context was changed to the OS context, allowing the OS scheduler to run, which then potentially selected another process to switch to. In theory you could have a full protected virtual memory OS that still used cooperative multitasking. In those systems, if a errant process never yielded, the OS scheduler never ran, so all other processes in the system were frozen. **
The next natural question is what makes something pre-emptive... The answer is that the OS schedules an interrupt timer with the CPU to stop the currently executing task and switch back to the OS scheduler's context regardless of whether the current task cares to release the CPU or not, thus "pre-empting" it.
If the OS uses CPU privilege levels, the (kernel configured) timer is not cancelable by lower level (user mode) code, though in theory if the OS didn't use such protections an errant task could mask off or cancel the interrupt timer and hijack the CPU. There are some other scenarios like IO calls where the scheduler can be invoked outside the timer, and the scheduler may decide no other process has higher priority and return control to the same process without a switch... And in reality most OSes don't do a real context switch here because that's expensive, the scheduler code runs inside the context of whatever process was executing, so it has to be very careful not to step on the stack, to save register states, etc.
** You might ask why not just fire a timer if yield isn't called within a certain period of time. The answer lies in multi-threaded synchronization. In a cooperative system, you don't have to bother taking locks, worry about re-entrance, etc because you only yield when things are in a known good state. If this mythical timer fires, you have now potentially corrupted the state of the program that was interrupted. If programs have to be written to handle this, congrats... You now have a half-assed pre-emptive multitasking system. Might as well just do it right! And if you are changing things anyway, may as well add threads, protected memory, etc. That's pretty much the history of the major OSes right there.
The basic idea behind cooperative multitasking is trust - that each subtask will relinquish control, of its own accord, in a timely fashion, to avoid starving other tasks of processor time. This is why tasks in a cooperative multitasking system need to be tested extremely thoroughly, and in some cases certified for use.
I don't claim to be an expert, but I imagine cooperative tasks could be implemented as state machines, where passing control to the task would cause it to run for the absolute minimal amount of time it needs to make any kind of progress. For example, a file reader might read the next few bytes of a file, a parser might parse the next line of a document, or a sensor controller might take a single reading, before returning control back to a cooperative scheduler, which would check for task completion.
Each task would have to keep its internal state on the heap (at object level), rather than on the stack frame (at function level) like a conventional blocking function or thread.
And unlike conventional multitasking, which relies on a hardware timer to trigger a context switch, cooperative multitasking relies on the code to be written in such a way that each step of each long-running task is guaranteed to finish in an acceptably small amount of time.
The tasks will execute an explicit wait or pause or yield operation which makes the call to the dispatcher. There may be different operations for waiting on IO to complete or explicitly yielding in a heavy computation. In an application task's main loop, it could have a *wait_for_event* call instead of busy polling. This would suspend the task until it has input to process.
There may also be a time-out mechanism for catching runaway tasks, but it is not the primary means of switching (or else it wouldn't be cooperative).
One way to think of cooperative multitasking is to split a task into steps (or states). Each task keeps track of the next step it needs to execute. When it's the task's turn, it executes only that one step and returns. That way, in the main loop of your program you are simply calling each task in order, and because each task only takes up a small amount of time to complete a single step, we end up with a system which allows all of the tasks to share cpu time (ie. cooperate).

What exactly are "spin-locks"?

I always wondered what they are: every time I hear about them, images of futuristic flywheel-like devices go dancing (rolling?) through my mind...
What are they?
When you use regular locks (mutexes, critical sections etc), operating system puts your thread in the WAIT state and preempts it by scheduling other threads on the same core. This has a performance penalty if the wait time is really short, because your thread now has to wait for a preemption to receive CPU time again.
Besides, kernel objects are not available in every state of the kernel, such as in an interrupt handler or when paging is not available etc.
Spinlocks don't cause preemption but wait in a loop ("spin") till the other core releases the lock. This prevents the thread from losing its quantum and continue as soon as the lock gets released. The simple mechanism of spinlocks allows a kernel to utilize it in almost any state.
That's why on a single core machine a spinlock is simply a "disable interrupts" or "raise IRQL" which prevents thread scheduling completely.
Spinlocks ultimately allow kernels to avoid "Big Kernel Lock"s (a lock acquired when core enters kernel and released at the exit) and have granular locking over kernel primitives, causing better multi-processing on multi-core machines thus better performance.
EDIT: A question came up: "Does that mean I should use spinlocks wherever possible?" and I'll try to answer it:
As I mentioned, Spinlocks are only useful in places where anticipated waiting time is shorter than a quantum (read: milliseconds) and preemption doesn't make much sense (e.g. kernel objects aren't available).
If waiting time is unknown, or if you're in user mode Spinlocks aren't efficient. You consume 100% CPU time on the waiting core while checking if a spinlock is available. You prevent other threads from running on that core till your quantum expires. This scenario is only feasible for short bursts at kernel level and unlikely an option for a user-mode application.
Here is a question on SO addressing that: Spinlocks, How Useful Are They?
Say a resource is protected by a lock ,a thread that wants access to the resource needs to acquire the lock first. If the lock is not available, the thread might repeatedly check if the lock has been freed. During this time the thread busy waits, checking for the lock, using CPU, but not doing any useful work. Such a lock is termed as a spin lock.
It is pertty much a loop that keeps going till a certain condition is met:
while(cantGoOn) {};
while(something != TRUE ){};
// it happend
move_on();
It's a type of lock that does busy waiting
It's considered an anti-pattern, except for very low-level driver programming (where it can happen that calling a "proper" waiting function has more overhead than simply busy locking for a few cycles).
See for example Spinlocks in Linux kernel.
SpinLocks are the ones in which thread waits till the lock is available. This will normally be used to avoid overhead of obtaining the kernel objects when there is a scope of acquiring the kernel object within some small time period.
Ex:
While(SpinCount-- && Kernel Object is not free)
{}
try acquiring Kernel object
You would want to use a spinlock when you think it is cheaper to enter a busy waiting loop and pool a resource instead of blocking when the resource is locked.
Spinning can be beneficial when locks are fine grained and large in number (for example, a lock per node in a linked list) as well as when lock hold times are always extremely short. In general, while holding a spin lock, one should avoid blocking, calling anything that itself may block, holding more than one spin lock at once, making dynamically dispatched calls (interface and virtuals), making statically dispatched calls into any code one doesn't own, or allocating memory.
It's also important to note that SpinLock is a value type, for performance reasons. As such, one must be very careful not to accidentally copy a SpinLock instance, as the two instances (the original and the copy) would then be completely independent of one another, which would likely lead to erroneous behavior of the application. If a SpinLock instance must be passed around, it should be passed by reference rather than by value.
It's a loop that spins around until a condition is met.
In nutshell, spinlock employs atomic compare and swap (CAS) or test-and-set like instructions to implement lock free, wait free thread safe idiom. Such structures scale well in multi-core machines.
Well, yes - the point of spin locks (vs a traditional critical sections, etc) is that they offer better performance under some circumstances (multicore systems..), because they don't immediately yield the rest of the thread's quantum.
Spinlock, is a type of lock, which is non-block able & non-sleep-able. Any thread which want to acquire a spinlock for any shared or critical resource will continuously spin, wasting the CPU processing cycle till it acquire the lock for the specified resource. Once spinlock is acquired, it try to complete the work in its quantum and then release the resource respectively. Spinlock is the highest priority type of lock, simply can say, it is non-preemptive kind of lock.

process scheduling question

For example, a process waiting for
disk I/O to complete will sleep on the
address of the buffer header
corresponding to the data being
transferred. When the interrupt
routine for the disk driver notes that
the transfer is complete, it calls
wakeup on the buffer header. The
interrupt uses the kernel stack for
whatever process happened to be
running at the time, and the wakeup is
done from that system process.
Can you please explain the last line in the paragraph which I have emphasised. It is about waking up the process which has been waiting for some event to occur and thus has slept. This para is from Galvin. By the way can you suggest some good book or link for studying unix operating systems?
Thanks.
There is some process running at the time the interrupt is received. The kernel doesn't change over to some other process context to handle it -- that would take time -- it just does what's necessary in the current context, and lets the scheduler know that the next time it schedules, the waiting process is ready to proceed.
There are a number of good internals books around. I'm fond of the various McKusick et al books, like The Design and Implementation of the FreeBSD Operating System.
Maurice Bach's Design of the Unix Operating System is the most well-known and comprehensive book on the subject.
The I/O completion interrupt will be executed as soon as the disk signals the end of the transfer. This is done regardless of what the kernel is currently doing. Interrupt handlers are usually very small and self-contained. Therefore it is faster to re-use the current runtime environment (stack, CPU state, etc) instead of doing a full context switch to a separate thread. On the down side this means that interrupt handlers are only allowed to do very limited things, like setting a flag somewhere else, or enqueing a work item. Also, they have to clean up very carefully after themselves, so that the running process is not disturbed.
Eric Raymond's 'The Art of Unix Programming' , should be read to understand the Unix philosophy and culture.To actually know and appreciate the reasons behind its design.

Detecting Blocked Threads

I have a theory regarding trouble shooting a Asynchronous Application (I'm using the CCR) and I wonder if someone can confirm my logic.
If a CCR based multi-threaded application using the default number of threads (i.e. one per core) is slower than the same application with double the threads specified - does this means that threads are being blocked somewhere in the code
What do think? Is this a quick and valid way to detect if threads are being inadvertantly being blocked?
What do you mean by "slower"?
If you want to automatically detect blocked threads, perhaps those threads should send a heartbeat, which are then observed by a monitor of some sort, but your options are limited.
A cheap way to tell if threads are being blocked is to get the current system time before doing any potentially blocking operation, then after the operation, and see how much time has elapsed. For example, while waiting for a message to arrive, measure to see how much time the thread was blocked waiting for a message to arrive.
Unless there are always more than enough messages to be processed, threads will block waiting for a message. If you have more threads, then you have more potential message generators (depending on your design) and thus threads waiting to receive messages will be more likely to have one ready.
Exactly one thread to CPU is too few unless you can guarantee that there will always be enough messages so no thread will have to wait.
If this is the case, that means that your threadpool is being exhausted (i.e. you have 2 threads but you've async pended 4 IOs or something) - if your work is heavily IO bound, the rule of "one thread per core" doesn't really apply.
I've found that to keep the system fluid with minimal threads, I keep the tasks dealing with I/O as concise as possible. They simply post the data from the I/O into another Port and do no further processing. The data is therefore queued elsewhere for processing in a controlled manner without interfering with the task of grabbing data as fast as possible. This processing might happen in the ExclusiveGroup of an Interleave if there's shared state to think about... and a handy side-effect is that exclusive tasks will never tie up all the threads in a Dispatcher (however, I suspect that there's probably nattier ways of managing this in the CCR API)

Resources