An imprecise external abort, received while the processor enters WFI, may cause a processor deadlock - deadlock

This is an ARM errata for Cortex A9 processors.
Description:-
An imprecise external abort received while the processor is ready to enter into WFI state might
cause a processor deadlock.
Explicit memory transactions can be completed by inserting a DSB before the WFI instruction.
However, this does not prevent memory accesses generated by previously issued PLD instructions
page table walks associated with previously issued PLD instructions or as a result of the PLE
engine.
If an external abort is returned as a result of one of these memory accesses after executing a WFI
instruction, the processor can cause a deadlock.
So, how to prevent the deadlock by protecting the MMU.

Related

How are branch mispredictions handled before a hardware interrupt

A hardware interrupt occurs to a particular vector (not masked), CPU checks IF flag and pushes RFLAGS, CS and RIP to the stack, meanwhile there are still instructions completing in the back end, one of these instruction's branch predictions turns out to be wrong. Usually the pipeline would be flushed and the front end starts fetching from the correct address but in this scenario an interrupt is in progress.
When an interrupt occurs, what happens to instructions in the pipeline?
I have read this and clearly a solution is to immediately flush everything from the pipeline so that this doesn't occur and then generate the instructions to push the RFLAGS, CS, RIP to the location of the kernel stack in the TSS; however, the question arises, how does it know the (CS:)RIP associated with the most recent architectural state in order to be able to push it on the stack (given that the front end RIP would now be ahead). This is similar to the question of how the taken branch execution unit on port0 knows the (CS:)RIP of what should have been fetched when the take prediciton turns out to be wrong -- is the address encoded into the instruction as well as the prediction? The same issue arises when you think of a trap / exception, the CPU needs to push the address of the current instruction (fault) or the next instruction (trap) to the kernel stack, but how does it work out the address of this instruction when it is halfway down the pipeline -- this leads me to believe that the address must be encoded into the instruction and is worked out using the length information and this is possibly all done at predecode stage..
The CPU will presumably discard the contents of the ROB, rolling back to the latest retirement state before servicing the interrupt.
An in-flight branch miss doesn't change this. Depending on the CPU (older / simpler), it might have already been in the process of rolling back to retirement state and flushing because of a branch miss, when the interrupt arrived.
As #Hadi says, the CPU could choose at that point to retire the branch (with the interrupt pushing a CS:RIP pointing to the correct branch target), instead of leaving it to be re-executed after returning from the interrupt.
But that only works if the branch instruction was already ready to retire: there were no instructions older than the branch still not executed. Since it's important to discover branch misses as early as possible, I assume branch recovery starts when it discovers a mispredict during execution, not waiting until it reaches retirement. (This is unlike other kinds of faults: e.g. Meltdown and L1TF are based on a faulting load not triggering #PF fault handling until it reaches retirement so the CPU is sure there really is a fault on the true path of execution. You don't want to start an expensive pipeline flush until you're sure it wasn't in the shadow of a mispredict or earlier fault.)
But since branch misses don't take an exception, redirecting the front-end can start early before we're sure that the branch instruction is part of the right path in the first place.
e.g. cmp byte [cache_miss_load], 123 / je mispredicts but won't be discovered for a long time. Then in the shadow of that mispredict, a cmp eax, 1 / je on the "wrong" path runs and a mispredict is discovered for it. With fast recovery, uops past that are flushed and fetch/decode/exec from the "right" path can start before the earlier mispredict is even discovered.
To keep IRQ latency low, CPUs don't tend to give in-flight instructions extra time to retire. Also, any retired stores that still have their data in the store buffer (not yet committed to L1d) have to commit before any stores by the interrupt handler can commit. But interrupts are serializing (I think), and any MMIO or port-IO in a handler will probably involve a memory barrier or strongly-ordered store, so letting more instructions retire can hurt IRQ latency if they involve stores. (Once a store retires, it definitely needs to happen even while its data is still in the store buffer).
The out-of-order back-end always knows how to roll back to a known-good retirement state; the entire contents of the ROB are always considered speculative because any load or store could fault, and so can many other instructions1. Speculation past branches isn't super-special.
Branches are only special in having extra tracking for fast recovery (the Branch Order Buffer in Nehalem and newer) because they're expected to mispredict with non-negligible frequency during normal operation. See What exactly happens when a skylake CPU mispredicts a branch? for some details. Especially David Kanter's quote:
Nehalem enhanced the recovery from branch mispredictions, which has been carried over into Sandy Bridge. Once a branch misprediction is discovered, the core is able to restart decoding as soon as the correct path is known, at the same time that the out-of-order machine is clearing out uops from the wrongly speculated path. Previously, the decoding would not resume until the pipeline was fully flushed.
(This answer is intentionally very Intel-centric because you tagged it intel, not x86. I assume AMD does something similar, and probably most out-of-order uarches for other ISAs are broadly similar. Except that memory-order mis-speculation isn't a thing on CPUs with a weaker memory model where CPUs are allowed to visibly reorder loads.)
Footnote 1: So can div, or any FPU instruction if FP exceptions are unmasked. And a denormal FP result could require a microcode assist to handle, even with FP exceptions masked like they are by default.
On Intel CPUs, a memory-order mis-speculation can also result in a pipeline nuke (load speculatively done early, before earlier loads complete, but the cache lost its copy of the line before the x86 memory model said the load could take its value).
In general, each entry in the the ReOrder Buffer (ROB) has a field that is used to store enough information about the instruction address to reconstruct the whole instruction address unambiguously. It may be too costly to store the whole address for each instruction in the ROB. For instructions that have not yet been allocated (i.e., not yet passed the allocation stage of the pipeline), they need to carry this information with them at least until they reach the allocation stage.
If an interrupt and a branch misprediction occur at the same time, the proessor may, for example, choose to service the interrupt. In this case, all the instructions that are on the mispredicted path need to be flushed. The processor may choose also to flush other instructions that are on the correct path, but have not yet retired. All of these instructions are in the ROB and their instruction addresses are known. For each speculated branch, there is a tag that identifies all instructions on that speculated path and all instructions on this path are tagged with it. If there is another, later speculated branch, another tag is used, but it is also ordered with respect to the previous tag. Using these tags, the processor can determine exactly which instructions to flush when any of the speculated branches turns out to be incorrect. This is determined after the corresponding branch instruction completes execution in the branch execution unit. Branches may complete execution out of order. When the correct address of a msipredicted branch is calculated, it's forwarded to the fetch unit and the branch prediction unit (BPU). The fetch unit uses it to fetch instructions from the correct path and the BPU uses it to update its prediction state.
The processor can choose to retire the mispredicted branch instruction itself and flush all other later instructions. All rename registers are reclaimed and those physical registers that are mapped to architectural registers at the point the branch is retired are retained. At this point, the processor executes instructions to save the current state and then begins fetching instructions of the interrupt handler.

how does the OS determine null pointer access without checking all pointer addresses?

It is known that the 0 address (which is marked as the macro 'NULL'), is not legal to access.
I was wondering how is it that the operating system (say linux) can determine when there is an access to null address, somewhere in the code, without having to access each and every pointer address in the code?
I assume it has something to do with signal and specifically, the "sigsegv" signal.
But I'm not sure how it's done.
First of all a null pointer access is not necessarily invalid. Typically, either the operating system's program loader or the linker (depending upon the system) set up processes so that the the lowest page in the virtual address space is not mapped.
Many systems that do this also allow the application to map the first page, making a null reference valid.
The NULL pointer is checked the same way all other memory addresses are checked: through the logical address translation of the CPU.
Each time the processor accesses memory (ignoring caching) it looks up the address in the process's page table. If there is no corresponding entry, the processor triggers an access fault (that in Unix variants gets translated into a signal).
If there is an entry in the page table for the address, the processor checks the access allowed for the page. If you are in user mode and try to access a kernel protected page, that triggers a fault. If you are trying to write to a read only page, that triggers a fault. If you try to execute a non-executable page, that triggers a fault.
This is a rather lengthy topic. You need to understand logical memory translation (sometimes misnamed virtual memory) if you want to learn more on the topic.
Pointers refer to virtual address space. In the virtual address space, each page of memory can be mapped to real physical memory. The operating system takes care of this mapping separately for each process.
When you access memory through a pointer, the CPU looks at the mapping for the virtual address your pointer specifies and checks if there is real, physical memory behind. Additional checks are done to verify that you have read or write access to that piece of memory, depending on the operation you are attempting.
If there is no memory mapped for that address, the CPU generates a hardware interrupt. The OS catches that interrupt and - usually - signals sigsegv for the calling process.
The zero page containing the NULL address is usually intentionally left unmapped, so that NULL pointer accesses, which usually result from programming errors, are easily trapped.
Linux obtains this support from hardware. Processor is informed about the purpose of individual memory regions and their availability. If "unavailable" memory region is accessed the processor informs the operating system about the problem and the operating system informs the application.
It means two things:
There is no software overhead related to checking all pointers against the NULL value.
There is no precise check for allowed pointer values.
In other words, if your pointer points anywhere to the "available" memory then the hardware unit is unable to recognize the problem.
The Memory Management Unit plays a key role in the exception triggering when a NULL pointer is dereferenced or an invalid address is accessed.
During the normal virtual-to-physical memory mapping process done by MMU on each RAM access, the undefined address is simply not found in the range of virtual addresses defined in the MMU descriptors. This can have catastrophic consequences if occurred in OS kernel-space, or just process kill and cleanup in the user-space domain.
...how is it that the operating system (say linux) can determine when there is an access to null address, somewhere in the code, without having to access each and every pointer address in the code?
Well, OS cannot determine a NULL dereference without accessing the pointer. From the wiki for segmentation fault:
In computing, a segmentation fault (often shortened to segfault) or access violation is a fault raised by hardware with memory protection, notifying an operating system (OS) about a memory access violation; on x86 computers this is a form of general protection fault. The OS kernel will in response usually perform some corrective action, generally passing the fault on to the offending process by sending the process a signal....
The memory access violation is a run-time incident, and unless there is an invalid access, there is no way OS will raise the signal to the process.
FWIW, a process is allowed to access the memory allocated for it (in virtual address space). Any address, outside the allocated virtual address space, if accessed, will generate a fault (through MMU) which in turn, generates the segmentation fault.
TL;DR - SIGSEV is generated on encountering the NULL-pointer dereference, not before that. Also, OS does not detect the erroneous access itself, rather it is informed to the OS by the Memory Management Unit via raising a fault.

if interrupt happens how does unix kernel determine which process its for

Lets say Unix is executing process A and an interrupt at higher level occurs. Then OS get a interrupt number and from IVT it looks up the routine to call.
Now how does the OS know that this interrupt was for process A and not for process B. It might have been that process B might have issued a disk read and it came back while OS was executing process A.
Thanks
Start with this: http://en.wikipedia.org/wiki/MINIX
Go buy the book and read it; it will really help a lot.
Interrupts aren't "for" processes. They're for devices and handled by device drivers.
The device driver handles the interrupt and updates the state of the device.
If the device driver concludes that an I/O operation is complete, it can then update the its queue of I/O requests to determine which operation completed. The operation is removed from the queue of pending operations.
The process which is waiting for that operation is now ready-to-run and can resume execution.
You are talking about a hardware interrupt and these are not targeted at processes.
If a process A requests a file, the filesystem layer, which already resides in the kernel, will fetch the file from the block device. The block device itself is handled by a driver.
When the interrupt occurs, triggered by the block device, the OS has this interrupt associated with the driver. So the driver is told to handle the interrupt. It will then query which blocks were read and see for what it requested them.
After the filesystem is told that the requested data is ready, it may further process it. Then, the process leaves blocked state.
In the next round of the scheduler, the scheduler may select to wake up this process. It may also select to wake up another process first.
As you can see, the interrupt occurance is fully disconnected from the process operation.

sqlite database connection/locking question

Folks
I am implementing a file based queue (see my earlier question) using sqlite. I have the following threads running in background:
thread-1 to empty out a memory structure into the "queue" table (an insert into "queue" table).
thread-1 to read and "process" the "queue" table runs every 5 to 10 seconds
thread-3 - runs very infrequently and purges old data that is no longer needed from the "queue" table and also runs vacuum so the size of the database file remains small.
Now the behavior that I would like is for each thread to get whatever lock it needs (waiting with a timeout if possible) and then completing the transaction. It is ok if threads do not run concurrently - what is important is that the transaction once begins does not fail due to "locking" errors such as "database is locked".
I looked at the transaction documentation but there does not seem to be a "timeout" facility (I am using JDBC). Can the timeout be set to a large quantity in the connection?
One solution (untried) I can think of is to have a connection pool of max 1 connection. Thus only one thread can connect at a time and so we should not see any locking errors. Are there better ways?
Thanx!
If it were me, I'd use a single database connection handle. If a thread needs it, it can allocate it within a critical section (or mutex, or similar) - this is basically a poor man's connection pool with only one connection in the pool:) It can do its business with the databse. When done, it exits the critical section (or frees the mutex or ?). You won't get locking errors if you carefully use the single db connection.
-Don

What exactly are "spin-locks"?

I always wondered what they are: every time I hear about them, images of futuristic flywheel-like devices go dancing (rolling?) through my mind...
What are they?
When you use regular locks (mutexes, critical sections etc), operating system puts your thread in the WAIT state and preempts it by scheduling other threads on the same core. This has a performance penalty if the wait time is really short, because your thread now has to wait for a preemption to receive CPU time again.
Besides, kernel objects are not available in every state of the kernel, such as in an interrupt handler or when paging is not available etc.
Spinlocks don't cause preemption but wait in a loop ("spin") till the other core releases the lock. This prevents the thread from losing its quantum and continue as soon as the lock gets released. The simple mechanism of spinlocks allows a kernel to utilize it in almost any state.
That's why on a single core machine a spinlock is simply a "disable interrupts" or "raise IRQL" which prevents thread scheduling completely.
Spinlocks ultimately allow kernels to avoid "Big Kernel Lock"s (a lock acquired when core enters kernel and released at the exit) and have granular locking over kernel primitives, causing better multi-processing on multi-core machines thus better performance.
EDIT: A question came up: "Does that mean I should use spinlocks wherever possible?" and I'll try to answer it:
As I mentioned, Spinlocks are only useful in places where anticipated waiting time is shorter than a quantum (read: milliseconds) and preemption doesn't make much sense (e.g. kernel objects aren't available).
If waiting time is unknown, or if you're in user mode Spinlocks aren't efficient. You consume 100% CPU time on the waiting core while checking if a spinlock is available. You prevent other threads from running on that core till your quantum expires. This scenario is only feasible for short bursts at kernel level and unlikely an option for a user-mode application.
Here is a question on SO addressing that: Spinlocks, How Useful Are They?
Say a resource is protected by a lock ,a thread that wants access to the resource needs to acquire the lock first. If the lock is not available, the thread might repeatedly check if the lock has been freed. During this time the thread busy waits, checking for the lock, using CPU, but not doing any useful work. Such a lock is termed as a spin lock.
It is pertty much a loop that keeps going till a certain condition is met:
while(cantGoOn) {};
while(something != TRUE ){};
// it happend
move_on();
It's a type of lock that does busy waiting
It's considered an anti-pattern, except for very low-level driver programming (where it can happen that calling a "proper" waiting function has more overhead than simply busy locking for a few cycles).
See for example Spinlocks in Linux kernel.
SpinLocks are the ones in which thread waits till the lock is available. This will normally be used to avoid overhead of obtaining the kernel objects when there is a scope of acquiring the kernel object within some small time period.
Ex:
While(SpinCount-- && Kernel Object is not free)
{}
try acquiring Kernel object
You would want to use a spinlock when you think it is cheaper to enter a busy waiting loop and pool a resource instead of blocking when the resource is locked.
Spinning can be beneficial when locks are fine grained and large in number (for example, a lock per node in a linked list) as well as when lock hold times are always extremely short. In general, while holding a spin lock, one should avoid blocking, calling anything that itself may block, holding more than one spin lock at once, making dynamically dispatched calls (interface and virtuals), making statically dispatched calls into any code one doesn't own, or allocating memory.
It's also important to note that SpinLock is a value type, for performance reasons. As such, one must be very careful not to accidentally copy a SpinLock instance, as the two instances (the original and the copy) would then be completely independent of one another, which would likely lead to erroneous behavior of the application. If a SpinLock instance must be passed around, it should be passed by reference rather than by value.
It's a loop that spins around until a condition is met.
In nutshell, spinlock employs atomic compare and swap (CAS) or test-and-set like instructions to implement lock free, wait free thread safe idiom. Such structures scale well in multi-core machines.
Well, yes - the point of spin locks (vs a traditional critical sections, etc) is that they offer better performance under some circumstances (multicore systems..), because they don't immediately yield the rest of the thread's quantum.
Spinlock, is a type of lock, which is non-block able & non-sleep-able. Any thread which want to acquire a spinlock for any shared or critical resource will continuously spin, wasting the CPU processing cycle till it acquire the lock for the specified resource. Once spinlock is acquired, it try to complete the work in its quantum and then release the resource respectively. Spinlock is the highest priority type of lock, simply can say, it is non-preemptive kind of lock.

Resources