Non blocking system call and mode switch - unix

Suppose we invoke a system call for asynchronous IO. At the time of invoking system call, the mode changes from user mode to kernel mode . After invocation, the mode should immediately change back to user mode so that user application can proceed further(as it is non blocking).
Now if the mode is changed to user mode then how will kernel proceed with IO as mode is changed from kernel to user mode ? Will kernel perform asynchronous IO in user mode ?

IO means two different things (at two different levels of abstractions):
from an application point of view, from a process running in user-mode, calling any system call (listed in syscalls(2) for Linux) related to input or output, e.g. read(2), .... Notice that aio_read(3) is not listed as a system call (it is some library function using other system calls, see aio(7)).
on the raw hardware, any physical input or output operation sending data (or orders) to actual IO devices (e.g. SATA disks, USB ports, etc...)
Asynchronous or synchronous IO for a process means just calling some suitable subset of system calls, since system calls are the only way a process can interact with the kernel, and since in user-mode no physical IO is directly possible.
Read Operating Systems: Three Easy Pieces (freely downloadable) to get a better view of OSes.
Will kernel perform asynchronous IO in user mode ?
This shows some confusion. In practice, inside the kernel, physical IO is generally (and probably always) initiated by interrupt handlers (which might configure some DMA etc...). A hardware interrupt switches the processor to "kernel-mode" (actually supervisor mode of the ISA).
A blocking system call (e.g. read(2) when physical IO is needed since the data is not in the page cache) don't block the entire computer: it is just the calling process which becomes "blocked" so is rescheduled. The kernel will schedule some other runnable process. Much later, after having the kernel handle many interrupts, the blocked process will become runnable and could be rescheduled to run.
Processes are themselves (with files) one of the major abstractions (provided by the kernel) to application code.
In other words, at the conceptual level, the kernel scheduler is coded in some continuation-passing style.
See also kernelnewbies and OSDEV.

The asynchronous IO will be performed on behalf of the process, the kernel will handle it almost as usual while the process continues to run. In blocking mode, the process is just suspended.
Kernel has access to every process space, so he can fill/read data from process user space whatever a process is currently doing.

Related

User mode and kernel mode: different program at same time

Is it possible that one process is running in kernel mode and another in user mode at the same time?
I know, it's not a coding question but please guide me if someone knows answer.
For two processes to actually be running at the same time, you must have multiple CPUs. And indeed, when you have multiple CPUs, what runs on the different CPUs is very loosly coupled and you can definitely have one process running user code on one CPU, while another process runs kernel code (e.g., doing some work inside a system call) on another CPU.
If you are asking about just one CPU, in that case you can't have two running processes at the same time. But what you can have is two runnable processes, which mean two processes which are both ready to run but since there is just one CPU, only one of the can actually run. One of the runnable processes might be in user mode - e.g., consider a long-running tight loop that was preempted after its time quota was over. Another runnable process might be in kernel mode - e.g., consider a process that did a read() system call from disk, the kernel sent the read request to the disk, but the read request completed so now the process is ready to run again in kernel mode and complete the read() call.
Yes, it is possible. Even multiple processes can be in the kernel mode at the same time.
Just that a single process cannot be in both the modes at the same time.
correct me but i suppose there is no any processes in kernel mode , there are only threads.

How can a code be asyncronus on a single-core CPU which is synchronous?

In a uniprocessor (UP) system, there's only one CPU core, so only one thread of execution can be happening at once. This thread of execution is synchronous (it gets a list of instructions in a queue and run them one by one). When we write code, it compiles to set of CPU instructions.
How can we have asynchronous behavior in software on a UP machine? Isn't everything just run in some fixed order chosen by the OS?
Even an out-of-order execution CPU gives the illusion of running instructions in program order. (This is separate from memory reordering observed by other cores or devices in the system. In a UP system, runtime memory reordering is only relevant for device drivers.)
An interrupt handler is a piece of code that runs asynchronously to the rest of the code, and can happen in response to an interrupt from a device outside the CPU. In user-space, a signal handler has equivalent semantics.
(Or a hardware interrupt can cause a context switch to another software thread. This is asynchronous as far as the software thread is concerned.)
Events like interrupts from network packets arriving or disk I/O completing happen asynchronously with respect to whatever the CPU was doing before the interrupt.
Asynchronous doesn't mean simultaneous, just that it can run between any two machine instructions of the rest of the code. A signal handler in a user-space program can run between any two machine instructions, so the code in the main program must work in a way that doesn't break if this happens.
e.g. A program with a signal-handler can't make any assumptions about data on the stack below the current stack pointer (i.e. in the un-reserved part of the stack). The red-zone in the x86-64 SysV ABI is a modification to this rule for user-space only, since the kernel can respect it when transferring control to a signal handler. The kernel itself can't use a red-zone, because hardware interrupts write to the stack outside of software control, before running the interrupt handler.
In an OS where I/O completion can result in the delivery of a POSIX signal (i.e. with POSIX async I/O), the timing of a signal can easily be determined by the timing of a hardware interrupts, so user-space code runs asynchronously with timing determined by things external to the computer. It's not just an issue for the kernel.
On a multicore system, there are obviously far more ways for things to happen in different orders more of the time.
Many processors are capable of multithreading, and many operating systems can simulate multithreading on single-threaded processors by swapping tasks in and out of the processor.

Concurrency in the Linux network drivers: probe() VS ndo_open(), ndo_start_xmit() VS NAPI poll()

Could anyone explain if additional synchronization, e.g., locking, is needed in the following two situations in a Linux network driver? I am interested in the kernel 2.6.32 and newer.
1. .probe VS .ndo_open
In a driver for a PCI network card, the net_device instance is usually registered in .probe() callback. Suppose a driver specifies .ndo_open callback in the net_device_ops, performs other necessary operations and then calls register_netdev().
Is it possible for that .ndo_open callback to be called by the kernel after register_netdev() but before the end of .probe callback? I suppose it is, but may be, there is a stronger guarantee, something that ensures that the device can be opened no earlier than .probe ends?
In other words, if .probe callback accesses, say, the private part of the net_device struct after register_netdev() and ndo_open callback accesses that part too, do I need to use locks or other means to synchronize these accesses?
2. .ndo_start_xmit VS NAPI poll
Is there any guarantee that, for a given network device, .ndo_start_xmit callback and NAPI poll callback provided by a driver never execute concurrently?
I know that .ndo_start_xmit is executed with BH disabled at least and poll runs in the softirq, and hence, BH context. But this serializes execution of these callbacks on the local CPU only. Is it possible for .ndo_start_xmit and poll for the same network device to execute simultaneously on different CPUs?
As above, if these callbacks access the same data, is it needed to protect the data with a lock or something?
References to the kernel code and/or the docs are appreciated.
EDIT:
To check the first situation, I conducted an experiment and added a 1-minute delay right before the end of the call to register_netdev() in e1000 driver (kernel: 3.11-rc1). I also added debug prints there in .probe and .ndo_open callbacks. Then I loaded e1000.ko, and tried to access the network device it services before the delay ended (in fact, NetworkManager did that before me), then checked the system log.
Result: yes, it is possible for .ndo_open to be called even before the end of .probe although the "race window" is usually rather small.
The second situation (.ndo_start_xmit VS NAPI poll) is still unclear to me and any help is appreciated.
Wrt the ".ndo_start_xmit VS NAPI poll" qs, well here's how am thinking:
the start-xmit method of a network driver is invoked in NET_TX_SOFTIRQ context - it is in a softirq ctx itself. So is the NAPI receive poll method, but of course in the NET_RX_SOFTIRQ context.
Now the two softirq's will lock each other out - not race - on any local core. But by design intent, softirq's can certainly run in parallel on SMP; thus, who is to say that these two methods, the ".ndo_start_xmit VS NAPI poll", running in two separate softirq context's, will not ever race?
IOW, I guess it could happen. Be safe, use spinlocks to protect global data.
Also, with modern TCP offload techniques becoming more prevalent, GSO is/could also be invoked at any point.
HTH!

Cooperative Multitasking system

I'm trying to get around the concept of cooperative multitasking system and exactly how it works in a single threaded application.
My understanding is that this is a "form of multitasking in which multiple tasks execute by voluntarily ceding control to other tasks at programmer-defined points within each task."
So if you have a list of tasks and one task is executing, how do you determine to pass execution to another task? And when you give execution back to a previous task, how do resume from where you were previously?
I find this a bit confusing because I don't understand how this can be achieve without a multithreaded application.
Any advice would be very helpeful :)
Thanks
In your specific scenario where a single process (or thread of execution) uses cooperative multitasking, you can use something like Windows' fibers or POSIX setcontext family of functions. I will use the term fiber here.
Basically when one fiber is finished executing a chunk of work and wants to voluntarily allow other fibers to run (hence the "cooperative" term), it either manually switches to the other fiber's context or more typically it performs some kind of yield() or scheduler() call that jumps into the scheduler's context, then the scheduler finds a new fiber to run and switches to that fiber's context.
What do we mean by context here? Basically the stack and registers. There is nothing magic about the stack, it's just a block of memory the stack pointer happens to point to. There is also nothing magic about the program counter, it just points to the next instruction to execute. Switching contexts simply saves the current registers somewhere, changes the stack pointer to a different chunk of memory, updates the program counter to a different stream of instructions, copies that context's saved registers into the CPU, then does a jump. Bam, you're now executing different instructions with a different stack. Often the context switch code is written in assembly that is invoked in a way that doesn't modify the current stack or it backs out the changes, in either case it leaves no traces on the stack or in registers so when code resumes execution it has no idea anything happened. (Again, the theme: we assume that method calls fiddle with registers, push arguments to the stack, move the stack pointer, etc but that is just the C calling convention. Nothing requires you to maintain a stack at all or to have any particular method call leave any traces of itself on the stack).
Since each stack is separate, you don't have some continuous chain of seemingly random method calls eventually overflowing the stack (which might be the result if you naively tried to implement this scheme using standard C methods that continuously called each other). You could implement this manually with a state machine where each fiber kept a state machine of where it was in its work, periodically returning to the calling dispatcher's method, but why bother when actual fiber/co-routine support is widely available?
Also remember that cooperative multitasking is orthogonal to processes, protected memory, address spaces, etc. Witness Mac OS 9 or Windows 3.x. They supported the idea of separate processes. But when you yielded, the context was changed to the OS context, allowing the OS scheduler to run, which then potentially selected another process to switch to. In theory you could have a full protected virtual memory OS that still used cooperative multitasking. In those systems, if a errant process never yielded, the OS scheduler never ran, so all other processes in the system were frozen. **
The next natural question is what makes something pre-emptive... The answer is that the OS schedules an interrupt timer with the CPU to stop the currently executing task and switch back to the OS scheduler's context regardless of whether the current task cares to release the CPU or not, thus "pre-empting" it.
If the OS uses CPU privilege levels, the (kernel configured) timer is not cancelable by lower level (user mode) code, though in theory if the OS didn't use such protections an errant task could mask off or cancel the interrupt timer and hijack the CPU. There are some other scenarios like IO calls where the scheduler can be invoked outside the timer, and the scheduler may decide no other process has higher priority and return control to the same process without a switch... And in reality most OSes don't do a real context switch here because that's expensive, the scheduler code runs inside the context of whatever process was executing, so it has to be very careful not to step on the stack, to save register states, etc.
** You might ask why not just fire a timer if yield isn't called within a certain period of time. The answer lies in multi-threaded synchronization. In a cooperative system, you don't have to bother taking locks, worry about re-entrance, etc because you only yield when things are in a known good state. If this mythical timer fires, you have now potentially corrupted the state of the program that was interrupted. If programs have to be written to handle this, congrats... You now have a half-assed pre-emptive multitasking system. Might as well just do it right! And if you are changing things anyway, may as well add threads, protected memory, etc. That's pretty much the history of the major OSes right there.
The basic idea behind cooperative multitasking is trust - that each subtask will relinquish control, of its own accord, in a timely fashion, to avoid starving other tasks of processor time. This is why tasks in a cooperative multitasking system need to be tested extremely thoroughly, and in some cases certified for use.
I don't claim to be an expert, but I imagine cooperative tasks could be implemented as state machines, where passing control to the task would cause it to run for the absolute minimal amount of time it needs to make any kind of progress. For example, a file reader might read the next few bytes of a file, a parser might parse the next line of a document, or a sensor controller might take a single reading, before returning control back to a cooperative scheduler, which would check for task completion.
Each task would have to keep its internal state on the heap (at object level), rather than on the stack frame (at function level) like a conventional blocking function or thread.
And unlike conventional multitasking, which relies on a hardware timer to trigger a context switch, cooperative multitasking relies on the code to be written in such a way that each step of each long-running task is guaranteed to finish in an acceptably small amount of time.
The tasks will execute an explicit wait or pause or yield operation which makes the call to the dispatcher. There may be different operations for waiting on IO to complete or explicitly yielding in a heavy computation. In an application task's main loop, it could have a *wait_for_event* call instead of busy polling. This would suspend the task until it has input to process.
There may also be a time-out mechanism for catching runaway tasks, but it is not the primary means of switching (or else it wouldn't be cooperative).
One way to think of cooperative multitasking is to split a task into steps (or states). Each task keeps track of the next step it needs to execute. When it's the task's turn, it executes only that one step and returns. That way, in the main loop of your program you are simply calling each task in order, and because each task only takes up a small amount of time to complete a single step, we end up with a system which allows all of the tasks to share cpu time (ie. cooperate).

POSIX Threads: are pthreads_cond_wait() and others systemcalls?

The POSIX standard defines several routines for thread synchronization, based on concepts like mutexes and conditional variables.
my question is now: are these (like e.g. pthreads_cond_init(), pthreads_mutex_init(), pthreads_mutex_lock()... and so on) system calls or just library calls? i know they are included via "pthread.h", but do they finally result in a system call and therefore are implemented in the kernel of the operating system?
On Linux a pthread mutex makes a "futex" system call, but only if the lock is contended. That means that taking a lock no other thread wants is almost free.
In a similar way, sending a condition signal is only expensive when there is someone waiting for it.
So I believe that your answer is that pthread functions are library calls that sometimes result in a system call.
Whenever possible, the library avoids trapping into the kernel for performance reasons. If you already have some code that uses these calls you may want to take a look at the output from running your program with strace to better understand how often it is actually making system calls.
I never looked into all those library call , but as far as I understand they all involve kernel operations as they are supposed to provide synchronisations between process and/or threads at global level - I mean at the OS level.
The kernel need to maintain for a mutex, for instance, a thread list: threads that are currently sleeping, waiting that a locked mutex get released. When the thread that currently lock/owns that mutex invokes the kernel with pthread_mutex_release(), the kernel system call will browse that aforementioned list to get the higher priority thread that is waiting for the mutex release, flag the new mutex owner into the mutex kernel structure, and then will give away the cpu (aka "ontect switch") to the newly owner thread, thus this process will return from the posix library call pthread_mutex_lock().
I only see a cooperation with the kernel when it involves IPC between processes (I am not talking between threads at a single process level). Therefore I expect those library call to invoke the kernel, so.
When you compile a program on Linux that uses pthreads, you have to add -lphtread to the compiler options. by doing this, you tell the linker to link libpthreads. So, on linux, they are calls to a library.

Resources