In pipelined processors, why are the contents of the pipeline instructions flushed during after an interrupt and not saved? - cpu-registers

I understand that in non pipelined processors if an interrupt occurs the contents of the registers are saved and the interrupt handled. Why can this not happen with pipelined processors?

Related

OpenCL: can I do simultaneous "read" operations?

I have an OpenCL buffer created with the read and write flag. Can I access the same memory address simultaneously? say, calling enqueueReadBuffer and a kernel that doesn't modify the contents out-of-order without waitlists, or two calls to kernels that only read from the buffer.
Yes, you can do so. Create 2 queues, then call clEnqueieReadBuffer and clEnqueueNDRangeKernel on the different queue.
It ultimately depends on weather the device and driver supports executing different queues at the same time. Most GPUs can while embedded devices may or may not.

Effect of not using clWaitForEvents

I'm new to OpenCL programming. In one of my OpenCL applications, I use clWaitForEvents after launching every kernel.
Case 1:
time_start();
cl_event event;
cl_int status = clEnqueueNDRangeKernel(queue, ..., &event);
clWaitForEvents(1, &event);
time_end();
Time taken : 250 ms (with clWaitForEvents)
If I remove clWaitForEvents(), my kernel runs faster with the same output.
Case 2:
time_start();
cl_event event;
cl_int status = clEnqueueNDRangeKernel(queue, ..., &event);
time_end();
Time taken: 220 ms (without clWaitForEvents)
I've to launch 10 different kernels sequentially. Every kernel is dependent on the output of the previous kernel. Using clWaitForEvent after every kernel increases my execution time by few 100 ms.
Can the outputs go wrong if I do not use clWaitForEvents? I would like to understand what might possibly go wrong if I do not use clWaitForEvents or clFinish.
Any pointers are appreciated.
Hopefully a slightly less complicated answer:
I've to launch 10 different kernels sequentially. Every kernel is dependent on the output of the previous kernel.
If you don't explicitly set CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE property in clCreateCommandQueue() call (= the usual case), it will be an in-order queue. You don't need to synchronize commands in them (actually you shouldn't, as you see it can considerably slow down execution). See the docs:
If the CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE property of a command-queue is not set, the commands enqueued to a command-queue execute in order. For example, if an application calls clEnqueueNDRangeKernel to execute kernel A followed by a clEnqueueNDRangeKernel to execute kernel B, the application can assume that kernel A finishes first and then kernel B is executed. If the memory objects output by kernel A are inputs to kernel B then kernel B will see the correct data in memory objects produced by execution of kernel A.
I would like to understand what might possibly go wrong if I do not use clWaitForEvents or clFinish.
If you're doing simple stuff on a single in-order queue, you don't need clWaitForEvents() at all. It's mostly useful if you want to wait for multiple events from multiple queues, or you're using out-of-order queues, or you want to enqueue 20 commands but wait for the 4th, or something similar.
For a single in-order queue, after clFinish() returns all commands will be completed and any&all events will have their status updated to complete or failed. So in the simplest case you don't need to deal with events at all, just enqueue everything you need (check the enqueues for errors though) and call clFinish().
Note that if you don't use any form of wait/flush (WaitForEvents / Finish / a blocking command), the implementation may take as much time as it wants to actually push those commands to a device. IOW you must either 1) use WaitForEvents or Finish, or 2) enqueue a blocking command (read/write/map/unmap) as the last command.
In-order-queue implicitly waits for each command completion in the order they are enqueued but only on device-side. This means host can't know what happened.
Out-of-order-queue does not guarantee any command order in anywhere and can have issues.
'Wait-for-event' waits on host side for an event of a command.
'Finish' waits on host side until all commands are complete.
'Non blocking buffer read/write' does not wait on host side.
'Blocking buffer read/write' waits on host side but does not wait for other commands.
Recommended solutions:
Inter-command sync (for using output of a command as input of next command)
in-order-queue.
or passing event of a command to another (if its an out-of-order queue)
Inter-queue(or out-of-order queue) sync (for overlapping buffer copies and kernel executions)
pass events from command(s) to another command
Device - host sync (for getting latest data to RAM(or getting first data from RAM) or pausing host)
enable blocking option on buffer commands
or add a clFinish
or use clWaitForEvent
Be informed when a command is complete(for reasons like benchmarking)
use event callback
or constantly query event state(CPU/pci-e usage increases)
Enqueueing 1 non-blocking buffer write + 1000 x kernels + 1 blocking buffer read on an in-order-queue can successfully execute a chain of 1000 kernels on initial data and get latest results on host side.

How can a code be asyncronus on a single-core CPU which is synchronous?

In a uniprocessor (UP) system, there's only one CPU core, so only one thread of execution can be happening at once. This thread of execution is synchronous (it gets a list of instructions in a queue and run them one by one). When we write code, it compiles to set of CPU instructions.
How can we have asynchronous behavior in software on a UP machine? Isn't everything just run in some fixed order chosen by the OS?
Even an out-of-order execution CPU gives the illusion of running instructions in program order. (This is separate from memory reordering observed by other cores or devices in the system. In a UP system, runtime memory reordering is only relevant for device drivers.)
An interrupt handler is a piece of code that runs asynchronously to the rest of the code, and can happen in response to an interrupt from a device outside the CPU. In user-space, a signal handler has equivalent semantics.
(Or a hardware interrupt can cause a context switch to another software thread. This is asynchronous as far as the software thread is concerned.)
Events like interrupts from network packets arriving or disk I/O completing happen asynchronously with respect to whatever the CPU was doing before the interrupt.
Asynchronous doesn't mean simultaneous, just that it can run between any two machine instructions of the rest of the code. A signal handler in a user-space program can run between any two machine instructions, so the code in the main program must work in a way that doesn't break if this happens.
e.g. A program with a signal-handler can't make any assumptions about data on the stack below the current stack pointer (i.e. in the un-reserved part of the stack). The red-zone in the x86-64 SysV ABI is a modification to this rule for user-space only, since the kernel can respect it when transferring control to a signal handler. The kernel itself can't use a red-zone, because hardware interrupts write to the stack outside of software control, before running the interrupt handler.
In an OS where I/O completion can result in the delivery of a POSIX signal (i.e. with POSIX async I/O), the timing of a signal can easily be determined by the timing of a hardware interrupts, so user-space code runs asynchronously with timing determined by things external to the computer. It's not just an issue for the kernel.
On a multicore system, there are obviously far more ways for things to happen in different orders more of the time.
Many processors are capable of multithreading, and many operating systems can simulate multithreading on single-threaded processors by swapping tasks in and out of the processor.

MPI one-sided communication with user callbacks

To overlap MPI communications and computations, I am working on issuing asynchronous I/O (MPI calls) with user-defined computation function on the data from I/O.
MS-Window's 'Overlap' is not the friend of MPI (It supports for overlapped I/O only for File I/O and Socket communication, but not for MPI operations...)
I cannot find an appropriate MPI API for it, is there anyone with a glimpse on it?
There are no completion callbacks in MPI. Non-blocking operations always return a request handle that must be either be synchronously waited on using MPI_Wait and family or periodically tested using the non-blocking MPI_Test and family.
With the help of either MPI_Waitsome or MPI_Testsome, it is possible to implement a dispatch mechanism that monitors multiple requests and calls specific functions upon their completion. None of the MPI calls has any timeout characteristics though - it is either "wait forever" (MPI_Wait...) or "check without waiting" (MPI_Test...).

if interrupt happens how does unix kernel determine which process its for

Lets say Unix is executing process A and an interrupt at higher level occurs. Then OS get a interrupt number and from IVT it looks up the routine to call.
Now how does the OS know that this interrupt was for process A and not for process B. It might have been that process B might have issued a disk read and it came back while OS was executing process A.
Thanks
Start with this: http://en.wikipedia.org/wiki/MINIX
Go buy the book and read it; it will really help a lot.
Interrupts aren't "for" processes. They're for devices and handled by device drivers.
The device driver handles the interrupt and updates the state of the device.
If the device driver concludes that an I/O operation is complete, it can then update the its queue of I/O requests to determine which operation completed. The operation is removed from the queue of pending operations.
The process which is waiting for that operation is now ready-to-run and can resume execution.
You are talking about a hardware interrupt and these are not targeted at processes.
If a process A requests a file, the filesystem layer, which already resides in the kernel, will fetch the file from the block device. The block device itself is handled by a driver.
When the interrupt occurs, triggered by the block device, the OS has this interrupt associated with the driver. So the driver is told to handle the interrupt. It will then query which blocks were read and see for what it requested them.
After the filesystem is told that the requested data is ready, it may further process it. Then, the process leaves blocked state.
In the next round of the scheduler, the scheduler may select to wake up this process. It may also select to wake up another process first.
As you can see, the interrupt occurance is fully disconnected from the process operation.

Resources