How do I release a command queue that contains waiting kernels? - opencl

In my command queue, I have a user event E and kernels K1, K2 and K3.
I enqueue K1, with wait event to to E.
I enqueue K2, with wait event set to the completion event of K1.
I enqueue K3, with wait event set to the completion event of K2.
Suppose I need to release this command queue before E fires. How do I do this? Currently, release just hangs. Event if I set the event status on E to -1, still does not release.

There is no way to do it at least in < CL 2.0.
The only solution is not to queue them in the first place.
I know, it sucks.
Releasing the queue will first call a clFinish() in the queue, leading to a lock situation.
I suffered myself this situation, when I had enough GPU resources, and the CPU was the bottleneck. So I was queuing the next iterations in the GPU, and depending on the CPU result, whether it was enough or not, I run more iterations or I discarded the GPU result.
It would have been great to be able to stop the queued execution as soon as the CPU knows that the data it no longer needed. But you either queue it later (leading to some GPU idle time), or you run it all the time, but dont wait for the result if it is not needed (leading to extra GPU usage).

Related

When a process makes a system call to transmit a TCP packet over the network, which of the following steps do NOT occur always?

I am teaching myself OS by going through the lecture notes of the course at IIT Bombay (https://www.cse.iitb.ac.in/~mythili/os/). One of the questions in the Process worksheet asks which of the following doesn't always happen in the situation described at the title. The answer is C.
A. The process moves to kernel mode.
B. The program counter of the CPU shifts to the kernel part of the address space.
C. The process is context-switched out and a separate kernel process starts execution.
D. The OS code that deals with handling TCP/IP packets is invoked
I'm a bit confused though. I thought when an interrupt routine occurs the process is context-switched out so other processes can run and the CPU is not idle during that time. The kernel, then, will take care of the packet sending. Why would C not be correct then?
You are right in saying that "when an interrupt routine occurs the process is context-switched out so other processes can run and the CPU is not idle during that time", but the words "generally or mostly" need to be added to it.
In most cases, there is another process waiting for CPU time and that can be scheduled. However it is not the case 100% of the time. The question is about the word "always" and while other options always occur in the given situation, option C is a choice that OS makes at run time. If OS determines that switching out this process can be sub optimal than performing the system call and resuming the same process, then it may not perform the context switching.
There is a cost associated with context switching and if other processes are also blocked on some I/O then it may be optimal for OS to NOT switch the context or there might be other reasons to not switch the context such as what if only 1 process is running, there is no other process to switch the context to!

Effect of not using clWaitForEvents

I'm new to OpenCL programming. In one of my OpenCL applications, I use clWaitForEvents after launching every kernel.
Case 1:
time_start();
cl_event event;
cl_int status = clEnqueueNDRangeKernel(queue, ..., &event);
clWaitForEvents(1, &event);
time_end();
Time taken : 250 ms (with clWaitForEvents)
If I remove clWaitForEvents(), my kernel runs faster with the same output.
Case 2:
time_start();
cl_event event;
cl_int status = clEnqueueNDRangeKernel(queue, ..., &event);
time_end();
Time taken: 220 ms (without clWaitForEvents)
I've to launch 10 different kernels sequentially. Every kernel is dependent on the output of the previous kernel. Using clWaitForEvent after every kernel increases my execution time by few 100 ms.
Can the outputs go wrong if I do not use clWaitForEvents? I would like to understand what might possibly go wrong if I do not use clWaitForEvents or clFinish.
Any pointers are appreciated.
Hopefully a slightly less complicated answer:
I've to launch 10 different kernels sequentially. Every kernel is dependent on the output of the previous kernel.
If you don't explicitly set CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE property in clCreateCommandQueue() call (= the usual case), it will be an in-order queue. You don't need to synchronize commands in them (actually you shouldn't, as you see it can considerably slow down execution). See the docs:
If the CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE property of a command-queue is not set, the commands enqueued to a command-queue execute in order. For example, if an application calls clEnqueueNDRangeKernel to execute kernel A followed by a clEnqueueNDRangeKernel to execute kernel B, the application can assume that kernel A finishes first and then kernel B is executed. If the memory objects output by kernel A are inputs to kernel B then kernel B will see the correct data in memory objects produced by execution of kernel A.
I would like to understand what might possibly go wrong if I do not use clWaitForEvents or clFinish.
If you're doing simple stuff on a single in-order queue, you don't need clWaitForEvents() at all. It's mostly useful if you want to wait for multiple events from multiple queues, or you're using out-of-order queues, or you want to enqueue 20 commands but wait for the 4th, or something similar.
For a single in-order queue, after clFinish() returns all commands will be completed and any&all events will have their status updated to complete or failed. So in the simplest case you don't need to deal with events at all, just enqueue everything you need (check the enqueues for errors though) and call clFinish().
Note that if you don't use any form of wait/flush (WaitForEvents / Finish / a blocking command), the implementation may take as much time as it wants to actually push those commands to a device. IOW you must either 1) use WaitForEvents or Finish, or 2) enqueue a blocking command (read/write/map/unmap) as the last command.
In-order-queue implicitly waits for each command completion in the order they are enqueued but only on device-side. This means host can't know what happened.
Out-of-order-queue does not guarantee any command order in anywhere and can have issues.
'Wait-for-event' waits on host side for an event of a command.
'Finish' waits on host side until all commands are complete.
'Non blocking buffer read/write' does not wait on host side.
'Blocking buffer read/write' waits on host side but does not wait for other commands.
Recommended solutions:
Inter-command sync (for using output of a command as input of next command)
in-order-queue.
or passing event of a command to another (if its an out-of-order queue)
Inter-queue(or out-of-order queue) sync (for overlapping buffer copies and kernel executions)
pass events from command(s) to another command
Device - host sync (for getting latest data to RAM(or getting first data from RAM) or pausing host)
enable blocking option on buffer commands
or add a clFinish
or use clWaitForEvent
Be informed when a command is complete(for reasons like benchmarking)
use event callback
or constantly query event state(CPU/pci-e usage increases)
Enqueueing 1 non-blocking buffer write + 1000 x kernels + 1 blocking buffer read on an in-order-queue can successfully execute a chain of 1000 kernels on initial data and get latest results on host side.

clReleaseCommandQueue hangs when queue contains unexecuted kernels that are waiting on completion events

Releasing a command queue should, in my opinion, remove all unexecuted kernels, even if they are waiting on completion events.
I am using an AMD card, and it seems that I have to manually set completion events to COMPLETE in order to successfully release the command queue.
Is this a bug in the AMD implementation?
If in doubt, always refer to the specification:
clReleaseCommandQueue performs an implicit flush to issue any previously queued OpenCL
commands in command_queue.
So, this is in fact expected behaviour.

In order queue: two kernels - one waiting for event and one not

Suppose I enqueue two kernels in an in-order queue.
The first kernel is set to only run when it receives a completion event,
while the second kernel is not waiting for an event.
Will the runtime execute the second kernel first in this case?
An in-order queue will execute the items in the order you queue them, essentially giving each operation its predecessor as a wait event. Your second kernel should not be executed until after the first one in your example.
Out of order queues require you to manage the wait lists yourself, but have the advantage that tasks can be executed as soon as their prerequisites have been met. Just make sure your platform supports out-of-order queues before you end up troubleshooting a dead-end. See answers to this SO question.

Probe seems to consume the CPU

I've got an MPI program consisting of one master process that hands off commands to a bunch of slave processes. Upon receiving a command, a slave just calls system() to do it. While the slaves are waiting for a command, they are consuming 100% of their respective CPUs. It appears that Probe() is sitting in a tight loop, but that's only a guess. What do you think might be causing this, and what could I do to fix it?
Here's the code in the slave process that waits for a command. Watching the log and the top command at the same time suggests that when the slaves are consuming their CPUs, they are inside this function.
MpiMessage
Mpi::BlockingRecv() {
LOG(8, "BlockingRecv");
MpiMessage result;
MPI::Status status;
MPI::COMM_WORLD.Probe(MPI_ANY_SOURCE, MPI_ANY_TAG, status);
result.source = status.Get_source();
result.tag = status.Get_tag();
int num_elems = status.Get_count(MPI_CHAR);
char buf[num_elems + 1];
MPI::COMM_WORLD.Recv(
buf, num_elems, MPI_CHAR, result.source, result.tag
);
result.data = buf;
LOG(7, "BlockingRecv about to return (%d, %d)", result.source, result.tag);
return result;
}
Yes; most MPI implementations, for the sake of performance, busy-wait on blocking operations. The assumption is that the MPI job is the only thing going on that we care about on the processor, and if the task is blocked waiting for communications, the best thing to do is to continually poll for that communication to reduce latency; so that there's virtually no delay between when the message arrives and when it's handed off to the MPI task. This typically means that CPU is pegged at 100% even when nothing "real" is being done.
That's probably the best default behaviour for most MPI users, but it isn't always what you want. Typically MPI implementations allow turning this off; with OpenMPI, you can turn this behaviour off with an MCA parameter,
mpirun -np N --mca mpi_yield_when_idle 1 ./a.out
It sounds like there are three ways to wait for an MPI message:
Aggressive busy wait. This will get the message into your receiving code as fast as possible. Some processor is doing nothing but checking for the incoming message. If you put all of your processors in this state, the rest of your system is going to be very slow. MPI uses aggressive mode by default.
Degraded busy wait. This will yield to other processes while doing its busy wait. If the number of processes you ask for is more than the number of processors you have, MPI switches to degraded mode. You can also force aggressive or degraded mode with an MCA parameter.
Polling. Even the degraded busy wait is still a busy wait, and it will keep one processor pegged at 100% for each process that is waiting. If you have other tasks on your system that you don't want to compete with, you can call MPI_Iprobe() in a loop with a sleep call before calling a blocking receive. I find a 100ms sleep is responsive enough for my tasks, and still keeps the CPU usage minimal when a worker is idle.
I did some searching and found that a busy wait is what you want if you are not sharing your processors with other tasks.

Resources