Using the same kernel instance for more works - opencl

After i call clEnqueueNDRangeKernel using one cl_kernel instance can i use the same instance to enqueue an other task before the execution of the first one is finished?
Or i need to instantiate an other cl_kernel ?

yes this should be possible. If you enqueue a kernel a copy of the kernel state is enqueued. otherwise you wouldn't be able to modify the kernel args until the kernel command finishes execution. but i agree that this is not specified very clearly.

Related

How does the FreeRTOS kernel suspend a task on Arduino UNO?

There is a FreeRTOS library for Arduino, purported to even run on the UNO.
I'm trying to understand the inner workings of how a multi-tasking operating system can run on such limited hardware. I understand the principles of task scheduling/switching, but how does the kernel actually suspend a task in order to execute another one? How does it interrupt (and then later resume) the currently-executing code?
My guess is that a scheduled ISR (timer) directly modifies the stack to change the instruction pointer, but if it does this, it needs to make a copy of the stack and registers before switching tasks, then restore the current task's stack/registers before resuming execution. I'm not clear on how it would do this.
Can the FreeRTOS kernel switch tasks in the middle of, for example, a Serial.println() function call, (or any call that doesn't include cli()) and if so, how does it do this?
Thanks for any clarification.
My guess is that a scheduled ISR (timer) directly modifies the stack to change the instruction pointer, but if it does this, it needs to make a copy of the stack and registers before switching tasks, then restore the current task's stack/registers before resuming execution. I'm not clear on how it would do this.
Your guess is correct. If you look at port.c you will see, that the FreeRTOS makros portSAVE_CONTEXT and portRESTORE_CONTEXT are pushes respective pops all registers of the current running task to perform the task switch. Furthermore the watchdog timer interrupt is used to run the scheduler.
As long this watchdog timer is enabled and is triggerd, task switches can happen any time. So a switch can also happen during any function call like Serial.println. This implies that if you call this function from several task you will sooner or later corrupt your output of the serial stream.

Which command_queue to pass to clEnqueueCopyBuffer when launching kernels simultaneously?

So I am implementing a Kmeans clustering algorithm with OpenCL that uses channels: a feature from Intel's FPGA SDK for OpenCL.
To keep it succinct, this means I have two kernels that have to be enqueued on different command queues so they run simultaneously. I want to copy the cl_mem buffer from one kernel to the other every iteration (it's for the 4 clusters, so on the small side), part of which requires me to call clEnqueueCopyBuffer. This requires passing the function a command queue, but I don't know if it wants the queue of the buffer being copied or the queue of the buffer being copied to.
This is all the OpenCL Specification says for the command_queue parameter:
The command-queue in which the copy command will be queued. The OpenCL context associated with command_queue, src_buffer, and dst_buffer must be the same.
I can confirm these kernels are in fact in the same context.
You could use either command queue but you need to get an event from the copy operation to pass to the other kernel enqueue on the other command queue. Otherwise it might start before the copy finishes.

Effect of not using clWaitForEvents

I'm new to OpenCL programming. In one of my OpenCL applications, I use clWaitForEvents after launching every kernel.
Case 1:
time_start();
cl_event event;
cl_int status = clEnqueueNDRangeKernel(queue, ..., &event);
clWaitForEvents(1, &event);
time_end();
Time taken : 250 ms (with clWaitForEvents)
If I remove clWaitForEvents(), my kernel runs faster with the same output.
Case 2:
time_start();
cl_event event;
cl_int status = clEnqueueNDRangeKernel(queue, ..., &event);
time_end();
Time taken: 220 ms (without clWaitForEvents)
I've to launch 10 different kernels sequentially. Every kernel is dependent on the output of the previous kernel. Using clWaitForEvent after every kernel increases my execution time by few 100 ms.
Can the outputs go wrong if I do not use clWaitForEvents? I would like to understand what might possibly go wrong if I do not use clWaitForEvents or clFinish.
Any pointers are appreciated.
Hopefully a slightly less complicated answer:
I've to launch 10 different kernels sequentially. Every kernel is dependent on the output of the previous kernel.
If you don't explicitly set CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE property in clCreateCommandQueue() call (= the usual case), it will be an in-order queue. You don't need to synchronize commands in them (actually you shouldn't, as you see it can considerably slow down execution). See the docs:
If the CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE property of a command-queue is not set, the commands enqueued to a command-queue execute in order. For example, if an application calls clEnqueueNDRangeKernel to execute kernel A followed by a clEnqueueNDRangeKernel to execute kernel B, the application can assume that kernel A finishes first and then kernel B is executed. If the memory objects output by kernel A are inputs to kernel B then kernel B will see the correct data in memory objects produced by execution of kernel A.
I would like to understand what might possibly go wrong if I do not use clWaitForEvents or clFinish.
If you're doing simple stuff on a single in-order queue, you don't need clWaitForEvents() at all. It's mostly useful if you want to wait for multiple events from multiple queues, or you're using out-of-order queues, or you want to enqueue 20 commands but wait for the 4th, or something similar.
For a single in-order queue, after clFinish() returns all commands will be completed and any&all events will have their status updated to complete or failed. So in the simplest case you don't need to deal with events at all, just enqueue everything you need (check the enqueues for errors though) and call clFinish().
Note that if you don't use any form of wait/flush (WaitForEvents / Finish / a blocking command), the implementation may take as much time as it wants to actually push those commands to a device. IOW you must either 1) use WaitForEvents or Finish, or 2) enqueue a blocking command (read/write/map/unmap) as the last command.
In-order-queue implicitly waits for each command completion in the order they are enqueued but only on device-side. This means host can't know what happened.
Out-of-order-queue does not guarantee any command order in anywhere and can have issues.
'Wait-for-event' waits on host side for an event of a command.
'Finish' waits on host side until all commands are complete.
'Non blocking buffer read/write' does not wait on host side.
'Blocking buffer read/write' waits on host side but does not wait for other commands.
Recommended solutions:
Inter-command sync (for using output of a command as input of next command)
in-order-queue.
or passing event of a command to another (if its an out-of-order queue)
Inter-queue(or out-of-order queue) sync (for overlapping buffer copies and kernel executions)
pass events from command(s) to another command
Device - host sync (for getting latest data to RAM(or getting first data from RAM) or pausing host)
enable blocking option on buffer commands
or add a clFinish
or use clWaitForEvent
Be informed when a command is complete(for reasons like benchmarking)
use event callback
or constantly query event state(CPU/pci-e usage increases)
Enqueueing 1 non-blocking buffer write + 1000 x kernels + 1 blocking buffer read on an in-order-queue can successfully execute a chain of 1000 kernels on initial data and get latest results on host side.

Queries about multiple kernel in opencl

when I use multiple kernel in opencl such that result of first kernel (K1) execution is input to second kernel (K2) execution,so two questions are:
The event should be different for each kernel or should be same for
each kernel?
The command queue should be different for each kernel or should be
same for each kernel?
Thanks.
You need a single command queue (assuming the kernels are executed on the same device).
Unless your command queue is created with CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, you don't need to create events in the scenario you describe: kernels are executed in the order they are enqueued.
For an out-of-order command queue, you should get an event from the first clEnqueueNDRangeKernel, and pass it as dependency to the second one. Remember to release the events with clReleaseEvent.

signalling error from kernel

Is there a way for kernel interrupt the task queue and yield control to the host prematurely, flushing the remaining, yet unprocessed tasks?
I am passing output arrays to kernels, of which required size is not known in advance. I am trying to estimate the size, but if it is too small, the kernel should return control to the host, which could re-allocate it or otherwise react. Currently I have the kernel set some flag in a struct which is passed to all kernels, and every kernel checks this error flag when it executes and exits immediately if it is set; this way, the rest of the queue is practically skipped over. I am wondering if there is a better way for this? Can I generate an event from the kernel, for instance?

Resources