How do I know if the kernels are executing concurrently? - opencl

I have a GPU with CC 3.0, so it should support 16 concurrent kernels. I am starting 10 kernels by looping through clEnqueueNDRangeKernel for 10 times. How do I get to know that the kernels are executing concurrently?
One way which I have thought is to get the time before and after the NDRangeKernel statement. I might have to use events so as to ensure the execution of the kernel has completed. But I still feel that the loop will start the kernels sequentially. Can someone help me out..

To determine if your kernel executions overlap, you have to profile them. This requires several steps:
1. Creating the command-queues
Profiling data is only collected if the command-queue is created with the property CL_QUEUE_PROFILING_ENABLE:
cl_command_queue queues[10];
for (int i = 0; i < 10; ++i) {
queues[i] = clCreateCommandQueue(context, device, CL_QUEUE_PROFILING_ENABLE,
&errcode);
}
2. Making sure all kernels start at the same time
You are right in your assumption that the CPU queues the kernels sequentially. However, you can create a single user event and add it to the wait list for all kernels. This causes the kernels not to start running before the user event is completed:
// Create the user event
cl_event user_event = clCreateUserEvent(context, &errcode);
// Reserve space for kernel events
cl_event kernel_events[10];
// Enqueue kernels
for (int i = 0; i < 10; ++i) {
clEnqueueNDRangeKernel(queues[i], kernel, work_dim, global_work_offset,
global_work_size, 1, &user_event, &kernel_events[i]);
}
// Start all kernels by completing the user event
clSetUserEventStatus(user_event, CL_COMPLETE);
3. Obtain profiling times
Finally, we can collect the timing information for the kernel events:
// Block until all kernels have run to completion
clWaitForEvents(10, kernel_events);
for (int i = 0; i < 10; ++i) {
cl_ulong start;
clGetEventProfilingInfo(kernel_event[i], CL_PROFILING_COMMAND_START,
sizeof(start), &start, NULL);
cl_ulong end;
clGetEventProfilingInfo(kernel_event[i], CL_PROFILING_COMMAND_END,
sizeof(end), &end, NULL);
printf("Event %d: start=%llu, end=%llu", i, start, end);
}
4. Analyzing the output
Now that you have the start and end times of all kernel runs, you can check for overlaps (either by hand or programmatically). The output units are nanoseconds. Note however that the device timer is only accurate to a certain resolution. You can query the resolution using:
size_t resolution;
clGetDeviceInfo(device, CL_DEVICE_PROFILING_TIMER_RESOLUTION,
sizeof(resolution), &resolution, NULL);
FWIW, I tried this on a NVIDIA device with CC 2.0 (which should support concurrent kernels) and observed that the kernels were run sequentially.

You can avoid all the boilerplate code suggested in the other answers (which are correct by the way) by using C Framework for OpenCL, which simplifies this task a lot, and gives you detailed information about OpenCL events (kernel execution, data transfers, etc), including a table and a plot dedicated to overlapped execution of said events.
I developed this library in order to, among other things, simplify the process described in the other answers. You can see a basic usage example here.

Yes, as you suggest, try to use the events, and analyze all the QUEUED, SUBMIT, START, END values. These should be absolute values in "device time", and you may be able to see if processing (START to END) overlaps for the different kernels.

Related

locks in OpenMP

Everyone good time of day!
Not so long ago, I was able to parallel the recursive algorithm for searching for possible options for combining some events. At the moment, the code is as follows:
//#include's
// function announcements
// declaring a global variable:
QVector<QVector<QVector<float>>> variant; (or "std::vector")
int main() {
// reads data from file
// data are converted and analyzed
// the variant variable containing the current best result is filled in (here - by pre-analysis)
#pragma omp parallel shared(variant)
#pragma omp master
// occurs call a recursive algorithm of search all variants:
PEREBOR(Tabl_1, a, i_a, ..., reс_depth);
return 0;
}
void PEREBOR(QVector<QVector<uint8_t>> Tabl_1, QVector<A_struct> a, uint8_t i_a, ..., uint8_t reс_depth)
{
// looking for the boundaries of the first cycle for some reasons
for (int i = quantity; i < another_quantity; i++) {
// the Tabl_1 is processed and modified to determine the number of steps in the subsequent for cycle
for (int k = 0; k < the_quantity_just_found; k++) {
if the recursion depth is not 1, we go down further: {
// add descent to the next recursion level to the call stack:
#pragma omp task
PEREBOR(Tabl_1_COPY, a, i_a, ..., reс_depth-1);
}
else (if we went down to the lowest level): {
if (condition fulfilled) // condition check - READ variant variable
variant = it_is_equal_to_that_,_to_that...;
else
continue;
}
}
}
}
At the moment, this thing really works well, and on six cores the CPU gives an increase of more than 5.7 from the single-core version.
As you can see, with a sufficiently large number of threads, there may be a failure associated with the simultaneous reading/writing of the variant variable. I understand she needs to be protected. At the moment, I see an output only in the use of blocking functions, since the critical section is not suitable because if the variable variant is written in only one section of the code (at the lowest level of recursion), then the reading occurs in many places.
Actually, here is the question - if I apply the constructions:
omp_lock_t lock;
int main() {
...
omp_init_lock(&lock);
#pragma omp parallel shared(variant, lock)
...
}
...
else (if we went down to the lowest level): {
if (condition fulfilled) { // condition check - READ variant variable
omp_set_lock(&lock);
variant = it_is_equal_to_that_,_to_that...;
omp_unset_lock(&lock);
}
else
continue;
...
will this lock protect the reading of the variable in all other places? Or will I need to manually check the lock status and pause the thread before reading elsewhere?
I will be incredibly grateful to the distinguished community for help!
In OpenMP specification (1.4.1 The structure of OpenMP memory model) you can read
The OpenMP API provides a relaxed-consistency, shared-memory model.
All OpenMP threads have access to a place to store and to retrieve
variables, called the memory. In addition, each thread is allowed to
have its own temporary view of the memory. The temporary view of
memory for each thread is not a required part of the OpenMP memory
model, but can represent any kind of intervening structure, such as
machine registers, cache, or other local storage, between the thread
and the memory. The temporary view of memory allows the thread to
cache variables and thereby to avoid going to memory for every
reference to a variable.
This practically means that (as with any relaxed memory model), only at well-defined points, are threads guaranteed to have the same, consistent view on the value of shared variables. In between such points, the temporary view may be different across the threads.
In your code you handled the problem of simultaneous writing of the same variable, but there is no guarantee that an another thread reads the correct value of the variable without additional measures.
You have 3 options to do (Note that each of these solutions not only will handle simultaneous read/writes, but also provides a consistent view on the value of shared variables.):
If your variable is scalar type, the best solution is to use atomic operations. This is the fastest option as atomic operations are typically supported by the hardware.
#pragma omp parallel
{
...
#pragma omp atomic read
tmp=variant;
....
#pragma omp atomic write
variant=new_value;
}
Use critical construct. This solution can be used if your variable is a complex type (such as class) and its read/write cannot be performed atomically. Note that it is much less efficient (slower) than an atomic operation.
#pragma omp parallel
{
...
#pragma omp critical
tmp=variant;
....
#pragma omp critical
variant=new_value;
}
Use locks for each read/write of your variable. Your code is OK for write, but have to use it for reads as well. It requires the most coding, but practically the result is the same as using the critical construct. Note that OpenMP implementations typically use locks to implement critical constructs.

Device-side enqueue causes CL_OUT_OF_RESOURCES

I have a program utilizing OpenCL 2.0 because I want to take advantage of device-side enqueue. I have a test program that performs the following tasks on the host side:
Allocates 16 kilobytes of floating point memory on the device and zeros it out.
Builds the OpenCL program below, and creates a kernel of masterKernel()
Sets the first argument of masterKernel() (heap) to the allocated memory in step 1
Enqueues that masterKernel() via clEnqueueNDRangeKernel() with a work_dim of 1 and a global work size of 1. (So it only runs once, with get_global_id(0) always being zero)
Reads the memory back into the host and displays it.
Here is the OpenCL code:
//This function was stripped down to nothing for testing purposes.
kernel void childKernel(global float* heap)
{
}
//Enqueues the child kernel.
kernel void masterKernel(global float* heap)
{
ndrange_t ndRange = ndrange_1D(16); //Arbitrary, could be any number.
if(get_global_id(0) == 0)
{
enqueue_kernel(get_default_queue(), 0, ndRange,
^{ childKernel(heap); });
}
}
The program builds successfully. However, when I try to run masterKernel(), The call to enqueue_kernel() here causes the host side call to clEnqueueNDRangeKernel() to fail with an error code of CL_OUT_OF_RESOURCES. OpenCL's documentation says enqueue_kernel() should return CL_SUCCESS or CL_ENQUEUE_FAILURE depending on if the block enqueues successfully or not. It does not say that clEnqueueNDRangeKernel() itself should fail. Here are some other things I've tried:
Commenting out the call to enqueue_kernel() causes the program to succeed.
Adding a line that sets heap[0] to any number causes the host-side program to reflect that change. So I know that it's not a problem with how I'm feeding the arguments in
Modifying the if statement so that it reads something impossible like if(get_global_id(0) == 6000) still causes the error. This tells me that the error is not caused by enqueue_kernel() executing (I verified get_global_size(0) == 1), but merely that it exists in the program at all.
Modifying the if statement to if(0) does make the error not happen.
Making it so childKernel() actually does something does not make the error go away.
I am not really sure what to try next. I know my device supports OpenCL 2.0. My device is an AMD Radeon R9 380 graphics card. I do not have access to any other OpenCL 2.0 capable hardware to test it on.
I ended up figuring this one out. This issue happened because I did not create a device-side queue (one with the flags of CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE | CL_QUEUE_ON_DEVICE | CL_QUEUE_ON_DEVICE_DEFAULT).

How to invoke single workgroup with multiple workitems which run parallel?

I am in a startup of OpenCl and still learning.
Kernel Code:
__kernel void gpu_kernel(__global float* data)
{
printf("workitem %d invoked\n", get_global_id(0));
int x = 0;
if (get_global_id(0) == 1) {
while (x < 1) {
x = 0;
}
}
printf("workitem %d completed\n", get_global_id(0));
}
C code for invoking kernel
size_t global_item_size = 4; // number of workitems total
size_t local_item_size = 1; // number of workitems per group
ret = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, &global_item_size, &local_item_size, 0, NULL, NULL);
Ouput:
workitem 3 invoked
workitem 3 completed
workitem 0 invoked
workitem 0 completed
workitem 1 invoked
workitem 2 invoked
workitem 2 completed
## Here code is waiting on terminal for Workitem #1 to finish, which will never end
this clearly states, all workitems are parallel (but in different workgroup).
Another C code for invoking kernel (for 1 workgroup with 4 workitems)
size_t global_item_size = 4; // number of workitems total
size_t local_item_size = 4; // number of workitems per group
ret = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, &global_item_size, &local_item_size, 0, NULL, NULL);
Ouput:
workitem 0 invoked
workitem 0 completed
workitem 1 invoked
## Here code is waiting on terminal for Workitem #1 to finish, which will never end
This clearly states that, this running in sequence (that's why it completed 1st Workitem and then got stuck on second and rest are never executed)
My Question:
I need to invoke 1 workgroup with 4 workitems which run parallel. So that i can use barrier in my code (which i guess is only possible within single workgroup)?
any help/suggestion/pointer will be appreciated.
Your second host code snippet correctly launches a single work-group that contains 4 work-items. You have no guarantees that these work-items will run in parallel, since the hardware might not have the resources to do so. However, they will run concurrently, which is exactly what you need in order to be able to use work-group synchronisation constructs such as barriers. See this Stack Overflow question for a concise description of the difference between parallelism and concurrency. Essentially, the work-items in a work-group will make forward progress independently of each other, even if they aren't actually executing in parallel.
OpenCL 1.2 Specification (Section 3.2: Execution Model)
The work-items in a given work-group execute concurrently on the processing elements of a single compute unit.
Based on your previous question on a similar topic, I assume you are using AMD's OpenCL implementation targeting the CPU. The way most OpenCL CPU implementations work is by serialising all work-items from a work-group into a single thread. This thread then executes each work-item in turn (ignoring vectorisation for the sake of argument), switching between them when they either finish or hit a barrier. This is how they achieve concurrent execution, and gives you all the guarantees you need in order to safely use barriers within your kernel. Parallel execution is achieved by having multiple work-groups (as in your first example), which will result in multiple threads executing on multiple cores (if available).
If you replaced your infinite loop with a barrier, you would clearly see that this does actually work.

Replacement for Arduinos millis() that is reliable also with disabled interrupts

As stated in stackoverflow-17135805 the millis() function does not return the correct time, if the interrupts where disabled, while Arduino had to detect an overflow of timer0.
I have a time critical program that uses a lot of functions which have to disable the interrupts. So my program runs 1:30 while it thinks it was running only for 1:00.
Is there another timer that I can use to avoid this problem?
It happens to me when I use the GSM Module:
// startpoint
unsigned long t = 0;
unsigned long start = millis();
while ( (millis()-start) < 30000 ){
//read a chunk from the gprs module
for (int i=0;i<8;i++)
client.read();
//do this loop every 10ms
while( (millis()-start) < t*10 ){};
t++;
}
//endpoint
From the startpoint to the endpoint it should take 30 seconds. Instead it takes 65 seconds.
If you have to disable interrupts so often and so long your best bet would be to use an external timer. I highly recommend DS3231. Since it has a build in crystal it is easier to setup than a 1307 and it is also significantly more accurate.
You could use one of the other hardware timers
to keep track of the time. For example, on the Leonardo Timer 1 is a 16 bit timer.
To set it up directly (this obliterates code portability) there are a couple steps.
TCCR1A = 0;
this puts the timer in "normal" mode, meaning it just runs to 0xFFFF and wraps back to 0x0000.
TCCR3B = 0;
TCCR3B = _BV(CS11) | _BV(CS10);
this starts the timer and sets it to use a clock/64 prescale, which equates to 1 tic every 4us.
To check the time:
long time; // declared somewhere in scope.
time = TCNT1; // this reads the timer count register
time *= 4; // this multiplies time by 4 to give you us.
As mentioned earlier, TCNT1 wraps around at 0xFFFF = 65536. So, with the pre-scaler set as above, that gives you about 65536 * 4E-6 = .262 seconds of counting before your program needs to put the data into a bigger variable (assuming you care). Hopefully it isn't a problem to poll things more often than 4 times a second, which gets you away from interrupts.
Several arduino core functions utilize these timers, so you'll need to verify that the core functions you need don't depend on the timer you choose. For example, doing the above will break analogWrite() on certain pins.

Executing different kernels on different GPUs simultaneously

Basically I have two GPUs and I want to execute some kernels on each of them. I don't want the GPUs to be working on the same kernel with each doing some part of it(I don know if this is possible), just in case I don even want to see that behavior.
I just want to make sure that both the devices are being exercised. I have created context and the command queues for both of them. But I see only one kernel gets executed which means only one device is being used. This is how I have done it. . .
cl_device_id *device;
cl_kernel *kernels;
...
// creating context.
context = clCreateContext(0, num_devices, device, NULL, NULL, &error);
...
// creating command queues for all kernels
for(int i = 0; i<num_kenrels; i++)
cmdQ[i] = clCreateCommandQueue(context, *device, 0, &error);
...
// enqueue kernels
error = clEnqueueNDRangeKernel(*cmdQ, *kernels, 2, 0, glbsize, 0, 0, NULL, NULL);
Am I going the correct way?
It depends on how you actually filled your device array. In case you initialized it correctly, creating the context spanning the devices is correct.
Unfortunately, you have a wrong idea about kernels and command queues. A kernel is created from a program for a particular context. A queue on the other hand is used to communicate with a certain device. What you want to do is create one queue per device not kernel:
for (int i = 0; i < num_devices; i++)
cmdQ[i] = clCreateCommandQueue(context, device[i], 0, &error);
Now you can enqueue the different (or same) kernels on different devices via the corresponding command queues:
clEnqueueNDRangeKernel(cmdQ[0], kernels[0], /* ... */);
clEnqueueNDRangeKernel(cmdQ[1], kernels[1], /* ... */);
To sum up the terms:
A cl_context is created for a particular cl_platform_id and is like a container for a subset of devices,
a cl_program is created and built for a cl_context and its associated devices
a cl_kernel is extracted from a cl_program but can only be used on devices associated with the program's context,
a cl_command_queue is created for a specific device belonging to a certain context,
memory operations and kernel calls are enqueued in a command queue and executed on the corresponding device.

Resources