Barriers in OpenCL - opencl

In OpenCL, my understanding is that you can use the barrier() function to synchronize threads in a work group. I do (generally) understand what they are for and when to use them. I'm also aware that all threads in a work group must hit the barrier, otherwise there are problems. However, every time I've tried to use barriers so far, it seems to result in either my video driver crashing, or an error message about accessing invalid memory of some sort. I've seen this on 2 different video cards so far (1 ATI, 1 NVIDIA).
So, my questions are:
Any idea why this would happen?
What is the difference between barrier(CLK_LOCAL_MEM_FENCE) and barrier(CLK_GLOBAL_MEM_FENCE)? I read the documentation, but it wasn't clear to me.
Is there general rule about when to use barrier(CLK_LOCAL_MEM_FENCE) vs. barrier(CLK_GLOBAL_MEM_FENCE)?
Is there ever a time that calling barrier() with the wrong parameter type could cause an error?

As you have stated, barriers may only synchronize threads in the same workgroup. There is no way to synchronize different workgroups in a kernel.
Now to answer your question, the specification was not clear to me either, but it seems to me that section 6.11.9 contains the answer:
CLK_LOCAL_MEM_FENCE – The barrier function will either flush any
variables stored in local memory or queue a memory fence to ensure
correct ordering of memory operations to local memory.
CLK_GLOBAL_MEM_FENCE – The barrier function will queue a memory fence
to ensure correct ordering of memory operations to global memory.
This can be useful when work-items, for example, write to buffer or
image memory objects and then want to read the updated data.
So, to my understanding, you should use CLK_LOCAL_MEM_FENCE when writing and reading to the __local memory space, and CLK_GLOBAL_MEM_FENCE when writing and readin to the __global memory space.
I have not tested whether this is any slower, but most of the time, when I need a barrier and I have a doubt about which memory space is impacted, I simply use a combination of the two, ie:
barrier(CLK_LOCAL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE);
This way you should not have any memory reading\writing ordering problem (as long as you are sure that every thread in the group goes through the barrier, but you are aware of that).
Hope it helps.

Reviving an old-ish thread here. I have had a little bit of trouble with barrier() myself.
Regarding your crash problem, one potential cause could be if your barrier is inside a condition. I read that when you use barrier, ALL work items in the group must be able to reach that instruction, or it will hang your kernel - usually resulting in a crash.
if(someCondition){
//do stuff
barrier(CLK_LOCAL_MEM_FENCE);
//more stuff
}else{
//other stuff
}
My understanding is that if one or more work items satisfies someCondition, ALL work items must satisfy that condition, or there will be some that will skip the barrier. Barriers wait until ALL work items reach that point. To fix the above code, I need to restructure it a bit:
if(someCondition){
//do stuff
}
barrier(CLK_LOCAL_MEM_FENCE);
if(someCondition){
//more stuff
}else{
//other stuff
}
Now all work items will reach the barrier.
I don't know to what extent this applies to loops; if a work item breaks from a for loop, does it hit barriers? I am unsure.
UPDATE: I have successfully crashed a few ocl programs with a barrier in a for-loop. Make sure all work items exit the for loop at the same time - or better yet, put the barrier outside the loop.
(source: Heterogeneous Computing with OpenCL Chapter 5, p90-91)

Related

Are global memory barriers required if only one work item reads and writes to memory

In my kernel, each work item has a reserved memory region in a buffer
that only it writes to and reads from.
Is it necessary to use memory barriers in this case?
EDIT:
I call mem_fence(CLK_GLOBAL_MEM_FENCE) before each write and before each read. Is this enough to guarantee load/store consistency?
Also, is this even necessary if only one work item is loading storing to this memory region ?
See this other stack overflow question:
In OpenCL, what does mem_fence() do, as opposed to barrier()?
The memory barriers work at a workgroup level, this is, stopping the threads belonging to the same block of threads until all of them reach the barrier. If there is not intersection between the memory spaces of different work items, there is not needed any extra synchronization point.
Also, is this even necessary if only one work item is loading storing to this memory region ?
Theoretically, mem_fence only guarantees the commit of the previous memory accesses before the later ones. In my case, I never saw differences in the results of applications using or not this mem_fence call.
Best regards

Memory transfer between work items and global memory in OpenCL?

I have some queries regarding how data transfer happens between work items and global memory. Let us consider the following highly inefficient memory bound kernel.
__kernel void reduceURatios(__global myreal *coef, __global myreal *row, myreal ratio)
{
size_t gid = get_global_id(0);//line no 1
myreal pCoef = coef[gid];//line no 2
myreal pRow = row[gid];//line no 3
pCoef = pCoef - (pRow * ratio);//line no 4
coef[gid] = pCoef;//line no 5
}
Do all work items in a work group begin executing line no 1 at the
same time?
Do all work items in a work group begin executing line no 2 at the
same time?
Suppose different work items in a work group finish executing line
no 4 at different times. Do the early finished ones wait so that,
all work items transfer the data to global memory at the same time
in line no 5?
Do all work items exit the compute unit simultaneously such that
early finished work items have to wait until all work items have
finished executing?
Suppose each kernel has to perform 2 reads from global memory. Is it
better to execute these statements one after the other or is it
better to execute some computation statements between the 2 read
executions?
The above shown kernel is memory bound for GPU. Is there any way by
which performance can be improved?
Are there any general guidelines to avoid memory bounds?
Find my answers below: (thanks sharpneli for the good comment of AMD GPUs and warps)
Normally YES. But depends on the hardware. You can't directly expect that behavior and design your algorithm on this "ordered execution". That's why barriers and mem_fences exists. For example, some GPU execute in order only a sub-set of the WG's WI. In CPU it is even possible that they run completely free of order.
Same as answer 1.
As in the answer 1, they will really unlikely finish at different times, so YES. However you have to bear in mind that this is a good feature, since 1 big write to memory is more efficient than a lot of small writes.
Typically YES (see answer 1 as well)
It is better to intercalate the reads with operations, but the compiler will already account for this and reorder the operation order to hide the latency of reading/writting effects. Of course the compiler will never move around code that can change the result value. Unless you disable manually the compiler optimizations this is a typical behavior of OpenCL compilers.
NO, it can't be improved in any way from the kernel point of view.
The general rule is, each memory cell of the input is used by more than one WI?
NO (1 global->1 private) (this is the case of your kernel in the question)
Then that memory is global->private, and there is no way to improve it, don't use local memory since it will be a waste of time.
YES (1 global-> X private)
Try to move the global memory lo local memory first, then read directly from local to private for each WI. Depending on the reuse amount (maybe only 2 WIs use the same global data) it may not even be worth if the computation amount is already high. You have to consider the tradeoff between extra memory usage and global access gain. For image procesing it is typically a good idea, for other types of processes not so much.
NOTE: The same process applies if you try to write to global memory. It is always better to operate in local memory by many WI before writing to global. But if each WI writes to an unique address in global, then write directly.

How do I stop all 262,144 kernels if I find my answer

I am using pyopencl to find a certain pixel in a 512 x 512 (262,144 pixels) image. I am starting (512,512), when I run my program and comparing the pixel's neighbors to a known group of neighbors. I am doing image synthesis. I don't want to wait around for the remaining kernels to run if I find my group of pixels within a kernel. Is there a way to terminate the rest of the running kernels with a kernel program ?
Thanks
Tim
When you queue a kernel with many work items, it gets divided up into work groups and threads which keep the GPU busy. Really large global sizes start as many threads as they can and issue new ones when the old ones finish. So you could find the smallest global size that still performs well, and queue many of those (instead of one large one), but also be checking on the results of the previous ones you queued (use events to know when they are done, and read back memory to get their results). When you get the correct answer, stop queueing kernels.
so instead of this:
queue entire job (say, 4096 x 4906)
do:
do
{
queue some work (say, 32 x 32)
check if any of the prior work queued is done and check if it got the answer
}
while (no more work OR answer found)
You'll need to figure out the right tradeoff between the size of the smaller jobs and the overhead of checking their results versus extra work done.
Your question is a big issue and problem of parallelism.
What to do when one of your parallel threads has already the answer to the problem?
OpenCL does not allow to control the kernel execution. Not even at host level. And this is a big problem. However it is how it has to be, since, if the work items do not run freely detached one from another then it is not fully parallel.
The only solution is to split the computation into small parts and check the completion of each of them. But, sometimes the parts are already very small (like in your case 512x512 is quite small).
In your specific case I would process everything (512x512), after that I would use another kernel to get the final results out of the 512x512 set.
First thought it to have some sort of global memory flag that each kernel can read and set. This approach requires atomicity, so make sure to use the atomic_ functions.
__kernel void t(__global int *Data,
__global int *Flag){
if(atomic_max(*Flag, 0) == 0){
//perform calc on Data
if(PixelsFound){
//Set the flag to +1
*Flag = atomic_inc(*Flag);
}
}
}
Community, feel free to comment if this is known not to work!

Work-item execution order

I am working with OpenCL. And I am interested how work-item will be executed in the following example.
I have one-dimensional range of 10000 with a work-group size of 512. The kernel is the followin:
__kernel void
doStreaming() {
unsigned int id = get_global_id(0);
if (!isExecutable(id))
return;
/* do some work */
}
Here it check if it need to proceed the element with the following id or not.
Let assume that the execution started with the first work-group of 512 size and 20 of them were rejected by isExecutable. Does GPU continue to execute other 20 elements without waiting the first 492 elements?
There are no any barriers or other synchronization techniques involved.
When some workitems are branching far from the usual /* do some work */, they can use pipeline occupation advantage by getting instructions from next wavefront(amd) or next warp(nvidia) because current warp/wavefront workitem is busy doing other things. But this can cause memory access serialization and purge the accessing order of workgroup, decreasing performance.
Avoid having diverged warps/wavefronts: If you do if-statements in loop, it is really bad so better you find another way.
If every work item in a workgroup is having same branching, then it is ok.
If every work item does very few branching per hundreds of computing, it is ok.
Try to generate equal conditions for all workitems(emberrasingly parallel data/algorithm) to harness the power posessed by gpu.
Best way I know to get rid of simplest branch-vs-compute case is, using a global yes-no array. 0=yes, 1=no : always compute, then multiply your result with the yes-no element of work-item. Generally adding 1-byte element memory-access per core is much better then doing one branching per core. Actually making object length a power of 2 could be better after adding this 1-byte.
Yes and no. The following elaborations are based on documentation from NVIDIA, but I would doubt it to be any different on ATI hardware (though the actual numbers might differ maybe). In general the threads of a work group are executed in so-called warps, being sub-blocks of the work group size. On NVIDIA hardware each work group is divided into warps of 32 threads each. And each of those warps are executed in lock-step and thus perfectly in parallel (it may not be real-time parallel, meaning there could be 16 threads in parallel and then 16 again directly afterwards, but conceptually they're running perfectly parallel). So if only one of those 32 threads executes that additional code, the others will wait for it. But the threads in all the other warps won't care for all this.
So yes, there may be threads that will unneccessarily wait for the others, but that happens on a smaller scale than the whole work group size (32 on any NVIDIA hardware). This is why intra-warp branch deviation should be avoided if possible and this is also why code that is guaranteed to work inside a single warp only doesn't need any synchronization for e.g. shared memory access (a common optimization for algorithms).

Should I care about thread safe of static int (4 bytes) variable in ASP .NET

I have the feeling that I should not care about thread safe accessing / writing to an
public static int MyVar = 12;
in ASP .NET.
I read/write to this variable from various user threads. Let's suppose this variable will store the numbers of clicks on a certain button/link.
My theory is that no thread can read/write to this variable at the same time. It's just a simple variable of 4 bytes.
I do care about thread safe, but only for refference objects and List instances or other types that take more cycles to read/update.
I am wrong with my presumption ?
EDIT
I understand this depend of my scenario, but wasn't that the point of the question. The question is: it is right that can be written thread safe code with an (static int) variable without using lock keyword ?
It is my problem to write correct code. The answer seems to be: Yes, if you write correct and simple code, and not to much complicated, you can create thread safe functions without the need of lock keyword.
If one thread simply sets the value and another thread reads the value, then a lock is not necessary; the read and write are atomic. But if multiple threads might be updating it and are also reading it to do the update (e.g., increment), then you definitely do need some kind of synchronization. If only one thread is ever going to update it even for an increment, then I would argue that no synchronization is necessary.
Edit (three years later) It might also be desirable to add the volatile keyword to the declaration to ensure that reads of the value always get the latest value (assuming that matters in the application).
The concept of thread 'safety' is too vague to be meaningful unfortunately. If you're asking whether you can read and write to it from multiple threads without the program crashing during the operation, the answer is almost certainly yes. If you're also asking if the variable is guaranteed to either be the old value or the new value without ever storing any broken intermediate values, the answer for this data type is again almost certainly yes.
But if your question is "will my program work correctly if I access this from multiple threads", then the answer depends entirely on what your program is doing. For example, if you run the following pseudo code in 2 threads repeatedly in most programming languages, eventually you'll hit the assertion.
if MyVar >= 1:
MyVar = MyVar - 1
assert MyVar >= 0
Primitives like int are thread-safe in the sense that reads/writes are atomic. But as with most any type, it's left to you to do proper checking with more complex operations. For example, if (x > 0) x--; would be problematic in a multi-threaded scenario because x might change in between the if condition check and decrement.
A simple read or write on a field of 32 bits or less is always atomic. But you should provide your read/write code to make sure that it is thread safe.
Check out this post: http://msdn.microsoft.com/en-us/magazine/cc163929.aspx
It explains why you need to synchronize access to the integers in this scenario
Try Interlocked.Increment() or Interlocked.Add() and you'll be right. Your code complexity will be the same but you truly won't have to worry. If you're not worried about losing a few clicks in your counter, you can continue as you are.
Reading or writing integers is atomic. However, reading and then writing is not atomic. So, if you have one thread that writes and many that read, you may be able to get away without locks.
However, even though the operations are atomic, there are still potential multi-threading issues. In order for one thread to be guaranteed that another thread can see values it writes, you need a memory barrier. Otherwise, the compiler can optimize the code so that the variable stays in a register (or even optimize the operation away completely), so changes would be invisible from one thread to another.
You can establish a memory barrier explicitly (volatile or Thread.MemoryBarrier), or with the Interlocked class -- or with the lock statement (Monitor).

Resources