CUDA __syncthreads() and recursion - recursion

I want to use __syncthreads() to a recursion like
__device__ void foo(int k) {
if (some_condition) {
for (int i=0;i<8;i++) {
foo(i+k); // foo might take longer with some inputs
__syncthreads();
}
}
}
How does this __syncthreads() now apply? I know it is only applied within a block. As far as I understand it, this holds for all local threads independently of the recursion depth? But what if I wanted to make sure that this __syncthreads() to a certain recursion depths? Is that even possible? I could check for the recursion depth, but I believe that won't work either.
Are there possible alternatives?
I've seen the that there are 3 syncthread extensions for CUDA Device >= 2.0
int __syncthreads_count(int predicate);
int __syncthreads_and(int predicate);
int __syncthreads_or(int predicate);
But I don't think they will help since they seem like an atomic counter.

As you know, __syncthreads() is only safe where all threads within a block reach the barrier. This means that if you are calling __syncthreads() from within a condition the condition must evaluate to the same on all threads within a block.
For __syncthreads() within recursion, this means that all threads within a block must execute the recursion to the same depth, otherwise not all threads will be reaching the same barrier.

Are there possible alternatives?
Yes, don't use the recursion paradigm to express your function logic

Off course what you said about __syncthreads() is true, it only works for local threads within the blocks therefore you have no control what is happening in other blocks. the best way for reduction is first make a reduction for the whole array which will general an array equal to the size of blocks. Then do not copy the array back to the Host instead call another reduction which will will have 1 block and threads similar to the number of blocks in the previous call and later copy the array of size 1 from Device to Host. but make sure to use cudaThreadSynchronize() between two calls coz unless the first reduction is generated you can make this reduction. this is two step reduction but it works for me.
cheers!!!
saif

Related

How to synchronize (specific) work-items based on data, in OpenCL?

Context:
The need is to simulate a net of related discrete elements (complex electronic circuit). Thus each component receive input from several other components and output to several others.
The intended design is to have a kernel, with a configuration argument defining which component it shall represent. Each component of the circuit is represented by a work-item, and all the circuit will fit in a single work-group (or adequate splitting of the circuit will be done so each work-group can manage all the components as work-items).
The problem:
Is it possible, and in case how? to have some work-items waiting for other work-items data?
A work-item generate an output to an array (at a data-driven position). Another work-item needs to wait for this to happens before to start making it processing.
The net has no loops, thus, it is not possible that a single work-item needs to run twice.
Attempts:
In the following example, each component can have a maximum of one single input (to simplify) making the circuit a tree where the input to the circuit is the root, and the 3 outputs are leafs.
inputIndex modelize this tree, by indicating for each component which other component provide it input. The first component take itself as input but the kernel manage this case (for simplification).
result save the result of each component (voltage, intensity, etc.)
inputModified indicate if the given component already calculated his output.
// where the data come from (index in result)
constant int inputIndex[5]={0,0, 0, 2, 2};
kernel void update_component(
local int *result, // each work-item result.
local int *inputModified // If all inputs are ready (one only for this example)
) {
int id = get_local_id(0);
int size = get_local_size(0);
int barrierCount = 0;
// inputModified is a boolean indicating if the input is ready
inputModified[id]=(id!=0 ? 0 : 1);
// make sure all input are false by default (except the first input).
barrier(CLK_LOCAL_MEM_FENCE);
// Wait until all inputs are ready (only one in this example)
while( !inputModified[inputIndex[id]] && size > barrierCount++)
{
// If the input is not ready, wait for it
barrier(CLK_LOCAL_MEM_FENCE);
}
// all inputs are ready, compute output
if (id!=0) result[id] = result[inputIndex[id]]+1;
else result[0]=42;
// make sure any other work-item depending on this is unblocked
inputModified[id]=1;
// Even if finished, we needs to "barrier" for other working items.
while (size > barrierCount++)
{
barrier(CLK_LOCAL_MEM_FENCE);
}
}
This example has N barriers for N components, making it worse than a sequential solution.
Note: this is only the kernel, as the C++ minimal host is quite long. In case of need, I could find a way to add it.
Question:
Is it possible to efficiently, and by the kernel itself to have the different work-items waiting for their data to be provided by other work-items? Or what solution would be efficient?
This problem is (for me) not trivial to explain and I am far from expert in OpenCL. Please, be patient and feel free to ask if anything is unclear.
From documentation of barrier
https://www.khronos.org/registry/OpenCL/sdk/1.2/docs/man/xhtml/barrier.html
If barrier is inside a loop, all work-items must execute the barrier for each iteration of the loop before any are allowed to continue
execution beyond the barrier.
But a while loop (containing a barrier) in the kernel has this condition:
inputModified[inputIndex[id]]
this can change its behavior with id of thread and lead to undefined behavior. Besides, another barrier before that
barrier(CLK_LOCAL_MEM_FENCE);
already synchronizes all work-items in the work-group so the while loop is redundant, even if it works.
Also the last barrier loop is redundant
while (size > barrierCount++)
{
barrier(CLK_LOCAL_MEM_FENCE);
}
when kernel ends, it does synchronize all workitems.
If you are meant to send some message to out-of-workgroup workitems, then you can only use atomic variables. Even when using atomics, you should not assume any working/issuing order between any two workitems.
Your question
how? to have some work-items waiting for other work-items data? A
work-item generate an output to an array (at a data-driven position).
Another work-item needs to wait for this to happens before to start
making it processing. The net has no loops, thus, it is not possible
that a single work-item needs to run twice.
can be answered with an OpenCL 2.x feature "dynamic parallelism" which lets a workitem spawn new workgroups/kernels inside kernel. It is much more efficient than waiting on a spin-wait loop and absolutely more hardware-independent than relying on number of in-flight threads a GPU supports (when GPU can't handle that many in-flight threads, any spin-wait will dead-lock, order of threads does not matter).
When you use barrier, you don't need to inform other threads about "inputModified". Data of result is already visible within workgroup.
If you can't use OpenCL v2.x, then you should process a tree using BFS:
start 1 workitem for top node
process it and prepare K outputs and push them into a queue
end kernel
start K workitems (each pop elements from queue)
process them and prepare N outputs and push them into queue
end kernel
repeat until queue doesn't have any more elements
Number of kernel calls is equal to maximum depth of tree, not number of nodes.
If you need a quicker synchronization than "kernel launches", then use a single workgroup for whole tree, use barrier instead of kernel recalls. Or, process first few steps on CPU, have multiple sub-trees and send them to different OpenCL workgroups. Perhaps computing on CPU until there are N sub-trees where N=compute units of GPU could be better for workgroup-barrier based faster asynchronous computing of sub-trees.
There is also a barrierless, atomicless and single-kernel-call way for this. Start tree from bottom and go up.
Map all deepest level child nodes to workitems. Move each of them to the top while recording their path(node id, etc) within their private memory / some other fast memory. Then have them traverse back top-down through that recorded path, computing on the move, without any synchronizations nor even atomics. This is less work efficient than barrier/kernel-call versions but the lack of barrier and being on totally asynchronous paths should make it fast enough.
If tree has 10 depth, this means 10 node pointers to save, not so much for private registers. If tree depth is about 30 40 then use local memory with less threads in each workgroup; if it is even more, then allocate global memory.
But you may need to sort the workitems on their spatiality / tree's topology to make them work together faster with less branching.
This way looks simplest to me, so I suggest you to try this barrierless version first.
If you want only data-visibility per workitem instead of group or kernel, use fence: https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/mem_fence.html

How to use async_work_group_copy in OpenCL?

I would like to understand how to correctly use the async_work_group_copy() call in OpenCL. Let's have a look on a simplified example:
__kernel void test(__global float *x) {
__local xcopy[GROUP_SIZE];
int globalid = get_global_id(0);
int localid = get_local_id(0);
event_t e = async_work_group_copy(xcopy, x+globalid-localid, GROUP_SIZE, 0);
wait_group_events(1, &e);
}
The reference http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/async_work_group_copy.html says "Perform an async copy of num_elements gentype elements from src to dst. The async copy is performed by all work-items in a work-group and this built-in function must therefore be encountered by all work-items in a workgroup executing the kernel with the same argument values; otherwise the results are undefined."
But that doesn't clarify my questions...
I would like to know, if the following assumptions are correct:
The call to async_work_group_copy() must be executed by all work-items in the group.
The call should be in a way, that the source address is identical for all work-items and points to the first element of the memory area to be copied.
As my source address is relative based on the global work-item id of the first work-item in the work-group. So I have to subtract the local id to have the address identical for all work-items...
Is the third parameter really the number of elements (not the size in bytes)?
Bonus questions:
a. Can I just use barrier(CLK_LOCAL_MEM_FENCE) instead of wait_group_events() and ignore the return value? If so, would that be probably faster?
b. Does a local copy also make sense for processing on CPUs or is that overhead as they share a cache anyway?
Regards,
Stefan
One of the main reasons for this function existing is to allow the driver/kernel compiler to efficiently copy the memory without the developer having to make assumptions about the hardware.
You describe what memory you need copied as if it were a single-threaded copy, and async_work_group_copy gets it done for you using the parallel hardware.
For your specific questions:
I have never seen async_work_group_copy used by only some of the work items in a group. I always assumed this is because it it required. I think the blocking nature of wait_group_events forces all work items to be part of the copy.
Yes. Source (and destination) addresses need to be the same for all work items.
You could subtract your local id to get the correct address, but I find that basing the address on groupId solves this problem as well. (get_group_id)
Yes. The last param is the number of elements, not the size in bytes.
a. No. The event-based you will find that your barrier is hit almost immediately by the work items, and the data won't necessarily be copied. This makes sense because some opencl hardware might not even use the compute units at all to do the actual copy operation.
b. I think that cpu opencl implementations might guarantee L1 cache usage when you use local memory. The only way to know for sure if this performs better is to benchmark your application with various settings.

Proper way to inform OpenCL kernels of many memory objects?

In my OpenCL program, I am going to end up with 60+ global memory buffers that each kernel is going to need to be able to access. What's the recommended way to for letting each kernel know the location of each of these buffers?
The buffers themselves are stable throughout the life of the application -- that is, we will allocate the buffers at application start, call multiple kernels, then only deallocate the buffers at application end. Their contents, however, may change as the kernels read/write from them.
In CUDA, the way I did this was to create 60+ program scope global variables in my CUDA code. I would then, on the host, write the address of the device buffers I allocated into these global variables. Then kernels would simply use these global variables to find the buffer it needed to work with.
What would be the best way to do this in OpenCL? It seems that CL's global variables are a bit different than CUDA's, but I can't find a clear answer on if my CUDA method will work, and if so, how to go about transferring the buffer pointers into global variables. If that wont work, what's the best way otherwise?
60 global variables sure is a lot! Are you sure there isn't a way to refactor your algorithm a bit to use smaller data chunks? Remember, each kernel should be a minimum work unit, not something colossal!
However, there is one possible solution. Assuming your 60 arrays are of known size, you could store them all into one big buffer, and then use offsets to access various parts of that large array. Here's a very simple example with three arrays:
A is 100 elements
B is 200 elements
C is 100 elements
big_array = A[0:100] B[0:200] C[0:100]
offsets = [0, 100, 300]
Then, you only need to pass big_array and offsets to your kernel, and you can access each array. For example:
A[50] = big_array[offsets[0] + 50]
B[20] = big_array[offsets[1] + 20]
C[0] = big_array[offsets[2] + 0]
I'm not sure how this would affect caching on your particular device, but my initial guess is "not well." This kind of array access is a little nasty, as well. I'm not sure if it would be valid, but you could start each of your kernels with some code that extracts each offset and adds it to a copy of the original pointer.
On the host side, in order to keep your arrays more accessible, you can use clCreateSubBuffer: http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clCreateSubBuffer.html which would also allow you to pass references to specific arrays without the offsets array.
I don't think this solution will be any better than passing the 60 kernel arguments, but depending on your OpenCL implementation's clSetKernelArgs, it might be faster. It will certainly reduce the length of your argument list.
You need to do two things. First, each kernel that uses each global memory buffer should declare an argument for each one, something like this:
kernel void awesome_parallel_stuff(global float* buf1, ..., global float* buf60)
so that each utilized buffer for that kernel is listed. And then, on the host side, you need to create each buffer and use clSetKernelArg to attach a given memory buffer to a given kernel argument before calling clEnqueueNDRangeKernel to get the party started.
Note that if the kernels will keep using the same buffer with each kernel execution, you only need to setup the kernel arguments one time. A common mistake I see, which can bleed host-side performance, is to repeatedly call clSetKernelArg in situations where it is completely unnecessary.

Barriers in OpenCL

In OpenCL, my understanding is that you can use the barrier() function to synchronize threads in a work group. I do (generally) understand what they are for and when to use them. I'm also aware that all threads in a work group must hit the barrier, otherwise there are problems. However, every time I've tried to use barriers so far, it seems to result in either my video driver crashing, or an error message about accessing invalid memory of some sort. I've seen this on 2 different video cards so far (1 ATI, 1 NVIDIA).
So, my questions are:
Any idea why this would happen?
What is the difference between barrier(CLK_LOCAL_MEM_FENCE) and barrier(CLK_GLOBAL_MEM_FENCE)? I read the documentation, but it wasn't clear to me.
Is there general rule about when to use barrier(CLK_LOCAL_MEM_FENCE) vs. barrier(CLK_GLOBAL_MEM_FENCE)?
Is there ever a time that calling barrier() with the wrong parameter type could cause an error?
As you have stated, barriers may only synchronize threads in the same workgroup. There is no way to synchronize different workgroups in a kernel.
Now to answer your question, the specification was not clear to me either, but it seems to me that section 6.11.9 contains the answer:
CLK_LOCAL_MEM_FENCE – The barrier function will either flush any
variables stored in local memory or queue a memory fence to ensure
correct ordering of memory operations to local memory.
CLK_GLOBAL_MEM_FENCE – The barrier function will queue a memory fence
to ensure correct ordering of memory operations to global memory.
This can be useful when work-items, for example, write to buffer or
image memory objects and then want to read the updated data.
So, to my understanding, you should use CLK_LOCAL_MEM_FENCE when writing and reading to the __local memory space, and CLK_GLOBAL_MEM_FENCE when writing and readin to the __global memory space.
I have not tested whether this is any slower, but most of the time, when I need a barrier and I have a doubt about which memory space is impacted, I simply use a combination of the two, ie:
barrier(CLK_LOCAL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE);
This way you should not have any memory reading\writing ordering problem (as long as you are sure that every thread in the group goes through the barrier, but you are aware of that).
Hope it helps.
Reviving an old-ish thread here. I have had a little bit of trouble with barrier() myself.
Regarding your crash problem, one potential cause could be if your barrier is inside a condition. I read that when you use barrier, ALL work items in the group must be able to reach that instruction, or it will hang your kernel - usually resulting in a crash.
if(someCondition){
//do stuff
barrier(CLK_LOCAL_MEM_FENCE);
//more stuff
}else{
//other stuff
}
My understanding is that if one or more work items satisfies someCondition, ALL work items must satisfy that condition, or there will be some that will skip the barrier. Barriers wait until ALL work items reach that point. To fix the above code, I need to restructure it a bit:
if(someCondition){
//do stuff
}
barrier(CLK_LOCAL_MEM_FENCE);
if(someCondition){
//more stuff
}else{
//other stuff
}
Now all work items will reach the barrier.
I don't know to what extent this applies to loops; if a work item breaks from a for loop, does it hit barriers? I am unsure.
UPDATE: I have successfully crashed a few ocl programs with a barrier in a for-loop. Make sure all work items exit the for loop at the same time - or better yet, put the barrier outside the loop.
(source: Heterogeneous Computing with OpenCL Chapter 5, p90-91)

Should I care about thread safe of static int (4 bytes) variable in ASP .NET

I have the feeling that I should not care about thread safe accessing / writing to an
public static int MyVar = 12;
in ASP .NET.
I read/write to this variable from various user threads. Let's suppose this variable will store the numbers of clicks on a certain button/link.
My theory is that no thread can read/write to this variable at the same time. It's just a simple variable of 4 bytes.
I do care about thread safe, but only for refference objects and List instances or other types that take more cycles to read/update.
I am wrong with my presumption ?
EDIT
I understand this depend of my scenario, but wasn't that the point of the question. The question is: it is right that can be written thread safe code with an (static int) variable without using lock keyword ?
It is my problem to write correct code. The answer seems to be: Yes, if you write correct and simple code, and not to much complicated, you can create thread safe functions without the need of lock keyword.
If one thread simply sets the value and another thread reads the value, then a lock is not necessary; the read and write are atomic. But if multiple threads might be updating it and are also reading it to do the update (e.g., increment), then you definitely do need some kind of synchronization. If only one thread is ever going to update it even for an increment, then I would argue that no synchronization is necessary.
Edit (three years later) It might also be desirable to add the volatile keyword to the declaration to ensure that reads of the value always get the latest value (assuming that matters in the application).
The concept of thread 'safety' is too vague to be meaningful unfortunately. If you're asking whether you can read and write to it from multiple threads without the program crashing during the operation, the answer is almost certainly yes. If you're also asking if the variable is guaranteed to either be the old value or the new value without ever storing any broken intermediate values, the answer for this data type is again almost certainly yes.
But if your question is "will my program work correctly if I access this from multiple threads", then the answer depends entirely on what your program is doing. For example, if you run the following pseudo code in 2 threads repeatedly in most programming languages, eventually you'll hit the assertion.
if MyVar >= 1:
MyVar = MyVar - 1
assert MyVar >= 0
Primitives like int are thread-safe in the sense that reads/writes are atomic. But as with most any type, it's left to you to do proper checking with more complex operations. For example, if (x > 0) x--; would be problematic in a multi-threaded scenario because x might change in between the if condition check and decrement.
A simple read or write on a field of 32 bits or less is always atomic. But you should provide your read/write code to make sure that it is thread safe.
Check out this post: http://msdn.microsoft.com/en-us/magazine/cc163929.aspx
It explains why you need to synchronize access to the integers in this scenario
Try Interlocked.Increment() or Interlocked.Add() and you'll be right. Your code complexity will be the same but you truly won't have to worry. If you're not worried about losing a few clicks in your counter, you can continue as you are.
Reading or writing integers is atomic. However, reading and then writing is not atomic. So, if you have one thread that writes and many that read, you may be able to get away without locks.
However, even though the operations are atomic, there are still potential multi-threading issues. In order for one thread to be guaranteed that another thread can see values it writes, you need a memory barrier. Otherwise, the compiler can optimize the code so that the variable stays in a register (or even optimize the operation away completely), so changes would be invisible from one thread to another.
You can establish a memory barrier explicitly (volatile or Thread.MemoryBarrier), or with the Interlocked class -- or with the lock statement (Monitor).

Resources