I need (host readable) thread local memory for my OpenCL kernel. Let's look at an example:
//Assume #threads is known to be 8
double threadLocalScalar[8] = {}; //1 scalar per thread
for(some range in parallel)
threadLocalScalar[getThreadId()] += 1;
This is a basic "thread local memory" solution. A vector of length #threads and the usage of function getThreadId().
Now, I need to do the same (or anything that works the same) in OpenCL. My research so far is to use the function get_group_id(0) to get the work group id (and maybe for simplicity use work group size = 1). This way I know what "thread" is executing and can modify the correct part of a global memory vector.
However, I don't know how many "threads" are going to be created. So I can't determine how much global memory I need for the threadLocalScalar vector. How can I know this? Or, do you know any better solution? Is my research even correct?
Note:
The problem with using local memory is that I cannot read it from host. Otherwise I could easily use local memory and work_grop_size = 1 (only 1 "thread" per work_group meaning the local memory is "thread local".
You only need shared local memory if adjacent work items need access to the same global memory locations; using shared local memory means you only have to read the global value once per work group instead of multiple times.
Your code example (threadLocalScalar[getThreadId()] += 1) doesn't do that, it just increments a memory location per work item (independent work), so you don't need shared local memory. Just increment your global memory based on the global_id(0).
Related
I'm trying to get something to work but I run out of ideas so I figured I would ask here.
I have a kernel that has a large global size (usually 5 Million)
Each of the threads can require up to 1Mb of global memory (exact size not known in advance)
So i figured... ok, on my typical target GPU I have 6Gb and I can run 2880 threads in parrallel, more than enough right ?
My idea is to create a big buffer (well actually 2 because of the max buffer size limitation...)
Each thread pointing to a specific global memory area (with the coalescence and stuff, but you get the idea...)
My problem is, How do I know which thread is currenctly being run (in the kernel code) to point to the right memory area ?
I did find the cl_arm_get_core_id extension but this only gives me the workgroup, not the acutal thread being used, plus this does not seem to be available on all GPUs, since it's an extension.
I have the option to have work_group_size = nb_compute_units / nb_cores and have the offset to be arm_get_core_id() * work_group_size + global_id() % work_group_size
But maybe this group size is not optimal, and the portability issue still exists.
I can also enqueue a lot of kernels calls with global size 2880, and there I obviously know where to point to with the global Id.
But won't this lead to a lot of overhead because of the 5Million / 2880 kernel calls ? Plus any work group that finishes before the others will be idle until all workgroups for this call have finished their job.
Any ideas to do this properly are very welcome !
Well, you are storing 1MB per WI for temporal computations (because you are not saving them, otherwise your wouldn't have memory).
Then, why not simply let it spill to global memory? Does the compiler complain? If it does complain, then you need other approaches:
One possibility is to create a queue (just a boolean array), of the memory zones empty for usage by the WorkGroups. And every time a new workgroup is launched it takes an empty slot and sets the boolean to "used" state. You can do this with atomic_cmpxchg() atomic operation.
It may introduce a small overhead to launch each WG, but it would be probably negligible if each WI is needing 1MB of global memory.
Here you have a small example of how to do atomic_cmpxchg() LINK
I have some queries regarding how data transfer happens between work items and global memory. Let us consider the following highly inefficient memory bound kernel.
__kernel void reduceURatios(__global myreal *coef, __global myreal *row, myreal ratio)
{
size_t gid = get_global_id(0);//line no 1
myreal pCoef = coef[gid];//line no 2
myreal pRow = row[gid];//line no 3
pCoef = pCoef - (pRow * ratio);//line no 4
coef[gid] = pCoef;//line no 5
}
Do all work items in a work group begin executing line no 1 at the
same time?
Do all work items in a work group begin executing line no 2 at the
same time?
Suppose different work items in a work group finish executing line
no 4 at different times. Do the early finished ones wait so that,
all work items transfer the data to global memory at the same time
in line no 5?
Do all work items exit the compute unit simultaneously such that
early finished work items have to wait until all work items have
finished executing?
Suppose each kernel has to perform 2 reads from global memory. Is it
better to execute these statements one after the other or is it
better to execute some computation statements between the 2 read
executions?
The above shown kernel is memory bound for GPU. Is there any way by
which performance can be improved?
Are there any general guidelines to avoid memory bounds?
Find my answers below: (thanks sharpneli for the good comment of AMD GPUs and warps)
Normally YES. But depends on the hardware. You can't directly expect that behavior and design your algorithm on this "ordered execution". That's why barriers and mem_fences exists. For example, some GPU execute in order only a sub-set of the WG's WI. In CPU it is even possible that they run completely free of order.
Same as answer 1.
As in the answer 1, they will really unlikely finish at different times, so YES. However you have to bear in mind that this is a good feature, since 1 big write to memory is more efficient than a lot of small writes.
Typically YES (see answer 1 as well)
It is better to intercalate the reads with operations, but the compiler will already account for this and reorder the operation order to hide the latency of reading/writting effects. Of course the compiler will never move around code that can change the result value. Unless you disable manually the compiler optimizations this is a typical behavior of OpenCL compilers.
NO, it can't be improved in any way from the kernel point of view.
The general rule is, each memory cell of the input is used by more than one WI?
NO (1 global->1 private) (this is the case of your kernel in the question)
Then that memory is global->private, and there is no way to improve it, don't use local memory since it will be a waste of time.
YES (1 global-> X private)
Try to move the global memory lo local memory first, then read directly from local to private for each WI. Depending on the reuse amount (maybe only 2 WIs use the same global data) it may not even be worth if the computation amount is already high. You have to consider the tradeoff between extra memory usage and global access gain. For image procesing it is typically a good idea, for other types of processes not so much.
NOTE: The same process applies if you try to write to global memory. It is always better to operate in local memory by many WI before writing to global. But if each WI writes to an unique address in global, then write directly.
I would like to understand how to correctly use the async_work_group_copy() call in OpenCL. Let's have a look on a simplified example:
__kernel void test(__global float *x) {
__local xcopy[GROUP_SIZE];
int globalid = get_global_id(0);
int localid = get_local_id(0);
event_t e = async_work_group_copy(xcopy, x+globalid-localid, GROUP_SIZE, 0);
wait_group_events(1, &e);
}
The reference http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/async_work_group_copy.html says "Perform an async copy of num_elements gentype elements from src to dst. The async copy is performed by all work-items in a work-group and this built-in function must therefore be encountered by all work-items in a workgroup executing the kernel with the same argument values; otherwise the results are undefined."
But that doesn't clarify my questions...
I would like to know, if the following assumptions are correct:
The call to async_work_group_copy() must be executed by all work-items in the group.
The call should be in a way, that the source address is identical for all work-items and points to the first element of the memory area to be copied.
As my source address is relative based on the global work-item id of the first work-item in the work-group. So I have to subtract the local id to have the address identical for all work-items...
Is the third parameter really the number of elements (not the size in bytes)?
Bonus questions:
a. Can I just use barrier(CLK_LOCAL_MEM_FENCE) instead of wait_group_events() and ignore the return value? If so, would that be probably faster?
b. Does a local copy also make sense for processing on CPUs or is that overhead as they share a cache anyway?
Regards,
Stefan
One of the main reasons for this function existing is to allow the driver/kernel compiler to efficiently copy the memory without the developer having to make assumptions about the hardware.
You describe what memory you need copied as if it were a single-threaded copy, and async_work_group_copy gets it done for you using the parallel hardware.
For your specific questions:
I have never seen async_work_group_copy used by only some of the work items in a group. I always assumed this is because it it required. I think the blocking nature of wait_group_events forces all work items to be part of the copy.
Yes. Source (and destination) addresses need to be the same for all work items.
You could subtract your local id to get the correct address, but I find that basing the address on groupId solves this problem as well. (get_group_id)
Yes. The last param is the number of elements, not the size in bytes.
a. No. The event-based you will find that your barrier is hit almost immediately by the work items, and the data won't necessarily be copied. This makes sense because some opencl hardware might not even use the compute units at all to do the actual copy operation.
b. I think that cpu opencl implementations might guarantee L1 cache usage when you use local memory. The only way to know for sure if this performs better is to benchmark your application with various settings.
I am new to NUMA-aware multithreaded programming. I am writing my code such that all the threads and their memory allocation are restricted to one node. At the beginning of the program, I make the following calls:
struct bitmask *bm = numa_parse_nodestring("0");
if (bm == 0) {
exit(1);
}
numa_bind(bm);
My understanding is that a call to numa_bind in this way would bind all threads and all memory allocation to node 0.
Furthermore, when I start pthreads from this code, I bind them to specific CPUs using:
pthread_setaffinity_n
However, when I look at /proc//numa_maps, I can still see that certain libraries (e.g., libc) are bound to the memory on node 1. How can I make sure that all the memory required by the process is bound to node 0?
Shared libraries like libc can't be bound to a memory bank specified by your process/application. Please see shared-library-numa
Code would tend to get cached in the local processor's L3 cache. Since it's read-only it's unlikely to generate any traffic once it's been loaded into cache. I wouldn't bother with it too much, unless you have profiling information showing it does pose a problem.
In my OpenCL program, I am going to end up with 60+ global memory buffers that each kernel is going to need to be able to access. What's the recommended way to for letting each kernel know the location of each of these buffers?
The buffers themselves are stable throughout the life of the application -- that is, we will allocate the buffers at application start, call multiple kernels, then only deallocate the buffers at application end. Their contents, however, may change as the kernels read/write from them.
In CUDA, the way I did this was to create 60+ program scope global variables in my CUDA code. I would then, on the host, write the address of the device buffers I allocated into these global variables. Then kernels would simply use these global variables to find the buffer it needed to work with.
What would be the best way to do this in OpenCL? It seems that CL's global variables are a bit different than CUDA's, but I can't find a clear answer on if my CUDA method will work, and if so, how to go about transferring the buffer pointers into global variables. If that wont work, what's the best way otherwise?
60 global variables sure is a lot! Are you sure there isn't a way to refactor your algorithm a bit to use smaller data chunks? Remember, each kernel should be a minimum work unit, not something colossal!
However, there is one possible solution. Assuming your 60 arrays are of known size, you could store them all into one big buffer, and then use offsets to access various parts of that large array. Here's a very simple example with three arrays:
A is 100 elements
B is 200 elements
C is 100 elements
big_array = A[0:100] B[0:200] C[0:100]
offsets = [0, 100, 300]
Then, you only need to pass big_array and offsets to your kernel, and you can access each array. For example:
A[50] = big_array[offsets[0] + 50]
B[20] = big_array[offsets[1] + 20]
C[0] = big_array[offsets[2] + 0]
I'm not sure how this would affect caching on your particular device, but my initial guess is "not well." This kind of array access is a little nasty, as well. I'm not sure if it would be valid, but you could start each of your kernels with some code that extracts each offset and adds it to a copy of the original pointer.
On the host side, in order to keep your arrays more accessible, you can use clCreateSubBuffer: http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clCreateSubBuffer.html which would also allow you to pass references to specific arrays without the offsets array.
I don't think this solution will be any better than passing the 60 kernel arguments, but depending on your OpenCL implementation's clSetKernelArgs, it might be faster. It will certainly reduce the length of your argument list.
You need to do two things. First, each kernel that uses each global memory buffer should declare an argument for each one, something like this:
kernel void awesome_parallel_stuff(global float* buf1, ..., global float* buf60)
so that each utilized buffer for that kernel is listed. And then, on the host side, you need to create each buffer and use clSetKernelArg to attach a given memory buffer to a given kernel argument before calling clEnqueueNDRangeKernel to get the party started.
Note that if the kernels will keep using the same buffer with each kernel execution, you only need to setup the kernel arguments one time. A common mistake I see, which can bleed host-side performance, is to repeatedly call clSetKernelArg in situations where it is completely unnecessary.