How can one find out the size of the biggest 2D array that could be made in an OpenCL kernel?
For example
int anArray[1000][1000]; inside a kernel works fine.
but when I rewrite it for a bigger scenario like
int anArray[5000][5000] it shows a failed RUN.
Would like to know what exactly is/are the factor/factors that decides the maximum array size that could run successfully.
You can retrieve this kind of information using clGetDeviceInfo.
The following arguments should help you (depending on how you write your kernel) :
CL_DEVICE_GLOBAL_MEM_CACHE_SIZE
CL_DEVICE_LOCAL_MEM_SIZE
Reference : http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/clGetDeviceInfo.html
Related
Newbie to OpenCL here. I'm trying to convert a numerical method I've written to OpenCL for acceleration. I'm using the PyOpenCL package as I've written this once in Python already and as far as I can tell there's no compelling reason to use the C version. I'm all ears if I'm wrong on this, though.
I've managed to translate over most of the functionality I need in to OpenCL kernels. My question is on how to (properly) tell OpenCL to ignore my boundary/ghost cells. The reason I need to do this is that my method (for example) for point i accesses cells at [i-2:i+2], so if i=1, I'll run off the end of the array. So - I add some extra points that serve to prevent this, and then just tell my algorithm to only run on points [2:nPts-2]. It's easy to see how to do this with a for loop, but I'm a little more unclear on the 'right' way to do this for a kernel.
Is it sufficient to do, for example (pseudocode)
__kernel void myMethod(...) {
gid = get_global_id(0);
if (gid < nGhostCells || gid > nPts-nGhostCells) {
retVal[gid] = 0;
}
// Otherwise perform my calculations
}
or is there another/more appropriate way to enforce this constraint?
It looks sufficient.
Branching is same for nPts-nGhostCells*2 number of points and it is predictable if nPts and nGhostCells are compile-time constants. Even if it is not predictable, sufficiently large nPts vs nGhostCells (1024 vs 3) should not be distinctively slower than zero-branching version, except the latency of "or" operation. Even that "or" latency must be hidden behind array access latency, thanks to thread level parallelism.
At those "break" points, mostly 16 or 32 threads would lose some performance and only for several clock cycles because of the lock-step running of SIMD-like architectures.
If you happen to code some chaotic branching, like data-driven code path, then you should split them into different kernels(for different regions) or sort them before the kernel so that average branching between neighboring threads are minimized.
I have to solve a code problem on the gpu using CUDA but I always get a warning of Stack size for "name of the function" cannot be statically determined.
This is for a student project that I'm working on, the project is written in C using CUDA 9.0 libraries and it's running on an NVIDIA Quadro K5000 gpu.
Every single thread must execute one function and, in this function, there are two recursive calls of the same function, the reason why I want to use those two recursive calls it's because it makes the code clean and simple for me, but if there is only one recursive call there isn't anymore the Stack size problem.
Here is the error I get every time I compile the code:
CUDA supports recursive function calls but I don't understand why it makes a problem when there are two recursive calls.
__device__ void bitonicMergeGPU(float *arr, int l, int indexT, int order)
{
int k,p;
if(l > 1)
{
p = l/2;
for(k=indexT;k<indexT+p;k++)
{
//Compare the values.
compareAndExchange(arr,k,k+p,order);
}
//THIS IS WHERE I GET THE ERROR
bitonicMergeGPU(arr,p,indexT,order);
bitonicMergeGPU(arr,p,indexT+p,order);
}
}
I simply want to know if it is possible to solve the problem of the recursive calls.
CUDA supports recursion. When you use recursion in CUDA, this warning is expected, and there is no NVIDIA-documented way you can make the warning go away (except by not using recursion).
If you use a function recursively, in most languages it will use more stack space as the recursion depth increases. This is true in CUDA as well. You need to account for this and provide enough stack space for the maximum recursion depth you anticipate. It is common practice to limit recursion depth, so as to prevent stack problems.
The compiler is unable to discover the maximum runtime recursion depth at compile time, and the warning is there to remind you of that.
Regardless of how much you increase the stack size, the warning will not go away. The warning is there to let you know that it is your responsibility to make sure your recursion design along with the stack space allocated will work correctly. The compiler does not verify in any way that the amount of increase in stack size is sufficient.
Using recursion in CUDA must be very careful. The recursion uses stack memory which has a limit of 512 KB. The default is usually 1KB which is easy to overflow and crashes the program. You can get the stack size per thread using cudaThreadGetLimit().
Suggestions:
Redesign the algorithm/function using a non-recursive approach. The efficiency is usually very similar.
Increase stack size per thread using cudaThreadSetLimit(), not exceed the limit, e.g. 512KB.
It seems that I can duplicate a kernel by get the program object and kernel name from the kernel. And then I can create a new one.
Is this the right way? It doesn't looks so good, though.
EDIT: To answer properly the question: Yes it is the correct way, there is no other way in CL 2.0 or earlier versions.
The compilation (and therefore, slow step) of the CL code creation is in the "program" creation (clProgramBuild + clProgramLink).
When you create a kernel. You are just creating a object that packs:
An entry point to a function in the program code
Parameters for input + output to that function
Some memory to remember all the above data between calls
It is an simple task that should be almost for free.
That is why it is preferred to have multiple kernel with different input parameters. Rather than one single kernel, and changing the parameters every loop.
I am trying to run a kernel on the gpu. I am looking for the best way to adjust the global and local dimensions of the grid of threads. In my experiments, I understood 32 block threads made of 1 threads is 32 times faster than 1 block of 32 threads (on my nvidia GTX 980). Before, I was using the following way to determine the kernel grid dimensions:
size_t local_ws = 32;
size_t nKernels = num_seeding_points;
local_ws = local_ws > nKernels ? nKernels : local_ws;
size_t global_ws = (nKernels + local_ws - 1) / local_ws * local_ws;
but I understood if the number of kernels are not big, this way will not use my GPU completely, and we I changed this part to:
size_t local_ws = 1;
size_t nKernels = num_seeding_points;
local_ws = local_ws > nKernels ? nKernels : local_ws;
size_t global_ws = (nKernels + local_ws - 1) / local_ws * local_ws;
My code runs 20 times faster than before. I wanted to see how can I compute the best possible values for running my kernel. Definitely, your experiences will help a lot.
In order to auto-tune global and local work sizes you should first query your kernel object and/or your device for the following info:
Useful kernel info (using the clGetKernelWorkGroupInfo() function):
CL_KERNEL_WORK_GROUP_SIZE: Maximum block size that can be used to execute a kernel on a specific device.
CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE: Get the preferred multiple for the block size. This is a performance hint, and is probably the most important piece of information to optimize you global and local work sizes.
If you didn't yet create a kernel object when you determine the global and local work sizes, you can instead query your device for similar info (using the clGetDeviceInfo() function):
CL_DEVICE_MAX_WORK_ITEM_SIZES: Maximum number of threads that can be specified in each dimension of the block.
CL_DEVICE_MAX_WORK_GROUP_SIZE: Maximum number of threads in a block.
Starting from the actual size of the work you want to process (i.e. num_seeding_points), and using the information provided by the aforementioned functions, you can optimize the global and local work sizes for whatever OpenCL device you're using. Most importantly, always try to make your local work size a multiple of CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE.
Note that for small global sizes (lower than 128 or 256) you won't see much benefit with these optimizations.
I wrote a function for the cf4ocl library called ccl_kernel_suggest_worksizes() that suggests optimum global and local work sizes given the size of the work you want to process, a device, and optionally, a kernel object. Check its source code here, maybe it gives some additional hints.
I have noticed a number of kernel sources that look like this (found randomly by Googling):
__kernel void fill(__global float* array, unsigned int arrayLength, float val)
{
if(get_global_id(0) < arrayLength)
{
array[get_global_id(0)] = val;
}
}
My question is if that if-statement is actually necessary (assuming that "arrayLength" in this example is the same as the global work size).
In some of the more "professional" kernels I have seen, it is not present. It also seems to me that the hardware would do well to not assign kernels to nonsense coordinates.
However, I also know that processors work in groups. Hence, I can imagine that some processors of a group must do nothing (for example if you have 1 group of size 16, and a work size of 41, then the group would process the first 16 work items, then then next 16, then the next 9, with 7 processors not doing anything--do they get dummy kernels?).
I checked the spec., and the only relevant mention of "get_global_id" is the same as the online documentation, which reads:
The global work-item ID specifies the work-item ID based on the number of global work-items specified to execute the kernel.
. . . based how?
So what is it? Is it safe to omit iff the array's size is a multiple of the work group size? What?
You have the right answer already, I think. If the global size of your kernel execution is the same as the array length, then this if statement is useless.
In general, that type of check is only needed for cases where you've partitioned your data in such a way that you know you might execute extra work items relative to your array size. In my experience, you can almost always avoid such cases.