I am trying to translate some existing CUDA kernels to OpenCL and the problem is that I am bound to use OpenCL 1.2, so it is not possible to use non-uniform work-group size, meaning that I should let enqueueNDRangeKernel decide local work-group size(to avoid non-divisible workgroup size w.r.t. global work size).
As it is mentioned in this presentation, I use __local int * which is an argument of kernel function as shared memory pointer with the size that is defined in the host code using the <Kernel>.setArg.
In some of these CUDA kernels, I have allocated dynamic shared memory with the size that is dependant on Thread-Block or local workgroup size. When I try to translate these kernels to OpenCL, I don't know how to get local workgroup size that is set by enqueueNDRangeKernel passing a NULL value for the local argument to let it automatically decide local workgroup size.
To make it more clear, all I want is to translate this piece of CUDA code to OpenCL:
dim3 block_size, grid_size;
unsigned int smem_size;
block_size.x = 10;
block_size.y = 10;
block_size.z = 2;
// smem_size is depandant on ThreadBlock size.
smem_size = (block_size.x * block_size.y * block_size.z) * 5;
myCudaKernel<<< grid_size, block_size, smem_size >>>(...);
*By the way, I need a general solution that is usable for 3-D workgroups as well.
*I have an Nvidia graphics card.
Related
My OpenCL program involves having about 7 billion work-items. In my C++ program, I would set this to my global_item_size:
size_t global_item_size = 7200000000;
If my program is compiled to 64-bit systems (x64), this global size is OK, since SIZE_MAX (the maximum value of size_t) is much larger than 7 billion. However, to ensure backwards compatibility I want to make sure that my program is able to compile to 32-bit systems (x86). On 32-bit systems, SIZE_MAX is about 4 billion, less than my global size, 7 billion. If I would try to set the global size to 7 billion, it would result in an overflow. What can I do in this case?
One of the solutions I was thinking about was to make a multi-dimensional global size and local size. However, this solution requires the kernel to calculate the original global size (because my kernel heavily depends on the global and local size), which would result in a performance loss.
The other solution I considered was to launch multiple kernels. I think this solution would be a little "sloppy" and synchronizing kernels also wouldn't be the best solution.
So my question basically is: How can I (if possible) make the global size larger than the maximum size of size_t? If this is not possible, what are some workarounds?
If you want to avoid batches you can give each kernel more work but effectively wrapping the code in a for loop. E.g.
for (int i = 0; i < WORK_ITEMS_PER_THREAD; ++i)
{
size_t id = WORK_ITEMS_PER_THREAD * get_global_id(0) + i;
...
}
Try to use uint64_t global_item_size = 7200000000ull; to avoid 32-bit integer overflow.
If you are strictly limited to the maximum 32-bit number of work items, you could do the computation in several batches (exchange GPU buffers in between compute steps via PCIe transfer) or you could pack several data items into one GPU thread.
I have created a buffer with attributes CL_MEM_READ_WRITE and CL_MEM_ALLOC_HOST_PTR. I have enqueued this buffer to GPU kernels. GPU kernels process the input given and fill these buffers. During this process CPU is made to wait. I have modified this design by partitioning the buffer in to three uniform sections using sub-buffers. Now GPU after filling one sub-buffer, CPU can start processing. This reduces CPU wait to one sub-buffer as opposed to one full frame processing.
The problem i am facing is, the mapped pointer (cpu side pointers) of sub-buffers and buffer are strange. The map pointer of the first sub-buffer and buffer are same. This is alright. But the map pointer of second sub-buffer is not equal to the map pointer of buffer + offset of second sub-buffer. I tried this on integrated GPU models (Intel HD graphics 4000). It was working fine. But when i run this on dedicated graphics card devices (nvidia zotac) i am facing this problem. Have you encountered such a scenario before. Can you provide some pointers to where to look to fix this problem.
typedef struct opencl_buffer {
cl_mem opencl_mem;
void *mapped_pointer;
int size;
}opencl_buffer;
// alloc gpu output buffers
opencl->opencl_mem = clCreateBuffer(
opencl->context, CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR,
3 * alloc_size, NULL, &status);
if (status != CL_SUCCESS)
goto fail;
// create output sub buffers
for (sub_idx = 0; sub_idx < 3; ++sub_idx) {
cl_buffer_region sf_region;
SubFrameInfo subframe;
sf_region.origin = alloc_size * sub_idx;
sf_region.size = alloc_size;
opencl->gpu_output_sub_buf[sub_idx].size = sf_region.size;
opencl->gpu_output_sub_buf[sub_idx].opencl_mem =
clCreateSubBuffer(opencl->opencl_mem,
CL_MEM_READ_WRITE,
CL_BUFFER_CREATE_TYPE_REGION,
&sf_region, &status);
if (status != CL_SUCCESS)
goto fail;
}
Now, when i map gpu_output_sub_buf[0].opencl_mem and gpu_output_sub_buf[1].opencl_mem, the difference between CPU side pointers is expected to be alloc_size (assume char pointers). This happens to be the case in Intel HD graphics. But Nvidia platform is providing a different result.
There is no specification-based reason a mapped sub-buffer should be at an address that is a known offset from the mapped main buffer (or mapped sub-buffer that aligns with same). Mapping only creates a range of host memory that you can use, and then you unmap to get it back on the device. It doesn't have to even be at the same address each time.
Of course OpenCL 2.0 SVM changes all this, but you didn't say you're using SVM, and NVIDIA doesn't support OpenCL 2.0 today anyway.
I am trying to run a kernel on the gpu. I am looking for the best way to adjust the global and local dimensions of the grid of threads. In my experiments, I understood 32 block threads made of 1 threads is 32 times faster than 1 block of 32 threads (on my nvidia GTX 980). Before, I was using the following way to determine the kernel grid dimensions:
size_t local_ws = 32;
size_t nKernels = num_seeding_points;
local_ws = local_ws > nKernels ? nKernels : local_ws;
size_t global_ws = (nKernels + local_ws - 1) / local_ws * local_ws;
but I understood if the number of kernels are not big, this way will not use my GPU completely, and we I changed this part to:
size_t local_ws = 1;
size_t nKernels = num_seeding_points;
local_ws = local_ws > nKernels ? nKernels : local_ws;
size_t global_ws = (nKernels + local_ws - 1) / local_ws * local_ws;
My code runs 20 times faster than before. I wanted to see how can I compute the best possible values for running my kernel. Definitely, your experiences will help a lot.
In order to auto-tune global and local work sizes you should first query your kernel object and/or your device for the following info:
Useful kernel info (using the clGetKernelWorkGroupInfo() function):
CL_KERNEL_WORK_GROUP_SIZE: Maximum block size that can be used to execute a kernel on a specific device.
CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE: Get the preferred multiple for the block size. This is a performance hint, and is probably the most important piece of information to optimize you global and local work sizes.
If you didn't yet create a kernel object when you determine the global and local work sizes, you can instead query your device for similar info (using the clGetDeviceInfo() function):
CL_DEVICE_MAX_WORK_ITEM_SIZES: Maximum number of threads that can be specified in each dimension of the block.
CL_DEVICE_MAX_WORK_GROUP_SIZE: Maximum number of threads in a block.
Starting from the actual size of the work you want to process (i.e. num_seeding_points), and using the information provided by the aforementioned functions, you can optimize the global and local work sizes for whatever OpenCL device you're using. Most importantly, always try to make your local work size a multiple of CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE.
Note that for small global sizes (lower than 128 or 256) you won't see much benefit with these optimizations.
I wrote a function for the cf4ocl library called ccl_kernel_suggest_worksizes() that suggests optimum global and local work sizes given the size of the work you want to process, a device, and optionally, a kernel object. Check its source code here, maybe it gives some additional hints.
I'm using the Tesla m1060 for GPGPU computation. It has the following specs:
# of Tesla GPUs 1
# of Streaming Processor Cores (XXX per processor) 240
Memory Interface (512-bit per GPU) 512-bit
When I use OpenCL, I can display the following board information:
available platform OpenCL 1.1 CUDA 6.5.14
device Tesla M1060 type:CL_DEVICE_TYPE_GPU
max compute units:30
max work item dimensions:3
max work item sizes (dim:0):512
max work item sizes (dim:1):512
max work item sizes (dim:2):64
global mem size(bytes):4294770688 local mem size:16383
How can I relate the GPU card informations to the OpenCL memory informations ?
For example:
What does "Memory Interace" means ? Is it linked the a Work Item ?
How can I relate the "240 cores" of the GPU to Work Groups/Items ?
How can I map the work-groups to it (what would be the number of Work groups to use) ?
Thanks
EDIT:
After the following answers, there is a thing that is still unclear to me:
The CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE value is 32 for the kernel I use.
However, my device has a CL_DEVICE_MAX_COMPUTE_UNITS value of 30.
In the OpenCL 1.1 Api, it is written (p. 15):
Compute Unit: An OpenCL device has one or more compute units. A work-group executes on a single compute unit
It seems that either something is incoherent here, or that I didn't fully understand the difference between Work-Groups and Compute Units.
As previously stated, when I set the number of Work Groups to 32, the programs fails with the following error:
Entry function uses too much shared data (0x4020 bytes, 0x4000 max).
The value 16 works.
Addendum
Here is my Kernel signature:
// enable double precision (not enabled by default)
#ifdef cl_khr_fp64
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
#else
#error "IEEE-754 double precision not supported by OpenCL implementation."
#endif
#define BLOCK_SIZE 16 // --> this is what defines the WG size to me
__kernel __attribute__((reqd_work_group_size(BLOCK_SIZE, BLOCK_SIZE, 1)))
void mmult(__global double * A, __global double * B, __global double * C, const unsigned int q)
{
__local double A_sub[BLOCK_SIZE][BLOCK_SIZE];
__local double B_sub[BLOCK_SIZE][BLOCK_SIZE];
// stuff that does matrix multiplication with __local
}
In the host code part:
#define BLOCK_SIZE 16
...
const size_t local_work_size[2] = {BLOCK_SIZE, BLOCK_SIZE};
...
status = clEnqueueNDRangeKernel(command_queue, kernel, 2, NULL, global_work_size, local_work_size, 0, NULL, NULL);
The memory interface doesn't mean anything to an opencl application. It is the number of bits the memory controller has for reading/writing to the memory (the ddr5 part in modern gpus). The formula for maximum global memory speed is approximately: pipelineWidth * memoryClockSpeed, but since opencl is meant to be cross-platform, you won't really need to know this value unless you are trying to figure out an upper bound for memory performance. Knowing about the 512-bit interface is somewhat useful when you're dealing with memory coalescing. wiki: Coalescing (computer science)
The max work item sizes have to do with 1) how the hardware schedules computations, and 2) the amount of low-level memory on the device -- eg. private memory and local memory.
The 240 figure doesn't matter to opencl very much either. You can determine that each of the 30 compute units is made up of 8 streaming processor cores for this gpu architecture (because 240/30 = 8). If you query for CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE, it will very likey be a multiple of 8 for this device. see: clGetKernelWorkGroupInfo
I have answered a similar questions about work group sizing. see here, and here
Ultimately, you need to tune your application and kernels based on your own bench-marking results. I find it worth the time to write many tests with various work group sizes and eventually hard-code the optimal size.
Adding another answer to address your local memory issue.
Entry function uses too much shared data (0x4020 bytes, 0x4000 max)
Since you are allocating A_sub and B_sub, each having 32*32*sizeof(double), you run out of local memory. The device should be allowing you to allocate 16kb, or 0x4000 bytes of local memory without an issue.
0x4020 is 32 bytes or 4 doubles more than what your device allows. There are only two things I can think of that may cause the error: 1) there could be a bug with your device or drivers preventing you from allocating the full 16kb, or 2) you are allocating the memory somewhere else in your kernel.
You will have to use a BLOCK_SIZE value less than 32 to work around this for now.
There's good news though. If you only want to hit a multiple of CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE as a work group size, BLOCK_SIZE=16 already does this for you. (16*16 = 256 = 32*8). To better take advantage of local memory, try BLOCK_SIZE=24. (576=32*18)
I'm trying to process an image using OpenCL 1.1 C++ on my AMD CPU.
The characteristics are:
using CPU: AMD Turion(tm) 64 X2 Mobile Technology TL-60
initCL:CL_DEVICE_IMAGE2D_MAX_WIDTH :8192
initCL:CL_DEVICE_IMAGE2D_MAX_HEIGHT :8192
initCL:timer resolution in ns:1
initCL:CL_DEVICE_GLOBAL_MEM_SIZE in bytes:1975189504
initCL:CL_DEVICE_GLOBAL_MEM_CACHE_SIZE in bytes:65536
initCL:CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE in bytes:65536
initCL:CL_DEVICE_LOCAL_MEM_SIZE in bytes:32768
initCL:CL_DEVICE_MAX_COMPUTE_UNITS:2
initCL:CL_DEVICE_MAX_WORK_GROUP_SIZE:1024
initCL:CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS:3
initCL:CL_DEVICE_MAX_WORK_ITEM_SIZES:dim=0, size 1024
initCL:CL_DEVICE_MAX_WORK_ITEM_SIZES:dim=1, size 1024
initCL:CL_DEVICE_MAX_WORK_ITEM_SIZES:dim=2, size 1024
createCLKernel:mean_value
createCLKernel:CL_KERNEL_WORK_GROUP_SIZE:1024
createCLKernel:CL_KERNEL_LOCAL_MEM_SIZE used by the kernel in bytes:0
createCLKernel:CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE:1
The kernel is for the moment empty:
__kernel void mean_value(image2d_t p_image,
__global ulong4* p_meanValue)
{
}
The execution call is:
cl::NDRange l_globalOffset;
// The global worksize is the entire image
cl::NDRange l_globalWorkSize(l_width, l_height);
// Needs to be determined
cl::NDRange l_localWorkSize;//(2, 2);
// Computes the mean value
cl::Event l_profileEvent;
gQueue.enqueueNDRangeKernel(gKernelMeanValue, l_globalOffset, l_globalWorkSize,
l_localWorkSize, NULL, &l_profileEvent);
If l_width=558 and l_height=328, l_localWorkSize can not be greater than (2, 2) otherwise, I get this error:"Invalid work group size"
Is it because I only have 2 cores ?
Is there a rule to determine l_localWorkSize ?
You can check 2 things using the clGetDeviceInfo function :
CL_DEVICE_MAX_WORK_GROUP_SIZE to check that 4 is not too big for your workgroup and
CL_DEVICE_MAX_WORK_ITEM_SIZES to check that the number of work-items by dimension is not too big.
And the fact the group-size may be limited to the number of cores makes sense : if you have inter work-items communication/synchronization you'll want them to be executed at the same time, otherwise the OpenCL driver would have to emulate this which might be at least hard and probably impossible in the general case.
I read in the OpenCL specs that enqueueNDRangeKernel() succeeds if l_globalWorkSize is evenly divisible byl_localWorkSize. In my case, I can set it up to (2,41).