My OpenCL code is slower on GPU than on my CPU

My OpenCL code is slower on GPU than on my CPU - opencl

I am starting with OpenCL for some computer vision tasks. I use the python pyopencl module. My code runs faster on an Intel cpu than on my Nvidia GTX 750Ti.
I have an example code that multiplies a (2000x4000) Array item-wise. It runs in 2ms on my cpu and in 8ms on my gpu. As you can see in the code, the time spent is just the kernel call.
Why is it so much slower on my GPU?
import time
import numpy as np
import pyopencl as cl
devices = cl.get_platforms()[1].get_devices()
ctx = cl.Context(devices)
queue = cl.CommandQueue(ctx)
kernel = cl.Program(
ctx, """
kernel void mult(
global float *a,
global float *b,
global float *out
)
{
int row = get_global_id(0);
int col = get_global_id(1);
int cols = get_global_size(1);
int index = col + row * cols;
out[index] = a[index] * b[index];
}
""").build()
a = np.random.rand(2000, 4000).astype(np.float32)
a_b = cl.Buffer(ctx, cl.mem_flags.READ_ONLY | cl.mem_flags.COPY_HOST_PTR, hostbuf=a.flatten())
rows, cols = a.shape
out_b = cl.Buffer(ctx, cl.mem_flags.WRITE_ONLY, size=rows*cols*np.dtype(np.float32).itemsize)
start = time.time() * 1000
kernel.mult(queue, a.shape, None, a_b, a_b, out_b)
end = time.time() * 1000
print(f"{end-start}ms")
out = np.empty(a.shape, dtype=np.float32)
cl.enqueue_copy(queue, out, out_b)
# make sure result is correct
np.testing.assert_array_equal(a*a, out)
Here is the output of clinfo
> clinfo
Number of platforms 2
Platform Name NVIDIA CUDA
Platform Vendor NVIDIA Corporation
Platform Version OpenCL 1.2 CUDA 9.1.84
Platform Profile FULL_PROFILE
Platform Extensions cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_nv_create_buffer
Platform Extensions function suffix NV
Platform Name Intel(R) CPU Runtime for OpenCL(TM) Applications
Platform Vendor Intel(R) Corporation
Platform Version OpenCL 2.1 LINUX
Platform Profile FULL_PROFILE
Platform Extensions cl_khr_icd cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_depth_images cl_khr_3d_image_writes cl_intel_exec_by_local_thread cl_khr_spir cl_khr_fp64 cl_khr_image2d_from_buffer cl_intel_vec_len_hint
Platform Host timer resolution 1ns
Platform Extensions function suffix INTEL
Platform Name NVIDIA CUDA
Number of devices 1
Device Name GeForce GTX 750 Ti
Device Vendor NVIDIA Corporation
Device Vendor ID 0x10de
Device Version OpenCL 1.2 CUDA
Driver Version 390.116
Device OpenCL C Version OpenCL C 1.2
Device Type GPU
Device Topology (NV) PCI-E, 01:00.0
Device Profile FULL_PROFILE
Device Available Yes
Compiler Available Yes
Linker Available Yes
Max compute units 5
Max clock frequency 1084MHz
Compute Capability (NV) 5.0
Device Partition (core)
Max number of sub-devices 1
Supported partition types None
Max work item dimensions 3
Max work item sizes 1024x1024x64
Max work group size 1024
Preferred work group size multiple 32
Warp size (NV) 32
Preferred / native vector sizes
char 1 / 1
short 1 / 1
int 1 / 1
long 1 / 1
half 0 / 0 (n/a)
float 1 / 1
double 1 / 1 (cl_khr_fp64)
Half-precision Floating-point support (n/a)
Single-precision Floating-point support (core)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Correctly-rounded divide and sqrt operations Yes
Double-precision Floating-point support (cl_khr_fp64)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Address bits 64, Little-Endian
Global memory size 2096300032 (1.952GiB)
Error Correction support No
Max memory allocation 524075008 (499.8MiB)
Unified memory for Host and Device No
Integrated memory (NV) No
Minimum alignment for any data type 128 bytes
Alignment of base address 4096 bits (512 bytes)
Global Memory cache type Read/Write
Global Memory cache size 81920 (80KiB)
Global Memory cache line size 128 bytes
Image support Yes
Max number of samplers per kernel 32
Max size for 1D images from buffer 134217728 pixels
Max 1D or 2D image array size 2048 images
Max 2D image size 16384x16384 pixels
Max 3D image size 4096x4096x4096 pixels
Max number of read image args 256
Max number of write image args 16
Local memory type Local
Local memory size 49152 (48KiB)
Registers per block (NV) 65536
Max number of constant args 9
Max constant buffer size 65536 (64KiB)
Max size of kernel argument 4352 (4.25KiB)
Queue properties
Out-of-order execution Yes
Profiling Yes
Prefer user sync for interop No
Profiling timer resolution 1000ns
Execution capabilities
Run OpenCL kernels Yes
Run native kernels No
Kernel execution timeout (NV) Yes
Concurrent copy and kernel execution (NV) Yes
Number of async copy engines 1
printf() buffer size 1048576 (1024KiB)
Built-in kernels
Device Extensions cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_nv_create_buffer
Platform Name Intel(R) CPU Runtime for OpenCL(TM) Applications
Number of devices 1
Device Name Intel(R) Core(TM) i5-2400 CPU # 3.10GHz
Device Vendor Intel(R) Corporation
Device Vendor ID 0x8086
Device Version OpenCL 2.1 (Build 0)
Driver Version 18.1.0.0920
Device OpenCL C Version OpenCL C 2.0
Device Type CPU
Device Profile FULL_PROFILE
Device Available Yes
Compiler Available Yes
Linker Available Yes
Max compute units 4
Max clock frequency 3100MHz
Device Partition (core)
Max number of sub-devices 4
Supported partition types by counts, equally, by names (Intel)
Max work item dimensions 3
Max work item sizes 8192x8192x8192
Max work group size 8192
Preferred work group size multiple 128
Max sub-groups per work group 1
Preferred / native vector sizes
char 1 / 16
short 1 / 8
int 1 / 4
long 1 / 2
half 0 / 0 (n/a)
float 1 / 8
double 1 / 4 (cl_khr_fp64)
Half-precision Floating-point support (n/a)
Single-precision Floating-point support (core)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero No
Round to infinity No
IEEE754-2008 fused multiply-add No
Support is emulated in software No
Correctly-rounded divide and sqrt operations No
Double-precision Floating-point support (cl_khr_fp64)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Address bits 64, Little-Endian
Global memory size 8308092928 (7.738GiB)
Error Correction support No
Max memory allocation 2077023232 (1.934GiB)
Unified memory for Host and Device Yes
Shared Virtual Memory (SVM) capabilities (core)
Coarse-grained buffer sharing Yes
Fine-grained buffer sharing Yes
Fine-grained system sharing Yes
Atomics Yes
Minimum alignment for any data type 128 bytes
Alignment of base address 1024 bits (128 bytes)
Preferred alignment for atomics
SVM 64 bytes
Global 64 bytes
Local 0 bytes
Max size for global variable 65536 (64KiB)
Preferred total size of global vars 65536 (64KiB)
Global Memory cache type Read/Write
Global Memory cache size 262144 (256KiB)
Global Memory cache line size 64 bytes
Image support Yes
Max number of samplers per kernel 480
Max size for 1D images from buffer 129813952 pixels
Max 1D or 2D image array size 2048 images
Base address alignment for 2D image buffers 64 bytes
Pitch alignment for 2D image buffers 64 pixels
Max 2D image size 16384x16384 pixels
Max 3D image size 2048x2048x2048 pixels
Max number of read image args 480
Max number of write image args 480
Max number of read/write image args 480
Max number of pipe args 16
Max active pipe reservations 65535
Max pipe packet size 1024
Local memory type Global
Local memory size 32768 (32KiB)
Max number of constant args 480
Max constant buffer size 131072 (128KiB)
Max size of kernel argument 3840 (3.75KiB)
Queue properties (on host)
Out-of-order execution Yes
Profiling Yes
Local thread execution (Intel) Yes
Queue properties (on device)
Out-of-order execution Yes
Profiling Yes
Preferred size 4294967295 (4GiB)
Max size 4294967295 (4GiB)
Max queues on device 4294967295
Max events on device 4294967295
Prefer user sync for interop No
Profiling timer resolution 1ns
Execution capabilities
Run OpenCL kernels Yes
Run native kernels Yes
Sub-group independent forward progress No
IL version SPIR-V_1.0
SPIR versions 1.2
printf() buffer size 1048576 (1024KiB)
Built-in kernels
Device Extensions cl_khr_icd cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_depth_images cl_khr_3d_image_writes cl_intel_exec_by_local_thread cl_khr_spir cl_khr_fp64 cl_khr_image2d_from_buffer cl_intel_vec_len_hint
NULL platform behavior
clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...) No platform
clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...) No platform
clCreateContext(NULL, ...) [default] No platform
clCreateContext(NULL, ...) [other] Success [NV]
clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT) No platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU) No platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU) No platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR) No platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM) No platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL) No platform

I don't know much about pyOpenCL, but I do know OpenCL a bit...
The GTX 750 TI has 5 compute units and 640 CUDA cores, meaning your optimal local work size is 640/5 = 128. Using smaller/larger values will only waste resources. I don't know what the library does when you pass `None', but this is one key aspect to get performance. I strongly suggest you look at what values are used.
Generally speaking, reading and writing back to global memory directly is 'slow'. Each compute unit has a certain amount of local memory that can (and should) be leveraged. I'm not sure this is suitable for a kernel as simple as yours, but I'd try storing the results in local memory before transferring back to main memory. You can cast to larger data types to improve throughput between local and global memory too.
Finally, it wouldn't be surprising that transferring data from/to GPU takes more time than doing the actual computation.

Memory transfers from CPU to GPU and back via PCIe usually have a latency in the order of 10µs, independent on how much data you transfer. This means that large data transfers are more efficient and for small data sets the latency might be longer than the execution time on a CPU.
Your matrix multiplication kernel can be optimized to run approximately 10x faster. The key word here is cache tiling with local memory. The idea is to load chunks of data from global memory to local memory in a single coalesced transfer and then access one element at a time from local memory. This reduces global memory access latency a lot and will substantially speed up the kernel.

Related

Is there a way to load a vector equal by size to global memory size of GPU in OpenCl?

My GPU has 12 GB global memory (CL_DEVICE_GLOBAL_MEM_SIZE), but only 3 GB of memory which it can allocate (CL_DEVICE_MAX_MEM_ALLOC_SIZE). When I try to load a vector of size exceeding 3 GB, the program crashes. The question is, if it is possible to load a bigger vector into GPU memory to utilize it completely, how to do it?

By default, CL_DEVICE_MAX_MEM_ALLOC_SIZE reports 1/4 of CL_DEVICE_GLOBAL_MEM_SIZE, meaning it would only be allowed to allocate four 3GB buffers on a 12GB GPU.
However, Nvidia GPUs allow to allocate their full memory capacity in a single buffer, even though they also report to have the 1/4 limit.
Some AMD GPUs have the limit set higher, for example the Radeon VII lets you use 14/16GB for a single buffer.
The only devices I have ever seen that really inforce the 1/4 limit are Intel HD 4600 and 5500, so older Intel integrated GPUs. If you go above 1/4 in buffer size there, the cl::Buffer constructor throws error -61.
In case you are stuck with the 1/4 memory limit on your device, split your large 12GB buffer in 4 smaller 3GB buffers (for example one vector for x, y, z, w components of the vector each). If you use Windows, note that you might only be able to use ~11.5GB in total as some VRAM is reserved for the operating system.
I think your issue might not be CL_DEVICE_MAX_MEM_ALLOC_SIZE though, but 32-bit integer overflow for the array size above 4GB. Use the uint64_t data type to set the array size instead.
You might also be interested in this lightweight OpenCL-Wrapper for C++. There, the length of vectors always is in 64-bit integer, and it automatically keeps track on howm much memory you use in total on each device, telling you if you allocate too much. It also catches that -61 error on Intel iGPUs and tells you the maximum allowed buffer size then.

OpenCL Compute units and GPU Processing units mismatch

I'm a bit confused about compute units. I have an nvidia gtx 1650Ti graphics card. When I asked for max_compute_units, it returns 16 units, and max_work_group_size is 1024.
But when I executed the kernel:
int i = get_global_id (0);
result [i] = get_local_id (0);
I get the repeating local id range from 0 to 255. How does this relate to the max_compute_units returned by the graphics card? Is this an error in max_compute_units value and the gpu actually has more compute units than it indicates? Or does OpenCl get_local_id have its own distribution logic not tied to hardware? Thx!

OpenCL ompute units refer to streaming multiprocessors (SMs) on Nvidia GPUs or compute units (CUs) on AMD GPUs. Each SM contains 128 CUDA cores (Pascal and earlier) or 64 CUDA cores (Turing/Volta). For AMD, each CU contains 64 streaming multiprocessors. This refers to the hardware. The more SMs/CUs, the faster the GPU (within the same microarchitecture).
The work group size / local ID refer to how you group threads in software into so-called thread blocks. Thread blocks are useful for matrix multiplications for example, because within a thread block, communication between threads is possible via shared memory. Thread blocks can have different size (sort of an optimization parameter, either 32, 64, 128, 256, 512 or 1024 (max_work_group_size)). Based on your GPU, some intermediate values might also work. On the hardware (at least for Nvidia), the thread blocks are executed as so-called warps (groups of 32 threads) on the SMs. For Turing, one SM can compute 2 warps simultaneously. If you choose the thread block size 16, then each warp only computes 16 threads and the other 16 are idle, so you only get half the performance.
In your example with the local ID (this is the index in the thread block) betwqeen 0 and 255, your thread block size is 256. You define the thread block size in the kernel call as the "local range". max_work_group_size does not correlate with max_compute_units in any way; both are hardware / driver limitations.

What is the meaning of having a certain number of OpenCL work-items into a CPU?

I'm trying tu understand why I could have more work-items in a CPU than a GPU in one dimension.
PLATFORM 0 DEVICE 0
== CPU ==
DEVICE_VENDOR: Intel
DEVICE NAME: Intel(R) Core(TM) i5-5257U CPU # 2.70GHz
MAXIMUM NUMBER OF PARALLAEL COMPUTE UNITS: 4
MAXIMUM DIMENSIONS FOR THE GLOBAL/LOCAL WORK ITEM IDs: 3
MAXIMUM NUMBER OF WORK-ITEMS IN EACH DIMENSION: (1024 1 1 )
MAXIMUM NUMBER OF WORK-ITEMS IN A WORK-GROUP: 1024
PLATFORM 0 DEVICE 1
== GPU ==
DEVICE_VENDOR: Intel Inc.
DEVICE NAME: Intel(R) Iris(TM) Graphics 6100
MAXIMUM NUMBER OF PARALLAEL COMPUTE UNITS: 48
MAXIMUM DIMENSIONS FOR THE GLOBAL/LOCAL WORK ITEM IDs: 3
MAXIMUM NUMBER OF WORK-ITEMS IN EACH DIMENSION: (256 256 256 )
MAXIMUM NUMBER OF WORK-ITEMS IN A WORK-GROUP: 256
The above is the result of my test code to print the information of the actual hardware that the OpenCL framework can use.
I really do not understand why the value of 1024 in the Maximum number of work-items in the CPU section. What is the real meaning of having that amount of work-items?

CPUs are more general purpose than GPUs. Their OpenCL implementation looks like serialized(but interleaved on instructions) for workgroups since each compute unit is a physical core to issue workgroups as a whole. Since they are serialized/interleaved, they rely on instructions-in-flight. CPUs have 100-200 instructions in-flight and if those instructions are AVX/SSE, then you can expect 800-1600 scalar data operations in-flight. This is well within range of 1024 workitems per workgroup, if OpenCL implementation is vectorized under the hood.
Since GPUs use massive thread-level-parallelism to fill pipelines to have more instructions-in-flight, they don't need as much ILP as CPUs so they can work fine with just 256 threads per workgroup and these threads run in parallel. Thread-level-parallelism fills pipelines easier than instruction-level-parallelism. Intel has 7-way, Nvidia 16-way, Amd 40-way thread-level-parallelism, for each pipeline. Each subslice of Iris6100 has (8 EUs) 64 pipelines. 64 pipelines x 7 means it can have multiple workgroups in-flight too, just like Nvidia and Amd GPUs. Probably having more threads/workitems per workgroup doesn't yield more performance for that iGPU and having more than 1024 threads per workgroup doesn't yield more performance for that CPU.
CPU also has 256kB L2 cache for compute unit which may be another limiting factor on maximum 1024 workitems per workgroup for saving states of each workitem efficiently.
As an image processing example:
You can divide and conquer an image by having 32x32 patches of it, on CPU(1024 threads). But this needs re-computation of 2D indices in kernel since CPU supports 1D kernel.
You can divide and conquer an image by having 16x16 patches of it, on iGPU (256 threads).
256x1 on iGPU
1024x1 on CPU
8x8x4 on iGPU
1x256x1 on iGPU
1x1x256 on iGPU
but not 1x1024x1 on CPU
They are the number of workitems per workgroup and generally they are a fraction of maximum allowed in-flight workitems per compute unit.
For this image processing example, up to several thousands of pixels can be in-flight per compute unit or up to 50k-100k pixels in-flight for a high-end GPU.
Having only 1 on other dimensions for CPU (imo) is originated from CPU's OpenCL implementation being an emulation. It doesn't have hardware to accelerate computation of thread-id values for other dimensions. But GPUs probably have this kind of support on hardware so that they can have more dimensions without decreasing performance as 1D kernel on CPU has to compute some modulos and divisions to emulate 2nd and 3rd dimensions which is a bottleneck for simple kernels.
If CPUs had emulated 2nd and 3rd dimensions too, there would be some modulos and divisions going on background with further slow-downs inside kernel if developers flatten a 3d kernel into 1d indices unknowingly. But GPUs may not even be computing modules under the hood. They could be just some lookup tables as fast as registers or some other fast accessed constants.
This is just a limitation per workgroup. You can launch many workgroups per kernel launch so it shouldn't affect the maximum image size to process in different devices like CPU or GPU or iGPU. Each image is processed by multiple workgroups for tiling from 1x1x1 to 32x32x1 or some other size.

OpenCL and Tesla M1060

I'm using the Tesla m1060 for GPGPU computation. It has the following specs:
# of Tesla GPUs 1
# of Streaming Processor Cores (XXX per processor) 240
Memory Interface (512-bit per GPU) 512-bit
When I use OpenCL, I can display the following board information:
available platform OpenCL 1.1 CUDA 6.5.14
device Tesla M1060 type:CL_DEVICE_TYPE_GPU
max compute units:30
max work item dimensions:3
max work item sizes (dim:0):512
max work item sizes (dim:1):512
max work item sizes (dim:2):64
global mem size(bytes):4294770688 local mem size:16383
How can I relate the GPU card informations to the OpenCL memory informations ?
For example:
What does "Memory Interace" means ? Is it linked the a Work Item ?
How can I relate the "240 cores" of the GPU to Work Groups/Items ?
How can I map the work-groups to it (what would be the number of Work groups to use) ?
Thanks
EDIT:
After the following answers, there is a thing that is still unclear to me:
The CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE value is 32 for the kernel I use.
However, my device has a CL_DEVICE_MAX_COMPUTE_UNITS value of 30.
In the OpenCL 1.1 Api, it is written (p. 15):
Compute Unit: An OpenCL device has one or more compute units. A work-group executes on a single compute unit
It seems that either something is incoherent here, or that I didn't fully understand the difference between Work-Groups and Compute Units.
As previously stated, when I set the number of Work Groups to 32, the programs fails with the following error:
Entry function uses too much shared data (0x4020 bytes, 0x4000 max).
The value 16 works.
Addendum
Here is my Kernel signature:
// enable double precision (not enabled by default)
#ifdef cl_khr_fp64
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
#else
#error "IEEE-754 double precision not supported by OpenCL implementation."
#endif
#define BLOCK_SIZE 16 // --> this is what defines the WG size to me
__kernel __attribute__((reqd_work_group_size(BLOCK_SIZE, BLOCK_SIZE, 1)))
void mmult(__global double * A, __global double * B, __global double * C, const unsigned int q)
{
__local double A_sub[BLOCK_SIZE][BLOCK_SIZE];
__local double B_sub[BLOCK_SIZE][BLOCK_SIZE];
// stuff that does matrix multiplication with __local
}
In the host code part:
#define BLOCK_SIZE 16
...
const size_t local_work_size[2] = {BLOCK_SIZE, BLOCK_SIZE};
...
status = clEnqueueNDRangeKernel(command_queue, kernel, 2, NULL, global_work_size, local_work_size, 0, NULL, NULL);

The memory interface doesn't mean anything to an opencl application. It is the number of bits the memory controller has for reading/writing to the memory (the ddr5 part in modern gpus). The formula for maximum global memory speed is approximately: pipelineWidth * memoryClockSpeed, but since opencl is meant to be cross-platform, you won't really need to know this value unless you are trying to figure out an upper bound for memory performance. Knowing about the 512-bit interface is somewhat useful when you're dealing with memory coalescing. wiki: Coalescing (computer science)
The max work item sizes have to do with 1) how the hardware schedules computations, and 2) the amount of low-level memory on the device -- eg. private memory and local memory.
The 240 figure doesn't matter to opencl very much either. You can determine that each of the 30 compute units is made up of 8 streaming processor cores for this gpu architecture (because 240/30 = 8). If you query for CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE, it will very likey be a multiple of 8 for this device. see: clGetKernelWorkGroupInfo
I have answered a similar questions about work group sizing. see here, and here
Ultimately, you need to tune your application and kernels based on your own bench-marking results. I find it worth the time to write many tests with various work group sizes and eventually hard-code the optimal size.

Adding another answer to address your local memory issue.
Entry function uses too much shared data (0x4020 bytes, 0x4000 max)
Since you are allocating A_sub and B_sub, each having 32*32*sizeof(double), you run out of local memory. The device should be allowing you to allocate 16kb, or 0x4000 bytes of local memory without an issue.
0x4020 is 32 bytes or 4 doubles more than what your device allows. There are only two things I can think of that may cause the error: 1) there could be a bug with your device or drivers preventing you from allocating the full 16kb, or 2) you are allocating the memory somewhere else in your kernel.
You will have to use a BLOCK_SIZE value less than 32 to work around this for now.
There's good news though. If you only want to hit a multiple of CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE as a work group size, BLOCK_SIZE=16 already does this for you. (16*16 = 256 = 32*8). To better take advantage of local memory, try BLOCK_SIZE=24. (576=32*18)

Work-items, Work-groups and Command Queues organization and memory limit in OpenCL

Okay i have already been through most of the ati and nvidia guides to OpenCL, there are some stuff that i just want to be sure of, and some need clarification. Nothing in the documentation gives a clear cut answer.
Now i have a radeon 4650, now on querying my device, i got
CL_DEVICE_MAX_COMPUTE_UNITS: 8
CL_DEVICE_ADDRESS_BITS: 32
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3
CL_DEVICE_MAX_WORK_ITEM_SIZES: 128 / 128 / 128
CL_DEVICE_MAX_WORK_GROUP_SIZE: 128
CL_DEVICE_MAX_MEM_ALLOC_SIZE: 256 MByte
CL_DEVICE_GLOBAL_MEM_SIZE: 256 MByte
ok first, my card has 1GB memory, why am i allowed to 256MB only?
2nd i don't understand the Work-item dimension part, does that mean i can have up to 128*3 or 128^3 work-items?
when i calculated this before i run the query, i got 8 cores * 16 stream processors * 4 work-items = 512 why is this wrong?
also i got the same 3 dimension work-item stuff for my inte core 2 duo CPU, does the same calculations apply?
As for the command queues, when i tried accessing my core duo CPU as a device using OpenCL, stuff got processed on one core only, i tried doing multiple queues and queueing several entries, but still got processed on one core only, i used a global_work_size of 128*128*128*8 for a simple write program where each work-item writes its own global-id to the buffer and i got only zeros.
and what about Nvidia Cards? on a Nvidia 9500 GT with 32 cuda cores, does the work-items calculate similarly?
Thanks alot, i've been really all over the place trying to find answers.

ok first, my card has 1GB memory, why
am i allowed to 256MB only?
This is an ATI driver bug/limitation AFAIK. I'll check on my 5850 if I can repro.
http://devforums.amd.com/devforum/messageview.cfm?catid=390&threadid=124142&messid=1069111&parentid=0&FTVAR_FORUMVIEWTMP=Branch
2nd i don't understand the Work-item
dimension part, does that mean i can
have up to 128*3 or 128^3 work-items?
No. That means you can have max 128 on one dim since CL_DEVICE_MAX_WORK_ITEM_SIZES is 128 / 128 / 128. And since CL_DEVICE_MAX_WORK_GROUP_SIZE is 128, you can have, e.g: work_group_size(128, 1, 1) or work_group_size(1, 128, 1) or work_group_size(64, 1, 2), or work_group_size(8, 4, 4) etc, as long as product of each dim is <= 128 it will be fine.
when i calculated this before i run
the query, i got 8 cores * 16 stream
processors * 4 work-items = 512 why is
this wrong?
also i got the same 3 dimension
work-item stuff for my inte core 2 duo
CPU, does the same calculations apply?
Don't understand what you are trying to compute here.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex