Is there a way to load a vector equal by size to global memory size of GPU in OpenCl? - opencl

My GPU has 12 GB global memory (CL_DEVICE_GLOBAL_MEM_SIZE), but only 3 GB of memory which it can allocate (CL_DEVICE_MAX_MEM_ALLOC_SIZE). When I try to load a vector of size exceeding 3 GB, the program crashes. The question is, if it is possible to load a bigger vector into GPU memory to utilize it completely, how to do it?

By default, CL_DEVICE_MAX_MEM_ALLOC_SIZE reports 1/4 of CL_DEVICE_GLOBAL_MEM_SIZE, meaning it would only be allowed to allocate four 3GB buffers on a 12GB GPU.
However, Nvidia GPUs allow to allocate their full memory capacity in a single buffer, even though they also report to have the 1/4 limit.
Some AMD GPUs have the limit set higher, for example the Radeon VII lets you use 14/16GB for a single buffer.
The only devices I have ever seen that really inforce the 1/4 limit are Intel HD 4600 and 5500, so older Intel integrated GPUs. If you go above 1/4 in buffer size there, the cl::Buffer constructor throws error -61.
In case you are stuck with the 1/4 memory limit on your device, split your large 12GB buffer in 4 smaller 3GB buffers (for example one vector for x, y, z, w components of the vector each). If you use Windows, note that you might only be able to use ~11.5GB in total as some VRAM is reserved for the operating system.
I think your issue might not be CL_DEVICE_MAX_MEM_ALLOC_SIZE though, but 32-bit integer overflow for the array size above 4GB. Use the uint64_t data type to set the array size instead.
You might also be interested in this lightweight OpenCL-Wrapper for C++. There, the length of vectors always is in 64-bit integer, and it automatically keeps track on howm much memory you use in total on each device, telling you if you allocate too much. It also catches that -61 error on Intel iGPUs and tells you the maximum allowed buffer size then.

Related

Check No.of nvidia cores utilized

Is there a way to check the number of stream processors and cores utilized by an OpenCL kernel?
No. However you can make guesses based on your application: If the number of work items is much larger than the number of CUDA cores and the work group size is 32 or larger, all stream processors are used at the same time.
If the number of work items is the about the same or lower than the number of CUDA cores, you won't have full utilization.
If you set the work size to 16, only half of the CUDA cores will be used at any time, but the non-used half is blocked and cannot do other work. (So always set work group size to 32 or larger.)
Tools like nvidia-smi can tell you the time-averaged GPU usage. So if you run your kernel over and over without any delay in between, the usage indicates the average fraction of used CUDA cores at any time.

What causes and how can I check the number of work-groups limit in OpenCL?

I've shortly started using OpenCL to write programs for GPUs. I'm familiar with basic concepts that are required to write efficient programs in OpenCL, like work-items, work-groups, global-item-size, barriers, etc.
One of my programs involved making about 20 million work-groups with 360 work-items in each work-group. However, for some reason OpenCL couldn't handle that many number of work-groups. All elements of my output array simply remained 0. In addition, OpenCL didn't even start the calculations when I called clEnqueueNDRangeKernel(), since when I viewed the GPU usage stats I didn't see a "spike" that usually happens when I run an OpenCL kernel. I attempted to reduce the number work-groups, to see what is the maximum number of work-groups. It was 5965232 and it is always 5965232. Not more, not less.
I know that the problem is NOT with the number of work-items. It is with the number of work-groups. To prove this, here is my original code, where LIST_SIZE is 360.
global_item_size = 5965232*LIST_SIZE;
local_size = LIST_SIZE;
and a modified version of my code:
global_item_size = 5965232*LIST_SIZE*1.3;
local_size = LIST_SIZE*1.3;
In all the scenarios, the number of work-groups limit was 5965232.
I'm trying to find out what causes this limit and how to check this limit. I understand that there may be a limitation, but what causes this limitation and how can I check check this limit number in OpenCL? I've did a lot of research, but all sites are talking about work-group size limits and not about number of work-group limits.
I'm using the Intel Graphics HD 4000 GPU with an i5-3320M. It has 32 MB of integrated RAM.
5965232*320 = 2147483520 < 2147483647 = 2^31-1 = maximum 32-bit signed integer value
You are dealing with a classical 32-bit integer overflow in the multiplication in line
global_item_size = 5965232*LIST_SIZE;
Try global_item_size = 5965232ull*(uint64_t)LIST_SIZE; instead. Make sure global_item_size is data type uint64_t.

Memory usage in Dual GPU(Multi GPU)

I am using two GPUs of same configuration for my HPC GPGPU calculation using OpenCL. One of the card is connected for the display purpose and about 200-300 MB of memory is always used by two programs called compiz and x server. My question is , when using these GPU's for computation I can use only a partial amount of total memory in GPU which is used for display purpose whereas the 2nd GPU I am able to use entire Global memory. In my case I am using two Nvidia Quadro 410, Which has 192 cuda cores , 512 MB as memory but 503 MB usable . In case of display GPU i can use only 128MB for computation and other I can use full 503 MB for calculation.
According to the The OpenCL Specification Page 32
Max size of memory obj
ect allocation
in bytes. The minimum value is max
(1/4
th
of
CL_DEVICE_GLOBAL_MEM_SIZE
,
128*1024*1024)
Also shouldn't this hold good for all the GPU's present in the System?
Just continue to read from that point, you will see
Max size of memory object allocation
in bytes. The minimum value is max
(1/4th of
CL_DEVICE_GLOBAL_MEM_SIZE ,
128*1024*1024)
so whichever is greater, 128MB or 1/4 of total; will be the limit.
OpenCL will automatically swap data out or it the GPU, so you are not actually limited to the GPU global memory, you can have more memory used, as long as you don use it all at once. You can "obviusly" not create objects such big that don fit on the GPU memory. That is where this limit kicks in.
The current max limit per object is as pointed out by #huseyin
CL_DEVICE_MAX_MEM_ALLOC_SIZE (cl_ulong)
Max size of memory object allocation in bytes. The minimum value is max
(1/4 th of CL_DEVICE_GLOBAL_MEM_SIZE, 128*1024*1024)
minimun_max_global_size = max(1/4*global, 128MB)
If you read it carefuly, is the min value for the max allocation size.(tricky wording!).
Probably nVIDIA is setting it to 1/4 on a display GPU, and to whole memory size on the non-display GPU. But nVIDIA is following the spec by doing so in both cases.
It is something you should query, and operate within the limits the API reports. You cannot change it, and you should not guess it.

How to determine the maximum size of bus-addressable OpenCL memory buffer?

I am using the AMD bus-addressable memory extension to write from an FPGA to a GPU and vice versa. In the first case, an OpenCL buffer is created with the CL_MEM_BUS_ADDRESSABLE_AMD flag set. However, the largest size that I can allocate is much less than what is reported for CL_DEVICE_MAX_MEM_ALLOC_SIZE.
How can I find the maximum allocation size of such a buffer?

Optimal Local/Global worksizes in OpenCL

I am wondering how to chose optimal local and global work sizes for different devices in OpenCL?
Is it any universal rule for AMD, NVIDIA, INTEL GPUs?
Should I analyze physical build of the devices (number of multiprocessors, number of streaming processors in multiprocessor, etc)?
Does it depends on the algorithm/implementation? Because I saw that some libraries (like ViennaCL) to assess correct values just tests many combination of local/global work sizes and chose best combination.
NVIDIA recommends that your (local)workgroup-size is a multiple of 32 (equal to one warp, which is their atomic unit of execution, meaning that 32 threads/work-items are scheduled atomically together). AMD on the other hand recommends a multiple of 64(equal to one wavefront). Unsure about Intel, but you can find this type of information in their documentation.
So when you are doing some computation and let say you have 2300 work-items (the global size), 2300 is not dividable by 64 nor 32. If you don't specify the local size, OpenCL will choose a bad local size for you. What happens when you don't have a local size which is a multiple of the atomic unit of execution is that you will get idle threads which leads to bad device utilization. Thus, it can be benificial to add some "dummy" threads so that you get a global size which is a multiple of 32/64 and then use a local size of 32/64 (the global size has to be dividable by the local size). For 2300 you can add 4 dummy threads/work-items, because 2304 is dividable by 32. In the actual kernel, you can write something like:
int globalID = get_global_id(0);
if(globalID >= realNumberOfThreads)
globalID = 0;
This will make the four extra threads do the same as thread 0. (it is often faster to do some extra work then to have many idle threads).
Hope that answered your question. GL HF!
If you're essentially making processing using little memory (e.g. to store kernel private state) you can choose the most intuitive global size for your problem and let OpenCL choose the local size for you.
See my answer here : https://stackoverflow.com/a/13762847/145757
If memory management is a central part of your algorithm and will have a great impact on performance you should indeed go a little further and first check the maximum local size (which depends on the local/private memory usage of your kernel) using clGetKernelWorkGroupInfo, which itself will decide of your global size.

Resources