OpenCL and Tesla M1060 - opencl

I'm using the Tesla m1060 for GPGPU computation. It has the following specs:
# of Tesla GPUs 1
# of Streaming Processor Cores (XXX per processor) 240
Memory Interface (512-bit per GPU) 512-bit
When I use OpenCL, I can display the following board information:
available platform OpenCL 1.1 CUDA 6.5.14
device Tesla M1060 type:CL_DEVICE_TYPE_GPU
max compute units:30
max work item dimensions:3
max work item sizes (dim:0):512
max work item sizes (dim:1):512
max work item sizes (dim:2):64
global mem size(bytes):4294770688 local mem size:16383
How can I relate the GPU card informations to the OpenCL memory informations ?
For example:
What does "Memory Interace" means ? Is it linked the a Work Item ?
How can I relate the "240 cores" of the GPU to Work Groups/Items ?
How can I map the work-groups to it (what would be the number of Work groups to use) ?
Thanks
EDIT:
After the following answers, there is a thing that is still unclear to me:
The CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE value is 32 for the kernel I use.
However, my device has a CL_DEVICE_MAX_COMPUTE_UNITS value of 30.
In the OpenCL 1.1 Api, it is written (p. 15):
Compute Unit: An OpenCL device has one or more compute units. A work-group executes on a single compute unit
It seems that either something is incoherent here, or that I didn't fully understand the difference between Work-Groups and Compute Units.
As previously stated, when I set the number of Work Groups to 32, the programs fails with the following error:
Entry function uses too much shared data (0x4020 bytes, 0x4000 max).
The value 16 works.
Addendum
Here is my Kernel signature:
// enable double precision (not enabled by default)
#ifdef cl_khr_fp64
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
#else
#error "IEEE-754 double precision not supported by OpenCL implementation."
#endif
#define BLOCK_SIZE 16 // --> this is what defines the WG size to me
__kernel __attribute__((reqd_work_group_size(BLOCK_SIZE, BLOCK_SIZE, 1)))
void mmult(__global double * A, __global double * B, __global double * C, const unsigned int q)
{
__local double A_sub[BLOCK_SIZE][BLOCK_SIZE];
__local double B_sub[BLOCK_SIZE][BLOCK_SIZE];
// stuff that does matrix multiplication with __local
}
In the host code part:
#define BLOCK_SIZE 16
...
const size_t local_work_size[2] = {BLOCK_SIZE, BLOCK_SIZE};
...
status = clEnqueueNDRangeKernel(command_queue, kernel, 2, NULL, global_work_size, local_work_size, 0, NULL, NULL);

The memory interface doesn't mean anything to an opencl application. It is the number of bits the memory controller has for reading/writing to the memory (the ddr5 part in modern gpus). The formula for maximum global memory speed is approximately: pipelineWidth * memoryClockSpeed, but since opencl is meant to be cross-platform, you won't really need to know this value unless you are trying to figure out an upper bound for memory performance. Knowing about the 512-bit interface is somewhat useful when you're dealing with memory coalescing. wiki: Coalescing (computer science)
The max work item sizes have to do with 1) how the hardware schedules computations, and 2) the amount of low-level memory on the device -- eg. private memory and local memory.
The 240 figure doesn't matter to opencl very much either. You can determine that each of the 30 compute units is made up of 8 streaming processor cores for this gpu architecture (because 240/30 = 8). If you query for CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE, it will very likey be a multiple of 8 for this device. see: clGetKernelWorkGroupInfo
I have answered a similar questions about work group sizing. see here, and here
Ultimately, you need to tune your application and kernels based on your own bench-marking results. I find it worth the time to write many tests with various work group sizes and eventually hard-code the optimal size.

Adding another answer to address your local memory issue.
Entry function uses too much shared data (0x4020 bytes, 0x4000 max)
Since you are allocating A_sub and B_sub, each having 32*32*sizeof(double), you run out of local memory. The device should be allowing you to allocate 16kb, or 0x4000 bytes of local memory without an issue.
0x4020 is 32 bytes or 4 doubles more than what your device allows. There are only two things I can think of that may cause the error: 1) there could be a bug with your device or drivers preventing you from allocating the full 16kb, or 2) you are allocating the memory somewhere else in your kernel.
You will have to use a BLOCK_SIZE value less than 32 to work around this for now.
There's good news though. If you only want to hit a multiple of CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE as a work group size, BLOCK_SIZE=16 already does this for you. (16*16 = 256 = 32*8). To better take advantage of local memory, try BLOCK_SIZE=24. (576=32*18)

Related

What to do if I have more work-items than SIZE_MAX in OpenCL

My OpenCL program involves having about 7 billion work-items. In my C++ program, I would set this to my global_item_size:
size_t global_item_size = 7200000000;
If my program is compiled to 64-bit systems (x64), this global size is OK, since SIZE_MAX (the maximum value of size_t) is much larger than 7 billion. However, to ensure backwards compatibility I want to make sure that my program is able to compile to 32-bit systems (x86). On 32-bit systems, SIZE_MAX is about 4 billion, less than my global size, 7 billion. If I would try to set the global size to 7 billion, it would result in an overflow. What can I do in this case?
One of the solutions I was thinking about was to make a multi-dimensional global size and local size. However, this solution requires the kernel to calculate the original global size (because my kernel heavily depends on the global and local size), which would result in a performance loss.
The other solution I considered was to launch multiple kernels. I think this solution would be a little "sloppy" and synchronizing kernels also wouldn't be the best solution.
So my question basically is: How can I (if possible) make the global size larger than the maximum size of size_t? If this is not possible, what are some workarounds?
If you want to avoid batches you can give each kernel more work but effectively wrapping the code in a for loop. E.g.
for (int i = 0; i < WORK_ITEMS_PER_THREAD; ++i)
{
size_t id = WORK_ITEMS_PER_THREAD * get_global_id(0) + i;
...
}
Try to use uint64_t global_item_size = 7200000000ull; to avoid 32-bit integer overflow.
If you are strictly limited to the maximum 32-bit number of work items, you could do the computation in several batches (exchange GPU buffers in between compute steps via PCIe transfer) or you could pack several data items into one GPU thread.

What causes and how can I check the number of work-groups limit in OpenCL?

I've shortly started using OpenCL to write programs for GPUs. I'm familiar with basic concepts that are required to write efficient programs in OpenCL, like work-items, work-groups, global-item-size, barriers, etc.
One of my programs involved making about 20 million work-groups with 360 work-items in each work-group. However, for some reason OpenCL couldn't handle that many number of work-groups. All elements of my output array simply remained 0. In addition, OpenCL didn't even start the calculations when I called clEnqueueNDRangeKernel(), since when I viewed the GPU usage stats I didn't see a "spike" that usually happens when I run an OpenCL kernel. I attempted to reduce the number work-groups, to see what is the maximum number of work-groups. It was 5965232 and it is always 5965232. Not more, not less.
I know that the problem is NOT with the number of work-items. It is with the number of work-groups. To prove this, here is my original code, where LIST_SIZE is 360.
global_item_size = 5965232*LIST_SIZE;
local_size = LIST_SIZE;
and a modified version of my code:
global_item_size = 5965232*LIST_SIZE*1.3;
local_size = LIST_SIZE*1.3;
In all the scenarios, the number of work-groups limit was 5965232.
I'm trying to find out what causes this limit and how to check this limit. I understand that there may be a limitation, but what causes this limitation and how can I check check this limit number in OpenCL? I've did a lot of research, but all sites are talking about work-group size limits and not about number of work-group limits.
I'm using the Intel Graphics HD 4000 GPU with an i5-3320M. It has 32 MB of integrated RAM.
5965232*320 = 2147483520 < 2147483647 = 2^31-1 = maximum 32-bit signed integer value
You are dealing with a classical 32-bit integer overflow in the multiplication in line
global_item_size = 5965232*LIST_SIZE;
Try global_item_size = 5965232ull*(uint64_t)LIST_SIZE; instead. Make sure global_item_size is data type uint64_t.

Using vector types to improve OpenCL kernel performance

I have the following OpenCL kernel, which copies values from one buffer to another, optionally inverting the value (the 'invert' arg can be 1 or -1):-
__kernel void extraction(__global const short* src_buff, __global short* dest_buff, const int record_len, const int invert)
{
int i = get_global_id(0); // Index of record in buffer
int j = get_global_id(1); // Index of value in record
dest_buff[(i* record_len) + j] = src_buff[(i * record_len) + j] * invert;
}
The source buffer contains one or more "records", each containing N (record_len) short values. All records in the buffer are of equal length, and record_len is always a multiple of 32.
The global size is 2D (number of records in the buffer, record length), and I chose this as it seemed to make best use of the GPU parallel processing, with each thread being responsible for copying just one value in one record in the buffer.
(The local work size is set to NULL by the way, allowing OpenCL to determine the value itself).
After reading about vectors recently, I was wondering if I could use these to improve on the performance? I understand the concept of vectors but I'm not sure how to use them in practice, partly due to lack of good examples.
I'm sure the kernel's performance is pretty reasonable already, so this is mainly out of curiosity to see what difference it would make using vectors (or other more suitable approaches).
At the risk of being a bit naive here, could I simply change the two buffer arg types to short16, and change the second value in the 2-D global size from "record length" to "record length / 16"? Would this result in each kernel thread copying a block of 16 short values between the buffers?
Your naive assumption is basically correct, though you may want to add a hint to the compiler that this kernel is optimized for the vector type (Section 6.7.2 of spec), in your case, you would add
attribute((vec_type_hint(short16)))
above your kernel function. So in your example, you would have
__attribute__((vec_type_hint(short16)))
__kernel void extraction(__global const short16* src_buff, __global short16* dest_buff, const int record_len, const int invert)
{
int i = get_global_id(0); // Index of record in buffer
int j = get_global_id(1); // Index of value in record
dest_buff[(i* record_len) + j] = src_buff[(i * record_len) + j] * invert;
}
You are correct in that your 2nd global dimension should be divided by 16, and your record_len should also be divided by 16. Also, if you were to specify the local size instead of giving it NULL, you would also want to divide that by 16.
There are some other things to consider though.
You might think choosing the largest vector size should provide the best performance, especially with such a simple kernel. But in my experience, that rarely is the most optimal size. You may try asking clGetDeviceInfo for CL_DEVICE_PREFERRED_VECTOR_WIDTH_SHORT, but for me this rarely is accurate (also, it may give you 1, meaning the compiler will try auto-vectorization or the device doesn't have vector hardware). It is best to try different vector sizes and see which is fastest.
If your device supports auto-vectorization, and you want to give it a go, it may help to remove your record_len parameter and replace it with get_global_size(1) so the compiler/driver can take care of dividing record_len by whatever vector size it picks. I would recommend doing this anyway, assuming record_len is equal to the global size you gave that dimension.
Also, you gave NULL to the local size argument so that the implementation picks a size automatically. It is guaranteed to pick a size that works, but it will not necessarily pick the most optimal size.
Lastly, for general OpenCL optimizations, you may want to take a look at the NVIDIA OpenCL Best Practices Guide for NVidia hardware, or the AMD APP SDK OpenCL User Guide for AMD GPU hardware. The NVidia one is from 2009, and I'm not sure how much their hardware has changed since. Notice though that it actually says:
The CUDA architecture is a scalar architecture. Therefore, there is no performance
benefit from using vector types and instructions. These should only be used for
convenience.
Older AMD hardware (pre-GCN) benefited from using vector types, but AMD suggests not using them on GCN devices (see mogu's comment). Also if you are targeting a CPU, it will use AVX hardware if available.

Why Nvidia and AMD OpenCL reduction example did not reduce an array to an element in one go?

I am working on some OpenCL reduction and I found AMD and Nvidia both has some example like the following kernel (this one is taken from Nvidia's website, but AMD has a similar one):
__kernel void reduce2(__global T *g_idata, __global T *g_odata, unsigned int n, __local T* sdata){
// load shared mem
unsigned int tid = get_local_id(0);
unsigned int i = get_global_id(0);
sdata[tid] = (i < n) ? g_idata[i] : 0;
barrier(CLK_LOCAL_MEM_FENCE);
// do reduction in shared mem
for(unsigned int s=get_local_size(0)/2; s>0; s>>=1)
{
if (tid < s)
{
sdata[tid] += sdata[tid + s];
}
barrier(CLK_LOCAL_MEM_FENCE);
}
// write result for this block to global mem
if (tid == 0) g_odata[get_group_id(0)] = sdata[0];}
I have two questions:
the code above reduce an array to another smaller array, I am just wondering why all the example I saw did the same instead of reducing an array directly into a single element, which is the usual semantic of "reduction" (IMHO). This should be easily achievable with an outer loop inside the kernel. Is there special reason for this?
I have implemented this reduction and found it quite slow, is there any optimisation I can do to improve it? I saw another example used some unrolling to avoid synchronisation in the loop, but I did not quite get the idea, can you explain a bit?
The reduction problem in a multithread environment is a very special parallel problem. There is one path that needs to be done sequentially, which is the element 0 to the power of 2.
Even if you had infinite threads for processing, you will need log2(N) passes trough the array to reduce it to a single element.
In a real system your number of threads (work-items) are reduced but high (~128-2048). So, in order to use them efficiently all of them have to have something to do. But since the problem is more and more serial and less parallel as you reduce the size of the reduction. These algorithms only bother about the high part, and let the CPU do the rest of the reduction.
To make the story short. You can reduce an array from 1024 to 512 in one pass, but you need the same power to reduce it from 2 to 1. In the latter case all the threads minus 1 are idle, an incredible waste of GPU resources (99.7% idle).
As you can see, there is no point in reducing this last part on a GPU. It is easier to simply copy it to CPU and do it sequentially.
Answering your question: Yes, it is slow, and will always be. If there was a magic trick to solve it, then AMD and nVIDIA would be using it don't you think? :)
For question 1: This kernel reduces a big array into a smaller one and not a single element because there is no synchronization possible between work-groups. so each work-group can reduces its portion of the array to one elements but after that all these single elements given by each work-group need to be written in global memory before a new pass is performed. This could go on until the moment the array is small enough to have only one work-group running.
For question 2: There is several approaches to perform a reduction with different performance. How to improve performance for such problem is discussed in this article from the AMD resources. Hope you'll find it useful.

INVALID_WORK_GROUP_SIZE when processing an Image2D

I'm trying to process an image using OpenCL 1.1 C++ on my AMD CPU.
The characteristics are:
using CPU: AMD Turion(tm) 64 X2 Mobile Technology TL-60
initCL:CL_DEVICE_IMAGE2D_MAX_WIDTH :8192
initCL:CL_DEVICE_IMAGE2D_MAX_HEIGHT :8192
initCL:timer resolution in ns:1
initCL:CL_DEVICE_GLOBAL_MEM_SIZE in bytes:1975189504
initCL:CL_DEVICE_GLOBAL_MEM_CACHE_SIZE in bytes:65536
initCL:CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE in bytes:65536
initCL:CL_DEVICE_LOCAL_MEM_SIZE in bytes:32768
initCL:CL_DEVICE_MAX_COMPUTE_UNITS:2
initCL:CL_DEVICE_MAX_WORK_GROUP_SIZE:1024
initCL:CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS:3
initCL:CL_DEVICE_MAX_WORK_ITEM_SIZES:dim=0, size 1024
initCL:CL_DEVICE_MAX_WORK_ITEM_SIZES:dim=1, size 1024
initCL:CL_DEVICE_MAX_WORK_ITEM_SIZES:dim=2, size 1024
createCLKernel:mean_value
createCLKernel:CL_KERNEL_WORK_GROUP_SIZE:1024
createCLKernel:CL_KERNEL_LOCAL_MEM_SIZE used by the kernel in bytes:0
createCLKernel:CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE:1
The kernel is for the moment empty:
__kernel void mean_value(image2d_t p_image,
__global ulong4* p_meanValue)
{
}
The execution call is:
cl::NDRange l_globalOffset;
// The global worksize is the entire image
cl::NDRange l_globalWorkSize(l_width, l_height);
// Needs to be determined
cl::NDRange l_localWorkSize;//(2, 2);
// Computes the mean value
cl::Event l_profileEvent;
gQueue.enqueueNDRangeKernel(gKernelMeanValue, l_globalOffset, l_globalWorkSize,
l_localWorkSize, NULL, &l_profileEvent);
If l_width=558 and l_height=328, l_localWorkSize can not be greater than (2, 2) otherwise, I get this error:"Invalid work group size"
Is it because I only have 2 cores ?
Is there a rule to determine l_localWorkSize ?
You can check 2 things using the clGetDeviceInfo function :
CL_DEVICE_MAX_WORK_GROUP_SIZE to check that 4 is not too big for your workgroup and
CL_DEVICE_MAX_WORK_ITEM_SIZES to check that the number of work-items by dimension is not too big.
And the fact the group-size may be limited to the number of cores makes sense : if you have inter work-items communication/synchronization you'll want them to be executed at the same time, otherwise the OpenCL driver would have to emulate this which might be at least hard and probably impossible in the general case.
I read in the OpenCL specs that enqueueNDRangeKernel() succeeds if l_globalWorkSize is evenly divisible byl_localWorkSize. In my case, I can set it up to (2,41).

Resources