Estimating the optimal tiling size for GPU matrix computations

Estimating the optimal tiling size for GPU matrix computations - opencl

I've written a Matrix Multiplication Kernel in SYCL, based on Tiling sub-matrices to local cache. The performance uplift I get with tiling (tile size 16x16) and without tiling (naive) approach is up to 2x.
For lower tile sizes, I get near to naive speeds, which is expected. For any tile size higher than 16 (and I would choose a power of 2 because so is my matrix size) like 32, the kernel throws a sycl exception.
I suspect this is because GPU cannot accommodate the higher tile-size on its local cache.
Questions:
How do I determine dynamically (and set) the maximum tile size supported on deployment on different GPUs?
For Intel GPUs, how can I find out the maximum GPU local cache size?
I tried checking ark.intel.com, but that doesn't list the GPU local cache size.
Current setup: i7-8665U with Intel UHD 620
P.S: If you would like to see my kernel code, please add a comment, I will add. I currently don't feel the need to show the kernel code and bloat the post.

In general in matrix multiplication tiling there are several things you need to take care of:
Size of tile per thread - since you need to keep the data in registers that are scares, for example for NVidia it is around 256 - so automatically you can't make a tile larger than 16x16 - in reality 6x6/8x8 is sweet spot for nvidia/amd/intel gpus per thread
It is better to load to the large tile (like 128x128 or 72x72 (for AMD)) to local memory and split the work load on smaller tiles for each thread in work group - but you should be very careful in avoiding bank conflicts
Optimal parameters selection is dependent on gpu vendor (amd/nvidia/intel/arm-mali etc), gpu version/generation and of course matrix size. CLBlast has for example complex tuning routines for matrix multiplication parameters selection.
So in order to select optimal parameters you need to look at wavefront/wrap/simd size for amd/nvidia/intel gpu (64 or 32/32/8-32) number of local memory banks, registers count per thread etc. In general it can be done using automatic tuning and caching these values.
I found this tutorial very helpful in understanding various issues to make fast matrix multiplications:
https://cnugteren.github.io/tutorial/pages/page1.html
And even there he got about 50-60% of efficiency. Implementing good matrix multiplication algorithm is hard.
And this is Intel specific tutorial: https://software.intel.com/content/www/us/en/develop/articles/sgemm-for-intel-processor-graphics.html

#Artyom has given an explanation on things to take care of, while implementing Matrix Multiplication on GPU.
On the questions, here are the snippets in SYCL that show what I was looking for:
// Create a queue with device
default_selector d_selector;
queue q(d_selector, dpc_common::exception_handler);
std::cout << "Enumerated Device: "
<< q.get_device().get_info<info::device::name>() << "\n";
auto wgroup_size = q.get_device().get_info<info::device::max_work_group_size>();
auto local_mem_size = q.get_device().get_info<info::device::local_mem_size>();
auto global_mem_size = q.get_device().get_info<info::device::global_mem_size>();
std::cout << "Maximum workgroup size\t:" << wgroup_size << "\n"
<< "Global Memory Size\t:" << global_mem_size / 1024 / 1024 << " MB\n"
<< "Local Memory Size\t:" << local_mem_size / 1024 << " KB\n";
This shows:
Enumerated Device: Intel(R) Gen9
Maximum workgroup size :256
Global Memory Size :3199 MB
Local Memory Size :64 KB
Maximum workgroup size is 256, i.e. across each dimension, 16 is maximum supported.
Local Cache Size is 65536 bytes (64KB). This is also confirmed here if anyone wants to look further.

Related

Is there a way to load a vector equal by size to global memory size of GPU in OpenCl?

My GPU has 12 GB global memory (CL_DEVICE_GLOBAL_MEM_SIZE), but only 3 GB of memory which it can allocate (CL_DEVICE_MAX_MEM_ALLOC_SIZE). When I try to load a vector of size exceeding 3 GB, the program crashes. The question is, if it is possible to load a bigger vector into GPU memory to utilize it completely, how to do it?

By default, CL_DEVICE_MAX_MEM_ALLOC_SIZE reports 1/4 of CL_DEVICE_GLOBAL_MEM_SIZE, meaning it would only be allowed to allocate four 3GB buffers on a 12GB GPU.
However, Nvidia GPUs allow to allocate their full memory capacity in a single buffer, even though they also report to have the 1/4 limit.
Some AMD GPUs have the limit set higher, for example the Radeon VII lets you use 14/16GB for a single buffer.
The only devices I have ever seen that really inforce the 1/4 limit are Intel HD 4600 and 5500, so older Intel integrated GPUs. If you go above 1/4 in buffer size there, the cl::Buffer constructor throws error -61.
In case you are stuck with the 1/4 memory limit on your device, split your large 12GB buffer in 4 smaller 3GB buffers (for example one vector for x, y, z, w components of the vector each). If you use Windows, note that you might only be able to use ~11.5GB in total as some VRAM is reserved for the operating system.
I think your issue might not be CL_DEVICE_MAX_MEM_ALLOC_SIZE though, but 32-bit integer overflow for the array size above 4GB. Use the uint64_t data type to set the array size instead.
You might also be interested in this lightweight OpenCL-Wrapper for C++. There, the length of vectors always is in 64-bit integer, and it automatically keeps track on howm much memory you use in total on each device, telling you if you allocate too much. It also catches that -61 error on Intel iGPUs and tells you the maximum allowed buffer size then.

How do I plan this least-squares computation on a GPU?

I'm writing a program called PerfectTIN (https://github.com/phma/perfecttin) which does lots of least-squares adjustments of a TIN to fit a point cloud. Each adjustment takes some contiguous group of triangles and adjusts the elevations of up to 8 points, which are corners of the triangles, to fit the dots in the triangles. I have it working on SMP. At the start of processing, it does only one adjustment at a time, so it splits the adjustment into tasks, each of which takes some dots, all of which are in the same triangle. Each thread takes a task from a queue and computes a small square matrix and a small column vector. When they're all ready, the adjustment routine adds up the matrices and the vectors and finishes the least squares computation.
I'd like to process tasks on the GPU as well as the CPU. The data needed for a task are
The three corners of the triangle (x,y,z)
The coordinates of the dots (x,y,z).
The output data are
A symmetric matrix with up to nine nonzero entries (since it's symmetric, I need only compute six numbers)
A column vector with the same number of rows.
The number of dots is a multiple of 1024, except for a few tasks which I can handle in the CPU (the total number of dots in a triangle can be any nonnegative integer). For a fairly large point cloud of 56 million dots, some tasks are larger than 131072 dots.
Here is part of the output of clinfo (if you need other parts, let me know):
Platform Name Clover
Number of devices 1
Device Name Radeon RX 590 Series (POLARIS10, DRM 3.33.0, 5.3.0-7625-generic, LLVM 9.0.0)
Device Vendor AMD
Device Vendor ID 0x1002
Device Version OpenCL 1.1 Mesa 19.2.8
Driver Version 19.2.8
Device OpenCL C Version OpenCL C 1.1
Device Type GPU
Device Profile FULL_PROFILE
Device Available Yes
Compiler Available Yes
Max compute units 36
Max clock frequency 1545MHz
Max work item dimensions 3
Max work item sizes 256x256x256
Max work group size 256
Preferred work group size multiple 64
Preferred / native vector sizes
char 16 / 16
short 8 / 8
int 4 / 4
long 2 / 2
half 8 / 8 (cl_khr_fp16)
float 4 / 4
double 2 / 2 (cl_khr_fp64)
Double-precision Floating-point support (cl_khr_fp64)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
If I understand right, if I put one dot in each core of the GPU, the total number of dots I can process at once is 36×256=9×1024=9216. Could I put four dots in each core, since a work group would then have 1024 dots? In this case I could process 36864 dots at once. How many dots should each core process? What if a task is bigger than the GPU can hold? What if several tasks (possibly from different triangles) fit in the GPU?
Of course I want this code to run on other GPUs than mine. I'm going to use OpenCL for portability. What different GPUs (description rather than name) am I going to encounter?

If I understand right, if I put one dot in each core of the GPU, the total number of dots I can process at once is 36×256=9×1024=9216.
Not quite. The total number of dots is not limited by the maximum work group size. The work group size is the number of GPU threads working synchronously in a group. Within a work group, you can share data via local memory, which can be useful to speed up certain calculations like matrix multiplications for example (cache tiling).
The idea of GPU parallelization is to split the prioblem up into as many independent parts as there are. In C++ something like
void example(float* data, const int N) {
for(int n=0; n<N; n++) {
data[n] += 1.0f;
}
}
in OpenCL becomes this
kernel void example(global float* data ) {
const int n = get_global_id(0) ;
data[n] += 1.0f;
}
where the global range of the kernel is set to N.
Each GPU thread should process only a single dot. The number of threads (global range) can and should be much larger than the number of GPU cores available. If you don't explicitly need local memory (all threads can work independent), you can set the local work group size to either 32, 64, 128 or 256 - it doesn't matter - but there might be some performance difference between these values. However the global range (total number of threads / points) must be a multiple of the work group size.
What if a task is bigger than the GPU can hold?
I assume you mean when your data set does not fit into video memory all at once. In this case you can do the computation in several batches, so exchange the GPU buffers using PCIe transfers. But that comes at a large performance penalty.
Of course I want this code to run on other GPUs than mine. I'm going to use OpenCL for portability. What different GPUs (description rather than name) am I going to encounter?
OpenCL is excellent for portability across devices and operating systems. Other than AMD GPUs and CPUs, you will encounter Nvidia GPUs, which only support OpenCL 1.2, and Intel GPUs and CPUs. If the graphics drivers are installed, your code will run on all of them without issues. Just be aware that the amount of video memory can be vastly different.

OpenCL and Tesla M1060

I'm using the Tesla m1060 for GPGPU computation. It has the following specs:
# of Tesla GPUs 1
# of Streaming Processor Cores (XXX per processor) 240
Memory Interface (512-bit per GPU) 512-bit
When I use OpenCL, I can display the following board information:
available platform OpenCL 1.1 CUDA 6.5.14
device Tesla M1060 type:CL_DEVICE_TYPE_GPU
max compute units:30
max work item dimensions:3
max work item sizes (dim:0):512
max work item sizes (dim:1):512
max work item sizes (dim:2):64
global mem size(bytes):4294770688 local mem size:16383
How can I relate the GPU card informations to the OpenCL memory informations ?
For example:
What does "Memory Interace" means ? Is it linked the a Work Item ?
How can I relate the "240 cores" of the GPU to Work Groups/Items ?
How can I map the work-groups to it (what would be the number of Work groups to use) ?
Thanks
EDIT:
After the following answers, there is a thing that is still unclear to me:
The CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE value is 32 for the kernel I use.
However, my device has a CL_DEVICE_MAX_COMPUTE_UNITS value of 30.
In the OpenCL 1.1 Api, it is written (p. 15):
Compute Unit: An OpenCL device has one or more compute units. A work-group executes on a single compute unit
It seems that either something is incoherent here, or that I didn't fully understand the difference between Work-Groups and Compute Units.
As previously stated, when I set the number of Work Groups to 32, the programs fails with the following error:
Entry function uses too much shared data (0x4020 bytes, 0x4000 max).
The value 16 works.
Addendum
Here is my Kernel signature:
// enable double precision (not enabled by default)
#ifdef cl_khr_fp64
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
#else
#error "IEEE-754 double precision not supported by OpenCL implementation."
#endif
#define BLOCK_SIZE 16 // --> this is what defines the WG size to me
__kernel __attribute__((reqd_work_group_size(BLOCK_SIZE, BLOCK_SIZE, 1)))
void mmult(__global double * A, __global double * B, __global double * C, const unsigned int q)
{
__local double A_sub[BLOCK_SIZE][BLOCK_SIZE];
__local double B_sub[BLOCK_SIZE][BLOCK_SIZE];
// stuff that does matrix multiplication with __local
}
In the host code part:
#define BLOCK_SIZE 16
...
const size_t local_work_size[2] = {BLOCK_SIZE, BLOCK_SIZE};
...
status = clEnqueueNDRangeKernel(command_queue, kernel, 2, NULL, global_work_size, local_work_size, 0, NULL, NULL);

The memory interface doesn't mean anything to an opencl application. It is the number of bits the memory controller has for reading/writing to the memory (the ddr5 part in modern gpus). The formula for maximum global memory speed is approximately: pipelineWidth * memoryClockSpeed, but since opencl is meant to be cross-platform, you won't really need to know this value unless you are trying to figure out an upper bound for memory performance. Knowing about the 512-bit interface is somewhat useful when you're dealing with memory coalescing. wiki: Coalescing (computer science)
The max work item sizes have to do with 1) how the hardware schedules computations, and 2) the amount of low-level memory on the device -- eg. private memory and local memory.
The 240 figure doesn't matter to opencl very much either. You can determine that each of the 30 compute units is made up of 8 streaming processor cores for this gpu architecture (because 240/30 = 8). If you query for CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE, it will very likey be a multiple of 8 for this device. see: clGetKernelWorkGroupInfo
I have answered a similar questions about work group sizing. see here, and here
Ultimately, you need to tune your application and kernels based on your own bench-marking results. I find it worth the time to write many tests with various work group sizes and eventually hard-code the optimal size.

Adding another answer to address your local memory issue.
Entry function uses too much shared data (0x4020 bytes, 0x4000 max)
Since you are allocating A_sub and B_sub, each having 32*32*sizeof(double), you run out of local memory. The device should be allowing you to allocate 16kb, or 0x4000 bytes of local memory without an issue.
0x4020 is 32 bytes or 4 doubles more than what your device allows. There are only two things I can think of that may cause the error: 1) there could be a bug with your device or drivers preventing you from allocating the full 16kb, or 2) you are allocating the memory somewhere else in your kernel.
You will have to use a BLOCK_SIZE value less than 32 to work around this for now.
There's good news though. If you only want to hit a multiple of CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE as a work group size, BLOCK_SIZE=16 already does this for you. (16*16 = 256 = 32*8). To better take advantage of local memory, try BLOCK_SIZE=24. (576=32*18)

Optimal Local/Global worksizes in OpenCL

I am wondering how to chose optimal local and global work sizes for different devices in OpenCL?
Is it any universal rule for AMD, NVIDIA, INTEL GPUs?
Should I analyze physical build of the devices (number of multiprocessors, number of streaming processors in multiprocessor, etc)?
Does it depends on the algorithm/implementation? Because I saw that some libraries (like ViennaCL) to assess correct values just tests many combination of local/global work sizes and chose best combination.

NVIDIA recommends that your (local)workgroup-size is a multiple of 32 (equal to one warp, which is their atomic unit of execution, meaning that 32 threads/work-items are scheduled atomically together). AMD on the other hand recommends a multiple of 64(equal to one wavefront). Unsure about Intel, but you can find this type of information in their documentation.
So when you are doing some computation and let say you have 2300 work-items (the global size), 2300 is not dividable by 64 nor 32. If you don't specify the local size, OpenCL will choose a bad local size for you. What happens when you don't have a local size which is a multiple of the atomic unit of execution is that you will get idle threads which leads to bad device utilization. Thus, it can be benificial to add some "dummy" threads so that you get a global size which is a multiple of 32/64 and then use a local size of 32/64 (the global size has to be dividable by the local size). For 2300 you can add 4 dummy threads/work-items, because 2304 is dividable by 32. In the actual kernel, you can write something like:
int globalID = get_global_id(0);
if(globalID >= realNumberOfThreads)
globalID = 0;
This will make the four extra threads do the same as thread 0. (it is often faster to do some extra work then to have many idle threads).
Hope that answered your question. GL HF!

If you're essentially making processing using little memory (e.g. to store kernel private state) you can choose the most intuitive global size for your problem and let OpenCL choose the local size for you.
See my answer here : https://stackoverflow.com/a/13762847/145757
If memory management is a central part of your algorithm and will have a great impact on performance you should indeed go a little further and first check the maximum local size (which depends on the local/private memory usage of your kernel) using clGetKernelWorkGroupInfo, which itself will decide of your global size.

INVALID_WORK_GROUP_SIZE when processing an Image2D

I'm trying to process an image using OpenCL 1.1 C++ on my AMD CPU.
The characteristics are:
using CPU: AMD Turion(tm) 64 X2 Mobile Technology TL-60
initCL:CL_DEVICE_IMAGE2D_MAX_WIDTH :8192
initCL:CL_DEVICE_IMAGE2D_MAX_HEIGHT :8192
initCL:timer resolution in ns:1
initCL:CL_DEVICE_GLOBAL_MEM_SIZE in bytes:1975189504
initCL:CL_DEVICE_GLOBAL_MEM_CACHE_SIZE in bytes:65536
initCL:CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE in bytes:65536
initCL:CL_DEVICE_LOCAL_MEM_SIZE in bytes:32768
initCL:CL_DEVICE_MAX_COMPUTE_UNITS:2
initCL:CL_DEVICE_MAX_WORK_GROUP_SIZE:1024
initCL:CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS:3
initCL:CL_DEVICE_MAX_WORK_ITEM_SIZES:dim=0, size 1024
initCL:CL_DEVICE_MAX_WORK_ITEM_SIZES:dim=1, size 1024
initCL:CL_DEVICE_MAX_WORK_ITEM_SIZES:dim=2, size 1024
createCLKernel:mean_value
createCLKernel:CL_KERNEL_WORK_GROUP_SIZE:1024
createCLKernel:CL_KERNEL_LOCAL_MEM_SIZE used by the kernel in bytes:0
createCLKernel:CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE:1
The kernel is for the moment empty:
__kernel void mean_value(image2d_t p_image,
__global ulong4* p_meanValue)
{
}
The execution call is:
cl::NDRange l_globalOffset;
// The global worksize is the entire image
cl::NDRange l_globalWorkSize(l_width, l_height);
// Needs to be determined
cl::NDRange l_localWorkSize;//(2, 2);
// Computes the mean value
cl::Event l_profileEvent;
gQueue.enqueueNDRangeKernel(gKernelMeanValue, l_globalOffset, l_globalWorkSize,
l_localWorkSize, NULL, &l_profileEvent);
If l_width=558 and l_height=328, l_localWorkSize can not be greater than (2, 2) otherwise, I get this error:"Invalid work group size"
Is it because I only have 2 cores ?
Is there a rule to determine l_localWorkSize ?

You can check 2 things using the clGetDeviceInfo function :
CL_DEVICE_MAX_WORK_GROUP_SIZE to check that 4 is not too big for your workgroup and
CL_DEVICE_MAX_WORK_ITEM_SIZES to check that the number of work-items by dimension is not too big.
And the fact the group-size may be limited to the number of cores makes sense : if you have inter work-items communication/synchronization you'll want them to be executed at the same time, otherwise the OpenCL driver would have to emulate this which might be at least hard and probably impossible in the general case.

I read in the OpenCL specs that enqueueNDRangeKernel() succeeds if l_globalWorkSize is evenly divisible byl_localWorkSize. In my case, I can set it up to (2,41).

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex