OpenCL AMD vs NVIDIA performance

OpenCL AMD vs NVIDIA performance - opencl

I implemented a simple kernel which is some sort of a convolution. I measured it on NVIDIA GT 240. It took 70 ms when written on CUDA and 100 ms when written on OpenCL. Ok, I thought, NVIDIA compiler is better optimized for CUDA (or I'm doing something wrong).
I need to run it on AMD GPUs, so I migrated to AMD APP SDK. Exactly the same kernel code.
I made two tests and their results were discouraging for me: 200 ms at HD 6670 and 70 ms at HD 5850 (the same time as for GT 240 + CUDA). And I am very interested of the reasons of such strange behaviour.
All projects were built on VS2010 using settings from the sample projects of NVIDIA and AMD respectively.
Please, do not consider my post as NVIDIA advertisement. I fairly understand that HD 5850 is more powerful than GT 240. The only thing I wish to know is why such difference is and how to fix the problem.
Update. Below is the kernel code which looks for 6 equally sized template images in the base one. Every pixel of the base image is considered as a possible origin of one of the templates and is processed by a separate thread. The kernel compares R, G, B values of each pixel of the base image and of the template one, and if at least one difference exceeds diff parameter, the corresponding pixel is counted nonmatched. If the number of nonmatched pixels is less than maxNonmatchQt the corresponding template is hit.
__constant int tOffset = 8196; // one template size in memory (in bytes)
__kernel void matchImage6( __global unsigned char* image, // pointer to the base image
int imgWidth, // base image width
int imgHeight, // base image height
int imgPitch, // base image pitch (in bytes)
int imgBpp, // base image bytes (!) per pixel
__constant unsigned char* templates, // pointer to the array of templates
int tWidth, // templates width (the same for all)
int tHeight, // templates height (the same for all)
int tPitch, // templates pitch (in bytes, the same for all)
int tBpp, // templates bytes (!) per pixel (the same for all)
int diff, // max allowed difference of intensity
int maxNonmatchQt, // max number of nonmatched pixels
__global int* result, // results
) {
int x0 = (int)get_global_id(0);
int y0 = (int)get_global_id(1);
if( x0 + tWidth > imgWidth || y0 + tHeight > imgHeight)
return;
int nonmatchQt[] = {0, 0, 0, 0, 0, 0};
for( int y = 0; y < tHeight; y++) {
int ind = y * tPitch;
int baseImgInd = (y0 + y) * imgPitch + x0 * imgBpp;
for( int x = 0; x < tWidth; x++) {
unsigned char c0 = image[baseImgInd];
unsigned char c1 = image[baseImgInd + 1];
unsigned char c2 = image[baseImgInd + 2];
for( int i = 0; i < 6; i++)
if( abs( c0 - templates[i * tOffset + ind]) > diff ||
abs( c1 - templates[i * tOffset + ind + 1]) > diff ||
abs( c2 - templates[i * tOffset + ind + 2]) > diff)
nonmatchQt[i]++;
ind += tBpp;
baseImgInd += imgBpp;
}
if( nonmatchQt[0] > maxNonmatchQt && nonmatchQt[1] > maxNonmatchQt && nonmatchQt[2] > maxNonmatchQt && nonmatchQt[3] > maxNonmatchQt && nonmatchQt[4] > maxNonmatchQt && nonmatchQt[5] > maxNonmatchQt)
return;
}
for( int i = 0; i < 6; i++)
if( nonmatchQt[i] < maxNonmatchQt) {
unsigned int pos = atom_inc( &result[0]) * 3;
result[pos + 1] = i;
result[pos + 2] = x0;
result[pos + 3] = y0;
}
}
Kernel run configuration:
Global work size = (1900, 1200)
Local work size = (32, 8) for AMD and (32, 16) for NVIDIA.
Execution time:
HD 5850 - 69 ms,
HD 6670 - 200 ms,
GT 240 - 100 ms.
Any remarks about my code are also highly appreciated.

The difference in execution times is caused by compilers.
Your code can be easily vectorized. Consider image and templates as arrays of vector type char4 (forth coordinate of each char4 vector is always 0). Instead of 3 memory reads:
unsigned char c0 = image[baseImgInd];
unsigned char c1 = image[baseImgInd + 1];
unsigned char c2 = image[baseImgInd + 2];
use only one:
unsigned char4 c = image[baseImgInd];
Instead of bulky if:
if( abs( c0 - templates[i * tOffset + ind]) > diff ||
abs( c1 - templates[i * tOffset + ind + 1]) > diff ||
abs( c2 - templates[i * tOffset + ind + 2]) > diff)
nonmatchQt[i]++;
use fast:
unsigned char4 t = templates[i * tOffset + ind];
nonmatchQt[i] += any(abs_diff(c,t)>diff);
Thus you increase performance of your code up to 3 times (if compiler doesn't vectorize the code by itself). I suppose that AMD OpenCL compiler does not such vectorization and other optimizations. From my experience OpenCL on NVIDIA GPU usually can be made faster than CUDA, because it is more low-level.

There can be no exact perfect answer for this. OpenCL performance depends on many parameters. The number of access to global memory, efficiency of the code etc. Moreover its very difficult compare between two device since they might be having different local, global, constant memories. Number of cores, frequency, memory bandwidth, more importantly the hardware architecture etc.
Each hardware provides their own performance boost, for example native_ from NVIDIA. So you need to explore more regarding the hardware on which you are working, that might actually work. But what I would recommend personally is not to use such hardware specific optimizations it might effect the flexibility of your code.
You can also find some papers published that shows, CUDA performance is much better than the OpenCL performance on same NVIDIA hardware.
So its always better to write code which provides good flexibility, rather than device specific optimizations.

Related

What is the best practice for memory access in this N-body problem solved on AMD Radeon RX580?

I compute trajectories of N particles which move in their gravitation force field. I wrote the following OpenCL kernel:
#define G 100.0f
#define EPS 1.0f
float2 f (float2 r_me, __constant float *m, __global float2 *r, size_t s, size_t n)
{
size_t i;
float2 res = (0.0f, 0.0f);
for (i=1; i<n; i++) {
size_t idx = i;
// size_t idx = (i + s) % n;
float2 dir = r[idx] - r_me;
float dist = length (dir);
res += G*m[idx]/pown(dist + EPS, 3) * dir;
}
return res;
}
__kernel void take_step_rk2 (__constant float *m,
__global float2 *r,
__global float2 *v,
float delta)
{
size_t n = get_global_size(0);
size_t s = get_global_id(0);
float2 mv = f(r[s], m, r, s, n);
float2 mr = v[s];
float2 vpred1 = v[s] + mv * delta;
float2 rpred1 = r[s] + mr * delta;
float2 nv = f(rpred1, m, r, s, n);
float2 nr = vpred1;
barrier (CLK_GLOBAL_MEM_FENCE);
r[s] += (mr + nr) * delta / 2;
v[s] += (mv + nv) * delta / 2;
}
Then I run this kernel multiple times as one-dimensional problem with global work size = [number of bodies]:
void take_step (struct cl_state *state)
{
size_t n = state->nbodies;
clEnqueueNDRangeKernel (state->queue, state->step, 1, NULL, &n, NULL, 0, NULL, NULL);
clFinish (state->queue);
}
This is a quote from AMD OpenCL Optimization Guide (year 2015):
Under certain conditions, one unexpected case of a channel conflict is that reading from the same address is a conflict, even on the FastPath.
This does not happen on the read-only memories, such as constant buffers,
textures, or shader resource view (SRV); but it is possible on the read/write UAV
memory or OpenCL global memory.
Work items in my queue all try to get access to the same memory in this loop, so there must be a channel conflict:
for (i=1; i<n; i++) {
size_t idx = i;
// size_t idx = (i + s) % n;
float2 dir = r[idx] - r_me;
float dist = length (dir);
res += G*m[idx]/pown(dist + EPS, 3) * dir;
}
I replaced
size_t idx = i;
// size_t idx = (i + s) % n;
with
// size_t idx = i;
size_t idx = (i + s) % n;
so the first work item (with global id 0) firstly access the first element in array r, the second work item access the second element and so on.
I expected that this change must result in performance improvement, but to the contrary, it resulted in significant performance degradation (roughly by the factor of 2). What am I missing? Why all-to-the-same memory access it better in this situation?
If you have other tips how to improve the performance, please share with me. OpenCL optimization guide is very confusing.

The f function's loop does not have a barrier for reconvergence for coalesced access. Once some items get their r data, they start computing but those couldn't will wait their data hence, lose the coalescence integrity. To re-group them, add 1 barrier at least per 10 iterations or 2 iterations or maybe even every iteration. But accessing to global has high latency. Barrier + latency is bad for performance. You need local memory here since it has low latency and broadcasting ability which lets it lose coalescedness only on grains bigger than local thread number (64?) which is not bad for global memory access either(you need to fill local memory from global in every Kth iteration where N is divided into K sized groups).
A source from 2013 (
http://developer.amd.com/wordpress/media/2013/07/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide-rev-2.7.pdf):
Thus, the key to effectively using the LDS is to control the access
pattern, so that accesses generated on the same cycle map to different
banks in the LDS. One notable exception is that accesses to the same
address (even though they have the same bits 6:2) can be broadcast to
all requestors and do not generate a bank conflict.
Using LDS(__local) for this will give good performance. Since LDS is small, you should do it in small patches like 256 particles at a time.
Also, using i as idx is very cache friendly but modulus version is very cache enemy. Once data can exist in cache, it doesn't matter if N requests are done. They come from cache now. But with modulus, you destroy cache ingredients before they are re-used, depending on N. For small N it should be faster as you foresee. For big N, and with small GPU cache, it would be much worse. Like only 1 global request per cycle versus N-cache_size global requests per cycle.
I guess with such strong GPU, you had a high N value such as 64k bodies which needed 2 variables per body and 4 bytes per variable totaling 512kB which can not fit L1. Maybe only L2 which is slower than idx=i through L1.
Answer:
all to same L1 cache adr is faster than all to global and L2 cache adr
use local memory in "blocking/patching" algorithm to achieve high speed

GPU Driver does not respond after NDRangekernel increase

i am new to opencl and i want to actually parallelise this Sieve Prime, the C++ code is here: https://www.geeksforgeeks.org/sieve-of-atkin/
I somehow don't get the good results out of it, actually the CPU version is much faster after comparing. I tried to use NDRangekernel to avoid writing the nested loops and probably increase the performance but when i give higher limit number in function, the GPU driver stops responding and the program crashes. Maybe my NDRangekernel config is not ok, anyone could help with it? I probably don't get the NDRange properly, here are the info about my GPU.
CL_DEVICE_NAME: GeForce GT 740M
CL_DEVICE_VENDOR: NVIDIA Corporation
CL_DRIVER_VERSION: 397.31
CL_DEVICE_TYPE: CL_DEVICE_TYPE_GPU
CL_DEVICE_MAX_COMPUTE_UNITS: 2
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3
CL_DEVICE_MAX_WORK_ITEM_SIZES: 1024 / 1024 / 64
CL_DEVICE_MAX_WORK_GROUP_SIZE: 1024
CL_DEVICE_MAX_CLOCK_FREQUENCY: 1032 MHz
CL_DEVICE_ADDRESS_BITS: 32
CL_DEVICE_MAX_MEM_ALLOC_SIZE: 512 MByte
CL_DEVICE_GLOBAL_MEM_SIZE: 2048 MByte
CL_DEVICE_ERROR_CORRECTION_SUPPORT: no
CL_DEVICE_LOCAL_MEM_TYPE: local
CL_DEVICE_LOCAL_MEM_SIZE: 48 KByte
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 64 KByte
CL_DEVICE_QUEUE_PROPERTIES:
-CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE
CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_PROFILING_ENABLE
CL_DEVICE_IMAGE_SUPPORT: 1
CL_DEVICE_MAX_READ_IMAGE_ARGS: 256
CL_DEVICE_MAX_WRITE_IMAGE_ARGS: 16
here is my NDRange code
queue.enqueueNDRangeKernel(add, cl::NDRange(1,1), cl::NDRange((limit * limit) -1, (limit * limit) -1 ), cl::NullRange,NULL, &event);
and my kernel code:
__kernel void sieveofAktin(const int limit, __global bool* sieve)
{
int x = get_global_id(0);
int y = get_global_id(1);
//printf("%d \n", x);
int n = (4 * x * x) + (y * y);
if (n <= limit && (n % 12 == 1 || n % 12 == 5))
sieve[n] ^= true;
n = (3 * x * x) + (y * y);
if (n <= limit && n % 12 == 7)
sieve[n] ^= true;
n = (3 * x * x) - (y * y);
if (x > y && n <= limit && n % 12 == 11)
sieve[n] ^= true;
for (int r = 5; r * r < limit; r++) {
if (sieve[r]) {
for (int i = r * r; i < limit; i += r * r)
sieve[i] = false;
}
}
}

You have lots of branching in that code, and I suspect that's what may be killing your performance on GPUs. Look at chapter 6 of the NVIDIA OpenCL Best Practices Guide for details on why this hurts performance.
I'm not sure how possible it is without looking closely at the algorithm, but ideally you want to rewrite the code to use as little branching as possible. Alternatively, you could look at other algorithms entirely.
As for the locking, I'd need to see more of your host code to know what is happening, but it's possible you're exceeding various limits of your platform/device. Are you checking for errors on every OpenCL function you call?

Regardless of how good or bad your algorithm or implementation is - the driver should always respond. Non-response is quite possibly a bug. File a bug report at http://developer.nvidia.com/ .

OpenCL NDrangekernel with 3d global size and 3d local size

I am trying to compute local sum of each group by identify with 3d volume position and group ID.
My idea is divide space into groups and use atomic_add to compute local_sum.
But because I am new to parallel computing so it is kind of hard to find the correlation between codes and instructions.
My current kernel is like:
__kernel void TestAtomicAddLocal(__global *int src, int3 size, __global int *res)
{
int x = get_global_id(0);
int y = get_global_id(1);
int z = get_global_id(2);
if( x >= vol_dim.x || y >= vol_dim.y || z >= vol_dim.z ){ return; }
int id = x + y * vol_dim.x + z * vol_dim.x * vol_dim.y;
// local mem shared by all work items in work group,
//so this can be accessed by all items in current workgroup
__local int local_sum;
local_sum= 0;
// use global_id to access the value of input array
int local_offset = atomic_add(&local_sum, src[id]);
barrier(CLK_LOCAL_MEM_FENCE);
int global_offset = atomic_add(&num_verts[0], local_sum);
barrier(CLK_GLOBAL_MEM_FENCE);
}
For the host part, my setting is
enqueueNDrangeKernel( cq, kn_testAtomicAddLocal, 3, 0, cl::size3(256,256,256), cl::size3(64, 64, 64), 0, 0, 0);
For kenrnel arguments, the *src is cl_mem with size 256*256*256*sizeof(cl_int), size is 4 * sizeof(cl_int), and *res is cl_mem with size 4*sizeof(int).
Then I get error that CL_OUT_OF_RESOURCE and CL_INVALID_GROUP_SIZE, from my understanding, my device max group size is 1024, but here total group = (256/64)^3 = 64 < 1024.
My gpu max work item size is 1024x1024x64 which is also ok. So I think there must something I understand is wrong. I hope someone could help me out.

The max group size limits your 64 * 64 * 64 part.
And I guess you are using a CUDA card. You'd better use CUDA on CUDA cards. OpenCL is more or less emulated on CUDA cards. If you are not, I thinks all AMD cards have a group size limit of 256. edit: Um... I forgot about Intel ones. If so, ignore this part.
One more important thing, you'd better check some reduction logic implementation example on the internet first. Atomics are very expesive and using them like what you did would almost certain make your GPU code slower than CPU ones.

Opencl size of local memory has impact on speed?

i am new in OpenCL and i am trying to compute histogram of grayscaled image. I am performing this computation on GPU nvidia GT 330M.
code is
__kernel void histogram(__global struct gray * input, __global int * global_hist, __local volatile int * histogram){
int local_offset = get_local_id(0) * 256;
int histogram_global_offset = get_global_id(0) * 256;
int offset = get_global_id(0) * 1920;
int value;
for(unsigned int i = 0; i < 256; i++){
histogram[local_offset + i] = 0;
}
barrier(CLK_LOCAL_MEM_FENCE);
for(unsigned int i = 0; i < 1920; i++){
value = input[offset + i].i;
histogram[local_offset + value]++;
}
barrier(CLK_LOCAL_MEM_FENCE);
for(unsigned int i = 0; i < 256; i++){
global_hist[histogram_global_offset + i] = histogram[local_offset + i];
}
}
This computation is performed on image 1920*1080.
I am firing kernels with
queue.enqueueNDRangeKernel(kernel_histogram, cl::NullRange, cl::NDRange(1080), cl::NDRange(1));
When local size of histogram is set to 256 * sizeof(cl_int) speed of this computation is (through nvidia nsight performance analysis) 11 675 microseconds.
Because local workgroup size is set to one. I tried increase local workgroup size to 8. But when i increase local size of histogram to 256 * 8 * sizeof(cl_int) and compute with local wg size 1. I get 85 177 microseconds.
So when i fire it with 8 kernels per workgroup i dont get speedup from 11ms but from 85ms. So final speed with 8 kernels per worgroup is 13 714 microseconds.
But when i create computation bug, set local_offset to zero and size of local histogram is 256 * sizeof(cl_int) and use 8 kernels per workgroup i get much better time - 3 854 microsec.
Does anybody have some ideas to speed up this computation ?
Thanks!

This answer assumes you want to eventually reduce your histogram all the way down to 256 int values. You call the kernel with as many work groups as you have compute units on your device, and group size should be (as always) a multiple of CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE on the device.
__kernel void histogram(__global struct gray * input, __global int * global_hist){
int group_id = get_group_id(0);
int num_groups = get_num_groups(0);
int local_id = get_local_id(0);
int local_size = get_local_size(0);
volatile __local int histogram[256];
int i;
for(i=local_id; i<256; i+=local_size){
histogram[i] = 0;
}
int rowNum, colNum, value, global_hist_offset
for(rowNum = group_id; rowNum < 1080; rowNum+=num_groups){
for(colNum = local_id; colNum < 1920; colNum += local_size){
value = input[rowNum*1920 + colNum].i;
atomic_inc(histogram[input]);
}
}
barrier(CLK_LOCAL_MEM_FENCE);
global_hist_offset = group_id * 256;
for(i=local_id; i<256; i+=local_size){
global_hist[global_hist_offset + i] = histogram[i];
}
}
Each work group works cooperatively on one row of the image at a time. Then the group moves on to another row, calculated using the num_groups value. This will work well no matter how many groups you have. For example, if you have 7 compute units, group 3 (the forth group) will start on row 3 in the image, and then every 7th row thereafter. Group 3 would compute 153 rows in total, and its final row would be row 1074. Some work groups may compute 1 more row -- groups 0 and 1 in this example.
The same interlacing is done within the work group when looking at the columns of the image. in the colNum loop, the Nth work item starts at column N, and skips ahead by local_size columns. The remainder for this loop shouldn't come in to play as often, because CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE will likely be a factor of 1920. Try all work group sizes from (1..X) * CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE up to the maximum work group size for your device.
One final point about this kernel: the results are not identical to your original kernel. Your global_hist array is 1080 * 256 integers. The one I have needs to be num_groups * 256 integers. This helps if you want a full reduction, because there is much less to add after the kernel executes.

How to make ARGB transparency using bitwise operators

I need to make transparency, having 2 pixels:
pixel1: {A, R, G, B} - foreground pixel
pixel2: {A, R, G, B} - background pixel
A,R,G,B are Byte values
each color is represented by byte value
now I'm calculating transparency as:
newR = pixel2_R * alpha / 255 + pixel1_R * (255 - alpha) / 255
newG = pixel2_G * alpha / 255 + pixel1_G * (255 - alpha) / 255
newB = pixel2_B * alpha / 255 + pixel1_B * (255 - alpha) / 255
but it is too slow
I need to do it with bitwise operators (AND,OR,XOR, NEGATION, BIT MOVE)
I want to do it on Windows Phone 7 XNA
---attached C# code---
public static uint GetPixelForOpacity(uint reduceOpacityLevel, uint pixelBackground, uint pixelForeground, uint pixelCanvasAlpha)
{
byte surfaceR = (byte)((pixelForeground & 0x00FF0000) >> 16);
byte surfaceG = (byte)((pixelForeground & 0x0000FF00) >> 8);
byte surfaceB = (byte)((pixelForeground & 0x000000FF));
byte sourceR = (byte)((pixelBackground & 0x00FF0000) >> 16);
byte sourceG = (byte)((pixelBackground & 0x0000FF00) >> 8);
byte sourceB = (byte)((pixelBackground & 0x000000FF));
uint newR = sourceR * pixelCanvasAlpha / 256 + surfaceR * (255 - pixelCanvasAlpha) / 256;
uint newG = sourceG * pixelCanvasAlpha / 256 + surfaceG * (255 - pixelCanvasAlpha) / 256;
uint newB = sourceB * pixelCanvasAlpha / 256 + surfaceB * (255 - pixelCanvasAlpha) / 256;
return (uint)255 << 24 | newR << 16 | newG << 8 | newB;
}

You can't do an 8 bit alpha blend using only bitwise operations, unless you basically re-invent multiplication with basic ops (8 shift-adds).
You can do two methods as mentioned in other answers: use 256 instead of 255, or use a lookup table. Both have issues, but you can mitigate them. It really depends on what architecture you're doing this on: the relative speed of multiply, divide, shift, add and memory loads. In any case:
Lookup table: a trivial 256x256 lookup table is 64KB. This will thrash your data cache and end up being very slow. I wouldn't recommend it unless your CPU has an abysmally slow multiplier, but does have low latency RAM. You can improve performance by throwing away some alpha bits, e.g A>>3, resulting in 32x256=8KB of lookup, which has a better chance of fitting in cache.
Use 256 instead of 255: the idea being divide by 256 is just a shift right by 8. This will be slightly off and tend to round down, darkening the image slightly, e.g if R=255, A=255 then (R*A)/256 = 254. You can cheat a little and do this: (R*A+R+A)/256 or just (R*A+R)/256 or (R*A+A)/256 = 255. Or, scale A to 0..256 first, e.g: A = (256*A)/255. That's just one expensive divide-by-255 instead of 6. Then, (R*A)/256 = 255.

I don't think it can be done with the same precision using only those operators. Your best bet is, I reckon, using a LUT (as long as the LUT can fit in the CPU cache, otherwise it might even be slower)
// allocate the LUT (64KB)
unsigned char lut[256*256] __cacheline_aligned; // __cacheline_aligned is a GCC-ism
// macro to access the LUT
#define LUT(pixel, alpha) (lut[(alpha)*256+(pixel)])
// precompute the LUT
for (int alpha_value=0; alpha_value<256; alpha_value++) {
for (int pixel_value=0; pixel_value<256; pixel_value++) {
LUT(pixel_value, alpha_value) = (unsigned char)((double)(pixel_value) * (double)(alpha_value) / 255.0));
}
}
// in the loop
unsigned char ialpha = 255-alpha;
newR = LUT(pixel2_R, alpha) + LUT(pixel1_R, ialpha);
newG = LUT(pixel2_G, alpha) + LUT(pixel1_G, ialpha);
newB = LUT(pixel2_B, alpha) + LUT(pixel1_B, ialpha);
otherwise you should try vectorizing your code. But to do that you should at least provide us with more info on your CPU architecture and compiler. Keep in mind that your compiler might be able to vectorize automatically, if provided with the right options.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex