I am trying to compute local sum of each group by identify with 3d volume position and group ID.
My idea is divide space into groups and use atomic_add to compute local_sum.
But because I am new to parallel computing so it is kind of hard to find the correlation between codes and instructions.
My current kernel is like:
__kernel void TestAtomicAddLocal(__global *int src, int3 size, __global int *res)
{
int x = get_global_id(0);
int y = get_global_id(1);
int z = get_global_id(2);
if( x >= vol_dim.x || y >= vol_dim.y || z >= vol_dim.z ){ return; }
int id = x + y * vol_dim.x + z * vol_dim.x * vol_dim.y;
// local mem shared by all work items in work group,
//so this can be accessed by all items in current workgroup
__local int local_sum;
local_sum= 0;
// use global_id to access the value of input array
int local_offset = atomic_add(&local_sum, src[id]);
barrier(CLK_LOCAL_MEM_FENCE);
int global_offset = atomic_add(&num_verts[0], local_sum);
barrier(CLK_GLOBAL_MEM_FENCE);
}
For the host part, my setting is
enqueueNDrangeKernel( cq, kn_testAtomicAddLocal, 3, 0, cl::size3(256,256,256), cl::size3(64, 64, 64), 0, 0, 0);
For kenrnel arguments, the *src is cl_mem with size 256*256*256*sizeof(cl_int), size is 4 * sizeof(cl_int), and *res is cl_mem with size 4*sizeof(int).
Then I get error that CL_OUT_OF_RESOURCE and CL_INVALID_GROUP_SIZE, from my understanding, my device max group size is 1024, but here total group = (256/64)^3 = 64 < 1024.
My gpu max work item size is 1024x1024x64 which is also ok. So I think there must something I understand is wrong. I hope someone could help me out.
The max group size limits your 64 * 64 * 64 part.
And I guess you are using a CUDA card. You'd better use CUDA on CUDA cards. OpenCL is more or less emulated on CUDA cards. If you are not, I thinks all AMD cards have a group size limit of 256. edit: Um... I forgot about Intel ones. If so, ignore this part.
One more important thing, you'd better check some reduction logic implementation example on the internet first. Atomics are very expesive and using them like what you did would almost certain make your GPU code slower than CPU ones.
Related
I compute trajectories of N particles which move in their gravitation force field. I wrote the following OpenCL kernel:
#define G 100.0f
#define EPS 1.0f
float2 f (float2 r_me, __constant float *m, __global float2 *r, size_t s, size_t n)
{
size_t i;
float2 res = (0.0f, 0.0f);
for (i=1; i<n; i++) {
size_t idx = i;
// size_t idx = (i + s) % n;
float2 dir = r[idx] - r_me;
float dist = length (dir);
res += G*m[idx]/pown(dist + EPS, 3) * dir;
}
return res;
}
__kernel void take_step_rk2 (__constant float *m,
__global float2 *r,
__global float2 *v,
float delta)
{
size_t n = get_global_size(0);
size_t s = get_global_id(0);
float2 mv = f(r[s], m, r, s, n);
float2 mr = v[s];
float2 vpred1 = v[s] + mv * delta;
float2 rpred1 = r[s] + mr * delta;
float2 nv = f(rpred1, m, r, s, n);
float2 nr = vpred1;
barrier (CLK_GLOBAL_MEM_FENCE);
r[s] += (mr + nr) * delta / 2;
v[s] += (mv + nv) * delta / 2;
}
Then I run this kernel multiple times as one-dimensional problem with global work size = [number of bodies]:
void take_step (struct cl_state *state)
{
size_t n = state->nbodies;
clEnqueueNDRangeKernel (state->queue, state->step, 1, NULL, &n, NULL, 0, NULL, NULL);
clFinish (state->queue);
}
This is a quote from AMD OpenCL Optimization Guide (year 2015):
Under certain conditions, one unexpected case of a channel conflict is that reading from the same address is a conflict, even on the FastPath.
This does not happen on the read-only memories, such as constant buffers,
textures, or shader resource view (SRV); but it is possible on the read/write UAV
memory or OpenCL global memory.
Work items in my queue all try to get access to the same memory in this loop, so there must be a channel conflict:
for (i=1; i<n; i++) {
size_t idx = i;
// size_t idx = (i + s) % n;
float2 dir = r[idx] - r_me;
float dist = length (dir);
res += G*m[idx]/pown(dist + EPS, 3) * dir;
}
I replaced
size_t idx = i;
// size_t idx = (i + s) % n;
with
// size_t idx = i;
size_t idx = (i + s) % n;
so the first work item (with global id 0) firstly access the first element in array r, the second work item access the second element and so on.
I expected that this change must result in performance improvement, but to the contrary, it resulted in significant performance degradation (roughly by the factor of 2). What am I missing? Why all-to-the-same memory access it better in this situation?
If you have other tips how to improve the performance, please share with me. OpenCL optimization guide is very confusing.
The f function's loop does not have a barrier for reconvergence for coalesced access. Once some items get their r data, they start computing but those couldn't will wait their data hence, lose the coalescence integrity. To re-group them, add 1 barrier at least per 10 iterations or 2 iterations or maybe even every iteration. But accessing to global has high latency. Barrier + latency is bad for performance. You need local memory here since it has low latency and broadcasting ability which lets it lose coalescedness only on grains bigger than local thread number (64?) which is not bad for global memory access either(you need to fill local memory from global in every Kth iteration where N is divided into K sized groups).
A source from 2013 (
http://developer.amd.com/wordpress/media/2013/07/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide-rev-2.7.pdf):
Thus, the key to effectively using the LDS is to control the access
pattern, so that accesses generated on the same cycle map to different
banks in the LDS. One notable exception is that accesses to the same
address (even though they have the same bits 6:2) can be broadcast to
all requestors and do not generate a bank conflict.
Using LDS(__local) for this will give good performance. Since LDS is small, you should do it in small patches like 256 particles at a time.
Also, using i as idx is very cache friendly but modulus version is very cache enemy. Once data can exist in cache, it doesn't matter if N requests are done. They come from cache now. But with modulus, you destroy cache ingredients before they are re-used, depending on N. For small N it should be faster as you foresee. For big N, and with small GPU cache, it would be much worse. Like only 1 global request per cycle versus N-cache_size global requests per cycle.
I guess with such strong GPU, you had a high N value such as 64k bodies which needed 2 variables per body and 4 bytes per variable totaling 512kB which can not fit L1. Maybe only L2 which is slower than idx=i through L1.
Answer:
all to same L1 cache adr is faster than all to global and L2 cache adr
use local memory in "blocking/patching" algorithm to achieve high speed
i am new in OpenCL and i am trying to compute histogram of grayscaled image. I am performing this computation on GPU nvidia GT 330M.
code is
__kernel void histogram(__global struct gray * input, __global int * global_hist, __local volatile int * histogram){
int local_offset = get_local_id(0) * 256;
int histogram_global_offset = get_global_id(0) * 256;
int offset = get_global_id(0) * 1920;
int value;
for(unsigned int i = 0; i < 256; i++){
histogram[local_offset + i] = 0;
}
barrier(CLK_LOCAL_MEM_FENCE);
for(unsigned int i = 0; i < 1920; i++){
value = input[offset + i].i;
histogram[local_offset + value]++;
}
barrier(CLK_LOCAL_MEM_FENCE);
for(unsigned int i = 0; i < 256; i++){
global_hist[histogram_global_offset + i] = histogram[local_offset + i];
}
}
This computation is performed on image 1920*1080.
I am firing kernels with
queue.enqueueNDRangeKernel(kernel_histogram, cl::NullRange, cl::NDRange(1080), cl::NDRange(1));
When local size of histogram is set to 256 * sizeof(cl_int) speed of this computation is (through nvidia nsight performance analysis) 11 675 microseconds.
Because local workgroup size is set to one. I tried increase local workgroup size to 8. But when i increase local size of histogram to 256 * 8 * sizeof(cl_int) and compute with local wg size 1. I get 85 177 microseconds.
So when i fire it with 8 kernels per workgroup i dont get speedup from 11ms but from 85ms. So final speed with 8 kernels per worgroup is 13 714 microseconds.
But when i create computation bug, set local_offset to zero and size of local histogram is 256 * sizeof(cl_int) and use 8 kernels per workgroup i get much better time - 3 854 microsec.
Does anybody have some ideas to speed up this computation ?
Thanks!
This answer assumes you want to eventually reduce your histogram all the way down to 256 int values. You call the kernel with as many work groups as you have compute units on your device, and group size should be (as always) a multiple of CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE on the device.
__kernel void histogram(__global struct gray * input, __global int * global_hist){
int group_id = get_group_id(0);
int num_groups = get_num_groups(0);
int local_id = get_local_id(0);
int local_size = get_local_size(0);
volatile __local int histogram[256];
int i;
for(i=local_id; i<256; i+=local_size){
histogram[i] = 0;
}
int rowNum, colNum, value, global_hist_offset
for(rowNum = group_id; rowNum < 1080; rowNum+=num_groups){
for(colNum = local_id; colNum < 1920; colNum += local_size){
value = input[rowNum*1920 + colNum].i;
atomic_inc(histogram[input]);
}
}
barrier(CLK_LOCAL_MEM_FENCE);
global_hist_offset = group_id * 256;
for(i=local_id; i<256; i+=local_size){
global_hist[global_hist_offset + i] = histogram[i];
}
}
Each work group works cooperatively on one row of the image at a time. Then the group moves on to another row, calculated using the num_groups value. This will work well no matter how many groups you have. For example, if you have 7 compute units, group 3 (the forth group) will start on row 3 in the image, and then every 7th row thereafter. Group 3 would compute 153 rows in total, and its final row would be row 1074. Some work groups may compute 1 more row -- groups 0 and 1 in this example.
The same interlacing is done within the work group when looking at the columns of the image. in the colNum loop, the Nth work item starts at column N, and skips ahead by local_size columns. The remainder for this loop shouldn't come in to play as often, because CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE will likely be a factor of 1920. Try all work group sizes from (1..X) * CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE up to the maximum work group size for your device.
One final point about this kernel: the results are not identical to your original kernel. Your global_hist array is 1080 * 256 integers. The one I have needs to be num_groups * 256 integers. This helps if you want a full reduction, because there is much less to add after the kernel executes.
I am following along with Heterogeneous Computing with OpenCL and it is leaving me hanging.
They pass an image, as an array of floats, to enqueueWriteBuffer. I think the image, in this case, has no values for color. It is simply {col,row,col,row,col,row} e.g. {0,0,0,1,0,2,1,0,1,1,1,2...}.
but when they do enqueueReadBuffer the size they expect is HW and if you are going to do an array like I just did the array size would be HW*2.
// SETUP BUFFERS
Buffer d_ip = Buffer(context, CL_MEM_READ_ONLY, W*H*sizeof(float));
Buffer d_op = Buffer(context, CL_MEM_WRITE_ONLY, W*H*sizeof(float));
queue.enqueueWriteBuffer(d_ip, CL_TRUE, 0, W*H*sizeof(float), img); //img, what is img? the book just says it is my image.
// SETUP RANGES
NDRange globalws(W, H);
NDRange localws(16, 16);
// QUEUE AND READ
queue.enqueueNDRangeKernel(rotn_kernel, NullRange, globalws, localws);
queue.enqueueReadBuffer(d_op, CL_TRUE, 0, W*H*sizeof(float), img);
// X AND Y INSIDE THE KERNEL
const int x = get_global_id(0);
const int y = get_global_id(1);
If all of the new pixel coordinates are calculated in the kernel couldn't you just pass an empty float array of the appropriate size (WH apparently although I don't see how it isn't WH*2). But then I tried hard coding this (on a 500x300 image) and it blew up my stack.
It isn't size W*H*2 because they're probably not storing the data quite like you think. Usually, data of this nature is stored such that the first row of the data is stored in the first W entries, the second in the second W, etc.; this results in an array of size W*H. Thus, to get information about something in row X, column Y, you have to get the element at index (W * X) + Y
When writing my OpenCL code, I always treat each kernel as reading a 3D set of data, regardless if the data is 1D, 2D, or 3D:
__kernel void TestKernel(__global float *Data){
k = get_global_id(0); //also z
j = get_global_id(1); //also y
i = get_global_id(2); //also x
//Convert 3D to 1D
int linear_coord = i + get_global_size(0)*j + get_global_size(0)*get_global_size(1)*k;
//do stuff
}
When doing the clEnqueueNDKernelRange(...), just set the dimension to be:
int X = 500;
int Y = 300;
int Z = 1;
size_t GlobalDim = {Z, Y, X};
This let's all of my kernels work easily in all dimensions.
Your code doesn't call any clSetKernelArg, have you added these? Are the OpenCL functions kicking back any errors? You might want to take a step back and use the OpenCL C code instead of the C++ class.
I implemented a simple kernel which is some sort of a convolution. I measured it on NVIDIA GT 240. It took 70 ms when written on CUDA and 100 ms when written on OpenCL. Ok, I thought, NVIDIA compiler is better optimized for CUDA (or I'm doing something wrong).
I need to run it on AMD GPUs, so I migrated to AMD APP SDK. Exactly the same kernel code.
I made two tests and their results were discouraging for me: 200 ms at HD 6670 and 70 ms at HD 5850 (the same time as for GT 240 + CUDA). And I am very interested of the reasons of such strange behaviour.
All projects were built on VS2010 using settings from the sample projects of NVIDIA and AMD respectively.
Please, do not consider my post as NVIDIA advertisement. I fairly understand that HD 5850 is more powerful than GT 240. The only thing I wish to know is why such difference is and how to fix the problem.
Update. Below is the kernel code which looks for 6 equally sized template images in the base one. Every pixel of the base image is considered as a possible origin of one of the templates and is processed by a separate thread. The kernel compares R, G, B values of each pixel of the base image and of the template one, and if at least one difference exceeds diff parameter, the corresponding pixel is counted nonmatched. If the number of nonmatched pixels is less than maxNonmatchQt the corresponding template is hit.
__constant int tOffset = 8196; // one template size in memory (in bytes)
__kernel void matchImage6( __global unsigned char* image, // pointer to the base image
int imgWidth, // base image width
int imgHeight, // base image height
int imgPitch, // base image pitch (in bytes)
int imgBpp, // base image bytes (!) per pixel
__constant unsigned char* templates, // pointer to the array of templates
int tWidth, // templates width (the same for all)
int tHeight, // templates height (the same for all)
int tPitch, // templates pitch (in bytes, the same for all)
int tBpp, // templates bytes (!) per pixel (the same for all)
int diff, // max allowed difference of intensity
int maxNonmatchQt, // max number of nonmatched pixels
__global int* result, // results
) {
int x0 = (int)get_global_id(0);
int y0 = (int)get_global_id(1);
if( x0 + tWidth > imgWidth || y0 + tHeight > imgHeight)
return;
int nonmatchQt[] = {0, 0, 0, 0, 0, 0};
for( int y = 0; y < tHeight; y++) {
int ind = y * tPitch;
int baseImgInd = (y0 + y) * imgPitch + x0 * imgBpp;
for( int x = 0; x < tWidth; x++) {
unsigned char c0 = image[baseImgInd];
unsigned char c1 = image[baseImgInd + 1];
unsigned char c2 = image[baseImgInd + 2];
for( int i = 0; i < 6; i++)
if( abs( c0 - templates[i * tOffset + ind]) > diff ||
abs( c1 - templates[i * tOffset + ind + 1]) > diff ||
abs( c2 - templates[i * tOffset + ind + 2]) > diff)
nonmatchQt[i]++;
ind += tBpp;
baseImgInd += imgBpp;
}
if( nonmatchQt[0] > maxNonmatchQt && nonmatchQt[1] > maxNonmatchQt && nonmatchQt[2] > maxNonmatchQt && nonmatchQt[3] > maxNonmatchQt && nonmatchQt[4] > maxNonmatchQt && nonmatchQt[5] > maxNonmatchQt)
return;
}
for( int i = 0; i < 6; i++)
if( nonmatchQt[i] < maxNonmatchQt) {
unsigned int pos = atom_inc( &result[0]) * 3;
result[pos + 1] = i;
result[pos + 2] = x0;
result[pos + 3] = y0;
}
}
Kernel run configuration:
Global work size = (1900, 1200)
Local work size = (32, 8) for AMD and (32, 16) for NVIDIA.
Execution time:
HD 5850 - 69 ms,
HD 6670 - 200 ms,
GT 240 - 100 ms.
Any remarks about my code are also highly appreciated.
The difference in execution times is caused by compilers.
Your code can be easily vectorized. Consider image and templates as arrays of vector type char4 (forth coordinate of each char4 vector is always 0). Instead of 3 memory reads:
unsigned char c0 = image[baseImgInd];
unsigned char c1 = image[baseImgInd + 1];
unsigned char c2 = image[baseImgInd + 2];
use only one:
unsigned char4 c = image[baseImgInd];
Instead of bulky if:
if( abs( c0 - templates[i * tOffset + ind]) > diff ||
abs( c1 - templates[i * tOffset + ind + 1]) > diff ||
abs( c2 - templates[i * tOffset + ind + 2]) > diff)
nonmatchQt[i]++;
use fast:
unsigned char4 t = templates[i * tOffset + ind];
nonmatchQt[i] += any(abs_diff(c,t)>diff);
Thus you increase performance of your code up to 3 times (if compiler doesn't vectorize the code by itself). I suppose that AMD OpenCL compiler does not such vectorization and other optimizations. From my experience OpenCL on NVIDIA GPU usually can be made faster than CUDA, because it is more low-level.
There can be no exact perfect answer for this. OpenCL performance depends on many parameters. The number of access to global memory, efficiency of the code etc. Moreover its very difficult compare between two device since they might be having different local, global, constant memories. Number of cores, frequency, memory bandwidth, more importantly the hardware architecture etc.
Each hardware provides their own performance boost, for example native_ from NVIDIA. So you need to explore more regarding the hardware on which you are working, that might actually work. But what I would recommend personally is not to use such hardware specific optimizations it might effect the flexibility of your code.
You can also find some papers published that shows, CUDA performance is much better than the OpenCL performance on same NVIDIA hardware.
So its always better to write code which provides good flexibility, rather than device specific optimizations.
I have taken the Kernel from the great OpenCL SpMV article for AMD by Bryan Catanzaro.
I have given it a toy problem where the input is
A= [0 0 0 6 1 3 5 7 2 4 0 0]
offsets= [-3 0 2]
x= [1 2 3 4]
and the output y should be [7 22 15 34]
Here is the kernel:
__kernel
void dia_spmv(__global float *A, __const int rows,
__const int diags, __global int *offsets,
__global float *x, __global float *y) {
int row = get_global_id(0);
float accumulator = 0;
for(int diag = 0; diag < diags; diag++) {
int col = row + offsets[diag];
if ((col >= 0) && (col < rows)) {
float m = A[diag*rows + row];
float v = x[col];
accumulator += m * v;
}
}
y[row] = accumulator;
}
After loading and writing the input arguments I execute the kernel like this:
size_t global_work_size;
global_work_size = 4;
err = clEnqueueNDRangeKernel(cmd_queue, kernel, 1, NULL, &global_work_size,NULL, 0, NULL, NULL);
err = clFinish(cmd_queue);
And I get the correct result when I read y back from gpu memory.
I.e. I get y = [7 22 15 34]
I am new to OpenCL (and GPGPU in general) so I want to try and understand how to extend the problem correctly for much larger matrices of arbitrary dimension.
So lets say I have 1000 000 rows. What should I set global_work_size to be?
And should I set local_work_size or should I leave it as NULL?
To use the kernel for arbitrary matrix sizes you should think about the problem and rewrite the kernel. The issue is the limited memory size of the GPU and limited size for a single buffer. You can get the maximum size for a buffer with clGetDeviceInfo and CL_DEVICE_MAX_MEM_ALLOC_SIZE.
You need to split your problem into smaller pieces. Calculate them separately and merge the results afterwards.
I do not know the problem above and can not give you any hint which helps you to implement this. I can only give you the general direction.