This kernel works fine:
__kernel void test(__global float* a_Direction, __global float* a_Output, const unsigned int a_Count)
int index = get_global_id(0);
if (index < a_Count)
a_Output[index * 3 + 0] = a_Direction[index * 3 + 0] * 0.5f + 0.5f;
a_Output[index * 3 + 1] = a_Direction[index * 3 + 1] * 0.5f + 0.5f;
a_Output[index * 3 + 2] = a_Direction[index * 3 + 2] * 0.5f + 0.5f;
This kernel produces out of bounds errors:
__kernel void test(__global float3* a_Direction, __global float3* a_Output, const unsigned int a_Count)
int index = get_global_id(0);
if (index < a_Count)
a_Output[index].x = a_Direction[index].x * 0.5f + 0.5f;
a_Output[index].y = a_Direction[index].y * 0.5f + 0.5f;
a_Output[index].z = a_Direction[index].z * 0.5f + 0.5f;
To me it seems like they should both do the exact same thing.
But for some reason only one of the two works.
Am I missing something obvious?
The exact error is: "CL_OUT_OF_RESOURCES error executing CL_COMMAND_READ_BUFFER on GeForce GTX580M (Device 0).
#arsenm in his/her answer as well as #Darkzeros gave the proper explanation but I feel like it is interesting to develop a bit. The problem is that in the second kernel these is a "hidden" alignment that happens. As the standard states in the section 6.1.5.:
For 3-component vector data types, the size of the data type is 4 *
sizeof(component). This means that a 3-component vector data type will
be aligned to a 4 * sizeof(component) boundary.
Let's illustrate that with an example:
assuming that a_Direction is made of 9 floats and that you use 3 threads/workitems to process these elements. In the first kernel these is no problem: the thread 0 will handle the elements with the indexes 0, 1, 2, the thread 1 the elements 3, 4, 5 and finally, the thread 2 the elements 6, 7, 8: everything is fine.
However for the second kernel, assuming the data structure you use stays the same from the host side point of view (i.e. an array going from 0 to 8), the thread 0 will handle the elements 0, 1, 2 (and will also access the element 4 because the float3 type vector will behave like a float4 type vector without doing anything with it).The second thread i.e. the thread 1 won't access the elements 3, 4, 5 but the elements 4, 5, 6 (and 7 without doing anything with it).
Therefore, and this is where the problem arise, the thread 2 will try to access the elements 8, 9, 10 (and 11), hence out of bounds access.
To summary, a vector of 3 elements behaves like a vector of 4 elements.
Now, if you want to use vectors without changing your data structure in the host side, you can use the vload3 and vstore3 functions as described in the section 3.12.7. of the standard. Like that:
vstore3(vload3(index, a_Direction) * 0.5f + 0.5f, index, a_Output));
BTW, you don't have to bother with statements like (assuming a proper alignment):
a_Output[index].x = a_Direction[index].x * 0.5f + 0.5f;
a_Output[index].y = a_Direction[index].y * 0.5f + 0.5f;
a_Output[index].z = a_Direction[index].z * 0.5f + 0.5f;
This statement is enough (no need to write a line for every elements):
a_Output[index] = a_Direction[index] * 0.5f + 0.5f;
The problem you're probably having is you've allocated a buffer that is n * 3 * sizeof(float) for your float3s, but the size and alignment of float3 is 16, and not 12.
I'm currently writing a simple SVG interpreter (not exhaustive of course, this would be pretty complicated), and I find myself struggling with the <feTurbulence> filter.
Basically, there are two types of noise that may be generated :
fractalNoise is actually pretty simple to generate, since the original Perlin noise generator can be used (sample code may be found here for example).
turbulence is the default filter, and at first I thought that the only important factor would be that the noise is divergence-free, so I tried applying ∇ x f to a vector field with only the z component non-null, to get a divergence-free 2D random vector field (since ∇.(∇ x f) = 0). The issue is that the resulting noise looks nothing like it's supposed to.
Here is what chrome generates using fractalNoise and turbulence (from here) :
The images generated by chrome
And here are the noises I'm able to generate with the math I described above :
My fractal noise
My first try at turbulent noise
There are a lot of issues here. First of : in the noise generated by the SVG interpreter, I'm not sure that I understand what the colors represent. The fractal noise seems "brigther" than mine, although they both share the same "quality" (mine goes from 0 to 1 on all color channels).
As for the turbulent noise, generating a 2D field means that I only have 2 components to work with, not 3, but most importantly I'm certain that this is not how the actual turbulent noise is generated. Does anyone know what is the math for this noise generator ?
Thanks !
For completeness, here is the GLSL code I'm using for generating octaved Perlin noise (modified from rosettacode, and no normalization/parametrization has already been done) :
uniform float time;
mat3 rotation(float angle, vec3 axis){
float c = cos(angle);
float s = sin(angle);
float t = 1 - c;
float x = axis.x;
float y = axis.y;
float z = axis.z;
return mat3(
vec3(t * x * x + c, t * x * y - s * z, t * x * z + s * y),
vec3(t * x * y + s * z, t * y * y + c, t * y * z - s * x),
vec3(t * x * z - s * y, t * y * z + s * x, t * z * z + c)
float rand(vec3 c){
return fract(sin(dot(c, vec3(12.9898, 78.233, 32.43983))) * 43758.5453);
float noise(vec3 p, float freq){
float unit = 1000/freq;
p = rotation(rand(vec3(23.473, -128.437, 23.439)) * 6.28318531,
normalize(vec3(rand(vec3(-2, 3, 1)), rand(vec3(23.2, 47.3, 82.8)), rand(vec3(-239.3, -4.3, 2.59))))) * p;
vec3 ij = floor(p/unit);
vec3 xy = mod(p,unit)/unit;
xy = .5*(1.-cos(3.1415926535*xy));
float a = rand((ij+vec3(0.,0.,0.)));
float b = rand((ij+vec3(1.,0.,0.)));
float c = rand((ij+vec3(0.,1.,0.)));
float d = rand((ij+vec3(1.,1.,0.)));
float e = rand((ij+vec3(0.,0.,1.)));
float f = rand((ij+vec3(1.,0.,1.)));
float g = rand((ij+vec3(0.,1.,1.)));
float h = rand((ij+vec3(1.,1.,1.)));
float x1 = mix(a, b, xy.x);
float x2 = mix(c, d, xy.x);
float x3 = mix(e, f, xy.x);
float x4 = mix(g, h, xy.x);
float y1 = mix(x1, x2, xy.y);
float y2 = mix(x3, x4, xy.y);
return mix(y1, y2, xy.z);
float pNoise(vec3 p, int res){
float persistance = .5;
float n = 0.;
float normK = 0.;
float f = 4.;
float amp = 1.;
int iCount = 0;
for (int i = 0; i<50; i++){
n+=amp*noise(p, f);
if (iCount == res) break;
float nf = n/normK;
return nf;
void main(){
gl_FragColor = vec4(
pNoise(vec3(gl_FragCoord.xy + vec2(57349.4387, -1271.45738), time), 4),
pNoise(vec3(gl_FragCoord.xy + vec2(9453.32748, 23875.43473), time), 4),
pNoise(vec3(gl_FragCoord.xy + vec2(-28574.323, 125943457.3), time), 4), 1);
After further testing, and checking out the code, it really seems like the actual turbulence is only perlin noise in the [-1, 1] range, whose absolute value is then taken.
For reference, here is the image i'm getting with this method :
the result of my turbulence filter using an absolute value
And a turbulent displacement example from my code :
A turbulence displacement example
Thanks !
below codes(rewritten by C#) are used to compress unit normal vector from Wild Magic 5.17,could someone explain some math behind them or share some related refs ? I can figure out the octant bits setting, but the mantissa packing and unpacking seem complex ...
codes gist
some of codes here
// ...
public static ushort CompressNormal(Vector3 normal)
var x = normal.x;
var y = normal.y;
var z = normal.z;
Debug.Assert(MathUtil.IsSame(x * x + y * y + z * z, 1));
// Determine octant.
ushort index = 0;
if (x < 0.0)
index |= 0x8000;
x = -x;
if (y < 0.0)
index |= 0x4000;
y = -y;
if (z < 0.0)
index |= 0x2000;
z = -z;
// Determine mantissa.
ushort usX = (ushort)Mathf.Floor(gsFactor * x);
ushort usY = (ushort)Mathf.Floor(gsFactor * y);
ushort mantissa = (ushort)(usX + ((usY * (255 - usY)) >> 1));
index |= mantissa;
return index;
// ...
Author wanted to use 13 bits.
Trivial way: 6 bits for x component + 6 bits for y - occupies only 12 bits, so he invented approach to assign ~90 (lsb) units for x and and ~90 (msb) units for y (90*90~2^13).
I have no idea why he uses quadratic formula for y-component - this way gives slightly different distribution of approximated values between smaller and larger values - but why specifically for y?
I've asked Mr. Eberly (author of Wild Magic) and he gives the ref, desc in short, codes above try to map (x, y) to an index of triangular array (index is from 0 to N * (N + 1) / 2 - 1)
more details are in the related doc here,
btw, another solution here with a different compress method.
Given positive-integer inputs x and y, is there a mathematical formula that will return 1 if x==y and 0 otherwise? I am in the unfortunate position of having to use a tool that only allows me to use the following symbols: numerals 0-9; decimal point .; parentheses ( and ); and the four basic arithmetic operations +, -, /, and *.
Currently I am relying on the fact that the tool that evaluates division by zero to be zero. (I can't tell if this is a bug or a feature.) Because of this, I have been able to use ((x-y)/(y-x))+1. Obviously, this is ugly and unideal, especially in the case that it is a bug and they fix it in a future version.
Taking advantage of integer division in C truncates toward 0, the follows works well. No multiplication overflow. Well defined for all "positive-integer inputs x and y".
(x/y) * (y/x)
#include <stdio.h>
#include <limits.h>
void etest(unsigned x, unsigned y) {
unsigned ref = x == y;
unsigned z = (x/y) * (y/x);
if (ref != z) {
printf("%u %u %u %u\n", x,y,z,ref);
void etests(void) {
unsigned list[] = { 1,2,3,4,5,6,7,8,9,10,100,1000, UINT_MAX/2 , UINT_MAX - 1, UINT_MAX };
for (unsigned x = 0; x < sizeof list/sizeof list[0]; x++) {
for (unsigned y = 0; y < sizeof list/sizeof list[0]; y++) {
etest(list[x], list[y]);
int main(void) {
return 0;
Output (No difference from x == y)
If division is truncating and the numbers are not too big, then:
((x - y) ^ 2 + 2) / ((x - y) ^ 2 + 1) - 1
The division has the value 2 if x = y and otherwise truncates to 1.
(Here x^2 is an abbreviation for x*x.)
This will fail if (x-y)^2 overflows. In that case, you need to independently check x/k = y/k and x%k = y%k where (k-1)*(k-1) doesn't overflow (which will work if k is ceil(sqrt(INT_MAX))). x%k can be computed as x-k*(x/k) and A&&B is simply A*B.
That will work for any x and y in the range [-k*k, k*k].
A slightly incorrect computation, using lots of intermediate values, which assumes that x - y won't overflow (or at least that the overflow won't produce a false 0).
int delta = x - y;
int delta_hi = delta / K;
int delta_lo = delta - K * delta_hi;
int equal_hi = (delta_hi * delta_hi + 2) / (delta_hi * delta_hi + 1) - 1;
int equal_lo = (delta_lo * delta_lo + 2) / (delta_lo * delta_lo + 1) - 1;
int equals = equal_hi * equal_lo;
or written out in full:
(For signed 31-bit integers, use K=46341; for unsigned 32-bit integers, 65536.)
Checked with #chux's test harness, adding the 0 case: live on coliru and with negative values also on coliru.
On a platform where integer subtraction might produce something other than the 2s-complement wraparound, a similar technique could be used, but dividing the numbers into three parts instead of two.
So the problem is that if they fix division by zero, it means that you cannot use any divisor that contains input variables anymore (you'd have to check that the divisor != 0, and implementing that check would solve the original x-y == 0 problem!); hence, division cannot be used at all.
Ergo, only +, -, * and the association operator () can be used. It's not hard to see that with only these operators, the desired behaviour cannot be implemented.
i am new in OpenCL and i am trying to compute histogram of grayscaled image. I am performing this computation on GPU nvidia GT 330M.
code is
__kernel void histogram(__global struct gray * input, __global int * global_hist, __local volatile int * histogram){
int local_offset = get_local_id(0) * 256;
int histogram_global_offset = get_global_id(0) * 256;
int offset = get_global_id(0) * 1920;
int value;
for(unsigned int i = 0; i < 256; i++){
histogram[local_offset + i] = 0;
for(unsigned int i = 0; i < 1920; i++){
value = input[offset + i].i;
histogram[local_offset + value]++;
for(unsigned int i = 0; i < 256; i++){
global_hist[histogram_global_offset + i] = histogram[local_offset + i];
This computation is performed on image 1920*1080.
I am firing kernels with
queue.enqueueNDRangeKernel(kernel_histogram, cl::NullRange, cl::NDRange(1080), cl::NDRange(1));
When local size of histogram is set to 256 * sizeof(cl_int) speed of this computation is (through nvidia nsight performance analysis) 11 675 microseconds.
Because local workgroup size is set to one. I tried increase local workgroup size to 8. But when i increase local size of histogram to 256 * 8 * sizeof(cl_int) and compute with local wg size 1. I get 85 177 microseconds.
So when i fire it with 8 kernels per workgroup i dont get speedup from 11ms but from 85ms. So final speed with 8 kernels per worgroup is 13 714 microseconds.
But when i create computation bug, set local_offset to zero and size of local histogram is 256 * sizeof(cl_int) and use 8 kernels per workgroup i get much better time - 3 854 microsec.
Does anybody have some ideas to speed up this computation ?
This answer assumes you want to eventually reduce your histogram all the way down to 256 int values. You call the kernel with as many work groups as you have compute units on your device, and group size should be (as always) a multiple of CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE on the device.
__kernel void histogram(__global struct gray * input, __global int * global_hist){
int group_id = get_group_id(0);
int num_groups = get_num_groups(0);
int local_id = get_local_id(0);
int local_size = get_local_size(0);
volatile __local int histogram[256];
int i;
for(i=local_id; i<256; i+=local_size){
histogram[i] = 0;
int rowNum, colNum, value, global_hist_offset
for(rowNum = group_id; rowNum < 1080; rowNum+=num_groups){
for(colNum = local_id; colNum < 1920; colNum += local_size){
value = input[rowNum*1920 + colNum].i;
global_hist_offset = group_id * 256;
for(i=local_id; i<256; i+=local_size){
global_hist[global_hist_offset + i] = histogram[i];
Each work group works cooperatively on one row of the image at a time. Then the group moves on to another row, calculated using the num_groups value. This will work well no matter how many groups you have. For example, if you have 7 compute units, group 3 (the forth group) will start on row 3 in the image, and then every 7th row thereafter. Group 3 would compute 153 rows in total, and its final row would be row 1074. Some work groups may compute 1 more row -- groups 0 and 1 in this example.
The same interlacing is done within the work group when looking at the columns of the image. in the colNum loop, the Nth work item starts at column N, and skips ahead by local_size columns. The remainder for this loop shouldn't come in to play as often, because CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE will likely be a factor of 1920. Try all work group sizes from (1..X) * CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE up to the maximum work group size for your device.
One final point about this kernel: the results are not identical to your original kernel. Your global_hist array is 1080 * 256 integers. The one I have needs to be num_groups * 256 integers. This helps if you want a full reduction, because there is much less to add after the kernel executes.
I implemented a simple kernel which is some sort of a convolution. I measured it on NVIDIA GT 240. It took 70 ms when written on CUDA and 100 ms when written on OpenCL. Ok, I thought, NVIDIA compiler is better optimized for CUDA (or I'm doing something wrong).
I need to run it on AMD GPUs, so I migrated to AMD APP SDK. Exactly the same kernel code.
I made two tests and their results were discouraging for me: 200 ms at HD 6670 and 70 ms at HD 5850 (the same time as for GT 240 + CUDA). And I am very interested of the reasons of such strange behaviour.
All projects were built on VS2010 using settings from the sample projects of NVIDIA and AMD respectively.
Please, do not consider my post as NVIDIA advertisement. I fairly understand that HD 5850 is more powerful than GT 240. The only thing I wish to know is why such difference is and how to fix the problem.
Update. Below is the kernel code which looks for 6 equally sized template images in the base one. Every pixel of the base image is considered as a possible origin of one of the templates and is processed by a separate thread. The kernel compares R, G, B values of each pixel of the base image and of the template one, and if at least one difference exceeds diff parameter, the corresponding pixel is counted nonmatched. If the number of nonmatched pixels is less than maxNonmatchQt the corresponding template is hit.
__constant int tOffset = 8196; // one template size in memory (in bytes)
__kernel void matchImage6( __global unsigned char* image, // pointer to the base image
int imgWidth, // base image width
int imgHeight, // base image height
int imgPitch, // base image pitch (in bytes)
int imgBpp, // base image bytes (!) per pixel
__constant unsigned char* templates, // pointer to the array of templates
int tWidth, // templates width (the same for all)
int tHeight, // templates height (the same for all)
int tPitch, // templates pitch (in bytes, the same for all)
int tBpp, // templates bytes (!) per pixel (the same for all)
int diff, // max allowed difference of intensity
int maxNonmatchQt, // max number of nonmatched pixels
__global int* result, // results
) {
int x0 = (int)get_global_id(0);
int y0 = (int)get_global_id(1);
if( x0 + tWidth > imgWidth || y0 + tHeight > imgHeight)
int nonmatchQt[] = {0, 0, 0, 0, 0, 0};
for( int y = 0; y < tHeight; y++) {
int ind = y * tPitch;
int baseImgInd = (y0 + y) * imgPitch + x0 * imgBpp;
for( int x = 0; x < tWidth; x++) {
unsigned char c0 = image[baseImgInd];
unsigned char c1 = image[baseImgInd + 1];
unsigned char c2 = image[baseImgInd + 2];
for( int i = 0; i < 6; i++)
if( abs( c0 - templates[i * tOffset + ind]) > diff ||
abs( c1 - templates[i * tOffset + ind + 1]) > diff ||
abs( c2 - templates[i * tOffset + ind + 2]) > diff)
ind += tBpp;
baseImgInd += imgBpp;
if( nonmatchQt[0] > maxNonmatchQt && nonmatchQt[1] > maxNonmatchQt && nonmatchQt[2] > maxNonmatchQt && nonmatchQt[3] > maxNonmatchQt && nonmatchQt[4] > maxNonmatchQt && nonmatchQt[5] > maxNonmatchQt)
for( int i = 0; i < 6; i++)
if( nonmatchQt[i] < maxNonmatchQt) {
unsigned int pos = atom_inc( &result[0]) * 3;
result[pos + 1] = i;
result[pos + 2] = x0;
result[pos + 3] = y0;
Kernel run configuration:
Global work size = (1900, 1200)
Local work size = (32, 8) for AMD and (32, 16) for NVIDIA.
Execution time:
HD 5850 - 69 ms,
HD 6670 - 200 ms,
GT 240 - 100 ms.
Any remarks about my code are also highly appreciated.
The difference in execution times is caused by compilers.
Your code can be easily vectorized. Consider image and templates as arrays of vector type char4 (forth coordinate of each char4 vector is always 0). Instead of 3 memory reads:
unsigned char c0 = image[baseImgInd];
unsigned char c1 = image[baseImgInd + 1];
unsigned char c2 = image[baseImgInd + 2];
use only one:
unsigned char4 c = image[baseImgInd];
Instead of bulky if:
if( abs( c0 - templates[i * tOffset + ind]) > diff ||
abs( c1 - templates[i * tOffset + ind + 1]) > diff ||
abs( c2 - templates[i * tOffset + ind + 2]) > diff)
use fast:
unsigned char4 t = templates[i * tOffset + ind];
nonmatchQt[i] += any(abs_diff(c,t)>diff);
Thus you increase performance of your code up to 3 times (if compiler doesn't vectorize the code by itself). I suppose that AMD OpenCL compiler does not such vectorization and other optimizations. From my experience OpenCL on NVIDIA GPU usually can be made faster than CUDA, because it is more low-level.
There can be no exact perfect answer for this. OpenCL performance depends on many parameters. The number of access to global memory, efficiency of the code etc. Moreover its very difficult compare between two device since they might be having different local, global, constant memories. Number of cores, frequency, memory bandwidth, more importantly the hardware architecture etc.
Each hardware provides their own performance boost, for example native_ from NVIDIA. So you need to explore more regarding the hardware on which you are working, that might actually work. But what I would recommend personally is not to use such hardware specific optimizations it might effect the flexibility of your code.
You can also find some papers published that shows, CUDA performance is much better than the OpenCL performance on same NVIDIA hardware.
So its always better to write code which provides good flexibility, rather than device specific optimizations.