declaring and defining pointer vetors of vectors in OpenCL Kernel - opencl

I have a variable which is vector of vector, And in c++, I am easily able to define and declare it but in OpenCL Kernel, I am facing the issues. Here is an example of what I am trying to do.
std::vector<vector <double>> filter;
for (int m= 0;m<3;m++)
{
const auto& w = filters[m];
-------sum operation using w
}
Now Here, I can easily referencing the values of filters[m] in w, but I am not able to do this OpenCl kernel file. Here is what I have tried,but it is giving me wrong output.
In host code:-
filter_dev = cl::Buffer(context,CL_MEM_READ_ONLY|CL_MEM_USE_HOST_PTR,filter_size,(void*)&filters,&err);
filter_dev_buff = cl::Buffer(context,CL_MEM_READ_WRITE,filter_size,NULL,&err);
kernel.setArg(0, filter_dev);
kernel.setArg(1, filter_dev_buff);
In kernel code:
__kernel void forward_shrink(__global double* filters,__global double* weight)
{
int i = get_global_id[0]; // I have tried to use indiviadual values of i in filters j, just to check the output, but is not giving the same values as in serial c++ implementation
weight = &filters[i];
------ sum operations using weight
}
Can anyone help me? Where I am wrong or what can be the solution?

You are doing multiple things wrong with your vectors.
First of all (void*)&filters doesn't do what you want it to do. &filters doesn't return a pointer to the beginning of the actual data. For that you'll have to use filters.data().
Second you can't use an array of arrays in OpenCL (or vector of vectors even less). You'll have to flatten the array yourself to a 1D array before you pass it to a OpenCL kernel.

Related

Passing values between variables of host and kernel Code in a loop in OpenCL

I am in trouble passing values between host code and kernel code due to some vector data types. The following code/explanation is just for referencing my problem, my code is much bigger and complicated. With this small example, hopefully, I will be able to explain where I am having a problem. I f anything more needed please let me know.
std::vector<vector<double>> output;
for (int i = 0;i<2; i++)
{
auto& out = output[i];
sum =0;
for (int l =0;l<3;l++)
{
for (int j=0;j<4; j++)
{
if (some condition is true)
{ out[j+l] = 0.;}
sum+= .....some addition...
}
out[j+l] = sum
}
}
Now I want to parallelize this code, from the second loop. This is what I have done in host code:
cl::buffer out = (context,CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, output.size(), &output, NULL)
Then, I have set the arguments
cl::SetKernelArg(0, out);
Then the loop,
for (int i = 0,i<2, i++)
{
auto& out = output[i];
// sending some more arguments(which are changing accrding to loop) for sum operations
queue.enqueueNDRangeKernel(.......)
queue.enqueuereadbuffer(.....,&out,...)
}
In Kernel Code:
__kernel void sumout(__global double* out, ....)
{
int l = get_global_id(0);
int j = get_global_id(1);
if (some condition is true)
{ out[j+l] = 0.; // Here it goes out of the loop then
return}
sum+= .....some addition...
}
out[j+l] = sum
}
So now, in if condition out[j+l] is getting 0 in the loop. So out value is regularly changing. In normal code, it is a reference pointer to a vector. I am not able to read the values in output from out during my kernel and host code. I want to read the values in output[i] for every out[j+l]. But I am confused due this buffer and vector.
just for more clarification,output is a vector of vector and out is reference vector to output vector. I need to update values in output for every change in out. Since these are vectors, I passed out as cl buffer. I hope it is clear.
Please let me know, if the code is required, I will try to provide as much as I can.
You are sending pointers of vectors to opencl(ofcourse they are contiguous on pointer level) but whole data is not contiguous in memory since each inner vector points to different memory area. Opencl cannot map host pointers to device memory and there is no such command in this api.
You could use vector of arrays(latest version) or pure arrays.

OpenCL usage of register value crashes program

I finished writing an OpenCL kernel for thermodynamics calculations and observed a really weird bug.
My kernel looks like this:
__kernel void energy(... float3 dest, int nlocal, ...){
int i = get_global_id(0);
float3 ev = {0.0f, 0.0f, 0.0f};
for(...){
//some thermo calculations, adding values to evx and evy
ev.x +=...;
ev.y +=...;
}
//Then I want to save the result in dest[i].
//Program exits at next two line
dest[i].x = ev.x;
dest[i].y = ev.y;
I get an "unmapped Memory" and segfault error. I get the same error when trying to print out the value using printf. Seems like the program can't read the value. Writing to it works though!(Maybe because of some compiler optimizations)
Now if I use another float register value, I get the same error. But if I change the last lines to something like this (no use of ev.x or ev.y)
dest[i].x = i/nlocal*3.1f
dest[i].y = ...;
everything is going as expected and I get no error.
This works too:
int i = ...
float3 = {0.0f, ...}
dest[i].x = ev.x;
But somehow after the actual calculation it is not possible anymore.
The program is running on a Nvidia K40m, Kepler architecture.
This looks suspicious in your code:
kernel(... __global int* neigh
__global int* neighs = neigh+i;
...
int j = neighs[k*n];
...
Seems like you are passing a array of pointers in neigh, then getting the pointer and using it.
Pointers are not allowed in CL, if you pass pointers then you are addressing out of the GPU memory, and therefore crashing.
It is also possible that your vectors are simply not properly calculated, the sizes should be:
res, nneigh = GLOBAL_SIZE
neighs = max(nneigh[])*n
x = max(neighs[])
And also possible you did create the buffers smaller than they should be (remember they are floats, and float3, which use 32bits and 128bits per element). CL API calls are defined in bytes (you should use sizeof()), not in elements.
Okay I found the answer and the code above is working. I changed the kernel parameters for better understanding and corrected the mistake unconsiously when I posted the code here.
int numneigh = nneigh[i] (stands for number of neighbors) is correct
in the original code I did this:
int numneigh = neigh[i] (the neighbors)
Thanks for helping, and your guess that something is wrong with neigh/nneigh was correct, even though the mistake was not in code posted above :P

OpenCL void pointer arithmetic - strange behavior

I have wrote an OpenCL kernel that is using the opencl-opengl interoperability to read vertices and indices, but probably this is not even important because I am just doing simple pointer addition in order to get a specific vertex by index.
uint pos = (index + base)*stride;
Here i am calculating the absolute position in bytes, in my example pos is 28,643,328 with a stride of 28, index = 0 and base = 1,022,976. Well, that seems correct.
Unfortunately, I cant use vload3 directly because the offset parameter isn't calculated as an absolute address in bytes. So I just add pos to the pointer void* vertices_gl
void* new_addr = vertices_gl+pos;
new_addr is in my example = 0x2f90000 and this is where the strange part begins,
vertices_gl = 0x303f000
The result (new_addr) should be 0x4B90000 (0x303f000 + 28,643,328)
I dont understand why the address vertices_gl is getting decreased by 716,800 (0xAF000)
I'm targeting the GPU: AMD Radeon HD5830
Ps: for those wondering, I am using a printf to get these values :) ( couldn't get CodeXL working)
There is no pointer arithmetic for void* pointers. Use char* pointers to perform byte-wise pointer computations.
Or a lot better than that: Use the real type the pointer is pointing to, and don't multiply offsets. Simply write vertex[index+base] assuming vertex points to your type containing 28 bytes of data.
Performance consideration: Align your vertex attributes to a power of two for coalesced memory access. This means, add 4 bytes of padding after each vertex entry. To automatically do this, use float8 as the vertex type if your attributes are all floating point values. I assume you work with position and normal data or something similar, so it might be a good idea to write a custom struct which encapsulates both vectors in a convenient and self-explaining way:
// Defining a type for the vertex data. This is 32 bytes large.
// You can share this code in a header for inclusion in both OpenCL and C / C++!
typedef struct {
float4 pos;
float4 normal;
} VertexData;
// Example kernel
__kernel void computeNormalKernel(__global VertexData *vertex, uint base) {
uint index = get_global_id(0);
VertexData thisVertex = vertex[index+base]; // It can't be simpler!
thisVertex.normal = computeNormal(...); // Like you'd do it in C / C++!
vertex[index+base] = thisVertex; // Of couse also when writing
}
Note: This code doesn't work with your stride of 28 if you just change one of the float4s to a float3, since float3 also consumes 4 floats of memory. But you can write it like this, which will not add padding (but note that this will penalize memory access bandwidth):
typedef struct {
float pos[4];
float normal[3]; // Assuming you want 3 floats here
} VertexData;

OpenCL - is it possible to invoke another function from within a kernel?

I am following along with a tutorial located here: http://opencl.codeplex.com/wikipage?title=OpenCL%20Tutorials%20-%201
The kernel they have listed is this, which computes the sum of two numbers and stores it in the output variable:
__kernel void vector_add_gpu (__global const float* src_a,
__global const float* src_b,
__global float* res,
const int num)
{
/* get_global_id(0) returns the ID of the thread in execution.
As many threads are launched at the same time, executing the same kernel,
each one will receive a different ID, and consequently perform a different computation.*/
const int idx = get_global_id(0);
/* Now each work-item asks itself: "is my ID inside the vector's range?"
If the answer is YES, the work-item performs the corresponding computation*/
if (idx < num)
res[idx] = src_a[idx] + src_b[idx];
}
1) Say for example that the operation performed was much more complex than a summation - something that warrants its own function. Let's call it ComplexOp(in1, in2, out). How would I go about implementing this function such that vector_add_gpu() can call and use it? Can you give example code?
2) Now let's take the example to the extreme, and I now want to call a generic function that operates on the two numbers. How would I set it up so that the kernel can be passed a pointer to this function and call it as necessary?
Yes it is possible. You just have to remember that OpenCL is based on C99 with some caveats. You can create other functions either inside of the same kernel file or in a seperate file and just include it in the beginning. Auxiliary functions do not need to be declared as inline however, keep in mind that OpenCL will inline the functions when called. Pointers are also not available to use when calling auxiliary functions.
Example
float4 hit(float4 ray_p0, float4 ray_p1, float4 tri_v1, float4 tri_v2, float4 tri_v3)
{
//logic to detect if the ray intersects a triangle
}
__kernel void detection(__global float4* trilist, float4 ray_p0, float4 ray_p1)
{
int gid = get_global_id(0);
float4 hitlocation = hit(ray_p0, ray_p1, trilist[3*gid], trilist[3*gid+1], trilist[3*gid+2]);
}
You can have auxiliary functions for use in the kernel, see OpenCL user defined inline functions . You can not pass function pointers into the kernel.

CUDA device pointer manipulation

I've used:
float *devptr;
//...
cudaMalloc(&devptr, sizeofarray);
cudaMemcpy(devptr, hostptr, sizeofarray, cudaMemcpyHostToDevice);
in CUDA C to allocate and populate an array.
Now I'm trying to run a cuda kernel, e.g.:
__global__ void kernelname(float *ptr)
{
//...
}
in that array but with an offset value.
In C/C++ it would be someting like this:
kernelname<<<dimGrid, dimBlock>>>(devptr+offset);
However, this doesn't seem to work.
Is there a way to do this without sending the offset value to the kernel in a separate argument and use that offset in the kernel code?
Any ideas on how to do this?
Pointer arithmetic does work just fine in CUDA. You can add an offset to a CUDA pointer in host code and it will work correctly (remembering the offset isn't a byte offset, it is a plain word or element offset).
EDIT: A simple working example:
#include <cstdio>
int main(void)
{
const int na = 5, nb = 4;
float a[na] = { 1.2, 3.4, 5.6, 7.8, 9.0 };
float *_a, b[nb];
size_t sza = size_t(na) * sizeof(float);
size_t szb = size_t(nb) * sizeof(float);
cudaFree(0);
cudaMalloc((void **)&_a, sza );
cudaMemcpy( _a, a, sza, cudaMemcpyHostToDevice);
cudaMemcpy( b, _a+1, szb, cudaMemcpyDeviceToHost);
for(int i=0; i<nb; i++)
printf("%d %f\n", i, b[i]);
cudaThreadExit();
}
Here, you can see a word/element offset has been applied to the device pointer in the second cudaMemcpy call to start the copy from the second word, not the first.
Pointer arithmetic does work on host side code, it's used fairly often in the example code provided by nvidia.
"Linear memory exists on the device in a 40-bit address space, so separately allocated entities can reference one another via pointers, for example, in a binary tree."
Read more at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#ixzz4KialMz00
And from the performance primitives (npp) documentation, a perfect example of pointer arithmetic.
"4.5.1 Select-Channel Source-Image Pointer
This is a pointer to the channel-of-interest within the first pixel of the source image. E.g. if pSrc is the
pointer to the first pixel inside the ROI of a three channel image. Using the appropriate select-channel copy
primitive one could copy the second channel of this source image into the first channel of a destination
image given by pDst by offsetting the pointer by one:
nppiCopy_8u_C3CR(pSrc + 1, nSrcStep, pDst, nDstStep, oSizeROI);"
*Note: this works without multiplying by the number of bytes per data element because the compiler is aware of the data type of the pointer, and calculates the address accordingly.
In C and C++, pointer arithmetic can be accomplished as above or by the notation &ptr[offset] (to return device memory address of data instead of value, value will not work on device memory from host side code). When using either notation the size of the data type is automatically handled, and the offset is specified as a number of data elements rather than bytes.

Resources