OpenCL void pointer arithmetic - strange behavior - pointers

I have wrote an OpenCL kernel that is using the opencl-opengl interoperability to read vertices and indices, but probably this is not even important because I am just doing simple pointer addition in order to get a specific vertex by index.
uint pos = (index + base)*stride;
Here i am calculating the absolute position in bytes, in my example pos is 28,643,328 with a stride of 28, index = 0 and base = 1,022,976. Well, that seems correct.
Unfortunately, I cant use vload3 directly because the offset parameter isn't calculated as an absolute address in bytes. So I just add pos to the pointer void* vertices_gl
void* new_addr = vertices_gl+pos;
new_addr is in my example = 0x2f90000 and this is where the strange part begins,
vertices_gl = 0x303f000
The result (new_addr) should be 0x4B90000 (0x303f000 + 28,643,328)
I dont understand why the address vertices_gl is getting decreased by 716,800 (0xAF000)
I'm targeting the GPU: AMD Radeon HD5830
Ps: for those wondering, I am using a printf to get these values :) ( couldn't get CodeXL working)

There is no pointer arithmetic for void* pointers. Use char* pointers to perform byte-wise pointer computations.
Or a lot better than that: Use the real type the pointer is pointing to, and don't multiply offsets. Simply write vertex[index+base] assuming vertex points to your type containing 28 bytes of data.
Performance consideration: Align your vertex attributes to a power of two for coalesced memory access. This means, add 4 bytes of padding after each vertex entry. To automatically do this, use float8 as the vertex type if your attributes are all floating point values. I assume you work with position and normal data or something similar, so it might be a good idea to write a custom struct which encapsulates both vectors in a convenient and self-explaining way:
// Defining a type for the vertex data. This is 32 bytes large.
// You can share this code in a header for inclusion in both OpenCL and C / C++!
typedef struct {
float4 pos;
float4 normal;
} VertexData;
// Example kernel
__kernel void computeNormalKernel(__global VertexData *vertex, uint base) {
uint index = get_global_id(0);
VertexData thisVertex = vertex[index+base]; // It can't be simpler!
thisVertex.normal = computeNormal(...); // Like you'd do it in C / C++!
vertex[index+base] = thisVertex; // Of couse also when writing
}
Note: This code doesn't work with your stride of 28 if you just change one of the float4s to a float3, since float3 also consumes 4 floats of memory. But you can write it like this, which will not add padding (but note that this will penalize memory access bandwidth):
typedef struct {
float pos[4];
float normal[3]; // Assuming you want 3 floats here
} VertexData;

Related

How do I convert a signed 8-byte integer to normalised float? [duplicate]

I try to optimize a working compute shader. Its purpose is to create an image: find the good color (using a little palette), and call imageStore(image, ivec2, vec4).
The colors are indexed, in an array of uint, in an UniformBuffer.
One color in this UBO is packed inside one uint, as {0-255, 0-255, 0-255, 0-255}.
Here the code:
struct Entry
{
*some other data*
uint rgb;
};
layout(binding = 0) uniform SConfiguration
{
Entry materials[MATERIAL_COUNT];
} configuration;
void main()
{
Entry material = configuration.materials[currentMaterialId];
float r = (material.rgb >> 16) / 255.;
float g = ((material.rgb & G_MASK) >> 8) / 255.;
float b = (material.rgb & B_MASK) / 255.;
imageStore(outImage, ivec2(gl_GlobalInvocationID.xy), vec4(r, g, b, 0.0));
}
I would like to clean/optimize a bit, because this color conversion looks bad/useless in the shader (and should be precomputed). My question is:
Is it possible to directly pack a vec4(r, g, b, 0.0) inside the UBO, using 4 bytes (like a R8G8B8A8) ?
Is it possible to do it directly? No.
But GLSL does have a number of functions for packing/unpacking normalized values. In your case, you can pass the value as a single uint uniform, then use unpackUnorm4x8 to convert it to a vec4. So your code becomes:
vec4 color = unpackUnorm4x8(material.rgb);
This is, of course, a memory-vs-performance tradeoff. So if memory isn't an issue, you should probably just pass a vec4 (never use vec3) directly.
Is it possible to directly pack a vec4(r, g, b, 0.0) inside the UBO, using 4 bytes (like a R8G8B8A8) ?
There is no way to express this directly as 4 single byte values; there is no appropriate data type in the shader to allow you to do declare this as a byte type.
However, why do you think you need to? Just upload it as 4 floats - it's a uniform so it's not like you are replicating it thousands of times, so the additional size is unlikely to be a problem in practice.

c++ Occasional Dynamic Pointer Crashing

I have made a program to take in float inputs from a user to create a dynamic array (Then use those inputs with functions to find basic stuff like max,min,sum,avg but that stuff works fine so I don't think Ill include that here for the purpose of not creating a wall of code).
It works about half the time and while I have some theories about the cause I cant put my finger on a solution.
int main() {
int Counter = 0;
float *UsrIn = nullptr;
float Array[Counter];
My first thought was that the part below was the issue. My class hasn't really gone over what notation (I assume it refers to bytes so maybe scientific notation would work) to use with new that I can recall. I just tried 20 for the sake of testing and it seemed to work(probably a silly assumption in hindsight).
UsrIn = new float[(int)20];
cout << "Enter float numbers:" << endl;
cout << "Enter '9999999' to quit:" << endl;
cin >> *UsrIn; // User Input for pointer Dynamic Array
Array[Counter] = *UsrIn;
while(*UsrIn!=9999999) // User Input for Dynamic Array
{
Counter++;
UsrIn++;
cin >> *UsrIn;
Array[Counter] = *UsrIn;
}
delete UsrIn;
delete[] UsrIn;
My other thought was that maybe a pointer address was already in use by something else or maybe it was invalid somehow. I don't know of a way to test for that because the crash I occasionally get only happens when exiting the while loop after entering "9999999"
As a side note I'm not getting any warnings or error messages just a crashed program from eclipse.
Variable-length arrays are not universally supported in C++ implementations, although your compiler clearly supports them. The problem, from what you've described, is with this code:
int main() {
int Counter = 0;
float *UsrIn = nullptr;
float Array[Counter];
You're defining a variable-length array of size 0. So, although you're allocating 20 entries for UsrIn, you're not allocating any memory for Array. The intention of variable-length arrays is to allocate an array of a given size where the size is not actually known until run time. Based on your other code, that's not really the situation here. The easiest thing to do is just change the Array size to match your UsrIn size, e.g.:
float Array[20];
If you really want more of a dynamic behavior, you could use std::vector<float>
std::vector<float> Array;
...
Array.push_back(*UsrIn);

declaring and defining pointer vetors of vectors in OpenCL Kernel

I have a variable which is vector of vector, And in c++, I am easily able to define and declare it but in OpenCL Kernel, I am facing the issues. Here is an example of what I am trying to do.
std::vector<vector <double>> filter;
for (int m= 0;m<3;m++)
{
const auto& w = filters[m];
-------sum operation using w
}
Now Here, I can easily referencing the values of filters[m] in w, but I am not able to do this OpenCl kernel file. Here is what I have tried,but it is giving me wrong output.
In host code:-
filter_dev = cl::Buffer(context,CL_MEM_READ_ONLY|CL_MEM_USE_HOST_PTR,filter_size,(void*)&filters,&err);
filter_dev_buff = cl::Buffer(context,CL_MEM_READ_WRITE,filter_size,NULL,&err);
kernel.setArg(0, filter_dev);
kernel.setArg(1, filter_dev_buff);
In kernel code:
__kernel void forward_shrink(__global double* filters,__global double* weight)
{
int i = get_global_id[0]; // I have tried to use indiviadual values of i in filters j, just to check the output, but is not giving the same values as in serial c++ implementation
weight = &filters[i];
------ sum operations using weight
}
Can anyone help me? Where I am wrong or what can be the solution?
You are doing multiple things wrong with your vectors.
First of all (void*)&filters doesn't do what you want it to do. &filters doesn't return a pointer to the beginning of the actual data. For that you'll have to use filters.data().
Second you can't use an array of arrays in OpenCL (or vector of vectors even less). You'll have to flatten the array yourself to a 1D array before you pass it to a OpenCL kernel.

go tour when to not use pointer to struct literal in a variable

Per the Go tour page 28 and page 53
They show a variable that is a pointer to a struct literal. Why is this not the default behavior? I'm unfamiliar with C, so it's hard to wrap my head around it. The only time I can see when it might not be more beneficial to use a pointer is when the struct literal is unique, and won't be in use for the rest program and so you would want it to be garbage collected as soon as possible. I'm not even sure if a modern language like Go even works that way.
My question is this. When should I assign a pointer to a struct literal to a variable, and when should I assign the struct literal itself?
Thanks.
Using a pointer instead of just a struct literal is helpful when
the struct is big and you pass it around
you want to share it, that is that all modifications affect your struct instead of affecting a copy
In other cases, it's fine to simply use the struct literal. For a small struct, you can think about the question just as using an int or an *int : most of the times the int is fine but sometimes you pass a pointer so that the receiver can modify your int variable.
In the Go tour exercises you link to, the Vertex struct is small and has about the same semantic than any number. In my opinion it would have been fine to use it as struct directly and to define the Scaled function in #53 like this :
func (v Vertex) Scaled(f float64) Vertex {
v.X = v.X * f
v.Y = v.Y * f
return v
}
because having
v2 := v1.Scaled(5)
would create a new vertex just like
var f2 float32 = f1 * 5
creates a new float.
This is similar to how is handled the standard Time struct (defined here), which is usually kept in variables of type Time and not *Time.
But there is no definite rule and, depending on the use, I could very well have kept both Scale and Scaled.
You're probably right that most of the time you want pointers, but personally I find the need for an explicit pointer refreshing. It makes it so there's no difference between int and MyStruct. They behave the same way.
If you compare this to C# - a language which implements what you are suggesting - I find it confusing that the semantics of this:
static void SomeFunction(Point p)
{
p.x = 1;
}
static void Main()
{
Point p = new Point();
SomeFunction(p);
// what is p.x?
}
Depend on whether or not Point is defined as a class or a struct.

CUDA device pointer manipulation

I've used:
float *devptr;
//...
cudaMalloc(&devptr, sizeofarray);
cudaMemcpy(devptr, hostptr, sizeofarray, cudaMemcpyHostToDevice);
in CUDA C to allocate and populate an array.
Now I'm trying to run a cuda kernel, e.g.:
__global__ void kernelname(float *ptr)
{
//...
}
in that array but with an offset value.
In C/C++ it would be someting like this:
kernelname<<<dimGrid, dimBlock>>>(devptr+offset);
However, this doesn't seem to work.
Is there a way to do this without sending the offset value to the kernel in a separate argument and use that offset in the kernel code?
Any ideas on how to do this?
Pointer arithmetic does work just fine in CUDA. You can add an offset to a CUDA pointer in host code and it will work correctly (remembering the offset isn't a byte offset, it is a plain word or element offset).
EDIT: A simple working example:
#include <cstdio>
int main(void)
{
const int na = 5, nb = 4;
float a[na] = { 1.2, 3.4, 5.6, 7.8, 9.0 };
float *_a, b[nb];
size_t sza = size_t(na) * sizeof(float);
size_t szb = size_t(nb) * sizeof(float);
cudaFree(0);
cudaMalloc((void **)&_a, sza );
cudaMemcpy( _a, a, sza, cudaMemcpyHostToDevice);
cudaMemcpy( b, _a+1, szb, cudaMemcpyDeviceToHost);
for(int i=0; i<nb; i++)
printf("%d %f\n", i, b[i]);
cudaThreadExit();
}
Here, you can see a word/element offset has been applied to the device pointer in the second cudaMemcpy call to start the copy from the second word, not the first.
Pointer arithmetic does work on host side code, it's used fairly often in the example code provided by nvidia.
"Linear memory exists on the device in a 40-bit address space, so separately allocated entities can reference one another via pointers, for example, in a binary tree."
Read more at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#ixzz4KialMz00
And from the performance primitives (npp) documentation, a perfect example of pointer arithmetic.
"4.5.1 Select-Channel Source-Image Pointer
This is a pointer to the channel-of-interest within the first pixel of the source image. E.g. if pSrc is the
pointer to the first pixel inside the ROI of a three channel image. Using the appropriate select-channel copy
primitive one could copy the second channel of this source image into the first channel of a destination
image given by pDst by offsetting the pointer by one:
nppiCopy_8u_C3CR(pSrc + 1, nSrcStep, pDst, nDstStep, oSizeROI);"
*Note: this works without multiplying by the number of bytes per data element because the compiler is aware of the data type of the pointer, and calculates the address accordingly.
In C and C++, pointer arithmetic can be accomplished as above or by the notation &ptr[offset] (to return device memory address of data instead of value, value will not work on device memory from host side code). When using either notation the size of the data type is automatically handled, and the offset is specified as a number of data elements rather than bytes.

Resources