My input char array is 1024 and I have specified any local item size=32. My each workitem is working on 16 char, I have specified global item size=64. In my kernel I have calculated data_tid as follows.
int offset=16;
int gr_id=get_group_id(0);
int id = get_global_id(0);
int data_tid=(gr_id*32+id)*offset;
I am using NVIDIA Geforce 9800GT.
My code is throwing floating point exception. I already seen that this is caused by global_size not divisible by local size but in my case it is divisible.
Can anyone please tell me where is the problem?
You want to use get_local_id instead of get_global_id. The local id is the index of the work item in each group. The global id is the index of the work item in the entire range.
You could also directly use int data_tid = get_global_id(0) * offset;
Related
In which section memory is allocated if I write something like
1. int *ptr;
*ptr = 22;
2. int *ptr = new int(22);
What I Understand is when we use keyword new then memory is get reserved into Heap and that reserved memory address is get returned .
But what happened in case we didn't use keyword new ?? Where memory is get allocated ??
is Both Syntax is Same ?? If No, what is Exact difference between these two statement ??
You code examples can be rephrased as follows:
1st:
int * ptr;
*ptr = 22;
2nd:
int * ptr;
ptr = new int; //the only difference
*ptr = 22;
What happens in the second one:
int * ptr; means create variable capable of storing address of int variable. For now variable isn't initialized, so it stores garbage. If you interpret garbage as pointer, it can points anywhere (it can be 0, or 0xabcdef11, or 0x31323334, or literally ANYTHING which is left on non-cleared memory form previous usage)
ptr = new int; means "allocate memory area capable of holding int and store its address in ptr variable". Since this line, ptr points to specific memory
*ptr = 22; means put value 22 to memory pointed by ptr.
In the first example you create variable, but don't initialize it. ptr contains garbage, but you ask to interpret it as address and store 22 to this address. What can happen:
address is invalid (e.g. 0, or out of address range, or points to protected memory) => program crashes
address is valid and writable, but memory area is used by another part of the program: you'll write 22, but it will corrupt someone's data, result totally unpredictable.
address is valid and writable, memory area isn't in use. You'll write 22, but you aren't guaranteed to read it back. Memory can become used for different purpose and 22 will be overwritten.
anything else. All this is actually an undefined behavior, everything is possible.
That's why it's always recommended to initialize pointer immediately:
int * ptr = NULL; //or better "nullptr" starting from C++11
Attempt to store value *ptr = 22; will at least explicitly crash the program.
I have wrote an OpenCL kernel that is using the opencl-opengl interoperability to read vertices and indices, but probably this is not even important because I am just doing simple pointer addition in order to get a specific vertex by index.
uint pos = (index + base)*stride;
Here i am calculating the absolute position in bytes, in my example pos is 28,643,328 with a stride of 28, index = 0 and base = 1,022,976. Well, that seems correct.
Unfortunately, I cant use vload3 directly because the offset parameter isn't calculated as an absolute address in bytes. So I just add pos to the pointer void* vertices_gl
void* new_addr = vertices_gl+pos;
new_addr is in my example = 0x2f90000 and this is where the strange part begins,
vertices_gl = 0x303f000
The result (new_addr) should be 0x4B90000 (0x303f000 + 28,643,328)
I dont understand why the address vertices_gl is getting decreased by 716,800 (0xAF000)
I'm targeting the GPU: AMD Radeon HD5830
Ps: for those wondering, I am using a printf to get these values :) ( couldn't get CodeXL working)
There is no pointer arithmetic for void* pointers. Use char* pointers to perform byte-wise pointer computations.
Or a lot better than that: Use the real type the pointer is pointing to, and don't multiply offsets. Simply write vertex[index+base] assuming vertex points to your type containing 28 bytes of data.
Performance consideration: Align your vertex attributes to a power of two for coalesced memory access. This means, add 4 bytes of padding after each vertex entry. To automatically do this, use float8 as the vertex type if your attributes are all floating point values. I assume you work with position and normal data or something similar, so it might be a good idea to write a custom struct which encapsulates both vectors in a convenient and self-explaining way:
// Defining a type for the vertex data. This is 32 bytes large.
// You can share this code in a header for inclusion in both OpenCL and C / C++!
typedef struct {
float4 pos;
float4 normal;
} VertexData;
// Example kernel
__kernel void computeNormalKernel(__global VertexData *vertex, uint base) {
uint index = get_global_id(0);
VertexData thisVertex = vertex[index+base]; // It can't be simpler!
thisVertex.normal = computeNormal(...); // Like you'd do it in C / C++!
vertex[index+base] = thisVertex; // Of couse also when writing
}
Note: This code doesn't work with your stride of 28 if you just change one of the float4s to a float3, since float3 also consumes 4 floats of memory. But you can write it like this, which will not add padding (but note that this will penalize memory access bandwidth):
typedef struct {
float pos[4];
float normal[3]; // Assuming you want 3 floats here
} VertexData;
I have seen solutions like this:
kernel dp_square (const float *a,
float *result)
{
int id = get_global_id(0);
result[id] = a[id] * a[id];
}
and
kernel dp_square (const float *a,
float *result, const unsigned int count)
{
int id = get_global_id(0);
if(id < count)
result[id] = a[id] * a[id];
}
Is the check for id< count important, what happens if a kernel work item tries to process an item not avalible?
Can the reason for it not being there in the first example be that programmer just ensures that the global size is equal the number of elements to be processed ( is this normal) ?
This is often done for two reasons --
To ensure that a developer-error doesn't kill the code or read bad memory
Because sometimes it is optimal to run more work-items than there are data points. For example, if the optimal work-group size for my device is 32 (not uncommon), and I have an array of 61 pieces of data, I'll run 64-work items, and the last three will simply "play dead."
In order to not include this check, you'd have to use a work-group size that divides the total number of work-items. In this case, that would leave you with a work-group size of 1 (as 61 is prime), which would be very slow!
I've used:
float *devptr;
//...
cudaMalloc(&devptr, sizeofarray);
cudaMemcpy(devptr, hostptr, sizeofarray, cudaMemcpyHostToDevice);
in CUDA C to allocate and populate an array.
Now I'm trying to run a cuda kernel, e.g.:
__global__ void kernelname(float *ptr)
{
//...
}
in that array but with an offset value.
In C/C++ it would be someting like this:
kernelname<<<dimGrid, dimBlock>>>(devptr+offset);
However, this doesn't seem to work.
Is there a way to do this without sending the offset value to the kernel in a separate argument and use that offset in the kernel code?
Any ideas on how to do this?
Pointer arithmetic does work just fine in CUDA. You can add an offset to a CUDA pointer in host code and it will work correctly (remembering the offset isn't a byte offset, it is a plain word or element offset).
EDIT: A simple working example:
#include <cstdio>
int main(void)
{
const int na = 5, nb = 4;
float a[na] = { 1.2, 3.4, 5.6, 7.8, 9.0 };
float *_a, b[nb];
size_t sza = size_t(na) * sizeof(float);
size_t szb = size_t(nb) * sizeof(float);
cudaFree(0);
cudaMalloc((void **)&_a, sza );
cudaMemcpy( _a, a, sza, cudaMemcpyHostToDevice);
cudaMemcpy( b, _a+1, szb, cudaMemcpyDeviceToHost);
for(int i=0; i<nb; i++)
printf("%d %f\n", i, b[i]);
cudaThreadExit();
}
Here, you can see a word/element offset has been applied to the device pointer in the second cudaMemcpy call to start the copy from the second word, not the first.
Pointer arithmetic does work on host side code, it's used fairly often in the example code provided by nvidia.
"Linear memory exists on the device in a 40-bit address space, so separately allocated entities can reference one another via pointers, for example, in a binary tree."
Read more at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#ixzz4KialMz00
And from the performance primitives (npp) documentation, a perfect example of pointer arithmetic.
"4.5.1 Select-Channel Source-Image Pointer
This is a pointer to the channel-of-interest within the first pixel of the source image. E.g. if pSrc is the
pointer to the first pixel inside the ROI of a three channel image. Using the appropriate select-channel copy
primitive one could copy the second channel of this source image into the first channel of a destination
image given by pDst by offsetting the pointer by one:
nppiCopy_8u_C3CR(pSrc + 1, nSrcStep, pDst, nDstStep, oSizeROI);"
*Note: this works without multiplying by the number of bytes per data element because the compiler is aware of the data type of the pointer, and calculates the address accordingly.
In C and C++, pointer arithmetic can be accomplished as above or by the notation &ptr[offset] (to return device memory address of data instead of value, value will not work on device memory from host side code). When using either notation the size of the data type is automatically handled, and the offset is specified as a number of data elements rather than bytes.
I'm currently working on a project suing OpenCL on a NVIDIA Tesla C1060 (driver version 195.17). However I'm getting some strange behaviour I can't really explain. Here is the code which puzzles me (reduced for clarity and testing purpose):
kernel void TestKernel(global const int* groupOffsets, global float* result,
local int* tmpData, const int itemcount)
{
unsigned int groupid = get_group_id(0);
unsigned int globalsize = get_global_size(0);
unsigned int groupcount = get_num_groups(0);
for(unsigned int id = get_global_id(0); id < itemcount; id += globalsize, groupid += groupcount)
{
barrier(CLK_LOCAL_MEM_FENCE);
if(get_local_id(0) == 0)
tmpData[0] = groupOffsets[groupid];
barrier(CLK_LOCAL_MEM_FENCE);
int offset = tmpData[0];
result[id] = (float) offset;
}
}
This code should load the offset for each workgroup into local memory and then read it back and write it into the corresponding outputvector entry. For most workitems this is working, but for each workgroup the workitems with local ids 1 to 31 read an incorrect value.
My output vector (for workgroupsize=128) is as following:
index 0: 0
index 1- 31: 470400
index 32-127: 0
index 128: 640
index 129-159: 471040
index 160-255: 640
index 256: 1280
index 257-287: 471680
index 288-511: 1280
...
the output i expected would be
index 0-127: 0
index 128-255: 640
index 256-511: 1280
...
Strange thing is: the problem only occurs when I use less then itemcount workitems (so it works as expected when globalsize>=itemcount, meaning that every workitem processes only one entry). So I'm guessing it has something to do with the loop.
Does anyone know what I'm doing wrong and how to fix it?
Update:
I found out that it seems to work if I change
if(get_local_id(0) == 0)
tmpData[0] = groupOffsets[groupid];
to
if(get_local_id(0) < 32)
tmpData[0] = groupOffsets[groupid];
Which astonishes me even more, so while it might fix the problem, I'm don't feel comfortable fixing it this way (as in it might break some other time).
Besides I would rather avoid losing performance when running on Geforce 8xxx class hardware due to additional (uncoalesced for that hardware as far as I understand) memory accesses.
So the question still remains.
Firstly, and importantly, you need to be careful that itemcount is a multiple of the local work size to avoid divergence when executing the barrier.
All work-items in a work-group executing the kernel on a processor must execute this function before any are allowed to continue execution beyond the barrier. This function must be encountered by all work-items in a work-group executing the kernel.
You could implement this as follows:
unsigned int itemcountrounded = get_local_size(0) * ((itemcount + get_local_size(0) - 1) / get_local_size(0));
for(unsigned int id = get_global_id(0); id < itemcountrounded; id += globalsize, groupid += groupcount)
{
// ...
if (id < itemcount)
result[id] = (float) offset;
}
You said the code was reduced for simplicity, what happens if you run what you posted? Just wondering whether you need to put the barrier on global memory as well.