clGetProgramInfo CL_PROGRAM_BINARY_SIZES Incorrect Results? - opencl

I am trying to cache a program in a file so that it does not need to compile to assembly. Consequently, I am trying to dump the binaries. I am getting an issue where the binary program returned alternately has garbage data at the end.
Error checking omitted for clarity (no errors occur, though):
clGetProgramInfo(kernel->program, CL_PROGRAM_BINARY_SIZES, 0,NULL, &n);
n /= sizeof(size_t);
size_t* sizes = new size_t[n];
clGetProgramInfo(kernel->program, CL_PROGRAM_BINARY_SIZES, n*sizeof(size_t),sizes, NULL);
I have confirmed that kernel->program is identical between times. In the above code, "n" is invariably 1, but sizes[0] varies between 2296 and 2312 alternate runs.
The problem is that the 2296 number appears to be more accurate--after the final closing brace in the output, there are three newlines and then three spaces.
For the 2312 number, after the final closing brace in the output, there are the three newlines, a line of garbage data, and then the three spaces.
Naturally, the line of garbage data is problematic. I'm not sure how to get rid of it, and I'm pretty sure it's not an error on my part.
NVIDIA GeForce GTX 580M, with driver 305.60 on Windows 7.
Update: I have changed the code to the following:
//Get how many devices there are
size_t n;
clGetProgramInfo(kernel->program, CL_PROGRAM_NUM_DEVICES, 0,NULL, &n);
//Get the list of binary sizes
size_t* sizes = new size_t[n];
clGetProgramInfo(kernel->program, CL_PROGRAM_BINARY_SIZES, n*sizeof(size_t),sizes, NULL);
//Get the binaries
unsigned char** binaries = new unsigned char*[n];
for (int i=0;i<(int)n;++i) {
binaries[i] = new unsigned char[sizes[i]];
}
clGetProgramInfo(kernel->program, CL_PROGRAM_BINARIES, n*sizeof(unsigned char*),binaries, NULL);
Now, the code has n = 4, but only sizes[0] contains meaningful information (so the alloc of sizes[1] fails in the loop). Thoughts?

I get the number of devices with the following line:
clGetProgramInfo(kernel->program, CL_PROGRAM_NUM_DEVICES, sizeof(cl_uint), &n, NULL);

clGetProgramInfo(kernel->program, CL_PROGRAM_NUM_DEVICES, 0,NULL, &n);
needs to be:
clGetProgramInfo(kernel->program, CL_PROGRAM_NUM_DEVICES, sizeof(size_t), &n, NULL);

clGetProgramInfo with CL_PROGRAM_BINARY_SIZES and CL_PROGRAM_BINARIES needs a pointer to an array and not just to a single variable because it creates binaries for each device that you supplied when building the program. That is why the first line returns nothing. n for the second example should be the number of devices.
Not sure why the second example is different for each run... are you sure you are building for the same device each time?

Related

Can I have boolean buffer in OpenCL and change its value during kernel execution, example to break while loop

I want to do some experiments in OpenCL and I want to know possibility to change states during kernel execution from host code using buffer.
I attempted to alter the state of a while loop in the kernel code by modifying the buffer value from within the host code, however the execution is hung.
void my_kernel(
__global bool *in,
__global int *out)
{
int i = get_global_id(0);
while(1) {
if(1 == *in) {
printf("while loop is finished");
break;
}
}
printf("out[0] = %d\n", out[0]);
}
I call second time the function clEnqueueWriteBuffer() to change state of input value.
input[0] = 1;
err = clEnqueueWriteBuffer(commands, input_buffer,
CL_TRUE, 0, sizeof(int), (void*)input,
0, NULL,NULL);
At least for OpenCL 1.x, this is not permitted, and any behaviour you may observe in one implementation cannot be relied upon.
See the NOTE in the OpenCL 1.2 specification, section 5.2.2, Reading, Writing and Copying Buffer Objects:
Calling clEnqueueWriteBuffer to update the latest bits in a region of the buffer object with the ptr argument value set to host_ptr + offset, where host_ptr is a pointer to the memory region specified when the buffer object being written is created with CL_MEM_USE_HOST_PTR, must meet the following requirements in order to avoid undefined behavior:
The host memory region given by (host_ptr + offset, cb) contains the latest bits when the enqueued write command begins execution.
The buffer object or memory objects created from this buffer object are not mapped.
The buffer object or memory objects created from this buffer object are not used by any command-queue until the write command has finished execution.
The final condition is not met by your code, therefore its behaviour is undefined.
I am not certain if the situation is different with OpenCL 2.x's Shared Virtual Memory (SVM) feature, as I have no practical experience using it, perhaps someone else can contribute an answer for that.

How to allocate Local Work Item sizes in OpenCL

I've set up a convolution kernel in OpenCL to convolve a 228x228x3 image with 11x11x3x96 weights to produce 55x55x96 filters.
My code without allotting localWorkSize works perfectly, but when I do allot it, I start getting errors
My questions are therefore,
1) How many threads are being launched when I set localWorkSize to NULL? I'm guessing it's implicit but is there any way to get those numbers?
2) How should I allot localWorkSize to avoid errors?
//When localWorkSize is NULL
size_t globalWorkSize[3] = {55,55,96};
//Passing NULL for localWorkSize argument
errNum = clEnqueueNDRangeKernel(command_queue, kernel,3,NULL,globalWorkSize, NULL,0, NULL,&event);
//WORKS PERFECTLY
// When I set localWorkSize
size_t globalWorkSize[3] = {55,55,96};
size_t localWorkSize[3] = {1,1,1};
errNum = clEnqueueNDRangeKernel(command_queue, kernel,3,NULL,globalWorkSize, localWorkSize,0, NULL,&event);
//ERROR CONTEXT CODE 999
I'm just trying to understand how many threads are created when localWorkSize is Null and GlobalWorkSize is described

OpenCL usage of register value crashes program

I finished writing an OpenCL kernel for thermodynamics calculations and observed a really weird bug.
My kernel looks like this:
__kernel void energy(... float3 dest, int nlocal, ...){
int i = get_global_id(0);
float3 ev = {0.0f, 0.0f, 0.0f};
for(...){
//some thermo calculations, adding values to evx and evy
ev.x +=...;
ev.y +=...;
}
//Then I want to save the result in dest[i].
//Program exits at next two line
dest[i].x = ev.x;
dest[i].y = ev.y;
I get an "unmapped Memory" and segfault error. I get the same error when trying to print out the value using printf. Seems like the program can't read the value. Writing to it works though!(Maybe because of some compiler optimizations)
Now if I use another float register value, I get the same error. But if I change the last lines to something like this (no use of ev.x or ev.y)
dest[i].x = i/nlocal*3.1f
dest[i].y = ...;
everything is going as expected and I get no error.
This works too:
int i = ...
float3 = {0.0f, ...}
dest[i].x = ev.x;
But somehow after the actual calculation it is not possible anymore.
The program is running on a Nvidia K40m, Kepler architecture.
This looks suspicious in your code:
kernel(... __global int* neigh
__global int* neighs = neigh+i;
...
int j = neighs[k*n];
...
Seems like you are passing a array of pointers in neigh, then getting the pointer and using it.
Pointers are not allowed in CL, if you pass pointers then you are addressing out of the GPU memory, and therefore crashing.
It is also possible that your vectors are simply not properly calculated, the sizes should be:
res, nneigh = GLOBAL_SIZE
neighs = max(nneigh[])*n
x = max(neighs[])
And also possible you did create the buffers smaller than they should be (remember they are floats, and float3, which use 32bits and 128bits per element). CL API calls are defined in bytes (you should use sizeof()), not in elements.
Okay I found the answer and the code above is working. I changed the kernel parameters for better understanding and corrected the mistake unconsiously when I posted the code here.
int numneigh = nneigh[i] (stands for number of neighbors) is correct
in the original code I did this:
int numneigh = neigh[i] (the neighbors)
Thanks for helping, and your guess that something is wrong with neigh/nneigh was correct, even though the mistake was not in code posted above :P

segmentation fault when using shared memory created by open_shm on Xeon Phi

I have written my code for single Xeon Phi node( with 61 cores on it). I have two files. I have called MPI_Init(2) before calling any other mpi calls. I have found ntasks, rank also using mpi calls. I have also included all the required libraries. Still i get an error. Can you please help me out with this?
In file 1:
int buffsize;
int *sendbuff,**recvbuff,buffsum;
int *shareRegion;
shareRegion = (int*)gInit(MPI_COMM_WORLD, buffsize, ntasks); /* gInit is in file 2 */
buffsize=atoi(argv[1]);
sendbuff=(int *)malloc(sizeof(int)*buffsize);
if( taskid == 0 ){
recvbuff=(int **)malloc(sizeof(int *)*ntasks);
recvbuff[0]=(int *)malloc(sizeof(int)*ntasks*buffsize);
for(i=1;i<ntasks;i++)recvbuff[i]=recvbuff[i-1]+buffsize;
}
else{
recvbuff=(int **)malloc(sizeof(int *)*1);
recvbuff[0]=(int *)malloc(sizeof(int)*1);
}
for(i=0;i<buffsize;i++){
sendbuff[i]=1;
MPI_Barrier(MPI_COMM_WORLD);
call(sendbuff, buffsize, shareRegion, recvbuff[0],buffsize,taskid,ntasks);
In file 2:
void* gInit( MPI_Comm comm, int size, int num_proc)
{
int share_mem = shm_open("share_region", O_CREAT|O_RDWR,0666 );
if( share_mem == -1)
return NULL;
int rank;
MPI_Comm_rank(comm,&rank);
if( ftruncate( share_mem, sizeof(int)*size*num_proc) == -1 )
return NULL;
int* shared = mmap(NULL, sizeof(int)*size*num_proc, PROT_WRITE | PROT_READ, MAP_SHARED, share_mem, 0);
if(shared == (void*)-1)
printf("error in mem allocation (mmap)\n");
*(shared+(rank)) = 0
MPI_Barrier(MPI_COMM_WORLD);
return shared;
}
void call(int *sendbuff, int sendcount, volatile int *sharedRegion, int **recvbuff, int recvcount, int rank, int size)
{
int i=0;
int k,j;
j=rank*sendcount;
for(i=0;i<sendcount;i++)
{
sharedRegion[j] = sendbuff[i];
j++;
}
if( rank == 0)
for(k=0;k<size;k++)
for(i=0;i<sendcount;i++)
{
j=0;
recvbuff[k][i] = sharedRegion[j];
j++;
}
}
Then i am doing some computation in file 1 on this recvbuff.
I get this segmentation fault while using sharedRegion variable.
MPI represents the Message Passing paradigm. That means, processes (ranks) are isolated and are generally running on a distributed machine. They communicate via explicit communication messages, recent versions allow also one-sideded, but still explicit, data transfer. You can not assume that shared memory is available for the processes. Have a look at any MPI tutorial to see how MPI is used.
Since you did not specify on what kind of machine you are running, any further suggestion is purely speculative. If you actually are on a shared memory machine, you may want to use a real shared memory paradigm instead, e.g. OpenMP.
While it's possible to restrict MPI to only use one machine and have shared memory (see the RMA chapter, especially in MPI-3), if you're only ever going to use one machine, it's easier to use some other paradigm.
However, if you're going to use multiple nodes and have multiple ranks on one node (multi-core processes for example), then it might be worth taking a look at MPI-3 RMA to see how it can help you with both locally shared memory and remote memory access. There are multiple papers out on the subject, but because they're so new, there's not a lot of good tutorials yet. You'll have to dig around a bit to find something useful to you.
The ordering of these two lines:
shareRegion = (int*)gInit(MPI_COMM_WORLD, buffsize, ntasks); /* gInit is in file 2 */
buffsize=atoi(argv[1]);
suggest that buffsize could possibly have different values before and after the call to gInit. If buffsize as passed in the first argument to the program is larger than its initial value while gInit is called, then out-of-bounds memory access would occur later and lead to a segmentation fault.
Hint: run your code as an MPI singleton (e.g. without mpirun) from inside a debugger (e.g. gdb) or change the limits so that cores would get dumped on error (e.g. with ulimit -c unlimited) and then examine the core file(s) with the debugger. Compiling with debug information (e.g. adding -g to the compiler options) helps a lot in such cases.

CUDA device pointer manipulation

I've used:
float *devptr;
//...
cudaMalloc(&devptr, sizeofarray);
cudaMemcpy(devptr, hostptr, sizeofarray, cudaMemcpyHostToDevice);
in CUDA C to allocate and populate an array.
Now I'm trying to run a cuda kernel, e.g.:
__global__ void kernelname(float *ptr)
{
//...
}
in that array but with an offset value.
In C/C++ it would be someting like this:
kernelname<<<dimGrid, dimBlock>>>(devptr+offset);
However, this doesn't seem to work.
Is there a way to do this without sending the offset value to the kernel in a separate argument and use that offset in the kernel code?
Any ideas on how to do this?
Pointer arithmetic does work just fine in CUDA. You can add an offset to a CUDA pointer in host code and it will work correctly (remembering the offset isn't a byte offset, it is a plain word or element offset).
EDIT: A simple working example:
#include <cstdio>
int main(void)
{
const int na = 5, nb = 4;
float a[na] = { 1.2, 3.4, 5.6, 7.8, 9.0 };
float *_a, b[nb];
size_t sza = size_t(na) * sizeof(float);
size_t szb = size_t(nb) * sizeof(float);
cudaFree(0);
cudaMalloc((void **)&_a, sza );
cudaMemcpy( _a, a, sza, cudaMemcpyHostToDevice);
cudaMemcpy( b, _a+1, szb, cudaMemcpyDeviceToHost);
for(int i=0; i<nb; i++)
printf("%d %f\n", i, b[i]);
cudaThreadExit();
}
Here, you can see a word/element offset has been applied to the device pointer in the second cudaMemcpy call to start the copy from the second word, not the first.
Pointer arithmetic does work on host side code, it's used fairly often in the example code provided by nvidia.
"Linear memory exists on the device in a 40-bit address space, so separately allocated entities can reference one another via pointers, for example, in a binary tree."
Read more at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#ixzz4KialMz00
And from the performance primitives (npp) documentation, a perfect example of pointer arithmetic.
"4.5.1 Select-Channel Source-Image Pointer
This is a pointer to the channel-of-interest within the first pixel of the source image. E.g. if pSrc is the
pointer to the first pixel inside the ROI of a three channel image. Using the appropriate select-channel copy
primitive one could copy the second channel of this source image into the first channel of a destination
image given by pDst by offsetting the pointer by one:
nppiCopy_8u_C3CR(pSrc + 1, nSrcStep, pDst, nDstStep, oSizeROI);"
*Note: this works without multiplying by the number of bytes per data element because the compiler is aware of the data type of the pointer, and calculates the address accordingly.
In C and C++, pointer arithmetic can be accomplished as above or by the notation &ptr[offset] (to return device memory address of data instead of value, value will not work on device memory from host side code). When using either notation the size of the data type is automatically handled, and the offset is specified as a number of data elements rather than bytes.

Resources