clEnqueueNDRangeKernel' failed with error 'out of resources' - opencl

From my kernel, I call a function say f which has an infinite loop which breaks on ++depth > 5. This works without the following snippet.
for(int j = 0;j < 9;j++){
f1 = inside(prev.o, s[j]);
f2 = inside(x.o, s[j]);
if((f1 ^ f2)){
stage = 0;
break;
}
else if(fabs(offset(x.o, s[j])) < EPSILON)
{
id = j;
stage = 1;
break;
}
}
Looping over the 9 elements in s is the only thing I do here. This is inside the infinite loop. I checked and this does not have a problem running 2 times but the third time it runs out of memory. What is going on? It's not like I am creating any new variables anywhere. There is a lot of code in the while loop which does more complicated computation than the above snippet and that does not run into a problem. My guess is that I'm doing something wrong with storing s.

If you read the OpenCL documentation, the error is not produced because the kernel code is wrong. The code is not even run at all, it all happens at the queueing step:
OpenCL: clEnqueuNDRangeKernel
CL_OUT_OF_RESOURCES:
If there is a failure to queue the execution instance of kernel on the command-queue because of insufficient resources needed to execute the kernel. For example, the explicitly specified local_work_size causes a failure to execute the kernel because of insufficient resources such as registers or local memory.
Another example would be the number of read-only image args used in kernel exceed the CL_DEVICE_MAX_READ_IMAGE_ARGS value for device or the number of write-only image args used in kernel exceed the CL_DEVICE_MAX_WRITE_IMAGE_ARGS value for device or the number of samplers used in kernel exceed CL_DEVICE_MAX_SAMPLERS for device.
Check the local memory size, local group size, constant memory and kernel arguments size.

Related

Recursive deallocation of larger tries

I have written a basic function for recursively deallocating a trie data structure in C:
// Root pointer is passed as arg in initial call
void destroy(node *trav)
{
for (int i = 0; i < N; i++)
{
if (trav->children[i])
{
destroy(trav->children[i]);
}
}
free(trav);
}
This functions seems to work perfectly fine with any smaller dictionary file. The largest file that the program successfully loaded and unloaded contained 134,480 words.
However, it produces a segmentation fault when deallocating a larger trie. The larger file that causes a segmentation fault contains 506,915 words.
The error message produced by Valgrind states: "Invalid read of size 8" followed by several backtraces and finally; "Address is not stack'd, malloc'd or (recently) free'd".
What might be causing this?
What might be causing this?
Stack overflow might be causing this, although that seems somewhat unlikely: there are almost no locals, so each frame probably only consumes 32 bytes of stack, and that would allow recursion of 8M/32 == 262144 levels deep with Linux default 8MiB stack.
However, if your trie is extremely unbalanced, stack overflow is possible.
You can try ulimit -s unlimited and see if that makes the problem go away.
Or you could run your program under GDB, and examine the instruction at which the SIGSEGV is reported. If it's a CALL, PUSH, or another form of "move to stack", stack overflow is also very likely.

What is the rule behind instruction count in Intel PIN?

I wanted to count instructions in simple recursive fibo function O(2^n). I succeded to do so with bubble sort and matrix multiplication, but in this case it seemed like instruction count ignored my fibo function. Here is the code used for instrumentation:
// Insert a call at the entry point of a routine to increment the call count
RTN_InsertCall(rtn, IPOINT_BEFORE, (AFUNPTR)docount, IARG_PTR, &(rc->_rtnCount), IARG_END);
// For each instruction of the routine
for (INS ins = RTN_InsHead(rtn); INS_Valid(ins); ins = INS_Next(ins))
{
// Insert a call to docount to increment the instruction counter for this rtn
INS_InsertCall(ins, IPOINT_BEFORE, (AFUNPTR)docount, IARG_PTR, &(rc->_icount), IARG_END);
}
I started to wonder what's the difference between this program and the previous ones and my first thought was: here I'm not using an array.
This is what I realised after some manual tests:
a = 5; // instruction ignored by PIN and
// pretty much everything not using array
fibo[1] = 1 // instruction counted properly
a = fibo[1] // instruction ignored by PIN
So it seems like only instructions counted are writes to the memory (that's what I assume). After I changed my fibo function to this it works:
long fibonacciNumber(int n, long *fiboNumbers)
{
if (n < 2) {
fiboNumbers[n] = n;
return n;
}
fiboNumbers[n] = fiboNumbers[n-1] + fiboNumbers[n-2];
return fibonacciNumber(n - 1, fiboNumbers) + fibonacciNumber(n - 2, fiboNumbers);
}
But I would like to count instructions also for programs that aren't written by me. Is there a way to count all type of instrunctions? Is there any particular reason why only this instructions are counted? Any help appreciated.
//Edit
I used disassembly option in Visual Studio to check how it looks and it still makes no sense for me. I can't find the reason why only assingment to array is interpreted by PIN as instruction.
instruction_comparison
This exceeded all my expectations, counted as 2 instructions:
even 2 instructions, not one
PIN, like other low-level profiling and analysis tools, measures individual instructions, low-level orders like "add these two registers" or "load a value from that memory address". The sequence of instructions which a program comprises are generally produced from a high-level language like C++ through a compiler. An individual line of C++ code might be transformed into exactly one instruction, but it's also common for a line to translate to several instructions or even to zero instructions; and the instructions for a line of code may be interleaved with those of other instructions.
Your compiler can output an assembly-language file for your source code, showing what instructions were produced for which lines of code. (For GCC and Clang, this is done with the -S flag.) Note that reading the assembly code output from a compiler is not the best way to learn assembly. Also, I would point you to godbolt.org, a very convenient tool for analyzing assembly output.

Using vloadn (opencl) to load unallocated memory

I am using vloadn to load data and as a parameter I pass the range I want to read and it works, but I am wondering what's the behavior of vload4. If this might cause some unexpected issue or I am perfectly safe to do this. An example might be something like this:
__kernel void myKernel(__global float* data_ptr, int size)
{
float4 vec = vload4(0, data_ptr);
float sum = 0.f;
// data_ptr points to an array of 2 floats in global mem
if (size == 2) {
sum += vec.s1;
sum += vec.s0;
}
else if (size == 1) {
sum += vec.s0;
}
}
data_ptr is an array of 2 floats in global memory, but even though I am accessing only those 2 floats, I am loading 4 floats using vload4. The reason I am asking is that I want to use a single vloadn and decide afterwards how much of it I actually want to use and not to use vloadn based on size (e.g. for size==4 use vload4, for size==8 vload8 etc.
If it's still within data_ptr it will be fine; you don't have to use all the data you read. However, if you read past either end of the buffer that data_ptr points to you can have problems (memory read exception, for example, or some other device-dependent error). Note: Check the address alignment requirements for vload to see if you're allowed to read at any address or only multiple of with size.

Exit early on found in OpenCL

I'm trying to write an OpenCL implementation of memchr to help me learn how OpenCL works. What I'm planning to do is to assign each work item a chunk of memory to search. Then, inside each work item, it loops through the chunk searching for the character.
Especially if the buffer is large, I don't want the other threads to keep searching after an occurrence has already been found (assume there is only one occurrence of the character in any given buffer).
What I'm stuck on is how does a work item indicate, both to the host and other threads, when it has found the character?
Thanks,
One way you could do this is to use a global flag variable. You atomically set it to 1 when you find the value and other threads will check on that value when they are doing work.
For example:
__kernel test(__global int* buffer, __global volatile int* flag)
{
int tid = get_global_id(0);
int sx = get_global_size(0);
int i = tid;
while(buffer[i] != 8) //Whatever value we're trying to find.
{
int stop = atomic_add(&flag, 0); //Read the atomic value
if(stop)
break;
i = i + sx;
}
atomic_xchg(&flag, 1); //Set the atomic value
}
This might add more overhead than by just running the whole kernel (unless you are doing a lot of work on every iteration). In addition, this method won't work if each thread is just checking a single value in the array. Each thread must have multiple iterations of work.
Finally, I've seen instances where writing to an atomic variable doesn't immediately commit, so you need to check to see if this code will deadlock on your system because the write isn't committing.

Pointer Trouble

I was trying some basic pointer manipulation and have a issue i would like clarified. Here is the code snippet I am referring to
int arr[3] = {0};
*(arr+0) = 12;
*(arr+1) = 24;
*(arr+2) = 74;
*(arr+3) = 55;
cout<<*(arr+3)<<"\t"<<(long)(arr+3)<<endl;
//cout<<"Address of array arr : "<<arr<<endl;
cout<<(long)(arr+0)<<"\t"<<(long)(arr+1)<<"\t"<<(long)(arr+2)<<endl;;
for(int i=0;i<4;i++)
cout<<*(arr+i)<<"\t"<<i<<"\t"<<(long)(arr+i)<<endl;
//*(arr+3) = 55;
cout<<*(arr+3)<<endl<<endl;
My problem is:
When I try to acces arr+3 outside the for-loop , I get the desired value 55 printed. But when I try to access it through the for loop, I get some different value(3 in this case). After the for loop, it is printing the value as 4. Could someone explain to me what is happening? Thanks in advance..
You have created an array of size 3 and you are trying to access the 4th element. The outcome is therefore undefined.
Since you allocate the array in the stack, the first time you try to write the 4th element, you are actually writing beyond the space that was allocated for the stack. In Debug mode this will work, but in Release your program will probably crash.
The second time you are reading the value at the 4th place you are reading the value 4. This makes sense, as the compiler has allocated the stack space after the array for variable i, which after the loop has finished executing will have the value 4.
As array has been defined with 3 elements, data will be stored sequentially like 12,24,74. When you assign 55 for 4th element, it is stored somewhere else in memory, not sequentially. First time, Compiler prints it correctly, but then it is not able to handle memory so it prints garbage value.

Resources