Why do I get a warning when I use recursion in CUDA?

Why do I get a warning when I use recursion in CUDA? - recursion

I have to solve a code problem on the gpu using CUDA but I always get a warning of Stack size for "name of the function" cannot be statically determined.
This is for a student project that I'm working on, the project is written in C using CUDA 9.0 libraries and it's running on an NVIDIA Quadro K5000 gpu.
Every single thread must execute one function and, in this function, there are two recursive calls of the same function, the reason why I want to use those two recursive calls it's because it makes the code clean and simple for me, but if there is only one recursive call there isn't anymore the Stack size problem.
Here is the error I get every time I compile the code:
CUDA supports recursive function calls but I don't understand why it makes a problem when there are two recursive calls.
__device__ void bitonicMergeGPU(float *arr, int l, int indexT, int order)
{
int k,p;
if(l > 1)
{
p = l/2;
for(k=indexT;k<indexT+p;k++)
{
//Compare the values.
compareAndExchange(arr,k,k+p,order);
}
//THIS IS WHERE I GET THE ERROR
bitonicMergeGPU(arr,p,indexT,order);
bitonicMergeGPU(arr,p,indexT+p,order);
}
}
I simply want to know if it is possible to solve the problem of the recursive calls.

CUDA supports recursion. When you use recursion in CUDA, this warning is expected, and there is no NVIDIA-documented way you can make the warning go away (except by not using recursion).
If you use a function recursively, in most languages it will use more stack space as the recursion depth increases. This is true in CUDA as well. You need to account for this and provide enough stack space for the maximum recursion depth you anticipate. It is common practice to limit recursion depth, so as to prevent stack problems.
The compiler is unable to discover the maximum runtime recursion depth at compile time, and the warning is there to remind you of that.
Regardless of how much you increase the stack size, the warning will not go away. The warning is there to let you know that it is your responsibility to make sure your recursion design along with the stack space allocated will work correctly. The compiler does not verify in any way that the amount of increase in stack size is sufficient.

Using recursion in CUDA must be very careful. The recursion uses stack memory which has a limit of 512 KB. The default is usually 1KB which is easy to overflow and crashes the program. You can get the stack size per thread using cudaThreadGetLimit().
Suggestions:
Redesign the algorithm/function using a non-recursive approach. The efficiency is usually very similar.
Increase stack size per thread using cudaThreadSetLimit(), not exceed the limit, e.g. 512KB.

Related

Rf_allocVector only allocates and does not zero out memory

Original motivation behind this is that I have a dynamically sized array of floats that I want to pass to R through Rcpp without either incurring the cost of a zeroing out nor the cost of a deep copy.
Originally I had thought that there might be some way to take heap allocated array, make it aware to R's gc system and then wrap it with other data to create a "Rcpp::NumericVector" but it seems like that that's not possible - or doable with my current knowledge.
However and correct me if I'm wrong it looks like simply constructing a NumericVector with a size N and then using it as an N sized allocation will call R.h's Rf_allocVector and that itself does not either zero out the allocated array - I tested it on a small C program that gets dyn.loaded into R and it looks like garbage values. I also took a peek at the assembly and there doesn't seem to be any zeroing out.
Can anyone confirm this or offer any alternate solution?

Welcome to StackOverflow.
You marked this rcpp but that is a function from the C API of R -- whereas the Rcpp API offers you its constructors which do in fact set the memory tp zero:
> Rcpp::cppFunction("NumericVector goodVec(int n) { return NumericVector(n); }")
> sum(goodVec(1e7))
[1] 0
>
This creates a dynamically allocated vector using R's memory functions. The vector is indistinguishable from R's own. And it has the memory set to zero
as we use R_Calloc, which is documented in Writing R Extension to setting the memory to zero. (We may also use memcpy() explicitly, you can check the sources.)
So in short, you just have yourself confused over what the C API of R, as well as Rcpp offer, and what is easiest to use when. Keep reading documentation, running and writing examples, and studying existing code. It's all out there!

Operate only on a subset of buffer in OpenCL kernel

Newbie to OpenCL here. I'm trying to convert a numerical method I've written to OpenCL for acceleration. I'm using the PyOpenCL package as I've written this once in Python already and as far as I can tell there's no compelling reason to use the C version. I'm all ears if I'm wrong on this, though.
I've managed to translate over most of the functionality I need in to OpenCL kernels. My question is on how to (properly) tell OpenCL to ignore my boundary/ghost cells. The reason I need to do this is that my method (for example) for point i accesses cells at [i-2:i+2], so if i=1, I'll run off the end of the array. So - I add some extra points that serve to prevent this, and then just tell my algorithm to only run on points [2:nPts-2]. It's easy to see how to do this with a for loop, but I'm a little more unclear on the 'right' way to do this for a kernel.
Is it sufficient to do, for example (pseudocode)
__kernel void myMethod(...) {
gid = get_global_id(0);
if (gid < nGhostCells || gid > nPts-nGhostCells) {
retVal[gid] = 0;
}
// Otherwise perform my calculations
}
or is there another/more appropriate way to enforce this constraint?

It looks sufficient.
Branching is same for nPts-nGhostCells*2 number of points and it is predictable if nPts and nGhostCells are compile-time constants. Even if it is not predictable, sufficiently large nPts vs nGhostCells (1024 vs 3) should not be distinctively slower than zero-branching version, except the latency of "or" operation. Even that "or" latency must be hidden behind array access latency, thanks to thread level parallelism.
At those "break" points, mostly 16 or 32 threads would lose some performance and only for several clock cycles because of the lock-step running of SIMD-like architectures.
If you happen to code some chaotic branching, like data-driven code path, then you should split them into different kernels(for different regions) or sort them before the kernel so that average branching between neighboring threads are minimized.

Erlang: stackoverflow with recursive function that is not tail call optimized?

Is it possible to get a stackoverflow with a function that is not tail call optimized in Erlang? For example, suppose I have a function like this
sum_list([],Acc) ->
Acc;
sum_list([Head|Tail],Acc) ->
Head + sum_list(Tail, Acc).
It would seem like if a large enough list was passed in it would eventually run out of stack space and crash. I tried testing this like so:
> L = lists:seq(1, 10000000).
[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22, 23,24,25,26,27,28,29|...]
> sum_test:sum_list(L, 0).
50000005000000
But it never crashes! I tried it with a list of 100,000,000 integers and it took a while to finish but it still never crashed! Questions:
Am I testing this correctly?
If so, why am I unable to generate a stackoverflow?
Is Erlang doing something that prevents stackoverflows from occurring?

You are testing this correctly: your function is indeed not tail-recursive. To find out, you can compile your code using erlc -S <erlang source file>.
{function, sum_list, 2, 2}.
{label,1}.
{func_info,{atom,so},{atom,sum_list},2}.
{label,2}.
{test,is_nonempty_list,{f,3},[{x,0}]}.
{allocate,1,2}.
{get_list,{x,0},{y,0},{x,0}}.
{call,2,{f,2}}.
{gc_bif,'+',{f,0},1,[{y,0},{x,0}],{x,0}}.
{deallocate,1}.
return.
{label,3}.
{test,is_nil,{f,1},[{x,0}]}.
{move,{x,1},{x,0}}.
return.
As a comparison the following tail-recursive version of the function:
tail_sum_list([],Acc) ->
Acc;
tail_sum_list([Head|Tail],Acc) ->
tail_sum_list(Tail, Head + Acc).
compiles as:
{function, tail_sum_list, 2, 5}.
{label,4}.
{func_info,{atom,so},{atom,tail_sum_list},2}.
{label,5}.
{test,is_nonempty_list,{f,6},[{x,0}]}.
{get_list,{x,0},{x,2},{x,3}}.
{gc_bif,'+',{f,0},4,[{x,2},{x,1}],{x,1}}.
{move,{x,3},{x,0}}.
{call_only,2,{f,5}}.
{label,6}.
{test,is_nil,{f,4},[{x,0}]}.
{move,{x,1},{x,0}}.
return.
Notice the lack of allocate and the call_only opcode in the tail-recursive version, as opposed to the allocate/call/deallocate/return sequence in the non-recursive function.
You are not getting a stack overflow because the Erlang "stack" is very large. Indeed, stack overflow usually means the processor stack overflowed, as the processor's stack pointer went too far away. Processes traditionally have a limited stack size which can be tuned by interacting with the operating system. See for example POSIX's setrlimit.
However, Erlang execution stack is not the processor stack, as the code is interpreted. Each process has its own stack which can grow as needed by invoking operating system memory allocation functions (typically malloc on Unix).
As a result, your function will not crash as long as malloc calls succeed.
For the record, the actual list L is using the same amount of memory as the stack to process it. Indeed, each element in the list takes two words (the integer value itself, which is boxed as a word as they are small) and the pointer to the next element to the list. Conversely, the stack is grown by two words at each iteration by allocate opcode: one word for CP which is saved by allocate itself and one word as requested (the first parameter of allocate) for the current value.
For 100,000,000 words on a 64-bit VM, the list takes a minimum of 1.5 GB (more as the actual stack is not grown every two words, fortunately). Monitoring and garbaging this is difficult in the shell, as many values remain live. If you spawn a function, you can see the memory usage:
spawn(fun() ->
io:format("~p\n", [erlang:memory()]),
L = lists:seq(1, 100000000),
io:format("~p\n", [erlang:memory()]),
sum_test:sum_list(L, 0),
io:format("~p\n", [erlang:memory()])
end).
As you can see, the memory for the recursive call is not released immediately.

Isn't using a stack same as recursion?

When asked for a non-recursive algorithm to solve a problem, people often use stacks but in essence aren't stacks and recursion same?. Moreover, the space complexity remains the same(asymptotically) when stacks are used to replace recursion. Is there any fundamental difference that I have failed to
observe?

Your applications stack size is more limited than the data structure stack. As long as you can allocate memory dynamically (actually this time it depends on applications heap) you will have no problem.
Your applications stack as mentioned is more limited plus that it has the copy of each temporary local variable, function parameters, return values, stack pointers and ext. That makes its size more reduced than it seems.

Yes, it's the exact same thing.
Using a Stack instead of using the call stack is either a work around (when a language has less stack space than the maximum your program needs to handle) or it's a optimization that might save some space since the stack frame usually take some machine words in languages that doesn't do TCO or you don't need to store the data the same way so it makes the total consumption far less even when the stack is less space efficient per item.
Not all languages need it though, In the language Racket it doesn't serve any purpose to use an explicit stack since there is no limit on the stack since other than the total memory the program has so even with non TCO calls (like tree traversal) it still would work until all memory is consumed.
In Java the stack space is configurable when you start up your program. It prefer to read recursive code so check if your language can do that before reverting to using an explicit stack.

Big-O complexity may be the same in some cases, but constant factors using an explicit stack are often better. Moreover, the size of the machine execution stack is often relatively small, while an explicit (heap-allocated) stack can grow much larger.
Sometimes you'll want to look for a completely different algorithm that happens to be non-recursive that will perform much better. Consider the naive fibonacci sequence algorithm:
int f(int n)
{
if (n < 2)
{
return 1;
}
return f(n-2) + f(n-1);
}
This algorithm takes exponential time :(
The non-recursive version (actually a form of "dynamic programming" in this case) is only linear in n, and does not use a stack:
int f(int n)
{
int fMinusOne = 1;
int fMinusTwo = 1;
for (int idx = 1; idx < n; ++idx)
{
int next = fMinusOne + fMinusTwo;
fMinusTwo = fMinusOne;
fMinusOne = next;
}
return fMinusOne;
}

foldl generating Bind-stack overflow in Maxima

I wrote a function that shall perform foldl similar to Haskell in Maxima,
foldl(f,ac,li):=block([con:[],acc:ac],/*print("List=",li,ac),*/
if (is(li#[])) then
(acc:apply(f,cons(acc,[first(li)])),
acc:foldl(f,acc,rest(li))),acc)$
And it works fine in folding the list from left side and evaluating along the way hence preventing any accumulation of long unevaluated expression in buffer.
The problem I am facing in running this with,
foldl(lambda([x,y],x+y),0,makelist(i,i,1,97));
Error in PROGN [or a callee]: Bind stack overflow.
But if I run it upto 96, it generates result appropriately.
I don't understand why is this simple addition causing problem as I don't have any infinte loop or memory hungry task going on.

Well, foldl is defined as a recursive function, and it will call itself as many times as there are elements in the list. So whether it works depends on the Lisp implementation-specific limit for the function call stack. For GCL it seems the limit is relatively small. For other Lisp implementations, the limit is greater. But the only way to make it work for all sizes of the list is to write it iteratively.
There are built-in functions similar to foldl -- see lreduce, rreduce, xreduce, and tree_reduce.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Why do I get a warning when I use recursion in CUDA? - recursion

Related

Rf_allocVector only allocates and does not zero out memory

Operate only on a subset of buffer in OpenCL kernel

Erlang: stackoverflow with recursive function that is not tail call optimized?

Isn't using a stack same as recursion?

foldl generating Bind-stack overflow in Maxima

Categories

Resources