When asked for a non-recursive algorithm to solve a problem, people often use stacks but in essence aren't stacks and recursion same?. Moreover, the space complexity remains the same(asymptotically) when stacks are used to replace recursion. Is there any fundamental difference that I have failed to
observe?
Your applications stack size is more limited than the data structure stack. As long as you can allocate memory dynamically (actually this time it depends on applications heap) you will have no problem.
Your applications stack as mentioned is more limited plus that it has the copy of each temporary local variable, function parameters, return values, stack pointers and ext. That makes its size more reduced than it seems.
Yes, it's the exact same thing.
Using a Stack instead of using the call stack is either a work around (when a language has less stack space than the maximum your program needs to handle) or it's a optimization that might save some space since the stack frame usually take some machine words in languages that doesn't do TCO or you don't need to store the data the same way so it makes the total consumption far less even when the stack is less space efficient per item.
Not all languages need it though, In the language Racket it doesn't serve any purpose to use an explicit stack since there is no limit on the stack since other than the total memory the program has so even with non TCO calls (like tree traversal) it still would work until all memory is consumed.
In Java the stack space is configurable when you start up your program. It prefer to read recursive code so check if your language can do that before reverting to using an explicit stack.
Big-O complexity may be the same in some cases, but constant factors using an explicit stack are often better. Moreover, the size of the machine execution stack is often relatively small, while an explicit (heap-allocated) stack can grow much larger.
Sometimes you'll want to look for a completely different algorithm that happens to be non-recursive that will perform much better. Consider the naive fibonacci sequence algorithm:
int f(int n)
{
if (n < 2)
{
return 1;
}
return f(n-2) + f(n-1);
}
This algorithm takes exponential time :(
The non-recursive version (actually a form of "dynamic programming" in this case) is only linear in n, and does not use a stack:
int f(int n)
{
int fMinusOne = 1;
int fMinusTwo = 1;
for (int idx = 1; idx < n; ++idx)
{
int next = fMinusOne + fMinusTwo;
fMinusTwo = fMinusOne;
fMinusOne = next;
}
return fMinusOne;
}
Related
I understand that for C at least the stack frame and return address are written to the stack every time the recursive function is called, but is there an obscure way of making it not run out of memory? Obviously this is purely a hypothetical question as I can't imagine a use case for it.
You can emulate recursion using a stack
The part of memory related to function calls and static variables (declared with int x; in C) is separate from the part of memory related to dynamic allocation (using malloc() in C). Only the former, called "the stack" is limited and will lead to a "Stack Overflow" error. Well, of course that's not entirely true. The latter is called "the heap" and of course your computer is not magic and will run out of memory at some point if you really try to push its limits.
Recursive function to loop and stack
How can you emulate recursion with loop and stack?
How ti rewrite a recursive method by using a stack?
Way to go from recursion to iteration
You can use tail-recursion to avoid adding layers to the call stack
Stack overflow is due to the size of the call stack. Imagine a function like this:
int f(int n)
{
int x;
if (n < 2)
{
return 1;
}
else
{
x = f(n-1);
return n * x;
}
}
When making the recursive call to f, the computer needs to keep some note of the fact that we'll need to do one more multiplication once the recursive call is completed. Taking note of this is achieved by adding a layer to a "call stack" with some information on the values of variables, and where in the code we are. This requires memory and will lead to stack overflow in case the stack becomes too big.
Now compare with the following code:
int f(int n, int acc)
{
if (n < 2)
{
return acc;
}
else
{
return f(n-1, n * acc);
}
}
This time the recursive call is directly encapsulated in the return, meaning there is no more work to do after the recursive call. Imagine you asked me to do a job and report the result to you; by making the recursive call I'm delegating some work to my friend; then instead of staying around waiting for my friend to report back to me so that I can report back to you, I leave immediately and tell my friend to report directly to you. This saves memory by "cutting the middle man".
Read more:
Wikipedia: Tail call
Wikipedia: Tail-recursive functions
In languages that feature lazy evaluation, you can write a seemingly infinitely-recursive function, then only evaluate it as far as required:
Haskell infinite recursion
Newbie to OpenCL here. I'm trying to convert a numerical method I've written to OpenCL for acceleration. I'm using the PyOpenCL package as I've written this once in Python already and as far as I can tell there's no compelling reason to use the C version. I'm all ears if I'm wrong on this, though.
I've managed to translate over most of the functionality I need in to OpenCL kernels. My question is on how to (properly) tell OpenCL to ignore my boundary/ghost cells. The reason I need to do this is that my method (for example) for point i accesses cells at [i-2:i+2], so if i=1, I'll run off the end of the array. So - I add some extra points that serve to prevent this, and then just tell my algorithm to only run on points [2:nPts-2]. It's easy to see how to do this with a for loop, but I'm a little more unclear on the 'right' way to do this for a kernel.
Is it sufficient to do, for example (pseudocode)
__kernel void myMethod(...) {
gid = get_global_id(0);
if (gid < nGhostCells || gid > nPts-nGhostCells) {
retVal[gid] = 0;
}
// Otherwise perform my calculations
}
or is there another/more appropriate way to enforce this constraint?
It looks sufficient.
Branching is same for nPts-nGhostCells*2 number of points and it is predictable if nPts and nGhostCells are compile-time constants. Even if it is not predictable, sufficiently large nPts vs nGhostCells (1024 vs 3) should not be distinctively slower than zero-branching version, except the latency of "or" operation. Even that "or" latency must be hidden behind array access latency, thanks to thread level parallelism.
At those "break" points, mostly 16 or 32 threads would lose some performance and only for several clock cycles because of the lock-step running of SIMD-like architectures.
If you happen to code some chaotic branching, like data-driven code path, then you should split them into different kernels(for different regions) or sort them before the kernel so that average branching between neighboring threads are minimized.
I have to solve a code problem on the gpu using CUDA but I always get a warning of Stack size for "name of the function" cannot be statically determined.
This is for a student project that I'm working on, the project is written in C using CUDA 9.0 libraries and it's running on an NVIDIA Quadro K5000 gpu.
Every single thread must execute one function and, in this function, there are two recursive calls of the same function, the reason why I want to use those two recursive calls it's because it makes the code clean and simple for me, but if there is only one recursive call there isn't anymore the Stack size problem.
Here is the error I get every time I compile the code:
CUDA supports recursive function calls but I don't understand why it makes a problem when there are two recursive calls.
__device__ void bitonicMergeGPU(float *arr, int l, int indexT, int order)
{
int k,p;
if(l > 1)
{
p = l/2;
for(k=indexT;k<indexT+p;k++)
{
//Compare the values.
compareAndExchange(arr,k,k+p,order);
}
//THIS IS WHERE I GET THE ERROR
bitonicMergeGPU(arr,p,indexT,order);
bitonicMergeGPU(arr,p,indexT+p,order);
}
}
I simply want to know if it is possible to solve the problem of the recursive calls.
CUDA supports recursion. When you use recursion in CUDA, this warning is expected, and there is no NVIDIA-documented way you can make the warning go away (except by not using recursion).
If you use a function recursively, in most languages it will use more stack space as the recursion depth increases. This is true in CUDA as well. You need to account for this and provide enough stack space for the maximum recursion depth you anticipate. It is common practice to limit recursion depth, so as to prevent stack problems.
The compiler is unable to discover the maximum runtime recursion depth at compile time, and the warning is there to remind you of that.
Regardless of how much you increase the stack size, the warning will not go away. The warning is there to let you know that it is your responsibility to make sure your recursion design along with the stack space allocated will work correctly. The compiler does not verify in any way that the amount of increase in stack size is sufficient.
Using recursion in CUDA must be very careful. The recursion uses stack memory which has a limit of 512 KB. The default is usually 1KB which is easy to overflow and crashes the program. You can get the stack size per thread using cudaThreadGetLimit().
Suggestions:
Redesign the algorithm/function using a non-recursive approach. The efficiency is usually very similar.
Increase stack size per thread using cudaThreadSetLimit(), not exceed the limit, e.g. 512KB.
I am working on some OpenCL reduction and I found AMD and Nvidia both has some example like the following kernel (this one is taken from Nvidia's website, but AMD has a similar one):
__kernel void reduce2(__global T *g_idata, __global T *g_odata, unsigned int n, __local T* sdata){
// load shared mem
unsigned int tid = get_local_id(0);
unsigned int i = get_global_id(0);
sdata[tid] = (i < n) ? g_idata[i] : 0;
barrier(CLK_LOCAL_MEM_FENCE);
// do reduction in shared mem
for(unsigned int s=get_local_size(0)/2; s>0; s>>=1)
{
if (tid < s)
{
sdata[tid] += sdata[tid + s];
}
barrier(CLK_LOCAL_MEM_FENCE);
}
// write result for this block to global mem
if (tid == 0) g_odata[get_group_id(0)] = sdata[0];}
I have two questions:
the code above reduce an array to another smaller array, I am just wondering why all the example I saw did the same instead of reducing an array directly into a single element, which is the usual semantic of "reduction" (IMHO). This should be easily achievable with an outer loop inside the kernel. Is there special reason for this?
I have implemented this reduction and found it quite slow, is there any optimisation I can do to improve it? I saw another example used some unrolling to avoid synchronisation in the loop, but I did not quite get the idea, can you explain a bit?
The reduction problem in a multithread environment is a very special parallel problem. There is one path that needs to be done sequentially, which is the element 0 to the power of 2.
Even if you had infinite threads for processing, you will need log2(N) passes trough the array to reduce it to a single element.
In a real system your number of threads (work-items) are reduced but high (~128-2048). So, in order to use them efficiently all of them have to have something to do. But since the problem is more and more serial and less parallel as you reduce the size of the reduction. These algorithms only bother about the high part, and let the CPU do the rest of the reduction.
To make the story short. You can reduce an array from 1024 to 512 in one pass, but you need the same power to reduce it from 2 to 1. In the latter case all the threads minus 1 are idle, an incredible waste of GPU resources (99.7% idle).
As you can see, there is no point in reducing this last part on a GPU. It is easier to simply copy it to CPU and do it sequentially.
Answering your question: Yes, it is slow, and will always be. If there was a magic trick to solve it, then AMD and nVIDIA would be using it don't you think? :)
For question 1: This kernel reduces a big array into a smaller one and not a single element because there is no synchronization possible between work-groups. so each work-group can reduces its portion of the array to one elements but after that all these single elements given by each work-group need to be written in global memory before a new pass is performed. This could go on until the moment the array is small enough to have only one work-group running.
For question 2: There is several approaches to perform a reduction with different performance. How to improve performance for such problem is discussed in this article from the AMD resources. Hope you'll find it useful.
In Introduction to Algorithms p169 it talks about using tail recursion for Quicksort.
The original Quicksort algorithm earlier in the chapter is (in pseudo-code)
Quicksort(A, p, r)
{
if (p < r)
{
q: <- Partition(A, p, r)
Quicksort(A, p, q)
Quicksort(A, q+1, r)
}
}
The optimized version using tail recursion is as follows
Quicksort(A, p, r)
{
while (p < r)
{
q: <- Partition(A, p, r)
Quicksort(A, p, q)
p: <- q+1
}
}
Where Partition sorts the array according to a pivot.
The difference is that the second algorithm only calls Quicksort once to sort the LHS.
Can someone explain to me why the 1st algorithm could cause a stack overflow, whereas the second wouldn't? Or am I misunderstanding the book.
First let's start with a brief, probably not accurate but still valid, definition of what stack overflow is.
As you probably know right now there are two different kind of memory which are implemented in too different data structures: Heap and Stack.
In terms of size, the Heap is bigger than the stack, and to keep it simple let's say that every time a function call is made a new environment(local variables, parameters, etc.) is created on the stack. So given that and the fact that stack's size is limited, if you make too many function calls you will run out of space hence you will have a stack overflow.
The problem with recursion is that, since you are creating at least one environment on the stack per iteration, then you would be occupying a lot of space in the limited stack very quickly, so stack overflow are commonly associated with recursion calls.
So there is this thing called Tail recursion call optimization that will reuse the same environment every time a recursion call is made and so the space occupied in the stack is constant, preventing the stack overflow issue.
Now, there are some rules in order to perform a tail call optimization. First, each call most be complete and by that I mean that the function should be able to give a result at any moment if you interrupts the execution, in SICP
this is called an iterative process even when the function is recursive.
If you analyze your first example, you will see that each iteration is defined by two recursive calls, which means that if you stop the execution at any time you won't be able to give a partial result because you the result depends of those calls to be finished, in this scenario you can't reuse the stack environment because the total information is split between all those recursive calls.
However, the second example doesn't have that problem, A is constant and the state of p and r can be locally determined, so since all the information to keep going is there then TCO can be applied.
The essence of the tail recursion optimization is that there is no recursion when the program is actually executed. When the compiler or interpreter is able to kick TRO in, it means that it will essentially figure out how to rewrite your recursively-defined algorithm into a simple iterative process with the stack not used to store nested function invocations.
The first code snippet can't be TR-optimized because there are 2 recursive calls in it.
Tail recursion by itself is not enough. The algorithm with the while loop can still use O(N) stack space, reducing it to O(log(N)) is left as exercise in that section of CLRS.
Assume we are working in a language with array slices and tail call optimization. Consider the difference between these two algorithms:
Bad:
Quicksort(arraySlice) {
if (arraySlice.length > 1) {
slices = Partition(arraySlice)
(smallerSlice, largerSlice) = sortBySize(slices)
Quicksort(largerSlice) // Not a tail call, requires a stack frame until it returns.
Quicksort(smallerSlice) // Tail call, can replace the old stack frame.
}
}
Good:
Quicksort(arraySlice) {
if (arraySlice.length > 1){
slices = Partition(arraySlice)
(smallerSlice, largerSlice) = sortBySize(slices)
Quicksort(smallerSlice) // Not a tail call, requires a stack frame until it returns.
Quicksort(largerSlice) // Tail call, can replace the old stack frame.
}
}
The second one is guarenteed to never need more than log2(length) stack frames because smallerSlice is less than half as long as arraySlice. But for the first one, the inequality is reversed and it will always need more than or equal to log2(length) stack frames, and can require O(N) stack frames in the worst case where smallerslice always has length 1.
If you don't keep track of which slice is smaller or larger, you will have similar worst cases to the first overflowing case, even though it will require O(log(n)) stack frames on average. If you always sort the smaller slice first, you will never need more than log_2(length) stack frames.
If you are using a language that doesn't have tail call optimization, you can write the second (not stack-blowing) version as:
Quicksort(arraySlice) {
while (arraySlice.length > 1) {
slices = Partition(arraySlice)
(smallerSlice, arraySlice) = sortBySize(slices)
Quicksort(smallerSlice) // Still not a tail call, requires a stack frame until it returns.
}
}
Another thing worth noting is that if you are implementing something like Introsort which changes to Heapsort if the recursion depth exceeds some number proportional to log(N), you will never hit the O(N) worst case stack memory usage of quicksort, so you technically don't need to do this. Doing this optimization (popping smaller slices first) still improves the constant factor of the O(log(N)) though, so it is strongly recommended.
Well, the most obvious observation would be:
Most common stack overflow problem - definition
The most common cause of stack overflow is excessively deep or infinite recursion.
The second uses less deep recursion than the first (n branches per call instead of n^2) hence it is less likely to cause a stack overflow..
(so lower complexity means less chance to cause a stack overflow)
But somebody would have to add why the second can never cause a stack overflow while the first can.
source
Well If you consider the complexity of the two methods the first method obviously has more complexity than the second since it calls Recursion on both LHS and RHS as a result there are more chances of getting stack overflow
Note: That doesnt mean that there are absolutely no chances of getting SO in second method
In the function 2 that you shared, Tail Call elimination is implemented. Before proceeding further let us understand what is tail recursion function?. If the last statement in the code is the recursive call and does do anything after that, then it is called tail recursive function. So the first function is a tail recursion function. For such a function with some changes in the code one can remove the last recursion call like you showed in function 2 which performs the same work as function 1. This process is called tail recursion optimization or tail call elimination and following are the result of it
Optimizing in terms of auxiliary space
Optimizing in terms of recursion call overhead
Last recursive call is eliminated by using the while loop. The good thing is that for function 2, no auxiliary space is used for the right call as its recursion is eliminated using p: <- q+1 and the overall function does not have recursion call overhead. So whatever way partition happens maximum space needed is theta(log n)