Is it possible to get a stackoverflow with a function that is not tail call optimized in Erlang? For example, suppose I have a function like this
sum_list([],Acc) ->
Acc;
sum_list([Head|Tail],Acc) ->
Head + sum_list(Tail, Acc).
It would seem like if a large enough list was passed in it would eventually run out of stack space and crash. I tried testing this like so:
> L = lists:seq(1, 10000000).
[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22, 23,24,25,26,27,28,29|...]
> sum_test:sum_list(L, 0).
50000005000000
But it never crashes! I tried it with a list of 100,000,000 integers and it took a while to finish but it still never crashed! Questions:
Am I testing this correctly?
If so, why am I unable to generate a stackoverflow?
Is Erlang doing something that prevents stackoverflows from occurring?
You are testing this correctly: your function is indeed not tail-recursive. To find out, you can compile your code using erlc -S <erlang source file>.
{function, sum_list, 2, 2}.
{label,1}.
{func_info,{atom,so},{atom,sum_list},2}.
{label,2}.
{test,is_nonempty_list,{f,3},[{x,0}]}.
{allocate,1,2}.
{get_list,{x,0},{y,0},{x,0}}.
{call,2,{f,2}}.
{gc_bif,'+',{f,0},1,[{y,0},{x,0}],{x,0}}.
{deallocate,1}.
return.
{label,3}.
{test,is_nil,{f,1},[{x,0}]}.
{move,{x,1},{x,0}}.
return.
As a comparison the following tail-recursive version of the function:
tail_sum_list([],Acc) ->
Acc;
tail_sum_list([Head|Tail],Acc) ->
tail_sum_list(Tail, Head + Acc).
compiles as:
{function, tail_sum_list, 2, 5}.
{label,4}.
{func_info,{atom,so},{atom,tail_sum_list},2}.
{label,5}.
{test,is_nonempty_list,{f,6},[{x,0}]}.
{get_list,{x,0},{x,2},{x,3}}.
{gc_bif,'+',{f,0},4,[{x,2},{x,1}],{x,1}}.
{move,{x,3},{x,0}}.
{call_only,2,{f,5}}.
{label,6}.
{test,is_nil,{f,4},[{x,0}]}.
{move,{x,1},{x,0}}.
return.
Notice the lack of allocate and the call_only opcode in the tail-recursive version, as opposed to the allocate/call/deallocate/return sequence in the non-recursive function.
You are not getting a stack overflow because the Erlang "stack" is very large. Indeed, stack overflow usually means the processor stack overflowed, as the processor's stack pointer went too far away. Processes traditionally have a limited stack size which can be tuned by interacting with the operating system. See for example POSIX's setrlimit.
However, Erlang execution stack is not the processor stack, as the code is interpreted. Each process has its own stack which can grow as needed by invoking operating system memory allocation functions (typically malloc on Unix).
As a result, your function will not crash as long as malloc calls succeed.
For the record, the actual list L is using the same amount of memory as the stack to process it. Indeed, each element in the list takes two words (the integer value itself, which is boxed as a word as they are small) and the pointer to the next element to the list. Conversely, the stack is grown by two words at each iteration by allocate opcode: one word for CP which is saved by allocate itself and one word as requested (the first parameter of allocate) for the current value.
For 100,000,000 words on a 64-bit VM, the list takes a minimum of 1.5 GB (more as the actual stack is not grown every two words, fortunately). Monitoring and garbaging this is difficult in the shell, as many values remain live. If you spawn a function, you can see the memory usage:
spawn(fun() ->
io:format("~p\n", [erlang:memory()]),
L = lists:seq(1, 100000000),
io:format("~p\n", [erlang:memory()]),
sum_test:sum_list(L, 0),
io:format("~p\n", [erlang:memory()])
end).
As you can see, the memory for the recursive call is not released immediately.
Related
We were recently reading the BEAM Book as part of a reading group.
In appendix B.3.3 it states that the call_last instruction has the following behavior
Deallocate Deallocate words of stack, then do a tail recursive
call to the function of arity Arity in the same module at label
Label
Based on our current understanding, tail-recursive would imply that the memory allocated on the stack can be reused from the current call.
As such, we were wondering what is being deallocated from the stack.
Additionally, we were also wondering why there is a need to deallocate from the stack before doing the tail-recursive call, instead of directly doing the tail recursive call.
In asm for CPUs, an optimized tailcall is just a jump to the function entry point. I.e. running the whole function as a loop body in the case of tail-recursion. (Without pushing a return address, so when you reach the base-case it's just one return back to the ultimate parent.)
I'm going to take a wild guess that Erlang / BEAM bytecode is remotely similar, even though I know nothing about it specifically.
When execution reaches the top of a function, it doesn't know whether it got there by recursion or a call from another function and would thus have to allocate more space if it needed any.
If you want to reuse already-allocated stack space, you'd have to further optimize the tail-recursion into an actual loop inside the function body, not recursion at all anymore.
Or to put it another way, to tailcall anything, you need the callstack in the same state it was in on function entry. Jumping instead of calling loses the opportunity to do any cleanup after the called function returns, because it returns to your caller, not to you.
But can't we just put the stack-cleanup in the recursion base-case that does actually return instead of tailcalling? Yes, but that only works if the "tailcall" is to a point in this function after allocating new space is already done, not the entry point that external callers will call. Those 2 changes are exactly the same as turning tail-recursion into a loop.
(Disclaimer: This is a guess)
Tail-recursion calls does not mean that it cannot perform any other call previously or use the stack in the meantime. In that case, the allocated stack for those calls must be deallocated before performing the tail-recursion. The call_last deallocates surplus stack before behaving like call_only.
You can see an example if you erlc -S the following code:
-module(test).
-compile(export_all).
fun1([]) ->
ok;
fun1([1|R]) ->
fun1(R).
funN() ->
A = list(),
B = list(),
fun1([A, B]).
list() ->
[1,2,3,4].
I've annotated the relevant parts:
{function, fun1, 1, 2}.
{label,1}.
{line,[{location,"test.erl",4}]}.
{func_info,{atom,test},{atom,fun1},1}.
{label,2}.
{test,is_nonempty_list,{f,3},[{x,0}]}.
{get_list,{x,0},{x,1},{x,2}}.
{test,is_eq_exact,{f,1},[{x,1},{integer,1}]}.
{move,{x,2},{x,0}}.
{call_only,1,{f,2}}. % No stack allocated, no need to deallocate it
{label,3}.
{test,is_nil,{f,1},[{x,0}]}.
{move,{atom,ok},{x,0}}.
return.
{function, funN, 0, 5}.
{label,4}.
{line,[{location,"test.erl",10}]}.
{func_info,{atom,test},{atom,funN},0}.
{label,5}.
{allocate_zero,1,0}. % Allocate 1 slot in the stack
{call,0,{f,7}}. % Leaves the result in {x,0} (the 0 register)
{move,{x,0},{y,0}}.% Moves the previous result from {x,0} to the stack because next function needs {x,0} free
{call,0,{f,7}}. % Leaves the result in {x,0} (the 0 register)
{test_heap,4,1}.
{put_list,{x,0},nil,{x,0}}. % Create a list with only the last value, [B]
{put_list,{y,0},{x,0},{x,0}}. % Prepend A (from the stack) to the previous list, creating [A, B] ([A | [B]]) in {x,0}
{call_last,1,{f,2},1}. % Tail recursion call deallocating the stack
{function, list, 0, 7}.
{label,6}.
{line,[{location,"test.erl",15}]}.
{func_info,{atom,test},{atom,list},0}.
{label,7}.
{move,{literal,[1,2,3,4]},{x,0}}.
return.
EDIT:
To actually answer your questions:
The thread's memory is used for both the stack and the heap, which use the same memory block in opposite sides, growning towards each other (the thread's GC triggers when they meet).
"Allocating" in this case means increasing the space used for the stack, and if that space is not going to be used anymore, it must be deallocated (returned to the memory block) in order to be able to use it again later (either as heap or as stack).
I'm reading Learn Functional Programming with Elixir now, on chapter 4 the author talks about Tail-Call Optimization that a tail-recursive function will use less memory than a body-recursive function. But when I tried the examples in the book, the result is opposite.
# tail-recursive
defmodule TRFactorial do
def of(n), do: factorial_of(n, 1)
defp factorial_of(0, acc), do: acc
defp factorial_of(n, acc) when n > 0, do: factorial_of(n - 1, n * acc)
end
TRFactorial.of(200_000)
# body-recursive
defmodule Factorial do
def of(0), do: 1
def of(n) when n > 0, do: n * of(n - 1)
end
Factorial.of(200_000)
In my computer, the beam.smp of tail-recursive version will use 2.5G ~ 3G memory while the body-recursive only use around 1G. Am I misunderstanding something?
TL;DR: erlang virtual machine seems to optimize both to be TCO.
Nowadays compilers and virtual machines are too smart to predict their behavior. The advantage of tail recursion is not less memory consumption, but:
This is to ensure that no system resources, for example, call stack, are consumed.
When the call is not tail-recursive, the stack must be preserved across calls. Consider the following example.
▶ defmodule NTC do
def inf do
inf()
IO.puts(".")
DateTime.utc_now()
end
end
Here we need to preserve stack to make it possible to continue execution of the caller when the recursion would return. It won’t because this recursion is infinite. The compiler is unable to optimize it and here is what we get:
▶ NTC.inf
[1] 351729 killed iex
Please note, that no output has happened, which means we were recursively calling itself until the stack blew up. With TCO, infinite recursion is possible (and it is used widely in message handling.)
Turning back to your example. As we saw, TCO was made in both cases (otherwise we’d end up with stack overflow,) and the former keeps an accumulator in the dedicated variable, while the latter uses the return value on stack only. This gain you see: elixir is immutable and the variable content (which is huge for factorial of 200K) gets copied and kept in the memory for each call.
Sidenote:
You might disassembly both modules with :erts_debug.df(Factorial) (which would produce Elixir.Factorial.dis files in the same directory) and see that the calls were implicitly TCO’ed.
I have to solve a code problem on the gpu using CUDA but I always get a warning of Stack size for "name of the function" cannot be statically determined.
This is for a student project that I'm working on, the project is written in C using CUDA 9.0 libraries and it's running on an NVIDIA Quadro K5000 gpu.
Every single thread must execute one function and, in this function, there are two recursive calls of the same function, the reason why I want to use those two recursive calls it's because it makes the code clean and simple for me, but if there is only one recursive call there isn't anymore the Stack size problem.
Here is the error I get every time I compile the code:
CUDA supports recursive function calls but I don't understand why it makes a problem when there are two recursive calls.
__device__ void bitonicMergeGPU(float *arr, int l, int indexT, int order)
{
int k,p;
if(l > 1)
{
p = l/2;
for(k=indexT;k<indexT+p;k++)
{
//Compare the values.
compareAndExchange(arr,k,k+p,order);
}
//THIS IS WHERE I GET THE ERROR
bitonicMergeGPU(arr,p,indexT,order);
bitonicMergeGPU(arr,p,indexT+p,order);
}
}
I simply want to know if it is possible to solve the problem of the recursive calls.
CUDA supports recursion. When you use recursion in CUDA, this warning is expected, and there is no NVIDIA-documented way you can make the warning go away (except by not using recursion).
If you use a function recursively, in most languages it will use more stack space as the recursion depth increases. This is true in CUDA as well. You need to account for this and provide enough stack space for the maximum recursion depth you anticipate. It is common practice to limit recursion depth, so as to prevent stack problems.
The compiler is unable to discover the maximum runtime recursion depth at compile time, and the warning is there to remind you of that.
Regardless of how much you increase the stack size, the warning will not go away. The warning is there to let you know that it is your responsibility to make sure your recursion design along with the stack space allocated will work correctly. The compiler does not verify in any way that the amount of increase in stack size is sufficient.
Using recursion in CUDA must be very careful. The recursion uses stack memory which has a limit of 512 KB. The default is usually 1KB which is easy to overflow and crashes the program. You can get the stack size per thread using cudaThreadGetLimit().
Suggestions:
Redesign the algorithm/function using a non-recursive approach. The efficiency is usually very similar.
Increase stack size per thread using cudaThreadSetLimit(), not exceed the limit, e.g. 512KB.
I wrote a function that shall perform foldl similar to Haskell in Maxima,
foldl(f,ac,li):=block([con:[],acc:ac],/*print("List=",li,ac),*/
if (is(li#[])) then
(acc:apply(f,cons(acc,[first(li)])),
acc:foldl(f,acc,rest(li))),acc)$
And it works fine in folding the list from left side and evaluating along the way hence preventing any accumulation of long unevaluated expression in buffer.
The problem I am facing in running this with,
foldl(lambda([x,y],x+y),0,makelist(i,i,1,97));
Error in PROGN [or a callee]: Bind stack overflow.
But if I run it upto 96, it generates result appropriately.
I don't understand why is this simple addition causing problem as I don't have any infinte loop or memory hungry task going on.
Well, foldl is defined as a recursive function, and it will call itself as many times as there are elements in the list. So whether it works depends on the Lisp implementation-specific limit for the function call stack. For GCL it seems the limit is relatively small. For other Lisp implementations, the limit is greater. But the only way to make it work for all sizes of the list is to write it iteratively.
There are built-in functions similar to foldl -- see lreduce, rreduce, xreduce, and tree_reduce.
In Introduction to Algorithms p169 it talks about using tail recursion for Quicksort.
The original Quicksort algorithm earlier in the chapter is (in pseudo-code)
Quicksort(A, p, r)
{
if (p < r)
{
q: <- Partition(A, p, r)
Quicksort(A, p, q)
Quicksort(A, q+1, r)
}
}
The optimized version using tail recursion is as follows
Quicksort(A, p, r)
{
while (p < r)
{
q: <- Partition(A, p, r)
Quicksort(A, p, q)
p: <- q+1
}
}
Where Partition sorts the array according to a pivot.
The difference is that the second algorithm only calls Quicksort once to sort the LHS.
Can someone explain to me why the 1st algorithm could cause a stack overflow, whereas the second wouldn't? Or am I misunderstanding the book.
First let's start with a brief, probably not accurate but still valid, definition of what stack overflow is.
As you probably know right now there are two different kind of memory which are implemented in too different data structures: Heap and Stack.
In terms of size, the Heap is bigger than the stack, and to keep it simple let's say that every time a function call is made a new environment(local variables, parameters, etc.) is created on the stack. So given that and the fact that stack's size is limited, if you make too many function calls you will run out of space hence you will have a stack overflow.
The problem with recursion is that, since you are creating at least one environment on the stack per iteration, then you would be occupying a lot of space in the limited stack very quickly, so stack overflow are commonly associated with recursion calls.
So there is this thing called Tail recursion call optimization that will reuse the same environment every time a recursion call is made and so the space occupied in the stack is constant, preventing the stack overflow issue.
Now, there are some rules in order to perform a tail call optimization. First, each call most be complete and by that I mean that the function should be able to give a result at any moment if you interrupts the execution, in SICP
this is called an iterative process even when the function is recursive.
If you analyze your first example, you will see that each iteration is defined by two recursive calls, which means that if you stop the execution at any time you won't be able to give a partial result because you the result depends of those calls to be finished, in this scenario you can't reuse the stack environment because the total information is split between all those recursive calls.
However, the second example doesn't have that problem, A is constant and the state of p and r can be locally determined, so since all the information to keep going is there then TCO can be applied.
The essence of the tail recursion optimization is that there is no recursion when the program is actually executed. When the compiler or interpreter is able to kick TRO in, it means that it will essentially figure out how to rewrite your recursively-defined algorithm into a simple iterative process with the stack not used to store nested function invocations.
The first code snippet can't be TR-optimized because there are 2 recursive calls in it.
Tail recursion by itself is not enough. The algorithm with the while loop can still use O(N) stack space, reducing it to O(log(N)) is left as exercise in that section of CLRS.
Assume we are working in a language with array slices and tail call optimization. Consider the difference between these two algorithms:
Bad:
Quicksort(arraySlice) {
if (arraySlice.length > 1) {
slices = Partition(arraySlice)
(smallerSlice, largerSlice) = sortBySize(slices)
Quicksort(largerSlice) // Not a tail call, requires a stack frame until it returns.
Quicksort(smallerSlice) // Tail call, can replace the old stack frame.
}
}
Good:
Quicksort(arraySlice) {
if (arraySlice.length > 1){
slices = Partition(arraySlice)
(smallerSlice, largerSlice) = sortBySize(slices)
Quicksort(smallerSlice) // Not a tail call, requires a stack frame until it returns.
Quicksort(largerSlice) // Tail call, can replace the old stack frame.
}
}
The second one is guarenteed to never need more than log2(length) stack frames because smallerSlice is less than half as long as arraySlice. But for the first one, the inequality is reversed and it will always need more than or equal to log2(length) stack frames, and can require O(N) stack frames in the worst case where smallerslice always has length 1.
If you don't keep track of which slice is smaller or larger, you will have similar worst cases to the first overflowing case, even though it will require O(log(n)) stack frames on average. If you always sort the smaller slice first, you will never need more than log_2(length) stack frames.
If you are using a language that doesn't have tail call optimization, you can write the second (not stack-blowing) version as:
Quicksort(arraySlice) {
while (arraySlice.length > 1) {
slices = Partition(arraySlice)
(smallerSlice, arraySlice) = sortBySize(slices)
Quicksort(smallerSlice) // Still not a tail call, requires a stack frame until it returns.
}
}
Another thing worth noting is that if you are implementing something like Introsort which changes to Heapsort if the recursion depth exceeds some number proportional to log(N), you will never hit the O(N) worst case stack memory usage of quicksort, so you technically don't need to do this. Doing this optimization (popping smaller slices first) still improves the constant factor of the O(log(N)) though, so it is strongly recommended.
Well, the most obvious observation would be:
Most common stack overflow problem - definition
The most common cause of stack overflow is excessively deep or infinite recursion.
The second uses less deep recursion than the first (n branches per call instead of n^2) hence it is less likely to cause a stack overflow..
(so lower complexity means less chance to cause a stack overflow)
But somebody would have to add why the second can never cause a stack overflow while the first can.
source
Well If you consider the complexity of the two methods the first method obviously has more complexity than the second since it calls Recursion on both LHS and RHS as a result there are more chances of getting stack overflow
Note: That doesnt mean that there are absolutely no chances of getting SO in second method
In the function 2 that you shared, Tail Call elimination is implemented. Before proceeding further let us understand what is tail recursion function?. If the last statement in the code is the recursive call and does do anything after that, then it is called tail recursive function. So the first function is a tail recursion function. For such a function with some changes in the code one can remove the last recursion call like you showed in function 2 which performs the same work as function 1. This process is called tail recursion optimization or tail call elimination and following are the result of it
Optimizing in terms of auxiliary space
Optimizing in terms of recursion call overhead
Last recursive call is eliminated by using the while loop. The good thing is that for function 2, no auxiliary space is used for the right call as its recursion is eliminated using p: <- q+1 and the overall function does not have recursion call overhead. So whatever way partition happens maximum space needed is theta(log n)