Tail-call recursive behavior of the BEAM bytecode instruction call_last - recursion

We were recently reading the BEAM Book as part of a reading group.
In appendix B.3.3 it states that the call_last instruction has the following behavior
Deallocate Deallocate words of stack, then do a tail recursive
call to the function of arity Arity in the same module at label
Label
Based on our current understanding, tail-recursive would imply that the memory allocated on the stack can be reused from the current call.
As such, we were wondering what is being deallocated from the stack.
Additionally, we were also wondering why there is a need to deallocate from the stack before doing the tail-recursive call, instead of directly doing the tail recursive call.

In asm for CPUs, an optimized tailcall is just a jump to the function entry point. I.e. running the whole function as a loop body in the case of tail-recursion. (Without pushing a return address, so when you reach the base-case it's just one return back to the ultimate parent.)
I'm going to take a wild guess that Erlang / BEAM bytecode is remotely similar, even though I know nothing about it specifically.
When execution reaches the top of a function, it doesn't know whether it got there by recursion or a call from another function and would thus have to allocate more space if it needed any.
If you want to reuse already-allocated stack space, you'd have to further optimize the tail-recursion into an actual loop inside the function body, not recursion at all anymore.
Or to put it another way, to tailcall anything, you need the callstack in the same state it was in on function entry. Jumping instead of calling loses the opportunity to do any cleanup after the called function returns, because it returns to your caller, not to you.
But can't we just put the stack-cleanup in the recursion base-case that does actually return instead of tailcalling? Yes, but that only works if the "tailcall" is to a point in this function after allocating new space is already done, not the entry point that external callers will call. Those 2 changes are exactly the same as turning tail-recursion into a loop.

(Disclaimer: This is a guess)
Tail-recursion calls does not mean that it cannot perform any other call previously or use the stack in the meantime. In that case, the allocated stack for those calls must be deallocated before performing the tail-recursion. The call_last deallocates surplus stack before behaving like call_only.
You can see an example if you erlc -S the following code:
-module(test).
-compile(export_all).
fun1([]) ->
ok;
fun1([1|R]) ->
fun1(R).
funN() ->
A = list(),
B = list(),
fun1([A, B]).
list() ->
[1,2,3,4].
I've annotated the relevant parts:
{function, fun1, 1, 2}.
{label,1}.
{line,[{location,"test.erl",4}]}.
{func_info,{atom,test},{atom,fun1},1}.
{label,2}.
{test,is_nonempty_list,{f,3},[{x,0}]}.
{get_list,{x,0},{x,1},{x,2}}.
{test,is_eq_exact,{f,1},[{x,1},{integer,1}]}.
{move,{x,2},{x,0}}.
{call_only,1,{f,2}}. % No stack allocated, no need to deallocate it
{label,3}.
{test,is_nil,{f,1},[{x,0}]}.
{move,{atom,ok},{x,0}}.
return.
{function, funN, 0, 5}.
{label,4}.
{line,[{location,"test.erl",10}]}.
{func_info,{atom,test},{atom,funN},0}.
{label,5}.
{allocate_zero,1,0}. % Allocate 1 slot in the stack
{call,0,{f,7}}. % Leaves the result in {x,0} (the 0 register)
{move,{x,0},{y,0}}.% Moves the previous result from {x,0} to the stack because next function needs {x,0} free
{call,0,{f,7}}. % Leaves the result in {x,0} (the 0 register)
{test_heap,4,1}.
{put_list,{x,0},nil,{x,0}}. % Create a list with only the last value, [B]
{put_list,{y,0},{x,0},{x,0}}. % Prepend A (from the stack) to the previous list, creating [A, B] ([A | [B]]) in {x,0}
{call_last,1,{f,2},1}. % Tail recursion call deallocating the stack
{function, list, 0, 7}.
{label,6}.
{line,[{location,"test.erl",15}]}.
{func_info,{atom,test},{atom,list},0}.
{label,7}.
{move,{literal,[1,2,3,4]},{x,0}}.
return.
EDIT:
To actually answer your questions:
The thread's memory is used for both the stack and the heap, which use the same memory block in opposite sides, growning towards each other (the thread's GC triggers when they meet).
"Allocating" in this case means increasing the space used for the stack, and if that space is not going to be used anymore, it must be deallocated (returned to the memory block) in order to be able to use it again later (either as heap or as stack).

Related

elixir: why tail-recursive use more memory than a body-recursive function?

I'm reading Learn Functional Programming with Elixir now, on chapter 4 the author talks about Tail-Call Optimization that a tail-recursive function will use less memory than a body-recursive function. But when I tried the examples in the book, the result is opposite.
# tail-recursive
defmodule TRFactorial do
def of(n), do: factorial_of(n, 1)
defp factorial_of(0, acc), do: acc
defp factorial_of(n, acc) when n > 0, do: factorial_of(n - 1, n * acc)
end
TRFactorial.of(200_000)
# body-recursive
defmodule Factorial do
def of(0), do: 1
def of(n) when n > 0, do: n * of(n - 1)
end
Factorial.of(200_000)
In my computer, the beam.smp of tail-recursive version will use 2.5G ~ 3G memory while the body-recursive only use around 1G. Am I misunderstanding something?
TL;DR: erlang virtual machine seems to optimize both to be TCO.
Nowadays compilers and virtual machines are too smart to predict their behavior. The advantage of tail recursion is not less memory consumption, but:
This is to ensure that no system resources, for example, call stack, are consumed.
When the call is not tail-recursive, the stack must be preserved across calls. Consider the following example.
▶ defmodule NTC do
def inf do
inf()
IO.puts(".")
DateTime.utc_now()
end
end
Here we need to preserve stack to make it possible to continue execution of the caller when the recursion would return. It won’t because this recursion is infinite. The compiler is unable to optimize it and here is what we get:
▶ NTC.inf
[1] 351729 killed iex
Please note, that no output has happened, which means we were recursively calling itself until the stack blew up. With TCO, infinite recursion is possible (and it is used widely in message handling.)
Turning back to your example. As we saw, TCO was made in both cases (otherwise we’d end up with stack overflow,) and the former keeps an accumulator in the dedicated variable, while the latter uses the return value on stack only. This gain you see: elixir is immutable and the variable content (which is huge for factorial of 200K) gets copied and kept in the memory for each call.
Sidenote:
You might disassembly both modules with :erts_debug.df(Factorial) (which would produce Elixir.Factorial.dis files in the same directory) and see that the calls were implicitly TCO’ed.

Recursion in java help

I am new to the site and am not familiar with how and where to post so please excuse me. I am currently studying recursion and am having trouble understanding the output of this program. Below is the method body.
public static int Asterisk(int n)
{
if (n<1)
return;
Asterisk(n-1);
for (int i = 0; i<n; i++)
{
System.out.print("*");
}
System.out.println();
}
This is the output
*
**
***
****
*****
it is due to the fact that the "Asterisk(n-1)" lies before the for loop.
I would think that the output should be
****
***
**
*
This is the way head recursion works. The call to the function is made before execution of other statements. So, Asterisk(5) calls Asterisk(4) before doing anything else. This further cascades into serial function calls from Asterisk(3) → Asterisk(2) → Asterisk(1) → Asterisk(0).
Now, Asterisk(0) simply returns as it passes the condition n<1. The control goes back to Asterisk(1) which now executes the rest of its code by printing n=1 stars. Then it relinquishes control to Asterisk(2) which again prints n=2 stars, and so on. Finally, Asterisk(5) prints its n=5 stars and the function calls end. This is why you see the pattern of ascending number of stars.
There are two ways to create programming loops. One is using imperative loops normally native to the language (for, while, etc) and the other is using functions (functional loops). In your example the two kinds of loops are presented.
One loop is the unrolling of the function
Asterisk(int n)
This unrolling uses recursion, where the function calls itself. Every functional loop must know when to stop, otherwise it goes on forever and blows up the stack. This is called the "stopping condition". In your case it is :
if (n<1)
return;
There is bidirectional equivalence between functional loops and imperative loops (for, while, etc). You can turn any functional loop into a regular loop and vice versa.
IMO this particular exercise was meant to show you the two different ways to build loops. The outer loop is functional (you could substitute it for a for loop) and the inner loop is imperative.
Think of recursive calls in terms of a stack. A stack is a data structure which adds to the top of a pile. A real world analogy is a pile of dishes where the newest dish goes on the top. Therefore recursive calls add another layer to the top of the stack, then once some criteria is met which prevents further recursive calls, the stack starts to unwind and we work our way back down to the original item (the first plate in pile of dishes).
The input of a recursive method tends towards a base case which is the termination factor and prevents the method from calling itself indefinitely (infinite loop). Once this base condition is met the method returns a value rather than calling itself again. This is how the stack in unwound.
In your method, the base case is when $n<1$ and the recursive calls use the input $n-1$. This means the method will call itself, each time decreasing $n$ by 1, until $n<1$ i.e. $n=0$. Once the base condition is met, the value 0 is returned and we start to execute the $for$ loop. This is why the first line contains a single asterix.
So if you run the method with an input of 5, the recursive calls build a stack of values of $n$ as so
0
1
2
3
4
5
Then this stack is unwound starting with the top, 0, all the way down to 5.

Erlang: stackoverflow with recursive function that is not tail call optimized?

Is it possible to get a stackoverflow with a function that is not tail call optimized in Erlang? For example, suppose I have a function like this
sum_list([],Acc) ->
Acc;
sum_list([Head|Tail],Acc) ->
Head + sum_list(Tail, Acc).
It would seem like if a large enough list was passed in it would eventually run out of stack space and crash. I tried testing this like so:
> L = lists:seq(1, 10000000).
[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22, 23,24,25,26,27,28,29|...]
> sum_test:sum_list(L, 0).
50000005000000
But it never crashes! I tried it with a list of 100,000,000 integers and it took a while to finish but it still never crashed! Questions:
Am I testing this correctly?
If so, why am I unable to generate a stackoverflow?
Is Erlang doing something that prevents stackoverflows from occurring?
You are testing this correctly: your function is indeed not tail-recursive. To find out, you can compile your code using erlc -S <erlang source file>.
{function, sum_list, 2, 2}.
{label,1}.
{func_info,{atom,so},{atom,sum_list},2}.
{label,2}.
{test,is_nonempty_list,{f,3},[{x,0}]}.
{allocate,1,2}.
{get_list,{x,0},{y,0},{x,0}}.
{call,2,{f,2}}.
{gc_bif,'+',{f,0},1,[{y,0},{x,0}],{x,0}}.
{deallocate,1}.
return.
{label,3}.
{test,is_nil,{f,1},[{x,0}]}.
{move,{x,1},{x,0}}.
return.
As a comparison the following tail-recursive version of the function:
tail_sum_list([],Acc) ->
Acc;
tail_sum_list([Head|Tail],Acc) ->
tail_sum_list(Tail, Head + Acc).
compiles as:
{function, tail_sum_list, 2, 5}.
{label,4}.
{func_info,{atom,so},{atom,tail_sum_list},2}.
{label,5}.
{test,is_nonempty_list,{f,6},[{x,0}]}.
{get_list,{x,0},{x,2},{x,3}}.
{gc_bif,'+',{f,0},4,[{x,2},{x,1}],{x,1}}.
{move,{x,3},{x,0}}.
{call_only,2,{f,5}}.
{label,6}.
{test,is_nil,{f,4},[{x,0}]}.
{move,{x,1},{x,0}}.
return.
Notice the lack of allocate and the call_only opcode in the tail-recursive version, as opposed to the allocate/call/deallocate/return sequence in the non-recursive function.
You are not getting a stack overflow because the Erlang "stack" is very large. Indeed, stack overflow usually means the processor stack overflowed, as the processor's stack pointer went too far away. Processes traditionally have a limited stack size which can be tuned by interacting with the operating system. See for example POSIX's setrlimit.
However, Erlang execution stack is not the processor stack, as the code is interpreted. Each process has its own stack which can grow as needed by invoking operating system memory allocation functions (typically malloc on Unix).
As a result, your function will not crash as long as malloc calls succeed.
For the record, the actual list L is using the same amount of memory as the stack to process it. Indeed, each element in the list takes two words (the integer value itself, which is boxed as a word as they are small) and the pointer to the next element to the list. Conversely, the stack is grown by two words at each iteration by allocate opcode: one word for CP which is saved by allocate itself and one word as requested (the first parameter of allocate) for the current value.
For 100,000,000 words on a 64-bit VM, the list takes a minimum of 1.5 GB (more as the actual stack is not grown every two words, fortunately). Monitoring and garbaging this is difficult in the shell, as many values remain live. If you spawn a function, you can see the memory usage:
spawn(fun() ->
io:format("~p\n", [erlang:memory()]),
L = lists:seq(1, 100000000),
io:format("~p\n", [erlang:memory()]),
sum_test:sum_list(L, 0),
io:format("~p\n", [erlang:memory()])
end).
As you can see, the memory for the recursive call is not released immediately.

Quicksort and tail recursive optimization

In Introduction to Algorithms p169 it talks about using tail recursion for Quicksort.
The original Quicksort algorithm earlier in the chapter is (in pseudo-code)
Quicksort(A, p, r)
{
if (p < r)
{
q: <- Partition(A, p, r)
Quicksort(A, p, q)
Quicksort(A, q+1, r)
}
}
The optimized version using tail recursion is as follows
Quicksort(A, p, r)
{
while (p < r)
{
q: <- Partition(A, p, r)
Quicksort(A, p, q)
p: <- q+1
}
}
Where Partition sorts the array according to a pivot.
The difference is that the second algorithm only calls Quicksort once to sort the LHS.
Can someone explain to me why the 1st algorithm could cause a stack overflow, whereas the second wouldn't? Or am I misunderstanding the book.
First let's start with a brief, probably not accurate but still valid, definition of what stack overflow is.
As you probably know right now there are two different kind of memory which are implemented in too different data structures: Heap and Stack.
In terms of size, the Heap is bigger than the stack, and to keep it simple let's say that every time a function call is made a new environment(local variables, parameters, etc.) is created on the stack. So given that and the fact that stack's size is limited, if you make too many function calls you will run out of space hence you will have a stack overflow.
The problem with recursion is that, since you are creating at least one environment on the stack per iteration, then you would be occupying a lot of space in the limited stack very quickly, so stack overflow are commonly associated with recursion calls.
So there is this thing called Tail recursion call optimization that will reuse the same environment every time a recursion call is made and so the space occupied in the stack is constant, preventing the stack overflow issue.
Now, there are some rules in order to perform a tail call optimization. First, each call most be complete and by that I mean that the function should be able to give a result at any moment if you interrupts the execution, in SICP
this is called an iterative process even when the function is recursive.
If you analyze your first example, you will see that each iteration is defined by two recursive calls, which means that if you stop the execution at any time you won't be able to give a partial result because you the result depends of those calls to be finished, in this scenario you can't reuse the stack environment because the total information is split between all those recursive calls.
However, the second example doesn't have that problem, A is constant and the state of p and r can be locally determined, so since all the information to keep going is there then TCO can be applied.
The essence of the tail recursion optimization is that there is no recursion when the program is actually executed. When the compiler or interpreter is able to kick TRO in, it means that it will essentially figure out how to rewrite your recursively-defined algorithm into a simple iterative process with the stack not used to store nested function invocations.
The first code snippet can't be TR-optimized because there are 2 recursive calls in it.
Tail recursion by itself is not enough. The algorithm with the while loop can still use O(N) stack space, reducing it to O(log(N)) is left as exercise in that section of CLRS.
Assume we are working in a language with array slices and tail call optimization. Consider the difference between these two algorithms:
Bad:
Quicksort(arraySlice) {
if (arraySlice.length > 1) {
slices = Partition(arraySlice)
(smallerSlice, largerSlice) = sortBySize(slices)
Quicksort(largerSlice) // Not a tail call, requires a stack frame until it returns.
Quicksort(smallerSlice) // Tail call, can replace the old stack frame.
}
}
Good:
Quicksort(arraySlice) {
if (arraySlice.length > 1){
slices = Partition(arraySlice)
(smallerSlice, largerSlice) = sortBySize(slices)
Quicksort(smallerSlice) // Not a tail call, requires a stack frame until it returns.
Quicksort(largerSlice) // Tail call, can replace the old stack frame.
}
}
The second one is guarenteed to never need more than log2(length) stack frames because smallerSlice is less than half as long as arraySlice. But for the first one, the inequality is reversed and it will always need more than or equal to log2(length) stack frames, and can require O(N) stack frames in the worst case where smallerslice always has length 1.
If you don't keep track of which slice is smaller or larger, you will have similar worst cases to the first overflowing case, even though it will require O(log(n)) stack frames on average. If you always sort the smaller slice first, you will never need more than log_2(length) stack frames.
If you are using a language that doesn't have tail call optimization, you can write the second (not stack-blowing) version as:
Quicksort(arraySlice) {
while (arraySlice.length > 1) {
slices = Partition(arraySlice)
(smallerSlice, arraySlice) = sortBySize(slices)
Quicksort(smallerSlice) // Still not a tail call, requires a stack frame until it returns.
}
}
Another thing worth noting is that if you are implementing something like Introsort which changes to Heapsort if the recursion depth exceeds some number proportional to log(N), you will never hit the O(N) worst case stack memory usage of quicksort, so you technically don't need to do this. Doing this optimization (popping smaller slices first) still improves the constant factor of the O(log(N)) though, so it is strongly recommended.
Well, the most obvious observation would be:
Most common stack overflow problem - definition
The most common cause of stack overflow is excessively deep or infinite recursion.
The second uses less deep recursion than the first (n branches per call instead of n^2) hence it is less likely to cause a stack overflow..
(so lower complexity means less chance to cause a stack overflow)
But somebody would have to add why the second can never cause a stack overflow while the first can.
source
Well If you consider the complexity of the two methods the first method obviously has more complexity than the second since it calls Recursion on both LHS and RHS as a result there are more chances of getting stack overflow
Note: That doesnt mean that there are absolutely no chances of getting SO in second method
In the function 2 that you shared, Tail Call elimination is implemented. Before proceeding further let us understand what is tail recursion function?. If the last statement in the code is the recursive call and does do anything after that, then it is called tail recursive function. So the first function is a tail recursion function. For such a function with some changes in the code one can remove the last recursion call like you showed in function 2 which performs the same work as function 1. This process is called tail recursion optimization or tail call elimination and following are the result of it
Optimizing in terms of auxiliary space
Optimizing in terms of recursion call overhead
Last recursive call is eliminated by using the while loop. The good thing is that for function 2, no auxiliary space is used for the right call as its recursion is eliminated using p: <- q+1 and the overall function does not have recursion call overhead. So whatever way partition happens maximum space needed is theta(log n)

How does one implement a "stackless" interpreted language?

I am making my own Lisp-like interpreted language, and I want to do tail call optimization. I want to free my interpreter from the C stack so I can manage my own jumps from function to function and my own stack magic to achieve TCO. (I really don't mean stackless per se, just the fact that calls don't add frames to the C stack. I would like to use a stack of my own that does not grow with tail calls). Like Stackless Python, and unlike Ruby or... standard Python I guess.
But, as my language is a Lisp derivative, all evaluation of s-expressions is currently done recursively (because it's the most obvious way I thought of to do this nonlinear, highly hierarchical process). I have an eval function, which calls a Lambda::apply function every time it encounters a function call. The apply function then calls eval to execute the body of the function, and so on. Mutual stack-hungry non-tail C recursion. The only iterative part I currently use is to eval a body of sequential s-expressions.
(defun f (x y)
(a x y)) ; tail call! goto instead of call.
; (do not grow the stack, keep return addr)
(defun a (x y)
(+ x y))
; ...
(print (f 1 2)) ; how does the return work here? how does it know it's supposed to
; return the value here to be used by print, and how does it know
; how to continue execution here??
So, how do I avoid using C recursion? Or can I use some kind of goto that jumps across c functions? longjmp, perhaps? I really don't know. Please bear with me, I am mostly self- (Internet- ) taught in programming.
One solution is what is sometimes called "trampolined style". The trampoline is a top-level loop that dispatches to small functions that do some small step of computation before returning.
I've sat here for nearly half an hour trying to contrive a good, short example. Unfortunately, I have to do the unhelpful thing and send you to a link:
http://en.wikisource.org/wiki/Scheme:_An_Interpreter_for_Extended_Lambda_Calculus/Section_5
The paper is called "Scheme: An Interpreter for Extended Lambda Calculus", and section 5 implements a working scheme interpreter in an outdated dialect of Lisp. The secret is in how they use the **CLINK** instead of a stack. The other globals are used to pass data around between the implementation functions like the registers of a CPU. I would ignore **QUEUE**, **TICK**, and **PROCESS**, since those deal with threading and fake interrupts. **EVLIS** and **UNEVLIS** are, specifically, used to evaluate function arguments. Unevaluated args are stored in **UNEVLIS**, until they are evaluated and out into **EVLIS**.
Functions to pay attention to, with some small notes:
MLOOP: MLOOP is the main loop of the interpreter, or "trampoline". Ignoring **TICK**, its only job is to call whatever function is in **PC**. Over and over and over.
SAVEUP: SAVEUP conses all the registers onto the **CLINK**, which is basically the same as when C saves the registers to the stack before a function call. The **CLINK** is actually a "continuation" for the interpreter. (A continuation is just the state of a computation. A saved stack frame is technically continuation, too. Hence, some Lisps save the stack to the heap to implement call/cc.)
RESTORE: RESTORE restores the "registers" as they were saved in the **CLINK**. It's similar to restoring a stack frame in a stack-based language. So, it's basically "return", except some function has explicitly stuck the return value into **VALUE**. (**VALUE** is obviously not clobbered by RESTORE.) Also note that RESTORE doesn't always have to return to a calling function. Some functions will actually SAVEUP a whole new computation, which RESTORE will happily "restore".
AEVAL: AEVAL is the EVAL function.
EVLIS: EVLIS exists to evaluate a function's arguments, and apply a function to those args. To avoid recursion, it SAVEUPs EVLIS-1. EVLIS-1 would just be regular old code after the function application if the code was written recursively. However, to avoid recursion, and the stack, it is a separate "continuation".
I hope I've been of some help. I just wish my answer (and link) was shorter.
What you're looking for is called continuation-passing style. This style adds an additional item to each function call (you could think of it as a parameter, if you like), that designates the next bit of code to run (the continuation k can be thought of as a function that takes a single parameter). For example you can rewrite your example in CPS like this:
(defun f (x y k)
(a x y k))
(defun a (x y k)
(+ x y k))
(f 1 2 print)
The implementation of + will compute the sum of x and y, then pass the result to k sort of like (k sum).
Your main interpreter loop then doesn't need to be recursive at all. It will, in a loop, apply each function application one after another, passing the continuation around.
It takes a little bit of work to wrap your head around this. I recommend some reading materials such as the excellent SICP.
Tail recursion can be thought of as reusing for the callee the same stack frame that you are currently using for the caller. So you could just re-set the arguments and goto to the beginning of the function.

Resources