Why "there is no such thing as stack overflow" in Racket?

Why "there is no such thing as stack overflow" in Racket? - recursion

The following paragraph is from The Racket Guide (2.3.4):
At the same time, recursion does not lead to particularly bad
performance in Racket, and there is no such thing as stack overflow;
you can run out of memory if a computation involves too much context,
but exhausting memory typically requires orders of magnitude deeper
recursion than would trigger a stack overflow in other languages.
I'm curious about how Racket was designed to avoid stack overflow? What's more, why other languages like C cannot avoid such a problem?

First, some terminology: making a non-tail call requires a context frame to store local variables, parent return address, etc. So the question is how to represent an arbitrarily large context. "The stack" (or call stack) is just one (admittedly common) implementation strategy for the context.
Here are a few implementation strategies for deep recursion (ie, large contexts):
Allocate context frames on the heap and let the GC be responsible for cleaning them up. (This is nice and simple but probably relatively slow, although people would argue that point.)
Allocate context frames on the stack. When the stack is full, copy the frames currently on the stack into the heap, clear the stack, and reset the stack pointer to the bottom. When returning, if the stack is empty, copy frames from the heap back to the stack. (This means you can't have pointers to stack-allocated objects, though, since the objects get moved around.)
Allocate context frames on the stack. When the stack is full, allocate a new big chunk of memory, call that the new stack (ie set the stack pointer), and keep going. (This might require mprotect or other operations to convince the OS that the new block of memory is okay to treat as a call stack.)
Allocate context frames on the stack. When the stack is full, make a new thread to continue the computation, and wait for the thread to finish and arrange to grab a return value from it to return to the old thread's stack. (This strategy can be useful on platforms like the JVM that don't let you directly control the stack, stack pointer, etc. On the other hand, it complicates features like thread-local storage, etc.)
... and more variations on the strategies above.
Support for deep recursion often coincides with support for first-class continuations. In general, implementing first-class continuations means you almost automatically get support for deep recursion. There's a nice paper called Implementation Strategies for First-class Continuations by Will Clinger et al. with more detail and comparisons between different strategies.

There are two pieces to this answer.
First, in Racket and other functional languages, tail calls do not create additional stack frames. That is, a loop such as
(define (f x) (f x))
... can run forever without using any stack space at all. Many non-functional languages don't prioritize function calling in the same way as functional languages, and therefore aren't properly tail-calling.
HOWEVER, the comment that you're referring to isn't just limited to tail-calling; Racket allows very deeply nested stack frames.
Your question is a good one: why don't other languages allow deeply nested stack frames? I wrote a short test, and it looks like C unceremoniously dumps core at a depth of between 262,000 and 263,000. I wrote a simple racket test that does the same thing (being careful to ensure the recursive call was not in tail position), and I interrupted it at a depth of 48,000,000 without any apparent ill effects (except, presumably, a fairly large runtime stack).
To answer your question directly, there's no reason that I'm aware of that C couldn't allow much much more deeply nested stacks, but I think that for most C programmers, a recursion depth of 262K is plenty.
Not for us, though!
Here's my C code:
#include <stdio.h>
int f(int depth){
if ((depth % 1000) == 0) {
printf("%d\n",depth);
}
return f(depth+1);
}
int main() {
return f(0);
}
... and my racket code:
#lang racket
(define (f depth)
(when (= (modulo depth 1000) 0)
(printf "~v\n" depth))
(f (add1 depth))
(printf "this will never print..."))
(f 0)
EDIT: here's the version that uses randomness on the way out to stymie possible optimizations:
#lang racket
(define (f depth)
(when (= (modulo depth 1000000) 0)
(printf "~v\n" depth))
(when (< depth 50000000)
(f (add1 depth)))
(when (< (random) (/ 1.0 100000))
(printf "X")))
(f 0)
Also, my observations of the process size are consistent with a stack frame of about 16 bytes, plus or minus; 50M * 16 bytes = 800 Megabytes, and the observed size of the stack is about 1.2 Gigabytes.

Related

Why does apply throw a CONTROL-STACK-EXHAUSTED-ERROR on a large list?

(apply #'+ (loop for i from 1 to x collect 1))
works if x has value 253391, but fails with a (SB-KERNEL::CONTROL-STACK-EXHAUSTED-ERROR) on 253392*. This is orders of magnitude smaller than call-arguments-limit**.
Is recursion exhausting the stack? If so, is it in apply? Why hasn't it been optimized out?
*Also interesting, (apply #'max (loop for i from 1 to 253391 collect 1)) throws the error, but 253390 is fine.
**call-arguments-limit evaluates to 4611686018427387903 (with the help of format's ~R, it turns out this is four quintillion six hundred eleven quadrillion six hundred eighty-six trillion eighteen billion four hundred twenty-seven million three hundred eighty-seven thousand nine hundred three)

parameters that can be passed to a function in SBCL
You don't pass parameters. You pass arguments.
(defun foo (x y) (list x y))
x and y are parameters of the function foo.
(foo 20 22)
20 and 22 are arguments in a call of the function foo.
See the variables call-arguments-limit and lambda-parameters-limit.
SBCL and call-arguments-limit
If a function can't nearly handle the claimed number of arguments, then this looks like a bug in SBCL. You might want to report this bug. Maybe they need to change the value of call-arguments-limit.
Testing
APPLY is one way to test it.
Another:
(eval (append '(drop-params)
(loop for i from 1 to 2533911 collect 1)))
One can also use FUNCALL with a number of arguments spread out.
Why does a limit exist?
The Common Lisp standard was written to allow efficient implementations on various different computers. It was thought that some machine-level function calling implementations only support limited number of arguments. The standard says the number of supported arguments can be as low as 50. Actually some implementations have a relatively low number of supported arguments.
Thus apply in Common Lisp is not a tool for list processing, but to call functions with computed arglists.
For list and vector processing use REDUCE, instead of APPLY
If we want to sum all numbers in a list, replace
(apply #'+ list) ; don't use this
with
(reduce #'+ list) ; can handle arbitrary long lists
Recursion
apply is a non-optimized recursive function
I cannot see why the function APPLY should use recursion.
For example if you think of
(apply #'+ '(1 2 3 4 5))
The repeated summing of the arguments is done by the function + and not by apply.
This is different from
(reduce #'+ '(1 2 3 4 5))
where the repeated call of the function + with two arguments is done by reduce.

What's causing the stack exhaustion?
Though recursion is often a likely culprit for stack exhaustion, that is not true in this case. According to the SBCL Internals Manual:
In full call, the arguments are passed creating a partial frame on the stack top and storing stack arguments into that frame.
Each generated list element is stored in the new stack frame, quickly exhausting the stack. Presumably, passing SBCL a larger value through –control-stack-size would increase this practical limit to the number of arguments that can be passed in a function call.
Why is call-arguments-limit so much larger than the practical limit?
An SBCL mailing list response to someone with a similar problem explains why the practical limit of the stack size isn't reflected in call-arguments-limit:
The condition you're seeing here is not due to a fundamental implementation limitation, but rather because of the particular choice of stack size that someone chose -- if the stack size were bigger, this call of yours would not error. [...] Consider that, given our strategy of passing excess arguments on the stack, that the actual maximum number of arguments passable at any given time depends on the program state, and isn't in fact a constant.
Spec says that call-arguments-limit must be a constant, so SBCL seems to have defined it as most-positive-fixnum.
There are and a couple of bug reports discussing the issue, and a TODO in the source suggesting that at least one contributor feels it should be reduced to a less absurd value:
;; TODO: Reducing CALL-ARGUMENTS-LIMIT to something reasonable to
;; allow DWORD ops without it looking like a bug would make sense.
;; With a stack size of about 2MB, the limit is absurd anyway.
SBCL's particular way of implementing call-arguments-limit might have room for improvement and could lead to unexpected behaviour, but it does follow ANSI spec.
The practical limit varies depending on the space remaining on the stack, so defining call-arguments-limit according to this value would not obey the spec requirement for a constant value.

Erlang: stackoverflow with recursive function that is not tail call optimized?

Is it possible to get a stackoverflow with a function that is not tail call optimized in Erlang? For example, suppose I have a function like this
sum_list([],Acc) ->
Acc;
sum_list([Head|Tail],Acc) ->
Head + sum_list(Tail, Acc).
It would seem like if a large enough list was passed in it would eventually run out of stack space and crash. I tried testing this like so:
> L = lists:seq(1, 10000000).
[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22, 23,24,25,26,27,28,29|...]
> sum_test:sum_list(L, 0).
50000005000000
But it never crashes! I tried it with a list of 100,000,000 integers and it took a while to finish but it still never crashed! Questions:
Am I testing this correctly?
If so, why am I unable to generate a stackoverflow?
Is Erlang doing something that prevents stackoverflows from occurring?

You are testing this correctly: your function is indeed not tail-recursive. To find out, you can compile your code using erlc -S <erlang source file>.
{function, sum_list, 2, 2}.
{label,1}.
{func_info,{atom,so},{atom,sum_list},2}.
{label,2}.
{test,is_nonempty_list,{f,3},[{x,0}]}.
{allocate,1,2}.
{get_list,{x,0},{y,0},{x,0}}.
{call,2,{f,2}}.
{gc_bif,'+',{f,0},1,[{y,0},{x,0}],{x,0}}.
{deallocate,1}.
return.
{label,3}.
{test,is_nil,{f,1},[{x,0}]}.
{move,{x,1},{x,0}}.
return.
As a comparison the following tail-recursive version of the function:
tail_sum_list([],Acc) ->
Acc;
tail_sum_list([Head|Tail],Acc) ->
tail_sum_list(Tail, Head + Acc).
compiles as:
{function, tail_sum_list, 2, 5}.
{label,4}.
{func_info,{atom,so},{atom,tail_sum_list},2}.
{label,5}.
{test,is_nonempty_list,{f,6},[{x,0}]}.
{get_list,{x,0},{x,2},{x,3}}.
{gc_bif,'+',{f,0},4,[{x,2},{x,1}],{x,1}}.
{move,{x,3},{x,0}}.
{call_only,2,{f,5}}.
{label,6}.
{test,is_nil,{f,4},[{x,0}]}.
{move,{x,1},{x,0}}.
return.
Notice the lack of allocate and the call_only opcode in the tail-recursive version, as opposed to the allocate/call/deallocate/return sequence in the non-recursive function.
You are not getting a stack overflow because the Erlang "stack" is very large. Indeed, stack overflow usually means the processor stack overflowed, as the processor's stack pointer went too far away. Processes traditionally have a limited stack size which can be tuned by interacting with the operating system. See for example POSIX's setrlimit.
However, Erlang execution stack is not the processor stack, as the code is interpreted. Each process has its own stack which can grow as needed by invoking operating system memory allocation functions (typically malloc on Unix).
As a result, your function will not crash as long as malloc calls succeed.
For the record, the actual list L is using the same amount of memory as the stack to process it. Indeed, each element in the list takes two words (the integer value itself, which is boxed as a word as they are small) and the pointer to the next element to the list. Conversely, the stack is grown by two words at each iteration by allocate opcode: one word for CP which is saved by allocate itself and one word as requested (the first parameter of allocate) for the current value.
For 100,000,000 words on a 64-bit VM, the list takes a minimum of 1.5 GB (more as the actual stack is not grown every two words, fortunately). Monitoring and garbaging this is difficult in the shell, as many values remain live. If you spawn a function, you can see the memory usage:
spawn(fun() ->
io:format("~p\n", [erlang:memory()]),
L = lists:seq(1, 100000000),
io:format("~p\n", [erlang:memory()]),
sum_test:sum_list(L, 0),
io:format("~p\n", [erlang:memory()])
end).
As you can see, the memory for the recursive call is not released immediately.

Quicksort and tail recursive optimization

In Introduction to Algorithms p169 it talks about using tail recursion for Quicksort.
The original Quicksort algorithm earlier in the chapter is (in pseudo-code)
Quicksort(A, p, r)
{
if (p < r)
{
q: <- Partition(A, p, r)
Quicksort(A, p, q)
Quicksort(A, q+1, r)
}
}
The optimized version using tail recursion is as follows
Quicksort(A, p, r)
{
while (p < r)
{
q: <- Partition(A, p, r)
Quicksort(A, p, q)
p: <- q+1
}
}
Where Partition sorts the array according to a pivot.
The difference is that the second algorithm only calls Quicksort once to sort the LHS.
Can someone explain to me why the 1st algorithm could cause a stack overflow, whereas the second wouldn't? Or am I misunderstanding the book.

First let's start with a brief, probably not accurate but still valid, definition of what stack overflow is.
As you probably know right now there are two different kind of memory which are implemented in too different data structures: Heap and Stack.
In terms of size, the Heap is bigger than the stack, and to keep it simple let's say that every time a function call is made a new environment(local variables, parameters, etc.) is created on the stack. So given that and the fact that stack's size is limited, if you make too many function calls you will run out of space hence you will have a stack overflow.
The problem with recursion is that, since you are creating at least one environment on the stack per iteration, then you would be occupying a lot of space in the limited stack very quickly, so stack overflow are commonly associated with recursion calls.
So there is this thing called Tail recursion call optimization that will reuse the same environment every time a recursion call is made and so the space occupied in the stack is constant, preventing the stack overflow issue.
Now, there are some rules in order to perform a tail call optimization. First, each call most be complete and by that I mean that the function should be able to give a result at any moment if you interrupts the execution, in SICP
this is called an iterative process even when the function is recursive.
If you analyze your first example, you will see that each iteration is defined by two recursive calls, which means that if you stop the execution at any time you won't be able to give a partial result because you the result depends of those calls to be finished, in this scenario you can't reuse the stack environment because the total information is split between all those recursive calls.
However, the second example doesn't have that problem, A is constant and the state of p and r can be locally determined, so since all the information to keep going is there then TCO can be applied.

The essence of the tail recursion optimization is that there is no recursion when the program is actually executed. When the compiler or interpreter is able to kick TRO in, it means that it will essentially figure out how to rewrite your recursively-defined algorithm into a simple iterative process with the stack not used to store nested function invocations.
The first code snippet can't be TR-optimized because there are 2 recursive calls in it.

Tail recursion by itself is not enough. The algorithm with the while loop can still use O(N) stack space, reducing it to O(log(N)) is left as exercise in that section of CLRS.
Assume we are working in a language with array slices and tail call optimization. Consider the difference between these two algorithms:
Bad:
Quicksort(arraySlice) {
if (arraySlice.length > 1) {
slices = Partition(arraySlice)
(smallerSlice, largerSlice) = sortBySize(slices)
Quicksort(largerSlice) // Not a tail call, requires a stack frame until it returns.
Quicksort(smallerSlice) // Tail call, can replace the old stack frame.
}
}
Good:
Quicksort(arraySlice) {
if (arraySlice.length > 1){
slices = Partition(arraySlice)
(smallerSlice, largerSlice) = sortBySize(slices)
Quicksort(smallerSlice) // Not a tail call, requires a stack frame until it returns.
Quicksort(largerSlice) // Tail call, can replace the old stack frame.
}
}
The second one is guarenteed to never need more than log2(length) stack frames because smallerSlice is less than half as long as arraySlice. But for the first one, the inequality is reversed and it will always need more than or equal to log2(length) stack frames, and can require O(N) stack frames in the worst case where smallerslice always has length 1.
If you don't keep track of which slice is smaller or larger, you will have similar worst cases to the first overflowing case, even though it will require O(log(n)) stack frames on average. If you always sort the smaller slice first, you will never need more than log_2(length) stack frames.
If you are using a language that doesn't have tail call optimization, you can write the second (not stack-blowing) version as:
Quicksort(arraySlice) {
while (arraySlice.length > 1) {
slices = Partition(arraySlice)
(smallerSlice, arraySlice) = sortBySize(slices)
Quicksort(smallerSlice) // Still not a tail call, requires a stack frame until it returns.
}
}
Another thing worth noting is that if you are implementing something like Introsort which changes to Heapsort if the recursion depth exceeds some number proportional to log(N), you will never hit the O(N) worst case stack memory usage of quicksort, so you technically don't need to do this. Doing this optimization (popping smaller slices first) still improves the constant factor of the O(log(N)) though, so it is strongly recommended.

Well, the most obvious observation would be:
Most common stack overflow problem - definition
The most common cause of stack overflow is excessively deep or infinite recursion.
The second uses less deep recursion than the first (n branches per call instead of n^2) hence it is less likely to cause a stack overflow..
(so lower complexity means less chance to cause a stack overflow)
But somebody would have to add why the second can never cause a stack overflow while the first can.
source

Well If you consider the complexity of the two methods the first method obviously has more complexity than the second since it calls Recursion on both LHS and RHS as a result there are more chances of getting stack overflow
Note: That doesnt mean that there are absolutely no chances of getting SO in second method

In the function 2 that you shared, Tail Call elimination is implemented. Before proceeding further let us understand what is tail recursion function?. If the last statement in the code is the recursive call and does do anything after that, then it is called tail recursive function. So the first function is a tail recursion function. For such a function with some changes in the code one can remove the last recursion call like you showed in function 2 which performs the same work as function 1. This process is called tail recursion optimization or tail call elimination and following are the result of it
Optimizing in terms of auxiliary space
Optimizing in terms of recursion call overhead
Last recursive call is eliminated by using the while loop. The good thing is that for function 2, no auxiliary space is used for the right call as its recursion is eliminated using p: <- q+1 and the overall function does not have recursion call overhead. So whatever way partition happens maximum space needed is theta(log n)

sbcl runs forever on second call of function

The function:
Given a list lst return all permutations of the list's contents of exactly length k, which defaults to length of list if not provided.
(defun permute (lst &optional (k (length lst)))
(if (= k 1)
(mapcar #'list lst)
(loop for item in lst nconcing
(mapcar (lambda (x) (cons item x))
(permute (remove-if (lambda (x) (eq x item)) lst)
(1- k))))))
The problem:
I'm using SLIME in emacs connected to sbcl, I haven't done too much customization yet. The function works fine on smaller inputs like lst = '(1 2 3 4 5 6 7 8) k = 3 which is what it will mostly be used for in practice. However when I Call it with a large input twice in a row the second call never returns and sbcl does not even show up on top. These are the results at the REPL:
CL-USER> (time (nth (1- 1000000) (permute '(0 1 2 3 4 5 6 7 8 9))))
Evaluation took:
12.263 seconds of real time
12.166150 seconds of total run time (10.705372 user, 1.460778 system)
[ Run times consist of 9.331 seconds GC time, and 2.836 seconds non-GC time. ]
99.21% CPU
27,105,349,193 processor cycles
930,080,016 bytes consed
(2 7 8 3 9 1 5 4 6 0)
CL-USER> (time (nth (1- 1000000) (permute '(0 1 2 3 4 5 6 7 8 9))))
And it never comes back from the second call. I can only guess that for some reason I'm doing something horrible to the garbage collector but I can't see what. Does anyone have any ideas?

One thing that's wrong in your code is your use of EQ. EQ compares for identity.
EQ is not for comparing numbers. EQ of two numbers can be true or false.
Use EQL if you want to compare by identity, numbers by value or characters. Not EQ.
Actually
(remove-if (lambda (x) (eql x item)) list)
is just
(remove item list)
For your code the EQ bug COULD mean that permute gets called in the recursive call without actually a number removed from the list.
Other than that, I think SBCL is just busy with memory management. SBCL on my Mac acquired lots of memory (more than a GB) and was busy doing something. After some time the result was computed.
Your recursive function generates huge amount of 'garbage'. LispWorks says: 1360950192 bytes
Maybe you can come up with a more efficient implementation?
Update: garbage
Lisp provides some automatic memory management, but that does not free the programmer from thinking about space effects.
Lisp uses both a stack and the heap to allocate memory. The heap maybe structured in certain ways for the GC - for example in generations, half spaces and/or areas. There are precise garbage collectors and 'conservative' garbage collectors (used by SBCL on Intel machines).
When a program runs we can see various effects:
normal recursive procedures allocate space on the stack. Problem: the stack size is usually fixed (even though some Lisps can increase it in an error handler).
a program may allocate huge amount of memory and return a large result. PERMUTE is such a function. It can return very large lists.
a program may allocate memory and use it temporarily and then the garbage collector can recycle it. The rate of creation and destruction can be very high, even though the program does not use a large amount of fixed memory.
There are more problems, though. But for each of the above the Lisp programmer (and every other programmer using a language implementation with garbage collection) has to learn how to deal with that.
Replace recursion with iteration. Replace recursion with tail recursion.
Return only the part of the result that is needed and don't generate the full solution. If you need the n-th permutation, then compute that and not all permutations. Use lazy datastructures that are computed on demand. Use something like SERIES, which allows to use stream-like, but efficient, computation. See SICP, PAIP, and other advanced Lisp books.
Reuse memory with a resource manager. Reuse buffers instead of allocating objects all the time. Use an efficient garbage collector with special support for collecting ephemeral (short-lived) objects. Sometimes it also may help to destructively modify objects, instead of allocating new objects.
Above deals with the space problems of real programs. Ideally our compilers or runtime infrastructure may provide some automatic support to deal with these problems. But in reality this does not really work. Most Lisp systems provide low-level functionality to deal with this and Lisp provides mutable objects - because the experience of real-world Lisp programs has shown that programmers do want to use them to optimize their programs. If you have a large CAD application that computes the form of turbine blades, then theoretical/puristic views about non-mutable memory simply does not apply - the developer wants the faster/smaller code and the smaller runtime footprint.

SBCL on most platforms uses generational garbage collector, which means that allocated memory which survives more than some number of collections will be more rarely considered for collection. Your algorithm for the given test case generates so much garbage that it triggers GC so many times that the actual results, which obviously have to survive the entire function runtime, are tenured, that is, moved to a final generation which is collected either very rarely or not at all. Therefore, the second run will, on standard settings for 32-bit systems, run out of heap (512 MB, can be increased with runtime options).
Tenured data can be garbage collected by manually triggering the collector with (sb-ext:gc :full t). This is obviously implementation dependent.

From the looks of the output, you're looking at the slime-repl, right?
Try changing to the "*inferior-lisp*" buffer, you'll probably see that SBCL has dropped down to the ldb (built-in low-level debugger). Most probably, you've managed to blow the call-stack.

Are there problems that cannot be written using tail recursion?

Tail recursion is an important performance optimisation stragegy in functional languages because it allows recursive calls to consume constant stack (rather than O(n)).
Are there any problems that simply cannot be written in a tail-recursive style, or is it always possible to convert a naively-recursive function into a tail-recursive one?
If so, one day might functional compilers and interpreters be intelligent enough to perform the conversion automatically?

Yes, actually you can take some code and convert every function call—and every return—into a tail call. What you end up with is called continuation-passing style, or CPS.
For example, here's a function containing two recursive calls:
(define (count-tree t)
(if (pair? t)
(+ (count-tree (car t)) (count-tree (cdr t)))
1))
And here's how it would look if you converted this function to continuation-passing style:
(define (count-tree-cps t ctn)
(if (pair? t)
(count-tree-cps (car t)
(lambda (L) (count-tree-cps (cdr t)
(lambda (R) (ctn (+ L R))))))
(ctn 1)))
The extra argument, ctn, is a procedure which count-tree-cps tail-calls instead of returning. (sdcvvc's answer says that you can't do everything in O(1) space, and that is correct; here each continuation is a closure which takes up some memory.)
I didn't transform the calls to car or cdr or + into tail-calls. That could be done as well, but I assume those leaf calls would actually be inlined.
Now for the fun part. Chicken Scheme actually does this conversion on all code it compiles. Procedures compiled by Chicken never return. There's a classic paper explaining why Chicken Scheme does this, written in 1994 before Chicken was implemented: CONS should not cons its arguments, Part II: Cheney on the M.T.A.
Surprisingly enough, continuation-passing style is fairly common in JavaScript. You can use it to do long-running computation, avoiding the browser's "slow script" popup. And it's attractive for asynchronous APIs. jQuery.get (a simple wrapper around XMLHttpRequest) is clearly in continuation-passing style; the last argument is a function.

It's true but not useful to observe that any collection of mutually recursive functions can be turned into a tail-recursive function. This observation is on a par with the old chestnut fro the 1960s that control-flow constructs could be eliminated because every program could be written as a loop with a case statement nested inside.
What's useful to know is that many functions which are not obviously tail-recursive can be converted to tail-recursive form by the addition of accumulating parameters. (An extreme version of this transformation is the transformation to continuation-passing style (CPS), but most programmers find the output of the CPS transform difficult to read.)
Here's an example of a function that is "recursive" (actually it's just iterating) but not tail-recursive:
factorial n = if n == 0 then 1 else n * factorial (n-1)
In this case the multiply happens after the recursive call.
We can create a version that is tail-recursive by putting the product in an accumulating parameter:
factorial n = f n 1
where f n product = if n == 0 then product else f (n-1) (n * product)
The inner function f is tail-recursive and compiles into a tight loop.
I find the following distinctions useful:
In an iterative or recursive program, you solve a problem of size n by
first solving one subproblem of size n-1. Computing the factorial function
falls into this category, and it can be done either iteratively or
recursively. (This idea generalizes, e.g., to the Fibonacci function, where
you need both n-1 and n-2 to solve n.)
In a recursive program, you solve a problem of size n by first solving two
subproblems of size n/2. Or, more generally, you solve a problem of size n
by first solving a subproblem of size k and one of size n-k, where 1 < k < n. Quicksort and mergesort are two examples of this kind of problem, which
can easily be programmed recursively, but is not so easy to program
iteratively or using only tail recursion. (You essentially have to simulate recursion using an explicit
stack.)
In dynamic programming, you solve a problem of size n by first solving all
subproblems of all sizes k, where k<n. Finding the shortest route from one
point to another on the London Underground is an example of this kind of
problem. (The London Underground is a multiply-connected graph, and you
solve the problem by first finding all points for which the shortest path
is 1 stop, then for which the shortest path is 2 stops, etc etc.)
Only the first kind of program has a simple transformation into tail-recursive form.

Any recursive algorithm can be rewritten as an iterative algorithm (perhaps requiring a stack or list) and iterative algorithms can always be rewritten as tail-recursive algorithms, so I think it's true that any recursive solution can somehow be converted to a tail-recursive solution.
(In comments, Pascal Cuoq points out that any algorithm can be converted to continuation-passing style.)
Note that just because something is tail-recursive doesn't mean that its memory usage is constant. It just means that the call-return stack doesn't grow.

You can't do everything in O(1) space (space hierarchy theorem). If you insist on using tail recursion, then you can store the call stack as one of the arguments. Obviously this doesn't change anything; somewhere internally, there is a call stack, you're simply making it explicitly visible.
If so, one day might functional compilers and interpreters be intelligent enough to perform the conversion automatically?
Such conversion will not decrease space complexity.
As Pascal Cuoq commented, another way is to use CPS; all calls are tail recursive then.

I don't think something like tak could be implemented using only tail calls. (not allowing continuations)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex