sbcl runs forever on second call of function - recursion

The function:
Given a list lst return all permutations of the list's contents of exactly length k, which defaults to length of list if not provided.
(defun permute (lst &optional (k (length lst)))
(if (= k 1)
(mapcar #'list lst)
(loop for item in lst nconcing
(mapcar (lambda (x) (cons item x))
(permute (remove-if (lambda (x) (eq x item)) lst)
(1- k))))))
The problem:
I'm using SLIME in emacs connected to sbcl, I haven't done too much customization yet. The function works fine on smaller inputs like lst = '(1 2 3 4 5 6 7 8) k = 3 which is what it will mostly be used for in practice. However when I Call it with a large input twice in a row the second call never returns and sbcl does not even show up on top. These are the results at the REPL:
CL-USER> (time (nth (1- 1000000) (permute '(0 1 2 3 4 5 6 7 8 9))))
Evaluation took:
12.263 seconds of real time
12.166150 seconds of total run time (10.705372 user, 1.460778 system)
[ Run times consist of 9.331 seconds GC time, and 2.836 seconds non-GC time. ]
99.21% CPU
27,105,349,193 processor cycles
930,080,016 bytes consed
(2 7 8 3 9 1 5 4 6 0)
CL-USER> (time (nth (1- 1000000) (permute '(0 1 2 3 4 5 6 7 8 9))))
And it never comes back from the second call. I can only guess that for some reason I'm doing something horrible to the garbage collector but I can't see what. Does anyone have any ideas?

One thing that's wrong in your code is your use of EQ. EQ compares for identity.
EQ is not for comparing numbers. EQ of two numbers can be true or false.
Use EQL if you want to compare by identity, numbers by value or characters. Not EQ.
Actually
(remove-if (lambda (x) (eql x item)) list)
is just
(remove item list)
For your code the EQ bug COULD mean that permute gets called in the recursive call without actually a number removed from the list.
Other than that, I think SBCL is just busy with memory management. SBCL on my Mac acquired lots of memory (more than a GB) and was busy doing something. After some time the result was computed.
Your recursive function generates huge amount of 'garbage'. LispWorks says: 1360950192 bytes
Maybe you can come up with a more efficient implementation?
Update: garbage
Lisp provides some automatic memory management, but that does not free the programmer from thinking about space effects.
Lisp uses both a stack and the heap to allocate memory. The heap maybe structured in certain ways for the GC - for example in generations, half spaces and/or areas. There are precise garbage collectors and 'conservative' garbage collectors (used by SBCL on Intel machines).
When a program runs we can see various effects:
normal recursive procedures allocate space on the stack. Problem: the stack size is usually fixed (even though some Lisps can increase it in an error handler).
a program may allocate huge amount of memory and return a large result. PERMUTE is such a function. It can return very large lists.
a program may allocate memory and use it temporarily and then the garbage collector can recycle it. The rate of creation and destruction can be very high, even though the program does not use a large amount of fixed memory.
There are more problems, though. But for each of the above the Lisp programmer (and every other programmer using a language implementation with garbage collection) has to learn how to deal with that.
Replace recursion with iteration. Replace recursion with tail recursion.
Return only the part of the result that is needed and don't generate the full solution. If you need the n-th permutation, then compute that and not all permutations. Use lazy datastructures that are computed on demand. Use something like SERIES, which allows to use stream-like, but efficient, computation. See SICP, PAIP, and other advanced Lisp books.
Reuse memory with a resource manager. Reuse buffers instead of allocating objects all the time. Use an efficient garbage collector with special support for collecting ephemeral (short-lived) objects. Sometimes it also may help to destructively modify objects, instead of allocating new objects.
Above deals with the space problems of real programs. Ideally our compilers or runtime infrastructure may provide some automatic support to deal with these problems. But in reality this does not really work. Most Lisp systems provide low-level functionality to deal with this and Lisp provides mutable objects - because the experience of real-world Lisp programs has shown that programmers do want to use them to optimize their programs. If you have a large CAD application that computes the form of turbine blades, then theoretical/puristic views about non-mutable memory simply does not apply - the developer wants the faster/smaller code and the smaller runtime footprint.

SBCL on most platforms uses generational garbage collector, which means that allocated memory which survives more than some number of collections will be more rarely considered for collection. Your algorithm for the given test case generates so much garbage that it triggers GC so many times that the actual results, which obviously have to survive the entire function runtime, are tenured, that is, moved to a final generation which is collected either very rarely or not at all. Therefore, the second run will, on standard settings for 32-bit systems, run out of heap (512 MB, can be increased with runtime options).
Tenured data can be garbage collected by manually triggering the collector with (sb-ext:gc :full t). This is obviously implementation dependent.

From the looks of the output, you're looking at the slime-repl, right?
Try changing to the "*inferior-lisp*" buffer, you'll probably see that SBCL has dropped down to the ldb (built-in low-level debugger). Most probably, you've managed to blow the call-stack.

Related

Performance of function call in Common Lisp SBCL

I'm new to Common Lisp and ran into a performance thing that just struck me as weird. I'm checking if a number is divisible by 10 using rem in a loop. If I move the check into a function, it runs 5x slower. What would cause that?
I'm running sbcl 1.4.5 on 64 bit Ubuntu 18.04.
(defun fn (x)
(= 0 (rem x 10))
)
(defun walk-loop-local (n)
(loop for i from 1 to n do
(= 0 (rem i 10))
))
(defun walk-loop-func (n)
(loop for i from 1 to n do
(fn i)
))
(time (walk-loop-local 232792560))
(time (walk-loop-func 232792560))
I'd expect the time to be the same (and a lot faster, but that's a separate question). Instead, here's the output,
CL-USER> (load "loops.lisp")
Evaluation took:
0.931 seconds of real time
0.931389 seconds of total run time (0.931389 user, 0.000000 system)
100.00% CPU
2,414,050,454 processor cycles
0 bytes consed
Evaluation took:
4.949 seconds of real time
4.948967 seconds of total run time (4.948967 user, 0.000000 system)
100.00% CPU
12,826,853,706 processor cycles
0 bytes consed
Common Lisp allows dynamic redefinition of functions: if you redefined fn during the approx. 5 seconds of your second test, the running loop would switch to calling the new definition of fn while running. This features comes with some constraints on how to compile function calls and how to optimize them when needed.
As pointed out by RainerJoswing in comments, the above is an over-simplification, there are cases where the compiler may assume functions are not redefined (recursive functions, functions in the same file), see 3.2.2.3 Semantic Constraints, for example:
A call within a file to a named function that is defined in the same
file refers to that function, unless that function has been declared
notinline. The consequences are unspecified if functions are redefined
individually at run time or multiply defined in the same file.
A function mixes error checking and the computations you want it to perform. At function call boundaries you typically have a prologue where your inputs are checked, and an epilogue where results might be "boxed": if the compiler knows that locally a variable is always a single-float, it can use a raw representation of floats during the extent of the function, but when returning the result, it should be a valid Lisp type, which means coercing it back to a tagged value, for example.
The SBCL compiler tries to ensure the code is safe, where safe means never invoking code that has undefined behaviour in the Lisp specification. Note however that if you call fn with a string input, the code is expected to detect the type error. Unlike C, a type-error at runtime in Lisp is well-defined (as long as the declared type, which defaults to T, encompasses all possible values at runtime). And so, compiling Lisp code for safety tends to add a lot of error checking at multiple points of the program.
Optimizing code consists in removing checks that are guaranteed to be always true, eliminating dead branches in the generated code.
For example, if you consider fn alone, you can see that it has to check its input every time it is called, because it might very well be called with a string input. But when you directly inline the operation, then the index i can be statically determined to be an integer, which allows calls to = and rem to be applied without (much) error checking.
Optimization in SBCL happens because there is a static analysis which maps variables to elements of the type lattice of Lisp (and and or are basically the greatest lower bound and lowest upper bound for types, with types T and type nil at both ends). SBCL reports only errors that are sure to happen: you have an error if you call a function that accepts integers from 0 to 5 if you call it with an input that is known to always be above 5 or below zero (both sets have no intersection), but you have no warning if you call it with an integer between 2 and 10. This is safe because the compiler can defer error checking at runtime, contrary to other languages where the runtime has no sense of types (trying to warn everytime the code might have errors would result in a lot of warnings given the open-worldness of Lisp).
You can (declaim (inline fn)) in your file and then the performance will be identical to the first version. A rule of thumb is that inside a function, things are a bit more static than in the global environment: local functions cannot be redefined, local variables can have their types precisely defined, etc. You have more control about what is always true.
Note that the overhead of error checking is a problem if it is executed a lot of time (relatively to the rest of the code). If you fill a big array with single-floats and apply numerical code on it, it makes sense to use a specialized array type, like (simple-array single-float), or to declare local variables to be floats with (declare (type single-float x)), so that you don't check that each value is effectively a float. In other cases, the overhead is not high enough to spend too much time reducing it.
You are using the SBCL compiler:
(defun walk-loop-local (n)
(loop for i from 1 to n do
(= 0 (rem i 10))))
I think your code does nothing in the loop iteration. It gets optimized away, since the value of = form is not used anywhere and there are no side-effects.
Thus there is no overhead, since there is no code.
Use (disassemble #'walk-local-form) to check the compile code.
If I move the check into a function, it runs 5x slower. What would cause that?
Instead of doing nothing, in each iteration the function gets called and executes your code.
Actually measuring calling overhead
(defparameter *i* nil)
(defun walk-loop-local (n)
(loop for i from 1 to n do
(setf *i* (= 0 (rem i 10)))))
(defun fn (x)
(setf *i* (= 0 (rem x 10))))
(defun walk-loop-func (n)
(loop for i from 1 to n do
(fn i)))
In above case the code doesn't get removed.
CL-USER> (time (walk-loop-local 232792560))
Evaluation took:
5.420 seconds of real time
5.412637 seconds of total run time (5.399134 user, 0.013503 system)
99.87% CPU
6,505,078,020 processor cycles
0 bytes consed
NIL
CL-USER> (time (walk-loop-func 232792560))
Evaluation took:
6.235 seconds of real time
6.228447 seconds of total run time (6.215409 user, 0.013038 system)
99.89% CPU
7,481,974,847 processor cycles
0 bytes consed
You can see that the function call overhead isn't that large.
Every function call adds an overhead. This is what you are measuring.
You could declaim the function fn to be inline and also try modifying the compiler flags to optimize for runtime execution (opposed to debug information or safety). I'm on the phone now, but could add the hyperspecs link if needed.
BR, Eric

Why "there is no such thing as stack overflow" in Racket?

The following paragraph is from The Racket Guide (2.3.4):
At the same time, recursion does not lead to particularly bad
performance in Racket, and there is no such thing as stack overflow;
you can run out of memory if a computation involves too much context,
but exhausting memory typically requires orders of magnitude deeper
recursion than would trigger a stack overflow in other languages.
I'm curious about how Racket was designed to avoid stack overflow? What's more, why other languages like C cannot avoid such a problem?
First, some terminology: making a non-tail call requires a context frame to store local variables, parent return address, etc. So the question is how to represent an arbitrarily large context. "The stack" (or call stack) is just one (admittedly common) implementation strategy for the context.
Here are a few implementation strategies for deep recursion (ie, large contexts):
Allocate context frames on the heap and let the GC be responsible for cleaning them up. (This is nice and simple but probably relatively slow, although people would argue that point.)
Allocate context frames on the stack. When the stack is full, copy the frames currently on the stack into the heap, clear the stack, and reset the stack pointer to the bottom. When returning, if the stack is empty, copy frames from the heap back to the stack. (This means you can't have pointers to stack-allocated objects, though, since the objects get moved around.)
Allocate context frames on the stack. When the stack is full, allocate a new big chunk of memory, call that the new stack (ie set the stack pointer), and keep going. (This might require mprotect or other operations to convince the OS that the new block of memory is okay to treat as a call stack.)
Allocate context frames on the stack. When the stack is full, make a new thread to continue the computation, and wait for the thread to finish and arrange to grab a return value from it to return to the old thread's stack. (This strategy can be useful on platforms like the JVM that don't let you directly control the stack, stack pointer, etc. On the other hand, it complicates features like thread-local storage, etc.)
... and more variations on the strategies above.
Support for deep recursion often coincides with support for first-class continuations. In general, implementing first-class continuations means you almost automatically get support for deep recursion. There's a nice paper called Implementation Strategies for First-class Continuations by Will Clinger et al. with more detail and comparisons between different strategies.
There are two pieces to this answer.
First, in Racket and other functional languages, tail calls do not create additional stack frames. That is, a loop such as
(define (f x) (f x))
... can run forever without using any stack space at all. Many non-functional languages don't prioritize function calling in the same way as functional languages, and therefore aren't properly tail-calling.
HOWEVER, the comment that you're referring to isn't just limited to tail-calling; Racket allows very deeply nested stack frames.
Your question is a good one: why don't other languages allow deeply nested stack frames? I wrote a short test, and it looks like C unceremoniously dumps core at a depth of between 262,000 and 263,000. I wrote a simple racket test that does the same thing (being careful to ensure the recursive call was not in tail position), and I interrupted it at a depth of 48,000,000 without any apparent ill effects (except, presumably, a fairly large runtime stack).
To answer your question directly, there's no reason that I'm aware of that C couldn't allow much much more deeply nested stacks, but I think that for most C programmers, a recursion depth of 262K is plenty.
Not for us, though!
Here's my C code:
#include <stdio.h>
int f(int depth){
if ((depth % 1000) == 0) {
printf("%d\n",depth);
}
return f(depth+1);
}
int main() {
return f(0);
}
... and my racket code:
#lang racket
(define (f depth)
(when (= (modulo depth 1000) 0)
(printf "~v\n" depth))
(f (add1 depth))
(printf "this will never print..."))
(f 0)
EDIT: here's the version that uses randomness on the way out to stymie possible optimizations:
#lang racket
(define (f depth)
(when (= (modulo depth 1000000) 0)
(printf "~v\n" depth))
(when (< depth 50000000)
(f (add1 depth)))
(when (< (random) (/ 1.0 100000))
(printf "X")))
(f 0)
Also, my observations of the process size are consistent with a stack frame of about 16 bytes, plus or minus; 50M * 16 bytes = 800 Megabytes, and the observed size of the stack is about 1.2 Gigabytes.

Why does apply throw a CONTROL-STACK-EXHAUSTED-ERROR on a large list?

(apply #'+ (loop for i from 1 to x collect 1))
works if x has value 253391, but fails with a (SB-KERNEL::CONTROL-STACK-EXHAUSTED-ERROR) on 253392*. This is orders of magnitude smaller than call-arguments-limit**.
Is recursion exhausting the stack? If so, is it in apply? Why hasn't it been optimized out?
*Also interesting, (apply #'max (loop for i from 1 to 253391 collect 1)) throws the error, but 253390 is fine.
**call-arguments-limit evaluates to 4611686018427387903 (with the help of format's ~R, it turns out this is four quintillion six hundred eleven quadrillion six hundred eighty-six trillion eighteen billion four hundred twenty-seven million three hundred eighty-seven thousand nine hundred three)
parameters that can be passed to a function in SBCL
You don't pass parameters. You pass arguments.
(defun foo (x y) (list x y))
x and y are parameters of the function foo.
(foo 20 22)
20 and 22 are arguments in a call of the function foo.
See the variables call-arguments-limit and lambda-parameters-limit.
SBCL and call-arguments-limit
If a function can't nearly handle the claimed number of arguments, then this looks like a bug in SBCL. You might want to report this bug. Maybe they need to change the value of call-arguments-limit.
Testing
APPLY is one way to test it.
Another:
(eval (append '(drop-params)
(loop for i from 1 to 2533911 collect 1)))
One can also use FUNCALL with a number of arguments spread out.
Why does a limit exist?
The Common Lisp standard was written to allow efficient implementations on various different computers. It was thought that some machine-level function calling implementations only support limited number of arguments. The standard says the number of supported arguments can be as low as 50. Actually some implementations have a relatively low number of supported arguments.
Thus apply in Common Lisp is not a tool for list processing, but to call functions with computed arglists.
For list and vector processing use REDUCE, instead of APPLY
If we want to sum all numbers in a list, replace
(apply #'+ list) ; don't use this
with
(reduce #'+ list) ; can handle arbitrary long lists
Recursion
apply is a non-optimized recursive function
I cannot see why the function APPLY should use recursion.
For example if you think of
(apply #'+ '(1 2 3 4 5))
The repeated summing of the arguments is done by the function + and not by apply.
This is different from
(reduce #'+ '(1 2 3 4 5))
where the repeated call of the function + with two arguments is done by reduce.
What's causing the stack exhaustion?
Though recursion is often a likely culprit for stack exhaustion, that is not true in this case. According to the SBCL Internals Manual:
In full call, the arguments are passed creating a partial frame on the stack top and storing stack arguments into that frame.
Each generated list element is stored in the new stack frame, quickly exhausting the stack. Presumably, passing SBCL a larger value through –control-stack-size would increase this practical limit to the number of arguments that can be passed in a function call.
Why is call-arguments-limit so much larger than the practical limit?
An SBCL mailing list response to someone with a similar problem explains why the practical limit of the stack size isn't reflected in call-arguments-limit:
The condition you're seeing here is not due to a fundamental implementation limitation, but rather because of the particular choice of stack size that someone chose -- if the stack size were bigger, this call of yours would not error. [...] Consider that, given our strategy of passing excess arguments on the stack, that the actual maximum number of arguments passable at any given time depends on the program state, and isn't in fact a constant.
Spec says that call-arguments-limit must be a constant, so SBCL seems to have defined it as most-positive-fixnum.
There are and a couple of bug reports discussing the issue, and a TODO in the source suggesting that at least one contributor feels it should be reduced to a less absurd value:
;; TODO: Reducing CALL-ARGUMENTS-LIMIT to something reasonable to
;; allow DWORD ops without it looking like a bug would make sense.
;; With a stack size of about 2MB, the limit is absurd anyway.
SBCL's particular way of implementing call-arguments-limit might have room for improvement and could lead to unexpected behaviour, but it does follow ANSI spec.
The practical limit varies depending on the space remaining on the stack, so defining call-arguments-limit according to this value would not obey the spec requirement for a constant value.

Explanation of "Lose your head" in lazy sequences

In Clojure programming language, why this code passes with flying colors?
(let [r (range 1e9)] [(first r) (last r)])
While this one fails:
(let [r (range 1e9)] [(last r) (first r)])
I know it is about "Losing your head" advice but would you please explain it to me? I'm not able to digest it yet.
UPDATE:
It is really hard to pick the correct answer, two answers are amazingly informative.
Note: Code snippets are from "The Joy of Clojure".
To elaborate on dfan and Rafał's answers, I've taken the time to run both expressions with the YourKit profiler.
It's fascinating to see the JVM at work. The first program is so GC-friendly that the JVM really shines at managing its memory.
I drew some charts.
GC friendly: (let [r (range 1e9)] [(first r) (last r)])
This program runs very low on memory; overall, less than 6 megabytes. As stated earlier, it is very GC friendly, it makes a lot of collections, but uses very little CPU for that.
Head holder: (let [r (range 1e9)] [(last r) (first r)])
This one is very memory hungry. It goes up to 300 MB of RAM, but that's not enough and the program does not finish (the JVM dies less than one minute later). The GC takes up to 90% of CPU time, which indicates it desperately tries to free any memory it can, but cannot find any (the objects collected are very little to none).
Edit The second program ran out of memory, which triggered a heap dump. An analysis of this dump shows that 70% of the memory is java.lang.Integer objects, which could not be collected. Here's another screenshot:
range generates elements as needed.
In the case of (let [r (range 1e9)] [(first r) (last r)]), it grabs the first element (0), then generates a billion - 2 elements, throwing them out as it goes, and then grabs the last element (999,999,999). It never has any need to keep any part of the sequence around.
In the case of (let [r (range 1e9)] [(last r) (first r)]), it generates a billion elements in order to be able to evaluate (last r), but it also has to hold on to the beginning of the list it's generating in order to later evaluate (first r). So it's not able to throw anything away as it goes, and (I presume) runs out of memory.
What really holds the head here is the binding of the sequence to r (not the already-evaluated (first r), since you cannot evaluate the whole sequence from its value.)
In the first case the binding no longer exists when (last r) is evaluated, since there are no more expressions with r to evaluate. In the second case, the existence of the not-yet-evaluated (first r) means that the evaluator needs to keep the binding to r.
To show the difference, this evaluates OK:
user> (let [r (range 1e8) a 7] [(last r) ((constantly 5) a)])
[99999999 5]
While this fails:
(let [r (range 1e8) a 7] [(last r) ((constantly 5) r)])
Even though the expression following (last r) ignores r, the evaluator is not that smart and keeps the binding to r, thus keeping the whole sequence.
Edit: I have found a post where Rich Hickey explains the details of the mechanism responsible for clearing the reference to the head in the above cases. Here it is: Rich Hickey on locals clearing
for a technical description, go to http://clojure.org/lazy. the advice is mentioned in the section Don't hang (onto) your head

Practical use of fold/reduce in functional languages

Fold (aka reduce) is considered a very important higher order function. Map can be expressed in terms of fold (see here). But it sounds more academical than practical to me. A typical use could be to get the sum, or product, or maximum of numbers, but these functions usually accept any number of arguments. So why write (fold + 0 '(2 3 5)) when (+ 2 3 5) works fine. My question is, in what situation is it easiest or most natural to use fold?
The point of fold is that it's more abstract. It's not that you can do things that you couldn't before, it's that you can do them more easily.
Using a fold, you can generalize any function that is defined on two elements to apply to an arbitrary number of elements. This is a win because it's usually much easier to write, test, maintain and modify a single function that applies two arguments than to a list. And it's always easier to write, test, maintain, etc. one simple function instead of two with similar-but-not-quite functionality.
Since fold (and for that matter, map, filter, and friends) have well-defined behaviour, it's often much easier to understand code using these functions than explicit recursion.
Basically, once you have the one version, you get the other "for free". Ultimately, you end up doing less work to get the same result.
Here are a few simple examples where reduce works really well.
Find the sum of the maximum values of each sub-list
Clojure:
user=> (def x '((1 2 3) (4 5) (0 9 1)))
#'user/x
user=> (reduce #(+ %1 (apply max %2)) 0 x)
17
Racket:
> (define x '((1 2 3) (4 5) (0 9 1)))
> (foldl (lambda (a b) (+ b (apply max a))) 0 x)
17
Construct a map from a list
Clojure:
user=> (def y '(("dog" "bark") ("cat" "meow") ("pig" "oink")))
#'user/y
user=> (def z (reduce #(assoc %1 (first %2) (second %2)) {} y))
#'user/z
user=> (z "pig")
"oink"
For a more complicated clojure example featuring reduce, check out my solution to Project Euler problems 18 & 67.
See also: reduce vs. apply
In Common Lisp functions don't accept any number of arguments.
There is a constant defined in every Common Lisp implementation CALL-ARGUMENTS-LIMIT, which must be 50 or larger.
This means that any such portably written function should accept at least 50 arguments. But it could be just 50.
This limit exists to allow compilers to possibly use optimized calling schemes and to not provide the general case, where an unlimited number of arguments could be passed.
Thus to really process large (larger than 50 elements) lists or vectors in portable Common Lisp code, it is necessary to use iteration constructs, reduce, map, and similar. Thus it is also necessary to not use (apply '+ large-list) but use (reduce '+ large-list).
Code using fold is usually awkward to read. That's why people prefer map, filter, exists, sum, and so on—when available. These days I'm primarily writing compilers and interpreters; here's some ways I use fold:
Compute the set of free variables for a function, expression, or type
Add a function's parameters to the symbol table, e.g., for type checking
Accumulate the collection of all sensible error messages generated from a sequence of definitions
Add all the predefined classes to a Smalltalk interpreter at boot time
What all these uses have in common is that they're accumulating information about a sequence into some kind of set or dictionary. Eminently practical.
Your example (+ 2 3 4) only works because you know the number of arguments beforehand. Folds work on lists the size of which can vary.
fold/reduce is the general version of the "cdr-ing down a list" pattern. Each algorithm that's about processing every element of a sequence in order and computing some return value from that can be expressed with it. It's basically the functional version of the foreach loop.
Here's an example that nobody else mentioned yet.
By using a function with a small, well-defined interface like "fold", you can replace that implementation without breaking the programs that use it. You could, for example, make a distributed version that runs on thousands of PCs, so a sorting algorithm that used it would become a distributed sort, and so on. Your programs become more robust, simpler, and faster.
Your example is a trivial one: + already takes any number of arguments, runs quickly in little memory, and has already been written and debugged by whoever wrote your compiler. Those properties are not often true of algorithms I need to run.

Resources