I'm new to Common Lisp and ran into a performance thing that just struck me as weird. I'm checking if a number is divisible by 10 using rem in a loop. If I move the check into a function, it runs 5x slower. What would cause that?
I'm running sbcl 1.4.5 on 64 bit Ubuntu 18.04.
(defun fn (x)
(= 0 (rem x 10))
)
(defun walk-loop-local (n)
(loop for i from 1 to n do
(= 0 (rem i 10))
))
(defun walk-loop-func (n)
(loop for i from 1 to n do
(fn i)
))
(time (walk-loop-local 232792560))
(time (walk-loop-func 232792560))
I'd expect the time to be the same (and a lot faster, but that's a separate question). Instead, here's the output,
CL-USER> (load "loops.lisp")
Evaluation took:
0.931 seconds of real time
0.931389 seconds of total run time (0.931389 user, 0.000000 system)
100.00% CPU
2,414,050,454 processor cycles
0 bytes consed
Evaluation took:
4.949 seconds of real time
4.948967 seconds of total run time (4.948967 user, 0.000000 system)
100.00% CPU
12,826,853,706 processor cycles
0 bytes consed
Common Lisp allows dynamic redefinition of functions: if you redefined fn during the approx. 5 seconds of your second test, the running loop would switch to calling the new definition of fn while running. This features comes with some constraints on how to compile function calls and how to optimize them when needed.
As pointed out by RainerJoswing in comments, the above is an over-simplification, there are cases where the compiler may assume functions are not redefined (recursive functions, functions in the same file), see 3.2.2.3 Semantic Constraints, for example:
A call within a file to a named function that is defined in the same
file refers to that function, unless that function has been declared
notinline. The consequences are unspecified if functions are redefined
individually at run time or multiply defined in the same file.
A function mixes error checking and the computations you want it to perform. At function call boundaries you typically have a prologue where your inputs are checked, and an epilogue where results might be "boxed": if the compiler knows that locally a variable is always a single-float, it can use a raw representation of floats during the extent of the function, but when returning the result, it should be a valid Lisp type, which means coercing it back to a tagged value, for example.
The SBCL compiler tries to ensure the code is safe, where safe means never invoking code that has undefined behaviour in the Lisp specification. Note however that if you call fn with a string input, the code is expected to detect the type error. Unlike C, a type-error at runtime in Lisp is well-defined (as long as the declared type, which defaults to T, encompasses all possible values at runtime). And so, compiling Lisp code for safety tends to add a lot of error checking at multiple points of the program.
Optimizing code consists in removing checks that are guaranteed to be always true, eliminating dead branches in the generated code.
For example, if you consider fn alone, you can see that it has to check its input every time it is called, because it might very well be called with a string input. But when you directly inline the operation, then the index i can be statically determined to be an integer, which allows calls to = and rem to be applied without (much) error checking.
Optimization in SBCL happens because there is a static analysis which maps variables to elements of the type lattice of Lisp (and and or are basically the greatest lower bound and lowest upper bound for types, with types T and type nil at both ends). SBCL reports only errors that are sure to happen: you have an error if you call a function that accepts integers from 0 to 5 if you call it with an input that is known to always be above 5 or below zero (both sets have no intersection), but you have no warning if you call it with an integer between 2 and 10. This is safe because the compiler can defer error checking at runtime, contrary to other languages where the runtime has no sense of types (trying to warn everytime the code might have errors would result in a lot of warnings given the open-worldness of Lisp).
You can (declaim (inline fn)) in your file and then the performance will be identical to the first version. A rule of thumb is that inside a function, things are a bit more static than in the global environment: local functions cannot be redefined, local variables can have their types precisely defined, etc. You have more control about what is always true.
Note that the overhead of error checking is a problem if it is executed a lot of time (relatively to the rest of the code). If you fill a big array with single-floats and apply numerical code on it, it makes sense to use a specialized array type, like (simple-array single-float), or to declare local variables to be floats with (declare (type single-float x)), so that you don't check that each value is effectively a float. In other cases, the overhead is not high enough to spend too much time reducing it.
You are using the SBCL compiler:
(defun walk-loop-local (n)
(loop for i from 1 to n do
(= 0 (rem i 10))))
I think your code does nothing in the loop iteration. It gets optimized away, since the value of = form is not used anywhere and there are no side-effects.
Thus there is no overhead, since there is no code.
Use (disassemble #'walk-local-form) to check the compile code.
If I move the check into a function, it runs 5x slower. What would cause that?
Instead of doing nothing, in each iteration the function gets called and executes your code.
Actually measuring calling overhead
(defparameter *i* nil)
(defun walk-loop-local (n)
(loop for i from 1 to n do
(setf *i* (= 0 (rem i 10)))))
(defun fn (x)
(setf *i* (= 0 (rem x 10))))
(defun walk-loop-func (n)
(loop for i from 1 to n do
(fn i)))
In above case the code doesn't get removed.
CL-USER> (time (walk-loop-local 232792560))
Evaluation took:
5.420 seconds of real time
5.412637 seconds of total run time (5.399134 user, 0.013503 system)
99.87% CPU
6,505,078,020 processor cycles
0 bytes consed
NIL
CL-USER> (time (walk-loop-func 232792560))
Evaluation took:
6.235 seconds of real time
6.228447 seconds of total run time (6.215409 user, 0.013038 system)
99.89% CPU
7,481,974,847 processor cycles
0 bytes consed
You can see that the function call overhead isn't that large.
Every function call adds an overhead. This is what you are measuring.
You could declaim the function fn to be inline and also try modifying the compiler flags to optimize for runtime execution (opposed to debug information or safety). I'm on the phone now, but could add the hyperspecs link if needed.
BR, Eric
Related
(apply #'+ (loop for i from 1 to x collect 1))
works if x has value 253391, but fails with a (SB-KERNEL::CONTROL-STACK-EXHAUSTED-ERROR) on 253392*. This is orders of magnitude smaller than call-arguments-limit**.
Is recursion exhausting the stack? If so, is it in apply? Why hasn't it been optimized out?
*Also interesting, (apply #'max (loop for i from 1 to 253391 collect 1)) throws the error, but 253390 is fine.
**call-arguments-limit evaluates to 4611686018427387903 (with the help of format's ~R, it turns out this is four quintillion six hundred eleven quadrillion six hundred eighty-six trillion eighteen billion four hundred twenty-seven million three hundred eighty-seven thousand nine hundred three)
parameters that can be passed to a function in SBCL
You don't pass parameters. You pass arguments.
(defun foo (x y) (list x y))
x and y are parameters of the function foo.
(foo 20 22)
20 and 22 are arguments in a call of the function foo.
See the variables call-arguments-limit and lambda-parameters-limit.
SBCL and call-arguments-limit
If a function can't nearly handle the claimed number of arguments, then this looks like a bug in SBCL. You might want to report this bug. Maybe they need to change the value of call-arguments-limit.
Testing
APPLY is one way to test it.
Another:
(eval (append '(drop-params)
(loop for i from 1 to 2533911 collect 1)))
One can also use FUNCALL with a number of arguments spread out.
Why does a limit exist?
The Common Lisp standard was written to allow efficient implementations on various different computers. It was thought that some machine-level function calling implementations only support limited number of arguments. The standard says the number of supported arguments can be as low as 50. Actually some implementations have a relatively low number of supported arguments.
Thus apply in Common Lisp is not a tool for list processing, but to call functions with computed arglists.
For list and vector processing use REDUCE, instead of APPLY
If we want to sum all numbers in a list, replace
(apply #'+ list) ; don't use this
with
(reduce #'+ list) ; can handle arbitrary long lists
Recursion
apply is a non-optimized recursive function
I cannot see why the function APPLY should use recursion.
For example if you think of
(apply #'+ '(1 2 3 4 5))
The repeated summing of the arguments is done by the function + and not by apply.
This is different from
(reduce #'+ '(1 2 3 4 5))
where the repeated call of the function + with two arguments is done by reduce.
What's causing the stack exhaustion?
Though recursion is often a likely culprit for stack exhaustion, that is not true in this case. According to the SBCL Internals Manual:
In full call, the arguments are passed creating a partial frame on the stack top and storing stack arguments into that frame.
Each generated list element is stored in the new stack frame, quickly exhausting the stack. Presumably, passing SBCL a larger value through –control-stack-size would increase this practical limit to the number of arguments that can be passed in a function call.
Why is call-arguments-limit so much larger than the practical limit?
An SBCL mailing list response to someone with a similar problem explains why the practical limit of the stack size isn't reflected in call-arguments-limit:
The condition you're seeing here is not due to a fundamental implementation limitation, but rather because of the particular choice of stack size that someone chose -- if the stack size were bigger, this call of yours would not error. [...] Consider that, given our strategy of passing excess arguments on the stack, that the actual maximum number of arguments passable at any given time depends on the program state, and isn't in fact a constant.
Spec says that call-arguments-limit must be a constant, so SBCL seems to have defined it as most-positive-fixnum.
There are and a couple of bug reports discussing the issue, and a TODO in the source suggesting that at least one contributor feels it should be reduced to a less absurd value:
;; TODO: Reducing CALL-ARGUMENTS-LIMIT to something reasonable to
;; allow DWORD ops without it looking like a bug would make sense.
;; With a stack size of about 2MB, the limit is absurd anyway.
SBCL's particular way of implementing call-arguments-limit might have room for improvement and could lead to unexpected behaviour, but it does follow ANSI spec.
The practical limit varies depending on the space remaining on the stack, so defining call-arguments-limit according to this value would not obey the spec requirement for a constant value.
Is it possible to use (declare (type ...)) declarations in functions but also perform type-checking on the function arguments, in order to produce faster but still safe code?
For instance,
(defun add (x y)
(declare (type fixnum x y))
(the fixnum x y))
when called as (add 1 "a") would result in undefined behaviour, so preferably I'd like to modify it as
(defun add (x y)
(declare (type fixnum x y))
(check-type x fixnum)
(check-type y fixnum)
(the fixnum x y))
but I worry that the compiler is allowed to assume that the check-type always passes and thus omit the check.
So my question is, is the above example wrong as I expect it, and secondly, is there any common idiom* in use to achieve type-safety with optimised code?
*) I can imagine, for instance, using an optimised lambda and calling that after doing the type-checking, but I wonder if that's the most elegant way.
You can always check the types first and then enter optimized code:
(defun foo (x)
(check-type x fixnum)
(locally
(declare (fixnum x)
(optimize (safety 0)))
x))
The LOCALLY is used for local declarations.
Since you are asking:
Is it possible to use (declare (type ...)) declarations in functions but also perform type-checking on the function arguments, in order to produce faster but still safe code?
it seems to me that you are missing an important point about the type system of Common Lisp, that is that the language is not (and cannot-be) statically typed. Let's clarify this aspect.
Programming languages can be roughly classified in three broad categories:
with static type-checking: every expression or statement is
checked for type-correctness at compile time, so that type errors
can be detected during program development, and the code is more
efficient since no check for types must be done at run-time;
with dynamic type checking: every operation is checked at run time
for type correctness, so that no type-error can occur at run-time;
without type checking: type-errors can occur at run-time so that the
program can stop for error or have an undefined behaviour.
Edited
With the respect to the previous classification, the Common Lisp specification left to the implementations the burden of deciding if they want to follow the second or the third approach! Not only, but through the optimize declaration the specification lets the implementations free to change this dinamically, in the same program.
So, most implementations, at the initial optimization and safety levels, implements the second approach, enriched with the following two possibilities:
one can request to the compiler to omit run time type-checking when compiling some piece of code, typically for efficiency reasons, so that, inside that particular piece of code, and depending on the optimization and safety settings, the language can behave like the languages of the third category: this could be supported by hints through type declarations, like (declare (type fixnum x)) for variables and (the fixnum (f x)) for values;
one can insert into the code explicit type-checking tests to be performed at run-time, through check-type, so that an eventual difference in the type of the value checked will cause a “correctable error”.
Note, moreover, that different compilers can behave differently in checking types at compile times, but they can never reach the level of compilers for languages with static type checking because Common Lisp is a highly dynamical language. Consider for instance this simple case:
(defun plus1(x) (1+ x))
(defun read-and-apply-plus1()
(plus1 (read)))
in which no static type-checking can be done for the call (plus1 (read)), since the type of (read) is not known at compile time.
I am making my own Lisp-like interpreted language, and I want to do tail call optimization. I want to free my interpreter from the C stack so I can manage my own jumps from function to function and my own stack magic to achieve TCO. (I really don't mean stackless per se, just the fact that calls don't add frames to the C stack. I would like to use a stack of my own that does not grow with tail calls). Like Stackless Python, and unlike Ruby or... standard Python I guess.
But, as my language is a Lisp derivative, all evaluation of s-expressions is currently done recursively (because it's the most obvious way I thought of to do this nonlinear, highly hierarchical process). I have an eval function, which calls a Lambda::apply function every time it encounters a function call. The apply function then calls eval to execute the body of the function, and so on. Mutual stack-hungry non-tail C recursion. The only iterative part I currently use is to eval a body of sequential s-expressions.
(defun f (x y)
(a x y)) ; tail call! goto instead of call.
; (do not grow the stack, keep return addr)
(defun a (x y)
(+ x y))
; ...
(print (f 1 2)) ; how does the return work here? how does it know it's supposed to
; return the value here to be used by print, and how does it know
; how to continue execution here??
So, how do I avoid using C recursion? Or can I use some kind of goto that jumps across c functions? longjmp, perhaps? I really don't know. Please bear with me, I am mostly self- (Internet- ) taught in programming.
One solution is what is sometimes called "trampolined style". The trampoline is a top-level loop that dispatches to small functions that do some small step of computation before returning.
I've sat here for nearly half an hour trying to contrive a good, short example. Unfortunately, I have to do the unhelpful thing and send you to a link:
http://en.wikisource.org/wiki/Scheme:_An_Interpreter_for_Extended_Lambda_Calculus/Section_5
The paper is called "Scheme: An Interpreter for Extended Lambda Calculus", and section 5 implements a working scheme interpreter in an outdated dialect of Lisp. The secret is in how they use the **CLINK** instead of a stack. The other globals are used to pass data around between the implementation functions like the registers of a CPU. I would ignore **QUEUE**, **TICK**, and **PROCESS**, since those deal with threading and fake interrupts. **EVLIS** and **UNEVLIS** are, specifically, used to evaluate function arguments. Unevaluated args are stored in **UNEVLIS**, until they are evaluated and out into **EVLIS**.
Functions to pay attention to, with some small notes:
MLOOP: MLOOP is the main loop of the interpreter, or "trampoline". Ignoring **TICK**, its only job is to call whatever function is in **PC**. Over and over and over.
SAVEUP: SAVEUP conses all the registers onto the **CLINK**, which is basically the same as when C saves the registers to the stack before a function call. The **CLINK** is actually a "continuation" for the interpreter. (A continuation is just the state of a computation. A saved stack frame is technically continuation, too. Hence, some Lisps save the stack to the heap to implement call/cc.)
RESTORE: RESTORE restores the "registers" as they were saved in the **CLINK**. It's similar to restoring a stack frame in a stack-based language. So, it's basically "return", except some function has explicitly stuck the return value into **VALUE**. (**VALUE** is obviously not clobbered by RESTORE.) Also note that RESTORE doesn't always have to return to a calling function. Some functions will actually SAVEUP a whole new computation, which RESTORE will happily "restore".
AEVAL: AEVAL is the EVAL function.
EVLIS: EVLIS exists to evaluate a function's arguments, and apply a function to those args. To avoid recursion, it SAVEUPs EVLIS-1. EVLIS-1 would just be regular old code after the function application if the code was written recursively. However, to avoid recursion, and the stack, it is a separate "continuation".
I hope I've been of some help. I just wish my answer (and link) was shorter.
What you're looking for is called continuation-passing style. This style adds an additional item to each function call (you could think of it as a parameter, if you like), that designates the next bit of code to run (the continuation k can be thought of as a function that takes a single parameter). For example you can rewrite your example in CPS like this:
(defun f (x y k)
(a x y k))
(defun a (x y k)
(+ x y k))
(f 1 2 print)
The implementation of + will compute the sum of x and y, then pass the result to k sort of like (k sum).
Your main interpreter loop then doesn't need to be recursive at all. It will, in a loop, apply each function application one after another, passing the continuation around.
It takes a little bit of work to wrap your head around this. I recommend some reading materials such as the excellent SICP.
Tail recursion can be thought of as reusing for the callee the same stack frame that you are currently using for the caller. So you could just re-set the arguments and goto to the beginning of the function.
Fold (aka reduce) is considered a very important higher order function. Map can be expressed in terms of fold (see here). But it sounds more academical than practical to me. A typical use could be to get the sum, or product, or maximum of numbers, but these functions usually accept any number of arguments. So why write (fold + 0 '(2 3 5)) when (+ 2 3 5) works fine. My question is, in what situation is it easiest or most natural to use fold?
The point of fold is that it's more abstract. It's not that you can do things that you couldn't before, it's that you can do them more easily.
Using a fold, you can generalize any function that is defined on two elements to apply to an arbitrary number of elements. This is a win because it's usually much easier to write, test, maintain and modify a single function that applies two arguments than to a list. And it's always easier to write, test, maintain, etc. one simple function instead of two with similar-but-not-quite functionality.
Since fold (and for that matter, map, filter, and friends) have well-defined behaviour, it's often much easier to understand code using these functions than explicit recursion.
Basically, once you have the one version, you get the other "for free". Ultimately, you end up doing less work to get the same result.
Here are a few simple examples where reduce works really well.
Find the sum of the maximum values of each sub-list
Clojure:
user=> (def x '((1 2 3) (4 5) (0 9 1)))
#'user/x
user=> (reduce #(+ %1 (apply max %2)) 0 x)
17
Racket:
> (define x '((1 2 3) (4 5) (0 9 1)))
> (foldl (lambda (a b) (+ b (apply max a))) 0 x)
17
Construct a map from a list
Clojure:
user=> (def y '(("dog" "bark") ("cat" "meow") ("pig" "oink")))
#'user/y
user=> (def z (reduce #(assoc %1 (first %2) (second %2)) {} y))
#'user/z
user=> (z "pig")
"oink"
For a more complicated clojure example featuring reduce, check out my solution to Project Euler problems 18 & 67.
See also: reduce vs. apply
In Common Lisp functions don't accept any number of arguments.
There is a constant defined in every Common Lisp implementation CALL-ARGUMENTS-LIMIT, which must be 50 or larger.
This means that any such portably written function should accept at least 50 arguments. But it could be just 50.
This limit exists to allow compilers to possibly use optimized calling schemes and to not provide the general case, where an unlimited number of arguments could be passed.
Thus to really process large (larger than 50 elements) lists or vectors in portable Common Lisp code, it is necessary to use iteration constructs, reduce, map, and similar. Thus it is also necessary to not use (apply '+ large-list) but use (reduce '+ large-list).
Code using fold is usually awkward to read. That's why people prefer map, filter, exists, sum, and so on—when available. These days I'm primarily writing compilers and interpreters; here's some ways I use fold:
Compute the set of free variables for a function, expression, or type
Add a function's parameters to the symbol table, e.g., for type checking
Accumulate the collection of all sensible error messages generated from a sequence of definitions
Add all the predefined classes to a Smalltalk interpreter at boot time
What all these uses have in common is that they're accumulating information about a sequence into some kind of set or dictionary. Eminently practical.
Your example (+ 2 3 4) only works because you know the number of arguments beforehand. Folds work on lists the size of which can vary.
fold/reduce is the general version of the "cdr-ing down a list" pattern. Each algorithm that's about processing every element of a sequence in order and computing some return value from that can be expressed with it. It's basically the functional version of the foreach loop.
Here's an example that nobody else mentioned yet.
By using a function with a small, well-defined interface like "fold", you can replace that implementation without breaking the programs that use it. You could, for example, make a distributed version that runs on thousands of PCs, so a sorting algorithm that used it would become a distributed sort, and so on. Your programs become more robust, simpler, and faster.
Your example is a trivial one: + already takes any number of arguments, runs quickly in little memory, and has already been written and debugged by whoever wrote your compiler. Those properties are not often true of algorithms I need to run.
The function:
Given a list lst return all permutations of the list's contents of exactly length k, which defaults to length of list if not provided.
(defun permute (lst &optional (k (length lst)))
(if (= k 1)
(mapcar #'list lst)
(loop for item in lst nconcing
(mapcar (lambda (x) (cons item x))
(permute (remove-if (lambda (x) (eq x item)) lst)
(1- k))))))
The problem:
I'm using SLIME in emacs connected to sbcl, I haven't done too much customization yet. The function works fine on smaller inputs like lst = '(1 2 3 4 5 6 7 8) k = 3 which is what it will mostly be used for in practice. However when I Call it with a large input twice in a row the second call never returns and sbcl does not even show up on top. These are the results at the REPL:
CL-USER> (time (nth (1- 1000000) (permute '(0 1 2 3 4 5 6 7 8 9))))
Evaluation took:
12.263 seconds of real time
12.166150 seconds of total run time (10.705372 user, 1.460778 system)
[ Run times consist of 9.331 seconds GC time, and 2.836 seconds non-GC time. ]
99.21% CPU
27,105,349,193 processor cycles
930,080,016 bytes consed
(2 7 8 3 9 1 5 4 6 0)
CL-USER> (time (nth (1- 1000000) (permute '(0 1 2 3 4 5 6 7 8 9))))
And it never comes back from the second call. I can only guess that for some reason I'm doing something horrible to the garbage collector but I can't see what. Does anyone have any ideas?
One thing that's wrong in your code is your use of EQ. EQ compares for identity.
EQ is not for comparing numbers. EQ of two numbers can be true or false.
Use EQL if you want to compare by identity, numbers by value or characters. Not EQ.
Actually
(remove-if (lambda (x) (eql x item)) list)
is just
(remove item list)
For your code the EQ bug COULD mean that permute gets called in the recursive call without actually a number removed from the list.
Other than that, I think SBCL is just busy with memory management. SBCL on my Mac acquired lots of memory (more than a GB) and was busy doing something. After some time the result was computed.
Your recursive function generates huge amount of 'garbage'. LispWorks says: 1360950192 bytes
Maybe you can come up with a more efficient implementation?
Update: garbage
Lisp provides some automatic memory management, but that does not free the programmer from thinking about space effects.
Lisp uses both a stack and the heap to allocate memory. The heap maybe structured in certain ways for the GC - for example in generations, half spaces and/or areas. There are precise garbage collectors and 'conservative' garbage collectors (used by SBCL on Intel machines).
When a program runs we can see various effects:
normal recursive procedures allocate space on the stack. Problem: the stack size is usually fixed (even though some Lisps can increase it in an error handler).
a program may allocate huge amount of memory and return a large result. PERMUTE is such a function. It can return very large lists.
a program may allocate memory and use it temporarily and then the garbage collector can recycle it. The rate of creation and destruction can be very high, even though the program does not use a large amount of fixed memory.
There are more problems, though. But for each of the above the Lisp programmer (and every other programmer using a language implementation with garbage collection) has to learn how to deal with that.
Replace recursion with iteration. Replace recursion with tail recursion.
Return only the part of the result that is needed and don't generate the full solution. If you need the n-th permutation, then compute that and not all permutations. Use lazy datastructures that are computed on demand. Use something like SERIES, which allows to use stream-like, but efficient, computation. See SICP, PAIP, and other advanced Lisp books.
Reuse memory with a resource manager. Reuse buffers instead of allocating objects all the time. Use an efficient garbage collector with special support for collecting ephemeral (short-lived) objects. Sometimes it also may help to destructively modify objects, instead of allocating new objects.
Above deals with the space problems of real programs. Ideally our compilers or runtime infrastructure may provide some automatic support to deal with these problems. But in reality this does not really work. Most Lisp systems provide low-level functionality to deal with this and Lisp provides mutable objects - because the experience of real-world Lisp programs has shown that programmers do want to use them to optimize their programs. If you have a large CAD application that computes the form of turbine blades, then theoretical/puristic views about non-mutable memory simply does not apply - the developer wants the faster/smaller code and the smaller runtime footprint.
SBCL on most platforms uses generational garbage collector, which means that allocated memory which survives more than some number of collections will be more rarely considered for collection. Your algorithm for the given test case generates so much garbage that it triggers GC so many times that the actual results, which obviously have to survive the entire function runtime, are tenured, that is, moved to a final generation which is collected either very rarely or not at all. Therefore, the second run will, on standard settings for 32-bit systems, run out of heap (512 MB, can be increased with runtime options).
Tenured data can be garbage collected by manually triggering the collector with (sb-ext:gc :full t). This is obviously implementation dependent.
From the looks of the output, you're looking at the slime-repl, right?
Try changing to the "*inferior-lisp*" buffer, you'll probably see that SBCL has dropped down to the ldb (built-in low-level debugger). Most probably, you've managed to blow the call-stack.