My application, which turns given data into a tree representation, is using way too much memory. As it manages to turn around 200-300MB of memory into roughly 3GB before crashing.
I now want to figure out where the leak is, which part of the program causes that.
Therefore I do now wonder what the most common and efficient technique for memory-profiling in common-lisp, using sbcl is?
I already looked at (room) and (time) but its output is way to verbose, all I need is a wrapper which will say "After the execution the overall memory usage was +1000Byte", this would do the deal as I just want to know where the memory is used. Another criteria is that it has to work "on the fly" as the application will most likely crash, due to no remaining RAM.
Something looking like this:
(dotimes (i 4)
(profiler-wrapper :messg "After execution memory ~a~%" (execute-me i) ))
After execution memory +100Mb
After execution memory +100Mb
After execution memory +100Mb
After execution memory +100Mb
NIL
Try the deterministic profiler, which is available in Slime through the slime-profile* commands or in the REPL. It does the wrapping for you and can report CPU and memory usage per function, at the expense of some overhead for the wrapping.
SBCL also comes with a statistical profiler that has less overhead and is more helpful if long-running processes need to be analysed and little or no prior information about the problem areas exist.
I wrote my own version of such an macro, it is most likely not elegant and it does slow the code significantly due to (sb-ext:gc :full t) but it does give some perspective of the memory usage after the execution of a given body.
(defparameter *last-profile-step* 0)
(defmacro profile-it ((name gc-on) &body body)
`(let ((*last-profile-step* ,(if gc-on
`(progn
(sb-ext:gc :full t)
(sb-kernel::dynamic-usage))
`(sb-kernel::dynamic-usage))))
(unwind-protect
(progn
,#body)
(progn
(FORMAT t "After execution of ~a : ~a byte~%" ,name
(- ,(if gc-on
`(progn
(sb-ext:gc :full t)
(sb-kernel::dynamic-usage))
`(sb-kernel::dynamic-usage))
*last-profile-step* ))))))
Related
I'm new to Common Lisp and ran into a performance thing that just struck me as weird. I'm checking if a number is divisible by 10 using rem in a loop. If I move the check into a function, it runs 5x slower. What would cause that?
I'm running sbcl 1.4.5 on 64 bit Ubuntu 18.04.
(defun fn (x)
(= 0 (rem x 10))
)
(defun walk-loop-local (n)
(loop for i from 1 to n do
(= 0 (rem i 10))
))
(defun walk-loop-func (n)
(loop for i from 1 to n do
(fn i)
))
(time (walk-loop-local 232792560))
(time (walk-loop-func 232792560))
I'd expect the time to be the same (and a lot faster, but that's a separate question). Instead, here's the output,
CL-USER> (load "loops.lisp")
Evaluation took:
0.931 seconds of real time
0.931389 seconds of total run time (0.931389 user, 0.000000 system)
100.00% CPU
2,414,050,454 processor cycles
0 bytes consed
Evaluation took:
4.949 seconds of real time
4.948967 seconds of total run time (4.948967 user, 0.000000 system)
100.00% CPU
12,826,853,706 processor cycles
0 bytes consed
Common Lisp allows dynamic redefinition of functions: if you redefined fn during the approx. 5 seconds of your second test, the running loop would switch to calling the new definition of fn while running. This features comes with some constraints on how to compile function calls and how to optimize them when needed.
As pointed out by RainerJoswing in comments, the above is an over-simplification, there are cases where the compiler may assume functions are not redefined (recursive functions, functions in the same file), see 3.2.2.3 Semantic Constraints, for example:
A call within a file to a named function that is defined in the same
file refers to that function, unless that function has been declared
notinline. The consequences are unspecified if functions are redefined
individually at run time or multiply defined in the same file.
A function mixes error checking and the computations you want it to perform. At function call boundaries you typically have a prologue where your inputs are checked, and an epilogue where results might be "boxed": if the compiler knows that locally a variable is always a single-float, it can use a raw representation of floats during the extent of the function, but when returning the result, it should be a valid Lisp type, which means coercing it back to a tagged value, for example.
The SBCL compiler tries to ensure the code is safe, where safe means never invoking code that has undefined behaviour in the Lisp specification. Note however that if you call fn with a string input, the code is expected to detect the type error. Unlike C, a type-error at runtime in Lisp is well-defined (as long as the declared type, which defaults to T, encompasses all possible values at runtime). And so, compiling Lisp code for safety tends to add a lot of error checking at multiple points of the program.
Optimizing code consists in removing checks that are guaranteed to be always true, eliminating dead branches in the generated code.
For example, if you consider fn alone, you can see that it has to check its input every time it is called, because it might very well be called with a string input. But when you directly inline the operation, then the index i can be statically determined to be an integer, which allows calls to = and rem to be applied without (much) error checking.
Optimization in SBCL happens because there is a static analysis which maps variables to elements of the type lattice of Lisp (and and or are basically the greatest lower bound and lowest upper bound for types, with types T and type nil at both ends). SBCL reports only errors that are sure to happen: you have an error if you call a function that accepts integers from 0 to 5 if you call it with an input that is known to always be above 5 or below zero (both sets have no intersection), but you have no warning if you call it with an integer between 2 and 10. This is safe because the compiler can defer error checking at runtime, contrary to other languages where the runtime has no sense of types (trying to warn everytime the code might have errors would result in a lot of warnings given the open-worldness of Lisp).
You can (declaim (inline fn)) in your file and then the performance will be identical to the first version. A rule of thumb is that inside a function, things are a bit more static than in the global environment: local functions cannot be redefined, local variables can have their types precisely defined, etc. You have more control about what is always true.
Note that the overhead of error checking is a problem if it is executed a lot of time (relatively to the rest of the code). If you fill a big array with single-floats and apply numerical code on it, it makes sense to use a specialized array type, like (simple-array single-float), or to declare local variables to be floats with (declare (type single-float x)), so that you don't check that each value is effectively a float. In other cases, the overhead is not high enough to spend too much time reducing it.
You are using the SBCL compiler:
(defun walk-loop-local (n)
(loop for i from 1 to n do
(= 0 (rem i 10))))
I think your code does nothing in the loop iteration. It gets optimized away, since the value of = form is not used anywhere and there are no side-effects.
Thus there is no overhead, since there is no code.
Use (disassemble #'walk-local-form) to check the compile code.
If I move the check into a function, it runs 5x slower. What would cause that?
Instead of doing nothing, in each iteration the function gets called and executes your code.
Actually measuring calling overhead
(defparameter *i* nil)
(defun walk-loop-local (n)
(loop for i from 1 to n do
(setf *i* (= 0 (rem i 10)))))
(defun fn (x)
(setf *i* (= 0 (rem x 10))))
(defun walk-loop-func (n)
(loop for i from 1 to n do
(fn i)))
In above case the code doesn't get removed.
CL-USER> (time (walk-loop-local 232792560))
Evaluation took:
5.420 seconds of real time
5.412637 seconds of total run time (5.399134 user, 0.013503 system)
99.87% CPU
6,505,078,020 processor cycles
0 bytes consed
NIL
CL-USER> (time (walk-loop-func 232792560))
Evaluation took:
6.235 seconds of real time
6.228447 seconds of total run time (6.215409 user, 0.013038 system)
99.89% CPU
7,481,974,847 processor cycles
0 bytes consed
You can see that the function call overhead isn't that large.
Every function call adds an overhead. This is what you are measuring.
You could declaim the function fn to be inline and also try modifying the compiler flags to optimize for runtime execution (opposed to debug information or safety). I'm on the phone now, but could add the hyperspecs link if needed.
BR, Eric
The following paragraph is from The Racket Guide (2.3.4):
At the same time, recursion does not lead to particularly bad
performance in Racket, and there is no such thing as stack overflow;
you can run out of memory if a computation involves too much context,
but exhausting memory typically requires orders of magnitude deeper
recursion than would trigger a stack overflow in other languages.
I'm curious about how Racket was designed to avoid stack overflow? What's more, why other languages like C cannot avoid such a problem?
First, some terminology: making a non-tail call requires a context frame to store local variables, parent return address, etc. So the question is how to represent an arbitrarily large context. "The stack" (or call stack) is just one (admittedly common) implementation strategy for the context.
Here are a few implementation strategies for deep recursion (ie, large contexts):
Allocate context frames on the heap and let the GC be responsible for cleaning them up. (This is nice and simple but probably relatively slow, although people would argue that point.)
Allocate context frames on the stack. When the stack is full, copy the frames currently on the stack into the heap, clear the stack, and reset the stack pointer to the bottom. When returning, if the stack is empty, copy frames from the heap back to the stack. (This means you can't have pointers to stack-allocated objects, though, since the objects get moved around.)
Allocate context frames on the stack. When the stack is full, allocate a new big chunk of memory, call that the new stack (ie set the stack pointer), and keep going. (This might require mprotect or other operations to convince the OS that the new block of memory is okay to treat as a call stack.)
Allocate context frames on the stack. When the stack is full, make a new thread to continue the computation, and wait for the thread to finish and arrange to grab a return value from it to return to the old thread's stack. (This strategy can be useful on platforms like the JVM that don't let you directly control the stack, stack pointer, etc. On the other hand, it complicates features like thread-local storage, etc.)
... and more variations on the strategies above.
Support for deep recursion often coincides with support for first-class continuations. In general, implementing first-class continuations means you almost automatically get support for deep recursion. There's a nice paper called Implementation Strategies for First-class Continuations by Will Clinger et al. with more detail and comparisons between different strategies.
There are two pieces to this answer.
First, in Racket and other functional languages, tail calls do not create additional stack frames. That is, a loop such as
(define (f x) (f x))
... can run forever without using any stack space at all. Many non-functional languages don't prioritize function calling in the same way as functional languages, and therefore aren't properly tail-calling.
HOWEVER, the comment that you're referring to isn't just limited to tail-calling; Racket allows very deeply nested stack frames.
Your question is a good one: why don't other languages allow deeply nested stack frames? I wrote a short test, and it looks like C unceremoniously dumps core at a depth of between 262,000 and 263,000. I wrote a simple racket test that does the same thing (being careful to ensure the recursive call was not in tail position), and I interrupted it at a depth of 48,000,000 without any apparent ill effects (except, presumably, a fairly large runtime stack).
To answer your question directly, there's no reason that I'm aware of that C couldn't allow much much more deeply nested stacks, but I think that for most C programmers, a recursion depth of 262K is plenty.
Not for us, though!
Here's my C code:
#include <stdio.h>
int f(int depth){
if ((depth % 1000) == 0) {
printf("%d\n",depth);
}
return f(depth+1);
}
int main() {
return f(0);
}
... and my racket code:
#lang racket
(define (f depth)
(when (= (modulo depth 1000) 0)
(printf "~v\n" depth))
(f (add1 depth))
(printf "this will never print..."))
(f 0)
EDIT: here's the version that uses randomness on the way out to stymie possible optimizations:
#lang racket
(define (f depth)
(when (= (modulo depth 1000000) 0)
(printf "~v\n" depth))
(when (< depth 50000000)
(f (add1 depth)))
(when (< (random) (/ 1.0 100000))
(printf "X")))
(f 0)
Also, my observations of the process size are consistent with a stack frame of about 16 bytes, plus or minus; 50M * 16 bytes = 800 Megabytes, and the observed size of the stack is about 1.2 Gigabytes.
A few questions here, regarding letcc that is used in The Seasoned Schemer.
(define (intersect-all sets)
(letcc hop
(letrec
((A (lambda (sets)
(cond
((null? (car sets)) (hop '())
((null? (cdr sets)) (car sets))
(else
(intersect (car sets)
(A (cdr sets)))))))
; definition of intersect removed for brevity
(cond
((null? sets) '())
(else (A sets))))))
I think I understand what letcc achieves, and that is basically something like catch and throw in ruby (and seemingly CL), which basically means a whole block of code can be cut short by calling whatever the named letcc is. This feels like the least "functional" thing I've come across in this short series of books and it makes me feel a bit hesitant to use it, as I want to learn a good functional style. Am I just misunderstanding letcc, or is it not really a functional programming concept and only exists to improve performance? The whole idea that I can be in the middle of some routine and then suddenly get to another point in the code feels a bit wrong... like abusing try/catch in Java for program flow.
letcc doesn't seem to exist in the version of guile (1.8.7) I have installed in OS X. Is there another name for it that I should be looking for in guile?
If I'm misunderstanding letcc by comparing it with try/catch in Java, or catch/throw in ruby (which is not exception handling, just to be clear, for the non-rubyists), how exactly does it work, at the functional level? Can it be expressed in a longer, more complex way, that convinces me it is functional after all?
"Functional" has several meanings, but no popular meaning contradicts continuations in any way. But they can be abused into creating code that is hard to read. They're not tools that can be "abused for program flow" -- they are program flow tools.
Can't help you there. I know that there was semi-recent on continuations in Guile, but I don't know where things stand. It should definitely have call-with-current-continuation, usually also under the more friendly name of call/cc, and let/cc is a simple macro that can be built with call/cc.
I can tell you that in Racket there is a let/cc builtin together with a bunch of others builtins in the same family, and additionally there's a whole library of various control operators (with an extensive list of references.).
Simple uses of let/cc are indeed similar to a catch/throw kind of thing -- more specifically, such continuations are commonly known as "escape continuations" (or sometimes "upward"). This is the kind of use that you have in that code, and that is often used to implement an abort or a return.
But continuations in Scheme are things that can be used in any place. For a very simple example that shows this difference, try this:
(define (foo f) (f 100))
(let/cc k (+ (foo k) "junk") (more junk))
Finally, if you want to read more about continuations, you can see the relevant parts of PLAI, there's also a more brief by-example overview that Matthew Might wrote, and you can see some class notes that I wrote that are based on PLAI, with some examples that were inspired by the latter post.
In Clojure programming language, why this code passes with flying colors?
(let [r (range 1e9)] [(first r) (last r)])
While this one fails:
(let [r (range 1e9)] [(last r) (first r)])
I know it is about "Losing your head" advice but would you please explain it to me? I'm not able to digest it yet.
UPDATE:
It is really hard to pick the correct answer, two answers are amazingly informative.
Note: Code snippets are from "The Joy of Clojure".
To elaborate on dfan and RafaĆ's answers, I've taken the time to run both expressions with the YourKit profiler.
It's fascinating to see the JVM at work. The first program is so GC-friendly that the JVM really shines at managing its memory.
I drew some charts.
GC friendly: (let [r (range 1e9)] [(first r) (last r)])
This program runs very low on memory; overall, less than 6 megabytes. As stated earlier, it is very GC friendly, it makes a lot of collections, but uses very little CPU for that.
Head holder: (let [r (range 1e9)] [(last r) (first r)])
This one is very memory hungry. It goes up to 300 MB of RAM, but that's not enough and the program does not finish (the JVM dies less than one minute later). The GC takes up to 90% of CPU time, which indicates it desperately tries to free any memory it can, but cannot find any (the objects collected are very little to none).
Edit The second program ran out of memory, which triggered a heap dump. An analysis of this dump shows that 70% of the memory is java.lang.Integer objects, which could not be collected. Here's another screenshot:
range generates elements as needed.
In the case of (let [r (range 1e9)] [(first r) (last r)]), it grabs the first element (0), then generates a billion - 2 elements, throwing them out as it goes, and then grabs the last element (999,999,999). It never has any need to keep any part of the sequence around.
In the case of (let [r (range 1e9)] [(last r) (first r)]), it generates a billion elements in order to be able to evaluate (last r), but it also has to hold on to the beginning of the list it's generating in order to later evaluate (first r). So it's not able to throw anything away as it goes, and (I presume) runs out of memory.
What really holds the head here is the binding of the sequence to r (not the already-evaluated (first r), since you cannot evaluate the whole sequence from its value.)
In the first case the binding no longer exists when (last r) is evaluated, since there are no more expressions with r to evaluate. In the second case, the existence of the not-yet-evaluated (first r) means that the evaluator needs to keep the binding to r.
To show the difference, this evaluates OK:
user> (let [r (range 1e8) a 7] [(last r) ((constantly 5) a)])
[99999999 5]
While this fails:
(let [r (range 1e8) a 7] [(last r) ((constantly 5) r)])
Even though the expression following (last r) ignores r, the evaluator is not that smart and keeps the binding to r, thus keeping the whole sequence.
Edit: I have found a post where Rich Hickey explains the details of the mechanism responsible for clearing the reference to the head in the above cases. Here it is: Rich Hickey on locals clearing
for a technical description, go to http://clojure.org/lazy. the advice is mentioned in the section Don't hang (onto) your head
The function:
Given a list lst return all permutations of the list's contents of exactly length k, which defaults to length of list if not provided.
(defun permute (lst &optional (k (length lst)))
(if (= k 1)
(mapcar #'list lst)
(loop for item in lst nconcing
(mapcar (lambda (x) (cons item x))
(permute (remove-if (lambda (x) (eq x item)) lst)
(1- k))))))
The problem:
I'm using SLIME in emacs connected to sbcl, I haven't done too much customization yet. The function works fine on smaller inputs like lst = '(1 2 3 4 5 6 7 8) k = 3 which is what it will mostly be used for in practice. However when I Call it with a large input twice in a row the second call never returns and sbcl does not even show up on top. These are the results at the REPL:
CL-USER> (time (nth (1- 1000000) (permute '(0 1 2 3 4 5 6 7 8 9))))
Evaluation took:
12.263 seconds of real time
12.166150 seconds of total run time (10.705372 user, 1.460778 system)
[ Run times consist of 9.331 seconds GC time, and 2.836 seconds non-GC time. ]
99.21% CPU
27,105,349,193 processor cycles
930,080,016 bytes consed
(2 7 8 3 9 1 5 4 6 0)
CL-USER> (time (nth (1- 1000000) (permute '(0 1 2 3 4 5 6 7 8 9))))
And it never comes back from the second call. I can only guess that for some reason I'm doing something horrible to the garbage collector but I can't see what. Does anyone have any ideas?
One thing that's wrong in your code is your use of EQ. EQ compares for identity.
EQ is not for comparing numbers. EQ of two numbers can be true or false.
Use EQL if you want to compare by identity, numbers by value or characters. Not EQ.
Actually
(remove-if (lambda (x) (eql x item)) list)
is just
(remove item list)
For your code the EQ bug COULD mean that permute gets called in the recursive call without actually a number removed from the list.
Other than that, I think SBCL is just busy with memory management. SBCL on my Mac acquired lots of memory (more than a GB) and was busy doing something. After some time the result was computed.
Your recursive function generates huge amount of 'garbage'. LispWorks says: 1360950192 bytes
Maybe you can come up with a more efficient implementation?
Update: garbage
Lisp provides some automatic memory management, but that does not free the programmer from thinking about space effects.
Lisp uses both a stack and the heap to allocate memory. The heap maybe structured in certain ways for the GC - for example in generations, half spaces and/or areas. There are precise garbage collectors and 'conservative' garbage collectors (used by SBCL on Intel machines).
When a program runs we can see various effects:
normal recursive procedures allocate space on the stack. Problem: the stack size is usually fixed (even though some Lisps can increase it in an error handler).
a program may allocate huge amount of memory and return a large result. PERMUTE is such a function. It can return very large lists.
a program may allocate memory and use it temporarily and then the garbage collector can recycle it. The rate of creation and destruction can be very high, even though the program does not use a large amount of fixed memory.
There are more problems, though. But for each of the above the Lisp programmer (and every other programmer using a language implementation with garbage collection) has to learn how to deal with that.
Replace recursion with iteration. Replace recursion with tail recursion.
Return only the part of the result that is needed and don't generate the full solution. If you need the n-th permutation, then compute that and not all permutations. Use lazy datastructures that are computed on demand. Use something like SERIES, which allows to use stream-like, but efficient, computation. See SICP, PAIP, and other advanced Lisp books.
Reuse memory with a resource manager. Reuse buffers instead of allocating objects all the time. Use an efficient garbage collector with special support for collecting ephemeral (short-lived) objects. Sometimes it also may help to destructively modify objects, instead of allocating new objects.
Above deals with the space problems of real programs. Ideally our compilers or runtime infrastructure may provide some automatic support to deal with these problems. But in reality this does not really work. Most Lisp systems provide low-level functionality to deal with this and Lisp provides mutable objects - because the experience of real-world Lisp programs has shown that programmers do want to use them to optimize their programs. If you have a large CAD application that computes the form of turbine blades, then theoretical/puristic views about non-mutable memory simply does not apply - the developer wants the faster/smaller code and the smaller runtime footprint.
SBCL on most platforms uses generational garbage collector, which means that allocated memory which survives more than some number of collections will be more rarely considered for collection. Your algorithm for the given test case generates so much garbage that it triggers GC so many times that the actual results, which obviously have to survive the entire function runtime, are tenured, that is, moved to a final generation which is collected either very rarely or not at all. Therefore, the second run will, on standard settings for 32-bit systems, run out of heap (512 MB, can be increased with runtime options).
Tenured data can be garbage collected by manually triggering the collector with (sb-ext:gc :full t). This is obviously implementation dependent.
From the looks of the output, you're looking at the slime-repl, right?
Try changing to the "*inferior-lisp*" buffer, you'll probably see that SBCL has dropped down to the ldb (built-in low-level debugger). Most probably, you've managed to blow the call-stack.