Why is this lisp benchmark (in sbcl) so slow?

Why is this lisp benchmark (in sbcl) so slow? - common-lisp

Since I’m interested in C++ as well as in Lisp, I tried to reproduce the benchmark presented here written by Guillaume Michel. The benchmark is basically a DAXPY BLAS Level 1 operation performed multiple times on big arrays. The full code was thankfully posted on github and is in both languages roughly one page.
Sadly, I discovered that it was not possible for me to reach the velocity of his calculations for lisp.
For Linux, he got similar results for C++ as well as Lisp:
Size | C++ | Common Lisp
100,000,000 | 181.73 | 183.9
The numbers differ (naturally) both on my PC:
Size | C++ | Common Lisp
100,000,000 | 195.41 | 3544.3
Since I wanted an additional measurement, I started both programs with the time command and got (shortened up):
$ time ./saxpy
real 0m7.381s
$ time ./saxpy_lisp
real 0m40.661s
I assumed different reasons for this blatant difference. I scanned through both code samples, but found no big algorithmic or numeric difference between the C++ and Lisp Version. Then I thought, the usage of buildapp created the delay, so I started the benchmark directly in the REPL. My last resort was to try another version of sbcl, so I downloaded the newest sbcl-1.3.11 and evaluated it there – still the (optimized) Lisp version needs much longer than it’s C++ counterpart.
What am I missing?

I cannot replicate your findings:
sylwester#pus:~/a/bench-saxpy:master$ sbcl --dynamic-space-size 14000
This is SBCL 1.3.1.debian, an implementation of ANSI Common Lisp.
More information about SBCL is available at <http://www.sbcl.org/>.
SBCL is free software, provided as is, with absolutely no warranty.
It is mostly in the public domain; some portions are provided under
BSD-style licenses. See the CREDITS and COPYING files in the
distribution for more information.
* (compile-file "saxpy.lisp")
; compiling file "/pussycat/natty-home/westerp/apps/bench-saxpy/saxpy.lisp" (written 06 NOV 2016 12:04:30 AM):
; compiling (DEFUN SAXPY ...)
; compiling (DEFUN DAXPY ...)
; compiling (DEFMACRO TIMING ...)
; compiling (DEFUN BENCH ...)
; compiling (DEFUN MAIN ...)
; /pussycat/natty-home/westerp/apps/bench-saxpy/saxpy.fasl written
; compilation finished in 0:00:00.038
#P"/pussycat/natty-home/westerp/apps/bench-saxpy/saxpy.fasl"
NIL
NIL
* (load "saxpy.fasl")
T
* (main t)
Size, Execution Time (ms)
10, 0.0
100, 0.0
1000, 0.0
10000, 0.1
100000, 0.49999997
1000000, 3.6
10000000, 39.2
100000000, 346.30002
(NIL NIL NIL NIL NIL NIL NIL NIL)
So on my machine it took 346ms (relatively old machine). The whole test took about 5 seconds but that is a series of many tests. Interesting that running from sbcl was faster than making an image and running it:
sylwester#pus:~/a/bench-saxpy:master$ make lisp
buildapp --output saxpy_lisp --entry main --load saxpy.lisp --dynamic-space-size 14000
;; loading file #P"/pussycat/natty-home/westerp/apps/bench-saxpy/saxpy.lisp"
[undoing binding stack and other enclosing state... done]
[saving current Lisp image into saxpy_lisp:
writing 4944 bytes from the read-only space at 0x20000000
writing 3168 bytes from the static space at 0x20100000
writing 60522496 bytes from the dynamic space at 0x1000000000
done]
sylwester#pus:~/a/bench-saxpy:master$ ./saxpy_lisp
Size, Execution Time (ms)
10, 0.0
100, 0.0
1000, 0.0
10000, 0.0
100000, 0.4
1000000, 3.5
10000000, 40.2
100000000, 369.49997
In the C version I get:
100000000, 296.693634, 295.762695, 340.574860
It seems like the C version actually calculates the same in 3 different ways and reports the time it took. Using time wouldn't do it justice.

Related

Why is dribble producing an empty file?

I am trying to learn Common Lisp with the book Common Lisp: A gentle introduction to Symbolic Computation. In addition, I am using SBCL, Emacs, and Slime.
By the end of chapter 9, the author shows the dribble tool.
He shows the following:
I tried to reproduce the commands presented by the author. Considering the inputs, the only difference was the fact that I put a different location to save the file. In my environment, I did:
CL-USER> (dribble "/home/pedro/miscellaneous/misc/symbolic-computation/teste-tool.log")
; No value
CL-USER> (cons 2 nil)
(2)
CL-USER> '(is driblle really working?)
(IS DRIBLLE REALLY WORKING?)
CL-USER> "is dribble useful at all?"
"is dribble useful at all?"
CL-USER> (dribble)
; No value
The file was indeed created:
$ readlink -f teste-tool.log
/home/pedro/miscellaneous/misc/symbolic-computation/teste-tool.log
Note that I did not get messages such as "Now recording in file --location---" in the REPL while I was typing. But this may vary according to the Lisp implementation.
The big surprise was that, unfortunately, the file was empty. Thus, dribble did not work as expected.
Did I do something wrong?

Yes, by default, within Slime I don't think this works.
It will work within the SBCL Repl:
➜ sbcl
This is SBCL 2.0.1.debian, an implementation of ANSI Common Lisp.
More information about SBCL is available at <http://www.sbcl.org/>.
SBCL is free software, provided as is, with absolutely no warranty.
It is mostly in the public domain; some portions are provided under
BSD-style licenses. See the CREDITS and COPYING files in the
distribution for more information.
* (dribble "dribble-test.lisp")
* (* 8 5)
40
* "Will this work?"
"Will this work?"
* (dribble)
* %
Which can be confirmed:
➜ cat dribble-test.lisp
* (* 8 5)
40
* "Will this work?"
"Will this work?"
* (dribble)
Saving "REPL history" seems less useful with Slime IMO because of all the "non-REPL evaluation" that you do by e.g. selecting a function or an expression or region and evaluating that in the REPL.
To actually save and view history within Slime, see slime-repl-save-history and associated functions; you can even merge histories from independent Repls if you so choose :-)

How do I use the live-code feature in sbcl?

I am trying to get live-coding to work in lisp. i have the file t.cl which contains only this line: (loop(write(- 2 1))). Now, when i run the file in bash with sbcl --load t.cl --eval '(quit)', it runs the line, but when I try to edit the file in another terminal and save it while it runs, nothing changes ..

Why your example fails
When running sbcl --load t.cl --eval '(quit)' in a shell, what this does is spin-up a SBCL Lisp image in a process, compile the file and run it. You then modify the file and save it to your disk. This last action is of no concern to the already running SBCL process, which has already compiled the previous file. SBCL read the file once when you asked it to, once it has the compiled instructions to run, it has no reason to look at the file again unless you explicitly ask it to.
A 'live' example with Emacs+SLIME
In order to perform 'live' changes to your running program, you must interact with the already running Lisp image. This is easily doable with Emacs+Slime. You can, for example, have a loop like so:
(defun foo (x) (+ x 3))
(dotimes (it 20)
(format t "~A~%" (foo it))
(sleep 1))
and then recompile foo during execution within the REPL with a new definition:
(defun foo (x) (+ x 100))
Another thread will be used to recompile the function. The new function will be used for future calls as soon as its compilation is finished.
The output in the REPL will look like:
3
4
5
CL-USER> (defun foo (x) (+ x 100))
WARNING: redefining COMMON-LISP-USER::FOO in DEFUN
FOO
103
104
105
...
This would also work with the new definition of foo being compiled from another file as opposed to being entered directly in the REPL.
Working from the system shell
While you can already use the example above for development purposes, you might want to interact with a running SBCL Lisp image from the shell. I am not aware of how to do that. For your exact example, you want to get SBCL to reload eventual files that you have modified. A brief look at the SBCL manual doesn't seem to provide ways to pipe lisp code to an already running SBCL process.

how to make clisp or sbcl use all cpu cores availble?

Over a remote ssh connection, I'm trying to cross-compile sbcl using clisp. The steps I've followed thus far are such:
I downloaded most recent sbcl source, (at this time sbcl-1.3.7), uncompressed it, and entered its source directory.
Then to build it:
root#remotehost:/sbcl-1.3.7# screen
root#remotehost:/sbcl-1.3.7# sh make.sh --prefix=/usr --dynamic-space-size=2Gb --xc-host='clisp -q'
root#remotehost:/sbcl-1.3.7# Ctrl-A Ctrl-D
[detached from 4486.pts-1.remotehost]r/fun-info-funs.fas
root#remotehost:/sbcl-1.3.7#
Over a second remote ssh connection to the same box, top reports cpu usage at 6%
nproc says I have 16 cores (this is google compute engine--no way could I afford something with 16 cores :)
MAKEFLAGS are set in my environment to -j16, but I guess clisp isn't aware of that. How can I get this build to make use of all 16 cores?

I reccomend you to use a parallism library, I really like lparallel library
It has pretty utilities to parallelize your code ammong all processors in your machine. This is an example for my macbook pro (4 cores) using SBCL. There are a great series of common lisp concurrence and parallelism here
But let's create a sample example using lparallel cognates, note that this example is not a well exercise of parallelism is only to show the power of leparallel and how easy is to use.
Let's consider a fibonacci tail recursive function from cliki:
(defun fib (n) "Tail-recursive computation of the nth element of the
Fibonacci sequence" (check-type n (integer 0 *)) (labels ((fib-aux
(n f1 f2)
(if (zerop n) f1
(fib-aux (1- n) f2 (+ f1 f2)))))
(fib-aux n 0 1)))
This will be the sample high computation cost algorithm. let's use it:
CL-USER> (time (progn (fib 1000000) nil))
Evaluation took:
17.833 seconds of real time
18.261164 seconds of total run time (16.154088 user, 2.107076 system)
[ Run times consist of 3.827 seconds GC time, and 14.435 seconds non-GC time. ]
102.40% CPU
53,379,077,025 processor cycles
43,367,543,984 bytes consed
NIL
this is the calculation for the 1000000th term of the fibonacci series on my computer.
Let's for example calculate a list of fibonnaci numbers using mapcar:
CL-USER> (time (progn (mapcar #'fib '(1000000 1000001 1000002 1000003)) nil))
Evaluation took:
71.455 seconds of real time
73.196391 seconds of total run time (64.662685 user, 8.533706 system)
[ Run times consist of 15.573 seconds GC time, and 57.624 seconds non-GC time. ]
102.44% CPU
213,883,959,679 processor cycles
173,470,577,888 bytes consed
NIL
Lparallell has cognates:
They return the same results as their CL counterparts except in cases
where parallelism must play a role. For instance premove behaves
essentially like its CL version, but por is slightly different. or
returns the result of the first form that evaluates to something
non-nil, while por may return the result of any such
non-nil-evaluating form.
first load lparallel:
CL-USER> (ql:quickload :lparallel)
To load "lparallel":
Load 1 ASDF system:
lparallel
; Loading "lparallel"
(:LPARALLEL)
So in our case, the only thing that you have to do is initially a kernel with the number of cores you have available:
CL-USER> (setf lparallel:*kernel* (lparallel:make-kernel 4 :name "fibonacci-kernel"))
#<LPARALLEL.KERNEL:KERNEL :NAME "fibonacci-kernel" :WORKER-COUNT 4 :USE-CALLER NIL :ALIVE T :SPIN-COUNT 2000 {1004E1E693}>
and then launch the cognates from the pmap family:
CL-USER> (time (progn (lparallel:pmapcar #'fib '(1000000 1000001 1000002 1000003)) nil))
Evaluation took:
58.016 seconds of real time
141.968723 seconds of total run time (107.336060 user, 34.632663 system)
[ Run times consist of 14.880 seconds GC time, and 127.089 seconds non-GC time. ]
244.71% CPU
173,655,268,162 processor cycles
172,916,698,640 bytes consed
NIL
You can see how easy is to parallelize this task, lparallel has a lot of resources that you can explore:
I also add a capture of the cpu usage from the first mapcar and pmapcar in my Mac:

SBCL cross compiling (aka bootstrapping) is done in 100% vanilla CL and there are nothing in the standard about threading or multiple processes.
Using a supplied SBCL binary for your machine might not use threading either even though I know SBCL does have thread support in the language.
You only need to do this if you have altered the source of SBCL as the latest supported version is usally precompiled already. I haven't compiled SBCL myself but i doubt it takes longer than my compilations of linux in the 90s that used less time than my nightly sleep.

reading deeply nested tree causes stack overflow

I'm trying to read a massive sexp from file into memory, and it seems to be working out fine for smaller inputs, but on more deeply nested ones sbcl conks out with stack exhaustion. There seems to be a hard recursion limit (at 1000 functions deep) that sbcl simply cannot surpass (strangely, even when its stack size is increased). Example (code is here): make check-c works, but make check-cpp exhausts the stack as below:
INFO: Control stack guard page unprotected
Control stack guard page temporarily disabled: proceed with caution
Unhandled SB-KERNEL::CONTROL-STACK-EXHAUSTED in thread #<SB-THREAD:THREAD
"main thread" RUNNING
{10034E6DE3}>:
Control stack exhausted (no more space for function call frames).
This is probably due to heavily nested or infinitely recursive function
calls, or a tail call that SBCL cannot or has not optimized away.
PROCEED WITH CAUTION.
Backtrace for: #<SB-THREAD:THREAD "main thread" RUNNING {10034E6DE3}>
0: ((LAMBDA NIL :IN SB-DEBUG::FUNCALL-WITH-DEBUG-IO-SYNTAX))
1: (SB-IMPL::CALL-WITH-SANE-IO-SYNTAX #<CLOSURE (LAMBDA NIL :IN SB-DEBUG::FUNCALL-WITH-DEBUG-IO-SYNTAX) {100FC9006B}>)
2: (SB-IMPL::%WITH-STANDARD-IO-SYNTAX #<CLOSURE (LAMBDA NIL :IN SB-DEBUG::FUNCALL-WITH-DEBUG-IO-SYNTAX) {100FC9003B}>)
...
Why am I using recursion, then? Actually, I'm not, but unfortunately the builtin (read) uses recursion, and that's where the stack overflow is occurring. The other option (which I've started working on) is to write an iterative version of read which relies upon the more limited syntax that I'm feeding into it from a separate program to avoid the complexity of re-implementing read (my (currently broken) attempts at that are in the lisp branch of the above repository).
However, I'd prefer a more canonical solution. Are there alternatives to the builtin read that can parse deeply nested structures by avoiding recursion?
EDIT: This appears to be an insurmountable issue with sbcl itself, not the input data. For a quick example, try running:
(for i in $(seq 1 2000); do
echo -n "("
done; echo -n "2"; for i in $(seq 1 2000); do
echo -n ")"
done; echo) > file
And then in sbcl:
(with-open-file (file "file" :direction :input) (read file))
The same failure occurs.
EDIT: Asked around on #sbcl, and apparently the control stack size really applies only to new threads, and that the stack size for the main thread is affected by a lot of other factors as well. So I tried putting the read in a separate thread. Still didn't work. Checkout this repo and run make check if you're interested.

I don't know what you did (because you didn't show it exactly), but when I start sbcl as follows your example works fine for me:
sbcl --control-stack-size 100
Of course I recommended GNU CLISP and Embedded Common Lisp as they also work A-OK for your example.
I'll add a reference to this answer for future readers: https://stackoverflow.com/a/9002973/816536
I'll also mention that compiling the code with appropriate optimization options may be necessary in many CL implementations to benefit from tail-call optimization.

Unexpected behavior with `eval-when`

The following fragment of CL code does not work as I expected with CCL running SLIME. If I
first compile and load the file using C-c C-k, and then run
(rdirichlet #(1.0 2.0 3.0) 1.0)
in the SLIME/CCL REPL, I get the error
value 1.0 is not of the expected type DOUBLE-FLOAT.
[Condition of type TYPE-ERROR]
It works with SBCL. I expected the (setf *read-default-float-format* 'double-float)) to allow me to use values like 1.0. If I load this file into CCL using LOAD at the REPL it works. What am I missing?
(eval-when (:compile-toplevel :load-toplevel :execute)
(require :asdf) (require :cl-rmath) (setf *read-default-float-format* 'double-float))
(defun rdirichlet (alpha rownum)
;;(declare (fixnum rownum))
(let* ((alphalen (length alpha))
(dirichlet (make-array alphalen :element-type '(double-float 0.0 *) :adjustable nil :fill-pointer nil :displaced-to nil)))
(dotimes (i alphalen)
(setf (elt dirichlet i) (cl-rmath::rgamma (elt alpha i) rownum)))
;; Divide dirichlet vector by its sum
(map 'vector #'(lambda (x) (/ x (reduce #'+ dirichlet))) dirichlet)))
UPDATE: I forgot to mention my platform and versions. I'm using Debian squeeze x86. The version of SLIME is from Debian unstable, 1:20120525-2.
CCL is the 1.8 release. I tried it both with the upstream binaries from http://svn.clozure.com/publicsvn/openmcl/release/1.8/linuxx86/ccl, and a binary package created by me - see Package ccl at mentors.debian.net. The result was the same in each case.
It seems probable that this issue is SLIME specific. It would be helpful if people could comment whether they see this behavior or not. Also,
what is the equivalent of C-c C-k in SLIME, if one is running CCL in basic command line mode? (LOAD filename), or something else? Or, to ask a slightly different question, what CCL function is C-c C-k calling?
I notice that calling C-c C-k on the following code
(eval-when (:compile-toplevel :load-toplevel :execute)
(require :asdf) (require :cl-rmath) (setf *read-default-float-format* 'double-float))
(print *read-default-float-format*)
produces DOUBLE-FLOAT, though even though *read-default-float-format* at the REPL immediately afterwards gives SINGLE-FLOAT.
UPDATE 2: It looks like, as Rainer said, that the compilation occurs in a separate thread.
Per the function all-processes in Threads Dictionary
printing all-processes from the buffer using C-c C-k gives
(#<PROCESS worker(188) [Active] #x18BF99CE> #<PROCESS repl-thread(12) [Semaphore timed wait] #x187A186E> #<PROCESS auto-flush-thread(11) [Sleep] #x187A1C9E> #<PROCESS swank-indentation-cache-thread(6) [Semaphore timed wait] #x186C128E> #<PROCESS reader-thread(5) [Active] #x186C164E> #<PROCESS control-thread(4) [Semaphore timed wait] #x186BE3BE> #<PROCESS Swank Sentinel(2) [Semaphore timed wait] #x186BD0D6> #<TTY-LISTENER listener(1) [Active] #x183577B6> #<PROCESS Initial(0) [Sleep] #x1805FCCE>)
CL-USER> (all-processes)
and in the REPL gives
(#<PROCESS repl-thread(12) [Active] #x187A186E> #<PROCESS auto-flush-thread(11) [Sleep] #x187A1C9E> #<PROCESS swank-indentation-cache-thread(6) [Semaphore timed wait] #x186C128E> #<PROCESS reader-thread(5) [Active] #x186C164E> #<PROCESS control-thread(4) [Semaphore timed wait] #x186BE3BE> #<PROCESS Swank Sentinel(2) [Semaphore timed wait] #x186BD0D6> #<TTY-LISTENER listener(1) [Active] #x183577B6> #<PROCESS Initial(0) [Sleep] #x1805FCCE>)
so it seems #<PROCESS worker(188) [Active] #x18BF99CE> is the thread that is doing the compilation. Of course, there still remains the question of why these variables are local to a thread, and also why SBCL does not behave the same way.

I can see that with CCL and some older (which I use) SLIME, too. Haven't tried it with a newer SLIME.
It does not happen with SBCL or LispWorks.
*read-default-float-format* is one of the I/O variables of Common Lisp. Something like WITH-STANDARD-IO-SYNTAX binds them to the standard values and, on exit, restores the previous values. So I suspect that CCL's SLIME code has such a binding in effect. This is confirmed by setting other I/O variables like *read-base* - they are also bound.
CCL, Emacs and SLIME has some layers of code which makes it slightly complex to debug this.
Emacs runs the ELISP code of SLIME and talks to SWANK.
SWANK is the backend of SLIME and is Common Lisp code running in CCL.
SWANK has portable code and some CCL-specific code.
On the Emacs side the SLIME/ELISP function SLIME-COMPILE-AND-LOAD-FILE is used.
On the SWANK side the Common Lisp function swank:compile-file-for-emacs gets called.
Later SWANK:LOAD-FILE gets called.
I don't see where the I/O variables are bound - maybe it is in the CCL networking code?
This seems to be the answer:
If CCL has thread-local I/O variables, then the compilation happens in another thread and won't change the I/O bindings in the REPL thread and also not in any other thread.
This is a difference between various CL implementations. Since the CL standard does not specify threads or processes, it is also not specified if special bindings are global or have a thread-local binding by default...
If you think about it, protecting a thread's I/O variables against changes from other threads makes sense...
So, the correct way to deal with it should be to make sure in each thread independently that the right I/O variable values are in effect.
Let me expand a bit why things are like they are.
Typically a Common Lisp implementation can run more than one thread. Some implementations also allow concurrent threads running at the same time on different cores. These threads can do very different things: one could run a REPL, another one could answer an HTTP request, one could load data from the disk and another one could read the contents of an Email. Lisp in this case runs several different tasks inside one Lisp system.
Lisp has several variables which determine the behavior of I/O operations. For example which format floats are when read or which base integer numbers are in when read. The letter is done by `read-base.
Now imagine that the above disk reading thread has set its *read-base* to 16 for some purpose. Now you change the global in another thread to 8 and then suddenly all other threads have base 8. The consequence: the disk reading thread will suddenly see *read-base* 8 instead of 16 and work differently.
So it makes sense to prevent this in some way. The simplest is that in each thread the running code has their own bindings for the I/O values and then changing the *read-base* won't have effects on other threads. These bindings are usually introduced by LET or a function call. Typically the code would be responsible to bind the variables.
Another way to prevent it is to give each thread a number of initial bindings, which should for example include the I/O bindings. CCL does that. LispWorks for example does that, too. But not for the I/O variables.
Now each Lisp might give you a non-portable way to change the thread-local top binding (CCL has that, too - for example (setf ccl:symbol-value-in-process) ). Still it would not mean that it might change the binding in effect in the REPL. Since the REPL itself is a piece of Lisp code, running in a thread and it could have been setting up its own bindings.
In CCL you can also set the global static binding: (CCL::%SET-SYM-GLOBAL-VALUE sym value). But if you use such functionality you are probably doing something wrong or you have a good reason.
Some background for CCL: http://clozure.com/pipermail/openmcl-devel/2011-June/012882.html
Lore
Don't try to change global bindings from one thread visible for other threads.
Shield your code from changes in other threads by binding the crucial variables to the correct values.
Write a function which sets up the variable values to some values in a controllable fashion.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Why is this lisp benchmark (in sbcl) so slow? - common-lisp

Related

Why is dribble producing an empty file?

How do I use the live-code feature in sbcl?

how to make clisp or sbcl use all cpu cores availble?

reading deeply nested tree causes stack overflow

Unexpected behavior with `eval-when`

Categories

Resources