ASCII stream decoding error in cl-html-parse - common-lisp

I get an ASCII stream decoding error when I run this:
(with-open-file (stream file)
(net.html.parser:parse-html stream))
I'm using SBCL 1.0.58 and cl-html-parse 20101006 (using quicklisp).
I get the error fairly frequently trying to parse pages I download with curl (in a regular shell). Most are UTF-8.
Should I somehow be specifying an encoding for the file, and how would I do that?

After reading a bit more on with-open-file, I found I can make this work by specifying :external-format.
(with-open-file (stream file
:if-does-not-exist nil
:external-format :UTF-8)
(net.html.parser:parse-html stream))
Still trying to figure out writing (I get a similar error), but I think I'm looking in the right place now.

Related

How to pipe data to other process via temporary file

I want to sent some data from my program to a process executed via uiop:run-program.
The following works:
(require :asdf)
(asdf:load-system :uiop)
(uiop:with-temporary-file (:stream dot-program-stream
:pathname dot-program-file)
(format dot-program-stream "digraph g { n1 -> n2; }")
(finish-output dot-program-stream)
:close-stream
(uiop:with-temporary-file (:pathname png-data)
(uiop:run-program '("/usr/bin/dot" "-Tpng") :input dot-program-file
:output png-data)
(uiop:launch-program '("/usr/bin/display") :input png-data)))
It seems rather convoluted.
A simpler version, where I used only a stream did not finish-output and did not use the :close-stream label resulted in dot producung an empty 0 byte file.
How to execute a process and pass it data generated by my lisp program as standard input?
Take a closer look at the documentation of uiop:launch-program and uiop:run-program, especially the options for the :input and :output keys.
You can call launch-program with :input :stream. Launch-program returns a process info object that contains the stream connected to that program's standard input behind the accessor process-info-input, so you can print to that.
If you have a different program that produces output that should go into that input stream, you have several options:
create a temporary file, then read it and print it to the second program's input stream (that seems to be your current approach)
use run-program with :output :string for the first call, then use launch-program with :input :stream for the second and write the output of the first to that stream
use launch-program also for the first call, in this case with :output :stream, then read from that output and print it to the second program's input
You can either read everything first, then write everything, or you can do buffered reading and writing, which might be interesting for long running processes.
Instead of this in-process buffering, you could also use a fifo (named pipe) from your OS.

Difficulty reading input pipe in SBCL

I am slowly getting closer to be able to read and write to/from named pipes of a background process through SBCL. What I do is kick off the program I am trying to read/write from/to:
todd#ubuntu:~/CoreNLP$ cat ./spin | /usr/bin/java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse,dcoref -outputFormat text > ./spout &
[1] 24616
So that all works out fine, so I kick off SBCL and do this:
(defparameter from-corenlp (open "./spout"))
Which also works out fine, but declaring the stream causes SBCL to spill the stream onto the screen (which is all the startup information from the background process). It does not wait until I read from the stream. Is that how things are supposed to work?
The solution, as I posted it to the stanford parser mailing list (stack overflow reformatted a lot of it to something weird, but you get the idea):
It took quite a while, but I finally figured out embedding (for the most part) the CoreNLP program (while in interactive mode) in SBCL Lisp.
First of all, forget using (sb-ext:run-program ...). This combination of spawning Java with a quoted argument (like the asterisk) no matter how well escaped, simply makes the spawned program crash.
Inferior shell seems to kick off the parser but it is only good for a one-off parse, even in the interactive mode. Perhaps I could have done better, but inferior shell needs to be installed and it is poorly documented.
The initial attempted solution of using Unix named pipes ends up being the final one, but it took a bit of work, first with buffering, then with the order of operations, and finally understanding some nuances about the parser program.
First, turning off buffering completely when running the program is important, so running it looks like this:
stdbuf --i=0 --o=0 --e=0 cat ./spin | /usr/bin/java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse,dcoref -outputFormat text > ./spout &
That is supposed to be running the parser in the background accepting input from spin and sending its output to spout. But if you look at the process table in Linux, you will not see it running. It is still waiting for something to pull from the output pipe before it can even run.
So, we run SBCL and start a stream pulling from the parser´s pipe:
(defparameter *from-corenlp* (open "./spout"))
NOW the parser starts running. Here, oddly, it also starts dumping output to the screen, not to the pipe! That is because all of this banner stuff when the parser starts and stops (and apparently even the NLP> prompt) is sent to stderr, not stdout. This is actually a good thing.
So then we declare the stream from Lisp to the parser:
(defparameter *to-corenlp* (open "./spin" :direction :output :if-exists :append))
Then we send some text for the parser to parse:
(write-line "This is the first test." *to-corenlp*)
I ran into a problem here a few times, even. Remember that Lisp has its own buffer so you have to clear out the stream every time:
(finish-output *to-corenlp*)
You then can run this line below a whole bunch of times to verify you obtain the exact same behavior you would have gotten from an interactive session of the parser:
(format t "~a~%" (read-line *from-corenlp*))
Which, if you are a good boy scout, should not only be true, but you can carry on with your interactive slave parser session for as long as you like:
(write-line "This is the second test." *to-corenlp*)
(finish-output *to-corenlp*)
Isn´t that great? And notice I pulled all of that off being terrible at Unix, terrible at Lisp and being a terrible boy scout!
Now so can you!

Inferior-lisp not responding on sldb-quit

I just started learning common lisp, so excuse me if lisp terminology is a bit off. I installed slime and am using Clozure CL. ccl is working just fine. When I enter a wrong expression, the debugger opens (slbc ccl/1 buffer). When I enter q, the debugger buffer closes, and then the inferior-lisp buffer does not respond. Why is that?
and if I want to continue work, I seem to have to restart inferior-lisp, what is it I am doing wring?
I just wanted to say put out the solution I found.
I had followed the instructions in the slime's user manual (from here), I used MALPA repository to install slime.
As PuercoPop's says in the comments, i should land in a slime-repl buffer, which I didn't have by default. I did some further digging and learnt that i have to add a few more line to my .emacs file for the slime-repl buffer to load. The line needed was
(slime-setup '(slime-fancy))
My final .emacs file looks like this:
(require 'package)
(add-to-list 'package-archives
'("melpa" . "https://melpa.org/packages/"))
(when (< emacs-major-version 24)
(add-to-list 'package-archives '("gnu" . "http://elpa.gnu.org/packages/")))
(package-initialize)
(setq package-enable-at-startup nil)
(setq inferior-lisp-program "F:/Binaries/ccl/wx86cl64.exe")
(setq slime-auto-connect 'ask)
(setq slime-net-coding-system 'utf-8-unix)
(require 'slime)
(slime-setup
'(slime-fancy slime-asdf slime-references slime-indentation slime-xref-browser)
)

translate-pathname behaves strange

Following this question: Strange symbols in filespec when calling load I tried my luck with pathnames, but, as you see, failed. Below is an example of the error, which I cannot explain:
This code does not work:
(defun test-process-imgae-raw ()
(cl-gd:with-image-from-file
(test #P"digit-recognition:digit-7.png")
(process-image-raw test)))
Neither does this:
(defun test-process-imgae-raw ()
(cl-gd:with-image-from-file
(test "digit-recognition:digit-7.png")
(process-image-raw test)))
But this code does:
(defun test-process-imgae-raw ()
(cl-gd:with-image-from-file
(test (translate-logical-pathname "digit-recognition:digit-7.png"))
(process-image-raw test)))
And so does this:
(defun test-process-imgae-raw ()
(cl-gd:with-image-from-file
(test (translate-logical-pathname #P"digit-recognition:digit-7.png"))
(process-image-raw test)))
Here's the "translator":
(setf (logical-pathname-translations "DIGIT-RECOGNITION")
`(("**;*.*" "/home/wvxvw/Projects/digit-recognition/**/*.*")))
And here's the error I'm getting:
Pathname components from SOURCE and FROM args to TRANSLATE-PATHNAME
did not match:
:NEWEST NIL
[Condition of type SIMPLE-ERROR]
Restarts:
0: [RETRY] Retry SLIME REPL evaluation request.
1: [*ABORT] Return to SLIME's top level.
2: [ABORT] Abort thread (#<THREAD "repl-thread" RUNNING {1003800113}>)
Backtrace:
0: (SB-IMPL::DIDNT-MATCH-ERROR :NEWEST NIL)
1: (SB-IMPL::TRANSLATE-COMPONENT :NEWEST NIL :NEWEST T)
2: (TRANSLATE-PATHNAME #P"DIGIT-RECOGNITION:DIGIT-7.PNG.NEWEST" #P"DIGIT-RECOGNITION:**;*.*" #P"/home/wvxvw/Projects/digit-recognition/**/*.*")
3: (TRANSLATE-LOGICAL-PATHNAME #P"DIGIT-RECOGNITION:DIGIT-7.PNG.NEWEST")
4: (SB-IMPL::QUERY-FILE-SYSTEM #P"DIGIT-RECOGNITION:DIGIT-7.PNG" :TRUENAME NIL)
5: (PROBE-FILE #P"DIGIT-RECOGNITION:DIGIT-7.PNG")
6: (CREATE-IMAGE-FROM-FILE #<unavailable argument> NIL)
7: (TEST-PROCESS-IMGAE-RAW)
I'm trying to read the Hyperspec section on translate-pathname, but I can make absolutely no sense of what it says, neither from the examples it shows. Let alone it, I can't even understand how there can possibly be an error if you transform a string by whatever rules you put in place, so far it's only one way transformation...
I'm trying to read SBCL sources for this function, but they are really lengthy, and trying to figure out the problem this way is taking huge amounts of time.
tl;dr How is it even possible that translate-logical-pathname called from user's code will produce something different to what is produced from that function if called from system code? This is not only non-portable, this is just outright broken.
EDIT:
Adding one more asterisk to the pattern on the left side, but not on the right solved this. But the purpose or logic of why is this necessary is beyond me.
I.e.
(setf (logical-pathname-translations "DIGIT-RECOGNITION")
`(("**;*.*.*" "/home/wvxvw/Projects/digit-recognition/**/*.*")))
This allows pathnames like digit-recognition:foo.bar.newest to succeed, just like digit-recognition:foo.bar but why is that asterisk a requirement flies beyond me. Also, why is the system function feels entitled to change the pathname to something else of what it was given?.. But just not to get you confused, with-image-from-file will only work with the path already expanded by translate-logical-pathname, it won't work otherwise.
EDIT2:
OK, it seems like this is the problem with cl-gd, instead of trying to expand the file name, it takes it literally. This code taken from create-image-from-file probably best answers my question:
(when (pathnamep file-name)
(setq file-name
#+:cmu (ext:unix-namestring file-name)
#-:cmu (namestring file-name)))
(with-foreign-object (err :int)
(with-cstring (c-file-name file-name)
(let ((image (ecase %type
((:jpg :jpeg)
(gd-image-create-from-jpeg-file c-file-name err))
I.e. instead of doing (namestring file-name) it has to do (namestring (trnaslate-logical-pathname file-name)). Duh...
Another way is to use TRUENAME, which returns the real file name. Normally this would not make a difference.
Image a file system with file versions (like the file systems of VMS, ...). If you have a logical pathname foo:bar;baz.png.newest, then it might translate to, say, /myfiles/images/baz.png~newest (again, just assume that it has version numbers). This still is not a real physical file. If such a Lisp system tries to open the file, it has to look into the file system to actually determine the newest file. That might be /myfiles/images/baz.png~42.
So, if you want to pass real physical filenames to external tools (like a C library), it might not be sufficient to expand the logical pathname, but it might be necessary to compute the truename - the real physical file.
The ability to deal with file versions comes from a time when file versions where quite common (see Versioning file system) with operating systems like ITS, VMS or the various Lisp Machine operating systems.
The main practical problem for this is that there is no common test suite for pathname operations for the various CL implementations and thus implementations differ in a lot of subtle details (especially when you need to deal with different file systems from different operating systems). Plus real file systems have complications - for example file names in Mac OS X use a special unicode encoding when dealing with Umlauts.

How to implement `tail` command using CL?

"with-open-file" will read from the beginning of a file. If the file is VERY big how to read the last 20 lines efficiently ?
Sincerely!
This opens a file, reads the final byte, and closes the file.
(defun read-final-byte (filename)
(with-open-file (s filename
:direction :input
:if-does-not-exist :error)
(let ((len (file-length s)))
(file-position s (1- len)) ; 0-based position.
(read-char s nil)))) ; don't error if reading the end of the file.
If you want to specifically read the last n lines, you will have to read back an indeterminate number of bytes until you get n+1 newlines. In order to do this, you will either have to do block reads backwards (faster but will wind up in reading unneeded bytes), or byte-reads (slower but allows precision and a slightly more obvious algorithm).
I suspect tail has a reasonable algorithm applied for this, so it would likely be worth reading tail's source for a guideline.

Resources