Escaping quotes in cl-ppcre regex - common-lisp

Background
I need to parse CSV files, and cl-csv et. al. are too slow on large files, and have a dependency on cl-unicode, which my preferred lisp implementation does not support. So, I am improving cl-simple-table, one that Sabra-on-the-hill benchmarked as the fastest csv reader in a review.
At the moment, simple-table's line parser is rather fragile, and it breaks if the separator character appears within a quoted string. I'm trying to replace the line parser with cl-ppcre.
Attempts
Using the Regex Coach, I've found a regex that works in almost all cases:
("[^"]+"|[^,]+)(?:,\s*)?
The challenge is getting this Perl regex string into something I can use in cl-ppcre to split the line. I have tried passing the regex string, with various escapes for the ":
(defparameter bads "\"AER\",\"BenderlyZwick\",\"Benderly and Zwick Data: Inflation, Growth and Stock returns\",31,5,0,0,0,0,5,\"https://vincentarelbundock.github.io/Rdatasets/csv/AER/BenderlyZwick.csv\",\"https://vincentarelbundock.github.io/Rdatasets/doc/AER/BenderlyZwick.html\"
"Bad string, note a separator character in the quoted field, near Inflation")
(ppcre:split "(\"[^\"]+\"|[^,]+)(?:,\s*)?" bads)
NIL
Neither single, double, triple nor quadruple \ work.
I've parsed the string to see what the parse tree looks like:
(ppcre:parse-string "(\"[^\"]+\"|[^,]+)(?:,s*)?")
(:SEQUENCE (:REGISTER (:ALTERNATION (:SEQUENCE #\" (:GREEDY-REPETITION 1 NIL (:INVERTED-CHAR-CLASS #\")) #\") (:GREEDY-REPETITION 1 NIL (:INVERTED-CHAR-CLASS #\,)))) (:GREEDY-REPETITION 0 1 (:GROUP (:SEQUENCE #\, (:GREEDY-REPETITION 0 NIL #\s)))))
and passed the resulting tree to split:
(ppcre:split '(:SEQUENCE (:REGISTER (:ALTERNATION (:SEQUENCE #\" (:GREEDY-REPETITION 1 NIL (:INVERTED-CHAR-CLASS #\")) #\") (:GREEDY-REPETITION 1 NIL (:INVERTED-CHAR-CLASS #\,)))) (:GREEDY-REPETITION 0 1 (:GROUP (:SEQUENCE #\, (:GREEDY-REPETITION 0 NIL #\s))))) bads)
NIL
I also tried various forms of *allow-quoting*:
(let ((ppcre:*allow-quoting* t))
(ppcre:split "(\\Q\"\\E[^\\Q\"\\E]+\\Q\"\\E|[^,]+)(?:,\s*)?" bads))
I've read through the cl-ppcre docs, but there are very few examples of using parse trees, and no examples of escaping quotes.
Nothing seems to work.
I was hoping that the Regex Coach would provide a way to see the S-expression parse tree form of the Perl syntax string. That would be a very useful feature, allowing you to experiment with the regex string and then copy & paste the parse tree in Lisp code.
Does anyone know how to escape quotes in this example?

In this answer I focus on the errors in your code and try to explain how you could make it work. As explained by #Svante, this might not be the best course of actions for your use-case. In particular, your regex might be too tailored for your known test inputs and might miss cases that could arise later.
For example, your regex consider fields as either strings delimited by double-quotes with no inner double-quotes (even escaped), or a sequence of characters different from the comma. If, however, your field starts with a normal letter and then contains a double quote, it will be part of the field name.
Fixing the test string
Maybe there was a problem when formatting your question, but the form introducing bads is malformed.
Here is a fixed definition for *bads* (notice the asterisks around the special variable, this is a useful convention that helps distinguish them from lexical variables (asterisks around the names are also known as "earmuffs")):
(defparameter *bads*
"\"AER\",\"BenderlyZwick\",\"Benderly and Zwick Data: Inflation, Growth and Stock returns\",31,5,0,0,0,0,5,\"https://vincentarelbundock.github.io/Rdatasets/csv/AER/BenderlyZwick.csv\",\"https://vincentarelbundock.github.io/Rdatasets/doc/AER/BenderlyZwick.html\"")
Escape characters in regex
The parse tree you obtain contains this:
(... (:GREEDY-REPETITION 0 NIL #\s) ...)
There is a literal character #\s in your parse-tree. To understand why, let's define two auxiliary functions:
(defun chars (string)
"Convert a string to a list of char names"
(map 'list #'char-name string))
(defun test (s)
(list :parse (chars s)
:as (ppcre:parse-string s)))
For example, here is how the different strings below are parsed:
(test "s")
=> (:PARSE ("LATIN_SMALL_LETTER_S") :AS #\s)
(test "\s")
=> (:PARSE ("LATIN_SMALL_LETTER_S") :AS #\s)
(test "\\s")
=> (:PARSE ("REVERSE_SOLIDUS" "LATIN_SMALL_LETTER_S")
:AS :WHITESPACE-CHAR-CLASS)
Only in the last case, where the backslash (reverse solidus) is escaped, the PPCRE parser sees both this backslash and the next character #\s and interprets this sequence as :WHITESPACE-CHAR-CLASS. The Lisp reader interprets \s as s, because it is not part of the characters that can be escaped in Lisp.
I tend to work with parse tree directly because a lot of headaches w.r.t. escaping goes away (and in my opinion this is exacerbated with \Q and \E). A fixed parse tree is for example the following one, where I replaced the #\s by the desired keyword and removed the :register nodes that were not useful:
(:sequence
(:alternation
(:sequence #\"
(:greedy-repetition 1 nil
(:inverted-char-class #\"))
#\")
(:greedy-repetition 1 nil (:inverted-char-class #\,)))
(:greedy-repetition 0 1
(:group
(:sequence #\,
(:greedy-repetition 0 nil :whitespace-char-class)))))
Why the result is NIL
Remember that you are trying to split the string with this regex, but the regex actually describes a field and the following comma. The reason you have a NIL result is because your string is just a sequence of separators, like this example:
(split #\, ",,,,,,")
NIL
With a simpler example, you can see that splitting words as separators give:
(split "[a-z]+" "abc0def1z3")
=> ("" "0" "1" "3")
But if the separators also include digits, then the result is NIL:
(split "[a-z0-9]+" "abc0def1z3")
=> NIL
Looping over fields
With the regex you defined, it is easier to use do-register-groups. It is a loop construct that iterates over the string by trying to match the regex successively on the string, binding each (:register ...) in the regex to a variable.
If you put (:register ...) around the first (:alternation ...), you will sometimes capture the double quotes (first branch of the alternation):
(do-register-groups (field)
('(:SEQUENCE
(:register
(:ALTERNATION
(:SEQUENCE #\"
(:GREEDY-REPETITION 1 NIL
(:INVERTED-CHAR-CLASS #\"))
#\")
(:GREEDY-REPETITION 1 NIL (:INVERTED-CHAR-CLASS #\,))))
(:GREEDY-REPETITION 0 1
(:GROUP
(:SEQUENCE #\,
(:GREEDY-REPETITION 0 NIL :whitespace-char-class)))))
*bads*)
(print field))
"\"AER\""
"\"BenderlyZwick\""
"\"Benderly and Zwick Data: Inflation, Growth and Stock returns\""
"31"
"5"
"0"
"0"
"0"
"0"
"5"
"\"https://vincentarelbundock.github.io/Rdatasets/csv/AER/BenderlyZwick.csv\""
"\"https://vincentarelbundock.github.io/Rdatasets/doc/AER/BenderlyZwick.html\""
Another option is to add two :register nodes, one for each branch of the alternation; that means binding two variables, one of them being NIL for each successful match:
(do-register-groups (quoted simple)
('(:SEQUENCE
(:ALTERNATION
(:SEQUENCE #\"
(:register ;; <- quoted (first register)
(:GREEDY-REPETITION 1 NIL
(:INVERTED-CHAR-CLASS #\")))
#\")
(:register ;; <- simple (second register)
(:GREEDY-REPETITION 1 NIL (:INVERTED-CHAR-CLASS #\,))))
(:GREEDY-REPETITION 0 1
(:GROUP
(:SEQUENCE #\,
(:GREEDY-REPETITION 0 NIL :whitespace-char-class)))))
*bads*)
(print (or quoted simple)))
"AER"
"BenderlyZwick"
"Benderly and Zwick Data: Inflation, Growth and Stock returns"
"31"
"5"
"0"
"0"
"0"
"0"
"5"
"https://vincentarelbundock.github.io/Rdatasets/csv/AER/BenderlyZwick.csv"
"https://vincentarelbundock.github.io/Rdatasets/doc/AER/BenderlyZwick.html"
Inside the loop you could push each field into a list or a vector to be processed later.

Related

trying to use cl-lexer on a file containing "{" and "}"

Using the file "test-lexer.lisp", I have very slightly modified lex to be
(defparameter *lex* (test-lexer "{ 1.0 12 fred 10.23e12"))
and increased the number of times test repeats to 6
(defun test ()
(loop repeat 6
collect (multiple-value-list (funcall *lex*))))
and tried modifying test-lexer in a number of ways to try to get it to recognize "{" as a token.
For example, adding [:punct:] in (deflexer test-lexer ...)
by changing
("[:alpha:][:alnum:]*"
(return (values 'name %0)))
to
("[:alpha:][:alnum:][:punct:]*"
(return (values 'name %0)))
and consistently get errors like
"""Lexer unable to recognize a token in "{ 1.0 12 fred 10.23e12", position 0 ("{ 1.0 12 fred 10.23e")
[Condition of type SIMPLE-ERROR]"""
How can i specify "{" as a character to be recognized? Or is my problem elsewhere?
The cl-lexer system is based on regular expressions, so you can put any literal character to stand for itself, like {. But it happens that the brace character has a special meaning in the regular expression language, so you need to quote it with a backslash. In order to write a backslash in Lisp strings, backslashes need to be escaped. Hence:
(deflexer test-lexer
("\\{" (return (values :grouping :open-brace))) ;; <-- Here
("[0-9]+([.][0-9]+([Ee][0-9]+)?)"
(return (values 'flt (num %0))))
("[0-9]+"
(return (values 'int (int %0))))
("[:alpha:][:alnum:]*"
(return (values 'name %0)))
("[:space:]+"))
I return the :open-brace value and the :grouping category, but you can choose to return something else if you want.
The test function then returns:
((:GROUPING :OPEN-BRACE) (FLT 1.0) (INT 12)
(NAME "fred") (FLT 1.023e13) (NIL NIL))

Updating one map in a vector of maps in clojure

I have done a lot of searching on this and not found an answer to my specific problem. As background I am taking a coding bootcamp on Java and we are learning JavaScript and Clojure alongside Java. The course content is roughly Java: 50% / JavaScript 30% / Clojure 20%. Having advanced beyond my classmates I am tackling a challenge from my instructor.
I have built an app in Clojure to manage an animal shelter. I am using a vector of hashmaps as my central data store. Things that I have succeeded in doing:
Loading sample data from a hard coded array of hashmaps.
Creating a new animal and adding it's hashmap to the array.
Listing the entire "table" of animals (data from one hashmap per line)
Displaying a multiline detailed view of one animal's record.
What I am struggling with at the moment is my edit function. It is supposed to display the chosen animal's existing data and take any edits the user wants to make then update a working copy of the hashmap and finally update the working copy of the array.
(defn edit-animals [animals]
;; ask which animal to edit index is 1 based from the UI
(let [index (wait-for-int "Please enter the idex of the animal to edit. Enter 0 to display the list of animals" 0 (count animals))]
;; if the user typed a valid index
(if (> index 0)
;; then edit the animal
(let [index (- index 1) ;; index is 0 based for indexing the array
animals animals
animal (nth animals index)
;; prompt with existing data and ask for new data
name (wait-for-string (str "Name: " (:name animal) "\n") false)
species (wait-for-string (str "Species: " (:species animal "\n")) false)
breed (wait-for-string (str "Breed: " (:breed animal "\n")) false)
description (wait-for-string (str "Description: " (:description animal "\n")) false)
;; check for null returns from user
name2 (if (= name "") (:name animal) name)
species2 (if (= species "") (:species animal) species)
breed2 (if (= breed "") (:breed animal) breed)
description2 (if (= description "") (:description animal) description)
;; update local copy of animal
animal (assoc animal :name name2 :species species2 :breed breed2 :description description2)]
;; assoc fails to update animals
;; assoc-in crashes at runtime
animals (assoc animals index animal))
;;(wait-for-enter "\nPress enter to return to the menu")
;; else dispolay the list of animals
(display-animals animals)))
animals)
I have run this code in my debugger and verified that everything is working as expected up to the line:
animal (assoc animal :name name2 :species species2 :breed breed2 :description description2)
The next line fails in one of two ways as I have documented in the comments.
I am aware that atom may be a better way to do this but so far the vector of maps that is constantly passed around is working, so I would like to find a solution to my current problem that does not involve using atom. Once I get this problem solved I plan to switch the project to an atomic data structure. But is a project for another day.
If I have missed a relevant discussion here, please point me in the right direction!
The line:
animals (assoc animals index animal))
does not do what you think it does -- it is not inside the let binding vector.
First of all, good job, you asked the question the right way (with examples, etc.). Congratulations on your coding course. My recommendation would be to keep doing what you're doing, and learn to think in clojure in a very different way than you think in java. They are both worthwhile, just different in approach. In your code, you are doing numerous "temporary assignments" (such as name2 for example). The fact that your let binding vector has 11 pairs is a red flag that you're doing too much. The second item in let, animals animals is particularly strange.
Instead, try to think about the evaluation of expressions rather than the assignment of values. name2 = name1 + ... is a statement, and not an expression. It doesn't do anything. Instead, in a declarative language, almost everything is an expression. In the code below (which I just an extension of what you've done and not necessarily how I'd do it from scratch), note that no local bindings are re-bound (nothing is "assigned to" more than once). let allows us to lexically bind name to the result of an expression, and then we use name to achieve something else. not-empty allows us to do better than using name and name2, which is what you have done.
(def animals [{:name "John" :species "Dog" :breed "Pointer" :description "Best dog"}
{:name "Bob" :species "Cat" :breed "Siamese" :description "Exotic"}])
(defn edit-animals
[animals]
(if-let [index (dec (wait-for-int "Please enter the index of the animal to edit. Enter 0 to display the list of animals" 0 (count animals)))]
(let [animal (nth animals index)
name (or (not-empty (wait-for-string (str "Name: " (:name animal) "\n") false))
(:name animal))
species (or (not-empty (wait-for-string (str "Species: " (:species animal) "\n") false))
(:species animal))
breed (or (not-empty (wait-for-string (str "Breed: " (:breed animal) "\n") false))
(:breed animal))
description (or (not-empty (wait-for-string (str "Description: " (:description animal) "\n") false))
(:description animal))
animal {:name name :species species :breed breed :description description}]
(println "showing new animal: " animal)
(assoc animals index animal))
animals))
(def animals (edit-animals animals))
Note that this does not really achieve much other than restructuring your code. It really does too much, and is not a good example of how a function should do one thing well. But I think your goal for now should be to be a little more idiomatic and get away from the imperative mentality, when you write your clojure. After you do that, you can focus on the design part of it.
Keep up the good work and ask any more questions you have!
in short, you have to return the animals bound in your innermost let (it index > 0), else display current animals and return it. So it would be like this:
(defn edit-animals []
(let [index ...]
(if (> index 0)
;; for an acceptable index you query for input, modify animals, and return modified collection
(let [...]
(assoc animals index animals))
;; otherwise display initial data and return it
(do (display animals)
animals))))
But i would restructure the code more, to make it more clojure-style. First of all i would extract the updating of the animal with input to standalone function, to remove name, name2, breed, breed2... bindings that make the let bindings messy. (upd-with-input), and replace assoc with update. Something like this:
(defn edit-animals [animals]
(letfn [(upd-with-input [animal field prompt]
(let [val (wait-for-string (str prompt (field animal) "\n") false)]
(if (clojure.string/blank? val)
animal
(assoc animal field val))))]
(let [index (dec (wait-for-int "enter index" 0 (count animals)))]
(if (contains? animals index)
(update animals index
(fn [animal]
(-> animal
(upd-with-input :name "Name: ")
(upd-with-input :species "Species: ")
(upd-with-input :breed "Breed: ")
(upd-with-input :description "Description: "))))
(do (when (== -1 index) (display animals))
animals)))))
then i would think of separating the whole part collecting user input from actually updating animals collection.
I think you want to delete animals from
animals (assoc animals index animal))
at the end of the if statement and instead return the result of the function
(assoc animals index animal)
or display them
(display-animals (assoc animals index animal))

How to prevent close!-ing before put-ing in onto-chan

I'd like to run a code like
(->> input
(partition-all 5)
(map a-side-effect)
dorun)
asynchronously dividing input and output(a-side-effect).
Then I've written the code to experiment below.
;; using boot-clj
(set-env! :dependencies '[[org.clojure/core.async "0.2.374"]])
(require '[clojure.core.async :as async :refer [<! <!! >! >!!]])
(let [input (range 18)
c (async/chan 1 (comp (partition-all 5)
(map prn)))]
(async/onto-chan c input false)
(async/close! c))
explanation for this code:
Actually elements in input and its quantity is not defined before running and elements in input is able to be taken by some numbers from 0 to 10.
async/onto-chan is used to put a Seq of elements (a fragment of input) into the channel c and will be called many times thus the 3rd argument is false.
prn is a substitute for a-side-effect.
I expected the code above prints
[0 1 2 3 4]
[5 6 7 8 9]
[10 11 12 13 14]
[15 16 17]
in REPL however it prints no characters.
And then I add a time to wait, like this
(let [c (async/chan 1 (comp (partition-all 5)
(map prn)))]
(async/onto-chan c (range 18) false)
(Thread/sleep 1000) ;wait
(async/close! c))
This code gave my expected output above.
And then I inspect core.async/onto-chan.
And I think what happend:
the channel c was core.async/close!ed in my code.
each item of the argument of core.async/onto-chan was put(core.async/>!) in vain in the go-loop in onto-chan because the channel c was closed.
Are there sure ways to put items before close!ing?
write a synchronous version of onto-chan not using go-loop?
Or is my idea wrong?
Your second example with Thread.sleep only ‘works’ by mistake.
The reason it works is that every transformed result value that comes out of c’s transducer is nil, and since nils are not allowed in channels, an exception is thrown, and no value is put into c: this is what allows the producer onto-chan to continue putting into the channel and not block waiting. If you paste your second example into the REPL you’ll see four stack traces – one for each partition.
The nils are of course due to mapping over prn, which is a side-effecting function that returns nil for all inputs.
If I understand your design correctly, your goal is to do something like this:
(defn go-run! [ch proc]
(async/go-loop []
(when-let [value (<! ch)]
(proc value)
(recur))))
(let [input (range 18)
c (async/chan 1 (partition-all 5))]
(async/onto-chan c input)
(<!! (go-run! c prn)))
You really do need a producer and a consumer, else your program will block. I’ve introduced a go-loop consumer.
Very generally speaking, map and side-effects don’t go together well, so I’ve extracted the side-effecting prn into the consumer.
onto-chan cannot be called ‘many times’ (at least in the code shown) so it doesn’t need the false argument.
taking megakorre's idea:
(let [c (async/chan 1 (comp (partition-all 5)
(map prn)))
put-ch (async/onto-chan c (range 18) false)]
(async/alts!! [put-ch])
(async/close! c))

char representation clojure

How can I represent a char (character) in clojure?
Also I would like an example to test it using the char? function
(println (char? 1))
(println (char? (char 'a')))
Use a backslash for representing an individual character. For instance:
(char? \a)
returns true

Create a variable name from a string in Lisp

I'm trying to take a string, and convert it into a variable name. I though (make-symbol) or (intern) would do this, but apparently it's not quite what I want, or I'm using it incorrectly.
For example:
> (setf (intern (string "foo")) 5)
> foo
5
Here I would be trying to create a variable named 'foo' with a value of 5. Except, the above code gives me an error. What is the command I'm looking for?
There are a number of things to consider here:
SETF does not evaluate its first argument. It expects a symbol or a form that specifies a location to update. Use SET instead.
Depending upon the vintage and settings of your Common Lisp implementation, symbol names may default to upper case. Thus, the second reference to foo may actually refer to a symbol whose name is "FOO". In this case, you would need to use (intern "FOO").
The call to STRING is harmless but unnecessary if the value is already a string.
Putting it all together, try this:
> (set (intern "FOO") 5)
> foo
5
Use:
CL-USER 7 > (setf (SYMBOL-VALUE (INTERN "FOO")) 5)
5
CL-USER 8 > foo
5
This also works with a variable:
CL-USER 11 > (let ((sym-name "FOO"))
(setf (SYMBOL-VALUE (INTERN sym-name)) 3))
3
CL-USER 12 > foo
3
Remember also that by default symbols are created internally as uppercase. If you want to access a symbol via a string, you have to use an uppercase string then.

Resources