pyparsing for querying a database of chemical elements - pyparsing

I would like to parse a query for a database of chemical elements.
The database is stored in a xml file. Parsing that file produces a nested dictionary that is stored in a singleton object that inherit from collections.OrderedDict.
Asking for an element will give me an ordered dictionary of its corresponding properties
(i.e. ELEMENTS['C'] --> {'name':'carbon','neutron' : 0,'proton':6, ...}).
Conversely, asking for a propery will give me an ordered dictionary of its values for all the elements (i.e. ELEMENTS['proton'] --> {'H' : 1, 'He' : 2} ...).
A typical query could be:
mass > 10 or (nucleon < 20 and atomic_radius < 5)
where each 'subquery' (i.e. mass > 10) will return the set of elements that matches it.
Then, the query will be converted and transformed internally to a string that will be evaluated further to produce a set of the indexes of the elements that matched it. In that context the operators and/or are not boolean operator but rather ensemble operator that acts upon python sets.
I recently sent a post for building such a query. Thanks to the useful answers I got, I think that I did more or less the job (I hope on a nice way !) but I still have some questions related to pyparsing.
Here is my code:
import numpy
from pyparsing import *
# This import a singleton object storing the datase dictionary as
# described earlier
from ElementsDatabase import ELEMENTS
and_operator = oneOf(['and','&'], caseless=True)
or_operator = oneOf(['or' ,'|'], caseless=True)
# ELEMENTS.properties is a property getter that returns the list of
# registered properties in the database
props = oneOf(ELEMENTS.properties, caseless=True)
# A property keyword can be quoted or not.
props = Suppress('"') + props + Suppress('"') | props
# When parsed, it must be replaced by the following expression that
# will be eval later.
props.setParseAction(lambda t : "numpy.array(ELEMENTS['%s'].values())" % t[0].lower())
quote = QuotedString('"')
integer = Regex(r'[+-]?\d+').setParseAction(lambda t:int(t[0]))
float_ = Regex(r'[+-]?(\d+(\.\d*)?)?([eE][+-]?\d+)?').setParseAction(lambda t:float(t[0]))
comparison_operator = oneOf(['==','!=','>','>=','<', '<='])
comparison_expr = props + comparison_operator + (quote | float_ | integer)
comparison_expr.setParseAction(lambda t : "set(numpy.where(%s)%s%s)" % tuple(t))
grammar = Combine(operatorPrecedence(comparison_expr, [(and_operator, 2, opAssoc.LEFT) (or_operator, 2, opAssoc.LEFT)]))
# A test query
res = grammar.parseString('"mass " > 30 or (nucleon == 1)',parseAll=True)
print eval(' '.join(res._asStringList()))
My question are the following:
1 using 'transformString' instead of 'parseString' never triggers any
exception even when the string to be parsed does not match the grammar.
However, it is exactly the functionnality I need. Is there is a way to do so ?
2 I would like to reintroduce white spaces between my tokens in order
that my eval does not fail. The only way I found to do so it the one
implemented above. Would you see a better way using pyparsing ?
sorry for the long post but I wanted to introduce in deeper details its context. BTW, if you find this approach bad, do not hesitate to tell it me!
thank you very much for your help.
Eric

do not worry about my concern, I found a work around. I used the SimpleBool.py example shipped with pyparsing (thanks for the hint Paul).
Basically, I used the following approach:
1 for each subquery (i.e. mass > 10), using the setParseAction method,
I joined a function that returns the set of eleements that matched
the subquery
2 then, I joined the following functions for each logical operator (and,
or and not):
def not_operator(token):
_, s = token[0]
# ELEMENTS is the singleton described in my original post
return set(ELEMENTS.keys()).difference(s)
def and_operator(token):
s1, _, s2 = token[0]
return (s1 and s2)
def or_operator(token):
s1, _, s2 = token[0]
return (s1 or s2)
# Thanks for Paul for the hint.
grammar = operatorPrecedence(comparison_expr,
[(not_token, 1,opAssoc.RIGHT,not_operator),
(and_token, 2, opAssoc.LEFT,and_operator),
(or_token, 2, opAssoc.LEFT,or_operator)])
Please not that these operators acts upon python sets rather than
on booleans.
And that does the job.
I hope that this approach will help anyone of you.
Eric

Related

Runtime error:dictionary changed size during iteration

I iterate thru items of a dictionary "var_dict".
Then as I iterate in a for loop, I need to update the dictionary.
I understand that is not possible and that triggers the runtime error I experienced.
My question is, do I need to create a different dictionary to store data? As is now, I am trying to use same dictionary with different keys.
I know the problem is related to iteration thru the key and values of a dictionary and attempt to change it. I want to know if the best option in this case if to create a separate dictionary.
for k, v in var_dict.items():
match = str(match)
match = match.strip("[]")
match = match.strip("&apos;&apos;")
result = [index for index, value in enumerate(v) if match in value]
result = str(result)
result = result.strip("[]")
result = result.strip("&apos;")
#====> IF I print(var_dict), at this point I have no error *********
if result == "0":
#It means a match between interface on RP PSE2 model was found; Interface position is on PSE2 architecture
print (f&apos;PSE-2 Line cards:{v} Interfaces on PSE2:{entry} Interface PortID:{port_id}&apos;)
port_id = int(port_id)
print(port_id)
if port_id >= 19:
#print(f&apos;interface:{entry} portID={port_id} CPU_POS={port_cpu_pos} REPLICATION=YES&apos;)
if_info = [entry,&apos;PSE2=YES&apos;,port_id,port_cpu_pos,&apos;REPLICATION=YES&apos;]
var_dict[&apos;IF_PSE2&apos;].append(if_info)
#===> *** This is the point that if i attempt to print var_dict, I get the Error during olist(): dictionary changed size during iteration
else:
#print(f&apos;interface:{entry},portID={port_id} CPU_POS={port_cpu_pos} REPLICATION=NO&apos;)
if_info = [entry,&apos;PSE2=YES&apos;,port_id,port_cpu_pos,&apos;REPLICATION=NO&apos;]
var_dict[&apos;IF_PSE2&apos;].append(if_info)
else:
#it means the interface is on single PSE. No replication is applicable. Just check threshold between incoming and outgoing rate.
if_info = [entry,&apos;PSE2=NO&apos;,int(port_id),port_cpu_pos,&apos;REPLICATION=NO&apos;]
var_dict[&apos;IF_PSE1&apos;].append(if_info)
I did a shallow copy and that allowed me to iterate a dictionary copy and make modifications to the original dictionary. Problem solved. Thanks.
(...)
temp_var_dict = var_dict.copy()
for k, v in temp_var_dict.items():
(...)

Ocaml - global vs local variable

I wanted to create a global variable called result that uses 5 string concatenations to create a string containing 9 times the string start, separated by commas.
I have two pieces of code, only the second one declares a global variable.
For some reason it's not registering easily in my brain... Is it just that i used a let in so result in the first piece of code is a local variable? Is there a more detailed explanation for this?
let start = "ab";;
let result = start ^ "," in
let result = result ^ result in
let result = result ^ result in
let result = result ^ result in
let result = result ^ start in
result;;
- : string = "ab,ab,ab,ab,ab,ab,ab,ab,ab"
let result =
let result = start ^ "," in
let result = result ^ result in
let result = result ^ result in
let result = result ^ result in
let result = result ^ start in
result;;
val result : string = "ab,ab,ab,ab,ab,ab,ab,ab,ab"
Let me to be a little bit boring person. There are no local and global variables in OCaml. This concept came from languages with different scoping rules. Also, the word "variable" itself should be taken with care. Its meaning was perverted by C-like languages. The original, mathematical, meaning of this word corresponds to a name of some mathematical object, that is used inside a formula, that represent a range of such values. In C-like languages, a variable is confused with the memory cell, that can change in time. So, to avoid the confusion let's use a more accurate terminology. Let's use word name instead of variable. Since, variables... sorry names are not memory cells, there is nothing to create. When you're using one of the let syntaxes, you're actually creating a binding, i.e., an association between a name and a value. The let <name> = <expr-1> in <expr-2> binds a value of the in the scope of the <expr-2> expression. The let <name> = <expr-1> in <expr-2> is by itself is also an expression, so, for example <expr-2> can also contain let ... in ... constructs inside, e.g.,
let a = 1 in
let b = a + 1 in
let c = b + 1 in
a + b + c
I especially, indented the code in non-idiomatic way to highlight the syntactic structure of the expression. OCaml also allows to use a name, that is already bound in the scope. The new binding will hide the existing one (that is not allowed in C, for example), e.g.,
let a = a + 1 in
let a = a + 1 in
let a = a + 1 in
a + a + a
Finally, the top-level (aka module level) let-binding (called definition in OCaml parlance), has the syntax: let <name> = <expr>, note that there is no in here. The definition binds the <name> to a result of the evaluation of <expr> in the lexical scope that extends form the point of definition to the end of the enclosing module. When you're implementing a module, you must use let <name> = <expr> to bind your code to names (you may omit name by using _). It is a little bit different from the interactive toplevel (interactive ocaml program), that actually accepts an expression, and evaluates it. For example,
let result = start ^ "," in
let result = result ^ result in
let result = result ^ result in
let result = result ^ result in
let result = result ^ start in
result
Is not a valid OCaml program (something that can be put into an ml file and compiled). Because it is an expression, not a module definition.
Is it just that i used a let in so result in the first piece of code is a local variable?
Pretty much. The syntax to define a global variable is let variable = expression without an in. The syntax to define a local variable is let variable = expression in expression which will define variable local to the expression after the in.
When you have let ... in, you're declaring a local variable. When you have just let by itself (at the top level of a module), you're declaring a global name of the module. (That is, a name that can be exported from the module.)
Your first example consists entirely of let ... in. So there is no top-level name declared.
Your second example has one let by itself, followed by several occurrences of let ... in. So it declares a top-level name result.

idl: pass keyword dynamically to isa function to test structure read by read_csv

I am using IDL 8.4. I want to use isa() function to determine input type read by read_csv(). I want to use /number, /integer, /float and /string as some field I want to make sure float, other to be integer and other I don't care. I can do like this, but it is not very readable to human eye.
str = read_csv(filename, header=inheader)
; TODO check header
if not isa(str.(0), /integer) then stop
if not isa(str.(1), /number) then stop
if not isa(str.(2), /float) then stop
I am hoping I can do something like
expected_header = ['id', 'x', 'val']
expected_type = ['/integer', '/number', '/float']
str = read_csv(filename, header=inheader)
if not array_equal(strlowcase(inheader), expected_header) then stop
for i=0l,n_elements(expected_type) do
if not isa(str.(i), expected_type[i]) then stop
endfor
the above doesn't work, as '/integer' is taken literally and I guess isa() is looking for named structure. How can you do something similar?
Ideally I want to pick expected type based on header read from file, so that script still works as long as header specifies expected field.
EDIT:
my tentative solution is to write a wrapper for ISA(). Not very pretty, but does what I wanted... if there is cleaner solution , please let me know.
Also, read_csv is defined to return only one of long, long64, double and string, so I could write function to test with this limitation. but I just wanted to make it to work in general so that I can reuse them for other similar cases.
function isa_generic,var,typ
; calls isa() http://www.exelisvis.com/docs/ISA.html with keyword
; if 'n', test /number
; if 'i', test /integer
; if 'f', test /float
; if 's', test /string
if typ eq 'n' then return, isa(var, /number)
if typ eq 'i' then then return, isa(var, /integer)
if typ eq 'f' then then return, isa(var, /float)
if typ eq 's' then then return, isa(var, /string)
print, 'unexpected typename: ', typ
stop
end
IDL has some limited reflection abilities, which will do exactly what you want:
expected_types = ['integer', 'number', 'float']
expected_header = ['id', 'x', 'val']
str = read_csv(filename, header=inheader)
if ~array_equal(strlowcase(inheader), expected_header) then stop
foreach type, expected_types, index do begin
if ~isa(str.(index), _extra=create_struct(type, 1)) then stop
endforeach
It's debatable if this is really "easier to read" in your case, since there are only three cases to test. If there were 500 cases, it would be a lot cleaner than writing 500 slightly different lines.
This snipped used some rather esoteric IDL features, so let me explain what's happening a bit:
expected_types is just a list of (string) keyword names in the order they should be used.
The foreach part iterates over expected_types, putting the keyword string into the type variable and the iteration count into index.
This is equivalent to using for index = 0, n_elements(expected_types) - 1 do and then using expected_types[index] instead of type, but the foreach loop is easier to read IMHO. Reference here.
_extra is a special keyword that can pass a structure as if it were a set of keywords. Each of the structure's tags is interpreted as a keyword. Reference here.
The create_struct function takes one or more pairs of (string) tag names and (any type) values, then returns a structure with those tag names and values. Reference here.
Finally, I replaced not (bitwise not) with ~ (logical not). This step, like foreach vs for, is not necessary in this instance, but can avoid headache when debugging some types of code, where the distinction matters.
--
Reflective abilities like these can do an awful lot, and come in super handy. They're work-horses in other languages, but IDL programmers don't seem to use them as much. Here's a quick list of common reflective features I use in IDL, with links to the documentation for each:
create_struct - Create a structure from (string) tag names and values.
n_tags - Get the number of tags in a structure.
_extra, _strict_extra, and _ref_extra - Pass keywords by structure or reference.
call_function - Call a function by its (string) name.
call_procedure - Call a procedure by its (string) name.
call_method - Call a method (of an object) by its (string) name.
execute - Run complete IDL commands stored in a string.
Note: Be very careful using the execute function. It will blindly execute any IDL statement you (or a user, file, web form, etc.) feed it. Never ever feed untrusted or web user input to the IDL execute function.
You can't access the keywords quite like that, but there is a typename parameter to ISA that might be useful. This is untested, but should work:
expected_header = ['id', 'x', 'val']
expected_type = ['int', 'long', 'float']
str = read_cv(filename, header=inheader)
if not array_equal(strlowcase(inheader), expected_header) then stop
for i = 0L, n_elemented(expected_type) - 1L do begin
if not isa(str.(i), expected_type[i]) then stop
endfor

Extract nth element of a tuple

For a list, you can do pattern matching and iterate until the nth element, but for a tuple, how would you grab the nth element?
TL;DR; Stop trying to access directly the n-th element of a t-uple and use a record or an array as they allow random access.
You can grab the n-th element by unpacking the t-uple with value deconstruction, either by a let construct, a match construct or a function definition:
let ivuple = (5, 2, 1, 1)
let squared_sum_let =
let (a,b,c,d) = ivuple in
a*a + b*b + c*c + d*d
let squared_sum_match =
match ivuple with (a,b,c,d) -> a*a + b*b + c*c + d*d
let squared_sum_fun (a,b,c,d) =
a*a + b*b + c*c + d*d
The match-construct has here no virtue over the let-construct, it is just included for the sake of completeness.
Do not use t-uples, Don¹
There are only a few cases where using t-uples to represent a type is the right thing to do. Most of the times, we pick a t-uple because we are too lazy to define a type and we should interpret the problem of accessing the n-th field of a t-uple or iterating over the fields of a t-uple as a serious signal that it is time to switch to a proper type.
There are two natural replacements to t-uples: records and arrays.
When to use records
We can see a record as a t-uple whose entries are labelled; as such, they are definitely the most natural replacement to t-uples if we want to access them directly.
type ivuple = {
a: int;
b: int;
c: int;
d: int;
}
We then access directly the field a of a value x of type ivuple by writing x.a. Note that records are easily copied with modifications, as in let y = { x with d = 0 }. There is no natural way to iterate over the fields of a record, mostly because a record do not need to be homogeneous.
When to use arrays
A large² homogeneous collection of values is adequately represented by an array, which allows direct access, iterating and folding. A possible inconvenience is that the size of an array is not part of its type, but for arrays of fixed size, this is easily circumvented by introducing a private type — or even an abstract type. I described an example of this technique in my answer to the question “OCaml compiler check for vector lengths”.
Note on float boxing
When using floats in t-uples, in records containing only floats and in arrays, these are unboxed. We should therefore not notice any performance modification when changing from one type to the other in our numeric computations.
¹ See the TeXbook.
² Large starts near 4.
Since the length of OCaml tuples is part of the type and hence known (and fixed) at compile time, you get the n-th item by straightforward pattern matching on the tuple. For the same reason, the problem of extracting the n-th element of an "arbitrary-length tuple" cannot occur in practice - such a "tuple" cannot be expressed in OCaml's type system.
You might still not want to write out a pattern every time you need to project a tuple, and nothing prevents you from generating the functions get_1_1...get_i_j... that extract the i-th element from a j-tuple for any possible combination of i and j occuring in your code, e.g.
let get_1_1 (a) = a
let get_1_2 (a,_) = a
let get_2_2 (_,a) = a
let get_1_3 (a,_,_) = a
let get_2_3 (_,a,_) = a
...
Not necessarily pretty, but possible.
Note: Previously I had claimed that OCaml tuples can have at most length 255 and you can simply generate all possible tuple projections once and for all. As #Virgile pointed out in the comments, this is incorrect - tuples can be huge. This means that it is impractical to generate all possible tuple projection functions upfront, hence the restriction "occurring in your code" above.
It's not possible to write such a function in full generality in OCaml. One way to see this is to think about what type the function would have. There are two problems. First, each size of tuple is a different type. So you can't write a function that accesses elements of tuples of different sizes. The second problem is that different elements of a tuple can have different types. Lists don't have either of these problems, which is why you can have List.nth.
If you're willing to work with a fixed size tuple whose elements are all the same type, you can write a function as shown by #user2361830.
Update
If you really have collections of values of the same type that you want to access by index, you should probably be using an array.
here is a function wich return you the string of the ocaml function you need to do that ;) very helpful I use it frequently.
let tup len n =
if n>=0 && n<len then
let rec rep str nn = match nn<1 with
|true ->""
|_->str ^ (rep str (nn-1))in
let txt1 ="let t"^(string_of_int len)^"_"^(string_of_int n)^" tup = match tup with |" ^ (rep "_," n) ^ "a" and
txt2 =","^(rep "_," (len-n-2)) and
txt3 ="->a" in
if n = len-1 then
print_string (txt1^txt3)
else
print_string (txt1^txt2^"_"^txt3)
else raise (Failure "Error") ;;
For example:
tup 8 6;;
return:
let t8_6 tup = match tup with |_,_,_,_,_,_,a,_->a
and of course:
val t8_6 : 'a * 'b * 'c * 'd * 'e * 'f * 'g * 'h -> 'g = <fun>

OCaml: Does storing some values to be used later introduce "side effects"?

For a homework assignment, we've been instructed to complete a task without introducing any "side-effects". I've looked up "side-effects" on Wikipedia, and though I get that in theory it means "modifies a state or has an observable interaction with calling functions", I'm having trouble figuring out specifics.
For example, would creating a value that holds a non-compile time result be introducing side effects?
Say I had (might not be syntactically perfect):
val myList = (someFunction x y);;
if List.exists ((=) 7) myList then true else false;;
Would this introduce side-effects? I guess maybe I'm confused on what "modifies a state" means in the definition of side-effects.
No; a side-effect refers to e.g. mutating a ref cell with the assignment operator :=, or other things where the value referred to by a name changes over time. In this case, myList is an immutable value that never changes during the program, thus it is effect-free.
See also
http://en.wikipedia.org/wiki/Referential_transparency_(computer_science)
A good way to think about it is "have I changed anything which any later code (including running this same function again later) could ever possibly see other than the value I'm returning?" If so, that's a side effect. If not, then you can know that there isn't one.
So, something like:
let inc_nosf v = v+1
has no side effects because it just returns a new value which is one more than an integer v. So if you run the following code in the ocaml toplevel, you get the corresponding results:
# let x = 5;;
val x : int = 5
# inc_nosf x;;
- : int = 6
# x;;
- : int = 5
As you can see, the value of x didn't change. So, since we didn't save the return value, then nothing really got incremented. Our function itself only modifies the return value, not x itself. So to save it into x, we'd have to do:
# let x = inc_nosf x;;
val x : int = 6
# x;;
- : int = 6
Since the inc_nosf function has no side effects (that is, it only communicates with the outside world using its return value, not by making any other changes).
But something like:
let inc_sf r = r := !r+1
has side effects because it changes the value stored in the reference represented by r. So if you run similar code in the top level, you get this, instead:
# let y = ref 5;;
val y : int ref = {contents = 5}
# inc_sf y;;
- : unit = ()
# y;;
- : int ref = {contents = 6}
So, in this case, even though we still don't save the return value, it got incremented anyway. That means there must have been changes to something other than the return value. In this case, that change was the assignment using := which changed the stored value of the ref.
As a good rule of thumb, in Ocaml, if you avoid using refs, records, classes, strings, arrays, and hash tables, then you will avoid any risk of side effects. Although you can safely use string literals as long as you avoid modifying the string in place using functions like String.set or String.fill. Basically, any function which can modify a data type in place will cause a side effect.

Resources