Parsing nested SQL queries with parenthesized predicates using pyparsing - pyparsing

I'm trying to parse nested queries of the form that contains predicates with parentheses. Example:
query = '(A LIKE "%.something.com" AND B = 4) OR (C In ("a", "b") AND D Contains "asdf")'
I've tried many of the answers/examples I've seen but without getting them to work, and this is what I have come up with so far that almost(?) works:
from pyparsing import *
r = '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'*+,-./:;<=>?#[\]^_`{|}~'
any_keyword = CaselessKeyword("AND") | CaselessKeyword("OR")
non_keyword = ~any_keyword + Word(r)
expr = infixNotation(originalTextFor(non_keyword[1, ...]),
[
(oneOf("AND OR", caseless=True, asKeyword=True), 2, opAssoc.LEFT)
])
Then, running expr.parseString(query).asList() returns only:
[['A LIKE "%.something.com"', 'AND', 'B = 4']]
without the rest of the query.
As far as I understand, this is due to the C In ("a", "b") part, since there are parentheses there.
Is there a way to "disregard" parentheses inside the predicates so that parsing returns the expected answer:
[[['A LIKE "%.something.com"', 'AND', 'B = 4'], 'OR', ['C In ("a", "b")', 'AND', 'D Contains "asdf"']]]

Welcome to pyparsing! You've made some good headway from following other examples, but let's back up just a bit.
infixNotation is definitely the right way to go here, but there are many more operators in this expression than just AND and OR. These are all sub-expressions in your input string:
A LIKE "%.something.com"
B = 4
C in ("a", "b")
D contains "asdf"
Each of these is its own binary expression, with operators "LIKE", "=", "in" and "contains". Also, your operands can be not only identifiers, but quoted strings, a collection of quoted strings, and numbers.
I like your intuition that this is a logical expression of terms, so let's define 2 levels of infixNotations:
an expression of column or numeric "arithmetic" (using '=', "LIKE", etc.)
an expression of terms defined by #1, combined with logical NOT, AND, and OR operators
If we call #1 a column_expr, #2 would look similar to what you have already written:
expr = infixNotation(column_expr,
[
(NOT, 1, opAssoc.RIGHT),
(AND, 2, opAssoc.LEFT),
(OR, 2, opAssoc.LEFT),
])
I've added NOT as a reasonable extension to what you have now - most logical expressions include these 3 operators. Also, it is conventional to define them in this order of precedence, so that "X OR Y AND Z" eventually evaluates in the order "X OR (Y AND Z)", since AND has higher precedence than OR (and NOT is higher still).
#1 takes some more work, so I've written a little BNF for what we should expect for an individual operand for column_expr (I cannot recommend taking this step highly enough!):
identifier ::= one or more Words composed of printable letters (we may come back to this)
number ::= an integer or real number (we can use the one defined in pyparsing_common.number)
quotedString ::= (a quoted string like one define by pyparsing)
quotedString_list ::= '(' quotedString [',' quotedString]... ')'
# put identifier last, since it will match just about anything, and we want to try the other
# expressions first
column_operand ::= quotedString | quotedString_list | number | identifier
Then a column_expr will be an infixNotation using these column_operands:
ppc = pyparsing_common
LPAR, RPAR = map(Suppress, "()")
# use Group to keep these all together
quotedString_list = Group(LPAR + delimitedList(quotedString) + RPAR)
column_operand = quotedString | quotedString_list | ppc.number | identifier
column_expr = infixNotation(column_operand,
[
(IN | CONTAINS, 2, opAssoc.LEFT),
(LIKE, 2, opAssoc.LEFT),
('=', 2, opAssoc.LEFT),
])
If you find you have to add other operators like "!=", most likely you will add them in to column_expr.
Some other notes:
You probably want to remove ' and " from r, since they should really be handled as part of the quoted strings
As your list of keywords grows, you will find it easier to define them using something like this:
AND, OR, NOT, LIKE, IN, CONTAINS = keyword_exprs = list(map(CaselessKeyword, """
AND OR NOT LIKE IN CONTAINS
""".split()))
any_keyword = MatchFirst(keyword_exprs)
Then you can reference them more easily as I have done in the code above.
Write small tests first before trying to test the complex query you posted. Nice work in including many of the variations of operands. Then use runTests to run them all, like this:
expr.runTests("""
A LIKE "%.something.com"
B = 4
C in ("A", "B")
D CONTAINS "ASDF"
(A LIKE "%.something.com" AND B = 4) OR (C In ("a", "b") AND D Contains "asdf")
""")
With these changes, I get this for your original query string:
[[['A', 'LIKE', '"%.something.com"'], 'AND', 'B = 4'], 'OR', [['C', 'IN', ['"a"', '"b"']], 'AND', ['D', 'CONTAINS', '"asdf"']]]
Hmmm, I'm not keen on a term that looks like 'B = 4', now that we are actually parsing the sub expressions. I suspect it is because your definition of identifier is a little too aggressive. If we cut it back to just ~any_keyword + Word(alphas, r), forcing a leading alpha character and without the [1, ...] for repetition, then we get the better-looking:
[[['A', 'LIKE', '"%.something.com"'], 'AND', ['B', '=', 4]], 'OR', [['C', 'IN', ['"a"', '"b"']], 'AND', ['D', 'CONTAINS', '"asdf"']]]
If in fact you do want these sub-expressions to be retained as they were found in the original, and just break up on the logical operators, then you can just wrap column_expr in originalTextFor as you used before, giving:
[['A LIKE "%.something.com"', 'AND', 'B = 4'], 'OR', ['C In ("a", "b")', 'AND', 'D Contains "asdf"']]
Good luck with your SQL parsing project!

Related

Why (; [(:x, 1), (:y, 2)]...) creates a NamedTuple?

I'm still learning Julia, and I recently came across the following code excerpt that flummoxed me:
res = (; [(:x, 10), (:y, 20)]...) # why the semicolon in front?
println(res) # (x = 10, y = 20)
println(typeof(res)) # NamedTuple{(:x, :y), Tuple{Int64, Int64}}
I understand the "splat" operator ..., but what happens when the semicolon appear first in a tuple? In other words, how does putting a semicolon in (; [(:x, 10), (:y, 20)]...) create a NamedTuple? Is this some undocumented feature/trick?
Thanks for any pointers.
Yes, this is actually a documented feature, but perhaps not a very well known one. As the documentation for NamedTuple notes:
help?> NamedTuple
search: NamedTuple #NamedTuple
NamedTuple
NamedTuples are, as their name suggests, named Tuples. That is, they're a tuple-like
collection of values, where each entry has a unique name, represented as a Symbol.
Like Tuples, NamedTuples are immutable; neither the names nor the values can be
modified in place after construction.
Accessing the value associated with a name in a named tuple can be done using field
access syntax, e.g. x.a, or using getindex, e.g. x[:a]. A tuple of the names can be
obtained using keys, and a tuple of the values can be obtained using values.
[... some other non-relevant parts of the documentation omitted ...]
In a similar fashion as to how one can define keyword arguments programmatically, a
named tuple can be created by giving a pair name::Symbol => value or splatting an
iterator yielding such pairs after a semicolon inside a tuple literal:
julia> (; :a => 1)
(a = 1,)
julia> keys = (:a, :b, :c); values = (1, 2, 3);
julia> (; zip(keys, values)...)
(a = 1, b = 2, c = 3)
As in keyword arguments, identifiers and dot expressions imply names:
julia> x = 0
0
julia> t = (; x)
(x = 0,)
julia> (; t.x)
(x = 0,)

Tokenizing a letter as an operator

I need to make a language that has variables in it, but it also needs the letter 'd' to be an operand that has a number on the right and maybe a number on the left. I thought that making sure the lexer checks for the letter first would give it precedence, but that doesn't happen and i don't know why.
from ply import lex, yacc
tokens=['INT', 'D', 'PLUS', 'MINUS', 'LPAR', 'RPAR', 'BIGGEST', 'SMALLEST', 'EQ', 'NAME']
t_PLUS = r'\+'
t_MINUS = r'\-'
t_LPAR = r'\('
t_RPAR = r'\)'
t_BIGGEST = r'\!'
t_SMALLEST = r'\#'
t_D = r'[dD]'
t_EQ = r'\='
t_NAME = r'[a-zA-Z_][a-zA-Z0-9_]*'
def t_INT(t):
r'[0-9]\d*'
t.value = int(t.value)
return t
def t_newline(t):
r'\n+'
t.lexer.lineno += 1
t_ignore = ' \t'
def t_error(t):
print("Not recognized by the lexer:", t.value)
t.lexer.skip(1)
lexer = lex.lex()
while True:
try: s = input(">> ")
except EOFError: break
lexer.input(s)
while True:
t = lexer.token()
if not t: break
print(t)
If i write:
3d4
it outputs:
LexToken(INT,3,1,0)
LexToken(NAME,'d4',1,1)
and i don't know how to work around it.
Ply does not prioritize token variables by order of appearance; rather, it orders them in decreasing order by length (longest first). So your t_NAME pattern will come before t_D. This is explained in the Ply manual, along with a concrete example of how to handle reserved words (which may not apply in your case).
If I understand correctly, the letter d cannot be an identifier, and neither can d followed by a number. It is not entirely clear to me whether you expect d2e to be a plausible identifier, but for simplicity I'm assuming that the answer is "No", in which case you can easily restrict the t_NAME regular expression by requiring an initial d to be followed by another letter:
t_NAME = '([a-ce-zA-CE-Z_]|[dD][a-zA-Z_])[a-zA-Z0-9_]*'
If you wanted to allow d2e to be a name, then you could go with:
t_NAME = '([a-ce-zA-CE-Z_]|[dD][0-9]*[a-zA-Z_])[a-zA-Z0-9_]*'

How should I map over Maybe List?

I came away from Professor Frisby's Mostly Adequate Guide to Functional Programming with what seems to be a misconception about Maybe.
I believe:
map(add1, Just [1, 2, 3])
// => Just [2, 3, 4]
My feeling coming away from the aforementioned guide is that Maybe.map should try to call Array.map on the array, essentially returning Just(map(add1, [1, 2, 3]).
When I tried this using Sanctuary's Maybe type, and more recently Elm's Maybe type, I was disappointed to discover that neither of them support this (or, perhaps, I don't understand how they support this).
In Sanctuary,
> S.map(S.add(1), S.Just([1, 2, 3]))
! Invalid value
add :: FiniteNumber -> FiniteNumber -> FiniteNumber
^^^^^^^^^^^^
1
1) [1, 2, 3] :: Array Number, Array FiniteNumber, Array NonZeroFiniteNumber, Array Integer, Array ValidNumber
The value at position 1 is not a member of ‘FiniteNumber’.
In Elm,
> Maybe.map sqrt (Just [1, 2, 3])
-- TYPE MISMATCH --------------------------------------------- repl-temp-000.elm
The 2nd argument to function `map` is causing a mismatch.
4| Maybe.map sqrt (Just [1, 2, 3])
^^^^^^^^^^^^^^
Function `map` is expecting the 2nd argument to be:
Maybe Float
But it is:
Maybe (List number)
Similarly, I feel like I should be able to treat a Just(Just(1)) as a Just(1). On the other hand, my intuition about [[1]] is completely the opposite. Clearly, map(add1, [[1]]) should return [NaN] and not [[2]] or any other thing.
In Elm I was able to do the following:
> Maybe.map (List.map (add 1)) (Just [1, 2, 3])
Just [2,3,4] : Maybe.Maybe (List number)
Which is what I want to do, but not how I want to do it.
How should one map over Maybe List?
You have two functors to deal with: Maybe and List. What you're looking for is some way to combine them. You can simplify the Elm example you've posted by function composition:
> (Maybe.map << List.map) add1 (Just [1, 2, 3])
Just [2,3,4] : Maybe.Maybe (List number)
This is really just a short-hand of the example you posted which you said was not how you wanted to do it.
Sanctuary has a compose function, so the above would be represented as:
> S.compose(S.map, S.map)(S.add(1))(S.Just([1, 2, 3]))
Just([2, 3, 4])
Similarly, I feel like I should be able to treat a Just(Just(1)) as a Just(1)
This can be done using the join from the elm-community/maybe-extra package.
join (Just (Just 1)) == Just 1
join (Just Nothing) == Nothing
join Nothing == Nothing
Sanctuary has a join function as well, so you can do the following:
S.join(S.Just(S.Just(1))) == Just(1)
S.join(S.Just(S.Nothing)) == Nothing
S.join(S.Nothing) == Nothing
As Chad mentioned, you want to transform values nested within two functors.
Let's start by mapping over each individually to get comfortable:
> S.map(S.toUpper, ['foo', 'bar', 'baz'])
['FOO', 'BAR', 'BAZ']
> S.map(Math.sqrt, S.Just(64))
Just(8)
Let's consider the general type of map:
map :: Functor f => (a -> b) -> f a -> f b
Now, let's specialize this type for the two uses above:
map :: (String -> String) -> Array String -> Array String
map :: (Number -> Number) -> Maybe Number -> Maybe Number
So far so good. But in your case we want to map over a value of type Maybe (Array Number). We need a function with this type:
:: Maybe (Array Number) -> Maybe (Array Number)
If we map over S.Just([1, 2, 3]) we'll need to provide a function which takes [1, 2, 3]—the inner value—as an argument. So the function we provide to S.map must be a function of type Array (Number) -> Array (Number). S.map(S.add(1)) is such a function. Bringing this all together we arrive at:
> S.map(S.map(S.add(1)), S.Just([1, 2, 3]))
Just([2, 3, 4])

Convert Dict to DataFrame in Julia

Suppose I have a Dict defined as follows:
x = Dict{AbstractString,Array{Integer,1}}("A" => [1,2,3], "B" => [4,5,6])
I want to convert this to a DataFrame object (from the DataFrames module). Constructing a DataFrame has a similar syntax to constructing a dictionary. For example, the above dictionary could be manually constructed as a data frame as follows:
DataFrame(A = [1,2,3], B = [4,5,6])
I haven't found a direct way to get from a dictionary to a data frame but I figured one could exploit the syntactic similarity and write a macro to do this. The following doesn't work at all but it illustrates the approach I had in mind:
macro dict_to_df(x)
typeof(eval(x)) <: Dict || throw(ArgumentError("Expected Dict"))
return quote
DataFrame(
for k in keys(eval(x))
#eval ($k) = $(eval(x)[$k])
end
)
end
end
I also tried writing this as a function, which does work when all dictionary values have the same length:
function dict_to_df(x::Dict)
s = "DataFrame("
for k in keys(x)
v = x[k]
if typeof(v) <: AbstractString
v = string('"', v, '"')
end
s *= "$(k) = $(v),"
end
s = chop(s) * ")"
return eval(parse(s))
end
Is there a better, faster, or more idiomatic approach to this?
Another method could be
DataFrame(Any[values(x)...],Symbol[map(symbol,keys(x))...])
It was a bit tricky to get the types in order to access the right constructor. To get a list of the constructors for DataFrames I used methods(DataFrame).
The DataFrame(a=[1,2,3]) way of creating a DataFrame uses keyword arguments. To use splatting (...) for keyword arguments the keys need to be symbols. In the example x has strings, but these can be converted to symbols. In code, this is:
DataFrame(;[Symbol(k)=>v for (k,v) in x]...)
Finally, things would be cleaner if x had originally been with symbols. Then the code would go:
x = Dict{Symbol,Array{Integer,1}}(:A => [1,2,3], :B => [4,5,6])
df = DataFrame(;x...)

What does the lambda calculus have to say about return values?

It is by now a well known theorem of the lambda calculus that any function taking two or more arguments can be written through currying as a chain of functions taking one argument:
# Pseudo-code for currying
f(x,y) -> f_curried(x)(y)
This has proven to be extremely powerful not just in studying the behavior of functions but in practical use (Haskell, etc.).
Functions returning values, however, seem to not be discussed. Programmers typically deal with their inability to return more than one value from a function by returning some meta-object (lists in R, structures in C++, etc.). It has always struck me as a bit of a kludge, but a useful one.
For instance:
# R code for "faking" multiple return values
uselessFunc <- function(dat) {
model1 <- lm( y ~ x , data=dat )
return( list( coef=coef(model1), form=formula(model1) ) )
}
Questions
Does the lambda calculus have anything to say about a multiplicity of return values? If so, do any surprising conclusions result?
Similarly, do any languages allow true multiple return values?
According to the Wikipedia page on lambda calculus:
Lambda calculus, also written as λ-calculus, is a formal system for function
definition, function application and recursion
And a function, in the mathematical sense:
Associates one quantity, the argument of the function, also known as the input,
with another quantity, the value of the function, also known as the output
So answering your first question no, lambda calculus (or any other formalism based on mathematical functions) can not have multiple return values.
For your second question, as far as I know, programming languages that implement multiple return values do so by packing multiple results in some kind of data structure (be it a tuple, an array, or even the stack) and then unpacking it later - and that's where the differences lie, as some programming languages make the packing/unpacking part transparent for the programmer (for instance Python uses tuples under the hood) while other languages make the programmer do the job explicitly, for example Java programmers can simulate multiple return values to some extent by packing multiple results in a returned Object array and then extracting and casting the returned result by hand.
A function returns a single value. This is how functions are defined in mathematics. You can return multiple values by packing them into one compound value. But then it is still a single value. I'd call it a vector, because it has components. There are vector functions in mathematics there, so there are also in programming languages. The only difference is the support level from the language itself and does it facilitate it or not.
Nothing prevents you from having multiple functions, each one returning one of the multiple results that you would like to return.
For example, say, you had the following function in python returning a list.
def f(x):
L = []
for i in range(x):
L.append(x * i)
return L
It returns [0, 3, 6] for x=3 and [0, 5, 10, 15, 20] for x=5. Instead, you can totally have
def f_nth_value(x, n):
L = []
for i in range(x):
L.append(x * i)
if n < len(L):
return L[n]
return None
Then you can request any of the outputs for a given input, and get it, or get None, if there aren't enough outputs:
In [11]: f_nth_value(3, 0)
Out[11]: 0
In [12]: f_nth_value(3, 1)
Out[12]: 3
In [13]: f_nth_value(3, 2)
Out[13]: 6
In [14]: f_nth_value(3, 3)
In [15]: f_nth_value(5, 2)
Out[15]: 10
In [16]: f_nth_value(5, 5)
Computational resources may be wasted if you have to do some of the same work, as in this case. Theoretically, it can be avoided by returning another function that holds all the results inside itself.
def f_return_function(x):
L = []
for i in range(x):
L.append(x * i)
holder = lambda n: L[n] if n < len(L) else None
return holder
So now we have
In [26]: result = f_return_function(5)
In [27]: result(3)
Out[27]: 15
In [28]: result(4)
Out[28]: 20
In [29]: result(5)
Traditional untyped lambda calculus is perfectly capable of expressing this idea. (After all, it is Turing complete.) Whenever you want to return a bunch of values, just return a function that can give the n-th value for any n.
In regard to the second question, python allows for such a syntax, if you know exactly, just how many values the function is going to return.
def f(x):
L = []
for i in range(x):
L.append(x * i)
return L
In [39]: a, b, c = f(3)
In [40]: a
Out[40]: 0
In [41]: b
Out[41]: 3
In [42]: c
Out[42]: 6
In [43]: a, b, c = f(2)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-43-5480fa44be36> in <module>()
----> 1 a, b, c = f(2)
ValueError: need more than 2 values to unpack
In [44]: a, b, c = f(4)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-44-d2c7a6593838> in <module>()
----> 1 a, b, c = f(4)
ValueError: too many values to unpack
Lastly, here is an example from this Lisp tutorial:
;; in this function, the return result of (+ x x) is not assigned so it is essentially
;; lost; the function body moves on to the next form, (* x x), which is the last form
;; of this function body. So the function call only returns (* 10 10) => 100
* ((lambda (x) (+ x x) (* x x)) 10)
=> 100
;; in this function, we capture the return values of both (+ x x) and (* x x), as the
;; lexical variables SUM and PRODUCT; using VALUES, we can return multiple values from
;; a form instead of just one
* ((lambda (x) (let ((sum (+ x x)) (product (* x x))) (values sum product))) 10)
=> 20 100
I write this as a late response to the accepted answer since it is wrong!
Lambda Calculus does have multiple return values, but it takes a bit to understand what returning multiple values mean.
Lambda Calculus has no inherent definition of a collection of stuff, but it does allow you to invent it using products and church numerals.
pure functional JavaScript will be used for this example.
let's define a product as follows:
const product = a => b => callback => callback(a)(b);
then we can define church_0, and church_1 aka true, false, aka left, right, aka car, cdr, aka first, rest as follows:
const church_0 = a => b => a;
const church_1 = a => b => b;
let's start with making a function that returns two values, 20, and "Hello".
const product = a => b => callback => callback(a)(b);
const church_0 = a => b => a;
const church_1 = a => b => b;
const returns_many = () => product(20)("Hello");
const at_index_zero = returns_many()(church_0);
const at_index_one = returns_many()(church_1);
console.log(at_index_zero);
console.log(at_index_one);
As expected, we got 20 and "Hello".
To return more than 2 values, it gets a bit tricky:
const product = a => b => callback => callback(a)(b);
const church_0 = a => b => a;
const church_1 = a => b => b;
const returns_many = () => product(20)(
product("Hello")(
product("Yes")("No")
)
);
const at_index_zero = returns_many()(church_0);
const at_index_one = returns_many()(church_1)(church_0);
const at_index_two = returns_many()(church_1)(church_1)(church_0);
console.log(at_index_zero);
console.log(at_index_one);
console.log(at_index_two);
As you can see, a function can return an arbitrary number of return values, but to access these values, a you cannot simply use result()[0], result()[1], or result()[2], but you must use functions that filter out the position you want.
This is mindblowingly similar to electrical circuits, in that circuits have no "0", "1", "2", "3", but they do have means to make decisions, and by abstracting away our circuitry with byte(reverse list of 8 inputs), word(reverse list of 16 inputs), in this language, 0 as a byte would be [0, 0, 0, 0, 0, 0, 0, 0] which is equivalent to:
const Byte = a => b => c => d => e => f => g => h => callback =>
callback(a)(b)(c)(d)(e)(f)(g)(h);
const Byte_one = Byte(0)(0)(0)(0)(0)(0)(0)(1); // preserves
const Bit_zero = Byte_one(b7 => b6 => b5 => b4 => b3 => b2 => b1 => b0 => b0);
After inventing a number, we can make an algorithm to, given a byte-indexed array, and a byte representing index we want from this array, it will take care of the boilerplate.
Anyway, what we call arrays is nothing more than the following, expressed in higher level to show the point:
// represent nested list of bits(addresses)
// to nested list of bits(bytes) interpreted as strings.
const MyArray = function(index) {
return (index == 0)
? "0th"
: (index == 1)
? "first"
: "second"
;
};
except it doesnt do 2^32 - 1 if statements, it only does 8 and recursively narrows down the specific element you want. Essentially it acts exactly like a multiplexor(except the "single" signal is actually a fixed number of bits(coproducts, choices) needed to uniquely address elements).
My point is that is Arrays, Maps, Associative Arrays, Lists, Bits, Bytes, Words, are all fundamentally functions, both at circuit level(where we can represent complex universes with nothing but wires and switches), and mathematical level(where everything is ultimately products(sequences, difficult to manage without requiring nesting, eg lists), coproducts(types, sets), and exponentials(free functors(lambdas), forgetful functors)).

Resources