How can I write an unambiguous nearley grammar for boolean search operators - bnf

The Context
I am climbing the Nearley learning curve and trying to write a grammar for a search query parser.
The Goal
I would like to write grammar that is able to parse a querystring that contains boolean operators (e.g. AND, OR, NOT). Lets use AND for this question as a trivial case.
For instance, the grammar should recognize these example strings as valid:
pants
pants AND socks
jumping jacks
The Attempt
My naive attempt looks something like this:
query ->
statement
| statement "AND" statement
statement -> .:+
The Problem
The above grammar attempt is ambiguous because .:+ will match literally any string.
What I really want is for the first condition to match any string that does not contain AND in it. Once "AND" appears I want to enter the second condition only.
The Question
How can I detect these two distinct cases without having ambiguous grammar?
I am worried I'm missing something fundamental; I can imagine a ton of use cases where we want arbitrary text split up by known operators.

Yeah, if you've got an escape hatch that could be literally anything, you're going to have a problem.
Somewhere you're going to want to define what your base set of tokens are, at least something like \S+ and then how those tokens can be composed.
The place I'd typically start for a parser is trying to figure out where recursion is accounted for in the parser, and what approach to parsing the lib you're relying on takes.
Looks like Nearley is an Earley parser, and as the wikipedia entry for them notes, they're efficient for left-recursion.
This is just hazarding a guess, but something like this might get you to conjunction at least.
CONJUNCTION -> AND | OR
STATEMENT -> TOKENS | (TOKENS CONJUNCTION STATEMENT)
TOKENS -> [^()]+
A structure like this should be unambiguous and bans parentheses in tokens, unless they're surrounded by double quotes.

Related

How to match a minimum number of regex groups or assertions?

OK regex nerds!
I am using regex lookahead assertions for password validation that is similar to the pattern described here:
\A(?=\w{6,10}\z)(?=[^a-z]*[a-z])(?=(?:[^A-Z]*[A-Z]){3})(?=\D*\d)
However, we want to only require that any 3 of the 4 assertions be valid - not necessarily all of them. Any thoughts on how this could be done?
To shorten any kind of pattern, factorize:
\A(?:
(?=\w{6,10}\z) (?=.*[a-z]) (?: (?:.*[A-Z]){3} | .*\d )
|
(?=.*\d) (?=(?:.*[A-Z]){3}) (?: .*[a-z] | \w{6,10}\z )
)
Note that you don't need a lookahead to test the last condition.
demo
Other way, where each condition is optional and that uses a named group to count (.net only):
\A
(?<c>(?=\w{6,10}\z))?
(?<c>(?=[^a-z]*[a-z]))?
(?<c>(?=(?:[^A-Z]*[A-Z]){3}))?
(?<c>(?=\D*\d))?
(?<-c>){3} # decrement c 3 times
(?(c)|(?!$)) # conditional: force the pattern to fail if too few conditions succeed.
demo
There's no "easy" way to do this in a single regular expression. The only way would be to define all possible permutations of the "three out of four" assertions - e.g.
\A(?=\w{6,10}\z)(?=[^a-z]*[a-z])(?=(?:[^A-Z]*[A-Z]){3})| # Maybe no digit
\A(?=[^a-z]*[a-z])(?=(?:[^A-Z]*[A-Z]){3})(?=\D*\d)| # Maybe wrong length
\A(?=\w{6,10}\z)(?=(?:[^A-Z]*[A-Z]){3})(?=\D*\d)| # Maybe no lower
\A(?=\w{6,10}\z)(?=[^a-z]*[a-z])(?=\D*\d) # Maybe not enough uppers
However, this mind-melting regex is clearly not a good solution.
A better approach would be to perform the four checks separately (with regex or otherwise), and count that there is at least three passed conditions.
...However, let's take a step back here and ask: Why are you doing this?? You're implementing a password entropy check. Based on your fuzzy rules, the following passwords are valid:
AAAa1
password1
LETmein
And the following passwords are invalid:
reallylongsecurepassword8374235359232
HorseBatteryStapleCorrect
I would strongly advise against such a bizarrely restrictive policy.
Brief
The easiest method would be to have separate regular expressions and check whether 3/4 of them are successful in your code's language. The only way to do this in regex is to present all cases. That being said, this is probably the easiest method (in regex) to present all options as it allows you to edit the patterns in one location (where they are defined) rather than multiple times (more prone to bugs). The DEFINE constructs in regex are seldom supported, but PCRE regex does.
You can also have your code generate each regex permutation. See this question about generating all permutations of a list in python
I don't know why you want to do this for passwords, it's considered malpractice, but, since you're asking for it, I figured I'd give you the easiest solution possible in regex... You really should only check minimum length (and complexity if you want [based on algorithms] to show the user how secure your system finds their password to be).
Code
(?(DEFINE)
(?<w>(?=\w{6,10}\z))
(?<l>(?=[^a-z]*[a-z]))
(?<u>(?=(?:[^A-Z]*[A-Z]){3}))
(?<d>(?=\D*\d))
)
\A(?:
(?&w)(?&l)(?&u)|
(?&w)(?&l)(?&d)|
(?&w)(?&u)(?&d)|
(?&l)(?&u)(?&d)
)
Note: The regex above uses the x modifier (ignore whitespace) so that we can nicely organize the content.

Can I manipulate symbols like I manipulate strings?

Question: Can I divide a symbol into two symbols based on a letter or symbol?
Example: For example, let's say I have :symbol1_symbol2, and I want to split it on the _ into :symbol1 and :symbol2. Is this possible?
Motivation: A fairly common recommendation in Julia is to use Symbol in place of String or ASCIIString as it is more efficient for many operations. So I'm interested in situations where this might break down because there is no analogue for Symbol for an operation that we might typically perform on ASCIIString, e.g. anything to do with regular expressions.
No you can't manipulate symbols.
They are not a composite type (in logic, though they maybe in implement).
They are one thing.
Much like an integer is one thing,
or a boolean is one thing.
You can't manipulate the parts of it.
As I understand, it the reason they are fast is because thay are "one thing".
Symbols are not strings.
Symbols are the representation of a parsed token.
They exist for working with macros etc.
They are useful for other things.
Though one fo there most common alterate uses in 0.3 was as a standin for enumerations. Now that Enum is in 0.4, that use will decline.
They are still logically good for dictionary keys etc.
--
If for some reason you must.
Eg for interop with a 3rd party library, or for some kind of dynamic dispatch:
You can convert it to a String,
with string(:abc), (There is not currently a convert),
and back with Symbol("abc").
so
function symsplit(s_s::Symbol)
combined_string_from=string(s_s)
strings= split(combined_string_from, '_')
map(Symbol,strings)
end
#show symsplit(:a)
#show symsplit(:a_b)
#show symsplit(:a_b_c);
but please don't.
You can find all the methods that operate on symbols by calling methodswith(Symbol) (though most just use the symbol as a marker/enum)
See also:
What is a "symbol" in Julia?

calculate binary expression - convert string to binary data

I get a string like this: "000AND111"
I need to calculate this and return the result.
How I do it in Flex?
just see this post thanks to the pingback by #powerlljf3
I would suggest a 3 phases approach.
1- write a small parser that split up the string in meaningful tokens (numbers and operands). Since Operands are all litterals and numbers are 0/1 combination, the parser is pretty easy (the grammer is LL1), so regular expressions can be really do the work here.
2- after building up the sequency of tokens and what is tecnically call the parsed expression tree (the sequency of tokens and operands), just implements any operand with the specific function (the link to my blog, works for few of the common boolean algebra operands)
3- finally just start reading tokens from left to right, and apply function where operands are found.
I would look through this http://www.nicolabortignon.com/as3-bitwise-operations/. It has many examples of binary math that can be used in AS3.

Creating own substring functions recursively in Ocaml

How can i write a substring function in ocaml without using any assignments lists and iterations, only recursions? i can only use string.length.
i tried so far is
let substring s s2 start stop=
if(start < stop) then
substring s s2 (start+1) stop
else s2;;
but obviously it is wrong, problem is that how can i pass the string that is being built gradually with recursive calls?
This feels like a homework problem that is intended to teach you think think about recursion. For me it would be easier to think about the recursion part if you decide on the basic operations you're going to use. You can't use assignments, lists, or iterations, okay. You need to extract parts of your input string somehow, but you obviously can't use the built-in substring function to do this, that would defeat the purpose of the exercise. The only other operation I can think of is the one that extracts a single character from a string:
# "abcd".[2];;
- : char = 'c'
You also need a way to add a character to a string, giving a longer string. But you're not allowed to use assignment to do this. It seems to me you're going to have to use String.make to translate your character to a string:
# String.make 1 'a';;
- : string = "a"
Then you can concatenate two strings using the ^ operator:
# "abc" ^ "def"
- : string = "abcdef"
Are you allowed to use these three operations? If so, you can start thinking about the recursion part of the substring problem. If not, then I probably don't understand the problem well enough yet to give advice. (Or maybe whoever set up the restrictions didn't expect you to have to calculate substrings? Usually the restrictions are also a kind of hint as to how you should proceed.)
Moving on to your specific question. In beginning FP programming, you don't generally want to pass the answer down to recursive calls. You want to pass a smaller problem down to the recursive call, and get the answer back from it. For the substring problem, an example of a smaller problem is to ask for the substring that starts one character further along in the containing string, and that is one character shorter.
(Later on, you might want to pass partial answers down to your recursive calls in order to get tail-recursive behavior. I say don't worry about it for now.)
Now I can't give you the answer to this, Partly because it's your homework, and partly because it's been 3 years since I've touched OCaml syntax, but I could try to help you along.
Now the Basic principle behind recursion is to break a problem down into smaller versions of itself.
You don't pass the string that is slowly being built up, instead use your recursive function to generate a string that is almost built up except for a single character, and then you add that character to the end of the string.

Excel-like toy-formula parsing

I would like to create a grammar for parsing a toy like formula language that resembles S-expression syntax.
I read through the "Getting Started with PyParsing" book and it included a very nice section that sort of covers a similar grammar.
Two examples of data to parse are:
sum(5,10,avg(15,20))+10
stdev(5,10)*2
Now, I have come up with a grammar that sort-of parses the formula but disregards
expanding the functions and operator precedence.
What would be the best practice to continue on with it: Should I add parseActions
for words that match oneOf the function names ( sum, avg ... ). If I build a nested
list, I could do a depth-first walking of parse results and evaluate the functions ?
It's a little difficult to advise without seeing more of your code. Still, from what you describe, it sounds like you are mostly tokenizing, to recognize the various bits of punctuation and distinguishing variable names from numeric constants from algebraic operators. nestedExpr will impart some structure, but only basic parenthetical nesting - this still leaves operator precedence handling for your post-parsing work.
If you are learning about parsing infix notation, there is a succession of pyparsing examples to look through and study (at the pyparsing wiki Examples page). Start with fourFn.py, which is actually a five function infix notation parser. Look through its BNF() method, and get an understanding of how the recursive definitions work (don't worry about the pushFirst parse actions just yet). By structuring the parser this way, operator precedence gets built right into the parsed results. If you parse 4 + 2 * 3, a mere tokenizer just gives you ['4','+','2','*','3'], and then you have to figure out how to do the 2*3 before adding the 4 to get 10, and not just brute force add 4 and 2, then multiply by 3 (which gives the wrong answer of 18). The parser in fourFn.py will give you ['4','+',['2','*','3']], which is enough structure for you to know to evaluate the 2*3 part before adding it to 4.
This whole concept of parsing infix notation with precedence of operations is so common, I wrote a helper function that does most of the hard work, called operatorPrecedence. You can see how this works in the example simpleArith.py and then move on to eval_arith.py to see the extensions need to create an evaluator of the parsed structure. simpleBool.py is another good example showing precedence for logical terms AND'ed and OR'ed together.
Finally, since you are doing something Excel-like, take a look at excelExpr.py. It tries to handle some of the crazy corner cases you get when trying to evaluate Excel cell references, including references to other sheets and other workbooks.
Good luck!

Resources