When/Why Does Mutual Left Recursion Happen in Antlr? - recursion

I have an expression that is a collection of my other top-level things. In expression I have math that is expression (op) expression. With this I get
The following sets of rules are mutually left-recursive [expression, math]
compileUnit : expression EOF;
expression
: parens
| operation
| math
| variable
| number
| comparisonGroup
;
math : expression op=( ADD | SUBSTRACT | MULTIPLY | DIVIDE ) expression #mathExpression;
HOWEVER!
This is not a problem-
expression
: parens
| operation
| expression op=( ADD | SUBSTRACT | MULTIPLY | DIVIDE ) expression
| variable
| number
| comparisonGroup
;
And neither is this!-
math : op=( ADD | SUBSTRACT | MULTIPLY | DIVIDE ) expression expression #mathExpression;
So why is it that my first code block behaves differently than the other two examples?

Antlr4 can handle direct left recursion, but not indirect left recursion, where a left recursive rule is defined as a rule that "either directly or indirectly invokes itself on the left edge of an alternative" (TDAR; pg 71).
When, as in the first example, the #mathExpression alternative is factored out of the expression rule and into a separate math rule, the left direct recursion becomes indirect, i.e., the rules are 'mutually left-recursive'.
As realized in the second and third examples, a typical solution is to simply combine the indirect left-recursive rules in a single rule.

Related

Backus–Naur form with boolean algebra. Problem with brackets and parse tree

boolean algebra
I want to write those boolean expression in Backus-Naur-Form.
What I have got is:
< variable > ::= < signal > | < operator> | < bracket >< variable>
< signal> ::= <p> | <q> | <r>| <s>
< operator> ::= <AND> | <OR> | <implication>| <equivalence>| <NOT>
< bracket> ::= < ( > | < ) >
I made a rekursion with < bracket >< variable>, so that whenever there is a bracket it starts a new instance, but I still do not know when to close the brackets. With this you are able to set a closing bracket and make a new instance, but I only want that for opening brackets.
Can I seperate < bracket> in < open bracket> and < closing bracket>?
Is my Backus-Naur form even correct? There isn't much information about Backus-Naur form with boolean algebra on the internet. How does the parse tree of this look like?
I assume that you want to define a grammar for boolean expressions using the Backus-Naur form, and that your examples are concrete instances of such expressions. There are multiple problems with your grammar:
First of all, you want your grammar to only generate correct boolean expressions. With your grammar, you could generate a simple operator ∨ as valid expression using the path <variable> -> <operator> -> <OR>, which is clearly wrong since the operator is missing its operands. In other words, ∨ on its own cannot be a correct boolean expression. Various other incorrect expressions can be derived with your grammar. For the same reason, the opening and closing brackets should appear together somewhere within a production rule, since you want to ensure that every opening bracket has a closing bracket. Putting them in separate production rules might destroy that guarantee, depending on the overall structure of your grammar.
Secondly, you want to differentiate between non-terminal symbols (the ones that are refined by production rules, i.e. the ones written between < and >) and terminal symbols (atomic symbols like your variables p, q, r and s). Hence, your non-terminal symbols <p>, <q>, <r> and <s> should be terminal symbols p, q, r and s. Same goes for other symbols like brackets and operators.
Thirdly, in order to get an unambiguous parse tree, you want to get your precedence and associativity of your operators correct, i.e., you want to make sure that, for example, negation is evaluated before implication, since it has a higher precedence (similar to arithmetic expressions where multiplication must be evaluated before addition). In other words, we want operators with higher precedence to appear closer to the leaf nodes of the parse tree, and operators with lower precedence to appear closer to the root node of the tree, since the leaves of the tree are evaluated first. We can achieve that by defining our grammar in a way that reflects the precedences of the operators in a decreasing manner:
<expression> ::= <expression> ↔ <implication> | <implication>
<implication> ::= <implication> → <disjunction> | <disjunction>
<disjunction> ::= <disjunction> ∨ <conjunction> | <conjunction>
<conjunction> ::= <conjunction> ∧ <negation> | <negation>
<negation> ::= ¬ <negation> | <variable> | ( <expression> )
<variable> ::= p | q | r | s
Starting with <expression>, we can see that a valid boolean expression starts with chaining all the ↔ operators together, then all the → operators , then all the ∨ operators, and so on, according to their precedence. Hence, operators with lower precedence (e.g., ↔) are located near the root of the tree, where operators with higher precedence (e.g., ¬) are located near the leaves of the tree.
Note that the grammar above is left-recursive, which might cause some problems with software tools that cannot handle them (e.g., parser generators).

Parser combinator for propositional logic

I'd like to design a combinator to parse Propositional Logic. Here's a simple BNF:
<sentence> ::= <atomic-sentence> | <complex-sentence>
<atomic-sentence> ::= True | False | P | Q | R
<complex-sentence> ::= (<sentence>)
| <sentence> <connective> <sentence>
| ¬<sentence>
<connective> ::= ∧ | ∨ | ⇒ | ⇔
The problem is that the grammar is left-recursive, which leads to an infinite loop: a sentence can be a complex sentence, which can start with a sentence, which can be a complex sentence, ... forever. Here's an example sentence that causes this problem:
P∧Q
Is there a simple way to fix the grammar so that it is suitable for a parser combinator? Thanks.
FWIW, I'm using FParsec in F#, but I think any parser combinator library would have the same issue.
FParsec can handle infix operators using the OperatorPrecedenceParser class where you just need to specify which operators with which associativity and precedence you have, without having to actually write a grammar for your infix expressions. The rest of this answer will explain how to solve the problem without this class for cases where the class doesn't apply, for parser combinators that don't have an equivalent class or in case you just plain don't want to use it or are at least interested in how you'd solve the problem without it.
Parser combinators tend to not support left-recursion, but they do tend to support repetition. Luckily, a left-recursive rule of the form <a> ::= <a> <b> | <c> can be rewritten using the * repetition operator to <a> ::= <c> <b>*. If you then left-fold over the resulting list, you can construct a tree that looks just like the parse tree you'd have gotten from the original grammar.
So if we first inline <complex-sentence> into <sentence> and then apply the above pattern, we get <a> = <sentence>, <b> = <connective> <sentence> and <c> = <atomic-sentence> | '(' <sentence> ')' | ¬<sentence>, resulting in the following rule after the transformation:
<sentence> ::= ( <atomic-sentence>
| '(' <sentence> ')'
| ¬<sentence>
)* <connective> <sentence>
To improve readability, we'll put the parenthesized part into its own rule:
<operand> ::= <atomic-sentence>
| '(' <sentence ')'
| ¬<sentence>
<sentence> ::= <operand> (<connective> <sentence>)*
Now if you try this grammar, you'll notice something strange: the list created by the * will only ever contain a single element (or none). This because if there's more than two operands, the right-recursive call to <sentence> will eat up all the operands, creating a right-associative parse tree.
So really the above grammar is equivalent to this (or rather the grammar is ambiguous, but a parser combinator will treat it as if it were equivalent to this):
<sentence> ::= <operand> <connective> <sentence>
This happened because the original grammar was ambiguous. The ambiguous definition <s> ::= <s> <c> <s> | <o> can either be interpreted as the left-recursive <s> ::= <s> <c> <o> | <o> (which will create a left-associative parse tree) or the right-recursive <s> ::= <o> <c> <s> | <o> (right-associative parse tree). So we should first remove the ambiguity by choosing one of those forms and then apply the transformation if applicable.
So if we choose the left-recursive form, we end up with:
<sentence> ::= <operand> (<connective> <operand>)*
Which will indeed create lists with more than one element. Alternatively if we choose the right-recursive rule, we can just leave it as-is (no repetition operator necessary) as there is no left-recursion to eliminate.
As I said, we can now get a left-associative tree by taking the list from the left-recursive version and left-folding it or a right-associative one by taking the right-recursive version. However both of these options will leave us with a tree that treats all of the operators as having the same precedence.
To fix the precedence you can either apply something like the shunting yard algorithm to the list or you can first re-write the grammar to take precedence into account and then apply the transformation.

Avoiding left recursion in parsing LiveScript object definitions

I'm working on a parser for LiveScript language, and am having trouble with parsing both object property definition forms — key: value and (+|-)key — together. For example:
prop: "val"
+boolProp
-boolProp
prop2: val2
I have the key: value form working with this:
Expression ::= TestExpression
| ParenExpression
| OpExpression
| ObjDefExpression
| PropDefExpression
| LiteralExpression
| ReferenceExpression
PropDefExpression ::= Expression COLON Expression
ObjDefExpression ::= PropDefExpression (NEWLINE PropDefExpression)*
// ... other expressions
But however I try to add ("+"|"-") IDENTIFIER to PropDefExpression or ObjDefExpression, I get errors about using left recursion. What's the (right) way to do this?
The grammar fragment you posted is already left-recursive, i.e. without even adding (+|-)boolprop, the non-terminal 'Expression' derives a form in which 'Expression' reappears as the leftmost symbol:
Expression -> PropDefExpression -> Expression COLON Expression
And it's not just left-recursive, it's ambiguous. E.g.
Expression COLON Expression COLON Expression
can be derived in two different ways (roughly, left-associative vs right-associative).
You can eliminate both these problems by using something more restricted on the left of the colon, e.g.:
PropDefExpression ::= Identifier COLON Expression
Also, another ambiguity: Expression derives PropDefExpression in two different ways, directly and via ObjDefExpression. My guess is, you can drop the direct derivation.
Once you've taken care of those things, it seems to me you should be able to add (+|-)boolprop without errors (unless it conflicts with one of the other kinds of expression that you didn't show).
Mind you, looking at the examples at http://livescript.net, I'm doubtful how much of that you'll be able to capture in a conventional grammar. But if you're just going for a subset, you might be okay.
I don't know how much help this will be, because I know nothing about GrammarKit and not much more about the language you're trying to parse.
However, it seems to me that
PropDefExpression ::= Expression COLON Expression
is not quite accurate, and it is creating an ambiguity when you add the boolean property production because an Expression might start with a unary - operator. In the actual grammar, though, a property cannot start with an arbitrary Expression. There are two types of key-property definitions:
name : expression
parenthesized_expression : expression
(Which is to say, expressions need to start with a ().
That means that a boolean property definition, starting with + or - is recognizable from the first token, which is precisely the condition needed for successful recursive descent parsing. There are several other property definition syntaxes, including names and parenthesized_expressions not followed by a :
That's easy to parse with an LR(1) parser, like the one Jison produces, but to parse it with a recursive-descent parser you need to left-factor. (It's possible that GrammarKit can do this for you, by the way.) Basically, you'd need something like (this is not complete):
PropertyDefinition ::= PropertyPrefix PropertySuffix? | BooleanProperty
PropertyPrefix ::= NAME | ParenthesizedExpression
PropertySuffix ::= COLON Expression | DOT NAME

How can I use a relative Path or a Wildcard in JQ

Is it possible to use a relative path or name in JQ like the XPath // ?
Or is it possible to use an wildcard in JQ like .level1.*.level3.element ?
That's what the .. filter was meant to represent. The use would look like this:
.level1 | .. | .level3? .element
Note: you must use the ? otherwise you'll get errors as it recurses down objects that do not have the corresponding property.
Two additional points relative to Jeff's answer:
(1) An alternative to using ? is to use objects, e.g.
.level1 | .. | objects | .level3.element
(2) Typically one will want to eliminate the nulls corresponding to paths that do NOT match the specified trailing keys. To eliminate ALL nulls, one option is to tack on the filter: select(. != null).
On the other hand, if one wants to retain nulls that do appear as values, then one possibility is to use paths as follows:
.level1
| (paths | select( .[-2:] == ["level3", "element"])) as $path
| getpath($path)
(Since paths produces a stream of arrays of strings, the above expression produces a stream of the values corresponding to paths ending in .level3.element)
Equivalently but as a one-liner:
.level1 | getpath(paths | select(.[-2:] == ["level3","element"]))

What are the different kinds of cases?

I'm interested in the different kinds of identifier cases, and what people call them. Do you know of any additions to this list, or other alternative names?
myIdentifier : Camel case (e.g. in java variable names)
MyIdentifier : Capital camel case (e.g. in java class names)
my_identifier : Snake case (e.g. in python variable names)
my-identifier : Kebab case (e.g. in racket names)
myidentifier : Flat case (e.g. in java package names)
MY_IDENTIFIER : Upper case (e.g. in C constant names)
flatcase or mumblecase
kebab-case. Also called caterpillar-case, dash-case, hyphen-case, lisp-case, spinal-case and css-case
camelCase
PascalCase or CapitalCamelCase
snake_case or c_case
MACRO_CASE, UPPER_CASE or SCREAM_CASE
COBOL-CASE or TRAIN-CASE
Names are either generic, after a language, or colorful; most don’t have a standard name outside of a specific community.
There are many names for these naming conventions (names for names!); see Naming convention: Multiple-word identifiers, particularly for CamelCase (UpperCamelCase, lowerCamelCase). However, many don’t have a standard name. Consider the Python style guide PEP 0008 – it calls them by generic names like “lower_case_with_underscores”.
One convention is to name after a well-known use. This results in:
PascalCase
MACRO_CASE (C preprocessor macros)
…and suggests these names, which are not widely used:
c_case (used in K&R and in the standard library, like size_t)
lisp-case, css-case
COBOL-CASE
Alternatively, there are illustrative names, of which the best established is CamelCase. snake_case is more recent (2004), but is now well-established. kebab-case is yet more recent and still not established, and may have originated on Stack Overflow! (What's the name for dash-separated case?) There are many more colorful suggestions, like caterpillar-case, Train-case (initial capital), caravan-case, etc.
+--------------------------+-------------------------------------------------------------+
| Formatting | Name(s) |
+--------------------------+-------------------------------------------------------------|
| namingidentifier | flat case/Lazy Case |
| NAMINGIDENTIFIER | upper flat case |
| namingIdentifier | (lower) camelCase, dromedaryCase |
| NamingIdentifier | (upper) CamelCase, PascalCase, StudlyCase, CapitalCamelCase |
| naming_identifier | snake_case, snake_case, pothole_case, C Case |
| Naming_Identifier | Camel_Snake_Case |
| NAMING_IDENTIFIER | SCREAMING_SNAKE_CASE, MACRO_CASE, UPPER_CASE, CONSTANT_CASE |
| naming-identifier | Kebab Case/caterpillar-case/dash-case, hyphen-case, |
| | lisp-case, spinal-case and css-case |
| NAMING-IDENTIFIER | TRAIN-CASE, COBOL-CASE, SCREAMING-KEBAB-CASE |
| Naming-Identifier | Train-Case, HTTP-Header-Case |
| _namingIdentifier | Undercore Notation (prefixed by "_" followed by camelCase |
| datatypeNamingIdentifier | Hungarian Notation (variable names Prefixed by metadata |
| | data-types which is out-dated) |
|--------------------------+-------------------------------------------------------------+
MyVariable : Pascal Case => Used for Class
myVariable : Camel Case => Used for variable at Java, C#, etc.
myvariable : Flat Case => Used for package at Java, etc.
my_variable : Snake Case => Used for variable at Python, PHP, etc.
my-variable : Kebab Case => Used for css
The most common case types:
Camel case
Snake case
Kebab case
Pascal case
Upper case (with snake case)
camelCase
camelCase must (1) start with a lowercase letter and (2) the first letter of every new subsequent word has its first letter capitalized and is compounded with the previous word.
An example of camel case of the variable camel case var is camelCaseVar.
snake_case
snake_case is as simple as replacing all spaces with a "_" and lowercasing all the words. It's possible to snake_case and mix camelCase and PascalCase but imo, that ultimately defeats the purpose.
An example of snake case of the variable snake case var is snake_case_var.
kebab-case
kebab-case is as simple as replacing all spaces with a "-" and lowercasing all the words. It's possible to kebab-case and mix camelCase and PascalCase but that ultimately defeats the purpose.
An example of kebab case of the variable kebab case var is kebab-case-var.
PascalCase
PascalCase has every word starts with an uppercase letter (unlike camelCase in that the first word starts with a lowercase letter).
An example of pascal case of the variable pascal case var is PascalCaseVar.
Note: It's common to see this confused for camel case, but it's a separate case type altogether.
UPPER_CASE_SNAKE_CASE
UPPER_CASE_SNAKE_CASE is replacing all the spaces with a "_" and converting all the letters to capitals.
an example of upper case snake case of the variable upper case snake case var is UPPER_CASE_SNAKE_CASE_VAR.
For Python specifically, it is best to use snake_case for variable and function names, UPPER_CASE for constants (even though we don't have any keywords that specifically say that our variable is a constant) and PascalCase for class names.
camelCase is not recommended for Python (although languages such as Javascript have it as their main casing), and kebab-case would be invalid as Python names cannot contain a hypen (-).
variable_name = 'Hello World!'
def function_name():
pass
CONSTANT_NAME = 'Constant Hello World!!'
class ClassName:
pass

Resources