Recursive descent parsing and abstract syntax trees - css

I'm hard-coding a recursive decent parser, mostly for learning purposes and, I've run into some trouble.
I'll use this short excerpt from the CSS3 grammar as an example:
simple_selector = type_selector | universal;
type_selector = [ namespace_prefix ]? element_name;
namespace_prefix = [ IDENT | '*' ]? '|';
element_name = IDENT;
universal = [ namespace_prefix ]? '*';
First, I didn't realize that namespace_prefix was an optional part within both the type_selector and universal. That led to the type_selector always failing when fed input like *|* because it was blindly being considered for any input that matched the namespace_prefix production.
Recursive decent is straightforward enough but my understanding of it is that I need to do a lot of (for lack of better word) exploratory recursion before settling on a production. So I changed the signature of my productions to return Boolean values. This way I could easily tell whether a specific production resulted in success or not.
I use a linked list data structure to support arbitrary look-ahead, and can easily slice this list to attempt a production and then return to my starting point if the production doesn't succeed. However, while trying out a production, I'm passing along mutable state, trying to construct a document object model. This isn't really working out because I have no way of knowing whether the production will be successful or not. And if the production isn't successful, I need to somehow undo any changes made.
My question is this. Should I use an abstract syntax tree as an intermediate representation and then go from there? Is this something you would commonly do to work around this problem? Because the issue seems to be primarily with the document object model not being a suitable tree data structure for recursion.

I'm not intimately familiar with CSS, but in general what you would do is refactor the grammar to eliminate ambiguities as much as you can. In your case here, the namespace_prefix production that can be at the beginning of both type_selector and universal can be pulled out in front as a separate optional production:
simple_selector = [ namespace_prefix ]? (type_selector | universal);
type_selector = element_name;
namespace_prefix = [ IDENT | '*' ]? '|';
element_name = IDENT;
universal = '*';
Not all grammars can be simplified for simple look-ahead like this, though, and for those you can use more complicated shift-reduce parsers, or - as you suggest - backtracking. For backtracking, you usually just attempt to parse productions and record the path through the grammar. Once you have a production that matches the input you use the recorded path to actually perform the semantic action for that production.

Related

At what point does Elixir features becomes a misdirection?

Generally speaking, Functional Programming prides itself for being clearer and concise. The fact that you don't have side-effects/state management makes it easier for developers to reason about their code and assure behaviours. How far does this truth reach?
I'm still learning Elixir but given the code from Coding Gnome:
def make_move(game = %{ game_state: state }, _guess)
when state in [:won, :lost] do
...
end
def make_move(game = %{ game_state: state }, _guess)
when state in [:pending] do
...
end
def make_move(game, guess) do
...
end
One could write it without any fanciness in Javascript as:
const makeMove = (game, guess) => {
switch(game.state) {
case 'won':
return makeMoveFinalState();
case 'lost':
return makeMoveFinalState();
case 'pending':
return makeMovePending();
}
}
Disregarding all the type/struct safety provided by Elixir compiler, an Elixir programmer would have to read the whole file before making sure that there wasn't a function with a different signature hijacking another, right? I feel that this increases the overhead while a programmer, because it's yet another thing you have to think about, even before looking at the implementation of the function.
Besides that, it feels to me as a misdirection because you can't be 100% sure that a case is ending up in that general make_move function unless you know beforehand all others and the signatures types, while with a conditional you have a clearer path of flow.
Could this be rewritten in a better way? At what point does these abstractions start to weight in the programmer?
I think this boils down mostly to preference and usually simple exercises with pattern matching with simple conditions do not show the range of "clarity" pattern matching can provide. But I'm suspect because I prefer pattern matching, any way, I'm gonna bite.
In this case, the switch could be said to be more readable and straightforward, but note that there's nothing preventing you from writing a very similar thing in Elixir (or erlang)
def make_move(game = %{ game_state: state }, _guess) do
case state do
state when state in [:won, :lost] -> # do things
:pending -> # do things
_else -> # do other things
end
end
Regarding the placement of different function clauses for the same function name, elixir will emit a warning if they're not grouped together, so that ends up just being your responsibility to write them together and in the correct order (it will also warn you if any of the branches is by definition unreachable, like placing a catch all before any specific branch that has matchings).
But I think that if for instance you add a slight change of the matching requirements for the pending state, then in my view it starts becoming clearer to write it in the erlang/elixir way. Say that when the state is pending there are two different execution paths, depending if it's your turn or something else.
Now you could write 2 specific branches for that with just function signatures:
def make_move(game = %{ game_state: :pending, your_turn: true }, _guess) do
# do stuff
end
def make_move(game = %{ game_state: :pending }, _guess) do
# do stuff
end
To do that in JS you would need to have either another switch, or another if. If you have more complex matching patterns then it easily becomes harder to follow, while on elixir I think the paths are quite clear.
If the other conditions could be more thornier, say when it's :pending and there's nothing on a stack key that holds a list, then again matching that becomes:
def make_move(game = %{ game_state: :pending, your_turn: true, stack: [] }, _guess) do
Or if there's another branch where it depends if the first item in the stack was something specific:
def make_move(game = %{ game_state: :pending, your_turn: true, player_id: your_id, stack: [%AnAlmostTypedStruct{player: your_id} | _] }, _guess) do
Here erlang/elixir would only match this if your_id was the same in both places where it's used in the pattern.
And also, you say "without fanciness" in JS, but different function heads/arity/pattern matching is nothing fancy in Elixir/Erlang, it's just like the language has support for switch/case based statements at a much lower level (at the module compilation level?).
I for one would love to have effective pattern matching & different function clauses (not destructuring only) in JS.

What does => mean in Ada?

I understand when and how to use => in Ada, specifically when using the keyword 'others', but I am not sure of its proper name nor how and why it was created. The history and development of Ada is very interesting to me and I would appreciate anyone's insight on this.
=> is called arrow. It is used with any form of parameter, not only with the parameter 'others'.
Section 6.4 of the Ada Reference Manual states:
parameter_association ::= [formal_parameter_selector_name =>]
explicit_actual_parameter
explicit_actual_parameter ::= expression | variable_name
A parameter_association is named or positional according to whether or
not the formal_parameter_selector_name is specified. Any positional
associations shall precede any named associations. Named associations
are not allowed if the prefix in a subprogram call is an
attribute_reference.
Similarly, array aggregates are described in section 4.3.3
array_aggregate ::= positional_array_aggregate |
named_array_aggregate
positional_array_aggregate ::=
(expression, expression {, expression}) | (expression {, expression}, others => expression) | (expression {, expression},
others => <>)
named_array_aggregate ::=
(array_component_association {, array_component_association})
array_component_association ::=
discrete_choice_list => expression | discrete_choice_list => <>
The arrow is used to associate an array index with a specific value or to associate a formal parameter name of a subprogram with the actual parameter.
Stack Overflow isn’t really the place for this kind of question, which is why it's received at least one close vote.
That said, "arrow" has been present in the language since its first version; see ARM83 2.2. See also the Ada 83 Rationale; section 3.5 seems to be the first place where it’s actually used, though not by name.
As a complement to Jim's answer, on the usage/intuitiveness side: the arrow X => A means in various places of the Ada syntax: value A goes to place X. It is very practical, for instance, to fill an array with an arbitrary cell order. See slide 8 of this presentation for an application with large arrays. Needless to say that the absence of the arrow notation would lead to a heap of bugs in such a case. Sometimes it is just useful for making the associations more readable. You can see it here in action for designing a game level.

PEGKit Keep trying rules

Suppose I have a rule:
myCoolRule:
Word
| 'myCoolToken' Word otherRule
I supply as input myCoolToken something else now it attempts to parse it greedily matches myCoolToken as a word and then hits the something and says uhhh I expected EOF, if I arrange the rules so it attempts to match myCoolToken first all is good and parses perfectly, for that input.
I am wondering if it is possible for it to keep trying all the rules in that statement to see if any works. So it matches Word fails, comes back and then tries the next rule.
Here is the actual grammar rules causing problems:
columnName = Word;
typeName = Word;
//accepts CAST and cast
cast = { MATCHES_IGNORE_CASE(LS(1), #"CAST") }? Word ;
checkConstraint = 'CHECK' '('! expr ')'!;
expr = requiredExp optionalExp*;
requiredExp = (columnName
| cast '(' expr as typeName ')'
... more but not important
optionalExp ...not important
The input CHECK( CAST( abcd as defy) ) causes it to fail, even though it is valid
Is there a construct or otherwise to make it verify all rules before giving up.
Creator of PEGKit here.
If I understand your question, No, this is not possible. But this is a feature of PEGKit, not a bug.
Your question is related to "determinacy" vs "nondeterminacy". PEGKit is a "deterministic" toolkit (which is widely considered a desirable feature for parsing programming languages).
It seems you are looking for a more "nondeterministic" behavior in this case, but I don't think you should be :).
PEGKit allows you to specify the priority of alternate options via the order in which the alternate options are listed. So:
foo = highPriority
| lowerPriority
| lowestPriority
;
If the highPriority option matches the current input, the lowerPriority and lowestPriority options will not get a chance to try to match, even if they are somehow a "better" match (i.e. they match more tokens than highPriority).
Again, this is related to "determinacy" (highPriority is guaranteed to be given primacy) and is widely considered a desirable feature when parsing programming languages.
So if you want your cast() expressions to have a higher priority than columnName, simply list the cast() expression as an option before the columnName option.
requiredExp = (cast '(' expr as typeName ')'
| columnName
... more but not important
OK, so that takes care of the syntactic details. However, if you have higher-level semantic constraints which can affect parsetime decisions about which alternative should have the highest priority, you should use a Semantic Predicate, like:
foo = { shouldChooseOpt1() }? opt1
| { shouldChooseOpt2() }? opt2
| defaultOpt
;
More details on Semantic Predicates here.

How to denote that a command line argument is optional when printing usage

Assume that I have a script that can be run in either of the following ways.
./foo arg1 arg2
./foo
Is there a generally accepted way to denote that arg1 and arg2 aren't mandatory arguments when printing the correct usage of the command?
I've sometimes noticed usage printed with the arguments wrapped in brackets like in the following usage printout.
Usage: ./foo [arg1] [arg2]
Do these brackets mean that the argument is optional or is there another generally accepted way to denote that an argument is optional?
I suppose this is as much a standard as anything.
The Open Group Base Specifications Issue 7
IEEE Std 1003.1, 2013 Edition
Copyright © 2001-2013 The IEEE and The Open Group
Ch. 12 - Utility Conventions
Although it doesn't seem to mention many things I have commonly seen over the years used to denote various meanings:
square brackets [optional option]
angle brackets <required argument>
curly braces {default values}
parenthesis (miscellaneous info)
Edit: I should add, that these are just conventions. The important thing is to pick a convention which is sensible, clearly state your convention, and stick to it consistently. Be flexible and create conventions which seem to be most frequently encountered on your target platform(s). They will be the easiest for users to adapt to.
I personally have not seen a 'standard' that denotes that a switch is optional (like how there's a standard that defines how certain languages are written for example), as it really is personal choice, but according to IBM's docs and the Wiki, along with numerous shell scripts I've personally seen (and command line options from various programs), and the IEEE, the 'defacto' is to treat square bracketed ([]) parameters as optional parameters. Example from Linux:
ping (output trimmed...)
usage: ping [-c count] [-t ttl] host
where [-c count] and [-t ttl] are optional parameters but host is not (as defined in the help).
I personally follow the defacto as well by using [] to mean they are optional parameters and make sure to note that in the usage of that script/program.
I should note that a computer standard should define how something happens and its failure paths (either true fail or undefined behavior). Something along the lines of the command line interpreter _shall_ treat arguments as optional when enclosed in square brackets, and _shall_ treat X as Y when Z, etc.. Much like the ISO C standard says how a function shall be formed for it to be valid (otherwise it fails). Given that there are no command line interpreters, from ASH to ZSH and everything in between, that fail a script for treating [] as anything but optional, one could say there is not a true standard.
Yes, the square brackets indicate optional arguments in Unix man pages.
From "man man":
[-abc] any or all arguments within [ ] are optional.
I've never wondered if they're formally specified somewhere, I've always just assumed they come from conventions used in abstract algebra, in particular, in BNF grammars.

Using ANTLR with Left-Recursive Rules

Basically, I've written a parser for a language with just basic arithmetic operators ( +, -, * / ) etc, but for the minus and plus cases, the Abstract Syntax Tree which is generated has parsed them as right associative when they need to be left associative. Having googled for a solution, I found a tutorial that suggests rewriting the rule from:
Expression ::= Expression <operator> Term | Term`
to
Expression ::= Term <operator> Expression*
However, in my head this seems to generate the tree the wrong way round. Any pointers on a way to resolve this issue?
First, I think you meant
Expression ::= Term (<operator> Expression)*
Back to your question: You do not need to "resolve the issue", because ANTLR has no problem dealing with tail recursion. I'm nearly certain that it replaces tail recursion with a loop in the code that it generates. This tutorial (search for the chapter called "Expressions" on the page) explains how to arrive at the e1 = e2 (op e2)* structure. In general, though, you define expressions in terms of higher-priority expressions, so the actual recursive call happens only when you process parentheses and function parameters:
expression : relationalExpression (('and'|'or') relationalExpression)*;
relationalExpression : addingExpression ((EQUALS|NOT_EQUALS|GT|GTE|LT|LTE) addingExpression)*;
addingExpression : multiplyingExpression ((PLUS|MINUS) multiplyingExpression)*;
multiplyingExpression : signExpression ((TIMES|DIV|'mod') signExpression)*;
signExpression : (PLUS|MINUS)* primeExpression;
primeExpression : literal | variable | LPAREN expression /* recursion!!! */ RPAREN;

Resources