Writing a syntax analyser (parser) from BNF without using a generator - abstract-syntax-tree

I'm interested in writing a parser (syntax analyser) from a BNF grammar without using a generator tool like yacc or bison.
I take as example the BNF for simple arithmetic expression : (extracted from the Dragon Book 2.2.6)
expr -> expr + term | expire - term | term
term -> term * factor | term / factor | factor
factor -> number | (expr)
Suppose I would like to create the equivalent code.
I guess I should create three functions called parseExpr, parseTerm and parseFactor.
How do I construct these functions regarding to the BNF defined upwards ?
For the parseFactor it seems to be obvious :
Get the token type
If type number return a node for number type
If the token represents an opening parenthesis, get the node returned by parseExpr
Check that next token is a closing parenthesis. If yes, return the node obtained at 3. If no, throw an error
For the parseExpr, I'm a little bit confused about how to start and interpret the production : expr -> expr + term | expire - term | term
How to translate this production ? How to detect which case applies ? Same question for the last production ?

Related

Determining if grammar meets LL(1) requirements

I have a BNF for a recursive descent parser. One of the steps to solving this is to verify that the grammar is LL(1), but I keep coming up with verification that it is not.
The BNF in question, or more specifically, the exact area I'm having an issue:
<S> -> start <vars> <block>
<block> -> begin <vars> <stats> end
<vars> -> e | id = number <vars>
<stats> -> <if> | <block> | <loop> | <assign>
There is more to this, but these are the only productions that are relevant to this question, I believe.
My approach to solving this is to compute FIRST of the right hand sides of those productions that have a choice. If there is no choice, I skip, as I know they are already k=0.
FIRST(e | id = number <vars>) = {e, id} // Since it produces the empty set, I must also compute follow.
FOLLOW( e | id = number <vars> ) = FOLLOW(<vars>)
Non-terminal 'vars' appears in 2 productions: and , and is followed by two nonterminals: 'block' and 'stats'
FIRST(<block>) = {begin}
FIRST(<stats>) = { ... begin ... } // contains all terminals
Now, my problem. In computing the FOLLOW(), I have found two begin tokens, which leads me to say that this grammar is not LL(1). However, I don't believe that the answer to this exercise is that it is not possible to create a recursive descent parser, so I believe that I've made an error somewhere or that I have executed the algorithm incorrectly.
Can anyone point me in the right direction?
So you've correctly found that FOLLOW(var) = FIRST(block) ∪ FIRST(stats). These are all sets, so when you compute the union of the two first sets (each of which contains begin), you end up with just a single begin. As long as neither of these sets ends up containing id, everything is fine and your grammar is still LL(1).

PEGKit Keep trying rules

Suppose I have a rule:
myCoolRule:
Word
| 'myCoolToken' Word otherRule
I supply as input myCoolToken something else now it attempts to parse it greedily matches myCoolToken as a word and then hits the something and says uhhh I expected EOF, if I arrange the rules so it attempts to match myCoolToken first all is good and parses perfectly, for that input.
I am wondering if it is possible for it to keep trying all the rules in that statement to see if any works. So it matches Word fails, comes back and then tries the next rule.
Here is the actual grammar rules causing problems:
columnName = Word;
typeName = Word;
//accepts CAST and cast
cast = { MATCHES_IGNORE_CASE(LS(1), #"CAST") }? Word ;
checkConstraint = 'CHECK' '('! expr ')'!;
expr = requiredExp optionalExp*;
requiredExp = (columnName
| cast '(' expr as typeName ')'
... more but not important
optionalExp ...not important
The input CHECK( CAST( abcd as defy) ) causes it to fail, even though it is valid
Is there a construct or otherwise to make it verify all rules before giving up.
Creator of PEGKit here.
If I understand your question, No, this is not possible. But this is a feature of PEGKit, not a bug.
Your question is related to "determinacy" vs "nondeterminacy". PEGKit is a "deterministic" toolkit (which is widely considered a desirable feature for parsing programming languages).
It seems you are looking for a more "nondeterministic" behavior in this case, but I don't think you should be :).
PEGKit allows you to specify the priority of alternate options via the order in which the alternate options are listed. So:
foo = highPriority
| lowerPriority
| lowestPriority
;
If the highPriority option matches the current input, the lowerPriority and lowestPriority options will not get a chance to try to match, even if they are somehow a "better" match (i.e. they match more tokens than highPriority).
Again, this is related to "determinacy" (highPriority is guaranteed to be given primacy) and is widely considered a desirable feature when parsing programming languages.
So if you want your cast() expressions to have a higher priority than columnName, simply list the cast() expression as an option before the columnName option.
requiredExp = (cast '(' expr as typeName ')'
| columnName
... more but not important
OK, so that takes care of the syntactic details. However, if you have higher-level semantic constraints which can affect parsetime decisions about which alternative should have the highest priority, you should use a Semantic Predicate, like:
foo = { shouldChooseOpt1() }? opt1
| { shouldChooseOpt2() }? opt2
| defaultOpt
;
More details on Semantic Predicates here.

SML '97: what is exactly the standard syntax?

I came to this question, when I wanted to check something about the syntax of functor declarations. I came to two contradictory syntax definitions, while the syntax of Standard ML '97, as its name suggest, is supposed to be part of a standard, defined in “The definition of Standard ML — Revised”.
From the book
“The definition of Standard ML — Revised”, by R. Milner, page 14, on Google Books says:
fundec ::= functor funbinf
funbind ::= funid (strid : sigexp) = strexp <and funbind>
I read it as “A functor gets exactly one argument and cannot be said to match a signature”.
From another reliable source
“Standard ML syntax summary”, by L. Paulson, page 2, on PDF says (schema approximately re‑expressed using the same notation as in the definition of SML '97):
FunctorDeclaration ::= functor FunctorBinding <and FunctorBinding>
FunctorBinding ::= Ident ( FunctorArguments ) : Signature = Structure
FunctorArguments ::= Ident : Signature | Specification
I read it as “A functor may get multiple arguments and may be said to match a signature”.
Question
The two documents says different things, so I'm confused. What is the real definition of Standard ML '97? Or am I just miss‑reading the standard definition?
Chapters 2 and 3 of the Definition only give the bare syntax of the language. That's extended by the "derived forms" (i.e., syntactic sugar) defined in Appendix A, which include the funid (spec) form (which is short for funid (X : sig spec end) with X being opened on the RHS).
See here for a complete SML grammar including all derived forms.

Using ANTLR with Left-Recursive Rules

Basically, I've written a parser for a language with just basic arithmetic operators ( +, -, * / ) etc, but for the minus and plus cases, the Abstract Syntax Tree which is generated has parsed them as right associative when they need to be left associative. Having googled for a solution, I found a tutorial that suggests rewriting the rule from:
Expression ::= Expression <operator> Term | Term`
to
Expression ::= Term <operator> Expression*
However, in my head this seems to generate the tree the wrong way round. Any pointers on a way to resolve this issue?
First, I think you meant
Expression ::= Term (<operator> Expression)*
Back to your question: You do not need to "resolve the issue", because ANTLR has no problem dealing with tail recursion. I'm nearly certain that it replaces tail recursion with a loop in the code that it generates. This tutorial (search for the chapter called "Expressions" on the page) explains how to arrive at the e1 = e2 (op e2)* structure. In general, though, you define expressions in terms of higher-priority expressions, so the actual recursive call happens only when you process parentheses and function parameters:
expression : relationalExpression (('and'|'or') relationalExpression)*;
relationalExpression : addingExpression ((EQUALS|NOT_EQUALS|GT|GTE|LT|LTE) addingExpression)*;
addingExpression : multiplyingExpression ((PLUS|MINUS) multiplyingExpression)*;
multiplyingExpression : signExpression ((TIMES|DIV|'mod') signExpression)*;
signExpression : (PLUS|MINUS)* primeExpression;
primeExpression : literal | variable | LPAREN expression /* recursion!!! */ RPAREN;

Erlang Hash Tree

I'm working on a p2p app that uses hash trees.
I am writing the hash tree construction functions (publ/4 and publ_top/4) but I can't see how to fix publ_top/4.
I try to build a tree with publ/1:
nivd:publ("file.txt").
prints hashes...
** exception error: no match of right hand side value [67324168]
in function nivd:publ_top/4
in call from nivd:publ/1
The code in question is here:
http://github.com/AndreasBWagner/nivoa/blob/886c624c116c33cc821b15d371d1090d3658f961/nivd.erl
Where do you think the problem is?
Thank You,
Andreas
Looking at your code I can see one issue that would generate that particular exception error
publ_top(_,[],Accumulated,Level) ->
%% Go through the accumulated list of hashes from the prior level
publ_top(string:len(Accumulated),Accumulated,[],Level+1);
publ_top(FullLevelLen,RestofLevel,Accumulated,Level) ->
case FullLevelLen =:= 1 of
false -> [F,S|T]=RestofLevel,
io:format("~w---~w~n",[F,S]),
publ_top(FullLevelLen,T,lists:append(Accumulated,[erlang:phash2(string:concat([F],[S]))]),Level);
true -> done
end.
In the first function declaration you match against the empty list. In the second declaration you match against a list of length (at least) 2 ([F,S|T]). What happens when FullLevelLen is different from 1 and RestOfLevel is a list of length 1? (Hint: You'll get the above error).
The error would be easier to spot if you would pattern match on the function arguments, perhaps something like:
publ_top(_,[],Accumulated,Level) ->
%% Go through the accumulated list of hashes from the prior level
publ_top(string:len(Accumulated),Accumulated,[],Level+1);
publ_top(1, _, _, _) ->
done;
publ_top(_, [F,S|T], Accumulated, Level) ->
io:format("~w---~w~n",[F,S]),
publ_top(FullLevelLen,T,lists:append(Accumulated,[erlang:phash2(string:concat([F],[S]))]),Level);
%% Missing case:
% publ_top(_, [H], Accumulated, Level) ->
% ...

Resources