Avoid common prefixes without change lookahead - javacc

I'm using JavaCC to make a specification to recognize a language. The problem I have is that JavaCC gives me a warning because public is a common prefix of Member() declaration. Member() can has Attributes() and/or Method() but must have at least one Method, the order does not matter.
The warning JavaCC gives me is:
Choice conflict in (...)+ construct at line 66, column 23.
Expansion nested within construct and expansion following construct have common prefixes, one of which is: "public". Consider using a lookahead of 2 or more for nested expansion.
The line 66 is the only line of Member(). Also I need to do this without change lookahead value.
Here is the code:
void Member() : {}
{
(Attribute())* (Method())+ (Attribute() | Method())*
}
void Attribute() : {}
{
"private" Type() <Id> [":=" Expr()]";"
}
void Method() : {}
{
MethodHead() MethodBody()
}
void MethodHead() : {}
{
("public")? (<Id> | Type() | "void") <Id> "(" Parameter() ")"
}
Thanks.

The problem is that this regular expression
(Method())+ (Attribute() | Method())*
is ambiguous. Let's abbreviate methods by M and attributes by A. If the input is MAM, there no problem. The (Method())+ matches the first M and the (Attribute() | Method())* matches the remaining AM. But if the input is MMA, where should the divide be? Either (Method())+ matches M and (Attribute() | Method())* matches MA or (Method())+ matches MM and (Attribute() | Method())* matches A. Both parses are possible. JavaCC doesn't know which parse you want, so it complains.
What you can do:
Nothing. Ignore the warning. The default behaviour is that as many Methods as possible will be recognized by (Method())+ and only methods after the first attribute will be recognized by (Attribute() | Method())*.
Suppress the warning using lookahead. You said you don't want to add lookahead, but for completeness, I'll mention that you could change (Method())+ to (LOOKAHEAD(1) Method())+. That won't change the behaviour of the parser, but it will suppress the warning.
Rewrite the grammar.
The offending line could be rewritten as either
(Attribute())* Method() (Attribute() | Method())*
or as
(Attribute())* (Method())+ [Attribute() (Attribute() | Method())*]

Related

Javacc conflicting productions

I have two inputs
a - b
a += b
And I have a production with a choice
void AssignmentExpression() : {}
{
LOOKAHEAD(3) ConditionalExpression()
| LOOKAHEAD(3) UnaryExpression() AssignmentOperator() AssignmentExpression()
}
With this production input (1) works, but input (2) does not work.
If I swap the choice in the production so that it becomes
void AssignmentExpression() : {}
{
LOOKAHEAD(3) UnaryExpression() AssignmentOperator() AssignmentExpression()
| LOOKAHEAD(3) ConditionalExpression()
}
Then input (2) works, but input (1) does not work.
How do I fix this? Increasing the LOOKAHEAD parameter does not help.
See Expression Parsing by Recursive Descent. Follow the "classic solution".
Since you are using JJTree, the answer to the question Make a calculator's grammar that make a binary tree with javacc will be helpful.
You might try
void AssignmentExpression() : {}
{
LOOKAHEAD(UnaryExpression() AssignmentOperator() )
UnaryExpression() AssignmentOperator() AssignmentExpression()
| ConditionalExpression()
}
Without seeing more of the grammar it is hard to know whether this will work. Since the use of the lookahead specification will suppress any warnings from JavaCC --JavaCC "assumes" you know what you are doing-- you have to do the analysis yourself.
My other answer is better.

Pass a string value in a recursive bison rule

i'm having some issues on bison (again).
I'm trying to pass a string value between a "recursive rule" in my grammar file using the $$,
but when I print the value I have passed, the output looks like a wrong reference ( AU�� ) instead the value I wrote in my input file.
line: tok1 tok2
| tok1 tok2 tok3
{
int len=0;
len = strlen($1) + strlen($3) + 3;
char out[len];
strcpy(out,$1);
strcat(out," = ");
strcat(out,$3);
printf("out -> %s;\n",out);
$$ = out;
}
| line tok4
{
printf("line -> %s\n",$1);
}
Here I've reported a simplified part of the code.
Giving in input the token tok1 tok2 tok3 it should assign to $$ the out variable (with the printf I can see that in the first part of the rule the out variable has the correct value).
Matching the tok4 sequentially I'm in the recursive part of the rule. But when I print the $1 value (who should be equal to out since I have passed it trough $$), I don't have the right output.
You cannot set:
$$ = out;
because the string that out refers to is just about to vanish into thin air, as soon as the block in which it was declared ends.
In order to get away with this, you need to malloc the storage for the new string.
Also, you need strlen($1) + strlen($3) + 4; because you need to leave room for the NUL terminator.
It's important to understand that C does not really have strings. It has pointers to char (char*), but those are really pointers. It has arrays (char []), but you cannot use an array as an aggregate. For example, in your code, out = $1 would be illegal, because you cannot assign to an array. (Also because $1 is a pointer, not an array, but that doesn't matter because any reference to an array, except in sizeof, is effectively reduced to a pointer.)
So when you say $$ = out, you are making $$ point to the storage represented by out, and that storage is just about to vanish. So that doesn't work. You can say $$ = $1, because $1 is also a pointer to char; that makes $$ and $1 point to the same character. (That's legal but it makes memory management more complicated. Also, you need to be careful with modifications.) Finally, you can say strcpy($$, out), but that relies on $$ already pointing to a string which is long enough to hold out, something which is highly unlikely, because what it means is to copy the storage pointed to by out into the location pointed to by $$.
Also, as I noted above, when you are using "string" functions in C, they all insist that the sequence of characters pointed to by their "string" arguments (i.e. the pointer-to-character arguments) must be terminated with a 0 character (that is, the character whose code is 0, not the character 0).
If you're used to programming in languages which actually have a string datatype, all this might seem a bit weird. Practice makes perfect.
The bottom line is that what you need to do is to create a new region of storage large enough to contain your string, like this (I removed out because it's not necessary):
$$ = malloc(len + 1); // room for NUL
strcpy($$, $1);
strcat($$, " = ");
strcat($$, $3);
// You could replace the strcpy/strcat/strcat with:
// sprintf($$, "%s = %s", $1, $3)
Note that storing mallocd data (including the result of strdup and asprintf) on the parser stack (that is, as $$) also implies the necessity to free it when you're done with it; otherwise, you have a memory leak.
I've solved it changin the $$ = out; line into strcpy($$,out); and now it works properly.

How to convert BNF to EBNF

How can I convert this BNF to EBNF?
<vardec> ::= var <vardeclist>;
<vardeclist> ::= <varandtype> {;<varandtype>}
<varandtype> ::= <ident> {,<ident>} : <typespec>
<ident> ::= <letter> {<idchar>}
<idchar> ::= <letter> | <digit> | _
EBNF or Extended Backus-Naur Form is ISO 14977:1996, and is available in PDF from ISO for free*. It is not widely used by the computer language standards. There's also a paper that describes it, and that paper contains this table summarizing EBNF notation.
Table 1: Extended BNF
Extended BNF Operator Meaning
-------------------------------------------------------------
unquoted words Non-terminal symbol
" ... " Terminal symbol
' ... ' Terminal symbol
( ... ) Brackets
[ ... ] Optional symbols
{ ... } Symbols repeated zero or more times
{ ... }- Symbols repeated one or more times†
= in Defining symbol
; post Rule terminator
| in Alternative
, in Concatenation
- in Except
* in Occurrences of
(* ... *) Comment
? ... ? Special sequence
The * operator is used with a preceding (unsigned) integer number; it does not seem to allow for variable numbers of repetitions — such as 1-15 characters after an initial character to make identifiers up to 16 characters long. This lis
In the standard, open parenthesis ( is called start group symbol and close parenthesis ) is called end group symbol; open square bracket [ is start option symbol and close square bracket is end option symbol; open brace { is start repeat symbol and close brace } is end repeat symbol. Single quotes ' are called first quote symbol and double quotes " are second quote symbol.
* Yes, free — even though you can also pay 74 CHF for it if you wish. Look at the Note under the box containing the chargeable items.
The question seeks to convert this 'BNF' into EBNF:
<vardec> ::= var <vardeclist>;
<vardeclist> ::= <varandtype> {;<varandtype>}
<varandtype> ::= <ident> {,<ident>} : <typespec>
<ident> ::= <letter> {<idchar>}
<idchar> ::= <letter> | <digit> | _
The BNF is not formally defined, so we have to make some (easy) guesses as to what it means. The translation is routine (it could be mechanical if the BNF is formally defined):
vardec = 'var', vardeclist, ';';
vardeclist = varandtype, { ';', varandtype };
varandtype = ident, { ',', ident }, ':', typespec;
ident = letter, { idchar };
idchar = letter | digit | '_';
The angle brackets have to be removed around non-terminals; the definition symbol ::= is replaced by =; the terminals such as ; and _ are enclosed in quotes; concatenation is explicitly marked with ,; and each rule is ended with ;. The grouping and alternative operations in the original happen to coincide with the standard notation. Note that explicit concatenation with the comma means that multi-word non-terminals are unambiguous.
† Casual study of the standard itself suggests that the {...}- notation is not part of the standard, just of the paper. However, as jmmut notes in a comment, the standard does define the meaning of {…}-:
§5.8 Syntactic term
…
When a syntactic-term is a syntactic-factor followed by
an except-symbol followed by a syntactic-exception it
represents any sequence of symbols that satisfies both of
the conditions:
a) it is a sequence of symbols represented by the syntactic-factor,
b) it is not a sequence of symbols represented by the
syntactic-exception.
…
NOTE - { "A" } - represents a sequence of one or more A's because it is a syntactic-term with an empty syntactic-exception.
Remove the angle brackets and put all terminals into quotes:
vardec ::= "var" vardeclist;
vardeclist ::= varandtype { ";" varandtype }
varandtype ::= ident { "," ident } ":" typespec
ident ::= letter { idchar }
idchar ::= letter | digit | "_"

Recovering multiple errors in Javacc

Is there a way in javacc to parse an input file further even after detecting an error. I got to know that there are several ways such as panic mode recovery, phrase level recovery and so on. But I can't figure how to implement it in javacc jjt file.
For an example assume my input file is
Line 1: int i
Line 2: int x;
Line 3: int k
So what I want is after detecting the error of missing semicolon at line 1, proceed parsing and find the error at line 3 too.
I found the answer in the way of panic mode error recovery,but it too have some bugs. What I did was I edit my grammar so that once I encounter a missing character in a line of the input file(in the above case a semicolon) parser proceed until it finds a similar character. Those similar characters are called synchronizing tokens.
See the example below.
First I replaced all the SEMICOLON tokens in my grammar with this.
Semicolon()
Then add this new production rule.
void Semicolon() :
{}
{
try
{
<SEMICOLON>
} catch (ParseException e) {
Token t;
System.out.println(e.toString());
do {
t = getNextToken();
} while (t.kind != SEMICOLON && t!=null && t.kind != EOF );
}
}
Once I encounter a missing character parser search for a similar character.When it finds such character it returns to the rule which called it.
Example:-
Assume a semicolon missing in a variable declaration.
int a=10 <--- no semicolon
So parser search for a semicolon.At some point it finds a semicolon.
___(some code)__; method(param1);
So after finding the first semicolon in the above example it returns to the variable declaration rule(because it is the one which called the semicolon() method.) But what we find after the newly find semicolon is a function call,not a variable declaration.
Can anyone please suggest a way to solve this problem.

Ambiguous grammar with Lemon Parser Generator

So basically I want to parsed structure CSS code in PHP, using a lexer/parser generated by the PEAR packages PHP_LexerGenerator and PHP_ParserGenerator. My goal is to parse files like this:
selector, selector2 {
prop: value;
prop2 /*comment */ :
value;
subselector {
prop: value;
subsub { prop: value; }
}
}
This is all fine as long as I don't have pseudo classes. Pseudoclasses allow it, to add : and a CSS name ([a-z][a-z0-9]*) to an element, like in a.menu:visited. Being somewhat lazy, the parser has no list of valid pseudo classes and accepts everything for the class name.
My grammar (ignoring all the special cases and whitespace) looks like this:
document ::= (<rule>)*
rule ::= <selector> '{' (<content>)* '}'
content ::= <rule>
content ::= <definition>
definition ::= <name> ':' <name> ';'
// h1 .class.class2#id :visited
<selector> ::= <name> (('.'|'#') <name>)* (':' <name>)?
Now, when I try to parse the following
h1 {
test:visited {
simple: case;
}
}
The parser complains, that it expected a <name> to follow the double colon. So it tries to read the simple: as a <selector> (just look at the syntax highlighting of SO).
Is it my error that the parser can not backtrace enough to try the <definition> rule? Or is Lemon just not powerful enough to express this? If so, what can I do to get a parser working with this grammar?
Your question asks about PHP_ParserGenerator and PHP_LexerGenerator. The parser generator code is marked as 'not maintained', which bodes ill.
The syntax you are using for the grammar is not acceptable for Lemon, so you need to clarify why you think the parser generator should accept it. You mention a problem with 'expected a <name> to follow the double colon, but neither your grammar nor your sample input has a double colon, which makes it hard to help you.
I think this Lemon grammar is equivalent to the one you showed:
document ::= rule_list.
rule_list ::= .
rule_list ::= rule_list rule.
rule ::= selector LBRACE content_list RBRACE.
content_list ::= .
content_list ::= content_list content.
content ::= rule.
content ::= definition.
definition ::= NAME COLON NAME SEMICOLON.
selector ::= NAME opt_dothashlist opt_colonname.
opt_dothashlist ::= .
opt_dothashlist ::= dot_or_hash NAME.
dot_or_hash ::= DOT.
dot_or_hash ::= HASH.
opt_colonname ::= COLON NAME.
However, when it is compiled, Lemon complains 1 parsing conflicts and the output file shows:
State 2:
definition ::= NAME * COLON NAME SEMICOLON
selector ::= NAME * opt_dothashlist opt_colonname
(10) opt_dothashlist ::= *
opt_dothashlist ::= * dot_or_hash NAME
dot_or_hash ::= * DOT
dot_or_hash ::= * HASH
COLON shift 10
COLON reduce 10 ** Parsing conflict **
DOT shift 13
HASH shift 12
opt_dothashlist shift 5
dot_or_hash shift 7
This means it is not sure what to do with a colon; it might be the 'opt_colonname' part of a 'selector' or it might be part of a 'definition':
name1:name4 : name2:name3 ;
Did you mean to allow syntax such as that? Nominally, according to the grammar, that should be valid, but
name1:name4;
should also be valid. I think it requires 2 or 3 lookahead tokens to disambiguate these (so your grammar is not LALR(1) but LALR(3)).
Review your definition of 'selector' in particular.

Resources