I have a JavaCC grammar with a troublesome section that can be reduced to this:
void Start(): {}
{
A()
<EOF>
}
void A(): {}
{
( "(" A() ")" | "biz" )
( B() | C() )*
}
void B(): {}
{ "foo" A() }
void C(): {}
{ "bar" A() }
When I compile the above grammar, JavaCC warns of a choice conflict on the line ( B() | C() )*. There are 2 things I'm trying to understand. The first is why it thinks there is a conflict in this case. AFAICT at each point it should be able to determine which path to take based on just the current token. The second one is how to get rid of the warning. I can't seem to find the right spot to put the LOOKAHEAD statement. No matter where I put it, I either get a warning that it isn't at a choice point or I continue to get the same warning. Here's what I thought it might like:
void Start(): {}
{
A()
<EOF>
}
void A(): {}
{
( "(" A() ")" | "biz" )
( LOOKAHEAD(1) B() | LOOKAHEAD(1) C() )*
}
void B(): {}
{ "foo" A() }
void C(): {}
{ "bar" A() }
But this still produces the warning. I've also tried various semantic lookahead statements with no luck.
I'm clearly missing something, but I at a loss as to what. FWIW, putting any token after ( B() | C() )* also "fixes" the issue, so I'm guessing it has something to do with it not knowing how to exit that loop, but seem like that should just be when it doesn't see "foo" or "bar". The generated code appears to be correct, but if there is an ambiguity here I'm not seeing then obviously that won't matter.
EDIT 1..
After some poking about and looking at a Java grammar I found that this makes things happy:
void Start(): {}
{
A()
<EOF>
}
void A(): {}
{
( "(" A() ")" | "biz" )
(
LOOKAHEAD(2)
(B() | C())
)*
}
void B(): {}
{ "foo" A() }
void C(): {}
{ "bar" A() }
I'm still not entirely clear why it would need the extra token to decide which option to take in the loop (and maybe it really doesn't).
EDIT 2...
OK, I see the issue now, the ambiguity is not between B and C, but between whether to do a depth first or breadth first construction of the tree. So the following is just as ambiguous:
void Start(): {}
{
A()
<EOF>
}
void A(): {}
{
( "(" A() ")" | "biz" )
(B())*
}
void B(): {}
{ "foo" A() }
Switching from * to ? resolves the ambiguity as suggested.
If we change B to `{ "foo" A() "end" } that also resolves the issue since there is a clear end to B. Now suppose we do this:
void Start(): {}
{
A()
<EOF>
}
void A(): {}
{
( "(" A() ")" | "biz" )
( B() | C() )*
}
void B(): {}
{ "foo" A() "end" }
void C(): {}
{ "bar" A() }
Here I would expect that the same issue would exist for C, but JavaCC still reports that the ambiguous prefix is "foo". Is that just a reporting error? Of course using ? in this case is not possible since then you can't match successive B's (which we want). FWIW, the code generated here produces the depth first tree and since that's what I want it may be sufficient to suppress the warning.
Your grammar is ambiguous. Consider the input
biz foo biz foo biz EOF
We have the following two leftmost derivations.
The first is
Start
=> ^^^^^ Expand
A EOF
=> ^ Expand
( "(" A ")" | "biz") ( B | C)* EOF
=> ^^^^^ choose
"biz" ( B | C )* EOF
=> ^^^^^^^^^^ unroll and choose the B
"biz" B ( B | C )* EOF
=> ^ expand
"biz" "foo" A ( B | C )* EOF
=> ^ expand
"biz" "foo" ( "(" A ")" | "biz") ( B | C)* ( B | C )* EOF
=> ^^^^^ choose
"biz" "foo" "biz" ( B | C)* ( B | C )* EOF
=> ^^^^^^^^^^ unroll and choose the B
"biz" "foo" "biz" B ( B | C)* ( B | C )* EOF
=> ^ expand
"biz" "foo" "biz" "foo" A ( B | C)* ( B | C )* EOF
=> ^ expand
"biz" "foo" "biz" "foo" ( "(" A ")" | "biz") ( B | C)* ( B | C)* ( B | C )* EOF
=> ^^^^^ choose
"biz" "foo" "biz" "foo" "biz" ( B | C)* ( B | C)* ( B | C )* EOF
=> ^^^^^^^^^ Terminate
"biz" "foo" "biz" "foo" "biz" ( B | C)* ( B | C )* EOF
=> ^^^^^^^^^ Terminate
"biz" "foo" "biz" "foo" "biz"( B | C )* EOF
=> ^^^^^^^^^ Terminate
"biz" "foo" "biz" "foo" "biz" EOF
For the second derivation everything is the same as the first until the point where the parser has to decide whether to enter the second loop.
Start
=>* As above
"biz" "foo" "biz" ( B | C)* ( B | C )* EOF
=> ^^^^^^^^^ Terminate!! (Previously it was expand)
"biz" "foo" "biz" ( B | C )*
=> ^^^^^^^^^^ Unroll and choose B
"biz" "foo" "biz" B ( B | C )* EOF
=> ^ Expand
"biz" "foo" "biz" "foo" A ( B | C )* EOF
=> ^ Expand
"biz" "foo" "biz" "foo" ( "(" A ")" | "biz") ( B | C)* ( B | C )* EOF
=> ^^^^^ Choose
"biz" "foo" "biz" "foo" "biz" ( B | C)* ( B | C )* EOF
=> ^^^^^^^^^ Terminate
"biz" "foo" "biz" "foo" "biz"( B | C )* EOF
=> ^^^^^^^^^ Terminate
"biz" "foo" "biz" "foo" "biz" EOF
In terms of parse trees, there are the following two possibilities. The parse tree on the left is from the first derivation. The parse tree on the right is from the second
Start --> A -+---------------------------> biz <--------------+- A <-- Start
| |
+-> B -+--------------------> foo <------+-- B <-+
| | |
+-> A -+-------------> biz <- A <-+ |
| |
+-> B -+------> foo <------+-- B <-+
| |
+-> A -> biz <- A <-+
When you have a choice
LOOKAHEAD(3) X() | Y()
That means look ahead at most 3 tokens to try to determine if X() is wrong. If the next 3 tokens show that X() is wrong, the parser uses Y(), otherwise it goes with X(). When your grammar is ambiguous, no amount of looking ahead will resolve the conflict. For ambiguous input, looking ahead won't help the parser to figure out which choice is "right" and which is "wrong" because are "right".
So why does JavaCC stop warning when you put in the "LOOKAHEAD" directive. It's not because the look-ahead issue has been solved. It's because when you put in a look-ahead directive, JavaCC always stops giving warnings. The assumption is that you know what you are doing, even if you don't.
Often the best way to deal with look-ahead problems is to rewrite the grammar to be unambiguous and LL(1).
So what should you do? I'm not sure, because I don't know what kind of parse tree you prefer. If it's the one on the left, I think changing the * to a ? will fix the issue.
If you like the parse tree on the right, I think the following grammar will do it
void Start(): {}
{
A()
<EOF>
}
void A(): {}
{
SimpleA()
( B() | C() )*
}
void SimpleA() : {}
{
"(" A() ")" | "biz"
}
void B(): {}
{ "foo" SimpleA() }
void C(): {}
{ "bar" SimpleA() }
Related
I have this jq filter and input:
( ._birthDate.extension[0] | .url, ( .extension[0] | .url, .valueString ), ( .extension[1] | .url, .valueString ) )
{
"_birthDate":{
"extension":[
{
"url":"http://catsalut.gencat.cat/fhir/StructureDefinition/patient-dataBirhtDeath",
"extension":[
{
"url":"country",
"valueString":"724"
},
{
"url":"state",
"valueString":"08"
}
]
}
]
}
}
…which yields the following output:
"http://catsalut.gencat.cat/fhir/StructureDefinition/patient-dataBirhtDeath"
"country"
"724"
"state"
"08"
I wanted to refactor the filter:
( ._birthDate.extension[0] | .url, ( .extension[:2] | .url, .valueString ) )
…but I am getting the following error:
jq: error (at :18): Cannot index array with string "url"
See this demo.
Array/String Slice: .[10:15] [docs]
... Either index may be negative (in which case it counts backwards from the end of the array), or omitted (in which case it refers to the start or end of the array).
So your were first using .extension[0] that ment: take index 0 from .extension where
.extension[:2] means: take index 0 up and including index 2 from .extension
As #pmf already mentiond, the difference is the returned value, an object at first, but an array on the second.
So you can loop over the array using [] to return the .url and .valueString for each object in side the .extension array:
._birthDate.extension[0] | .url, ( .extension[:2][] | .url, .valueString )
Online Demo
However, since .extension is an array with only 2 indexes, the :2 part doesn't do anything useful in your example, so why not simplify it to:
._birthDate.extension[0] | .url, ( .extension[] | .url, .valueString )
Online Demo
If you only need to keep the strings then you can use .. to recursively traverse your document and strings to filter out non-strings yielded along the way:
.. | strings
Demo
I have to define string literal in ANTLR4 and catch UNCLOSE_STRING exceptions.
Strings are surrounded by a pair of "" and have may have supported escapes:
\b \f \r \n \t \’ \\
The only way for " to appear inside a string is to be appended by a
' ('").
I have tried various ways to define a string literal but they were all catched by UNCLOSE_STRING:
program: global_variable_part function_declaration_part EOF;
<!-- Shenanigans of statements ...-->
fragment Character: ~( [\b\f\r\n\t"\\] | '\'') | Escape | '\'"';
fragment Escape: '\\' ( 'b' | 'f' | 'r' | 'n' | 't' | '\'' | '\\');
fragment IllegalEscape: '\\' ~( 'b' | 'f' | 'r' | 'n' | 't' | '\'' | '\\') ;
STR_LIT: '"' Character* '"' {
content = str(self.text)
self.text = content[1:-1]
};
UNCLOSE_STRING: '"' Character* ([\b\f\r\n\t\\] | EOF) {
esc = ['\b', '\t', '\n', '\f', '\r', '\\']
content = str(self.text)
raise UncloseString(content)
};
For example
"ab'"c\\n def" would match but only Unclosed String: ab'"c\n def" was produced.
This is quite close to the specification for Strings in Java. Don't be afraid to "borrow" from other grammars. I slight modification to the Java Lexer rules that (I think) matches your needs would be:
StringLiteral
: '"' StringCharacters? '"'
;
fragment
StringCharacters
: StringCharacter+
;
fragment
StringCharacter
: ~["\\\r\n]
| EscapeSequence
;
fragment
EscapeSequence
: '\\' [btnfr'\\]
: "\'"" // <-- the '" escape match
;
If you know of another language that's a closer match, you can look at how it was handled for looking for it's grammar here (ANTLR4 Grammars)
I have this grammar with common prefixes (<id>) and I want to transform it to avoid them.
void Components() : {}
{
(Read() | Write())* (<id>Assignment())* <id>Declaration() (Read() | Write() | <id>(Assignment() | Declaration()))*
}
The problem is (<id>Assignment())* <id>Declaration(). The grammar can have 0 or more Assignments/Read/Write statments but at least 1 Declaration and then any statment/declaration in any order.
Refactoring this is easy, but I probably wouldn't do it. I'd probably look ahead a little further. Here are two solutions
Factor out the <id>
void Components() : {}
{
(Read() | Write())*
<id>
(Assignment() <id>)*
Declaration()
( Read()
| Write()
| <id> (Assignment() | Declaration())
)*
}
Use longer lookahead
void Components() : {}
{
(Read() | Write())*
(LOOKAHEAD( 2 ) <id> Assignment())*
<id> Declaration()
( Read()
| Write()
| LOOKAHEAD( 2 ) <id> Assignment()
| <id> Declaration())
)*
}
I found this grammar for a calculator:
<Expression> ::= <ExpressionGroup> | <BinaryExpression> | <UnaryExpression> | <LiteralExpression>
<ExpressionGroup> ::= '(' <Expression> ')'
<BinaryExpression> ::= <Expression> <BinaryOperator> <Expression>
<UnaryExpression> ::= <UnaryOperator> <Expression>
<LiteralExpression> ::= <RealLiteral> | <IntegerLiteral>
<BinaryOperator> ::= '+' | '-' | '/' | '*'
<UnaryOperator> ::= '+' | '-'
<RealLiteral> ::= <IntegerLiteral> '.' | <IntegerLiteral> '.' <IntegerLiteral>
<IntegerLiteral> ::= <Digit> <IntegerLiteral> | <Digit>
<Digit> ::= '0' | '1' |'2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
Source: here
It looks great. So I wrote the lexer and started the parser. Now there is an infinite recursion that I can't solve between Expression and BinaryExpression.
My code for expression:
boolean isExpression() {
if (isExpressionGroup() || isBinaryExpression() || isUnaryExpression() || isLiteralExpression()) {
println("Expression!");
return true;
}
println("Not expression.");
return false;
}
And for binary expression:
boolean isBinaryExpression() {
if (isExpression()) {
peek(1);
if (currentLex.token == Token.BINARY_OPERATOR) {
peek(2);
if (isExpression()) {
peek(3);
println("Binary expression!");
return true;
} else peek(0);
} else peek(0);
} else peek(0);
return false;
}
So peek(int) is just a function for looking forward without consuming any lexemes. So my problem: My input is '2*3' . isExpression() gets called. isExpressionGroup() fails, because there is no '('. Then the isBinaryExpression() gets called, which calls isExpression(). isExpressionGroup() fails again, and isBinaryExpression() gets called again. And so on, until a stack overflow.
I know, there is ANTLR and JavaCC (and other tools), but I would like to do it without them.
Could anyone give a hand?
Dealing with left recursion in a hand-crafted top-descent parser is not easy. Parser generators that solve the problem have years of work in them. There are theoretical reasons for that.
The best solution if you don't want to use a tool is to eliminate the left recursion. The problem if you do it "by the book" is that you'll get an ugly grammar and an ugly parser that will be difficult to use.
But there's another solution. You can add enough rules to represent the precedence hierarchy of the operators, which is something you'd have to do anyway, unless you want to risk a a+b*c be parsed as (a+b)*c.
There are plenty of examples of non left-recursive grammars for expressions on the Web, and here in SO in particular. I suggest you take one of them, and start from there.
I have problem with rules priority in bnf converter. Here I copy some rules
CParams. CallParams ::= [CallParam] ;
separator CallParam "," ;
VarCParam. CallParam ::= Ident ;
ExpCParam. CallParam ::= Exp ;
BExpCParam. CallParam ::= BExp ;
[...]
EVar. Exp3 ::= Ident ;
[...]
BVar. BExp2 ::= Ident ;
I write an example program:
void p(int a) {
a = a+7;
print a;
}
main() {
int i;
p(i);
}
As a result I expect that p(i) will be translated to CParams [VarCParam (Ident "i")], but it is converted to CParams [BExpCParam (BVar (Ident "i"))].
Could you tell how to change the rules in order to fix this bug
There is a conflict in your grammar: both trees are possible. happy just choose one way but probably printed something like this during compilation:
reduce/reduce conflicts: 2
To fix it you have to remove one of those rules:
VarCParam. CallParam ::= Ident ;
BExpCParam. CallParam ::= BExp ;
BVar. BExp2 ::= Ident ;