Define token to match any string - javacc

I am new to javacc. I am trying to define a token which can match any string. I am following the regex syntax <ANY: (~[])+> which is not working. I want to achieve something very simple, define an expression having the following BNF:
<exp> ::= "path(" <string> "," <number> ")"
My current .jj file is as follows, any help on how I can parse the string:
options
{
}
PARSER_BEGIN(SimpleAdd)
package SimpleAddTest;
public class SimpleAdd
{
}
PARSER_END(SimpleAdd)
SKIP :
{
" "
| "\r"
| "\t"
| "\n"
}
TOKEN:
{
< NUMBER: (["0"-"9"])+ > |
<PATH: "path"> |
<RPAR: "("> |
<LPAR: ")"> |
<QUOTE: "'"> |
<COMMA: ","> |
<ANY: (~[])+>
}
int expr():
{
String leftValue ;
int rightValue ;
}
{
<PATH> <RPAR> <QUOTE> leftValue = str() <QUOTE> <COMMA> rightValue = num() <LPAR>
{ return 0; }
}
String str():
{
Token t;
}
{
t = <ANY> { return t.toString(); }
}
int num():
{
Token t;
}
{
t = <NUMBER> { return Integer.parseInt(t.toString()); }
}
The error I am getting with the above javacc file is:
Exception in thread "main" SimpleAddTest.ParseException: Encountered " <ANY> "path(\'5\',1) "" at line 1, column 1.
Was expecting:
"path" ...

The pattern <ANY: (~[])+> will indeed match any nonempty string. The issue is that this is not what you really want. If you have a rule <ANY: (~[])+>, it will match the whole file, unless the file is empty. In most cases, because of the longest match rule, the whole file will be parsed as [ANY, EOF]. Is that really what you want? Probably not.
So I'm going to guess at what you really want. I'll guess you want any string that doesn't include a double quote character. Maybe there are other restrictions, such as no nonprinting characters. Maybe you want to allow double quotes if the are preceded by a backslash. Who knows? Adjust as needed.
Here is what you can do. First, replace the token definitions with
TOKEN:
{
< NUMBER: (["0"-"9"])+ > |
<PATH: "path"> |
<RPAR: "("> |
<LPAR: ")"> |
<COMMA: ","> |
<STRING: "\"" (~["\""])* "\"" >
}
Then change your grammar to
int expr():
{
String leftValue ;
int rightValue ;
}
{
<PATH> <RPAR> leftValue=str() <COMMA> rightValue = num() <LPAR>
{ return 0; }
}
String str():
{
Token t;
int len ;
}
{
t = <String>
{ len = t.image.length() ; }
{ return t.image.substring(1,len-1); }
}

Related

Javacc error reporting results in “Expansion can be matched by empty string.”

I am trying to add some custom error messages to my javacc parser to hopefully make the error messages more specific and the language problems easier to find and correct.
The first error that I am trying to focus in on is how to detect that the correct number of arguments have been provided to a 'function' call. Rather than the default message, I would like to print out something like "missing argument to function".
My simplified language and my attempt to catch a missing argument error looks something like:
double arg(boolean allowMissing):
{ double v; Token t; }
{
t = <INT> { return Double.parseDouble(t.image); }
| t = <DOUBLE> { return Double.parseDouble(t.image); }
| v = functions() { return v; }
| { if (!allowMissing) throw new ParseException("Missing argument");} // #1 Throw error if missing argument
}
double functions() :
{ double v1, v2, result;
double[] array;
}
{
(<MIN> "(" v1=arg(false) "," v2=arg(false) ")") { return (v1<v2)?v1:v2; }
| (<MAX> "(" v1=arg(false) "," v2=arg(false) ")") { return (v1>v2)?v1:v2; }
| (<POW> "(" v1=arg(false) "," v2=arg(false) ")") { return Math.pow(v1, v2); }
| (<SUM> "(" array=argList() ")") { result=0; for (double v:array) result+=v; return result;}
}
double[] argList() :
{
ArrayList<Double> list = new ArrayList<>();
double v;
}
{
( (v=arg(true) { list.add(v);} ( "," v=arg(false) {list.add(v);} )*)?) { // #2 Expansion can be matched by empty string here
double[] arr = new double[list.size()];
for (int i=0; i<list.size(); i++)
arr[i] = list.get(i);
return arr;
}
}
As you can see functions will recursively resolve their arguments, and this allows function call to be nested.
Here are a few valid expressions that can be parsed in this language:
"min(1,2)",
"max(1,2)",
"max(pow(2,2),2)",
"sum(1,2,3,4,5)",
"sum()"
Here is an invalid expression:
"min()"
This all works well until I tried to check for missing arguments (code location #1). This works fine for the functions that have a fixed number of arguments. The problem is that the sum function (code location #2) is allowed to have zero arguments. I even passed in a flag to not throw an error if missing arguments are allowed. however, javacc gives me an error at location #2 that "Expansion within "(...)?" can be matched by empty string". I understand why I get this error. I have also read the answer for JavaCC custom errors cause "Expansion can be matched by empty string." but it did not help me.
My problem is that I just cannot see how I can have this both ways. I want to throw an error for missing arguments in the functions that have a fixed number of arguments, but I don't want an error in the function that allows no arguments. Is there a way to refactor my parser so that I still use the recursive style, catch missing arguments from the functions that take a fixed arguments, yet allow some functions to have zero arguments?
Or is there a better way to add in custom error messages? I am not really seeing much in the documentation.
Also, any pointers to examples that use more sophisticated error reporting would be greatly appreciated. I am actually using jjtree, but I simplified it down for this example.
Here's what I would do.
Instead of using a boolean argument in function arg, I would use the ? operator:
double arg():
{ double v; Token t; }
{
t = <INT> { return Double.parseDouble(t.image); }
| t = <DOUBLE> { return Double.parseDouble(t.image); }
| v = functions() { return v; }
}
double functions() :
{ double v1=0, v2=0, result;
double[] array;
}
{
(<MIN> "(" (v1=arg())? "," (v2=arg())? ")") { return (v1<v2)?v1:v2; }
| (<MAX> "(" (v1=arg())? "," (v2=arg())? ")") { return (v1>v2)?v1:v2; }
| (<POW> "(" (v1=arg())? "," (v2=arg())? ")") { return Math.pow(v1, v2); }
| (<SUM> "(" array=argList() ")") { result=0; for (double v:array) result+=v; return result;}
}
double[] argList() :
{
List<Double> list = new ArrayList<Double>();
double v;
}
{
( (v=arg() { list.add(v); } | { list.add(0.); } )
( "," (v=arg() { list.add(v); } | { list.add(0.); } ) )*) {
double[] arr = new double[list.size()];
for (int i=0; i<list.size(); i++)
arr[i] = list.get(i);
return arr;
}
}
You could do this
double arg():
{ double v; Token t; }
{
t = <INT> { return Double.parseDouble(t.image); }
| t = <DOUBLE> { return Double.parseDouble(t.image); }
| v = functions() { return v; }
}
double argRequired():
{ double v; }
{
v = arg() { return v ; }
| { if (!allowMissing) throw new ParseException("Missing argument");} // #1 Throw error if missing argument
}
double argOptional( double defaultValue ): // Not needed for this example, but might be useful.
{ double v; }
{
v = arg() { return v ; }
| { return defaultValue ; }
}
double functions() :
{ double v1, v2, result;
double[] array;
}
{
(<MIN> "(" v1=argRequired() "," v2=argRequired() ")") { return (v1<v2)?v1:v2; }
| (<MAX> "(" v1=argRequired() "," v2=argRequired() ")") { return (v1>v2)?v1:v2; }
| (<POW> "(" v1=argRequired() "," v2=argRequired() ")") { return Math.pow(v1, v2); }
| (<SUM> "(" array=argList() ")") { result=0; for (double v:array) result+=v; return result;}
}
double[] argList( ) :
{
ArrayList<Double> list = new ArrayList<>();
double v;
}
{
( v=arg() { list.add(v);}
( "," v=argRequired() {list.add(v);}
)*
)?
{
double[] arr = new double[list.size()];
for (int i=0; i<list.size(); i++)
arr[i] = list.get(i);
return arr;
}
}

JavaCC simple example not working

I am trying javacc for the first time with a simple naive example which is not working. My BNF is as follows:
<exp>:= <num>"+"<num>
<num>:= <digit> | <digit><num>
<digit>:= [0-9]
Based on this BNF, I am writing the SimpleAdd.jj as follows:
options
{
}
PARSER_BEGIN(SimpleAdd)
public class SimpleAdd
{
}
PARSER_END(SimpleAdd)
SKIP :
{
" "
| "\r"
| "\t"
| "\n"
}
TOKEN:
{
< NUMBER: (["0"-"9"])+ >
}
int expr():
{
int leftValue ;
int rightValue ;
}
{
leftValue = num()
"+"
rightValue = num()
{ return leftValue+rightValue; }
}
int num():
{
Token t;
}
{
t = <NUMBER> { return Integer.parseInt(t.toString()); }
}
using the above file, I am generating the java source classes. My main class is as follows:
public class Main {
public static void main(String [] args) throws ParseException {
SimpleAdd parser = new SimpleAdd(System.in);
int x = parser.expr();
System.out.println(x);
}
}
When I am entering the expression via System.in, I am getting the following error:
11+11^D
Exception in thread "main" SimpleAddTest.ParseException: Encountered "<EOF>" at line 0, column 0.
Was expecting:
<NUMBER> ...
at SimpleAddTest.SimpleAdd.generateParseException(SimpleAdd.java:200)
at SimpleAddTest.SimpleAdd.jj_consume_token(SimpleAdd.java:138)
at SimpleAddTest.SimpleAdd.num(SimpleAdd.java:16)
at SimpleAddTest.SimpleAdd.expr(SimpleAdd.java:7)
at SimpleAddTest.Main.main(Main.java:9)
Any hint to solve the problem ?
Edit Note that this answer answers an earlier version of the question.
When a BNF production uses a nonterminal that returns a result, you can record that result in a variable.
First declare the variables in the declaration part of the BNF production
int expr():
{
int leftValue ;
int rightValue ;
}
{
Second, in the main body of the production, record the results in the variables.
leftValue = num()
"+"
rightValue = num()
Finally, use the values of those variables to compute the result of this production.
{ return leftValue+rightValue; }
}

Why in this code i am getting Lexical error is coming using javacc tool

I have made a AssignStatement class and i am trying to pass the String using javacc.
The assignment statement is of the form :a=b+c*d.
Here, is the Source Code
options
{
static=false;
DEBUG_TOKEN_MANAGER=true;
}
public class AssignStatement
{
public static void main(String s[])
{
try
{
AssignStatement as=new AssignStatement(System.in);
as.StartSymbol();
System.out.println("Syntax checking successfully");
}
catch(Throwable e)
{
System.out.println("Syntex checking failed"+e.getMessage());
}
}
}
PARSER_END(AssignStatement)
SKIP: {"" | "\t" | "\n" | "\r" }
TOKEN:{ "(" | ")" | "+" | "*" | ":=" | <NUM:(["0"-"9"])+>| <ID:(["a"-"z"])+> }
void StartSymbol(): {}
{
(AStmt())*<EOF>
}
void AStmt(): {}
{
LOOKAHEAD(2) <ID> "=" AStmt()
| Term() ("+" Term())*
}
void Term(): {}
{
Factor() ("*" Factor())*
}
void Factor(): {}
{
<NUM>
| <ID>
| "(" AStmt() ")"
}
The Output i got after i did java AssignStatement
"a=10+20*30"
Current character : \" (34) at line 1 column 1
No string literal matches possible.
Starting NFA to match one of : { , }
Current character : \" (34) at line 1 column 1
Syntex checking failedLexical error at line 1, column 1. Encountered: "\"" (34)
, after : ""
Output I should get
syntex checked successfully.
The first character of the input is ", but there is no regular expression that allows the first character to be a ". So the lexer throws a TokenManagerError after reading the first character.

JavaCC grammar - proper lexing

I have a JavaCC grammar with following definitions:
<REGULAR_IDENTIFIER : (["A"-"Z"])+ > // simple identifier like say "DODGE"
<_LABEL : (["A"-"Z"])+ (":") > // label, eg "DODGE:"
<DOUBLECOLON : "::">
<COLON : ":">
Right now "DODGE::" lexed as <_LABEL> <COLON> ("DODGE:" ":")
but i need to lex it as <REGULAR_IDENTIFIER> <DOUBLECOLON> ("DODGE" "::")
I think the following will work
MORE: { < (["A"-"Z"])+ :S0 > } // Could be identifier or label.
<S0> TOKEN: { <LABEL : ":" : DEFAULT> } // label, eg "DODGE:"
<S0> TOKEN: { <IDENTIFIER : "" : DEFAULT > } // simple identifier like say "DODGE"
<S0> TOKEN: { <IDENTIFIER : "::" { matchedToken.image = image.substring(0,image.size()-2) ; } : S1 > }
<S1> TOKEN: { <DOUBLECOLON : "" { matchedToken.image = "::" ; } : DEFAULT> }
<DOUBLECOLON : "::">
<COLON : ":">
Note that "DODGE:::" is three tokens, not two.
In javacc the maximal match rule (longest prefix match rule) is used see:
http://www.engr.mun.ca/~theo/JavaCC-FAQ/javacc-faq-moz.htm#more-than-one
This means that the _LABEL token will be matched before the REGULAR_IDENTIFIER token, as the _LABEL token will contain more characters. This means that what you are trying to do should not be done in the tokenizer.
I have written a parser which recognizes the grammar correctly, I use the parser for recognizing the _LABEL's, instead of the tokenizer:
options {
STATIC = false;
}
PARSER_BEGIN(Parser)
import java.io.StringReader;
public class Parser {
//Main method, parses the first argument to the program
public static void main(String[] args) throws ParseException {
System.out.println("Parseing: " + args[0]);
Parser parser = new Parser(new StringReader(args[0]));
parser.Start();
}
}
PARSER_END(Parser)
//The _LABEL will be recognized by the parser, not the tokenizer
TOKEN :
{
<DOUBLECOLON : "::"> //The double token will be preferred to the single colon due to the maximal munch rule
|
<COLON : ":">
|
<REGULAR_IDENTIFIER : (["A"-"Z"])+ > // simple identifier like say "DODGE"
}
/** Root production. */
void Start() :
{}
{
(
LOOKAHEAD(2) //We need a lookahead of two, to see if this is a label or not
<REGULAR_IDENTIFIER> <COLON> { System.err.println("label"); } //Labels, should probably be put in it's own production
| <REGULAR_IDENTIFIER> { System.err.println("reg_id"); } //Regulair identifiers
| <DOUBLECOLON> { System.err.println("DC"); }
| <COLON> { System.err.println("C"); }
)+
}
In a real you should of cause move the <REGULAR_IDENTIFIER> <COLON> to a _label production.
Hope it helps.

Javacc for parsing '<UPPER_CASE> <ARROW>

I am writing a parser for a set of CFG.
(Note: The RHS can ONLY be an uppercase letter)
/*ignore declaration and stuff, here's the main part of the code */
void
start():
{
}
{
(
<UPPER_CHAR>
<ARROW>
<STRING>
( <PIPE> <STRING> )*
)*
}
TOKEN:
{
<ARROW: "=>" >
|
<PIPE: "|">
|
<UPPER_CHAR: (["A"-"Z"])>
}
TOKEN: {<STRING: (<LETTER> | <DIGIT> | <SYMBOL>)+ > }
This obviously missed some edge cases, some which include:
A => A | a | D E => e
So what did I do wrong?
I guess SYMBOL includes "=" and ">" but not "|". In that case. STRING will match the whole of " D E => e".
Why do you want STRING at all? Why not do something like this.
void start() : {} {
(
<UPPER_CHAR> <ARROW>
choices()
)*
}
void choices() : {} {
choice() ( <PIPE> choice())*
}
void choice() : {} {
LOOKAHEAD(<UPPER_CHAR> <ARROW> )
{}
|
(<UPPER_CHAR> | <LOWER_CHAR>) choice()
|
{}
}
The reason I used recursion for choice is that there is no way to use syntactic lookahead to exit a loop. I.e. what you want is (<UPPER_CHAR> | <LOWER_CHAR>)*, but you want to get out of this loop as soon as the next two tokens are <UPPER_CHAR> <ARROW>.

Resources