Properly catching an unclosed string in ANTLR4 - string-literals

I have to define string literal in ANTLR4 and catch UNCLOSE_STRING exceptions.
Strings are surrounded by a pair of "" and have may have supported escapes:
\b \f \r \n \t \’ \\
The only way for " to appear inside a string is to be appended by a
' ('").
I have tried various ways to define a string literal but they were all catched by UNCLOSE_STRING:
program: global_variable_part function_declaration_part EOF;
<!-- Shenanigans of statements ...-->
fragment Character: ~( [\b\f\r\n\t"\\] | '\'') | Escape | '\'"';
fragment Escape: '\\' ( 'b' | 'f' | 'r' | 'n' | 't' | '\'' | '\\');
fragment IllegalEscape: '\\' ~( 'b' | 'f' | 'r' | 'n' | 't' | '\'' | '\\') ;
STR_LIT: '"' Character* '"' {
content = str(self.text)
self.text = content[1:-1]
};
UNCLOSE_STRING: '"' Character* ([\b\f\r\n\t\\] | EOF) {
esc = ['\b', '\t', '\n', '\f', '\r', '\\']
content = str(self.text)
raise UncloseString(content)
};
For example
"ab'"c\\n def" would match but only Unclosed String: ab'"c\n def" was produced.

This is quite close to the specification for Strings in Java. Don't be afraid to "borrow" from other grammars. I slight modification to the Java Lexer rules that (I think) matches your needs would be:
StringLiteral
: '"' StringCharacters? '"'
;
fragment
StringCharacters
: StringCharacter+
;
fragment
StringCharacter
: ~["\\\r\n]
| EscapeSequence
;
fragment
EscapeSequence
: '\\' [btnfr'\\]
: "\'"" // <-- the '" escape match
;
If you know of another language that's a closer match, you can look at how it was handled for looking for it's grammar here (ANTLR4 Grammars)

Related

Make tr replace only exact characters given in arguments

I'm trying to convert filenames to remove unacceptable characters, but tr doesn't always treat its input arguments exactly as they're given.
For example:
$ echo "(hello) - {world}" | tr '()-{}' '_'
_______ _ _______
...whereas I only intended to replace (, ), -, { and }, all the characters between ) and { in ASCII collation order were replaced as well -- so every letter in the input also became a _!
Is there a way to make tr replace only the exact characters given in its argument?
tr's syntax is surprisingly complicated. It supports ranges, character classes, collation-based equivalence matching, etc.
To avoid surprises (when a string matches any of that syntax unexpectedly), we can convert our literal characters to a string of \### octal specifiers of those characters' ordinals:
trExpressionFor() {
printf %s "$1" | od -v -A n -b | tr ' ' '\\'
}
trL() { # name short for "tr-literal"
tr "$(trExpressionFor "$1")" "$(trExpressionFor "$2")"
}
...used as:
$ trExpressionFor '()-{}'
\050\051\055\173\175
$ echo "(hello) - {world}" | trL '()-{}' '_'
_hello_ _ _world_

How can one filter a JSON object to select only specific key/values using jq?

I'm trying to validate all versions in a versions.json file, and get as the output a json with only the invalid versions.
Here's a sample file:
{
"slamx": "16.4.0 ",
"sdbe": null,
"mimir": null,
"thoth": null,
"quasar": null,
"connectors": {
"s3": "16.0.17",
"azure": "6.0.17",
"url": "8.0.2",
"mongo": "7.0.15"
}
}
I can use the following jq script line to do what I want:
delpaths([paths(type == "string" and contains(" ") or type == "object" | not)])
| delpaths([paths(type == "object" and (to_entries | length == 0))])
And use it on a shell like this:
BAD_VERSIONS=$(jq 'delpaths([paths(type == "string" and contains(" ") or type == "object" | not)]) | delpaths([paths(type == "object" and (to_entries | length == 0))])' versions.json)
if [[ $BAD_VERSIONS != "{}" ]]; then
echo >&2 $'Bad versions detected in versions.json:\n'"$BAD_VERSIONS"
exit 1
fi
and get this as the output:
Bad versions detected in versions.json:
{
"slamx": "16.4.0 "
}
However, that's a very convoluted way of doing the filtering. Instead of just walking the paths tree and just saying "keep this, keep that", I need to create a list of things I do not want and remove them, twice.
Given all the path-handling builtins and recursive processing, I can't help but feel that there has to be a better way of doing this, something akin to select, but working recursively across the object, but the best I could do was this:
. as $input |
[path(recurse(.[]?)|select(strings|contains("16")))] as $paths |
reduce $paths[] as $x ({}; . | setpath($x; ($input | getpath($x))))
I don't like that for two reasons. First, I'm creating a new object instead of "editing" the old one. Second and foremost, it's full of variables, which points to a severe flow inversion issue, and adds to the complexity.
Any ideas?
Thanks to #jhnc's comment, I found a solution. The trick was using streams, which makes nesting irrelevant -- I can apply filters based solely on the value, and the objects will be recomposed given the key paths.
The first thing I tried did not work, however. This:
jq -c 'tostream|select(.[-1] | type=="string" and contains(" "))' versions.json
returns [["slamx"],"16.4.0 "], which is what I'm searching for. However, I could not fold it back into an object. For that to happen, the stream has to have the "close object" markers -- arrays with just one element, corresponding to the last key of the object being closed. So I changed it to this:
jq -c 'tostream|select((.[-1] | type=="string" and contains(" ")) or length==1)' versions.json
Breaking it down, .[-1] selects the last element of the array, which will be the value. Next, type=="string" and contains(" ") will select all values which are strings and contain spaces. The last part of the select, length==1, keeps all the "end" markers. Interestingly, it works even if the end marker does not correspond to the last key, so this might be brittle.
With that done, I can de-stream it:
jq -c 'fromstream(tostream|select((.[-1] | type=="string" and contains(" ")) or length==1))' versions.json
The jq expression is as follow:
fromstream(
tostream |
select(
(
.[-1] |
type=="string" and contains(" ")
) or
length==1
)
)
For objects, the test to_entries|length == 0 can be abbreviated to length==0.
If I understand the goal correctly, you could just use .., perhaps along the following lines:
..
| objects
| with_entries(
select(( .value|type == "string" and contains(" ")) or (.value|type == "object" and length==0)) )
| select(length>0)
paths
If you want the paths, then consider:
([], paths) as $p
| getpath($p)
| objects
| with_entries(
select(( .value|type == "string" and contains(" ")) or (.value|type == "object" and length==0)) )
| select(length>0) as $x
| {} | setpath($p; $x)
With your input modified so that s3 has a trailing blank, the above produces:
{"slamx":"16.4.0 "}
{"connectors":{"s3":"16.0.17 "}}

How to solve this grammar recursion?

I found this grammar for a calculator:
<Expression> ::= <ExpressionGroup> | <BinaryExpression> | <UnaryExpression> | <LiteralExpression>
<ExpressionGroup> ::= '(' <Expression> ')'
<BinaryExpression> ::= <Expression> <BinaryOperator> <Expression>
<UnaryExpression> ::= <UnaryOperator> <Expression>
<LiteralExpression> ::= <RealLiteral> | <IntegerLiteral>
<BinaryOperator> ::= '+' | '-' | '/' | '*'
<UnaryOperator> ::= '+' | '-'
<RealLiteral> ::= <IntegerLiteral> '.' | <IntegerLiteral> '.' <IntegerLiteral>
<IntegerLiteral> ::= <Digit> <IntegerLiteral> | <Digit>
<Digit> ::= '0' | '1' |'2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
Source: here
It looks great. So I wrote the lexer and started the parser. Now there is an infinite recursion that I can't solve between Expression and BinaryExpression.
My code for expression:
boolean isExpression() {
if (isExpressionGroup() || isBinaryExpression() || isUnaryExpression() || isLiteralExpression()) {
println("Expression!");
return true;
}
println("Not expression.");
return false;
}
And for binary expression:
boolean isBinaryExpression() {
if (isExpression()) {
peek(1);
if (currentLex.token == Token.BINARY_OPERATOR) {
peek(2);
if (isExpression()) {
peek(3);
println("Binary expression!");
return true;
} else peek(0);
} else peek(0);
} else peek(0);
return false;
}
So peek(int) is just a function for looking forward without consuming any lexemes. So my problem: My input is '2*3' . isExpression() gets called. isExpressionGroup() fails, because there is no '('. Then the isBinaryExpression() gets called, which calls isExpression(). isExpressionGroup() fails again, and isBinaryExpression() gets called again. And so on, until a stack overflow.
I know, there is ANTLR and JavaCC (and other tools), but I would like to do it without them.
Could anyone give a hand?
Dealing with left recursion in a hand-crafted top-descent parser is not easy. Parser generators that solve the problem have years of work in them. There are theoretical reasons for that.
The best solution if you don't want to use a tool is to eliminate the left recursion. The problem if you do it "by the book" is that you'll get an ugly grammar and an ugly parser that will be difficult to use.
But there's another solution. You can add enough rules to represent the precedence hierarchy of the operators, which is something you'd have to do anyway, unless you want to risk a a+b*c be parsed as (a+b)*c.
There are plenty of examples of non left-recursive grammars for expressions on the Web, and here in SO in particular. I suggest you take one of them, and start from there.

Fitting order in bnf converter

I have problem with rules priority in bnf converter. Here I copy some rules
CParams. CallParams ::= [CallParam] ;
separator CallParam "," ;
VarCParam. CallParam ::= Ident ;
ExpCParam. CallParam ::= Exp ;
BExpCParam. CallParam ::= BExp ;
[...]
EVar. Exp3 ::= Ident ;
[...]
BVar. BExp2 ::= Ident ;
I write an example program:
void p(int a) {
a = a+7;
print a;
}
main() {
int i;
p(i);
}
As a result I expect that p(i) will be translated to CParams [VarCParam (Ident "i")], but it is converted to CParams [BExpCParam (BVar (Ident "i"))].
Could you tell how to change the rules in order to fix this bug
There is a conflict in your grammar: both trees are possible. happy just choose one way but probably printed something like this during compilation:
reduce/reduce conflicts: 2
To fix it you have to remove one of those rules:
VarCParam. CallParam ::= Ident ;
BExpCParam. CallParam ::= BExp ;
BVar. BExp2 ::= Ident ;

How to convert BNF to EBNF

How can I convert this BNF to EBNF?
<vardec> ::= var <vardeclist>;
<vardeclist> ::= <varandtype> {;<varandtype>}
<varandtype> ::= <ident> {,<ident>} : <typespec>
<ident> ::= <letter> {<idchar>}
<idchar> ::= <letter> | <digit> | _
EBNF or Extended Backus-Naur Form is ISO 14977:1996, and is available in PDF from ISO for free*. It is not widely used by the computer language standards. There's also a paper that describes it, and that paper contains this table summarizing EBNF notation.
Table 1: Extended BNF
Extended BNF Operator Meaning
-------------------------------------------------------------
unquoted words Non-terminal symbol
" ... " Terminal symbol
' ... ' Terminal symbol
( ... ) Brackets
[ ... ] Optional symbols
{ ... } Symbols repeated zero or more times
{ ... }- Symbols repeated one or more times†
= in Defining symbol
; post Rule terminator
| in Alternative
, in Concatenation
- in Except
* in Occurrences of
(* ... *) Comment
? ... ? Special sequence
The * operator is used with a preceding (unsigned) integer number; it does not seem to allow for variable numbers of repetitions — such as 1-15 characters after an initial character to make identifiers up to 16 characters long. This lis
In the standard, open parenthesis ( is called start group symbol and close parenthesis ) is called end group symbol; open square bracket [ is start option symbol and close square bracket is end option symbol; open brace { is start repeat symbol and close brace } is end repeat symbol. Single quotes ' are called first quote symbol and double quotes " are second quote symbol.
* Yes, free — even though you can also pay 74 CHF for it if you wish. Look at the Note under the box containing the chargeable items.
The question seeks to convert this 'BNF' into EBNF:
<vardec> ::= var <vardeclist>;
<vardeclist> ::= <varandtype> {;<varandtype>}
<varandtype> ::= <ident> {,<ident>} : <typespec>
<ident> ::= <letter> {<idchar>}
<idchar> ::= <letter> | <digit> | _
The BNF is not formally defined, so we have to make some (easy) guesses as to what it means. The translation is routine (it could be mechanical if the BNF is formally defined):
vardec = 'var', vardeclist, ';';
vardeclist = varandtype, { ';', varandtype };
varandtype = ident, { ',', ident }, ':', typespec;
ident = letter, { idchar };
idchar = letter | digit | '_';
The angle brackets have to be removed around non-terminals; the definition symbol ::= is replaced by =; the terminals such as ; and _ are enclosed in quotes; concatenation is explicitly marked with ,; and each rule is ended with ;. The grouping and alternative operations in the original happen to coincide with the standard notation. Note that explicit concatenation with the comma means that multi-word non-terminals are unambiguous.
† Casual study of the standard itself suggests that the {...}- notation is not part of the standard, just of the paper. However, as jmmut notes in a comment, the standard does define the meaning of {…}-:
§5.8 Syntactic term
…
When a syntactic-term is a syntactic-factor followed by
an except-symbol followed by a syntactic-exception it
represents any sequence of symbols that satisfies both of
the conditions:
a) it is a sequence of symbols represented by the syntactic-factor,
b) it is not a sequence of symbols represented by the
syntactic-exception.
…
NOTE - { "A" } - represents a sequence of one or more A's because it is a syntactic-term with an empty syntactic-exception.
Remove the angle brackets and put all terminals into quotes:
vardec ::= "var" vardeclist;
vardeclist ::= varandtype { ";" varandtype }
varandtype ::= ident { "," ident } ":" typespec
ident ::= letter { idchar }
idchar ::= letter | digit | "_"

Resources