Avoiding left recursion in parsing LiveScript object definitions - recursion

I'm working on a parser for LiveScript language, and am having trouble with parsing both object property definition forms — key: value and (+|-)key — together. For example:
prop: "val"
+boolProp
-boolProp
prop2: val2
I have the key: value form working with this:
Expression ::= TestExpression
| ParenExpression
| OpExpression
| ObjDefExpression
| PropDefExpression
| LiteralExpression
| ReferenceExpression
PropDefExpression ::= Expression COLON Expression
ObjDefExpression ::= PropDefExpression (NEWLINE PropDefExpression)*
// ... other expressions
But however I try to add ("+"|"-") IDENTIFIER to PropDefExpression or ObjDefExpression, I get errors about using left recursion. What's the (right) way to do this?

The grammar fragment you posted is already left-recursive, i.e. without even adding (+|-)boolprop, the non-terminal 'Expression' derives a form in which 'Expression' reappears as the leftmost symbol:
Expression -> PropDefExpression -> Expression COLON Expression
And it's not just left-recursive, it's ambiguous. E.g.
Expression COLON Expression COLON Expression
can be derived in two different ways (roughly, left-associative vs right-associative).
You can eliminate both these problems by using something more restricted on the left of the colon, e.g.:
PropDefExpression ::= Identifier COLON Expression
Also, another ambiguity: Expression derives PropDefExpression in two different ways, directly and via ObjDefExpression. My guess is, you can drop the direct derivation.
Once you've taken care of those things, it seems to me you should be able to add (+|-)boolprop without errors (unless it conflicts with one of the other kinds of expression that you didn't show).
Mind you, looking at the examples at http://livescript.net, I'm doubtful how much of that you'll be able to capture in a conventional grammar. But if you're just going for a subset, you might be okay.

I don't know how much help this will be, because I know nothing about GrammarKit and not much more about the language you're trying to parse.
However, it seems to me that
PropDefExpression ::= Expression COLON Expression
is not quite accurate, and it is creating an ambiguity when you add the boolean property production because an Expression might start with a unary - operator. In the actual grammar, though, a property cannot start with an arbitrary Expression. There are two types of key-property definitions:
name : expression
parenthesized_expression : expression
(Which is to say, expressions need to start with a ().
That means that a boolean property definition, starting with + or - is recognizable from the first token, which is precisely the condition needed for successful recursive descent parsing. There are several other property definition syntaxes, including names and parenthesized_expressions not followed by a :
That's easy to parse with an LR(1) parser, like the one Jison produces, but to parse it with a recursive-descent parser you need to left-factor. (It's possible that GrammarKit can do this for you, by the way.) Basically, you'd need something like (this is not complete):
PropertyDefinition ::= PropertyPrefix PropertySuffix? | BooleanProperty
PropertyPrefix ::= NAME | ParenthesizedExpression
PropertySuffix ::= COLON Expression | DOT NAME

Related

Why this jq pipeline doesn't need a dot?

jq -r '."#graph"[]["rdfs:label"]' 9.0/schemaorg-all-http.jsonld works but jq -r '."#graph"[].["rdfs:label"]' 9.0/schemaorg-all-http.jsonld does not and I don't understand why .["rdfs:label"] does not need the dot. https://stackoverflow.com/a/39798796/308851 suggests it needs .name after [] and https://stedolan.github.io/jq/manual/#Basicfilters says
For example .["foo::bar"] and .["foo.bar"] work while .foo::bar does not,
Where did the dot go?
Using the terminology of the jq manual, jq expressions are
fundamentally composed of pipes and what it calls "basic filters". The
first filter under the heading "Basic Filters" is the identify filter,
.; and .[] is the "Array/Object Value Iterator".
From this perspective, that is, from the perspective of
pipes-and-basic-filters, the expression under consideration
."#graph"[]["rdfs:label"] can be viewed as an abbreviated form of
the pipeline:
.["#graph"] | .[] | .["rdfs:label"]
So from this perspective, the question is what abbreviations are allowed.
One of the most important abbreviation rules is:
E | .[] #=> E[]
Another is:
.["<string>"] #=> ."<string>"
Application of these rules yields the simplified expression.
So perhaps the basic answer to the "why" in this question is: for convenience. :-)
The dot serves two different purposes in jq:
A dot on its own means "the current object". Let's call this the identity dot. It can only appear at the start of an expression or subexpression, for example at the very start, or after a binary operator like the | or + or and, or inside an opening parenthesis (.
A dot followed by a string or an identifier means "retrieve the named field of the current object". Let's call this an indexing dot. Whatever is to the left of it needs to be a complete subexpression, for example a literal value, a parenthesised expression, a function call, etc. It can't appear in any of the places the identity dot can appear.
The thing to understand is that in the square bracket operators, the dot shown in the documentation is an identity dot - it's not actually part of the operator itself. The operator is just the square brackets and their contents, and it needs to be attached to another complete expression.
In general, both square bracket operators (e.g. ["foo"] or [] or [0] or [2:5]) and object identifier indexing operators (e.g. .foo or ."foo") can be appended to another expression. Only the object identifier indexing operators can appear "bare" with no expression on the left. Since the square bracket operators can't appear bare, you will typically see them in the documentation composed after an identity dot.
These are all equivalent:
.foo # indexing dot
."foo" # indexing dot
. .foo # identity dot and indexing dot
. | .foo # identity dot and indexing dot
.["foo"] # identity dot
. | .["foo"] # two identity dots
So the answer to your question is that the last dot in ."#graph"[].["rdfs:label"] isn't allowed because:
It can't be an identity dot because it has an expression on the left.
It can't be an indexing dot because it doesn't have an identifier or a string on the right, it has a square bracket.
All that said, it looks like newer versions of jq are going to extend the syntax to allow square bracket operators immediately after an indexing dot, and having the intuitive meaning of just applying that indexing operation the same as if there had been no dot, so hopefully you won't need to worry about the difference in the future.

SQLite3 regexp performance

How performant is the SQLite3 REGEXP operator?
For simplicity, assume a simple table with a single column pattern and an index
CREATE TABLE `foobar` (`pattern` TEXT);
CREATE UNIQUE INDEX `foobar_index` ON `foobar`(`pattern`);
and a query like
SELECT * FROM `foobar` WHERE `pattern` REGEXP 'foo.*'
I have been trying to compare and understand the output from EXPLAIN and it seems to be similar to using LIKE except it will be using regexp for matching. However, I am not fully sure how to read the output from EXPLAIN and I'm not getting a grasp of how performant it will be.
I understand it will be slow compared to a indexed WHERE `pattern` = 'foo' query but is it slower/similar to LIKE?
sqlite does not optimize WHERE ... REGEXP ... to use indexes. x REGEXP y is simply a function call; it's equivalent to regexp(x,y). Also note that not all installations of sqlite have a regexp function defined so using it (or the REGEXP operator) is not very portable. LIKE/GLOB on the other hand can take advantage of indexes for prefix queries provided that some additional conditions are met:
The right-hand side of the LIKE or GLOB must be either a string literal or a parameter bound to a string literal that does not begin with a wildcard character.
It must not be possible to make the LIKE or GLOB operator true by having a numeric value (instead of a string or blob) on the left-hand side. This means that either:
the left-hand side of the LIKE or GLOB operator is the name of an indexed column with TEXT affinity, or
the right-hand side pattern argument does not begin with a minus sign ("-") or a digit.
This constraint arises from the fact that numbers do not sort in lexicographical order. For example: 9<10 but '9'>'10'.
The built-in functions used to implement LIKE and GLOB must not have been overloaded using the sqlite3_create_function() API.
For the GLOB operator, the column must be indexed using the built-in BINARY collating sequence.
For the LIKE operator, if case_sensitive_like mode is enabled then the column must indexed using BINARY collating sequence, or if case_sensitive_like mode is disabled then the column must indexed using built-in NOCASE collating sequence.
If the ESCAPE option is used, the ESCAPE character must be ASCII, or a single-byte character in UTF-8.

SQLite source code parse.y - nm

I am reading the grammar of SQLite and having a few questions about the following paragraph.
// The name of a column or table can be any of the following:
//
%type nm {Token}
nm(A) ::= id(A).
nm(A) ::= STRING(A).
nm(A) ::= JOIN_KW(A).
The nm has been used quite widely in the program. The lemon parser documentation said
Typically the data type of a non-terminal is a pointer to the root of
a parse-tree structure that contains all information about that
non-terminal
%type expr {Expr*}
Should I understand {Token} actually stands for a syntactic grouping which is a non-terminal token that "is a parse-tree structure that contains all.."?
What is nm short for in this same, is it simply "name"?
What is the period sign (dot .) that each nm(A) declaration ends up with?
No, you should understand that Token is a C object type used for the semantic value of nms.
(It is defined in sqliteInt.h and consists of a pointer to a non-null terminated character array and the length of that array.)
The comment immediately above the definition of nm starts with the words "the name", which definitely suggests to me that nm is an abbreviation for "name", yes. That is also consistent with its semantic type, as above, which is basically a name (or at least a string of characters).
All lemon productions end with a dot. It tells lemon where the end of the production is, like semicolons​ indicate to a C compiler where the end of a statement is. This makes it easier to parse consecutive productions, since otherwise the parser would have to look several symbols ahead to see the ::=

Case insensitive token matching

Is it possible to set the grammar to match case insensitively.
so for example a rule:
checkName = 'CHECK' Word;
would match check name as well as CHECK name
Creator of PEGKit here.
The only way to do this currently is to use a Semantic Predicate in a round-about sort of way:
checkName = { MATCHES_IGNORE_CASE(LS(1), #"check") }? Word Word;
Some explanations:
Semantic Predicates are a feature lifted directly from ANTLR. The Semantic Predicate part is the { ... }?. These can be placed anywhere in your grammar rules. They should contain either a single expression or a series of statements ending in a return statement which evaluates to a boolean value. This one contains a single expression. If the expression evaluates to false, matching of the current rule (checkName in this case) will fail. A true value will allow matching to proceed.
MATCHES_IGNORE_CASE(str, regexPattern) is a convenience macro I've defined for your use in Predicates and Actions to do regex matches. It has a case-sensitive friend: MATCHES(str, regexPattern). The second argument is an NSString* regex pattern. Meaning should be obvious.
LS(num) is another convenience macro for your use in Predicates/Actions. It means fetch a Lookahead String and the argument specifies how far to lookahead. So LS(1) means lookahead by 1. In other words, "fetch the string value of the first upcoming token the parser is about to try to match".
Notice that I'm still matching Word twice at the end there. The first Word is necessary for matching 'check' (even though it was already tested in the predicate, it was not matched and consumed). The second Word is for your name or whatever.
Hope that helps.

Are shorthand character classes (such as \d) not supported in JavaCC

I am trying to learn to use JavaCC and realized that it has support for regular expressions. Call me lazy but I thought the default/common way to define digits is a bit too long:
TOKEN : { < #DIGITS : (["0" - "9"])+ >}
I tried using the shorthand character classes such as:
TOKEN : { < #DIGITS : (\d)+ >}
but the "compiler compiler" doesn't seem to like it. I get Lexical errors for the shorthand character. I could not find any documentation on the matter so I am not sure if I am doing something wrong or that it's simply not supported. If anyone can confirm/deny my assumption, that javacc not playing well with the shorthand character classes, I would be very appreciative.
Your finding that it's not supported is correct. Regular expressions in JavaCC are made up only of string literals, references to other regular expressions, and references to the predefined regular expression < EOF >.
However, what you are doing with the code you have there is creating your own shortcut. The number sign means that the symbol is private, i.e., can be used only inside regular expressions. So, defining it as TOKEN : { < #D : (["0" - "9"])+ > } means you could then use < D > within other token definitions.
The example grammar javacc.jj, included with the binary distribution, is the official grammar, so looking in this file you can see just what is parsable by this grammar. The output seems to be a essentially a grammar validator.

Resources