Converting EBNF to BNF basics - bnf

I'm not quite sure how to answer a question for my computer languages class. I am to convert the following statement from EBNF form to BNF form:
EBNF: expr --> [-] term {+ term}
I understand that expressions included within curly braces are to be repeated zero or more times, and that things included within right angle braces represents zero or one options. If my understanding is correct, would this be a correct conversion?
My BNF:
expr --> expr - term
| expr + term
| term
Bonus Reading
Converting EBNF to BNF (general rules)

I don't think that's correct. In fact, I don't think the EBNF is actually valid EBNF. The answer to the question How to convert BNF to EBNF shows how valid EBNF is constructed, quoting from ISO/IEC 14977:1996, the Extended Backus-Naur Form standard.
I think the expression:
expr --> [-] term {+ term}
should be written:
expr = [ '-' ] term { '+', term };
This means that an expression consists of an optional minus sign, followed by a term, followed by a sequence of zero of more occurrences of a plus sign and a term.
Next question: which dialect of BNF are you targeting? Things get tricky here; there are many dialects. However, here's one possible translation:
<expr> ::= [ MINUS ] <term> <opt_add_term_list>
<opt_add_term_list> ::= /* Nothing */
| <opt_add_term_list> <opt_add_term>
<add_term> ::= PLUS term
Where MINUS and PLUS are terminals (for '-' and '+'). This is a very austere but minimal BNF. Another possible translation would be:
<expr> ::= [ MINUS ] <term> { PLUS <term> }*
Where the { ... }* part means zero or more of the contained pattern ... (so PLUS <term> in this example). Or you could use quoted characters:
<expr> ::= [ '-' ] <term> { '+' <term> }*
And so the list of possible alternatives goes on. You'll have to look at the definition of BNF you were given to work to, and you should complain about the very sloppy EBNF you were given, if it was meant to be ISO standard EBNF. If it was just some random BNF-style language called EBNF, I guess it is just the name that is confusing. Private dialects are fine as long as they're defined, but it isn't possible for people not privy to the dialect to know what the correct answer is.

Related

Why this jq pipeline doesn't need a dot?

jq -r '."#graph"[]["rdfs:label"]' 9.0/schemaorg-all-http.jsonld works but jq -r '."#graph"[].["rdfs:label"]' 9.0/schemaorg-all-http.jsonld does not and I don't understand why .["rdfs:label"] does not need the dot. https://stackoverflow.com/a/39798796/308851 suggests it needs .name after [] and https://stedolan.github.io/jq/manual/#Basicfilters says
For example .["foo::bar"] and .["foo.bar"] work while .foo::bar does not,
Where did the dot go?
Using the terminology of the jq manual, jq expressions are
fundamentally composed of pipes and what it calls "basic filters". The
first filter under the heading "Basic Filters" is the identify filter,
.; and .[] is the "Array/Object Value Iterator".
From this perspective, that is, from the perspective of
pipes-and-basic-filters, the expression under consideration
."#graph"[]["rdfs:label"] can be viewed as an abbreviated form of
the pipeline:
.["#graph"] | .[] | .["rdfs:label"]
So from this perspective, the question is what abbreviations are allowed.
One of the most important abbreviation rules is:
E | .[] #=> E[]
Another is:
.["<string>"] #=> ."<string>"
Application of these rules yields the simplified expression.
So perhaps the basic answer to the "why" in this question is: for convenience. :-)
The dot serves two different purposes in jq:
A dot on its own means "the current object". Let's call this the identity dot. It can only appear at the start of an expression or subexpression, for example at the very start, or after a binary operator like the | or + or and, or inside an opening parenthesis (.
A dot followed by a string or an identifier means "retrieve the named field of the current object". Let's call this an indexing dot. Whatever is to the left of it needs to be a complete subexpression, for example a literal value, a parenthesised expression, a function call, etc. It can't appear in any of the places the identity dot can appear.
The thing to understand is that in the square bracket operators, the dot shown in the documentation is an identity dot - it's not actually part of the operator itself. The operator is just the square brackets and their contents, and it needs to be attached to another complete expression.
In general, both square bracket operators (e.g. ["foo"] or [] or [0] or [2:5]) and object identifier indexing operators (e.g. .foo or ."foo") can be appended to another expression. Only the object identifier indexing operators can appear "bare" with no expression on the left. Since the square bracket operators can't appear bare, you will typically see them in the documentation composed after an identity dot.
These are all equivalent:
.foo # indexing dot
."foo" # indexing dot
. .foo # identity dot and indexing dot
. | .foo # identity dot and indexing dot
.["foo"] # identity dot
. | .["foo"] # two identity dots
So the answer to your question is that the last dot in ."#graph"[].["rdfs:label"] isn't allowed because:
It can't be an identity dot because it has an expression on the left.
It can't be an indexing dot because it doesn't have an identifier or a string on the right, it has a square bracket.
All that said, it looks like newer versions of jq are going to extend the syntax to allow square bracket operators immediately after an indexing dot, and having the intuitive meaning of just applying that indexing operation the same as if there had been no dot, so hopefully you won't need to worry about the difference in the future.

Backus–Naur form with boolean algebra. Problem with brackets and parse tree

boolean algebra
I want to write those boolean expression in Backus-Naur-Form.
What I have got is:
< variable > ::= < signal > | < operator> | < bracket >< variable>
< signal> ::= <p> | <q> | <r>| <s>
< operator> ::= <AND> | <OR> | <implication>| <equivalence>| <NOT>
< bracket> ::= < ( > | < ) >
I made a rekursion with < bracket >< variable>, so that whenever there is a bracket it starts a new instance, but I still do not know when to close the brackets. With this you are able to set a closing bracket and make a new instance, but I only want that for opening brackets.
Can I seperate < bracket> in < open bracket> and < closing bracket>?
Is my Backus-Naur form even correct? There isn't much information about Backus-Naur form with boolean algebra on the internet. How does the parse tree of this look like?
I assume that you want to define a grammar for boolean expressions using the Backus-Naur form, and that your examples are concrete instances of such expressions. There are multiple problems with your grammar:
First of all, you want your grammar to only generate correct boolean expressions. With your grammar, you could generate a simple operator ∨ as valid expression using the path <variable> -> <operator> -> <OR>, which is clearly wrong since the operator is missing its operands. In other words, ∨ on its own cannot be a correct boolean expression. Various other incorrect expressions can be derived with your grammar. For the same reason, the opening and closing brackets should appear together somewhere within a production rule, since you want to ensure that every opening bracket has a closing bracket. Putting them in separate production rules might destroy that guarantee, depending on the overall structure of your grammar.
Secondly, you want to differentiate between non-terminal symbols (the ones that are refined by production rules, i.e. the ones written between < and >) and terminal symbols (atomic symbols like your variables p, q, r and s). Hence, your non-terminal symbols <p>, <q>, <r> and <s> should be terminal symbols p, q, r and s. Same goes for other symbols like brackets and operators.
Thirdly, in order to get an unambiguous parse tree, you want to get your precedence and associativity of your operators correct, i.e., you want to make sure that, for example, negation is evaluated before implication, since it has a higher precedence (similar to arithmetic expressions where multiplication must be evaluated before addition). In other words, we want operators with higher precedence to appear closer to the leaf nodes of the parse tree, and operators with lower precedence to appear closer to the root node of the tree, since the leaves of the tree are evaluated first. We can achieve that by defining our grammar in a way that reflects the precedences of the operators in a decreasing manner:
<expression> ::= <expression> ↔ <implication> | <implication>
<implication> ::= <implication> → <disjunction> | <disjunction>
<disjunction> ::= <disjunction> ∨ <conjunction> | <conjunction>
<conjunction> ::= <conjunction> ∧ <negation> | <negation>
<negation> ::= ¬ <negation> | <variable> | ( <expression> )
<variable> ::= p | q | r | s
Starting with <expression>, we can see that a valid boolean expression starts with chaining all the ↔ operators together, then all the → operators , then all the ∨ operators, and so on, according to their precedence. Hence, operators with lower precedence (e.g., ↔) are located near the root of the tree, where operators with higher precedence (e.g., ¬) are located near the leaves of the tree.
Note that the grammar above is left-recursive, which might cause some problems with software tools that cannot handle them (e.g., parser generators).

Common Lisp: A good way to represent grammar rules?

This is a Common Lisp data representation question.
What is a good way to represent grammars? By "good" I mean a representation that is simple, easy to understand, and I can operate on the representation without a lot of fuss. The representation doesn't have to be particularly efficient; the other properties (simple, understandable, process-able) are more important to me.
Here is a sample grammar:
Session → Facts Question
Session → ( Session ) Session
Facts → Fact Facts
Facts → ε
Fact → ! STRING
Question → ? STRING
The representation should allow the code that operates on the representation to readily distinguish between terminal symbols and non-terminal symbols.
Non-terminal symbols: Session, Facts, Fact, Question
Terminal symbols: (, ), ε, !, ?
This particular grammar uses parentheses symbols, which conflicts with Common Lisp's use of parentheses symbols. What's a good way to handle that?
I want my code to be able to be able to recognize the symbol for the empty string, ε. What's a good way to represent the symbol for the empty string, ε?
I want my code to be able to distinguish between the left-hand side and the right-hand side of a grammar rule.
Below are some common operations that I want to perform on the representation.
Consider this rule:
A → u1u2...un
Operations: I want to get the first symbol of a grammar rule's right-hand side. Then I want to know: is it a terminal symbol? Is it the ε-symbol? If it's a non-terminal symbol, then I want to get its grammar rule.
GRAIL (GRAmmar In Lisp)
Description of GRAIL
Slightly modified version of GRAIL with a function generator included
I'm including the BNF of GRAIL from the second link in case it expires:
<grail-list> ::= "'(" {<grail-rule>} ")"
<grail-rule> ::= <assignment> | <alternation>
<assignment> ::= "(" <type> " ::= " <s-exp> ")"
<alternation> ::= "(" <type> " ::= " <type> {<type>} ")"
<s-exp> ::= <symbol> | <nonterminal> | "(" {<s-exp>} ")"
<type> ::= "#(" <type-name> ")"
<nonterminal> ::= "#(" {<arg-name> " "} <type-name> ")"
<type-name> ::= <symbol>
<arg-name> ::= <symbol>
DCG Format (Definite Clause Grammar)
There is an implementation of a definite clause grammar in Paradigms of Artificial Intelligence Programming. Technically it's Prolog, but it's all implemented as Lisp in the book.
Grammar of English in DCG Format as used in PAIP
DCG Parser
Hope this helps!

Avoiding left recursion in parsing LiveScript object definitions

I'm working on a parser for LiveScript language, and am having trouble with parsing both object property definition forms — key: value and (+|-)key — together. For example:
prop: "val"
+boolProp
-boolProp
prop2: val2
I have the key: value form working with this:
Expression ::= TestExpression
| ParenExpression
| OpExpression
| ObjDefExpression
| PropDefExpression
| LiteralExpression
| ReferenceExpression
PropDefExpression ::= Expression COLON Expression
ObjDefExpression ::= PropDefExpression (NEWLINE PropDefExpression)*
// ... other expressions
But however I try to add ("+"|"-") IDENTIFIER to PropDefExpression or ObjDefExpression, I get errors about using left recursion. What's the (right) way to do this?
The grammar fragment you posted is already left-recursive, i.e. without even adding (+|-)boolprop, the non-terminal 'Expression' derives a form in which 'Expression' reappears as the leftmost symbol:
Expression -> PropDefExpression -> Expression COLON Expression
And it's not just left-recursive, it's ambiguous. E.g.
Expression COLON Expression COLON Expression
can be derived in two different ways (roughly, left-associative vs right-associative).
You can eliminate both these problems by using something more restricted on the left of the colon, e.g.:
PropDefExpression ::= Identifier COLON Expression
Also, another ambiguity: Expression derives PropDefExpression in two different ways, directly and via ObjDefExpression. My guess is, you can drop the direct derivation.
Once you've taken care of those things, it seems to me you should be able to add (+|-)boolprop without errors (unless it conflicts with one of the other kinds of expression that you didn't show).
Mind you, looking at the examples at http://livescript.net, I'm doubtful how much of that you'll be able to capture in a conventional grammar. But if you're just going for a subset, you might be okay.
I don't know how much help this will be, because I know nothing about GrammarKit and not much more about the language you're trying to parse.
However, it seems to me that
PropDefExpression ::= Expression COLON Expression
is not quite accurate, and it is creating an ambiguity when you add the boolean property production because an Expression might start with a unary - operator. In the actual grammar, though, a property cannot start with an arbitrary Expression. There are two types of key-property definitions:
name : expression
parenthesized_expression : expression
(Which is to say, expressions need to start with a ().
That means that a boolean property definition, starting with + or - is recognizable from the first token, which is precisely the condition needed for successful recursive descent parsing. There are several other property definition syntaxes, including names and parenthesized_expressions not followed by a :
That's easy to parse with an LR(1) parser, like the one Jison produces, but to parse it with a recursive-descent parser you need to left-factor. (It's possible that GrammarKit can do this for you, by the way.) Basically, you'd need something like (this is not complete):
PropertyDefinition ::= PropertyPrefix PropertySuffix? | BooleanProperty
PropertyPrefix ::= NAME | ParenthesizedExpression
PropertySuffix ::= COLON Expression | DOT NAME

What are the rules around whitespace in attribute selectors?

I'm reading the spec on attribute selectors, but I can't find anything that says if whitespace is allowed. I'm guessing it's allowed at the beginning, before and after the operator, and at the end. Is this correct?
The rules on whitespace in attribute selectors are stated in the grammar. Here's the Selectors 3 production for attribute selectors (some tokens substituted with their string equivalents for illustration; S* represents 0 or more whitespace characters):
attrib
: '[' S* [ namespace_prefix ]? IDENT S*
[ [ '^=' |
'$=' |
'*=' |
'=' |
'~=' |
'|=' ] S* [ IDENT | STRING ] S*
]? ']'
;
Of course, the grammar isn't terribly useful to someone looking to understand how to write attribute selectors, as it's intended for someone who's implementing a selector engine.
Here's a plain-English explanation:
Whitespace before the attribute selector
This isn't covered in the above production, but the first obvious rule is that if you're attaching an attribute selector to another simple selector or a pseudo-element, don't use a space:
a[href]::after
If you do, the space is treated as a descendant combinator instead, with the universal selector implied on the attribute selector and anything that may follow it. In other words, these selectors are equivalent to each other, but different from the above:
a [href] ::after
a *[href] *::after
Whitespace inside the attribute selector
Whether you have any whitespace within the brackets and around the comparison operator doesn't matter; I find that browsers seem to treat them as if they weren't there (but I haven't tested extensively). These are all valid according to the grammar and, as far as I've seen, work in all modern browsers:
a[href]
a[ href ]
a[ href="http://stackoverflow.com" ]
a[href ^= "http://"]
a[ href ^= "http://" ]
Whitespace is not allowed between the ^ (or other symbol) and = as these are treated as a single token, and tokens cannot be broken apart.
If IE7 and IE8 implement the grammar correctly, they should be able to handle them all as well.
If a namespace prefix is used, whitespace is not allowed between the prefix and the attribute name.
These are incorrect:
unit[sh| quantity]
unit[ sh| quantity="200" ]
unit[sh| quantity = "200"]
These are correct:
unit[sh|quantity]
unit[ sh|quantity="200" ]
unit[sh|quantity = "200"]
Whitespace within the attribute value
But notice the quotes around the attribute values above; if you leave them out, and you try to select something whose attribute has spaces in its value you have a syntax error.
This is incorrect:
div[class=one two]
This is correct:
div[class="one two"]
This is because an unquoted attribute value is treated as an identifier, which doesn't include whitespace (for obvious reasons), whereas a quoted value is treated as a string. See this spec for more details.
To prevent such errors, I strongly recommend always quoting attribute values, whether in HTML, XHTML (required), XML (required), CSS or jQuery (once required).
Whitespace after the attribute value
As of Selectors 4 (following the original publication of this answer), attribute selectors can accept flags in the form of an identifier appearing after the attribute value. Two flags have been defined pertaining to character case, one for case-insensitive matching:
div[data-foo="bar" i]
And one for case-sensitive matching (whose addition I had a part in, albeit by proxy of the WHATWG):
ol[type="A" s]
ol[type="a" s]
The grammar has been updated thus:
attrib
: '[' S* attrib_name ']'
| '[' S* attrib_name attrib_match [ IDENT | STRING ] S* attrib_flags? ']'
;
attrib_name
: wqname_prefix? IDENT S*
attrib_match
: [ '=' |
PREFIX-MATCH |
SUFFIX-MATCH |
SUBSTRING-MATCH |
INCLUDE-MATCH |
DASH-MATCH
] S*
attrib_flags
: IDENT S*
In plain English: if the attribute value is not quoted (i.e. it is an identifier), whitespace between it and attrib_flags is required; otherwise, if the attribute value is quoted then whitespace is optional, but strongly recommended for the sake of readability. Whitespace between attrib_flags and the closing bracket is optional as always.

Resources