Need confirmation about PEG's semantic predicates in pyparsing - pyparsing

The PEG paper describes two semantic predicate parsing expressions:
And predicate &e
Not predicate !e
Does pyparsing support the And predicate ? Or is that just a synonym for the sequencing parsing expression ? In that case it should be equivalent to the And class. Correct?
Does NotAny represent the Not predicate?
Specifically do they conform to the spec's behaviour:
The parsing expression foo &(bar) matches and consumes the text "foo" but only if it is followed by the text "bar". The parsing expression foo !(bar) matches the text "foo" but only if it is not followed by the text "bar". The expression !(a+ b) a matches a single "a" but only if it is not the first in an arbitrarily-long sequence of a's followed by a b.

The PEG & and ! predicates are non-consuming lookaheads, corresponding to pyparsing's FollowedBy and NotAny. & is different from the sequence in that "a + b" consumes both a and b expressions' text from the input string, but "a & b" means "match a only if followed by b, BUT DON'T CONSUME b".

Related

Distinguish between transpose and command string in Julia lexer

For my thesis I am implementing a parser/lexer for Julia, however some areas are being a bit of a problem.
For background Julia has a special token that gives the transpose (`), also there is 'command string' that uses this same token to wrap the string (`command`). The problem I am having is that I can't seem to get a regex that will match properly.
i.e.
this should match for a transpose:
a`
as well as
a` b`
and
a`
b`
and this should match the command string
`a`
and also:
` a
b `
The issue I'm having is that either, when there's 2 transposes in a file it will match the command string, or when there is new line in a command string then the parser will fail as both are seen as only a transpose, to me this seems like they are mutually exclusive.
The regexes in the order in which they are in the lexer are:
option 1:
COMMAND
: '`' (ESC|.)*? '`'
;
TRANSPOSE
: '\'' | '`'
;
option 2:
COMMAND
: '`' ( '\\' | ~[\\\r\n\f] )* '`'
;
TRANSPOSE
: '\'' | '`'
;
As has been noted in the comments, the transpose operator in Julia is actually ', rather than `. What has not been noted yet is there is at least one one critical diffence between how ' is used and how ` is used) that makes your job a lot easier. Specifically:
Unlike `, which may be used to quote command strings of any length, ' is only ever used to quote characters. Consequently, the only valid uses of ' as a quotation are of single characters (e.g. `a`) or one of the special ANSI escape sequences beginning with \ such as `\n` for newline (the full list being to my knowledge \a, \b, \f, \n, \r, \t, \v, \`, \", and \\).
Consequently, the 's in a sequence like [a' b']can only possibly be interpreted as transposes since ' b' is not a valid Char
While juxtaposition can be taken to mean multiplication in Julia, and multiplication can in turn be used to concatenate strings in Julia (long story -- string concatenation is the associative, noncommutative binary operation of a free monoid and thus analogous to multiplication), juxtaposition is not currently allowed as a way to multiply strings or characters.
Consequently, a sequence like a'b' can only be interpreted as a' * b' and not a * 'b'.
Combining these two more broadly, unless I am missing some edge case, it appears that a new ' following any character other than whitespace, parentheses, or a valid infix operator, is always parsed as transpose, rather than the opening quote of a character literal.

Why this jq pipeline doesn't need a dot?

jq -r '."#graph"[]["rdfs:label"]' 9.0/schemaorg-all-http.jsonld works but jq -r '."#graph"[].["rdfs:label"]' 9.0/schemaorg-all-http.jsonld does not and I don't understand why .["rdfs:label"] does not need the dot. https://stackoverflow.com/a/39798796/308851 suggests it needs .name after [] and https://stedolan.github.io/jq/manual/#Basicfilters says
For example .["foo::bar"] and .["foo.bar"] work while .foo::bar does not,
Where did the dot go?
Using the terminology of the jq manual, jq expressions are
fundamentally composed of pipes and what it calls "basic filters". The
first filter under the heading "Basic Filters" is the identify filter,
.; and .[] is the "Array/Object Value Iterator".
From this perspective, that is, from the perspective of
pipes-and-basic-filters, the expression under consideration
."#graph"[]["rdfs:label"] can be viewed as an abbreviated form of
the pipeline:
.["#graph"] | .[] | .["rdfs:label"]
So from this perspective, the question is what abbreviations are allowed.
One of the most important abbreviation rules is:
E | .[] #=> E[]
Another is:
.["<string>"] #=> ."<string>"
Application of these rules yields the simplified expression.
So perhaps the basic answer to the "why" in this question is: for convenience. :-)
The dot serves two different purposes in jq:
A dot on its own means "the current object". Let's call this the identity dot. It can only appear at the start of an expression or subexpression, for example at the very start, or after a binary operator like the | or + or and, or inside an opening parenthesis (.
A dot followed by a string or an identifier means "retrieve the named field of the current object". Let's call this an indexing dot. Whatever is to the left of it needs to be a complete subexpression, for example a literal value, a parenthesised expression, a function call, etc. It can't appear in any of the places the identity dot can appear.
The thing to understand is that in the square bracket operators, the dot shown in the documentation is an identity dot - it's not actually part of the operator itself. The operator is just the square brackets and their contents, and it needs to be attached to another complete expression.
In general, both square bracket operators (e.g. ["foo"] or [] or [0] or [2:5]) and object identifier indexing operators (e.g. .foo or ."foo") can be appended to another expression. Only the object identifier indexing operators can appear "bare" with no expression on the left. Since the square bracket operators can't appear bare, you will typically see them in the documentation composed after an identity dot.
These are all equivalent:
.foo # indexing dot
."foo" # indexing dot
. .foo # identity dot and indexing dot
. | .foo # identity dot and indexing dot
.["foo"] # identity dot
. | .["foo"] # two identity dots
So the answer to your question is that the last dot in ."#graph"[].["rdfs:label"] isn't allowed because:
It can't be an identity dot because it has an expression on the left.
It can't be an indexing dot because it doesn't have an identifier or a string on the right, it has a square bracket.
All that said, it looks like newer versions of jq are going to extend the syntax to allow square bracket operators immediately after an indexing dot, and having the intuitive meaning of just applying that indexing operation the same as if there had been no dot, so hopefully you won't need to worry about the difference in the future.

Ada 2012 RM - Comments and String Literals

I am journeying through the Ada 2012 RM and would like to see if there is a hole in my understanding or a hole in the RM. Assuming that
put_line ("-- this is a not a comment");
is legal code, how can I deduce its legality from the RM, since section 2.7 states that "a comment starts with two adjacent hyphens and extends up to the end of the line.", while section 2.6 states "a string_literal is formed by a sequence of graphic characters (possibly none) enclosed between two
quotation marks used as string brackets." It seems like there is tension between the two sections and that 2.7 would win, but that is apparently not the case.
To get a clearer understanding here, you need to have a look at section 2.2 in the RM.
2.2 (1), which states;
The text of each compilation is a sequence of separate lexical elements. Each lexical element is formed from a sequence of characters, and is either a delimiter, an identifier, a reserved word, a numeric_literal, a character_literal, a string_literal, or a comment. The meaning of a program depends only on the particular sequences of lexical elements that form its compilations, excluding comments.
And 2.2 (3/2) which states:
"[In some cases an explicit separator is required to separate adjacent lexical elements.] A separator is any of a separator_space space character, a format_effector format effector, or the end of a line, as follows:
A separator_space space character is a separator except within a comment, a string_literal, or a character_literal.
The character whose code point position is 16#09# (CHARACTER TABULATION) Character tabulation (HT) is a separator except within a comment.
The end of a line is always a separator.
One or more separators are allowed between any two adjacent lexical elements, before the first of each compilation, or after the last."
and
A delimiter is either one of the following special characters:
& ' ( ) * + , – . / : ; < = > |
or one of the following compound delimiters each composed of two adjacent special characters
=> .. ** := /= >= <= << >> <>
Each of the special characters listed for single character delimiters is a single delimiter except if this character is used as a character of a compound delimiter, or as a character of a comment, string_literal, character_literal, or numeric_literal.
So, once you filter out the white-space of a program text and break it down into a sequence of lexical elements, a lexical element corresponding to a string literal begins with a double quote character, and a lexical element corresponding to a comment begins with --.
These are clearly different syntax items, and do not conflict with each other.
This also explains why;
X := A - -1
+ B;
gives a different result than;
X := A --1
+ B;
The space separator between the dashes makes the first minus a different lexical element than the -1, so -1 is a numeric literal in the first case, while the --1 is a comment.

SQLite source code parse.y - nm

I am reading the grammar of SQLite and having a few questions about the following paragraph.
// The name of a column or table can be any of the following:
//
%type nm {Token}
nm(A) ::= id(A).
nm(A) ::= STRING(A).
nm(A) ::= JOIN_KW(A).
The nm has been used quite widely in the program. The lemon parser documentation said
Typically the data type of a non-terminal is a pointer to the root of
a parse-tree structure that contains all information about that
non-terminal
%type expr {Expr*}
Should I understand {Token} actually stands for a syntactic grouping which is a non-terminal token that "is a parse-tree structure that contains all.."?
What is nm short for in this same, is it simply "name"?
What is the period sign (dot .) that each nm(A) declaration ends up with?
No, you should understand that Token is a C object type used for the semantic value of nms.
(It is defined in sqliteInt.h and consists of a pointer to a non-null terminated character array and the length of that array.)
The comment immediately above the definition of nm starts with the words "the name", which definitely suggests to me that nm is an abbreviation for "name", yes. That is also consistent with its semantic type, as above, which is basically a name (or at least a string of characters).
All lemon productions end with a dot. It tells lemon where the end of the production is, like semicolons​ indicate to a C compiler where the end of a statement is. This makes it easier to parse consecutive productions, since otherwise the parser would have to look several symbols ahead to see the ::=

Shell command to parse/print stream

Specific question
What is a shell command to turn strings like this
class A(B, C):
into sets of strings like this
B -> A;
C -> A;
Where A, B, and C are all of the form \w+ and, where I've written "B, C" I really mean any number of terms separated by commas and whitespace. I.e. "B, C" could equally be "B" or "B, C, D, E".
Big picture
I'm visualizing the class hierarchy of a Python project. I'm looking into a directory for all .py files, grepping for class declarations and then converting them to DOT format. So far I've used find and grep to get a list of lines. I've done what is above in a small python script. If possible I'd like to use just the standard unix toolchain instead. Ideally I'd like to find another composable tool to pipe into and out of and complete the chain.
You want primitive? This sed script should work on every UNIX since V7 (but I haven't tested it on anything really old so be careful). Run it as sed -n -f scriptfile infile > outfile
: loop
/^class [A-Za-z0-9_][A-Za-z0-9_]*(\([A-Za-z0-9_][A-Za-z0-9_]*, *\)*[A-Za-z0-9_][A-Za-z0-9_]*):$/{
h
s/^class \([A-Za-z0-9_][A-Za-z0-9_]*\)(\([A-Za-z0-9_][A-Za-z0-9_]*\)[,)].*/\2 -> \1;/
p
g
s/\(class [A-Za-z0-9_][A-Za-z0-9_]*(\)[A-Za-z0-9_][A-Za-z0-9_]*,* */\1/
b loop
}
Those are BREs (Basic Regular Expressions). They don't have a + operator (that's only found in Extended Regular Expressions) and they definitely don't have \w (which was invented by perl). So your simple \w+ becomes [A-Za-z0-9_][A-Za-z0-9_]* and I had to use it several times, resulting in major ugliness.
In pseudocode form, what the thing does is:
while the line matches /^class \w+(comma-separated-list-of \w+):$/ {
save the line in the hold space
capture the outer \w and the first \w in the parentheses
replace the entire line with the new string "\2 -> \1;" using the captures
print the line
retrieve the line from the hold space
delete the first member of the comma-separated list
}
Using Python's ast module to parse Python is easy as, well, Python.
import ast
class ClassDumper(ast.NodeVisitor):
def visit_ClassDef(self, clazz):
def expand_name(expr):
if isinstance(expr, ast.Name):
return expr.id
if isinstance(expr, ast.Attribute):
return '%s.%s' % (expand_name(expr.value), expr.attr)
return ast.dump(expr)
for base in clazz.bases:
print '%s -> %s;' % (clazz.name, expand_name(base))
ClassDumper.generic_visit(self, clazz)
ClassDumper().visit(ast.parse(open(__file__).read()))
(This isn't quite right w.r.t. nesting, as it'll output Inner -> Base; instead of Outer.Inner -> Base;, but you could fix that by keeping track of context in a manual walk.)

Resources