Optional sequence rule clarification - abnf

3.8. Optional Sequence:
[RULE]
Square brackets enclose an optional element sequence:
[foo bar]
is equivalent to
*1(foo bar).
The above section from RFC5234 seems not correct to me.
I think this is because the optional sequence rule [foo bar] is not only equivalent to 1*1(foo bar), but also equivalent to 1*1(bar foo). And the above example matches with default value 0, that is 0*1(foo bar).
However, [] usually means something else. So on the other hand, I think [foo bar] should mean either (foo) or (bar).
Can anyone clear this confusion for me?

The RFC defines the syntax and semantics of ABNF grammars and the quoted text defines the semantics of optional sequence syntax. It is correct by definition. Parentheses in ABNF form sequence groups, (foo bar) means foo immediately followed by bar. The number syntax in front indicates repetition, where the asterisk separates minimum number of occurences from maximum number of occurences. The minimum defaults to zero. So
*1(foo bar)
is the same as
0*1(foo bar)
meaning a sequence of foo immediately followed by bar that appears at least zero and at most one time, i.e., the sequence is optional. Since optional parts are quite frequent in formal grammars, there is a special shorthand syntax for them, namely
[foo bar]
which also means a sequence of foo immediately followed by bar that appears at least zero and at most one time. What syntactic constructs usually mean does not matter here, the specification is not reflecting on the world, it defines its own conventions.

Related

Left/Right recursion and Bison parsing stack behavior

So in my < 24 hours of bison/flex investigation I've seen lots of documentation that indicates that left recursion is better than right recursion. Some places even mention that with left recursion, you need constant space on the Bison parser stack whereas right recursion requires order N space. However, I can't quite find any sources that explains what is going on explicitly.
As an example (parser that only adds and subtracts):
Scanner:
%%
[0-9]+ {return NUMBER;}
%%
Parser:
%%
/* Left */
expression:
NUMBER
| expression '+' NUMBER { $$ = $1 + $3; }
| expression '-' NUMBER { $$ = $1 - $3; }
;
/* Right */
expression:
NUMBER
| NUMBER '+' expression { $$ = $1 + $3; }
| NUMBER '-' expression { $$ = $1 - $3; }
;
%%
For the example of 1+5-2, it seems with left recursion, the parser receives '1' from the lexer and sees that '1' matches expression: NUMBER and pushes an expression of value 1 to the parser stack. It sees + and pushes. Then it sees 5 and the expression(1), + and 5 matches expression: expression '+' NUMBER so it pops twice, does the math and pushes a new expression with value 6 on the stack and then repeats for the subtraction. At any one point, there is at max, 3 symbols on the stack. So it's like an in-place calculation and operates left to right.
With right recursion, I'm not sure why it has to load all symbols on the stack but I'm going to attempt at describing why that might be the case. It sees a 1 and matches expression: NUMBER so it pushes an expression with value 1 on the stack. It pushes '+' on the stack. When it sees the 5, my first thought is that 5 on it's own could match expression: NUMBER and hence be an expression of value 5 and then it plus the last two symbols on the stack could match expression: NUMBER '+' expression but my assumption is that because expression is on the right of the rule, that it can't jump the gun and evaluate 5 as an expression as a NUMBER since with LALR(1), it already knows more symbols are coming so it has to wait until it hits the end of the list?
TL;DR;
Can someone maybe explain with some detail how Bison manages its parse stack relative to how it does its shift/reduction with the parser grammar rules? Silly/contrived examples welcome!
With LR (bottom-up) parsing, each non-terminal is reduced precisely when its last token is encountered. (LALR parsing is a simplified LR parse which handles lookahead slightly less precisely.) Until a non-terminal is reduced, all its components live on the stack. So if you use right recursion and you are parsing
NUMBER + NUMBER + NUMBER + NUMBER
the reductions won't start until you get to the end, because each NUMBER starts an expression and all the expressions end at the last NUMBER.
If you use left recursion, each NUMBER terminates an expression, so a reduction happens each time a NUMBER is encountered.
That's not the reason to use left-recursion though. You use left-recursion because it accurately describes the language. If you have 7 - 2 - 1, you want the result to be 4, because that's what algebraic rules require: the expression is parsed as though it were (7 - 2) - 1, so 7 - 2 must be reduced first. With right-recursion, you would incorrectly evsluate that as 6, because the 2 - 1 would reduce first.
Most operators associate to the left, so you use left-recursion. For the occasional operator which associates to the right, you need right recursion and you have to live with the stack growing. It's not a big deal. Your machine has tons of memory.
As an example, consider assignment. a = b = 42 means a = (b = 42). If you did it left associatively, you'd first set a to b, and then attempt to set something to 42; (a = b) = 42 wouldn't make sense in most languages and it is certainly not the expected action.
LL (topdown) parsing uses lookahead to predict which production will be reduced. It can't handle left-recursion at all because the prediction ends up in a recursive loop: expression starts with an expression which starts with an expression … and the parser never manages to predict a NUMBER. So with LL parsers, you have to use right-recursion and then your grammar does not correctly describe the language (assuming that the language has left-associative operators, as is usually the case). Some people don't mind this, but I think that grammars should actually indicate the correct parse, and I find the modification necessary to make a grammar parseable with a top-down parser to be messy and hard to read. Your mileage may vary.
By the way, "force down your throat" is a very ungenerous description of documentation which is trying to give you good advice. It is good to be skeptical -- you understand things better if you work at figuring out why they work the way they do -- but many people just want the good advice.
So after reading this rather important page in the bison documentation:
https://www.gnu.org/software/bison/manual/html_node/Lookahead.html#Lookahead
combined with running with
%debug
and
yydebug = 1;
in my main()
I think I see exactly what is happening.
With left recursion, it sees a 1 and shifts it. The lookahead is now +. Then it determines that it can reduce the 1 to an expression via expression: NUMBER. So it pops and puts an expression on the stack. It shifts + and NUMBER(5) and then sees it can reduce via expression: expression '+' NUMBER and pops 3x and pushes a new expression(6). Basically, by using the lookahead and the rules, bison can determine if it needs to shift or reduce at any time as it reads the tokens. As this repeats, at most there are 3 symbols/groupings on the parse stack (for this simplified expression evaluation section).
With right recursion, it sees a 1 and shifts it. The lookahead is now +. The parser sees no reason to reduce 1 to an expression(1) because there is no rule that goes expression '+' so it just continues shifting each token until it gets to the end of the input. At this point, there is [NUMBER, +, NUMBER, -, NUMBER] on the stack so it sees that the most recent number can be reduced to expression(2) and then shifts that. Then the rules start getting applied (`expression: NUMBER '-' expression') etc etc.
So the key to my understanding is that Bison uses the lookahead token to make intelligent decisions about reducing now or just shifting based on the rules it has at its disposal.

How to distinguish definition list from ordered list in rst when the term starts with number in ordinal form?

Example:
1. inflection
foo
2. inflection
qux
In rst renders as ordered list but in my case it would be more fitting to use definition list.
If I remove one spaces from the definition indent to make it look like a definition like
1. inflection
foo
then rst2html emits warning of improperly ended ordered list.
If on the other hand I add indent like
1. inflection
foo
I do get a definition list but always a separate dl inside each of the ordered list item.
Context: some languages inflect nouns and I want to give a list of inflections on an unusual noun. The inflections are commonly referred to as "1. inflection, 2. inflection" etc, hence my issue to express this in rst.
My workaround so far is to avoid the numbers by using latin name of the inflections but I'd rather not to.
d'oh, Escaping mechanism works. example::
\1. inflection
foo

What characters are allowed in common lisp symbols?

What characters are allowed in common lisp symbols? Can you give a regular expression to match them (or are they beyond the capable of regular grammars to describe)?
I have tried looking for information on this, but all I can find are some examples in CLHS, but no concrete definition of what exactly a legal symbol is.
Edit:
So, common lisp symbols can legally contain any character.
However, the parser doesn't just accept any character as it reads lisp code. What are the rules for parsable symbols? E.g. symbols that can be supplied as 'quoted symbols or inside of '(quoted lists).
I am interested in generating and reading non-bar-delimited symbols, from a non-lisp language. It should suffice, for my application, to use [a-zA-Z0-9:&-]+, but I tend to prefer to be as accurate as possible, which is why I am trying to determine if there is a regex that can match symbols. Matching the |delimited syntax| would be a bonus, but non-delimited symbols would suffice.
This needs to be symbols that would be loaded legally when using (read). The answer is not that symbols can contain any character:
[1]> (read t)
#
*** - READ from #<IO TERMINAL-STREAM>: objects printed as # in view of *PRINT-LEVEL* cannot be read back in
I want to know the rules, or a regex, for what is a valid symbol here, without delimiting it with |.
As sds mentioned, symbol names can contain any characters. Given any string, you can create a symbol with that name. However, based on your comments, it sounds like you're wonder what, under fairly default settings, will be read as a symbol. The answer is still "pretty much anything", with a few exceptions.
The relevant sections in the HyperSpec begin with 2.2 Reader Algorithm, which describes the tokenization process. It describes the process in detail, but perhaps the most important part is:
When dealing with tokens, the reader's basic function is to
distinguish representations of symbols from those of numbers. When a
token is accumulated, it is assumed to represent a number if it
satisfies the syntax for numbers listed in Figure 2-9. If it does not
represent a number, it is then assumed to be a potential number if it
satisfies the rules governing the syntax for a potential number. If a
valid token is neither a representation of a number nor a potential
number, it represents a symbol.
The Figure 2.9 mentioned in that except is in section 2.3.1 Numbers as Tokens, which says:
When a token is read, it is interpreted as a number or symbol. The token is interpreted as a number if it satisfies the syntax for numbers specified in the next figure.
So, the process is really "tokenize the stream, and for each token, check if it's a number, and if it's not a number, then it's a symbol." I realize this doesn't provide an a nice clean grammar for symbols, but that's just the way the language is defined. If you sit down to the task of writing a tokenizer and reader for a Lisp, you may find that this is a pretty convenient way of going about it. You pretty much just need to recognize which characters terminate a symbol, which characters start and end lists, what gets eliminated as whitespace, and what your escape characters are. Then you read nested lists of tokens, turning each token into a number or a symbol (or a string, etc.).
Perhaps one of the easiest ways to see why you have to do this in terms of tokenization and then checking for numbers is the fact that Common Lisp has a *read-base*variable that controls the base. Depending on the value of *read-base*, some things are numbers or symbols, and you can't know until you know what the complete token is, and what the current state of the runtime is.
CL-USER> 'beef
BEEF
CL-USER> (setf *read-base* 16)
16
CL-USER> 'beef
48879
CL-USER> (setf *read-base* a) ; set it back to 10, which is now a
10
CL-USER> (setf *read-base* 36)
36
CL-USER> 'hello ; a number
29234652
CL-USER> 'hello\ world ; a symbol
|HELLO WORLD|
Any character can be in a symbol. E.g.:
(length (loop for i to char-code-limit
collect (intern (string (code-char i)))))
==> 1114113

Regexpression asp.net validator for a few words

I'm trying to create a validator for a string, that may contain 1-N words, which a separated with 1 whitespace (spaces only between words). I'm a newbie in a regex, so I feel a bit confused, cause my expression seem to be correct:
^[[a-zA-Z]+\s{1}]{0,}[a-zA-Z]+$
What am I doing wrong here? (it accepts only 2 words .. but I want it to accept 1+ words)
Any help is greatly appreciated :)
As often happens with someone beginning a new programming language or syntax, you're close, but not quite! The ^ and $ anchors are being used correctly, and the character classes [a-zA-Z] will match only letters (sounds right to me), but your repetition is a little off, and your grouping is not what you think it is - which is your primary problem.
^[[a-zA-Z]+\s{1}]{0,}[a-zA-Z]+$
^ ^^^^^^^^
a bbbacccc
It only matches two words because you effectively don't have any group repetition; this is because you don't really have any groups - only character classes. The simplest fix is to change the first [ and its matching end brace (marked by a's in the listing above) to parentheses:
^([a-zA-Z]+\s{1}){0,}[a-zA-Z]+$
This single change will make it work the way you expect! However, there a few recommendations and considerations I'd like to make.
First, for readability and code maintenance, use the single character repetition operators instead of repetition braces wherever possible. * repeats zero or more times, + repeats one or more times, and ? repeats 0 or one times (AKA optional). Your repetition curly braces are syntactically correct, and do what you intend them to, but one (marked by b's above) should be removed because it is redundant, and the other (marked by c's above) should be shortened to an asterisk *, as they have exactly the same meaning:
^([a-zA-Z]+\s)*[a-zA-z]+$
Second, I would recommend considering (depending upon your application requirements) the \w shorthand character class instead of the [a-zA-Z] character class, with the following considerations:
it matches both upper and lowercase letters
it does match more than letters (it matches digits 0-9 and the underscore as well)
it can often be configured to match non-English (unicode) letters for multi-lingual input
If any of these are unnecessary or undesirable, then you're on the right track!
On a side note, the character combination \b is a word-boundary assertion and is not needed for your case, as you will already begin and end where there are letters and letters only!
As for learning more about regular expressions, I would recommend Regular-Expressions.info, which has a wealth of info about regexes and the inner workings and quirks of the various implementations. I also use a tool called RegexBuddy to test and debug expressions.

prolog recursion

am making a function that will send me a list of all possible elemnts .. in each iteration its giving me the last answer .. but after the recursion am only getting the last answer back .. how can i make it give back every single answer ..
thank you
the problem is that am trying to find all possible distributions for a list into other lists .. the code
addIn(_,[],Result,Result).
addIn(C,[Element|Rest],[F|R],Result):-
member( Members , [F|R]),
sumlist( Members, Sum),
sumlist([Element],ElementLength),
Cap is Sum + ElementLength,
(Cap =< Ca,
append([Element], Members,New)....
by calling test .. am getting back all the list of possible answers .. now if i tried to do something that will fail like
bp(3,11,[8,2,4,6,1,8,4],Answer).
it will just enter a while loop .. more over if i changed the
bp(NB,C,OL,A):-
addIn(C,OL,[[],[],[]],A);
bp(NB,C,_,A).
to and instead of Or .. i get error :
ERROR: is/2: Arguments are not
sufficiently instantiated
appreciate the help ..
Thanks alot #hardmath
It sounds like you are trying to write your own version of findall/3, perhaps limited to a special case of an underlying goal. Doing it generally (constructing a list of all solutions to a given goal) in a user-defined Prolog predicate is not possible without resorting to side-effects with assert/retract.
However a number of useful special cases can be implemented without such "tricks". So it would be helpful to know what predicate defines your "all possible elements". [It may also be helpful to state which Prolog implementation you are using, if only so that responses may include links to documentation for that version.]
One important special case is where the "universe" of potential candidates already exists as a list. In that case we are really asking to find the sublist of "all possible elements" that satisfy a particular goal.
findSublist([ ],_,[ ]).
findSublist([H|T],Goal,[H|S]) :-
Goal(H),
!,
findSublist(T,Goal,S).
findSublist([_|T],Goal,S) :-
findSublist(T,Goal,S).
Many Prologs will allow you to pass the name of a predicate Goal around as an "atom", but if you have a specific goal in mind, you can leave out the middle argument and just hardcode your particular condition into the middle clause of a similar implementation.
Added in response to code posted:
I think I have a glimmer of what you are trying to do. It's hard to grasp because you are not going about it in the right way. Your predicate bp/4 has a single recursive clause, variously attempted using either AND or OR syntax to relate a call to addIn/4 to a call to bp/4 itself.
Apparently you expect wrapping bp/4 around addIn/4 in this way will somehow cause addIn/4 to accumulate or iterate over its solutions. It won't. It might help you to see this if we analyze what happens to the arguments of bp/4.
You are calling the formal arguments bp(NB,C,OL,A) with simple integers bound to NB and C, with a list of integers bound to OL, and with A as an unbound "output" Answer. Note that nothing is ever done with the value NB, as it is not passed to addIn/4 and is passed unchanged to the recursive call to bp/4.
Based on the variable names used by addIn/4 and supporting predicate insert/4, my guess is that NB was intended to mean "number of bins". For one thing you set NB = 3 in your test/0 clause, and later you "hardcode" three empty lists in the third argument in calling addIn/4. Whatever Answer you get from bp/4 comes from what addIn/4 is able to do with its first two arguments passed in, C and OL, from bp/4. As we noted, C is an integer and OL a list of integers (at least in the way test/0 calls bp/4).
So let's try to state just what addIn/4 is supposed to do with those arguments. Superficially addIn/4 seems to be structured for self-recursion in a sensible way. Its first clause is a simple termination condition that when the second argument becomes an empty list, unify the third and fourth arguments and that gives "answer" A to its caller.
The second clause for addIn/4 seems to coordinate with that approach. As written it takes the "head" Element off the list in the second argument and tries to find a "bin" in the third argument that Element can be inserted into while keeping the sum of that bin under the "cap" given by C. If everything goes well, eventually all the numbers from OL get assigned to a bin, all the bins have totals under the cap C, and the answer A gets passed back to the caller. The way addIn/4 is written leaves a lot of room for improvement just in basic clarity, but it may be doing what you need it to do.
Which brings us back to the question of how you should collect the answers produced by addIn/4. Perhaps you are happy to print them out one at a time. Perhaps you meant to collect all the solutions produced by addIn/4 into a single list. To finish up the exercise I'll need you to clarify what you really want to do with the Answers from addIn/4.
Let's say you want to print them all out and then stop, with a special case being to print nothing if the arguments being passed in don't allow a solution. Then you'd probably want something of this nature:
newtest :-
addIn(12,[7, 3, 5, 4, 6, 4, 5, 2], Answer),
format("Answer = ~w\n",[Answer]),
fail.
newtest.
This is a standard way of getting predicate addIn/4 to try all possible solutions, and then stop with the "fall-through" success of the second clause of newtest/0.
(Added) Suggestions about coding addIn/4:
It will make the code more readable and maintainable if the variable names are clear. I'd suggest using Cap instead of C as the first argument to addIn/4 and BinSum when you take the sum of items assigned to a "bin". Likewise Bin would be better where you used Members. In the third argument to addIn/4 (in the head of the second clause) you don't need an explicit list structure [F|R] since you never refer to either part F or R by itself. So there I'd use Bins.
Some of your predicate calls don't accomplish much that you cannot do more easily. For example, your second call to sumlist/2 involves a list with one item. Thus the sum is just the same as that item, i.e. ElementLength is the same as Element. Here you could just replace both calls to sumlist/2 with one such call:
sumlist([Element|Bin],BinSum)
and then do your test comparing BinSum with Cap. Similarly your call to append/3 just adjoins the single item Element to the front of the list (I'm calling) Bin, so you could just replace what you have called New with [Element|Bin].
You have used an extra pair of parentheses around the last four subgoals (in the second clause for addIn/4). Since AND is implied for all the subgoals of this clause, using the extra pair of parentheses is unnecessary.
The code for insert/4 isn't shown now, but it could be a source of some unintended "backtracking" in special cases. The better approach would be to have the first call (currently to member/2) be your only point of indeterminacy, i.e. when you choose one of the bins, do it by replacing it with a free variable that gets unified with [Element|Bin] at the next to last step.

Resources