I will make this easy to understand using Javascript as language
Using ascii as alphabet, and given all possible "statements" of length N formed using ascii, how many will be a valid JS program. Example for length 8
var i=0; //Is a string 8 characters long, and valid JS
jar i=0; //Is a string 8 characters long, but invalid JS
gfjsjhh3 //Is a string 8 characters long, but invalid JS
now imagine we have all possible strings 8 characters long.
How many will be valid JS?
Further rules:
1) variables are as short as possible
2) no blank spaces except where necessary
More formal definition of problem:
If we are given a grammar G for alphabet K, and also the collection of all possible combinations (statements) of length N that can be formed with K(or K* up to length N), how many of those statements will satisfy grammar G.
I know you are thinking this is some academic, away from reality stuff. However, if the number of statements that are programs is much less than the total number of combinations, you could use small "numbers" to refer to locations where these sparse programs appear in the sea of combinations, and be able to send this "address" instead of the whole program, greatly reducing payload
The grammar is usually only one aspect of validity checks, check out ES6 §5.3. And grammars are very poor to capture your “as short as possible” requirement. But grammars are a good tool to reason about things, so let's concentrate on that. It will still allow invalid programs, and programs with long variable names, but as a starting point for proof of concept (whatever you concept may be) it should suffice.
You might start by thinking about derivation trees based on some BNF grammar for JS. Something like sections §11 to §15 of the ES6 standard, but make sure that the grammar you use is going down to the character level, not treating a whole identifier as a single terminal. For your hypothetical compression scheem, if you encode the choice at each node of the tree, you have something like the compression effect you describe.
In order to count the number of programs of a given length, you could do dynamic programming on the BNF rules. So if you have a rule which says
IndexedMemberExpression ::= MemberExpression '[' Expression ']'
(loosely modeled after ES6 §12.3) then you know that an IndexedMemberExpression of length n is made up of a MemberExpression of length i and an Expression of length n − i − 2, for some 0 ≤ i ≤ n − 2, so if you know how many ways there are for those, you know for the whole expression. Have one array of length 1000 for each BNF non-terminal, fill them in order of increasing length, and you get the number of derivation trees (i.e. grammatically correct programs) for the root rule.
Actually an IndexedMemberExpression is just one kind of MemberExpression, and you'd want to sum all of those up. So it's more like this:
allocate an arry of size 1000 for each non-terminal,
and initialize all the elements to zero
for n from 0 to 1000:
…
# compute PrimaryExpression[n] first
# rule MemberExpression ::= PrimaryExpression
MemberExpression[n] = PrimaryExpression[n]
# rule MemberExpression ::= MemberExpression '[' Expression ']'
if n >= 2:
for i from 0 to n - 2:
MemberExpression[n] += MemberExpression[i] * Expression[n-i-2]
# rule MemberExpression ::= MemberExpression '.' IdentifierName
if n >= 1:
for i from 0 to n - 1:
MemberExpression[n] += MemberExpression[i] * IdentifierName[n-i-1]
…
Do this for all the rules, in an order which makes sure that you fully update each non-terminal on the left hand side of the increments before using it for the first time on the right hand side.
Note that for almost any character sequence of length n, the corresponding string literal will be a valid JS expression of length n + 2. So don't expect the number of valid programs to have a radically different asymptotic behavior compared to the number of arbitrary character sequences.
Related
I made this code to generate all unique substrings given a string, and I'm struggling to find the complexity of the code when I use recursion. I think a good complexity of time for this problem is O(N²), but what is my complexity and how can I improve my code?
'''
a b c d
ab ac ad bc bd cd
abc abd acd bcd
abcd
'''
dict_poss = {}
#get all letters
def func(string):
for i,value in enumerate(string):
dict_poss[value] = True
recursive(value,string[i+1:])
#get all combinations
def recursive(letter,string):
for i,value in enumerate(string):
if letter+value not in dict_poss:
dict_poss[letter+value] = True;
recursive(letter+value,string[i+1:])
return
func("abcd")
print(dict_poss)
From what you have written at the top there, you are trying to find all possible subsequences of the string, in which case this will be O(2^n). Think of the number of possible binary strings of length N, where you can construct a subsequence by the mask of each possible binary string (take a letter if 1, ignore it if 0).
If you want to find all possible substrings, this comes down to the implementation of strings in the language you're using. In c++, it's fairly trivial to do an n^2, but in java it would be O(n^3) since substring / concatenation is O(n) (Although you could do it in n^2 in java, just have to be tricky about what you do :)). Not sure what it is in, im guessing this is python (you should tag your question with the language you're using if you include code), but you could look it up. You could also time it with differently sized inputs, it wouldn't be hard to get a measurable runtime with a complexity of n^2.
So in my < 24 hours of bison/flex investigation I've seen lots of documentation that indicates that left recursion is better than right recursion. Some places even mention that with left recursion, you need constant space on the Bison parser stack whereas right recursion requires order N space. However, I can't quite find any sources that explains what is going on explicitly.
As an example (parser that only adds and subtracts):
Scanner:
%%
[0-9]+ {return NUMBER;}
%%
Parser:
%%
/* Left */
expression:
NUMBER
| expression '+' NUMBER { $$ = $1 + $3; }
| expression '-' NUMBER { $$ = $1 - $3; }
;
/* Right */
expression:
NUMBER
| NUMBER '+' expression { $$ = $1 + $3; }
| NUMBER '-' expression { $$ = $1 - $3; }
;
%%
For the example of 1+5-2, it seems with left recursion, the parser receives '1' from the lexer and sees that '1' matches expression: NUMBER and pushes an expression of value 1 to the parser stack. It sees + and pushes. Then it sees 5 and the expression(1), + and 5 matches expression: expression '+' NUMBER so it pops twice, does the math and pushes a new expression with value 6 on the stack and then repeats for the subtraction. At any one point, there is at max, 3 symbols on the stack. So it's like an in-place calculation and operates left to right.
With right recursion, I'm not sure why it has to load all symbols on the stack but I'm going to attempt at describing why that might be the case. It sees a 1 and matches expression: NUMBER so it pushes an expression with value 1 on the stack. It pushes '+' on the stack. When it sees the 5, my first thought is that 5 on it's own could match expression: NUMBER and hence be an expression of value 5 and then it plus the last two symbols on the stack could match expression: NUMBER '+' expression but my assumption is that because expression is on the right of the rule, that it can't jump the gun and evaluate 5 as an expression as a NUMBER since with LALR(1), it already knows more symbols are coming so it has to wait until it hits the end of the list?
TL;DR;
Can someone maybe explain with some detail how Bison manages its parse stack relative to how it does its shift/reduction with the parser grammar rules? Silly/contrived examples welcome!
With LR (bottom-up) parsing, each non-terminal is reduced precisely when its last token is encountered. (LALR parsing is a simplified LR parse which handles lookahead slightly less precisely.) Until a non-terminal is reduced, all its components live on the stack. So if you use right recursion and you are parsing
NUMBER + NUMBER + NUMBER + NUMBER
the reductions won't start until you get to the end, because each NUMBER starts an expression and all the expressions end at the last NUMBER.
If you use left recursion, each NUMBER terminates an expression, so a reduction happens each time a NUMBER is encountered.
That's not the reason to use left-recursion though. You use left-recursion because it accurately describes the language. If you have 7 - 2 - 1, you want the result to be 4, because that's what algebraic rules require: the expression is parsed as though it were (7 - 2) - 1, so 7 - 2 must be reduced first. With right-recursion, you would incorrectly evsluate that as 6, because the 2 - 1 would reduce first.
Most operators associate to the left, so you use left-recursion. For the occasional operator which associates to the right, you need right recursion and you have to live with the stack growing. It's not a big deal. Your machine has tons of memory.
As an example, consider assignment. a = b = 42 means a = (b = 42). If you did it left associatively, you'd first set a to b, and then attempt to set something to 42; (a = b) = 42 wouldn't make sense in most languages and it is certainly not the expected action.
LL (topdown) parsing uses lookahead to predict which production will be reduced. It can't handle left-recursion at all because the prediction ends up in a recursive loop: expression starts with an expression which starts with an expression … and the parser never manages to predict a NUMBER. So with LL parsers, you have to use right-recursion and then your grammar does not correctly describe the language (assuming that the language has left-associative operators, as is usually the case). Some people don't mind this, but I think that grammars should actually indicate the correct parse, and I find the modification necessary to make a grammar parseable with a top-down parser to be messy and hard to read. Your mileage may vary.
By the way, "force down your throat" is a very ungenerous description of documentation which is trying to give you good advice. It is good to be skeptical -- you understand things better if you work at figuring out why they work the way they do -- but many people just want the good advice.
So after reading this rather important page in the bison documentation:
https://www.gnu.org/software/bison/manual/html_node/Lookahead.html#Lookahead
combined with running with
%debug
and
yydebug = 1;
in my main()
I think I see exactly what is happening.
With left recursion, it sees a 1 and shifts it. The lookahead is now +. Then it determines that it can reduce the 1 to an expression via expression: NUMBER. So it pops and puts an expression on the stack. It shifts + and NUMBER(5) and then sees it can reduce via expression: expression '+' NUMBER and pops 3x and pushes a new expression(6). Basically, by using the lookahead and the rules, bison can determine if it needs to shift or reduce at any time as it reads the tokens. As this repeats, at most there are 3 symbols/groupings on the parse stack (for this simplified expression evaluation section).
With right recursion, it sees a 1 and shifts it. The lookahead is now +. The parser sees no reason to reduce 1 to an expression(1) because there is no rule that goes expression '+' so it just continues shifting each token until it gets to the end of the input. At this point, there is [NUMBER, +, NUMBER, -, NUMBER] on the stack so it sees that the most recent number can be reduced to expression(2) and then shifts that. Then the rules start getting applied (`expression: NUMBER '-' expression') etc etc.
So the key to my understanding is that Bison uses the lookahead token to make intelligent decisions about reducing now or just shifting based on the rules it has at its disposal.
I am finding difficulties to understand
1) AST matching, how two AST's are similar? Are types included in the comparison/matching or only the operations like +, -, ++,...etc inlcuded?
2) Two statements are syntactically similar (This term I read it somewhere in a paper), can we say the below example that the two statement are syntactically similar?
int x = 1 + 2
String y = "1" + "2"
Java - Eclipse is what am using right now and trying to understand the AST for.
Best Regards,
What ASTs are:
An AST is a data structure representing program source text that consists of nodes that contain a node type, and possibly a literal value, and a list of children nodes. The node type corresponds to what OP calls "operations" (+, -, ...) but also includes language commands (do, if, assignment, call, ...) , declarations (int, struct, ...) and literals (number, string, boolean). [It is unclear what OP means by "type"]. (ASTs often have additional information in each node referring back to the point of origin in the source text file).
What ASTs are for:
OP seems puzzled by the presence of ASTs in Eclipse.
ASTs are used to represent program text in a form which is easier to interpret than the raw text. They provide a means to reason about the program structure or content; sometimes they are used to enable the modification of program ("refactoring") by modifying the AST for a program and then regenerating the text from the AST.
Comparing ASTs for similarity is not a really common use in my experience, except in clone detection and/or pattern matching.
Comparing ASTs:
Comparing ASTs for equality is easy: compare the root node type/literal value for equality; if not equal, the comparision is complete, else (recursively) compare the children nodes).
Comparing ASTs of similarity is harder because you must decide how to relax the equality comparision. In particular, you must decide on a precise definition of similarity. There are many ways to define this, some rather shallow syntactically, some more semantically sophisticated.
My paper Clone Detection Using Abstract Syntax Trees describes one way to do this, using similarity defined as the ratio of the number of nodes shared divided by the number of nodes total in both trees. Shared nodes are computed by comparing the trees top down to the point where some child is different. (The actual comparision is to compute an anti-unifier). This similary measure is rather shallow, but it works remarkably well in finding code clones in big software systems.
From that perspective, OPs's examples:
int x = 1 + 2
String y = "1" + "2"
have trees written as S-expressions:
(declaration_with_assignment (int x) (+ 1 2))
(declaration_with_assignment (String y) (+ "1" "2"))
These are not very similar; they only share a root node whose type is "declaration-with-assignment" and the top of the + subtree. (Here the node count is 12 with only 2 matching nodes for a similarity of 2/12).
These would be more similar:
int x = 1 + 2
float x = 1.0 + 2
(S-expressions)
(declaration_with_assignment (int x) (+ 1 2))
(declaration_with_assignment (float x) (+ 1.0 2))
which share the declaration with assignment, the add node, the literal leaf node 2, and arguably the literal nodes for integer 1 and float 1.0, depending on whether you wish to define them as "equal" or not, for a similarity of 4/12.
If you change one of the trees to be a pattern tree, in which some "leaves" are pattern variables, you can then use such pattern trees to find code that has certain structure.
The surface syntax pattern:
?type ?variable = 1 + ?expression
with S-expression
((declaration_with_assignment (?type ?varaible)) (+ 1 ?expression))
matches the first of OP's examples but not the second.
As far as I know, Eclipse doesn't offer any pattern-based matching abilities.
But these are very useful in program analysis and/or program transformation tools. For some specific examples, too long to include here, see DMS Rewrite Rules
(Full disclosure: DMS is a product of my company. I'm the architect).
I'm working on a problem from the Languages and Machines: An Introduction to the Theory of Computer Science (3rd Edition) in Chapter 2 Example 6.
I need help finding the answer of:
Recursive definition of set strings over {a,b} that contains one b and even number of a's before the first b?
When looking for a recursive definition, start by identifying the base cases and then look for the recursive steps - like you're doing induction. What are the smallest strings in this language? Well, any string must have a b. Is b alone a string in the language? Why yes it is, since there are zero as that come before it and zero is an even number.
Rule 1: b is in L.
Now, given any string in the language, how can we get more strings? Well, we can apparently add any number of as to the end of the string and get another string in the language. In fact, we can get all such strings from b if we simply allow you to add one more a to the end of a string in the language. From x in L, we therefore recover xa, xaa, ..., xa^n, ... = xa*.
Rule 2: if x is in L then xa is in L.
Finally, what can we do to the beginning of strings in our language? The number of as must be even. So far, rules 1 and 2 only allow us to construct strings that have zero as before the b. We should be able to get two, four, six, etc., all the even numbers, of as. A rule that lets us add two as to any string in our language will let us add ever more as to the beginning while maintaining the evenness property we require. Starting with x in L, we recover aax, aaaax, ..., (aa)^(2n)x, ... = (aa)*x.
Rule 3: if x is in L, then aax is in L.
Optionally, you may add the sometimes implicitly understood rule that only those things allowed by the aforementioned rules are allowed. Otherwise, technically anything is allowed since we haven't explicitly disallowed anything yet.
Rule 4: Nothing is in L unless by virtue of some combination of the rules 1, 2 and/or 3 above.
I am using this piece of code and a stackoverflow will be triggered, if I use Extlib's Hashtbl the error does not occur. Any hints to use specialized Hashtbl without stackoverflow?
module ColorIdxHash = Hashtbl.Make(
struct
type t = Img_types.rgb_t
let equal = (==)
let hash = Hashtbl.hash
end
)
(* .. *)
let (ctable: int ColorIdxHash.t) = ColorIdxHash.create 256 in
for x = 0 to width -1 do
for y = 0 to height -1 do
let c = Img.get img x y in
let rgb = Color.rgb_of_color c in
if not (ColorIdxHash.mem ctable rgb) then ColorIdxHash.add ctable rgb (ColorIdxHash.length ctable)
done
done;
(* .. *)
The backtrace points to hashtbl.ml:
Fatal error: exception Stack_overflow Raised at file "hashtbl.ml",
line 54, characters 16-40 Called from file "img/write_bmp.ml", line
150, characters 52-108 ...
Any hints?
Well, you're using physical equality (==) to compare the colors in your hash table. If the colors are structured values (I can't tell from this code), none of them will be physically equal to each other. If all the colors are distinct objects, they will all go into the table, which could really be quite a large number of objects. On the other hand, the hash function is going to be based on the actual color R,G,B values, so there may well be a large number of duplicates. This will mean that your hash buckets will have very long chains. Perhaps some internal function isn't tail recursive, and so is overflowing the stack.
Normally the length of the longest chain will be 2 or 3, so it wouldn't be surprising that this error doesn't come up often.
Looking at my copy of hashtbl.ml (OCaml 3.12.1), I don't see anything non-tail-recursive on line 54. So my guess might be wrong. On line 54 a new internal array is allocated for the hash table. So another idea is just that your hashtable is just getting too big (perhaps due to the unwanted duplicates).
One thing to try is to use structural equality (=) and see if the problem goes away.
One reason you may have non-termination or stack overflows is if your type contains cyclic values. (==) will terminates on cyclic values (while (=) may not), but Hash.hash is probably not cycle-safe. So if you manipulate cyclic values of type Img_types.rgb_t, you have to devise your one cycle-safe hash function -- typically, calling Hash.hash on only one of the non-cyclic subfields/subcomponents of your values.
I've already been bitten by precisely this issue in the past. Not a fun bug to track down.