Z80 ASM BNF structure... am I on the right track? - bnf

I'm trying to learn BNF and attempting to assemble some Z80 ASM code. Since I'm new to both fields, my question is, am I even on the right track? I am trying to write the format of Z80 ASM as EBNF so that I can then figure out where to go from there to create machine code from the source. At the moment I have the following:
Assignment = Identifier, ":" ;
Instruction = Opcode, [ Operand ], [ Operand ] ;
Operand = Identifier | Something* ;
Something* = "(" , Identifier, ")" ;
Identifier = Alpha, { Numeric | Alpha } ;
Opcode = Alpha, Alpha ;
Int = [ "-" ], Numeric, { Numeric } ;
Alpha = "A" | "B" | "C" | "D" | "E" | "F" |
"G" | "H" | "I" | "J" | "K" | "L" |
"M" | "N" | "O" | "P" | "Q" | "R" |
"S" | "T" | "U" | "V" | "W" | "X" |
"Y" | "Z" ;
Numeric = "0" | "1" | "2" | "3"| "4" |
"5" | "6" | "7" | "8" | "9" ;
Any directional feedback if I am going wrong would be excellent.

Old-school assemblers were typically hand-coded in assembler and used adhoc parsing techniques to process assembly source lines to produce actual assembler code.
When assembler syntax is simple (e.g. always OPCODE REG, OPERAND) this worked well enough.
Modern machines have messy, nasty instruction sets with lots of instruction variations and operands, which may be expressed with complex syntax allowing multiple index registers to participate in the operand expression. Allowing sophisticated assembly-time expressions with fixed and relocatable constants with various types of addition operators complicates this. Sophisticated assemblers allowing conditional compilation, macros, structured data declarations, etc. all add new demands on syntax. Processing all this syntax by ad hoc methods is very hard and is the reason that parser generators were invented.
Using a BNF and a parser generator is very reasonable way to build a modern assembler, even for a legacy processor such as the Z80. I have built such assemblers for Motorola 8 bit machines such as the 6800/6809, and am getting ready to do the same for a modern x86. I think you're headed down exactly the right path.
********** EDIT ****************
The OP asked for example lexer and parser definitions.
I've provided both here.
These are excerpts from real specifications for a 6809 asssembler.
The complete definitions are 2-3x the size of the samples here.
To keep space down, I have edited out much of the dark-corner complexity
which is the point of these definitions.
One might be dismayed by the apparant complexity; the
point is that with such definitions, you are trying to describe the
shape of the language, not code it procedurally.
You will pay a significantly higher complexity if you
code all this in an ad hoc manner, and it will be far
less maintainable.
It will also be of some help to know that these definitions
are used with a high-end program analysis system that
has lexing/parsing tools as subsystems, called the
The DMS Software Reengineering Toolkit. DMS will automatically build ASTs from the
grammar rules in the parser specfication, which makes it a
lot easier to buid parsing tools. Lastly,
the parser specification contains so-called "prettyprinter"
declarations, which allows DMS to regenreate source text from the ASTs.
(The real purpose of the grammer was to allow us to build ASTs representing assembler
instructions, and then spit them out to be fed to a real assembler!)
One thing of note: how lexemes and grammar rules are stated (the metasyntxax!)
varies somewhat between different lexer/parser generator systems. The
syntax of DMS-based specifications is no exception. DMS has relatively sophisticated
grammar rules of its own, that really aren't practical to explain in the space available here. You'll have to live with idea that other systems use similar notations, for
EBNF for rules and and regular expression variants for lexemes.
Given the OP's interests, he can implement similar lexer/parsers
with any lexer/parser generator tool, e.g., FLEX/YACC,
JAVACC, ANTLR, ...
********** LEXER **************
-- M6809.lex: Lexical Description for M6809
-- Copyright (C) 1989,1999-2002 Ira D. Baxter
%%
#mainmode Label
#macro digit "[0-9]"
#macro hexadecimaldigit "<digit>|[a-fA-F]"
#macro comment_body_character "[\u0009 \u0020-\u007E]" -- does not include NEWLINE
#macro blank "[\u0000 \ \u0009]"
#macro hblanks "<blank>+"
#macro newline "\u000d \u000a? \u000c? | \u000a \u000c?" -- form feed allowed only after newline
#macro bare_semicolon_comment "\; <comment_body_character>* "
#macro bare_asterisk_comment "\* <comment_body_character>* "
...[snip]
#macro hexadecimal_digit "<digit> | [a-fA-F]"
#macro binary_digit "[01]"
#macro squoted_character "\' [\u0021-\u007E]"
#macro string_character "[\u0009 \u0020-\u007E]"
%%Label -- (First mode) processes left hand side of line: labels, opcodes, etc.
#skip "(<blank>*<newline>)+"
#skip "(<blank>*<newline>)*<blank>+"
<< (GotoOpcodeField ?) >>
#precomment "<comment_line><newline>"
#preskip "(<blank>*<newline>)+"
#preskip "(<blank>*<newline>)*<blank>+"
<< (GotoOpcodeField ?) >>
-- Note that an apparant register name is accepted as a label in this mode
#token LABEL [STRING] "<identifier>"
<< (local (;; (= [TokenScan natural] 1) ; process all string characters
(= [TokenLength natural] ?:TokenCharacterCount)=
(= [TokenString (reference TokenBodyT)] (. ?:TokenCharacters))
(= [Result (reference string)] (. ?:Lexeme:Literal:String:Value))
[ThisCharacterCode natural]
(define Ordinala #61)
(define Ordinalf #66)
(define OrdinalA #41)
(define OrdinalF #46)
);;
(;; (= (# Result) `') ; start with empty string
(while (<= TokenScan TokenLength)
(;; (= ThisCharacterCode (coerce natural TokenString:TokenScan))
(+= TokenScan) ; bump past character
(ifthen (>= ThisCharacterCode Ordinala)
(-= ThisCharacterCode #20) ; fold to upper case
)ifthen
(= (# Result) (append (# Result) (coerce character ThisCharacterCode)))=
);;
)while
);;
)local
(= ?:Lexeme:Literal:String:Format (LiteralFormat:MakeCompactStringLiteralFormat 0)) ; nothing interesting in string
(GotoLabelList ?)
>>
%%OpcodeField
#skip "<hblanks>"
<< (GotoEOLComment ?) >>
#ifnotoken
<< (GotoEOLComment ?) >>
-- Opcode field tokens
#token 'ABA' "[aA][bB][aA]"
<< (GotoEOLComment ?) >>
#token 'ABX' "[aA][bB][xX]"
<< (GotoEOLComment ?) >>
#token 'ADC' "[aA][dD][cC]"
<< (GotoABregister ?) >>
#token 'ADCA' "[aA][dD][cC][aA]"
<< (GotoOperand ?) >>
#token 'ADCB' "[aA][dD][cC][bB]"
<< (GotoOperand ?) >>
#token 'ADCD' "[aA][dD][cC][dD]"
<< (GotoOperand ?) >>
#token 'ADD' "[aA][dD][dD]"
<< (GotoABregister ?) >>
#token 'ADDA' "[aA][dD][dD][aA]"
<< (GotoOperand ?) >>
#token 'ADDB' "[aA][dD][dD][bB]"
<< (GotoOperand ?) >>
#token 'ADDD' "[aA][dD][dD][dD]"
<< (GotoOperand ?) >>
#token 'AND' "[aA][nN][dD]"
<< (GotoABregister ?) >>
#token 'ANDA' "[aA][nN][dD][aA]"
<< (GotoOperand ?) >>
#token 'ANDB' "[aA][nN][dD][bB]"
<< (GotoOperand ?) >>
#token 'ANDCC' "[aA][nN][dD][cC][cC]"
<< (GotoRegister ?) >>
...[long list of opcodes snipped]
#token IDENTIFIER [STRING] "<identifier>"
<< (local (;; (= [TokenScan natural] 1) ; process all string characters
(= [TokenLength natural] ?:TokenCharacterCount)=
(= [TokenString (reference TokenBodyT)] (. ?:TokenCharacters))
(= [Result (reference string)] (. ?:Lexeme:Literal:String:Value))
[ThisCharacterCode natural]
(define Ordinala #61)
(define Ordinalf #66)
(define OrdinalA #41)
(define OrdinalF #46)
);;
(;; (= (# Result) `') ; start with empty string
(while (<= TokenScan TokenLength)
(;; (= ThisCharacterCode (coerce natural TokenString:TokenScan))
(+= TokenScan) ; bump past character
(ifthen (>= ThisCharacterCode Ordinala)
(-= ThisCharacterCode #20) ; fold to upper case
)ifthen
(= (# Result) (append (# Result) (coerce character ThisCharacterCode)))=
);;
)while
);;
)local
(= ?:Lexeme:Literal:String:Format (LiteralFormat:MakeCompactStringLiteralFormat 0)) ; nothing interesting in string
(GotoOperandField ?)
>>
#token '#' "\#" -- special constant introduction (FDB)
<< (GotoDataField ?) >>
#token NUMBER [NATURAL] "<decimal_number>"
<< (local [format LiteralFormat:NaturalLiteralFormat]
(;; (= ?:Lexeme:Literal:Natural:Value (ConvertDecimalTokenStringToNatural (. format) ? 0 0))
(= ?:Lexeme:Literal:Natural:Format (LiteralFormat:MakeCompactNaturalLiteralFormat format))
);;
)local
(GotoOperandField ?)
>>
#token NUMBER [NATURAL] "\$ <hexadecimal_digit>+"
<< (local [format LiteralFormat:NaturalLiteralFormat]
(;; (= ?:Lexeme:Literal:Natural:Value (ConvertHexadecimalTokenStringToNatural (. format) ? 1 0))
(= ?:Lexeme:Literal:Natural:Format (LiteralFormat:MakeCompactNaturalLiteralFormat format))
);;
)local
(GotoOperandField ?)
>>
#token NUMBER [NATURAL] "\% <binary_digit>+"
<< (local [format LiteralFormat:NaturalLiteralFormat]
(;; (= ?:Lexeme:Literal:Natural:Value (ConvertBinaryTokenStringToNatural (. format) ? 1 0))
(= ?:Lexeme:Literal:Natural:Format (LiteralFormat:MakeCompactNaturalLiteralFormat format))
);;
)local
(GotoOperandField ?)
>>
#token CHARACTER [CHARACTER] "<squoted_character>"
<< (= ?:Lexeme:Literal:Character:Value (TokenStringCharacter ? 2))
(= ?:Lexeme:Literal:Character:Format (LiteralFormat:MakeCompactCharacterLiteralFormat 0 0)) ; nothing special about character
(GotoOperandField ?)
>>
%%OperandField
#skip "<hblanks>"
<< (GotoEOLComment ?) >>
#ifnotoken
<< (GotoEOLComment ?) >>
-- Tokens signalling switch to index register modes
#token ',' "\,"
<<(GotoRegisterField ?)>>
#token '[' "\["
<<(GotoRegisterField ?)>>
-- Operators for arithmetic syntax
#token '!!' "\!\!"
#token '!' "\!"
#token '##' "\#\#"
#token '#' "\#"
#token '&' "\&"
#token '(' "\("
#token ')' "\)"
#token '*' "\*"
#token '+' "\+"
#token '-' "\-"
#token '/' "\/"
#token '//' "\/\/"
#token '<' "\<"
#token '<' "\<"
#token '<<' "\<\<"
#token '<=' "\<\="
#token '</' "\<\/"
#token '=' "\="
#token '>' "\>"
#token '>' "\>"
#token '>=' "\>\="
#token '>>' "\>\>"
#token '>/' "\>\/"
#token '\\' "\\"
#token '|' "\|"
#token '||' "\|\|"
#token NUMBER [NATURAL] "<decimal_number>"
<< (local [format LiteralFormat:NaturalLiteralFormat]
(;; (= ?:Lexeme:Literal:Natural:Value (ConvertDecimalTokenStringToNatural (. format) ? 0 0))
(= ?:Lexeme:Literal:Natural:Format (LiteralFormat:MakeCompactNaturalLiteralFormat format))
);;
)local
>>
#token NUMBER [NATURAL] "\$ <hexadecimal_digit>+"
<< (local [format LiteralFormat:NaturalLiteralFormat]
(;; (= ?:Lexeme:Literal:Natural:Value (ConvertHexadecimalTokenStringToNatural (. format) ? 1 0))
(= ?:Lexeme:Literal:Natural:Format (LiteralFormat:MakeCompactNaturalLiteralFormat format))
);;
)local
>>
#token NUMBER [NATURAL] "\% <binary_digit>+"
<< (local [format LiteralFormat:NaturalLiteralFormat]
(;; (= ?:Lexeme:Literal:Natural:Value (ConvertBinaryTokenStringToNatural (. format) ? 1 0))
(= ?:Lexeme:Literal:Natural:Format (LiteralFormat:MakeCompactNaturalLiteralFormat format))
);;
)local
>>
-- Notice that an apparent register is accepted as a label in this mode
#token IDENTIFIER [STRING] "<identifier>"
<< (local (;; (= [TokenScan natural] 1) ; process all string characters
(= [TokenLength natural] ?:TokenCharacterCount)=
(= [TokenString (reference TokenBodyT)] (. ?:TokenCharacters))
(= [Result (reference string)] (. ?:Lexeme:Literal:String:Value))
[ThisCharacterCode natural]
(define Ordinala #61)
(define Ordinalf #66)
(define OrdinalA #41)
(define OrdinalF #46)
);;
(;; (= (# Result) `') ; start with empty string
(while (<= TokenScan TokenLength)
(;; (= ThisCharacterCode (coerce natural TokenString:TokenScan))
(+= TokenScan) ; bump past character
(ifthen (>= ThisCharacterCode Ordinala)
(-= ThisCharacterCode #20) ; fold to upper case
)ifthen
(= (# Result) (append (# Result) (coerce character ThisCharacterCode)))=
);;
)while
);;
)local
(= ?:Lexeme:Literal:String:Format (LiteralFormat:MakeCompactStringLiteralFormat 0)) ; nothing interesting in string
>>
%%Register -- operand field for TFR, ANDCC, ORCC, EXG opcodes
#skip "<hblanks>"
#ifnotoken << (GotoRegisterField ?) >>
%%RegisterField -- handles registers and indexing mode syntax
-- In this mode, names that look like registers are recognized as registers
#skip "<hblanks>"
<< (GotoEOLComment ?) >>
#ifnotoken
<< (GotoEOLComment ?) >>
#token '[' "\["
#token ']' "\]"
#token '--' "\-\-"
#token '++' "\+\+"
#token 'A' "[aA]"
#token 'B' "[bB]"
#token 'CC' "[cC][cC]"
#token 'DP' "[dD][pP] | [dD][pP][rR]" -- DPR shouldnt be needed, but found one instance
#token 'D' "[dD]"
#token 'Z' "[zZ]"
-- Index register designations
#token 'X' "[xX]"
#token 'Y' "[yY]"
#token 'U' "[uU]"
#token 'S' "[sS]"
#token 'PCR' "[pP][cC][rR]"
#token 'PC' "[pP][cC]"
#token ',' "\,"
-- Operators for arithmetic syntax
#token '!!' "\!\!"
#token '!' "\!"
#token '##' "\#\#"
#token '#' "\#"
#token '&' "\&"
#token '(' "\("
#token ')' "\)"
#token '*' "\*"
#token '+' "\+"
#token '-' "\-"
#token '/' "\/"
#token '<' "\<"
#token '<' "\<"
#token '<<' "\<\<"
#token '<=' "\<\="
#token '<|' "\<\|"
#token '=' "\="
#token '>' "\>"
#token '>' "\>"
#token '>=' "\>\="
#token '>>' "\>\>"
#token '>|' "\>\|"
#token '\\' "\\"
#token '|' "\|"
#token '||' "\|\|"
#token NUMBER [NATURAL] "<decimal_number>"
<< (local [format LiteralFormat:NaturalLiteralFormat]
(;; (= ?:Lexeme:Literal:Natural:Value (ConvertDecimalTokenStringToNatural (. format) ? 0 0))
(= ?:Lexeme:Literal:Natural:Format (LiteralFormat:MakeCompactNaturalLiteralFormat format))
);;
)local
>>
... [snip]
%% -- end M6809.lex
**************** PARSER **************
-- M6809.ATG: Motorola 6809 assembly code parser
-- (C) Copyright 1989;1999-2002 Ira D. Baxter; All Rights Reserved
m6809 = sourcelines ;
sourcelines = ;
sourcelines = sourcelines sourceline EOL ;
<<PrettyPrinter>>: { V(CV(sourcelines[1]),H(sourceline,A<eol>(EOL))); }
-- leading opcode field symbol should be treated as keyword.
sourceline = ;
sourceline = labels ;
sourceline = optional_labels 'EQU' expression ;
<<PrettyPrinter>>: { H(optional_labels,A<opcode>('EQU'),A<operand>(expression)); }
sourceline = LABEL 'SET' expression ;
<<PrettyPrinter>>: { H(A<firstlabel>(LABEL),A<opcode>('SET'),A<operand>(expression)); }
sourceline = optional_label instruction ;
<<PrettyPrinter>>: { H(optional_label,instruction); }
sourceline = optional_label optlabelleddirective ;
<<PrettyPrinter>>: { H(optional_label,optlabelleddirective); }
sourceline = optional_label implicitdatadirective ;
<<PrettyPrinter>>: { H(optional_label,implicitdatadirective); }
sourceline = unlabelleddirective ;
sourceline = '?ERROR' ;
<<PrettyPrinter>>: { A<opcode>('?ERROR'); }
optional_label = labels ;
optional_label = LABEL ':' ;
<<PrettyPrinter>>: { H(A<firstlabel>(LABEL),':'); }
optional_label = ;
optional_labels = ;
optional_labels = labels ;
labels = LABEL ;
<<PrettyPrinter>>: { A<firstlabel>(LABEL); }
labels = labels ',' LABEL ;
<<PrettyPrinter>>: { H(labels[1],',',A<otherlabels>(LABEL)); }
unlabelleddirective = 'END' ;
<<PrettyPrinter>>: { A<opcode>('END'); }
unlabelleddirective = 'END' expression ;
<<PrettyPrinter>>: { H(A<opcode>('END'),A<operand>(expression)); }
unlabelleddirective = 'IF' expression EOL conditional ;
<<PrettyPrinter>>: { V(H(A<opcode>('IF'),H(A<operand>(expression),A<eol>(EOL))),CV(conditional)); }
unlabelleddirective = 'IFDEF' IDENTIFIER EOL conditional ;
<<PrettyPrinter>>: { V(H(A<opcode>('IFDEF'),H(A<operand>(IDENTIFIER),A<eol>(EOL))),CV(conditional)); }
unlabelleddirective = 'IFUND' IDENTIFIER EOL conditional ;
<<PrettyPrinter>>: { V(H(A<opcode>('IFUND'),H(A<operand>(IDENTIFIER),A<eol>(EOL))),CV(conditional)); }
unlabelleddirective = 'INCLUDE' FILENAME ;
<<PrettyPrinter>>: { H(A<opcode>('INCLUDE'),A<operand>(FILENAME)); }
unlabelleddirective = 'LIST' expression ;
<<PrettyPrinter>>: { H(A<opcode>('LIST'),A<operand>(expression)); }
unlabelleddirective = 'NAME' IDENTIFIER ;
<<PrettyPrinter>>: { H(A<opcode>('NAME'),A<operand>(IDENTIFIER)); }
unlabelleddirective = 'ORG' expression ;
<<PrettyPrinter>>: { H(A<opcode>('ORG'),A<operand>(expression)); }
unlabelleddirective = 'PAGE' ;
<<PrettyPrinter>>: { A<opcode>('PAGE'); }
unlabelleddirective = 'PAGE' HEADING ;
<<PrettyPrinter>>: { H(A<opcode>('PAGE'),A<operand>(HEADING)); }
unlabelleddirective = 'PCA' expression ;
<<PrettyPrinter>>: { H(A<opcode>('PCA'),A<operand>(expression)); }
unlabelleddirective = 'PCC' expression ;
<<PrettyPrinter>>: { H(A<opcode>('PCC'),A<operand>(expression)); }
unlabelleddirective = 'PSR' expression ;
<<PrettyPrinter>>: { H(A<opcode>('PSR'),A<operand>(expression)); }
unlabelleddirective = 'TABS' numberlist ;
<<PrettyPrinter>>: { H(A<opcode>('TABS'),A<operand>(numberlist)); }
unlabelleddirective = 'TITLE' HEADING ;
<<PrettyPrinter>>: { H(A<opcode>('TITLE'),A<operand>(HEADING)); }
unlabelleddirective = 'WITH' settings ;
<<PrettyPrinter>>: { H(A<opcode>('WITH'),A<operand>(settings)); }
settings = setting ;
settings = settings ',' setting ;
<<PrettyPrinter>>: { H*; }
setting = 'WI' '=' NUMBER ;
<<PrettyPrinter>>: { H*; }
setting = 'DE' '=' NUMBER ;
<<PrettyPrinter>>: { H*; }
setting = 'M6800' ;
setting = 'M6801' ;
setting = 'M6809' ;
setting = 'M6811' ;
-- collects lines of conditional code into blocks
conditional = 'ELSEIF' expression EOL conditional ;
<<PrettyPrinter>>: { V(H(A<opcode>('ELSEIF'),H(A<operand>(expression),A<eol>(EOL))),CV(conditional[1])); }
conditional = 'ELSE' EOL else ;
<<PrettyPrinter>>: { V(H(A<opcode>('ELSE'),A<eol>(EOL)),CV(else)); }
conditional = 'FIN' ;
<<PrettyPrinter>>: { A<opcode>('FIN'); }
conditional = sourceline EOL conditional ;
<<PrettyPrinter>>: { V(H(sourceline,A<eol>(EOL)),CV(conditional[1])); }
else = 'FIN' ;
<<PrettyPrinter>>: { A<opcode>('FIN'); }
else = sourceline EOL else ;
<<PrettyPrinter>>: { V(H(sourceline,A<eol>(EOL)),CV(else[1])); }
-- keyword-less directive, generates data tables
implicitdatadirective = implicitdatadirective ',' implicitdataitem ;
<<PrettyPrinter>>: { H*; }
implicitdatadirective = implicitdataitem ;
implicitdataitem = '#' expression ;
<<PrettyPrinter>>: { A<operand>(H('#',expression)); }
implicitdataitem = '+' expression ;
<<PrettyPrinter>>: { A<operand>(H('+',expression)); }
implicitdataitem = '-' expression ;
<<PrettyPrinter>>: { A<operand>(H('-',expression)); }
implicitdataitem = expression ;
<<PrettyPrinter>>: { A<operand>(expression); }
implicitdataitem = STRING ;
<<PrettyPrinter>>: { A<operand>(STRING); }
-- instructions valid for m680C (see Software Dynamics ASM manual)
instruction = 'ABA' ;
<<PrettyPrinter>>: { A<opcode>('ABA'); }
instruction = 'ABX' ;
<<PrettyPrinter>>: { A<opcode>('ABX'); }
instruction = 'ADC' 'A' operandfetch ;
<<PrettyPrinter>>: { H(A<opcode>(H('ADC','A')),A<operand>(operandfetch)); }
instruction = 'ADC' 'B' operandfetch ;
<<PrettyPrinter>>: { H(A<opcode>(H('ADC','B')),A<operand>(operandfetch)); }
instruction = 'ADCA' operandfetch ;
<<PrettyPrinter>>: { H(A<opcode>('ADCA'),A<operand>(operandfetch)); }
instruction = 'ADCB' operandfetch ;
<<PrettyPrinter>>: { H(A<opcode>('ADCB'),A<operand>(operandfetch)); }
instruction = 'ADCD' operandfetch ;
<<PrettyPrinter>>: { H(A<opcode>('ADCD'),A<operand>(operandfetch)); }
instruction = 'ADD' 'A' operandfetch ;
<<PrettyPrinter>>: { H(A<opcode>(H('ADD','A')),A<operand>(operandfetch)); }
instruction = 'ADD' 'B' operandfetch ;
<<PrettyPrinter>>: { H(A<opcode>(H('ADD','B')),A<operand>(operandfetch)); }
instruction = 'ADDA' operandfetch ;
<<PrettyPrinter>>: { H(A<opcode>('ADDA'),A<operand>(operandfetch)); }
[..snip...]
-- condition code mask for ANDCC and ORCC
conditionmask = '#' expression ;
<<PrettyPrinter>>: { H*; }
conditionmask = expression ;
target = expression ;
operandfetch = '#' expression ; --immediate
<<PrettyPrinter>>: { H*; }
operandfetch = memoryreference ;
operandstore = memoryreference ;
memoryreference = '[' indexedreference ']' ;
<<PrettyPrinter>>: { H*; }
memoryreference = indexedreference ;
indexedreference = offset ;
indexedreference = offset ',' indexregister ;
<<PrettyPrinter>>: { H*; }
indexedreference = ',' indexregister ;
<<PrettyPrinter>>: { H*; }
indexedreference = ',' '--' indexregister ;
<<PrettyPrinter>>: { H*; }
indexedreference = ',' '-' indexregister ;
<<PrettyPrinter>>: { H*; }
indexedreference = ',' indexregister '++' ;
<<PrettyPrinter>>: { H*; }
indexedreference = ',' indexregister '+' ;
<<PrettyPrinter>>: { H*; }
offset = '>' expression ; -- page zero ref
<<PrettyPrinter>>: { H*; }
offset = '<' expression ; -- long reference
<<PrettyPrinter>>: { H*; }
offset = expression ;
offset = 'A' ;
offset = 'B' ;
offset = 'D' ;
registerlist = registername ;
registerlist = registerlist ',' registername ;
<<PrettyPrinter>>: { H*; }
registername = 'A' ;
registername = 'B' ;
registername = 'CC' ;
registername = 'DP' ;
registername = 'D' ;
registername = 'Z' ;
registername = indexregister ;
indexregister = 'X' ;
indexregister = 'Y' ;
indexregister = 'U' ; -- not legal on M6811
indexregister = 'S' ;
indexregister = 'PCR' ;
indexregister = 'PC' ;
expression = sum '=' sum ;
<<PrettyPrinter>>: { H*; }
expression = sum '<<' sum ;
<<PrettyPrinter>>: { H*; }
expression = sum '</' sum ;
<<PrettyPrinter>>: { H*; }
expression = sum '<=' sum ;
<<PrettyPrinter>>: { H*; }
expression = sum '<' sum ;
<<PrettyPrinter>>: { H*; }
expression = sum '>>' sum ;
<<PrettyPrinter>>: { H*; }
expression = sum '>/' sum ;
<<PrettyPrinter>>: { H*; }
expression = sum '>=' sum ;
<<PrettyPrinter>>: { H*; }
expression = sum '>' sum ;
<<PrettyPrinter>>: { H*; }
expression = sum '#' sum ;
<<PrettyPrinter>>: { H*; }
expression = sum ;
sum = product ;
sum = sum '+' product ;
<<PrettyPrinter>>: { H*; }
sum = sum '-' product ;
<<PrettyPrinter>>: { H*; }
sum = sum '!' product ;
<<PrettyPrinter>>: { H*; }
sum = sum '!!' product ;
<<PrettyPrinter>>: { H*; }
product = term '*' product ;
<<PrettyPrinter>>: { H*; }
product = term '||' product ; -- wrong?
<<PrettyPrinter>>: { H*; }
product = term '/' product ;
<<PrettyPrinter>>: { H*; }
product = term '//' product ;
<<PrettyPrinter>>: { H*; }
product = term '&' product ;
<<PrettyPrinter>>: { H*; }
product = term '##' product ;
<<PrettyPrinter>>: { H*; }
product = term ;
term = '+' term ;
<<PrettyPrinter>>: { H*; }
term = '-' term ;
<<PrettyPrinter>>: { H*; }
term = '\\' term ; -- complement
<<PrettyPrinter>>: { H*; }
term = '&' term ; -- not
term = IDENTIFIER ;
term = NUMBER ;
term = CHARACTER ;
term = '*' ;
term = '(' expression ')' ;
<<PrettyPrinter>>: { H*; }
numberlist = NUMBER ;
numberlist = numberlist ',' NUMBER ;
<<PrettyPrinter>>: { H*; }

BNF is more generally used for structured, nested languages like Pascal, C++, or really anything derived from the Algol family (which includes modern languages like C#). If I were implementing an assembler, I might use some simple regular expressions to pattern-match the opcode and operands. It's been a while since I've used Z80 assembly language, but you might use something like:
/\s*(\w{2,3})\s+((\w+)(,\w+)?)?/
This would match any line which consists of a two- or three-letter opcode followed by one or two operands separated by a comma. After extracting an assembler line like this, you would look at the opcode and generate the correct bytes for the instruction, including the values of the operands if applicable.
The type of parser I've outlined above using regular expressions would be called an "ad hoc" parser, which essentially means you split and examine the input on some kind of block basis (in the case of assembly language, by text line).

I don't think you need overthink it. There's no point making a parser that takes apart “LD A,A” into a load operation, destination and source register, when you can just string match the whole thing (modulo case and whitespace) into one opcode directly.
There aren't that many opcodes, and they aren't arranged in such a way that you really get much benefit from parsing and understanding the assembler IMO. Obviously you'd need a parser for the byte/address/indexing arguments, but other than that I'd just have a one-to-one lookup.

Related

Lex & Yacc AST homework

I need to write this function in AST, preorder, but when I run my yacc file, it prints "Segmentatio fault(core dumped)". If you can please help me resolve my problem, because it as been a few days and I still do not understand what to do. I checked my syntax and it is working, but for some reason when I add mknode and printtree to it, it prints this message. Please help me.
void foo(int x, y, z; real f){
if (x>y) {
x = x + f;
}
else {
y = x + y + z;
x = f*2;
z = f;
}
This is my yacc file, including my function printtree and mknode.
%{
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
typedef struct node{
char *token;
struct node *left;
struct node *right;
}node;
node *mknode(char *token, node *left, node *right);
void printtree(node *tree);
%}
%union
{
char *s;
struct node *node;
}
%token IF ELSE INT CHAR VOID REAL RETURN GUI
%left '*'
%left '+'
%token <s> NUM ID FUNC
%type <node> S start function func args args1 body if_st ret_st expr block ass calc
%type <s> type
%%
S: start {printtree($1);};
start: function {$$ = mknode("CODE",$1,NULL);};
function: func { $$ = mknode("FUNC",$1, NULL); };
func: type ID '(' args ')' '{' body '}' {$$ = mknode($2,NULL, mknode("ARGS", $4,mknode($1, NULL,$7)));};
type: INT {$$ = "INT";}
| CHAR {$$ = "CHAR";}
| VOID {$$ = "VOID";}
| REAL {$$ = "REAL";};
args: type args1 args {$$ = mknode($1,$2,$3);} | type args1 {$$ = mknode($1,$2,NULL);} ;
args1: ID {$$ = mknode($1,NULL,NULL);}
| ID ';' {$$ = mknode($1,NULL,NULL);}
| ID ',' args1 {$$ = mknode($1,NULL,$3);}
| { $$ = NULL; };
body: if_st {$$ = mknode("BODY", $1, NULL);}
| ret_st {$$ = mknode("BODY", $1, NULL);};
if_st: IF'(' expr ')' '{'block'}' ELSE '{'block'}' {$$ = mknode("IF-ELSE",mknode(NULL,$3,mknode(NULL,$6,$10)),NULL);}
| IF '(' expr ')' '{'block'}'{$$ = mknode("IF",$3,$6);} ;
expr: ID '<' ID {$$ = mknode("<",mknode($1,NULL,NULL),mknode($3,NULL,NULL));}
| ID '>' ID {$$ = mknode(">",mknode($1,NULL,NULL),mknode($3,NULL,NULL));}
| ID '=' ID {$$ = mknode("==",mknode($1,NULL,NULL),mknode($3,NULL,NULL));}
| ID '<' NUM {$$ = mknode("<",mknode($1,NULL,NULL),mknode($3,NULL,NULL));}
| ID '>' NUM {$$ = mknode(">",mknode($1,NULL,NULL),mknode($3,NULL,NULL));}
| ID '=' NUM {$$ = mknode("==",mknode($1,NULL,NULL),mknode($3,NULL,NULL));};
block: block ass {$$ = mknode(NULL,$1,$2);}
| ass {$$ = mknode(NULL,$1,NULL);};
ass: ID '=' calc ';'{$$ = mknode("=",mknode($1,NULL,NULL),mknode(NULL,$3,NULL));};
calc: ID '+' calc {$$ = mknode("+",mknode($1,NULL,NULL),mknode(NULL,$3,NULL));}
| ID '*' calc {$$ = mknode("*",mknode($1,NULL,NULL),mknode(NULL,$3,NULL));}
| NUM '+' calc {$$ = mknode("+",mknode($1,NULL,NULL),mknode(NULL,$3,NULL));}
| NUM '*' calc {$$ = mknode("*",mknode($1,NULL,NULL),mknode(NULL,$3,NULL));}
| NUM {$$ = mknode($1,NULL,NULL);}
| ID {$$ = mknode($1,NULL,NULL);};
ret_st: RETURN GUI calc GUI ';' { $$ = mknode("RET", $3, NULL); };
%%
#include "lex.yy.c"
int main()
{
return yyparse();
}
node *mknode(char *token,node *left,node *right)
{
node *newnode = (node*)malloc(sizeof(node));
char *newstr = (char*)malloc(sizeof*(token)+1);
strcpy(newstr,token);
newnode->left = left;
newnode->right = right;
newnode->token = newstr;
return newnode;
}
void printtree(node *tree)
{
printf("%s\n",tree->token);
if(tree->left)
printtree(tree->left);
if(tree->right)
printtree(tree->right);
}
int yyerror()
{
printf("ERROR\n");
return 0;
}
Most likely cause of the crash:
you call mknode in a couple of places (eg, the block rule) with NULL as the first argument, but mknode calls strcpy with this argument as the source string, so it will crash
Other problems:
you use sizeof(token) where token is a char * (getting the size of a pointer, not the length of the string. You need strlen(token). Better yet, use strdup(token) to do the malloc+strcpy all in one.
your grammar is inflexible, with almost-dupliacted rules and limited nesting. You're better off using fewer rules -- get rid of all the calc/expr stuff and just have
expr: expr '+' expr
| expr '*' expr
| expr '<' expr
| expr '>' expr
| expr '=' expr
| ID
| NUM
| '(' expr ')'
and set precedence of your operators appropriately. Similarly block and body should be combined into one non-terminal and a couple of rules.

scientific notation and non scientific notation values in one line

This works:
Ada.Text_IO.Put_Line("Oil Rate : " & Float'Image(oil_float);
Ada.Text_IO.Put_Line(oil_float, Exp => 0);
But this doesn't:
Ada.Text_IO.Put_Line("Oil Rate : " & Float'Image(oil_float, Exp => 0) & " is " & (oil_float, Exp => 0));
I wanted to put it in one line. is it possible?
You could write your own function:
function Float_Image (Value : Float; Exponent : Natural) return String is
Result : String (1 .. 64);
begin
Ada.Float_Text_IO.Put (Result, Value, Exp => Exponent);
return Ada.Strings.Fixed.Trim (Result, Ada.Strings.Left);
end Float_Image;
(note the unfortunate fixed-length intermediate result; more than enough to hold Float’Last, though)

javacc' LOOKAHEAD( AllSymbols() ) AllSymbols() not chosen, sole to be parsed correctly

The grammar, in a pinch, is as follows:
Phi ::= Phi_sub ( ("&&" | "||") Phi_sub )*
Phi_sub ::= "(" Phi ")" | ...
Psi ::= Psi_sub ( ("&&" | "||") Psi_sub )*
Psi_sub ::= "(" Psi ")" | ...
Xi ::= LOOKAHEAD( Phi ) Phi | LOOKAHEAD( Psi ) Psi
As you can see, an infinite lookahead would in general be required in the Xi production, because the parser needs to distinguish cases like:
((Phi_sub && Phi_sub) || Phi_sub) vs ((Psi_sub && Psi_sub) || Psi_sub)
i.e. an arbitrary amount of prefixing (.
I thought, that making the lookahead like above would work, but it doesn't. For example, Phi is chosen, even if Xi does not expand to Phi, but does to Psi. This can be easily checked on a certain stream S by calling Phi with the debugger just after the parsed decided, within Xi, to choose Phi, and is about to call Phi. The debugger in such a case shows a proper expansion to Psi, while allowing the parser just to call Phi as it wants would cause a parse exception.
The other way of testing it is swapping Phi and Psi:
Xi ::= LOOKAHEAD( Psi ) Psi | LOOKAHEAD( Phi ) Phi
This will make the parser parse the particular S correctly, and so it seems that simply the first branch within Xi is chosen, be it the valid one or not.
I guess I got some basic assumption wrong, but have no idea what can it be. Should the above work in general, if there are no additional factors, like an ignored inner lookahead?
Your assumptions are not wrong. What you are trying to do should work. And it should work for the reasons you think it should work.
Here is a complete example written in JavaCC.
void Start() : {} { Xi() <EOF> }
void Xi() : {} {
LOOKAHEAD( Phi() ) Phi() { System.out.println( "Phi" ) ; }
| LOOKAHEAD( Psi() ) Psi() { System.out.println( "Psi" ) ; }
}
void Phi() : {} { Phi_sub() ( ("&&" | "||") Phi_sub() )*}
void Phi_sub() : {} { "(" Phi() ")" | "Phi_sub" }
void Psi() : {} { Psi_sub() ( ("&&" | "||") Psi_sub() )* }
void Psi_sub() : {} { "(" Psi() ")" | "Psi_sub" }
And here is some sample output:
Input is : <<Phi_sub>>
Phi
Input is : <<Psi_sub>>
Psi
Input is : <<((Phi_sub && Phi_sub) || Phi_sub)>>
Phi
Input is : <<((Psi_sub && Psi_sub) || Psi_sub)>>
Psi
The problem you are having lies in something not shown in the question.
By the way, it's a bad idea to put a lookahead specification in front of every alternative.
void X() : {} { LOOKAHEAD(Y()) Y() | LOOKAHEAD(Z()) Z() }
is roughly equivalent to
void X() : {} { LOOKAHEAD(Y()) Y() | LOOKAHEAD(Z()) Z() | fail with a stupid error message }
For example, here is another run of the above grammar
Input is : <<((Psi_sub && Psi_sub) || Phi_sub)>>
NOK.
Encountered "" at line 1, column 1.
Was expecting one of:
After all lookahead has failed, the parser is left with an empty set of expectations!
If you change Xi to
void Xi() : {} {
LOOKAHEAD( Phi() ) Phi() { System.out.println( "Phi" ) ; }
| Psi() { System.out.println( "Psi" ) ; }
}
you get a slightly better error message
Input is : <<((Psi_sub && Psi_sub) || Phi_sub)>>
NOK.
Encountered " "Phi_sub" "Phi_sub "" at line 1, column 26.
Was expecting one of:
"(" ...
"Psi_sub" ...
You can also make a custom error message
void Xi() : {} {
LOOKAHEAD( Phi() ) Phi() { System.out.println( "Phi" ) ; }
| LOOKAHEAD( Psi() ) Psi() { System.out.println( "Psi" ) ; }
| { throw new ParseException( "Expected either a Phi or a Psi at line "
+ getToken(1).beginLine
+ ", column " + getToken(1).beginColumn + "." ) ;
}
}

Date Time Parser using YACC shift reduce conflicts

I have the following YACC parser
%start Start
%token _DTP_LONG // Any number; Max upto 4 Digits.
%token _DTP_SDF // 17 Digit number indicating SDF format of Date Time
%token _DTP_EOS // end of input
%token _DTP_MONTH //Month names e.g Jan,Feb
%token _DTP_AM //Is A.M
%token _DTP_PM //Is P.M
%%
Start : DateTimeShortExpr
| DateTimeLongExpr
| SDFDateTimeExpr EOS
| DateShortExpr EOS
| DateLongExpr EOS
| MonthExpr EOS
;
DateTimeShortExpr : DateShortExpr TimeExpr EOS {;}
| DateShortExpr AMPMTimeExpr EOS {;}
;
DateTimeLongExpr : DateLongExpr TimeExpr EOS {;}
| DateLongExpr AMPMTimeExpr EOS {;}
;
DateShortExpr : Number { rc = vDateTime.SetDate ((Word) $1, 0, 0);
}
| Number Number { rc = vDateTime.SetDate ((Word) $1, (Word) $2, 0); }
| Number Number Number { rc = vDateTime.SetDate ((Word) $1, (Word) $2, (Word) $3); }
;
DateLongExpr : Number AbsMonth { // case : number greater than 31, consider as year
if ($1 > 31) {
rc = vDateTime.SetDateFunc (1, (Word) $2, (Word) $1);
}
// Number is considered as days
else {
rc = vDateTime.SetDateFunc ((Word) $1, (Word) $2, 0);
}
}
| Number AbsMonth Number {rc = vDateTime.SetDateFunc((Word) $1, (Word) $2, (Word) $3);}
;
TimeExpr : Number { rc = vDateTime.SetTime ((Word) $1, 0, 0);}
| Number Number { rc = vDateTime.SetTime ((Word) $1, (Word) $2, 0); }
| Number Number Number { rc = vDateTime.SetTime ((Word) $1, (Word) $2, (Word) $3); }
;
AMPMTimeExpr : TimeExpr _DTP_AM { rc = vDateTime.SetTo24hr(TP_AM) ; }
| TimeExpr _DTP_PM { rc = vDateTime.SetTo24hr(TP_PM) ; }
| _DTP_AM TimeExpr { rc = vDateTime.SetTo24hr(TP_AM) ; }
| _DTP_PM TimeExpr { rc = vDateTime.SetTo24hr(TP_PM) ; }
;
SDFDateTimeExpr : SDFNumber { rc = vDateTime.SetSDF ($1);}
;
MonthExpr : AbsMonth { rc = vDateTime.SetNrmMth ($1);}
| AbsMonth Number { rc = vDateTime.Set ($1,$2);}
;
Number : _DTP_LONG { $$ = $1; }
;
SDFNumber : _DTP_SDF { $$ = $1; }
;
EOS : _DTP_EOS { $$ = $1; }
;
AbsMonth : _DTP_MONTH { $$ = $1; }
;
%%
It is giving three shift reduce conflicts.How can i remove them????
The shift-reduce conflicts are inherent in the "little language" that your grammar describes. Consider the stream of input tokens
_DTP_LONG _DTP_LONG _DTP_LONG EOS
Each _DTP_LONG can be reduced as a Number. But should
Number Number Number
be reduced as a 1-number DateShortExpr followed by a 2-number TimeExpr or as a 2-number DateShortExpr followed by a 1-number TimeShortExpr? The ambiguity is built in.
If possible, redesign your language by adding additional symbols to distinguish dates from times--colons to set off the parts of a time and slashes to set off the parts of a date, for instance.
Update
I don't think that you can use yacc/bison's precedence features here, because the tokens are indistinguishable.
You will have to rely on yacc/bison's default behavior when it encounters a shift/reduce conflict, that is, to shift rather than reduce. Consider this example in your output:
+------------------------- STATE 9 -------------------------+
+ CONFLICTS:
? sft/red (shift & new state 12, rule 11) on _DTP_LONG
+ RULES:
DateShortExpr : Number^ (rule 11)
DateShortExpr : Number^Number
DateShortExpr : Number^Number Number
DateLongExpr : Number^AbsMonth
DateLongExpr : Number^AbsMonth Number
+ ACTIONS AND GOTOS:
_DTP_LONG : shift & new state 12
_DTP_MONTH : shift & new state 13
: reduce by rule 11
Number : goto state 26
AbsMonth : goto state 27
What the parser will do is to shift and apply rule 12, rather than reduce by rule 11 (DateShortExpr : Number). This means the parser will never interpret a single Number as a DateShortExpr; it will always shift.
And a difficulty with relying on the default behavior is that it might change as you make modifications to your grammar.

ANTLRWorks - Code Generation getting stuck and not generating

Ive defining a grammar for arithmetric expressions using the following syntax. Its a subset of a more complicated whole, but the problems only occured when i extended the grammar to include Logical Operations.
When I try to code gen using antlrworks it take a very long time to even start generating. I think the problems is in the rule for paren, as it includes a loop to the start of expr. Any help in fixing this would be great
Thanks in advance
the options used:
options {
tokenVocab = MAliceLexer;
backtrack = true;
}
code for the Grammar is below:
type returns [ASTTypeNode n]
: NUMBER {$n = new IntegerTypeNode();}
| LETTER {$n = new CharTypeNode();}
| SENTENCE { $n = new StringTypeNode();}
;
term returns [ASTNode n]
: IDENTIFIER {$n = new IdentifierNode($IDENTIFIER.text);}
| CHAR {$n = new LetterNode($CHAR.text.charAt(1));}
| INTEGER {$n = new NumberNode(Integer.parseInt( $INTEGER.text ));}
| STRING { $n = new StringNode( $STRING.text ); }
;
paren returns [ASTNode n]
:term { $n = $term.n; }
| LPAR expr RPAR { $n = $expr.n; }
;
negation returns [ASTNode n]
:BITNEG (e = negation) {$n = new BitNotNode($e.n);}
| paren {$n = $paren.n;}
;
unary returns [ASTNode n]
:MINUS (u =unary) {$n = new NegativeNode($u.n);}
| negation {$n = $negation.n;}
;
mult returns [ASTNode n]
: unary DIV (m = mult) {$n = new DivideNode($unary.n, $m.n);}
| unary MULT (m = mult) {$n = new MultiplyNode($unary.n, $m.n);}
| unary MOD (m=mult) {$n = new ModNode($unary.n, $m.n);}
| unary {$n = $unary.n;}
;
binAS returns [ASTNode n]
: mult PLUS (b=binAS) {$n = new AdditionNode($mult.n, $b.n);}
| mult MINUS (b=binAS) {$n = new SubtractionNode($mult.n, $b.n);}
| mult {$n = $mult.n;}
;
comp returns [ASTNode n]
: binAS GREATEREQ ( e =comp) {$n = new GreaterEqlNode($binAS.n, $e.n);}
|binAS GREATER ( e = comp ) {$n = new GreaterNode($binAS.n, $e.n);}
|binAS LESS ( e = comp ) {$n = new LessNode($binAS.n, $e.n);}
|binAS LESSEQ ( e = comp ) {$n = new LessEqNode($binAS.n, $e.n);}
|binAS {$n = $binAS.n;}
;
equality returns [ASTNode n]
: comp EQUAL ( e = equality) {$n = new EqualNode($comp.n, $e.n);}
|comp NOTEQUAL ( e = equality ) {$n = new NotEqualNode($comp.n, $e.n);}
|comp { $n = $comp.n; }
;
bitAnd returns [ASTNode n]
: equality BITAND (b=bitAnd) {$n = new BitAndNode($equality.n, $b.n);}
| equality {$n = $equality.n;}
;
bitXOr returns [ASTNode n]
: bitAnd BITXOR (b = bitXOr) {$n = new BitXOrNode($bitAnd.n, $b.n);}
| bitAnd {$n = $bitAnd.n;}
;
bitOr returns [ASTNode n]
: bitXOr BITOR (e =bitOr) {$n = new BitOrNode($bitXOr.n, $e.n);}
| bitXOr {$n = $bitXOr.n;}
;
logicalAnd returns [ASTNode n]
: bitOr LOGICALAND (e = logicalAnd){ $n = new LogicalAndNode( $bitOr.n, $e.n ); }
| bitOr { $n = $bitOr.n; }
;
expr returns [ASTNode n]
: logicalAnd LOGICALOR ( e = expr ) { $n = new LogicalOrNode( $logicalAnd.n, $e.n); }
| IDENTIFIER INC {$n = new IncrementNode(new IdentifierNode($IDENTIFIER.text));}
| IDENTIFIER DEC {$n = new DecrementNode(new IdentifierNode($IDENTIFIER.text));}
| logicalAnd {$n = $logicalAnd.n;}
;
`
This seems to be a bug introduced in version 3.3 (and upwards). ANTLR 3.2 produces the following error when generating a parser from your grammar:
warning(205): Test.g:31:2: ANTLR could not analyze this decision in rule equality; often this is because of recursive rule references visible from the left edge of alternatives. ANTLR will re-analyze the decision with a fixed lookahead of k=1. Consider using "options {k=1;}" for that decision and possibly adding a syntactic predicate.
error(10): internal error: org.antlr.tool.Grammar.createLookaheadDFA(Grammar.java:1279): could not even do k=1 for decision 6; reason: timed out (>1000ms)
It looks to me you've used an LR grammar as the basis for your ANTLR grammar. Consider starting over but then with LL parsing in mind. Have a look at the following Q&A to see how to parse expressions using ANTLR: ANTLR: Is there a simple example?
Also, I see you're using some tokens that look an awful lot like each other: LETTER, CHAR, SENTENCE and IDENTIFIER. You must realize that if all of them may start with, for example, a lower case letter, only one of the rules is matched (the one that matches most, or in case of a tie, the one defined first in the lexer grammar). The lexer does not produce tokens based on what the parser "asks" for, it creates tokens independently from the parser.
Finally, for a simple expression parser, you really don't need predicates (and backtrack=true causes ANTLR to automatically inserts predicates in front of all parser rules!).

Resources