How to go from AST to backend code? - abstract-syntax-tree

I'm working on a compiler for a programming language and I have a good representation of the source language in the form of an Abstract Syntax Tree (AST). I tried generating backend code directly by traversing the AST, but that's not working well. In general, think of mapping an AST for a C-like language down to an assembly-like language. I now think that there is some phase that I'm missing to go from the AST to backend code. The problem is that going from the AST to the next representation (say, 3-address code) is rought. It feels like there is a step in between that I'm missing.
[source lang] -> [lex/parse]-> [AST] -> [semantic analysis] -> [?] -> [backend code]
This is what I've come up with while thinking about it:
1) Transform the AST of the source language to an AST that represents the backend. This means that I would need to have 2 different ASTs. Then, output the backend from the transformed AST.
2) Transform the AST to a different type of data structure, and generated backend code based on this other structure. I'm not sure what this other struct would be.
3) Traverse the AST in a different way (from the way used to pretty print it) to generate the backend code. This is how I tried doing it first but it doesn't seem right; I feels like a hackish way to go about it.
What are my options go to from AST to backend code?
Note that I'm not asking what kinds of representations the AST could be turned into, such as 3-address code. For example, I understand that a C addition like so:
x = a + b + c
could be turned into 3-address like so:
t1 = add a, b
t2 = add t1, c
set x, t2
I am asking for techniques/guidance/experience on how to do it.
To give an example of the type of answer that I'm looking for would be:
Question: What steps can I take to perform semantic analysis on the source lang?
Answer: Parse the language into an AST and traverse the AST to perform the semantic checks.

One can represent the program any way one likes; you can build compilers completely using trees. The purpose of a representation is to make it easy to collect certain kinds of facts; the representation serves to make collection/processing of those specific facts easier. Thus one expects the representation of the program to change at different stages of compilation, depending on what the compiler is trying to achieve in that stage.
One typical scheme is to translate programs through these representations to produce final code:
* ASTs
* Triples
* Abstract machine instructions
* Concrete machine instructions
The fact that you don't seem to know this, means you haven't done your homework. This is pretty well described in compiler books. Go read one.

Related

What's a cterm?

The Isabelle implementation manual says:
Types ctyp and cterm represent certified types and terms, respectively. These are abstract datatypes that guarantee that its values have passed the full well-formedness (and well-typedness) checks, relative to the declarations of type constructors, constants etc. in the background theory.
My understanding is: when I write cterm t, Isabelle checks that the term is well-built according to the theory where it lives in.
The abstract types ctyp and cterm are part of the same inference kernel that is mainly responsible for thm. Thus syntactic operations on ctyp and cterm are located in the Thm module, even though theorems
are not yet involved at that stage.
My understanding is: if I want to modify a cterm at the ML level I will use operations of the Thm module (where can I find that module?)
Furthermore, it looks like cterm t is an entity that converts a term of the theory level to a term of the ML-level. So I inspect the code of cterm in the declaration:
ML_val ‹
some_simproc #{context} #{cterm "some_term"}
›
and get to ml_antiquotations.ML:
ML_Antiquotation.value \<^binding>‹cterm› (Args.term >> (fn t =>
"Thm.cterm_of ML_context " ^ ML_Syntax.atomic (ML_Syntax.print_term t))) #>
This line of code is unreadable to me with my current knowledge.
I wonder if someone could give a better low-level explanation of cterm. What is the meaning of the code below? Where are located the checks that cterm performs on theory terms? Where are located the manipulations that we can do on cterms (the module Thm above)?
The ‘c’ stands for ‘certified’ (Or ‘checked’? Not sure). A cterm is basically a term that has been undergone checking. The #{cterm …} antiquotation allows you to simply write down terms and directly get a cterm in various contexts (in this case probably the context of ML, i.e. you directly get a cterm value with the intended content). The same works for regular terms, i.e. #{term …}.
You can manipulate cterms directly using the functions from the Thm structure (which, incidentally, can be found in ~~/src/Pure/thm.ML; most of these basic ML files are in the Pure directory). However, in my experience, it is usually easier to just convert the cterm to a regular term (using Thm.term_of – unlike Thm.cterm_of, this is a very cheap operation) and then work with the term instead. Directly manipulating cterms only really makes sense if you need another cterm in the end, because re-certifying terms is fairly expensive (still, unless your code is called very often, it probably isn't really a performance problem).
In most cases, I would say the workflow is like this: If you get a cterm as an input, you turn it into a regular term. This you can easily inspect/take apart/whatever. At some point, you might have to turn it into a cterm again (e.g. because you want to instantiate some theorem with it or use it in some other way that involves the kernel) and then you just use Thm.cterm_of to do that.
I don't know exactly what the #{cterm …} antiquotation does internally, but I would imagine that at the end of the day, it just parses its parameter as an Isabelle term and then certifies it with the current context (i.e. #{context}) using something like Thm.cterm_of.
To gather my findings with cterms, I post an answer.
This is how a cterm looks like in Pure:
abstype cterm =
Cterm of {cert: Context.certificate,
t: term, T: typ,
maxidx: int,
sorts: sort Ord_List.T}
(To be continued)

Why use macros in Julia?

I was reading up on the documentation of macros and ran into the following under the `Hold up: why macros' section. The reasoning given to use macros is as follows:
Macros are necessary because they execute when code is parsed,
therefore, macros allow the programmer to generate and include
fragments of customized code before the full program is run
This leads me to wonder why someone would want to use "generate and include fragments of customized code before the full program is run". Can someone provide context as to why this would be beneficial and/or other good use cases for macros?
Let me give you my view on macros.
A macro basically is a code -> code function. It takes code (a Julia expression) as input and spits out code (a different Julia expression).
Why is this useful? It has multiple purposes:
compile time copy-and-paste: You don't have to write the same piece of code multiple times but instead can define a short macro that writes it for you wherever you put it. (example)
domain specific language (DSL): You can create special syntax that after the macros code -> code transform is replaced by pure Julia constructs. This is used in many packages to define special syntax, for example here and here.
code generation: Imagine you want to write a really long piece of code which, although being long, is very simple because it has some kind of pattern that repeats itself rather trivially. Writing that code by hand can be a pain (or even practically impossible). A macro can programmatically generate the code for you. One example is for-loop unrolling (see here and here). But even the #time macro isn't doing much more than just putting a bunch of Base.time_ns() function calls around the provided Julia expression.
special string parsing: If you type the literal 3.2 in Julia it will be parsed and interpreted as a Float64. Now, imagine you want to supply a number literally that goes beyond Float64 precision but would fit into a BigFloat. Typing big(3.123124812498124812498) won't work, because the literal number is first interpreted as a Float64 and then handed to the big function. Instead you need a way to tell Julia at parse time that this should become a BigFloat. This is handled by a #big_str 3.2 macro which (for convenience) can also be written as big"3.2". The latter is just syntactic sugar.
There might be many more applications of macros, but those are the most important to me.
Let me end by referencing Steven G. Johnson's great talk at JuliaCon 2019:
Most of the time, don't do metaprogramming :)

Reflecting on a Type parameter

I am trying to create a function
import Language.Reflection
foo : Type -> TT
I tried it by using the reflect tactic:
foo = proof
{
intro t
reflect t
}
but this reflects on the variable t itself:
*SOQuestion> foo
\t => P Bound (UN "t") (TType (UVar 41)) : Type -> TT
Reflection in Idris is a purely syntactic, compile-time only feature. To predict how it will work, you need to know about how Idris converts your program to its core language. Importantly, you won't be able to get ahold of reflected terms at runtime and reconstruct them like you would with Lisp. Here's how your program is compiled:
Internally, Idris creates a hole that will expect something of type Type -> TT.
It runs the proof script for foo in this state. We start with no assumptions and a goal of type Type -> TT. That is, there's a term being constructed which looks like ?rhs : Type => TT . rhs. The ?foo : ty => body syntax shows that there's a hole called foo whose eventual value will be available inside of body.
The step intro t creates a function whose argument is t : Type - this means that we now have a term like ?foo_body : TT . \t : Type => foo_body.
The reflect t step then fills the current hole by taking the term on its right-hand side and converting it to a TT. That term is in fact just a reference to the argument of the function, so you get the variable t. reflect, like all other proof script steps, only has access to the information that is available directly at compile time. Thus, the result of filling in foo_body with the reflection of the term t is P Bound (UN "t") (TType (UVar (-1))).
If you could do what you are wanting here, it would have major consequences both for understanding Idris code and for running it efficiently.
The loss in understanding would come from the inability to use parametricity to reason about the behavior of functions based on their types. All functions would effectively become potentially ad-hoc polymorphic, because they could (say) run differently on lists of strings than on lists of ints.
The loss in performance would come from representing enough type information to do the reflection. After Idris code is compiled, there is no type information left in it (unlike in a system such as the JVM or .NET or a dynamically typed system such as Python, where types have a runtime representation that code can access). In Idris, types can be very large, because they can contain arbitrary programs - this means that far more information would have to be maintained, and computation occurring at the type level would also have to be preserved and repeated at runtime.
If you're wanting to reflect on the structure of a type for further proof automation at compile time, take a look at the applyTactic tactic. Its argument should be a function that takes a reflected context and goal and gives back a new reflected tactic script. An example can be seen in the Data.Vect source.
So I suppose the summary is that Idris can't do what you want, and it probably never will be able to, but you might be able to make progress another way.

Pure functional bottom up tree algorithm

Say I wanted to write an algorithm working on an immutable tree data structure that has a list of leaves as its input. It needs to return a new tree with changes made to the old tree going upwards from those leaves.
My problem is that there seems to be no way to do this purely functional without reconstructing the entire tree checking at leaves if they are in the list, because you always need to return a complete new tree as the result of an operation and you can't mutate the existing tree.
Is this a basic problem in functional programming that only can be avoided by using a better suited algorithm or am I missing something?
Edit: I not only want to avoid to recreate the entire tree but also the functional algorithm should have the same time complexity than the mutating variant.
The most promising I have seen so far (which admittedly is not very long...) is the Zipper data structure: It basically keeps a separate structure, a reverse path from the node to root, and does local edits on this separate structure.
It can do multiple local edits, most of which are constant time, and write them back to the tree (reconstructing the path to root, which are the only nodes that need to change) all in one go.
The Zipper is a standard library in Clojure (see the heading Zippers - Functional Tree Editing).
And there's the original paper by Huet with an implementation in OCaml.
Disclaimer: I have been programming for a long time, but only started functional programming a couple of weeks ago, and had never even heard of the problem of functional editing of trees until last week, so there may very well be other solutions I'm unaware of.
Still, it looks like the Zipper does most of what one could wish for. If there are other alternatives at O(log n) or below, I'd like to hear them.
You may enjoy reading
http://lorgonblog.spaces.live.com/blog/cns!701679AD17B6D310!248.entry
This depends on your functional programming language. For instance in Haskell, which is a Lazy functional programming language, results are calculated at the last moment; when they are acutally needed.
In your example the assumption is that because your function creates a new tree, the whole tree must be processed, whereas in reality the function is just passed on to the next function and only executed when necessary.
A good example of lazy evaluation is the sieve of erastothenes in Haskell, which creates the prime numbers by eliminating the multiples of the current number in the list of numbers. Note that the list of numbers is infinite. Taken from here
primes :: [Integer]
primes = sieve [2..]
where
sieve (p:xs) = p : sieve [x|x <- xs, x `mod` p > 0]
I recently wrote an algorithm that does exactly what you described - https://medium.com/hibob-engineering/from-list-to-immutable-hierarchy-tree-with-scala-c9e16a63cb89
It works in 2 phases:
Sort the list of nodes by their depth in the hierarchy
constructs the tree from bottom up
Some caveats:
No Node mutation, The result is an Immutable-tree
The complexity is O(n)
Ignores cyclic referencing in the incoming list

How do I detect circular logic or recursion in a custom expression evaluator?

I've written an experimental function evaluator that allows me to bind simple functions together such that when the variables change, all functions that rely on those variables (and the functions that rely on those functions, etc.) are updated simultaneously. The way I do this is instead of evaluating the function immediately as it's entered in, I store the function. Only when an output value is requested to I evaluate the function, and I evaluate it each and every time an output value is requested.
For example:
pi = 3.14159
rad = 5
area = pi * rad * rad
perim = 2 * pi * rad
I define 'pi' and 'rad' as variables (well, functions that return a constant), and 'area' and 'perim' as functions. Any time either 'pi' or 'rad' change, I expect the results of 'area' and 'perim' to change in kind. Likewise, if there were any functions depending on 'area' or 'perim', the results of those would change as well.
This is all working as expected. The problem here is when the user introduces recursion - either accidental or intentional. There is no logic in my grammar - it's simply an evaluator - so I can't provide the user with a way to 'break out' of recursion. I'd like to prevent it from happening at all, which means I need a way to detect it and declare the offending input as invalid.
For example:
a = b
b = c
c = a
Right now evaluating the last line results in a StackOverflowException (while the first two lines evaluate to '0' - an undeclared variable/function is equal to 0). What I would like to do is detect the circular logic situation and forbid the user from inputing such a statement. I want to do this regardless of how deep the circular logic is hidden, but I have no idea how to go about doing so.
Behind the scenes, by the way, input strings are converted to tokens via a simple scanner, then to an abstract syntax tree via a hand-written recursive descent parser, then the AST is evaluated. The language is C#, but I'm not looking for a code solution - logic alone will be fine.
Note: this is a personal project I'm using to learn about how parsers and compilers work, so it's not mission critical - however the knowledge I take away from this I do plan to put to work in real life at some point. Any help you guys can provide would be appreciated greatly. =)
Edit: In case anyone's curious, this post on my blog describes why I'm trying to learn this, and what I'm getting out of it.
I've had a similar problem to this in the past.
My solution was to push variable names onto a stack as I recursed through the expressions to check syntax, and pop them as I exited a recursion level.
Before I pushed each variable name onto the stack, I would check if it was already there.
If it was, then this was a circular reference.
I was even able to display the names of the variables in the circular reference chain (as they would be on the stack and could be popped off in sequence until I reached the offending name).
EDIT: Of course, this was for single formulae... For your problem, a cyclic graph of variable assignments would be the better way to go.
A solution (probably not the best) is to create a dependency graph.
Each time a function is added or changed, the dependency graph is checked for cylces.
This can be cut short. Each time a function is added, or changed, flag it. If the evaluation results in a call to the function that is flagged, you have a cycle.
Example:
a = b
flag a
eval b (not found)
unflag a
b = c
flag b
eval c (not found)
unflag b
c = a
flag c
eval a
eval b
eval c (flagged) -> Cycle, discard change to c!
unflag c
In reply to the comment on answer two:
(Sorry, just messed up my openid creation so I'll have to get the old stuff linked later...)
If you switch "flag" for "push" and "unflag" for "pop", it's pretty much the same thing :)
The only advantage of using the stack is the ease of which you can provide detailed information on the cycle, no matter what the depth. (Useful for error messages :) )
Andrew

Resources