In the implementation of the L compiler, the focus was put on the readability and simplicity of the source code. As often in computer science, it is difficult to make things simple: I had a complete, working prototype, with parser, typing, cps transformation and compilation to LLVM when I started this blog; I spent all this time cleaning, documenting, and especially simplifying the compiler, publishing the progress along the way.
I believe that the result is worth the work: a simpler compiler is easier to extend, both for myself and future contributors. The challenge will be to keep things easy as more features are added; this is possible only if we wait for things to be well structured before adding the non-essential features (for now, I just keep a huge todo-list with those); and if we do not hesitate to perform a local redesign occasionally.
The toplevel of the compiler, that link all the passes together, is a good example of the results of such a redesign. A design goal of this toplevel is to make each pass communicate with a minimal interface. I had written a first attempt, using a record updated functionally, with each field containing the internal state of one pass. The toplevel consisted in a giant function with nested loops – one loop per pass. This structure was inelegant for a number of reasons:
After a while, I came up with a much simpler design using streams of definitions. The code for a toplevel is then dead-simple:
module Stream = Extensions.Stream
let process_file file =
let parsetree_stream = Parser.make_stream file in
let ast_stream = Astfromsexp.from_stream parsetree_stream in
let cps_stream = Cpstransform.from_stream ast_stream in
let converted_stream =
Stream.iter_and_copy Cpsconvertclosures.in_definition cps_stream in
let llvm_stream = Cpsllvm.from_stream converted_stream in
Stream.iter (fun elt → Llvmexec.exec_nodef () elt) llvm_stream
The actual implementation also interposes optional printer to see the contents of the stream, which is very useful when debugging the compiler.
The stream interface make it easy to add new passes, completely remove the state from the interface, factorize the pattern when a pass can produce several elements at once, retain the minimal memory consumption of the nested loop design, and can be easily updated to a parallel pipelining implementation.
The interface of each pass is made minimal. Each pass only declares:
And that is all. The interface of each pass clear is minimal, which allows to study or change it independently.
Note: this way of structuring things imply that transformation from one pass to the next is done in the second pass. (This implies that the second pass depends on the first one).
An alternative is to expose functions to build elements in the second pass, and move the transformation code from the beginning of the second pass to the end of the first pass. In this case, the first pass depends on the second pass. One advantage is that the first pass does not have to expose the type of its internal format. I am considering using this structure for the parser, where a "parsetree" is produced, but whose type is not very interesting outside of the parsing phase; while the building functions for the AST are more interesting.
Inside each pass, the code is structured around a hierarchy of modules and interfaces. The idea is that each interface progressively hides more details about the implementation of the pass, until the top where we get the minimal interface presented earlier. Intermediate levels show up when we manage to group a complex piece of code into a simple interface. For instance in the CPS pass, there are four intermediary modules:
Structuring the code around hierarchical, modular interfaces and implementations thus allow intermediate levels of abstractions; when completing a task one is provided with just the needed information, the right degree of abstraction; this helps to focus. The hierarchy also decrease, and thus simplifies, the size of the interfaces. OCaml was perfect for this job, but C, to a lesser extent, also allows this kind of modular programming.
Note that this design is not at all object-oriented. I think that when providing a new abstract type, packing it with all the methods allowing to access or update objects of this type is an excellent idea. However by using only object-oriented interfaces, the tendency is to to present all methods of an object in a flat manner, and implement all of them in the same class and file. This is problematic when there are many methods (which should be grouped by themes), or that some methods are lengthy algorithms (the algorithm should be set in a separate file). Modules allow such grouping and separation of methods, while creating a new usual object-oriented classes would not work (because we operate on an object defined in another class). Of course there are object-oriented work-arounds; but I think that modules and interfaces with abstract types solve the problem more elegantly (while still allowing "type + methods"-style interfaces). I hope that the code for the L compiler, written in OCaml, shows my point.