The L parser has been cleaned up and is now commited on github. As always, it is written in a literate programming style, and you can have a look at the generated documents corresponding to its interface and implementation.
For the implementation I wrote a big diagram explaining how TDOP parsing works in practice. I also wrote a more interactive version of this diagram below for this blog post.
To understand this diagram, you should have a look at the tdop.mli file, that defines the interface of the TDOP parser. For people in a hurry, here is a summary of the main points:
parse
function that will, in our case, parse an
expression.
aEb
, where a
and b
are tokens and E
is an expression, then the right binding power of
a
is compared to the left binding power of b
. E
is associated
with the the token with the highest binding power.
This scheme is more powerful than classical operator precedence, and in particular allows to deal with associativity.
parse
function.
The following diagram will explain both how the parser works, and how it is used to parse a simple mathematical expression.
parse
function has a base right binding power of 0.
parse
is, by definition,
in prefix position. The associated parsing function just asks parse
for another
expression, with a high priority (3).
parse
is called with a right binding power of 3.
parse
finds "a" in prefix position.
parse
finds "+" in infix position. However, the left binding power of "+" is 1,
which is smaller than 3, the right binding power of that invocation of parse. So "+" is not
part of the expression, and parse
returns.
parse
finds "+" in infix position. As the right binding power of that invocation
of parse is 0, and the left binding power of "+" is 1, "+" is part of the expression.
parse
for another expression, with a priority of 1.
parse
is called with a right binding power of 1.
parse
finds "b" in prefix position. "b" is a terminal symbol,
with no further parsing action other than returning "b" as a parse tree.
parse
finds "*" in infix position. As the right binding power of that invocation
of parse is 1, and the left binding power of "*" is 2, "*" is part of the expression. The
use of left and right priority thus allowed to express that "*" binds stronger than "+".
parse
for another expression, with a priority of 2.
parse
is called with a right binding power of 2.
parse
finds "c" in prefix position. "c" is a terminal symbol,
with no further parsing action other than returning "c" as a parse tree.
parse
finds "+" in infix position. However, the left binding power of "+" is 1,
which is smaller than 2, the right binding power of that invocation of parse. So "+" is not
part of the expression, and parse
returns.
parse
finds "+" in infix position. However, the left binding power of "+" is 1,
which is not greater than 1, the right binding power of that invocation of parse. So "+" is not
part of the expression, and parse
returns. Tokens that have their left binding power
and right binding power equals are thus left-associative; to obtain right-associativity, the
right binding power should have been set to the "left binding power minus 1" (1-1 = 0 for the "+" token).
parse
finds "+" in infix position. As the right binding power of that invocation
of parse is 0, and the left binding power of "+" is 1, "+" is part of the expression
(but in a left-associative way).
parse
...
parse
finds the symbol "end-of-file" in infix position,
which has a very low left binding power of 0, and thus cannot be part of any expression.
parse
finds "end-of-file" again, and returns the expression completely parsed.
This explanation is probably a bit short if you never encountered a TDOP parser before, so I really encourage you to read tdop.mli, see how it's used in the L parser, or have a look at other nice presentations of TDOP on the internet (or of their implementation in other languages than OCaml).
I found that TDOP parsers were easy to implement, easy to use, and in particular match well the thought process. I believe that human beings distinguish "-" by this notion of "prefix" versus "infix" position, for instance, and mentally parse strings using relative priority between infix tokens; they are not expanding parsing rules to see if one matches.
In the future, I would like to experiment with TDOP-based parser generation. The extension to EBNF notation that you can see in the documentation of the parser allows to see what it could look like.
A future blog article will present the syntax of the L language in more details.