Blog posts

TDOP / Pratt parser in pictures

by: Matthieu Lemerre tags: l, parser, and ocaml published: 03 January 2014

The L parser has been cleaned up and is now commited on github. As always, it is written in a literate programming style, and you can have a look at the generated documents corresponding to its interface and implementation.

For the implementation I wrote a big diagram explaining how TDOP parsing works in practice. I also wrote a more interactive version of this diagram below for this blog post.

To understand this diagram, you should have a look at the tdop.mli file, that defines the interface of the TDOP parser. For people in a hurry, here is a summary of the main points:

TDOP defines a parse function that will, in our case, parse an expression.
TDOP works by associating parsing functions to tokens (and not to rules, as in the case in classic parsing frameworks such as bison/yacc, ANTR, or PEGs).
The parsing function called depend on the position of the token: tokens at the beginning of an expression are in "prefix" position, others are in "infix" position. This allows to handle things like unary minus without resorting to lexing hacks.
Tokens are given two priorities: the left binding power and the right binding power. If a string has a form aEb, where a and b are tokens and E is an expression, then the right binding power of a is compared to the left binding power of b. E is associated with the the token with the highest binding power.
This scheme is more powerful than classical operator precedence, and in particular allows to deal with associativity.
The left binding power of a token is retrieved through a mapping table. The right binding power is given by the parsing function associated to a token as an argument to the parse function.
The tdop.mli file of the L compiler allows to deal with separation (i.e. spacing) between tokens, but I will not speak about them in this post.

The following diagram will explain both how the parser works, and how it is used to parse a simple mathematical expression.

The first call to the parse function has a base right binding power of 0.

The first token encountered by parse is, by definition, in prefix position. The associated parsing function just asks parse for another expression, with a high priority (3).

So parse is called with a right binding power of 3.

parse finds "a" in prefix position.

"a" is a terminal symbol, with no further parsing action other than returning "a" as a parse tree.

parse finds "+" in infix position. However, the left binding power of "+" is 1, which is smaller than 3, the right binding power of that invocation of parse. So "+" is not part of the expression, and parse returns.

The parse function associated with "-" changes the "a" parsetree, and returns "-a".

parse finds "+" in infix position. As the right binding power of that invocation of parse is 0, and the left binding power of "+" is 1, "+" is part of the expression.

The parsing function associated with infix "+" is called with the result of the parse so far ("-a"). This function just asks parsefor another expression, with a priority of 1.

So parse is called with a right binding power of 1.

parse finds "b" in prefix position. "b" is a terminal symbol, with no further parsing action other than returning "b" as a parse tree.

parse finds "*" in infix position. As the right binding power of that invocation of parse is 1, and the left binding power of "*" is 2, "*" is part of the expression. The use of left and right priority thus allowed to express that "*" binds stronger than "+".

The parsing function associated with infix "*" is called with the result of the parse so far ("b"). This function just asks parsefor another expression, with a priority of 2.

So parse is called with a right binding power of 2.

parse finds "c" in prefix position. "c" is a terminal symbol, with no further parsing action other than returning "c" as a parse tree.

parse finds "+" in infix position. However, the left binding power of "+" is 1, which is smaller than 2, the right binding power of that invocation of parse. So "+" is not part of the expression, and parse returns.

The parsing function associated to "*" combines the parse tree, received as an argument ("b") with the parsetree just returned "c" and returns the parsetree "b*c". Thus the parse functions associated with tokens are responsible for implementing semantic actions.

parse finds "+" in infix position. However, the left binding power of "+" is 1, which is not greater than 1, the right binding power of that invocation of parse. So "+" is not part of the expression, and parse returns. Tokens that have their left binding power and right binding power equals are thus left-associative; to obtain right-associativity, the right binding power should have been set to the "left binding power minus 1" (1-1 = 0 for the "+" token).

The parsing function associated to "+" combines "-a" with "b*c" to return the parse tree "(-a)+(b*c)"

parse finds "+" in infix position. As the right binding power of that invocation of parse is 0, and the left binding power of "+" is 1, "+" is part of the expression (but in a left-associative way).

The end is similar to what was already seen: the parsing function associated to "+" calls parse...

which finds the terminal symbol "d"...

which is returned as a parse tree ...

Here, parse finds the symbol "end-of-file" in infix position, which has a very low left binding power of 0, and thus cannot be part of any expression.

The "d" parse tree is comined with "(-a) + (b*c)"...

parse finds "end-of-file" again, and returns the expression completely parsed.

This explanation is probably a bit short if you never encountered a TDOP parser before, so I really encourage you to read tdop.mli, see how it's used in the L parser, or have a look at other nice presentations of TDOP on the internet (or of their implementation in other languages than OCaml).

I found that TDOP parsers were easy to implement, easy to use, and in particular match well the thought process. I believe that human beings distinguish "-" by this notion of "prefix" versus "infix" position, for instance, and mentally parse strings using relative priority between infix tokens; they are not expanding parsing rules to see if one matches.

In the future, I would like to experiment with TDOP-based parser generation. The extension to EBNF notation that you can see in the documentation of the parser allows to see what it could look like.

A future blog article will present the syntax of the L language in more details.

Blog posts

TDOP / Pratt parser in pictures