TDOP / Pratt parser in pictures

by: Matthieu Lemerre  tags: l, parser, and ocaml  published: 03 January 2014

The L parser has been cleaned up and is now commited on github. As always, it is written in a literate programming style, and you can have a look at the generated documents corresponding to its interface and implementation.

For the implementation I wrote a big diagram explaining how TDOP parsing works in practice. I also wrote a more interactive version of this diagram below for this blog post.

To understand this diagram, you should have a look at the tdop.mli file, that defines the interface of the TDOP parser. For people in a hurry, here is a summary of the main points:

  • TDOP defines a parse function that will, in our case, parse an expression.
  • TDOP works by associating parsing functions to tokens (and not to rules, as in the case in classic parsing frameworks such as bison/yacc, ANTR, or PEGs).
  • The parsing function called depend on the position of the token: tokens at the beginning of an expression are in "prefix" position, others are in "infix" position. This allows to handle things like unary minus without resorting to lexing hacks.
  • Tokens are given two priorities: the left binding power and the right binding power. If a string has a form aEb, where a and b are tokens and E is an expression, then the right binding power of a is compared to the left binding power of b. E is associated with the the token with the highest binding power.

    This scheme is more powerful than classical operator precedence, and in particular allows to deal with associativity.

  • The left binding power of a token is retrieved through a mapping table. The right binding power is given by the parsing function associated to a token as an argument to the parse function.
  • The tdop.mli file of the L compiler allows to deal with separation (i.e. spacing) between tokens, but I will not speak about them in this post.

The following diagram will explain both how the parser works, and how it is used to parse a simple mathematical expression.

This explanation is probably a bit short if you never encountered a TDOP parser before, so I really encourage you to read tdop.mli, see how it's used in the L parser, or have a look at other nice presentations of TDOP on the internet (or of their implementation in other languages than OCaml).

I found that TDOP parsers were easy to implement, easy to use, and in particular match well the thought process. I believe that human beings distinguish "-" by this notion of "prefix" versus "infix" position, for instance, and mentally parse strings using relative priority between infix tokens; they are not expanding parsing rules to see if one matches.

In the future, I would like to experiment with TDOP-based parser generation. The extension to EBNF notation that you can see in the documentation of the parser allows to see what it could look like.

A future blog article will present the syntax of the L language in more details.