This module helps implementing TDOP (top-down operator precedence) parsers. TDOP parsers (also called Pratt parsers) were invented by Vaughan Pratt in 1973. They are practical, efficient, simple to understand and simple to implement, and can be dynamically extended at runtime; their only drawback is that they are not very well-known.
The core idea of TDOP is to combine "recursive descent" parser with "operator precedence" in a single framework; and to work by associating parsing functions to tokens (and not to rules as done in BNF).
The module works by defining a "main" parsing function parse
,
that will call auxiliary parsing functions associated to the
tokens, depending on their position:
parse
is called is in
prefix position.parse
could have returned something, but
continues, then the first token encountered after parse
could
have returned is in infix position.For instance, suppose we are parsing an expression (i.e. parse
parses an "expression"). In the string -3 + 4 × −5 − 22
, the
left-most -
appears in prefix position. It is associated to a
parsing function requiring another expression, so 3
is also in
prefix position. After 3
parse
could return, but continues with
+
, so +
is in infix position. +
is associated to a parsing
function requiring another expression, so 4
appears in prefix
position, etc. At the end in this expression, all the numbers and
the first two -
are in prefix position, while +
, ×
, and the
last -
are in infix position.
Note that the name "infix" does not imply a symmetry: in the regexp
[a−z]+
, +
is also in infix position; in the statement
(3+4):Int
, :
is in infix position, even if its left side is an
expression and its right side a type. So "infix" also encompass
"postfix" and other situations.
The main parsing function is called parse
; it takes a stream, and
a right binding power (defined below), as arguments.
The define_prefix
and define_infix
functions allow to associate
to each keyword a parsing function. Parsing functions are given a
stream; the prefix or infix token is left in the stream when the
custom parsing function is called. This allows, for instance, the
same parsing function to be used for different tokens. The
define_infix_{left∣right}_associative
functions are special cases
of define_infix
, explained below. This dynamic interface allows
to dynamically extend the parser by registering new token/parsing
function association; an application is to allow a source file to
extend its syntax. It also breaks recursivity, which allow the
parsing functions to call parse
.
The dynamic interface cannot work for tokens that carry an information, such as ints and string: one cannot associate a parsing function to "all ints" using a hash table. Thus for non-keyword arguments, the parsing functions are given statically as arguments to a functor. The interface assumes that only keyword tokens can be used as infix.
The argument to the functor also define the type of the "semantic values", i.e. the objects returned by the parsing functions.
In TDOP, tokens are associated with two binding powers: the left
binding power, and the right binding power. If a substring has the
form AEB
, where A
and B
are tokens and E
an expression that
can be parsed by parse
, then:
B
is strictly higher than the
right binding power of A
, then E
is associated to A
; E
is associated to A
.
As a convention, we chose that in case of equality, E
is
associated to A
.
Thus, "left binding power" of B
really means "the binding power
of B
that applies to the expression E
which is to the left of
B
" (and respectively for the "right binding power").
Here are some example uses of these binding powers.
In mathematics, ×
takes precedence over +
: 2+3×4
is parsed as
2+(3×4)
; while 2×3+4
is parsed as (2×3)+4
. If we give to ×
left and right binding powers of 100, and to +
left and right
binding powers of 90, then in both cases when parsing 3
, the
token ×
will take precedence over +
. Unary prefix operators
like - should be given a big right binding power, to take
precedence over + and *.
The -
operator is left-associative: 1-2-3
must be read as
1-(2−3)
, not as (1−2)−3
. If we give to -
equal binding powers
of 90, then when parsing 2
the leftmost -
will take precedence
over the rightmost one; this is because of our convention that when
binding powers are equal, the expression is associated to the
leftmost token.
The →
operator for types is right-associative: Int → Float →
Bool
is actually Int → (Float → Bool)
. TDOP handles
right-associative operators with equal ease: we associate to →
a
left binding power of 80, and a right binding power of 79. Now when
parsing Float
the rightmost →
token takes precedence, which
makes →
right-associative.
In classical mathematics, function call is denoted with (
, i.e.
the use of the (
operator that follows a function name, as in
2×f(3+4)
. Here, (
must be given a high left-binding power (to
bind more tightly than ×
), and a low right-binding power (to bind
less tightly than +
); for instance 120
and 0
.
The left binding power of a token is provided together with the
custom parsing function, using one of the define_infix_*
functions. The right binding power is passed as an argument to the
parse
function: the idea is to associate to each token an infix
and/or prefix handler, which is a custom parsing function;
this parsing function will call parse
with the correct right
binding power.
Note that there are predefined functions to help defining left-associative and right-associative operators, so there is no need to write custom parsing functions for these common cases.
In L, left-associative and right-associative operators are defined
using define_infix_{left∣right}_associative
, and we defined 0 to
be the initial right binding power for the first call of the
parse
function; when calling parse
from a custom parsing
functions, the right binding power should generally be 0.
In addition to "normal" TDOP parsing, this parser also takes advantage of the separation levels of token provided by the lexer: the idea is that the same token used with different separations can be considered as if there were two different tokens, and can have different left and right binding powers, and parsing functions.
So one can view the use of separations as replacing the
Token.token → handler
and Token.token → left_binding_power
maps of the normal TDOP interface, by Token.token × separation →
handler
and Token.token × separation → left_binding_power
maps.
However, most of the time tokens with different separations will
still be bound to the same binding powers and parsing function.
Thus, to make the token/parsing function more efficient and
partical, using separations is in fact done using a "secondary
dispatch" – there is one dispatch for the token, and a second
dispatch for the separation of this token. This amounts to currying
the maps, i.e. we really have a Token.token → separation →
left_binding_power
map. Here are some concrete recipes:
define_infix
functions
takes a Token.with_info → left_binding_power_provider
instead of
just a left_binding_power
. The Token.with_info
contains the
separation information. Thus, if you want to completely ignore the
separation information and perform "normal" TDOP parsing, you just
have to pass (fun _ → binding_power)
as a
binding_power_provider
.infix_handler
and
prefix_handler
receive streams that still contain the token used
to do the dispatch. The handler can dispatch and perform different
actions according to the separation information in this token: make
different semantic actions, or give different right binding powers.
Combined with the dispatch offered by using a custom
left_binding_power_provider
, one can for instance consider x.y
and x . y
entirely differently, as in Haskell.left_binding_power_provider
or the parsing
functions for some separations.For instance, in L
we call the parse
function with a
right_binding_power
of 0. Then we just need to give a negative
binding_power
to infix tokens that follow a newline to make them
appear in different statements.
The parse
function provided by the TDOP parser is a regular
parsing function, and can be integrated as a part of any
hand-written parser; including another TDOP parser. Several
instances of the module can indeed be used simultaneously, for
instance, one for the grammar describing the expressions, and one
for the grammar describing the types.
The interface of this module, with dynamic registration of parsing
functions, guarantees that there will be no problem of self- or
mutual- recursion using a TDOP parser: you can always begin by
defining the TDOP modules, and then use the parse
functions.
Top-down parsing using recursive descent parsers are often slow because the parser recurses on the rules, and expressing operator precedence using only rules make the parser call a lot of function before finding the correct rule that allows to parse a token. Moreover, talking about precedence levels makes the grammar easier to explain to understand. Because of these reasons, recursive descent parsers often come in combination with a more efficient "operator precedence" parser.
In TDOP, the actions are associated to the tokens, not to the rule, which makes the time complexity of the parsing linear in size of the stream of tokens; more over it makes operator precedence and recursive descent parsing are nicely integrated in a single simple framework.
The seminal paper that invented Pratt parsers was seen in Pratt, Vaughan. "Top down operator precedence." Proceedings of the 1st Annual ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages (1973). The subject was studied more in depth (in particular, how a TDOP-based formal language could be used to describe a language and implement a parser, as an alternative to yacc and BNF) by Van De Vanter, Michael L. "A Formalization and Correctness Proof of the CGOL Language System." (Master’s Thesis). MIT Laboratory for Computer Science Technical Report MIT-LCS-TR-147 (Cambridge, MA). 1975.
module type BASE_PARSER =
sig
(∗ The type of the semantic values, returned by the semantic actions. ∗)
type t
val int_handler : int → Token.Stream.t → t
val ident_handler : Token.ident → Token.Stream.t → t
end
type prefix_handler = Token.Stream.t → Base_parser.t
type infix_handler = Token.Stream.t → Base_parser.t → Base_parser.t
Said differently, if the parse
function is called with a right
binding power of x
, then parsing will stop before tokens whose
left binding power is ≤ x
.
type left_binding_power = int
type right_binding_power = int
left_binding_power_provider
changes the binding power according
to separation level. The left_binding_power_provider
can also
return a parse error, if a token is used with incorrect
separation.
type left_binding_power_provider =
Token.With_info.t → left_binding_power
left_binding_power
is ≤ x
. Uses the prefix and infix handlers that have been
defined.
val parse : Token.Stream.t → right_binding_power → Base_parser.t
val define_prefix : Token.token → prefix_handler → unit
val define_infix : Token.token → left_binding_power_provider → infix_handler → unit
The first argument is the parsed token to which the function is
attached (in infix or prefix position). The second is the binding
power of the operator. The last is a function that, given the
object resulting from parsing the left part, and the one obtained
from parsing the right part, gives the object obtained by
combining them.
val define_infix_left_associative :
Token.token → left_binding_power_provider →
(Token.With_info.t → left:Base_parser.t → right:Base_parser.t → Base_parser.t) →
unit
val define_infix_right_associative :
Token.token → left_binding_power_provider →
(Token.With_info.t → left:Base_parser.t → right:Base_parser.t → Base_parser.t) →
unit