I have a working closure conversion done using the purely functional CPS data structure presented in my earlier post, but it is somewhat hackish. Thus I am trying to improve it, following Andrew Kennedy's excellent paper "Compiling with continuations, continued".
Kennedy's CPS structure requires a union-find data structure, used to merge the occurences of a variable, and find the binding site of an occurence. I already had a union-find data structure, used to implement first-order unification in type inference, but as is usual a module becomes good on the second time you write it (when you have more experience about its implementation and usage).
There are many possible variations in the interface of a union-find module. The particularities of this one is explicit support for attaching description to sets, and a "partition" type separate from the "element" type. Also, the interface is functorial, and I provided two versions: a Safe one that checks that usage of the structure is correct, and a Fast one with no check. It is easy to shoot yourself in the foot by using this module incorrectly, so using the Safe one is probably a better bet.
So here they are.
1. A union-find data structure maintains a partition of elements into disjoint sets.
It allows to add new elements in new partitions, perform the union of two partitions, and retrieve the partition in which is an element. Moreover it allows to attach a description to a partition, which is generally the point of using such a structure.
This module has side effects: adding an element to a union-find data structure changes that element, and the union operation merges the partitions destructively. This make it easy to use this module incorrectly. To that end, a number of protections (using types and dynamic checks) are set that detect such incorrect uses of the module.
Note on the name: there are other data structures that maintain
disjoint sets with other operations, such as partition refinement,
so "union-find" is a more accurate name for this data structure
than "disjoint set".
module type S =
sig
All the functions (except create
) take a t
argument; in their
safe version this argument is used to checks that other
element
and partition
arguments indeed belong to the t
argument.
type t
type partition
type element
type description
create()
returns a new empty union find data structure.
val create : unit → t
singleton t e d
adds a new element e
to t
, and create and
returns a new partition p
in t
, such that e
is the only
element of p
. It also attach the description d
to p
.The safe version checks that e
was not previously added to
another union-find data structure (with the same link).
val singleton : t → element → description → partition
find t e
returns the partition p
of t
that contains e
.
val find : t → element → partition
union t p1 p2 d
creates a new partition p3
, with description
d
, that contains the union of all the elements in p1
and
p2
. The p1
and p2
arguments are consumed, i.e. must not be
used after they were passed to union
. p1
and p2
must be
different partitions.
val union : t → partition → partition → description → partition
description t p
returns the description associated to p
.
val description : t → partition → description
description t p
changes the description associated to p
.
val set_description : t → partition → description → unit
end
link
type, and each element of a union-find structure
must be "associated" to one different link (generally the link is a
mutable field in the element type). Initially, the link value is
empty_link
.The Make
functor, once told how access the link of an element,
returns a module complying to S
. Below we given an exemple of
usage.
Note: It is possible for an element to be present in two different union-find data structures; it must just have different links.
If the link in an element must be re-used for another union-find
data structure, then it must be set to empty_link
, and one must
stop using the union-find data structure that contained the element
(even with other elements).
module type UNION_FIND = sig
type (α, β) link
val empty_link:(α,β) link
module type LINK =
sig
type element
type description
val get : element → (description, element) link
val set : element → (description, element) link → unit
end
module Make(Link : LINK):S with type description = Link.description and type element = Link.element
end
type test = { x:int; mutable z:(string, test) Union_find.Safe.link };;
module Test = struct
type description = string
type element = test
let get_link t = t.z
let set_link t z = t.z ← z
end
module A = Union_find.Safe.Make(Test)
let uf = A.create() in
let elt1 = {x=1; z=Safe.empty_link} in
let part1 = A.singleton t elt1 "1" in
assert(A.description t (A.find t elt1) = "1")
1. We represent each disjoint set by a tree : elements are in the same set than the element that they point to.
The root of the tree is the representative of the set, and
corresponds to elements of type partition
. It points to a
"partition descriptor".
type (α,β) baselink =
∣ Partition_descriptor of α partition_descriptor
∣ Parent of β
Note that the partition descriptor is not accessible by the users
of the module, and the interface make it so that there can be only
one link to the partition descriptor (from the representative).
This allows to update the partition descriptor destructively.
and α partition_descriptor = { mutable rank:rank; mutable desc:α }
and rank = int
The safe module identifies all union-find data structures by a
unique id, embed that in the links, and checks for all operation
that they are equal. It also checks initialization of the link.
module type SAFETY = sig
type t
val create: unit → t
type (α,β) link
(∗ Create a safe link from a baselink. ∗)
val securize: t → (α,β) baselink → (α,β) link
(∗ Returns the base_link from the safe link. ∗)
val get_base: (α,β) link → (α,β) baselink
(∗ Check the safe link withat the element (and the safe link) belong to t. ∗)
val check_membership: t → (α,β) link → unit
(∗ Check that the element is not yet part of any union find. ∗)
val check_unused: (α,β) link → unit
(∗ Initial link. ∗)
val empty_link: (α,β) link
end
module No_safety:SAFETY = struct
type t = unit
let create() = ()
type (α,β) link = (α,β) baselink
let check_membership () l = ()
(∗ Note: This cast can make the execution fail without notice. ∗)
let empty_link = Obj.magic 0
end
module Unique = Unique.Make(struct end)
module Safety:SAFETY = struct
type t = Unique.t
let create() = Unique.fresh()
type (α,β) link = t option × (α,β) baselink
let check_membership t (u,_) =
(match u with
∣ Some(a) → assert (t ≡ a) (∗ The element is in another union-find structure. ∗)
∣ None → assert false); () (∗ The element is in no union-find structure. ∗)
let check_unused (u,_) =
(match u with
∣ Some(_) → assert false (∗ The element is already in a union-find structure. ∗)
∣ None → ())
(∗ Note: The cast is not dangerous, because the left-hand part is
checked first. ∗)
let empty_link = (None, Obj.magic 0)
end
partition
and element
are
actually the same underlying type; the difference is that elements
returned with type partition
are the root of the tree). Hiding
this in the interface provides some guarantee that arguments of type
partition are the representative of their partition.Unfortunately, after calling union
on two partitions p1
and
p2
, one of them will stop being the root; that is why the
partition arguments of union
must not be re-used. Thus, defining
the partition
type only guarantees that the argument has been a
root in the past, and we ensure that by a dynamic test.
module type S = sig
type t
type partition
type element
type description
val create: unit → t
val singleton : t → element → description → partition
val find : t → element → partition
val union: t → partition → partition → description → partition
val description: t → partition → description
val set_description : t → partition → description → unit
end
Saf
allows to
differenciate the "Fast" and "Safe" modules, while Link
is used
to find and change the link.
module Make(Saf:SAFETY):UNION_FIND =
struct
type (α,β) link = (α,β) Saf.link
let empty_link = Saf.empty_link
module type LINK = sig
type element
type description
val get: element → (description, element) link
val set: element → (description, element) link → unit
end
module Make(Link: LINK) =
struct
type t = Saf.t
type element = Link.element
type description = Link.description
type partition = Link.element
singleton
is the only way to add new elements to the
union-find structure, and is the place where we check that the
element is not part of another structure.
let singleton t elt desc =
let l = (Link.get elt) in
Saf.check_unused l;
Link.set elt (Saf.securize t (Partition_descriptor {rank=0;desc=desc}));
elt
find
just walks the tree until it finds the root.But performance is increased if the length of the path is diminished: traversed nodes are linked to nodes that are closer to the roof. The possibility we have implemented is path compression: when the root is found, the elements are changed to link to the it, so that subsequent calls are faster. We implemented a tail-recursive version of this algorithm (which still requires two pass).
Note: there are alternatives to path compression, such that halving; but in Tarjan’s structure the root is linked to itself, which is not the case here, so halving would require more checks than in Tarjan’s version. Thus we stick with path compression.
Note: we could perform a lighter check in the safe version by
checking only the argument, and not all recursive calls; this is
probably not worth implementing it, and the heavy check has its
uses.
let find t x =
(∗ Tail-recursive function to find the root of the algorithm. ∗)
let rec find x =
let l = (Link.get x) in
Saf.check_membership t l;
match Saf.get_base l with
∣ Partition_descriptor(s) → x
∣ Parent(y) → find y in
(∗ This is also tail-recursive, but we do not perform the checks
the second time. ∗)
let rec compress x r =
let l = (Link.get x) in
match Saf.get_base l with
∣ Partition_descriptor(s) → ()
∣ Parent(y) → Link.set x (Saf.securize t (Parent r)) in
let root = find x in
compress x root;
root
let get_partition_descriptor t p =
let l = (Link.get p) in
Saf.check_membership t l;
match Saf.get_base l with
∣ Partition_descriptor(s) → s
∣ _ → assert(false) (∗ The element is not a partition. ∗)
let description t x = (get_partition_descriptor t x).desc
let set_description t x desc =
let pd = get_partition_descriptor t x in
pd.desc ← desc
The last argument allows to update the set descriptor along with this operation.
Note that this function takes partitions as argument; one could
have instead taken any element, and performed the find inside
the function; in particular some efficient algorithms
interleave the find and union operations. The reason why we
take partition arguments is that it avoids a find when we know
that the argument is a partition (for instance when merging
with a just-created singleton), and the user needs to perform a
find to retrieve and merge the description in the algorithms we
use (such as unification).
let union t p1 p2 newdesc =
let d1 = get_partition_descriptor t p1 in
let d2 = get_partition_descriptor t p2 in
assert (p1 ≠ p2);
if( d1.rank < d2.rank) then
begin
(∗ Keep d2_repr as root. Height of the merge is max(d1_height +1, d2_height) so does not change. ∗)
Link.set p1 (Saf.securize t (Parent p2));
d2.desc ← newdesc;
p2
end
else if (d1.rank > d2.rank) then
begin
(∗ Keep d1_repr as root. Height of the merge is max(d2_height +1, d1_height) so does not change. ∗)
Link.set p2 (Saf.securize t (Parent p1));
d1.desc ← newdesc;
p1
end
else
begin
(∗ We choose arbitrarily p1 to be the root.
The height may have changed, as all elements in the subset
with root p2 are 1 step further to the root. ∗)
Link.set p2 (Saf.securize t (Parent p1));
d1.rank ← d1.rank + 1; d1.desc ← newdesc;
p1
end
end
Recent performance comparison of these algorithms (and modern
enhancements) can be found in "Experiments on Union-Find Algorithms
for the Disjoint-Set Data Structure", by Md. Mostofa Ali Patwary,
Jean Blair, Fredrik Manne.