Why OCaml GADTs matter for performance

kragen · on March 31, 2015

I'm still not quite sure what GADTs are (after skimming the WP article, reading this article, and skimming the code in it) but very surprisingly to me, it looks like OCaml's GADT implementation tackles a lot of the memory fragmentation problems that come with object-graph-heap memory-model languages like Lisp, Python, Java, previously OCaml, JavaScript, and so on, in exchange for a little more type declaration overhead. (Although, in my small experience, debugging OCaml type errors usually involves adding some redundant type declarations anyway.)

This is really interesting and I hope I understand it someday soon!

tel · on March 31, 2015

Generalized Algebraic Data Types, GADTs, are algebraic data types which include both (a) existential types and (b) type equalities. The combination allows for a certain kind of interaction between the specific kind of value within a type and the type itself.

For instance, we might have a little expression language using normal ADTs [0]

    type exp = 
      | Int  of int               -- integer constants
      | Bool of bool              -- bool constants
      | Add  of exp * exp         -- addition
      | Neg  of exp               -- negation
      | Ifte of exp * exp * exp   -- if then else
      | Lte  of exp * exp         -- less than or equal to

but there's no particular guarantee that a value of `exp` is well-typed. For instance, the following is a valid value of `exp`

    let x : exp = Add (Int 3, Bool true)

GADTs allow us to fix this situation by reflecting the type of the expression value into the type of the expression. Here's some syntax

    type _ exp =
      | Int  : int  -> int exp
      | Bool : bool -> bool exp
      | Add  : int  exp * int exp -> int exp
      | Neg  : int  exp * int exp
      | Ifte : bool exp * 'a  exp * 'a exp -> 'a exp
      | Lte  : int  exp * int exp -> bool exp

The way to read this syntax is to see it as giving the "type" of each constructor. Not only do we now parameterize the `exp` type with an index type corresponding to the type of the value this expression denotes, we also create very non-trivial type equations like

    Ifte : bool exp -> 'a exp -> 'a exp -> 'a exp

and

    Lte : int exp -> int exp -> bool exp

The upshot is that while the type of an interpreter for the first exp would have to look like

    type value = VInt of int | VBool of bool

    val interpret : exp -> value option

the type of the second `exp` is so constrained as to eliminate (a) the need for the separate `value` type and (b) the possibility of failure:

    val interpret : 'a exp -> 'a

    let interpret = function
      | Int i          -> i
      | Bool b         -> b
      | Add (i1, i2)   -> interpret i1 + interpret i2
      | Neg i          -> - (interpret i)
      | Ifte (c, t, e) -> if interpret c then interpret t else interpret e
      | Lte (i1, i2)   -> interpret i1 <= interpret i2

Such a construction is known to be type-error free because GADTs allow us to ensure that type constructors follow certain type equations and it's possible to write such a clean evaluator because GADTs keep around enough type information so that the compiler can "trust" them.

Finally, the short answer of how they work is something like they interpret a constructor like

    Ifte : bool exp * 'a exp * 'a exp -> 'a exp

as one with some existential types

    Ifte :    ...    | 'x exp * 'a exp * 'a exp 'a ex

and some type equalities

    Ifte : 'x == int |              ...

then when typechecking it just has to collect all the unique new type variables and solve the system of equalities which arises.

[0] Yeah, the author pooh poohs at examples like this, but I think they're really particularly clear.

gsg · on March 31, 2015

I've come to think that these toy interpreter examples are a bit questionable unless they contain a variadic structure like an arbitrary tuple - they give the impression that programming with GADTs is simpler than it really is.

joeyh · on March 31, 2015

Problem with toy AST examples is they're great if you're doing something like that, but don't give much motivation if you're not. (Unless you starting seeing all programming as AST manipulation, which can be an interesting viewpoint.)

Here's an example of another way to use GADTs. http://joeyh.name/blog/entry/making_propellor_safer_with_GAD...

More important that using GADTs for one thing or another is to start looking at the type checker as something you use to assist you in writing the kind of code you want to write, and avoiding the kind of code you want to avoid. GADTs are one amoung many useful tools in the modern toolbox for that.

kragen · on March 31, 2015

That’s a really fascinating post! Thank you! I’m still not sure I understand GADTs (in part, here, because I know very little Haskell indeed) but this helps.

Do GADTs allow Turing-complete computation at compile time, like C++ templates? Are there limits to what kind of correctness properties you can use them to get compile-time machine-checked proofs of?

vilhelm_s · on April 1, 2015

GADTs by themselves do not allow arbitrary computation[1]. But Haskell also allows you to write functions at the type-level ("type families" in the Haskell jargon), so you can directly write functions to compute types without any trickery, and the Propeller example uses this feature.

GADTs let you write down basically the same things that you could write as an inductive datatype in a dependent language like Coq and Agda. So (with some patience) you can make a function return a "proof" of anything that can be witnessed by a first-order inductive type, e.g. equations between expressions, list membership, proofs that a term in an object-language is welltyped, etc. The main limitation is that Haskell doesn't have any termination checking, so if you consider it as a dependently typed language there is no way to write a trustworthy proof that uses induction. Also its not very user-friendly, see [2] for some examples.

[1] Basically, GADTs are equivalent to having equations between types. If all the equations are written out in full, it is reasonably straightforward to check whether two types are provably equal. If you want to do type _inference_, i.e. figure out what equality assumptions a given function should make in order to become welltyped, then you run into various undecidable problems (e.g. "rigid E-unification"). My guess would be that if you wrote down a completely general typesystem featuring GADTs, then you would be able to encode arbitrary computations as type-inference problems, but the encoding would be very clumsy. The Haskell and Ocaml implementations deliberately don't try to be very general in the type inference, and instead try to pick a decidable subset, so the programmer has to add enough type annotations to guide it.

[2] https://personal.cis.strath.ac.uk/conor.mcbride/pub/hasochis...

tel · on March 31, 2015

How about this one for variadic structures? [0]

    data Rec (f :: k -> *) (ts :: [k]) where
      RNil :: Rec f '[]
      (:&) :: f t -> Rec f ts -> Rec f (t ': ts)

    infixr 9 :&

    data ElField f where
      Field :: KnownSymbol s => t -> ElField '(s, t)

    x :: Rec ElField ['("foo", Int), '("bar", String)]
    x = Field 3 :& Field "bar" :& RNil

It's an embedding of (variadic) typed records.

Anyway, I don't actually use GADT-driven embedded ASTs myself. They're just a pretty clear example.

[0] This is Jon Sterling's `vinyl` package in Haskell

gsg · on March 31, 2015

Yeah, that gives something of the flavour. (I was thinking of the record encoding at http://www.lexifi.com/blog/dynamic-types, myself.)

taeric · on March 31, 2015

Worse, to me, is that they don't seem to provide any evidence that the pattern can actually scale.

edwintorok · on March 31, 2015

I've been experimenting with a way to combine GADTs with polymorphic variants, for example to give invariants that all fields of a record use the same polymorphic variant tag (even if the type parameters are different).

For example to have the validation status as a parameter of a record, and have functions that only accept validated records (you could achieve something similar and probably simpler with functors):

  type validation = [`Valid | `Invalid]
  type valid = [`Valid]
  module V : sig
    type ('a, 'b) t
    val of_string : validate:(string -> 'a) -> string -> ('a, [> validation]) t
    val valid_or_exn : ('a, validation) t -> ('a, [> valid]) t
    val valid : ('a, valid) t -> 'a
  end = struct
  type (_,_) typ =
    | Invalid: exn -> ('a, [> `Invalid] as 'b) typ
    | Valid: 'a -> ('a, [> `Valid] as 'b) typ

  type ('a, 'b) t = { raw: string; value: ('a, 'b) typ }

  let of_string ~validate raw = { raw; value = try Valid (validate raw) with e -> Invalid e }

  let valid_or_exn s = match s.value with
    | Valid _ as v -> { s with value = v }
    | Invalid e -> raise e

  let valid ({ value = Valid v; raw = _ } : ('a, valid) t) = v
  end

  type 'b t = {
    x: (int, 'b) V.t;
    y: (string, 'b) V.t;
  }
  let (v:valid t) =
    let raw = {
      x = V.of_string ~validate:int_of_string "4";
      y = V.of_string ~validate:String.escaped "test\n"
    } in
    { x = V.valid_or_exn raw.x; y = V.valid_or_exn raw.y }

Looks like the compiler knows that the 'valid' function can't take 'Invalid' types given the proper type annotations on the polymorphic variant (without type annotations it gives a warning that the match is not exhaustive).

kragen · on March 31, 2015

Thank you for explaining! While I think I understand how your example works and how it's useful, I think I need to read more to understand what “existential types” are, since I’m pretty sure you’re not talking about dudes like Camus. And I’m not totally sure I understand them well enough to write a type checker for them. Which I think means I still don’t really understand GADTs.

tel · on April 1, 2015

Existential types are dual to universal types.

Universal types occur in polymorphic types. Like length

    val length : 'a list -> int

It works for all values 'a. Really, this is better denoted by an explicit binder

    val length : forall a . a list -> int

and this is what the preceeding quote implies---this variable is "bound" by a universal binder. What it means is that the user of `length` gets to make a choice about what the actual value of `a` is.

"Existential types" is short for "existentially bound type variables". For this we might write something like

    val x : exists a . a list

which indicates that we have a list containing some values of type `a`... for some, unknown type `a`! In other words, the following are both, simultaneously valid values at type `x`

    let x : exists a . a list = [1; 2; 3]

    let x : exists a . a list = ["foo"; "bar"; "baz]

The type system has "forgotten" what the type `a` is.

In GADT syntax we write these like

    type ex = Ex : 'a -> ex

Because the type `'a` does not get exposed in the final `ex` type it must be "forgotten" and therefore existentially bound. To make it more clear we can determine the type of the following function:

    let build a = Ex a

which is

    val build : 'a -> ex

and should really drive home the idea of just "forgetting" a type.

emillon · on March 31, 2015

At runtime, GADTs are actually represented the same way as plain ADTs (that is, a tagged union). The tighter types make it possible to remove some unused branches in pattern matching, so you may remove a few tests, but memory-wise it should be the same.

jordwalke · on March 31, 2015

In my experience, a memory win takes place because you can use something lighter weight (ADTs) to model your data as opposed to something that consumes additional memory (objects or a record of bound functions) simply to get the type system to accept your code.

Furthermore, when using ADTs to store your data (and constructing modules that operate on them) I believe the resolution to the functions that operate on them are entirely done at compile time, not run time - whether or not that is being capitalized on is a different story, but I believe it has benefits now and will have even more later as the compiler authors continue to add more optimizations.

emillon · on March 31, 2015

Yes, exactly. ADTs are entirely known statically so there's no vtable etc.

AceJohnny2 · on March 31, 2015

GADTs are http://en.wikipedia.org/wiki/Generalized_algebraic_data_type

marktangotango · on March 31, 2015

Upvoting, so obnoxious of the author of the post to not define them.

adultSwim · on March 31, 2015

I can't tell if you are serious or not. OP goes through them by example.

Original type:

  type 'a t = | Array of 'a array
              | Bytes of bytes

GADT'd (same functionality as before):

  type 'a t = | Array : 'a array -> 'a t
              | Bytes : bytes -> 'a t

Now using power of GADT to add constraint:

  type 'a t = | Array : 'a array -> 'a t
              | Bytes : bytes -> char t

Just jumping into it seems like a fine choice to make. Allows for a (narrow) introduction to GADTs without getting bogged down in their full complexity/limitations. Shows that it's useful before/without a theoretical discussion.

Still, I wouldn't have minded a little discussion introducing GADTs.

AceJohnny2 · on March 31, 2015

In the original version of the article (when I wrote my comment), the author hadn't defined them.

mercurial · on March 31, 2015

It would have been interesting to get actual performance numbers with the article.

throwaway000002 · on March 31, 2015

The specific performance aspect is, essentially, immaterial. The idea, in over-simple terms, is say you have Foo<T> but for say T=X there is a faster specialized way to implement Foo<X>, well you can do this in the type system of Ocaml via GADTs and still write generic type-safe code for Foo<T> and the compiler does the right thing when when it happens that T=X.(+)

So you get to have your cake and eat it too.

(+) without brittle, verbose, horrific re-factoring required to do similar things in simpler type systems.

kragen · on March 31, 2015

Yes, having a faster specialized way to implement Foo<X> is great. But the reason it's great is "the specific performance aspect". If the cost of using the faster specialized way is, say, massive interpretive overhead, then it may not actually be faster, and then there's no reason to use it.

GADTs actually seem to go in the other direction, though — they seem to enable the compiler to prove more at compile-time and thus generate code with less run-time tag checking.

But some hard performance numbers would go a long way toward proving that.

mercurial · on April 1, 2015

Especially since GADTs don't do too well with type inference. My take on "premature optimization is the root of all evil" is that I will take any "cheap" premature optimization which doesn't impact readability, but I'd like to be sure there is an actual performance improvement first.