so, um, unions? i have to say that in many years of programming, i have almost n...

mgsloan2 · on March 14, 2023

In my experience with languages that lack concise sum types and pattern matching, you end up with data types that have lots of implicit invariants. If you find yourself writing docs or thinking in your head things like "Field X is only set if field Y is true" or "If field X is non-null then field Y must be null and vice-versa", then these are indications that sum types would model the data better. Then these invariants can be followed by construction, and it becomes clear what's available at data access - great for interface clarity, robustness, dev tools, morale, etc.

Relatedly, storing booleans is a smell, imho typically an enum or sum type is always better in languages that have concise syntax for these. True and False are meaningless without context, and so can easily lead to errors where they are provided to a different context or generally misused due to a misunderstanding about the meaning.

zabzonk · on March 14, 2023

i agree with your first point, but if your type-system is sufficently strong to express this, i'd like to see an example.

i agree completely with your second point - passing around or storing booleans is usually horrible.

mgsloan2 · on March 14, 2023

> if your type-system is sufficently strong to express this

No fanciness needed, just plain old sum types. It is certainly possible to express those invariants directly in languages with a dependent type systems or refinement types like in liquid haskell - see https://ucsd-progsys.github.io/liquidhaskell-tutorial/Tutori.... It's typically much easier to reason about and use sum types, though.

Of course these examples are trivial and silly, but I see instances of these patterns all the time in big co software, and of course usually the invariants are far more complex but many could be expressed via sum types. I've seen loads of bugs from constructing data that invalidates assumptions made elsewhere that could have been prevented by sum types, as well as lots of confusion among engineers about which states some data can have.

> Field X is only set if field Y is true

Original gnarly C style pattern:

    struct TurboEncabulatorConfig {
        // When true, the turbo-encabulator must reticulate splines, and 'splines'
        // must be non-null. When false, 'splines' must be null.
        bool reticulate_splines;
        struct Splines *splines;
    };

Rust (let's ignore pointer vs value distinction):

    enum TurboEncabulatorConfig {
        NonReticulatingConfig,
        ReticulatingConfig { splines: Splines },
    }

> If field X is non-null then field Y must be null and vice-versa.

Original gnarly C style pattern:

    struct TurboEncabulatorConfig {
        // When non-null, lunar_waneshaft must be null. 
        struct Fan *pentametric_fan;
        // When non-null, pentametric_fan must be null.
        struct Shaft *lunar_waneshaft;
    };

Rust:

    enum TurboEncabulatorConfig {
        PentametricTurboEncabulator { pentametric_fan: Fan },
        LunarTurboEncabulator { lunar_waneshaft: Shaft },
    }

fulafel · on March 14, 2023

Type systems are of course not the only possible mechanism that can enforce these kinds of invariants. In dynamic languages, schema style solutions (eg malli or spec in Clojure) have advantages: they have more expressivity than typical type systems, you can manipulate them as data, you can use in contexts that are not statically verifiable, like at the data interfaces of yuor app.

mejutoco · on March 14, 2023

I agree it has more flexibility, but more expressiveness is a double edged sword. Type systems are often not Turing-complete, or at least they are limited in some ways. A language being very expressive means it cannot be run at compile time, sine we cannot know if it terminates.

If we cannot run it at runtime now we need tests to hit that path or to test it manually to actually know if it works.

burntsushi · on March 13, 2023

Nope, not unions. Sum types. Sum types would be more analogous to tagged unions or discriminated unions.

> i have almost never needed to use such types

Well sure. When is an abstraction "needed"? Before Fortran, nobody "needed" a programming language either. So is a programming language necessary?

That's the funny thing about the word "need." It has very narrow application, and it's precisely why I didn't mention the word a single time in my top level comment.

noobermin · on March 14, 2023

Can you explain the utility of them over unions then? Asking sincerely.

burntsushi · on March 14, 2023

Unfortunately it's actually kind of difficult to parse your question because of the ambiguity in the surrounding comments. You could mean "tagged union" instead of "union," or maybe not and "them" means "tagged union"...

If you're asking to compare tagged unions and unions, then...

A union is not a tagged union. A union is part of a tagged union. A union on its own is just a region of memory. What's in that memory? I dunno. Who does? Maybe something about your program knows what it is. Maybe not. But what if you need to know what it is and some other aspect of your program doesn't tell you what it is? Well, you instead put a little bit of memory next to your union. Perhaps it an integer. 0 means the memory is a 32-bit signed integer. 1 means it's a NUL terminated string. 2 means it's a 'struct dirent'. The point is, that integer is a tag. The combination of a union and a tag is a tagged union.

An abstract syntax tree is a classic example of a tagged union.

If instead you're asking to compare tagged unions and sum types, then...

Tagged unions are one particularly popular implementation choice of a sum type. Usually sum types have additional stuff layered on top of them that a simple tagged union does not have. For example, pattern matching and exhaustiveness checking.

noobermin · on March 14, 2023

Sorry, yes I meant tagged union, a C style union (as in the language's concept) is so useless by itself that I thought being tagged was an obvious given.

In that case, the pattern matching and exhaustiveness means that the value the sumtypes in rust offer is in combination with those other language features, not the mere usage of the sumtypes themselves then, I think that makes more sense.

burntsushi · on March 14, 2023

Yes. When you say "sum type" and you're talking about, say, Ocaml or Standard ML or (mostly) Haskell or Rust, then what comes with that is also the infrastructure around sum types in those languages. They aren't really separable. That includes pattern matching and exhaustiveness checking (the latter is an opt-in warning in Haskell, and can be opted out of in Rust on a per-type basis).

The discussion around "sum type" the concept is to disentangle it from the implementation strategy. Tagged unions are the implementation strategy, and they don't come with pattern matching or exhaustiveness checks.

Of course, many of these terms are used interchangeably in common vernacular. But if you look at the comment that kicked off this annoying sub-thread:

> so, um, unions? i have to say that in many years of programming, i have almost never needed to use such types.

Then this is clearly trying to draw an equivalence between two different terms, but where no such equivalence exists. There's no acknowledgment here of the difference between what a "union" is (nevermind a tagged union) and the "sum types" being talked about in my original top level comment. And the last part, "i have almost never needed to use such types" is puzzling on multiple levels.

zabzonk · on March 14, 2023

[flagged]

burntsushi · on March 14, 2023

Based on your other replies in this thread, as far as I can tell, you're not here to try and understand something. You're here to fuck around and play games with people over definitions. So, I'm not responding to you. I'm responding to the people following along who might get confused by the mess you're making here. More to the point, the Wikipedia article for "sum type" redirects to "tagged union," which just honestly makes this more of a mess.

A tagged union is a specific kind of data structure. It essentially refers to a pair of information: the first is a tag and the second is the union. The union might represent many different kinds of values. The important bit is that the size of that memory is the size of the largest value. The tag then tells you what kind of value is in that memory, and crucially, tells you how to interpret that memory. Getting this wrong, for example, in a language like C can lead to undefined behavior.

Tagged unions show up in all sorts of places. The classical example is an AST. You usually have one "node" type that can be one of many possible things: a literal, a binary operation, a function call, a type definition, whatever. And for any non-primitive node, it usually has pointers to children. And what is the type of a child? Well, a "node"! It's a recursive data type. It lets you represent children as just more nodes, and those nodes might be any other AST node. (And as you might imagine, an AST can be quite a bit more complicated than this. You often want restrictions on which types of children are allowed in certain contexts, and so you wind up with many different tagged unions.)

The use of a tagged union like this can be conceptually thought of as a sum type. A sum type is not a data structure. A sum type is a concept. It comes from type theory and it is the "dual" of product types. Product types represent the conjunction of values while sum types represent the disjunction of values. In type theory notation, you usually see 'A x B x C' to refer to the product type of A, B and C. Similarly, you usually see 'A + B + C' to refer to the sum type of A, B or C.

A sum type can be represented by just about anything, because it's just a concept. You can take something that is typically used for product types and use it to present an API that gives you a sum type:

    type OptionInt struct {
        exists bool // private
        number int // private
    }
    
    func OptionIntNone() OptionInt {
        return OptionInt{false, 0}
    }
    
    func OptionIntSome(number int) OptionInt {
        return OptionInt{true, number}
    }
    
    func (o OptionInt) IsSome() bool {
        return o.exists
    }
    
    func (o OptionInt) Number() int {
        if !o.exists {
            panic("option is none, no number available")
        }
        return o.number
    }

This is a very contrived and hamstrung example, because it really doesn't give you much. But what it does give you is the API of a sum type. It is a value that can be either missing or present with an integer. Notably, it does not rely on any particular integer sentinel to indicate whether the value is missing or not. It encodes that state separately. So it can fully represent the presence of any integer value.

As you can see, despite it offering an API that represents the sum type concept, it is actually implemented in terms of a product type, or a struct in this case.

In a language like Go, which does not really support either unions or sum types, this sort of API is really hamfisted. In particular, one could object that it is a sum type because it doesn't really represent the "sum typeness" in the type system. Namely, if you get its use wrong, then you don't get a compilation error. That is, you can call the 'Number()' method without checking whether 'IsSome()' returns true or not. In other words, correct use of OptionInt looks like this:

    var opt OptionInt := doSomethingThatMightReturnNone()
    if opt.IsSome() {
        // OK, safe to access Number. It won't panic
        // because we just checked that IsSome() is true.
        fmt.Println(opt.Number())
    }

But nothing in Go stops you from writing this instead:

    var opt OptionInt := doSomethingThatMightReturnNone()
    fmt.Println(opt.Number())

This is why "support for sum types" is critical for making them as useful as they can be. In Rust, that same option type is defined like this:

    enum OptionInt {
        None,
        Some(i32),
    }

That's it. The language gives you the rest. And crucially, instead of splitting out the 'IsSome()' check and the access via 'Number()', they are coupled together to make it impossible to get wrong. So the above code for getting a number out looks like this instead:

    let opt: OptionInt = do_something_that_might_return_none()
    if let OptionInt::Some(number) = opt {
        println!("{}", number);
    }

You can't get at the 'number' without doing the pattern matching. The language won't let you. Now, you can define methods on your sum types that behave like Go's 'Number()' method above and panic when the value isn't what you expect. And indeed, Rust's Option and Result types provide this routine under the names 'unwrap()' and 'expect()'. But the point here isn't just about Option and Result, it's about sum types in general. And you don't have to define those methods that might panic. You can just let the language help guide you along instead.

Sum types are an idea. A concept. A technique. Tagged unions are an implementation strategy.

Notice also that I haven't mentioned exhaustiveness at all here. It isn't required for sum types. Haskell, for example, doesn't mind at all if your pattern matching isn't exhaustive. You actually have to tell it to warn you about it, and even then, it's not a compilation error unless you treat it as one. In Rust, exhaustiveness checking is enabled by default for all sum types. But you can disable it for a particular sum type with the '#[non_exhaustive]' attribute. The benefit of disabling it is that adding a new variant is possibly semver compatible change, where as adding a new variant in an exhaustive sum type is definitely a semver incompatible change.

kd5bjo · on March 14, 2023

> But you can disable it for a particular sum type with the '#[non_exhaustive]' attribute.

I’m not sure this is the best description of what `#[non_exhaustive]` does in Rust. It doesn’t so much disable exhaustiveness checking as it marks the given list of variants as incomplete.

In Rust, some pattern matching contexts are required to be exhaustive (e.g. ‘let pattern = …;’, ‘match … {}’), and others are not (e.g. ‘if let … = … {}’). When an enum is tagged as ‘#[non_exhaustive]’, the former are required to include a wildcard matcher that will accept unknown variants, even if all known variants are explicitly matched.

Rust also allows structs to be `#[non_exhaustive]` to mark that the listed fields might not be sufficient. This disables the ability to (outside of a privacy boundary) directly construct an instance of the structure, or to destructure it with an exhaustive pattern.

tialaramex · on March 14, 2023

The privacy boundary applies in both cases by the way. Despite labelling your AmericanCity enum as #[non_exhaustive] since it only has Houston and Chicago in it so far, you are allowed in your own code inside the boundary to write a match which handles only Houston and Chicago. And that's actually probably a good idea, because when you add SanFrancisco to the enum, the compiler will point out anywhere your match clauses don't handle it, whereas if you had a default case (as your 3rd parties will) you'd drop into the default.

burntsushi · on March 14, 2023

My goal wasn't to describe it thoroughly, it was to call it out as a means of disabling exhaustiveness checking. Regardless of whether that's really what the essence of '#[non_exhaustive]' actually is, it does disable exhaustiveness checking. It's right in the name. It's an important part of what it does.

Basically, we're talking past each other. I called it out for what it does, but you're pointing out why someone would want to use it. That is, "#[non_exhaustive] on enums indicates that the list of variants is incomplete, and it does this by disabling exhaustiveness checking outside of the crate in which it is defined." Although, of course, not even that is precisely correct. Because your match arms are still forced to be technically exhaustive via the compiler requiring a wildcard arm. But the end effect to the user of the enum is that exhaustiveness checking on the variants of the enum is disabled.

kaba0 · on March 14, 2023

Thanks for the writeup!

Slightly unrelated, but re

> More to the point, the Wikipedia article for "sum type" redirects to "tagged union," which just honestly makes this more of a mess

I often see Wikipedia articles in CS topics severely lacking, or claiming exact definitions, when no such thing exists. Many definitions we regularly use are not exact at all, different papers/books use them differently, e.g. how would you even define OOP, low-level languages, etc? I would prefer them to use less strong language, as too many people argue on the internet based on Wikipedia definitions.

burntsushi · on March 14, 2023

I agree that they are lacking. In theory, I have the disposition to improve articles like that, but in practice, whenever I've tried, I get shut down by editors. It's been a long time since I've tried though.

It happens on other wikis too, like the Arch wiki.

I realize it's a tired complaint that a lot of people have and editors don't have an easy job, but I am so completely done with contributing to wikis. I know a lot of other people with the same complaint and same conclusion, including experts in the domain of type theory. One wonders whether this is one of the reasons why the articles are not as good as they could be.

youainti · on March 14, 2023

Thank you for such a great explanation. I found it to be really useful.

signa11 · on March 14, 2023

this is just ... beautiful ! thank you !

cornstalks · on March 14, 2023

The union part is just an (obvious) optimization. An implementation detail. Sum types are not required to use a union; they can be implemented without using a union.

zabzonk · on March 14, 2023

how, exactly?

the-smug-one · on March 14, 2023

struct List {}; struct Cons : public List { Cons(int v, List* nxt); int v; List* nxt; }; struct Nil : public List {};

This is how you'd do it in Java.

zabzonk · on March 14, 2023

and you would use it ... how?

the-smug-one · on March 14, 2023

I've showed you an alternative encoding, maybe get to work and learn something yourself?

dcow · on March 14, 2023

It's not exactly Java, but:

    Cons(1, Cons(2, Cons(3, Nil)))

kaba0 · on March 14, 2023

Actually, Java does have sum types nowadays, and you can write a recursive list implementation like “new Cons(2, new Cons(3, Nil.nil()))”.

Here is a definition for a Maybe type, converting it to a List is left as an exercise to the reader :) : https://news.ycombinator.com/item?id=35133670

dcow · on March 14, 2023

I just meant because I was eliding the `new`s.

cornstalks · on March 14, 2023

By using a struct with lazily initialized members, for example.

the-smug-one · on March 14, 2023

Why does the members be lazily init:ed?

cornstalks · on March 14, 2023

If the data types have constructors and/or destructors that have side effects then it has to be lazy. Otherwise IMO it's not really behaving like a sum type.

zabzonk · on March 14, 2023

if that is your example, provide some code.

cornstalks · on March 14, 2023

Pretend we have the following sum type:

    enum SumType {
        Foo(foo),
        Bar(bar),
        Baz(baz),
    }

    void DoSomething(SumType sum_type) {
        match sum_type {
            Foo(foo): foo.do_x();
            Bar(bar): bar.do_y();
            Baz(baz): baz.do_z();
        }
    }

This could be lowered to:

    struct LoweredSumType {
        optional<Foo> foo;
        optional<Bar> bar;
        optional<Baz> baz;
    }

    void DoSomething(LoweredSumType sum_type) {
        if (sum_type.foo.has_value()) {
            sum_type.foo->do_x();
        } else if (sum_type.bar.has_value()) {
            sum_type.bar->do_y();
        } else {
            sum_type.baz->do_z();
        }
    }

This isn't real code; it's a pretend language that's kind of a mix of Rust and C++. But it illustrates that the sum type can be lowered into a non-union representation.

zabzonk · on March 14, 2023

> Pretend we have the following sum type:

which looks remarkably like a union to me, if i understand your made-up language.

cdcarter · on March 14, 2023

It's not a union, though, it stores references to three memory locations, not just one (as a union represents a single memory location which can have multiple possible interpretations). And the lowered code has nothing protecting it from an invariant where more than one reference is non-null!

dralley · on March 13, 2023

Tagged unions with exhaustiveness checking, built into the language in a fairly deep way. Very different from what would come to mind for a C or C++ developer who hears the word "union".

nicoburns · on March 14, 2023

Does your code ever contain class hierarchies with a fixed set of classes? Or variables where certain values have special case meaning? Those are the cases where Sum Types make things immeasurably nicer than the alternatives.

zabzonk · on March 14, 2023

> Does your code ever contain class hierarchies with a fixed set of classes?

no - one of the advantages of OO programing is that class hierarchies (should you feel the need to use them, which mostly i do not) can be expanded. or indeed contracted.

> Or variables where certain values have special case meaning?

very, very rarely (i would say never, in my own code) but of course we have the null pointer as a counter-example. to which i can say: don't use raw pointers.

kaba0 · on March 14, 2023

The Visitor pattern, which is a must in certain niches is exactly analogous to pattern matching over a sum type, and I am fairly certain that I can objectively say that the latter is a million times more readable and maintainable.

It doesn’t make OOP obsolete, though, at all. Certain other areas are better expressed as hierarchies, e.g. GUI nodes.

jjnoakes · on March 14, 2023

Do you have any large open source code bases that you could share?

edgyquant · on March 14, 2023

I can’t imagine not having the ability for a variable to be one of multiple potential types. I could probably work around it, sure, but why would you want to?

zabzonk · on March 14, 2023

i can't imagine why you would want that. for example, why would i want something defined as an integer hold something other than an integer? this is how strongly-typed languages work.

Twisol · on March 14, 2023

Scott Wlaschin's "F# For Fun and Profit" website is a great resource for strongly-typed (yes) data modeling with sum types as a fundamental tool.

https://fsharpforfunandprofit.com/series/designing-with-type...

By the third entry in the linked series, we've motivated the use of a sum type with three variants in the example domain model.

lmm · on March 14, 2023

You definitely want an integer to always be an integer, but you often want to represent "A or B". E.g. user has a contact method that's a phone number or an email address. Result of validation is a validated value or a failure message.

cmrdporcupine · on March 14, 2023

The C++ (or Java etc) way to do many of the kind of things people do with Rust pattern matching & sum types would be via OO subtyping polymorphism. e.g. classic visitor pattern, etc. Way more verbose and awkward, and scatters the logic all over the place.

Enums + switch is the other, and far less powerful.

kaba0 · on March 14, 2023

I swear I am not paid to shill for Java or anything…, but Java did get sum types recently, and it’s quite good. They made it in a backwards compatible manner by sealed classes that list all of its subtypes.

Switch expressions were also implemented, and they can exhaustively match on the given sum type. Pattern matching is quite limited as of now (only usable with records), but it is coming.

cmrdporcupine · on March 14, 2023

I should take a look. I kind of have a nostalgic interest in looking back at the JVM again, as 10 years ago it was my main career.

The problem with Java was always that it has moved so slowly in acquiring nice things. The actual core runtime is amazingly well engineered.

thom · on March 14, 2023

Or, for the case of errors, with exceptions, which happen to have their own very limited pattern matching syntax.