Make the Type System Do the Work

chrisaycock · on Feb 6, 2014

This article uses inheritance in C++ to ensure the types line-up, but I have really taken a shine to algebraic data types for my firm's data feed handlers.

Here is an example in C++11 that implements a market-data spec line-for-line.

  #pragma pack(push,1)
  
  struct Quote {
    char     symbol[16];
    uint16_t bid;
    uint32_t bidsize;
    uint16_t ask
    uint32_t asksize;
  };

  static_assert(sizeof(Quote) == 28, "Quote size wrong");
  
  struct Trade {
    char     symbol[16];
    uint16_t price;
    uint32_t size;
  };

  static_assert(sizeof(Trade) == 22, "Trade size wrong");
  
  struct Data {
    enum class DataType : char {
      kTrade = 't',
      kQuote = 'q'
    };

    uint64_t sequence_number;
    DataType data_type;
    union {
      Quote quote;
      Trade trade;
    };
  };
  
  #pragma pack(pop)

Now because I've disabled padding in the structs, the types will map directly to the proper byte locations as specified by the data vendor. There is no need for me to manipulate bytes!

And I can get away with a single cast:

  Data* data = reinterpret_cast<Data*>(raw_buffer);
  switch (data->data_type) {
    case Data::DataType::kTrade: {
      Trade& trade = data->trade;
      std::cout << trade.symbol << ' ' << trade.price << ' '
                << trade.size   << std::endl;
      break;
    }
  
    case Data::DataType::kQuote: {
      Quote& quote = data->quote;
      std::cout << quote.symbol << ' ' << quote.bid << ' '
                << quote.ask    << std::endl;
    }
  
    default:
      std::cerr << "Unknown message type: "
                << static_cast<char>(data->datatype) << std::endl;
  }

So now the code resembles a match statement in OCaml, but will actually compile directly onto the underlying bytes. No data marshaling or anything like that.

com2kid · on Feb 6, 2014

This is incredibly common in embedded programming. Lots of times you'll have some binary packet come over the wire, the first field is a packet type specifier, you can then just cast the rest of your byte buffer to the proper type.

It is very unfortunate that C (and until recently C++) did not have a way to specify the size of enums. Basically this has resulted in crappy #defines being used for far too many years.

I always called what you did above "tagged types". A number of type systems actually work like that under the covers, objects get a type ID, and then a lookup function figures out what that type ID maps to.

qznc · on Feb 6, 2014

The more common term is "tagged union". http://en.wikipedia.org/wiki/Tagged_union

_pmf_ · on Feb 6, 2014

> It is very unfortunate that C (and until recently C++) did not have a way to specify the size of enums.

It had compiler specific switches to set the size of all enums globally, which helped exactly nobody.

jevinskie · on Feb 6, 2014

I agree, this typed enum in C++11 made a MachO manipulation library much nicer.

pbsd · on Feb 6, 2014

This is a neat trick, but it will segfault in architectures where alignment is enforced (like MIPS, ARM, etc) if the input buffer is not suitably aligned. One relatively cheap way to sidestep this is to memcpy into a stack-allocated struct:

    Data data;
    memcpy(&data, raw_buffer, sizeof data);
    switch(data.data_type)
    // ...

signa11 · on Feb 6, 2014

well, there are 2 other options available:

one is to use "--pointer_alignment=1" so that all accesses via the pointer are treated as unaligned OR

use "--no_unaligned_access" flag to tell the compiler to not knowingly generate unaligned access

laureny · on Feb 6, 2014

> static_assert(sizeof(Quote) == 28, "Quote size wrong");

A good reminder that trying to "let the types do the work" in C++ is an oxymoron.

If you're so keen on algebraic data types, there are languages much better suited to that than C++.

vinkelhake · on Feb 6, 2014

> If you're so keen on algebraic data types, there are languages much better suited to that than C++.

Variants of this comment are often seen.

For any given feature, there's going to be a language that shines in that particular area. However, choice of language rarely comes down to a specific feature. There's usually a multitude of aspects that needs to be considered - some of them aren't even technical.

axman6 · on Feb 6, 2014

That seems like an orthogonal issue to me. The assert is to ensure that the data type matches some external requirement; a quote is a structure that as 28 bytes long, and if something changes to make a Quote not 28 bytes long, you'll be told. If it weren't there, the algebraic data stuff would still work fine.

Edit: I say this as someone whose primary language is Haskell

pjmlp · on Feb 6, 2014

> If you're so keen on algebraic data types, there are languages much better suited to that than C++.

Yes, you are quite right.

However it only works when one works alone.

Usually the choice of a language is driven by what the customer requires as technology, what the team members are capable of using or willing to learn, finally what the boss allows.

fmap · on Feb 6, 2014

If I remember correctly there was even a special clause in the C standard definition of union types which allows you to write:

struct A { Data header; ... }

struct B { Data header; ... }

union AB { Data header; A a; B b }

and then access the header field in AB. If this is syntactically the first field in each of the constituent records the result is well-defined.

This is a great solution if you need to map to a specific wire format. If you do not have that constraint and have code which uses algebraic data types a lot you should probably switch to OCaml or Haskell. There are specific optimizations to work with complex matches and for representing algebraic types in a compact and efficient way. In C++ the compiler has to work with reinterpret_casts and cannot change the representation of your data types (much). Doing the optimizations by hand is possible, but chances are that a simple-minded OCaml prototype will already perform reasonably well.

TorKlingberg · on Feb 6, 2014

Just make sure your systems native endianness is the same as in the raw buffer.

10098 · on Feb 6, 2014

My previous job involved dealing with financial data feeds, can confirm. We used that trick all the time.

judk · on Feb 6, 2014

That looks like an implementation of algebraic data types, written in a language that doesn't support them. "Union" ,"reinterpret_cast" -- I hope that's well tested, since the type system won't help you there!

cpwright · on Feb 6, 2014

To make it even better, you can declare classes that are exactly the size of your primitive type, and do things like endianness conversion for you. I will often even make functions that convert, e.g., a bid into a specific price type. Having all of the numerical quantities in your trading system strongly typed makes working on it much easier. The strong type system that enables this, without sacrificing any runtime speed is my favorite part about C++.

lukego · on Feb 6, 2014

That's how I write LuaJIT code for networking and device drivers too. Very neat.

greatsuccess · on Feb 6, 2014

Its clever but it doesnt scale to writing systems. Its just a hack.

com2kid · on Feb 6, 2014

Hmm? I disagree. I have built very large systems using exactly this technique.

As I mentioned in a comment above, lots of on the wire binary protocols start with a type identifier byte (or word). You then have a wonderful switch statement based off of that.

Heck I had an entire inter processor communication system running based on what was essentially this very idea. The switch actually had two layers, your first byte gave you a subsystem to jump to and the second byte was the command within that subsystem.

You sent data down the wire connecting the two processors together (a UART connection in this case), read straight into a buffer, and just pop right into a switch statement. The main disadvantage is everything is set at compile time, but for a lot of uses that is a-ok.

mlvljr · on Feb 6, 2014

What if your bytes are 16 bit on a given platform? ;)

dlubarov · on Feb 6, 2014

I prefer to have a single class which stores an SI unit, as in

  final class Temperature {
    private final double kelvin;

    public static Temperature fromKelvin() { ... }
    public static Temperature fromCelcius() { ... }
    public static Temperature fromFahrenheit() { ... }

    public double getKelvin() { ... }
    public double getCelcius() { ... }
    public double getFahrenheit() { ... }
  }

This avoids the quadratic explosion of conversion methods.

A more complete example - https://github.com/dlubarov/daniel/blob/master/data/src/dani...

hrjet · on Feb 6, 2014

But this solution doesn't solve the original problem described in the article. With your API, it is easy for a programmer to use a temperature in Celcius as a temperature in Kelvins and the type system can't catch it.

dlubarov · on Feb 6, 2014

Well, most code can just pass around Temperature objects without dealing with any particular units. There's no opportunity to mix up units there.

When some code needs to actually get a number out, say for the purpose of logging, it's possible to screw up with either approach.

With the author's classes, you could do

  void printTemperature(DegCelsius temperature) {
    log("Temperature in Kelvin: %f", temperature.getDegrees());
  }

With this Temperate class, you could do

  void printTemperature(Temperature temperature) {
    log("Temperature in Kelvin: %f", temperature.getCelcius());
  }

As the author says, "wrong code should look wrong." I think both of these look pretty wrong.

hrjet · on Feb 6, 2014

With the author's code you can define a print() function inside each temperature class, so that there is no mixup of units during printing. Or let the class supply its suffix instead of hardcoding it in an external print() function. It might not be always possible to follow that technique, but in general, it is better (according to the article, and I agree) to retain as much information at compile time as possible.

Retric · on Feb 6, 2014

One way to do this is to have a super class for each measurement type with a canonical measurement. So Temperature,Mass,Time,Distance etc it's slower but this way you can always get any measurement to come out without doing kg>lb, grams>lb, ounces>LB because it ends up as kg>kg>lb, grams>kg>lb etc.

rcfox · on Feb 6, 2014

The Boost.Units[1] library takes this even further, allowing you to do things like divide a length by a time quantity and get a velocity, again with compile-time type checking.

[1]: http://www.boost.org/doc/libs/1_55_0/doc/html/boost_units.ht...

EdwardDiego · on Feb 6, 2014

F# has similar also: http://en.wikibooks.org/wiki/F_Sharp_Programming/Units_of_Me...

slashdev · on Feb 6, 2014

Did anyone else recoil in horror at how a simple function is turned into a type hierarchy? This is everything that's wrong with Java to me. Classes are all fine and well, but don't take it to extremes. Maybe the examples were just too trivial, but it really is a far greater evil than just using well named variables and a simple conversion function.

wvenable · on Feb 6, 2014

What is missing the rest of the program that could be using temperature hundreds of times scattered throughout the calculations and UI. A small type hierarchy provides type safety for all that other code and is certainly less prone to errors than well named variables and simple conversion functions.

The example is bit too trivial as the hierarchy doesn't provide much value. I would assume the base class would actually provide some useful functionality that isn't in this example or it could be eliminated entirely.

slashdev · on Feb 6, 2014

Yes, if it is a major part of what the program does it could be justified. But often I see this kind of craziness for things used only a couple times.

hdevalence · on Feb 6, 2014

Your problem isn't with expressive type systems, it's with languages whose type systems are not sufficiently expressive, though.

Well-named variables are interpreted and checked by the programmer; well-expressed type constraints are interpreted and checked by the compiler. Guess which one makes fewer mistakes?

slashdev · on Feb 8, 2014

Yes, I'm happy to offload as much as possible to the compiler as long as it doesn't involve hundreds of lines of unnecessary classes and interfaces. That's just wrong.

zastrowm · on Feb 6, 2014

I think the examples are just too trivial (or in this case, aren't extended to a greater library). Where they really come in handy is if you're using the types all over the place. If I were to create a Degrees class (with Celsius and Fahrenheit) just for representing it in a UI, I'd call it crazy. But if I'm doing conversions all over the place or creating a library for others to use, I would think the plumbing is worth it. Of course, that's true for most "plumbing" in code.

eps · on Feb 6, 2014

If you are "doing conversions all over the place", you are looking at a fundamentally flawed code design that costs you performance. Ensuring consistency of coversions here is like painting a house whose roof's on fire.

C/F example is synthetic, but it doesn't make it any good - if you are operating with data that can be expressed in different units, then pick a single unit, stick with it throught the code and convert only when displaying (or passing to external party code).

pdpi · on Feb 6, 2014

Or perhaps you're integrating a large number of systems, each with its own conventions.

tomlu · on Feb 6, 2014

It's nice how in Go you can alias a type to int, yet it is its own distinct type with compile time checks and explicit conversions. Definitely takes away all the boilerplate.

pycassa · on Feb 6, 2014

Haskell has both. You can have type synonyms that are not distinct to the original type according to the compiler and new types that are distinct to the original that doesn't even 'inherit' all the type classes of the original type. You have to specifically instantiate all the type classes you need.

pjmlp · on Feb 6, 2014

> Definitely takes away all the boilerplate.

Until you need to write generic code. Then it is casts everywhere and boilerplate to satisfy interfaces.

drbawb · on Feb 6, 2014

>... and boilerplate to satisfy the interfaces.

Go actually provides facilities for embedding types which helps remove some of the unnecessary boilerplate from satisfying an interface.[1]

Through embedding you can leverage the implementation of an existing type to satisfy an interface w/o writing methods to dispatch to that underlying type.

[1]: http://golang.org/doc/effective_go.html#embedding

pjmlp · on Feb 6, 2014

I used Go quite a lot until version 1.0 got released.

You are not explaining nothing new to me.

dkuntz2 · on Feb 6, 2014

If you need generics why are you writing in Go?

pjmlp · on Feb 6, 2014

I am not, left the language ecosystem before the 1.0 release.

Might consider it again if a customer requires me to use it, otherwise I am happier with other more modern languages.

Nilzor · on Feb 6, 2014

no. This was an example to illustrate a point. Use your imagination to find real uses of this approach. And what's wrong with type hierarchies anyway?

cmbaus · on Feb 6, 2014

coolsunglasses · on Feb 6, 2014

To take this to the current state of the art, look at Haskell, Agda, and Idris. Latter two being dependently typed.

Steps for learning Haskell:

https://gist.github.com/bitemyapp/8739525

Buttons840 · on Feb 6, 2014

What are your impressions of the three languages?

Personally, I'm learning Haskell first, because Haskell is the most popular of the three and there is plenty of good learning material. As far as Adga and Idris go, I think Idris looks more appealing.

tel · on Feb 6, 2014

Learn Haskell first. When you've gotten your head at least partially around it then the latter two become much more approachable. Programming in a dependently typed language is not simple if you're not used to the algebraic methods they're based on. That said, Idris seems to be aiming to be much more "practical" than Agda.

Also, all that said, once you start getting to the point where you're really getting comfortable with Haskell types you absolutely should learn Agda or Idris since they do advanced types far better than Haskell does. Many confusing Haskellisms become obvious in the light of Agda/Idris.

mcguire · on Feb 6, 2014

Care to make any comparison to ATS? I have been playing with it a bit lately and having a good time.

tel · on Feb 6, 2014

I haven't used ATS. My understanding is that it's a dependently systems language which makes me think it'll provide good region types and things like what Rust is trying to offer. In that way I'd say it still differs a lot from, say, Agda, where you're likely to encounter fairly heady questions about equality and algebra very quickly.

But again, I've not used it so I can't much say for certain.

coolsunglasses · on Feb 6, 2014

Well Haskell is the "practical" choice but there's a lot to learn from Agda and Idris if your interest is in a helpful type system - the topic of the original post.

MaxGabriel · on Feb 6, 2014

Any recommendations for web programming in Haskell?

tel · on Feb 6, 2014

I highly recommend taking a look at the Happstack Tutorial. There's both a Lite version [1] and a full Crash-Course [2].

[1] http://happstack.com/page/view-page-slug/9/happstack-lite [2] http://happstack.com/docs/crashcourse/index.html

coolsunglasses · on Feb 6, 2014

Just getting started? Scotty.

That having been said, learning how and why Snap and Yesod do their thing is advisable.

I don't think there's any real reason not to take some time with Happstack, Scotty, Yesod, and Snap in turn and just see which appeal to you the most.

Scotty is analogous to something like Sinatra and reuses the existing WAI/Warp (think Rack/Ring/WSGI) work that was originally done for Yesod.

Groxx · on Feb 6, 2014

Excellent, thanks! I may actually do so now.

dustingetz · on Feb 6, 2014

yo switch the gist to markdown to get word wrap

coolsunglasses · on Feb 6, 2014

Done.

skybrian · on Feb 6, 2014

This is a good trick but it needs to be used judiciously or you'll end up with lots of boilerplate.

I particularly like Go's type system, though, because it makes it very simple:

  type celsius float64

There's no overhead and it doesn't attempt to prevent you from doing any calculations as a float64; the type checking just prevents direct assignments from a celsius type to another type.

hdevalence · on Feb 6, 2014

On the other hand, Go doesn't have a way to write code that abstracts over types, and my (albeit limited) experience with Go is that most of the boilerplate is due to the language's poor type system. If you use a language with a better type system (Haskell is the first that comes to mind, though there are others), you can actually do this sort of thing without the boilerplate code.

sparkie · on Feb 6, 2014

Haskell still has some boilerplate - actually quite a lot if you don't use -XGeneralizedNewtypeDeriving (which has known problems when used with other extensions). With newtype deriving, this is about as simple as it gets:

    {-# LANGUAGE GeneralizedNewtypeDeriving #-}

    class Degrees deg where
      toK :: deg -> K
      fromK :: K -> deg

    newtype K = K { unK :: Double } deriving (Show)
    instance Degrees K where
      toK = id
      fromK = id

    newtype C = C { unC :: Double } deriving (Show)
    instance Degrees C where
      toK = K . (+ 273.16) . unC
      fromK = C . (flip (-) 273.16) . unK

    newtype F = F { unF :: Double } deriving (Show)
    instance Degrees F where
      toK = K . (+ 273.16) . (* (5/9)) . (flip (-) 32) . unF
      fromK = F . (+ 32) . (* (9/5)) . (flip (-) 273.16) . unK

Without newtype deriving (or other generic deriving extensions), one would need to implement instances of {Num, Real, Fractional, RealFrac, Floating, RealFloat} for all three types, which do nothing but map to the Double equivalents.

tel · on Feb 6, 2014

GeneralizedNewtypeDeriving's problems are fixed in the newest GHC which just had it's first RC a few days ago. It introduces a "roles" system which tracks which types are allowed to use GND without breaking abstraction boundaries. Roles are mostly invisible to end users as well.

herge · on Feb 6, 2014

I got annoyed with Go when I learnt how hard it is to sort arbitrary lists. Doing the following in Go is quite a pain:

scores = [{'name': 'Bob', 'score': 20}, {'name': 'Jane', 'score': 15}] scores.sort(key=lambda x: x['score'])

skybrian · on Feb 6, 2014

Yes, there is in practice a fair bit of boilerplate in Go. (The way you'd do it in this case is by defining two new types for a struct and a slice of structs. Not that hard once you know the trick.)

Still, they got tiny types right, and without getting bogged down in dimensional analysis.

jeffdavis · on Feb 6, 2014

"'Dog is-a Mammal' ... actually fairly sound"

I disagree in the context of OOP inheritance.

(This example uses shapes, because the idea of mutating mammals gets a little strange.)

Let's say you have a Circle class and an Ellipse class. A Circle inherits from an Ellipse, of course, because a Circle is-a Ellipse. By specializing as a Circle, we get extra reader methods such as getRadius(). Great.

But what about mutators? For an Ellipse, it may have a mutator called squash() that holds the major axis constant and halves the minor axis. But that leaves us with a problem, because circles can't be squashed while remaining circles.

Let's say we have a program like:

    void f1() {
      Circle myCircle(10.0);
      f2(myCircle);
      cout << myCircle.getRadius() << "\n";
    }
    void f2(Ellipse &e) {
      if (...)
        e.squash();
    }

Inheritance says that f2 must also accept a Circle. But what happens when we try to squash it? We can either throw a runtime error there, or we could let the squash succeed and the getRadius() in f1() will fail.

Although a Circle value is an Ellipse value, a Circle variable is not an Ellipse variable. In fact, inheritance for variables should go in the opposite direction as inheritance for values, because an Ellipse variable can certainly hold a Circle value.

dragonwriter · on Feb 6, 2014

"Dog is-a Mammal" (or "Circle is-a Ellipse") is sound, what is unsound is supporting mutations that alter identity. If you change the minor axis of an ellipse, it isn't the same ellipse. The idea of a mutating squash method of the type derived is therefore unsound in the context of the domain being modeled, as it would alter identity.

> (This example uses shapes, because the idea of mutating mammals gets a little strange.)

Mutating shapes is strange, because geometric shapes don't tend to have attributes that can be mutated without effect on identity (if you change their features, they aren't the same shape.)

Mutating animals is more normal; because a dog is a concrete thing that exists in time and space and not an abstract ideal the way a shape is, there are attributes of a dog that can change without its identity changing, so mutating methods on a Dog can make sense.

jeffdavis · on Feb 6, 2014

"what is unsound is supporting mutations that alter identity"

EDIT: included code for clarity

I can make essentially the same example with concrete entities that exist in time and space, as well:

    void f1() {
      GasolineVehicle vehicle(myEngine);
      f2(vehicle);
      cout << vehicle.getMPG() << "\n";
    }
    void f2(Vehicle &v) {
      if (...)
        v.setPropulsionDevice(myMotor);
    }

After replacing the engine with a motor, MPG no longer makes sense.

At the time you're writing the Vehicle class, it seems perfectly reasonable to define a setPropulsionDevice() method, and there's no hint that calling it would change the vehicle's identity. Only when you define the subclass does it turn into an identity problem, but then it's too late.

Also, if you think the Circle/Ellipse example is not compelling enough, how would you structure those two classes? Would one inherit from the other, and if so, in which direction?

mbrock · on Feb 6, 2014

I think it's a fundamentally bad idea to think of class hierarchies as ontological schemata of real world things.

It's like ancient philosophers trying to define the human being. Are we FeatherlessBipeds or RationalAnimals? Who really cares?

The Circle/Ellipse quabble is academic because it never mentions the actual domain – for what purpose are we trying to model these entities in the first place?

Even in pure geometry, there are many different ways to classify and treat these structures. If you have a generic shape representation, you could just have Circle and Ellipse as different builder functions. If you're mostly just concerned with drawing, then a simple "isCircular" method on the Ellipse object might be what you need. And so on...

zvrba · on Feb 6, 2014

> At the time you're writing the Vehicle class, it seems perfectly reasonable to define a setPropulsionDevice() method

Are you claiming that it seems reasonable to be able to replace a car engine with a jet engine without changing anything else? (The car's hull, construction, etc., and after all these modifications it's not the "same" car anymore.)

jeffdavis · on Feb 7, 2014

No.

Any example I come up with is going to sound contrived, because it is. Real examples get messy, and just add to the confusion.

But real issues show up due to fundamental problems with "is-a": Even if an X value "is-a" Y value, an X variable is not a Y variable (because it can't hold all objects of type Y). So passing a mutable object reference to a function is fundamentally different than passing an immutable object -- but inheritance hierarchies treat them the same.

You can get around this problem by not using inheritance hierarchies that work that way, or by always using immutable objects (value semantics). But I pretty much consider the Java/C++-style inheritance to be a mess.

dragonwriter · on Feb 7, 2014

> But real issues show up due to fundamental problems with "is-a": Even if an X value "is-a" Y value, an X variable is not a Y variable (because it can't hold all objects of type Y). So passing a mutable object reference to a function is fundamentally different than passing an immutable object -- but inheritance hierarchies treat them the same.

That's not a problem with "is-a", its a problem with unsound use of mutation in inheritance heirarchies. A mutating method can either: 1) Anticipate that it can fail, or at least the mutation component can (however that is signaled, whether by return value or exception), on some combinations of argument values and object state, in which case it works fine as a method anywhere in an inheritance hierarchy, or 2) Be guaranteed to succeed completely with any arguments of the appropriate types, in which case its fundamentally unsound anywhere except in a final class (since declaring a mutating method is essentially logically equivalent to declaring a method with a -- potentially additional -- return value whose type is simultaneously both the the type of the class it is declared in and the type of the class it is used in.

hrjet · on Feb 6, 2014

This is elegantly solved with immutable data types. The squash method of an Ellipse can be defined to return an Ellipse (a new instance, the original is never modified in place). This will work fine even when Circle inherits from Ellipse.

zvrba · on Feb 6, 2014

The circle vs ellipse problem is rather artificial and, if anything, it shows that modeling with types and everyday intuition are two different things.

If the rest of your program can handle general ellipses, it should also be able to handle ellipses having the same minor and major radius (i.e., circles). The obvious solution is to not have the Circle class and _maybe_ equip Ellipse with IsCircle() method. (Though, why would you care?!)

See the C++ faq lite items 21.6 -- 21.8. (A quote from 21.8: "Here's how to make good inheritance decisions in OO design/programming: recognize that the derived class objects must be substitutable for the base class objects. ")

judk · on Feb 6, 2014

The circle/ellipse problem is real and shows how mutability ruins people intuitions about models and relationships. Note that mutability ruins programs in the same way, introducing subtle inconsistencies and invariant violations.

mcguire · on Feb 6, 2014

Weirdly, the C++ Faq taught me more about object-oriented and related issues than any other source. It's very good.

pdpi · on Feb 6, 2014

The short version here is "Inheritance must respect the Liskov Substitution Principle, and many seemingly obvious inheritance relationships don't".

EdwardDiego · on Feb 6, 2014

> The only problem is that you end up with a variable that fails to describe itself better than “I’m a number”.

Our application servers need to know about updated entities, so we broadcast notifications when an entity has been updated - but we prefer to leave it up to the individual applications to retrieve the updated entity as they see fit.

So we're often dealing with numerical ids of entities, and we've had a few bugs arise from code like so

   public void something(long accountId, long websiteId)

When people have accidentally used the wrong long in the wrong place. So we've now moved to using simple wrapper types around aforementioned longs -

    public void something(AccountId accountId, WebsiteId websiteId)

It does come at a little bit of cost as we're passing these things around a messaging system, so now we have to think about how they are serialized (Java serialization is problematic, and there are occasional religious wars about GPB vs. JSON), but it leverages the type system to make it every explicit what this number represents.

eropple · on Feb 6, 2014

This method also create a lot of object churn. For reasonably small systems that can be okay, but you have to be mindful of it.

gizmo686 · on Feb 6, 2014

Would it be possible to do this as a type-alias type thing that disapears at compile time? As far as I am aware, Java has no support for such type aliasing, but it should be possible to add a pre-compiler phase to your build process that replaces these types with what they are aliases of. The only part of this that seems complicated is the type checker, which would need to be aware of the difference between AccountId and long.

yareally · on Feb 6, 2014

I could see it being done with annotations + processing them perhaps, as some frameworks do for database properties such as index/uniqueness. An object for every scalar datatype is just overkill, even if it solves a legitimate problem.

eropple · on Feb 6, 2014

You could make it a pre-compile step in Maven, maybe. But it'd be gross.

clord · on Feb 6, 2014

The big problem (other than verbose code) with rolling up stuff into classes in C++ is that performance suffers. Many platforms will not pass small structs efficiently.

Inspired by Haskell's newtype, I drafted a proposal for C++14 which would have introduced native newtype to C++, but it got rolled into an omnibus paper (n3635) and appears to have been forgotten.

The proposal is modelled on strong enums, but generalizes them to all types. They are essentially strong typedefs with none of the weaknesses of wrapper types, restrict pointers, typedefs, and pragma disjoint.

    newtype NotInt : int; 
    NotInt k(5); // call ‘default’
    NotInt bar(NotInt);
    bar(k); // passes exactly as an ABI int would
    int baz(int&);
    baz(k); // Error! no conversion

While drafting this proposal, I realized it was a much more C++-like way of bifurcating aliases than 'restrict' keyword:

    newtype D1 : double; 
    newtype D2 : double; 
	
    void biz(D1 *x, D2 *y) {
        // Being separate, D1/D2 can be
        // assumed not to alias by optimizers
	...
    }

While playing with it, we came up with this proposed syntax (although I don't think this would get through the committee):

    [template <typename T>]
    newtype [NotT] : [(public|private)] T [= (default|explicit|deleted)] [{
      // only non-virtual member functions, no data or reference members (something like POD)
      // this->value has all of the operations of T, but only aliases with the new type, and implicitly converts to/from T prvalues.
    }] [optional_variable_name];

gordaco · on Feb 6, 2014

I don't think there's going to be any performance difference at all compared to the basic case, which actually is another reason why this is so good. Ultimately, all these struct just hold a double, which means that in memory they're exactly a double and nothing more, there's no type tag or anything like that. And all the code is static so there is no vtable, just the double value. The compiler does the rest, not the runtime.

(Still, I'd love to see newtypes in C++)

comex · on Feb 6, 2014

It can make a difference due to weird ABIs. For example, in the ARM procedure call standard, 64-bit values can be returned in r0 and r1, but 64-bit structures can't (no matter whether they contain two 32-bit values or just one 64-bit value). However, I'd call that fairly niche - it's hard to imagine an application that would notice a significant performance difference.

com2kid · on Feb 6, 2014

This is very nice to know actually, and quite relevant to code I am working on right now!

Disappointing though.

> it's hard to imagine an application that would notice a significant performance difference.

One hopes any function call made often enough to be a perf hit from this would also be designed to be easily inlined!

clord · on Feb 6, 2014

Scientific computing will hit this sort of problem just because the kernel is so large it actually needs to be broken into sub-procedures for icache reasons. But also, the penalty is large, because there is a full store/load cycle involved, which will clear caches and increase memory bandwidth requirements.

SuddsMcDuff · on Feb 6, 2014

The best write up I ever heard for this approach (often called "primitive obsession" - http://sourcemaking.com/refactoring/primitive-obsession) was in Object Calisthenics, an essay by Jeff Bay in The Thoughtworks Anthology (http://www.amazon.co.uk/The-ThoughtWorks-Anthology-Technolog...):

"an int on its own is just a scalar, so it has no meaning. When a method takes an int as a parameter, the method name needs to do all of the work of expressing the intent. If the same method takes an Hour as a parameter, it’s much easier to see what’s going on. Small objects like this can make programs more maintainable, since it isn’t possible to pass a Year to a method that takes an Hour parameter. With a primitive variable the compiler can’t help you write semantically correct programs. With an object, even a small one, you are giving both the compiler and the programmer additional info about what the value is and why it is being used.

Small objects like Hour or Money also give us an obvious place to put behavior that would otherwise have been littered around other classes."

carlob · on Feb 6, 2014

It might be because I only use C family languages for very low level highly optimized stuff, but I found it very weird that a blog post about types would force all those useless casts from int to double.

9 / 5 in lieu of 9. / 5. or 1.8 will come back and bite you!

taeric · on Feb 6, 2014

I think the major problem this runs into is that it doesn't necessarily help with the actually hard parts of programming.

That is, sure, mistakes have been made with units. More mistakes are made in other areas, though. Usually amusingly more basic areas.

The question seems to be whether encoding at the type level will help with these areas. It seems the conflict is the idea that fully proving a solution is superior to partially proving it.

Laudable, to be sure, but anyone that has attempted to prove that quick sort works knows, it isn't easy. Just understanding the shorter versions of some programs takes a fair bit of brain power as it is. Understanding the fully specified version can be orders of magnitude more complicated.

So, I am left wondering if it isn't enough to stop at "passes all the tests." Especially if you have the correct tests.

And eventually one wonders whether the "correct tests" includes more stringent types. Maybe it does. Maybe it doesn't. Likely depends on the situation, is my bet.

moron4hire · on Feb 6, 2014

Type checking always happens at some point in every language, dynamic or static. At some point, a routine gets executed on some form of data, and the success of that routine is predicated on the data being of the right type.

In dynamic languages, you end up writing a whole raft of tests that are not much more than type assertions that are otherwise done for you by the compiler in a statically typed language.

For statically typed languages, the type check is a type of test that just so happens to be very succinct and easier to write than it is to skip. In dynamic languages, it's harder to write a test for a type assertion than it is to trust the duck is a duck, give it a punch, and hope it quacks.

It makes sense, especially for a reusable library, to try to offload as much to the static type checker as possible. It always runs and can't be skipped. As you say, "especially if you have the right tests," a static type checker forces you to have the "right tests" for at least a small part of your code, leaving you to write fewer tests for the rest of it.

taeric · on Feb 6, 2014

I don't think you really wrote anything I did not acknowledge. Yes, types are a form of complete test over the shape/type of data that they encode. If you can encode your data with the correct type, it will go a long way to making sure you don't have errors in the types.

However, I take issue with this statement: "For statically typed languages, the type check is a type of test that just so happens to be very succinct and easier to write than it is to skip. " For the simple cases, I agree with this. But when you see the type gymnastics that some programs go through to guarantee some traits, it is hard to agree with any claim of "easier to write." Consider a "true" quicksort in haskel[1].

My assertion was that the "hard part" of programming is often in the logic, not necessarily the shape/type of the data. When you see attempts at getting the logic of a program into the type system, things start to take a turn for the worse. Really worse when you find you have many types that all encode the same type of data. (Consider the metric system debate.)

My point about "having the right tests" is that sometimes you only care/have the knowledge to think about/whatever certain scenarios that are either likely or guaranteed to happen. Pushing everything in the type system implies that you had the energy to think of and consider everything. My assertion is often you don't have that time/energy.

So, sure, you will have the "right tests" in that scenario to an extent, since you will have all tests. However, I wonder if most programs couldn't work out with a much smaller subset of those tests.

A direct analogy for the point I am making is physics. Classical/Newtonian physics break down and don't work at a level. Worse, they rely on a lot of simplifications and actually describe what is happening moreso than how/why. Yet, for a large portion of calculations that people will do, they work. I grant that encoding all of the assumptions into the calculations will make them more accurate, but often you don't need that level of accuracy. And they definitely make them harder to understand.

[1] http://www.haskell.org/haskellwiki/Introduction/Direct_Trans...

tel · on Feb 6, 2014

The true quicksort example you linked has very little to do with Haskell's type system and everything to do with purity. The qsort version at the bottom of that page is also pretty close to the version used in any mutable language, just with a bit more noise in the syntax. A direct translation to Python might make the syntax easier to understand, though it should be stated that this is not supposed to be pretty Python code---it's literally just a syntactic swap.

    def lscan(ary, p, h, i):
      if (i < h):
        if (p < ary[i]):
          i
        else:
          lscan(ary, p, h, i+1)
      else:
        return i

    def rscan(ary, p, l, i):
      if (l < i):
        if (ary[i] < p):
          i
        else:
          rscan(ary, p, h, i-1)
      else:
        return i

    def swap(ary, i, j):
      ary[i], ary[j] = ary[j], ary[i]

    def sloop(ary, p, l, h):
      if (l < h):
        l1 = lscan(ary, p, h,  l)
        h1 = rscan(ary, p, l1, h)
        if (l1 < h1):
          swap(ary, l1, h1)
          sloop(ary, p, l1, h1)
        else:
          return l1

    def myqsort(ary, lo, hi):
      if (lo < hi):
        piv = a[hi]
        i = sloop piv lo hi
        swap i hi
        myqsort(ary, lo, i-1)
        myqsort(ary, i+1, hi)

taeric · on Feb 7, 2014

My point is pretty much about the "a bit more noise in the syntax." And qsort will only work on array types, something that is typically shunned by people trying to "do all of the work with the type system." Consider the types for "NonEmptyList" and such.

Is it neat when these work? Certainly. Do I think they pay obvious dividends in the effort increase? Not so much. I am open to the argument, but I have yet to see anything compelling. And with so much traction in other areas, I am doubtful that it is anything close to the silver bullet that advocates make it out to be.

tel · on Feb 7, 2014

Again, that noise is entirely an artifact of purity and syntactic choices, not the types.

I'm not sure exactly what you're arguing for here, but I can explain a use case of NEL—when you consume it you don't have to check the for emptiness. That means that you can eliminate an entire branch of code, either locally or in exception catching, or an entire failure mode.

Frequently people just throw those checks in and then assume, trusting their own eye or documentation to minimize the emptiness checking, but without expressing that into the type the compiler won't be able to help you not repeat that check. It's optimal to check it exactly once and people are liable to check it either 0 or many times.

taeric · on Feb 8, 2014

My argument is that as you get more types that cover states or behavior of your program, one typically finds more code.

Is there an argument that these are beneficial? Of course there is. I just have not been privy to an example where it carried its weight.

Take the not-empty list. Typically the "cost of checking for empty" is so negligible that to push that thinking into the compiler just doesn't lead to any gains. Not even just marginal gains. It is nigh unmeasurable.

Compound this with similar types over the state of any other object, and you have an explosion of types that becomes unwieldy. Account, NotEmptyAccount, OverdrawnAccount, DrawingInterestAccount... It gets to be just crazy. Especially in areas where you don't care about any of those details, but now have to account for an inheritance tree.

And then you get into the fun of having to logic not just about how the actual program works, but how the interaction of various types works.

So, you keep saying the syntax I was referring to was due to purity. Which is kind of the point. These languages use syntax and types to help ensure purity. Which makes them more to handle for some algorithms. Less for others. And then just hard to really know the true computational cost for a lot.

Which is all to say that it is a design tradeoff. And should be undertaken as an informed decision. Not a "let the type system do the work" kind of one. More of one that "this is so important we need to make sure we can afford the cost of pushing it into the type system." Or, if you would prefer "this particular instance is so cheap to push to the type system, we should do so."

tel · on Feb 8, 2014

I agree completely that it's a tradeoff. I also only bring up purity to suggest that other strongly typed, impure languages could solve this with less trouble.

NEL is a particularly interesting point of compromise. I don't think it carries its weight very frequently at all, though I have used it very effectively in a few cases. I also think the burden of type hierarchies as you're describing isn't much felt in Haskell. You typically don't build large taxonomies, or taxonomies at all, in a language without subtyping.

The only place where I see taxonomies forming at all is in `lens` and there they are used to great effect.

taeric · on Feb 13, 2014

Apologies for not seeing this. I really need to get better at following HN threads. (Or not, depending on your take.)

I think we are mostly in agreement. I just have not had the fortune to work in an area where these have panned out that well. Hopefully I get a chance to change that someday. (Though, I am currently under the song of the lisp sirens.)

minikomi · on Feb 6, 2014

Some talk on the racket mailing list was had for "numbers with dimensions[1]" in which contracts were being bent to do calculations which respected units [2]. The Frink language was also mentioned [3].

[1] https://groups.google.com/forum/#!topic/racket-users/mnId0ux...

[2] https://github.com/Metaxal/measures#4-dimensions-and-contrac...

[3] http://futureboy.us/frinkdocs/#SampleCalculations

moron4hire · on Feb 6, 2014

I did just this very thing in C# a few months ago[1]. The problem I had was that it gets out of hand very quickly. I wanted to be able to divide distance by time and get speed out of it, but it is extremely difficult to satisfy all of the M-to-N relationships between types and still end up being usable.

I ended up abandoning it for the project I'm building, as I wasn't too clear on what the type of a secant of an angle in degrees should be and nobody on the interwebs seemed to care enough to take notice or answer my questions.

[1] https://github.com/capnmidnight/UnitsOfMeasure

platz · on Feb 6, 2014

F# has this, I wonder how they did it?

magic_haze · on Feb 6, 2014

They couldn't do it. The compiler does static verifications that, say, a value of type Meter when divided by a value of type second results in a value of type Meter/Second, where '/' is a custom concept built into the F# compiler. The .NET runtime knows nothing about these custom types.

moron4hire · on Feb 6, 2014

Yeah, that's pretty much exactly the issue I ran into. The problem is that the standard .NET type semantics can only infer "up" the inheritance hierarchy.

I had gotten really close. I had types like

    Base
    Length: Base
    Meters: Length
    Feet: Length
    
    Compound<Base, Base>
    Area<T>: Compound<T, T> where T: Length
    SquareMeters: Area<Meters>
    SquareFeet: Area<Feet>

and had defined all of my math as generic extension methods of the Base class so that the types would compose:

    Compound<T, U> Multiply<T, U>(T a, U b) where T : Base where U : Base
    T Divide<T, U>(Compound<T, U>, U b) where T : Base where U : Base

That went a long way towards getting fairly concrete types out of only a few lines of code. But I couldn't get it the rest of the way. C# explicitly forbids user-defined typecasts between types in an inheritance chain with each other, so while multiplying two Meters got me a Compound<Meters, Meters>, it was not then possible to take a Compound<Meter, Meter> and convert it to anything like Area<Meters> or SquareMeters automatically.

platz · on Feb 6, 2014

Interesting problem! I was wondering why even define things like SquareMeters, but as you say the issues is reducing the terms that have different forms but mean the same thing.

moron4hire · on Feb 6, 2014

Having a SquareMeters type was meant to reduce complexity. Instead of always dealing with Area<Meters, Meters> all the time, it was meant to be a shortcut.

Hrm, what about namespace aliasing? "using SquareMeters = Composite<Meters, Meters>"? Since the problem is not the composition of the type but the verbosity of the deeply composited types, then perhaps it's just about making a shortcut for the name.

platz · on Feb 6, 2014

things can be composed different ways. m/s^2 should equal (m/s)/s

moron4hire · on Feb 6, 2014

Okay, I've definitely convinced myself there is no good way to do this without having to write way too much code over and over again. The problem is basically equivalent to inserting items into a sorted list, but you only have one operation in which to do it. So, you are either limited to simple types, or your code for operations on types explodes trying to define all of the ways to compare types (in a generic sense) and different scenarios for combinations of types.

This requires a Turing-complete type system.

MichaelGG · on Feb 6, 2014

Is that a real issue? They could generate struct types, but why take the performance loss? It only affects people trying to reflect over the unit of measurement types.

Additionally, phantom types in F# can make wrapping other types (like int -> AccountId) rather succinct.

platz · on Feb 6, 2014

Ah, I see. Looking at how something similar was implemented in Haskell, they used phantom types and functional dependencies (in the type system) to achieve this - I'm not sure F# has those capabilities.

moron4hire · on Feb 6, 2014

F# has that capability, but .NET does not, so once the F# compiler has done its checking, it drops the info on the floor and you have nothing at runtime.

Now, this might not necessarily be a bad thing for F# (I don't know, I'm not an F# user). The runtime type information in .NET is great, but it's all for reflection to be able to build things that a stronger type system like F# has, or a macro system, would be able to handle.

jimmaswell · on Feb 6, 2014

The type of a trig function is just a scalar, no unit.

mwsherman · on Feb 6, 2014

Might be interesting to compare to F#‘s units of measure: http://msdn.microsoft.com/en-us/library/dd233243.aspx

imsofuture · on Feb 6, 2014

To take this even further, look at OCaml's strong, inferred typing.

jeffdavis · on Feb 6, 2014

Does static typing have inherent costs?

For instance, there are some desirable properties often present in dynamic languages like hot code loading, or the ability to make code updates independently. Not many statically-typed systems are good at these things, or at least the static typing doesn't work well across boundaries (like a network, etc.).

Can these be reconciled? Can something have all of the benefits of, say, haskell and erlang at once?

tel · on Feb 6, 2014

There's Cloud Haskell which is an attempt to do Erlang style distributed concurrency in Haskell. For non-distributed concurrency, you can already get away with using Haskell's green threads and channels. Furthermore, there's the beginning of a hot code swapping system in GHC now, but it's not terrifically popular at the moment---it was built for a particular use at Facebook and hasn't gotten too far from there.

It's worth talking about why Cloud Haskell is tough, though. The primary issue is that Haskell doesn't have any kind of transparency into functions---if you are given a type of `(a -> b)` there's no way to examine what it is on the inside, no concept of source that could be sent to another node.

This limits the kinds of messages and spawning that can happen in a distributed Haskell system. The Cloud Haskell project is working to remove as much of this limitation as possible, though.

edderly · on Feb 6, 2014

In some environments beware of creating new types and objects: http://developer.android.com/training/articles/perf-tips.htm...

ygra · on Feb 6, 2014

Yes, I noticed that some time ago and ended up rewriting parts of the application (and also avoiding generics, instead opting for HPPC). What annoys me a bit is that Dalvik has dramatically different performance characteristics from normal Java which necessitates relearning all this stuff. Nice article though and I wish I had found it sooner.

jimmaswell · on Feb 6, 2014

This seems to paint the JIT as something that hurts performance instead of helps it, is that true on Android?

edderly · on Feb 8, 2014

It relates to the behavior of the garbage collection scheme used by dalvik which is a mark and sweep approach. When the GC runs, that’s all the VM is doing. It’s pointed out that a generational garbage collector would be more efficient allocating memory presumably there is a trade off in terms of the overall memory footprint.

midas007 · on Feb 6, 2014

Static type checking is a double-ledger bookkeeping of "what I am/have is what you are/want." Erlang and Haskell have the most interesting approaches to typing: optional and extensible respectively.

pointernil · on Feb 6, 2014

I often wonder how the "tool" named MONADS could help to deal with this quite old problem of what I call

meta-types or semantic types

(i'm sure there is a proper name for this; I'm not a cs/type-theory professional ;)

Time handling f.e. and often physical Units would quite often benefit from much stronger guaranties at compile and runtime. And no, OOP and it's tools are not sufficient to handle this... in theory maybe, not in day to day development.

We need to translate the core ideas of a monad into more a "mainstream" syntax and runtime env. to prevent the next incident of:

I'll just add the integer 365 to my certificate valid-date within our core cloud infrastructure code...

I'm not necessarily a MS fan, but would love to see how Anders Hejlsberg f.e. would tackle this kind of language-design challenge ;)

tel · on Feb 6, 2014

I don't think this has much at all to do with monads.

pointernil · on Feb 7, 2014

Care to explain in a little more detail? Is a calender resp. date-time monad really such a far fetched idea?

tel · on Feb 7, 2014

That's just a state monad, the thread and what I assumed and may have assumed incorrectly you were talking about, was around overloading math operations to work on dimensionalized quantities like feet and seconds and the like. You can dimensionalize time as well—take a look at the time or thyme Haskell packages for some fascinating examples.

Doing computation in the context of the current time is absolutely monadic. If you're handling time abstractly, it's the state monad. If you're handling it for real then you need some kind of monad protecting communication with NTP/your local CPU clock. That's IO in Haskell.

pointernil · on Feb 7, 2014

Yup, you assumed incorrectly. I can see the qualitative difference between operator overloading and monads. That's why I suggested to transfer the core idea of monads into more mainstream languages to provide more "semantic" safety, so to say.

Thanks for the clarification.

ternaryoperator · on Feb 7, 2014

What he's talking about is avoiding primitive obsession. Always good advice, although I find its greatest benefit comes in typing collections, rather than just using primitives like List or ArrayList.

wwwtyro · on Feb 6, 2014

Interesting. What about more complex units, e.g. acceleration - m/s^2?

NigelTufnel · on Feb 6, 2014

Maybe it's just something I don't understand but I would go with Celsius everywhere with converting to Fahrenheit/Kelvin when showing/grabbing temperature to/from user.

ygra · on Feb 6, 2014

The Celsius/Fahrenheit example isn't the best. Generally all examples based on units of measurement are probably flawed because sane systems can just standardise on SI and be done with it.

However, different coordinate systems that should never be mixed are useful to keep separate. At work we deal with maps and geographic information, so there are always at least three coordinate systems: Screen, projection and geographic coordinates. Having a type system that catches mistakes there would be so much nicer than just always passing P2D around and trying to never mix different ones.

mapcar · on Feb 6, 2014

Sorry, what language are these example given in?

MBlume · on Feb 6, 2014

dschiptsov · on Feb 6, 2014

Looks rather like a type-system abuse.

btw, there is Haskell for that.)

alipang · on Feb 6, 2014

He claims it's to make wrong code look wrong. He uses the type system in order to prevent bugs, how is that abuse? Is the type system only there for performance in your opinion, or what's your angle?

greatsuccess · on Feb 6, 2014

This is fairly the premise of OOP on display. And fairly basic at that. Im not sure whats to be learned here except that in the real world, your representations will always need to be extended and subclassing is usually a bad hammer.

Interfaces are better but extend poorly through subclassing as well.

Composition is infinite but difficult to define over time when things change (usually resorts to more subclassing or more injection strategies).

I wish this article were valuable to solving the actual problem but its just a primer on OOP vs using primitives which is not particularly instructive or useful to solving real world problems.

saltylicorice · on Feb 6, 2014