This is describing what I've known as "Typed Hungarian notation", and have seen a few times before, though I can't seem to find now.
The original intent of Hungarian notation was to encode metadata into the variable name - for example, "validData = validate(rawData); showValidData(validData)". The idea was that the prefixes would make it just look wrong to the programmer if someone accidentally used mismatched names, such as "showValidData(rawData)", indicating a likely bug.
This notation mutated into the far more simplistic and arguably less useful "prefix with the datatype", such as iFoo indicating Foo is an integer. This mutation became known as Systems Hungarian, while the original became Apps Hungarian.
The suggestion in this post is taking Apps Hungarian, and instead of relying on variable names, encoding it into the type system itself.
The first time I recall seeing something like this suggested was actually in Java, with examples in that blog post involving class attributes:
public String Name;
public String Address;
And creating classes to represent these values, preventing them from being used the wrong way:
public NameType Name;
public AddressType Address;
...which is why I remember this as "Typed Hungarian" - making the language's type system handle the work that a person would normally have to think about in Apps Hungarian.
The Java example you've provided is known as Domain Modeling, it's well described (and honestly greatly expanded on) in Domain Driven Design by Eric Evans.
It's applied heavily in the enterprise world simply because it's so powerful. In general as the domain logic becomes more complex the benefits of doing this increase.
Actually, not encountering this style in code-base which is solving a complex problem is a massive warning sign for me. It usually means that concepts are poorly defined and that the logic is scattered randomly all over the code-base.
Apps Hungarian is just a logical style: name functions, variables, types so that they're meaningful within your domain. The result of which is code which is very easy to understand for someone who understands the domain. This doesn't mean long names - if you're doing anything math intensive then using the short names which conform to the norms of the field is perfect. For a business process it probably isn't :)
>Actually, not encountering this style in code-base which is solving a complex problem is a massive warning sign for me.
There is a fine line between capturing some common meaning in a reusable type (e.g. Optional<T>) and spending most of your time crafting a straight-jacket of type restrictions that are supposed to thwart some imaginary low-level bug in the future.
Java is actually a great example of how this mentality backfires in real life. The language, from day 1, had a zillion ways to force someone else to do something they didn't want to. Abstract classes. Private methods. Final classes. Etc. It was supposed to ensure good design, but in practice just led to the endless reenactment of the banana-gorilla-jungle problem. (https://pastebin.com/uvr99kBE)
At the same time, the language designers didn't add such basic features as lambdas and streams until 1.8. This lead to creation of ungodly amount of easily preventable bugs, since everyone was writing those awful imperative loops.
Banana-gorilla-jungle problem results from OOP. This article (and presumably, the GP) is describing encoding domain-driven logic in a statically-typed functional language, and thus the OOP issues are not a problem.
Functional languages don't carry around their environment with them, so you just get a banana. No gorilla or jungle.
Oh, it does have some banana-gorilla-jungle problem, it's just not at the same place.
If the types are computationally equivalent, there is a plentora of type classes you must inherit, there is a lot of packing and unpacking to make them work with standard libraries (how do you print an address?), and if the values are not completely separable, there are plenty of conversion issues.
It is an improvement, no doubt about it, but the main benefit of the languages with advanced type systems is that you can refactor code easily to add them only to the places where it's a win, only after your code is complex enough to make it a win.
Not really. There are specific design decisions that Java made which enabled this issue to manifest. I've recently took a foray into Smalltalk and it doesn't have this problem, because methods depend on object interfaces rather than class types. AFAIK, OCaml also got this issue right while keeping static type checks - they decoupled classes from types.
>Functional languages don't carry around their environment with them
It is entirely possible to recreate the issue in a functional language, as long as there are sufficiently powerful mechanisms to enforce type safety during compilation. In fact, if you don't understand this and follow the advice in the article too far, that's exactly where you will end up. You will have a jungle monad containing gorilla monad holding hostage the banana data structure the user actually cares about.
In the off chance you're reading this 9 days later, I'd like to push back against this response (working on an OCaml project at work, actually).
> AFAIK, OCaml also got this issue right while keeping static type checks - they decoupled classes from types.
If you mean that the types correspond with OCaml classes themselves and not the objects that classes create, yes, that's correct. But it sounds like you might mean the opposite of that. In any case, people rarely use the O part of OCaml much. Most people find they prefer the first-class modules for almost everything that needs greater modularity than what functions provide. I've found that the only thing that OCaml object system is good for is modeling/wrapping existing OOP or prototypal APIs, like certain XML ones, or some web APIs.
>You will have a jungle monad containing gorilla monad holding hostage the banana data structure the user actually cares about.
While it's true that monads are a way to make normally explicit environments implicit, monads aren't really what the book I'm talking about is about, nor this article.
Yes, monad-heavy code could (with effort) end up with an application monad displaying problems similar to the banana-gorilla-jungle problem. I've frankly never seen monadic code like that, but perhaps you have.
Instead, what this article is advocating for is simply modeling your data correctly. Really, this is a universal programming theme, that helps a lot regardless of programming language. But the structures available in statically-typed functional languages are more powerful and able to model domains with greater granularity. If you're familiar with the saying "make invalid states unrepresentable", then you understand what I mean.
Java has just had a terrible steward in Oracle. Plus the community seems to like boilerplate code (see: Angular for more of that if you're into javascript).
I switched to C# as my main language specifically because they were gaining awesome features like generics, lambda, LINQ. Since then it just got better and better.
In the C# world you seem to encounter systems where the logic is either in:
* UI event handlers
* Methods on helper classes which consume a model generated from a database.
* Domain modeled with classes with the logic split between the classes themselves and services which consume classes.
In the first two cases the code tends to have a massive sprawling dependency tree which makes it harder to maintain. In the latter case you've usually got a better chance of having properly isolated code. Even if the isolation isn't super you at least have the logic well organised.
it's extremely useful especially at interface level but also in day to day coding, pity the people whose method signature is (string, string, string) because no ide can save them from the churn.
It's particularly powerful if you can great an algebra of your domain. You can define invariants which are then enforced by the type system.
I use this a lot when writing C# code and the whole experience is getting better with every new release. Microsoft keep adding awesome stuff like ADTs which make working with these types a dream.
Came here to post this. I second this book heartily. If you design your types to be DD with raw, valid, invalid data then by the time you actually get to the database or API, you are sure that your code has passed through however many compile-time enforced validity checks (if you use F# or similar static strong type PL) before it got to this point.
That way, you don't need nearly as much defensive coding re: null checks etc. because the data has gone from 'some client typed some value in an input box, hope its valid' to 'this is an Address line that has been confirmed to be valid structure to pass to downstream systems'.
Pit of success is so much better than peak of success, and much easier to get in a static typed functional language vs. an OO language in my experience.
The best thing is that he writes F# in a way that is perfectly understandable by the domain experts so he can use the code as the ubiquitous language. In this way an intermediate representation to communicate with the users is not needed and there is nothing that needs to be kept in sync with the code.
> Actually, not encountering this style in code-base which is solving a complex problem is a massive warning sign for me.
This is very narrow-minded. DDD as a broad concept is good and something I practice.
DDD in enterprise is a disastrous nightmare. When people on your team are wasting time Googling lingo to try to figure out what kind of object they need, where to put a file, or where a method should go, it's a huge red flag.
I was on a team that seemingly spent all its time trying to figure out what the hell DDD was prescribing, let alone trying to figure out how to do it.
> When people on your team are wasting time Googling lingo to try to figure out what kind of object they need, where to put a file, or where a method should go, it's a huge red flag.
I agree, but for a different reason. If you've got a team of programmers on a project without any understanding of what the client is doing you're bound to implement stuff ass-backwards from the clients perspective.
DDD is fine in enterprise, but the domain does need to be communicated to all involved parties. You can't just dump a codebase on someone and expect them to get to work, that only barely works for non-DDD codebases let alone DDD codebases.
The kind of DDD this article is describing is totally different from Java-world enterprise DDD. It's DDD done right, with no 50-member OOP classes. Just a single function that can only take a type that describes your domain in a way that prevents whole classes of errors.
A lot of members of my team couldn't figure out where things were "supposed" to go (or what most of the terms mean) after weeks of working on the code, so they just started slapping methods anywhere and worrying about it later.
I would actually agree with you there, but I suspect that wasn't your point.
A hammer is such a simple and obvious concept that it should not need to be googled. If anybody needs to do that anyway, you should not put the blame on the carpenter, but on the guy who "designed" the hammer.
On a related note, I still very much am convinced that Don Norman's Design of Everyday Things is a great read for anyone who designs APIs; types in this argument function both as affordances and force functions, which are the exact reason why a hammer is such an obvious tool.
You seem to be implying that a dev who doesn't understand enterprise DDD is unqualified.
That's not like a carpenter who can't use a hammer. It's like a carpenter who asks for a sketch of a project and instead receives a phone-book sized list of vague instructions that no one can agree on.
I don't think people agree with your analogy. To me your analogy is too simple. I have 6 years of java development experience and a masters in informatics. I have yet to work with enterprise DDD. There were no classes at university that taught it.
In most of my projects I have had to learn a lot of new stuff, it's a blessing if someone on the team knows how to use it already. However I have found that on most projects we've struggled to hire people with substantial experience with the tech stack and methodology that the customer had chosen.
We still wouldn't find an entire team that had experience with everything we needed even if we paid them ten fold the amount of every other company.
On the other hand I reckon every carpenter learns to use a hammer and nails in school.
> There were no classes at university that taught it.
So? You don't leave university and expect to be "done", right? You learn, learn, and learn - your whole software dev life. Even better, your whole life.
And no - nobody teaches you how to professionally handle a hammer in school. Now you are simplifying here. In school, they show you a hammer and nails. That's it. As much as they show you a compiler, a programming language and a basic architecture in university.
There are a ton of different hammers and nails out there, each with their own special attributes. You have an engineer's hammer, a tiling hammer, a rubber hammer, a roofing hammer, a splitting maul, a sledgehammer and so on... And I won't start talking about nails.
It's up to you to gain knowledge when to use what, and how to use it. That doesn't mean it's useless per se. Or overly complicated. It might just no "be your thing", which is fine.
This was exactly my point. Not all carpenters would know how to use some specialised hammer and specialised nail. Just like most developers wouldn’t know how to use a specialised language and framework. So developers that don’t know how to develop with DDD would just learn as part of joining a project and they would still be qualified to be on the project. Just like how a junior carpenter might not know everything needed to erect a hose by themselves, but as part of a team of carpenters they would learn the things they didn’t know from school including those specialised hammers and nails for that specific house because the owner wanted something unique or what not.
FYI a carpenter in Norway needs 4 years of education to become a carpenter. 2 years in school and 2 years as an apprentice. If they don’t know how to use a bog standard hammer by that time they are not passing the final examination.
I see - maybe I got your point wrong at first. Sorry for the confusion.
Here in Germany, it's basically the same: A carpenter goes to school and work in a dual system for 3 years, and can then put a master on top of it after several years of working experience. This takes another 2 - 4 years and a ton of money.
This can go two ways. One can end up with lots of different types that do nothing but wrap a string. On the other hand I have also seen a float being used to represent a price. The correct way is somewhere in between those two extremes. If you have some logic specific to a value, by all means, make it a type but if there isn't it just ends up with lots of boilerplate.
Another good example of this is having separate classes for something like unsafe strings vs. safe strings in a web app. The functions which interact with the outside world accept unsafe strings and emit safe strings to the rest of the application. Then the rest of the application only works with safe strings.
Anything that accepts a safe string can make an assumption that it doesn't need to do any validation (or "parsing" in the context of the OP), which lets you centralize validation logic. And since you can't turn an unsafe string into a safe string without sending it through the validator, it prevents unsafe strings from leaking into the rest of the app by accident.
This concept can be used for pretty much anything where you are doing data validation or transformation.
Also a good way to prevent hashed passwords from being accidentally logged.
Class PasswordType(django.db.models.Field):
hashed_pw = CharField()
def __str__():
# you can even raise an Exception here
return '<confidential data>'
Not that you should be trying to log this stuff anyways, but unless you're a solo dev you can't prevent other people from creating bugs, but you can mitigate common scenarios.
What are safe and unsafe strings supposed to mean? All strings seem like normal string to me, a "DELETE * FROM db" is no different from any other string until it's given to a SQL query.
Escaping modes. All strings are not equivalent: "Bobby tables" is very different from "'; drop table users; --".
The idea is to encode the contexts where a string is safe to use directly into the type of the variable, and ensure that functions that manipulate them or send them to outside systems can only receive variables of the proper type. When you receive data from the outside world, it's always unsafe: you don't even know if you've gotten a valid utf8 sequence. So all external functions return an UnsafeString, which you can .decode() into a SafeString (or even call it a String for brevity, since most manipulations will be on a safe string).
Then when you send to a different system, all strings need to be explicitly encoded: you'd pass a SqlString to the DB that's been escaped to prevent SQL injection, you'd pass a JSONString to any raw JSON fragments that's had quotes etc. escaped, you'd pass an HtmlString to a web template that properly escapes HTML entities, and so on. It's legal to do a "SELECT $fields FROM tablename where $whereClause" if $fields and $whereClause are SqlStrings, but illegal if they are any other type of strings. And if you do <a href="$url"> where $url is an UnsafeString, the templating engine will barf at you.
There are various ways to cut down the syntactic overhead of this system by using sensible defaults for functions. One common one is to receive all I/O as byte[], assume all strings are safe UTF-8 encoded text, and then perform escaping at the library boundaries, using functionality like prepared statements in SQL or autoescaping in HTML templating languages. Most libraries provide an escape-hatch for special cases like directly building an SQL query out of text, using the typed-string mechanism above.
That’s the one, thanks for the link! I tried to find it while writing my post but couldn’t for the life of me remember a single thing to even try searching for. I probably last read that article when it was written in 2005.
A safe string is something you got from the programmer (or other trusted source), and an unsafe string is something you got from the network/environment/etc.
Are you genuinely curious, or are you being a troll?
Look at the content of your string, make a decision as to whether you would give it to a SQL engine. If you have not looked, it's presumed unsafe. If you have validated it - parsed it, in the context of this article and this discussion - and decided that you consider it safe, then it is a safe string from that point on.
This isn't a philosophical debate about what "safe" means to humans, it's a programming discussion that says if you only want to pass "select * from reports" to your database, check that's what the string contains before you pass it anywhere.
It's impossible to do SQL-safety validation at any other layer, because otherwise you're making the assertion that someone with the last name "O'Neil" or "Null" (Yes! A real name!) may as well give up and legally change it for the "safety" of programmers that are too lazy to do thing right.
I am really not trying to be a troll. Genuinely don't understand this concept of safe strings.
How could a software even look at text content and determine safeness? There are cases where string input might be limited to just letters or numbers but often it's not. As soon as punctuation or unicode (non English users) is on the table, text is basically anything and there are no general defense from that.
Parsing and static types could have restrictions on string length, min or max value for numbers, how many items in an array, but it cannot make text safe generally-speaking by any meaning of safe. It has no awareness of how the content will be used.
We're not talking about some absolute, metaphysical "safe strings" that guard against every possible flaw, but rather about better supporting an already existing safety check.
If you never thought to write an escaping function in the first, you can't write a SqlString safe type either, obviously. Equally obviously, if you can write an escaping function but you can't write a function that detects a DROP TABLE, then you can write a SqlString type but not a SelectQueryString type.
The idea being discussed here is simply that if you do write an escaping function, its signature should not be (String -> String) or (String -> Boolean) or, God forbid, (String -> void), but something like (String -> SqlString).
This ensures that whatever you feed to your database must have gone through such an escaping function, instead of expecting the programmer to simply remember it. Also prevents you from accidentally escaping a string twice.
(Obligatory pedantic disclaimer: if you're working with modern databases, please don't escape your own strings and just use parameters instead.)
I agree with you. The concepts of a safe string in isolation is too abstract too be meaningful. A correct API, such as interacting with a database only using explicit parameters (instead of string-concatenating to build up a query) is always safe, irrespective of the provenance of the input. The input could be a virus or a DB command and this would still be 100% safe.
What people mean by safe string in more specific contexts however, is meaningful, but the word "safe" is an unfortunate choice. Instead, think "SqlEscapedString" or "HtmlEscapedString" or "UriEscapedString". These are much more meaningful, and their use-case should be obvious. You can convert an arbitrary input "String" type into a "SqlEscapedString" and then safely use simple string concatenation to build up a query. This is useful in situations where non-parameter parts of the query are dependent upon the input in ways that are not safely exposed in the DB query API. For example, building up complicated WHERE clauses or using dynamic table names.
So you can write something like the following (in pseudo code):
String tableName = ParseFromUntrustedPacket( packet );
SqlEscapedString sqlTableName = new SqlEscapedString( tableName );
SqlEscapedString query = SqlEscapedString.Unsafe( "SELECT * FROM " ) &
sqlTableName &
SqlEscapedString.Unsafe( " WHERE Foo is NOT NULL" );
var result = connection.Execute( query );
The benefit of this kind of approach is that if that last function call has the signature of "Execute( SqlEscapedString q )", then it is basically impossible to accidentally pass an unescaped (unsafe) input string into it by accident. At every step, the developer is forced to make a decision to either pass in a potentially dangerous query snippet using "Unsafe(...)" or to make input strings safe by escaping them.
Similarly, this method converts Strings into a different type when escaping them, making it (almost) impossible to accidentally double-escape inputs, which is an issue commonly seen in some environments such as complex shell scripts.
ASP.NET for example does something similar with IHtmlString.
Oh, then you're reading too much into "safe" and assuming it means "can never do any bad if used in any situation, must need an AI".
It's like the same way a software can look at a number that's going to control a water heater and determine whether it's a safe temperature for a human body or not. You the programmer chose some limits. When the user enters a number, it's an unsafe value by default, because you haven't validated it.
After you validate it, you have something which is 'safe' to pass around to anywhere in your code, like a security checkpoint says that random people are unsafe, and when they enter a building their details are checked, and then they are OK to enter and go anywhere inside the building.
You, the programmer, choose what things you consider safe and unsafe and those words mean validated or unvalidated, verified or unverified, checked or unchecked, approved or unapproved, known or unknown, outside or inside, or any other pair.
> it cannot make text safe generally-speaking by any meaning of safe
If something can't be done, ever, in any situation, that probably isn't what people are talking about doing.
The point that's being made here is if you make safe and unsafe strings separate types, in a strongly-typed system, it is impossible to use an unsafe string where a safe string is expected or vice versa. When you have a boundary function that turns an unsafe string into a safe string (e.g., escaping), or that rejects strings that are not safe, you can have a system where all the inputs are unsafe and are forced to go through such a mechanism exactly once to guarantee freedom from double-escaping issues.
I think the above definition of "gets turned into safe strings early" isn't necessarily a clear one.
The general idea is to separate strings into different types, with different rules. E.g. a HTML templating engine will always escape strings unless they're of a specific type (e.g. in Python a popular implementation calls the type "MarkupSafe") that says it's ok to include as raw HTML (e.g. because it's the output of a sanitizer), an SQL query builder will only accept specially tagged strings as non-parameters into queries, ..., which reduces the likelihood of the programmer accidentally using a string in a place where it isn't correct to use. Username field doesn't have any special rules attached? All code will reject unsafe use as far as possible.
Depends on context: for content, only < needs to be escaped, within a tag (but not an attribute) > needs to escaped, within an attribute quotes of the same kind that started the attribute value (if any) must be escaped. Then there are legitimate cases of richly formatted user input/markup where you want to restrict script or block-level elements, or elements that can reach out to a container element such as a paragraph or section. I could go on here, but the point is to use HTML-aware template engines and markup processors, not rely on magic escaping routines.
A string that is supposed to represent a ”name” in a web app context — safe or not? I am referring to potential SQL injections.
Surely, no names contain semicolons, but is the business logic-part of your app to determine that names which only contain A-Za-z (or whatever) are safe?
It is contextually dependent, meaning in practice as close to the actual SQL query as possible. Or call to file system, where dots are unsafe, and so on.
Static typing helps here as described in the article.
Rust is extremely good at this with newtype syntax and the ability to apply impls and derives to masking types.
Your JSON data might be a string, but having a Json type that guarantees its valid Json is way better. Same with stringified Base64, a constrained numeric value for a tunable, etc. Because using from impls on these types lets the compiler figure out almost all invalid type usages at compile time and give you eloquent feedback on it.
Your comment gets the history hilariously backwards.
Actually using the type system has been standard practice in ML-family languages since at least the '70s. Simonyi described Hungarian notation as a way of, effectively, emulating a type system in less-powerful languages; describing Hungarian notation as "prefixing with the type" was not a mistake (as Spolsky claims) but an accurate description of how the technique was intended to be used.
The "Systems Hungarian" mistake came about because C programmers misunderstood the term "type" to refer to something far more limited than it does.
The technique in the article is not "encoding Apps Hungarian into the type system". It's doing the very thing that inspired ("Apps") Hungarian in the first place!
The original paper by Simonyi actually does describe prefixing with datatype. E.g. b for byte, ch for character, w for word, sz for pointer to zero-terminated string.
He specifically states:
The basic idea is to name all quantities by their types (...) the concept of "type" in this context is determined by the set of operations that can be applied to a quantity. (..)
Note that the above definition of type (which, incidentally, is suggested by languages such as SIMULA and Smalltalk) is a superset of the more common definition, which takes only the quantity's representation into account
So he is comparing to early object oriented languages where presumably Hungarian notation is not needed anymore because the type system of the language itself is rich enough to constrain the set of valid operations on the value/object.
So the "metadata" encoded in the Hungarian prefix is exactly what would be called type in a modern language. The idea of Hungarian was valid in a language with a insufficiently expressive type system.
I think the term "Typed Hungarian Notation" is really weird. Hungarian Notation is a workaround for a limited type system. "Typed Hungarian" is just... typed code.
This helps the compiler, but only works for languages with type declarations, and is less helpful to the programmer, as to find a variable's kind (e.g. valid data vs. raw data) you need to look at its declaration, which may involve scrolling. The original idea, described in Joel Spolsky's original article on Apps vs. Systems Hungarian (https://www.joelonsoftware.com/2005/05/11/making-wrong-code-...) is that the programmer can see, in the expression that the variable is used or the function is called, that the code is wrong.
I apply a variant of Apps Hungarian in my own code, e.g.
I generally like this approach, but it's unfortunately really cumbersome in most mainstream languages like Java (and even Scala 2.x) or C#, but it works beautifully in Haskell because of two features:
- non-exported newtypes (also known as opaque types in Scala 3.x); this means your little module is in full control of instantiation of a 'parsed' value from an 'untyped' value.
- DerivingVia (not sure if there's an equivalent in Scala 3.x); this makes declaring type class instances equivalent to the 'wrapped' value completely trivial.
These two features (and, I guess, the pervasiveness of type classes) in Haskell make this sort of thing as close to zero effort as possible.
EDIT: Looking at some sibling posts, I think my C# knowledge may be out of date. Apologies for that.
I think you mean GeneralizedNewtypeDeriving, not DerivingVia. GND lets you pull instances through a newtype, trivially. DerivingVia lets you define your instance in terms of an instance on another newtype.
Exactly. I intentionally avoid GND because it can be quite wonky, and DerivingVia basically subsumes it. You do still have to have slightly more explicit declarations of instances, but IMO it's worth it for the control.
I wish there were a type system for numerical computation that could encode things like rank and size of tensors and could statically check them before running. Currently, catching bugs requires running the code which can make for a slow development loop, plus it would make for a much more robust system than documenting these things in comments
What you’re describe is in a similar vein to what is described in my blog post, and it’s often absolutely a good idea, but it isn’t quite the same as what the post is about. Something like NameType attaches a semantic label to the data, but it doesn’t actually make any illegal states representable… since there really aren’t any illegal states when it comes to names. (See, for example, https://www.kalzumeus.com/2010/06/17/falsehoods-programmers-... .)
In other situations, the approach of using an abstract datatype like this can actually rule out some invalid states, and I allude to that in the penultimate section of the blog post where I talk about using abstract types to “fake” parsers from validation functions. However, even that is still different from the technique focused on in most of the post. To illustrate why, consider an encoding of the NonEmpty type from the blog post in Java using an abstract type:
public class NonEmpty<T> {
public final ImmutableList<T> list;
private NonEmpty(ImmutableList<T> list) {
this.list = list;
}
public static <T> Optional<NonEmpty<T>> fromList(List<T> list) {
return list.isEmpty()
? Optional.none()
: Optional.of(new NonEmpty<>(ImmutableList.copyOf(list)));
}
public T head() {
return list.get(0);
}
}
In this example, since the constructor is private, a NonEmpty<T> can only be created via the static fromList method. This certainly reduces the surface area for failure, but it doesn’t technically make illegal states unrepresentable, since a mistake in the implementation of the NonEmpty class itself could theoretically lead to its list field containing an empty list.
In contrast, the NonEmpty type described in the blog post is “correct by construction”—it genuinely makes the illegal state impossible. A translation of that type into Java syntax would look like this:
public class NonEmpty<T> {
public final T head;
public final ImmutableList<T> tail;
public NonEmpty(T head, ImmutableList<T> tail) {
this.head = head;
this.tail = tail;
}
public static <T> Optional<NonEmpty<T>> fromList(List<T> list) {
return list.isEmpty()
? Optional.none()
: Optional.of(new NonEmpty<>(list.get(0), ImmutableList.copyOf(list.subList(1, list.size()))));
}
}
This is a little less compelling than the Haskell version simply because of Java’s pervasive nullability and the fact that List is not an inductive type in Java so you don’t get the exhaustiveness checking, but the basic ideas are still the same. Because NonEmpty<T> is correct by construction, it doesn’t need to be an abstract type—its constructor is public—in order to enforce correctness.
Maybe I shouldn't have included the Java example, some others have jumped on it as well. That wasn't meant to be a summarization, but a similar idea in a different context.
You also seem to have missed the point of the Java example anyway, in a misses-the-forest-for-the-trees way. It was meant as a 1:1 example of an important result of your blog post, not an example of constructing the parse function:
> However, this check is fragile: it’s extremely easy to forget. Because its return value is unused, it can always be omitted, and the code that needs it would still typecheck. A better solution is to choose a data structure that disallows duplicate keys by construction, such as a Map. Adjust your function’s type signature to accept a Map instead of a list of tuples, and implement it as you normally would.
Using NameType instead of String is the same as using a Map instead of a list of tuples.
It was why I included AddressType in the Java example. Just like changing the function signature to not accept a tuple of lists and require a Map instead, forcing you to use the parse function to construct the Map, functions that only work on AddressType or only work on NameType can't receive the other one as an argument - where with a String, they could. They have to pass through the requisite parse function to convert String into a NameType or AddressType first, however those are implemented.
And I've seen the falsehoods lists before; "Name" and "Address" were simply the first thing that popped into mind while typing that up. Examples are just examples, and Name and Address are conceptually different enough regardless of the falsehoods that the main idea behind the example ought to get by anyway.
> You also seem to have missed the point of the Java example anyway, in a misses-the-forest-for-the-trees way.
Perhaps I did, yes. I do think the kind of thing you’re describing is valuable, to be clear. A lot of my comment was intended more as clarification for other people reading these comments than as an argument against what you were saying. I imagine that if I misunderstood your point, other people are likely to, too, so if anything, your clarification is generally a good outcome, I think!
I believe Joel Spolsky discussed this in an article, having a wart on the name to represent if strings have raw inputs which are at risk of injection attacks.
I currently use this in my language "Grammar" (https://jtree.treenotation.org/designer/#standard%20grammar). In Grammar you can currently define two things: cellTypes and nodeTypes. Grammar looks at the suffix to determine the type. So if you are building a language with the concept of a person node and an age cell, you'd write "personNode" and "ageCell". "ageCell" might extend "intCell". I've found it a pleasure to use.
Read all the way down through the "Parsing, not validating, in practice" header. It eventually gets to examples exactly like mine, where "parse" means "transform into a different datatype so it can't be used wrong", while "validate" means "generate a message, but don't preserve that message in the type system in a way that's usable later".
On one of my first jobs all the class names were prefixed with C, but it wasn't carried through to variables or anything. I said we should stop prefixing class names with C and got shouted down. That's where I learned to mostly just do what I want without asking. Once you have 5k lines of something your way it is really too late for anyone to ask you to change it :)
Your comment is wrong is wrong. You are talking about "encapsulation tricks", where you don't make illegal states unrepresentable, but just all the construction so you can audit than declare the unrepresented state won't be constructed.
Sometimes this is good, but the benefits are a lot lower, and many an overzealous programmer has taking the "boolean blindness" "string blindness" argument too far with copious newtypes.
Actually ruling out invalid states by construction, however, is way more valuable, and far more likely to be worth the work. It requires more thought than just newtyping away, but that's a feature, not a bug.
Yes, yes yes! Encode invariants in your data, don't enforce invariants on your data. I particularly think this point needs to be stressed more, because practicably speaking, not every language has the typechecking faculties of Haskell:
> Sometimes, making an illegal state truly unrepresentable is just plain impractical given the tools Haskell provides, such as ensuring an integer is in a particular range. In that case, use an abstract newtype with a smart constructor to “fake” a parser from a validator.
For instance, few type systems will let you encode prime numbers as datatypes, so you'd have to do something in your language like
newtype Prime = Integer
mkPrime :: Integer -> Maybe Prime
mkPrime | isPrime p = Just (Prime p)
| otherwise = Nothing
This is why learning Elm has been valuable for me.
Without the convoluted type system of Haskell, you have to do this more, and it is a lot simpler to understand and maintain. You are never going to be able to use the type system to guarantee everything about your data, so let your runtime code check it, add lots of tests and keep it isolated (Elm calls them "Opaque Types") so it is easy to reason about.
The equivalent is possible in OO with classes, but with the caveat that only if you make everything immutable and ideally have a layer of compile time null checking on top.
In short, with Haskell I am learning and relearning language features constantly and banging my head. With Elm I grok the language and I am focused on solving the problem.
I think it’s fantastic that Elm works well for you; it’s a wonderful language and there’s no doubt that it is much simpler than Haskell. I agree with you that, in many cases, the technique you’re describing is sufficient, and it requires much less sophisticated machinery than some of the fancier Haskell techniques.
That said, I do think what you’re describing is different from what is described in the blog post. Opaque/abstract types reduce the trusted code significantly, which is fantastic, but the example provided in the blog post is actually even better than that: because NonEmpty is correct by construction, the amount of trusted code goes down to zero. A lot of the tools Haskell provides, like GADTs and type families, can be used to capture more invariants in a correct-by-construction way, rather than relying on abstraction boundaries. There are advantages and disadvantages to both.
I certainly don’t have any ill will toward anyone who finds the complexity of Haskell not worth the extra benefits, so to be quite clear, I’m not trying to sell you on Haskell instead of Elm! I do think being aware of the distinction is still useful, though.
Good reply. My view of Haskell is if I had to do it 40 hours a week in a team for 50 weeks, I'd get use to it and would see the benefits and get itchy using any language without them. I'd have wiser colleagues to help me learn and I'd see the patterns in an existing code base.
But as a hobby FP'er looking from the 'outside' Elm is more suited to me. So the point is my Haskell vs Elm preference is somewhat circumstantial.
It's like the home DIY person preferring to order a kitchen from IKEA and install it, instead of making their own cabinetry.
I learned a lot from this article and hope to revisit it.
I had one quarrel:
> Consider: what is a parser? Really, a parser is just a function that consumes less-structured input and produces more-structured output. By its very nature, a parser is a partial function—some values in the domain do not correspond to any value in the range—so all parsers must have some notion of failure.
I think it's reasonable to include some total functions as parsers as well, because some grammars -- structures that are recognized by the parser and that specify languages -- generate all strings as the corresponding language. In that case, there are no failure cases because every string is a valid input.
I came up with some more-trivial examples, but I think a less-trivial example is the split() function in many languages. Its Haskell type signature is String -> [String], and it is a total function. It handles the case where the input is a nonempty string with no delimiters (the output is a list containing a single token equal to the input), as well as the case where input is an empty string (the output is an empty list), and the case where the input consists only of delimiters (in many implementations, the output is a list containing n+1 empty strings).
Another trivial-ish case is where the grammar allows an arbitrary trailing string following the object of interest (similar to C's atoi). In that case, a parser will always return the empty object if there is no interesting or relevant data in the input. (For example, atoi returns 0 if the input string doesn't begin with an integer.) This could be viewed as a kind of error, but there may be applications in which this behavior is useful.
I think this is a reasonable point, and your split() example is a good one. I wasn’t sure while I was writing the blog post if I considered isomorphisms to be parsers, and I’m still not entirely sure if I do or not, but there’s an argument to be made that they are simply parsers where the set of possible failures is empty. I don’t have strong feelings about that distinction one way or the other, though.
Further nit-picking: even if there are no errors, I don't think it's an isomorphism unless the parser preserves all information? Usually parsers treat some input strings as equivalent (by stripping whitespace and comments), so there are fewer outputs than valid inputs.
I can't think of an easy way to make string splitting an isomorphism in both directions. You can serialize any list of strings to a comma-separated list, but only if you use escape sequences or encode the length of each item, and then there are some input strings that are errors.
I guess you could define a type as the set of all valid strings in the input language, but that's going out of your way to make sure parsing has to be done more than once.
Parsers that report no errors don't seem desirable anyway; error-checking is usually good.
The idea really is that parsing subsumes validation. If you’re parsing, you don’t have to validate, because validation is a natural side-effect of the parsing process. And indeed, I think that’s the very thesis of the blog post: parsing is validation, just without throwing away all the information you learned once you’ve finished!
I understood the article, and I see your point, but my main feedback is that when I read the title I assumed that you were going to describe some way where you would parse input without validating it.
One thing to bear in mind is that headlines are not prose, they are shorthand.
So you have to assume there could be missing words, since that’s how headlines work.
In this case there is a dropped “just”:
Don’t (just) validate: parse!
If the headline was “Parse, never validate” then I would agree with your reading. But “don’t” is pretty lightweight. I’m having a hard time thinking of a lighter discouragement word. Given that, I think it’s unfair to assume the strongest discouragement is implied.
This seems needlessly pedantic - we could also notice that a total function is just a special case of a partial function, and so total functions are already included by the article.
Language-theoretic security was the first thing that came to mind when I read the title. Was pleasantly surprised to see it referenced at the end of the article.
Dear awesome Haskell writers. You are writing great articles but I can’t understand the code examples as Haskell is too far from the programming languages. Can you provide examples in TypeScript / Rust / Swift / Reason or another C like language? If not, I’m curious why?
Love, Iddan
However, to answer your “why” question: mostly just because I write Haskell for a living, and my primary personal motivation for writing this blog post was to share with my coworkers and to point to during code reviews. Therefore, it makes the most sense for the examples to be in Haskell! Haskell is also especially well-suited for communicating this sort of thing if you already understand the notation, as you can sort of see from the Java translation: it’s significantly more verbose just to get the same idea across. Still, it’d be great if someone were to provide a “translation” of the same ideas to other languages, no doubt!
Haskell is in fact a programming language, despite what the phrasing of your comment might imply. Translation from language to language is often difficult, we don't do typically do it from English to say Spanish not because it wouldn't be valued but because we may not have the same expertise in Spanish and it would be expensive to hire a person to do it. Expertise is important, because key ideas may get lost in translation.
Since you mentioned typescript first, here's a library that is basically an implementation of the ideas OP tries to convey: https://github.com/gcanti/io-ts
Can't recommend using this library enough; being able to essentially reject any input (e.g. json over http) at runtime because it does not conform to the type definition is such an amazing thing!
I think that leveraging the type system to enforce invariants about your data model is one of the most effective things you can do as an engineer.
I gave a talk on this topic at my workplace, using Elixir and Haskell as examples. It's a little haphazard, since it wasn't written for public consumption, but someone might find it interesting: https://github.com/QuinnWilton/programming-with-types-talk
Not only was this the first article about Haskel that I could actually understand. But it was also the first article where type annotations makes sense. That it actually helps you think about edge cases, instead of just annoy you.
My personal experience with type annotations (not to be confused with type systems) makes the code harder to read, for very little benefit. I would like a type checker that, instead of you having to babysit the type checker, the type checker should babysit you. Instead of the type checker crying over trivial issues, and needs your comfort. The type checker should comfort me when I'm sad.
While manually writing human friendly error message helps, you still have to discover edge cases yourself. It would be nice with a tool, not necessary baked into the types/system, that automatically detects edge cases. For example detect when a variable has the possibility of becoming undefined/null. Maybe achieved via fussing.
A college once asked me "How do I get rid of the damn nulls?"
Null's are basically a lazy error/exception. Error handling is so laborious we often "forget" to do it. And this is very confusing for beginners. There are basically two camps, one that think errors are exceptions, and the other that thinks errors should be first class citizens. But there are different domains of course. I don't want my CPU to maybe return a error message when I ask it to multiply two numbers, I want that to always work, I don't care if the result get shifted (that would be a job for the type checker to detect, or better; automatically use mult64 instead of mult32), because handling errors at that level and detail would be too annoying and bad for performance. Writing network services however, it is crucial that I get noticed if a request failed.
For what it's worth: it's _very_ rare for a type annotation to be required in Haskell. It's just considered best practice (supported by compiler warning options) to have them on all top-level declarations, for two reasons:
- it's a good extra layer of _human-readable_ documentation about basic behavior and requirements (as in this article); and
- it lets the compiler give better error messages.
The compiler is _always_ going to do its own inference, whether you give it a signature or not. If it infers a valid type _which isn't the one you thought you wrote_, and you didn't specify what the type _should_ be, that function/statement will compile -- you won't get an error until later, when you attempt to consume it and the types don't match. This can be harder to trace, especially if there's a bunch of polymorphic code in the middle where any type will pass. Supplying a manual annotation lets the compiler immediately say "hey, these don't match."
I would like to see an example on something less trivial than encoding `not empty` on the type. At some point the difference between what the author calls "parsing" and "validating" gets blurry, or the type system can't encode.
You can always use abstract data types to indicate in the type system that you know something that can't directly be expressed. The point is not that "parsing" and "validating" are distinct concepts, but that they're two different points of view on the same problem.
It's pretty cute. Still no static types for "Strings that end with a dot" or "Strings that match [a-z]+" or "Strings that are proper nouns". ;)
I'm not sure I understand the benefits of this compared to a traditional parser generator.
"Use a data structure that makes illegal states unrepresentable." is not effort spent well.
(And given Gödels incompleteness theorems unachievable in the general case.)
A proper parser generator using a grammar will give you the structural guarantees you need, a lexer will reject invalid terms. And a transformation step gives you the target structure.
Eventually you will have to validate though at runtime.
> Still no static types for "Strings that end with a dot" or "Strings that match [a-z]+"
Sure there are. :) Technically speaking, anyway. Here’s a type for “strings that end with a dot”:
-- | A string that ends in the character '.',
-- represented as a string without its trailing dot.
newtype StringEndingInDot = StringEndingInDot String
fromString :: String -> Maybe StringEndingInDot
fromString str = case reverse str of
'.':cs -> Just $ StringEndingInDot (reverse cs)
_ -> Nothing
toString :: StringEndingInDot -> String
toString (StringEndingInDot str) = str ++ "."
And here’s a type for “strings that match [a-z]+”:
data Letter = A | B | C | D | ... | X | Y | Z
type StringAZ = NonEmpty Letter
Now, admittedly, I would never use either of these types in real code, which is probably your actual point. :) But that’s a situation where there is a pragmatic “escape hatch” of sorts, since you can create an abstract datatype to represent these sorts of things in the type system without having to genuinely prove them:
module StringEndingInDot(StringEndingInDot, fromString, toString) where
newtype StringEndingInDot = StringEndingInDot { toString :: String }
fromString :: String -> Maybe StringEndingInDot
fromString str = case reverse str of
'.':_ -> Just $ StringEndingInDot str
_ -> Nothing
You may rightly complain that’s a lot of boilerplate just to represent this one property, and it doesn’t compose. That’s where the “Ghosts of Departed Proofs” paper (https://kataskeue.com/gdp.pdf) cited in the conclusion can come in. It provides a technique like the above that’s a little more advanced, but brings back composition.
>Technically speaking, anyway. Here’s a type for “strings that end with a dot”:
These are just two functions that wrap and unwrap a string. Your "type" doesn't generate a compile-time error when constructed with the wrong data. This doesn't even handle the simplest use case:
A properly enforced static type would not need to emit Maybe, because it would be impossible to set it to the wrong value in the first place.
Not to mention that to be truly useful a static type would need some sort of literal that would emit a compile-time error if the supplied init data was wrong.
In short, examples above do not demonstrate Haskell's ability to statically enforce rules like "string needs to end with a dot".
Now, you could make a type that always prepends a dot when asked for a string representation, which happens to enforce this specific constraint, but you cannot use a trick like this for most constraints (such as the second example of "only alpha characters").
The problem, as you hint at, is that Haskell lacks a literal
syntax for "strings that end with a dot". One could fake
this up with Template Haskell if one was so inclined. On
the other hand, the representations proposed are
representations of "strings that end with a dot" and
"alphanumeric strings", the dubious ergonomics (and utility)
I think (but cannot guarantee) there's nothing much stopping you writing template Haskell to construct valid values at compile time if you want. It's just that most of the time you're (or at least I am) happy to use "unsafe" functions because the verification is simple enough to do in your head, and the other 99% of the time I'm creating values it's from run-time data.
> It's pretty cute. Still no static types for "Strings that end with a dot" or "Strings that match [a-z]+" or "Strings that are proper nouns". ;)
That’s not true, you can wrap these in types that can only be constructed from valid data, and from that point on you can be sure everything is correct.
data EndsWith : Type where
EndsWithPeriod : (s: String) -> { auto prf : ("." `Strings.isSuffixOf` s = True) } -> EndsWith
s : EndsWith
s = EndsWithPeriod "."
-- Value of type False = True cannot be found
-- s2 : EndsWith
-- s2 = EndsWithPeriod ""
The noun one might be possible -- I'm honestly less familiar with what a proper noun is offhand. (E: prefix=>suffix)
Ignoring all the convoluted type examples here, the simplest case is that you create the type that matches the parsed value. For example, you can create an EmailAddress class that can only contain valid email addresses. Instead of String, you use EmailAddress whenever you need to ensure the valid parsed value.
Nice article! It got me thinking about an issue I've noticed in dynamically typed languages (namely JavaScript) where it's very easy to end up doing lots of (potentially redundant) validation all over the place because it's much more difficult / unwieldy to pass that information along.
In this case not even TypeScript can rescue you. You might naively implement:
const head = (input: A[]) => input[0]
and it will happily infer a return type of A. Then you'll fall flat on your face the first time you encounter an empty array:
head([]) // -> undefined
To make it correct you need to explicitly define a return type of (input: A[]): A | undefined, just as the Maybe. It's obviously impossible for TS to guarantee that an array will not be empty at runtime, but I wish this specific case triggered some help from the type checker.
> It's obviously impossible for TS to guarantee that an array will not be empty at runtime, but I wish this specific case triggered some help from the type checker.
type NonEmptyArray<T> = { head: T, rest: T[] };
function makeNonEmptyArray<T>(array: T[]): NonEmptyArray<T> | null {
if (array.length == 0) return null;
return { head: array[0], rest: array.slice(1, array.length) };
}
function head<T>(array: T[]): T | null;
function head<T>(array: NonEmptyArray<T>): T;
function head(array: any): any {
if (Array.isArray(array)) {
if (array.length === 0) return null;
return array[0];
}
if (typeof array === "object") {
return array.head;
}
}
It's impossible for Haskell to guarantee an array will not be empty at runtime as well; that's why we can write a new type + a smart constructor to track runtime properties.
And you can't know if it check if the file exist first or not. From pure habit you add checks... (what? looking at the code of the library? thats crazy!).
The “exists” state is dynamic (something else in the system could be in the process of deleting the file since it was last checked). It’s more reliable if you don’t check in advance and simply try to open the file every time you need it. You handle errors while opening the file.
This is really interesting. For systems programming, files usually have clear ownership and modification context. If I call exists(), it should still exists for as long as that thread is paused.
It's an anti-pattern to throw errors for expected behavior.
Which is why you don't throw errors but return a result type such as Rust's Result, Haskell's Either, or C++'s upcoming std::expected :)
On the other hand, non-atomically asking for permission first in any concurrent (or re-entrant!) context is not just an antipattern but simply wrong and always a possible race condition. It is impossible to write a routine that tries to open a file and is statically guaranteed to succeed.
The idea appears to me to have been popularized by python, where it's straight called an antipattern to do the check beforehand [0]. The language also uses exceptions for standard control flow - for example, there's one called StopIteration that the language uses to indicate to loops that the current generator or iterator has reached its end.
For me, the funny thing about this article is that I kind of had the core insight just yesterday while hacking together a cli framework, but like the author, I couldn't quite explain what I learned.
Am I right to suspect the only reason "there can’t be any performance overhead" is that this is being done in a lazy language like Haskell? Meaning the statement won't hold in >99% of practical cases? Or did I misunderstand something?
No, this doesn't have anything to do with Haskell's lazy evaluation (at least not in the NonEmpty list example presented). The general idea is that if you are going to perform validation checks at some point in your dynamic language, you won't lose any performance performing those checks up-front in your static language.
But how can this possibly be true in general? As an example, just imagine I want to check that a string represents a list of space-delimited integers, where each one is also less than 10 digits. It's far more trivial to verify that than to actually parse the integers. And by performing the verification pass separately before the parsing pass, I can reject invalid inputs early, leading to much faster rejections than if I parse them as I validate them. The only way I can see there being zero cost difference is if everything is implicitly lazy, such that at runtime the verification won't even happen until the parsing needs to be performed too. Right?
It's true that in the failure case, you can get faster reults by short-circuiting. If failure cases are a substantial portion of your runtime, then yes, doing a fast pre-pass can be more efficient.
I'd speculate that in the real world, 99% of cases have time dominated by success-cases. Exceptions would be things like DOS attacks.
No that's just wrong. It's not just failures. Imagine if I verified they were all zero then I wouldn't have to do a full parse. Or if I verify they have only a few digits then I would avoid bignum parsing. I think this proves what I'm saying - this simply isn't true in general.
It's not necessarily wrong, though what you say makes sense.
Any code that you write, you can move into the parser module. In your example, you have a function that checks the strings are all zero, and you wouldn't call the parser if it returned true. But you can simply declare this code to be part of the parser and just not call the rest of the parser. The difference is in the API: in the one case, you return false and in the other you return a parse error.
Now, you may say that, knowing this information, you're going to parse into ints instead of BigNums. But the parser can do this too.
You might also say that I happen to know, due to some context, that all the numbers have 10 digits or fewer. And that therefore I can do better than the compiler. But if you make the context and the strings the input to the parser, then you bring it level again.
Or you might say that my application has some context that gives it an edge but the parser is a general purpose library that does not understand the context. In that case, your application does indeed have an edge. But that applies to any general purpose library and is not a point about the merits of parsing versus validation.
What's maybe more interesting is if you might only need some of the parsed results or the parsed results wouldn't fit into memory. But for that you just need an incremental or streaming compiler.
This is actually pretty common in real world compilers. For example, javascript parsers typically don't lex javascript functions in their initial pass through the source.
> In your example, you have a function that checks the strings are all zero, and you wouldn't call the parser if it returned true. But you can simply declare this code to be part of the parser and just not call the rest of the parser.
Again -- like with the previous argument about "error" vs. "success", this is a red herring built around the fact that my example was so trivial and easy to describe this way. What if the logic was supposed to be "if the digits are all zero then don't store the numbers in the database, otherwise subtract them from my account balance"? You avoid a database hit if they're all zero, so are you going to say debiting your account is now "parsing" too?
I understand the concept of wrapping values in specific types which gives you certain guarantees at compile time. And I really like this concept and will play around with this some more because the empty something issue is something I myself struggle with. But what really urgs me is the usage of throw in the exceptional case. My goto type in these situations would be Either rather than a throw. But this would create nearly the same issues on the caller site as an Maybe would create. Now one could argue that this is an exceptional case the user or Programm can’t possibly handle. So how do you handle this then? My main usage is in Rust and here the panic! handle seems to be used as often as unsafe raw pointers :)
Using this programming style requires a rather powerful type system, if you want to go past the simple examples the author shows and encode more complex validity conditions. I am still learning these things, but AFAIU the extreme edition of all this is the Calculus of Constructions used in Coq and Lean, where you can encode theorems and algorithm specifications as very complicated type; proving a theorem corresponds to showing an element of the theorem itself (seen as a type), and if you find an element of a type encoding a certain specification, Coq can work out for you a program that implements that specification.
This is what I understood, at least. Things are quite complicated!
* subclass or compose Number/Matrix types whenever possible (e.g. Users, Packages, Widgets)
* Use immutable collections (e.g. Google Guava library in Java)
I have built very powerful software with small teams using these principles.
At scale, your day to day programming becomes easy as the compiler is doing all the error checking.
It is also very slow to program this way. You write A LOT of boilerplate, and there's only so many code-generation libraries you can throw at it before it becomes a pain to setup your IDE.
But it is worthwhile for complex applications. We did not have bugs.
This post shows a simple "positive" assertions, without any combinatorial logic associated with the proofs. But this is similar to programming by contracts.
> Using this programming style requires a rather powerful type system, if you want to go past the simple examples the author shows and encode more complex validity conditions.
You need a strong type system if you want to encode complex constraints at compile time, but you can still encode a surprising number of conditions in weaker type systems. And if you can't enforce a constraint you care about at compile time, capturing that constraint in an ADT using sum types and runtime assertions can still provide a lot of value to anyone reading or maintaining your code.
This sort of style requires a lot more diligence and verbosity than the same code in Haskell would, but I think dynamically typed programs can make a lot more use of their type system than they usually do.
At the very least, humans can try to pick up the slack where a compiler is insufficient, and while designing your data model in Ruby as if it were in Haskell may not help the interpreter at all, it can give the programmer a lot more context about how the code is intended to work, and what assumptions it is making about its data.
I think it's a pragmatic style even with languages with much less powerful type systems. You might have to do more of the work yourself -- manually creating types, for example -- but the principle is good.
I think it's advice you don't have to go all in on -- you get benefits even if you use it just a little.
One example is with nullable types.
if you have
val foo: Whatever? // might be null and Whatever and Whatever? are two separate types.
foo.someMethod()// compilation error because foo is not of type Whatever
if(foo != null) {
foo.someMethod() // works because foo was smart cast to Whatever after the null check
}
In the same way doing a switch on a type smart casts the variable to that type. Smart casting of nullable types also gets rid of a lot of clumsy things like Optional, Maybe, etc. you'd need in other languages where you'd have to call a method to extract the value, assign it to a new variable, etc.
What is the equivalent in TypeScript?
const head = (arr: Array<number>):number => arr[0];
happily returns undefined if you pass an empty array (with strict null checks)
TypeScript is sadly very unsound by design, so doing this kind of thing in TypeScript is always going to be more “best effort” and less “exhaustive proof.” That’s not necessarily wrong or bad per se, as after all, Haskell has escape hatches, too (and so do dependently-typed languages, for that matter). However, there are philosophical differences between the way the languages’ type systems are designed.
When I’ve talked to people who develop and use TypeScript, the general impression I’ve gotten from them is that TS is actually as much about tooling as it is about correctness. TS enables things like autocomplete, jump to definition, and quick access to documentation, and it does that by leveraging the typing information to “see through” the dynamic dispatch that makes that sort of thing infeasible in dynamically-typed JavaScript. The TS type system does provide some correctness benefits, don’t get me wrong, but where ease of use and correctness are in conflict, usually ease of use is preferred.
This is true even with all the “strictness” options enabled, like strict null checking. For some examples of TypeScript unsoundness, see [1] and [2]. Flow is actually generally a lot better than TS in this regard (though it does still have some major soundness holes), but it looks like TS has basically “won,” for better or for worse. But again: TS won because its tooling was better, not because its ability to actually typecheck JS programs was better, which I think reinforces what I said in the previous paragraph on what TS is really about.
I recently [0] updated my Control Flow Graph representation in a Typescript parser for AVM1 bytecode because I had the exact same issue as in the article. I previously represented my graph as an array of blocks. But a graph always has at least one block (the entry point). The old representation forced me to check for the empty case despite knowing that valid graphs always have at least one block.
Switching to this new representation allowed to then simplify code manipulating this graph. This representation is also simple enough that most schema validators can represent it.
You still have to note that with this representation you no longer have a single array. It's not an issue in my case but your situation may vary. You may try the tuple-based solution mentioned in a sibling comment if you need to have a single array.
When we started writing the code for Heavenly-x (https://heavenlyx.com), the first thing we needed to write before anything else is the definition of a data structure that represents the requirements as close as possible (a concrete syntax tree).
We’re building a tool with a custom DSL for building CRUD GraphQL APIs, with auth and data validation and transformation. So our architecture consists of three phases:
- Parser
- Setup
- Runtime
There’s no way the parser would succeed if the input is not valid. We’re writing our software in Scala and we are using parboiled2 for parsing the DSL into a concrete syntax tree, so if it succeeds then it’s valid and if it fails, it fails early and we don’t have to worry about validation in later phases. We wrote some custom validation logic that traverses the concrete syntax tree after the parser to validate for some requirements that we couldn’t encode in the concrete syntax tree, but it’s really a small portion of our codebase and it’s easy to maintain.
At the Setup and the Runtime phase we assume that the concrete syntax tree is valid.
At the Runtime phase we have a running GraphQL server so we have a parsing phase too but it’s handled by Sangria so we don’t have to worry about it.
We are also building a UI for those who don’t like using our language. It’s a React app where the state data structure looks exactly like our concrete syntax tree.
The critical validation step happens when the abstract type is created, not when it is used or passed to other functions...similar to the example in TFA.
The added benefit is that you get a type that behaves pretty closely to a class, but adds no typing or runtime memory overhead. E.g. here's an example for an "unsafe" string : https://gist.github.com/jdonaldson/9a09287e540e7392857f
Another benefit is that you abstract types can define relationships between multiple types in this way, making it possible to perform validation for functions that need to handle "Either<T>"-style types.
Woah cool. Back when I was using Haskell (uh, ten years ago, now), people kept telling me that they were _pretty sure_ there was a flag that detected incomplete pattern matches at compile time, but no one could ever tell me what it was. -Wincomplete-patterns . Now I know. Thanks!
It doesn't make sense to use JSON as the interchange format for a statically typed language. There is an impedence mismatch between the two. You are forced to infer types which are not actually there.
The reason why JSON is so popular is the same reason why dynamic typed languages became popular. The interfaces between components are simple and human-readable. A library written in a dynamically typed language can be effectively integrated into any system without the possibility of type mismatches.
If you have an API written with Protocol Buffers, every system which interacts with yours needs to agree with your type system; this creates tighter coupling.
Doesn't this hold true for all serialization ever? Replace JSON with "bytes" (aka any serialized data) and it still holds:
It doesn't make sense to use bytes as the interchange format for a statically typed language. There is an impedence mismatch between the two. You are forced to infer types which are not actually there.
> It doesn't make sense to use JSON as the interchange format for a statically typed language.
Two problems here. One, there are multiple statically typed languages with different type systems and it is nontrivial to translate between them. It's not "having a language" versus "not having a language"; it's Type system A vs Type system B vs Type system C vs none.
Secondly, this is only true if you assume that no dynamically typed language will ever want to interoperate with your code - which is almost always incorrect to presume.
> But the type itself doesn't say that it must return the first element.
Sure, I didn’t say it does. That type isn’t a full specification of head. This blog post isn’t about proving literally everything in your program correct, which is impractical anyway because if you did that it would be as easy to prove the wrong thing as to write a bug in the first place. Some functional tests are still needed.
> Okay, what about more complex cases? E.g. how do you describe type "such value doesn't exist in the db table yet"?
I’ve addressed that kind of question more generally in another comment[1], but for that specific case, I think it probably isn’t worth proving, because a database lives outside of your application, and it’s the database’s responsibility to manage that kind of data integrity, not your application logic. That’s one of the kinds of “execution-phase failure modes” that I describe in this paragraph of the blog post:
> Parsing avoids this problem by stratifying the program into two phases—parsing and execution—where failure due to invalid input can only happen in the first phase. The set of remaining failure modes during execution is minimal by comparison, and they can be handled with the tender care they require.
> Okay, what about more complex cases? E.g. how do you describe type "such value doesn't exist in the db table yet"?
If that's supposed to be an error state then you just throw a different error.
If it's not supposed to be an error state, then you can have something like a function that accepts a type of "Input" and returns a type of "DBTrackedInput" and in that function you do your saving to the DB. Then things which only want to work with values that are in the DB accept types of DBTrackedInput. That prevents you from passing something which hasn't been saved in the DB to those functions.
And you can expand from there, maybe the function that saves to the DB only accepts input of type "ValidatedInput". So you first have a function that accepts an "Input", validates it, and emits a "ValidatedInput". Then you have other functions, like your DB saver, which can say "I only want validated data" and still other functions which can say "I only want data that's been saved to the DB".
What if there is another application (or e.g. a stored procedure) that uses this DB table so you cannot control all saving operations from your application?
Don't do that. A DB table is too big and complex a surface to use as an interface between multiple applications. Each table should have one clear owner; if multiple components need to access the same table, put it behind a service rather than having them access it directly.
> how do you describe type "such value doesn't exist in the db table yet"?
Depends what you mean by that, but depending on the context, you may not want this in a type.
If it's a "Dirty" value / not inserted yet, you can just wrap it with "newtype Dirty ... = SomeRecordType ...".
If you mean something that you just checked that isn't in the database, you probably just want a properly named true/false, or a Maybe result. Encoding the value itself as not existing in the database will just bite you when something else inserts one in the meantime.
> But the type itself doesn't say that it must return the first element.
The "-> a" means a guarantee you will always get an "a" (by definition). This is why the fix is to change the return type to "Maybe a" which includes "Nothing". Why it isn't like that by default in Haskell is a different issue, but the return type is really guaranteeing that "a".
I have done, and still do, a lot of "shotgun validation" ;-(
But after having read the article it's still not clear how to log illegal values or entities if they can't even be represented in the data model? How do you talk about things you can't name?
I agree broadly with this however in many cases that I have been dealing with I don't think that there can be a type system that would be rich enough to encode the validation laws.
For example how would you encode "This purchase request must be made before the closing date for the purchase"?
This isn't about encoding the validation rules in the type system, only the requirement that validation with respect to some properties has been done. So in your example you would just define a function
This sounds like the Nirvana fallacy. Just because parsing over validation is not perfect, it can still lead to significantly more maintainable code and safer and stabler system.
Pure validation has a lot of problems, namely that you have to validate the data on all levels for the invariants required at that level.
I like the concept of encoding properties, like a boolean of non-empty, in an object.
Beyond that, this article further convinced me type systems are for the pedantic. A given function signature is impossible? Seems like just another strength of a dynamic language.
What does your dynamic language do when you take the first element of an empty list? There is no obvious "correct" thing to do. Furthermore, whatever you do return is unlikely to be the same sort of thing that is returned for a non-empty list.
A dynamic language will have a behavior that corresponds to some sort of type signature, and it's not possible to write behavior that corresponds with the type signatures given as examples of "impossible" in the article.
A type signature is merely a statement about behavior, so it's nonsensical to make a false statement about the behavior, and Haskell catches this.
It throws an error. Errors are a type of behavior. Some of those error behaviors you can cope with. You catch those. Others you can't. You raise those and either the system can cope or it can't and you crash. What part is nonsensical?
Throwing an error should be part of the function signature. Otherwise the statement "This function takes a list of objects of type A, and returns an object of type A" is false; sometimes it will return an object of type A, other times it will signal an error.
It's better to assume all code could throw an error. No statement is safe. This becomes particularly important in distributed computing as resources may be offline at any given moment. It's from this perspective that type systems afford a false sense of security and appear to be scratching the itches of overactive Type As.
I'm not particularly an advocate for typed languages (my daily driver for personal projects is Common Lisp, which is mostly untyped), but I'm having trouble even understanding the point of view that "No statement is safe". If I had a system where the actual type of the expression "1+2" were "3 | Error" I would find a new system.
Code that doesn't read from unsafe memory locations, allocate memory, or perform I/O cannot throw an error. Code that does can throw an error. Some systems treat out-of-memory as fatal, removing one of those 3.
In some domains you don't want to know if the computation is happening locally or remotely, but IMO most of the time you do because pure computation cannot signal an error locally, but remotely it can. As you mention, distributed computing often runs in this mode, but distributed computing is an overkill for most problems; my phone is powerful enough to locally compute most daily tasks (even when it's done server-side for various reasons); my workstation is powerful enough to perform most daily tasks for 100s of people simultaneously.
It's hyperbole. A lot like systems that require hard type declarations for each and every function. The advantage of type systems is performance. Any correctness guarantees that come along with it are nice sugar but hardly the point and often misleading.
Nonsense. A distributed scenario is precisely where an advanced type system becomes really valuable, because you can draw a distinction between local and remote calls while still treating both in a uniform way. Treating every single call as though it were remote is impractical and wasteful.
In the style TFA you should wrap pickle.load() with a function that will unpickle the specific type you are expecting. So you should write a unpickleArrayOfInts() or whatever.
The actual type would be a rather large union type, which would be unwieldy to use (but in an untyped system like python you still would need to deal with all of those corner cases to have a program that is correct in the face of arbitrary input).
Really the biggest annoyance of type systems is that they make you deal with corner cases that you don't think are practically possible. If you are right, then they are wasting your time. If you are wrong, then they are saving you from having bugs.
I'm pretty confident that you could write something that was equivalent to all the useful `pickle` calls. By that I mean you'll need to know which operations you'll want to do on your unpickled object:
readAny :: forall c r. [TypeWithReadAnd c] -> String -> (forall a. c a => a -> r) -> Maybe r
readAny types string handle
I think it's fair to say "hey, pickle doesn't require me to list all my types explicitly", but on the other hand, it's not like pickle can conjure those types out of thin air--it considers only the types that are defined in your program.
Here's an example that uses Read as the serialization format and only deals with Int, Char and String; but hopefully you can imagine that I could replace the use of read with a per-type function that deserializes from a byte string or whatever.
You might be right, it's been a long time for me too. But if it can define new classes, then I'd expect that the code for those classes' methods would also be in the pickled format, at which point there's no particular reason you couldn't deserialize it in Haskell too... with some caveats about either needing to know what types those functions should have, or needing to build a run-time representation of those types so that they can be checked when called, or (hopefully!) crashing if you're okay with just unsafeCoercing things around.
Only if there is actually anything more you can say about String that would let you avoid the Maybe. If this is raw data and you don't even know that the JSON is well formatted (else why can a parse straight to a generic JSON object fail) then there's not any better way to type it.
You might want to return an error message on failure, or provide a mechanism for recovery, but those are somewhat specific to the use case and don't really address the key point of the article.
Why would you want a language which extra support for bugs?
That's like saying Python is for the pedantic because it doesn't have a flag --crash_randomly, or won't let me write 0/0 and rely on the runtime to pick an abritrary value and keep going, even though I might want that.
The original intent of Hungarian notation was to encode metadata into the variable name - for example, "validData = validate(rawData); showValidData(validData)". The idea was that the prefixes would make it just look wrong to the programmer if someone accidentally used mismatched names, such as "showValidData(rawData)", indicating a likely bug.
This notation mutated into the far more simplistic and arguably less useful "prefix with the datatype", such as iFoo indicating Foo is an integer. This mutation became known as Systems Hungarian, while the original became Apps Hungarian.
The suggestion in this post is taking Apps Hungarian, and instead of relying on variable names, encoding it into the type system itself.
The first time I recall seeing something like this suggested was actually in Java, with examples in that blog post involving class attributes:
And creating classes to represent these values, preventing them from being used the wrong way: ...which is why I remember this as "Typed Hungarian" - making the language's type system handle the work that a person would normally have to think about in Apps Hungarian.