Hacker News new | past | comments | ask | show | jobs | submit login
Postel’s Principle is a Bad Idea (programmingisterrible.com)
147 points by tinsel on Feb 3, 2013 | hide | past | favorite | 60 comments



Postel's Principle was very important for bootstrapping adoption of TCP/IP, but it's mostly a curse in mature systems. It doesn't even help new implementors; instead, it deceives them into thinking that they've achieved interoperability when instead they've accidentally built dependencies on other people's implementation details.

That said, I wouldn't suggest that our Insertion, Evasion paper presented an argument regarding the Principle in either direction. Even if we forbad leniency, there'd still be ambiguous standards.


when instead they've accidentally built dependencies on other people's implementation details

I think it would be very interesting if you could give an example or two of this.


Here's the general outline.

Vendor A writes a parser that is helpful and is liberal and infers missing quotes and stuff. Vendor B writes something that's mostly to spec, but accidentally doesn't properly quote things. It works fine, because A is liberal and infers these quotes.

Vendor C comes along and builds exactly to spec. But despite being perfectly to spec, it doesn't interop because B sends invalid data! But B is a big vendor, and their stuff works with A.

So now C must add a hack to their parser to deal with the fact that, because A was liberal, B got their implementation wrong.

One example is the loose routing parameter, "lr". It has no value, you just add the name of the parameter "uri;lr" in contrast to other parameters like "tag=bla". Some implementations send "lr=on", and that should be mostly harmless. Except other implementations take that to mean "lr" has a value, and no longer accept just "lr" as turning the feature on.

SIP is full of these things, many of them in the parsing layer alone, let alone actual semantics of what things mean. Browsers are another example: vendor A decides to allow closing tags out of order - how do you do handle such unspecified stuff cross browser?


I should do better than this, but my flight to SF takes off in just over an hour, so for now: the first paper this article linked to (I wrote it with Tim Newsham) is a study in how you can compare implementation details between two TCP/IP stacks and use them to sneak traffic past a middlebox that assumes one interpretation or the other. The example that comes to mind is putting data in a TCP SYN segment.


Windows software is a similar case - a program works on one version of Windows due to accidental dependencies on Windows implementation details (e.g. memory management), and then fails on the next version of Windows when the underlying implementation changes. Raymond Chen has written about the huge difficulty for Windows to maintain backwards compatibility with "broken" programs. New versions of Windows provide special-case handling for old applications so they will still run. See http://www.joelonsoftware.com/articles/APIWar.html and scroll way down to "The Two Forces at Microsoft" for a long discussion of this, and how Apple is much stricter.


> http://www.joelonsoftware.com/articles/APIWar.html

In the terminology of this article, Torvalds is firmly in the Raymond Chen camp as far as "The kernel is not allowed to break user software" is concerned. The difference between Windows and Linux (and especially between Windows 95 era Windows and Linux of the same vintage) is, apparently, that Linux didn't come from MS-DOS, and so never had to allow application software to get hooks into low-level parts of the kernel.

There was never an official version of Linux for hardware without memory protection, and there never will be. Scope is important.


No this is actually very different. Linus doesn't want breaking API changes to documented behavior, in the Raymond Chen case its not breaking applications that misbehave or abuse undocumented behavior.


> Linus doesn't want breaking API changes to documented behavior

Even aside from the fact this is wrong:

https://bugzilla.redhat.com/show_bug.cgi?id=638477#c129

http://kerneltrap.org/node/5725

The point I was making was that Linux didn't expose the same kind of deep, undocumented behavior because, as I said, it always had the ability to hide its inner workings.


Two off the top of my head:

* A classic would be IE's abuse of TCP RST: http://www.stroppykitten.com/cms/index.php?option=com_conten...

* A decent chunk of email server code (SMTP & IMAP implementations in particular) is there to handle erroneous client behaviours. The worst cases are those where the workaround leads to misbehaviours (or less optimal behaviours) for conforming clients. If I remember correctly, the popular Outlook series of clients is a notorious source of such warts. A number of SMTP sender libraries will skip over significant parts of the protocol state machine; configuring a mail server to handle that degenerate case can weaken its anti-spam provisions.


Yes, there were major changes between the way IMAP worked in Outlook 2003 & Outlook 2007 too, I think they re-wrote their IMAP code as it behaved completely differently. Admittedly it was much better, but it broke our custom IMAP server which I ended up fixing. I can't remember exactly why now, but Outlook 2003's implementation was bizarre, as if they'd not read the RFC.

That reminds me of the bizarre big because they used a short uint to store the message UID. Maybe it wasn't a short but it definitely wasn't 32-bit as per the spec, there was some magic number that if you went over 'boom'. As an end user it appeared that some messages just disappeared.


POP3 is no picnic, either. Outlook seems to want care about the last-used UIDL.

And, to make things more confusing, the RFC is a little vague about the re-use of UIDLs.


HTML (and the Web in general) has many, many examples of this, for example the way we parse <i><b>x</i></b> is required to be compatible with pages that relied on the way early browsers did it. In fact almost all of the Web's quirks are because of this.


Joel Spolsky did a great job describing how modern browsers and much else are led astray by Postel's law: http://www.joelonsoftware.com/items/2008/03/17.html


It seems to me it is a very valid principle in many areas.

For instance, the STEP file standard very clearly states that all input files must be 7-bit ASCII. Many of the programs that generate these files (including earlier versions of my own) paid no attention to this and wrote out 8-bit values in strings if the user requested it. Clearly this behavior is wrong. (The principle agrees: "Be conservative in what you do.")

However, rejecting an entire CAD file merely because the text strings in it used an illegal encoding is downright silly. It in no way can change the meaning of the geometry of the file. There is no hidden vector in there for malicious attacks. It makes perfect sense to accept illegal files like this and do your best to make them work, even if it might not get quite the same text strings the user intended.

I think jbert's point about being conservative in what you do in all respects is a strong one. Taking that into account suggests that maybe carefully marking the illegal character as such in the string might well be worthwhile, and is definitely more appropriate than trying to guess what 8-bit character standard was intended.


That's exactly the kind of security risk that the article is talking about. Internet Explorer could be tricked to use US-ASCII encoding and interpret ¼script¾ as a script tag (CVE 2006-3227)

Liberal vs strict is a false dichotomy. The third solution is to accept all possible inputs, but in a specified way.

Instead of taking draconian XML approach you can solve the problem by taking HTML5 approach and make error handling as interoperable as handling of correct input. In case of STEP files you could require all implementations to clear the 8th bit (or drop or clamp bytes out of range — whatever as long as it's specified and mandatory).


Maybe I'm missing something here, but a valid STEP string can already encode any arbitrary Unicode code point. It just does it using 7-bit ASCII. If your code is somehow executing these strings without examining their content, then you are already in big, big trouble.

Trying to do something with 8-bit characters -- whether skipping them, indicating an illegal character in the string, or trying to guess what was really meant -- cannot make that situation any worse.


The problem is if you decode a particular byte sequence that causes a bad action (if that's possible with step files) in a different way than some other program that is supposed to keep you safe.

In the case of ie, ie decoded one way and forum software might decode a different way. So the forum software says the string is safe for the browser (according to its decoding rules) but then the browser applies different rules and gets a bad string.

You may not be seeing the danger because you implicitly think a step file from unsafe sources is always unsafe. But imagine if you had a safe file detector program, except it applied different rules than the program you're actually going to open the file with.


As jbert pointed out, if your program's main job is to say whether or not something is safe, and it liberally says "Oh yeah, I think that's safe", that's pretty much the exact opposite of "be conservative in what you do".


Please explain the proper way of escaping/rejecting html in forum posts, when you can't rely on the browsers following the spec.


Possible attack: Because the strings are not ASCII, implementations now need to bring another library in to decode those strings. Now lets say someone encodes an end-string char (single quote?) using some alternative encoding that doesn't use the ASCII quote char.

When an implementation saves this file, it normalizes that other encoding to use an ASCII single quote, then proceeds to write out the rest of the string. This isn't caught inside the implementation, because the encoding library only normalizes when writing. When it reads the data in, it still just represented it as bytes, and there was no ASCII single quote byte until the end of the dangerous string.

So, yes, it's possible that even something as simple as "string encoding" could be used to implement an attack.


But this is where "be conservative in what you do" comes into play. The STEP format has formal rules for exporting all ASCII, Unicode, and ISO-8859 characters. A well-written STEP string exporter should handle them all without difficulty, no matter what goofy things are in the string.

And again, if you're worried that there may be an attack vector, change high-bit-set characters to "[Illegal character value N]". Though it might be more merciful to assume they just wanted ISO-8859-1 characters and substitute the appropriate control code.


The tl;dr of the article is to define handling of invalid input, so that all conforming implementations will handle it in the same way, without having to reverse-engeneer eachother to be interoperable.


So you're saying that every time I find a STEP file written in an invalid fashion, I should convene an ISO 10303 committee and wait for years to find out how everyone should handle it? That's doubly insane, because it would take many bugs that can be fixed in a day and make my customers suffer from them for years, while at the same time requiring me to modify my program to handle every bug found by every STEP software vendor or cease to be conforming.


If the penalty for generating a CAD file with its strings in the wrong encoding is that no importer will read it because they're being strict in what they accept, then no exporter that does so will last very long in the wild.


This post, if you read it to the end, doesn't reject the principle of being liberal in what you accept. Rather, it proposes being liberal in a formally specified an interoperable way - i.e. specs should explicitly define behavior for all inputs, including any error correction.

HTML5 takes this path with its parsing algorithm, and in fact is cited as an example in the post. However, the designers of the parsing algorithm saw it as being an application of Postel's Principle, rather than an example of the opposite.

The post is really more nuanced than it sounds and would better be titled "Specifications Should Define How to be Liberal in What You Accept".


Quite apart from security, Postel's Principle can hurt the capacity to make backwards-compatible feature additions in the future.

For example, if you have a set of unused flag bits documented as "reserved, must be zero", then a receiver that silently ignores non-zero bits allows senders that erroneously set those bits to propagate. This is fine, until one day in a future standard you want to define new behaviour for one of those bits, and find you can't - because there's large numbers of senders out there that erroneously set it but don't have any idea about the new behaviour.


It's a pity so many people grossly misunderstand Postel's Principle.

Postel didn't talk about off-spec behaviour. He talked about the borderline details, which were often quite hazy in early RFCs. When an RFC says the line length is at most 512 bytes and the terminator is CRLF, does that mean 510+CRLF or 512+CRLF? Postel says to accept 512+CRLF and send 510+CRLF.

If a write a receiver and want to accept 1024 bytes instead, maybe that's a good idea and maybe it's a bad idea. But if you do that, don't invoke Postel's Principle in defense.


As some one who used to work on OSI based systems well thats just sloppy standards writing the bane of internet standards.

Its a pity that RFC's and other internet standards are not written and implemented more rigorously - for example Google have problems interpreting the xml sitemap standard and that is only 3 pages FFS.


I've written ten RFCs of varying quality. It's terribly difficult to write something that a) gives a good overview of the subject, b) explains the choices that had to be made, c) spells out every detail, and d) remains short enough that implementers actually read all of it. All of mine fail in some way. I've heard the OSI documents failed too.

Quoting one implementer, whose code did not accept non-ASCII passwords: "Oh, the password syntax is on page 88? My printout ends after page 68". In that RFC, the details are spelt out in appendices, and Appendix A starts on page 69. (And I'm sure pg assigns bonus karma if you can identify the RFC.)


Is there any way to request an official clarification for borderline cases the RFC-author didn't think about when typing it up?


You can submit an erratum and an author will comment and often clarify, so formally speaking the answer to your question is yes. But other implementers don't generally read the errata, so you have to expect that your interoperation peers haven't read the clarification.

Once you understand the problem, the clarification, and that your interop peers do not, I bet your implementation's handling of the issue will be conservative in what it sends and liberal in what it receives.


As a counterpoint, perhaps it is reasonable, if interpreted more strictly.

Taking the perl-over-c-stdlib example (but I think it applies in other cases), if the "perl layer" was more strict in what it sent to the stdlib layer, there would have been no problem.

i.e. the error is in thinking of only the network as the place to apply the maxim. In fact, you should scrupulously adhere to every interface you pass data to (internal or external) - and interpret as reasonably as possible all interfaces you receive data from.

[I'd agree that the latter pt can be weakened. But it does help interop - and if you clean up your act before you hit the next layer then you limit any damage.]


Postel wrote it in 1980. (First found in RFC760[1])

Since then we've had computer viruses, worms, and other malware; we've had hackers, crackers, spies, criminals, and semi-competent people flooding the Internet; we have people not just making accidental requests but fuzzing and fusking to try to break things or bypass controls.

It's a great principle for the human stuff, but it feels really outdated for technical stuff.

[1] (http://www.ietf.org/rfc/rfc760.txt)


I agree, Postel's principle makes a lot of sense in context, if you view it as a bunch of mostly good-faith people attempting to bootstrap communication in a new medium. Then it's clear that to get things working, you want to forgive errors on the receiving side (to the extent you can do so), but send as clean and unproblematic output as you can. Basically what a sensible, not-anally-bureaucratic human who's trying to establish communication would do. It was also, iirc, influenced by some of the difficulties ARPANET had experienced in getting different implementations to interoperate. But it may make less sense today.


Agreed. Today, I think "fail fast" is a much better principle.


There are times when you want to build for robustness and times you want to be more concise. If you have control over the set of inputs (e.g. by formalizing it and using the right tools) that's great but there's usually some overhead involved in doing that. My argument would be that security is orthogonal to robustness - just because you accept input that is outside the originl specification doesn't mean that you should do that insecurely. The robust (liberal) implementation and the limited (conservative) implementation simply support different protocols, they can do either with security holes or without. Does this increase the attack "surface"? It may or may not.

A bigger problem is when the liberal implementations become the de-facto standard.


The authors of the SIP spec published another spec (RFC4475[1]) called "SIP torture tests", where they seem to take a perverse glee in showing how messed up their "human readable" syntax can get.

They even use the phrase "infer" in several places, encouraging systems to take obviously malformed packets and try to figure out what they meant.

Being liberal in accepting input, apart from security issues, seems to create a worse situation. Implementation A messes up something, but B seems to be OK with it. C then accidentally requires it, while D rejects it. Depending on how large and responsive the vendors behind those implementations are, you end up with a nasty state of affairs, with random hacks here and there.

It's hard enough to create unambiguous, comprehensible, specifications. Telling implementations to be liberal only makes it worse.


I can't read this comment without thinking about SOAP.


I could kiss you for that.

If the use of a format for interoperability can only be reasonably used by a single vendor, it has no benefit over a binary protocol.

The entire SOAP and XML-RPC space is postels law writ large.


Yup.

I had to only face the true horrors on one occasion, for a Responsys integration. They had the C# examples and the Java examples. The API that they offered for the two had differences because some methods would work with one, some with the other.

I'm a Perl programmer, so tried that. After all you just have to translate the language, right? Wrong. After banging my head against that mess for a week or so, I finally gave up, wrote the communication in Java, and had a Perl launcher for it.


Browser tolerance for HTML errors is one of the main reasons the web took off so fast.


Citation needed? Is there any reason to believe that if the browsers had insisted on wellformed documented and provided errors like "error at line X, table tag not closed" that people would not have been able to fixup documents? I don't believe that's would have stopped things.

But that exact behaviour, trying to infer intent, meant that tons of unspecified behaviour had to be added to all browsers to try to mimic which each one did to handle totally invalid cases.

So, even if leniency did make it easier to create a web page, it also contributed greatly to the already difficult task of creating consistent cross-browser rendering.

Look at JavaScript, and the recent semi-colon debacle with Bootstrap and some other tool. Having "implementer defined" leniency just means you'll get multiple interpretation and problems.


We've tried this experiment - it's called XHTML. For a long time, adoption by authors of the strict error handling it offered was stymied by lack of support in MSIE. So it's not a full counter-factual. However, we have learned two things:

(1) Now that MSIE does support true XML parsing of XHTML, almost no one is choosing to use it over HTML.

(2) Of the few experts who conditionally served either text/html or application/xhtml+xml depending on the UA, or serve XML unconditionally now, almost all have bugs in their sites which can get them to produce ill-formed XML which then shows an error page in the browser (for instance, submitting comments with certain sorts of errors). This is evidence that the draconian error handling approach is too challenging even for experts and imposes the costs of small mistakes on users.


I think the bigger lesson to be learned is that after poorly followed ad hoc standards have made a mess of things, it's hard to come in and clean up later.


A nontechnical user, given a choice between two environments, one of which nags them pedantically over technical details, and another which displays the gist of entered content but perhaps with sometimes screwy formatting which one would win?

Word processors won out over text processors for the nontechnical user partially for this reason.

Postel's law applied to HTML let nontechnical users get things done with less impedance. It's less important now not because it was the wrong choice, but because users have moved higher up the stack to CMSes that handle formatting etc.

Consistency across browsers back then was only ever of serious concern to professionals in design or browser programming.


Writing correct HTML is not significantly harder than writing crappy HTML. Comparing it to latex vs word isn't a good analogy IMO. And wouldn't nontechnical users be using higher level HTML editors anyway? At least some bad HTML comes from lazy developers who should know better, and could have done better with a stricter tool.


It's not entirely trivial to write out correct HTML. In addition to properly closing tags, you need to deal with optional end tags (many get this wrong), optional start tags, the odd comment syntax, weird exceptionality wrt escaping esp. in script tags, context-dependant validity such as no block-level elements in inline elements (or nested a tags, or tags whose "type" depends on their attributes).

It's obviously easier to write than to read; but it's definitely easy to make a mistake. There's a lot of illogical cruft that's accumulated in HTML; so even a careful implementer might make a mistake (and might not detect it since most other implementations are so liberal).


I'm was talking about the 90s.


But (as the OP says) you can have browser tolerance without leaving it undefined.


I'd be more interested in reading a paper called How JSON escaped Postel's Principle which included discussion on the ambiguity being pushed to other areas, such as date parsing.


Once upon a time, somebody wrote a json parser that used an existing date parser because component reuse is the shizzle. Then somebody updated the date parser to handle more formats because they were using it in a different project. Congratulations, now you have a json parser that eats a multitude of date formats. (Some liberties with facts taken, but building a tower out of flexible components results in a flexible tower.)


When did that happen? There is a JSON standard, http://www.ietf.org/rfc/rfc4627.txt, but I have yet to find a parser that follows it, including Crockford's original "reference implementation". As a result, there's no point in developing a conformant parser, as no one would use it. They're off chasing date formats, capitalized keywords, functions (!!) blissfully unaware or willfully heedless of the impact on interoperability.


See also DJB's notes on protocol design: http://cr.yp.to/proto/design.html


I am reminded of the discussion we had with colleagues about Markdown versus RestructuredText.

On Markdown side, you have sexy but ill-defined grammar, and on RST you have a slightly less nice-looking guy with a much better defined grammar, which allow building a saner tooling upon it.


Postel’s Principle is wrong, or perhaps wrongly applied.

The Robustness Principle is a prescription that lays down a strategy for growing robust systems. It works. The problem is that the robustness it provides isn't quite what people want it to be.


Posel's law is a maintenance and composition nightmare. My take: http://www.win-vector.com/blog/2010/02/postels-law-not-sure-...



Also see this list of links from 2004, arguing about whether Postel's Law applies to syndication formats: http://www.imc.org/atom-syntax/mail-archive/msg04697.html . Sadly, Mark Pilgrim's famous rant is no longer online.


See also: HTML parsers


> Treat input handling computational power as a privilege, and reduce it whenever possible.

A great example of this was Google's Code Search product, before it was canceled. Since full backtracking search was blowing out the tiny thread stacks in servers, they had to reduce what they allowed to actually regular expressions - expressions generating a regular language. Queries could be turned into DFAs of linear size with respect to input, making arbitrary public regex searches over code indices feasible.

Ross Cox's regular expressions write-ups are quite a fascinating deep-dive: http://swtch.com/~rsc/regexp/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: