Hacker News new | past | comments | ask | show | jobs | submit login

Well, first and most obviously, if you are thinking of rolling your own JSON parser, stop and seek medical attention.

Secondly, assume that parsing your input will crash, so catch the error and have your application fail gracefully.

This is the number one security issue I encounter in "security audited" PHP. (The second being the "==" vs. "===" debacle that is PHP comparison.)

As one example, consider what happens when the code opens a session, sets the session username, then parses some input JSON before the password is evaluated. Crashing the script at the json_decode() fails with the session open, so the attacker can log in as anyone.

Third, parsing everything is a minefield, including HTML. We as a community invest a lot of collective effort in improving those parsers, but this article does serve as a useful reminder of a lot of the infrastructure we take for granted.

Takeaways: Don't parse JSON yourself, and don't let calls to the parsing functions fail silently.




A parser should never crash on bad input. If it does, that's a serious bug that needs immediate attention, since that's at least a DoS vulnerability and quite likely something that could result in remote code execution. You definitely need to assume that the parser could fail, but that's different. Unless you're using "crash" in some way I'm not familiar with?


From the context, I'm assuming s_q_b simply means that if given malformed input, the parser should communicate the problem via the appropriate error channels for the given language. In Java, throw an exception; in Go, return an error, etc.

But yeah, the article's example of an XCode segfault is a good example of both poor parsing logic and poor isolation of fault domains. If your json processing library can corrupt the whole process then something's wonky.


Yes, exactly this.

What I see a lot is initializing an object from a JSON parser which, when it receives malformed input, either halts execution outright or returns an empty object. In both cases you just want to make sure to handle the error appropriately per-language.

Again, the main offender here is PHP, which makes this issue surprisingly easy to get wrong. (And add "==" to your lint checks! 99% of the time coders mean "===", and this can lead to surprisingly severe security issues.)


If you meant that it would throw or return an error on malformed input, I totally agree. You must always handle errors that your calls might return, and especially when those calls are given anything derived from outside input.


Yes, generally, and specifically that failure to handle this type of error can cause all sorts of plague, pestilence, and application compromise.


Indeed, forgetting to check for errors is a great way to cause all sorts of terrible bugs. That's one reason I'm starting to become a fan of checked exceptions and result types.


It can always crash when there is not enough memory.

Especially on the stack.

On of the test cases has 100000 [ and my own JSON parser crashed on them, because it parsed recursively and the stack overflowed. So now I have fixed it to parse iteratively.

It still crashes.

Big surprise. Turns out the array destructor is also recursive, because it deletes every element in the array. So the array destructor crashes at some random point, after the parser has worked correctly.


In "safe" languages (i.e. ones which aren't supposed to segfault/etc unless there's a compiler/interpreter/VM bug -- PHP, Java, JS, Python all fit in here) I've often heard "crash" being used to just mean an uncaught exception, or something else that brings the program crashing down.

GP says "or catch the error", so they're probably using that term in this sense.


Some parsers are dealing with known good input, and should not include validation code for performance reasons. Often you will parse the same JSON many times throughout a pipeline, but you only really need to validate it once. A good example of this is a scatter-gather message bus, where the router parses a message record, scatters it to a large number of peers, and gathers responses. Depending on how latency-critical this system is, it can make sense to validate and normalize the representation at the router, and send normalized (whitespace-free) JSON to the compute nodes.


Even if the json source is $DEITY's own platinum-iridium json generator, the json has to actually get to your parser somehow, and you can't guarantee that it did so in one piece. Memory is finite, no bus is perfect, etc.


Might as well translate validated JSON to a faster binary format for the internal bus, e.g. MessagePack. Validated msgpack is probably faster than unvalidated JSON.


This was something I wondered about in these results: what's the definition of "crash"? Specifically, with the Rust libraries, I'm not sure if this means "the program panic'd" or "the program segfaulted", or something else. The former isn't ideal, but isn't the worst. The later would be much more worrysome.


In the Rust case, the crashes are either the stack being blown on recursion or panics when overflowing numbers.


What about raise an exception?


That qualifies as error reporting in most languages using exceptions.


What about it?


Exceptions should only be used for exceptional cases. For a parser, bad input should be expected.


You say that as fact but you must know it is a matter of opinion: I would say you should match the language idioms. For example, Python iterators and generators work by raising/throwing when there are no more items: it is fully expected and will always happen when you write a for-in loop.


That seems like bad design. Rust iterators return an Option<T>, for example.


In Python a lot of flow control uses exceptions as it's cheaper to ask for forgiveness rather than permission and then still have to deal with errors.


And in many languages, they are the common way to communicate that a function can not return the data that is expected of it. Invalid input data means the parser can't produce the equivalent data structure -> exception.

If the parser has some kind of partial parsing, a way to recover from errors or you are using a language in which returning explicit errors is the more common idiom, then you probably shouldn't throw an exception.


Throwing an exception when parsing fails, sounds like a case of exception-handling as flow of control: a bad thing even when commonly done, having a lot in common with GOTO statements. (See http://softwareengineering.stackexchange.com/questions/18922... , and Ward's Wiki when it comes back up.)


Exceptions are for flow control. That's their entire purpose.

Throwing an exception when parsing fails is a near perfect example of where exceptions produce clarity of code.

That is to say, a parse function is usually a pure function that takes a string and returns some sort of object. As a pure function, it "answers a question". A parser answers the question: "What object does this string represent?"

When given bad input, the answer is not "NULL". It is not "-1". It is not a tuple of ("SUCCESS", NULL). The question itself has no answer. Exceptions enable code flows for questions without answers, and for procedures that can not be actualized.

Now, you can engineer it such that you change the question to "Give me a tuple that represents an error code and an object representing... etc." But then if your calling function can not answer the question it's meant to answer, you have to manually check the inner function result and return some code. With exceptions, stack unrolling comes for free, with zero code. Once you get used to expecting the secondary, exceptional code flow, you can read and design far cleaner code than you can otherwise.


To be more specific (and your parser example actually makes it very clear), exceptions are a form of an error monad. The benefit that you describe - the ability to automatically flow throw error values without having to check for them at every boundary - is exactly what a monad provides.

The problem with exceptions is that they're not type-checked in most languages - you can throw whatever you want, and the type system doesn't reflect that, so there's no way to statically determine that all expected exceptions are properly handled.

They're (partially) type-checked in Java, but it's extremely annoying, because there's no ability to express that part of the type in a way that lets you write methods that are generic with respect to exceptions thrown by other code (e.g. there's no way to describe a method `f(T x)` such that it throws `E` plus everything that `x.g()` may throw, for any `T`).


In Rust, it would be Option<T> where T is the type of the object you're supposed to get.

This means that if you don't parse correctly, it would be None, if you do parse correctly it would be Some(T)


Instead of `Option<T>` it would be more idiomatic to use `Result<T, ()>` or even better `Result<T, E>`. Where `E` is some way to communicate the error code/type/message back to the caller.


Consider this:

    def load_data():
      with open('some_file.json', 'r') as f_in:
        data = parse_json_stream(f_in)
        data['timestamp'] = some_date_fn() # Do something with the *definitely-valid* data on the next line.
        return data

    def parse_json_stream(io_stream):
      # Some complex parser...
      # at some point...
      if next_char != ']':
        raise JsonException('Expected "]" at line {}, column {}'. format(line, col))
      # More parser code...

A benefit of exceptions here is that you don't have to check the result of "data = parse_json_stream(f_in)" to immediately work with the resulting data. The stack unwinds until it is in a function that can handle the exception.

*edit: Code formatting.


Rust nearly has that same benefit. You can wrap `parse_json_stream(f_in)` in `try!(parse_json_stream(f_in))`, and if an error was returned from `parse_json_stream`, then an early return for `load_data` is inserted automatically, thereby propagating the error, similar to raising an exception.

Of course, these approaches are not isomorphic, but in Rust, the cost of explicitly checking an error value is typically very small thanks to algebraic data types, polymorphism and macros.


But exception handling is flow control, by its very nature. So it's clearly a gray area and the right thing to do depends on the common idioms of the language you're using, the expected frequency of parsing failures, and (possibly) runtime performance concerns. In Java for example, the XML parser built into the standard library does throw exceptions for certain types of invalid input.


But see Djikstra on "GOTO statement considered harmful" (http://david.tribble.com/text/goto.html). The problem is unstructured control flow, which both GOTOs and exceptions-as-control-flow give you; at least in what I was taught (early-2000s CS degree focused on C++), unstructured flow of control is only acceptable when it's a panic button to quit the program (or a major area of processing).

It sounds like the Web way of doing things doesn't have this tradition -- much like how it doesn't have static strict extensible type systems.


Thank you for that great link. Thank you twice over, because it refutes your claim.

Dijkstra specifically calls out exceptions as structured control flow, and as being probably-acceptable, and not subject to his concerns.

More broadly, any argument that goes "Exceptions are an extension of GOTO, and therefore bad" has some questions to answer, given that nearly all control structures are implemented as an extension of GOTO.

As to your last sentence, I think you have it backwards. I speculate that of the code written in 2016, most of the code that did not use exceptions for control flow was Javascript async code. (There are of course other non-exception-using languages, but other than C they're mostly pretty fringe, and for good or for ill JS is so damn fecund).


> nearly all control structures are implemented as an extension of GOTO

Well put. Under the hood, every IF, ELSE, WHILE, SWITCH, FOR, and other decision point in structured code is implemented with at least one unconditional JMP.


Exception flow control is far more structured than goto. An unexpected goto can drop you anywhere in memory, with no idea how you go there or how to get back. Exceptions cause a clearly-defined chain of events and you always know exactly where you'll end up (passing through all the finally blocks until you hit the closest catch block).


It isn't unstructured. It is an exception from the control flow of calling and returning from functions, but it has a structure of its own which is often very convenient for writing clean code. This is not at all specific to the Web, by the way.


Quoting Dijkstra from 1968 on goto as argumentum ad authority in 2016 should be considered harmful.

He had reasons for saying what he said in the context of the times. Remove that temporal context and the why becomes more important than the what.


His arguments are still sound, I think, and until that post I'd assumed that avoiding unstructured flow of control was still received wisdom. (I've certainly found it a very sound policy; million-plus-line codebases of other people's code, plus surprises, equal misery.)

Joel Spolsky had much the same to say: http://www.joelonsoftware.com/items/2003/10/13.html . If I'm going to have to argue about this, I'd rather use Spolsky than Dijkstra as a starting point; what, if anything, do you see that's wrong in his article?


The article basically makes 2 points, which are:

1. They are invisible in the source code

2. They create too many possible exit points

As a java developer I really don't find 1 to be a problem at all. Lots of people complain about checked exceptions, but they solve this problem. It's very easy to see which functions throw and which don't. I'd even argue that the invisibility that is there is a positive -- I usually don't want to think about the error case, but can easily trace the exceptional path if I do. I find that often functions need to go back up the stack to handle issues, in which case the exception control flow is exactly what I want.

Runtime exceptions, on the other hand, are invisible and can be an issue, but things like running out of memory, or overflowing the stack, well, I really don't want to have to write: public int add(int x, int y) throws OutOfMemoryException, StackOverflowException, ...

For 2, I find that they just create 1 extra exit point, which is back up the stack/to the catch handler. You could certainly abuse them to cause problems, but I personally don't find this to be an issue in practice.

I think that everyone agrees that unstructured flow of control is problematic, but checked exceptions do have structure, even if it's a bit of a loose one.


Parsing is often deeply recursive and in some languages, throwing an exception will automatically unwind the stack until it finds a handler. As well as explicitly giving you a good path and a bad path (as someone pointed out upthread) this can (again, in some languages) save a ton of repeated special case code in your parsing routines.

Some languages or libraries are explicitly designed to use exceptions for flow control some strongly discourage this. Apple's ObjC error handling guide for example contains this stricture but also calls out parsing as an example of when you should do it.

C.A.R 'null pointer' Hoare considered exceptions a blight on the earth that would lead to us accidentally nuking ourselves, so there's a spectrum of opinions available here.


The JavaScript JSON parser throws an exception on invalid input. A benefit of this is that there is only one code path to handle any kind of failure, and another is that unhandled parse failures lead to a hard stop, which is a good default.


> Exceptions should only be used for exceptional cases.

Why?

(My thoughts on the matter: http://www.mcherm.com/reasons-why-my-code-style-is-wrong.htm... )


You must really dislike Python, then since exceptions are used for all sorts of things that aren't unusual.


Legacy PHP code has an "Error" failure mechanism where the process just ends or execution jumps straight to a global handler function.

There is ongoing work to deprecate and remove it in favor of actual exceptions, which travel up the stack and are much easier to clean up after.


Recursion kills.


Parsing HTML is literally orders of magnitude more complex.

It is pretty rare to need to parse JSON yourself (what environment doesn't have that available?) but it isn't that difficult. It's a simple language.


Parsing JSON is indeed several orders of magnitude less complex than parsing HTML. In both cases there are excellent parsing libraries, but it would be very unwise to create your own HTML parser for production use, or to use any parser that hasn't seen some serious scrutiny.


JSON at least has the concept of "invalid JSON". That's a big step forward. A JSON parser, like an XML parser, can say "Syntax error - rejected." There's no such thing as "invalid HTML". For that reason, parsing HTML is a huge pain.

As someone who has a web crawler, I'm painfully aware of how much syntactically incorrect HTML is out there. HTML5 has a whole section which standardizes how to parse bad HTML. That's just the syntax needed to parse it into a tree, without considering the semantics at all.


There's no such thing as "invalid HTML". For that reason, parsing HTML is a huge pain.

Actually, as someone who has written an HTML parser by following the HTML5 spec, I see it as the opposite: because every string of bytes essentially corresponds to some HTML tree, there are no special "invalid" edge cases to consider and everything is fully specified. That's the best situation, since bugs tend to arise at the edge cases, the boundaries between valid and invalid. But the HTML5 spec has no edge cases: the effect of any byte in any state of the parser has been specified.

HTML5 has a whole section which standardizes how to parse bad HTML.

In some ways, I think it's a bit of a moot point what is "bad HTML" if all parsers that conform to the standard parse it in the same way. The spec does mention certain points as being parse errors, but are they really errors after all, if the behaviour is fully specified (and isn't "give up")? In fact, disregarding whether some states are parse errors actually simplifies the spec greatly because many of the states can be coalesced, and for something like a browser or crawler, it's completely irrelevant whether any of these "parse errors" actually occurred during the parsing. One example that comes to mind is spaces after quoted attribute values; a="b" c="d" and a="b"c="d" are parsed identically, except the latter is supposedly a parse error. Yet both are unambiguous.


I've written implementations of the HTML5 color algorithms. There are some sequences of bytes which, when given as a color value in HTML5, don't correspond to an RGB color, which makes things interesting.

(for the record, they are the empty string, and any string that is an ASCII case-insensitive match for the string "transparent")


It's worthwhile pointing out that HTML parsers are allowed to abort parsing the first time they hit each parse error, if they so choose. As such, not all implementations are guaranteed to parse content that contains parse errors, hence why it matters for authoring purposes.


It is pretty rare to need to parse JSON yourself but it isn't that difficult.

In theory, it's not supposed to be "that difficult". But in practice, according to the linked-to article, due to all rot and general clusterfuck-ery in the various competing specifications, apparently it is.

Or do you really think you could wrap your head all around those banana peels, and put together a robust, production-ready parser in a weekend?


Depends upon what you're doing.

I wouldn't want my own JSON parser out on the web, but if I needed to get JSON from $known_environment to $my_service, I'd feel safe enough with a parser I wrote.


Well that's the difference: it's the discrepancy between handling the data you're handed, and all the data possible. JSON has a lot of edge cases that are very infrequently exercised. Therefore a "robust, production-ready parser" is not usually what's desired by the pragmatist with the deadline. This can inevitably lead to security holes, but it doesn't necessarily. For example, sometimes the inputs are config files curated by your coworkers, or outputs which come from a server under your control and will always be in UTF-8 and will never use floats or integers larger than 2^50.

Taking it the other way, we can also ask "how can you optimize the parser to be as simple as possible, so that everything is well-specified and nobody can eff it up, while still preserving the structure that JSON gives you?" I tried to experiment with that about five years ago and came up with [1], but it shows a nasty cost differential between "human-readable" and "easy to parse." For example, the easiest string type to parse is a netstring, and this means automatic consistent handling of embedded nulls and what-have-you... but when those unreadable characters aren't escaped then you inherently have trouble reading/writing the file with a text editor. Similarly the easiest type of float is just to take the 64 bits and either dump them directly or as hex... but either way you don't have the ability to properly edit them with the text editor. Etc.

But I am finding that the central problem I'm having with JSON and XML is that it's harder to find (and harder to control!) streaming parsers, so one thing I'm thinking about for the future is that formats that I use will probably need to be streaming from the top-level.[2] So if anyone's reading this and designing stuff, probably even more important than making the parser obviously correct is making it obviously streaming.

[1] https://github.com/drostie/bsencode is based on having an easy-to-parse "outer language" of s-expressions, symbols, and netstrings, followed by an interpretation step where e.g. (float 8:01234567) is evaluated to be the corresponding float.

[2] More recently I've had a lot of success in dealing with more-parallel things to have streamability; for example if you remove whitespace from JSON then [date][tab][process-id][tab][json][newline] is a nice sort of TSV that gets really useful for a workflow of "append what you're about to do to the journal, then do it, then append back that it's done" and so forth; when a human technician needs to go back through the logs they have what they need to narrow down (a) when something went wrong, (b) what else was on that process when it was going wrong, (c) what did it do and what was it trying to do? ... you can of course do all this in JSON but then you need a streaming JSON parser, whereas everyone can do the line-buffering of "buffer a line, split the next chunk by newlines, prepend the first line with the buffer, then save the last line to the buffer, then emit the buffered lines and wait on the next chunk."


what environment doesn't have that available?

Autocad LISP is the one I ran into. At least it didn't 4 years ago. I'm sure there are other niche cases.

Although I'll admit I punted and wrote a trivial 'parser' that was only able to read and write the particular JSON I was dealing with in that project.


FWIW, if the need comes up again: https://github.com/mbeloshitsky/autolisp-json

Note it's only two years old, so wouldn't have helped last time ;-)


Wouldn't be possible to write an external tool that converts JSON to s-expressions and vice-versa?


Sure, but then you're just adding additional layers to what was supposed to a fairly straight forward script. Much easier to just include everything in a single program so you can just say give path to input json file here, give name of output file here, and run.


> It is pretty rare to need to parse JSON yourself (what environment doesn't have that available?) but it isn't that difficult. It's a simple language.

That, coupled with the fact that it is still so easy to get it wrong and to introduce security issues is exactly what should peak your attention to the seriousness of the subject. Building any parser is fraught with risk, it is super easy to get it subtly and horribly wrong.


Writing any code is fraught with risk, but writing a parser in a modern and reasonably safe language is not something to be greatly feared. It's more likely that you'll introduce a security issue in what you do with the JSON immediately after you parse it.


> writing a parser in a modern and reasonably safe language is not something to be greatly feared

It ought to be feared, if interoperability is involved. The problem isn't that you might introduce security issues. The problem is usually that you introduce very subtle deviations to the spec that everyone else implemented correctly, and as a result, sometimes your input and/or output do not work with other stuff out there.


Writing a parser for a badly-specified format which is widely used is a terrifying prospect in any language.

Okay so it's more terrifying in C than most other things, but still, it's terrifying. Runaway memory consumption, weird Unicode behaviour etc. etc. etc. It's easy to think you don't have to worry about Unicode because your language's string types will handle it for you - but what do they do if the input contains invalid codepoints? You're writing a parser, you need to know - and possibly override it if that behaviour conflicts with the spec.

Horrible business. Definitely not my favourite job.


Silent moment for those of us using niche languages to meet production requirements in environments that do not allow third-party code and do not have JSON parsing in the std lib...


If you don't have a solution, or you're not happy with your current solution, take a look at parsec style parsing. you can make a lot of progress with just a few combinators, and those style parsers are pretty easy to read.

You can get an implementation working with a fairly high level of confidence that it's right.

If it's not fast enough, make a pretty printer for your AST. Then do a CPS transform (by hand) on your library and parser, so you can make the stack explicit. Make sure the transformed version pretty prints exactly the same way.

Then make a third version that prints out the code that should run when parsing a document, rather than doing the parsing directly. You'll get a big case switch for each grammar you want to parse. Your pretty printer will help you find many bugs.

It's a pretty achievable path to get your grammar correct, and then get a specialized parser for it.


If your language doesn't come with JSON in the stdlib, you're really on the cutting edge. Or using a language meant for embedding :)


Doesn't Java lack a JSON parser in the standard library?


As of Java 8, the Nashorn JavaScript engine is included by default, and it does support JSON parsing.

[1] http://winterbe.com/posts/2014/04/05/java8-nashorn-tutorial/


The standard Java library includes a JavaScript interpreter, so there has to be a parser for object literals in there somewhere.


Object literals aren't JSON, though.


JS has stringify and parse, so there ought to be a JSON parser somewhere.


One thing that the article mentions is that there are in fact strings that are valid JSON but not valid JS object literals.


https://developer.mozilla.org/en/docs/Web/JavaScript/Referen...

A modern JS implementation should also have a JSON parser


> you're really on the cutting edge.

Or the other way around. Think about something like MUMPS.


No, no thank you. Did it for a year. Pretty sure I don't even want to think about doing it again.


Pour a 40 for me; I shall go wallow in my assembly shame.


Last time I worked on a FileMaker DB, it didn't have JSON support. I don't know if that has changed in the last few years.


Oh my, this is pure nonsense.

First, parsing JSON is trivial compared to other parsing tasks. There are no cycles as in YAML or other serializers, it's trivial forward scanning, without any need to tokenize or backtracking.

Second, JSON is one of the simplest formats out there, and due its simplicity also its most secure. It has some quirks and some edge cases are not well-defined. But with those problems you can always check against your local javascript implementation and the spec, just as OP did.

I know very few JSON parsers which actually crash on illegal input. There are some broken ones, but there are much more broken and insecure by default YAML or BSON parsers or language serializers, like pickle, serialize, Storable, ...

Parsing JSON is not a minefield, parsing JSON is trivial.

Takeaway: Favor JSON over any other serialization format, even if there are some ill-defined edgecases, comments are disallowed and the specs are not completely sound. The YAML and XML specs are much worse, their libraries horrible and bloated.

JSON is the only secure by default serializer. It doesn't allow objects nor code, it doesn't allow cyclic data, no external data, it's trivial, it's fast.

Having summarized that, I'm wondering why OP didn't include my JSON parser in his list, Cpanel::JSON::XS, which is the default fast JSON serializer for perl, is the fastest of all those parsers overall, and is the only one which does pass all these tests. Even more than the new one which OP wrote for this overview STJSON. The only remaining Cpanel::JSON::XS problem is to decode BOM of UTF16 and UTF32. Currently it throws an error. But there are not even tests for that. I added some.

Regarding security: https://metacpan.org/pod/Cpanel::JSON::XS#SECURITY-CONSIDERA...


>> Well, first and most obviously, if you are thinking of rolling your own JSON parser, stop and seek medical attention

This attitude really pisses me off. Get off your high horse.


No kidding. If everyone thought that way, the only JSON parser to exist and which everyone uses would be barely working, because everyone would be put off from trying to write something better.

I have actually written an HTML parser, and it was not so difficult because the HTML spec specifies everything down to the level of tokenising individual characters.


The implicit subtext here is "for production". It's an excellent rule of thumb since it's highly likely that one already exists that is already battle tested. It's better to use that one. It's not a high horse it's just good horse sense.

He's not talking about rolling your own for fun or learning reasons which anyone should feel free to do.


Some of us are not such shitty programmers that we can't build critical systems on our own. Crtitcal business use is absolutely the worst time to farm out to a third party.

And quit using war metaphors to feel macho about your work. What we do in no way resembles battle. Neither are most libraries tested to the degree you are trying to suggest. Neither is common usage in several projects a reliable source if testing.

When people say things like "don't reinvent the wheel" or "only for yourself, not for your employer", what they mean is "I know I can't, and who do you think you are thinking you're better than me. Get back in your place."


You seem to be offended when no offense was intended. No where did the OP indicate that this rule applied to Critical business use. JSON parsing is almost never a critical business use case.

And I don't know about you but these days hosting a public facing internet service is increasingly like a battle DDOS, Data Loss, State sponsored APT's. I certainly don't feel macho about my work. I don't author many libraries that are used by a public facing service on the internet so I have nothing to beat my chest about. I do however use libraries that have been written by someone else and had the bugs shaken out by years of continuous use in a hostile environment. It's not a perfect guarantee but it's certainly better than something I wrote this month.

In most cases a library that has been in continuous use for several years on large public facing services will have been much more tested than anything you might write. Not because you aren't skilled enough but because the edge cases are unbounded and you can't think of everything.


Takeaway: Dont parse json. Every json parser is wrong. Json is not a standard.

Parsing everything is NOT a minefield. We have BNF's parser theory etc for that. Lots of languages have clear unambigious definitions... Json clearly not. ITs a disgrace for the software engineering community.


JSON does have a BNF definition. http://www.json.org/


> Well, first and most obviously, if you are thinking of rolling your own JSON parser, stop and seek medical attention.

Been there done that. (The medical attention, I mean.) Worked just fine. The article makes it sound extremely difficult, but 100% of the article is about edge cases that rarely happen with normal encoders and can often be ignored (e.g. who cares if you escape your tab character?).

> consider what happens when the code opens a session, sets the session username, then parses some input JSON before the password is evaluated

Edit: I responded more elaborately on the unlikelihood of this, but honestly, I can't come up with a single conceivable scenario. How would you decode part of the JSON and only parse the password bit later?


Yeah, the whole idea is a terrible practice, but it is used in industry surprisingly often. People think that PHP's session lock will save them, seeing the potential race condition but not the decode failure.

The scenario is typically username-password-parameters passed as three variables via POST. The offending developer parses all three variables up front for simplicity:

The session user is created from the POST username, the POST password is input sanitized and put in a variable, then the parameters are parsed as JSON. The attacker POSTs a valid user, a valid (but incorrect) password, and malformed JSON.

The login code reads the variables from POST, opens the session, and dies on the JSON decode.

Hint: this bug exists in the wild on a massively deployed framework. I'm working up a responsible disclosure on that one now.


> Hint: this bug exists in the wild on a massively deployed framework. I'm working up a disclosure on that one now.

Alright you win. I'm curious to know which one it was after it's fixed!

I really didn't expect this to exist...


> and dies on the JSON decode

That's the part I don't get. Once the code dies it's done. What exactly can you do now, if no code is even running?


The steps are the following:

1. Parse user.

2. Parse password.

3. Create session object from user and password.

4. Parse JSON. (Crashes here.)

5. Update session and do some other security stuff.

The problem is that they're creating a persistent session object ahead of when the JSON parameters are being decoded, which leaves a (partially initialized, persisting-outside-of-function) session object without having gone through the full authentication flow.

The problem is that they're creating permanent objects ahead of a full validation on the data necessary to actually properly initialize the object.

(I think the OP may have been a little vague because that's probably specific enough to identify the vuln with a scanner and a hunch.)


Wait, so it creates a privileged session before verifying the password? That's your problem right there. A crash in the JSON processor (or anywhere else) is a minor blip compared to this godzilla bug of granting access before it's been earned.


Spot on! But if the JSON parser couldn't crash on malformed input, that kind of whopping mistake would be a lot harder to exploit.


Remember, it's PHP. Code will be running again on the next request. Only the process (or thread or whatever) handling the request with the invalid JSON crashes, and it only crashes after producing a persistent session; the next request, a different process/thread/whatever will run code, but the session from the previous request still exists.


I still don't get it. The browser gets the session cookie even though the request processing crashed? Or does another request manage to pick up the session that was dropped from the crashed request?

I don't do PHP. It just sounds incredible to me that so little isolation seems to exist.


Thanks for the extra explanation - I too doubted this could exist (surely the session existing alone isn't enough to assume a user is authenticated - need a variable set & checked every request?) but I see that it could.

I look forward to the disclosure to understand more about this - and check that the frameworks I use don't have this problem.


> edge cases that rarely happen with normal encoders and can often be ignored

I've found that "edge cases that rarely happen" are typically 90% of my major headaches.


Certainly, edge cases are a source of pain. Luckily I mentioned that even if you ignore most of these edge cases (as a parser) the data will come out just fine.


> Setting the session flag for "this user is logged in" before checking (or even decoding!) the password seems rather backwards to me.

Yeah, that seems like a problem regardless of whether or not you're parsing JSON.


Probably a symptom of the PHP multiverse: anything that can happen, has happened.


And will continue to happen in a Wordpress setup somewhere.


> How would you decode part of the JSON and only parse the password bit later?

Not saying it's common or likely in this scenario, but there are streaming JSON parsers (like SAX for XML)


> Well, first and most obviously, if you are thinking of rolling your own JSON parser, stop and seek medical attention.

As someone who has written his own JSON parser, I must concur. Ahh - are there any doctors here...?

In my defense - I was porting a codebase to a new platform, and needed to replace the existing JSON 'parser'. You see, it was:

  - Single-platform
  - Proprietary
  - Little more than a tokenizer with idiosyncrasies and other warts
Why was it chosen in the first place? Well, it was available as part of the system on the original platform. Not that I would've made the same choice myself. We had wrappers around it - but they didn't really abstract it away in any meaningful manner. So all of it's idiosyncrasies had leaked into all the code that used the wrappers. In the interests of breaking as little existing code as possible, I wrote a bunch of unit tests, and rewrote the wrapper in terms of my own hand rolled tokenizer. Later - either after the port, or as a side project during the port to help out a coworker (I forget) - I added some saner, higher level, easier to use, less idiosyncratic interfaces - basically allowing us to deprecate the old interface and clean it up at our leisure. This basically left us with a full blown parser - and it was all my fault.

> Takeaways: Don't parse JSON yourself, and don't let calls to the parsing functions fail silently.

I'd add to this: Fuzz your formats. All of them. Even those that don't receive malicious data will receive corrupt data.

Many of the same problems also affect e.g. binary formats. And just because you've parsed valid JSON doesn't mean you're safe. I've spent a decent amount of time using e.g. SDL MiniFuzz - fixing invalid enum values, unchecked array indicies, huge array allocations causing OOMs, bad hashmap keys, the works. The OOM case is particularly nasty - you may successfully parse your entire input (because 1.9GB arrays weren't quite enough to OOM your program during parsing), and then later randomly crash anywhere else because you're not handling OOMs throughout the rest of your program. I cap the maximum my parser will allocate to some multiple of the original input, and cap the original input to "something reasonable" (1MB is massive overkill for most of my JSON API needs, for example, so I use it as a default.)


I had a need to write my own JSON parser for C#, although that was mostly because I hated the data structures the existing C# parsers produced.

I had the advantage that I only needed to use it for rapid prototype projects, and that I could count on all of the data from a single source being the same "shape" (only the scalar values changed, never the keys or objects).

Not following the RFC helped greatly, as I just dgaf about things like trailing commas.

The biggest "gotcha" for my first implementation was a) temporary object allocation and b) recursion depth. The third prototype to use the parser needed to parse, I think it was, a ten thousand line JSON file? The garbage collection on all the temporary objects (mostly strings) meant that version took ~30 seconds. I refactored to be non-recursive and re-use string variable wherever possible, and it dropped down to seconds.

There was a moment in writing it that I thought I would have an actually legitimate use for GOTO (a switch case that was mostly a duplicate of another case), but that turned out not to be the case :/


Wouldn't it have been easier to use the C# JSON parser, and then later walk the tree it creates and convert it into a more sane data structure that way?


Hmm. Yeah, it definitely would have been. All I can say is - I was frustrated with the way someone functioned, and rather than patch it, I "fixed" it. Where "fixed" is less complete and stable and...

Well, I also wanted a YAML parser, and had the weird need to handle JSON inside XML (oh, industrial sector, and your SOAP), and to not care about things like trailing commas, and then at some point to deal with weirdly large amounts of data.

Each iteration only took a couple of days, and I learned a bunch about tokenizing things, and then dealing with token streams.


The article is about how the parsers behave differently. No suggestions of writing from scratch.

The === issue is in JS as well.


There's nothing wrong with writing a JSON parser, but you shouldn't expect to do anything else for the next time if you want to do it correctly.


Why would you set the session before the password has been evaluated?


Why do bugs exist?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: