Well, first and most obviously, if you are thinking of rolling your own JSON parser, stop and seek medical attention.
Secondly, assume that parsing your input will crash, so catch the error and have your application fail gracefully.
This is the number one security issue I encounter in "security audited" PHP. (The second being the "==" vs. "===" debacle that is PHP comparison.)
As one example, consider what happens when the code opens a session, sets the session username, then parses some input JSON before the password is evaluated. Crashing the script at the json_decode() fails with the session open, so the attacker can log in as anyone.
Third, parsing everything is a minefield, including HTML. We as a community invest a lot of collective effort in improving those parsers, but this article does serve as a useful reminder of a lot of the infrastructure we take for granted.
Takeaways: Don't parse JSON yourself, and don't let calls to the parsing functions fail silently.
A parser should never crash on bad input. If it does, that's a serious bug that needs immediate attention, since that's at least a DoS vulnerability and quite likely something that could result in remote code execution. You definitely need to assume that the parser could fail, but that's different. Unless you're using "crash" in some way I'm not familiar with?
From the context, I'm assuming s_q_b simply means that if given malformed input, the parser should communicate the problem via the appropriate error channels for the given language. In Java, throw an exception; in Go, return an error, etc.
But yeah, the article's example of an XCode segfault is a good example of both poor parsing logic and poor isolation of fault domains. If your json processing library can corrupt the whole process then something's wonky.
What I see a lot is initializing an object from a JSON parser which, when it receives malformed input, either halts execution outright or returns an empty object. In both cases you just want to make sure to handle the error appropriately per-language.
Again, the main offender here is PHP, which makes this issue surprisingly easy to get wrong. (And add "==" to your lint checks! 99% of the time coders mean "===", and this can lead to surprisingly severe security issues.)
If you meant that it would throw or return an error on malformed input, I totally agree. You must always handle errors that your calls might return, and especially when those calls are given anything derived from outside input.
Indeed, forgetting to check for errors is a great way to cause all sorts of terrible bugs. That's one reason I'm starting to become a fan of checked exceptions and result types.
It can always crash when there is not enough memory.
Especially on the stack.
On of the test cases has 100000 [ and my own JSON parser crashed on them, because it parsed recursively and the stack overflowed. So now I have fixed it to parse iteratively.
It still crashes.
Big surprise. Turns out the array destructor is also recursive, because it deletes every element in the array. So the array destructor crashes at some random point, after the parser has worked correctly.
In "safe" languages (i.e. ones which aren't supposed to segfault/etc unless there's a compiler/interpreter/VM bug -- PHP, Java, JS, Python all fit in here) I've often heard "crash" being used to just mean an uncaught exception, or something else that brings the program crashing down.
GP says "or catch the error", so they're probably using that term in this sense.
Some parsers are dealing with known good input, and should not include validation code for performance reasons. Often you will parse the same JSON many times throughout a pipeline, but you only really need to validate it once. A good example of this is a scatter-gather message bus, where the router parses a message record, scatters it to a large number of peers, and gathers responses. Depending on how latency-critical this system is, it can make sense to validate and normalize the representation at the router, and send normalized (whitespace-free) JSON to the compute nodes.
Even if the json source is $DEITY's own platinum-iridium json generator, the json has to actually get to your parser somehow, and you can't guarantee that it did so in one piece. Memory is finite, no bus is perfect, etc.
Might as well translate validated JSON to a faster binary format for the internal bus, e.g. MessagePack. Validated msgpack is probably faster than unvalidated JSON.
This was something I wondered about in these results: what's the definition of "crash"? Specifically, with the Rust libraries, I'm not sure if this means "the program panic'd" or "the program segfaulted", or something else. The former isn't ideal, but isn't the worst. The later would be much more worrysome.
You say that as fact but you must know it is a matter of opinion: I would say you should match the language idioms. For example, Python iterators and generators work by raising/throwing when there are no more items: it is fully expected and will always happen when you write a for-in loop.
And in many languages, they are the common way to communicate that a function can not return the data that is expected of it. Invalid input data means the parser can't produce the equivalent data structure -> exception.
If the parser has some kind of partial parsing, a way to recover from errors or you are using a language in which returning explicit errors is the more common idiom, then you probably shouldn't throw an exception.
Throwing an exception when parsing fails, sounds like a case of exception-handling as flow of control: a bad thing even when commonly done, having a lot in common with GOTO statements. (See http://softwareengineering.stackexchange.com/questions/18922... , and Ward's Wiki when it comes back up.)
Exceptions are for flow control. That's their entire purpose.
Throwing an exception when parsing fails is a near perfect example of where exceptions produce clarity of code.
That is to say, a parse function is usually a pure function that takes a string and returns some sort of object. As a pure function, it "answers a question". A parser answers the question: "What object does this string represent?"
When given bad input, the answer is not "NULL". It is not "-1". It is not a tuple of ("SUCCESS", NULL). The question itself has no answer. Exceptions enable code flows for questions without answers, and for procedures that can not be actualized.
Now, you can engineer it such that you change the question to "Give me a tuple that represents an error code and an object representing... etc." But then if your calling function can not answer the question it's meant to answer, you have to manually check the inner function result and return some code. With exceptions, stack unrolling comes for free, with zero code. Once you get used to expecting the secondary, exceptional code flow, you can read and design far cleaner code than you can otherwise.
To be more specific (and your parser example actually makes it very clear), exceptions are a form of an error monad. The benefit that you describe - the ability to automatically flow throw error values without having to check for them at every boundary - is exactly what a monad provides.
The problem with exceptions is that they're not type-checked in most languages - you can throw whatever you want, and the type system doesn't reflect that, so there's no way to statically determine that all expected exceptions are properly handled.
They're (partially) type-checked in Java, but it's extremely annoying, because there's no ability to express that part of the type in a way that lets you write methods that are generic with respect to exceptions thrown by other code (e.g. there's no way to describe a method `f(T x)` such that it throws `E` plus everything that `x.g()` may throw, for any `T`).
Instead of `Option<T>` it would be more idiomatic to use `Result<T, ()>` or even better `Result<T, E>`. Where `E` is some way to communicate the error code/type/message back to the caller.
def load_data():
with open('some_file.json', 'r') as f_in:
data = parse_json_stream(f_in)
data['timestamp'] = some_date_fn() # Do something with the *definitely-valid* data on the next line.
return data
def parse_json_stream(io_stream):
# Some complex parser...
# at some point...
if next_char != ']':
raise JsonException('Expected "]" at line {}, column {}'. format(line, col))
# More parser code...
A benefit of exceptions here is that you don't have to check the result of "data = parse_json_stream(f_in)" to immediately work with the resulting data. The stack unwinds until it is in a function that can handle the exception.
Rust nearly has that same benefit. You can wrap `parse_json_stream(f_in)` in `try!(parse_json_stream(f_in))`, and if an error was returned from `parse_json_stream`, then an early return for `load_data` is inserted automatically, thereby propagating the error, similar to raising an exception.
Of course, these approaches are not isomorphic, but in Rust, the cost of explicitly checking an error value is typically very small thanks to algebraic data types, polymorphism and macros.
But exception handling is flow control, by its very nature. So it's clearly a gray area and the right thing to do depends on the common idioms of the language you're using, the expected frequency of parsing failures, and (possibly) runtime performance concerns. In Java for example, the XML parser built into the standard library does throw exceptions for certain types of invalid input.
But see Djikstra on "GOTO statement considered harmful" (http://david.tribble.com/text/goto.html). The problem is unstructured control flow, which both GOTOs and exceptions-as-control-flow give you; at least in what I was taught (early-2000s CS degree focused on C++), unstructured flow of control is only acceptable when it's a panic button to quit the program (or a major area of processing).
It sounds like the Web way of doing things doesn't have this tradition -- much like how it doesn't have static strict extensible type systems.
Thank you for that great link. Thank you twice over, because it refutes your claim.
Dijkstra specifically calls out exceptions as structured control flow, and as being probably-acceptable, and not subject to his concerns.
More broadly, any argument that goes "Exceptions are an extension of GOTO, and therefore bad" has some questions to answer, given that nearly all control structures are implemented as an extension of GOTO.
As to your last sentence, I think you have it backwards. I speculate that of the code written in 2016, most of the code that did not use exceptions for control flow was Javascript async code. (There are of course other non-exception-using languages, but other than C they're mostly pretty fringe, and for good or for ill JS is so damn fecund).
> nearly all control structures are implemented as an extension of GOTO
Well put. Under the hood, every IF, ELSE, WHILE, SWITCH, FOR, and other decision point in structured code is implemented with at least one unconditional JMP.
Exception flow control is far more structured than goto. An unexpected goto can drop you anywhere in memory, with no idea how you go there or how to get back. Exceptions cause a clearly-defined chain of events and you always know exactly where you'll end up (passing through all the finally blocks until you hit the closest catch block).
It isn't unstructured. It is an exception from the control flow of calling and returning from functions, but it has a structure of its own which is often very convenient for writing clean code. This is not at all specific to the Web, by the way.
His arguments are still sound, I think, and until that post I'd assumed that avoiding unstructured flow of control was still received wisdom. (I've certainly found it a very sound policy; million-plus-line codebases of other people's code, plus surprises, equal misery.)
Joel Spolsky had much the same to say: http://www.joelonsoftware.com/items/2003/10/13.html . If I'm going to have to argue about this, I'd rather use Spolsky than Dijkstra as a starting point; what, if anything, do you see that's wrong in his article?
As a java developer I really don't find 1 to be a problem at all. Lots of people complain about checked exceptions, but they solve this problem. It's very easy to see which functions throw and which don't. I'd even argue that the invisibility that is there is a positive -- I usually don't want to think about the error case, but can easily trace the exceptional path if I do. I find that often functions need to go back up the stack to handle issues, in which case the exception control flow is exactly what I want.
Runtime exceptions, on the other hand, are invisible and can be an issue, but things like running out of memory, or overflowing the stack, well, I really don't want to have to write:
public int add(int x, int y) throws OutOfMemoryException, StackOverflowException, ...
For 2, I find that they just create 1 extra exit point, which is back up the stack/to the catch handler. You could certainly abuse them to cause problems, but I personally don't find this to be an issue in practice.
I think that everyone agrees that unstructured flow of control is problematic, but checked exceptions do have structure, even if it's a bit of a loose one.
Parsing is often deeply recursive and in some languages, throwing an exception will automatically unwind the stack until it finds a handler. As well as explicitly giving you a good path and a bad path (as someone pointed out upthread) this can (again, in some languages) save a ton of repeated special case code in your parsing routines.
Some languages or libraries are explicitly designed to use exceptions for flow control some strongly discourage this. Apple's ObjC error handling guide for example contains this stricture but also calls out parsing as an example of when you should do it.
C.A.R 'null pointer' Hoare considered exceptions a blight on the earth that would lead to us accidentally nuking ourselves, so there's a spectrum of opinions available here.
The JavaScript JSON parser throws an exception on invalid input. A benefit of this is that there is only one code path to handle any kind of failure, and another is that unhandled parse failures lead to a hard stop, which is a good default.
Parsing JSON is indeed several orders of magnitude less complex than parsing HTML. In both cases there are excellent parsing libraries, but it would be very unwise to create your own HTML parser for production use, or to use any parser that hasn't seen some serious scrutiny.
JSON at least has the concept of "invalid JSON". That's a big step forward. A JSON parser, like an XML parser, can say "Syntax error - rejected." There's no such thing as "invalid HTML". For that reason, parsing HTML is a huge pain.
As someone who has a web crawler, I'm painfully aware of how much syntactically incorrect HTML is out there. HTML5 has a whole section which standardizes how to parse bad HTML. That's just the syntax needed to parse it into a tree, without considering the semantics at all.
There's no such thing as "invalid HTML". For that reason, parsing HTML is a huge pain.
Actually, as someone who has written an HTML parser by following the HTML5 spec, I see it as the opposite: because every string of bytes essentially corresponds to some HTML tree, there are no special "invalid" edge cases to consider and everything is fully specified. That's the best situation, since bugs tend to arise at the edge cases, the boundaries between valid and invalid. But the HTML5 spec has no edge cases: the effect of any byte in any state of the parser has been specified.
HTML5 has a whole section which standardizes how to parse bad HTML.
In some ways, I think it's a bit of a moot point what is "bad HTML" if all parsers that conform to the standard parse it in the same way. The spec does mention certain points as being parse errors, but are they really errors after all, if the behaviour is fully specified (and isn't "give up")? In fact, disregarding whether some states are parse errors actually simplifies the spec greatly because many of the states can be coalesced, and for something like a browser or crawler, it's completely irrelevant whether any of these "parse errors" actually occurred during the parsing. One example that comes to mind is spaces after quoted attribute values; a="b" c="d" and a="b"c="d" are parsed identically, except the latter is supposedly a parse error. Yet both are unambiguous.
I've written implementations of the HTML5 color algorithms. There are some sequences of bytes which, when given as a color value in HTML5, don't correspond to an RGB color, which makes things interesting.
(for the record, they are the empty string, and any string that is an ASCII case-insensitive match for the string "transparent")
It's worthwhile pointing out that HTML parsers are allowed to abort parsing the first time they hit each parse error, if they so choose. As such, not all implementations are guaranteed to parse content that contains parse errors, hence why it matters for authoring purposes.
It is pretty rare to need to parse JSON yourself but it isn't that difficult.
In theory, it's not supposed to be "that difficult". But in practice, according to the linked-to article, due to all rot and general clusterfuck-ery in the various competing specifications, apparently it is.
Or do you really think you could wrap your head all around those banana peels, and put together a robust, production-ready parser in a weekend?
I wouldn't want my own JSON parser out on the web, but if I needed to get JSON from $known_environment to $my_service, I'd feel safe enough with a parser I wrote.
Well that's the difference: it's the discrepancy between handling the data you're handed, and all the data possible. JSON has a lot of edge cases that are very infrequently exercised. Therefore a "robust, production-ready parser" is not usually what's desired by the pragmatist with the deadline. This can inevitably lead to security holes, but it doesn't necessarily. For example, sometimes the inputs are config files curated by your coworkers, or outputs which come from a server under your control and will always be in UTF-8 and will never use floats or integers larger than 2^50.
Taking it the other way, we can also ask "how can you optimize the parser to be as simple as possible, so that everything is well-specified and nobody can eff it up, while still preserving the structure that JSON gives you?" I tried to experiment with that about five years ago and came up with [1], but it shows a nasty cost differential between "human-readable" and "easy to parse." For example, the easiest string type to parse is a netstring, and this means automatic consistent handling of embedded nulls and what-have-you... but when those unreadable characters aren't escaped then you inherently have trouble reading/writing the file with a text editor. Similarly the easiest type of float is just to take the 64 bits and either dump them directly or as hex... but either way you don't have the ability to properly edit them with the text editor. Etc.
But I am finding that the central problem I'm having with JSON and XML is that it's harder to find (and harder to control!) streaming parsers, so one thing I'm thinking about for the future is that formats that I use will probably need to be streaming from the top-level.[2] So if anyone's reading this and designing stuff, probably even more important than making the parser obviously correct is making it obviously streaming.
[1] https://github.com/drostie/bsencode is based on having an easy-to-parse "outer language" of s-expressions, symbols, and netstrings, followed by an interpretation step where e.g. (float 8:01234567) is evaluated to be the corresponding float.
[2] More recently I've had a lot of success in dealing with more-parallel things to have streamability; for example if you remove whitespace from JSON then [date][tab][process-id][tab][json][newline] is a nice sort of TSV that gets really useful for a workflow of "append what you're about to do to the journal, then do it, then append back that it's done" and so forth; when a human technician needs to go back through the logs they have what they need to narrow down (a) when something went wrong, (b) what else was on that process when it was going wrong, (c) what did it do and what was it trying to do? ... you can of course do all this in JSON but then you need a streaming JSON parser, whereas everyone can do the line-buffering of "buffer a line, split the next chunk by newlines, prepend the first line with the buffer, then save the last line to the buffer, then emit the buffered lines and wait on the next chunk."
Sure, but then you're just adding additional layers to what was supposed to a fairly straight forward script. Much easier to just include everything in a single program so you can just say give path to input json file here, give name of output file here, and run.
> It is pretty rare to need to parse JSON yourself (what environment doesn't have that available?) but it isn't that difficult. It's a simple language.
That, coupled with the fact that it is still so easy to get it wrong and to introduce security issues is exactly what should peak your attention to the seriousness of the subject. Building any parser is fraught with risk, it is super easy to get it subtly and horribly wrong.
Writing any code is fraught with risk, but writing a parser in a modern and reasonably safe language is not something to be greatly feared. It's more likely that you'll introduce a security issue in what you do with the JSON immediately after you parse it.
> writing a parser in a modern and reasonably safe language is not something to be greatly feared
It ought to be feared, if interoperability is involved. The problem isn't that you might introduce security issues. The problem is usually that you introduce very subtle deviations to the spec that everyone else implemented correctly, and as a result, sometimes your input and/or output do not work with other stuff out there.
Writing a parser for a badly-specified format which is widely used is a terrifying prospect in any language.
Okay so it's more terrifying in C than most other things, but still, it's terrifying. Runaway memory consumption, weird Unicode behaviour etc. etc. etc. It's easy to think you don't have to worry about Unicode because your language's string types will handle it for you - but what do they do if the input contains invalid codepoints? You're writing a parser, you need to know - and possibly override it if that behaviour conflicts with the spec.
Horrible business. Definitely not my favourite job.
Silent moment for those of us using niche languages to meet production requirements in environments that do not allow third-party code and do not have JSON parsing in the std lib...
If you don't have a solution, or you're not happy with your current solution, take a look at parsec style parsing. you can make a lot of progress with just a few combinators, and those style parsers are pretty easy to read.
You can get an implementation working with a fairly high level of confidence that it's right.
If it's not fast enough, make a pretty printer for your AST. Then do a CPS transform (by hand) on your library and parser, so you can make the stack explicit. Make sure the transformed version pretty prints exactly the same way.
Then make a third version that prints out the code that should run when parsing a document, rather than doing the parsing directly. You'll get a big case switch for each grammar you want to parse. Your pretty printer will help you find many bugs.
It's a pretty achievable path to get your grammar correct, and then get a specialized parser for it.
First, parsing JSON is trivial compared to other parsing tasks. There are no cycles as in YAML or other serializers, it's trivial forward scanning, without any need to tokenize or backtracking.
Second, JSON is one of the simplest formats out there, and due its simplicity also its most secure. It has some quirks and some edge cases are not well-defined. But with those problems you can always check against your local javascript implementation and the spec, just as OP did.
I know very few JSON parsers which actually crash on illegal input. There are some broken ones, but there are much more broken and insecure by default YAML or BSON parsers or language serializers, like pickle, serialize, Storable, ...
Parsing JSON is not a minefield, parsing JSON is trivial.
Takeaway: Favor JSON over any other serialization format, even if there are some ill-defined edgecases, comments are disallowed and the specs are not completely sound. The YAML and XML specs are much worse, their libraries horrible and bloated.
JSON is the only secure by default serializer. It doesn't allow objects nor code, it doesn't allow cyclic data, no external data, it's trivial, it's fast.
Having summarized that, I'm wondering why OP didn't include my JSON parser in his list, Cpanel::JSON::XS, which is the default fast JSON serializer for perl, is the fastest of all those parsers overall, and is the only one which does pass all these tests. Even more than the new one which OP wrote for this overview STJSON. The only remaining Cpanel::JSON::XS problem is to decode BOM of UTF16 and UTF32. Currently it throws an error. But there are not even tests for that. I added some.
No kidding. If everyone thought that way, the only JSON parser to exist and which everyone uses would be barely working, because everyone would be put off from trying to write something better.
I have actually written an HTML parser, and it was not so difficult because the HTML spec specifies everything down to the level of tokenising individual characters.
The implicit subtext here is "for production". It's an excellent rule of thumb since it's highly likely that one already exists that is already battle tested. It's better to use that one. It's not a high horse it's just good horse sense.
He's not talking about rolling your own for fun or learning reasons which anyone should feel free to do.
Some of us are not such shitty programmers that we can't build critical systems on our own. Crtitcal business use is absolutely the worst time to farm out to a third party.
And quit using war metaphors to feel macho about your work. What we do in no way resembles battle. Neither are most libraries tested to the degree you are trying to suggest. Neither is common usage in several projects a reliable source if testing.
When people say things like "don't reinvent the wheel" or "only for yourself, not for your employer", what they mean is "I know I can't, and who do you think you are thinking you're better than me. Get back in your place."
You seem to be offended when no offense was intended. No where did the OP indicate that this rule applied to Critical business use. JSON parsing is almost never a critical business use case.
And I don't know about you but these days hosting a public facing internet service is increasingly like a battle DDOS, Data Loss, State sponsored APT's. I certainly don't feel macho about my work. I don't author many libraries that are used by a public facing service on the internet so I have nothing to beat my chest about. I do however use libraries that have been written by someone else and had the bugs shaken out by years of continuous use in a hostile environment. It's not a perfect guarantee but it's certainly better than something I wrote this month.
In most cases a library that has been in continuous use for several years on large public facing services will have been much more tested than anything you might write. Not because you aren't skilled enough but because the edge cases are unbounded and you can't think of everything.
Takeaway: Dont parse json. Every json parser is wrong. Json is not a standard.
Parsing everything is NOT a minefield. We have BNF's parser theory etc for that. Lots of languages have clear unambigious definitions... Json clearly not. ITs a disgrace for the software engineering community.
> Well, first and most obviously, if you are thinking of rolling your own JSON parser, stop and seek medical attention.
Been there done that. (The medical attention, I mean.) Worked just fine. The article makes it sound extremely difficult, but 100% of the article is about edge cases that rarely happen with normal encoders and can often be ignored (e.g. who cares if you escape your tab character?).
> consider what happens when the code opens a session, sets the session username, then parses some input JSON before the password is evaluated
Edit: I responded more elaborately on the unlikelihood of this, but honestly, I can't come up with a single conceivable scenario. How would you decode part of the JSON and only parse the password bit later?
Yeah, the whole idea is a terrible practice, but it is used in industry surprisingly often. People think that PHP's session lock will save them, seeing the potential race condition but not the decode failure.
The scenario is typically username-password-parameters passed as three variables via POST. The offending developer parses all three variables up front for simplicity:
The session user is created from the POST username, the POST password is input sanitized and put in a variable, then the parameters are parsed as JSON. The attacker POSTs a valid user, a valid (but incorrect) password, and malformed JSON.
The login code reads the variables from POST, opens the session, and dies on the JSON decode.
Hint: this bug exists in the wild on a massively deployed framework. I'm working up a responsible disclosure on that one now.
5. Update session and do some other security stuff.
The problem is that they're creating a persistent session object ahead of when the JSON parameters are being decoded, which leaves a (partially initialized, persisting-outside-of-function) session object without having gone through the full authentication flow.
The problem is that they're creating permanent objects ahead of a full validation on the data necessary to actually properly initialize the object.
(I think the OP may have been a little vague because that's probably specific enough to identify the vuln with a scanner and a hunch.)
Wait, so it creates a privileged session before verifying the password? That's your problem right there. A crash in the JSON processor (or anywhere else) is a minor blip compared to this godzilla bug of granting access before it's been earned.
Remember, it's PHP. Code will be running again on the next request. Only the process (or thread or whatever) handling the request with the invalid JSON crashes, and it only crashes after producing a persistent session; the next request, a different process/thread/whatever will run code, but the session from the previous request still exists.
I still don't get it. The browser gets the session cookie even though the request processing crashed? Or does another request manage to pick up the session that was dropped from the crashed request?
I don't do PHP. It just sounds incredible to me that so little isolation seems to exist.
Thanks for the extra explanation - I too doubted this could exist (surely the session existing alone isn't enough to assume a user is authenticated - need a variable set & checked every request?) but I see that it could.
I look forward to the disclosure to understand more about this - and check that the frameworks I use don't have this problem.
Certainly, edge cases are a source of pain. Luckily I mentioned that even if you ignore most of these edge cases (as a parser) the data will come out just fine.
> Well, first and most obviously, if you are thinking of rolling your own JSON parser, stop and seek medical attention.
As someone who has written his own JSON parser, I must concur. Ahh - are there any doctors here...?
In my defense - I was porting a codebase to a new platform, and needed to replace the existing JSON 'parser'. You see, it was:
- Single-platform
- Proprietary
- Little more than a tokenizer with idiosyncrasies and other warts
Why was it chosen in the first place? Well, it was available as part of the system on the original platform. Not that I would've made the same choice myself. We had wrappers around it - but they didn't really abstract it away in any meaningful manner. So all of it's idiosyncrasies had leaked into all the code that used the wrappers. In the interests of breaking as little existing code as possible, I wrote a bunch of unit tests, and rewrote the wrapper in terms of my own hand rolled tokenizer. Later - either after the port, or as a side project during the port to help out a coworker (I forget) - I added some saner, higher level, easier to use, less idiosyncratic interfaces - basically allowing us to deprecate the old interface and clean it up at our leisure. This basically left us with a full blown parser - and it was all my fault.
> Takeaways: Don't parse JSON yourself, and don't let calls to the parsing functions fail silently.
I'd add to this: Fuzz your formats. All of them. Even those that don't receive malicious data will receive corrupt data.
Many of the same problems also affect e.g. binary formats. And just because you've parsed valid JSON doesn't mean you're safe. I've spent a decent amount of time using e.g. SDL MiniFuzz - fixing invalid enum values, unchecked array indicies, huge array allocations causing OOMs, bad hashmap keys, the works. The OOM case is particularly nasty - you may successfully parse your entire input (because 1.9GB arrays weren't quite enough to OOM your program during parsing), and then later randomly crash anywhere else because you're not handling OOMs throughout the rest of your program. I cap the maximum my parser will allocate to some multiple of the original input, and cap the original input to "something reasonable" (1MB is massive overkill for most of my JSON API needs, for example, so I use it as a default.)
I had a need to write my own JSON parser for C#, although that was mostly because I hated the data structures the existing C# parsers produced.
I had the advantage that I only needed to use it for rapid prototype projects, and that I could count on all of the data from a single source being the same "shape" (only the scalar values changed, never the keys or objects).
Not following the RFC helped greatly, as I just dgaf about things like trailing commas.
The biggest "gotcha" for my first implementation was a) temporary object allocation and b) recursion depth. The third prototype to use the parser needed to parse, I think it was, a ten thousand line JSON file? The garbage collection on all the temporary objects (mostly strings) meant that version took ~30 seconds. I refactored to be non-recursive and re-use string variable wherever possible, and it dropped down to seconds.
There was a moment in writing it that I thought I would have an actually legitimate use for GOTO (a switch case that was mostly a duplicate of another case), but that turned out not to be the case :/
Wouldn't it have been easier to use the C# JSON parser, and then later walk the tree it creates and convert it into a more sane data structure that way?
Hmm. Yeah, it definitely would have been. All I can say is - I was frustrated with the way someone functioned, and rather than patch it, I "fixed" it. Where "fixed" is less complete and stable and...
Well, I also wanted a YAML parser, and had the weird need to handle JSON inside XML (oh, industrial sector, and your SOAP), and to not care about things like trailing commas, and then at some point to deal with weirdly large amounts of data.
Each iteration only took a couple of days, and I learned a bunch about tokenizing things, and then dealing with token streams.
> In conclusion, JSON is not a data format you can rely on blindly.
That was definitely not my take-away from the article. More like "JSON is not a data format you can rely on blindly if you are using an esoteric edge-case and/or an alpha-stage parsing library." I haven't ever run into a single JSON issue that wasn't due to my own fat fingers or trying to serialize data that would have been better suited to something like BSON.
The sum totality of all the issues raised in that post is not an esoteric edge case, even if each individual element is an esoteric edge case. If you haven't encountered any of them in your real code yet, there's two basic possibilities. Either you aren't using JSON very hard at all... or you have encountered them and you just didn't realize it. You will, sooner or later.
I'm not saying JSON is bad. Personally I think the sloppiness that this post enumerates is part of its success, and part of why it has so much support in so many languages. The more you nail down the semantics, the harder such widespread support gets. Right now you get a nice subset of the JSON spec that works in a huge variety of languages, where the JSON support for a given language usually converts the JSON into something fairly native for the language. This wouldn't be possible if you nailed down the semantics as hard as something like Protobuf does. For instance, a ton of statically-typed languages will let you encode and decode integers into JSON. But JSON doesn't have an integer type. Strict JSON support ought to forbid integers. But it's sloppiness lets us all just sort of ignore that and get on with life.
If you have a need for precision, avoid JSON. And people probably generally need more precision than they realize and probably ought to reach for JSON a bit less often than they do. But on the other hand, the whole thing does mostly work, right? That can't be ignored.
JSON doesn't have an integer type, but it certainly supports integers. Within, obviously, implementation defined limits. I'm with you up to "if you need precision, avoid JSON". Actually JSON is fine for the kinds of precision most use cases require, and when it isn't, you probably know it.
JSON's number support is a source of problems, because the standard is so informal [1]:
JSON is agnostic about numbers. ... JSON instead offers
only the representation of numbers that humans use: a
sequence of digits
This poses a problem for some languages, and tends to break things. You would expect encode(decode(string)) == string, but languages deal with numbers differently. For example, in Go, if you decode into a map[string]interface{}, you will get float64 by default. If you decode "42" and then encode it back to JSON, you'll get "42.0". (This is the reason Go has a special type you can use, json.Number, that preserves the value as its original string value.)
This mostly causes issues for code that needs to be data-structure-agnostic (in the Go case: decoding into a interface{}, rather than a struct with "json:" tags), but there are edges cases where you can get a surprise. For example, since all numbers are technically both integers and floats, something like {"visitorCount": 42.0} is perfectly valid, and a client has to know to coerce the number into an integer if it wants to deal with it sanely, even though the meaning of that number might be nonsensical if treated as a float.
Exactly. JSON Number ARE NOT ACTUAL NUMBERS. They're really restricted strings (or, as you quote, a syntax for representing numbers).
IMO this wasn't originally a bad thing at all. There are so many different types of numbers, with so many different behaviours (does `1` == `1.0`? Not in statistics class) that trying to work it all out in JSON would have been a fools errand. The problem is that so many JSON parsers are overly aggressive about turning JSON Numbers into something specific (usually floats) that when writing a JSON API we have to assume it will be consumed in this lowest-common-denomitor way.
By the way, I don't think that all JSON numbers are technically both integers and floats. `42` and `42.0` are clearly different according to the spec, IMO. I'd be curious to get your thoughts on this.
The spec is too vague; it's "agnostic". A number can include a fractional part, but it doesn't say if that means that fractionless numbers are to be treated as integers. It leaves that to the implementation, which is a terrible idea for an interchange format.
Ruby, IMHO, errs on the side of consistency (fractionless numbers become Fixnum, so encode(decode("1")) => "1"), whereas Go goes the opposite way (all numbers become float64, so encode(decode("1")) => "1.0", unless you enable the magic json.Number type).
JavaScript is an interesting scenario, because JS doesn't even have integer numbers, which means it doesn't have any choice in the matter. Interestingly, Node.js/V8 truncates the fraction if it can: JSON.dump(JSON.parse("1.0")) => "1".
Thanks for your analysis (and the Ruby and JS reports). The more I think about it the more I think parsers should not parse to numbers by default, but instead to (using a Haskell-based pseudo type):
data Sign = Positive | Negative
data Digit = Zero | One ... Eight | Nine
data JSONNumber = JSONNumber
{ _sign :: Maybe Sign
, _integerPart :: [Digit]
, _fractionalPart :: Maybe [Digit]
}
Actually this should include exponents as well -- basically parsers should just report exactly what the spec allows, without coercing to float or whatever.
Of course, users of the parsers could ask to have a number converted to a float, but that wouldn't be the default.
We're a long way from that though. At the moment I don't even feel like I can write a protocol that expected implementations to distinguish `1` and `1.0`. This should not be the case.
> At the moment I don't even feel like I can write a protocol that expected implementations to distinguish `1` and `1.0`.
Indeed you can't, among other reasons because that distinction doesn't exist in JavaScript, but ... why would you want to make such a subtle distinction?
If you really want what you described above, you can get it by using a string.
The distinction may be familiar to us, but if you ask someone on the street, "1" and "1.0" are the same number. The habit we have of giving them different types is somewhat arbitrary.
It sounds like what you're saying is that it should be deserialized as Java's BigDecimal, or whatever the equivalent of that is in a given/language framework; and if there isn't one, then the JSON parser should provide that equivalent.
Almost two weeks late, but I thought about this more and I totally agree with you.
I should have said "JSON number's aren't machine numbers" (e.g. float, double, integer). But serialized, human-readable numbers are still numbers, so I was wrong.
> You would expect encode(decode(string)) == string
With Unicode string handling I already wouldn't expect that, not to mention whitespace. However, I would expect that dec(enc(dec(string))) === dec(string)
> {"visitorCount": 42.0} is perfectly valid
The problem here is not necessarily that the standard is informal, but that languages differ in the number types they offer. Treating "42.0" as something different from "42" is after all just another convention and not a universal one. In other words, having to coerce something to an integer is more a property of the environment than a problem with JSON.
"Perhaps all the numbers we use are safe. Are we actually mangling our numbers? Probably not...but will we know? Will anything blow up, or will our application be silently, subtly wrong?
I suspect that when this problem does occur, it goes undetected for longer than it should. In the remainder of this article, we examine potential improvements to our handling of long."
I like CBOR (http://cbor.io/). It's a standard (RFC-7049), it can encode 64-bit integers, IEEE floating point, text (UTF-8), binary data (anything non-UTF-8), booleans, null and undefined. You have arrays and maps (and maps can have any value as an index, not just strings) and you can semantically tag data as well. The RFC is one of the better RFCs I've read, covering all the details and plenty of encoding examples to test against.
If there are two servers/services that communicate via JSON and they use different parsers, these types of issues can lead to rather nasty problems. Even if both parties fail gracefully on their own.
This gets even worse if your software is an integration layer between two services you do not control.
I couldn't really care less if someone POSTs some garbage JSON that results in them getting a 500 response. Better than someone POSTing an XML bomb and affecting other peoples' requests. Please enlighten me if you know of a serialization format with libraries for all common languages that lacks any gotchas or edge cases.
All software has edges, so edge cases are unavoidable.
The best you can do is:
- interpret the spec to the letter.
- for every fragment of a statement you write, consider whether it might conceivably go wrong, and handle those cases (in the simplest matter because 'handling' means writing code, and that code, too, needs to go through this process).
For example, a json parser must be prepared to handle missing values, extremely long keys and values (integers may have thousands of digits, think long about the question whether 64 bits always is enough for storing a string length, etc.), etc.
- if you are truly paranoid, have very stringent security requirements, or expect to be heavily attacked, run the parser in a separate process.
So do I, but you should still consciously decide whether to add an overflow check.
Let's do a quick estimate: one can read in the order of 2^24 bytes/second from disk. A day has on the order of 2^16 seconds, so that's 2^40 bytes/day.
=> You will need 2^24 days to read 2^64 bytes. I think that's around 50k years. That an attacker will try to generate a buffer overflow this way is a risk I would take, even if I thought the hardware had room to store that string.
The only way I can foresee a real risk is when an optimizer can optimize away the computation of a string whose length it is asked to compute by an attacker.
That's still very much far-fetched, and if no string gets allocated it's hard to see how it could become a security issue, but it could be a reason to be extra careful, for example when providing online access to a C++ compiler, with its template metaprogramming capabilities.
If all you're doing is parsing their JSON for it's own sake, it's just a 500; but that's the boring case. Consider what happens when typical web code is interacting with the JSON parser.
Figures that something like this would be posted on my day off. I put this through a parser that I cover, and found that the only failures were for top-level scalars, which we don't support, and for things we accept that we shouldn't. I'll look through the latter tomorrow, as well as the optional "i_" tests.
Test suites are a huge value add for a standard, so thank you, Nicolas, for researching and creating this one. I was surprised that JSON_checker failed some of the tests. I use its test suite too.
The correct answer to parsing JSON is... don't. We experimented last hackday with building Netflix on TVs without using JSON serialization (Netflix is very heavy on JSON payloads) by packing the bytes by hand to get a sense of how much the "easy to read" abstraction was costing us, and the results were staggering. On low end hardware, performance was visibly better, and data access was lightening fast.
Michael Paulson, a member of the team, just gave a talk about how to use flatbuffers to accomplish the same sort of thing ("JSOFF: A World Without JSON"), linked in this thread: https://news.ycombinator.com/item?id=12799904
Not sure what your point is (or the point of that presentation, for that matter).
Of course there are binary serialization formats that are faster than XML or JSON, and of course they're less error-prone. This has been known for about 40 years now.
JSON/XML are used precisely because people want a human-readable interchange format. For high-performance uses, consider Google's Protocol Buffers or Boost::serialize. You're acting like you just hackathoned the biggest thing since sliced bread, but that's exactly how payloads have been sent (until high-bandwidth made us all lazy) since the inception of the Internet.
From experience, I think the whole "human-readable" idea is a bit overrated. All it means is that the format is entirely/mostly in ASCII. But if you have a hex editor, like all good programmers should, binary formats are not any less human-readable (or writable) nor more difficult to work with; and for some, even a text editor with CP437 or some other distinctive SBCS will suffice after a while. It's somewhat like learning a language; and if you are the one developing the format, it's a language that you create.
Then again, I grew up working with computers at a time when writing entire apps in Asm/machine language was pretty normal as well as other things which would be considered horribly impossible by many developers of the newer generation, and can mentally assemble/disassemble x86 to/from ASCII, so my perspective may be skewed... just a tiny little bit. ;-)
A minor gripe with your comment, but as a programmer conceivably must be human, both conditions are satisfied when a programmer is capable of reading it.
I thought my point was clear - don't get involved parsing JSON; I agree with the OP, parsing JSON is a minefield. I went further by implying that it is also unnecessary when ease of reading isn't needed, and called out some alternatives. I think it's amusing that you mentioned protocol buffers - were you aware that when I mentioned flat buffers that they were built in relation to performance inefficiencies in the very protocol buffers that you mentioned?
We didn't just "hackathoned the biggest thing since sliced bread", btw, we took a real world example of exchanging a human readable format for a human-with-tools-readable one and saw a significant win. High-bandwidth also isn't as prevalent as you think, and yes, you're generally paying both performance wise and occasionally monetarily for the laziness you mentioned. But then, if you've known this for 40 years and don't know how to measure it, there's not much I can do for you in a comment.
I believe that is the point. Choose the right serialization strategy to fit the job. Most projects default to JSON regardless of how suitable. At some scale that should be revisited since the human-readable / performance trade-off equation can change.
I want to use something like flat buffers in NodeJS for optimizing websocket traffic and implementing a FS database. But I cant find much stuff for it in JavaScript. Do you (de)serialize the flat buffers or use them directly by abstracting get/set for example via Object.defineProperty ?
No, I don't try to decode them directly if I can avoid it. We handrolled a C-based byte packer for our honeybadger project only because function access is relatively slow on the interpreter we have on low end TVs, but reading blocks through a Uint8Array is pretty fast. Writeup is here: http://alifetodo.blogspot.com/2016/05/project-honeybadger-pi... . I can push the C packer to github if there's interest, but since you mention NodeJS, you might have better luck with Paulson's benchmark NodeJS example: https://github.com/michaelbpaulson/flatbuffers-benchmarks
On consideration I don't think flat buffers is worth the added complexity. Using basically C structs is sure faster, and suitable for low end devices and clients written in C. But JSON have many advantages.
Yup, I'm a fan of protobufs and the more recent flatbufs. Definitely like the tooling around being able to seamlessly switch for readability. I'll check out ion.
Wow! This was a great practical analysis of existing implementations, besides a great technical overview of the spec(s). Thanks for open sourcing the analysis code[1], and for the extended results[2]
An informative article. The point is not that parsing JSON is "hard" in any sense of the word. It's that it's underspecified, which leads to parsers disagreeing.
Although the syntax of JSON is simple and well-specced:
* The semantics are not fully specified
* There are multiple specs (which is a problem even if they are 99% equivalent)
* Some of the specs are needlessly ambiguous in edge cases
* Some parsers are needlessly lenient or support extensions
I did write my own parser, but for a reason: I need it to be able to recover as much data as possible from a damaged, malformed, or incomplete file.
Turns out that a good chunk of these tests are for somewhat malformed, but not impossible to reason about files. Extra commas, unescaped characters, leading zeroes... I'd rather just accept those kinds of things rather than throw an error in the user's face. It's a big bad world out there, and data is by definition corrupt.
And this is borne out when I plug my parser into this test suite: Many, many yellow results, which is exactly how I want it.
> Scalars..In practice, many popular parsers do still implement RFC 4627 and won't parse lonely values.
Right. RFC 7159 expanded the definition of a JSON text.
> A JSON text is a serialized value. Note that certain previous specifications of JSON constrained a JSON text to be an object or an array.
If RFC 7159 wasn't different from 4627, there'd be no reason for 7159. Same with RFC 1945 and 7230 for HTTP. (Of course, HTTP is versioned...maybe he just means to repeat the earlier versioning criticism.)
---
> it is unclear to me whether parsers are allowed to raise errors when they meet extreme values such 1e9999 or 0.0000000000000000000000000000001
And then quotes the relevant part of the RFC 7159 grammar with answers the question:
> This specification allows implementations to set limits on the range and precision of numbers accepted. Since software that implements IEEE 754-2008 binary64 (double precision) numbers [IEEE754] is generally available and widely used, good interoperability can be achieved by implementations that expect no more precision or range than these provide, in the sense that implementations will approximate JSON numbers within the expected precision. A JSON number such as 1E400 or 3.141592653589793238462643383279 may indicate potential interoperability problems, since it suggests that the software that created it expects receiving software to have greater capabilities for numeric magnitude and precision than is widely available.
Parsers may limit this however they like. And so may serializers. This includes yielding errors. (Though approximating the nearest possible 64-bit double is IMO the better choice.)
---
So yeah, in the end there is fair amount of flexibility in standard JSON.
To summarize:
> An implementation may set limits on the size of texts that it accepts.
> An implementation may set limits on the maximum depth of nesting. [this one was never mentioned though]
> An implementation may set limits on the range and precision of numbers.
> An implementation may set limits on the length and
character contents of strings.
Most implementations on 32-bit platforms will not parse 5GB JSON texts.
It got supplanted by Transit [0] as an interchange format. Both are only really used within the Clojure community. I use Transit for internal-facing APIs.
Even if you're a clojure shop, you've got the issue that at your system boundary, everything else the world accepts json and/or XML, and nothing supports EDN/transit. So your data needs to be serialisable to one of those anyway, at least if it crosses the boundary. There is such a thing as a network effect, even in data serialisation formats...
There was a great article at some point that explained why 'be liberal in what you accept' is a very bad engineering practice in certain circumstances, such as setting a standard, because it causes users to be confused and annoyed when a value accepted by system A is subsequently not accepted by supposedly compatible system B. Leading to pointless discussions about what the spec 'intended' and subtle incompatibility. Anyone know what article I mean?
Thats pretty much my experience when building software as well. A lot of the time I have been liberal to incorporate legacy data, and every time in has ended up being the cause of the majority of bugs in the systems I have built.
Yeah. And I learned this the hard way with the Perl module JSON::XS. It successfully encodes a Perl NaN, but its decoder will choke on that JSON. (Reported it to the maintainer who insists that is consistent with the documentation and wouldn't fix it)
Similarly, Python's encoder violates the JSON specification by default, as it produces `Infinity`, `NaN` and `-NaN`, which other JSON parsers choke on.
Nobody should use JSON::XS anymore, everybody switched to Cancel::JSON::XS which does have those features, and less bugs.
I just added those great testcases, and found previously unknown bugs. But the python test runner from this repo gives a few false negatives, PR coming soon. It was much easier to write it in perl.
If JSON is comparable to minefield, then I guess XML and ASN.1 are nothing short of nuclear Armageddon in complexity and ones ability to shoot themselves into the leg ;-)
Have to take HN anti-parser comments with a grain of salt.
I wrote a parser for a format that everyone here would crucify me for. When I went on the job market, I removed it from my github, fearing employers would see it as a black mark. Soon after I got email from the creator of a YUUUGE project (as in, 5 digits of stars on github) asking why I had removed this parser that they were using ...
Also, as someone who has written an XML parser, according to some of the comments in this threads I'm way beyond medical help, and should give up on life :).
Depends on when you did this crazy thing. Was it in the dark days before they were "ubiquitous"? (for lack of a better word. also it's just a fun word)
Then you're a hero to those who made use of it.
If someone were to do that today, though, then yes, seek help.
I got dinged in a thread the other day for writing my own JSON parser, so i'm not sure i should confess to also having written enough of an ASN.1 parser to deal with certain PEM key formats:
I still love JSON regardless :) Client / server side languages have first class support for serialization and in most cases the data structures are rather easy.
I'd be very skeptical if one would suggest an alternative format for a web based project, however I can imagine such situations.
Funnily enough, as I've been experimenting with Chef and trying to stick to JSON config files where allowed, I was again struck that (a) it's not a good choice for config files (b) it's an OK choice though (c) lots of people are using it anyway (d) nearly everyone that does so (including Chef) allows comments, so in reality are not actually using JSON at all.
Point (d) is the important one. I really think we need a standard for json-with-comments. JSONC or whatever, but it should have a different standard filename and it should have an RFC dictating what is and isn't allowed. Personally I would allow only // comments because there are too many subtle issues with C-style comments, but it may be too late to agree on that.
Half the point of JSON is that if application A stores its data as JSON then application B can parse that without any nasty surprises. Except, there are now probably thousands of noncompliant implementations in the wild that only exist because the standard doesn't allow comments. Each one of those standards adds subtle differences (in addition to the comments themselves) depending largely on how they remove the comments before passing to the standards-compliant JSON parser (assuming they do that, which being DC's recommended approach, is as close to a standard as currently exists).
I really think it's not an OK choice. A config file format that doesn't allow comments provides some of the worst possible UX.
One of the nice things about config files is that normally they are self-documenting, explaining the meaning of the various directives and providing possible values.
Without comments, you have to constantly switch between the documentation and the config file.
Also, the restriction on trailing commas is another really bad issue for a config file language as it pollutes diffs, makes moving lines around needlessly difficult and is one more landmine waiting to happen for the sysadmin editing a file.
> I really think it's not an OK choice. A config file format that doesn't allow comments provides some of the worst possible UX.
Are we sure about that? E.g. mostly when messing with configuration files, I have to visit the documentation anyway, which could explain key-value pairs in the json file. Furthermore there are configuration files which consist mostly of comments to explain all kind of edge cases with actual configuration data commented, which can be a mess on its own.
If good code should be self-explanatory then why not configurations?
Yep, I went through the process of replacing all our XML objects into json ones about 6 years ago now. Smaller files and much easier to read, manipulate store and transfer.
And while I personally quite liked XSLT, javascript is a much more flexible and reliable option.
A list of items in an array. Some items in that list are of particular note.
(Say you have 15 items in an array. For whatever reason, such as you not being in control of the expected input structure of whatever you're giving this JSON to, you have them grouped with comments at the top of each block.)
Why would you want to do that in what is basically a human readable binary file and not in a readme?
You seemed to stop at the for who and for what purpose.
I can 'just' about see the case in something like nodeJS package.json files (that is the least of nodeJS's problems, but that's a whole other conversation). But a readme is a so much better option than having to troll through code comments.
The discussion above concerns why JSON isn't a good choice for configuration files. That's exactly because configuration files are not human readable binary files.
Now add a comment for foo and another one for bar. Also try it when your client throws an exception whenever it encounters invalid keys in object. Or if perhaps it blindly persists or tries to perform some logic using that "data". And so on.
YAML is equally horrible and the spec is an order of magnitude more complex. I wasted half an hour trying to spot an error in the ejabberd yaml config, only to find out something trivial was missing. At least JSON has braces even though it's not suitable for configuration files. By all means choose TOML or something else (even ini or java properties files) instead.
YAML has braces. In fact, it's a super set of JSON. Any YAML parser should be able to parse JSON encoded data with one exception. Block comments. Which is a bastardization of JSON (as mentioned in the article) so block comments shouldn't be a problem in most cases.
The specification says this, but I was never convinced it was true. Specifically, the spec around escaped unicode characters lacks any mention of surrogates being encoded in two \u sequences, but rather specifies \u as:
> Escaped 16-bit Unicode character.
Which is an unhelpfully not-even-wrong statement. In practice, trying to treat JSON as a subset of YAML results in things not round-tripping, like the following:
In [9]: yaml.load(json.dumps('\N{PILE OF POO}'))
Out[9]: '\ud83d\udca9'
(in case it isn't clear, that's not a valid repr of pile-of-poo in Python:
In [14]: _9 == '\N{PILE OF POO}'
Out[14]: False
)
I've now got surrogates in a decoded Unicode string (why is this even allowed, I don't know); this results in fun behavior like `.encode('utf-8')` raising.
Edit: weird: HN appears to strip pile of poo from comments…
In [1]: import json
In [2]: json.dumps("\N{PILE OF POO}")
Out[2]: '"\\ud83d\\udca9"'
Quality detective work there. JSON, being JavaScript [Object Notation], is standardized to UTF-16. What you see here is "PILE OF POO" in UTF-16.
YAML specifies that the recommended encoding is UTF-8, but should be able to parse UTF-16 and UTF-32. And, if you want that, then you need to tell YAML to expect UTF-16. You can do that by including a BOM (Byte-Order Mark).
PyYAML is documented as only supporting UTF-8 and UTF-16, but not UTF-32.
I'm scratching my head on how to get that "UTF-16 encoded" UTF-8 string to a proper representation (in Python). If this were Ruby, I would use str.force_encoding() which doesn't change the bytes at all. If the solution comes to be, then I'll update this comment (or reply).
Well, yes, but you missed my point: we're not looking at an encoded string object. A `str` (Python's string type) is supposed to represent a Unicode string — "Strings are immutable sequences of Unicode code points"; the underlying encoding is supposed to be transparent, and in fact, it's possible to construct an example (invalid, IMO, like the above example) YAML that PyYAML will decode where the underlying encoding in memory would be UTF-32, but still contain those surrogates.¹
In Unicode's lexicon, this string contains surrogate code points, not surrogate code units. This is wrong, and as I stated, it is surprising that Python permits it. Unicode explicitly warns against this behavior:
> A process shall not interpret a high-surrogate code point or a low-surrogate code point
as an abstract character.
And that's what's happening here.
> YAML specifies that the recommended encoding is UTF-8, but should be able to parse UTF-16 and UTF-32. And, if you want that, then you need to tell YAML to expect UTF-16. You can do that by including a BOM (Byte-Order Mark).
I'm passing PyYAML a string, but I can pass it encoded text as well; the output is the same. Note that the input contains characters completely in ASCII, so it's really not input encoding at play here. I realize now it wasn't explicit in my original comment, but here's the input being given to PyYAML (JSON that we're claiming is also YAML, because the claim was that JSON is a subset of YAML):
"\ud83d\udca9"
Note that this is the raw JSON/YAML, not a Python repr of it. Those slashes are literal slashes.
> PyYAML is documented as only supporting UTF-8 and UTF-16, but not UTF-32.
PyYAML is documented to contain a bug, then. (Though admittedly better than it being undocumented.) It is not developer friendly to only decode some text, and emit erroneous output on others.
> I'm scratching my head on how to get that "UTF-16 encoded" UTF-8 string to a proper representation (in Python).
It's hard because it shouldn't be possible. Unicode forbids it. The encoded UTF-32 version of that string (if Python permitted it) would be:
0x0000d83d 0x0000dca9
unless you explicitly handled the surrogate code points separately by decoding them first, and then re-encoding them into the target encoding, but that's crazy.
(this is another example of why the above is not valid or expected output from PyYAML.)
¹this:
In [19]: yaml.load('"\\ud83d\\udca9, \\U0001f4a9"')
Out[19]: '\ud83d\udca9, '
is, in CPython, a string that in memory is stored in UTF+32, but contains two surrogate code points.
(Note that I'm using a late version of Python 3. Early Python 3 and Python 2 handle Unicode poorly, but this example should work equally strangely there too. While PyYAML's output here is clearly buggy, the spec is equally vague and underspecified about what should happen.)
> In Unicode's lexicon, this string contains surrogate code points, not surrogate code units. This is wrong, and as I stated, it is surprising that Python permits it. Unicode explicitly warns against this behavior:
> Note that this is the raw JSON/YAML, not a Python repr of it. Those slashes are literal slashes.
I just came to these realization over dinner. Apparently, this is the odd behavior is defined by JSON. So, it's not Python's json module at fault because it is, actually, implemented correctly.
> Specifically, the spec around escaped unicode characters lacks any mention of surrogates being encoded in two \u sequences,
http://yaml.org/spec/1.2/spec.html#id2771184 makes the statement, "All characters mentioned in this specification are Unicode code points. Each such code point is written as one or more bytes depending on the character encoding used. Note that in UTF-16, characters above #xFFFF are written as four bytes, using a surrogate pair.". And, there's numerous mentions for JSON compatibility. So, I suppose, this is an issue PyYAML (and Ruby's YAML, I checked).
I had never seen this issue before. Ruby's JSON module doesn't breakup utf-8 characters into surrogate pairs, nor does any online json parser that I could find via Google.
I don't have a specific recommendation, but when I see a project uses a JSON file as configuration, I wonder: "hasn't the author ever needed to include a comment in the configuration ?".
I can't speak for others. When I write something that uses JSON for configuration files those files are pretty much always written from configuration management. The comments are then in the manifest/recipe/playbook that generates the config file, which is where people are actually working with the values.
What purpose? Writing config files? Lua is a programming language (a nice one too). It's code. Code should not be used for config files, nor for data serialization because once you eval it you are executing it.
Yes, Lua was originally written as a language for rich configuration files and has grown out of that into a more fully featured language.
It's also one of the easier languages to sandbox, since you can evaluate user provided code in a custom environment that only contains the functions you deem safe. You can even use the standard debug hooks to set an upper limit on the number of instructions a script can execute to prevent someone from creating an infinite loop in a config file and locking whatever thread is reading the config.
It's not very appropriate as a data serialization format, or as machine written config, but the parent post specifically asked about human written configuration files.
I have actually used Lua before with good success. It was on a smaller scale, so I can't speak to edge cases, but I would certainly recommend considering it at the least.
A general rule of thumb: Never use yet another non-markup language designed by people who claimed to be designing yet another markup language from the very outset, then after somebody awkwardly pointed out that what they'd designed wasn't actually a markup language, they invent a backronym to contradict that embarrassing historical fact.
It just makes me wonder what the hell they thought they were doing all that time... It's like designing something called YACC, and ending up with an interpreter interpreter!
>Originally YAML was said to mean Yet Another Markup Language, referencing its purpose as a markup language with the yet another construct, but it was then repurposed as YAML Ain't Markup Language, a recursive acronym, to distinguish its purpose as data-oriented, rather than document markup.
It is really a pleasure to use compared to JSON and XML. While it may not be as compact as ProtoBuffers, Thrift, or Avro, it is human readable and also valid Clojure code. Libraries are ready available to convert it to JSON.
JSON is great for data exchange but config files should be human readable and amply commented. Even rolling your own simple format is probably better than using JSON.
After trying json, yaml, json5, java properties, ini and toml, I finally choose hjson* as the configuration file format for the software I'm building. It's the easiest format to read and write IMHO, a bit like nginx config files.
Now the mess that is called JavaScript dates has crept into any system imaginable in the world. I can understand we needed to go for the lowest denominator but Crockford's card really could cram in another line with a date time string format.
Speaking as someone who wrote a JSON parser, this article and the accompanying test suite looks to be very valuable, and I will be adding this test suite to my parser's tests shortly.
That said, since my parser is a pure-Swift parser, I'm kind of bummed that the author didn't include it already, but instead chose to include an apparently buggy parser by Big Nerd Ranch instead. My parser is https://github.com/postmates/PMJSON
When I didn't know better, I wrote my own JSON parser for Java (it was years back and I didn't know about java libraries). From experience: DON'T. DO. IT.
That said, if you have decided to do it....
1) know fully well that it'll fail and build it with that assumption.
2) Please, please, please...give useful error messages when it does fail or you'd be spending way too much time over something simple.
By sheer randomness I was having the thought about it today: I made some code to highlight where the stdlib json module sees the mistakes in JSON decoding in the python stdlib.
I used the exception with string "blabal at line x, col y, char(c - d)" to actually highlight (ANSI colors) WHERE the mistake were.
I played a tad with it, and the highlighted area for missing separators, unfinished string, lack of identifier were making no sense. I thought I was having a bug. Checked and re-checked. But, No.
I made this tool because, whatever the linters are I was always wondering why I was not able to edit or validate json (especially one generated by tools coded by idiots) easily.
I thought I was stupid thinking json were complex.
Writing parsers is hard and takes some experience, but its not as hard or as impossible as most of these comments make out. JSON is retarted simple to parse, even in the face of certain edge case ambiguities.
I can say this from experience after having written an HTML/XML parser that provides support for various template schemes: Twig, Elm, Handlebars, ERB, Apache Velocity, JSP, Freemarker, and many more. I have written a JavaScript parser that supports React JSX, JSON, TypeScript, C#, Java, and many more things.
In years I have been programming I frequently hear whining like, "its too hard". Don't care. While you are wasting oxygen crying about how hard life is somebody else will roll a solution you will ultimately consume.
This is fantastic. However, it looks like the detailed conclusion is "exactly matching the RFC is a minefield".
About a month ago (for the third time, since I don't own the first two implementations) I made a very forgiving (and very error-unprotected) JSON parser: https://github.com/narfanator/maptionary
The core of JSON parsing, from that experience, seems really simple; it's catching all the edge cases that's hard.
In any event, I look forward to taking the time to test against this test suite!
People tend to screw up the unicode aspects more than the general parsing. And, indeed, the example JSON parser provided checks for a UTF-8 byte order mark, but doesn't validate that the data is valid UTF-8, so it will let through strings that might cause an application problems.
Although there is a commented out method to validate a code point, so I guess he understood that it was an issue.
Do you have an explanation anywhere of why each (or any) of the edge cases is supposed to succeed or fail, or why it commonly does what it's not supposed to do?
I realize that's almost as much work as writing each test case in the first place, but even a subset of the test cases having that explanation would be valuable.
In practice, people use increasingly smaller subsets of JavaScript to transmit data.
For example, a common pattern is to transmit (numeric) user IDs as strings so that they don't get mangled by floating-point precision issues with large numbers. You see both Twitter and Facebook APIs do this, for example.
I've read that you should treat IDs as strings anyway because you want to discourage incrementing IDs, adding IDs together etc. Anything else you want to do with integer IDs, e.g. comparing, can be done with sting IDs as well.
You definitely can't rely on it. Just the other day I was given a task to take a request payload from our front end and do some stuff with it on our backend. The payload looked like this: {"thing": [{"values": ["foobar"], "type": "blah blah"}, "some identifier"], "other thing": "some string"}. It's mixing types in arrays which is problematic for most statically types languages.
Tips for Go: Don't use map[string]interface{} and circumvent the type system (I've seen this a lot in production). The fix involves the UnmarshalJSON and MarshalJSON interfaces. This lets you put the data into a structure that's sane and re-encode it back to something the other system expects.
It's more complicated in practice. Prior to PHP 7.0, the official JSON extension was replaced with a completely different one in many distributions due to licensing issues, and since PHP 7.0, PHP's official JSON extension is yet another completely different implementation.
One of the biggest flaws of JSON is that it doesn't support "undefined". This makes translating Javascript structures to and from JSON actually not preserve the original value. Sigh.
But undefined is specific to Javascript… there are lots of other Javascript things that JSON doesn't handle either, like Set or Map objects. It's not intended to serialize arbibtrary JS objects — it's intended to serialize a useful least-common-denominator which has proven useful through experience.
The encoding takeaway seems simple: escape everything with \uxxxx characters that is outside of the ASCII range /[ -~]/ (regex) and you'll be pretty much fine. Set the encoder to utf-8, don't leave [dangling,commas,], and a few other things that are obvious from json.org.
The JSON standard(s) are very simple. But in practice this simplicity leads to a lot of edge cases. Very deeply nested structures, numbers that run on to infinity. The point of this post is testing various parsers against these malicious structures.
"In conclusion, JSON is not a data format you can rely on blindly." - that suggests the format is bad when it's highlighting problems with the parsers. I see it like saying "Plain text is a bad format because notepad bails on large files"
JSON is the de facto standard when it comes to (un)serialising and exchanging data in web and mobile programming.
I disagree. Take protobuf for example. You get schemas, data structures, and a parser in one package which is actually a lot smaller than JSON and compiles to nearly all the commonly used languages. Ever since I've started using it my life became so easier! If you don't want your data to be human readable (which is very common) you should not use JSON as a data interchange format.
Please tell that to the multi-billion dollar company I'm currently working to integrate with that decided 2016 was a good year to implement a brand new SOAP API.
Since you haven't demonstrated your premise beyond talking about what you personally prefer, that particular ball is still in your court. The world isn't obliged to accept your unsupported opinions as truth until you're convinced otherwise.
I don't need to demonstrate anything. It is not de facto standard since if it were everybody were using it which is not the case (Google is the best example). Look up what the term means and you will understand.
Not trying to be a dork, but thought this would be a good place to bring up...if anyone is interested...in the usage of landmines in current conflicts and the way that they tend to linger.
Call this a comment factoid. Off topic, but interesting.
Secondly, assume that parsing your input will crash, so catch the error and have your application fail gracefully.
This is the number one security issue I encounter in "security audited" PHP. (The second being the "==" vs. "===" debacle that is PHP comparison.)
As one example, consider what happens when the code opens a session, sets the session username, then parses some input JSON before the password is evaluated. Crashing the script at the json_decode() fails with the session open, so the attacker can log in as anyone.
Third, parsing everything is a minefield, including HTML. We as a community invest a lot of collective effort in improving those parsers, but this article does serve as a useful reminder of a lot of the infrastructure we take for granted.
Takeaways: Don't parse JSON yourself, and don't let calls to the parsing functions fail silently.