*It is pretty rare to need to parse JSON yourself but it isn't that difficult.* ...

teaearlgraycold · on Oct 26, 2016

Depends upon what you're doing.

I wouldn't want my own JSON parser out on the web, but if I needed to get JSON from $known_environment to $my_service, I'd feel safe enough with a parser I wrote.

drostie · on Oct 26, 2016

Well that's the difference: it's the discrepancy between handling the data you're handed, and all the data possible. JSON has a lot of edge cases that are very infrequently exercised. Therefore a "robust, production-ready parser" is not usually what's desired by the pragmatist with the deadline. This can inevitably lead to security holes, but it doesn't necessarily. For example, sometimes the inputs are config files curated by your coworkers, or outputs which come from a server under your control and will always be in UTF-8 and will never use floats or integers larger than 2^50.

Taking it the other way, we can also ask "how can you optimize the parser to be as simple as possible, so that everything is well-specified and nobody can eff it up, while still preserving the structure that JSON gives you?" I tried to experiment with that about five years ago and came up with [1], but it shows a nasty cost differential between "human-readable" and "easy to parse." For example, the easiest string type to parse is a netstring, and this means automatic consistent handling of embedded nulls and what-have-you... but when those unreadable characters aren't escaped then you inherently have trouble reading/writing the file with a text editor. Similarly the easiest type of float is just to take the 64 bits and either dump them directly or as hex... but either way you don't have the ability to properly edit them with the text editor. Etc.

But I am finding that the central problem I'm having with JSON and XML is that it's harder to find (and harder to control!) streaming parsers, so one thing I'm thinking about for the future is that formats that I use will probably need to be streaming from the top-level.[2] So if anyone's reading this and designing stuff, probably even more important than making the parser obviously correct is making it obviously streaming.

[1] https://github.com/drostie/bsencode is based on having an easy-to-parse "outer language" of s-expressions, symbols, and netstrings, followed by an interpretation step where e.g. (float 8:01234567) is evaluated to be the corresponding float.

[2] More recently I've had a lot of success in dealing with more-parallel things to have streamability; for example if you remove whitespace from JSON then [date][tab][process-id][tab][json][newline] is a nice sort of TSV that gets really useful for a workflow of "append what you're about to do to the journal, then do it, then append back that it's done" and so forth; when a human technician needs to go back through the logs they have what they need to narrow down (a) when something went wrong, (b) what else was on that process when it was going wrong, (c) what did it do and what was it trying to do? ... you can of course do all this in JSON but then you need a streaming JSON parser, whereas everyone can do the line-buffering of "buffer a line, split the next chunk by newlines, prepend the first line with the buffer, then save the last line to the buffer, then emit the buffered lines and wait on the next chunk."