Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Newline Delimited JSON (ndjson.org)
47 points by xingyzt on May 25, 2021 | hide | past | favorite | 57 comments


I was wondering how this compared to jsonlines/jsonl https://jsonlines.org

> This page describes the JSON Lines text format, also called newline-delimited JSON.

And then I saw this at the bottom of the ndjson.org page:

> Site forked from jsonlines.org

So is this a fork of the site only, or a fork of the jsonlines standard?


Yeah, the only difference I can see from the specs are that ndjson SHOULD be served with the type `application/x-ndjson` while jsonl does not seem to specify a mediatype.

There is similar issues in both projects:

- https://github.com/wardi/jsonlines/issues/22

- https://github.com/ndjson/ndjson-spec/issues/35


I was curious whether .jsonl or .ndjson is used more often. Based on frequency in GitHub it looks like both are popular but .jsonl enjoys slightly wider usage:

  extension:jsonl   62K results [1]
  extension:ndjson  41K results [2]
[1]: https://github.com/search?q=extension:jsonl

[2]: https://github.com/search?q=extension:ndjson


This is an awful lot of foofaraw to basically say "no newlines in your stream of discrete messages except between messages". The proposal appears to have absolutely nothing to do with JSON, and the only connection is that their example messages happen to be JSON but could just as easily be anything else.


The reason why it works with JSON is that JSON strings are not allowed to contain real newline characters. The spec says they have to be escaped. So newline works as a field delimiter, but only because JSON excludes it from field values.


Every compactible format allows physically marking line breaks. JSON isn't very special in this regard.


I am not familiar with the phrase "compactible format", but it's not about allowing, it's about requiring. I can be sure that my JSON message contains no literal newlines, because if it did then it wouldn't be JSON. Contrast with any format that allows literal newlines in values, for example CSV...


> I am not familiar with the phrase "compactible format"

You know how C++ lines end with semicolons (or how Javascript lines _can_ end that way if you want)? And how that means you can put all of the instructions in a function on one hyperlong line? That's compactible. Any format where you are able to smush all the lines within a functional group together into one long line is compactible.

You can use the ASCII record separator character instead of newlines if you prefer. Separating records without colliding with newlines is literally what it's for.


In the two languages mentioned, semicolons play a different role: statement terminator vs. statements separator .


newlines are explicitly allowed in json. https://www.json.org/json-en.html

See "whitespace". In order to be used in ndjson, valid json must have its newlines removed.


Right, but newlines are not semantically meaningful in JSON. JSON with newlines stripped naively will parse identically to JSON containing those newlines.


It looks each line in the stream represents a JSON statement (only hashes?) in an array, even without being enclosed in an array and comma delimited. Formally specifying the syntax alternative is worthwhile.


> It looks each line in the stream represents a JSON statement

Only incidentally. There's literally nothing about the format, and I hesitate to give it the implied gravitas by calling it a format, that makes the JSON part important in any way. The lines are just individual messages. JSON happens to fit, but so do many other syntaxes.

What's an array? It's an abstract sequence of items. What's a delimited stream? It's a sequence of items. Do the lines need to be JSON? Who the hell cares. If you're using JSON then they're JSON. If you aren't, then they aren't.

Honestly, the most important part of their description is "for passing messages between cooperating processes", because JSON is terrible for blindly consuming messages without cooperative explanation from the sender (http://seriot.ch/parsing_json.php).

Given that you can't actually do anything with JSON data (or most data, JSON is again not extremely special in this regard) without a concrete explanation from the sender of the expected interpretation for every field, can you tell me what you think the important difference is between "json messages delimited by newlines" as proposed here with its own custom mimetype but without any other information and "newline delimited messages that can be parsed with a json parser"?

They'd have been better off using the ASCII record separator character instead of newline. That's both already a standard and also doesn't force escaping to prevent collisions with normal text.


The record separator ship sailed long ago. The advantage of newline is that it’s easy to use with a bunch of standard tools like less/head/tail/your text editor.

Besides, you do still need to deal with the situation where the character shows up in your data; using an uncommon codepoint just makes it much easier to ignore until it blows up in your face.


> you do still need to deal with the situation where the character shows up in your data

JSON doesn't support binary data without conversion to string first.


I thought that you said that JSON was incidental and unimportant? My point is that using RS does still force escaping, in the general case; it’s not a panacea. You still need to figure out how to deal with your delimiter character appearing in the data. The way JSON Lines does this is by specifying the use of (a slightly limited subset of) JSON.


> I thought that you said that JSON was incidental and unimportant?

It is for this, but if we're already declaring a restriction to JSON then we're likewise declaring a restriction to unicode. All delimited data requires escaping if your delimiter can be part of a no-delimit sequence. At least a character explicitly intended for delimiting isn't part of any no-delimit sequences representable in unicode whereas \n is.


> we're likewise declaring a restriction to unicode.

U+1E is a Unicode code point, and therefore requires escaping inside of a ‘no-delimit’ sequence of Unicode characters.


The point of specifying JSON is so a client can parse the the stream of messages.

This is as useful/useless as JSON itself.


Sure, except then you don't know what the format of each message in the stream is. This way you know that it's JSON. Nobody is stopping anyone from making ND-XML, ND-Sexpr, ND-Messagepack, et al.


Knowing simply that something uses JSON syntax is itself insufficient to actually do anything with the data. Sorry, but you cannot escape communicating semantic understanding by throwing a "parts are JSON" mimetype header on it. JSON is just not a self-describing format, so you really can't escape the fact that it's only ever useful when either talking to yourself or when the sender is already saying "and when I say 'age' that is a number of months and means..." In both cases you already know that the messages are formatted with JSON.


There are 3 layers here:

1. Records separated by ASCII "\n" characters, wherein each record cannot contain an "\n" character

2. Each record is UTF-8 text in JSON format

3. The schema of each record is <something>

ND-JSON takes care of the first two. You would still have to use JSONSchema for the third.

This is not very different from using JSON without the "ND" part. Except:

1. Now there's the "ND" part, and I have something standardized I can use for quick and easy understanding/agreement when collaborating, especially outside my organization.

2. Plain JSON supports exactly 1 record per stream or file. ND-JSON supports any natural number of records, or an infinite number of records. That's useful to me.

I really don't understand the resistance here. Nobody is trotting this out claiming its some hot new thing. It's an unoffensive standardization of a pretty basic data format that people have been using for 10+ years.

If you have no use for it, then just ignore it. Consider that the people using it might not be morons; they might just have different needs that are different from yours.


> 2. Plain JSON supports exactly 1 record per stream or file.

Unless you wrap it in an array, which is all this is doing.


No, that is not "all this is doing".

A JSON array is exactly 1 "record", in that it must be parsed entirely before it can be interpreted as anything other than a string of text.

Moreover, you can write a ND-JSON file where each record is an array. The following is valid ND-JSON:

    ["a", "b", "c"]
    ["u", "v"]
    ["x", "y", "z"]


> must be parsed entirely

Streaming JSON parsers do exist, but they’re rather specialized and harder to use than JSON.parse(). The advantage of ND-JSON is that you can get the same effect in a much simpler way.


Makes sense to me. I've been using this and calling it "jsonlog" format, but now I can switch to a hopefully more standard name.


This has been Twitter's stream format since they had a stream format for their firehose in probably 2007.


Right, it's pretty much the standard already.


Err... Am I the only one that see json objects on multiple lines? And that’s it...


Exactly. It gives a name and a mime-type to a trivial and commonly used format. That's it. But useful none the less.


And here I am wrapping my results in a result object line a newb:

    {
        results: [
            { ... },
            { ... },
            { ... }
        ]
    }


well, jsonlines works extremly well for gigabytes of short json objects. You can mix and match usual unix tools, or simple scripts with just "read" in a loop, and you can parse each json object without worrying it'll take all your memory. jq can convert arrays to json lines and back. I love the format.


Line separated JSON has its uses, for example when streaming data.


Ok, but what if you want to transmit two streams? You'd get something like Interlaced-newline-delimited-JSON. And so the list of required file formats knows no end.


If you're transmitting two streams you use two streams to transmit. I don't see the problem?


Two streams over the same channel.


I don't see why this is a problem with ndjson anymore than it is a problem with other stream text encodings (like CSV).

ndjson seems to be about how to format the content of the stream, not about how to multiplex multiple streams over a single channel. Use http2, zeromq, raw tcp or whatever else you like for that.


The only reason people care about newlines for a stream is because they've arbitrarily chosen to fetch bytes from the stream until the next newline sequence (readline instead of read). But you could just as easily look for a different sequence, like the ASCII record separator character which was invented exactly for this task, and then you wouldn't have to destructively strip newlines from your input.


I'll switch from \n to \x1E when cat, head, text editors, etc. start supporting the latter as a synonym for the former.


We tell our customers our streaming API uses jsonlines for documentation purposes, but we actually just decode in a loop until EOF, and 400 at the first decode error. No separators necessary at all.


or just:

[ { ... }, { ... }, { ... }, ]


I know I have used something like this code before, it is basicly what they are talking about.

json_decode(“[“.implode($lines,”,\n”).”]”)


Sure, that's one way to parse this format. The advantage is that individual records can also be manipulated by JSON-agnostic code, so you can easily do things like stream processing.


Why not just use YAML, which accomplishes effectively the same goals, and already has a decent amount of traction?

In general I use YAML for human-edited configuration files, because they generate clean diffs, and JSON for machine-to-machine communication, because of the wealth of obscenely optimized parsnips for every language.


I don't think YAML accomplishes the same goals.

I like YAML, and I think people hate on it more than it deserves (at least, in its current form). And yes, you can stream YAML documents by concatenating them with document separators. But ND-JSON/JSONLines/RFC-7464 is mostly a machine-to-machine format anyway.

Maybe we should be able to customize the delimiter? The media type "application/stream+json; separator=\n" would be JSONLines/ND-JSON, "application/stream+json; separator=\x1E" would be RFC 7464, "application/stream+json; separator=\n...\n" would theoretically be valid YAML, etc.


I've done a similar thing with Lisp sometimes: implementing hacks so that a list in a file always looks like this:

  ((lorem ipsum ..)
   (quorat est demonstrandum ...)
   ...
   ...)
Only major line-breaks between top-level elements.


I could be wrong, but isnt the ElasticSearch API implemented like this?


I know the Elasticsearch bulk API uses it - I can see why, because you're sending one request that basically says, "Hey, do all of these separate operations at once", but I've cursed it a few times because it makes it a slight pain to build bulk requests programmatically.


basically json streams? https://en.wikipedia.org/wiki/JSON_streaming

In my day job we ndjson pretty extensively and it's been pretty good to us. It has far better characteristics than passing a giant json array of objects (you have to hit the closing bracket on the array to know if it's valid)


Haven't we already been doing this for years? JQ has supported this for a long time as well.


NDJSON is a fork of https://jsonlines.org/, which has been around since 2013. I’m sure the concept is even older, but it can be helpful to have a name and a spec (no matter how small) for everyone to agree on.


Yes, but having a spec (however trivial) means I can tell someone to use "nd-json" and send them a link to the spec, without having to debate about escaping, utf-8, etc.


Why not throw in the inline comments feature to the mix if you're gonna end up using a different spec/different parser anyway? This is an 8+ year old spec and I don't think it has caught much adoption.

If you're really into hot having commas, maybe you can do a post-processing step with sed or in-memory string replacement to add/remove "," and \n? Then you're still getting the same JSON experience under the covers where programs process it?


> Why not throw in the inline comments feature to the mix if you're gonna end up using a different spec/different parser anyway?

Doing so would complicate adoption, since parsing is essentially just:

    for line in file:
        obj = json.loads(line)
> This is an 8+ year old spec and I don't think it has caught much adoption

It's seen huge adoption, essentially becoming the standard logging format for many Node JS applications.

Usually I've heard it referred to as JSON Lines: https://jsonlines.org


The name "JSON Lines" is a lot better in my opinion. I hope the two Git repo maintainers can agree on some kind of merge, but they've had almost a decade to do it and neither one seems interested.

There's also an RFC (7464) for more or less the same thing, except using ASCII record separators instead of newline characters: https://datatracker.ietf.org/doc/html/rfc7464

And yes, I have used this at $JOB and I intend to use it in future ${JOB}s. That's adoption.


Big advantage of using new lines over record separators is that you can use standard unix command line tools. Can get a long way with grep | jq | sort | uniq pipelines.


The idea behind this spec is that it’s an ultra-simple framing format for a stream of JSON documents; you would use the same JSON parser as always but e.g. inside a loop that reads each line from the input. It has lots of adoption, but since it’s such a simple format many of those have invented it independently and don’t call it by name.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: