Hacker News new | past | comments | ask | show | jobs | submit login

> at least not until the IANA get around to officially assigning them a media type.

This is the wrong characterisation. IANA does not take such initiative; their role is administrative rather than regulatory or active. It’s up to an interested party to register media types.

For Parquet, that’s easy: the developers can fill out https://www.iana.org/form/media-types in probably less than ten minutes, probably choosing the media type application/vnd.apache.parquet. It’ll be processed quickly.

For JSON Lines/NDJSON, it’s messier, calling for standards tree registration, which generally means taking a proper specification through some relevant IETF working group. (There are a few media types in customary use presently, all bad: application/x-ndjson, application/x-jsonlines, application/jsonlines; all are in the standards tree despite nonregistration, and two include the long-obsolete x- prefix.) Such an adventurer will doubtless encounter at least some resistance due to the existing JSON Text Sequences (application/json-seq, defined in RFC 7464, https://www.rfc-editor.org/rfc/rfc7464), which is functionally equivalent, mildly harder to work with, and technically superior, due to being unambiguously not-just-JSON, using a ␞ (U+001E RECORD SEPARATOR) prefix on every record, but given the definite popularity of JSON Lines/NDJSON, an Internet Draft will easily be enough for provisional registration.




> and technically superior, due to being unambiguously not-just-JSON, using a ␞ (U+001E RECORD SEPARATOR) prefix on every record

Where would you see its superiority? I've mostly worked with jsonlines so far, but I found it very convenient to use, as it's almost the natural input/output format for Jq, grep and all kinds of other line-based tools.

I get that jsonseq would be easier to parse in theory, but this goes away when you ensure that no individual json segment contains a newline. And ensuring this is basically a jq -c call.

Because json is whitespace agnostic, there is also no situation where you need a newline to represent the data.

The only advantage of jsonseq I see is that in files which contain exactly one item you unambiguously know it's not jdon. Tte advantage goes away for files with zero items though - and in most situations ehere you'd have to make that distinction, I'd assume you'd use the content type anyway.


I find a surprising amount of value in in-band media type signalling. Not everything sees the declared media type, and media types are regularly calculated from file contents (magic numbers) rather than anything else, also. So here, the very first byte lets you know you’re dealing with a JSON text sequence rather than JSON or concatenated JSON or newline-delimited JSON or whatever else. It really comes down to just that. That line feeds are permitted (though largely not recommended) elsewhere is nice for human writing, but not particularly relevant, as these formats are not generally intended for human writing (you’d just use normal JSON if you wanted that).


Thanks for that background. I'll try to update the post when I get home this evening




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: