Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

"Self-describing formats like JSON Lines are big... but when you compress them, they go back to being small."

For CSV file replacements, I'd expect something like "one JSON array per line, all values must be JSON scalars". In that case, it's not much larger than a CSV, especially one using quotes already for string values.

But this demonstrates the problem with JSON for CSV, I suppose. Is each line an object? Is it wrapped in a top-level array or not? If it is objects, do the objects have to be one line? If it is an object, where do we put field order? The whole problem we're trying to solve with CSV is that it's not a format, it's a family of formats, but without some authority coming in and declaring a specialized JSON format we end up with a family of JSON formats to replace CSV as well. I'd still say it's a step up; at least the family of JSON formats is unambiguously parseable and the correct string values will pop out. But it's less of a full solution than I'd like.

(It wouldn't even have to be that much of an authority necessarily, but certainly more than "The HN user named jerf declares it to be thus." Though I suppose if I registered "csvjson.org" or some obvious variant and put up a suitably professional-looking page that might just do the trick. I know of a few other "standards" that don't seem to be much more than that. Technically, even JSON itself wasn't much more than that for a lot of its run, though it is an IETF standard now.)



Indeed, this has already been done: http://ndjson.org/

To be fair it's not an objectionable format. Using line breaks to separate objects makes it streamable, and you don't need to enclose the whole thing in an array to make it a valid JSON document.


That is not quite a CSV replacement. I use it for things with objects and stuff all the time. To be a CSV replacement you really need to add that each line must be a JSON array, and that it can only have scalars in it (no sub-arrays or objects). That would be a decent enough replacement for CSV itself. Not perfect, but the CSV "standard" is already a nightmare at the edge anyhow and honestly a lot of it can't be fixed anyway, so, this is probably as good as it could get.


> that it can only have scalars in it (no sub-arrays or objects)

I see CSV files that contain JSON arrays/objects in their fields all the time. Mainly from exporting Postgres tables that contain json/jsonb-typed columns. Are you saying that these aren't valid CSVs?


They're saying that a CSV equivalent should be strictly 2-dimensional, with "flat" values.

Such a format could contain arbitrary JSON in a "cell", but simply as text, in the same way as putting the same data in a CSV.


These are strings containing JSON


> But this demonstrates the problem with JSON for CSV, I suppose. Is each line an object?

How is that not a problem with every data serialization format? It does me no real good if I have an XML schema and a corresponding file. If I don't know what those elements and attributes represent I'm not really any better off.

It's not like JSON or XML can meaningfully be marshaled back into objects for use generically without knowledge of what is represented. There are generic JSON and XML readers that allow you to parse the data elements sure, but so, too, do generic CSV readers like C#'s CsvHelper or Python's csv. In all cases you have to know what the object turns into in the application before the serialized data is useful.

And, yes, CSV has slightly differing formats, but so does JSON. Date formats are conventionally ISO 8601, but that's not in the spec. That's why Microsoft got away with proprietary date formats in System.Text.Json. XML isn't really any better.


> That's why Microsoft got away with proprietary date formats in System.Text.Json.

What's proprietary in it? It follows ISO 8601-1:2019 and RFC 3339 according to the docs.


Sorry, that should be System.Runtime.Serialization.Json. System.Text.Json is the newer class that replaced it.

In .Net Framework 4.6 and earlier, the only built-in JSON serializer in the .Net Framework was System.Runtime.Serialization.Json.DataContractJsonSerializer.

You can still see it. If you're on Windows 10, run Windows Powershell v5.1 and run:

  Get-Item C:\Windows\System32\notepad.exe | Select-Object -Property Name, LastWriteTime | ConvertTo-Json
You'll see this output:

  {
    "Name":  "notepad.exe",
    "LastWriteTime":  "\/Date(1626957326200)\/"
  }
Microsoft didn't fix their weird JSON serialization until quite late. They may have back ported it to the .Net Framework, but they've deleted that documentation. Powershell v6 and v7 include the newer classes that are properly behaved. This is why Json.NET used to be so popular and ubiquitous for C# and ASP applications. It generated JSON like most web applications do, not the way Microsoft's wonky class did. Indeed, I believe it may be what System.Text.Json is based on.


Oh that one - yeah I've always steered clear of DataContractJsonSerializer. Never understood why they did it so weird.

To be fair, RFC 3339 wasn't even published back when this class was implemented (in .NET 3.5) so I guess they just went with whatever worked for their needs. ¯\_(ツ)_/¯


I'd be quicker to believe that it's because 2007 was still in the middle of Steve Ballmer's Microsoft, where embrace-extend-extinguish was their de jure practice.


I have wondered about a file format where a parser could be specified for [at the start of] each line. You could even have different json parsers with different well-characterized limits and relative speeds. Formats could change over time on a line-by-line basis, without being locked into a full-file IDL or similar.


JSON Lines is a specified format that answers those questions. https://jsonlines.org/ Seems like it qualifies to the level of authority you're requiring.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: