Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: ZSV (Zip Separated Values) columnar data format (github.com/hafthor)
86 points by hafthor 9 months ago | hide | past | favorite | 69 comments
A columnar data format built using simple, mature technologies.



I think “human readability” isn’t a great feature for a columnar data format, because once you get data on a scale where the column oriented layout makes sense, you’re way past the scale where a human would be want to read over the stored data anyways. Like, no human is going to read 50k rows, much less 10m rows. I guess it’s nice you can spot check the rows using only zip & head -n 10 and paste, but I don’t think that nice-ness is a good reason to pick a format that forbids common ASCII characters and doesn’t have widespread support.

It’s guess there’s a sort of perma-computing angle here, this format is simple enough that you could pack a lot of almanac data into it, and given a working zlib get it back out with very limited dependencies.

But given the petabytes of parquet files out there, I feel like the format is here to stay, much like sqlite is here to stay.

EDIT: there is a great handy CLI tool for doing SQL on parquet, csv, sqlite3, and other tabular data formats called duckdb. Handy for wrangling and analyzing tabular data from 100 to 10m rows and up.


You normally develop the tools and scripts on a much smaller data sets. So you export 1 minute of data, examine it manually or with simple tools, write your scripts and once they work, switch to processing months worth of it.

Human-readable comes handy here.


A pretty printer is almost as handy and with significantly fewer compromises.


I prefer SQL for tabular data larger than 3 screens high, for me it’s easier for basically any analysis compared to a grep/wc/count/cut/paste bash pipeline. I use sqlite for CSV pretty regularly, if I needed columnar I’d use duckdb for parquet


Even more so if you store personal data in there, which ofc would be encrypted per row.


> Like, no human is going to read 50k rows, much less 10m rows.

Well, its 2AM, some dork has checked in code which breaks production, and it absolutely positively has to be fixed by 6:00am before the customer comes in.

Your bleary eyes are scaring through log files and data files, trying to find the answer..

... believe me, you will appreciate human-readable formats for both of those. You just want to cat out the the entries in the db which the new code can't handle... the last thing you want to do is to have to invoke some other tool or write some other script to make the data human readable.

And when you find the problem, you will want to just be able to edit a text file containing test cases to verify the fix.

You don't want to write some script to generate and insert the data....at 2am, you are likely to write a buggy script which may keep you from realizing that you've already fixed the problem....or worse, indicate that you have fixed the problem when you haven't.

Fewer moving parts is always better.


> Well, its 2AM, some dork has checked in code which breaks production, and it absolutely positively has to be fixed by 6:00am before the customer comes in.

This is a classic XY problem. The issue isn't the data format, it's the fact that your organizational processes allow random code pushes at 2am that can break the whole thing.

Parquet, used by basically everyone, isn't human readable (and for good reason): it's for big data storage, retrieval, and processing. CSV is human readable (and for good reason): people use that data in Excel or other spreadsheeting software.


> This is a classic XY problem. The issue isn't the data format, it's the fact that your organizational processes allow random code pushes at 2am that can break the whole thing.

I feel like your comment is a nitpick. Crap getting broken for whatever reason happens. Having human readable things can be helpful for development or fixing things.

Of course, this isn't always the top priority-- other things, like being able to round-trip non-human readable data, or performance, or data density, may win.


This is a classic I don't have a catchy term for it problem, where someone focusses on the details of a given contrived example and thinks the problem and the solution are all solved by addressing the details of that specific contrived example.

The 2am story above is not the problem. Thinking up something that would have avoided that specific story like "have 3 people in shift rotation instead of just yourself, and then you never have a 2am problem", or "don't push code at 2am" is not a solution to the problem.

The value of simple data formats that can be directly read by a human without any special tools, is that it makes the data accessible to reading, analyzing, even processing or modifying, in all of the unknown unknowable infinite possible situations.

You don't know ahead of time that you will one day solve a problem deep in the trenches by being able to read or maybe even modify a file or the stdin in a cgi before some crazy untouchable special app picks it up. You don't know ahead of time that the platform will not have a db client you can use to access the data, but you do know that everything can process text, even an obscure cpu with no gcc or git or any of your usual nice toys, has some sort of shell and some sort of text editor.

You can't touch the main app which is some legacy mainframe banking thing or something, you don't have and can't install or compile anything, but you can still read the data and see that there is some unicode character scattered all through it, and you can set up a dirty hack stream edit to convert it to a single byte ascii replacement, using nothing but plain posix sh or some equivalent no matter what the platform.

And that ugly hack is nine thousand times more useful to the bosses and to yourself than not being able to see what was wrong with the data, and then only being able to say "the other side is sending us bad data, it will be broken until they fix their end, or until we can modify the crazy untouchable thing on our end, because I'm a helpless useless twat"

You can't predict ahead of time exactly when or why or how you will end up wanting to be able to access the data without the normal proper tools or apps from the happy path. But it's a fact that it happens, and having the option is more useful than not having the option. And being the person who can solve a problem is more useful than being the person who can't do anything any other way except the normal expected way.

No one said this trumps all other considerations for all jobs for all data, just that it's very valuable, a consideration among other considerations, and you can't predict all the specific ways in which it is valuable, and so giving it up has to be necessary not thoughtless.


I’ve never been frustrated at 2am that my data in sqlite3 or Postgres isn’t in a human readable disk format.

If I’m working with parquet I’ll have duckdb on hand for fiddling parquet files. I’m much better at SQL at 2 am than I am at piping Unix tools together over N files.

I have no idea how I’d drop bad rows from this thing with a bash pipeline anyways, I need to select from one file to find the bad line numbers (grep I guess, I’ll need to look up how to cut just the line number), and then delete those lines from all the files in a zip (??). Sounds a lot harder than a single SELECT WHERE NOT or DELETE WHERE.


> I’ve never been frustrated at 2am that my data in sqlite3

What if its not your data? And you've never used sqlite3 or Postres? Its 2AM, they couldn't get a hold of the guy who wrote the code because he's on vacation, or he wrote it 20 years ago and retired....so they haul your sorry self out of bed?

You really gunna be wanting to be reading sqlite3 tutorials, while your boss and boss's boss's boss is on the video call?

I don't want to overstate my case, and sure there are plenty of reasons to use a database to store your data. I was just trying to answer the question of the person who asked why human-readable formats are preferable. They are not in all circs, of course, but all other things being equal you will appreciate it when the fit hits the shan.


It’s unclear to me that this is actually fewer moving parts. There are already parquet CLI tools. If your data is in Parquet you should know how to use them or at least have them documented in your oncall runbook.


I'm sure parquet is the bees knees, and I'm sure if its your code you'll know how to fix it.

But what if it isn't your code? And you've never heard of parquet before? And its 2AM and they can't get a hold of the guy who wrote it, so they call in you....


Then there’s a serious organizational failing. Parquet is the de facto standard, the chances are a random engineer knows how to interact with it and not ZSV.


> Then there’s a serious organizational failing.

Man, if your success is predicated on working at a company with with no organizational problems....

A scrappy start-up can't afford to hire multiple, redundant engineers--and what with all of the massive layoffs happening, even at the big companies a lot of engineers are going to find themselves debugging other people's code.


Then embrace industry standards and don't add unknowns. “There are always organizational problems” is a poor justification for creating more problems.


This copied the superficial data layout without the key benefit of modern columnar formats: segment elimination.

Most such formats support efficient querying by skipping the disk read step entirely when a chunk of data is not relevant to a query. This is done by splitting the data into segments of about 100K rows, and then calculating the min/max range for each column. That is stored separately in a header or small metadata file. This allows huge chunks of the data to be entirely skipped if it falls out of range of some query predicate.

PS: the same compression ratio advantages could be achieved by compressing columns stored as JSON arrays, but such a format could encode all Unicode characters and has a readily available decoder in all mainstream programming languages.


It's there, specified as an optional feature.

> Price⇥⇥0 {rows:2, distinct:2, minvalue:111.11, maxvalue:222.22} 111.11⮐222.22⮐

> Price⇥⇥1 {rows:1, distinct:1, minvalue:333.33, maxvalue:333.33} 333.33⮐


I like your idea of storing columns as JSON arrays. I might play around with that. Thanks for giving it a look.


I have a sinking feeling like I’ve unleashed something here.

Some future programmer will be cursing my name as they try to make columnar JSON decoding performant.


hehe. I added an alternative JSON inner format spec to the readme. I need to add JSON and CSV support to the zsvutil itself next. I may actually change the spec to default to JSON. All because of you. haha.


The core idea with these compressed columnar "big data" formats is that they minimize storage accesses. Nothing else really matters as much. If you're gonna have to load a sizable chunk of the file to get to the bits you need, the format you store it in starts mattering less.

What this gets right: Part of the reason you want to store columns together is that similar values compress well, so you could reduce your IO: smaller files are faster to load into memory. However in many cases (e.g. Arrow, Parquet) lightweight compression formats are preferred here, e.g. run length encoding (1,1,1,1,1,1,5,5,5,5,3,3,3 -> 6x1,5x4,3x3) or dictionary encoding (if your column is enum-like, you can store each enum value as a byte flag) because they can be scanned without decoding, amplifying your savings.

What it misses on (IMHO): - There's a metadata field but it doesn't contain any offsets to access a specific column quickly. So if you have 8 columns of 2GB each, to just get to the 7th column you have to read 12GB first which is quite wasteful. If you store just an offset, you could be reading a handful of bytes. Massive savings. - Within each column, how do you get to the range of values you want? Most columnar formats have stripes (i.e. stored in chunks of X rows each) which contain statistics (this stripe or range of values contains min value A, max value B) that allow you to skip chunks really fast. So again within that 2GB you have to read not much more than you strictly have to.

If this reminds you of an on-disk tree where you first hop to a column and then hop to some specific stripes, yeah, that's pretty much the idea.

-----

Sidenote: I've generally concluded that "human readable" is only a virtue for encoding formats that aren't doing heavy lifting, like the API call your web app is sending to the backend. Even in that case, your HTTP request is wrapped in gzip, wrapped in TLS, wrapped in TCP and chunked to all hell. No one complains about the burden caused by those. So what's one more layer of decoding? We can just demand to have tools that are not terrible, and the result is pretty transparent to us. The format is mostly for the computer, not you.

When I hear about stuff like terabytes of JSON just being dumped into s3 buckets and then consumed again by some other worker I have a fit because it's so easy and cheap these days not to be that wasteful.


Thanks for taking a look. Regarding seekable columns, that's the reason why I use the ZIP file format. It has a central directory at the end of the ZIP file that has locations to each file inside, making it so you can seek to a specific column file to extract.


Oh interesting - I missed that tidbit! So with that and row groups and the metadata (assuming you have one metadata block per column in each row group) I think you get to full seekability, right?


It is simple, but how do you access the price in row #1234567890? If your data doesn't have this many records and can fit into RAM, a basic NLJSON or CSV will work just as well.


Like parquet this isn't really meant for RDBMS type of database, more like for analytics over large datasets. I work in an environment where we typically have tables with over 300 columns, 10s if not 100s millions of rows daily. When you want to do a simple sum/group by involving 2 or 3 columns, it is great to have a column store file format, where you only read the columns you need and those are compressed.

The price you pay is that it is inefficient for single record access, or for "select * " kind of queries.


I was comparing it with Parquet, which is much more complex, but has features that help you access the data in less than O(n), like row groups and pages.


you mentioned NLJSON and CSV, which would require to read all columns from the disk.


Yes, but you would usually have to read at least two columns anyway. What are the datasets that are too large to be ingested completely, but too small for a proper columnar format?

If ZSV is meant to occupy the gap between CSV/NLJSON (smaller datasets) and Parquet/DuckDB (larger datasets), this niche is actually really small, if not nonexistent.


yes it's unclear to me what is the advantage over parquet with compression. And there are enough file formats flying around already.


Even with an OLAP use case, you're most often not scanning every row in the database if you even have a single where clause / conditional filter which is almost always. You need to have some level of locality and if your format doesn't support that, that'll be enough to kill performance.

Also parquet has lots of features that'll get you to the general vicinity of a single record tolerably fast without sacrificing much in terms of storage or computational complexity. It's a small price for a big win.


There's two ways to limit the number of column-rows you have to read. One is by file partitioning, that is having many ZSV files rather than one giant one, ideally organized by partitioning key field(s). The other way is mentioned as an extension to the format itself which functions much like rowgroups do in Parquet. https://github.com/Hafthor/zsvutil?tab=readme-ov-file#row-gr...

Thanks for taking a look.


Oh, sorry, I must've missed the part about rowgroups and metadata. Yes, this should work to limit the scans to a reasonable amount.


What is NLJSON?


Also known as JSONL, or JSON Lines. Basically a file of JSON objects separated by newlines. Popular format for logs these days for obvious reasons.


NDJSON is the shorthand I've seen: https://github.com/ndjson/ndjson-spec


https://jsonlines.org/ was the first "this is trivial but let's write it down so maybe the name will stick" spec for it (from 2013ish)


Missed opportunity to just call it JSONS.


New Line delimited JSON


I love this kind of minimalist and clever solution where one developer delivers value similar to that of projects with tens of developers involved. Unfortunately, it's still not enough to defeat the true army of complexity, which contains thousands of developers :)


Thanks for taking a look.


Colour me out of the loop, but what is the utility of this type of approach? I can't seem to grok this.


many colums (100's) and you only need one. This approach makes it much faster.


Can't store tabs or newlines, odd choice.


Reading quick, it's because the tab is used to indicate nested tabular data in a column. I wonder why not just have a zsv in the zsv?


yeah, this is a limitation from the TSV format this is based on - there is an extension to the format that supports storing binary blobs - ref: https://github.com/Hafthor/zsvutil?tab=readme-ov-file#nested...


(Offtopic, but just FYI) it's tenet (principle) not tenant (building resident).


doh. fixed. thanks!


Basically it's the same limitations as CSV.

At least you could use something less likely to appear in data as record sepator (like 0x1E)

Otherwise it's an interesting idea!


0x1E is the record separator, in ASCII precisely for this purpose. Too bad it’s not popular, here we’re stuck with inferior TSV/CSV


I can't easily type that out - and once the format can't be read / editing in a simple text editor, I'm starting to lean towards a nice binary format like protobuf.


Strings can contain 0x1E, so it has exactly the same issues as a tab character but with all the downsides of it not being an easy, “simple” character.


As far as I know, thanks to quoting it is possible to put basically any data you want in a CSV.


The problem is there is no uniform standard for quoting and escaping in CSV, and different software uses different variants.


There is a standard, and it is very simple and easy to use.

Different software uses different variants because we're not allowed to have nice things and devs are too lazy to use something slightly more complicated than .split(',')

Though if you're going to ban some common characters anyway like TSV, you might as well use CSV and ban commas, newlines, and quotation marks.


can't you just do quoting?


https://github.com/Hafthor/zsvutil?tab=readme-ov-file#what-a...

> Any escaping or encoding of these characters would make the format less human-readable, harder to parse and could introduce ambiguity and consistency problems.

Found the wording of "could introduce ambiguity and consistency problems" a bit odd, but guess they mean that even if things are specified precisely (so there's no ambiguity) not everyone would follow the rules or something? And they want to play nice with other tools following the TSV "standard"


Please. I wrote a csv parser a couple weeks ago in an hour or two. It's not that hard to handle the quoting and edge cases. Yes, maybe different parsers will handle them differently, but just document your choices and that's that. How is ambiguity better than completely disallowing certain chars? That's a non-starter


Can we just all converge on Parquet + Arrow and call it a day please? Too much effort being put into 1..N ways to solve a problem that would be better put towards a single standard.

We work with Parquet + Arrow every day at $DAYJOB in a ML and Big Data context and it's been great. We don't even think we're using it to its fullest potential, but it's never been the bottleneck for us.


How is the data schema description language btw? I haven't used either yet.


Haven't used it directly myself. We mostly just use it for DataFrame crunching via pandas and/or polars (our usage is mixed) which tends to benefit nicely from columnar access.


Check out DFLib (https://dflib.org)


This is very promising for nested datapoints


The only benefit this format provides is the ability to read some columns without needing to read all columns. Unfortunately it is not a seekable format. That's a pretty big miss.

It also wouldn't be that hard to make it seekable. All you would have to do is make each tsv file two columns: record-id, value.


What do you mean it's not seekable?

> ZIP files are a collection of individually compressed files, with a directory as a footer to the file, which makes it easy to seek to a specific file without reading the whole file... The nature of .zip files makes it possible to seek and read just the columns required without having to read/decode the other columns.


Seeking within a column


There's two ways to limit the number of column-rows you have to read. One is by file partitioning, that is having many ZSV files rather than one giant one, ideally organized by partitioning key field(s). The other way is mentioned as an extension to the format itself which functions much like rowgroups do in Parquet. https://github.com/Hafthor/zsvutil?tab=readme-ov-file#row-gr...

Thanks for taking a look.


Wouldn’t be too hard to add a secondary “file” in the zip with an extra index


We need to have a come to Jesus meeting about these columnar formats.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: