You should benchmark SQLite before recommending it like this (for future referen...

striking · on April 27, 2019

Could you show us your methodology for your benchmark? We don't know under which circumstances you experienced degraded performance.

I've seen a lot of people just missing an index or something, and their DB then ran fine, so that's why I ask.

civility · on April 27, 2019

I don't work for that company any more, and I couldn't have legally showed the code even if I did.

It's not an index thing though. Try inserting 10 million rows into a simple table. No foreign constraints or anything. Some of our "documents" had 300 million rows. Handle the primary key however you want (let the DB do it, or generate your own). Use whatever SQLite API calls you want (precompiled statements, whatever). In a case like this, adding other indexes can only slow it down.

There are a few options to disable integrity and other safety constraints. They help speed things up a little, but it's all painfully slow in comparison to a writing simple CSV file.

The same is true on reading. You can parse an entire CSV file more quickly than "SELECT *" from the table.

I've tried several time to use SQLite as a document file format. It performs worse than CSV and JSON. It performs much worse than just an array of binary records. The devil is in the details, and assuming each file is not a single row/record, I wouldn't be surprised if 10+ million files in a structured directory tree performs better too.

mcbits · on April 27, 2019

Well, I certainly agree about benchmarking versus other options, especially if I/O proves to be a bottleneck. The other option above was not a single CSV or JSON-structured file, but millions of small files stored on the filesystem (with its requisite indexes, permissions, access logs, etc). And the comparison is not with the file contents being splayed out into a relational structure in SQLite, but just one flat key-blob table for storage and retrieval. It's possible that a multi-gigabyte CSV file would be faster still, depending on actual access patterns and how much effort you want to devote to this bespoke CSV database.

civility · on April 27, 2019

> The other option above was [...] millions of small files stored on the filesystem

I don't have any direct comparisons of SQLite to this approach, but other projects I've worked on did have a simple directory tree organizing medium size chunks of data by date over 20+ years. We had one "DB" where everything was stored by the degree of latitude and longitude it fell into, and another where things were stored by date. Both were gigabytes in size at a time when that was considered a large data set, and it was very fast to open only what you needed.

Depending on the problem, this can be a very good solution. It was trivially accessible from any programming language (including ones that didn't exist when the DB was started), and it was trivial to migrate to new operating systems (some of this started out on VMS).

I like SQLite quite a bit, but it's not always the best solution to storing tabular/relational data.

mcbits · on April 27, 2019

How many files are you talking about? That's the relevant variable, not the amount of data or whether it's text, tabular or relational data, images, audio, etc. E.g. you can copy a single 10 GB file to another drive faster than 1 million files totaling 1 GB, all due to filesystem overhead. If there's a filesystem where that's not true, I'm interested. :)

civility · on April 27, 2019

> How many files are you talking about?

The numbers won't seem significant by today's standards. In the one case (stored by date), maybe 150 thousand files, each a few megabytes.

> you can copy a single 10 GB file to another drive faster than 1 million files totaling 1 GB

True, but I'll bet you can create (or read) a thousand files with a thousand records each faster than you can insert (or select) a million records into (or from) a SQLite table.

mcbits · on April 28, 2019

150,000 is solidly in the realm where SQLite tested faster than the filesystems in the link above, although their files were only a few kilobytes. It's almost certainly different (worse) for multi-megabyte files. But what I've been trying to convey is that the number of records in the SQLite table will be identical to the number of files. If you need to parse the file contents, you'd parse the BLOB just the same. The difference is in how you interact with the disk.

Is reading X-thousand files containing a thousand records each (or one thumbnail, or one HTML dump, or one JavaScript function, or whatever) faster than SELECTing exactly the same number of BLOBs containing exactly the same data? It's worth considering and testing once the number of files starts affecting performance or even just becomes a pain to deal with. If it turns out that storing many files is still a better fit for a particular application, that's cool too. Nothing is a panacea.