Hacker News new | past | comments | ask | show | jobs | submit login

You should benchmark SQLite before recommending it like this (for future reference). I think it's a very elegant library, and I admire the development philosophy that went into it. However, every time I've tried to use it for something that needs to be fast it ended up being a painful mistake.



Could you show us your methodology for your benchmark? We don't know under which circumstances you experienced degraded performance.

I've seen a lot of people just missing an index or something, and their DB then ran fine, so that's why I ask.


I don't work for that company any more, and I couldn't have legally showed the code even if I did.

It's not an index thing though. Try inserting 10 million rows into a simple table. No foreign constraints or anything. Some of our "documents" had 300 million rows. Handle the primary key however you want (let the DB do it, or generate your own). Use whatever SQLite API calls you want (precompiled statements, whatever). In a case like this, adding other indexes can only slow it down.

There are a few options to disable integrity and other safety constraints. They help speed things up a little, but it's all painfully slow in comparison to a writing simple CSV file.

The same is true on reading. You can parse an entire CSV file more quickly than "SELECT *" from the table.

I've tried several time to use SQLite as a document file format. It performs worse than CSV and JSON. It performs much worse than just an array of binary records. The devil is in the details, and assuming each file is not a single row/record, I wouldn't be surprised if 10+ million files in a structured directory tree performs better too.


Well, I certainly agree about benchmarking versus other options, especially if I/O proves to be a bottleneck. The other option above was not a single CSV or JSON-structured file, but millions of small files stored on the filesystem (with its requisite indexes, permissions, access logs, etc). And the comparison is not with the file contents being splayed out into a relational structure in SQLite, but just one flat key-blob table for storage and retrieval. It's possible that a multi-gigabyte CSV file would be faster still, depending on actual access patterns and how much effort you want to devote to this bespoke CSV database.


> The other option above was [...] millions of small files stored on the filesystem

I don't have any direct comparisons of SQLite to this approach, but other projects I've worked on did have a simple directory tree organizing medium size chunks of data by date over 20+ years. We had one "DB" where everything was stored by the degree of latitude and longitude it fell into, and another where things were stored by date. Both were gigabytes in size at a time when that was considered a large data set, and it was very fast to open only what you needed.

Depending on the problem, this can be a very good solution. It was trivially accessible from any programming language (including ones that didn't exist when the DB was started), and it was trivial to migrate to new operating systems (some of this started out on VMS).

I like SQLite quite a bit, but it's not always the best solution to storing tabular/relational data.


How many files are you talking about? That's the relevant variable, not the amount of data or whether it's text, tabular or relational data, images, audio, etc. E.g. you can copy a single 10 GB file to another drive faster than 1 million files totaling 1 GB, all due to filesystem overhead. If there's a filesystem where that's not true, I'm interested. :)


> How many files are you talking about?

The numbers won't seem significant by today's standards. In the one case (stored by date), maybe 150 thousand files, each a few megabytes.

> you can copy a single 10 GB file to another drive faster than 1 million files totaling 1 GB

True, but I'll bet you can create (or read) a thousand files with a thousand records each faster than you can insert (or select) a million records into (or from) a SQLite table.


150,000 is solidly in the realm where SQLite tested faster than the filesystems in the link above, although their files were only a few kilobytes. It's almost certainly different (worse) for multi-megabyte files. But what I've been trying to convey is that the number of records in the SQLite table will be identical to the number of files. If you need to parse the file contents, you'd parse the BLOB just the same. The difference is in how you interact with the disk.

Is reading X-thousand files containing a thousand records each (or one thumbnail, or one HTML dump, or one JavaScript function, or whatever) faster than SELECTing exactly the same number of BLOBs containing exactly the same data? It's worth considering and testing once the number of files starts affecting performance or even just becomes a pain to deal with. If it turns out that storing many files is still a better fit for a particular application, that's cool too. Nothing is a panacea.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: