If you haven't tried SQLite, please do. For years I ignored SQLite and used MySQ...

jjoonathan · on Feb 16, 2021

Counterpoint: I over-used SQLite because it was the first database I encountered and spent waaaay longer working around its shortcomings than I eventually spent porting to postgres.

Long version: I couldn't get bulk insert performance above absolutely miserable levels. I tried tricks like deleting and recreating indices but without luck. The perf tooling wasn't there to quickly figure out where the problem was (this was 10 years ago, not sure if things have improved) so I wound up building a version of SQLite with debug symbols and profiling it with a C profiler. The problem turned out to be a default setting that made spill-to-disk very aggressive and basically guaranteed that any workflow like mine would grind along with miserable slowness and no outward indication of what to do about it. I found an email thread where someone in effectively the same situation made some constructive suggestions and got turned away on the principle that even casual users ought to just know performance knobs like this one. Yikes. I am probably munging some of the details, but it made me angry enough to learn postgres and port my code over despite having a fix for my immediate problem.

quietbritishjim · on Feb 16, 2021

Like everyone else, I'm going to offer tuning tips even thought that very much wasn't your point :-)

What you are describing sounds almost exactly like PRAMGA synchronous = FULL [1] (which is the default). That pragma controls when fsync occurs. Depending on your application, you might have got away with NORMAL or even OFF. Again depending on your application, you could have set journal_mode = OFF or increased the mmap_size (both also discussed on that page). Yes those are fairly magical hacks, but synchronous and journal_mode at least are things are always worth considering for a new SQLite database (and before building it with debugging symbols!).

Even without those tweaks, the really key thing is to use fairly large transactions. I'm surprised that alone wouldn't have got you decent performance.

One option you didn't have at the time but might help today is write-ahead mode [2] with journal_mode = WAL (but still presumably not as fast as journal_mode = OFF!). I believe the only reason it isn't enabled by default is for backwards compatibility. According to that article, it was introduced in 2010-07-21, and improved to better handle large transactions (>100MB) in 2016-02-15.

[1] https://www.sqlite.org/pragma.html#pragma_synchronous

[2] https://sqlite.org/wal.html

pornel · on Feb 16, 2021

I've been there. It doesn't work. I've been applying all those tips and tricks, but it didn't improve performance that much, but I have lost data.

Sqlite is full of locks, and any writes are single-threaded stop-the-world, and there's no way around it. It's Sqlite's philosophy. Write-heavy databases just need something like MVCC, and Sqlite won't have that.

quietbritishjim · on Feb 16, 2021

I think you make a point which is very true and valid but not relevant to this particular comment thread.

You are talking about multiple different processes/threads heavily and concurrently writing to a database. In that case, you're absolutely right, the point has come to switch to a client/server database like PostgreSQL or MySQL.

But the parent comment was not about that (or at least they didn't explicitly mention concurrency, and their mention of "a default setting that made spill-to-disk very aggressive" rather than locks suggested that concurrency wasn't the problem). They seemed to be talking about a single writer inserting at a high rate, which is something I'd expect to cope with very well with the right tricks (mostly batching multiple inserts in transactions - I acknowledge their comment that the defaults are unfortunate though). Yes even with a single process there are locks, but if a lock is uncontended then it is not normally a problem.

Const-me · on Feb 16, 2021

> the point has come to switch to a client/server database

Windows has an embedded NoSQL DB engine which is fine with concurrent writes and multi-versioning: https://en.wikipedia.org/wiki/Extensible_Storage_Engine

There’re disadvantages too. It does not implement SQL, the queries need to be done manually on top of various indices in these tables. The DB has much more than a single file. The API is way more complicated than sqlight. The databases are portable from older to newer versions of Windows with automatic upgrades, but not the other way.

tinus_hn · on Feb 16, 2021

The JET database continually corrupts itself. It’s terrible.

The Windows search functionality uses this database and the only way Microsoft was able to make this feature somewhat reliable is by having it run very thorough checks at startup and tossing the database and reindexing at the first hint of trouble.

I don’t know why it is so terribly unreliable but I know it is.

Const-me · on Feb 16, 2021

Interesting, I have opposite experience. Once I had lots of experience with AD and Exchange. Later I shipped desktop software based on ESENT, and supported it for a few years. Was generally happy with the technology.

Are you sure these issues are caused by the DB engine? Other possibilities include your PC (like interference with crappy AV software), or Microsoft’s code of these search indexing services.

tinus_hn · on Feb 17, 2021

This happens on large amounts of PCs and I don’t think the way Microsoft uses its own database engine should be an excuse for it corrupting its data.

I think the difference is that Windows Search is always running, including when the computer crashes, is turned off unexpectedly or doesn’t complete waking up from sleep or hibernation. I think JET just isn’t that robust against that. It’s quite conceivable people manually closed your software when they shut down their computer.

I remember Exchange failing the same way if its database disks suddenly disappear due to network issues.

Const-me · on Feb 17, 2021

> Microsoft uses its own database engine

Microsoft is large and software quality varies. For instance look a Skype, they failed to use their own GUI frameworks and are using Electron i.e. Google Chrome to paint a few controls.

> I think JET just isn’t that robust against that.

I think it is, it's mentioned everywhere:

https://docs.microsoft.com/en-us/windows/win32/extensible-st...

https://stackoverflow.com/a/2369220/126995

tinus_hn · on Feb 17, 2021

Okay, ‘according to the documentation’ it probably is. In reality it isn’t.

Like they say,

In theory, there’s no difference between theory and practice. In practice, there is.

Here’s a list of things that can go wrong (here in the context of domain controllers):

https://docs.microsoft.com/en-us/troubleshoot/windows-server...

Some of them are just the unavoidable hardware failures, for others they suggest ‘Deploy the OS on server-class hardware’. Or the always helpful ‘restore from backup’. Not quite reasonable for a database containing a search index on a consumer device.

Const-me · on Feb 17, 2021

> Here’s a list of things that can go wrong (here in the context of domain controllers)

That article was written because DC is a business-critical infrastructure. Here’s a comparable one about Oracle: https://docs.oracle.com/cd/A87860_01/doc/server.817/a76965/c...

NTFS: https://docs.microsoft.com/en-us/previous-versions/windows/i...

> Not quite reasonable for a database containing a search index on a consumer device.

Before I switched to MS Outlook, I was using Windows live mail (now discontinued) as an e-mail client for a decade or so, it used ESENT for everything.

When I run process explorer and search for “esent.dll”, it finds a dozen of system services using ESENT databases, many of them critical like CryptSvc.

I try to buy good hardware, but that’s not server-grade components. I don’t use ECC RAM nor a UPS, and I suffer from brief power outages couple times a year. If ESENT would be corrupting databases when the power is turned off suddenly, I would have noticed.

tinus_hn · on Feb 17, 2021

Your experiment with a sample size of 1 proves, with an uncertainty of 100%, that the database never corrupts.

Experiments with a larger sample size show a different result.

bayindirh · on Feb 17, 2021

> Sqlite is full of locks, and any writes are single-threaded stop-the-world, and there's no way around it. It's Sqlite's philosophy.

SQLite is not the silver bullet of the DB world, but it's extremely useful in certain scenarios. Will you need a DB for a desktop app? Ditto. Want a portable file to carry some complex data structure? Ditto. Want to test your webapp during development, or provide a single-user version for users? Ditto.

Concurrent writing, multiple user, multiple producer scenarios need something bigger. MySQL Embedded, MySQL, Posgres, MSSQL, Oracle... List goes on.

SQLite makes databases accessible and useful in much more scenarios. I've hated databases until I've found SQLite, because I simply didn't see the point of running a big server which is designed to handle much bigger data sets to store 250KB of text tables.

SQLite is extremely underrated IMHO.

ngrilly · on Feb 16, 2021

There's a way around it which is actively worked on apparently:

https://www.sqlite.org/src/doc/begin-concurrent/doc/begin_co...

xorcist · on Feb 16, 2021

Why would stop-the-world affect performance when you are doing batch inserts?

Stopping the database is even a common trick when bulk inserting into "real" databases.

slaymaker1907 · on Feb 16, 2021

I think they are referring to a OLTP type scenario where you have lots of inserts and reads, but the reads only look at a couple of rows at a time, often only a single row. MVCC is good for such cases because you can avoid the overhead of taking lots of locks but still get fine grained concurrency control.

You are right that the Sqlite approach actually works quite well for bulk operations since you only require a single lock. However, it's usually still better for bulk inserts without updates to use fine grained locking or MVCC since you can often avoid acquiring any locks at all beyond the basic ones guarding fundamental DB data structures (these aren't locks as far as SQL is concerned since they cannot cause deadlock, it's a big pet peeve of mine when people think lock-free = no use of mutexes).

As a side note, don't do the following pattern: "BEGIN TRAN; INSERT INTO T ...; SELECT max(id) FROM T; COMMIT;". I used to do this, but this is a very bad habit that may be incorrect (assuming that id is an autoincrementing column). It's only correct on systems with true serializability, when you have opted into full serializability, and where such systems consider autoincrementing IDs to be part of serializability. When I tested this, Postgres and MSSQL handled this as expected while MySQL allowed the select to return a different row. I just tested Sqlite, and it does seem to work there regardless of WAL since it only allows concurrent readers plus a single writer. Use last_insert_rowid() or the equivalent for your database[1].

[1]: https://sqlite.org/lang_corefunc.html#last_insert_rowid

anarazel · on Feb 16, 2021

> As a side note, don't do the following pattern: "BEGIN TRAN; INSERT INTO T ...; SELECT max(id) FROM T; COMMIT;". I used to do this, but this is a very bad habit that may be incorrect (assuming that id is an autoincrementing column). It's only correct on systems with true serializability, when you have opted into full serializability, and where such systems consider autoincrementing IDs to be part of serializability.

And even when it work, it'll create a lot of unnecessary conflicts.

> Use last_insert_rowid() or the equivalent for your database[1].

RETURNING is the best approach for that in postgres (although lastval() also works). https://www.postgresql.org/docs/devel/sql-insert.html

lenkite · on Feb 17, 2021

According to several heavy sqlite users on HN, performance is improved if you do locking at the application layer. No idea if this is true.

AtlasBarfed · on Feb 16, 2021

See what you need to do is stand up a Redis cache cluster in front of your app, and then a cassandra cluster to cache the writes and compact them, before synchronizing the data with your single-page SQLite datastore as the master data.

Obviously for your SQL queries we crank up a large SparkSQL cluster.

- AWS architects

You can thank me later.

mekster · on Feb 17, 2021

Do you get more thanks than just telling people to use MySQL than your 3 layer complicated solution?

quietbritishjim · on Feb 18, 2021

I'm sure they were being sarcastic.

mshenfield · on Feb 16, 2021

One trick of `pragma journal_mode = WAL` is that even with synchronous normal, SQLite syncs when you close the last connection [1]. So re-using a connection is key to getting the full benefits if you're executing commands from a single thread/process.

[1] https://sqlite.com/wal.html#the_wal_file

danielbarla · on Feb 16, 2021

Out of curiosity, just how many rows were you trying to insert, for this to be a problem? My memory is a bit fuzzy, but on SQLite even standard INSERT statements can scale to hundreds of thousands per second, if you do them in one transaction. Just curious about the scenario here.

prox · on Feb 16, 2021

A need little trick is to explicitly state “begin transaction “ and “end transaction” in my app. Not sure how general use this is.

hnlmorg · on Feb 16, 2021

This is exactly how I accomplished high performance in sqlite too.

I'm surprised more people don't use transactions in sqlite given transactions are a staple of using any enterprise RDBMS.

AtlasBarfed · on Feb 16, 2021

Once you travel code boundaries (classes, functions, whatever) transaction management gets a bit hairy.

The question "prove this program reliably closes the transaction I started" starts to become equivalent to "prove this program halts"

Obviously they are useful tools and heavily used, but it's not like they are a zero-overhead feature.

hnlmorg · on Feb 16, 2021

This is quite a generalised point so difficult to really respond to but some frameworks do provide a mechanism to check if a transaction has been closed. The stateless nature of web development does also make working with transactions easier as you have a clear entry and exit point of each request. However I appreciate not all uses of RDBMSs are stateless, let alone web applications. But a similar approach can sometimes be applied with static stateful applications where you can calve up the transactional operations into logical code paths that are easier to mentally visualise. As always though, it really depends on how your code is structured and the specific problems you're trying to solve.

Where you know code boundaries are an issue I've found functional designs tend to work a little better than OOP with regards to managing transactions but a lot of that could just be down to how my brain is wired (while I'm not a FP evangelist I do tend to favour breaking code down to stateless functions rather than stateful classes).

It's fair to say spaghetti code will be a problem on most reasonably mature code bases but there are approaches and frameworks that help somewhat with managing transactions across code boundaries -- just as there are tools that make working with transactions harder. But in my experience there are much harder problems to solve than working with transactions.

> but it's not like they are a zero-overhead feature.

Is there such thing as a zero overhead feature? (I say this semi-flippantly).

Spivak · on Feb 16, 2021

I mean I have to deal with this crap at $dayjob but I genuinely can't believe of the terrible code I see that borrows a resource (connection, transaction, file handle) and then only the happy path gives it back. I desperately wish that languages would make this a compile error if all code paths don't lead to the resource being freed.

The only thing that should ever stop you from returning a resource is a malicious scheduler.

lixtra · on Feb 16, 2021

> I desperately wish that languages would make this a compile error if all code paths don't lead to the resource being freed.

You might want to take a look at rust.

AtlasBarfed · on Feb 18, 2021

I can see the obvious parallels between memory management lifetimes and transaction management, does rust have explicit features for extending lifetimes to resources besides memory?

lixtra · on Feb 19, 2021

Yes, it works for files, network connections and database transactions[1].

To span a transaction over two functions, it has to be assigned to some variable and the lifetime of this variable is tracked by the compiler.

[1] https://docs.rs/sqlx/0.3.0/sqlx/struct.Transaction.html

smallnamespace · on Feb 16, 2021

You don’t need to solve the halting program, you just need a way to construct programs that halt (or close the connection), which is way easier.

Many languages have some sort of `finally` or `with` construct tailored for this use case.

Remember, we’re code writers, not arbitrary discriminators.

starik36 · on Feb 16, 2021

I actually avoided transactions in MSSQL when I could because it escalates locks real quick. Which is a death knell for a busy system.

Aeolun · on Feb 16, 2021

Eh? They exist, that doesn’t mean anyone actually uses them.

hnlmorg · on Feb 16, 2021

My point is that using transactions should be drilled into people who work on databases and where consistency is a requirement because transactions turn multiple complex SQL requests into one atomic operation:

http://db4beginners.com/blog/relationaldb-transaction/

Aeolun · on Feb 18, 2021

Oh, I agree. It’s just that my recent experiences have included mostly people foreign to both transactions and foreign keys.

mikeyjk · on Feb 16, 2021

I've never worked at a place that doesn't use them. What if the 3rd insert in a series fails and consequently writes the wrong thing down on the 4th with an update?

That's just my experience so I guess it may be meaningless but I'm surprised to hear it may not be often used.

konha · on Feb 16, 2021

> I've never worked at a place that doesn't use them.

Consider yourself lucky then. I know of a place that doesn’t use transactions in a homegrown ERP solution, of all things.

FpUser · on Feb 16, 2021

I've seen all kinds of wonders during my life so this one is no surprise. If however someone is doing stupid things it is their problem. They're free to complain to themselves.

mumblemumble · on Feb 16, 2021

Tangentially - it may be their problem, but I hesitate to say it's their fault. I'm continually dismayed at how spotty and superficial education about how to use an RDBMS can be. Even in formal education on the subject.

FpUser · on Feb 17, 2021

It does not take PhD and rocket science for one to figure that sometimes operations must be bunched and executed with success/failure as a single unit. That would come as a business requirements. For curious person it would not take much to do some search on a subject and discover and read about transactions.

slaymaker1907 · on Feb 16, 2021

You hardly ever need transactions if you track validity explicitly in your schema.

Suppose T1 has a 1-many relationship with T2. Declare in your assumptions that any rows in T2 with no corresponding valid row in T1 are not valid.

Additionally, have an is_valid field on T1 so selecting all valid data from T2 is done with "select * from T2 inner join T1 on T2.t1id = T1.id where T1.is_valid".

To insert data, insert a row into T1 first but initially have is_valid be false. Then insert all necessary data into T2. Finally, do an update and change the original row in T1 to have is_valid be true.

For deletions to T1, just do an update and set is_valid to false. Thanks to the validity logic, this has the effect of also invalidating all T2 rows.

Updates are trickier, but you can allow them to work without transactions by having two ids for for every table. The first id is the one we worked with before which is used for joins. The second id is used by applications to look for explicit records. Therefore, just never do any updates aside from the one setting is_valid to true (which is really storage logic and not application logic). Instead, just insert a new row into T1 whenever you want to update something in T1. The final update now just needs to flip the is_valid bit for the old row and the new row and will also need to verify that the old row is valid as well as any other rows the current update relies on (basically need to turn it into complex CAS).

All of this is pretty messy, but it does let you have CRUD without any transaction support from your DB. Also, even if your DB has transactions, this scheme has the advantage of being lock-free so your application cannot deadlock.

If you have many updates/deletes, you can do garbage collection either by allowing the GC to use a transaction or by changing adding in a check for insertions to T1 that verify the number of associated rows in T2 before setting is_valid.

Unfortunately, while updates and inserts with GC can still be lock-free, they are not wait-free since an insert or update can fail. If you never do updates or GC though, this is actually wait-free and guarantees that every create, read, and delete operation will succeed in the absence of hardware/network failures.

Still, this overhead probably isn't worth it unless you already need to track the history explicitly for auditing or something. At the company where we used this, we didn't have an is_valid row, we had valid_from and valid_to which were timestamps.

rc48 · on Feb 18, 2021

Same as @mikeyjk! Can you point toward more information on this topic? The applications I'm dealing with in my day job are constantly plagued by deadlocks in the database surrounding bulk transactions. I've theorized that a "T1 1-many T2" refactoring of the tables - like what you've described - might be a solution , but haven't settled on how to implement. Would absolutely love to learn more on this esoteric, hard-to-google topic.

mikeyjk · on Feb 22, 2021

I guess I'm not sure how you can get away without needing transactionality for these extra columns / is_valid columns

hnlmorg · on Feb 17, 2021

All of this sounds like a lot of additional work just to provide some of the functionality transactions provide out of the box. And I say "some" because you've not addressed rollbacks.

That said, it's still an interesting thought experiment so glad you shared.

mikeyjk · on Feb 17, 2021

Wow, this has broken my brain but has really piqued my interest - thanks. Do you have any reading on this topic / strategy to help a mouth breather like me out?

rc48 · on Feb 18, 2021

My sentiments exactly!

FpUser · on Feb 16, 2021

Not using transaction is just very bad practice. If people are using wrong approach to solve the task they should not complain about results.

derekp7 · on Feb 16, 2021

Another trick if you have multiple process (users) accessing the DB and you don't want to lock the DB for a long time, is to insert into a temp table (possibly with a transaction, although the time savings is not as dramatic with temp tables). Then copy the temp table to the main one (insert into ... from ...). The advantage is lets say you are reading in a bunch of items from something else, that will take a chunk of time more than just the DB time. So by going to a temp table you aren't locking the target table for anyone else while gathering the data. Then combine this with flushing the temp table every X rows or X seconds, and you have a number of efficient updates to the table without long lock times.

Also have WAL mode on to get multi-user access going.

uh_uh · on Feb 16, 2021

This surprises me as I thought SQLite locks are db-level, not table-level. Is this not the case?

derekp7 · on Feb 16, 2021

If you have wal-mode enabled then the automatic locks are table level. So one process can be updating one table and another one can work on a different one. Also, you only need to enable wal mode on the DB once (pragma journal_mode=wal), it "sticks" for each connection.

In my application that uses SQLite (Snebu backup), as data comes in (as a TAR format stream) I have one process extracting the data and metadata, then serializing the metadata to another process that owns the DB connection. This process dumps the metadata to a temp table, then every 10 seconds "flushes" the metadata to the various tables that it needs to go to. This way I can easily have multiple backups going simultaneously, as each process spends a small amount of time (relatively) flushing the data to the permanent tables, and a greater part of the time compressing and writing backup data to the disk vault directory.

I've been working with this for the past 8 years or so, and have picked up a few tricks on keeping as much as possible batched up in transactions, but also keeping the transaction times short relative to other operations. So far seems to work out fairly well.

Note, that in addition to journal_mode=wal, you need to have a busy handler defined that infinitely retries transactions with a 250 ms delay between each retry.

Edit: On further review of the docs, I'm not sure if wal mode enables table-level locking, it may be that when writing to a temp table, that temp tables are part of a separate schema (or are otherwise separate from the main DB) -- which makes sense, as temp tables are only visible to the process that owns them. So a temp table can be locked in a transaction, while the rest of the DB is writable.

uh_uh · on Feb 18, 2021

It seems like the locking is page-level in "wal" and "wal2" modes:

> Usually, SQLite allows at most one writer to proceed concurrently. The BEGIN CONCURRENT enhancement allows multiple writers to process write transactions simultanously if the database is in "wal" or "wal2" mode, although the system still serializes COMMIT commands.

> When a write-transaction is opened with "BEGIN CONCURRENT", actually locking the database is deferred until a COMMIT is executed. This means that any number of transactions started with BEGIN CONCURRENT may proceed concurrently. The system uses optimistic page-level-locking to prevent conflicting concurrent transactions from being committed.

> When a BEGIN CONCURRENT transaction is committed, the system checks whether or not any of the database pages that the transaction has read have been modified since the BEGIN CONCURRENT was opened. In other words - it asks if the transaction being committed operates on a different set of data than all other concurrently executing transactions. If the answer is "yes, this transaction did not read or modify any data modified by any concurrent transaction", then the transaction is committed as normal. Otherwise, if the transaction does conflict, it cannot be committed and an SQLITE_BUSY_SNAPSHOT error is returned. At this point, all the client can do is ROLLBACK the transaction.

The page also mentions:

> The key to maximizing concurrency using BEGIN CONCURRENT is to ensure that there are a large number of non-conflicting transactions. In SQLite, each table and each index is stored as a separate b-tree, each of which is distributed over a discrete set of database pages. This means that:

> Two transactions that write to different sets of tables never conflict

Source: https://www.sqlite.org/cgi/src/doc/begin-concurrent/doc/begi...

rc48 · on Feb 18, 2021

+1. I was not aware of the 'stickiness' of the wal pragma. Thank you for the tip. I will be defaulting to this going forward. For a quick confirmation, I used:

  $ f=$(mktemp)
  $ sqlite3 $f "pragma journal_mode"
  delete
  $ sqlite3 $f "pragma journal_mode=wal
  wal
  $ sqlite3 $f "pragma journal_mode"
  wal

Edit: struggling to format code markup correctly on my mobile.

chrisweekly · on Feb 16, 2021

+1, thanks for this and your previous comment. Have you written up your approach in any more detail? I'd love to reference it.

derekp7 · on Feb 16, 2021

The code is on github.com/derekp7/snebu if you want to take a look. The next task on my list is to write up developer / contributor documentation on how the whole thing is put together along with various tricks (and simplified demonstration code).

But specifically, look in "snebu-main.c" that is where the opendb function is (so you can see the pragma statements), and there is a busy_retry function that gets referenced (all it does is sleep for .1 seconds). I believe that you don't need the busy-retry function, if you use the built-in busy handler, but I'm not really sure and don't want to take a chance and break working code.

For the temp tables, look in snebu-submitfiles.c -- the function at the top handles the DB operations, one towards the bottom handles the non-DB tar file consumption operations, and there is a circular queue in the middle to handle buffering so the data ingestion can keep going while the data is getting flushed (these three run as separate processes). I should learn threads, as there may be more flexibility in that, but not comfortable enough with thread programming yet.

chrisweekly · on Feb 18, 2021

Thanks so much! The phrase "unreasonably helpful" comes to mind. Also, kudos for the extraordinarily high-quality documentation in snebu!

quietbritishjim · on Feb 16, 2021

> A need little trick is to explicitly state “begin transaction “ and “end transaction” in my app.

That is exactly what the parent comment already said:

> if you do them in one transaction

The commands you stated are exactly how to do (multiple) things in a transaction.

danielbarla · on Feb 16, 2021

Indeed, without this, performance would seem quite lacklustre. I believe it's very common in the SQLite community.

fctorial · on Feb 16, 2021

Why would I do a hundred thousand insertions in a single transaction in a crud app?

dwohnitmok · on Feb 16, 2021

I think parent means that for some careful selection of N (where N > 1) insertions per bulk transaction you can scale up to hundreds of thousands of insertions per second, rather than putting hundreds of thousands of insertions in a single transaction.

Tuna-Fish · on Feb 16, 2021

No, the opposite. In SQLite, starting and ending transactions that write things to the db is a relatively expensive operation, and running queries outside transaction is effectively the same as running each of them in an independent transaction.

If you need to do a lot of inserts (or updates, etc), the slowest possible way to do them is to do them outside of a transaction. The fastest way to do them is to wrap them all into a single transaction.

quietbritishjim · on Feb 16, 2021

> If you need to do a lot of inserts (or updates, etc), the slowest possible way to do them is to do them outside of a transaction. The fastest way to do them is to wrap them all into a single transaction.

This doesn't seem to contradict the comment you're replying to. They're suggesting wrapping operations into transactions in batches e.g. (just making some numbers up) if you have 100,000 inserts maybe you'd do 100 transactions of 1000 inserts each. I wouldn't call that "the opposite" of your one mega-transaction suggestion. In fact I'd expect it to still have most or all of the speed benefit of using one single transaction, or potentially even be slightly faster.

dwohnitmok · on Feb 16, 2021

Oh fascinating, you actually put hundreds of thousands of statements in a single SQLite transaction in an online CRUD app (as opposed to offline processing)? I've never done more than a couple hundred and even then usually they're "logically batched," both because I'm worried about forcing unnecessary read to write transaction promotions for concurrent reads and thereby increasing busy errors, but also because that affects durability to have a transaction open that long (it's not great to let your HTTP response hang for a second before responding as you keep your transaction open).

For serialized writers in any system I'm sure keeping a transaction open as long as possible is the ideal case for throughput, but there's other problems with that in a CRUD app no?

danielbarla · on Feb 17, 2021

Well, that was my original question to the the parent poster, too. You can achieve 100k+ /sec, but they went looking for better performance, apparently.

But to answer your question at face value, one common use for SQLite is as a file format for complex desktop apps, or perhaps a game save format for certain types of games (mostly the non-realtime types). One great advantage to this is that if you use a DB migration library, you ensure backwards compatibility with previous versions of your saved files. However, it's easy to imagine a fair bit of data getting inserted into such a new file each time it is created. It might only be 5000 rows, but I'd prefer to not have to wait 5 seconds just to persist it, if a single change can make it 0.05 seconds.

kenniskrag · on Feb 16, 2021

db seeding with dummy data, migration, analytics db sync etc.

fctorial · on Feb 16, 2021

These are all one time/one a year operations.

imtringued · on Feb 16, 2021

You still don't want to wait 5 hours to let your inefficient query complete.

cztomsik · on Feb 16, 2021

because bulk kinda implies that you want all-or-nothing :)

lrem · on Feb 16, 2021

I've recently had my Python script generate about 100GB of data and inserting them into SQLite was _far_ from being the bottleneck.

bhaak · on Feb 16, 2021

I suppose OP was committing too often or even had autocommit on.

Then it wouldn't actually haven been a case of "bulk insert" at all.

mamcx · on Feb 16, 2021

How many rows? What is the "fast" you expect?

My main app sync data across ERPs and their main case is batch loading of data. This mean that I need to nearly mirror a SQL Server/Oracle/Cobol/Firebase/Etc database into sqlite, clean it, then upload to postgresql.

I have more troubles fast loading into PG than sqlite (not saying I don't have them in the past!) and sqlite is very very fast to me.

elAhmo · on Feb 23, 2021

I stumbled upon a similar issue last week.

I was building a small Rails project that has mostly reads, but here and there it has inserts that can have 1000+ entries with related models. Things seem fine when clicking around, but when I used jMeter to test what is the capability of the server, I found it terrible with simultaneous requests.

Adding two workers made things even worse, and this is with ~10% writes and the remaining going to reads.

I quickly changed the db to postgres for comparison purposes, and from 3-4 requests per second with sqlite, it jumped to 60-70 as it did scale linearly with the number of workers.

I am by no means expert in database optimisations, but out of the box this behaviour was somewhat limiting the usability of sqlite for a webapp.

imtringued · on Feb 16, 2021

Postgres is pretty bad at bulk insert. You need to use COPY INTO and create an in memory CSV file or mess around with unnest() to get decent performance.

mvc · on Feb 16, 2021

`copy to` yes. That's what bulk insert is. How can having the feature make postgres bad at it? "in memory csv" no. You don't need to create an in memory csv file in order to load in bulk. Your language binding should provide a way to stream data into a `copy to` process I can't imagine how unnest is related to bulk loading. It's a query feature to unnest arrays.

btilly · on Feb 16, 2021

The relationship between bulk loading and unnest is as follows.

A single insert of an array using an unnest can insert many rows. The performance is worse than copy to, but in the same ballpark. But there are use cases where you'd like to bulk load through a stored procedure for a variety of reasons, and now calling the procedure with an array and using unnest internally is a straight win.

hiq · on Feb 16, 2021

Thanks for sharing! Did you have performance benefits using PostgreSQL out of the box, or did you have to tune specific settings? How much effort did you have to spend to do this migration?

wedesoft · on Feb 18, 2021

Did you use transactions? I noticed a significant performance increase after wrapping blocks of queries with BEGIN TRANSACTION and COMMIT TRANSACTION.

phillc73 · on Feb 16, 2021

If you like SQLite, then DuckDB[1] is probably worth looking at. Very similar in many ways, but DuckDB is a column-oriented rather than row, so does have some performance advantages. It is quite new, so I might not go all in for mission critical production yet, but it is worth exploring for analytics work.

[1] https://duckdb.org/

vslira · on Feb 16, 2021

Question for ppl using DuckDB: are the use cases similar to what you'd use Apache Arrow, but with the benefit of working in SQL, or are they meaningfully different?

I'm not currently using any of those, mind you, still on a pandas/dask* dataframe basis, but I'm trying to wrap my head around where the ecosystem is moving

*I know Dask is already using Arrow behind the scenes

phillc73 · on Feb 16, 2021

I don't use Apache Arrow, so I'm in a poor position to compare it with DuckDB.

My use case for DuckDB is effectively querying R dataframes with SQL. DuckDB has the functionality to register virtual tables, with data from existing dataframes.

As I know SQL reasonably well, using DuckDB to query dataframes means I don't need to learn a bunch of new dplyr verbs or data.table constructs.There are some other R packages which also support this use case - sqldf and tidyquery are two I am aware of. Both of these follow a different approach, where they parse the SQL query. Using a DuckDB virtual table lets the database handle all of the SQL.

I've found so far that through using DuckDB, performance is much better than sqldf and tidyquery, nowhere near as quick as data.table and can be quicker than dplyr, depending on query complexity. I haven't really looked at anything approaching big data sizes though.

sriku · on Feb 16, 2021

Wow! Seems to be around for a while as well. Regret not finding it earlier.

spaniard89277 · on Feb 16, 2021

DuckDB is amazing.

hans_castorp · on Feb 16, 2021

I will re-consider it, once they have proper data types and data type checking.

jraph · on Feb 16, 2021

This aspect will probably never change:

> Flexible typing is considered a feature of SQLite, not a bug. Nevertheless, we recognize that this feature does sometimes cause confusion and pain for developers who are acustomed to working with other databases that are more judgmental with regard to data types. In retrospect, perhaps it would have been better if SQLite had merely implemented an ANY datatype so that developers could explicitly state when they wanted to use flexible typing, rather than making flexible typing the default. But that is not something that can be changed now without breaking the millions of applications and trillions of database files that already use SQLite's flexible typing feature.

https://sqlite.org/quirks.html#flexible_typing

rini17 · on Feb 16, 2021

It is possible to create check constraints that do the type checking. Actually, if they shipped some predefined ones and said "if you want strong type checking do this", like, syntax to automatically populate the table columns with type check constraints, it would be IMO perfectly backward compatible.

darksaints · on Feb 16, 2021

At the very least we could get strict versions of data types, or some sort of key word used in the ddl to specify strict typing.

attilakun · on Feb 16, 2021

This might help: https://dba.stackexchange.com/a/222271

RedShift1 · on Feb 16, 2021

When I started a project with SQLite, the available data types struck me as odd (coming from MySQL), but as I learned I started asking the question: are there any more fundamental data types other than null, int, real, text and blob? For example dates are just a facade for an integer of some kind, JSON is really just text adhering to certain formatting rules, booleans are usually stored as some kind of byte anyway so why not drop that abstraction?

With this limited set of datatypes it really makes you think harder about the data you are processing, because in the end all your data is one of these types anyway.

hans_castorp · on Feb 16, 2021

> are there any more fundamental data types other than null, int, real, text and blob?

Those type names are just hints, they don't constrain the valid values in any way.

https://dbfiddle.uk/?rdbms=sqlite_3.27&fiddle=4634e3821676ed...

yread · on Feb 16, 2021

I don't really see it as such a problem. When interacting with the database you either use an ORM with your strongly-typed business objects or you write the SQL and use parameters (again strongly typed)

hans_castorp · on Feb 16, 2021

> When interacting with the database you either use an ORM with your strongly-typed business objects

If you have strongly-type business objects then why not have a strongly typed storage? If you code can control the constraints that a correct data type ensures, then why have "strongly type business objects" to begin with? Why not store everything in your code as strings as well?

I see this misconception all the time. The database (and its data) lives way longer than most applications. And it's also a wrong to assume that there is always only one application accessing the database. Bulk loads are a typical case of secondary applications.

Not choosing the proper data type in a relational database is a really bad decision and we see question on stackoverflow and similar sites on a weekly (if not daily) basis asking how to fix invalid data in those "un-typed" columns.

yread · on Feb 16, 2021

I'm not saying it's useless to have type constraints in a database. Just that with a bit of care it's not such a big deal.

In the end there will always be some business rules that are not constrained by the database. So, you always have to be a bit careful about what you store in it. Indeed not being careful and hoping that your types, constraints and triggers are going to save you is more risky

> Not choosing the proper data type in a relational database is a really bad decision

Well, then rejoice, you can't make this bad decision in SQLite because everything is +/- a number or a string

LunaSea · on Feb 16, 2021

SQL parameters aren't strongly typed though and in any case SQLLite doesn't have stored procedures.

fctorial · on Feb 16, 2021

It's a feature:

https://stackoverflow.com/questions/29476818/how-to-avoid-in...

RedShift1 · on Feb 16, 2021

That is... interesting...

dietr1ch · on Feb 16, 2021

I can't believe that runs. What a misfeature.

clon · on Feb 16, 2021

You just put me off Sqlite for life. I can't believe someone thought this to be a good idea.

foepys · on Feb 16, 2021

SQLite just does what JavaScript and other weakly typed languages do. The developer is responsible for what they input. If type checks are a must, SQLite is just not for you, easy as that.

LunaSea · on Feb 16, 2021

But event JavaScript will reject data if it's considered not valid for a type.

It won't throw but it will set an "invalid" / "NaN" / etc. value.

foepys · on Feb 16, 2021

I don't see how having "NaN" or "undefined" in a number field makes any difference to just putting the value there. If you are using a strongly typed language, this will not be a problem.

SQLite made a design choice in favor of simplicity. It's also missing basic date functions all together. The only way to compare dates is by using Unix Epoch.

samatman · on Feb 16, 2021

I use ISO 8601 full timestamps in UTC.

They were carefully designed so that collation order is identical to temporal order. Which is convenient!

If you need interval logic, though, SQLite won't help you, and epoch is the better choice. It's possible to solve some queries with a regex, but you won't love it.

OskarS · on Feb 16, 2021

Chill out. This is a design choice to make it more viable to be used in dynamically typed languages. It's maybe not the design choice everyone would make, but it's not, like, insane: this is how all variables in Python/JavaScript/Ruby/whatever works.

If you really care about this, adding strongly typed columns is trivial: https://dbfiddle.uk/?rdbms=sqlite_3.27&fiddle=9baffa184672a7...

giantDinosaur · on Feb 16, 2021

I mean, you can represent everything under the sun as a string (or a very big int) but that's a level of operating where I find my brain starts to get fatigued for absolutely no real gain at all.

magicalhippo · on Feb 16, 2021

We need to store weights in kg with three decimal places. If we sum to get a total, the total should be correct according to those three digits.

I'm sure we could get it to work in SQLite, but it sounds like we would have to have some layer of manual fudging that we couldn't forget about.

In our current database we just use "numeric(16, 3)" and no worries.

em500 · on Feb 16, 2021

At the risk of sounding glib, you could store everything as integer grams?

magicalhippo · on Feb 16, 2021

Yes of course. That's why I said I'm sure we could get it to work. But then it has to be converted to kg for actual usage. Either directly in the queries or views.

Having a fixed-point decimal type allows us to not think about once the table is created.

Same issue with money. Most of the time we need to store monetary values with two decimal places (cents).

dmurray · on Feb 16, 2021

What kind of application are you developing where there is so little middleware between the database storage layer and the user that converting to/from kilos adds significant additional complexity, but your users aren't technical enough to just deal with having weights in grams?

magicalhippo · on Feb 16, 2021

These are electronic versions of official forms. The user interface and reports should show what is actually being sent or has been sent.

Sure we could deal with it, but there are quite a number of tables due to different forms and messages, and then there's all the reports. Many custom ones thanks to to local officials wanting data from a certain customer in a certain way...

scaladev · on Feb 16, 2021

Bureaucratic forms have a tendency to change unpredictably. Today they want weight in kilograms, tomorrow they'll want it in grams, and the day after that in standardized intergalactic weight units. I get what you're saying (and I don't like writing unnecessary mapping boilerplate too), but there should be a middle layer to protect the application from craziness like that.

magicalhippo · on Feb 16, 2021

Thankfully things in our sector is fairly static in that regard, so this is seldom an issue. That sounds quite horrible indeed.

And yes, a fatter middle layer would be nice. Our next generation software will probably have more of that, this code base is over 20 years old at this point...

magicalhippo · on Feb 16, 2021

I realize this response turned out a lot more harsh than I intended. I was in a slight rush so didn't formulate myself well, didn't mean to be abrasive.

mvc · on Feb 16, 2021

> Most of the time we need to store monetary values with two decimal places (cents).

Yeah who cares about fractions of pennies anyway. Just makes things complicated.

magicalhippo · on Feb 16, 2021

Yeah for being so uptight about just about everything, the officials seems quite relaxed about values slipping between the cracks due to lack of decimal places.

Like, total invoice value can only be specified with two decimal digits, typically in foreign currency. Yet we also have to specify per-line value in local currency, also with only two digits. And then the per-line values are used to calculate taxes and whatnot...

jrochkind1 · on Feb 16, 2021

The US IRS (tax agency) allows you to round nearly all values to dollars ignoring cents altogether.

RedShift1 · on Feb 16, 2021

You could store it in grams as an integer.

darksaints · on Feb 16, 2021

With this sort of perspective, why use a database at all? You can store literally any digital object as some sequence of bytes, and that's what file systems are for.

At some point, you actually like for the software you use to actually have meaningful features.

isbvhodnvemrwvn · on Feb 16, 2021

Querying filesystems is extremely primitive when compared to what sqlite can do.

darksaints · on Feb 16, 2021

So is storing dates as integers, and json as strings.

RedShift1 · on Feb 17, 2021

But dates are integers (all databases store dates as an integer that is an offset from some fixed date), and JSONs are strings. SQLite has the functions to work with these datatypes (dates, JSON, etc...) but fundamentally they are not different from integers or strings.

darksaints · on Feb 17, 2021

And integers are bytes and strings are bytes and tables are bytes and indexes are bytes. If you're not happy with databases being intelligent, why stop with integers and strings, when you can do everything you want with bytes?! Memcached, or LevelDB, or even a file system is the ultimate database if you follow your arguments to their logical conclusion.

But data representation isn't data. You can't look at a serialized sequence of bytes and know what it represents without context - without a serialization scheme. For relational databases, the table definition is the serialization scheme - it is the context. For example, you can't know what date an integer is representing without contextual information like the offset. By storing dates as dates in your database, that context is baked in.

It is helpful to have that full context inside your database because it allows you to operate that database more efficiently (by carefully indexing on the properties of the data, not the data representation), but it also allows you to use that database in ways that are not tightly coupled to your application, such as analytics, because you are storing data itself and not just the bare minimum required to represent that data.

aembleton · on Feb 16, 2021

Because sqlite let's you index and query the data without having to write your own layer to do that.

darksaints · on Feb 16, 2021

Having rich data types allows a lot of things without having to write your own layer. Spatial indexes are extremely efficient, and I can't query geometries efficiently when I'm storing geometries as a blob. The same goes for dates, json, XML, ranges, etc.

The problem with reserving specialized logic for the application layer is that it limits you to simplistic indexing schemes and you end up doing excessive IO and filtering in memory to get what you actually want.

The idea that databases shouldn't have specialized datatypes is really only an idea that works in simplistic crud apps. The world is much bigger than that.

shakkhar · on Feb 16, 2021

FWIW, we use https://cgsql.dev/ to solve the type issue, among others. I totally understand that this is not for everyone.

lisper · on Feb 16, 2021

I tried to switch from MySQL to SQLite for my Postfix/Dovecot installation but I ran into a very annoying problem: every now and then I lose an email because the "database is locked".

This is a show-stopper for me because the problem happens very rarely, only once every couple of days, but it's catastrophic: when this happens, the incoming message is not bounced, it is actually lost. The only reason I even realized it was happening is because I noticed there were emails in the root account, which is the error-reporting mechanism of last resort.

I've searched the web in vain for a solution. If you have any suggestions, I would love to be able to stick with sqlite, but at the moment I am about to begin migrating back to MySQL. :-(

kevincox · on Feb 16, 2021

This sounds like a problem with Postfix or Dovecot. Postfix shouldn't ack or Dovecot shouldn't delete the email until it is safely stored.

That being said there are many use cases where having high availability is critical and in the face of multiple writers SQLite isn't the best option for that.

lisper · on Feb 16, 2021

I don't think this is a multiple-writer problem. Postfix is only reading. I am running a milter that is writing, but I control the code for that so I have it set up to retry if it fails. So the error is being generated by postfix itself, and so it must be happening on a read (because that is all postfix does).

I was hoping to find some kind of global switch that would make sqlite always wait for locks rather than throwing an error. But I've scoured the web for such a solution without success :-(

samatman · on Feb 16, 2021

There is no such thing, to my knowledge.

What can be done is registering a busy callback:

https://www.sqlite.org/c3ref/busy_handler.html

I don't understand enough about your specific problem to know if this will actually help you, just sharing a tidbit I encountered working on a comparable issue.

NelsonMinar · on Feb 16, 2021

It's terrific right until you need multiple processes writing to the same database. It's no accident that SQLite is fast for many small queries; it's not doing a lot of the work required to, say, be a good database backend for multiple web frontends.

danenania · on Feb 16, 2021

Yeah, I don’t really see sqlite as competing with postgres or mysql for this reason. It’s an alternative to the file system.

millstone · on Feb 16, 2021

Good observations from a MySQL perspective. Any thoughts from the other end, where the alternatives are JSON or XML or ZIP? SQLite tries hard to convince you to use it as an application file format, but it looks like a giant black box of overkill: why incorporate its 200k SLOC when the alternatives are a fraction of the size?

chousuke · on Feb 16, 2021

Is 200kLOC really a lot? Lots of software nowadays has hundreds of megabytes of dependencies, and people seem to be fine with that. Not that I think having tons of dependencies is really a good thing, but for an application that needs a file format, SQLite is a very sensible dependency.

neither XML, JSON nor zip solve the problems SQLite does, though; if you use plain old files, you need to make sure any changes you make actually end up on the disk, consistently. This is not easy to do. It also solves any consistency issues that might stem from someone reading the data while you're writing it.

On top of being just better, having a relational model for your data gives you much more freedom to use said data; you'll be able to do things efficiently that might require restructuring your JSON or XML format. Personally, I love SQLite-based application formats because I can explore them with SQL, which is often much easier than trying to make sense of a custom JSON or XML schema.

millstone · on Feb 16, 2021

Good reply, thank you.

Yes 200k SLOC is huge (modern development practices notwithstanding). SQLite creates temporary files at whim - nine different kinds! https://sqlite.org/tempfiles.html

I know how to atomically write a JSON file. But when I read, for example:

"The temporary files associated with transaction control, namely the rollback journal, super-journal, write-ahead log (WAL) files, and shared-memory files, are always written to disk. But the other kinds of temporary files might be stored in memory only and never written to disk. Whether or not temporary files other than the rollback, super, and statement journals are written to disk or stored only in memory depends on the SQLITE_TEMP_STORE compile-time parameter, the temp_store pragma, and on the size of the temporary file..."

My eyes have completely glazed over. If I add this to my app, what will it actually do? How can I even know?

iainmerrick · on Feb 16, 2021

I know how to atomically write a JSON file.

Are you sure? I’ve had a lot of trouble getting that to work reliably myself across multiple OSes. (In hindsight I wish I’d used SQLite!) This article gives a good explanation of the many difficulties:

https://danluu.com/deconstruct-files/

My eyes have completely glazed over. If I add this to my app, what will it actually do? How can I even know?

Well, fundamentally it’s very hard to get it exactly right, and I imagine that’s why the implementation is a little involved.

But you could a) read through those docs, lengthy though they are, and/or b) trust the many testimonials saying SQLite is very, very robust and reliable.

scaladev · on Feb 16, 2021

Here's another good review of the pain you get if you want to get your data to disk safely.

(SQLite does this for you automatically, BTW.)

"Ensuring data reaches disk"

https://lwn.net/Articles/457667/

ori_b · on Feb 16, 2021

> (SQLite does this for you automatically, BTW.)

Unless you're on nfs. Remote file locking is hard, and I don't think that any nfs implementation has gotten to the point where you can trust SQLite on it.

SQLite does updates in place, which I would trust far less than a rename call.

millstone · on Feb 16, 2021

> Are you sure?

No, and anyone who says yes is lying. (Lockless NFS exists and is no fun.)

> Well, fundamentally it’s very hard to get it exactly right, and I imagine that’s why the implementation is a little involved

SQLite has set itself the horrible task of updating files in-place. I know of two reliable, simpler alternatives:

1. Appending to files through O_APPEND

2. Rewriting files through rename()

If SQLite has different magic syscalls then I would very much like to learn.

Someone · on Feb 16, 2021

I don’t think you can atomically append more than one byte to files in unixes (the write call can return after having written some but not all requested bytes)

(Haven’t googled, but if that’s possible, I don’t see why write would have that limitation)

LunaSea · on Feb 16, 2021

Interesting and confirmed in the write() syscall man pages. Thanks!

Do you have any other resources regarding these types of low level "gotchas"?

I remember PostgreSQL having such an issue two years ago for example.

millstone · on Feb 16, 2021

Yeah, and eventually we reach the best-effort bedrock. Maybe the file is on a NFS mount, you call write(), it goes over the wire, who knows what happens!

moosebear847 · on Feb 17, 2021

Dan luu's article mentioned these two options as unreliable. What's the reason for the disparity?

exikyut · on Feb 16, 2021

> If I add this to my app, what will it actually do? How can I even know?

Be really, really, *really*, unambiguously sure about whether your data was written or not, AND have high confidence that I/O errors (eg, power loss) in the middle of does of deletes won't scramble (or truncate) existing data.

What you're looking at is the complexity required to solve for the wonderful tornado of "but it's my data really written???". But you don't have to deal with SQLite's implementation details in order for it to do its thing, which is what makes it so awesome (given is public domain status, what's more!).

pjc50 · on Feb 16, 2021

> I know how to atomically write a JSON file.

Sure. But that forces you to rewrite all the data at once. Once it becomes large or you require more frequent changes, that will impact performance.

yarcob · on Feb 16, 2021

File formats based on ZIP files only work for small files. For big files you have huge overheads; opening and saving a moderately sized documents takes seconds (vs. milliseconds for writing changes to an SQLite database). There's a reason why Excel files are limited to a million rows, while Access databases aren't.

The complexity of including SQLite is trivial for practical purposes; it's already available on many systems, and if not you can include it by adding a single C file to your project.

Setting up a workflow for Google Protocol Buffers (another popular alternative for document file formats) is a lot more complex than building or linking with SQLite, and it doesn't stop people from using them.

One thing that speaks for SQLite is the quality of the project; it's one of the best maintained Open Source projects with fantastic quality assurance and support for almost every OS. This means that you are unlikely to run into issues compiling or working with SQLite, like you might have with alternative libraries like libxml2 or jsonc (which are still great libraries!!).

EDIT: The big downside of SQLite is that it's unsuitable for documents that are exposed to the user because of the temporary files (like the WAL). If you have a ZIP based file format that you atomically rewrite from scratch on every save, it's almost impossible to corrupt. Your users can just take the file and email it and nothing bad will happen. I'm not sure what happens if you email an SQLite database file that is currently being used. I've done that in the past and have been surprised that some data seemed to be missing, but I don't recall the details. Hence SQLite is often used for application data files that are not directly exposed to the user.

slaymaker1907 · on Feb 16, 2021

You could get SQLite to work as document files exposed to the user so long as you use sessions[1]. When a file is opened, copy the DB to a temporary file or to use memory and write all changes during operation to this new DB, recording them all in a session. When the user explicitly saves a document, apply the session to the real DB.

[1]: https://www.sqlite.org/sessionintro.html#:~:text=1%20Introdu...

yarcob · on Feb 17, 2021

That doesn't sounds like it would just combine the drawbacks of both approaches: You end up with slow open/save operations, and the writes to the db still aren't necessarily atomic (eg. if a user copies the file while changes are being applied)

ak217 · on Feb 16, 2021

JSON/XML quickly stop being alternatives as soon as you need any sort of index, a memory-mapped/on-disk data structure that doesn't have to be loaded into memory, transactional or even just incremental writes. ZIP is not even directly comparable.

tonyedgecombe · on Feb 16, 2021

That's not normally what you need for an application format though is it.

rini17 · on Feb 16, 2021

It's becoming normal, as users coming from phones aren't trained to use "save" function and expect every individual change to persist.

I actually consider that a good thing. Doing everything in volatile memory until user asks otherwise is a relic from diskette era.

em500 · on Feb 16, 2021

It's not limited to phone users. I've been using computers since the 1980s (C64), and I appreciate not needing to habitually keep pressing "save" every few seconds in Google Docs or macOS Notes.

wheybags · on Feb 16, 2021

#1 best jetbrains idea feature IMO - save on focus lost. Just alt tab into your app, or into your terminal to git commit, no worrying about "did I remember to ctrl-s".

mmcdermott · on Feb 16, 2021

I grew up in the Win 3.1-Win 98 era. I don't think the save reflex will ever quite go away. :)

wtetzner · on Feb 16, 2021

Sometimes when I'm working on paper I reach for ctrl+s out of reflex ;)

folmar · on Feb 16, 2021

It is a nuisance unless you have undo that is persistent. Phone-app likes have barely any undo.

For example: cut some content from a file to paste it somewhere else. Now the program saves and system crashes.

ymbeld · on Feb 16, 2021

It’s bad enough to have to keep pressing Save manually, but I also have to do it regularly while using LibreOffice Calc since it keeps crashing. :-)

dragonwriter · on Feb 16, 2021

> That’s not normally what you need for an application format though is it.

Well, it depends what you mean by “need”. But continuous, incremental updates generally provide a much better user experience, either instead of or in addition to active “save” actions.

So, yeah, I think its exactly something that is commonly desirable in a file format for maintaining application state, even if there is a different interchange format that the application produces/consumes as a static input or output.

imtringued · on Feb 16, 2021

During the alpha Minecraft divided the world into 16x16x128 grids of blocks called chunks. Each chunk was its own file. Large worlds suffered from very poor performance because there were tens of thousands of files in a single folder.

Some random modder basically just put multiple chunks into one file so that each file is 2MB. If Notch had just put the game world into a SQLite database he wouldn't have had to reinvent the wheel. There are games that did that, such as the alpha of Cube World and they work just fine.

Heck, notch went one step further and invented NBT aka named binary tag which is basically a weirdo binary file format that stores JSON like data.

iforgotpassword · on Feb 16, 2021

> Large worlds suffered from very poor performance because there were tens of thousands of files in a single folder.

It was using subdirs for the chunks, two levels iirc, one was chunkX % 36, the next level chunkY % 36. So there weren't that many files per directory. The slowness came from the overhead of opening, read/write and closing so many files all the time.

> Some random modder basically just put multiple chunks into one file so that each file is 2MB.

Almost, it wasn't limited by file size, it was putting 32*32 chunks into one file that was similar to a simple file system. The format of the individual chunks within that file stayed almost the same. Yet it performed much better.

NBT is indeed a little weird but fairly straight forward overall, I guess designing and implementing it just scratched an itch. It was a hobby project after all.

slaymaker1907 · on Feb 16, 2021

I'm actually currently working on a user mode FS using Dokan for Windows that saves everything to a SQLite file for similar reasons. NTFS just doesn't do well at all with lots of small files.

fomine3 · on Feb 17, 2021

Interesting. Did you tested any performance by non-NTFS?

I remembered this post. https://github.com/microsoft/WSL/issues/873#issuecomment-425...

dragonwriter · on Feb 16, 2021

> Any thoughts from the other end, where the alternatives are JSON or XML or ZIP?

ZIP isn’t a format alternative, its just a compression and/or packaging technique for files which you still need to choose a format for.

JSON/YAML/XML are great for input and output formats, but not great for continuous, random read/write access.

millstone · on Feb 16, 2021

My understanding is that SQLite doesn't impose any format either?

dragonwriter · on Feb 16, 2021

> My understanding is that SQLite doesn't impose any format either?

That's true that you can stuff any kind of string/blob data into any column of any table, so, yes, you still have to determine the data schema with sqlite much as you do with JSON, XML, or even CSV. I mean, I could have a CSV where each element is a base64-encoded ZIP containing sqlite database files that are each a single table with a single column of JSON files, each of which contains a JSON array of strings with XML documents in them.

But that's usually not something people would mean if they said their app was using CSV as it's data storage format, nor is the version stripping out CSV on the top what people would mean if they say they are using SQLite.

With ZIP, you have to decide the format(s) for the file(s) in the ZIP, their hierarchical structure, and, if the files aren't themselves the atomic data elements, the schema applicable to each file.

Furthermore, in discussion of performance characteristics and other aspects of suitability, ZIP adds overhead, but you still also need to consider the access properties of the contained files.

mb7733 · on Feb 16, 2021

I'm genuinely curious what you mean by this. Of course SQLite imposes a file format... That format is a SQLite database

Dylan16807 · on Feb 16, 2021

> Of course SQLite imposes a file format... That format is a SQLite database

Well this is in a context that rejects zip as being a format. Do you do that? If the answer is no then skip the rest of my post and just note that they're talking about a different definition of 'format'.

-

But in that context:

The amount of structure imposed on you by the sqlite database format is not much more than the structure imposed on you by a zip. I think it's fair to rate them similarly as formats. A zip file is basically a key-value store.

"Zip full of csvs", while awful to use, would impose about the same amount of structure as sqlite does: not much. And zip+csv is not much more elaborate than zip on its own.

samatman · on Feb 16, 2021

In SQLite, I can have a table with a row which is `flag INTEGER NOT NULL CHECK flag == 0 or flag == 1 DEFAULT 0`

Surely this is more structure than a ZIP file, which is merely a way of compressing a directory of files into a single entity, can provide?

Sure the INTEGER part doesn't really do anything...

Dylan16807 · on Feb 17, 2021

That's a feature of the database engine, not the file format. You could attach a similar database engine to a zip backend.

samatman · on Feb 17, 2021

This is an asinine objection.

Dylan16807 · on Feb 17, 2021

I have a better one, if you want it.

A configuration like that comes from the program using sqlite. Just adding sqlite into a system doesn't set up any data formatting like that. Sqlite itself gives you a blank canvas. And a blank canvas is not much of a data format.

samatman · on Feb 17, 2021

You're wrong about this.

The SQLite data format includes the schema, in plain ASCII. This self-documenting nature makes it an excellent data format, I've taken advantage of it numerous times in making use of SQLite-based application file formats.

SQLite is put forth as a basis for an application file format, and by definition it must be sufficiently flexible to accommodate any application. But by including the schema, it is self-documenting as to what the structure is, which ZIP isn't and can't be. QED.

millstone · on Feb 16, 2021

I hope to make the modest point that, if you package your data in a SQLite file or a zip file, then you have both packaging and data problems.

MyCoolSQLApp may read and write a SQLite file with its own schema, but it can't handle an arbitrary SQLite file. Likewise MyCoolZipApp can't handle an arbitrary zip file.

mb7733 · on Feb 18, 2021

I'm not clear on the point here. You can make an application that handles arbitrary SQLite files, (for example: the SQLite shell, or a GUI database broswer), or you can make an application that only uses SQLite files with certain content/schema. Does this mean that SQLite does or does not define a file format?

If not, what do you call the specification of how data is stored in a SQLite file besides a 'file format'?

7steps2much · on Feb 16, 2021

It all depends on how much/how complex data you have. SQLite is a database after all, you can query it with SQL and do lots of fancy stuff that might be hard to do with regular file formats like JSON or XML.

If you just need a config file or only have a small amount of data you can use XML/JSON files that you parse yourself. If you are going to have loads of data that needs some structure (for example messages in a messaging app) i would use SQLite.

kstrauser · on Feb 16, 2021

There's an enormous amount of comments and tests in that codebase. As installed on my Mac, sqlite comprises a 1.3MB command line utility and a 1MB shared library. That's absolutely tiny given the functionality it provides.

petre · on Feb 16, 2021

Go ahead and use text files and then have fun with data corruption issues. We use CSV for sending commands to IoT devices and it's an issue. If this had been done with SQLite, then there were at least no data corruption issues. One could even use SQLite as a storage container for JSON if one whises to do so. They even have an extension that aids it with an useful set of functions:

https://www3.sqlite.org/json1.html

JamesSwift · on Feb 16, 2021

The JSON features of SQLite are extremely robust and performant. There is no reason to use raw JSON as the storage when you can just shove it into SQLite and lose almost nothing.

millstone · on Feb 16, 2021

This sounds like your issue is avoiding data corruption: then atomic writes are sufficient, you don't need a SQL parser or query optimizer or etc.

petre · on Feb 16, 2021

Not only that, it enables us to to CRUD operations, list the commands, sort them by time, do limits, pagination, bundle a bunch of commands that enable a certain functionality in a transaction etc. SQLite has all of those and more and also avoids data corruption issues by design. Anyway, the path we took was to move everything to MySQL just because most of the other data is also in a MySQL database. Otherwise we would have definitely used SQLite.

Scarbutt · on Feb 16, 2021

How do you create structure data with ZIP?

why incorporate its 200k SLOC when the alternatives are a fraction of the size?

Performance, ACID and a superior declarative query language.

millstone · on Feb 16, 2021

ZIP files are not a database: they are more like a directory hierarchy. But maybe all I need is named blobs: no query language parser, optimizer, indexing, etc.

SQLite positions itself as an improvement over ZIP for application file formats: https://www.sqlite.org/appfileformat.html . But minzip is so much smaller, easier to understand, debug and ship. So why use SQLite for an app if ZIP suffices?

setr · on Feb 16, 2021

If you’re talking about like cbr archives, you’re right. It’s comparing against usages like word/excel, which store a bunch of XML in an archive and call it a day.

If you’re not reading and writing out application state, then yes, you don’t need something to manage your non-existent state

realdense · on Feb 16, 2021

Depends on your use case. XML and JSON are great for applications with simple data stores, having done this myself. But if you foresee a need for complex queries or locking and threads then SQLite might be a good choice.

alexchamberlain · on Feb 16, 2021

The main challenge there is how to you ensure your database is resilient to machine or datacentre outages? ie what happens if the 1 server with the database is in a datacentre that loses Internet connectivity?

zdkl · on Feb 16, 2021

SQlite "merely" assumes that the problems that come with distributed systems are handled at the application layer. You'll have to solve those problems for yourself, sure, but in practice I have rarely (I think never actually) had dataloss through a fault of sqlite.

Also, did you know you can use in-memory instances (and share them across threads!) with the right incantation? And that you can backup your on-disk instance to an in-memory one, do your expensive transactions without hitting the disk then backup the modified instance right back to disk, even in-place if you want!

Sqlite is amazing when you don't expect the DB to do replication or failover on its own.

iveqy · on Feb 16, 2021

No I did not know that, I've looked for a long time for a way to convert a sqlite3 database to an in memory database and then back again. Do you mean that there's support in sqlite3 for this? Could you point me in the right direction?

Multicomp · on Feb 16, 2021

We both learned something new today. Looks like this is what you want in combination with using an in memory database.

I've been doing a handrolled in memory cache layer to speed data access, but with this, I can just call the db directly and then periodically sync to disk, redis rdb style. Sqlite is a staggeringly good piece of technology!

https://www.sqlite.org/backup.html

beagle3 · on Feb 16, 2021

In Python you just have to open the special file name “:memory:” to get a memory-based db. I don’t remember what the raw SQLite incantation is (or if it’s different). Also, pay attention to “ATTACH” - it’s the way to use multiple databases (file and/or memory) while still letting SQLite handle it all (e.g. join a memory db to a file db, insert result into 3rd file db - all without having to look at records in your own code)

folmar · on Feb 16, 2021

It's the same in plain sqlite.

cm2187 · on Feb 16, 2021

In other words it is just a regular object you serialize from time to time...

krab · on Feb 16, 2021

There is always DRBD as well. You can make replication a lower-layer problem. Not that it's without drawbacks.

CGamesPlay · on Feb 16, 2021

I feel like the sibling comments here are basically just saying "yep, that's the main challenge!" without providing useful tips. I personally haven't used it, but I'm aware that this library exists to help resolve this challenge. https://litestream.io

rapnie · on Feb 16, 2021

And as it happens this featured on HN just 4 days ago: https://news.ycombinator.com/item?id=26103776

mantap · on Feb 16, 2021

Just open your SQLite database in read-only mode :)

SQLite works really well for static or semi-static data. For example, a blog where you have a small number of users writing and many users reading from the DB. If the authors are content to use one server to edit the DB then you can easily push that DB to the servers handling the reads.

johannes1234321 · on Feb 16, 2021

Yes this can work, however you are mostly relying on the operating system's file system cache for speed. Other databases will try harder to keep their own cache. But true, there is lots of room where SQLite works nicely.

gravypod · on Feb 16, 2021

Some say replication is an application layer concern, not a serialization concern.

I don't agree or disagree but it's something I've heard.

I've seen sqlite used as a cross language data frame solution. Store it in s3 and it's resilient if you are read only.

thatwasunusual · on Feb 16, 2021

https://github.com/rqlite/rqlite

tptacek · on Feb 16, 2021

rqlite looks neat. I'd be interested in hearing any major success stories about it.

lrem · on Feb 16, 2021

TBF: you don't. The moment you care about any shortcoming of SQLite, move away.

One of the cool things about it, is that SQLite is very lax about what it accepts (mostly in the datatype area). You can write your SQL statements targeting whatever database you think you'll move to later and they'll work while you're still on SQLite. I believe having this migration work seamlessly towards PostreSQL is one of the advertised features.

pjc50 · on Feb 16, 2021

I would say that's a case for using a proper replicated RDBMS if you need that level of replication. Sqlite is not a hammer for all occasions.

Cthulhu_ · on Feb 16, 2021

I'm using SQLite at the moment; on the one side there's a 'legacy' (read: poorly written 2012) application, on the other there's the new and rebuilt version. The old one was not built very well, it does not use foreign keys or any kind of database constraints (it references other entries by name in a column of comma-separated values) and it runs like trash. But the performance problem is not in the dozen queries it runs to load the data, it's in the fact that it converts the query result to XML (via string concatenation, because of course) and that is converted to JSON; the conversion is at least 60% of each request. The other problem is that it writes and re-queries the data whenever you leave one of the hundreds of form fields in the application.

I'm rebuilding the application in a modern tech stack, still using SQLite but properly this time, along with Go and React. API requests take 20-40ms instead of 300-1500ms, and there's much less of them.

The main downside to using SQLite is that it does not support "proper" database migrations; you cannot alter a column. You can add columns to an existing table, but you can't change existing columns. The database abstraction I'm using at the moment, Gorm (a different subject entirely) work around this by moving stuff to a temp table, recreating the table with the updated columns and moving stuff back, I believe.

Anyway TL;DR sqlite is not the bottleneck.

srcreigh · on Feb 19, 2021

SQLite did get support for renaming columns recently [1].

Anyways, the process you describe is also used in MySQL for doing online schema migrations. [2] "proper" database migrations cause downtime

[1]: https://stackoverflow.com/questions/805363/how-do-i-rename-a... [2]: https://github.com/github/gh-ost