> In wal2 mode, the system uses two wal files instead of one. The files are name...

athoscouto · 2024-01-15T12:32:10 1705321930

> Checkpointers do not block writers, and writers do not block checkpointers. This means that, even following the checkpoint, the wal file cannot be overwritten or deleted, and so all subsequent transactions must also be appended to the wal file. The work of the checkpointer is not wasted - SQLite remembers which parts of the wal file have already been copied into the db file so that the next checkpoint does not have to do so again

Probably because of this.

> but it does mean that the wal file may grow indefinitely if the checkpointer never gets a chance to finish without a writer appending to the wal file. There are also circumstances in which long-running readers may prevent a checkpointer from checkpointing the entire wal file - also causing the wal file to grow indefinitely in a busy system.

> Wal2 mode does not have this problem. In wal2 mode, wal files do not grow indefinitely even if the checkpointer never has a chance to finish uninterrupted.

I don't get how wal2 fixes the long-running reader problem though. Maybe they were just referring to the former problem?

jacoblambda · 2024-01-15T13:55:57 1705326957

> I don't get how wal2 fixes the long-running reader problem though. Maybe they were just referring to the former problem?

Because with a single wal file you can't checkpoint it during a read since said file may change out from under you.

With two wal files, the one you are actively appending to can be treated like in wal1 mode but the one that isn't being appended to is immutable for the time being just like the main database.

This means you can treat the actual db file and the immutable wal file together as one immutable database file with some underlying abstraction. That abstraction then allows you to perform the checkpoint operation since the abstraction can keep all that immutable data accessible in some form or another while reworking the data structure of the db file.

Then once the checkpoint is complete, the abstraction can clear the now redundant immutable wal file, become transparent, and just present the underling single DB file.

And now once the wal file you are actively appending to reaches a sufficient size, you "lock" that one, rendering it immutable, and switch over to appending to the cleared wal file you were previously checkpointing. With this you can now checkpoint again without blocking reads or writes.

liuliu · 2024-01-15T15:14:50 1705331690

But Wal also keeps "history" so one reader transaction can see x0 from wal1, another reader can see x1 from Wal1. Wal1 cannot be merged into main db otherwise x0 is lost?

jacoblambda · 2024-01-15T17:23:00 1705339380

Yes that is correct. You can't checkpoint data after the oldest snapshot. But to my knowledge there's no way to force a read snapshot to continue to exist.

So while you can make multiple back to back reads that use the same snapshot, I believe there's no guarantee that the snapshot will still exist when the next read is opened unless the previous read is also still open (in which case an error is returned).

That seems to set an upper bound on how long a reader can block a checkpoint (unless the reader is intentionally staging reads to block the checkpoint).

Theoretically you could implement checkpoints that flatten everything between snapshots into single commits but the complexity and overhead probably isn't worth it given that the only real blocker for wal2 is an edge case that is nigh impossible to encounter unless you intentionally try to trigger it.

liuliu · 2024-01-15T18:02:41 1705341761

Open an BEGIN transaction forces read from a particular snapshot in SQLite. There are some complications around lock upgrade in that case: BEGIN a, read, BEGIN b, read, write from a transaction fail because cannot upgrade read lock to write lock. Other than that, in WAL mode, it will work exactly as expected:

BEGIN a, read x0 BEGIN b, write x1, END b BEGIN c, read will return x1 Back to a transaction, read again, return x0 still.

willvarfar · 2024-01-15T14:56:57 1705330617

> once the wal file you are actively appending to reaches a sufficient size, you "lock" that one

so there might eventually be wal3 and wal4 files and so on?

jacoblambda · 2024-01-15T15:13:28 1705331608

nope. since if the wal you are checkpointing isn't done yet, you just wait to "lock" and switch files until that operation is complete.

Checkpointing can be considered "lock free" since the operation will always eventually complete. How long it takes will depend on the wal file being checkpointed into the db but it'll eventually complete in some finite amount of time.

Because you know that any given checkpointing operation has to eventually complete, you can simply keep appending to the current "append" wal file and then tackle those changes when you finish the current checkpoint op (at which point the wal file you just finished checkpointing is free to take the appends).

athoscouto · 2024-01-15T15:22:33 1705332153

Wouldn't that require the reader to finish? Making the reader block the checkpoint from wrapping up?

jacoblambda · 2024-01-15T16:15:10 1705335310

Not if you abstract those reads. It's not like the application is directly opening a file handle. Instead it goes through the sqlite library which accesses not only the db and the wal file but also a non-persistent shm (shared memory) file used for maintaining a consistent view of the write ahead log.

When a reader is reading, it puts a shared lock on the specific data it is reading in the shm file. The checkpointer respects that lock and may (potentially) continue working elsewhere in the db file, slowly updating the indices for checkpointed data in the shm file.

The checkpointer won't change the underlying data that the reader has a lock on but they may have created a new location for it. When the reader is finally done reading, the checkpointer can quickly grab an exclusive lock and update the header in the shm for that data to point to the new destination (and then release said lock). Since the checkpointer never holds this lock for very long, the reader can either block when trying to get a shared lock or it can retry the lock a few moments later. Now that the header in the shm only points to the new location, the checkpointer can safely do whatever it needs to with the data in the old location.

Slowly rinse repeat this until the checkpointer has gotten through the entire write ahead log. At that point there should be no remaining references in the shm to data within the wal file.

Now the wal file can be "unlocked" and if the other wal file is large enough, it can be locked, writes switch over to the other wal, and the cycle repeats anew.

Edit: Importantly, this requires that all readers be on a snapshot that includes at least one commit from the "new" wal file. So compared with wal1, wal2 allows you to have long running readers as long as they start past the last commit of the "previous" wal file.

riwsky · 2024-01-15T15:18:02 1705331882

Think of them more as “walblue” and “walgreen”

stavros · 2024-01-15T13:39:00 1705325940

I suspect that having two WAL files means you can freely checkpoint the one that isn't currently being written to, but that's just a guess.

WolfOliver · 2024-01-15T11:14:26 1705317266

I'm not much into the details but I'm wondering why is it not always writing to "<database>-wal" once it is full, it is renamed to "<database>-wal-0" and it starts writing again into "<database>-wal", once it is full again it is renamed to "<database>-wal-1" ?

ajnin · 2024-01-15T11:44:40 1705319080

This is to handle the case where transactions keep getting written to the wal file while it is being "checkpointed", preventing it from ever being deleted. You only need to alternate between two files, one checkpointed, one written to, for this.

wongarsu · 2024-01-15T12:09:32 1705320572

when doing backups, managing file permissions, etc, it's really convenient to only have a fixed number of known files to worry about.

Switching to WAL already makes handling Sqlite databases much less convenient, since you now have three files instead of one, and need a filesystem snapshotting mechanism to reliably back them up (so you don't have one state in the database and another in the wal). Making the filenames and number of files less predictable would make that mode not worth it for many use cases

sureglymop · 2024-01-15T16:05:08 1705334708

You can't use a filesystem snapshotting mechanism... Even when not using wal mode, you can't just cp the .db file somewhere (in some circumstances, yes).

Instead, sqlite provides an online backup api specifically for creating backups. This also takes wal mode into account.

dezgeg · 2024-01-15T12:18:11 1705321091

Even the mechanism predating WALv1 (rollback journal) uses two files. I don't think there is any way of using SQLite crash-proof with just a single file.

Besides, even if the database is single-file it's still necessary to use filesystem snapshotting for live backup, or it's likely to get an inconsistent copy.

kijin · 2024-01-15T12:43:32 1705322612

With smallish databases, just pipe the .dump command to gzip. No need to snapshot an entire filesystem just to back up a few hundred megabytes (what I would consider "smallish"). Backup and restore speeds aren't a significant concern at those sizes, anyway, and you get human-readable backups that can be inspected with standard Unix tools.

maskros · 2024-01-15T14:38:55 1705329535

Or use "VACUUM INTO" to get a backup that is is even faster (and easier) to restore.

deberon · 2024-01-15T12:56:48 1705323408

We would back up double-digit GB MySQL databases by piping mysqldump into gzip as well. Like you I’m sure there’s a size limit where that doesn’t work. I never found out if it was CPU or memory constrained though.

sgarland · 2024-01-15T13:19:05 1705324745

Where I’m at, the DBAs shifted from pipes to temporary files. They said at the double-digit TB size, it would occasionally fail otherwise. This was with Postgres / barman / pigz, FWIW.

Never saw it myself, so I have no idea what the cause was.

kijin · 2024-01-16T01:21:58 1705368118

I've used mysqldump|gzip when migrating a ~1TB database to an incompatible version of MySQL/MariaDB. It's slow but very reliable. I think the single-threaded pipe is the bottleneck. The process never took up more than one full CPU core.

lupusreal · 2024-01-15T14:35:38 1705329338

> need a filesystem snapshotting mechanism to reliably back them up (so you don't have one state in the database and another in the wal)

VACUUM INTO?

formerly_proven · 2024-01-15T14:09:45 1705327785

Pretty much every journaling mechanism requires snapshots for reliable backups. This includes either of SQLite's modes (undo journal and WAL).

slau · 2024-01-15T11:23:30 1705317810

Presumably because you don't want to keep a full directory of WALs up to infinity. This uses WAL_size*2 on the disk and keeps it from growing beyond that.

fweimer · 2024-01-15T19:52:22 1705348342

Some operating systems and file systems do not support renaming files that are opened by the same or another process.

dspillett · 2024-01-15T19:00:36 1705345236

> Looks so logical that I don't understand why WAL mode was not implemented like this from the get go. Probably an optimization wrongly dismissed as premature?

While it has advantages, it is also more code so more possible places to hide, and other disadvantages hence it doesn't completely deprecate the other WAL mode.

Also the advantages might not have been as commonly cared about in sqlite in earlier times, but it is being used in more & more places and sometimes at larger scales or with more significant concurrency needs, and the core has been pretty darn stable for quite some time, all of which factors change the dynamics of what is worth committing the dev/testing time to in terms of usefulness to the end users.

tomashubelbauer · 2024-01-15T11:07:42 1705316862

Now I can't help but wonder if there should be a `waln` mode where the WAL files would round robin instead of alternate between just two potentially allowing for much more intense write cadence.

avereveard · 2024-01-15T11:18:50 1705317530

Il give this thread two or three more replies before it reimplement logrotate from first principles

rmbyrro · 2024-01-15T11:17:29 1705317449

That would make reading slower, since readers need to read from WAL as well.

nbevans · 2024-01-15T11:33:51 1705318431

There is a .shm (shared memory) file that all SQLite connections to the database would use to determine which page ranges are in the WAL versus main DB file. So that overhead already exists when WAL is enabled.

rmbyrro · 2024-01-15T13:19:13 1705324753

I'm not familiar with its inner workings, but if there were n WAL files, I'd expect the overhead to be a function of n.

dspillett · 2024-01-15T19:11:01 1705345861

Possibly a bad function of n (linear or worse) on IO systems based on traditional drives or dirt cheap SSDs, once IO becomes your main bottleneck (i.e. when you data no longer fits easily in RAM).

dspillett · 2024-01-15T19:08:13 1705345693

Presumably it was at least considered. I would guess that the negatives would be taking more disk space (and potentially imposing greater IO concurrency though that probably isn't a great issue with modern storage systems), and it being more complex (if not in the core design, then in having extra edge cases to make sure are all covered).

Other databases do do similar to what you suggest, though obviously the trade-offs will differ because of other different internals and product priorities, so it would have been thought about. For instance MS SQL Server has multiple “virtual logs” in its log files, for at least some overlapping reasons.

foofie · 2024-01-15T13:34:22 1705325662

> Looks so logical (...)

https://en.wikipedia.org/wiki/Log-structured_merge-tree

quietbritishjim · 2024-01-15T17:06:54 1705338414

It may simply have been a design oversight at the time. However, the page notes the following difference of WAL2 from WAL, which could be an issue in some cases:

> In wal mode, a checkpoint may be attempted at any time. In wal2 mode, the checkpointer has to wait until writers have switched to the "other" wal file before a checkpoint can take place.

forinti · 2024-01-15T11:03:27 1705316607

I'm just speculating here, but in a normal database you would have different processes writing the wal files to the database or archives.

You don't have that with sqlite, so I don't see an obvious advantage for this, except if they now spawn a process or thread to do this concurrently.

Edit: so I read the doc (shame on me) and it has nothing to do with speed. Its purpose is to prevent a wal file from growing too large.

hahn-kev · 2024-01-15T11:12:06 1705317126

I believe SQLite supports accessing a single database file from multiple processes or threads at once

arcbyte · 2024-01-15T13:40:39 1705326039

There can be many readers simultaneously, but only ever one writer at a time.

WolfOliver · 2024-01-15T11:15:34 1705317334

I guess it is to make it crash save, e.g. in case the server dies in the middle of a write, the database would be corrupt after restart, with the wal file it can recover.