Compiling and running sqlite3-rsync from a branch

isoprophlex · 2024-10-06T18:52:16 1728240736

> It took me quite a few iterations to get to this recipe for compiling the tool itself

Hehe, I really feel this remark, having recently built a sqlite3 binary with Lua support[*]

The LLMs were no help here, either. Which is maybe a good thing for our collective future employment.

That said, I found the quality of the documentation and legibility of the sqlite source code to be absolutely fantastic. When you work with the code, you can almost feel the programming enlightenment rub off on you.

[*] why, for heavens' sake? so you can decouple your business logic from the database and test it a bit outside the DB, whilst still keeping the data all in one place without sending it over the wire. It lets you SELECT lua('some lua code that operates on your sqlite data') FROM some-table;, so you can put gnarly nested conditionals or whatever into Lua, build some comprehensive testing around that, and execute it on your data as if everything was a single, plain old SQL statement.

simonw · 2024-10-06T20:13:25 1728245605

I got a little bit of assistance from Claude in figuring this out, but it pointed me in the right direction more than actually giving me the right sequence of commands. Claude got me as far as:

    gcc -o sqldiff sqldiff.c ../sqlite3.c -I.. -ldl -lpthread

Which was enough for me to figure out I needed to get the sqlite3.c amalgamation to build and then run gcc in the tool/ directory. I landed on this after a bit more poking:

    gcc -o sqlite3-rsync sqlite3-rsync.c ../sqlite3.c -DSQLITE_ENABLE_DBPAGE_VTAB

I did have a much more useful interaction with an LLM later on: I was curious about the protocol used over SSH, so I copied the whole of this 1800 line C file:

https://github.com/sqlite/sqlite/blob/sqlite3-rsync/tool/sql...

And pasted it into OpenAI o1-preview with the prompt "Explain the protocol over SSH part of this" - and got back a genuinely excellent high-level explanation of how that worked: https://chatgpt.com/share/6701450c-bc9c-8006-8c9e-468ab6f67e...

someone13 · 2024-10-06T20:53:31 1728248011

That share link 404s for me, FWIW. I’d be interested in seeing it!

simonw · 2024-10-06T21:01:25 1728248485

Weird - https://chatgpt.com/share/6701450c-bc9c-8006-8c9e-468ab6f67e... is working for me in a Chrome Incognito window.

Here's a copy of the Markdown answer it gave me: https://gist.github.com/simonw/ffbf90e0602df04c2f6b387de42ac...

infogulch · 2024-10-07T00:21:37 1728260497

Lua as the language for SQLite procedures just sounds right for some reason. Did you try integrating at that level, or is it limited to expression-level evaluation?

gary_0 · 2024-10-06T20:35:45 1728246945

sqlite3-rsync is in the makefile. I got sqlite-rsync to build by just checking out the main branch and going:

    ./configure
    make sqlite3-rsync

I think the rsync branch has been merged?

I'm really excited for this new tool; I'm hoping to use it for nightly backups of a webapp I'm building that uses sqlite exclusively. Although I'm hesitant to rely on sqlite-rsync right away because it's so new.

pkhuong · 2024-10-06T22:31:26 1728253886

> I'm hesitant to rely on sqlite-rsync right away because it's so new.

If you want something less new that only needs an S3-compatible endpoint, there's https://gist.github.com/pkhuong/555be66f5e573a2d56186d658fd8...

simonw · 2024-10-06T20:54:09 1728248049

Thanks for that! I just updated my TIL to promote this recipe instead - I didn't realize it was A) in the Makefile and B) merged into the main branch.

ralferoo · 2024-10-06T21:06:29 1728248789

> I'm hoping to use it for nightly backups of a webapp I'm building that uses sqlite exclusively

The easiest recipe for this is to use

    sqlite3 $SRC "vacuum into '$DEST'"

$DEST must not already exist, personally I prefer to put the timestamp into the filename and keep the last n copies.

gary_0 · 2024-10-06T21:18:58 1728249538

I'm aware of the existing backup methods, but AFAIK they don't do a delta of only the changed pages (unless I use separate software), and "vacuum into" doesn't work over SSH.

I have no need to store a backup on the same physical server. Also, the backup server will keep snapshots, and with sqlite3-rsync I can just update a previous snapshot with the changed pages rather than redundantly transferring the whole database over every time.

simonw · 2024-10-06T21:31:59 1728250319

Yeah, the unique benefit of sqlite-rsync looks to be efficiently backing up medium to large databases that don’t change enormously (so deltas are worthwhile) over an SSH connection. I particularly like that you don’t need double the disk space to create the temporary backup copy before downloading it.

tosh · 2024-10-06T20:28:23 1728246503

I wonder how this compares to litestream and other existing sqlite replication strategies.

ralferoo · 2024-10-06T21:13:11 1728249191

I had a quick scan through the source and it gets the hash of each page on both sides and sends the page if it's different (so conceptually very similar to rsync). I don't think the cache is indexed anywhere normally, so that'd imply it needs to do a full read of the database on both sides, so it'll be slow and probably trash your disk cache on a large database.

In theory, it should be able to use the WAL file directly if it copies the WAL checksums, but that information would get lost every time the WAL is flushed and the data is moved into the main db file.

Litestream knows exactly which pages have been written via the fuse fs, so it doesn't need to do all that work for unchanged pages.

simonw · 2024-10-06T21:18:16 1728249496

Ugh, yeah that's a good point: it looks like each time you run sqlite-rsync it calculates a hash for every database page on both the source and the replica side of things - that's a lot of disk access and CPU, way more than you would want to run on an every-ten-seconds basis.

simonw · 2024-10-06T21:04:27 1728248667

I guess this is probably fast enough that you could run it on a schedule every 10 seconds or so (at least for databases measured in GBs, not TBs) to keep a warm copy updated elsewhere. [UPDATE: maybe not, it has to hash everything, see comment here https://news.ycombinator.com/item?id=41749288#41760395]

Litestream gives you a backup in S3 within less than a second of data being committed though, plus since it records the full WAL archive you get point-in-time recovery for previous timestamps too.