Hacker News new | past | comments | ask | show | jobs | submit login
Store SQLite in Cloudflare Durable Objects (rkusa.st)
244 points by jgrahamc on Jan 25, 2022 | hide | past | favorite | 74 comments



I read through this entire post going "oh, this is cursed", slowly growing in volume the entire time. As is proper for something this clever.

I've been doing a bit of prodding at Durable Objects lately, in part inspired by the same note that the OP quoted about implementing datastores atop the framework, and this is the sort of environment that I could see being really, really useful at scale in the future. I'm anxious about using it without a public spec and some portability, but the concept is just tremendously cool.



What do you mean by "cursed"?


Substitute "this is a terrible idea and I love it" if you want. ;)


My guess in this context: "A hacky solution that probably is unstable"


This is super exciting!

I see there is concern about Asyncify. I believe in the browser there is a way around this using Atomics and a shared buffer (Absurd SQL[0] uses it), but suspect this isn't possible in CloudFlare Workers as it is working across the worker/main thread boundary? Is it possible to launch another worker from a CloudFlare worker, in the same process, like a web worker?

I am somewhat hoping that CloudFlare have something up their sleeve to make these sorts of projects easer, maybe even some sort of distributed block store.

0: https://github.com/jlongster/absurd-sql


Indeed Cloudflare Workers does not and likely will not support threads with shared memory, due to Spectre concerns. Asyncify is the way to go here.


Thats a pity, is that due to the architecture of sharing a V8 runtime with other customers workers? How is this different from V8 in Chrome and shared buffers?


Chrome is all-in on strict process isolation as their Spectre defense. Given the web platform that they need to support, they don't really have another choice -- precise timers and threads, as well as many other ad hoc timing mechanisms, were already part of the platform before Spectre hit. Using process for everything is pretty costly, though, to the point that they have to make some compromises which are open to attack: https://www.spookjs.com/

In Cloudflare Workers, process isolation for every worker isn't practical -- if we had to do that then we'd only be able to offer real "edge compute" to a small number of big enterprises with deep pockets, rather than at prices affordable to everyone. Our fundamental difficulty is that we need a single machine to be able to support thousands of running applications where each app may only get a small trickle of traffic, so that we can deploy every application to every one of our edge locations. It's like if you always had 10,000 tabs open in Chrome.

Instead our strategy is to stack a lot of different mitigations that make attacks slower, until the point where they are too slow to be remotely useful[0]. One part of the strategy is a novel defense we designed with researchers at TU Graz called Dynamic Process Isolation[1], in which we use hardware performance counters to detect execution patterns indicative of an attack and move those workers into private processes. For that strategy to work, though, we first need to slow down attacks enough to give ourselves a chance to detect the patterns -- and that requires disallowing precise timers or anything that could be used as a precise timer, such as multiple threads with shared memory. Luckily, we were thinking about timing attacks from the very start of the project (even before Spectre was known), so we were able to avoid ever putting these into the platform in the first place.

In general I think threads are not as important on servers because distributing work across machines is more powerful anyway. We're trying to create a platform where that is really easy.

[0] https://blog.cloudflare.com/mitigating-spectre-and-other-sec...

[1] https://blog.cloudflare.com/spectre-research-with-tu-graz/


Thanks for the detailed explanation, makes complete sense.

I suppose that means that for projects such as python-wasm [0] that are porting other language runtimes to WASM they will have to go the Asyncify route for CF Workers with all the overhead that entails?

From the looks of it they don't think asyncifying Python will be possible in the near to medium term [1].

0: https://github.com/ethanhs/python-wasm

1: https://github.com/ethanhs/python-wasm/issues/12


If this becomes a big enough problem, we might be able to build a work-around into the Cloudflare Workers Runtime itself so that it can support synchronous waits. It'll take some hacks but I think it's doable, without actually supporting threads.


Essentially doing what Asyncify is doing as part of the platform?


I'd probably use fibers, like node-fibers used to do. The problem is V8 stopped supporting fibers a few versions ago. We'd have to hack support back in. That's something we probably could get done but I'm not exactly excited about it. :)


> we'd only be able to offer real "edge compute" to a small number of big enterprises with deep pockets, rather than at prices affordable to everyone

What a nice way to formulate the tradeoffs between cost and security. When Workers came out with huge headlines about performance and cost I was very disappointed to find out there was no special technical wizardry behind that but just conscious trade-offs in disregard to customer data security. Workers simply skip all the sandboxing steps other providers take to implement a multi-tenant application runtime in a secure manner. So far the mitigations in place seem to make the possibility of Cloudflare customers getting bitten by a V8 vulnerability unlikely instead of impossible.

Sure, the platform has interesting ideas and I'm looking forward to try it out as a full-stack serverless platform. I just cannot foresee running anything serious on it before they come up with a more convincing security story.


Other providers run attacker-provided native code directly on hardware, deeply relying on bug-free silicon for their whole security model to work. I honestly think that's far more precarious than what Workers is doing.


I built something very similar for using DynamoDB as a VFS backing store for sqlite: https://github.com/psanford/donutdb


Whenever SQLite comes up, always have to link to this post about using SQLite in production: https://blog.wesleyac.com/posts/consider-sqlite :)

I'm running https://extensionpay.com off SQLite and a $5/month DigitalOcean box and it's serving around 3 million requests a month with absolutely no issues and seriously low CPU load. I'm kind of astonished, frankly.


> 3 million requests a month with absolutely no issues and seriously low CPU load

That's just under 1 request per second, so I can't say I'm surprised. Even if traffic is spiky that would leave you with a lot of headroom!


Yeah, thanks for putting that in perspective! ExtensionPay is still relatively new, too, so I'm glad there's plenty of headroom :)


A more useful metric for the parent comment to include would be the peak throughput in ops/time versus CPU or IO load.


When I develop server software I always include support for SQLite and PostgreSQL so I can run the server entirely stand-alone if I want to. Choosing whether to use the embedded database or connecting to PostgreSQL is a command line switch.

I have been using SQLite in an IoT application where the measurement of a long-running test instance has 65 million rows in the main data table and is 3.6Gb in size....and the response time running queries against the database isn't bad. I'm starting to question of I actually need PostgreSQL for a lot of situations.


We've been using it for https://internetdb.shodan.io and it's been doing a few thousand lookups/ second without problems.


Do you have any tips load-balancing or horizontal scaling of SQLite if you have to have a 2nd server for whatever reason?


https://github.com/rqlite/rqlite provides a form of horizontal scaling for reads, but it's not a drop in replacement for SQLite.

https://github.com/rqlite/rqlite/blob/master/DOC/READ_ONLY_N...

(I'm the author of rqlite)


I don't.

My current understanding is that even though there are sharding / replication projects out there for SQLite, it's usually a better idea to just go with a typical database server when you need many machines. But a lot of sites never get to that point, especially given how fast servers and SSDs are these days, so SQLite seems like it can get you further than you might realize.


> But a lot of sites never get to that point, especially given how fast servers and SSDs are these days, so SQLite seems like it can get you further than you might realize.

SQLite on top of NVMe with WAL mode enabled is about as fast as it gets. We've been doing this in production for years now.

We don't have a scenario where we would find ourselves writing more than 2 gigabytes per second to DB/disk, nor do we ever think we would encounter one with our product, so we have committed ourselves to single instance SQLite architecture. Saves a lot of time and frustration when you can accept your circumstances and go all-in on simpler solutions.


I asked OP this and would appreciate your input as well.

What tool do you use to access production database remotely? Say you need to browse some data, perform queries. Is there something like pgAdmin?

I'm interested in SQLite and I'd like yo know more.


We do not permit direct access to any application databases across any network. If something in one of our SQLite databases needs to be adjusted in an operational context, it needs to be done by way of some administrative interface that we have integrated into the product.

For offline troubleshooting/QA/Dev, we would simply grab a copy of the database from wherever and we would analyze it using DB Browser for SQLite. We actually use this tool so often that I felt it necessary to contribute to the author's Patreon account. It really is an amazing tool. Try reading/editing a binary field containing an image or other file in SSMS...

Having extreme discipline with your schema, queries, indexes, etc. is critical for being able to operate like this. If you are uncertain of what should live in what table, you probably just want to work against a hosted database engine until the whole team can agree on one standard path that you can burn into your codebase.

We did NOT start with SQLite for all the things. We had to make a lot of mistakes with hosted SQL providers before getting to this level of confidence.


> We actually use this tool so often that I felt it necessary to contribute to the author's Patreon account.

Thanks heaps for that btw!

Because of our Patrons, we've recently been able to buy an M1 Mac Mini. We should be able to start releasing ARM based macOS builds in the near-ish future. :)


My application Datasette can serve this need: https://datasette.io/

Lots of live demos - two of my favourites are https://github-to-sqlite.dogsheep.net/ and https://global-power-plants.datasettes.com/


I just scp the whole database to my local machine and use "db Browser for SQLite" (a mac app) :)

Would love to hear if there are other server-based db browsers for SQLite, though.


Cool, what's your setup / workload look like?


I just learned about ZeeSQL yesterday which is SQLite+Redis. It's proprietary but if I understand it correctly (the marketing is terribly unspecific about their primary goal) it let's you scale SQLite across all your Redis nodes.

https://zeesql.com/


I'll be honest, I would consider running SQLite in production, at least for my personal stuff, but the part that scares me is that there's no "managed SQLite" offering and I would have to manage it myself. Litestream seems like a great option for people like me, but I'd like to hear some experience report on it, especially on more complex case, and how it fails.

Another problem (which Litestream might solve) is that SQLite is not compatible with Heroku, that I use since I don't have much ops/sysadmin experience. I feel like people saying that SQLite makes things easier often have already some experience in sysadmin/ops stuff, which isn't my case.

This is not a criticism of SQLite in any way, just an expression of my own limitations and why it makes it harder for me to consider SQLite. I'll gladly take any learning resources about that.


Litestream author here. I'm working on a serverless option that'll run as a managed service later this year. I think there's a huge opportunity to get a simple option for folks running on ephemeral platforms like Heroku, Google Cloud Run, & Cloudflare Workers.


Ooh hi Ben, thanks so much for Litestream! Can you say more about the service you're describing and how it will work?


That's nice to know, I'll keep an eye out for when it comes out.


The conservative approach — I like it! Yeah it's not for everyone but it's a real delight if your needs align with what SQLite offers.

> there's no "managed SQLite" offering

It's just a file :) You can copy it to a remote server or use Litestream or whatever you want to do with a file.

> SQLite is not compatible with Heroku

Yeah, that's pretty annoying. Apparently fly.io is similar to Heroku but lets you use persistent disks — I've never used it though.

I'd love to see more "managed app server + sqlite" options, since a lot of the times the "managed database server" option from cloud hosts is kind of expensive. Send emails to Heroku and DigitalOcean and Google Computer platform. Tell the world!


I see how SQLite being just a file makes it easy to do lots of things from a developer point of view. However, I'm looking at this from a "business owner" point of view, at which point it's not just a file, it's my clients data. That's a bit scary for me, so I have a bias for "reassuring stuff" like managed databases with automated backups. This is why I'm looking for more info on "the hard parts" of managing SQLite. I'm conscious that this bias might not be based on reality, especially since I don't know well MySQL or PostgreSQL, especially how they store data on files. But as a relatively young developer, I've heard a lot that "data should go in the database" and databases are big scary things that should be managed by people with lots of experience and knowlegde in that role.

I'll take a look at fly.io, thanks.


Just to add to this: offering managed SQLite services isn't really all that much of a thing because essential, as it is just a file, there isn't much to manage.

With databases like postgres you have daemons, config files and more that need an experienced DB-Admin to configure and run. All of that complexity grows exponentially when you include scaling on multiple machines.

With SQLite you essentially need backups. That's about it. A managed SQLite service would essentially just be something along the lines of:

"Tell us which file is your database, we backup it and if you need it restored press this button."


That's a good point. I think this is my lack of experience showing here, mixed with lots of marketing aimed at people like me. After all, you can sell a managed Postgres for a lot more than you can sell a managed SQLite.


Well maybe to you "managed database server" feels more reassuring which is fine, but to me I run my business off SQLite which I've found insanely simple to set up and maintain, so that is more reassuring to me.

And just an aside, I think if you looked into it a bit you'd probably find that while there is a lot of nuance to database fine tuning, you can get most of that without needing to be an expert.


If you want an even-earlier-published article to point to as well, this has been mine: https://crawshaw.io/blog/one-process-programming-notes

There's also the venerable "Taco Bell Programming": http://widgetsandshit.com/teddziuba/2010/10/taco-bell-progra....

It all boils down to unix principles, I guess: use the most simple (but not simplistic) tools for the job.


"always have to link" to a post that was written less than a month ago?


Yes indeed! Even though it's only been a month I keep posting about it in HN threads about SQLite and people keep going "wow I never thought about that!"


If you setup Sqlite properly (Remote Clients <-> RESTful API <-> DB) and use WAL mode, you can run a DB backed website fairly cheap. It will be performant, reliable and simple to maintain as well.


Hi. What tool do you use to access production database remotely? Say you need to browse some data, perform queries. Is there something like pgAdmin?

I'm interested in SQLite and I'd like yo know more.


Copying and pasting my answer from another subthread:

I just scp the whole database to my local machine and use "db Browser for SQLite" (a mac app) :)

Would love to hear if there are other server-based db browsers for SQLite, though.


scp is the way I implemented querying a remote SQLite database under the hood in my data ide. Now that I think of it though I should probably add some clarification to docs or the UI that you're querying a local copy of the database and that your edits won't affect the original.

https://github.com/multiprocessio/datastation


CloudBeaver works great for me, though I'm not doing anything intense.

https://github.com/dbeaver/cloudbeaver/


Its SQLite so you would just copy the file and download it to a local machine for analysis through any number of tools. Presumably your setup would already be doing this as a form of backing up the data periodically.


You can use the command line for sqlite on the server. There are lots of GUI tools if you copy the file local. Here's one I use https://sqlitestudio.pl


> Is there something like pgAdmin?

A fairly well used GUI is https://sqlitebrowser.org (a project I help out personally).


You can do that with adminer https://www.adminer.org/


how do you get replication?


I don’t. I have the one server with streaming dB backups. Although apparently litestream might include a way to do live SQLite replication soon.


How do you do encryption, authentication and authorization?


I'm not sure what that has to do with my choice of database. I just do it the way every web app does those things.


How do you manage backups of your database and how often?


I do streaming updates with Litestream: https://litestream.io/

I recently set it up after putting it off for a while and it took like 5 minutes. It was great. And the backups are currently costing me 0 cents per month on backblaze.


A long these lines if you're looking for a way to sync your SQLite DB to an S3 like object store there is https://litestream.io/. It creates a snapshot of your DB on s3 and then handles checkpointing of the WAL.


Yes, love Litestream! It felt like the missing piece for replacing Postgres or MySQL with SQLite.


Does anyone know if it's possible to use Litestream with SQLite running on Cloudflare Durable Objects?

I suppose you'd have to compile Litestream to WASM or have Litestream running on a different machine and reading SQLite from the Durable Object?


Litestream author here. I haven’t looked at Durable Objects yet so I don’t think Litestream would currently work with it. I may add support in the future but I’m adding separate serverless functionality to be released later this year. That’ll allow Litestream to run on Cloudflare Workers or GCR or other ephemeral platforms like those.


Haha check the original post, my friend


That is something I was looking for.


I really think I want a “JSON” format for SQLite that is a strict subset of the SQLite file syntax that can be generated fairly trivially by other applications and read by SQLite. There are apps that produce JSON that would be better served by something I can run search queries upon and jq is nice but it’s not SQLite.

Is there a protobuf implementation of the SQLite file format out there somewhere?


I am having trouble with the mapping? I presume this touches on the orm impedance problem? Unless sqlite dictates the json format. But then it probably won’t be the json you like to deal with?


PostgreSQL has excellent JSON support today, including advanced queries, indexing, constraints. There are many cheap/free managed offerings that could rival SQLite in operational simplicity.

If you want a SQL interface to JSON with room to grow, that's your best bet IMO.


> There are apps that produce JSON that would be better served by something I can run search queries upon...

https://duckdb.org/ ?


I actually think this is onto something that I'm finding in a different way. Instead of a massive database, what if we had a key-value store mapping keys to tiny databases.

This is, to some degree, what I'm building over at http://www.adama-lang.org/ without a full SQL engine. Each document has tables, and the tables can be indexed. I have yet to find a usecase (in my domain) which requires joins. HOWEVER, I've had a ton of fun building it and I'm getting ready to start making games.

I do believe it would be amazing to have a key-logger service where a reducer like sqlite/adama could come into collapse the log into a single file.

The closest I see is from the Boki paper ( https://www.cs.utexas.edu/~zjia/boki-sosp21.pdf ) which was presented at SOSP21.


I see the bundle size after `npm install -S @rkusa/wasm-sqlite` it's a little bit above 4MB. To my current knowledge, limit is 1MB for workers. I asked one of the companies I work with and that uses CF extensively a few months ago for a limit increase, it took them like almost one month and gave us 2MB. So my question is: what is the limit increase one should ask? What was your experience with that and what I need to ask for if I want to run this? Other than that, great job, and thanks a lot for sharing with the community. Until now I was trying to get DuckDB to run in a durable object (they also have a WASM bundle), now I have an excellent alternative. Thanks again!


sqlite vfs on browser indexeddb: https://jlongster.com/future-sql-web


WebSql could have been..




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: