The first prototype implementation of dqlite was in Go, leveraging the hashicorp/raft implementation of the Raft algorithm. The project was later rewritten entirely in C because of performance problems due to the way Go interoperates with C: Go considers a function call into C that lasts more than ~20 microseconds as a blocking system call, in that case it will put the goroutine running that C call in waiting queue and resuming it will effectively cause a context switch, degrading performance (since there were a lot of them happening). See also this issue in the Go bug tracker.
The added benefit of the rewrite in C is that it's now easy to embed dqlite into project written in effectively any language, since all major languages have provisions to create C bindings.
They could also use D or Rust for this. If borrow-checker is too much, Rust can still do automatic, memory management with other benefits remaining. Both also support letting specific modules be unsafe where performance is critical.
That's not what the parent was asking. They were asking if WASM has an FFI interface for runtimes that want to not execute the WASM code entirely sandboxed from the host, but rather want to allow you to dlload(2) code into the OS process hosting the WASM interpreter and call it through WASM ops.
Genuine question here: if you’re already going to be shipping native code, why would you need WebAssembly in the mix, especially if you’re not targeting a browser?
For example, if you want to execute "native" code in a restricted environment.
WebAssembly is quite fast when (jit)compiled. Therefore, it's very tempting to use it to embed extension logic.
I don't want to flame, but I did find it curious they went with C rather than Rust. In my experience the transition is straightforward and the string handling (particularly with unicode encodings) is way better (in addition to the normal ownership benefits), and the result (an easily linkable library exposing a C ABI) is roughly the same.
I think it's pragmatism. There working two libraries that are already C. Also, a mentioned in another thread, type data couple bloat things a little, I think the specific statement is hyperbolic tough.
My guess is their main goal was to use this in go, where they already have experienced go and C developers and adding a third language would muddy things.
Well they are patching SQLite and SQLite is written in C. So they'd have to maintain a C path, Rust code, and another layer of C API interface for client bindings, if I understood their architecture correctly.
A) disk space seems like a reasonable tradeoff for many situations, especially when binaries typically aren't the source of data consumption
B) I don't think I've seen a rust binary more than 10 megs. Cargo, rust, and ripgrep are both about 6 mb on my disk; fd is 2.5mb. These seem like a reasonable standins for a binary of significant size and complexity. sqlite3 itself is about 1.3mb.
C) Dqlite doesn't seem particularly concerned with disk-constrained systems, though I may be interpreting their site incorrectly, and the low footprint should be equally achievable with the rust runtime—surely the database itself would be a much larger concern.
This just seems like an unusually good fit for the benefits of the language—reliable client glue you can import into many runtimes where being able to prove data flow would be an strong defensive coding pattern. That said, I think that C is a good, conservative approach here, I'm certainly not knocking anyone's judgement. Overall the parent poster is absolutely correct: there's a strong correlation between use of rust's type system and size of outputted code.
2.5 MiB is about what you expect the kernel size to be for an embedded device :)
Everything is an embedded device nowadays, so for reference, if you buy a WiFi AP today and open it up, you're likely to find a 8 or 16 MiB NOR flash inside, maybe a 128 MiB NAND flash (with realistically 64 MiB space since it will be doing A/B updates).
I don't think the database size is a big concern. For me the focus in dqlite is very much on the 'd' - you store atomic configuration data in there, it's not about throughput.
> Dqlite is a fast, embedded, persistent SQL database with Raft consensus that is perfect for fault-tolerant IoT and Edge devices.
Seems like, at least for embedded devices, you'd want something as small as possible so as to avoid consuming all available disk (not to say any other language will balloon it significantly or not).
There's one more big distinction, rqlite's replication is command based [0] where as dqlite is/was WAL frame-based -- so basically one ships the command and the other ships WAL frames. This distinction means that non-deterministic commands (ex. `RANDOM()`) will work differently.
It looks like dqlite's documentation has changed -- for some reason frames are no longer mentioned anywhere[2]. So maybe this isn't the case any more, but this was once the biggest differentiator for me.
Ahhhh thank you -- that information just got pushed into the FAQ -- I was thinking "surely they didn't just remove this information" but didn't look hard enough at all. Direct link:
If you do command-based replication, an insert or update that uses RANDOM() would have to be handled differently, lest you have differing values due to each member of the cluster evaluating and producing different values. (Anything that is an impure function basically will have that problem)
This is developed by LXD team for it's cluster. It's used by us in production as a part of LXD cluster. Initially there were some issues but now it can support thousands of nodes in a cluster easily in our regression tests.
It's good they made it as a separate project can be used independent of LXD containers.
In includes an answer about the difference with rqlite.
To me reading the docs it seems like dqlite has been developed by the team who develops LXD at Canonical as LXD is listed as the biggest user of the project and it says on the authors github that he works at Canonical at/with LXD/LXC.
Interesting project, good luck to the author/authors if you read this!
The one annoying thing about SQLite is that there is no easy way to change the table structure. Adding/Removing/Renaming columns is super complicated and afaik there is no good command line tool that does it for you.
That is the primary reason why I do not consider it for new projects. It's just to slow to iterate on.
That's a hell of a reason not to use sqlite. Staging data in a temporary table while a table is dropped, recreated, and then data is reinserted is not much of an inconvenience.
That gets more cumbersome if the table has indexes (you will have to create them on the new table), and even more cumbersome if foreign keys point to it (you will have to drop them before step 3 and recreate them after step 4)
“Copy all data” also can be difficult if the table has data that the database created that must stay the same because you use it elsewhere. That shouldn’t be a problem with SQLite, as it doesn’t allow rowid as foreign key, but if you use it as a foreign key outside the database, or use the hash of a full row to detect changes, it may still bite you.
It also may mean being offline for a significant amount of time, but that also often is (effectively) the case for databases that support deleting columns
Which adds up to dozens to hundreds of lines of code which needs to be maintained in each project you use sqlite with, vs the one liner of sql that would be required if using a sql that supported it.
What about database migrations? My app which uses a SQLite database needs to store an additional column for a table. Now I have to write a bunch of custom code to migrate it.
I really like the design of this website. It's simple, information-rich, fast, and doesn't contain a ridiculous number of images or dynamic components. It's a shame that I can only say this for a select few websites these days.
Yet it's still loaded with hundreds of KB of custom fonts, because the designers would rather I look at a blank page for a couple seconds than gaze upon their design with a typeface that isn't exactly the same one they have on their computer.
My resolution, window size, color settings, text zoom, font rendering, etc, are almost certainly different, too, but at least they've made the page more than twice as slow by forcing the correct font.
What can this be used for (example use cases)? Is there 24/7 support available? How long has it been around and is there a commitment to long term releases?
Thanks, those are important items for me especially when recommending a new technology to a client and/or boss, hopefully others will find it useful as well.
I used Dqlite for a side project[1], which replicates some of the features of LXD. Was relatively easy to use, but Dqlite moves at some pace and trying to keep up is quite "interesting". Anyway once I do end up getting time, I'm sure it'll be advantageous to what I'm doing.
Oh I had no idea somebody was using it in the wild! It has been unstable until now, we just released v1.0.0 yesterday. So no more public API breakage from now on.
Hey, free-ekanayaka, a few more questions for your FAQ if you're still paying attention:
Does this store the entire log for all time? When you bring up a new node, does it replay the entire history? If not, how do you bring up a new node without data?
How does backup/restore work?
How do upgrades work? Is the shared WAL low-level enough that it's 100% stable/compatible between sqlite/dqlite versions? If not, what happens if half your cluster is on the old version while you're upgrading, and sees things it doesn't understand yet?
Is it possible to encrypt node/node traffic? Or can you easily send the node-node traffic over a proxy, like Envoy? How about over a unix domain socket or "@named" unix domain socket (which we use for Envoy here at Square)
Linux does support async I/O to disks using various other interfaces/approaches - it’s just that the classic approach of select()’ing on non-blocking FDs doesn’t work for disk:
The support of async disk I/O in Linux differs depending on kernel version and file system type. But it is possible to get 100% async I/O with the is_submit(), and dqlite will leverage that if detected.
There is now a new async I/O API available in Linux (I'm not remembering the name right now, but it was developed by folks at Facebook). It looks promising so I'll check it at some point. (dqlite author here)
It's not a buzz term. It's really fully async disk I/O. Dqlite does not use SQLite's stock vfs implementation for writing to disk, as it's an entirely different model (based on raft).
> Q: How does dqlite behave during conflict situations? Does Raft select a winning WAL write and any others in flight writes are aborted?
> A: There can’t be a conflict situation. Raft’s model is that only the leader can append new log entries, which translated to dqlite means that only the leader can write new WAL frames. So this means that any attempt to perform a write transaction on a non-leader node will fail with a ErrNotLeader error (and in this case clients are supposed to retry against whoever is the new leader).
Correct me if I'm wrong, but isn't that essentially the same limitation WAL mode in normal sqlite has? With WAL you can have as many reads going on as you like, in parallel to a single write. That seems directly comparable to what the dqlite FAQ says, unless I'm missing something.
Yes, you have to use a forked SQLite in order to make use of dqlite.
I believe they (the LXD team) are working on upstreaming the WAL changes but due to SQLite's very strong compatibility guarantees they want to be very certain the API and protocol are correct before carving it in stone. Not to mention they are the only major users of the feature, so more widespread use would also be nice before merging it upstream.
They would have to, wouldn't they? The Sqlite VFS subsystem is the obvious place to intercept the usually local DB and WAL reads/writes and make them distributed functions that use Raft.
anywhere where you cba to run up a full database server and don't necessarily mind losing a few features.
Historically sqlite has been THE solution for data storage in embedded (think IOT), so in this case imagine an IOT configuration with multiple nodes.
So an easy use-case that springs to mind is any sort of distributed IOT device that need to track state. So any industrial or consumer monitoring system with a centralised controller that would use this for data storage. Specifically, that this enables the use of multiple nodes for high throughput imagine many, many, many sensors and a central controller streaming real-time data.
Depending on some details, it would be perfect for storing state for a libvirt cluster management tool I'm working on. Concerns are I can read the data on disk using the regular sqlite3 cli tool, and lack of rust bindings.
8 years ago or so I was involved in building a cloud platform. The very first version of the design for keeping the VM and storage allocation metadata synced across the cluster involved syncing sqlite databases (which we moved off once we realised we'd pretty much have to invent something a lot like raft to make it work). If this had existed then, we'd just have picked it off the shelf.
Some thoughts: A consensus protocol is like 1/50th of what you need for a stable, reliable distributed database, and it's developed by a company, so expect it to be abandoned once they stop developing it. I wouldn't use it at work (yet) but could be fun for personal projects.
I wouldn't paint Canonical with that generalization. It's not unheard of that they have dropped projects, but I wouldn't say it's common. But looks like their primary use is LXD, which doesn't seem to be going anywhere...
But the main point made by the parent is entirely correct: the biggest issue isn’t that of implementing a consensus protocol: the biggest issue is the reconfiguration of the cluster, management of the dead/live nodes, addition of extra nodes for replacement, copying of the data before reconfiguration.
Indeed. I was reading the dqlite page thinking "where is the monitoring endpoint to tell if the cluster is healthy or degraded?" Too often that seems like an afterthought if it's thought of at all.
I have a teeny, tiny cluster using MySQL+galera as a multi-master cluster, but it took a while to iterate to monitoring that tells me when one node is unhealthy and getting the correct repair and restart procedures.
Yep! And you need someone to help fix bizarre bugs in core that only crop up in your own weird environment. With support that's quick & easy, but otherwise you have to form your own dev team to specialize in it, making it costlier.
It doesn't really matter who wrote it, it's a trope of corporate software development. A small team makes project X to support project Y. They go through the usual dev + production + maintenance cycle, which takes 2-3 years typically, sometimes 5, after which the team is disbanded/reorganized, and no new dev work happens on the old projects. The project is effectively abandoned at that point, unless it happens to have picked up enough users that "a community" forms and picks up development... but that's rare, because corporations don't want to give development of their project over to randos on the internet, especially if they're still using it. The best case is it would fork, or move to some other org's code repo.
I like to use projects which lots of other projects depend on directly. That way if the main project goes unsupported, all the other projects using it will band together to support a fork. I believe open source that is not created for a company will last much longer. (I like that they rewrote it in C, though; it would probably survive well as a fork if enough people/projects use it)
Since the author is in the comments, what are you planning to do about operations: keeping consistent performance while adding/removing/resyncing nodes, rebalancing, dealing with bitrot, disk errors, disk performance issues, filesystem issues, dealing with unstable network performance, etc.? It doesn't look like there is anything to address operations in the code at the moment.
Depends on how fast is your disk, what file system and kernel you use, and how low is your network latency. Difficult to predict. But it's basically as fast as it can get given 1) hardware constraints 2) raft consensus.
If you want light-speed insert/delete, you could probably don't use the disk at all: as long as a majority of your nodes don't die, you won't lose any data. You can also go somewhere in between and save to disk only at specific intervals.
I just mean that RAFT nodes are already read-only for most of their lives — it’s not clear to me why you’d want nodes to be dedicated as read-only; it would only reduce consistency guarantees (and I believe you wouldn’t be avoiding the heartbeat and data consistency network chatter).. I think the leadership functionality comes cheap
https://github.com/canonical/dqlite/blob/master/doc/faq.md
Why C?
The first prototype implementation of dqlite was in Go, leveraging the hashicorp/raft implementation of the Raft algorithm. The project was later rewritten entirely in C because of performance problems due to the way Go interoperates with C: Go considers a function call into C that lasts more than ~20 microseconds as a blocking system call, in that case it will put the goroutine running that C call in waiting queue and resuming it will effectively cause a context switch, degrading performance (since there were a lot of them happening). See also this issue in the Go bug tracker.
The added benefit of the rewrite in C is that it's now easy to embed dqlite into project written in effectively any language, since all major languages have provisions to create C bindings.