Hacker News new | past | comments | ask | show | jobs | submit login
Dqlite – High-Availability SQLite (dqlite.io)
463 points by stubish on Aug 30, 2019 | hide | past | favorite | 115 comments



Interesting nugget about golang from their FAQ:

https://github.com/canonical/dqlite/blob/master/doc/faq.md

Why C?

The first prototype implementation of dqlite was in Go, leveraging the hashicorp/raft implementation of the Raft algorithm. The project was later rewritten entirely in C because of performance problems due to the way Go interoperates with C: Go considers a function call into C that lasts more than ~20 microseconds as a blocking system call, in that case it will put the goroutine running that C call in waiting queue and resuming it will effectively cause a context switch, degrading performance (since there were a lot of them happening). See also this issue in the Go bug tracker.

The added benefit of the rewrite in C is that it's now easy to embed dqlite into project written in effectively any language, since all major languages have provisions to create C bindings.


They could also use D or Rust for this. If borrow-checker is too much, Rust can still do automatic, memory management with other benefits remaining. Both also support letting specific modules be unsafe where performance is critical.


> since all major languages have provisions to create C bindings.

Does WebAssembly or any of its runtimes provide a way to do this?


you can probably compile c to webassembly using clang/llvm


That's not what the parent was asking. They were asking if WASM has an FFI interface for runtimes that want to not execute the WASM code entirely sandboxed from the host, but rather want to allow you to dlload(2) code into the OS process hosting the WASM interpreter and call it through WASM ops.

Presumably, https://github.com/CraneStation/wasmtime would benefit from such an FFI API being specified.


That sounds like ActiveX 2.0 at best, and an utter security nightmare at worst.


Precisely. Do you think this is a feature the WebAssembly team should standardize or is it a bad fit for the language?


That should be doable! (Disclaimer: I work at Wasmer https://github.com/wasmerio/wasmer )


Genuine question here: if you’re already going to be shipping native code, why would you need WebAssembly in the mix, especially if you’re not targeting a browser?


For example, if you want to execute "native" code in a restricted environment. WebAssembly is quite fast when (jit)compiled. Therefore, it's very tempting to use it to embed extension logic.


Yes, emscripten/llvm builds C code into webassembly. I'd say a significant portion of webassembly is C/C++ sourced.


You can't use emscripten/llvm with, #include <windows.h> to provide a GUI per se, can you?


Nope. There are iirc specific bindings for that in a browser context. I think there's been a lot of work for other runtimes.


I don't want to flame, but I did find it curious they went with C rather than Rust. In my experience the transition is straightforward and the string handling (particularly with unicode encodings) is way better (in addition to the normal ownership benefits), and the result (an easily linkable library exposing a C ABI) is roughly the same.


I think it's pragmatism. There working two libraries that are already C. Also, a mentioned in another thread, type data couple bloat things a little, I think the specific statement is hyperbolic tough.

My guess is their main goal was to use this in go, where they already have experienced go and C developers and adding a third language would muddy things.


Yeah, I think that's correct too.

But it made me think that a killer app for Rust would be this entire concept done in one embedded DB with bindings for most of the popular languages.


Well they are patching SQLite and SQLite is written in C. So they'd have to maintain a C path, Rust code, and another layer of C API interface for client bindings, if I understood their architecture correctly.


Yes, and one library will be 100 MiB, the other 1.


That's a gross exaggeration


A) disk space seems like a reasonable tradeoff for many situations, especially when binaries typically aren't the source of data consumption

B) I don't think I've seen a rust binary more than 10 megs. Cargo, rust, and ripgrep are both about 6 mb on my disk; fd is 2.5mb. These seem like a reasonable standins for a binary of significant size and complexity. sqlite3 itself is about 1.3mb.

C) Dqlite doesn't seem particularly concerned with disk-constrained systems, though I may be interpreting their site incorrectly, and the low footprint should be equally achievable with the rust runtime—surely the database itself would be a much larger concern.

This just seems like an unusually good fit for the benefits of the language—reliable client glue you can import into many runtimes where being able to prove data flow would be an strong defensive coding pattern. That said, I think that C is a good, conservative approach here, I'm certainly not knocking anyone's judgement. Overall the parent poster is absolutely correct: there's a strong correlation between use of rust's type system and size of outputted code.

EDIT: phrasing.


2.5 MiB is about what you expect the kernel size to be for an embedded device :)

Everything is an embedded device nowadays, so for reference, if you buy a WiFi AP today and open it up, you're likely to find a 8 or 16 MiB NOR flash inside, maybe a 128 MiB NAND flash (with realistically 64 MiB space since it will be doing A/B updates).

I don't think the database size is a big concern. For me the focus in dqlite is very much on the 'd' - you store atomic configuration data in there, it's not about throughput.


For an embedded database, they could certainly emphasize that more.


> Dqlite is a fast, embedded, persistent SQL database with Raft consensus that is perfect for fault-tolerant IoT and Edge devices.

Seems like, at least for embedded devices, you'd want something as small as possible so as to avoid consuming all available disk (not to say any other language will balloon it significantly or not).


I missed that it was primarily intended for embedded devices.


I was going to ask what is the difference between this and rqlite, which also uses Raft.

Found the answer on Reddit:

> rqlite is a full RDBMS application, but dqlite is a library you must link with other code. It's like the difference between MySQL and libsqlite3.so.

* https://www.reddit.com/r/golang/comments/8a8h8y/dqlite_distr...


There's one more big distinction, rqlite's replication is command based [0] where as dqlite is/was WAL frame-based -- so basically one ships the command and the other ships WAL frames. This distinction means that non-deterministic commands (ex. `RANDOM()`) will work differently.

It looks like dqlite's documentation has changed -- for some reason frames are no longer mentioned anywhere[2]. So maybe this isn't the case any more, but this was once the biggest differentiator for me.

[0]: https://github.com/rqlite/rqlite/#limitations

[1]: https://webcache.googleusercontent.com/search?q=cache:p1XBgh...

[2]: https://github.com/canonical/dqlite/commit/35ea7cd56e93a36c5...


> dqlite is/was WAL frame-based...maybe this isn't the case any more

According to https://github.com/canonical/dqlite/blob/master/doc/faq.md this is still the case.


Ahhhh thank you -- that information just got pushed into the FAQ -- I was thinking "surely they didn't just remove this information" but didn't look hard enough at all. Direct link:

https://github.com/canonical/dqlite/blob/master/doc/faq.md#h...


Can you give an example of how the specific differences may occur?


If you do command-based replication, an insert or update that uses RANDOM() would have to be handled differently, lest you have differing values due to each member of the cluster evaluating and producing different values. (Anything that is an impure function basically will have that problem)


rqlite creator here.

Yes, exactly, Dqlite is a library, rqlite is a full application.


This is developed by LXD team for it's cluster. It's used by us in production as a part of LXD cluster. Initially there were some issues but now it can support thousands of nodes in a cluster easily in our regression tests.

It's good they made it as a separate project can be used independent of LXD containers.


I was reading through the docs and this FAQ is worth checking out: https://github.com/canonical/dqlite/blob/master/doc/faq.md

In includes an answer about the difference with rqlite.

To me reading the docs it seems like dqlite has been developed by the team who develops LXD at Canonical as LXD is listed as the biggest user of the project and it says on the authors github that he works at Canonical at/with LXD/LXC.

Interesting project, good luck to the author/authors if you read this!


rqlite creator here.

Yes, good luck to the creators of this project, it looks very interesting and I've been watching it for a few years now.


The one annoying thing about SQLite is that there is no easy way to change the table structure. Adding/Removing/Renaming columns is super complicated and afaik there is no good command line tool that does it for you.

That is the primary reason why I do not consider it for new projects. It's just to slow to iterate on.


That's a hell of a reason not to use sqlite. Staging data in a temporary table while a table is dropped, recreated, and then data is reinserted is not much of an inconvenience.


It is. It is a bunch of complex commands. Just look at the proposed solutions on Stackoverflow:

https://stackoverflow.com/questions/8442147/how-to-delete-or...


So I'll keep my opinion about this out of this response. To those who are interested, the approved answer on that SO thread recommends:

    1. create new table as the one you are trying to change
    2. copy all data,
    3. drop old table,
    4. rename the new one.
It's up to others to decide whether that is complex enough to warrant avoiding sqlite


That gets more cumbersome if the table has indexes (you will have to create them on the new table), and even more cumbersome if foreign keys point to it (you will have to drop them before step 3 and recreate them after step 4)

“Copy all data” also can be difficult if the table has data that the database created that must stay the same because you use it elsewhere. That shouldn’t be a problem with SQLite, as it doesn’t allow rowid as foreign key, but if you use it as a foreign key outside the database, or use the hash of a full row to detect changes, it may still bite you.

It also may mean being offline for a significant amount of time, but that also often is (effectively) the case for databases that support deleting columns


good points


Which adds up to dozens to hundreds of lines of code which needs to be maintained in each project you use sqlite with, vs the one liner of sql that would be required if using a sql that supported it.


It's only required if you need the ability to change the schema of a SQLite table at runtime. I'd wager that's not a very common use-case for SQLite.


What about database migrations? My app which uses a SQLite database needs to store an additional column for a table. Now I have to write a bunch of custom code to migrate it.


> Adding/Removing/Renaming columns is super complicated and afaik there is no good command line tool that does it for you.

sqlite supports ADD COLUMN and RENAME COLUMN DDLs.

Dropping columns is not supported, nor is adding some of the more complex column, that does require going through full table rewriting.


Would making a new db with the new structure and essentially importing the old data be a reasonably speedy activity?


It is if your table is just a datastore with no foreign key links or indexes.

See another reply in this thread: https://news.ycombinator.com/item?id=20841814


I really like the design of this website. It's simple, information-rich, fast, and doesn't contain a ridiculous number of images or dynamic components. It's a shame that I can only say this for a select few websites these days.


Much of Canonical's recent developer-facing work uses a similar style: https://microk8s.io/, https://multipass.run/, https://cloud-init.io/ ...


Yet it's still loaded with hundreds of KB of custom fonts, because the designers would rather I look at a blank page for a couple seconds than gaze upon their design with a typeface that isn't exactly the same one they have on their computer.

My resolution, window size, color settings, text zoom, font rendering, etc, are almost certainly different, too, but at least they've made the page more than twice as slow by forcing the correct font.


It's not so much the size as the way the font was defined and the trash latency/TTFB of whatever they are hosting it on.


What can this be used for (example use cases)? Is there 24/7 support available? How long has it been around and is there a commitment to long term releases?


It has been around for 2 years and half, but we released v1.0.0 just yesterday.

I added answers to your other questions to the FAQ:

https://github.com/canonical/dqlite/blob/master/doc/faq.md#w...

https://github.com/canonical/dqlite/blob/master/doc/faq.md#w...

https://github.com/canonical/dqlite/blob/master/doc/faq.md#i...


Thanks, those are important items for me especially when recommending a new technology to a client and/or boss, hopefully others will find it useful as well.


Sounds like it's for embedded/iot. Similar use cases to SQLite but for connected devices.


Hm. Just enterprise-grade. I need military-grade und planet-scale.


Give us time, we'll get there :)


Can you query the DB on disk using the regular sqlite3 cli tool?


Yes you can, or you could when I was using < 1.0.0. It made developing with it really easy.


What is the difference between this and rqlite?

https://github.com/rqlite/rqlite


From the FAQ[1]

The main differences from rqlite are:

- Embeddable in any language that can interoperate with C

- Full support for transactions

- No need for statements to be deterministic (e.g. you can use time())

- Frame-based replication instead of statement-based replication

[1] https://github.com/canonical/dqlite/blob/master/doc/faq.md


rqlite creator here.

More fundamentally, as mentioned above, Dqlite is a library, whereas rqlite is a RDBMS (albeit a pretty simple and lightweight one).


If it’s used like a library where’s the data stored? Sorry if this is obvious


I used Dqlite for a side project[1], which replicates some of the features of LXD. Was relatively easy to use, but Dqlite moves at some pace and trying to keep up is quite "interesting". Anyway once I do end up getting time, I'm sure it'll be advantageous to what I'm doing.

1. https://github.com/spoke-d/thermionic


Oh I had no idea somebody was using it in the wild! It has been unstable until now, we just released v1.0.0 yesterday. So no more public API breakage from now on.


Has anybody run the Jepsen distributed database tests against dqlite?


Not that I know, but it's on my todo list (dqlite author here).


Does sqlite have consistent reads (i.e. are reads guaranteed to be served by the current leader)?


Hey, free-ekanayaka, a few more questions for your FAQ if you're still paying attention:

Does this store the entire log for all time? When you bring up a new node, does it replay the entire history? If not, how do you bring up a new node without data?

How does backup/restore work?

How do upgrades work? Is the shared WAL low-level enough that it's 100% stable/compatible between sqlite/dqlite versions? If not, what happens if half your cluster is on the old version while you're upgrading, and sees things it doesn't understand yet?

Is it possible to encrypt node/node traffic? Or can you easily send the node-node traffic over a proxy, like Envoy? How about over a unix domain socket or "@named" unix domain socket (which we use for Envoy here at Square)

Looks awesome, by the way!


> fully async disk I/O

I thought Linux didn't support real async disk IO. Is that not the case?

If Linux has no real async disk IO, how does Dqlite achieve fully async disk IO?


Linux does support async I/O to disks using various other interfaces/approaches - it’s just that the classic approach of select()’ing on non-blocking FDs doesn’t work for disk:

https://blog.cloudflare.com/io_submit-the-epoll-alternative-...


The support of async disk I/O in Linux differs depending on kernel version and file system type. But it is possible to get 100% async I/O with the is_submit(), and dqlite will leverage that if detected.

There is now a new async I/O API available in Linux (I'm not remembering the name right now, but it was developed by folks at Facebook). It looks promising so I'll check it at some point. (dqlite author here)


Jens Axboe, io_uring :)



I hope the claim to being fully async I/O is just a buzz term, as it's no longer supported in SQLite.

https://www.sqlite.org/asyncvfs.html


It's not a buzz term. It's really fully async disk I/O. Dqlite does not use SQLite's stock vfs implementation for writing to disk, as it's an entirely different model (based on raft).


> Q: How does dqlite behave during conflict situations? Does Raft select a winning WAL write and any others in flight writes are aborted?

> A: There can’t be a conflict situation. Raft’s model is that only the leader can append new log entries, which translated to dqlite means that only the leader can write new WAL frames. So this means that any attempt to perform a write transaction on a non-leader node will fail with a ErrNotLeader error (and in this case clients are supposed to retry against whoever is the new leader).

Correct me if I'm wrong, but isn't that essentially the same limitation WAL mode in normal sqlite has? With WAL you can have as many reads going on as you like, in parallel to a single write. That seems directly comparable to what the dqlite FAQ says, unless I'm missing something.


The very top of that page suggests WAL with PRAGMA synchronous = 0 accomplishes roughly the same objective.


Based on the following header file, it would seem that they are using this unsupported module.

https://github.com/canonical/dqlite/blob/master/src/vfs.h


Yes, you have to use a forked SQLite in order to make use of dqlite.

I believe they (the LXD team) are working on upstreaming the WAL changes but due to SQLite's very strong compatibility guarantees they want to be very certain the API and protocol are correct before carving it in stone. Not to mention they are the only major users of the feature, so more widespread use would also be nice before merging it upstream.


Everything you said is very accurate (dqlite author here).


They would have to, wouldn't they? The Sqlite VFS subsystem is the obvious place to intercept the usually local DB and WAL reads/writes and make them distributed functions that use Raft.


I did not know this is from Canonical. Interesting


What is the typical use case for such a thing?



This is great - good work to you and/or your team


https://github.com/canonical/raft

That's interesting. Didn't know about it.


Using Raft for the WAL sounds a bit like Calvin. How does Dqlite compare to eg FaunaDB in terms of distributed ACID guarantees?


I'm not brimming with ideas of where this would be a good fit tbh. Is anyone using it or considering it for anything?


It's sort of like a SQL version of etcd, so maybe some of the use cases for etcd would be similar: https://github.com/etcd-io/etcd/blob/master/Documentation/pr...


anywhere where you cba to run up a full database server and don't necessarily mind losing a few features. Historically sqlite has been THE solution for data storage in embedded (think IOT), so in this case imagine an IOT configuration with multiple nodes.

So an easy use-case that springs to mind is any sort of distributed IOT device that need to track state. So any industrial or consumer monitoring system with a centralised controller that would use this for data storage. Specifically, that this enables the use of multiple nodes for high throughput imagine many, many, many sensors and a central controller streaming real-time data.


Depending on some details, it would be perfect for storing state for a libvirt cluster management tool I'm working on. Concerns are I can read the data on disk using the regular sqlite3 cli tool, and lack of rust bindings.


8 years ago or so I was involved in building a cloud platform. The very first version of the design for keeping the VM and storage allocation metadata synced across the cluster involved syncing sqlite databases (which we moved off once we realised we'd pretty much have to invent something a lot like raft to make it work). If this had existed then, we'd just have picked it off the shelf.


I’m curious about the libvirt management tool ypu’re Working on. Any chance it is open source?


Some thoughts: A consensus protocol is like 1/50th of what you need for a stable, reliable distributed database, and it's developed by a company, so expect it to be abandoned once they stop developing it. I wouldn't use it at work (yet) but could be fun for personal projects.


I wouldn't paint Canonical with that generalization. It's not unheard of that they have dropped projects, but I wouldn't say it's common. But looks like their primary use is LXD, which doesn't seem to be going anywhere...


But the main point made by the parent is entirely correct: the biggest issue isn’t that of implementing a consensus protocol: the biggest issue is the reconfiguration of the cluster, management of the dead/live nodes, addition of extra nodes for replacement, copying of the data before reconfiguration.

All of that needs tooling.


Indeed. I was reading the dqlite page thinking "where is the monitoring endpoint to tell if the cluster is healthy or degraded?" Too often that seems like an afterthought if it's thought of at all.

I have a teeny, tiny cluster using MySQL+galera as a multi-master cluster, but it took a while to iterate to monitoring that tells me when one node is unhealthy and getting the correct repair and restart procedures.


Totally.

FWIW, I built all the functionality into rqlite from the very start, for exactly those reasons. In the real world a database must be operated.

https://github.com/rqlite/rqlite/blob/master/DOC/DIAGNOSTICS...

https://github.com/rqlite/rqlite/blob/master/DOC/CLUSTER_MGM...


Yep! And you need someone to help fix bizarre bugs in core that only crop up in your own weird environment. With support that's quick & easy, but otherwise you have to form your own dev team to specialize in it, making it costlier.


Needs a Kubernetes operator for sure.


It doesn't really matter who wrote it, it's a trope of corporate software development. A small team makes project X to support project Y. They go through the usual dev + production + maintenance cycle, which takes 2-3 years typically, sometimes 5, after which the team is disbanded/reorganized, and no new dev work happens on the old projects. The project is effectively abandoned at that point, unless it happens to have picked up enough users that "a community" forms and picks up development... but that's rare, because corporations don't want to give development of their project over to randos on the internet, especially if they're still using it. The best case is it would fork, or move to some other org's code repo.

I like to use projects which lots of other projects depend on directly. That way if the main project goes unsupported, all the other projects using it will band together to support a fork. I believe open source that is not created for a company will last much longer. (I like that they rewrote it in C, though; it would probably survive well as a fork if enough people/projects use it)


Hence it’s being released and if/when people fork it it will/ could live forever.


what's the comparison with rqlite?

https://github.com/rqlite/rqlite


never mind. c. already in ubuntu reps as a one-step, ok... this is cool. I'm gonna check it out.


> Runs on ARM, X86, POWER and IBM Z architectures

Odd, I'd think x64 would be the most commonly used architecture. Is this a mistake?


Therefore its called x86-64. It is the "same"


Since the author is in the comments, what are you planning to do about operations: keeping consistent performance while adding/removing/resyncing nodes, rebalancing, dealing with bitrot, disk errors, disk performance issues, filesystem issues, dealing with unstable network performance, etc.? It doesn't look like there is anything to address operations in the code at the moment.


does anyone know a good on disk nosql db like sqlite with python bindings? I know sqllite has a json extension but I don't think you can query


how is the performance running 3 nodes? how many inserts/reads can be done in average ?


Depends on how fast is your disk, what file system and kernel you use, and how low is your network latency. Difficult to predict. But it's basically as fast as it can get given 1) hardware constraints 2) raft consensus.

If you want light-speed insert/delete, you could probably don't use the disk at all: as long as a majority of your nodes don't die, you won't lose any data. You can also go somewhere in between and save to disk only at specific intervals.


Do all nodes participate as full RAFT nodes? Or can you have read-only nodes?


Aren't non-leader RAFT nodes already read-only?


I guess I meant observer nodes, with read-only DB access.


I just mean that RAFT nodes are already read-only for most of their lives — it’s not clear to me why you’d want nodes to be dedicated as read-only; it would only reduce consistency guarantees (and I believe you wouldn’t be avoiding the heartbeat and data consistency network chatter).. I think the leadership functionality comes cheap


That makes sense. I need to read up on Raft again.

I know Zookeeper, for example, supports observer nodes: essentially a cheaper read-only cache. Chubby at Google had the same thing.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: