Dqlite – High-Availability SQLite

bb88 · on Aug 30, 2019

Interesting nugget about golang from their FAQ:

https://github.com/canonical/dqlite/blob/master/doc/faq.md

Why C?

The first prototype implementation of dqlite was in Go, leveraging the hashicorp/raft implementation of the Raft algorithm. The project was later rewritten entirely in C because of performance problems due to the way Go interoperates with C: Go considers a function call into C that lasts more than ~20 microseconds as a blocking system call, in that case it will put the goroutine running that C call in waiting queue and resuming it will effectively cause a context switch, degrading performance (since there were a lot of them happening). See also this issue in the Go bug tracker.

The added benefit of the rewrite in C is that it's now easy to embed dqlite into project written in effectively any language, since all major languages have provisions to create C bindings.

nickpsecurity · on Aug 30, 2019

They could also use D or Rust for this. If borrow-checker is too much, Rust can still do automatic, memory management with other benefits remaining. Both also support letting specific modules be unsafe where performance is critical.

MuffinFlavored · on Aug 30, 2019

> since all major languages have provisions to create C bindings.

Does WebAssembly or any of its runtimes provide a way to do this?

szczepano · on Aug 30, 2019

you can probably compile c to webassembly using clang/llvm

derefr · on Aug 31, 2019

That's not what the parent was asking. They were asking if WASM has an FFI interface for runtimes that want to not execute the WASM code entirely sandboxed from the host, but rather want to allow you to dlload(2) code into the OS process hosting the WASM interpreter and call it through WASM ops.

Presumably, https://github.com/CraneStation/wasmtime would benefit from such an FFI API being specified.

zbentley · on Sept 1, 2019

That sounds like ActiveX 2.0 at best, and an utter security nightmare at worst.

MuffinFlavored · on Aug 31, 2019

Precisely. Do you think this is a feature the WebAssembly team should standardize or is it a bad fit for the language?

syrusakbary · on Sept 1, 2019

That should be doable! (Disclaimer: I work at Wasmer https://github.com/wasmerio/wasmer )

thosakwe · on Aug 31, 2019

Genuine question here: if you’re already going to be shipping native code, why would you need WebAssembly in the mix, especially if you’re not targeting a browser?

funny_falcon · on Sept 3, 2019

For example, if you want to execute "native" code in a restricted environment. WebAssembly is quite fast when (jit)compiled. Therefore, it's very tempting to use it to embed extension logic.

tracker1 · on Aug 30, 2019

Yes, emscripten/llvm builds C code into webassembly. I'd say a significant portion of webassembly is C/C++ sourced.

MuffinFlavored · on Aug 31, 2019

You can't use emscripten/llvm with, #include <windows.h> to provide a GUI per se, can you?

tracker1 · on Sept 1, 2019

Nope. There are iirc specific bindings for that in a browser context. I think there's been a lot of work for other runtimes.

SomeOldThrow · on Aug 30, 2019

I don't want to flame, but I did find it curious they went with C rather than Rust. In my experience the transition is straightforward and the string handling (particularly with unicode encodings) is way better (in addition to the normal ownership benefits), and the result (an easily linkable library exposing a C ABI) is roughly the same.

tracker1 · on Aug 30, 2019

I think it's pragmatism. There working two libraries that are already C. Also, a mentioned in another thread, type data couple bloat things a little, I think the specific statement is hyperbolic tough.

My guess is their main goal was to use this in go, where they already have experienced go and C developers and adding a third language would muddy things.

bb88 · on Aug 30, 2019

Yeah, I think that's correct too.

But it made me think that a killer app for Rust would be this entire concept done in one embedded DB with bindings for most of the popular languages.

rdtsc · on Aug 30, 2019

Well they are patching SQLite and SQLite is written in C. So they'd have to maintain a C path, Rust code, and another layer of C API interface for client bindings, if I understood their architecture correctly.

stefan_ · on Aug 30, 2019

Yes, and one library will be 100 MiB, the other 1.

stjohnswarts · on Aug 31, 2019

That's a gross exaggeration

SomeOldThrow · on Aug 30, 2019

A) disk space seems like a reasonable tradeoff for many situations, especially when binaries typically aren't the source of data consumption

B) I don't think I've seen a rust binary more than 10 megs. Cargo, rust, and ripgrep are both about 6 mb on my disk; fd is 2.5mb. These seem like a reasonable standins for a binary of significant size and complexity. sqlite3 itself is about 1.3mb.

C) Dqlite doesn't seem particularly concerned with disk-constrained systems, though I may be interpreting their site incorrectly, and the low footprint should be equally achievable with the rust runtime—surely the database itself would be a much larger concern.

This just seems like an unusually good fit for the benefits of the language—reliable client glue you can import into many runtimes where being able to prove data flow would be an strong defensive coding pattern. That said, I think that C is a good, conservative approach here, I'm certainly not knocking anyone's judgement. Overall the parent poster is absolutely correct: there's a strong correlation between use of rust's type system and size of outputted code.

EDIT: phrasing.

stefan_ · on Aug 30, 2019

2.5 MiB is about what you expect the kernel size to be for an embedded device :)

Everything is an embedded device nowadays, so for reference, if you buy a WiFi AP today and open it up, you're likely to find a 8 or 16 MiB NOR flash inside, maybe a 128 MiB NAND flash (with realistically 64 MiB space since it will be doing A/B updates).

I don't think the database size is a big concern. For me the focus in dqlite is very much on the 'd' - you store atomic configuration data in there, it's not about throughput.

SomeOldThrow · on Aug 31, 2019

For an embedded database, they could certainly emphasize that more.

josegonzalez · on Aug 30, 2019

> Dqlite is a fast, embedded, persistent SQL database with Raft consensus that is perfect for fault-tolerant IoT and Edge devices.

Seems like, at least for embedded devices, you'd want something as small as possible so as to avoid consuming all available disk (not to say any other language will balloon it significantly or not).

SomeOldThrow · on Aug 30, 2019

I missed that it was primarily intended for embedded devices.

emmelaich · on Aug 30, 2019

I was going to ask what is the difference between this and rqlite, which also uses Raft.

Found the answer on Reddit:

> rqlite is a full RDBMS application, but dqlite is a library you must link with other code. It's like the difference between MySQL and libsqlite3.so.

* https://www.reddit.com/r/golang/comments/8a8h8y/dqlite_distr...

hardwaresofton · on Aug 30, 2019

There's one more big distinction, rqlite's replication is command based [0] where as dqlite is/was WAL frame-based -- so basically one ships the command and the other ships WAL frames. This distinction means that non-deterministic commands (ex. `RANDOM()`) will work differently.

It looks like dqlite's documentation has changed -- for some reason frames are no longer mentioned anywhere[2]. So maybe this isn't the case any more, but this was once the biggest differentiator for me.

[0]: https://github.com/rqlite/rqlite/#limitations

[1]: https://webcache.googleusercontent.com/search?q=cache:p1XBgh...

[2]: https://github.com/canonical/dqlite/commit/35ea7cd56e93a36c5...

rlpb · on Aug 30, 2019

> dqlite is/was WAL frame-based...maybe this isn't the case any more

According to https://github.com/canonical/dqlite/blob/master/doc/faq.md this is still the case.

hardwaresofton · on Aug 30, 2019

Ahhhh thank you -- that information just got pushed into the FAQ -- I was thinking "surely they didn't just remove this information" but didn't look hard enough at all. Direct link:

https://github.com/canonical/dqlite/blob/master/doc/faq.md#h...

sansnomme · on Aug 30, 2019

Can you give an example of how the specific differences may occur?

rntksi · on Aug 31, 2019

If you do command-based replication, an insert or update that uses RANDOM() would have to be handled differently, lest you have differing values due to each member of the cluster evaluating and producing different values. (Anything that is an impure function basically will have that problem)

otoolep · on Aug 30, 2019

rqlite creator here.

Yes, exactly, Dqlite is a library, rqlite is a full application.

dragonsh · on Aug 30, 2019

This is developed by LXD team for it's cluster. It's used by us in production as a part of LXD cluster. Initially there were some issues but now it can support thousands of nodes in a cluster easily in our regression tests.

It's good they made it as a separate project can be used independent of LXD containers.

Scorpiion · on Aug 30, 2019

I was reading through the docs and this FAQ is worth checking out: https://github.com/canonical/dqlite/blob/master/doc/faq.md

In includes an answer about the difference with rqlite.

To me reading the docs it seems like dqlite has been developed by the team who develops LXD at Canonical as LXD is listed as the biggest user of the project and it says on the authors github that he works at Canonical at/with LXD/LXC.

Interesting project, good luck to the author/authors if you read this!

otoolep · on Aug 30, 2019

rqlite creator here.

Yes, good luck to the creators of this project, it looks very interesting and I've been watching it for a few years now.

FreeHugs · on Aug 30, 2019

The one annoying thing about SQLite is that there is no easy way to change the table structure. Adding/Removing/Renaming columns is super complicated and afaik there is no good command line tool that does it for you.

That is the primary reason why I do not consider it for new projects. It's just to slow to iterate on.

Dowwie · on Aug 30, 2019

That's a hell of a reason not to use sqlite. Staging data in a temporary table while a table is dropped, recreated, and then data is reinserted is not much of an inconvenience.

FreeHugs · on Aug 30, 2019

It is. It is a bunch of complex commands. Just look at the proposed solutions on Stackoverflow:

https://stackoverflow.com/questions/8442147/how-to-delete-or...

Dowwie · on Aug 30, 2019

So I'll keep my opinion about this out of this response. To those who are interested, the approved answer on that SO thread recommends:

    1. create new table as the one you are trying to change
    2. copy all data,
    3. drop old table,
    4. rename the new one.

It's up to others to decide whether that is complex enough to warrant avoiding sqlite

Someone · on Aug 30, 2019

That gets more cumbersome if the table has indexes (you will have to create them on the new table), and even more cumbersome if foreign keys point to it (you will have to drop them before step 3 and recreate them after step 4)

“Copy all data” also can be difficult if the table has data that the database created that must stay the same because you use it elsewhere. That shouldn’t be a problem with SQLite, as it doesn’t allow rowid as foreign key, but if you use it as a foreign key outside the database, or use the hash of a full row to detect changes, it may still bite you.

It also may mean being offline for a significant amount of time, but that also often is (effectively) the case for databases that support deleting columns

Dowwie · on Aug 30, 2019

good points

bfrog · on Aug 30, 2019

Which adds up to dozens to hundreds of lines of code which needs to be maintained in each project you use sqlite with, vs the one liner of sql that would be required if using a sql that supported it.

mbreese · on Aug 30, 2019

It's only required if you need the ability to change the schema of a SQLite table at runtime. I'd wager that's not a very common use-case for SQLite.

tylerhou · on Aug 31, 2019

What about database migrations? My app which uses a SQLite database needs to store an additional column for a table. Now I have to write a bunch of custom code to migrate it.

masklinn · on Aug 30, 2019

> Adding/Removing/Renaming columns is super complicated and afaik there is no good command line tool that does it for you.

sqlite supports ADD COLUMN and RENAME COLUMN DDLs.

Dropping columns is not supported, nor is adding some of the more complex column, that does require going through full table rewriting.

mysterydip · on Aug 30, 2019

Would making a new db with the new structure and essentially importing the old data be a reasonably speedy activity?

superturkey650 · on Aug 30, 2019

It is if your table is just a datastore with no foreign key links or indexes.

See another reply in this thread: https://news.ycombinator.com/item?id=20841814

crusader1099 · on Aug 30, 2019

I really like the design of this website. It's simple, information-rich, fast, and doesn't contain a ridiculous number of images or dynamic components. It's a shame that I can only say this for a select few websites these days.

timClicks · on Aug 30, 2019

Much of Canonical's recent developer-facing work uses a similar style: https://microk8s.io/, https://multipass.run/, https://cloud-init.io/ ...

ken · on Aug 30, 2019

Yet it's still loaded with hundreds of KB of custom fonts, because the designers would rather I look at a blank page for a couple seconds than gaze upon their design with a typeface that isn't exactly the same one they have on their computer.

My resolution, window size, color settings, text zoom, font rendering, etc, are almost certainly different, too, but at least they've made the page more than twice as slow by forcing the correct font.

zamadatix · on Aug 31, 2019

It's not so much the size as the way the font was defined and the trash latency/TTFB of whatever they are hosting it on.

JacKTrocinskI · on Aug 30, 2019

What can this be used for (example use cases)? Is there 24/7 support available? How long has it been around and is there a commitment to long term releases?

free-ekanayaka · on Aug 30, 2019

It has been around for 2 years and half, but we released v1.0.0 just yesterday.

I added answers to your other questions to the FAQ:

https://github.com/canonical/dqlite/blob/master/doc/faq.md#w...

https://github.com/canonical/dqlite/blob/master/doc/faq.md#i...

JacKTrocinskI · on Aug 30, 2019

Thanks, those are important items for me especially when recommending a new technology to a client and/or boss, hopefully others will find it useful as well.

tootie · on Aug 30, 2019

Sounds like it's for embedded/iot. Similar use cases to SQLite but for connected devices.

nilsjuenemann · on Aug 30, 2019

Hm. Just enterprise-grade. I need military-grade und planet-scale.

free-ekanayaka · on Aug 30, 2019

Give us time, we'll get there :)

nightfly · on Aug 30, 2019

Can you query the DB on disk using the regular sqlite3 cli tool?

stickupkid · on Aug 30, 2019

Yes you can, or you could when I was using < 1.0.0. It made developing with it really easy.

hamandcheese · on Aug 30, 2019

What is the difference between this and rqlite?

https://github.com/rqlite/rqlite

duiker101 · on Aug 30, 2019

From the FAQ[1]

The main differences from rqlite are:

- Embeddable in any language that can interoperate with C

- Full support for transactions

- No need for statements to be deterministic (e.g. you can use time())

- Frame-based replication instead of statement-based replication

[1] https://github.com/canonical/dqlite/blob/master/doc/faq.md

otoolep · on Aug 30, 2019

rqlite creator here.

More fundamentally, as mentioned above, Dqlite is a library, whereas rqlite is a RDBMS (albeit a pretty simple and lightweight one).

gigatexal · on Aug 30, 2019

If it’s used like a library where’s the data stored? Sorry if this is obvious

stickupkid · on Aug 30, 2019

I used Dqlite for a side project[1], which replicates some of the features of LXD. Was relatively easy to use, but Dqlite moves at some pace and trying to keep up is quite "interesting". Anyway once I do end up getting time, I'm sure it'll be advantageous to what I'm doing.

1. https://github.com/spoke-d/thermionic

free-ekanayaka · on Aug 30, 2019

Oh I had no idea somebody was using it in the wild! It has been unstable until now, we just released v1.0.0 yesterday. So no more public API breakage from now on.

bryanlarsen · on Aug 30, 2019

Has anybody run the Jepsen distributed database tests against dqlite?

free-ekanayaka · on Aug 30, 2019

Not that I know, but it's on my todo list (dqlite author here).

lima · on Aug 30, 2019

Does sqlite have consistent reads (i.e. are reads guaranteed to be served by the current leader)?

zellyn · on Aug 31, 2019

Hey, free-ekanayaka, a few more questions for your FAQ if you're still paying attention:

Does this store the entire log for all time? When you bring up a new node, does it replay the entire history? If not, how do you bring up a new node without data?

How does backup/restore work?

How do upgrades work? Is the shared WAL low-level enough that it's 100% stable/compatible between sqlite/dqlite versions? If not, what happens if half your cluster is on the old version while you're upgrading, and sees things it doesn't understand yet?

Is it possible to encrypt node/node traffic? Or can you easily send the node-node traffic over a proxy, like Envoy? How about over a unix domain socket or "@named" unix domain socket (which we use for Envoy here at Square)

Looks awesome, by the way!

continuations · on Aug 30, 2019

> fully async disk I/O

I thought Linux didn't support real async disk IO. Is that not the case?

If Linux has no real async disk IO, how does Dqlite achieve fully async disk IO?

rrdharan · on Aug 30, 2019

Linux does support async I/O to disks using various other interfaces/approaches - it’s just that the classic approach of select()’ing on non-blocking FDs doesn’t work for disk:

https://blog.cloudflare.com/io_submit-the-epoll-alternative-...

free-ekanayaka · on Aug 30, 2019

The support of async disk I/O in Linux differs depending on kernel version and file system type. But it is possible to get 100% async I/O with the is_submit(), and dqlite will leverage that if detected.

There is now a new async I/O API available in Linux (I'm not remembering the name right now, but it was developed by folks at Facebook). It looks promising so I'll check it at some point. (dqlite author here)

brauner · on Aug 30, 2019

Jens Axboe, io_uring :)

brennebeck · on Aug 30, 2019

https://github.com/canonical/dqlite/blob/master/doc/faq.md#a...

http://man7.org/linux/man-pages/man2/io_submit.2.html

rockwotj · on Aug 30, 2019

I hope the claim to being fully async I/O is just a buzz term, as it's no longer supported in SQLite.

https://www.sqlite.org/asyncvfs.html

free-ekanayaka · on Aug 30, 2019

It's not a buzz term. It's really fully async disk I/O. Dqlite does not use SQLite's stock vfs implementation for writing to disk, as it's an entirely different model (based on raft).

liability · on Aug 30, 2019

> Q: How does dqlite behave during conflict situations? Does Raft select a winning WAL write and any others in flight writes are aborted?

> A: There can’t be a conflict situation. Raft’s model is that only the leader can append new log entries, which translated to dqlite means that only the leader can write new WAL frames. So this means that any attempt to perform a write transaction on a non-leader node will fail with a ErrNotLeader error (and in this case clients are supposed to retry against whoever is the new leader).

Correct me if I'm wrong, but isn't that essentially the same limitation WAL mode in normal sqlite has? With WAL you can have as many reads going on as you like, in parallel to a single write. That seems directly comparable to what the dqlite FAQ says, unless I'm missing something.

tyingq · on Aug 30, 2019

The very top of that page suggests WAL with PRAGMA synchronous = 0 accomplishes roughly the same objective.

rockwotj · on Aug 30, 2019

Based on the following header file, it would seem that they are using this unsupported module.

https://github.com/canonical/dqlite/blob/master/src/vfs.h

cyphar · on Aug 30, 2019

Yes, you have to use a forked SQLite in order to make use of dqlite.

I believe they (the LXD team) are working on upstreaming the WAL changes but due to SQLite's very strong compatibility guarantees they want to be very certain the API and protocol are correct before carving it in stone. Not to mention they are the only major users of the feature, so more widespread use would also be nice before merging it upstream.

free-ekanayaka · on Aug 30, 2019

Everything you said is very accurate (dqlite author here).

tyingq · on Aug 30, 2019

They would have to, wouldn't they? The Sqlite VFS subsystem is the obvious place to intercept the usually local DB and WAL reads/writes and make them distributed functions that use Raft.

weitzj · on Aug 30, 2019

I did not know this is from Canonical. Interesting

yeasayer · on Aug 30, 2019

What is the typical use case for such a thing?

free-ekanayaka · on Aug 30, 2019

https://github.com/canonical/dqlite/blob/master/doc/faq.md#w...

goatinaboat · on Aug 30, 2019

This is great - good work to you and/or your team

bbmario · on Aug 30, 2019

https://github.com/canonical/raft

That's interesting. Didn't know about it.

jchanimal · on Aug 30, 2019

Using Raft for the WAL sounds a bit like Calvin. How does Dqlite compare to eg FaunaDB in terms of distributed ACID guarantees?

usgroup · on Aug 30, 2019

I'm not brimming with ideas of where this would be a good fit tbh. Is anyone using it or considering it for anything?

tyingq · on Aug 30, 2019

It's sort of like a SQL version of etcd, so maybe some of the use cases for etcd would be similar: https://github.com/etcd-io/etcd/blob/master/Documentation/pr...

Quarrelsome · on Aug 30, 2019

anywhere where you cba to run up a full database server and don't necessarily mind losing a few features. Historically sqlite has been THE solution for data storage in embedded (think IOT), so in this case imagine an IOT configuration with multiple nodes.

So an easy use-case that springs to mind is any sort of distributed IOT device that need to track state. So any industrial or consumer monitoring system with a centralised controller that would use this for data storage. Specifically, that this enables the use of multiple nodes for high throughput imagine many, many, many sensors and a central controller streaming real-time data.

nightfly · on Aug 30, 2019

Depending on some details, it would be perfect for storing state for a libvirt cluster management tool I'm working on. Concerns are I can read the data on disk using the regular sqlite3 cli tool, and lack of rust bindings.

regularfry · on Aug 30, 2019

8 years ago or so I was involved in building a cloud platform. The very first version of the design for keeping the VM and storage allocation metadata synced across the cluster involved syncing sqlite databases (which we moved off once we realised we'd pretty much have to invent something a lot like raft to make it work). If this had existed then, we'd just have picked it off the shelf.

reacharavindh · on Aug 30, 2019

I’m curious about the libvirt management tool ypu’re Working on. Any chance it is open source?

0xbadcafebee · on Aug 30, 2019

Some thoughts: A consensus protocol is like 1/50th of what you need for a stable, reliable distributed database, and it's developed by a company, so expect it to be abandoned once they stop developing it. I wouldn't use it at work (yet) but could be fun for personal projects.

linsomniac · on Aug 30, 2019

I wouldn't paint Canonical with that generalization. It's not unheard of that they have dropped projects, but I wouldn't say it's common. But looks like their primary use is LXD, which doesn't seem to be going anywhere...

heavenlyblue · on Aug 30, 2019

But the main point made by the parent is entirely correct: the biggest issue isn’t that of implementing a consensus protocol: the biggest issue is the reconfiguration of the cluster, management of the dead/live nodes, addition of extra nodes for replacement, copying of the data before reconfiguration.

All of that needs tooling.

linsomniac · on Aug 30, 2019

Indeed. I was reading the dqlite page thinking "where is the monitoring endpoint to tell if the cluster is healthy or degraded?" Too often that seems like an afterthought if it's thought of at all.

I have a teeny, tiny cluster using MySQL+galera as a multi-master cluster, but it took a while to iterate to monitoring that tells me when one node is unhealthy and getting the correct repair and restart procedures.

otoolep · on Aug 30, 2019

Totally.

FWIW, I built all the functionality into rqlite from the very start, for exactly those reasons. In the real world a database must be operated.

https://github.com/rqlite/rqlite/blob/master/DOC/DIAGNOSTICS...

https://github.com/rqlite/rqlite/blob/master/DOC/CLUSTER_MGM...

0xbadcafebee · on Aug 30, 2019

Yep! And you need someone to help fix bizarre bugs in core that only crop up in your own weird environment. With support that's quick & easy, but otherwise you have to form your own dev team to specialize in it, making it costlier.

gigatexal · on Aug 30, 2019

Needs a Kubernetes operator for sure.

0xbadcafebee · on Aug 30, 2019

It doesn't really matter who wrote it, it's a trope of corporate software development. A small team makes project X to support project Y. They go through the usual dev + production + maintenance cycle, which takes 2-3 years typically, sometimes 5, after which the team is disbanded/reorganized, and no new dev work happens on the old projects. The project is effectively abandoned at that point, unless it happens to have picked up enough users that "a community" forms and picks up development... but that's rare, because corporations don't want to give development of their project over to randos on the internet, especially if they're still using it. The best case is it would fork, or move to some other org's code repo.

I like to use projects which lots of other projects depend on directly. That way if the main project goes unsupported, all the other projects using it will band together to support a fork. I believe open source that is not created for a company will last much longer. (I like that they rewrote it in C, though; it would probably survive well as a fork if enough people/projects use it)

gigatexal · on Aug 30, 2019

Hence it’s being released and if/when people fork it it will/ could live forever.

hootbootscoot · on Aug 30, 2019

what's the comparison with rqlite?

https://github.com/rqlite/rqlite

hootbootscoot · on Aug 30, 2019

never mind. c. already in ubuntu reps as a one-step, ok... this is cool. I'm gonna check it out.

gwbas1c · on Aug 30, 2019

> Runs on ARM, X86, POWER and IBM Z architectures

Odd, I'd think x64 would be the most commonly used architecture. Is this a mistake?

secretvrdev · on Aug 30, 2019

Therefore its called x86-64. It is the "same"

zzzcpan · on Aug 30, 2019

Since the author is in the comments, what are you planning to do about operations: keeping consistent performance while adding/removing/resyncing nodes, rebalancing, dealing with bitrot, disk errors, disk performance issues, filesystem issues, dealing with unstable network performance, etc.? It doesn't look like there is anything to address operations in the code at the moment.

mlevental · on Aug 30, 2019

does anyone know a good on disk nosql db like sqlite with python bindings? I know sqllite has a json extension but I don't think you can query

maurodelazeri · on Aug 30, 2019

how is the performance running 3 nodes? how many inserts/reads can be done in average ?

free-ekanayaka · on Aug 30, 2019

Depends on how fast is your disk, what file system and kernel you use, and how low is your network latency. Difficult to predict. But it's basically as fast as it can get given 1) hardware constraints 2) raft consensus.

If you want light-speed insert/delete, you could probably don't use the disk at all: as long as a majority of your nodes don't die, you won't lose any data. You can also go somewhere in between and save to disk only at specific intervals.

zellyn · on Aug 30, 2019

Do all nodes participate as full RAFT nodes? Or can you have read-only nodes?

setr · on Aug 30, 2019

Aren't non-leader RAFT nodes already read-only?

zellyn · on Aug 30, 2019

I guess I meant observer nodes, with read-only DB access.

setr · on Aug 31, 2019

I just mean that RAFT nodes are already read-only for most of their lives — it’s not clear to me why you’d want nodes to be dedicated as read-only; it would only reduce consistency guarantees (and I believe you wouldn’t be avoiding the heartbeat and data consistency network chatter).. I think the leadership functionality comes cheap

zellyn · on Aug 31, 2019

That makes sense. I need to read up on Raft again.

I know Zookeeper, for example, supports observer nodes: essentially a cheaper read-only cache. Chubby at Google had the same thing.