Rqlite: Replicated SQLite using the Raft consensus protocol

biokoda · on Feb 27, 2016

ActorDB takes this about 100 steps further. It replicates the actual WAL pages instead of SQL commands. So the leader executes the SQL, followers just write the pages from leader.

https://github.com/biokoda/actordb

It uses LMDB to store everything instead of storing raft log in one place and SQLite in another. It can run thousands of SQLite instances at the same time and it actually compresses the pages when storing to LMDB.

You can query it with a MySQL client library (though Thrift is recommended). You can have multiple raft clusters, there is a complementary actordb_console app to manage it easily, it supports distributed transactions and more.

atombender · on Feb 27, 2016

ActorDB is very cool, but from what I gather it's — at least nominally — designed to work with many small databases (one per "actor") as opposed one large one, and that the constraints it puts on the query model means you have to carefully design your data model to be heavily compartmentalized from the beginning. It also poses some problems for database-wide queries, which have to fan out to every separate actor database. Whereas rqlite acts like one single SQLite instance, and can work with existing schemas.

Please do correct me if I'm wrong. Would ActorDB be happy and perform well with, say, just one actor?

derefr · on Feb 27, 2016

To me, that just sounds like it forces you to do the Right Thing from the beginning. Eventually, you will need to shard your data in some way†.

If you're already locked into a "global queries against tables containing Everyone's Stuff" data model, you end up doing this through Stupid Database Tricks like "see-other" redirection or silent persistent-hash load-balancing (with constant ops-heavy rebalancing as you grow.)

If you think about sharding from the beginning, though, you end up just dividing your data into little naturally-atomic "worlds." Like sharding email [headers, not bodies] by user, or sharding StackOverflow posts by community.

If you don't think you'll ever scale to the point where you'll be forced to make those calls, just pick a different database. (But then, if you don't think you'll ever reach that scale, then literally any database will do.)

---

† Unless you're EVE Online. They could probably write a really good whitepaper about how they've scaled their single-node MSSQL database so far. I presume they're mostly following the same patterns you would when running e.g. Oracle on big iron; but—according to most sources—they're just using some heavily-loaded commodity hardware with no fancy IO offloading et al. No idea how they do it.

calpaterson · on Feb 27, 2016

No, most businesses could run their datasets off normal databases. This becomes more true every year as stock hardware gets bigger ram and disks.

There are plenty of businesses running of single-node SQL databases. Not Google's scale, fine, but the same scale as the bottom 99% of businesses.

For the record, StackOverflow (your example) is run off a single MS-SQL cluster. It's not even that heavily loaded apparently:

http://highscalability.com/blog/2014/7/21/stackoverflow-upda...

derefr · on Feb 28, 2016

Yes, that's what I said: for most businesses (that aren't trying to scale to infinity the way that VC startups talk about), literally any database will do. Even single-node SQLite will do, because there won't be enough contention for its locking to ever matter.

atombender · on Feb 27, 2016

Absolutely. Keep in mind that ActorDB actually requires that you supply the sharding key yourself whenever you do a query. It's not like, say, Elasticsearch, which shards both reads and writes transparently.

Many apps do require large queries that span many "partitions": For example, listing all of a StackOverflow member's contributions across all communities, sorted by time. With ActorDB, you would have to plan for this by denormalizing a bit (biokoda might correct me here): Each community would be an actor with a table of QA items, and each user would also have an actor containing a table listing their own QA items for all communities. Since ActorDB apparently has transactions, you can maintain this duplicate data atomically, though you can't maintain foreign-key constraints across actors.

derefr · on Feb 27, 2016

I work with CouchDB, and it sounds similar: many queries end up being map-reduce functions run across many CouchDB "databases" (partitions) on the same node/cluster.

One of the nice things about this approach, from my CouchDB experience, is that each DB/partition has its own permissions (fully extensible through design-documents shoved into that DB), so instead of needing to carefully write your business layer to ensure that users can only ever query their own data out of a table, you just do all your work for a particular user in a table that only contains stuff that user has a right to access to begin with.

It's a lot like working with S3 buckets, now that I think about it. Buckets containing tables, rather than buckets containing objects.

catnaroek · on Feb 28, 2016

Please excuse my ignorance - how do I enforce nontrivial system-wide invariants with lots of little databases rather than a single consolidated one?

The invariants I care about are:

(0) Referential integrity. If the table `Foo` has a `FOREIGN KEY (BarID) REFERENCES Bar (BarID)`, then no row in `Foo` must be seen as having a `BarID` whose value can't be found in the table `Bar`.

(1) Linearizability. There must exist a total order on the entire transaction history of a database, such that, starting from the empty database, and executing the transactions nonconcurrently in the given total order, the result is the current state of the database. (NOTE: The transaction history need not be physically stored anywhere. So this invariant can't be “tested” - it has to be proven to hold.)

These guarantees are so basic, so fundamental in my everyday use of RDBMSs, that I need to be convinced that they hold.

biokoda · on Feb 28, 2016

By finding the natural sharding factor. If you have one, which very often you do. Three completely different examples:

- Are you running a mail server? Every user is an actor (i.e. an sqlite instance).

- Are you running a backend for a dropbox/evernote/messaging type app? Every user is an actor.

- Do you have a distributed filesystem? Use the KV mode and shard on file hash.

- Do you need scalable counters? Use KV mode again, split every counter into 10 (or 100) and increment/decrement on those.

catnaroek · on Feb 28, 2016

Again, please excuse my ignorance - what is the definition of “natural sharding factor”, and how would I compute it for, say, an ERP system?

biokoda · on Feb 28, 2016

I have no experience in developing in ERP systems. I do have experience on being at the end of a very poorly working one.

These systems seem (from my outside view) to have a tendency to become giant monoliths. So when developing you must fight against increasing monolithic complexity. Using something like ActorDB can be somewhat of a beneficial constraint. It forces you to maintain a clean design.

I would force sales, marketing, shipping, product planning (taking from wikipedia here..), to be their own separate actors with their own schemas. Then if possible shard within those types. So if something is customer service, have an actor per customer and have all his data there. If you're developing multiple products, every product has an actor.

catnaroek · on Feb 28, 2016

> These systems seem (from my outside view) to have a tendency to become giant monoliths. So when developing you must fight against increasing monolithic complexity.

You aren't wrong, that's my experience as well. It's just as annoying for programmers (or, at least, for me) as it is for users. The following question has popped out countless times in my head: “Why do I have to rely on an implicit convention that this application module never touches this database table?” There was never a good answer.

The only reason why I put up with such things is that I have no idea how to prevent more modular designs from turning into a data integrity nightmare. (I'll freely admit my lack of education is to blame here.) For instance, let's say we have three modules: inventory, sales and shipping. Furthermore, let's assume each module is its own actor and uses its own backend database. We must implement the use case “enter a sale in the system”:

(0) The sales module queries the inventory module whether there is enough of a product in stock to satisfy a customer order. The expected sequence of actions is:

(1) The inventory module “locks” the requested quantity/amount of the product [so that it can't be used, say, for another sale], and gives the sales module a “token” that can be used to confirm or cancel the withdrawal.

(2) The sales module queries the shipping module if there are enough available trucks/ships/whatever to ship the product to the customer's location by a given delivery date.

(3) The shipping module “locks” however many trucks/ships/whatever it deems necessary to ship the product, and gives the sales modules a “token” that can be used to confirm or cancel the shipping.

(4) The sales module queries the user for the customer's credit card number and verification code, interfaces with the bank's system, blablabla...

(5) The sales module confirms to the inventory and shipping modules [in this specific order] that the product will be withdrawn and shipped.

Now some exception handling:

(6) If step 3 fails, the sales module cancels the product withdrawal.

(7) If step 4 fails, the sales module cancels the product withdrawal and shipping.

(8) If any system [inventory, sales, shipping] goes down, neither the product nor the trucks/ships/whatever can be kept locked forever. So each lock must have a timeout, and, if it's neither explicitly confirmed nor explicitly cancelled by the sales module, it will be implicitly cancelled by the inventory and/or shipping module when the timeout elapses.

(9) It may happen [unlikely, but not impossible], that the inventory and shipping module's clocks get unsynchronized in such a way that, when the product withdrawal has been been confirmed, the shipping lock has already elapsed. Oh, the nightmare.

Implementing all of this correctly in all cases is actually tricky! And if anything is implemented slightly wrong, the whole system goes kaboom! With a monolithic database, there is no need to “lock” any resources, nor issue “confirmation tokens” - just use the DBMS's built-in transaction system!

biokoda · on Feb 28, 2016

> Furthermore, let's assume each module is its own actor and uses its own backend database.

I would have every module an actor type. There can be multiple types each type has its own schema. Within an actor type many actors. An actor for every product for instance. So all X widgets are in one actor.

> With a monolithic database, there is no need to “lock” any resources, nor issue “confirmation tokens” - just use the DBMS's built-in transaction system!

ActorDB has distributed ACID transactions so I would use that. You can create a transaction over multiple actors. The reason I would split it into many actors is that you're always locking small parts of the system for the duration of the transaction not the entire DB.

catnaroek · on Feb 28, 2016

> I would have every module an actor type. There can be multiple types each type has its own schema. Within an actor type many actors.

Noted.

> ActorDB has distributed ACID transactions so I would use that. You can create a transaction over multiple actors.

Sweet! In particular, the second sentence is exactly what I wanted to hear.

I will give ActorDB a try.

diroussel · on Feb 28, 2016

You won't find many RDBMS that can guarantee Linearizability.

See: https://martin.kleppmann.com/2015/09/26/transactions-at-stra...

catnaroek · on Feb 29, 2016

The ones that I use at home (PostgreSQL) and at work (SQL Server) allow serializable transactions. Of course, in many cases it's overkill, but it's good to know that it's there when needed.

biokoda · on Feb 27, 2016

You can actually get very far by not sharding. But it forces you to eventually throw out everything that makes relational databases awesome and turn it into a glorified KV store.

rodgerd · on Feb 27, 2016

> they're just using some heavily-loaded commodity hardware with no fancy IO offloading et al.

Aren't they backed by RAMSANs?

biokoda · on Feb 27, 2016

Well put.

biokoda · on Feb 27, 2016

It would perform well with one actor when compared to a single SQLite instance. Not compared to PostgreSQL/MySQL due to the concurrency model being optimized for many concurrent actors, not concurrent access to a single actor. A single actor can still execute thousands of queries per second however.

Compared to a single SQLite instance ActorDB has two major advantages. No write multiplication due to using LMDB and compression which means reads/writes are significantly smaller. SQLite will always write everything twice. First to WAL, then to the SQLite file and it has no compression capabilities. It should completely trounce rqlite in the performance department.

atombender · on Feb 27, 2016

Thanks, that's useful. I know nothing about SQLite's concurrency model, but it's not surprising that it can't compete with Postgres.

Does this mean that ActorDB relies on LMDB to replace the WAL? I don't know enough about LMDB either, but I presume it has some kind of redo log.

biokoda · on Feb 27, 2016

Yes ActorDB takes sqlite3.c, takes out the wal.c code and replaces it with calls to LMDB. There is no redo log, LMDB is designed in such a way that it does not require it. It is actually verified as the safest storage engine design out there.

thesz · on Feb 27, 2016

Do you have a link on the LMDB verification?

What I know is that SQLite IS safe to use even on unsafe file systems. There was a discussion here few months ago (link was from danloo.com).

biokoda · on Feb 27, 2016

https://www.usenix.org/conference/osdi14/technical-sessions/...

continuational · on Feb 28, 2016

This is exactly what I'm looking for. One small database per entity, with the option of distributed transactions across the databases. It sounds wonderful. How mature is it?

biokoda · on Feb 28, 2016

We eat our own dog food and use ActorDB for everything we do. Things that are critical for the company. Multiple products are using it without issue.

continuational · on Feb 29, 2016

Thank you. What do you do about secondary indexes? Say, if you want to search for users by name?

To elaborate: I imagine sharding my data per user, so that I have one actor per user. But I still want a search bar on my page where I can type in a name and find the user in question. A multi-actor query would probably be too inefficient in this case. Did you have a similar problem, and if so, how did you solve it?

biokoda · on Feb 29, 2016

ActorDB can have multiple actor types, each with a completely different schema.

I would create a new actor type for this purpose and have a single actor in it. This would just contain the names of users. It would be easy to query and manage.

Another option is to create a new KV type and have the names there. This is much more scalable bit a bit more difficult to query. Needed if you really need many thousands of queries per second.

On the other hand if your scale is such that search by name is a big load, it is probably better to complement the DB with elasticsearch and move text search to it. This is how the big boys do it.

continuational · on March 1, 2016

That makes sense. Thank you for the explanation.

mslot · on Feb 27, 2016

I've written something similar for PostgreSQL using Paxos: https://github.com/citusdata/pg_paxos . It can replicate tables across a set of PostgreSQL servers.

pg_paxos is an extension that uses PostgreSQL's query execution hooks so it doesn't require any external components. An advantage of pg_paxos is that you can do reads and writes on any node. Would love to know what people think and whether it's something you would use (if rigorously tested). Small warning: it replicates SQL queries and doesn't stop you from using volatile functions like now() yet, which would give inconsistent results.

A possible application is a reliable cron service. Each cron node tries to add an entry to a replicated locks table with the same primary key on any of the PostgreSQL nodes. Only one will succeed and then runs the cron job.

Some more details in these slides: http://www.p2d2.cz/files/pg_paxos.pdf

CodeWriter23 · on Feb 28, 2016

What was your motivation to DIY instead of using Bucardo?

mslot · on Feb 29, 2016

Bucardo provides asynchronous (eventually consistent) replication. If you need strong consistency with fault-tolerance, you need something like pg_paxos, but it comes at a significant performance penalty so it's not really an alternative for existing replication solutions.

CodeWriter23 · on March 3, 2016

Thanks. I've been looking at PG replication solutions to get rid of the hand-coded synchronization built into a couple of my apps that date back to the dial-up days.

paulryanrogers · on Feb 28, 2016

Pgpool replication can work that way too

ziedaniel1 · on Feb 28, 2016

"Due to the nature of Raft, there is a very small window (milliseconds) where a node has been disposed as leader, but has not yet changed its internal state. Therefore, even with the leader check in place, there is a very small window of time where out-of-date results could be returned."

No, that's absolutely not inherent to the nature of Raft. It sounds like they're making the same mistake that etcd did, simply relying on whether a node believes it's still the leader rather than having it confirm that fact by committing a log entry: https://aphyr.com/posts/316-jepsen-etcd-and-consul

otoolep · on Feb 28, 2016

We're way ahead of you.

https://github.com/otoolep/rqlite/issues/5

But yeah, that is awkwardly phrased -- let me fix it.

otoolep · on Feb 28, 2016

Done. You're right that it's not Raft, it's rqlite. Though I plan to add an option that will send all queries through Raft, if so requested.

ziedaniel1 · on Feb 28, 2016

Great - sorry for the harsh words, I guess I've just come to expect people to screw up distributed systems :)

TheHippo · on Feb 27, 2016

> rqlite reports a small amount anonymous data to Loggly, each time it is launched. This data is just the host operating system and system architecture and is only used to track the number of rqlite deployments. Reporting can be disabled by passing -noreport=true to rqlite at launch time.

This is very close to the end of the Readme in my opinion the wrong default setting.

otoolep · on Feb 27, 2016

This behaviour has been in the program from the very start, and was a poor decision on my part. I've pulled it completely from the code.

danielvf · on Feb 27, 2016

Hey! This was the Stripe CTF 3 final challenge problem. Good work making a polished version.

mbell · on Feb 27, 2016

It was a fun level :)

Anyone know if Stripe is planning another CTF? The last one was great fun.

ktRolster · on Feb 27, 2016

I'm having trouble understanding this (but everyone here thinks it's rather cool, so I think I'm missing something).

I've used SQLite mainly for small, app-local data storage (because it's easier than creating a custom file-format). What is the purpose of doing replication on such a database?

otoolep · on Feb 28, 2016

If you don't care about losing your data if you lose your host machine, then there is no point.

The point is to have copies of your SQLite data on multiple different machines so if you lose one of the machines, you don't lose your data. The same reason anyone replicates a database.

otoolep · on Feb 28, 2016

Though there are other reasons to replicate databases -- read replicas. rqlite can do this do.

nine_k · on Feb 28, 2016

Imagine a replica of client data on mobile devices, one shard per client.

bernardom · on Feb 27, 2016

Branding feedback: I can't be the only one who immediately thought this had something to do with R!

sqldf is a pretty popular package in R; RSQLite also exists. Maybe the R language shouldn't have a monopoly on packages starting with R, though...

otoolep · on Feb 27, 2016

That's because the OP called it "Rqlite". I actually named it "rqlite". "r" as in replicated.

https://github.com/otoolep/rqlite#pronunciation

bernardom · on Feb 27, 2016

Makes sense, thanks!

rectang · on Feb 27, 2016

"RescueLight"

philips · on Feb 27, 2016

Here is a similar project built on the etcd raft package: https://github.com/chzchzchz/raftsql

The etcd raft package is used by CockroachDB and etcd. The godoc is over here: https://godoc.org/github.com/coreos/etcd/raft

pyvpx · on Feb 27, 2016

can anyone recommend a good overview of Raft vs. Paxos?

krat0sprakhar · on Feb 27, 2016

While this doesn't answer your question, you will find this interesting - https://ramcloud.stanford.edu/~ongaro/userstudy/

georgeaf99 · on Feb 27, 2016

I'm not sure that this experiment is all that useful in determining the practicality of using Raft instead of Paxos. Anybody implementing one of these algorithms is going to be spending more than one hour learning it. In my experience, there a lots of corollaries that must be proven in order to actually implement the spec described by the Raft paper (mainly due to weak assumptions that cannot be assumed in all production systems). Paxos is a bit more established, so this is not as big of an issue; however, I still do not have an intuitive understanding of Paxos. This can (and will) lead to non-deterministic bugs in your implementation. A more useful study would track bugs in Raft & Paxos implementations.

dang · on Feb 27, 2016

That's pretty interesting and doesn't look like it has been discussed on HN before. You should submit it.

wyldfire · on Feb 27, 2016

I'm almost positive that I heard about it here. It's Raft's primary distinguishing feature from Paxos, right?

But hey, if you hadn't heard about it then there's a good chance that other HNers haven't either.

mslot · on Feb 27, 2016

I can't really give an overview, but I'll tell you what I know :).

Paxos is a rather simple algorithm for reaching consensus on a value once: http://research.microsoft.com/en-us/um/people/lamport/pubs/p...

The two phases of the algorithm are described at the bottom of page 5. Section 3 also describes how to use Paxos to implement log replication, Multi-Paxos, which is comparable to Raft. Unfortunately, this section does a lot of hand-waving, so it requires the implementer to have a thorough understanding of the underlying issues and find good solutions to problems such as membership changes.

Conversely, the Raft paper is much more instructive to implementers, giving a clear overview of the necessary functions, messages and state: https://ramcloud.stanford.edu/raft.pdf

Personally, I find Multi-Paxos a lot more elegant, as it essentially derived by working backwards from the safety constraints. As a distributed systems researcher, it is very obvious to me why it works and I can do a basic implementation almost from memory. Unlike Raft it is also independent of time. Another nice property is that it is symmetric/masterless. All nodes can do reads and writes at any time, though whether that's a good idea is another matter. Of course, these are rather subjective and mostly aesthetic notions.

Multi-Paxos is generally seen as harder to implement, because the Paxos Made Simple paper is not very explicit about implementation, and the original paper is... well... read it :) http://research.microsoft.com/en-us/um/people/lamport/pubs/l...

Raft is getting quite popular, because people find it easy to follow the instructions of the paper, but this can be slightly deceptive. For example, using stable storage for your state is essential for crash-safety (meaning fsync-ing before answering), but I don't see where Hashicorp's raft implementation does that. Raft relies on the combination of several rules to reach consensus reliably, for which Ongaro has given a rather elaborate proof using 20 pages of TLA+ in his thesis. It's quite easy to make a mistake in your implementation and break one of the rules. The same goes for Paxos, but the rules are, at least to some extent, more obvious.

A big benefit of Raft is that it already includes quite a few optimizations that would be necessary to make a Multi-Paxos implementation efficient, and you don't have to come up with them yourself. You can generally expect Raft implementations to outperform Multi-Paxos implementations. On the other hand, Multi-Paxos can be tailored to a specific use-case, such as Google's Megastore, Chubby and Spanner, which would be much more difficult with Raft.

jpgvm · on Feb 28, 2016

The Hashicorp implementation is normally used with the BoltDB store, which is stable on disk. You can find it in the hashicorp/raft-boltdb package.

otoolep · on Feb 28, 2016

Yeah, rqlite enables the BoltDB storage option.

ignoramous · on Feb 28, 2016

Since SQLite can do in-memory, does this now make a good replacement for distributed in-memory databases like memSQL and voltDB? Also, could this act as a replacement to memcached/redis?

To me, it looks like there's a explosion in database solutions, that most solutions are converging with overlapping set of features.

pkolaczk · on Feb 28, 2016

To me it doesn't make a good replacement. It is replicated, but not scalable. All the hard work is thrown at the leader.

otoolep · on Feb 28, 2016

That's right. I wouldn't claim that rqlite is scalable. That's not the point.

api · on Feb 27, 2016

Was thinking of doing this myself! Awesome.

pkolaczk · on Feb 28, 2016

The description suggests it was fault-tolerant, but then I read about all writes and queries going through a single leader, which is an obvious SPOF. Even if temporary, SPOF is still a SPOF.

allengeorge · on Feb 28, 2016

You can have a leader and still be fault tolerant...

At any rate, if the leader dies another member of the cluster is elected and operations proceed. Yes, it's not "always available", but that's the trade-off you make when you impose strong consistency.

pkolaczk · on Feb 28, 2016

"Not always available" means "not fault tolerant" to me.

You can have strong consistency with no leader and no temporary SPOF with Egalitarian Paxos. And even with standard Paxos/Raft it is possible to architect the system to not have a single leader, but multiple leaders responsible for different partitions of data.

allengeorge · on Feb 29, 2016

Even if you had multiple leaders responsible for different partitions you'd still be (temporarily) unavailable if the leader died, so...while you have improved the system's availability characteristics you haven't changed the underlying fault model.

Also, keep in mind that not everything has to go through the leader. If your application is OK with stale reads you can read committed data from any follower, allowing you to be available (for some definition of available) while the master is down.

Also, IIRC most practical strongly-consistent systems aren't egalitarian (I'd love to hear otherwise) and instead, rely on a leader.

otoolep · on Feb 28, 2016

Relative to a single SQLite database, rqlite is pretty fault-tolerant.