ActorDB takes this about 100 steps further. It replicates the actual WAL pages instead of SQL commands. So the leader executes the SQL, followers just write the pages from leader.
It uses LMDB to store everything instead of storing raft log in one place and SQLite in another. It can run thousands of SQLite instances at the same time and it actually compresses the pages when storing to LMDB.
You can query it with a MySQL client library (though Thrift is recommended). You can have multiple raft clusters, there is a complementary actordb_console app to manage it easily, it supports distributed transactions and more.
ActorDB is very cool, but from what I gather it's — at least nominally — designed to work with many small databases (one per "actor") as opposed one large one, and that the constraints it puts on the query model means you have to carefully design your data model to be heavily compartmentalized from the beginning. It also poses some problems for database-wide queries, which have to fan out to every separate actor database. Whereas rqlite acts like one single SQLite instance, and can work with existing schemas.
Please do correct me if I'm wrong. Would ActorDB be happy and perform well with, say, just one actor?
To me, that just sounds like it forces you to do the Right Thing from the beginning. Eventually, you will need to shard your data in some way†.
If you're already locked into a "global queries against tables containing Everyone's Stuff" data model, you end up doing this through Stupid Database Tricks like "see-other" redirection or silent persistent-hash load-balancing (with constant ops-heavy rebalancing as you grow.)
If you think about sharding from the beginning, though, you end up just dividing your data into little naturally-atomic "worlds." Like sharding email [headers, not bodies] by user, or sharding StackOverflow posts by community.
If you don't think you'll ever scale to the point where you'll be forced to make those calls, just pick a different database. (But then, if you don't think you'll ever reach that scale, then literally any database will do.)
---
† Unless you're EVE Online. They could probably write a really good whitepaper about how they've scaled their single-node MSSQL database so far. I presume they're mostly following the same patterns you would when running e.g. Oracle on big iron; but—according to most sources—they're just using some heavily-loaded commodity hardware with no fancy IO offloading et al. No idea how they do it.
Yes, that's what I said: for most businesses (that aren't trying to scale to infinity the way that VC startups talk about), literally any database will do. Even single-node SQLite will do, because there won't be enough contention for its locking to ever matter.
Absolutely. Keep in mind that ActorDB actually requires that you supply the sharding key yourself whenever you do a query. It's not like, say, Elasticsearch, which shards both reads and writes transparently.
Many apps do require large queries that span many "partitions": For example, listing all of a StackOverflow member's contributions across all communities, sorted by time. With ActorDB, you would have to plan for this by denormalizing a bit (biokoda might correct me here): Each community would be an actor with a table of QA items, and each user would also have an actor containing a table listing their own QA items for all communities. Since ActorDB apparently has transactions, you can maintain this duplicate data atomically, though you can't maintain foreign-key constraints across actors.
I work with CouchDB, and it sounds similar: many queries end up being map-reduce functions run across many CouchDB "databases" (partitions) on the same node/cluster.
One of the nice things about this approach, from my CouchDB experience, is that each DB/partition has its own permissions (fully extensible through design-documents shoved into that DB), so instead of needing to carefully write your business layer to ensure that users can only ever query their own data out of a table, you just do all your work for a particular user in a table that only contains stuff that user has a right to access to begin with.
It's a lot like working with S3 buckets, now that I think about it. Buckets containing tables, rather than buckets containing objects.
Please excuse my ignorance - how do I enforce nontrivial system-wide invariants with lots of little databases rather than a single consolidated one?
The invariants I care about are:
(0) Referential integrity. If the table `Foo` has a `FOREIGN KEY (BarID) REFERENCES Bar (BarID)`, then no row in `Foo` must be seen as having a `BarID` whose value can't be found in the table `Bar`.
(1) Linearizability. There must exist a total order on the entire transaction history of a database, such that, starting from the empty database, and executing the transactions nonconcurrently in the given total order, the result is the current state of the database. (NOTE: The transaction history need not be physically stored anywhere. So this invariant can't be “tested” - it has to be proven to hold.)
These guarantees are so basic, so fundamental in my everyday use of RDBMSs, that I need to be convinced that they hold.
I have no experience in developing in ERP systems. I do have experience on being at the end of a very poorly working one.
These systems seem (from my outside view) to have a tendency to become giant monoliths. So when developing you must fight against increasing monolithic complexity. Using something like ActorDB can be somewhat of a beneficial constraint. It forces you to maintain a clean design.
I would force sales, marketing, shipping, product planning (taking from wikipedia here..), to be their own separate actors with their own schemas. Then if possible shard within those types. So if something is customer service, have an actor per customer and have all his data there. If you're developing multiple products, every product has an actor.
> These systems seem (from my outside view) to have a tendency to become giant monoliths. So when developing you must fight against increasing monolithic complexity.
You aren't wrong, that's my experience as well. It's just as annoying for programmers (or, at least, for me) as it is for users. The following question has popped out countless times in my head: “Why do I have to rely on an implicit convention that this application module never touches this database table?” There was never a good answer.
The only reason why I put up with such things is that I have no idea how to prevent more modular designs from turning into a data integrity nightmare. (I'll freely admit my lack of education is to blame here.) For instance, let's say we have three modules: inventory, sales and shipping. Furthermore, let's assume each module is its own actor and uses its own backend database. We must implement the use case “enter a sale in the system”:
(0) The sales module queries the inventory module whether there is enough of a product in stock to satisfy a customer order. The expected sequence of actions is:
(1) The inventory module “locks” the requested quantity/amount of the product [so that it can't be used, say, for another sale], and gives the sales module a “token” that can be used to confirm or cancel the withdrawal.
(2) The sales module queries the shipping module if there are enough available trucks/ships/whatever to ship the product to the customer's location by a given delivery date.
(3) The shipping module “locks” however many trucks/ships/whatever it deems necessary to ship the product, and gives the sales modules a “token” that can be used to confirm or cancel the shipping.
(4) The sales module queries the user for the customer's credit card number and verification code, interfaces with the bank's system, blablabla...
(5) The sales module confirms to the inventory and shipping modules [in this specific order] that the product will be withdrawn and shipped.
Now some exception handling:
(6) If step 3 fails, the sales module cancels the product withdrawal.
(7) If step 4 fails, the sales module cancels the product withdrawal and shipping.
(8) If any system [inventory, sales, shipping] goes down, neither the product nor the trucks/ships/whatever can be kept locked forever. So each lock must have a timeout, and, if it's neither explicitly confirmed nor explicitly cancelled by the sales module, it will be implicitly cancelled by the inventory and/or shipping module when the timeout elapses.
(9) It may happen [unlikely, but not impossible], that the inventory and shipping module's clocks get unsynchronized in such a way that, when the product withdrawal has been been confirmed, the shipping lock has already elapsed. Oh, the nightmare.
Implementing all of this correctly in all cases is actually tricky! And if anything is implemented slightly wrong, the whole system goes kaboom! With a monolithic database, there is no need to “lock” any resources, nor issue “confirmation tokens” - just use the DBMS's built-in transaction system!
> Furthermore, let's assume each module is its own actor and uses its own backend database.
I would have every module an actor type. There can be multiple types each type has its own schema. Within an actor type many actors. An actor for every product for instance. So all X widgets are in one actor.
> With a monolithic database, there is no need to “lock” any resources, nor issue “confirmation tokens” - just use the DBMS's built-in transaction system!
ActorDB has distributed ACID transactions so I would use that. You can create a transaction over multiple actors. The reason I would split it into many actors is that you're always locking small parts of the system for the duration of the transaction not the entire DB.
The ones that I use at home (PostgreSQL) and at work (SQL Server) allow serializable transactions. Of course, in many cases it's overkill, but it's good to know that it's there when needed.
You can actually get very far by not sharding. But it forces you to eventually throw out everything that makes relational databases awesome and turn it into a glorified KV store.
It would perform well with one actor when compared to a single SQLite instance. Not compared to PostgreSQL/MySQL due to the concurrency model being optimized for many concurrent actors, not concurrent access to a single actor. A single actor can still execute thousands of queries per second however.
Compared to a single SQLite instance ActorDB has two major advantages. No write multiplication due to using LMDB and compression which means reads/writes are significantly smaller. SQLite will always write everything twice. First to WAL, then to the SQLite file and it has no compression capabilities. It should completely trounce rqlite in the performance department.
Yes ActorDB takes sqlite3.c, takes out the wal.c code and replaces it with calls to LMDB. There is no redo log, LMDB is designed in such a way that it does not require it. It is actually verified as the safest storage engine design out there.
This is exactly what I'm looking for. One small database per entity, with the option of distributed transactions across the databases. It sounds wonderful. How mature is it?
Thank you. What do you do about secondary indexes? Say, if you want to search for users by name?
To elaborate: I imagine sharding my data per user, so that I have one actor per user. But I still want a search bar on my page where I can type in a name and find the user in question. A multi-actor query would probably be too inefficient in this case. Did you have a similar problem, and if so, how did you solve it?
ActorDB can have multiple actor types, each with a completely different schema.
I would create a new actor type for this purpose and have a single actor in it. This would just contain the names of users. It would be easy to query and manage.
Another option is to create a new KV type and have the names there. This is much more scalable bit a bit more difficult to query. Needed if you really need many thousands of queries per second.
On the other hand if your scale is such that search by name is a big load, it is probably better to complement the DB with elasticsearch and move text search to it. This is how the big boys do it.
I've written something similar for PostgreSQL using Paxos: https://github.com/citusdata/pg_paxos . It can replicate tables across a set of PostgreSQL servers.
pg_paxos is an extension that uses PostgreSQL's query execution hooks so it doesn't require any external components. An advantage of pg_paxos is that you can do reads and writes on any node. Would love to know what people think and whether it's something you would use (if rigorously tested). Small warning: it replicates SQL queries and doesn't stop you from using volatile functions like now() yet, which would give inconsistent results.
A possible application is a reliable cron service. Each cron node tries to add an entry to a replicated locks table with the same primary key on any of the PostgreSQL nodes. Only one will succeed and then runs the cron job.
Bucardo provides asynchronous (eventually consistent) replication. If you need strong consistency with fault-tolerance, you need something like pg_paxos, but it comes at a significant performance penalty so it's not really an alternative for existing replication solutions.
Thanks. I've been looking at PG replication solutions to get rid of the hand-coded synchronization built into a couple of my apps that date back to the dial-up days.
"Due to the nature of Raft, there is a very small window (milliseconds) where a node has been disposed as leader, but has not yet changed its internal state. Therefore, even with the leader check in place, there is a very small window of time where out-of-date results could be returned."
No, that's absolutely not inherent to the nature of Raft. It sounds like they're making the same mistake that etcd did, simply relying on whether a node believes it's still the leader rather than having it confirm that fact by committing a log entry: https://aphyr.com/posts/316-jepsen-etcd-and-consul
> rqlite reports a small amount anonymous data to Loggly, each time it is launched. This data is just the host operating system and system architecture and is only used to track the number of rqlite deployments. Reporting can be disabled by passing -noreport=true to rqlite at launch time.
This is very close to the end of the Readme in my opinion the wrong default setting.
I'm having trouble understanding this (but everyone here thinks it's rather cool, so I think I'm missing something).
I've used SQLite mainly for small, app-local data storage (because it's easier than creating a custom file-format). What is the purpose of doing replication on such a database?
If you don't care about losing your data if you lose your host machine, then there is no point.
The point is to have copies of your SQLite data on multiple different machines so if you lose one of the machines, you don't lose your data. The same reason anyone replicates a database.
I'm not sure that this experiment is all that useful in determining the practicality of using Raft instead of Paxos. Anybody implementing one of these algorithms is going to be spending more than one hour learning it. In my experience, there a lots of corollaries that must be proven in order to actually implement the spec described by the Raft paper (mainly due to weak assumptions that cannot be assumed in all production systems). Paxos is a bit more established, so this is not as big of an issue; however, I still do not have an intuitive understanding of Paxos. This can (and will) lead to non-deterministic bugs in your implementation. A more useful study would track bugs in Raft & Paxos implementations.
The two phases of the algorithm are described at the bottom of page 5. Section 3 also describes how to use Paxos to implement log replication, Multi-Paxos, which is comparable to Raft. Unfortunately, this section does a lot of hand-waving, so it requires the implementer to have a thorough understanding of the underlying issues and find good solutions to problems such as membership changes.
Conversely, the Raft paper is much more instructive to implementers, giving a clear overview of the necessary functions, messages and state:
https://ramcloud.stanford.edu/raft.pdf
Personally, I find Multi-Paxos a lot more elegant, as it essentially derived by working backwards from the safety constraints. As a distributed systems researcher, it is very obvious to me why it works and I can do a basic implementation almost from memory. Unlike Raft it is also independent of time. Another nice property is that it is symmetric/masterless. All nodes can do reads and writes at any time, though whether that's a good idea is another matter. Of course, these are rather subjective and mostly aesthetic notions.
Raft is getting quite popular, because people find it easy to follow the instructions of the paper, but this can be slightly deceptive. For example, using stable storage for your state is essential for crash-safety (meaning fsync-ing before answering), but I don't see where Hashicorp's raft implementation does that. Raft relies on the combination of several rules to reach consensus reliably, for which Ongaro has given a rather elaborate proof using 20 pages of TLA+ in his thesis. It's quite easy to make a mistake in your implementation and break one of the rules. The same goes for Paxos, but the rules are, at least to some extent, more obvious.
A big benefit of Raft is that it already includes quite a few optimizations that would be necessary to make a Multi-Paxos implementation efficient, and you don't have to come up with them yourself. You can generally expect Raft implementations to outperform Multi-Paxos implementations. On the other hand, Multi-Paxos can be tailored to a specific use-case, such as Google's Megastore, Chubby and Spanner, which would be much more difficult with Raft.
Since SQLite can do in-memory, does this now make a good replacement for distributed in-memory databases like memSQL and voltDB? Also, could this act as a replacement to memcached/redis?
To me, it looks like there's a explosion in database solutions, that most solutions are converging with overlapping set of features.
The description suggests it was fault-tolerant, but then I read about all writes and queries going through a single leader, which is an obvious SPOF. Even if temporary, SPOF is still a SPOF.
You can have a leader and still be fault tolerant...
At any rate, if the leader dies another member of the cluster is elected and operations proceed. Yes, it's not "always available", but that's the trade-off you make when you impose strong consistency.
"Not always available" means "not fault tolerant" to me.
You can have strong consistency with no leader and no temporary SPOF with Egalitarian Paxos. And even with standard Paxos/Raft it is possible to architect the system to not have a single leader, but multiple leaders responsible for different partitions of data.
Even if you had multiple leaders responsible for different partitions you'd still be (temporarily) unavailable if the leader died, so...while you have improved the system's availability characteristics you haven't changed the underlying fault model.
Also, keep in mind that not everything has to go through the leader. If your application is OK with stale reads you can read committed data from any follower, allowing you to be available (for some definition of available) while the master is down.
Also, IIRC most practical strongly-consistent systems aren't egalitarian (I'd love to hear otherwise) and instead, rely on a leader.
https://github.com/biokoda/actordb
It uses LMDB to store everything instead of storing raft log in one place and SQLite in another. It can run thousands of SQLite instances at the same time and it actually compresses the pages when storing to LMDB.
You can query it with a MySQL client library (though Thrift is recommended). You can have multiple raft clusters, there is a complementary actordb_console app to manage it easily, it supports distributed transactions and more.