I like Slide 251: 'solve it at the application layer' :)
I'm part of a team responsible for cloud-based video editing software. We use multi-master replication (perhaps also known as optimistic replication) with our own tools throughout, but it does require careful design - keep as much data as immutable as possible, every piece of data that might be updated by different machines at the same time should have its own row, GUIDs on each row.
Each machine can generate its own local IDs, which look a lot like a timestamp with some unique stuff on the end. Each row gets a GUID and a 'version' ID column, and we only update relayed database updates if the incoming version is newer. This is largely last-timestamp-wins for the case of conflicts (rare because of design decisions), but there is some Lamport timestamp behaviour in there too for updating a existing row.
The main downside is still that all machines need to handle every write, but with batching up incoming processing into larger transactions, we've had no problems with quite a number of database updates on a dozen commodity machines. Obviously filtering into different shards would be an easy solution.
I'm looking forward to seeing what other people are doing with multi-master replication.
At TransLoc we use a multi-master setup. At its heart it's two nodes replicating from each other. Our data is divided into several databases. Each database "belongs" to only one master at a time. For example if we have nodes A and B, and databases w, x, y and z, we would have A be responsible for writes to w and x, while B would be responsible for y and z.
If B fails, we have a monitoring system in place to tell A that it is responsible for w, x, y and z at once. The monitor sets read-write permissions for databases y and z on A (through user permissions), and then notifies all of our application servers that things have shifted. The applications for their part include a piece of common code that monitors for changes in the cluster and allows the application code to cope with these changes. For web requests, if failover happened in the middle of a transaction, the request fails. For long running processes, the process will have to go to the top of the event loop and request a new database connection, etc.
So far it's worked fairly well. We are able to achieve high availability with it, since our master nodes are in two different data centers. There are definitely issues with this approach in general, but it works for our work load.
You're kind of in an active-passive multi-master setup which is find for now but as replication queries increase between the two nodes, you might start to see load issues. You're going to have to drop in a third machine and a decision is going to have to be made on how to replicate your data.
Yes. The other big problem with this setup is that a single node must be able to handle all queries if the other node fails. Fortunately, we are nowhere near saturating the capacity of our nodes under normal operation (we are only doing about 1,200 queries/second). We are also able to offload read-only queries to slaves that are replicating off of the masters, which should help with the read capacity, which is our biggest demand.
Pretty typical of any multi-master asynchronous replication: you must make damn sure your application isn't going to generate conflict. If conflict does happen, you have to apply some external logic to resolve that issue. Some things like GoldenGate (oracles purchase of binlog snarfing multi-master replication) provide you some tools for that. Basically dumping conflicted records into a table for manual or automated clean up later.
At the end of the day, it's always going to require well thought out schemas and data layout if you want to take writes for both masters.
Multi-master is less difficult if you only write to a single node at a time (hot-standby style).
I don't know much about replication, but I'd like to learn. Can you recommend any books/courses/tutorials that address the sorts of design decisions you're talking about?
I'm always in favour of technology which solves the replication issues in mysql. It's been one of the most painful parts of managing databases for myself.
I don't know much about replication but what happens when a master dies just after the database is updated. Will the changes not be replicated? If the user see that the second master does have the latest changes and he repeat his latest actions, what will happen when the first master goes back online? Will the data be duplicated or merged?
It depends on your use case - either you accept there is a write hole that the update wont appear until that host comes back online and other hosts can relay the change, or you add application-level checking to not return to the user in the first instance until the change hits at least another host.
In your scenario, yes. The two changes would be a conflict, but hopefully (if the user makes the same change) your conflict-resolution code would realise it is the same and merge them correctly. I fear I don't know how MySQL::Replication handles this, but it has to be application-level logic, so I assume it will allow you to override a default of, say, last-modification-wins.
I think spydum's comment is what you're looking for.
These are things that can and do happen, and you have to plan for it. Designing your applications and infrastructure with eventual consistency in mind can do wonders for both performance and redundancy.
That's one of the problems of multi-master replication. It's easy to introduce conflicts and requires lots of application logic or operation manual work to handle them.
Couple years ago I was looking at building highly available database (MySQL in particular), and looked into the multi-master setup. While sounds good on paper, its benefits don't warrant the high development and operation cost.
- The tables need to be changed and the application layer needs to be changed to support it, which is a big hassle and very fragile. It's easy to introduce update conflict. It's a nightmare when dealing with group of updates in a transaction. You can't really roll back a transaction at the replicated nodes.
- Whenever a node fails, the replication ring is broken and updates pile up at the previous node, while subsequent nodes' data become stale. It requires immediate human attention to fix it, which defeats the purpose of a HA cluster.
- Related to above. It's very difficult to add a new master node without stopping the cluster. The "catchup" process is very manual and fragile.
- Data in different node becomes stale under high replication load. Clients reading different masters would get stale data. They are supposed to be masters and got stale data?!
- Multi-master doesn't help write scalability as all; all nodes need to handle all writes. MySQL's single thread update in replication doesn't help. For read scalability, master-slave is better.
I abandoned the design after a while and chose a different approach. I ended up using a disk-based replication, like DRDB. A two-machine cluster forms the master cluster, one active and one standby. Writes are replicated to both machines at disk level synchronously. When the active node fails, the standby becomes active within seconds automatically with the complete data on disk.
The beauty of this approach is the simple design and setup. The data are always in sync, no stale data. Failover is immediate and automatic. The failed node can automatically catch up when back online. The database application doesn't need any change and all the SQL semantics are preserved. The cluster has one IP so the clients don't need special connection logic. They just need to retry when connection fails.
For disaster recovery, I built another two-machine cluster in another datacenter acting as the slave, which did async replication from the master cluster. When the two-machine master cluster completely failed (as in the datacenter got burnt down), the slave cluster can become master via a manual process within 30 minutes. The 30 minutes SLA is for someone got paged, look at the situation and decide to fail over. There are too many uncertainties across datacenters to fail over automatically.
Added bonus, slaves can still hang off the master cluster for read scalability. And it works with any disk-based databases, not just MySQL.
There's nothing stopping you from having standby relays, active relays etc. You are free to create a topology of your choosing with minimal restrictions.
I'm part of a team responsible for cloud-based video editing software. We use multi-master replication (perhaps also known as optimistic replication) with our own tools throughout, but it does require careful design - keep as much data as immutable as possible, every piece of data that might be updated by different machines at the same time should have its own row, GUIDs on each row.
Each machine can generate its own local IDs, which look a lot like a timestamp with some unique stuff on the end. Each row gets a GUID and a 'version' ID column, and we only update relayed database updates if the incoming version is newer. This is largely last-timestamp-wins for the case of conflicts (rare because of design decisions), but there is some Lamport timestamp behaviour in there too for updating a existing row.
The main downside is still that all machines need to handle every write, but with batching up incoming processing into larger transactions, we've had no problems with quite a number of database updates on a dozen commodity machines. Obviously filtering into different shards would be an easy solution.
I'm looking forward to seeing what other people are doing with multi-master replication.