As the linked page doesn't describe what it is: Bi-Directional Replication for PostgreSQL (Postgres-BDR, or BDR) is an asynchronous multi-master replication system for PostgreSQL, specifically designed to allow geographically distributed clusters. Supporting more than 48 nodes, BDR is a low overhead, low maintenance technology for distributed databases. [0]
AFAIK there's no explicit limit per se - the BDR developers tested it with 48 nodes, but it will work with more nodes.
But there are things you'll probably run into - for example this only supports full mesh topology, i.e. each node speaks to every other node. So the more nodes you have, the higher the number of management connections.
It's not immediately clear for me when to use BDR and when would I choose XL. Can somebody do a quick comparison? I like this quote in the announcement email:
---
IS Postgres-BDR RIGHT FOR YOU?
BDR is well suited for databases where:
- Data is distributed globally
- Majority of data is written to from only one node at a time (For example, the US node mostly writes changes to US customers, each branch office writes mostly to branch-office-specific data, and so on.)
- There is a need for SELECTs over complete data set (lookups and consolidation)
- There are OLTP workloads with many smaller transactions
- Transactions mostly touching non overlapping sets of data
- There is partition and latency tolerance
However, this is not a comprehensive list and use cases for BDR can vary based on database type and functionality.
In addition, BDR aids business continuity by providing increased availability during network faults. Applications can be closer to the data and more responsive for users, allowing for a much more satisfying end-user experience.
---
But I can't seem to find something similar for XL and especially a diff between the two.
I'm by no means authoritative, but XL lets you distributed postgresql and keep the transactionality. That's great if you're a bank but necessarily slow and not Available, so running it over a wan might be a bad idea.
BDR lets you distribute your database but throws transactionality out of the window. You have two independent databases which stream changes to each other. A lot faster, and well suited for wans, but not Consistent.
I have worked in FSIs and the DBAs I met were absolutely paranoid about consistency and durability of writes to core banking databases. Curiously it was atomicity and isolation that took a back seat, errors due to those would be corrected by applications (or by hand in exception processing) with reversing transactions.
Banks basically don't use database transactions to record financial transactions. Instead, failed operations are rolled back with an inverse operation. It's done this way to ensure that there is an audit trail of every operation, rolled back or not.
I would assume that banks can be the best enterprise profile user of event sourcing. For traditional application we use the DB rollback for failure mitigation and banks are actually explicitly doing that. I think the event is already there and they are essentially using the same concept as event sourcing if I'm not mistaken.
And I think its much easier to make it work. There is a lot of associated domain and logic but the core of banks is a series of transactions (never worked so could be completely wrong) .. source, destination, amount. It looks poster child for event sourcing/log based architecture.
XL has a global transaction manager, keeping the whole cluster consistent as a whole, and allows you to query the whole database for analytics, etc.
BDR on the other hand doesn't have such global component, so each node is consistent on it's own, but the nodes may be at slightly different states. Still, the replicated changes are applied transactionally (i.e. you won't see just a partial update from another node, etc.)
Well, if you're looking for a solution where the whole cluster is perfectly/strongly consistent, then BDR is not for you I guess. What BDR gives you is a guarantee that the changes are applied atomically (i.e. matching transactions from the source node), and in the commit order. But there's no global consistency guarantee - that's a consequence of of the asynchronous behavior.
So this might not be the right place for this, but i'm curious.
How do people deal with "eventual consistency"?
In my head once a transaction is done, it's everywhere and everyone has access to it.
What happens if 2 nodes try to modify the same data at the same time? Or what happens if you insert on one node, then query on another before it propagates? And if the answer to those questions are what I think they are (that bad stuff happens), how do you setup your application to avoid doing it?
This is a really big question that quickly gets into distributed systems theory in general, but in part, the idea is to recognize that a lot of times the "bad stuff" is not so bad. For instance, if someone updates their profile picture on Facebook, and I make a request that should include their profile picture, but their update hasn't propagated to the node I'm reading from, I just get the old profile picture, and that's a-ok. There are definitely nasty things (like simultaneous updates) that you have to be aware of, but for a lot of applications, there are a lot of cases where out of date data makes little difference.
I guess the part I have trouble with is how do you separate what can't be delayed from what can?
Is it just a matter of keeping the information on 2 separate systems, or are there tools that let you for instance mark one query as "take your time and make sure everything is 100% up to date before committing this" and another as "return whatever you got, it's not important"?
Depends on the database, but Cassandra, for example, has quorum mode for writes, which requires a majority of the cluster members ack the write. This can be enabled on a per-query basis, and also for reads.
Quorum and CRDTs deal with completely different problems.
CRDTs do one thing... they mitigate the issue of "lost updates". All acknowledged writes will be represented in the results of a query and no "winner" strategy is involved that would cause some acknowledged writes to be possibly incorrectly dominated by others and thus lost.
Quorum (strict) just provides a very, very, very weak form of consistency in the case of concurrent readers/writers (RYW) and just very, very weak consistency in the case of serialized readers/writers (RR).
My personal opinion is that any eventually consistent distributed database that doesn't have built-in CRDTs, or the necessary facilities to build application-level CRDTs, is a fundamentally dangerous and broken database because unless all you ever write to it is immutable or idempotent data, you're going to have your database silently dropping data during operation.
"Eventual consistency" usually means a few seconds rather than a few hours. If your users can tolerate hitting refresh when they see the stale data, or if your application is such that users don't have the option to use the newly-inputted or updated information for a short while, it may not be a problem.
It's probably not the type of thing you can just drop in. It will fit some use cases and not others. You'll want to design your application around an eventually-consistent system to prevent and/or handle conflicts in a sane manner.
Some systems like Riak have a built-in complex resolution scheme based on "vector clocks", which record information about the version of the data that was received and can tell which entry is newer. [0]
There seems to be a general fondness for writing to the database and then immediately reading back from it to make a decision. Some variant of that show up in the requirements quite often, usually in the form of confirmation pages or overview screens following right after an update, so the user knows what they did took effect.
And almost as often you'll find that some member or members of your team have made the assumption that the whole interaction will occur on one machine talking to one database instance. With session affinity you might be able to create a convincing illusion that this is working, but now with microservices you're intentionally hitting many servers. If checkout and order history are on two different servers, how does eventual consistency work?
>There seems to be a general fondness for writing to the database and then immediately reading back from it to make a decision
one way I saw banking software do it was keeping deltas of the changes thorough a wizard like process and then commit it all at once in a single transaction, at which point if there was a conflict the wizard restarted with up to date data. conflict detection was done by a sequential id keyed on the accounts touched on the operation, so that any operation modifying the account before your delta was applied would knock your commit off
Banking and the credit industry are a good example of eventually consistent systems.
But you can see the inconsistencies and you're expected to just ignore them. I guess what I'm asking is how do you push back on requirements and inattentive coworkers that make these compromises unworkable, because it seems so pervasive to me.
> How do people deal with "eventual consistency"? In my head once a transaction is done, it's everywhere and everyone has access to it.
Different strokes for different folks.
Synchronous replication: the whole cluster is a single unit. COMMIT only returns when it's been truly committed to all nodes. OTOH, it's slow, and may or may not tolerate partitions.
Asynchronous replication: COMMIT returns when it's gone into the current node. It is then the cluster's job to propagate it to the rest. This is a lot faster, and can deal with network partitions. OTOH there is usually a state lag between distant nodes.
Different applications may need either sync or async, depending on requirements.
Wow, that sounds like a really nice open-source company.
They offer sources, repos for both apt and yum as well as this:
> Need more help?
> Click here for pglogical's documentation.
> Click here to look at answers to some of the most frequently asked questions.
> Want to know even more? Need some help to implement pglogical? As the developers we're probably the best people in the world to help you get up and running with pglogical. We have consultants available to provide assistance, plus our unique 24/7 Production Support subscription service also covers pglogical.
That's a company I'd want to deal with. (Not using postgres at the moment, but hired Tomitribe for some minor work and the whole experience of dealing with the people who actually write and maintain stuff instead of marketing drones etc is totally awesome.)
PostgreSQL streaming : send disk diffs over network, raw binary, absolutely unusable if you don't have the same postgresql version on the other side
Logical : send data diffs over network. Could be used for replication, but also audit or sending to different databases...
pglogical works over the logical decoding code introduced in 9.4, so instead of sending over binary changes of the data files (SR just sends a change that says change these bytes on page X -- well, not super exact, but it gives you a picture of whats done), while logical decoding will read the WAL log, decode the changes so you can apply them else were by some plugin (pglogical would serve as a plugin in this case).
I think the best way to show how it would work (and this is not what pglogical does as it would be inefficient) is by writing a plugin that feeds the logical decoded changes from the WAL and builds SQL statements from them UPDATE, INSERT, DELETE, etc. You could easily apply those on any other PostgreSQL server, or maybe even some other SQL server.
Very handy for online upgrades.
Those are solutions for very different types of deployments.
BDR is asynchronous multi-master, without global transaction management etc.
XL is a distributed database with global transaction management, strong consistency, etc.
Say you have geographically distributed application (managing customers in US, EU, ...). You can deploy a node in each area and direct local users to this node with low latency etc. It also means "no conflicts" (which is a major concern in multi-master), and the changes will asynchronously replicate to the other nodes.
XL would perform poorly in this scenario (at least I believe so), and it's meant more for local deployments, when hitting vertical scalability limits.
I use on it to keep my postgres data in sync in a high available setup. I host two nodes, each in a different DigitalOcean datacenters (AMS2 and AMS3).
It's fast enough to not notice delays for users. As in, have a user log in on host1, switch the user to host2 and the session is there.
Disclaimer, we're not write heavy and have around one hundred active users.
If you're interested, I'm trying just this over the weekend. Standard streaming replication over WAN was pretty awful. It would work reliably for a few weeks then cause both the master and slave to block writes when they got out of sync because something was dropped.
I wonder if it's plausible for AWS to release a customized variant on this targeted to their data centers and designed for ease of multi-zone deployment? Something like Azure (which optimized and facilitates MySQL).
I realize that redshift already loosely meets that definition, but it doesn't quite work as a globally distributed regionally clustered web service back end. This is ideal.
What I want to know is the timeframe of supporting PostgreSQL 9.5 and/or 9.6. (Though 9.6 is still in beta.)
Also, my understanding is that they're feeding patches back to postgres, and want it to eventually run on stock postgres. But it's not clear to me how progress on that is going.
I was also surprised that UDR was removed, I didn't even realize it was deprecated.
I'm not actually using the product at all right now, but I've been watching on the website, because I want to use it eventually.
I'm kind of hoping it works with stock postgres before I jump in. But if not that, I think I at least want to wait for 9.6 support.
UDR has been deprecated in favor of pglogical, which is based on better code already available on PostgreSQL (and production ready... actually we are using it with customers on production systems).
There will not be any BDR release for 9.5. That has been commented may many times on the pgsql-general list.
There will most likely be a 9.6 version, but it won't run on stock postgres (you'll still need to have a patched version of postgres).
If the rest of the features BDR relies on get in for the PG 10 release, then there will be only a BDR extension which you'd have to install (no patching postgres)
That may be the case, but they need to update the documentation to reflect that. This is what it currently says:
"Note: All the new features required have been submitted for inclusion in PostgreSQL 9.5. Many have already been accepted and included. If all the functionality BDR requires is added to PostgreSQL 9.5 then the need for a modified PostgreSQL will go away in the next version."
Well, PostgreSQL 9.5 has been out for several months now, and there's no note at all for 9.6. For a 1.0 announcement, that's a pretty strange omission.
So we get something that only really works with a patched version of 9.4 when 9.6 is going to be released soon, and talks about 9.5 like it isn't out yet? No thanks.
I'm not on pgsql-general. Or any of the postgres lists besides announce.
So, I'm not sure who's fault it is, but I'm listening in one place (your website mainly), and you're talking in another place (pgsql-general apparently), and so we're not communicating effectively.
Actually, at least as recently as June, when I emailed you guys directly you didn't say that. Instead I got linked to a blog post from Dec 2015 that didn't answer my question, and then a "Not yet" when I replied to ask my question again.
If your use case can't handle eventually consistent data or you have no way to successfully resolve conflicts/never cause conflicts then this product is not for you.
They've chosen availability over consistency, so when the network between your DB servers partitions your applications will still function. But partition A won't see what's happening on partition B and vice versa and when the partition heals you might have conflicts to resolve.
One great use case I can think of for this is a case where you have geographically clustered clients that need to read globally but only write locally. As long as you model your data correctly you could set up a master in each region so that your writes were fully ACID compliant then have the masters replicate to each other so that data from other regions was available for reading. You'd get the benefit of distributed writes, your local clusters could continue working even if there was a partition or one of the other clusters went down, and you'd assure yourself that you wouldn't have conflicts.
Just wanted to point out that Citus Data is no longer a fork of Postgres and is now open source[0]. (Though, I'm sure their managed service is 'expensive' compared to unmanaged).
When reading this I wonder how to do this: OS or Postgres updates with one normal Postgres (or a typical master+slave setup) without downtime and without using Postgres BDR. Does somebody now?
OS upgrades, sure, just use two systems with one of the existing async rep solutions between them. You'll need to make the master read-only for a brief period before failover as the slave "catches up" with the master.
Online PG upgrades aren't (until now) possible because the binary log format changes. But pglogical (which underpins BDR) solves this issue: https://2ndquadrant.com/en/resources/pglogical/ Process would be the same as an OS upgrade (brief period of read-only availability required), but you can do it on one machine.
We have one master PG database and handle upgrades during "planned downtime" during off-peak hours (during weekend nights our B2B app is relatively empty).
We prep a new instance with the upgrades in place, and then load over the production data during downtime.
Not worked with them directly, but have followed the pg-hackers mailing list and they are have a bunch of contributions back to core Postgres, and have some core maintainers on staff.
They seem very open with everything they do, as opposed to EDB, which do contribute back but also have a bunch of closed source tools and their own closed source Postgres fork.
Source? I have not seen strong argument against running postgres or other DBs in a container. Data directory should be mapped to a persistent volume, of course.
Then why would you run it in a container? You're not encapsulating anything or making anything portable of the interesting part of the container is a directory on the host.
Nobody said it's easier. But it's definitely more decoupled with a more well defined and reproducable binary image state which you can spin up and shut down in large scale easier. Nobody is saying that this is something you, your Postgres or most people need. Over 10+ million Docker hub pulls of the Postgres images saying there is a use case of using Postgres Docker images.
Source is a basic understanding of containers and a basic understanding of databases.
a) Docker and other container systems are still very young. They're missing a lot of badly-desired features and the community is still coalescing on best practices and safe approaches. It's certain there are complex and significant bugs. This is triply-true if you're using an orchestration framework over the top like Kubernetes. This isn't a reason why DBs are a bad fit in itself, but it's a reason why running production data in Docker is a bad idea at present.
b) Docker is meant for stuff that's portable and can be isolated from its hardware. It's meant to make it easy to run many applications on one machine without the resource overhead of virtual machines. DBs are distinctly not well-suited to this environment. DBs require a good deal of tuning and hardware awareness to run properly. DBs want to run forever and keep frequently-used data in memory; often, restarting a DB server is a big deal because it makes the caches go cold and the server is slow until they warm back up. DBs don't take kindly to sharing a box with 30 other applications. On a VM at least you know you have a dedicated chunk of memory. Not the case on a Docker host. As I discuss in C, Docker is basically the antithesis of long-running; there are many ways to make your container suddenly disappear or stop.
c) There are many non-obvious gotchas involved in Docker usage; it's a very non-intuitive interface. For example, using `docker attach` to connect to a running container will often place the user in control of the process. A simple ^C, which most sysadmins would interpret as "OK, I'm done with this log", closes the container's process and cleans up your container. That's just one example of a risk among many brought to you by the counter-intuitive and unfriendly Docker UX. Not the kind of fragility we want for something as mission-critical as a database system. Again, Docker is meant for processes that are consequence-free if they're cleaned up. Their UX is obviously designed that way too. You're supposed to run 8 containers from the same image and if one goes away the LB detects it and it's nbd. Databases don't work like that.
d) Docker sometimes becomes a zombie and fails to respond to commands. You can't connect to any of your running containers. You can't issue start or stop commands. You can't get output from `docker ps`. This has happened to me on multiple occasions. Do we really want a production DB running in that context?
e) You mentioned it, but storage. Even persistent volumes can be a PITA to configure properly with Docker, and if you fail to do so, you are looking down a blackhole backed by the slow AUFS virtual filesystem. By default, all data written to a Docker container goes into its own differenced image and when you stop the container, it "goes away" (though it can usually be recovered by calling up the specific container ID instead of restarting from the image). These AUFS volumes have a habit of consuming a lot of disk space on the host, which sometimes causes point d to occur if you get to 0. Volumes cannot be mapped at build time and must be defined at runtime. There are many bugs with data volumes, including data loss bugs and filesystem feature bugs.
The long and short of it is that Docker is oriented toward applications that aren't close to the metal and can have their log and write output easily redirected to more durable systems. This fits the bill for many web apps (the application part, not the database parts). It absolutely does not fit the bill for long-running, close-to-the-metal, mandatory-firm-and-reliable-storage systems like databases.
I use Docker to run local development databases (as well as applications). I wouldn't use it for any DB more important than that. Docker is a cool set of abstractions around jails, but it is not a universal solution, just as cloud isn't.
Yeah, I understand that the concept of process isolation has existed for a long time. We used to know them as "chroot jails". cgroups obviously is a feature that has been dormant in the Linux kernel for a long time (until Docker came along and convinced everyone they had to use them, which actually isn't really a point in favor of containerization either: "we depend on this lightly tested feature of the kernel!"). Those things on their own are substantially different from what are today known as containers, a mix of concepts that involves custom daemons, complex network routing layers, bolt-on orchestration components like Swarm/Kube, and all sorts of other voodoo that makes today's understanding of "containers" distinct from historical uses either of the term or the concepts implemented via jails/zones/whatever.
Regarding d, this decision with Docker to use a daemon to control your processes (that is not init) continues to baffle me. It seems on the surface of it to be a really naive way to schedule your cgroups, but you could just as easily do it with a wrapper that terminates. Or just swallow that pill and go with systemd already.
I haven't seen any other container like system on Linux with that design. And I haven't seen it go zombie on me, but I have had it die on me for reasons I have never fully understood. There are bugs in there, just like with any young project of course. It's possible to recover from that state with intimate understanding of how Docker works, but it wreaks havoc on higher lever tools.
a) This is not true. There are mature, stable and unlike linux attempts, actually secure "containers". See Jails and Zones.
b) See a.
c) This is a docker issue, other platforms do not suffer these UX flaws.
d) See a.
e) See ZFS on FreeBSD and Illumos.
You can of course continue to use ill designed software, such as docker, and wait for someone to "make it right" or you can use alternatives that are designed well and work now.
> This is not true. There are mature, stable and unlike linux attempts
The comment above was about Docker specifically. I challenge you to run Docker on anything else than Linux (in production, that is).
> You can of course continue to use ill designed software, such as docker
Yeah, you don't use Docker because it's well designed (or even reliable). You use it because it is the strongest industry wide attempt at a standard there has been for the past ten years. There is a semi-standardized image format and api.
When the time comes to move all software to The Cloud, and some believe that to be inevitable (which you may not agree with), odds are that Docker is one of the accepted formats your provider understands.
Jails or zones has nothing to do with this. They are an implementation detail. I find it likely that the cloud will still run on Linux ten years from now.
The comment above, which you're referring to, starts with this:
> Source is a basic understanding of containers and a basic understanding of databases.
>
> a) Docker and other container systems are still very young. ...
So it's seems to be meant to be about container systems in general, although it then talks about Docker as if it was a representative example (which it isn't).
FWIW I see nothing wrong with using containers (even Docker) for databases. Of course, it depends on what you expect from that - it may for example make automated deployment much easier, etc.
Discussing the modern implementations of containers, like Docker, rkt, and LXC. Things like jails have been time-tested and most of the complaints in the post aren't applicable to them. "Container" has taken on new meaning to mean such new-fangled container frameworks. IMO calling something like a jail a container is now obtuse and incorrect.
My post was primarily referring to Docker, as you can tell from the grandparent that says "Docker and database do not mix", and then the rest of the post body that speaks to Docker-specific issues.
[0]: https://2ndquadrant.com/en/resources/bdr/