Bi-Directional Replication for PostgreSQL v1.0

sinatra · on Aug 12, 2016

As the linked page doesn't describe what it is: Bi-Directional Replication for PostgreSQL (Postgres-BDR, or BDR) is an asynchronous multi-master replication system for PostgreSQL, specifically designed to allow geographically distributed clusters. Supporting more than 48 nodes, BDR is a low overhead, low maintenance technology for distributed databases. [0]

[0]: https://2ndquadrant.com/en/resources/bdr/

rattray · on Aug 12, 2016

> Supporting more than 48 nodes

Hmm, on the mailing list announcement they claim 2-48 nodes.

EDIT: perhaps they're referring to support for read-replica's that can go beyond the 48 primaries.

https://www.postgresql.org/message-id/CAH%2BGA0qO0q9NJhKxaoo...

pgaddict · on Aug 12, 2016

AFAIK there's no explicit limit per se - the BDR developers tested it with 48 nodes, but it will work with more nodes.

But there are things you'll probably run into - for example this only supports full mesh topology, i.e. each node speaks to every other node. So the more nodes you have, the higher the number of management connections.

baq · on Aug 12, 2016

It's not immediately clear for me when to use BDR and when would I choose XL. Can somebody do a quick comparison? I like this quote in the announcement email:

---

IS Postgres-BDR RIGHT FOR YOU?

BDR is well suited for databases where:

- Data is distributed globally

- Majority of data is written to from only one node at a time (For example, the US node mostly writes changes to US customers, each branch office writes mostly to branch-office-specific data, and so on.)

- There is a need for SELECTs over complete data set (lookups and consolidation)

- There are OLTP workloads with many smaller transactions

- Transactions mostly touching non overlapping sets of data

- There is partition and latency tolerance

However, this is not a comprehensive list and use cases for BDR can vary based on database type and functionality.

In addition, BDR aids business continuity by providing increased availability during network faults. Applications can be closer to the data and more responsive for users, allowing for a much more satisfying end-user experience.

---

But I can't seem to find something similar for XL and especially a diff between the two.

brian_cloutier · on Aug 12, 2016

I'm by no means authoritative, but XL lets you distributed postgresql and keep the transactionality. That's great if you're a bank but necessarily slow and not Available, so running it over a wan might be a bad idea.

BDR lets you distribute your database but throws transactionality out of the window. You have two independent databases which stream changes to each other. A lot faster, and well suited for wans, but not Consistent.

icebraining · on Aug 12, 2016

Offtopic, but I find it funny how we use banks as the paramount example of a need for consistency, when they're actually not :)

inopinatus · on Aug 12, 2016

I have worked in FSIs and the DBAs I met were absolutely paranoid about consistency and durability of writes to core banking databases. Curiously it was atomicity and isolation that took a back seat, errors due to those would be corrected by applications (or by hand in exception processing) with reversing transactions.

rantanplan · on Aug 12, 2016

I would be interested if you could elaborate a bit.

DrJokepu · on Aug 12, 2016

Banks basically don't use database transactions to record financial transactions. Instead, failed operations are rolled back with an inverse operation. It's done this way to ensure that there is an audit trail of every operation, rolled back or not.

djsumdog · on Aug 13, 2016

Huh. So banks really couldn't use event sourcing could they? Oh maybe they could so long as every failure was a new event.

tmd83 · on Aug 13, 2016

I would assume that banks can be the best enterprise profile user of event sourcing. For traditional application we use the DB rollback for failure mitigation and banks are actually explicitly doing that. I think the event is already there and they are essentially using the same concept as event sourcing if I'm not mistaken.

And I think its much easier to make it work. There is a lot of associated domain and logic but the core of banks is a series of transactions (never worked so could be completely wrong) .. source, destination, amount. It looks poster child for event sourcing/log based architecture.

chillax · on Aug 12, 2016

perhaps refering to this? http://highscalability.com/blog/2013/5/1/myth-eric-brewer-on...

pgaddict · on Aug 12, 2016

XL has a global transaction manager, keeping the whole cluster consistent as a whole, and allows you to query the whole database for analytics, etc.

BDR on the other hand doesn't have such global component, so each node is consistent on it's own, but the nodes may be at slightly different states. Still, the replicated changes are applied transactionally (i.e. you won't see just a partial update from another node, etc.)

gleenn · on Aug 12, 2016

Not to offend, but "consistent on it's own" is a poor way to say "not consistent" in the way that matters.

pgaddict · on Aug 12, 2016

Well, if you're looking for a solution where the whole cluster is perfectly/strongly consistent, then BDR is not for you I guess. What BDR gives you is a guarantee that the changes are applied atomically (i.e. matching transactions from the source node), and in the commit order. But there's no global consistency guarantee - that's a consequence of of the asynchronous behavior.

takeda · on Aug 12, 2016

Over WAN you pretty much can't make things consistent. Well... I suppose you can, but it will be really slow due to speed of light.

Klathmon · on Aug 12, 2016

So this might not be the right place for this, but i'm curious.

How do people deal with "eventual consistency"?

In my head once a transaction is done, it's everywhere and everyone has access to it.

What happens if 2 nodes try to modify the same data at the same time? Or what happens if you insert on one node, then query on another before it propagates? And if the answer to those questions are what I think they are (that bad stuff happens), how do you setup your application to avoid doing it?

sanderjd · on Aug 12, 2016

This is a really big question that quickly gets into distributed systems theory in general, but in part, the idea is to recognize that a lot of times the "bad stuff" is not so bad. For instance, if someone updates their profile picture on Facebook, and I make a request that should include their profile picture, but their update hasn't propagated to the node I'm reading from, I just get the old profile picture, and that's a-ok. There are definitely nasty things (like simultaneous updates) that you have to be aware of, but for a lot of applications, there are a lot of cases where out of date data makes little difference.

Klathmon · on Aug 12, 2016

I guess the part I have trouble with is how do you separate what can't be delayed from what can?

Is it just a matter of keeping the information on 2 separate systems, or are there tools that let you for instance mark one query as "take your time and make sure everything is 100% up to date before committing this" and another as "return whatever you got, it's not important"?

pas · on Aug 12, 2016

Cassandra, MongoDB and Redis can do that to some degree.

http://cassandra.apache.org/doc/latest/architecture/dynamo.h...

http://hector-client.github.io/hector/build/html/content/con...

Tunable durability, and some minimal isolation (atomicity) guarantees for Mongo: https://docs.mongodb.com/manual/reference/write-concern/ and https://docs.mongodb.com/manual/core/read-isolation-consiste...

http://redis.io/commands/WAIT

I guess something like this can and will eventually find its way into any practical and effective distributed data store.

dberg · on Aug 12, 2016

One solution is to cookie persist the user to the primary site and have a TTL to allow for some replication lag. Facebook does something similar here.

gchpaco · on Aug 12, 2016

Depends on the database, but Cassandra, for example, has quorum mode for writes, which requires a majority of the cluster members ack the write. This can be enabled on a per-query basis, and also for reads.

The other way of doing it is things like CRDTs (https://en.wikipedia.org/wiki/Conflict-free_replicated_data_...) which have a join operation for any two data values.

You have to keep it in the back of your mind that it's a thing, but working without consistency can be done.

im_down_w_otp · on Aug 13, 2016

Quorum and CRDTs deal with completely different problems.

CRDTs do one thing... they mitigate the issue of "lost updates". All acknowledged writes will be represented in the results of a query and no "winner" strategy is involved that would cause some acknowledged writes to be possibly incorrectly dominated by others and thus lost.

Quorum (strict) just provides a very, very, very weak form of consistency in the case of concurrent readers/writers (RYW) and just very, very weak consistency in the case of serialized readers/writers (RR).

My personal opinion is that any eventually consistent distributed database that doesn't have built-in CRDTs, or the necessary facilities to build application-level CRDTs, is a fundamentally dangerous and broken database because unless all you ever write to it is immutable or idempotent data, you're going to have your database silently dropping data during operation.

trollied · on Aug 12, 2016

... and the problem is that the new kids think all of this is acceptable in worlds where it really isn't acceptable

cookiecaper · on Aug 12, 2016

"Eventual consistency" usually means a few seconds rather than a few hours. If your users can tolerate hitting refresh when they see the stale data, or if your application is such that users don't have the option to use the newly-inputted or updated information for a short while, it may not be a problem.

It's probably not the type of thing you can just drop in. It will fit some use cases and not others. You'll want to design your application around an eventually-consistent system to prevent and/or handle conflicts in a sane manner.

Some systems like Riak have a built-in complex resolution scheme based on "vector clocks", which record information about the version of the data that was received and can tell which entry is newer. [0]

[0] http://docs.basho.com/riak/kv/2.1.4/developing/usage/conflic...

hinkley · on Aug 12, 2016

Always wondered this myself.

There seems to be a general fondness for writing to the database and then immediately reading back from it to make a decision. Some variant of that show up in the requirements quite often, usually in the form of confirmation pages or overview screens following right after an update, so the user knows what they did took effect.

And almost as often you'll find that some member or members of your team have made the assumption that the whole interaction will occur on one machine talking to one database instance. With session affinity you might be able to create a convincing illusion that this is working, but now with microservices you're intentionally hitting many servers. If checkout and order history are on two different servers, how does eventual consistency work?

avereveard · on Aug 12, 2016

>There seems to be a general fondness for writing to the database and then immediately reading back from it to make a decision

one way I saw banking software do it was keeping deltas of the changes thorough a wizard like process and then commit it all at once in a single transaction, at which point if there was a conflict the wizard restarted with up to date data. conflict detection was done by a sequential id keyed on the accounts touched on the operation, so that any operation modifying the account before your delta was applied would knock your commit off

hinkley · on Aug 12, 2016

Banking and the credit industry are a good example of eventually consistent systems.

But you can see the inconsistencies and you're expected to just ignore them. I guess what I'm asking is how do you push back on requirements and inattentive coworkers that make these compromises unworkable, because it seems so pervasive to me.

hvo · on Aug 12, 2016

It may interest you to study this a bit to understand MVCC (Multiversion Concurrency Control).

https://www.postgresql.org/docs/9.6/static/mvcc.html

Florin_Andrei · on Aug 12, 2016

> How do people deal with "eventual consistency"? In my head once a transaction is done, it's everywhere and everyone has access to it.

Different strokes for different folks.

Synchronous replication: the whole cluster is a single unit. COMMIT only returns when it's been truly committed to all nodes. OTOH, it's slow, and may or may not tolerate partitions.

Asynchronous replication: COMMIT returns when it's gone into the current node. It is then the cluster's job to propagate it to the rest. This is a lot faster, and can deal with network partitions. OTOH there is usually a state lag between distant nodes.

Different applications may need either sync or async, depending on requirements.

jimktrains2 · on Aug 12, 2016

> I’m pleased to say that we’ve just released Postgres-BDR 1.0, based on PostgreSQL 9.4.9.

While I'm still excited to play with it, I can't wait until pglogical comes into mainline.

reitanqild · on Aug 12, 2016

Wow, that sounds like a really nice open-source company.

They offer sources, repos for both apt and yum as well as this:

> Need more help?

> Click here for pglogical's documentation.

> Click here to look at answers to some of the most frequently asked questions.

> Want to know even more? Need some help to implement pglogical? As the developers we're probably the best people in the world to help you get up and running with pglogical. We have consultants available to provide assistance, plus our unique 24/7 Production Support subscription service also covers pglogical.

That's a company I'd want to deal with. (Not using postgres at the moment, but hired Tomitribe for some minor work and the whole experience of dealing with the people who actually write and maintain stuff instead of marketing drones etc is totally awesome.)

colanderman · on Aug 12, 2016

Thanks for pointing out pglogical. I've never heard of it before, and it's exactly what we need to support online upgrade.

continuations · on Aug 12, 2016

How does pglogical differ from PostgreSQL's built-in streaming or log-shipping replication?

pinaraf · on Aug 12, 2016

PostgreSQL streaming : send disk diffs over network, raw binary, absolutely unusable if you don't have the same postgresql version on the other side Logical : send data diffs over network. Could be used for replication, but also audit or sending to different databases...

martinmarques · on Aug 13, 2016

pglogical works over the logical decoding code introduced in 9.4, so instead of sending over binary changes of the data files (SR just sends a change that says change these bytes on page X -- well, not super exact, but it gives you a picture of whats done), while logical decoding will read the WAL log, decode the changes so you can apply them else were by some plugin (pglogical would serve as a plugin in this case). I think the best way to show how it would work (and this is not what pglogical does as it would be inefficient) is by writing a plugin that feeds the logical decoded changes from the WAL and builds SQL statements from them UPDATE, INSERT, DELETE, etc. You could easily apply those on any other PostgreSQL server, or maybe even some other SQL server. Very handy for online upgrades.

koolba · on Aug 12, 2016

Anybody in the HN crowd have experience using this? How does it perform on a WAN?

eloycoto · on Aug 12, 2016

We used it for a few months now, it works. we have 3 datacenters with ~10ms of delay between them and no problems.

Over WAN will work, but not sure how reliable will be. Regards.

baq · on Aug 12, 2016

Have you tried PostgresXL and if yes, can you make a high level comparison?

pgaddict · on Aug 12, 2016

Those are solutions for very different types of deployments.

BDR is asynchronous multi-master, without global transaction management etc.

XL is a distributed database with global transaction management, strong consistency, etc.

Say you have geographically distributed application (managing customers in US, EU, ...). You can deploy a node in each area and direct local users to this node with low latency etc. It also means "no conflicts" (which is a major concern in multi-master), and the changes will asynchronously replicate to the other nodes.

XL would perform poorly in this scenario (at least I believe so), and it's meant more for local deployments, when hitting vertical scalability limits.

eloycoto · on Aug 12, 2016

No, I didn't try. So I can't compare. I saw the contributors/support from postgres, and I chose BDR.

pgaddict · on Aug 12, 2016

Well, we're also Postgres-XL contributors, so ... ;-)

rvanlaar · on Aug 12, 2016

I use on it to keep my postgres data in sync in a high available setup. I host two nodes, each in a different DigitalOcean datacenters (AMS2 and AMS3).

It's fast enough to not notice delays for users. As in, have a user log in on host1, switch the user to host2 and the session is there.

Disclaimer, we're not write heavy and have around one hundred active users.

https://rvanlaar.github.io/Django-HA-presentation/#/

eloycoto · on Aug 12, 2016

you should create a new node. Less than 3 is a bit dangerous ;-)

martinmarques · on Aug 13, 2016

Actually with two nodes, if one node goes down, you don't have a BDR cluster anymore ;-)

alfalfasprout · on Aug 12, 2016

If you're interested, I'm trying just this over the weekend. Standard streaming replication over WAN was pretty awful. It would work reliably for a few weeks then cause both the master and slave to block writes when they got out of sync because something was dropped.

pgaddict · on Aug 12, 2016

What do you mean by "got out of sync"? That's a fairly strange issue with physical streaming replication.

DelaneyM · on Aug 12, 2016

I wonder if it's plausible for AWS to release a customized variant on this targeted to their data centers and designed for ease of multi-zone deployment? Something like Azure (which optimized and facilitates MySQL).

I realize that redshift already loosely meets that definition, but it doesn't quite work as a globally distributed regionally clustered web service back end. This is ideal.

neuronexmachina · on Aug 12, 2016

Minor point: Did you mean to say "Aurora" (AWS's MySQL-ish DB) instead of "Azure" (Microsoft's cloud service)?

DelaneyM · on Aug 12, 2016

Yes. That is what I meant. :P

Thank you.

looneysquash · on Aug 12, 2016

What I want to know is the timeframe of supporting PostgreSQL 9.5 and/or 9.6. (Though 9.6 is still in beta.)

Also, my understanding is that they're feeding patches back to postgres, and want it to eventually run on stock postgres. But it's not clear to me how progress on that is going.

I was also surprised that UDR was removed, I didn't even realize it was deprecated.

I'm not actually using the product at all right now, but I've been watching on the website, because I want to use it eventually.

I'm kind of hoping it works with stock postgres before I jump in. But if not that, I think I at least want to wait for 9.6 support.

martinmarques · on Aug 14, 2016

UDR has been deprecated in favor of pglogical, which is based on better code already available on PostgreSQL (and production ready... actually we are using it with customers on production systems).

There will not be any BDR release for 9.5. That has been commented may many times on the pgsql-general list.

There will most likely be a 9.6 version, but it won't run on stock postgres (you'll still need to have a patched version of postgres).

If the rest of the features BDR relies on get in for the PG 10 release, then there will be only a BDR extension which you'd have to install (no patching postgres)

bonesmoses · on Aug 15, 2016

That may be the case, but they need to update the documentation to reflect that. This is what it currently says:

"Note: All the new features required have been submitted for inclusion in PostgreSQL 9.5. Many have already been accepted and included. If all the functionality BDR requires is added to PostgreSQL 9.5 then the need for a modified PostgreSQL will go away in the next version."

Well, PostgreSQL 9.5 has been out for several months now, and there's no note at all for 9.6. For a 1.0 announcement, that's a pretty strange omission.

So we get something that only really works with a patched version of 9.4 when 9.6 is going to be released soon, and talks about 9.5 like it isn't out yet? No thanks.

looneysquash · on Aug 15, 2016

I'm not on pgsql-general. Or any of the postgres lists besides announce.

So, I'm not sure who's fault it is, but I'm listening in one place (your website mainly), and you're talking in another place (pgsql-general apparently), and so we're not communicating effectively.

Actually, at least as recently as June, when I emailed you guys directly you didn't say that. Instead I got linked to a blog post from Dec 2015 that didn't answer my question, and then a "Not yet" when I replied to ask my question again.

jlgaddis · on Aug 12, 2016

If memory serves, they're planning on it being "in core" on 10.0.

rattray · on Aug 13, 2016

Would be amazing if anyone could come up with a citation for this.

nubela · on Aug 12, 2016

Any reasons why I should not use this? Any other (Postgres-esqe) alternatives for such a solution?

mason55 · on Aug 12, 2016

> Any reasons why I should not use this?

If your use case can't handle eventually consistent data or you have no way to successfully resolve conflicts/never cause conflicts then this product is not for you.

They've chosen availability over consistency, so when the network between your DB servers partitions your applications will still function. But partition A won't see what's happening on partition B and vice versa and when the partition heals you might have conflicts to resolve.

One great use case I can think of for this is a case where you have geographically clustered clients that need to read globally but only write locally. As long as you model your data correctly you could set up a master in each region so that your writes were fully ACID compliant then have the masters replicate to each other so that data from other regions was available for reading. You'd get the benefit of distributed writes, your local clusters could continue working even if there was a partition or one of the other clusters went down, and you'd assure yourself that you wouldn't have conflicts.

pabl0rg · on Aug 13, 2016

citus data may interest you. it automates sharding and will soon help with High Availability

zzzeek · on Aug 12, 2016

um, LICENSE? am I missing something

https://github.com/2ndQuadrant/bdr

kej · on Aug 12, 2016

The link at the top of the repo goes to the product page, which says:

>Postgres-BDR, an extension to PostgreSQL is free and open source licensed under the [same terms] as PostgreSQL.

So it's under the PostgreSQL license, which is similar to MIT.

rattray · on Aug 12, 2016

They claim performance comparable to Hot Standby:

https://2ndquadrant.com/en/resources/bdr/bdr-performance/

and, in some cases, ~1.5x over Slony.

onderkalaci · on Aug 12, 2016

Is there a documentation on "How BDR works?" or so?

srparish · on Aug 13, 2016

http://bdr-project.org/docs/stable/index.html

brightball · on Aug 12, 2016

Can't wait to try this out.

I wonder how long it will take Heroku, RDS or another PG provider to make this available?

louis-paul · on Aug 12, 2016

Surprisingly I haven't found a lot of HA-ready (actual HA, no failover) hosted PostgreSQL. There's Citus Data but that's a fork and they're expensive.

Judson · on Aug 12, 2016

Just wanted to point out that Citus Data is no longer a fork of Postgres and is now open source[0]. (Though, I'm sure their managed service is 'expensive' compared to unmanaged).

[0]: https://www.citusdata.com/blog/2016/03/24/citus-unforks-goes...

therealmarv · on Aug 12, 2016

When reading this I wonder how to do this: OS or Postgres updates with one normal Postgres (or a typical master+slave setup) without downtime and without using Postgres BDR. Does somebody now?

colanderman · on Aug 12, 2016

OS upgrades, sure, just use two systems with one of the existing async rep solutions between them. You'll need to make the master read-only for a brief period before failover as the slave "catches up" with the master.

Online PG upgrades aren't (until now) possible because the binary log format changes. But pglogical (which underpins BDR) solves this issue: https://2ndquadrant.com/en/resources/pglogical/ Process would be the same as an OS upgrade (brief period of read-only availability required), but you can do it on one machine.

pgaddict · on Aug 12, 2016

Something like that is possible by using UDR (think BDR but only in one direction):

https://wiki.postgresql.org/images/a/a8/Udr-pgconf.pdf

andruby · on Aug 12, 2016

We have one master PG database and handle upgrades during "planned downtime" during off-peak hours (during weekend nights our B2B app is relatively empty).

We prep a new instance with the upgrades in place, and then load over the production data during downtime.

rattray · on Aug 12, 2016

It seems like this is very much a (FOSS) product of 2nd Quadrant. Does anyone here have experience with them? What is their reputation like?

Tostino · on Aug 12, 2016

Not worked with them directly, but have followed the pg-hackers mailing list and they are have a bunch of contributions back to core Postgres, and have some core maintainers on staff.

They seem very open with everything they do, as opposed to EDB, which do contribute back but also have a bunch of closed source tools and their own closed source Postgres fork.

therealmarv · on Aug 12, 2016

Would not this be an ideal Docker database?! But I could not find a good maintained Docker image in the hub: https://hub.docker.com/search/?isAutomated=0&isOfficial=0&pa...

cookiecaper · on Aug 12, 2016

Docker and database do not mix.

thom_nic · on Aug 12, 2016

Source? I have not seen strong argument against running postgres or other DBs in a container. Data directory should be mapped to a persistent volume, of course.

Kassandry · on Aug 12, 2016

Josh Berkus, core PostgreSQL member and currently employed by Red Hat seems to think they go well together. Here's some of his work:

https://www.youtube.com/watch?v=iruaCgeG7qs

https://github.com/jberkus/atomicdb

superuser2 · on Aug 12, 2016

Then why would you run it in a container? You're not encapsulating anything or making anything portable of the interesting part of the container is a directory on the host.

therealmarv · on Aug 12, 2016

You can make the whole binary/running part interchangeable... think of OS updates, Postgres updates etc. just replace the container with a new one.

superuser2 · on Aug 12, 2016

>just

Why is building, distributing, and cutting over to a new container easier than "sudo apt-get upgrade postgres"?

therealmarv · on Aug 12, 2016

Nobody said it's easier. But it's definitely more decoupled with a more well defined and reproducable binary image state which you can spin up and shut down in large scale easier. Nobody is saying that this is something you, your Postgres or most people need. Over 10+ million Docker hub pulls of the Postgres images saying there is a use case of using Postgres Docker images.

cookiecaper · on Aug 12, 2016

Source is a basic understanding of containers and a basic understanding of databases.

a) Docker and other container systems are still very young. They're missing a lot of badly-desired features and the community is still coalescing on best practices and safe approaches. It's certain there are complex and significant bugs. This is triply-true if you're using an orchestration framework over the top like Kubernetes. This isn't a reason why DBs are a bad fit in itself, but it's a reason why running production data in Docker is a bad idea at present.

b) Docker is meant for stuff that's portable and can be isolated from its hardware. It's meant to make it easy to run many applications on one machine without the resource overhead of virtual machines. DBs are distinctly not well-suited to this environment. DBs require a good deal of tuning and hardware awareness to run properly. DBs want to run forever and keep frequently-used data in memory; often, restarting a DB server is a big deal because it makes the caches go cold and the server is slow until they warm back up. DBs don't take kindly to sharing a box with 30 other applications. On a VM at least you know you have a dedicated chunk of memory. Not the case on a Docker host. As I discuss in C, Docker is basically the antithesis of long-running; there are many ways to make your container suddenly disappear or stop.

c) There are many non-obvious gotchas involved in Docker usage; it's a very non-intuitive interface. For example, using `docker attach` to connect to a running container will often place the user in control of the process. A simple ^C, which most sysadmins would interpret as "OK, I'm done with this log", closes the container's process and cleans up your container. That's just one example of a risk among many brought to you by the counter-intuitive and unfriendly Docker UX. Not the kind of fragility we want for something as mission-critical as a database system. Again, Docker is meant for processes that are consequence-free if they're cleaned up. Their UX is obviously designed that way too. You're supposed to run 8 containers from the same image and if one goes away the LB detects it and it's nbd. Databases don't work like that.

d) Docker sometimes becomes a zombie and fails to respond to commands. You can't connect to any of your running containers. You can't issue start or stop commands. You can't get output from `docker ps`. This has happened to me on multiple occasions. Do we really want a production DB running in that context?

e) You mentioned it, but storage. Even persistent volumes can be a PITA to configure properly with Docker, and if you fail to do so, you are looking down a blackhole backed by the slow AUFS virtual filesystem. By default, all data written to a Docker container goes into its own differenced image and when you stop the container, it "goes away" (though it can usually be recovered by calling up the specific container ID instead of restarting from the image). These AUFS volumes have a habit of consuming a lot of disk space on the host, which sometimes causes point d to occur if you get to 0. Volumes cannot be mapped at build time and must be defined at runtime. There are many bugs with data volumes, including data loss bugs and filesystem feature bugs.

The long and short of it is that Docker is oriented toward applications that aren't close to the metal and can have their log and write output easily redirected to more durable systems. This fits the bill for many web apps (the application part, not the database parts). It absolutely does not fit the bill for long-running, close-to-the-metal, mandatory-firm-and-reliable-storage systems like databases.

I use Docker to run local development databases (as well as applications). I wouldn't use it for any DB more important than that. Docker is a cool set of abstractions around jails, but it is not a universal solution, just as cloud isn't.

therealmarv · on Aug 12, 2016

Container virtualization in Linux is actually quite old. First appearance in Linux around 2001: http://rhelblog.redhat.com/2015/08/28/the-history-of-contain... and the concept and real practice realization (in commercial products) of containers is even older. It has roots back to 1979! https://dzone.com/articles/evolution-of-linux-containers-fut...

Docker is only a philosophy (one application in one container) and toolset around this containers. And docker itself is quite new in comparison.

"This isn't a reason why DBs are a bad fit in itself, but it's a reason why running production data in Docker is a bad idea at present."

You cannot say this in general. As everything in the world this really depends on what you are trying to achieve with your system.

cookiecaper · on Aug 12, 2016

Yeah, I understand that the concept of process isolation has existed for a long time. We used to know them as "chroot jails". cgroups obviously is a feature that has been dormant in the Linux kernel for a long time (until Docker came along and convinced everyone they had to use them, which actually isn't really a point in favor of containerization either: "we depend on this lightly tested feature of the kernel!"). Those things on their own are substantially different from what are today known as containers, a mix of concepts that involves custom daemons, complex network routing layers, bolt-on orchestration components like Swarm/Kube, and all sorts of other voodoo that makes today's understanding of "containers" distinct from historical uses either of the term or the concepts implemented via jails/zones/whatever.

xorcist · on Aug 12, 2016

Regarding d, this decision with Docker to use a daemon to control your processes (that is not init) continues to baffle me. It seems on the surface of it to be a really naive way to schedule your cgroups, but you could just as easily do it with a wrapper that terminates. Or just swallow that pill and go with systemd already.

I haven't seen any other container like system on Linux with that design. And I haven't seen it go zombie on me, but I have had it die on me for reasons I have never fully understood. There are bugs in there, just like with any young project of course. It's possible to recover from that state with intimate understanding of how Docker works, but it wreaks havoc on higher lever tools.

X86BSD · on Aug 12, 2016

a) This is not true. There are mature, stable and unlike linux attempts, actually secure "containers". See Jails and Zones.

b) See a.

c) This is a docker issue, other platforms do not suffer these UX flaws.

d) See a.

e) See ZFS on FreeBSD and Illumos.

You can of course continue to use ill designed software, such as docker, and wait for someone to "make it right" or you can use alternatives that are designed well and work now.

I know which I would choose.

Also, you might check out Tredly. At tredly.com.

xorcist · on Aug 12, 2016

> This is not true. There are mature, stable and unlike linux attempts

The comment above was about Docker specifically. I challenge you to run Docker on anything else than Linux (in production, that is).

> You can of course continue to use ill designed software, such as docker

Yeah, you don't use Docker because it's well designed (or even reliable). You use it because it is the strongest industry wide attempt at a standard there has been for the past ten years. There is a semi-standardized image format and api.

When the time comes to move all software to The Cloud, and some believe that to be inevitable (which you may not agree with), odds are that Docker is one of the accepted formats your provider understands.

Jails or zones has nothing to do with this. They are an implementation detail. I find it likely that the cloud will still run on Linux ten years from now.

pgaddict · on Aug 13, 2016

The comment above, which you're referring to, starts with this:

> Source is a basic understanding of containers and a basic understanding of databases. > > a) Docker and other container systems are still very young. ...

So it's seems to be meant to be about container systems in general, although it then talks about Docker as if it was a representative example (which it isn't).

FWIW I see nothing wrong with using containers (even Docker) for databases. Of course, it depends on what you expect from that - it may for example make automated deployment much easier, etc.

cookiecaper · on Aug 14, 2016

Discussing the modern implementations of containers, like Docker, rkt, and LXC. Things like jails have been time-tested and most of the complaints in the post aren't applicable to them. "Container" has taken on new meaning to mean such new-fangled container frameworks. IMO calling something like a jail a container is now obtuse and incorrect.

My post was primarily referring to Docker, as you can tell from the grandparent that says "Docker and database do not mix", and then the rest of the post body that speaks to Docker-specific issues.

Matt3o12_ · on Aug 12, 2016

Care to elaborate?

petepete · on Aug 12, 2016

Title makes it sound like people are still writing extensions for PostgreSQL 1.0