Patroni: A Template for PostgreSQL HA with ZooKeeper, Etcd, or Consul

ZalandoTech · on April 14, 2017

Zalando open source evangelist here. Some of my colleagues are on this thread, taking your questions. Meanwhile, wanted to plug a few other Postgres/K8s projects we're working on:

-- Spilo: HA PostgreSQL cluster using Docker (https://github.com/zalando/spilo)

-- kube-ingress-aws-controller: Configures AWS Load Balancers according to Kubernetes Ingress resources (https://github.com/zalando-incubator/kube-ingress-aws-contro...)

-- external-dns: a Kubernetes Incubator collaboration. Configure external DNS servers (AWS Route53, Google CloudDNS and others) for Kubernetes Ingresses and Services. (https://github.com/kubernetes-incubator/external-dns)

We also have several older Postgres-related tools listed here: http://zalando.github.io/.

Users and contributors very welcome; just drop a line in our projects' Issues Trackers.

jchw · on April 14, 2017

Looking forward to external-dns being merged. DNS records are probably the last thing in our Kubernetes cluster that isn't automated.

I'm still not sold on using ALBs for Ingress resources. ALB is clearly superior to ELB when using them for the same things, but it seems like ALB costs will soar when using ALBs as Ingress controllers since they increased the rule limit and subsequently started charging on them. Still, the alternative of having four layers for every route seems nearly as undesirable. I really don't want ALB -> nginx -> kube-proxy -> container.

hjacobs · on April 14, 2017

This is a bit off topic, but the mentioned Ingress controller (https://github.com/zalando-incubator/kube-ingress-aws-contro...) is not using the ALB rules, it just provisions ALBs with the right SSL certificate and points to some HTTP proxy doing the actual host/path routing (e.g. Skipper). See http://kubernetes-on-aws.readthedocs.io/en/latest/admin-guid...

bogomipz · on April 14, 2017

I am curious did you look at Govenor(postres template for HA with etcd) before writing Patroni?

https://github.com/compose/governor

If you did can you comment on how these might compare and why you decided to write your own? Thanks!

hjacobs · on April 14, 2017

I guess you did not look into Patroni's README, as it states:

> Patroni originated as a fork of Governor, the project from Compose. It includes plenty of new features.

Governor's git repo seems to have no recent activity (last commit 11 months ago).

bogomipz · on April 14, 2017

Indeed I missed that, my bad. But anyway the README states:

>"Patroni originated as a fork of Governor, the project from Compose. It includes plenty of new features."

But it doesn't say what those features are. So I guess I'm curious what those features are without having to watch a whole youtube video. Or why they didn't submit PRs to Govenor for the added functionality.

CyberDem0n · on April 14, 2017

Compose guys are kind of happy with existing functionality of Governor and didn't really wanted to add something new. That was the main reason of making a fork.

If you read a further comments to this news you can get familiar with the part of new functionality of Patroni comparing to Governor.

bogomipz · on April 14, 2017

Thanks. Yeah I've been making my way through the video - "Elephants on Automatic: HA Clustered PostgreSQL with Helm"

and its discussed at the 15:57 minute mark here if anyone else is interested:

https://www.youtube.com/watch?v=CftcVhFMGSY

Cheers.

CyberDem0n · on April 14, 2017

You also can have a look on slides of my talk about at PGConf.US 2017: https://www.slideshare.net/AlexanderKukushkin1/patroni-ha-po...

Talk was also recorded but video is not yet published.

And one more thing, you can try to run a Live-Demo the same way as I did it during presentation: https://cyberdemn.blogspot.de/2017/04/patroni-ha-postgresql-...

bogomipz · on April 14, 2017

Oh neat, thanks!

Dowwie · on April 14, 2017

Could any contributers to Patroni opine on what impact Postgresql 10's logical replication will have? Are you excited about it?

hintbits · on April 14, 2017

While logical replication is a long-awaited and exciting feature, I'd use it to replicate only a subset of my database, not as a replacement for HA. At present, it cannot follow the master in a physical replication after promotion (basically, you have to initialize your logical replicas from scratch), and it has a bigger lag then the physical replication, especially for long-running transactions, as one has to wait until the commit before sending the data.

Nevertheless, I think Patroni can help configuring logical replicas that will never be part of the HA (if one can tolerate reinitializing logical replicas after failing over). Pool requests are welcome!

eikenberry · on April 14, 2017

I've been burned to many times by 3rd party HA solutions for Postgres. I'm not touching it again until they either have an official solution or if it is someone else's problem (eg. RDS).

hjacobs · on April 14, 2017

We also use RDS in Zalando (and it works great!). With running Patroni on Kubernetes we get several benefits over RDS which in our opinion warrant the effort. Some examples include:

* Real superuser access to the cluster

* Installing custom PostgreSQL extensions not available on RDS, PgQ (Queues in PostgreSQL), PgDecode (logical decode to Kafka/Json), PgJobs (minimalistic jobs in PostgreSQL)

* Easier migrations via S3/Streaming replication from and to AWS - RDS currently offers no way out of RDS to another PostgreSQL cluster

* Deployment model allows for more availability/flexibility with less costs - EBS + S3 allow us to run single node PostgreSQL with high uptime

* Multi node PostgreSQL can run on different node types

* OAuth login for PostgreSQL (via PAM)

FooBarWidget · on April 14, 2017

I wonder how much things like this are needed going forward. From my (rudimentary) knowledge about Kubernetes and stateful sets, I think that Kubernetes is able to solve a lot of the issues surrounding failover, recovery of the old master, and partitioning; providing that we use Kubernetes in combination with networked storage that guarantees reliability.

It appears that the way to setup PostgreSQL (or any replicated database) with Kubernetes is to treat index 0 in the stateful set as the master, and everything else as slaves. Each pod is to be connected to some networked storage. If the master goes down, then Kubernetes detects that through the health check, and simply reschedules the master on another node. The storage volume that the old master was using, is simply reattached to this other node. The master going down does not imply storage failure. Storage reliability then becomes a separate problem which is solved in another system (e.g. the RAID system or whatever).

In this kind of setup, there is no failover support in the database itself. From the point of view of the database itself, it looks as if the underlying hardware/OS "automatically" recovered from failure. This way we don't have to mess with promoting slaves and stuff.

Kubernetes already assigns static IPs to services, so pgpool and similar tools -- in so far they are only used to provide a stable network address for PostgreSQL -- become redundant. Network partitioning is "solved" by treating the Kubernetes state as the single authoritative description of the network state.

What do people think about this? Obviously this setup won't work if you don't have networked storage that can be reattached to another node, but I'm thinking that maybe reattachable networked storage should be the future.

foxylion · on April 14, 2017

I think the largest disadvantage is, that you won't get real high availability. Promoting a slave to a master is done in a second. Restarting the master can take some time (especially when it crashed previously). There is also the problem of losing the storage volume of your master node (ebs block storage do neither have 100% availability, nor 100% durability). In this case you can't recover your master.

smarterclayton · on April 15, 2017

This is correct. The time to recreate a stateful set pod is:

1. Time to detect node failure

2. Time for cloud provider to indicate node down

3. Time to create new pod and wait for startup

Today, 1/2 can be from 20-30s to infinite. Kubernetes does not drain a dead node unless it receives some external signal that the node is truly dead.

Future work involves adding such failure detection with a fencer (that can ensure the node is partitioned, at which point it's safe to drain it). Just like normal HA, Kubernetes needs to know the node is actually down, vs just partitioned.

I would recommend designing your stateful sets so that you can move master from index 0 to index N in the event of a missed heartbeat, and make sure your clients are accessing the master via some proxy (whether kubeproxy or another). If the node dies under partition, but clients are accessing via the service IP, starting deletion of the pod will update the service for any non-partitioned nodes.

tokenizerrr · on April 14, 2017

> providing that we use Kubernetes in combination with networked storage that guarantees reliability

What kind of networked storage do people like with kubernetes? I've recently set up a small cluster not in any cloud, and persistent cross-node storage is a concern. There's quite a few options such as glusterfs, but I'd be curious to know if anyone here knows about the tradeoffs.

takeda · on April 14, 2017

If you're in AWS then EFS is what is designed to do that, it is not cheap, but setting up a reliable solution is not easy either.

If your goal is file based storage (not block based), application can make api calls to get data and is mostly read, then S3 might also be an option.

Alternatively, roll-your-own on EC2 instances, which might not be as easy.

noway421 · on April 14, 2017

The set up op describing is more suited for clouds I guess, where networked storage is already a ready to use feature like aws ebs block storage.

jacques_chester · on April 14, 2017

Relaunching crashed processes is the visible part of HA and yes, Kubernetes and comparable systems do this well.

I don't have direct experience with building HA data systems, but some other teams at my employers (Pivotal) do. We ship some of these as BOSH releases intended to allow service injection into Cloud Foundry (and pretty soon OpenShift and Kubernetes too[0]). For the purposes of what you describe, BOSH and k8s are comparable.

Given that we have multiple teams that work continuously on building fully-packaged, fully automated HA data systems, I suspect that this stuff is harder than it looks.

[0] https://www.openservicebrokerapi.org/

cies · on April 14, 2017

This is very interesting. I thought it was best to keep yr db out of the k8s cluster. But this seems to make Postgres "cloud native".

The main reason for keeping dbs out of k8s was that the storage (persistence) solutions in k8s were not up to the task yet. Now I wonder how is Patroni doing persistence? I cannot find anything on it in the Patroni docs. Maybe Patroni is so "self healing" that a proper storage solution is not an issue -- dunno, just guessing here.

rsanders · on April 14, 2017

There are many solutions for network block device style persistent storage in K8S now. In the cloud, you have solutions like Amazon's EBS and GCE's persistent disks. On VMware you have local VMDKs. On your own equipment, there's support for iSCSI, Ceph RBD, OpenStack Cinder, Portworx, ScaleIO, Quobyte, and just about anything else if you write a few scripts to format and mount volumes on the host (FlexVolume). Except for FlexVolume, K8S is able to dynamically provision volumes of a requested size for you.

If you're feeling adventurous there's support for network filesystems like NFS, GlusterFS, CephFS, but I wouldn't recommend those for Postgres.

https://kubernetes.io/docs/concepts/storage/persistent-volum...

In one of his demos, Josh Berkus recommends just using the K8S ephemeral "emptyDir" storage type on each node, and counting on Patroni to maintain data persistence through replication. e.g. any one Postgres node may die and lose its local data, but when its replacement is created, it will be populated with a fresh copy of the database wherever it is scheduled on the K8S cluster. That sounds workable for smaller databases, though it gives every single person I've suggested it to nightmares.

abkfenris · on April 14, 2017

It's a bit scary when you're using K8S ephemeral storage and depending on a basebackup from another pod, but you can also set it up to restore from your usual backup system (WAL-E, pghoard, barman, pgBackrest). That way any time you lose a pod, you get a test of your backup system on restore.

eicnix · on April 14, 2017

There is currently some work being done for improving local storage usage in k8s. https://github.com/kubernetes/community/pull/306

Additionally k8s can be configured to restart pods on the same node so you can use the local node storage for persistence and rely on your DB for replication.

icebraining · on April 14, 2017

you can also set it up to restore from your usual backup system (WAL-E, pghoard, barman, pgBackrest)

That sounds interesting, has anyone documented that kind of setup?

abkfenris · on April 14, 2017

I'm working on one. I am waiting for a PR to land and a release to get generated for a bug that I found.

If the master pod in a StatefulSet is deleted, and kubernetes sets up a new one too fast, it currently keeps the master lock and failover never happens (it also tries to bootstrap from itself).

Once those land I'll work on a writeup of that setup

I've got a few Postgres instances running as single node Deployments and StatefulSets right now that backup and restore using pghoard for simple services that I've documented here http://alexkerney.com/2016/10/pghoard-kubernetes.html I do need to add a note about using livenessprobes to make sure the restore completed successfully. Sometimes these pods are cycled every day as half my K8S cluster is preemptable.

vishesh92 · on April 14, 2017

I am not really sure Patroni has anything to do with the persistence of data. It just uses etcd, zookeper or consul to elect a master in case of a failover and uses their key value store to save the information about the current master. Its upto you whether you keep your database on a container or not. And how the data is managed by the container. This talk by Josh Berkus explains how Patroni works pretty well. https://www.youtube.com/watch?v=OH9WSEiMsAw

jan_berlin · on April 14, 2017

The great thing today is there are lots of options with Patroni, it is very much up to you and your requirements what kind of storage you select and how you want to recover from node and storage failure.

Patroni with Spilo e.g. relies on EBS or local discs and can also ship WAL to S3. On Google it uses their equivalent solutions.

On Kubernetes earlier posts are correct that you may rely on Kubernetes bringing your pod back with the same underlying volume thus providing some kind of availability, but for others this is not good enough, e.g. remember EBS is single AZ only. Patroni can help here, running and orchestrating either slaves or giving you automated recovery from S3.

cies · on April 14, 2017

Thanks. All the k8s talk in the README confused me I guess.

And that's a great talk, made Patroni much clearer for me.

takeda · on April 14, 2017

Biggest issue, is not really k8s but docker, its drivers for storage are still quite young.

I personally wouldn't place a database in container for that reason. Essentially when you do so any durability guarantees that a database provides are going out of the window.

Things might seem to work fine, but when container get abruptly terminated in a wrong moment you might learn that all data is gone.

smarterclayton · on April 15, 2017

Kubernetes does not use Docker's storage drivers. There have been bugs with attach/detach of cloud block storage though that has impacted applications a few times over Kubernetes lifecycle, although all have been fixed fairly quickly.

hintbits · on April 14, 2017

Why would the data disappear, given that it's stored on an external volume (i.e. EBS or Google Cloud Storage or just a directory on your laptop), not in the container FS?

takeda · on April 17, 2017

I simplified, because in the end that would be the equivalent.

The danger is that some bugs in a filesystem, manifests themselves only when there is abrupt termination, otherwise things appear normal. Stuff like data is being written in incorrect order, or data is synced to disk at wrong time, or not at all.

On top of that user also expects performance, so some shortcuts can be taken that compromises the above (especially if the criteria is that docker is intended for stateless applications).

When there bugs like that and things are abruptly terminated, the data on the disk can be heavily mangled to a point that can't be recovered so essentially it would be lost.

616c · on April 14, 2017

An interview with the development team. I thought the project namw sounded familiar!

https://talkpython.fm/episodes/show/72/fashion-driven-open-s...

geoka9 · on April 14, 2017

What's zalando/patroni's approach to performance scaling/load balancing? For example, pgpool2 can distribute SELECTs among multiple slaves which makes it useful for heavy analytics loads. Can you guys do that, or you only handle HA?

CyberDem0n · on April 14, 2017

Patroni has REST API providing a health-checks for load-balancers. patroni:8008/replica will respond with http status code 200 only if node is running as replica.

In the HAProxy config file you just need to list all Patroni nodes, specify a health-check and it will do a load-balancing for you.

The same approach works for example with AWS ELB.

geoka9 · on April 14, 2017

But writes need to be directed to the master, right? Or Patroni is smart enough to do it by itself?

CyberDem0n · on April 14, 2017

Correct. Writes need to be directed to master. May be not only writes but some reads as well. Only application can know which statement can be executed on replicas and which must be executed on master.

hjacobs · on April 14, 2017

Here is a more recent talk about Patroni and Kubernetes (KubeCon Berlin 2017): https://www.youtube.com/watch?v=CftcVhFMGSY

eicnix · on April 14, 2017

How does this compare to other HA postgres products like stolon or Postgres-XL?

vishesh92 · on April 14, 2017

As per my understanding, Postgres-XL is not comparable to Patroni. Postgres-XL is comparable to PostgreSQL. Postgres-XL shards the data across multiple data nodes. Where as Patroni uses etcd, consul or zookeeper to provide HA for any Postgres cluster using replication (Data is written only to a single instance and replicated further).

I think you meant Stolon. Stolon is quite comparable to Patroni and has more features than Patroni, but I am not sure how many people are using it. I found more resources for Patroni as compared to stolon.

CyberDem0n · on April 14, 2017

> Stolon is quite comparable to Patroni and has more features than Patron.

Let me as a question, what feature Stolon has and Patroni doesn't? Can you give an example?

From my side I can provide list of features available in Patroni, but not in Stolon. For example:

* there is no way to do a controlled failover (switchover) in Stolon.

* in Patroni it's possible to exclude some nodes from a leader race.

* Patroni supports cascading replication

* Patroni can take basebackup from replicas if they are marked with a special tag

* it's even possible to configure Patroni to use a custom backup/recovery solution instead of pg_basebackup

* Patroni can give you a hint that postgres must be restarted to apply some configuration changes

* with Patroni it's even possible to schedule switchover or restart restart of postges on some specific time (for example 04:00 am, when traffic is minimal)

* with Patroni you can control High-Availability / Durability ratio. I.e.: what to do if master postgres has crashed? Either you will try to start it back or failover to a replica. But start of a crashed postgres may take a long time. With Patroni you can configure that if postgres didn't started in some amount of seconds - please do failover.

Stolon provides cloud-native deployment and Patroni doesn't? This is not really true. The main idea of Patroni is to be not dependent on some specific technologies. It can work as on bare-metal with the same success as on Kubernetes. It's very easy to build a custom solution with Patroni for your use-case. For example there is a Spilo project: https://github.com/zalando/spilo, a docker image build with Patroni and wal-e for a cloud deployments.

vishesh92 · on April 14, 2017

Thanks for this. As I mentioned in previous comment, I couldn't find much about Stolon. One of things I likee in stolon is the proxy, it enforce connections to the right PostgreSQL master and forcibly closes connections to unelected masters. Which is not present in Patroni, but I guess it should be doable using consul's service discovery and dynamic DNS. (I am not sure if this can be done using etcd and zookeper).

CyberDem0n · on April 14, 2017

Doesn't HAProxy provide such functionality? Patroni has REST API which can be used by HAProxy for a health-check.

patroni:8008/master will return http status code 200 only if the node running as elected master

patroni:8008/replica will return http status code 200 if node running as replica.

And final missing bit is a automation of generation of haproxy.cfg - it could be done with confd: https://github.com/kelseyhightower/confd

And here is an example of template file: https://github.com/zalando/patroni/blob/master/extras/confd/...

vishesh92 · on April 14, 2017

I am not sure, will HAProxy forcibly close connections to old master in case of a failover?

CyberDem0n · on April 14, 2017

on-marked-down shutdown-sessions

should do the trick

jsiepkes · on April 14, 2017

Patroni makes PostgreSQL cloud native; easily add and despose nodes in a self healing postgres cluster. Think of it as nosql style cluster management. PostgreSQL XL is more "classic" (non self healing) clustering.