We made this because we discovered that lots of companies using microservices have independently converged on this type of architecture for load balancing (client-side HAProxy + integrated service discovery component + health checks), but there wasn't a simple, easy-to-setup, end-to-end solution out there. (Our favorite for the record is AirBnb's SmartStack.) We'd love for some feedback and/or PRs and/or GitHub stars!
Airbnb recently merged some great changes into SmartStack [0, 1] provided by Yelp, among which are reducing the number of connections to zookeeper [2].
The changes that yelp have made are great for SmartStack users, but you still need to set up zookeeper in order to get going. Yelp is really pushing these changes for the multi datacenter use cases. I suspect this is one area where the strong consistency model of zookeeper is an even worse fit for service discovery than within a single datacenter.
To be honest my favorite part of SmartStack is that you are not tied to a single discovery backend or mechanism. Both Synapse and Nerve support custom backends using whatever system you want (zookeeper, etcd, DNS, etc). At the end of the day both just expose basic configuration files and we exploit that at Yelp to do pretty cool stuff like allowing multiple systems to inform nerve/synapse about services (e.g. marathon or puppet) and allowing us to control service latency using a DSL that compiles down to those configuration files.
Just to clear something up, we have not found it necessary to run zookeeper at a cross datacenter level to get multidatacenter support. We're still working on writing up the details but the general gist is run zk in all datacenters and then cross register from a single nerve instance to multiple datacenters. That's why we had to remove fast fail from nerve, because by its nature cross datacenter communication is flakey. This approach has some tradeoffs however, as all approaches do.
All that being said, this is an interesting system and I look forward to more mindshare in the area of service discovery!
I don't know, I'm a huge fan of consensus for service discovery.
It would be quite the kick in the pants if I thought that I had drained a group of machines and started some destructive maintenance on them, only to find that the eventual consistency fairy had forgotten about a couple of them, causing 500s on the site...
Multi-DC zookeeper isn't untenable. I've done it before with a quorum spread across five datacenters.
It's certainly possible to run zookeeper across multiple datacenters at scale as yelp has demonstrated, however we've elected to make a different set of tradeoffs.
Our goals include reducing operational complexity and being able to minimize the impact of node failures, i.e. quickly remove them from consideration by clients.
Did you evaluate Vulcand (https://github.com/mailgun/vulcand) and if so what were your thoughts? It sounds like Baker Street eliminates the spof. I've only used it to get a feel of it, never in production.
Thanks for the pointer! This one, we hadn't seen. Based on a 5 minute read, my take would be 1) uses etcd (strongly consistent service discovery) 2) it looks like it is a centralized architecture, versus a distributed one and 3) it looks like it's a ground up write of a proxy in Go (we decided to just use HAProxy because it's super well debugged).