We made this because we discovered that *lots* of companies using microservices ...

philsnow · on July 31, 2015

Airbnb recently merged some great changes into SmartStack [0, 1] provided by Yelp, among which are reducing the number of connections to zookeeper [2].

[0] https://github.com/airbnb/nerve/pull/71 [1] https://github.com/airbnb/synapse/pull/130 [2] https://github.com/Yelp/synapse/commit/82775562a35a89d60084f...

rschloming · on July 31, 2015

The changes that yelp have made are great for SmartStack users, but you still need to set up zookeeper in order to get going. Yelp is really pushing these changes for the multi datacenter use cases. I suspect this is one area where the strong consistency model of zookeeper is an even worse fit for service discovery than within a single datacenter.

jolynch · on Aug 1, 2015

To be honest my favorite part of SmartStack is that you are not tied to a single discovery backend or mechanism. Both Synapse and Nerve support custom backends using whatever system you want (zookeeper, etcd, DNS, etc). At the end of the day both just expose basic configuration files and we exploit that at Yelp to do pretty cool stuff like allowing multiple systems to inform nerve/synapse about services (e.g. marathon or puppet) and allowing us to control service latency using a DSL that compiles down to those configuration files.

Just to clear something up, we have not found it necessary to run zookeeper at a cross datacenter level to get multidatacenter support. We're still working on writing up the details but the general gist is run zk in all datacenters and then cross register from a single nerve instance to multiple datacenters. That's why we had to remove fast fail from nerve, because by its nature cross datacenter communication is flakey. This approach has some tradeoffs however, as all approaches do.

All that being said, this is an interesting system and I look forward to more mindshare in the area of service discovery!

rdli · on Aug 1, 2015

Awesome, great to know the details (we heard about what you guys were doing second hand from Igor). Looking forward to more details whenever you post!

philsnow · on July 31, 2015

I don't know, I'm a huge fan of consensus for service discovery.

It would be quite the kick in the pants if I thought that I had drained a group of machines and started some destructive maintenance on them, only to find that the eventual consistency fairy had forgotten about a couple of them, causing 500s on the site...

Multi-DC zookeeper isn't untenable. I've done it before with a quorum spread across five datacenters.

rschloming · on July 31, 2015

It's certainly possible to run zookeeper across multiple datacenters at scale as yelp has demonstrated, however we've elected to make a different set of tradeoffs.

Our goals include reducing operational complexity and being able to minimize the impact of node failures, i.e. quickly remove them from consideration by clients.

mtalantikite · on July 31, 2015

Did you evaluate Vulcand (https://github.com/mailgun/vulcand) and if so what were your thoughts? It sounds like Baker Street eliminates the spof. I've only used it to get a feel of it, never in production.

rdli · on July 31, 2015

Thanks for the pointer! This one, we hadn't seen. Based on a 5 minute read, my take would be 1) uses etcd (strongly consistent service discovery) 2) it looks like it is a centralized architecture, versus a distributed one and 3) it looks like it's a ground up write of a proxy in Go (we decided to just use HAProxy because it's super well debugged).