Show HN: Baker Street – A simple client-side load balancer for microservices

rdli · on July 31, 2015

We made this because we discovered that lots of companies using microservices have independently converged on this type of architecture for load balancing (client-side HAProxy + integrated service discovery component + health checks), but there wasn't a simple, easy-to-setup, end-to-end solution out there. (Our favorite for the record is AirBnb's SmartStack.) We'd love for some feedback and/or PRs and/or GitHub stars!

philsnow · on July 31, 2015

Airbnb recently merged some great changes into SmartStack [0, 1] provided by Yelp, among which are reducing the number of connections to zookeeper [2].

[0] https://github.com/airbnb/nerve/pull/71 [1] https://github.com/airbnb/synapse/pull/130 [2] https://github.com/Yelp/synapse/commit/82775562a35a89d60084f...

rschloming · on July 31, 2015

The changes that yelp have made are great for SmartStack users, but you still need to set up zookeeper in order to get going. Yelp is really pushing these changes for the multi datacenter use cases. I suspect this is one area where the strong consistency model of zookeeper is an even worse fit for service discovery than within a single datacenter.

jolynch · on Aug 1, 2015

To be honest my favorite part of SmartStack is that you are not tied to a single discovery backend or mechanism. Both Synapse and Nerve support custom backends using whatever system you want (zookeeper, etcd, DNS, etc). At the end of the day both just expose basic configuration files and we exploit that at Yelp to do pretty cool stuff like allowing multiple systems to inform nerve/synapse about services (e.g. marathon or puppet) and allowing us to control service latency using a DSL that compiles down to those configuration files.

Just to clear something up, we have not found it necessary to run zookeeper at a cross datacenter level to get multidatacenter support. We're still working on writing up the details but the general gist is run zk in all datacenters and then cross register from a single nerve instance to multiple datacenters. That's why we had to remove fast fail from nerve, because by its nature cross datacenter communication is flakey. This approach has some tradeoffs however, as all approaches do.

All that being said, this is an interesting system and I look forward to more mindshare in the area of service discovery!

rdli · on Aug 1, 2015

Awesome, great to know the details (we heard about what you guys were doing second hand from Igor). Looking forward to more details whenever you post!

philsnow · on July 31, 2015

I don't know, I'm a huge fan of consensus for service discovery.

It would be quite the kick in the pants if I thought that I had drained a group of machines and started some destructive maintenance on them, only to find that the eventual consistency fairy had forgotten about a couple of them, causing 500s on the site...

Multi-DC zookeeper isn't untenable. I've done it before with a quorum spread across five datacenters.

rschloming · on July 31, 2015

It's certainly possible to run zookeeper across multiple datacenters at scale as yelp has demonstrated, however we've elected to make a different set of tradeoffs.

Our goals include reducing operational complexity and being able to minimize the impact of node failures, i.e. quickly remove them from consideration by clients.

mtalantikite · on July 31, 2015

Did you evaluate Vulcand (https://github.com/mailgun/vulcand) and if so what were your thoughts? It sounds like Baker Street eliminates the spof. I've only used it to get a feel of it, never in production.

rdli · on July 31, 2015

Thanks for the pointer! This one, we hadn't seen. Based on a 5 minute read, my take would be 1) uses etcd (strongly consistent service discovery) 2) it looks like it is a centralized architecture, versus a distributed one and 3) it looks like it's a ground up write of a proxy in Go (we decided to just use HAProxy because it's super well debugged).

asherkin · on July 31, 2015

This looks really handy - thanks! A word of warning though: TfL aggressively protects the roundel trademark.

goatforce5 · on July 31, 2015

Yes. Came here to say this. Expect a nastygram from TfL. eg:

http://www.theregister.co.uk/2006/03/16/tube_map_madness/

dcosson · on July 31, 2015

This looks great! Looking forward to playing around with it.

I loved the idea of Airbnb's Synapse, but it's tricky to configure (you basically have to write an haproxy config from scratch plus learn how synapse config sections map to the haproxy config). Plus it seemed like the non-zookeeper backends were pretty unstable, I had to fix a few things to get it working with ec2 tags (and fwiw, at this point it's been over a month and my PR to merge the changes back upstream hasn't even been commented on).

How does Baker Street handle restarting haproxy, does it do anything like this[0] automatically to get zero-downtime configuration reloads?

[0] http://engineeringblog.yelp.com/2015/04/true-zero-downtime-h...

rschloming · on July 31, 2015

Currently we use the restart procedure as described in the haproxy manual. We would like to get to true zero downtime though, we've been looking both at the method described in the post you mention as well as possibly using nginx in favor of haproxy to achieve this.

revertts · on July 31, 2015

If I'm reading it right, the directory service today is a single host. That was very misleading after these statements (which suggested something closer to netflix eureka):

"Zookeeper provides a strongly consistent model; the directory service focuses on availability."

"Baker Street doesn't use Zookeeper or the other popular service discovery frameworks because we wanted a simple, highly available service, not a strongly consistent one."

Edit: Which is not to say that the project isn't interesting, just that some of the copy felt like a bait and switch. :)

eropple · on July 31, 2015

I had the same reaction, and I have severe reservations about availability-focused service location. The potential of firing traffic at the wrong nodes and having it dropped on the floor is a real red flag for me. A failure of a directory service due to a lack of consistency allows an application to, if not trivially, at least reliably cache requests to be pushed later when the health of the overall architecture can be established.

rschloming · on Aug 1, 2015

In a distributed architecture it is very difficult to avoid the possibility you mention even with a strongly consistent store at the center of your service discovery mechanism. The consistency the store provides doesn't necessarily extend to the operational state of your system.

For example, your zookeeper nodes may all be consistent with each other, but given that a server can fail at anytime, that information while consistent may still be stale. Likewise, if a client is caching connections outside of zookeeper's consensus mechanism, then these connections will also become stale in the face of changes.

Given these possibilities, there is always the potential for traffic to be dropped on the floor regardless of how consistent your store is, so ultimately what matters is how to minimize the probability of this occurring and whether your system can cope when it does.

rschloming · on July 31, 2015

We didn't intend to do a bait and switch. We mentioned this in the docs, but perhaps it was a little too buried. Our plan is to support multiple instances of the directory server for high availability. This is similar in principle to how systems like DNS or NSQ function.

revertts · on July 31, 2015

Yep, I did eventually find that. Having to search for it was frustrating; so much of the copy is devoted to describing what Baker Street isn't (hey, doesn't use consensus!) and not what it is (uses a single node, TODO: master/slaves or chain replication or blah blah blah). And it's kind of an important point, because it changes this from "might give this a go for a less critical service" to "unusable in the short term."

rdli · on Aug 1, 2015

It's a fair point, so we'll clarify this (and we're working on the replication bit too). Thanks!

dmourati · on Aug 1, 2015

Thanks for sharing. Sent this to my team as we were talking about this problem just this afternoon.

Points for sudo nano in the install guide.

Omie6541 · on Aug 1, 2015

"Datawire Directory" => Mycroft

felixgallo · on July 31, 2015

Curious: why choose something like Watson over using haproxy's pretty solid built in health checking mechanisms?

rdli · on Aug 1, 2015

We're running 1 HAProxy per application instance. The HAProxy built in health checking mechanism is designed when it front-ends > 1 app instances (i.e., it serves as a central proxy).

Watson checks the health of your local application, and propagates it to the (global) service discovery framework, so when other microservices want to connect to the service that Watson is monitoring, they know whether or not that service is available.

felixgallo · on Aug 1, 2015

Seems like you're trading away haproxy as a centralized SPOF for brand new custom code, which itself is a centralized SPOF; and also needing a new watchdog daemon to do what the haproxy instance would have done. Would be interesting to understand what the problem was that forced that complex arrangement, because it's not obvious to me at the moment.

rdli · on Aug 1, 2015

Three reasons:

1. In the central LB setup, if the LB server dies, you're service dies. Not in this setup (HAProxy is deployed side-by-side with each instance). 2. Elasticity. Imagine you have a shopping cart microservice, a search microservice, and a users microservice. Each of these requires its own HAProxy instance. Every time you spin up or spin down a new instance of these microservices, you need to configure the central HAProxy to pay attention to these new instances. 3. Health checks don't work well in DNS. In the centralized load balancer infrastructure, you end up relying on DNS so that your users microservice can talk to the shopping cart microservice (for example). DNS requires client polling and has propagation delays, so if one of your shopping cart microservice load balancer dies, it takes time for all the other microservices to figure out where to connect to.

A central LB works well if you have just a single microservice. But when you have dozens of them, you're suddenly managing dozens of load balancers dynamically, and it gets to be pretty chunky to manage at that point.

mark242 · on Aug 1, 2015

Except most people deploy dual HAProxy servers connected via heartbeat, which share a floating IP. There's no SPOF. Deploying new microservices is as simple as managing the HAProxy config file via Chef or Puppet.

rdli · on Aug 1, 2015

You can definitely do that. There is more programming involved, since you need to figure out how to tie a new instance deployment into Chef/Puppet/etc to update HAProxy. You also need to figure out how to get it to update quickly. Finally, you'll need to figure out how to deploy your dual HAProxy server with heartbeat setup automatically every time you deploy a new type of microservice, and update DNS appropriately. It just means that instead of deploying Baker Street as part of your microservice push, you're deploying a) your microservice b) your dual HAProxy setup and/or new HAProxy config and c) DNS.

Lots of ways to solve this problem; our big focus here is on simplicity. If you have the Chef-fu and time to do all the above, you could definitely make it work.

curiousjorge · on July 31, 2015

so if a HTTP request comes in, how does it communicate with existing HTTP microservices and know that it's available or not? Does it do this by polling?

I might actually give this a go since I need to route HTTP request to hundreds of flask servers but if they are busy, I don't want to keep hitting it.

rschloming · on July 31, 2015

The way this works is described here:

  http://bakerstreet.io/docs/architecture.html