This works well if you need to push high bit rates and are lookig for relatively...

phil21 · on May 28, 2014

Great post, you said things better than I ever could.

So I'll nitpick instead :)

> You really dont want to announce your routes from the same host that will serve your traffic. Separate your control plane and data plane. Its terrible when a host has a gray failure, say oom or read only disk, and route announcements stay up while the host fails to serve traffic.

This was probably the largest area I spent architectural time on before deployment. Was it better to run a tiny little healthcheck script on my HAProxy boxes to tear down bgpd, or should I run a route server.

In the end, I went with what I felt was the simpler solution of the two for our scale (~60 total HAProxy boxes spread around the world) and used an extremely simple "is haproxy accepting connections on the VIP or not" script that stayed in memory. It was also tossed in init, just in case it did get oomkilled or crashed - and in the end we never had an outage based on a service check that should have been caught. Knock on wood. Completely agree that a route server is better than a naive script ran from cron as a service check.

The route-server method is more interesting to me at scale, but adds additional complexity to the problem. Now people need to know how this all works, your service checks need to execute remotely (not a bad thing!), and probably a few other things I'm forgetting. However, it makes management and scaling much better and is the way I'd go if I did it all over again.

The failure model for this setup is basically is HAProxy up? Yes? Then announce routes. If not, pull routes. HAProxy was responsible for detecting the health of the application itself and deciding what to do with an app failure. We did add some code later on to down haproxy should it be unable to reach any webservers, but honestly the complexity and additional failure modes this adds usually is not worth it due to it being such a rare event.

donavanm · on May 29, 2014

Ha! I did have an outage because route announcements and data plane were on the same host a few years back. Having a separate health check service & route server is a trade off of complexity vs control. I could see the argument when you only have a couple hosts total. With dozens of endpoints in a fleet its quite nice to tolerate more wonky grey failures.

Unfortunately I dont know of any existing public lib/application/framework that does this type of layer 2/3/4 load balancing for fleets of endpoints. The vrrp/carp/keepalive/heartbeat crew seem focussed on master/slave failover which is totally uninteresting to me.

phil21 · on May 29, 2014

Yeah, it's pretty custom. We started doing this nearly 10 years ago and when explaining it to vendors their eyes would glaze over. These days it seems quite a bit more common - and I hope I had a tiny bit to do with that evangelizing it wherever I could.

Curious how you solved the hash redistribution problem? We never came up with anything good (some clever hacks though!), but luckily for our uses it wasn't a big deal and we could do away with a whole shedload of complexity.

The best we came up with was pre-assign all the IP's (or over-assign if you wanted more fine-grained balancing) a given cluster could ever maximally utilize. Then distribute those IPs evenly across the load balancers, and have the remaining machines take over those IPs should there be a failure. This was complicated as hell, and obviously broke layer3 to the access port so was a non-starter.

I'm sure we had better/more clever ideas, but we never had reason to chase them down so I honestly forget. At this point, if someone needs to refresh a page once out of every 100 million requests I'm pretty happy.