This works well if you need to push high bit rates and are lookig for relatively simple load balancing. A trident box, ala juniper qfx, can push a few hundred gbs for ~$25,000. Thats an incredibly low price point compared to any other lb solution.
Some caveats and comments about the technique.
BGP & ExaBGP are implementation details. OSPF, quagga, & bird will all accomplish the same thing. Use whatevere your comfortable with.
Scale out can get arbitrarily wide. At a simplistic design youll ECMP on the device (TOR) where your hosts are connected. Any network device will give you 8 way ECMP. Most Junos stuff does up to 32 way today, and 64 way with an update. You can ECMP before that as well, in your agg or border layer. That would give you 64 x 64 = 4096 end points per external "vip."
ECMP giveth and taketh away. If you change your next hops expect all those flows to scramble. The reason is that ordering of next hops / egress interfaces are generally included in assignment of flows to next hop. In a traditional routing application his has no effect. When the next hops are terminating TCP sessions youll be sending RSTs to 1/2 of your flows.
For this same reason youll have better luck advertising more specific routes, like /32s instead of a whole /24. This can help limit the blast radius of flow rehash events to a single destination "vip."
There are more tricks you can play to mitigate flow rehashes. Its quite a bit of additional complexity though.
For the same reason make double plus sure that you dont count the ingress interface in the ECMP hash key. On Junos this is incoming-interface-index and family inet { layer-4 }, IIRC.
You really dont want to announce your routes from the same host that will serve your traffic. Separate your control plane and data plane. Its terrible when a host has a gray failure, say oom or read only disk, and route announcements stay up while the host fails to serve traffic. You end up null routing or throwing 500s for 1/Nth of your traffic.
Great post, you said things better than I ever could.
So I'll nitpick instead :)
> You really dont want to announce your routes from the same host that will serve your traffic. Separate your control plane and data plane. Its terrible when a host has a gray failure, say oom or read only disk, and route announcements stay up while the host fails to serve traffic.
This was probably the largest area I spent architectural time on before deployment. Was it better to run a tiny little healthcheck script on my HAProxy boxes to tear down bgpd, or should I run a route server.
In the end, I went with what I felt was the simpler solution of the two for our scale (~60 total HAProxy boxes spread around the world) and used an extremely simple "is haproxy accepting connections on the VIP or not" script that stayed in memory. It was also tossed in init, just in case it did get oomkilled or crashed - and in the end we never had an outage based on a service check that should have been caught. Knock on wood. Completely agree that a route server is better than a naive script ran from cron as a service check.
The route-server method is more interesting to me at scale, but adds additional complexity to the problem. Now people need to know how this all works, your service checks need to execute remotely (not a bad thing!), and probably a few other things I'm forgetting. However, it makes management and scaling much better and is the way I'd go if I did it all over again.
The failure model for this setup is basically is HAProxy up? Yes? Then announce routes. If not, pull routes. HAProxy was responsible for detecting the health of the application itself and deciding what to do with an app failure. We did add some code later on to down haproxy should it be unable to reach any webservers, but honestly the complexity and additional failure modes this adds usually is not worth it due to it being such a rare event.
Ha! I did have an outage because route announcements and data plane were on the same host a few years back. Having a separate health check service & route server is a trade off of complexity vs control. I could see the argument when you only have a couple hosts total. With dozens of endpoints in a fleet its quite nice to tolerate more wonky grey failures.
Unfortunately I dont know of any existing public lib/application/framework that does this type of layer 2/3/4 load balancing for fleets of endpoints. The vrrp/carp/keepalive/heartbeat crew seem focussed on master/slave failover which is totally uninteresting to me.
Yeah, it's pretty custom. We started doing this nearly 10 years ago and when explaining it to vendors their eyes would glaze over. These days it seems quite a bit more common - and I hope I had a tiny bit to do with that evangelizing it wherever I could.
Curious how you solved the hash redistribution problem? We never came up with anything good (some clever hacks though!), but luckily for our uses it wasn't a big deal and we could do away with a whole shedload of complexity.
The best we came up with was pre-assign all the IP's (or over-assign if you wanted more fine-grained balancing) a given cluster could ever maximally utilize. Then distribute those IPs evenly across the load balancers, and have the remaining machines take over those IPs should there be a failure. This was complicated as hell, and obviously broke layer3 to the access port so was a non-starter.
I'm sure we had better/more clever ideas, but we never had reason to chase them down so I honestly forget. At this point, if someone needs to refresh a page once out of every 100 million requests I'm pretty happy.
Some caveats and comments about the technique.
BGP & ExaBGP are implementation details. OSPF, quagga, & bird will all accomplish the same thing. Use whatevere your comfortable with.
Scale out can get arbitrarily wide. At a simplistic design youll ECMP on the device (TOR) where your hosts are connected. Any network device will give you 8 way ECMP. Most Junos stuff does up to 32 way today, and 64 way with an update. You can ECMP before that as well, in your agg or border layer. That would give you 64 x 64 = 4096 end points per external "vip."
ECMP giveth and taketh away. If you change your next hops expect all those flows to scramble. The reason is that ordering of next hops / egress interfaces are generally included in assignment of flows to next hop. In a traditional routing application his has no effect. When the next hops are terminating TCP sessions youll be sending RSTs to 1/2 of your flows.
For this same reason youll have better luck advertising more specific routes, like /32s instead of a whole /24. This can help limit the blast radius of flow rehash events to a single destination "vip."
There are more tricks you can play to mitigate flow rehashes. Its quite a bit of additional complexity though.
For the same reason make double plus sure that you dont count the ingress interface in the ECMP hash key. On Junos this is incoming-interface-index and family inet { layer-4 }, IIRC.
You really dont want to announce your routes from the same host that will serve your traffic. Separate your control plane and data plane. Its terrible when a host has a gray failure, say oom or read only disk, and route announcements stay up while the host fails to serve traffic. You end up null routing or throwing 500s for 1/Nth of your traffic.