I'm really reaching back into the depths of my memory, but I've implemented this in the past. It's not quite as simple as they make it sound - there's a lot of sticky edge cases that crop up here (some of which have no doubt been addressed in subsequent years).
- It heavily limits the number of nodes you can have - that is something the article does say, but I want to highlight here. It strikes me as a really bad strategy for scale-out.
- I've run into weirdness with a variety of different router platforms (Linux, Cisco, Foundry) when you withdraw and publish BGP routes over and over and over again (i.e. you have a flapping/semi-available service).
- It is true that when a node goes down, the BGP dead peer detection will kick in and remove the node. However the time to remove the node will vary, and require tuning on the router/switch side of things.
This is a fairly crude implement to swing - machete rather than a scalpel. You lose a lot of the flexibility load balancers give you, and depend a lot more on software stacks you have less insight and visibility into (router/switches) and are also not designed to do this.
My suggestion would be that this is a great way to scale across multiple load balancers/haproxy nodes. Use BGP to load balance across individual haproxy nodes - that keeps the neighbor count low, minimizes flapping scenarios, and you get to keep all the flexibility a real load balancer gives you.
One last note - the OP doesn't talk about this, but the trick I used back in the day was that I actually advertised a /24 (or /22, maybe?) from my nodes to my router, which then propagated it to a decent chunk of the Internet. This is good for doing CloudFlare-style datacenter distribution, but has the added benefit that if all of your nodes go down, the BGP route will be withdrawn automatically, and traffic will stop flowing to that datacenter. Also makes maintenance a lot easier.
> My suggestion would be that this is a great way to scale across multiple load balancers/haproxy nodes. Use BGP to load balance across individual haproxy nodes
Exactly. BGP, while it may work like the OP said, was not meant to live this close to the actual server nodes.
You could push BGP even further away. In a more traditional model, it's meant to be used to switch (or load balance) between geographically separated datacenters.
This works well if you need to push high bit rates and are lookig for relatively simple load balancing. A trident box, ala juniper qfx, can push a few hundred gbs for ~$25,000. Thats an incredibly low price point compared to any other lb solution.
Some caveats and comments about the technique.
BGP & ExaBGP are implementation details. OSPF, quagga, & bird will all accomplish the same thing. Use whatevere your comfortable with.
Scale out can get arbitrarily wide. At a simplistic design youll ECMP on the device (TOR) where your hosts are connected. Any network device will give you 8 way ECMP. Most Junos stuff does up to 32 way today, and 64 way with an update. You can ECMP before that as well, in your agg or border layer. That would give you 64 x 64 = 4096 end points per external "vip."
ECMP giveth and taketh away. If you change your next hops expect all those flows to scramble. The reason is that ordering of next hops / egress interfaces are generally included in assignment of flows to next hop. In a traditional routing application his has no effect. When the next hops are terminating TCP sessions youll be sending RSTs to 1/2 of your flows.
For this same reason youll have better luck advertising more specific routes, like /32s instead of a whole /24. This can help limit the blast radius of flow rehash events to a single destination "vip."
There are more tricks you can play to mitigate flow rehashes. Its quite a bit of additional complexity though.
For the same reason make double plus sure that you dont count the ingress interface in the ECMP hash key. On Junos this is incoming-interface-index and family inet { layer-4 }, IIRC.
You really dont want to announce your routes from the same host that will serve your traffic. Separate your control plane and data plane. Its terrible when a host has a gray failure, say oom or read only disk, and route announcements stay up while the host fails to serve traffic. You end up null routing or throwing 500s for 1/Nth of your traffic.
Great post, you said things better than I ever could.
So I'll nitpick instead :)
> You really dont want to announce your routes from the same host that will serve your traffic. Separate your control plane and data plane. Its terrible when a host has a gray failure, say oom or read only disk, and route announcements stay up while the host fails to serve traffic.
This was probably the largest area I spent architectural time on before deployment. Was it better to run a tiny little healthcheck script on my HAProxy boxes to tear down bgpd, or should I run a route server.
In the end, I went with what I felt was the simpler solution of the two for our scale (~60 total HAProxy boxes spread around the world) and used an extremely simple "is haproxy accepting connections on the VIP or not" script that stayed in memory. It was also tossed in init, just in case it did get oomkilled or crashed - and in the end we never had an outage based on a service check that should have been caught. Knock on wood. Completely agree that a route server is better than a naive script ran from cron as a service check.
The route-server method is more interesting to me at scale, but adds additional complexity to the problem. Now people need to know how this all works, your service checks need to execute remotely (not a bad thing!), and probably a few other things I'm forgetting. However, it makes management and scaling much better and is the way I'd go if I did it all over again.
The failure model for this setup is basically is HAProxy up? Yes? Then announce routes. If not, pull routes. HAProxy was responsible for detecting the health of the application itself and deciding what to do with an app failure. We did add some code later on to down haproxy should it be unable to reach any webservers, but honestly the complexity and additional failure modes this adds usually is not worth it due to it being such a rare event.
Ha! I did have an outage because route announcements and data plane were on the same host a few years back. Having a separate health check service & route server is a trade off of complexity vs control. I could see the argument when you only have a couple hosts total. With dozens of endpoints in a fleet its quite nice to tolerate more wonky grey failures.
Unfortunately I dont know of any existing public lib/application/framework that does this type of layer 2/3/4 load balancing for fleets of endpoints. The vrrp/carp/keepalive/heartbeat crew seem focussed on master/slave failover which is totally uninteresting to me.
Yeah, it's pretty custom. We started doing this nearly 10 years ago and when explaining it to vendors their eyes would glaze over. These days it seems quite a bit more common - and I hope I had a tiny bit to do with that evangelizing it wherever I could.
Curious how you solved the hash redistribution problem? We never came up with anything good (some clever hacks though!), but luckily for our uses it wasn't a big deal and we could do away with a whole shedload of complexity.
The best we came up with was pre-assign all the IP's (or over-assign if you wanted more fine-grained balancing) a given cluster could ever maximally utilize. Then distribute those IPs evenly across the load balancers, and have the remaining machines take over those IPs should there be a failure. This was complicated as hell, and obviously broke layer3 to the access port so was a non-starter.
I'm sure we had better/more clever ideas, but we never had reason to chase them down so I honestly forget. At this point, if someone needs to refresh a page once out of every 100 million requests I'm pretty happy.
Half a million dollars of load balancers? Either you are buying from the wrong vendor, or you have some wonky ideas of how many load balancers you need per data center, or are not using them correctly. ( hint: check a10 networks and Zeus)
The reality is, that if your problem is only L3, then arguably this can be solved many ways. For example, networks have been doing 10s of gigabits of L3 load balancing using DSR for ages. Dynamic route propagation doesn’t have a hold on this ( albeit it’s more “elegant” ).
But most people do more than L3, and really do L4-L7 load balancing, and most modern “application load balancing” platforms are really software packages bundled up in a nice little appliance. This is where packages like varnish with its vcl/vmod’s and caching, Aflex ( from a10 networks ) and Traffic Sscript from Zeus , amongst others, come in. Shuffling bits is the easy part! Understanding the request, and making decisions on that is the harder part.
If you split the problem, and are using varnish or nginx as your application load balancer, you can’t claim you’ve gotten rid of them, you were either not buying the right initial platform, or using it correctly. When you put a “stop buying load balancers”… you must first define what you mean by “load balancer” ;)
For the record, I've used both commercial load balancing platforms as well as contributed patches and used OSS load balancing platforms..
Half a million dollars of Netscalers is extremely easy, and given that Google originally ran all Netscalers, a lot of ex-Google people that run operations at startups now default to them as a kneejerk. Similarly, F5 is a pretty easy budget torpedo, too.
This is actually a great idea. The last company I worked for ran into several issues with load balancing servers. Two out of the three major releases I was around for were an unmitigated disasters.
The first was because of old load balancing servers getting bogged down with traffic. CIO got pissed, dropped three million on brand new spanking SSD drives and new "state-of-the-art" servers. Cue to the next release.
Pretty much the same issue. It was two lines in a program that was calling a file from Sharepoint - thousands of times a second to the server and bogged down all three of the load balancer servers with traffic within minutes of the release. Took the back-end developers a week and some help from Microsoft to fix the bug. I just sat back and giggled since the CIO spent two hours in a meeting with the whole IT department lecturing them on the importance of load testing immediately after the first releases failure.
Needless to say, they didn't do any load testing for the applications either time which contributed to the issue. Of course, it just goes to show even with the bestest, newest hardware, you can still bring your site/applications to its knees.
> It was two lines in a program that was calling a file from Sharepoint - thousands of times a second to the server and bogged down all three of the load balancer servers with traffic within minutes of the release.
That type of issue can be quickly identified if you have, on the team, the sort of mind who is inquisitive of what goes on at low levels, and is not afraid of poking around with tcpdump and network interface traffic counters and stuff like that.
This article is a bit low on details, so it's hard to judge the quality of the proposed solution (without having tested a similar setup).
We faced the choice of either upgrading our aging Foundry load balancers or building our own solution a few years ago and came up with a very stable and scalable setup:
2+ load balances (old web servers sufficed) running Linux and:
* wackamole for IP address failover (detects peer failure with very low latency, informs upstream routers; identical setup for all load balancers works, can be tuned to have particular IP addresses preferably on particular load balancer hosts) http://www.backhand.org/wackamole/
* Varnish for HTTP proxying and load balancing (identical setup on all load balancers) - www.varnish.org
* Pound for HTTPS load balancing (identical configuration on all load balancers, can handle SNI, client certificates etc. ...) http://www.apsis.ch/pound/
This scales pretty much arbitrarily, just add more load balancers for more Varnish cache or SSL/TLS handshakes/second. We also have nameservers on all load balancers (also with replicated configuration and IP address failover). Configuration is really easy, only Varnish required some tuning (larger buffers etc.) and Pound (OpenSSL really) was set up carefully for PFS and good compatibility with clients.
The only drawback is that actual traffic distribution over the load balancers is arbitrary and thus unbalanced (wackamole assigns the IP addresses randomly unless configured to prefer a particular distribution), but the more IP addresses your traffic is spread out over, the less of a problem this becomes.
This solution works, but as you point out balancing over your load balancers is a huge hack and relies basically on DNS.
If you replaced wackamole/DNS RR with BGPd on your varnish/pound boxes, you would achieve the same goals but be able to fully direct your traffic flows yourself vs. a random RFC-busting DNS cache somewhere.
The other big downside to this solution is being forced to run a l2 broadcast domain for failover to work. Fine at your scale here, but when you get into dozens or hundreds of switches and larger scale I firmly believe (if at all possible) dropping L3 down to the access port is the way to go. Debugging STP issues on such networks is about the last thing in the world I'd like to be doing on a friday night.
Really he's talking about layer 4 load balancing, not 3, and assuming your juniper router has an Internet Processor II ASIC to juggle tcp flows. You're still buying hardware to do the load balancing, you just use software to do the BGP announce.
Honestly it all seems a bit crude and unreliable. If i'm writing a software load balancer i'm not going to use curl, bash scripts and pipes to do it. But this is why devops people shouldn't be designing highly available traffic control software.
It's a cool hack because you probably already have bought the hardware to do this. Most decent switches these days can run BGP (or at least OSPF, which is capable of the same thing). And switches are usually way cheaper than hardware load balancers, at least per-port. Sure it's not perfect but for a startup trying not to spend a lot it can get you pretty far.
I don't know. From an infrastructure perspective I don't like the idea of having too many eggs in one basket, like combining the router/switch with the lb vips. Ideally you'd get a couple commodity boxes and configure them with LVS or pfsense or something. That way things like maintenance and access control of different parts of your network are separated based on the resource, and stability of one component won't necessarily affect another. It would also probably be cheaper to buy a couple servers than buy a couple routers/switches for redundancy.
You're not understanding how this works at all. Your router/switch doesn't "combine" the vips.
Ignore the running of bgpd on the webserver - that's really an extremely specific use-case that is not useful for most folks.
However, imagine your scenario where you have two routers, two switches, and two load balancers configured in failover (LVS per your example) - with webservers behind that stack.
Now you need more than a single load balancer worth of capacity? How do you scale it?
Generally, you're pretty much stuck doing DNS RR to load balance across VIPs, and you add at least one VIP per load balancer you have. Need to do maintenance? Good luck not directing traffic to the load balancer you want to take out of service :) You can wait 3 days for all the DNS caches in the world to purge, or you're going to be killing sessions when you fail that VIP over.
Now consider instead running BGPd vs. DNS RR.
You have a single VIP, and as many load balancers as you like. I enjoy HAProxy, so I'll use it here.
All these machines advertise the VIP to the switch they are connected to. You set up path cost on your network so all these VIPs share an equal cost at your routers, and your router will ECMP to each. Doesn't matter what switch or rack your HAProxy box is connected to in your network (I suggest paying attention for traffic management reasons) - as long as it can speak BGP to the switch/router, the traffic flows.
Need to do maint? Kill bgpd on one of the HAProxy boxes. Current sessions stay up, you go get a coffee, and when you come back you have an empty session table and are free to do whatever you like to the machine. Turn it back on by starting up bgpd, and watch your traffic instantly rebalance.
You are completely correct that if you don't need more than one (in a pair for HA) load balancer worth of capacity, this solution is likely overkill. But once you need to start scaling, you're quickly going to learn that it will either be prohibitively expensive, or come with lots of downsides. There pretty much are no downsides to this architecture, other than needing someone with a small amount of clue to operate it.
What I was saying with that "combining" is that your router is now essentially the vip, in the sense that it is the load balancer and the peers are chosen and routed to from it. As opposed to a normal router which is merely passing routed traffic into the network and letting a different device handle load balancing. The idea being that different devices are used for different purposes, and separation of their functions may improve overall stability and increase flexibility of your network services.
One downside here is ECMP assumes all paths cost the same, which is ridiculous in real-world load balancing. One of your haproxies is going to get overloaded and then traffic to your site is going to intermittently suck balls as sessions stream in to both under-loaded and over-loaded boxes.
Of course, you have the same problem with round-robin DNS to load balancers, but in the case of a DR LVS load balancer for example, at least it's just starting the connection and handing it off to the appropriate proxy instead of randomly pinning sessions to specific interfaces. With DR it's the backend proxy that determines its return path; the LVS VIP isn't in the path. With LVS it can pick a destination path based on real-world load.
The other downside that you seem to gloss over with regard to scaling is the maximum of 16 ECMP addresses in the forwarding table. I'm sure we'll never need more than 16 of those, though....... (For reference: the company I used to work for had up to 23 proxies just for one application... might cause some hiccups with this set-up)
Doing maintenance on a vip address and doing maintenance on one of these bgp peers work about the same. You stop accepting new connections, let old connections expire, then take down the vip. As far as changing DNS records, instead of that you can either add a hot-spare vip with the IP of the one you want to maintain, or add the ip of that vip to an existing load balancer.
Ok, I understand what you meant now. I do disagree - it's simply doing what routers do, and has no specific knowledge or configuration for the VIP. It's simply forwarding traffic based on a destination table just like any other packet. If this was problematic in any way, your average backbone would implode - ECMP is utilized extensively to balance busy peers. Also routers already do redundancy (at least via L3) extremely robustly - so it's basically a "free" way to load balance your load balancers. You simply are not going to get the same level of performance out of a LVS/DR solution, as it's competing with very mature implementations done in silicon. We'll have to agree to disagree here.
Of course in ECMP all paths are the same - I don't see this as a downside though. Most router vendors do support ECMP weights if really needed, but there are better ways to architect things. I've run this setup with over 1500gbps of Internet-facing traffic, and never ran into a full 10g line because it was engineered properly. An in-house app that lowers my hashing inputs would probably require a different setup though, I agree.
16 ECMP is a decent number, but these days most routers I work with support 32. Some are supporting 64 now. But that's almost irrelevant, unless you're stuffing all your load balancers on a single switch. It's per-device, so you have 8 load balancers connected (and peering via BGP) to one switch, 8 another, and so on. Those then forward those routes up to the router(s) which then ECMP from there (up to 16/32 downstream switches per VIP). I've never needed more than "two levels" of this so I haven't really played with a sane configuration for more than 1024 load balancers for a single VIP (or 512 in your 16-way case). It scales more than perhaps a dozen companies in the world would need it to. Note that this explanation may sound complicated, but in a well engineered (aka not a giant L2 broadcast domain that spans the entire DC) network it just happens without you even specifically configuring for it.
Since my knowledge is dated - how do you "stop accepting new connections" with the LVS/DR model? I'm sure you can, just can't mentally model it at the moment. You need to have the VIP bound to the host in question for the current connections to complete, how do you re-route new connections to a different physical piece of gear at the same time utilizing the same VIP?
There are certainly downsides to this model as well, I don't want to pretend it's the ultimate solution. But, it's generally leaps and bounds better than any vendor trying to sell you a few million dollars of gear to do the same job. The biggest downside to ECMP based load balancing is the hash redistribution after a load balancer enters/leaves the pool. I know some router vendors support persistent hashing, but my use case didn't make this a huge problem. There are of course ways to mitigate this as well, but they get complicated.
In the end, for the scale you can achieve with this the simplicity is absolutely wonderful. It's one of those implementations you look at when you're done and say "this is beautiful" since there are no horrible-to-troubleshoot things that do ARP spoofing and other fuckery on the network to make it work. ECMP+BGP is what you get, you can traceroute, look at route tables, etc. and that displays reality with no room for confusion. No STP debugging to be found anywhere :)
You don't know about keepalived or something? Your wait 3 days example is actually just bring the VIP up on another haproxy in one second. You can get essentially identical behavior like this. Kill keepalived and other box takes over IP instantly. VRRP.
I haven't kept up on it, when did they add TCP state syncing?
Edit: Glanced over the docs, I don't see this listed as an obvious feature. So I'm missing what your point is.
If you mean you can failover a VIP instantly, sure I agree. But you're dropping the TCP session for all those clients utilizing that VIP at the moment. I was illustrating how with ECMP you can just wait a few minutes and the sessions migrate away/timeout then you do your maint without impacting a single TCP flow.
Then use state "EQUAL" and nopreempt options. This will make your VIP failover only when one of the hosts dies and keep it there regardless of whether the new host came back up.
Right, but you already have to buy two switches for redundancy right, what else are you going to plug those servers into? I get what you're saying about stability etc but BGP is a core function of a networking operating system. I do agree that this is not a typical setup and probably not for everyone but it is a clever use of existing protocols which made for an interesting read. I think for some people it could solve a real problem until they have time to do something better.
This is a cool setup, but with the caveat that Allan stated, it forces you to think a little more about a layer that most systems people are less experienced in. The software approach is particularly useful because one could take the "healthcheck" setup and have it keep your alerting/dashboards in sync with reality (e.g. do healthcheck, fork: return exit code; POST {$hostname: 'ok'} to metric collector).
I also see that Shutterstock is actively hiring. For anyone looking, Shutterstock is a great place to work and employs some really brilliant people.
Really? If you can't understand this, or why it's better than almost all other architectures for massive horizontal scaling you're not thinking it through.
I've run similar setups utilizing ECMP -> HAProxy -> content servers, that scaled into the multi-terrabit range.
My junior level sysadmins understood how it worked, and it's a hell of a lot nicer to be able to run L3 to every access port and not deal with epic hacks like DSR and other extremely hard to troubleshoot stuff on L2.
It can be explained basically as "see this process 'bgpd' running? That is what tells the traffic to come to this load balancer - kill it and the traffic goes away, start it back up and it comes back". From there, the config stuff is trivial and it's just another HAProxy instance.
The hardest part of implementing such a solution is coming up with sane service checking scripts. You want to down a single HAProxy instance should it be having issues, but you certainly don't want to down every single one should all your webservers alert at the same time (e.g. application update fail, or whatever). We had ours setup with very basic healthcheck scripts for BGP (is haproxy alive? is it answering requests? stay up!), and then much more complex checks haproxy did itself on the webservers - with paths of last resort and the like.
This architecture also scales great when you put your big boy pants on and need to start doing anycasting. You pretty much already have the architecture setup for it, you just need to change some IP's and how you do route aggregation in each PoP for your anycast space. It's a great feeling when you can down an entire PoP and traffic instantly moves over to the next closest, then comes right back after maintenance.
I have yet to see a more simple, concise, and reliable architecture for serving up massive amounts of HTTP. Once you get into the 100gbps+ range, the usual vendor offerings are laughable considering the costs. I would say based on most vendor demos we did, the BGPd+HAProxy solution was far easier to understand and administer at a large scale.
DNS RR to horizontally scale needs to finally die off. ECMP is a great way to retain full control over your traffic flow, and is essentially "free" on any modern networking gear that you already have.
As you start engineering bigger and bigger systems you'll start to discover that you sometimes need a complex solution to a big problem.
The sticky point here for some people is it requires you to hire actually talented people, and not rely on trendy methodologies focused around getting acceptable work from a larger pool of mediocre developers and sysadmins.
Why would this be a nightmare to document? This is an approach that has been successfully implemented and maintained in many companies, I think it's more a matter of whether or not this approach works for your team, not whether or not it can be documented(Which it can be).
I'm not a fan of this methodology myself, but BGP is an incredibly standard technology. You're just not used to it in the areas you tend to work in. Which is fine, and it's also fine if you don't want to hire a guy with a networking skillset to work on your network. I just ask that you're a bit more realistic with your reasoning against it.
This can shift complexity to elsewhere in your stack. A couple points to add.
Be mindful of the specific routing hardware you're using:
Announcing and withdrawing prefixes can cause the router to select new next hops (i.e. servers). This is mostly a problem with TCP and other connection oriented protocols (or even connectionless if you're expecting a client to be sticky to a server).
You may also lose the ability to do un-equal cost load balancing.
I think that's an important catch, flow breaking can happen dependent on what hardware you're using and how you swap in next-hops. It definitely requires a deep understanding of how flows are hashed.
You can also do this (Equal Cost MultiPath to servers) without a dynamic routing protocol but you are at the mercy of whatever health checks your top of rack switch supports.
On Cisco switches you can use a IP SLA check to monitor for DNS replies from a DNS server and then have a static route that tracks the SLA check. If your DNS server stops responding the route would be withdrawn and traffic routed away. This can happen within a few seconds.
Slides from a NANOG talk about this (PDF): http://www.nanog.org/meetings/nanog41/presentations/Kapela-l...
Routers already do this, it's generally called fast external fallover. If an interface on the router goes down it immediately takes down whatever BGP peer was coming across that link.
Sorry for the deleted comment - after circulating this article internally, I learned Arista already has better solutions for this particular problem than writing handlers to interface statuses. Decided to retract my comment (while you were replying, incidentally) as there are a number of solutions you can implement on a switch involving BGP knobs, fast-server failover, ECMP and consistent hashing.
>"it is actually more of a load-balance per-flow since each TCP session will stick to one route rather than individual packets going to different backend servers."
This strikes me as expensive, does this mean packets no longer pass through the ASIC only side of a router and thus the software in the router has to do some of the heavy lifting, thus limiting the capacity/throughout to a mere fraction of what the route is really capable of?
disclaimer: i have only a high-level overview of router tech
Been running software load balancers for over a decade. I started with LVS (Linux Virtual Server) now called ivps. Now we run HAProxy and we're looking Apache Traffic Server.
Some of the load balancers have even run BGP as called out in the OP. Nothing really fancy but enough to be interesting.
One of the coolest things I built was a Global Server Load Balancer to balance Load Balancers. We needed it initially to move data centers. It was built on top of PowerDNS and ketama hash.
As someone who works with NetScalers on the regular, I have to say I love this idea. Citrix support is terrible and NetScalers are such a pain to configure. Then I see the bill and it frosts the cake.
We recently upgraded from version 9 to version 10 and it took down our production site because of some asinine undocumented rate limiting they "finally enforced" in version 10.
I'd like to play with software load balancing in the testing facility.
Even though the above says load-balance per-packet, it is actually more of a load-balance per-flow since each TCP session will stick to one route rather than individual packets going to different backend servers. As far as I can tell, the reasoning for this stems from legacy chipsets that did not support a per-flow packet distribution.
Is this not fairly risky? It's essentially relying on a bug?
Yeah, I spent the last weekend configuring Cisco gear which basically I feel should have been done in software on Linux. The era of the hardware firewall slash load balancer is over as far as hardware goes. Buy a dedicated box (or two) and configure .. it's faster and more predictable/reliable.
Doesn't work for datacenters. Also is implemented with round-robin DNS to nodes (1-N, check X-Forwarded-For) in each AZ, which then handle balancing.
Also worth noting that unless you turn on cross-region balancing, if an AZ doesn't have a node in it and the RR DNS points them at that AZ, they'll be turned away. Additionally, without it you need to scale by multiples of AZs you run in otherwise you'll have unbalanced traffic.
On another note, I've always been curious if they're just abstractions around HAproxy at the per-node level.
Very true. Cross-AZ load balancing works quite well. I believe Amazon has said it's RR across the servers with the least connections, but degenerates to a simple RR without many nodes per AZ.
- It heavily limits the number of nodes you can have - that is something the article does say, but I want to highlight here. It strikes me as a really bad strategy for scale-out.
- I've run into weirdness with a variety of different router platforms (Linux, Cisco, Foundry) when you withdraw and publish BGP routes over and over and over again (i.e. you have a flapping/semi-available service).
- It is true that when a node goes down, the BGP dead peer detection will kick in and remove the node. However the time to remove the node will vary, and require tuning on the router/switch side of things.
This is a fairly crude implement to swing - machete rather than a scalpel. You lose a lot of the flexibility load balancers give you, and depend a lot more on software stacks you have less insight and visibility into (router/switches) and are also not designed to do this.
My suggestion would be that this is a great way to scale across multiple load balancers/haproxy nodes. Use BGP to load balance across individual haproxy nodes - that keeps the neighbor count low, minimizes flapping scenarios, and you get to keep all the flexibility a real load balancer gives you.
One last note - the OP doesn't talk about this, but the trick I used back in the day was that I actually advertised a /24 (or /22, maybe?) from my nodes to my router, which then propagated it to a decent chunk of the Internet. This is good for doing CloudFlare-style datacenter distribution, but has the added benefit that if all of your nodes go down, the BGP route will be withdrawn automatically, and traffic will stop flowing to that datacenter. Also makes maintenance a lot easier.