ECMP is old, we've being pulling this stunt for a long time, for the right appli...

asuffield · on March 16, 2016

(Tedious disclaimer: my opinion only, not speaking for anybody else. I'm an SRE at Google, and I'm oncall for this service.)

Well, it's "not new" in the sense that this system's been running google.com for about 8 years now ;)

This is just the first time that we've published how it works.

kijiki · on March 16, 2016

We (Cumulus Networks) support a feature called "resilient hashing" that ensures that if a software LB fails, only the flows going to that LB are redistributed to the remaining LBs.

You still lose some connections when an LB fails, but only the ones going through the failed LB. Unrelated flows to other LBs are not impacted.

We've got multiple customers doing variants of the LB architecture Google talks about here.

newman314 · on March 17, 2016

I didn't know Cumulus had a LB product. Any more details that you can share?

wmf · on March 17, 2016

You can turn any switches into ghetto load balancers by running BGP on some hosts and advertising a /32 into your switches.

bogomipz · on March 17, 2016

You want BGP on your edge router, where your transit connections, these are what do the ECMP towards your LBs which speak iBGP. I'm not sure where you would use a cheap switch in this setup, certainly not at the edge.

wmf · on March 18, 2016

Many recent data centers run eBGP between all switches. I can't explain the rest without diagrams.

bogomipz · on March 18, 2016

Huh? Then you must not understand what you are talking about very well. Are you talking about running BGP to ToR switches? What does that have to do with ECMP or load balancers? Its just an L3 design and its not new.

alinspired · on March 17, 2016

I think they mean hashing algorithm for cumulus ECMP routing, that serves downstream LBs

bogomipz · on March 17, 2016

but this resiliency has nothing to do with the hashing per se correct? Once the route is withdrawn via BGP it is no longer a viable path so it wouldn't ever be routed to and by extension "hashed(source/dest etc.) Or am I misunderstanding what you are saying?

kijiki · on March 19, 2016

The problem that resilient hashing solves is that if you have 8 LBs in an ECMP group, and one dies and gets withdrawn, a naive hash function would redistribute all flows randomly, meaning every active connection would break.

Resilient hashing means that only flows going to the dead LB will get rehashed to other LBs. Those flows would break anyway, but the remaining flows are OK.

meebindok · on March 21, 2016

such a "flow aware" bucket resiliency might work if a single LB forwards to all backend servers. but, as suggested in Maglev, if a ECMP is used towards multiple such Maglev LBs (each of them are forwarding to a set of backend servers), then we cannot pursue a (distributed) "flow-aware" resiliency..

in such cases, only consistent hashing or maglev hashing might be the only option.

bogomipz · on March 17, 2016

Usually ECMP on your LBs, you run a routing daemon on the LB box - such Quagga or BIRD and the speak BGP to the edge, if an LB fails or goes away then the route is withdrawn from BGP peer and this withdrawl is how faults are dealt with. Can you or someone else elaborate on how Malev adds or differs from that? Unfortunately I can not get to google at the moment.