The consistent hashing stuff in the paper is pretty cool. In order to distribute traffic among backends, they came up with a new "Maglev hashing" algorithm that gives a more even distribution than existing techniques, but is less robust to changes in the set of servers. The trick is that you can locally memoize the results of your mostly-consistent hash function, and rely on an extra layer of connection affinity from the upstream ECMP routers to the load-balancers, so that the same memoized value is used every time. So you never actually drop connections unless both a backend instance and a load balancer fail at the same time, which should be very rare. Clever!
As an aside, I couldn't help noticing these lines on adjacent pages:
> Maglev has been serving
Google’s traffic since 2008. It has sustained the rapid
global growth of Google services, and it also provides
network load balancing for Google Cloud Platform.
> Maglev handles
both IPv4 and IPv6 traffic, and all the discussion
below applies equally to both.
In that case, any chance GCE will be getting IPv6 support soon? ;)
And not just the ones that that backend was handling, but also some % of the overall traffic for re-balancing.
The degree of connection affinity from the ECMP is limited, and there's no "reliance" on it. If a connection flip-flopped between two or more load balancers, there would be no drops, thanks to the consistent fashion.
as a GCloud user I've never seen any mention of IPv6. That's the main reason i've not bothered trying to get on the ipv6 bandwagon. I don't want to strugle through it if my hosting provider doesn't even support it.
CloudSQL instances support ipv6,[1] and appengine apps can receive ipv6 traffic, but only suggestion regarding delay in getting ipv6 to users was hardware limitations.[2] I wanted to know why op thinks this is only for external customers.
cool, thanks for the extra details. maybe they are using ipv6 for a portion of the internal infrastructure and the OP extrapolated that to mean they use it everywhere internally.
I don't need to tell you how much of an ecosystem exploded after the BigTable whitepapers. It might not be realistic to expect that kind of reaction for load balancing tech, but it's great to see them continue to do this sort of thing. Also looking forward to seeing what this release in particular may inspire.
Surprised there are so few comments after this many upvotes. It's always great seeing companies share their work, such as MapReduce, BigTable, TensorFlow, etc from Google (Netflix's ops tools also are very interesting, not to discount the contributions of Facebook, Microsoft et al).
Would love to see the HN community's opinion on this, as someone who is not an expert in ops/infrastructure.
Exactly how the routes get steered to servers can vary (static routes with track objects, OSPF or BGP on the server), but the basic idea of using network gear to ECMP traffic between servers is, if not ancient, pretty common.
I suspect there's something novel in control plane or hashing used to steer ECMP but I didn't see anything about that in the article.
ECMP is old, we've being pulling this stunt for a long time, for the right applications (e.g., UDP, short lived TCP). Adding/removing routes in ECMP can cause flows to hash to a new destination. The connection tracking and consistent hashing in Maglev is important to deal with faults and avoid disrupting unrelated flows. Still nothing new.
We (Cumulus Networks) support a feature called "resilient hashing" that ensures that if a software LB fails, only the flows going to that LB are redistributed to the remaining LBs.
You still lose some connections when an LB fails, but only the ones going through the failed LB. Unrelated flows to other LBs are not impacted.
We've got multiple customers doing variants of the LB architecture Google talks about here.
You want BGP on your edge router, where your transit connections, these are what do the ECMP towards your LBs which speak iBGP. I'm not sure where you would use a cheap switch in this setup, certainly not at the edge.
Huh? Then you must not understand what you are talking about very well. Are you talking about running BGP to ToR switches? What does that have to do with ECMP or load balancers? Its just an L3 design and its not new.
but this resiliency has nothing to do with the hashing per se correct? Once the route is withdrawn via BGP it is no longer a viable path so it wouldn't ever be routed to and by extension "hashed(source/dest etc.) Or am I misunderstanding what you are saying?
The problem that resilient hashing solves is that if you have 8 LBs in an ECMP group, and one dies and gets withdrawn, a naive hash function would redistribute all flows randomly, meaning every active connection would break.
Resilient hashing means that only flows going to the dead LB will get rehashed to other LBs. Those flows would break anyway, but the remaining flows are OK.
such a "flow aware" bucket resiliency might work if a single LB forwards to all backend servers.
but, as suggested in Maglev, if a ECMP is used towards multiple such Maglev LBs (each of them are forwarding to a set of backend servers), then we cannot pursue a (distributed) "flow-aware" resiliency..
in such cases, only consistent hashing or maglev hashing might be the only option.
Usually ECMP on your LBs, you run a routing daemon on the LB box - such Quagga or BIRD and the speak BGP to the edge, if an LB fails or goes away then the route is withdrawn from BGP peer and this withdrawl is how faults are dealt with. Can you or someone else elaborate on how Malev adds or differs from that? Unfortunately I can not get to google at the moment.
Yeah, the interesting details are in the paper. If I understand it correctly, ECMP alone doesn't give you a very even distribution because of the small number of hash buckets. And every time the routing table changes (e.g. when taking backends in or out of rotation) it will drop connections because packets are suddenly being hashed to different destinations. Maglev adds an additional layer behind ECMP that solves those problems.
> This is possible because Google clusters, via Maglev, are already handling traffic at Google scale. [...] Google’s Maglevs, and our team of Site Reliability Engineers who manage them, have that covered for you. You can focus on building an awesome experience for your users, knowing that when your traffic ramps up, we’ve got your back.
This is probably one of he most compelling arguments I've heard: "It will work well because we both rely on it working well. We have skin in the game."
Great blog post! Thoughtful explanations and links to research papers[1] is a much classier way to compete and differentiate compared to just saying "we're better, we're better" like they've done before[2].
I love the jab at AWS in the end:
> 'knowing that when your traffic ramps up, we’ve got your back.'
They've heard that AWS load balancers need "prewarming" to handle huge traffic spikes, so they're throwing one in there...
The paper talks a lot about pushing Linux out of the way in the receive path (including to avoid copies), only to have a demultiplexing thread steer packets once they arrive in userspace. Linux has PACKET_MMAP for a long time and modern linux has RPS, which would seem to exactly duplicate this functionality without the copies or single point of contention (one thread running on one core, vs. N interrupt queues directly waking threads pinned on N cores). Can anyone comment on why they aren't using it?
> Because the NIC is no longer the bottleneck, this figure
shows the upper bound of Maglev throughput with the
current hardware, which is slightly higher than 15Mpps.
In fact, the bottleneck here is the Maglev steering mod-
ule, which will be our focus of optimization when we
switch to 40Gbps NICs in the future
(Tedious disclaimer: my opinion only, not speaking for anybody else. I'm an SRE at Google, and I'm oncall for this service.)
On the understanding that I can't tell you any more than what it already says in the paper, the key detail that you're looking for is section 4.1.2:
"""
Maglev originally used the Linux kernel network stack
for packet processing. It had to interact with the NIC using
kernel sockets, which brought significant overhead to
packet processing including hardware and software interrupts,
context switches and system calls [26]. Each
packet also had to be copied from kernel to userspace and
back again, which incurred additional overhead. Maglev
does not require a TCP/IP stack, but only needs to find a
proper backend for each packet and encapsulate it using
GRE. Therefore we lost no functionality and greatly improved
performance when we introduced the kernel bypass
mechanism – the throughput of each Maglev machine
is improved by more than a factor of five.
"""
My understanding is that RPS just steers packets into the Linux TCP stack. Nothing on the maglev machine wants to process the TCP layer; that workload can be distributed over the backends.
In our case the kernel buys us very little but costs a lot, as others correctly pointed out in this thread. We spent a lot of time working on our performance via the kernel, with moderate success, but userspace talking directly to the NIC was a huge leap in performance.
This is awesome! At work, we're a heavy user of google's network load balancer to route persistent TCP websocket connections to our haproxy cluster. It's worked flawlessly. Glad to see the design behind this released!
> to build clusters at scale, we want to avoid the need to place Maglev machines in the same layer-2 domain as the router, so hardware encapsulators are deployed behind
the router, which tunnel packets from routers to Maglev
machines
Is "hardware encapsulator" just a fancy way of saying they're using tunnel interfaces on the routers, or is a "hardware encapsulator" an actual thing?
Yes, it's just a 10GbE ethernet switch that can encapsulate the traffic in VXLAN headers, so that it can traverse east/west between any of thousands (millions?) of hypervisors without requiring traffic to hairpin to a gateway router and back. The logical networks all exist in an overlay network, so to the customer VMs, you get L2/L3 isolation. But, to the underlying hypervisors, they actually know which vNICs are running on each hypervisor in the cluster, so they can talk directly on a large connected underlay network at 10GbE (x2) line rate.
This is the standard way of distributing traffic in large datacenters. That way you get extremely fast, non-blocking line rate between any two physical hosts in the datacenter, and since the physical hosts know which VMs/containers are running on them, they can pass the traffic directly to the other host if VMs exist in the same L2 network, and even do virtual routing if the VMs exist across L3 boundaries - still a single east/west hop.
Broadcom makes "switch on a chip" modules that will do VxLAN encapsulation and translation to VLAN or regular Ethernet frames. That chipset is available in lots of common 10/40/100 GbE switches from Arista/Juniper/White Box.
In a regular IP Fabric environment we would all this device a VTEP.
There is white label OEM gear (the Arista and Cisco gear is now just OEM with their firmware running on it), but unless you're Google or Facebook and can write your own firmware, chances are you're better off with an "enterprise" solution like Arista or Cisco who will give you support and fix bugs in the firmware for you.
(Tedious disclaimer: my opinion only, not speaking for anybody else. I'm an SRE at Google, and I'm oncall for this service.)
No.
Edit: expanding on this a little, it's not something that's been released so we can't talk about it. I don't think I can comment on "illumin8"s proposals other than to say that I'm pretty sure they don't work here.
Google's exact ToR (top of rack) switch code isn't available, but you can buy a switch from any number of network gear vendors (Arista, Cisco, Brocade, Juniper, HP, etc), that can do VXLAN encapsulation and send the traffic over a leaf/spine network that covers thousands of racks.
I can't imagine Google is building clusters at such an alarming rate that it would justify manufacturing its own silicon for edge deployment, which suggests whatever commodity silicon is in the magic box can probably be found in a variety of vendor equipment wrapped in OpenFlow or similar
Using ECMP to balance the traffic across the load balancers is definitely the way to go. You can couple this with a dynamic routing protocol (BGP/OSPF/ISIS) to achieve fast failure detection of load-balancers too.
We used this strategy at a previous company I worked at deploying large Openstack based clouds, worked really well. Though if you want to to be easy on the backends you need to use some sort of stick table or consistent hashing mechanism as described in the paper.
ECMP has been used for well over a decade. The DPI industry has been doing this entire load balancing thing for at least 15 years.
10gbps with 8 cores is pretty crappy performance. People using DPDK and netmap are getting 10gbps with a single core. Netmap can bridge at line rate for 64 byte packets with a single core.
I am actually surprised this is 'news'. Perhaps google should reach out to some of the DPI companies to figure out how to really scale this. Currently they are 'overpaying' on commodity hardware by at least 8x over the current cutting edge in this type of field.
Just have a look at the NFV world for the cutting edge packet forwarding rates
Interesting read. I am most fascinated by their kernel bypass mechanism for packet transmission. They claim that their system requires proper support from the NIC hardware, does not involve the Linux kernel in packet transmission, does not require kernel modification, and allows them to switch between NICs. As well, if I'm reading the claims in the related work section correctly it seems this solution is architecturally different than netmap/pf_ring or dpdk/openonload.
As an aside, I couldn't help noticing these lines on adjacent pages:
> Maglev has been serving Google’s traffic since 2008. It has sustained the rapid global growth of Google services, and it also provides network load balancing for Google Cloud Platform.
> Maglev handles both IPv4 and IPv6 traffic, and all the discussion below applies equally to both.
In that case, any chance GCE will be getting IPv6 support soon? ;)