Hacker News new | past | comments | ask | show | jobs | submit login
Espresso – Google’s peering edge architecture (blog.google)
327 points by vgt on April 4, 2017 | hide | past | favorite | 94 comments



"We defined and employed SDN principles to build Jupiter, a datacenter interconnect capable of supporting more than 100,000 servers and 1 Pb/s of total bandwidth to host our services."

This type of scales boggle my mind. Though I have found I can no longer keep up with all the terminologies popping up every day. Posts like these are my only connection to learning the massive scaling of things to make the modern networks work.

"We leverage our large-scale computing infrastructure and signals from the application itself to learn how individual flows are performing, as determined by the end user’s perception of quality." Is this implying they are using Machine Learning to improve their own version of content delivery network?


The google network is gold plated, lacks the jitter inherent in the internet at large or inside other competitors' networks. It makes it tempting to ignore some aspects of distributed computing, if only for a moment.


Could you expand a bit more on you comment? I feel I'm missing some context. Specifically, what do you mean by gold plated? Why is it tempting to ignore some aspects of distributed computing? I'm missing a lot of context that you are implicitly implying so could you elaborate?


It's gold plated because they basically built their own ISP by acquiring either:

a: dark fiber IRUs between cities/metro areas

b: N x 10 and 100 Gbps wavelengths as L2 transport services from city to city, from a major carrier such as level3 or zayo

c: some combination of A and B

and they use that to build backbone links between their own network equipment that they have full control over. Google is its own AS and operates its own transport network around the US 48 states and around the world.

the exact design of what they're doing within their own AS at layers 1 and 2 is pretty opaque unless you happen to be a carrier partner that is willing to violate a whole raft of NDAs. But basically they've built their own backbone to a very massive scale yet without the huge capital expense of actually laying their own fiber between cities.

their network has incredibly low jitter because they don't run their links to saturation, and know EXACTLY what the latency is supposed to be from router interface to router interface between the pairs of core routers that are installed in each major city. Down to five decimal places, most likely. When you have your own dark fiber IRUs and operate your own WDM transport platforms you are in possession of things like OTDR traces for your dark fiber that tells you down to four decimal places the km length of your fiber path.

It also helps that the sort of people who have 'enable' on the AS15169 routers and core network gear are recruited from the top tier of network engineers and appropriately compensated. If they weren't working for Google they would be working for another major global player like NTT, DT, France Telecom/Orange, SingTel or Softbank.


Where do you get the crazy idea that Google doesn't run its links to saturation? It's crazy because it would cost an enormous amount of money.

The B4 paper states multiple times that Google runs links at almost 100% saturation, versus the standard 30-40%. That's accomplished through the use of SDN technology and, even before that, through strict application of QoS.

https://web.stanford.edu/class/cs244/papers/b4-sigcomm2013.p...

A few more details about strategies here:

https://research.google.com/pubs/archive/45385.pdf

Then there's a whole bunch of other host-side optimizations, including the use of new congestion control algorithms.

http://queue.acm.org/detail.cfm?id=3022184

You might recognize the name of the last author...


No, it would be crazy for them to run things at saturation under normal circumstances as that does not allow at all for abnormal circumstances. The opportunity cost of not using something 100% all the time is offset against the worth of increased stability/predictability in the face of changing conditions.

Though you do need to define "saturation". Are you referring to bulk bandwidth or some other measure of throughput/goodput? Saturating in terms of raw bandwidth can reduce useful throughput due to latency issues.


What I mean is that they do not run their links to saturation in the same way as an ordinary ISP. And because their traffic patterns are very different than an ordinary ISP, and much, much more geographically distributed, they can do all sorts of fun software tricks. The end result is the same: Low/no jitter and no packet loss.

As contrasted with what would happen if you had a theoretical hosting operation behind 2 x 10 Gbps transit connections to two upstreams, and tried to run both circuits at 8 to 9 Gbps outbound 24x7.


For clarity, do you mean that Google can, for example, run to 99% saturation all the time, whereas a typical ISP might have 30-40% average, with peaks to full saturation that causes high latency/packet loss when it occurs?


Yes, that's about right. Since they control both sides of the link, they can manage the flow from higher up on the [software] stack. Basically, if the link is getting saturated, the distributed system simply throttles some requests upstream by diverting traffic from places that result in traffic over that link. (And of course this requires a very complex control plane, but doable, and with proper [secondary] controls it probably stays understandable, manageable, and doesn't go haywire when shit hits the fan.)


So I wonder if that means they can do TCP control flow without dropping packets.


I guess they do drop packets (it's the best - easiest/cheapest/cleanest - way to backpropagate pressure - aka backpressure), but they watch for it a lot more vigorously. Also as I understand they try to separate long lived connections (between DCs) from internal short lived traffic. Different teams, different patterns, different control structures.


@puzzle: while you're not wrong, do note that B4 is not (and is not designed to be) a low-latency, low-jitter network. It's designed for massive bandwidth for inter-datacenter data transfer.


running your own internal links to near saturation (such as a theoretical 100 Gbps DWDM or MPLS circuit between two google datacenters in two different states) is a very different thing than running a BGP edge connection to saturation, such as a theoretical 100 Gbps, short reach intra building crossconnect from a huge CDN such as Limelight to a content-sink ISP such as Charter/TWTC or Comcast.


Very much so. B4 can be run near 100% because of strict admission control and optimized routing to maximize the use of all paths. It's much harder to do that on peering links where the traffic is bursty and you don't have control over the end-to-end latency and jitter. SDN isn't a magic pill for this, but it can most definitely lead to better performance and higher utilization than Ye Olde BGP Traffic Engineering.


Building distributed systems makes you aware of how unreliable things are at the large scale, e.g. the network. The parent comment implies that Google's network is so fast and reliable that it becomes tempting to ignore best practices and work as if it's a non distributed system.


He's saying that it's so reliable and the throughput is so high that sometimes you have to convince yourself that your computers are halfway across the planet.


I'm not sure they're saying that, they're just claiming Google has really good and well run networks. But even Google hasn't solved the speed of light issue, packets can only travel so fast. If your computer is halfway across the planet, you'll notice no matter how fancy your network is.


I assume they are talking about things like the truetime clocks used in spanner, which are not available on commodity hardware


Depends on commodity. You can just buy GPS slaved rubidium clocks with PTP output.


"Google has one of the largest peering surfaces in the world, exchanging data with Internet Service Providers (ISPs) at 70 metros and generating more than 25 percent of all Internet traffic. "

Wow.


70 metros?


70 metro areas. In a given metro area there might be one major traffic exchange point and de-facto most important peering location (example: SeattleIX in Seattle, and the Westin), or in a larger metro area, multiple exchange points.

edit: for a list of the geographical (OSI layer 1/2) locations where AS15169/google peers, see the following: https://www.peeringdb.com/asn/15169


70 metros translate to many more POPs (also for redundancy reasons).

The metros can be seen in e.g. the 1e100.net hosts in a traceroute. They're usually the closest airport code, so e.g. lhr for London.

Somebody in China reverse engineered the metro/POP naming and addressing for latency reasons. You can see that, for example, there are at least three POPs in Sydney:

https://docs.google.com/spreadsheets/d/1a5HI0lkc1TycJdwJnCVD...

https://github.com/lennylxx/ipv6-hosts


yes, the practice of using IATA airport codes or similar for reverse DNS is a practice that long pre-dates google. They just adopted industry best practices when they started building consistent rDNS.


It's also short-sighted as the four-character ICAO code system encodes geographical data and encompasses a magnitude more airfields.

For example one can determine that EGLL is in Western Europe, UK without having to look up tables to determine that it is Heathrow.


this is why some ISPs use a format that maps to two letter country code, then state or subdivision, then POP. where 01 is the first pop in the city, 02 is second, etc. The ISP's own internal documentation system will tell their NOC and neteng staff which pop number is which street address/datacenter (info that does not need to be available in rDNS). in a totally random example, nyc01 POP for an ISP might be 25 broadway, nyc02 would be 60 hudson, nyc03 would be 111 8th, and so on.

example agg router:

agg1.nyc01.ny.us.ASNUMBER.net

and then individual interfaces and subinterfaces would be defined under hierarchically under agg1.


I wonder if they are referring to a peering Internet exchange point (IXP) when they say metro. Basically a building where networks converge and ISPs connect to each other.


yes, though "metro" is a better way to define it since many IXes are geographically distributed throughout their city. For example DE-CIX in frankfurt is in many different datacenters, with their core switches connected by DE-CIX controlled dark fiber. AMS-IX in amsterdam is in many facilities in the same metro area, all the same L2 peering fabric. The SIX in Seattle is in three facilities in the same metro and several local ISPs have built their own extensions of it to Vancouver BC.


A metro in Google-speak generally refers to all of the peering locations in a given city or metro area, not specifically a single IXP.

For example, if you look at the PeeringDB entry for AS15169 (https://www.peeringdb.com/asn/15169), for the London "metro" there's public peering available on LINX at 3 different POPs, and private peering available at Digital Realty, 3 different Equinix POPs, and 2 Telehouse POPs.


The official Android testing framework from Google is also named Espresso. Are we running into a classic hard computer science problem?


Off-by-one errors?


I'm sure he means something to do with caches; I had it on the tip of my tongue a moment ago, but the doorbell rang.


Surely this is what he meant. Since edge peering/CDN is fundamentally caching (last I checked). It can't be off-by-one, and there are only 2 hard problems in CS, so there you have it.


"There are only two hard problems in CS : 1. Naming things. 2. Cache invalidation. "


"There are only two hard problems in CS : 1. Naming things. 2. Cache invalidation. 3. Off-by-one errors"


Only you got the order wrong. One of those distributed systems problems.


I thought he meant IE support.


He means naming things


I'm pretty sure a lack of coffee is the real issue at hand. Google made more architecture to give peers an edge via two different Espresso shots to test


Not if they developed using Espresso CSS editor that just had v3 released on March 30th.


There are two whole things at Google called "Espresso"?

Oh no. I bet this reuse of a name has gone unnoticed internally until now.


> The official Android testing framework from Google is also named Espresso.

This just shows that Android is treated as the ugly step child even within Google (not that this is not already obvious given the state of the Android API).


The essence of what Espresso is begins towards the end of the post:

Espresso delivers two key pieces of innovation. First, it allows us to dynamically choose from where to serve individual users based on measurements of how end-to-end network connections are performing in real time.

Second, we separate the logic and control of traffic management from the confines of individual router “boxes.”


I found this article confusing. It doesn't really say anything about what "Espresso" actually does, let alone how it does it.


For some reason they didn't actually link to the talk, which I haven't watched, but presumably actually tries to start answering those questions.

Quick search is showing the 2015 keynote that Amin gave, haven't found the 2017 one yet...

[1] - 2015 ONS Keynote https://www.youtube.com/watch?v=FaAZAII2x0w


And here's 2014, which was on the Andromeda NFV stack: https://www.youtube.com/watch?v=n4gOZrUwWmc

And B4: https://www.youtube.com/watch?v=tVNlXg0iN-g


I think with platforms like this it is now safe to say that the systems and services Google is deploying are no longer in the same category as classical networked systems. This is as foreign a concept from traditional networking and the seven layer OSI model as non von Neumann computing is from von neumann computing


> This is as foreign a concept from traditional networking and the seven layer OSI model as non von Neumann computing is from von neumann computing

Not really. The OSI model doesn't say anything about where I run my routing algorithm and BGP application vs. where my actual switches are.

"Classical" networking is an artifact of viewing routers/switches as monolithic blocks that embed all of their functionality in one black box. I said BGP application above because that's what it is, an application for distributing/communicating state. The same can be said for many other parts of networking traditionally embedded in the monolithic blob we often call a router.

Label switched fabrics provide inherent NFV, security functions, and allow you to influence paths (i.e. traffic engineer) from applications that are equipped to make decisions based on your priorities, not some rigid vendor implementation.

You will see more of this.


I can't speak the the edge stuff they are doing, but the routing closer to the data centers was unique to the way they do things. In a way that wouldn't​ be generally usable.

Not every application can put enough context in the request to make it work the way Google is working. Sort of app request context based routing.

Also, they have the advantage that a lot of their apps are like search, in that no consistency is needed. Five consecutive searches for "some query" can return different results each time, with no adverse effects.

That creates a lot of flexibility in routing requests to destinations.


As someone who works with SDN daily I couldn't agree more with you.

Espresso is making decisions on how to use the application layer, but the underlying layers stay the same.


> The OSI model doesn't say anything about where I run my routing algorithm and BGP application vs. where my actual switches are.

If you're $BIGASN and you set up an intra building singlemode crossconnect at $BIGCITY to establish settlement free peering (let's say for example a 4 x 10 Gbps bonded 802.3ad circuit) with $OTHERBIGASN, they most assuredly are going to notice if your BGP session and router is not directly on the other end of that cable.

Because they are going to be expecting sub-1ms latency to your router, and not "we're taking this session and stuffing it in some sort of tunnel or encapsulation and sending it somewhere else, to where the thing that actually speaks BGP is located". It's bad juju to practice deceptive peering.


> Because they are going to be expecting sub-1ms latency to your router, and not "we're taking this session and stuffing it in some sort of tunnel or encapsulation and sending it somewhere else, to where the thing that actually speaks BGP is located".

Why should they care?

> It's bad juju to practice deceptive peering.

I don't understand applying moral judgment to a technical design choice.


I'm guessing you do not handle peering for a medium to large sized AS, so it's really hard to explain. First: they should care because the point of establishing peering in a given city is to give inter-AS traffic the absolute shortest path and shortest number of hops between two points. If I put a router in portland, OR and buy a 10 Gbps mpls tunnel to Vancouver BC and join the vanix, ask to set up with peers there, all traffic will be taking a multi hundred km round trip to Portland.

Two: it's not moral judgment, it's a technical best practice to actually put routers in the city in which you set up new edge BGP sessions. Pretty basic ISP stuff in fact.


Just because they tunnel the BGP back to some software system to make routing decisions doesn't mean they tunnel all the user data to that same location. To do so would be silly.

In fact, a good system would have a couple of systems handling BGP, with physical location fairly irrelevant, but acting as if they are local to the peer they are talking to.


> This is as foreign a concept from traditional networking and the seven layer OSI model

cough cough what? One of the major challenges for a CDN is predicated upon OSI layers 1 and 2: You need to establish POPs with routers and caching servers geographically distributed near major IX points (L2 peering fabrics, and crucial buildings that host the same IX points, where you can run intra-building fiber crossconnects for network-to-network interfaces to provide settlement free peering to major ISPs). The internet is physically built out of a great deal of equipment at layer 1.

In the case of Google, you need to have a team of people who care about things like cost-effectively building intra-datacenter 100GbE layer 2 connections between Google, and large content sinks (eyeball) ISPs such as Charter/TWTC or similar.

Hand waving around and saying "we've built some new software to improve how we efficiently deliver BGP sessions to edge peers" is cool and all, but don't mistake it for some radical change. It is all still built on top of things like 2 megawatt diesel generators, massive battery plants, DWDM line terminals, dark fiber, etc.


In other words, it's built on hundreds of millions to billions of dollars of typical, boring stuff on the bottom that allows all the cool stuff to work as well as it does.


It's really easily billions, not hundreds of millions. I work for a much smaller outfit and we (handwave) push the smaller number already.


that allows all the cool stuff to work as well as it does.

that allows all the advertising to draw in ad spend as well as it does.


Yep.. the CEO of Fastly has some cool presentations on YouTube of their SDN configuration which does away with the traditional router/switch combination for the switch/SDN server combo, which provides a much smarter network at a fraction of the cost. The only caveat is your engineers need to understand and tube the SDN software stack from the network card to the kernel.


Even though this reply is 6 days late, I finally had time to watch the mentioned talk, and it's interesting and still very relevant (it's from 2015):

https://www.youtube.com/watch?v=TLbzvbfWmfY interesting stuff starts around 7 minutes


These presentations from Google are pretty irritating at these conferences. If you're familiar with the SDN field (as most ONS attendees would be), this presentation is essentially nothing but bragging about the scale at which they operate.

There is no useful information in here to advance the state of the art, no new ideas, no publicly available implementations (closed or open source). It's just a very high-level architectural view of a large network given by people who are incentivized to present it in the most favorable light. And due to the lack of any concrete details, it's free from critical analysis.

>Espresso delivers two key pieces of innovation. First, it allows us to dynamically choose from where to serve individual users based on measurements of how end-to-end network connections are performing in real time. Second, we separate the logic and control of traffic management from the confines of individual router “boxes.”

The first has been done before at many levels of the network:

* BGP anycast

* DNS responses based on querier

* Done in load balancer

* IGP protocols to handle traffic internally while taking into account link congestion

I assume their framework gives them much nicer primitives to work with than the above, which would be an advancement in the field if we could actually see an API or something.

The second is very far from "innovation". This is the essence of SDN and this has been the hottest thing since sliced bread in the networking world since 2008 at a minimum [1] and even earlier if you look at things like the Aruba wireless controller.

1. http://archive.openflow.org/documents/openflow-wp-latest.pdf


I don't disagree, but they have published a lot of papers about their networking research and implementation of things like like B4 and their SDN work [1]. Hopefully there's a paper on Espresso forthcoming, although the absence (as far as I can tell) of a paper on Andromeda alone means it might not be.

[1] https://research.google.com/pubs/Networking.html


I very much agree. I was hoping Espresso would be a framework for allowing GCP user applications to leverage Google's SDN, rather than just allow Google to offer their own services using this technology. I hope that's the next step.

For example, it would be cool if it were possible to move shared-client/server-secret checking (eg. for an HTTP API) out to the edge of Google's network, such that a DDoS attack with invalid packets (secrets) never even reach the application VM/cluster. DDoS attacks, which force applications offline (by making the app scale up to an unsustainable cost level), could be prevented this way.


You can do that by using Google clouds https load balancer.


Yes. The term "innovation" applied to the two ideas bugged me too.

Could be a 'smart people problem' -- easier to 'innovate' a new wheel then visualize how existing tools can be remixed to solve.

For an example of remixing, it was apparent 15 years ago that traditional routing was actively harmful for live and VOD video streaming, so we cobbled a mix of the techniques you listed plus a couple more[1] to connect users to edges in real time based on actual real-time end-to-end conditions.

We did it consciously using these two ideas, plus content aware caching, plus one more "magic" but super trivial idea I still haven't seen elsewhere in the wild.

The low tech rethink worked well, handily outperformed proprietary solutions per industry perf and SLO metrics companies, and it's been an ongoing surprise me that it's taken so long for others to rethink or remix the same way.

We filed no patents, as to your point, this all could be argued evident to anyone 'skilled in the art'.

---

1. A couple custom bits: we also had to write an edge server shim as no media server at the time could handle sessions flapping in real time, not to mention the content awareness crucial for giant media files.


I don't know much about SDN, but Google did seem to imply that DNS responses based on querier is not a very good solution.

> Rather than pick a static point to connect users simply based on their IP address (or worse, the IP address of their DNS resolver), we dynamically choose the best point and rebalance our traffic based on actual performance data. Similarly, we are able to react in real-time to failures and congestion both within our network and in the public Internet.


Yup, this is the kind of thing you get when you put in $30B into infrastructure.

https://youtu.be/j_K1YoMHpbk?t=7472


Can someone with more expertise summarize how this differs from commercial SDN solutions like: Cisco ACI, Juniper Contrail etc.?


Unfortunately, no - at least not without quite a few more details. As this stands, it could be the high-level marketing overview for pretty much any SDN solution available, commercial or open.


Thanks!

As someone who recently entered this field professionally, I find it amusing that most SDN solutions out there are just permutations of each other playing over marketing buzzwords.

Not too different from "cloud computing" from a few years ago.


Yeah, once you get past the hype though, SDNs can be great.


This vision seems very similar to the 2011 talk by Scott Shenker: https://www.youtube.com/watch?v=YHeyuD89n1Y


This is a good talk about decomposability of control plane and creating proper abstractions for it.

PS. "The ability to master complexity is not the same as the ability to extract simplicity" is a good takeaway. PPS. This is a part of EE 122: https://inst.eecs.berkeley.edu/~ee122/fa12/class.html PPPS. PDF for SDN lecture: https://inst.eecs.berkeley.edu/~ee122/fa12/notes/25-SDN.pdf


Distracting aside: It's amusing that languages use so many references from the coffee industry. I wonder how long it will take to fill a Starbucks menu.


One biggest takeaway from this is that they can have multiple machines for the same IP address. That is just awesome and also explains how they have probably managed to scale up services 8.8.8.8 without needing to use load balancers.


Anycast is pretty standard.


Especially so for DNS servers, and since long before the google nameservers[0].

[0] https://tools.ietf.org/html/rfc3258


What does "peering edge" mean? A google search only brings up this article.


A network is often described with an edge and a core, and there can be several types of "edges".

For a company like google, you would most likely have an edge towards your servers as well as an edge towards your peering partners. The peering edge is therefore the part of the network that is used to connect to BGP peering partners.


The part of their network architecture that interconnects with external networks


This is pretty impressive.


[deleted]


Everything you have ever done or ever will do infringes dozens of BS patents. Get back to work.


Even if it is under patent protection, Google is unlikely to sue you, unless you sue them first.


To my knowledge, Google has started exactly 2 lawsuits in it's history.


Talk to a lawyer asap.


It's Google. If you're going to call a lawyer he better be north of $750/hr.


Two Google products named Espresso? That won't be confusing at all.


Yeah, was annoying thinking this article was relevant to my every day life as an Android developer until I looked farther. Is it so tough for Google to Google a name before they pick it? Geez.


I suspect both are written in Java, and that namespace is getting crowded.


Sure. "Two"


Found the employee...


tl;dr?


Wow. Isn't this a trojan horse? People start to use it because of it's convenience and then it will spread and spread and spread. I mean what's up when google will run more or less everything?


Damn, for a moment there I was hoping that Google made some sort of really cool espresso machine. Perhaps with Alexa built-in.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: