This mentions Jupiter generations, which I think is about 10-15 years old at this point. It doesn't really talk about what existed before so it's not really 25 years of history here. I want to say "Watchtower" was before Jupiter? but honestly it's been about a decade since I read anything about it.
Google's DC networking is interesting because of how deeply integrated it is into the entire software stack. Click on some of the links and you'll see it mentions SDN (Software Defined Network). This is so Borg instances can talk to each other within the same service at high throughput and low latency. 8-10 years ago this was (IIRC) 40Gbps connections. It's probably 100Gbps now but that's just a guess.
But the networking is also integrated into global services like traffic management to handle, say, DDoS attacks.
Anyway, from reading this it doesn't sound like Google is abandoning their custom TPU silicon (ie it talks about the upcoming A3 Ultra and Trillium). So where does NVidia ConnectX fit in? AFAICT that's just the NIC they're plugging into Jupiter. That's probably what enables (or will enable) 100Gbps connections between servers. Yes, 100GbE optical NICs have existed for a long time. I would assume that NVidia produce better ones in terms of price, performance, size, power usage and/or heat produced.
Disclaimer: Xoogler. I didn't work in networking though.
The past few years there has been a weird situation where Google and AWS have had worse GPU's than smaller providers like Coreweave + Lambda Labs. This is because they didn't want to buy into Nvidias proprietary Infiniband stack for GPU-GPU networking, and instead wanted to make it work on top of their ethernet (but still pretty proprietary) stack.
The outcome was really bad GPU-GPU latency & bandwidth between machines. My understanding is ConnectX is Nvidias supported (and probably still very profitable) way for these hyperscalers to use their proprietary networks without buying Infiniband switches and without paying the latency cost of moving bytes from the GPU to the CPU.
Your understanding is correct. Part of the other issue is that at one point, there was a huge shortage of availability of IB switches... lead times of 1+ years... another solution had to be found.
RoCE is IB over Ethernet. All the underlying documentation and settings to put this stuff together are the same. It doesn't require ConnectX NIC's though. We do the same with 8x Broadcom Thor 2 NIC's (into a Broadcom Tomahawk 5 based Dell Z9864F switch) for our own 400G cluster.
Nvidia got ConnectX from their Mellanox acquisition -- they were experts in RMDA, particularly with Infiniband but eventually pushing Ethernet (RoCE). These NICs have hardware-acceleration of RDMA. Over the RDMA fabric, GPUs can communicate with each other without much CPU usage (the "GPU-to-GPU" mentioned in the article).
[I know nothing about Jupiter, and little about RDMA in practice, but used ConnectX for VMA, its hardware-accelerated, kernel-bypass tech.]
I would guess the Nvidia ConnectX is part of a secondary networking plane, not plugged into Jupiter. Current-gen Google NICs are custom hardware with a _lot_ of Google-specific functionality, such as running the borglet on the NIC to free up all CPU cores for guests.
It seems all cutting edge datacenters like x.ai Colossus are using Nvidia networking. Now Google is upgrading to Nvidia networking, too.
Since Nvidia owns most of the Gpgpu products, they have top notch networking and interconnect, I wonder if they don't have a plan to own all datacenter hardware in the future. Maybe they plan to also release CPUs, motherboards, storage and whatever else is needed.
I read this slightly differently, that specific machine types with Nvidia GPU hardware also have Nvidia networking for tying together those GPUs.
Google has its own TPUs and don’t really use GPUs except to sell them to end customers on cloud I think. So using Nvidia networking for Nvidia GPUs across many machines on cloud is really just a reflection of what external customers want to buy.
Disclaimer, I work at Google but have no non public info about this.
Only within supercomputers (including the smaller GPU ones used to train AI). Normal data centers use Cisco or Juniper or similarly.well known Ethernet equipment, and they still do. The Mellanox/Nvidia Infiniband networks are specifically used for supercomputer-like clusters.
You seem to have a narrow definition of “normal” for datacenters. Meta were using OCP mellanox NICs for common hardware platforms a decade ago and still are.
Yeah there’s a bit of industry worry about that very eventuality — hence the ultra Ethernet consortium trying to work on open source alternatives to the mellanox/nvidia lock-in.
Cisco have sat on the steering committees for a lot of things where they had a proprietary initial version of something. It's not that unusual, and also, it's often frankly not actually that open; e.g., see the rent seeking racket for access to PCI documentation, or USB-IF actively seeking to prevent open source hardware from existing, etc.
Eh, the UEC effort is a standards org through the Linux Foundation so it won't be subject to any of the usual chicanery. And actually, it looks like Nvidia is jsut a general member and not one of the Steering Committee members;
I have to wonder if Nvidia has reached a point where it hesitates to develop new products because it would hurt their margins. Sure they could probably release a profitable networking product but if they did their net margins would decrease even as profit increased. This may actually hurt their market cap as investors absolutely love high margins.
They can always release capital back to investors, and then those investors can put the money into different companies that eg produce networking equipment.
I was working under HDThoreaun's assumption that the margins would be lower.
If they have other opportunities for investment with higher margins, they should seize those, of course. And perhaps even call up investors for more capital, if required.
When the agents employed by investors would be harmed by releasing capital back, which is guaranteed since so many people’s compensation is in the form of stock and returning capital leads to decreasing stock value, why would those agents ever return the capital voluntarily?
They managed to double from 6 Petabit per second in 2022 to 13 Pbps in 2023. I assume with ConnectX-8 this could be 26 Pbps in 2025/26. The ConnextX-8 is PCI-e 6 so I assume we could get 1.6Tbps ConnextX-9 with PCI-e 7.0 which is not far away.
Cant wait to see the FreeBSD Netflix version of that post.
This also goes back to how increasing throughput is relatively easy and has a very strong roadmap. While increasing storage is difficult. I notice YouTube has been serving higher bitrate video in recent years with H.264. Instead of storing yet another copy of video files in VP9 or AV1 unless they are 2K+.
For TPU pods they use 3D torus topology with multi-terabit cross connects. For GPU, A3 Ultra instances offer "non-blocking 3.2 Tbps per server of GPU-to-GPU traffic over RoCE".
Is that the worst for training? Namely: do superior solutions exist?
Awesome Google... Now learn what an availability zone is and stop creating them with firewalls across the same data center.
Oh and make your data centers smaller. Not so big they can be seen in Google Maps. Because otherwise, you will be unable to move those whale sized workloads to an alternative.
To address the availability point of your comment, Google's terminology is slightly different to AWS.
On GCP it sounds like you want to have a multi region architecture, not multi-zone (if you want firewalls outside the same data center).
> Resources that live in a zone, such as virtual machine instances or zonal persistent disks, are referred to as zonal resources. Other resources, like static external IP addresses, are regional. Regional resources can be used by any resource in that region, regardless of zone, while zonal resources can only be used by other resources in the same zone.
Making a datacenter not visible from Google Maps, at least on most big cities where Google zones are deployed, would mean making them smaller than a car. Or even smaller than a dishwasher.
If I check London (where europe-west2 is kinda located) on Google Maps right now, I can easily discern manhole covers or people. If I check Jakarta (Asia-southeast2) things smaller than a car get confusing, but you can definitely see them.
Your comment does not address the essence of the point I was trying to make. If you have a monstrous data-center, instead of many smaller, in relative size, you are putting too many eggs on a giant basket.
The scale of cloud data centres reflects the scale of their customer base, not the size of the basket for each individual customer.
Larger data centres actually improve availability through several mechanisms: more power components such as generators means the failure of any one is just a few percent instead of a total blackout. You can also partition core infrastructure like routers and power rails into more fault domains and update domains.
Some large clouds have two update domains and five fault domains on top of three zones that are more than 10km apart. You can’t beat ~30 individual partitions with your data centres at a reasonable cost!
I provided three different references. Despite the massive downvotes on my comment I guess by Google engineers, as a troll...:-)I take comfort on the fact nobody was able to advance a reference to prove me wrong.
It is true that the nomenclature "AWS Availability Zone" has a different meaning than "GCP Zone" when discussing the physical separation between zones within the same region.
It's unclear why this is inherently a bad thing, as long as them same overall level of reliability is achieved.
The phrase "as long as the same overall level of reliability is achieved" is logically flawed when discussing physically co-located vs. geographically separated infrastructure.
In my experience, the set of issues that would affect 2 buildings close to each other, but not two buildings a mile apart, is vanishingly small, usually just last mile fiber cuts or power issues (which are rare and mitigated by having multiple independent providers), as well as issues like building fires (which are exceedingly rare, we know of, perhaps two of notable impact in more than a decade across the big three cloud providers).
Everything else is done at the zone level no matter what (onsite repair work, rollouts, upgrades, control plane changes, etc.) or can impact an entire region (non-last mile fiber or power cuts, inclement weather, regional power starvation, etc.)
There is a potential gain from physical zone isolation, but it protects against a relatively small set of issues. Is it really better to invest in that, or to invest the resources in other safety improvements?
I think you're undermining the seriousness of a physical event like a fire. Even if the likelihood of these things is "vanishingly small", the impact is so large that it more than offsets it. Taking the OVH data center fire as an example, multiple companies completely lost their data and are effectively dead now. When you're talking about a company-ending-event, many people would consider even just two examples per decade as a completely unacceptable failure rate. And it's more than just fires: we're also talking about tornados, floods, hurricanes, terrorist attacks, etc.
Google even recognizes this, and suggests that for disaster recovery planning, you should use multiple regions. AWS on the other hand does acknowledge some use cases for multiple regions (mostly performance or data sovereignty), but maintains the stance that if your only concern is DR, then a single region should be enough for the vast majority of workloads.
There's more to the story though, of course. GCP makes it easier to use multiple regions, including things like dual-region storage buckets, or just making more regions available for use. For example GCP has ~3 times as many regions in the US as AWS does (although each region is comparatively smaller). I'm not sure if there's consensus on which is the "right" way to do it. They both have pros and cons.
One of the vanishingly small set of issues I mentioned.
It is true, and obvious, that GCP and AWS and Azure use different architectures. It does not obviously follow that any of those architectures are inherently more reliable. And even if it did, it doesn't obviously follow that any of the platforms are inherently more reliable due to a specific architectural decision.
Like, all cloud providers still have regional outages.
That concept is useful when the scale of things you have is the same order of magnitude as the rate of failure. But we clearly don't have that here, because even at scale, these events aren't common. Like I said, there have been, across all cloud providers, less than a handful over a decade.
Like, you seem to be proclaiming that these kinds of events are common and, well, no, they aren't. That's why they make the top of HN when they do happen.
This isn't even close to true. You can just go on Google Maps and visually see the literally *hundreds* of wholly-owned and custom built data centers from AWS, MS, and Google. Edge locations (like Cloud CDN) are often in colos, but the main regions compute/storage are not. Most of them are even labeled on Google Maps.
Here's a couple search terms you can just type into Google Maps and see a small fraction of what I mean:
- "Google Data Center Berkeley County"
- "Microsoft Data Center Boydton"
- "GXO council bluffs" (two locations will appear, both are GCP data centers)
- "Google Data Center - Henderson"
- "Microsoft - DB5 Datacentre" (this one is in Dublin, and is huuuuuge)
- "Meta Datacenter Clonee"
- "Google Data Center (New Albany)" (just to the east of this one is a massive Meta data center campus, and to the immediate east of it is a Microsoft data center campus under construction)
And that's just a small sample. There are hundreds of these sites across the US. You're somewhat right that a lot of international locations are colocated in places like Equinix data centers, but even then it's not all of them and varies by country (for example in Dublin they mostly all have their own buildings, not colo). If you know where to look and what the buildings look like, the custom-build and self-owned data centers from the big cloud providers are easy to spot since they all have their own custom design.
While the OP is more wrong than right they aren't completely incorrect.
I'm in Australia.
GCP has 2 regions in Australia, Sydney and Melbourne. The Sydney region is in the Equinox DC. Not sure where the Melbourne one is but it isn't a Google-owned facility.
Yea, you're not the only "insider" here. And you're 100% wrong. Just because you completely misunderstand what those Amazon/MS employees are doing in those buildings doesn't mean that you know what you're talking about.
The big cloud players have the vast majority of their compute and storage hosted out of their own custom built and self-owned data centers. The stuff you see in colos is just the edge locations like Cloudfront and Cloud CDN, or the new-ish offerings like AWS Local Zones (which are a mix between self-owned and colo, depending on how large the local zone is).
Most of this is publicly available by just reading sites like datacenterdynamics.com regularly, btw. No insider knowledge needed.
The Cloud locations aren't just edge locations (scroll down on that page and note most have all APIs supported) and there are a lot more of them than there are Google-owned DCs.
Well those people lied to you then, or more likely there was a misunderstanding, because you can literally just look up the sites I mentioned above and see that you're entirely incorrect.
You don't need to be under NDA to see the hundreds of billions of dollars worth of custom built and self-owned data centers that the big players have.
I am one of those "pay grades many layers higher", and I can personally confirm that each of the locations above is wholly owned and used by Google, and only Google, which already invalidates your claim that "you can count the wholly-owned sites on one hand". Again, this isn't secret info, so I have no issue sharing it.
I'm not trying to make you divulge anything. I don't particularly care who you talk to, or who you are, nor do I care if you take it as a "personal insult" that you might be wrong.
You are right that it would be nuts that multiple senior people would collude to lie to you, which is why it's almost certainly more likely that you are just misunderstanding the information that was provided to you. It's possible to prove that you are incorrect based on publicly available data from multiple different sources. You can keep being stubborn if you want, but that won't make any of your statements correct.
You didn't ask for my advice, but I'll give it anyway: try to be more open to the possibility that you're wrong, especially when evidence that you're wrong is right in front of you. End of story.
You are correct many facilities are owned by the hyperscalers, and they also extensively use colos for hosting entire regions (not only PoPs), specially outside the US. More recently I’d also include Ireland.
I have worked at two cloud providers very close to the netops teams due to my customers, but I have signed NDAs so I won’t go further into it, specially since one of my ex-employers is very touchy about this subject.
It can be true that all the big clouds/cdns/websites are in all the big colos and that big tech also has many owned and operated sites elsewhere.
As one of these big companies. You've got to be in the big colos because that's where you interconnect and peer. You don't want to have a full datacenter installation at one of these places if you can avoid it, because costs are high; but building your own has a long timetable, so it makes sense to put things into colos from time to time and of course, things get entrenched.
I've seen datacenter lists when I worked at Yahoo and Facebook, and it was a mix of small installations at PoPs, larger installations at commercial colo facilities, and owned and operated data centers. Usually new large installations were owned and operated, but it took a long time to move out of commercial colos too. And then there's also whole building leases, from companies that specialize in that. Outside the US, there was more likely hood of being in commercial colo, I think because of logistics, but at large system counts, the dollar efficiency of running it yourself becomes more appealing (assuming land, electricity, and fiber are available)
It is true that every cloud provider uses some edge/colo infra, but it is also not true that most (or even really any relevant) processing happens in those colo/edge locations.
And limiting to just outside the US, both aws and Google have more than ten wholly owned campuses each, and then on top of that, there is edge/colo space.
Google's DC networking is interesting because of how deeply integrated it is into the entire software stack. Click on some of the links and you'll see it mentions SDN (Software Defined Network). This is so Borg instances can talk to each other within the same service at high throughput and low latency. 8-10 years ago this was (IIRC) 40Gbps connections. It's probably 100Gbps now but that's just a guess.
But the networking is also integrated into global services like traffic management to handle, say, DDoS attacks.
Anyway, from reading this it doesn't sound like Google is abandoning their custom TPU silicon (ie it talks about the upcoming A3 Ultra and Trillium). So where does NVidia ConnectX fit in? AFAICT that's just the NIC they're plugging into Jupiter. That's probably what enables (or will enable) 100Gbps connections between servers. Yes, 100GbE optical NICs have existed for a long time. I would assume that NVidia produce better ones in terms of price, performance, size, power usage and/or heat produced.
Disclaimer: Xoogler. I didn't work in networking though.
reply