AMD EPYC Rome 2P Will Have 128-160 PCIe Gen4 Lanes and a Bonus

consp · on April 5, 2019

Can someone enlighten me with some information about application requiring the full 16x PCIe gen4 bandwidth per slot (or 32x gen3 for that matter)? I can imagine some HPC GPU solutions but other than that what requires the thoughput?

There is obviously a market for this as both giants are building platforms.

PaulHoule · on April 5, 2019

It is not just about throughput, it is about the flexibility of how you can use the lanes.

Working with desktop computers I've often been frustrated by the limitations of PCIe lanes on Intel processors which are exacerbated by limitations of the motherboards. Frequently I find that "I can't get from here to there" when I am adding cards because the machine has an i5 processor or that particular motherboard, or... Often I can turn off some features on the motherboard i don't need to free up some lanes or shuffle the cards around, but it reminds me of the bad old days when I had to set jumpers for the interrupts on all the cards.

There are numerous holes in the hardware industry that have been caused by Intel's stinginess on I/O. You don't feel the pain on laptops or the tablets and cell phones that Intel had vainglorious ambitions for, but it's been one less reason for people to build machines for which Intel could wipe the floor of their competitors.

It's also been a big aid to IBM's Power architecture because one of the few things keeping that alive is better I/O.

dragontamer · on April 5, 2019

Virtualization.

Remember that these EPYC systems will have up to 64 cores per chip. 128 cores / 256 threads per 2P platform, with 128x to 160x PCIe 4.0 lanes depending on the motherboard.

Lets say you split up your system into 32 VMs (4-cores / 8-threads per VM). That's only 4x to 6x PCIe 4.0 lanes per VM. Doesn't seem very spacious anymore, does it? That's barely enough room for a NVMe drive per VM + a small, shared chunk of graphics (modern enterprise GPUs can be virtualized and split between VMs. So one x16 link to a beefy GPU can be shared between 8 VMs)

If your current customer is using NVMe drives + thick-client workstations for docker / testing / etc. etc, it may be difficult to convince them to switch to a virtualized infrastructure. But a beefy machine with this much I/O would definitely catch their eyes.

vel0city · on April 5, 2019

This, exactly. These chips are targeted for the heavy virtualization platforms (see: "The Cloud"). It means you can fit more customers in the same space and power envelope before you start running into noisy neighbor issues.

masklinn · on April 5, 2019

> These chips are targeted for the heavy virtualization platforms (see: "The Cloud").

They're very nice for big fast storage: https://www.dell.com/en-us/work/shop/povw/poweredge-r7415

> Front Drive Bays: Up to 24 x 2.5” […] NVMe

That's 96 PCIe lanes right there., and drives are bandwidth-limited by 3x4.

Especially if you want to plug that into a fast network, you need 3x16 per 100Gb port, or half a 4x16.

wlesieutre · on April 5, 2019

Not an expert, but at least on the gaming side of GPUs my understanding is that you usually aren't using the bandwidth much. For rendering your average frame, all the data you need is already in VRAM.

But if there's new model or textures that aren't yet on the GPU, you want to move them over quickly because the next frame needs to go out ASAP. For a game running at 60hz you've got 16ms to avoid a frame delay. At 144hz you have 7ms. So it can be helpful to be able to dump stuff to VRAM very quickly, which the GPU can grab from system memory via DMA at very high speeds.

For most model/texture loading you know an object is coming into the scene and would have put it on the GPU ahead of time, so this a sort of worst case scenario shouldn't really come up unless you have a huge number of possible things that could spawn in and you don't know which until the last minute. Pretty unlikely.

For the biggest transfers, like loading in an entire level worth of assets, you can hide it with loading screens, or load low resolution textures to start with and stream in the high resolution versions to catch up. Having more PCIe bandwidth will help speed that up if it's in system RAM, but if you're loading it all from disk that's going to be your limiting factor.

I saw some benchmarks years ago where running a high end GPU at x8 or even x4 had pretty minimal impact. With high res textures coming in for 4K now I'd be curious to see someone run that again.

Realistically, I don't think gamers should care about it beyond the spec one-upmanship. HPC would get a lot more use out of it. Maybe also a shift toward fast PCIe based storage solutions like Intel's Optane?

blattimwind · on April 5, 2019

> I saw some benchmarks years ago where running a high end GPU at x8 or even x4 had pretty minimal impact. With high res textures coming in for 4K now I'd be curious to see someone run that again.

The average impact is low, but frametime consistency is generally deteriorated much more strongly than the small reduction (0-5 %) of average FPS suggests. That is probably mostly due to blocking uploads, where having a multitude more bandwidth reduces the blocking time. The average reduction is probably more caused by the general area of latency, since modern renderers tend to do a bunch of queries and while they obviously try to avoid it they'll have some amount of blocking commands in each frame frame.

Patrick-STH · on April 5, 2019

Here is the easy example:

A dual 100GbE NIC requires a PCIe Gen4 x16 slot to run both ports at full speed. Mellanox ConnectX-6 200GbE cards require either 2x Gen4 or 1x Gen5 slot for full dual port bandwidth. Most of the larger NVMEoF nodes we are seeing are using more than one 100GbE NIC.

Enterprise SSDs will rapidly fill Gen4 x4 lanes later this year as we start to see them launch alongside more mainstream platforms. Four of those will fill a x16 slot.

GPUs and PCIe switches are also great examples of uses.

You are right that for the high-volume server segment, 160 PCIe lanes are overkill, as is even 128 today frankly. For some context, there is a huge portion of servers sold with less than 1 DIMM per RAM channel installed.

velox_io · on April 5, 2019

Hi Patrick,

Before we had M2/U2 NVMe drives, there were quite a few manufacturers making drives that plug straight into the PCIe bus where they could easily access the full 16 lanes, and they still can...

Intel seems to be the only ones making PCIe drives (Octane's are funny; they don't look fast only 2000 MB/s, although they do win on latency and random access). Anyway, there doesn't seem to be much demand for that speed, I guess it's a case that x4 U2/M2 are good enough.

Speaking of which.. I'm am surprised there isn't a similar network interface utilising PCIe with a low overhead protocol the same as NVMe does. Getting towards the point where a small number of machines could share memory and CPU's between machines. I know we have Infiniband but that has always come at quite a cost, mainly in supercomputers.

wmf · on April 5, 2019

There are still a decent number of x8 AIC SSDs on the market but enterprises don't want them because they're not hot swap.

Networking over PCIe exists but no one seems to care about it. https://semiaccurate.com/2014/09/23/look-avagos-expressfabri... Part of the problem may be that 96x8G PCIe switches look pretty small compared to 256x50G Ethernet switches and thus you'd need a lot more of them.

justinclift · on April 5, 2019

> I know we have Infiniband but that has always come at quite a cost, mainly in supercomputers.

Lower end Infiniband (eg 40/56Gb) isn't too expensive these days. Well, that's assuming people aren't paying list price. :)

praseodym · on April 5, 2019

NVMeOF (NVMe Over Fabrics) comes to mind, having multiple NVMe drives (mostly PCIe x4) + networking (100GbE uses x16 PCIe) consumes a lot of PCIe lanes.

Also note that current two-socket AMD Epyc servers use 64 lanes from each CPU for inter-CPU communication, so the entire server has only 128 lanes (out of 256) left for peripherals.

sp332 · on April 5, 2019

Current Epyc servers have PCIe 3.0 connections. They could use half as many PCIe 4.0 connections and get the same bandwidth.

masklinn · on April 5, 2019

Yeah but realistically NVMe will remain on 4x and you'll get faster drives. It allows for double the number of ports in NICs though, which is really useful for 10+Gb (and especially 100)

c2h5oh · on April 5, 2019

I can imagine PCIE 4.0 cards with PCIE 3.0 NVMe interfaces on them - 8 NVMe drives each using 4 PCIE 3.0 lanes in a single PCIE 3.0 x16 slot

joehandzik · on April 5, 2019

High performance networking on the whole, of which HPC GPU solutions are only one use case. And there's the continuing trend of sticking offload engines in the form of ARM ASICs alongside devices that connect over PCIe.

A bit farther into the future: https://genzconsortium.org The idea of everything being connected by a shared high speed fabric enables ideas like shared memory pools across a fabric for 'composable' configurations. https://drivescale.com/software-composable-infrastructure/

dodobirdlord · on April 5, 2019

GPU training of ML/DL models is bottlenecked in bizarre locations. Generally when doing distributed training on multiple GPUs with a large dataset you move all at once enough data to fill the GPU RAM during training and let it crunch for a little while, then push the next batch in. People have come up with workarounds like doing multiple training passes over each batch of data, but ideally you would not do that and would refresh the whole 12/16GB of training data on each pass. If you have enough RAM on the board to keep the whole training dataset in memory (likely) then you can easily find that the bottleneck in your system is bandwidth to the GPU. People like to be able to train with a large number of GPUs in parallel, but they really don't like cutting down to 8x PCIe lanes/GPU.

jamesblonde · on April 5, 2019

This will give us servers for deep learning that can have 8 GPUs and a couple of NVMe disks on PCI 4.0 (32 GB/s). With very good inter-GPU I/O and access to NVMe, it will enable commodity servers that are competitive with Nvidia's DGX-1 or DGX2, that include SXM2 (Nvlink with 80GB/s between GPUs).

Bayart · on April 5, 2019

A good target is storage, ie a lot of NVMe drives.

velox_io · on April 5, 2019

Yeah, that's why most PCIe M2 top out at around 3500 MB/s, I doubt it will take long to saturate 7 GB/s (or double that in RAID0). Either way, it's a heck of a lot of bandwidth!

blattimwind · on April 5, 2019

While NVMe drives are becoming quite common in regular computers, sadly most software is ill-equipped to handle that sort of I/O bandwidth well. (I would blame OS interfaces for at least three fifths of that, though people who think using textual formats where the most fastest, least correctest parsers top out at 1.5-2 GB/s for bulk data also have their share of blame).

velox_io · on April 5, 2019

I wouldn't say it's the bandwidth limit per se. A bigger problem is latency, cache misses have a bigger impact the faster clock speeds go.

Grace Hopper and her nanoseconds (30cm) of wire, doesn't sound so eccentric these days. https://en.wikipedia.org/wiki/Grace_Hopper#Anecdotes

Sadly, the extra bandwidth will come in handy as software developers become evermore slack.

I do agree about parsing text (like json!). It seems crazy to stick to text formats/ protocols when binary would be much faster (it's easy to convert binary to text for debugging purposes, as that's what happens anyway!), we're still not fully utilising http2. (Don't quote me) but I remember hearing somewhere that -10% of computer resources are spent converting between base10 and base2. Have a read of DotNet's Span (reference structs), by not copying values everywhere (and the relieving subsequent GC pressure) it's improved tasks like parsing text by an order of magnitude (often more) - That's something the compiler will be able to do without the developer's effort.

ASCII is underappreciated.

PedroBatista · on April 5, 2019

Software moves "slower" because hardware needs to exist 1st and needs to see a widespread adoption in order to justify the cost of development and continuous improvement, also only hardware companies have the expertise because, well, they are the ones who developed the thing.

Doing software actually costs a lot, and no one wants to spend that kind of money past the "good-enough" threshold if they don't have to.

blattimwind · on April 5, 2019

Software then moves a lot slower, because most user applications are pretty bad at using more than one core efficiently; and these CPUs have been mainstream for some 15 years now.

The hardware is certainly much more capable than what little use the software tends to make of it.

leeter · on April 5, 2019

Virtualization does, most virtualization platforms require dedicated lanes for each VM and devices assigned to the VM for dedicated situations.

masklinn · on April 5, 2019

The 16x links will usually get split out to 4x nvme slots, and that's where you're going to get limited by the PCIe bandwidth.

shaklee3 · on April 5, 2019

HPC as you mentioned, but also 200Gbps NICs need it, and those are available today.

en4bz · on April 5, 2019

Dual port 100 and 200 Gbit/s network adapters.

microcolonel · on April 5, 2019

How about 30-40 7GB/s NVMe devices.

deepnotderp · on April 5, 2019

Deep learning training, especially distributed training solutions can utilize greater pcie bandwidth.

dual_basis · on April 5, 2019

What about a bunch of NVMe drives?

deelowe · on April 5, 2019

Nvme and 8k displays?

kilo_bravo_3 · on April 5, 2019

Off the top of my head this would be great for massive enterprise VDI installations and game streaming services (which is just VDI by another name), ML, AI applications, and NVMe storage.

Personally, this will be good for me because everything I do is bandwidth-starved.

I'm not a radar scientist but I am a systems engineer supporting radar scientists working on air- and space-based Synthetic Aperture Radar (SAR) systems. We use GPUs, FPGAs, and other accelerators to generate images from SAR data.

Here's an "old and busted" image made from SAR data: https://hackadaycom.files.wordpress.com/2014/02/image-from-s...

In 2002, it took more than 24 hours to generate a single low-resolution picture from low-bandwidth SAR data on a $1.5 million Sunfire 15K cluster with 70-ish SPARC CPUs. Today, on a single 3U server with two Xeons and four Tesla V100s it takes about 15 seconds-- and that's an extremely high resolution image from very high data rate SAR data.

But our goal is real-time VIDEO from SAR data, so everything needs to be faster. Network speeds need to be faster, CPUs need to be faster, GPUs need to be faster and we need more of them, storage needs to be faster, everything needs to be faster.

I could see a 3/4U box with 16+ 16x PCIe slots each stuffed with a 1-slot GPU with an NVMe (4x PCIe lanes each!) storage array and a couple of 100GBe dual-port NICs blasting through SAR data like a hungry hungry hippo.

As far as PCIe lanes go, if I have a 24-drive NVMe array that's 96 PCIe lanes all by itself.

consp · on April 5, 2019

That makes sense. Thanks to the reactions here (from you and others) getting an idea of the bandwidth starved computational tasks in real life.

esaym · on April 5, 2019

I've never heard of "SAR", that's very interesting... But I am curious, if the data truly comes from only radar, why are shadows apparent in many of the created images? Or does the same surface reflect radio waves differently based on whether it is in the sun or not?

Carropola · on April 5, 2019

Radar sends pulse. Radar moves. Radar looks at reflections from previous pulse. Because radar moved before waves returned and is now looking at the area from a different angle making the image appear illuminated from elsewhere. Its not the suns shadow but a radar shadow. Like if you took a picture with a flash and managed to reposition your camera to a different angle before the light from your flash reached the target.

geokon · on April 6, 2019

A bit off topic, but where can I find details of the math behind SAR? In my last job I worked on radar systems and I was always curious about SAR. However all the information I found was high level descriptions. There was even some MIT course available on it but they too skipped/avoided the math involved

(Im curious to see if I can apply similar techniques to audio)

ip26 · on April 5, 2019

How does one contact you outside of HN?

justinclift · on April 5, 2019

Hmmm, how many radar pulses need to hit a target to generate video? Sounds potentially irradating... ;)

eg: WARNING Don't attempt to image living creatures WARNING

wyldfire · on April 5, 2019

RADAR typically uses microwaves, non-ionizing. Over this area it's bound to be a terribly, terribly small amount of exposure.

justinclift · on April 5, 2019

Ahhh, no worries.

Just remembering a friend recently telling me about someone who was killed at a work place. Apparently that person walked in front of the main radar array (defense related I think) while it was in operation and just dropped dead instantly. :(

Yetanfou · on April 6, 2019

At the 1933 World’s Fair in Chicago, Westinghouse demonstrated a 10-kilowatt shortwave radio transmitter that cooked steaks and potatoes between two metal plates [1]. In 1946 a Raytheon engineer named Robert Spencer "...was visiting a lab where magnetrons, the power tubes of radar sets, were being tested. Suddenly, he felt a peanut bar start to cook in his pocket. Other scientists had noticed this phenomenon, but Spencer itched to know more about it. He sent a boy out for a package of popcorn. When he held it near a magnetron, popcorn exploded all over the lab. Next morning he brought in a kettle, cut a hole in the side and put an uncooked egg (in its shell) into the pot. Then he moved a magnetron against the hole and turned on the juice. A sceptical engineer peeked over the top of the pot just in time to catch a face-full of cooked egg. The reason? The yolk cooked faster than the outside, causing the egg to burst..." [2]. This discovery led to a patent application for "the use of microwaves to heat food", a concept which was eventually realised in the "Raytheon RadaRange" series of microwave ovens [3].

[1] https://en.wikipedia.org/wiki/Microwave_oven#/media/File:Coo...

[2] https://spectrum.ieee.org/tech-history/space-age/a-brief-his...

[3] https://en.wikipedia.org/wiki/Microwave_oven#/media/File:NS_...

washadjeffmad · on April 5, 2019

This happened to a former schoolmate of mine with a pacemaker when strong EM equipment was turned on near them.

The people on site had no idea he had one and attempted to administer CPR, but they wouldn't have been able to detect his pulse, regardless.

Imagine how it looked to the people who never found out and the version of the story they might still be telling others.

velox_io · on April 5, 2019

Given the Inverse-square law, your phone and wifi network will probably give you more radiation..

I had some bones scans (plus a couple of CTs the same day, just a bit more ionising radiation) last year. They inject you with Technician 99m and YOU become the gamma/ x-ray source. It's a little concerning when you see the detail and spread of the radiation. If it wasn't a pure Gamma source (and 6hour half-life) it would be lethal!

I forget if it was 70,000 milli or microsieverts (think it was milli? 0.5% increased risk of cancer in my lifetime). Oh, and my bladder looked like a lightbulb.

craftyguy · on April 5, 2019

>why the AMD EPYC “Rome” generation will likely see 160x PCIe Gen4 lanes plus likely additional lane(s) for a necessary function.

Emphasis mine. This seems to be pure speculation, not 'news'.

Patrick-STH · on April 5, 2019

Likely is there because it requires that AMD's partners release servers in this configuration, and those servers are unreleased. In theory, something during validation could prevent it or partners could decide not to release this configuration (which are unlikely but still possible.)

MattSteelblade · on April 5, 2019

From the article, "Over the past few weeks, we have managed to confirm with a number of AMD’s ecosystem partners that our theory is not only valid, but it is indeed what we will see." Essentially, there is significant evidence that supports their hypothesis. This is far more than mere speculation and I see no reason to assume they have fabricated (hehe) this story.

twotwotwo · on April 5, 2019

Yeah:

> I did give AMD the heads-up that it would be going live just after the Cascade Lake launch. They did not sanction this article (indeed, they will not be overly excited to see it is live.)

That seems to justify some doubts.

PCIe 4.0 support has got to cost some die area/power, and not many PCI devices will support it at first. That might point towards starting with a mix of 3.0 and 4.0 lanes.

If they did upgrade all lanes, and some devices show up, could be cool for some bandwidth-hungry GPU and/or HPC-networking use cases. Or for servers that wanted to load up on any kind of compatible device (SSDs, whatever): you can fit more 4.0 x4 than 3.0 x8 anythings in a given number of lanes.

Still think we'll have to wait a gen, though.

bryanlarsen · on April 5, 2019

They claim to have confirmed this with AMD partners. So it's more than just speculation.

lousken · on April 5, 2019

Meanwhile i9 9900k still has only 16 3.0 lanes ...

eropple · on April 5, 2019

AMD's been adding more PCIe lanes across their hardware lineup than Intel has to be sure, but EPYC is server-class stuff with 64 cores to a chip. The Ryzen 2 2700X only has 20 lanes for 8 cores.

_lqaf · on April 5, 2019

I don't watch this as closely as a lot of people, but my impression is that Intel has been rationing lanes as one way to segment their market, and AMD noticed they can easily mess with that strategy.

I spent significantly more for the CPU on a personal build a few years ago because I needed 40 lanes instead of 24. I'm all for AMD messing with it.

blattimwind · on April 5, 2019

Intel segments the market across every single facet they can find and that won't outright keep people from buying their stuff. Hyperthreading, ECC memory (I ranted about how insane this is before, on numerous occasions), I/O options, GPU options, cache size, TDP, network and of course socket compatibility (where Intel figured out that not only can they sell you a new CPU with 0-5 % more performance every two years, but that they can also make you buy a new board to go with that just by adding or removing a couple ground pins).

And Intel got away with it because they had no competition. Monopolies are always very, very bad for consumers.

eropple · on April 5, 2019

Agreed. I do live video and a factor in going with Ryzen (a 1500X for now, a 3700X when that drops) was more PCIe lanes for more hardware.

That, and the eye-popping multithreaded performance for encoding.

tlamponi · on April 5, 2019

I mean more would be certainly great, but with 20 you can at least run a x16 GPU and a x4 NVMe SSD, which is for my personal setup perfectly OK, I cannot do that with 16 lanes.

kllrnohj · on April 5, 2019

The chipset just multiplexes and it's fine. You're unlikely to actually use more than 16 lanes simultaneously in such a setup. But yes, Intel is being overly stingy with PCI-E lanes off of the CPU and leaning heavily on the chipset to compensate.

I'm hoping Ryzen 3rd gen bumps the lane count a bit more, otherwise Threadripper's 64 lanes look mighty nice...

velox_io · on April 5, 2019

The new Ryzens are meant to launch with PCIe 4.0 (this summer, I'm due an upgrade!), so effectively...

I'm surprised they don't come with more, especially many of the current ones come with 2-3 x16 slots, AND 2-3 M2s.. Both Intel and AMD have been adding more cores recently. Plus, if you're doing any GPU-based rendering, they're going to simultaneously move data back and forth from NVMe to the GPU's memory. But it grabs a headline and most synthetic benchmarks will only test RAM, graphics and storage in isolation - sneaky!

kllrnohj · on April 5, 2019

> Plus, if you're doing any GPU-based rendering, they're going to simultaneously move data back and forth from NVMe to the GPU's memory.

Games definitely don't do this at all, which is the primary market for a discreet GPU in these consumer platforms.

When they do stream in assets they do so slowly & in a controlled, rationed amount to minimize impact on FPS. They are far from being PCI-E bandwidth limited. That's kind of why you see almost no FPS drop at all when restricting GPUs to x8 bandwidth, even.

If you're doing something more workstation-y or custom, that'd be when AMD & Intel would point you at the HEDT platforms which have more than 16-20 lanes.

floatboth · on April 5, 2019

Are there any I/O heavy GPU workloads? Mining famously works with just one lane, offline rendering (e.g. Blender Cycles) I think also just uploads the whole scene once and then bounces the rays around…

the8472 · on April 5, 2019

There are GPU-accelerated database engines which can stream the database database from NVMe. I am not sure if they do direct device-to-device transfer which has only become supported recently or whether they still need to bounce through the main memory.

In raytracing complex scenes can exceed your VRAM so if you don't want to fall back to CPU tracing you need a renderer that can swap parts of the BVH in and out on demand.

kllrnohj · on April 5, 2019

> offline rendering (e.g. Blender Cycles) I think also just uploads the whole scene once and then bounces the rays around…

I'd imaging a workload like that or other HPC-compute workloads where the data set just doesn't fit in VRAM would certainly prefer more PCI-E lanes.

I think that'd usually be considered using the wrong hardware for the job but if you're just messing around as part of a hobby you're obviously not buying the $7000 Radeon Pro SSG, either.

undersuit · on April 5, 2019

Intel just dropped a bunch of details about the actual competitor to EPYC, Cascade Lake AP. These server chips will have 48 PCI-E 3.0 lanes, still not anywhere near the EPYC's claimed amount, but definitely better than what they are selling to the niche high-end gamers.

PedroBatista · on April 5, 2019

They're losing market share but it's better to lose some market share than look desperate and nuke most of their business strategy and golden egg basket.

Sure people hate Intel for their practices, but they can afford to be hated because they still can milk the market for at least a couple of years. "No one ever got fired for buying Intel" is a great asset and they have the confidence they'll bounce back and offer at competing product "in the future".

PedroBatista · on April 5, 2019

A big part of Intel's pricing structure was around number of PCIE lanes.

They had the monopoly for a long time, so they turned into a money-making machine first and a chip company a distant 4th or 3rd.

Patrick-STH · on April 5, 2019

Just added a quick note. Everyone in the industry that contacted me today about this seems to be calling it WAFL or something that sounds like "Waffle" for the extra bonus PCIe lanes.