A Deep Dive into AMD’s Rome Epyc Architecture

mjw1007 · on Aug 17, 2019

Up until around 2012, realworldtech.com and anandtech.com used to publish rather more detailed descriptions of the microarchitecture inside each core.

Is anyone publishing things like that these days? I mean pages like these:

https://www.realworldtech.com/haswell-cpu/4/ https://www.anandtech.com/show/6355/intels-haswell-architect...

(I noticed that Agner Fog's chapter on Ryzen is conspicuously missing a "Literature" section.)

ksec · on Aug 17, 2019

Anandtech still does that, just no longer written by Anand himself ( He is working in Apple now ). So the writing aren't as good. Even though the technical details are still there.

One of the problem is that the market for these kind of review are very much a niche. And just like all forms of free media, if there aren't enough page view they stop doing it.

I have always thought some of these media will consolidate, I mean I only ever read Anandtech, Servethehome and some Ars, and that is about it. I have RSS Header news feed from a few other sources such as Tom's hardware, Engadget but if Anandtech cover the same topic I always go there first.

Not only has that not happen, most of these website manage to stay afloat catering for different market. But I have no idea how the market segmentation works. I could tell site like Wcctech is sort of 100% rumours site with very little if any technical knowledge in writing. And yet it gathers huge amount of audience.

While others like Tom's Hardware seems to have retain enough of its news reader to become sustainable.

close04 · on Aug 17, 2019

Unfortunately while AT still has some great deep-dives for mobile SoCs (top marks to Andrei Frumusanu), the x86 articles have become a bit shallow. And if that wasn't bad enough, they also suggest some bias.

They tend to bang the drum when it comes to Intel but in AMD reviews you'll get things like "Due to bad luck and timing issues we have not been able to test the latest Intel and AMD servers CPU in our most demanding workloads". It's a lot like reviewing a Ferrari but due to bad luck you could only test it in city traffic.

2 years ago they forgot to cover the Threadripper launch for 2 weeks while the front page was flooded with dozens of uninteresting half page articles about Intel motherboards being launched around the same time. I love a good tech article regardless of which brand they're talking about but bias will always kill the experience for me. YMMV I guess.

toast0 · on Aug 17, 2019

From watching these sites for years, I think you can see whatever bias you want to see. Some sites/authors do have clear bias, but a lot of it is just time pressure.

Often, review parts are shipped to sites with a review embargo until a certain date -- if you don't ship your review on that date, you lose out; if the shipment is late because of the vendor, or the shipping service, or the reviewer is sick, or out of town, or the shipped firmware isn't great and interim firmware makes a big difference the choices are:

a) take the time to do a full review, but publish late b) do a cursory review, apologize and publish on time c) do b, but follow up with a full review as time permits

If C happens more with AMD than Intel, it could be bias, it could be bad luck, or it could be Intel has been delivering more finished things to reviewers.

close04 · on Aug 17, 2019

AT shouldn't get worse treatment than any other review site. If all the others can post detailed benchmarks or cover an event and only AT has consistent issues and bad luck at some point a pattern emerges.

I get that I can also be biased. But bias should be like noise, taking all of the articles together should average it out. In AT's case it's more like the signal rather than the noise. What really capped it off for me was not covering a public event that every other website covered, like the 2017 Threadripper launch. The signal was that they are even willing to ignore one of the most interesting launches in years to post articles about trivial motherboard announcements. I would never mind if Intel launched some awesome new CPU.

Then the confirmation came the following year, coincidentally also during a Threadripper event when they wrote multiple articles touting Intel's new 5GHz 28 core CPU. They missed the fact that it was a massive overclock chilled by an (admittedly hidden) 1HP chiller and their experience raised no red flags where even the comments did. But worse, when the bubble burst unlike every other publication AT's response was an anemic piece excusing Intel and with the literal conclusion that "the 28-core announcement was not ideally communicated".

I understand Intel's shenanigans to try to steal some of the attention that TR is getting. But as a journalist being played like that should trigger a more visible reaction. Consistently painting them in a good light just raises suspicions for me. And while I still read their articles I no longer take them or the conclusion at face value unless another big site confirms it.

ksec · on Aug 18, 2019

Yes I totally understand that, and you could literally count with one hand how many people are working in Anandtech.

But sometimes I just want 2 Sentence on their Frontpage. Like

1) Today is the launch of AMD Threadripper, here are the Spec. It is exciting to test ( hype ) and we intend to publish a full review within 2 weeks.

Rather than just stay silent on the issue.

2) Today there is a new Intel threat called X, as published here ( Intel Official Documentation ) and here ( Likely the Bug have its own webpage now ). We may cover it with more details in the future.

I understand they have timing and staff issues. But two sentence will show they knew of the issue / press/ release rather than staying Silent on it.

May be Anandtech wants to be a pure review site, but then it is not has a section called News Pipeline. Staying silent on anything at Intel's disadvantage makes me question if they have slight bias towards Intel.

addicted · on Aug 17, 2019

That was also around the time Anand left the company. They probably had transitional issues.

That being said, last year all I could read on their comments section was their bias towards AMD to the point they were being accused of being paid by AMD. They had a ton of AMD coverage, including I believe a one on one with Lisa Su.

So I’m currently taking accusations of bias with a grain of salt.

close04 · on Aug 18, 2019

I wouldn't see Intel bias if AT gave them the spotlight during a time when they announced/launched massively interesting products (like Zen based CPUs were/are). I actually expect them to treat a new Intel architecture exhaustively even at the price of not having time for trivial AMD related news.

But if you read my concrete examples above and go to AT's site to confirm their legitimacy I think you will agree that this goes far beyond giving too much attention to one of them during a period of major change. They were willing to do exactly the opposite and refuse attention during a major launch to cover trivial topics for another company, they accepted being repeatedly played by the same company and never publicly held them accountable.

And this last part is arguably the most worrisome because it's no longer about one journalist's personal preference towards one company. It's their journalistic integrity. When you realize you were tricked into deceiving your readers you're expected to take a stand publicly. And at the very least learn from the experience and trust but verify. AT still enthusiastically covered paper launches that never materialized, with no "grain of salt" thrown in there. And it doesn't matter which brand they favor, only that they are not willing to take a stand after being repeatedly played for attention.

I still read them (only as secondary source) and not recommending against it. Just that the implicit trust I had when Anand was writing is off the table for me.

ENOTTY · on Aug 17, 2019

Wikichip is my go-to for these things.

throwaway2048 · on Aug 17, 2019

The servethehome review of Rome is a pretty detailed look at the architecture.

https://www.servethehome.com/amd-epyc-7002-series-rome-deliv...

deepnotderp · on Aug 17, 2019

Wikichip is nice, but tbh a lot of this stuff ends up in analyst reports nowadays. Look at conference proceedings if you're interested

rrss · on Aug 17, 2019

https://www.anandtech.com/show/14525/amd-zen-2-microarchitec...

mjw1007 · on Aug 17, 2019

That, and the servethehome review, seems to be basically putting the presentation slides into words.

A few years ago they seemed to have additional sources of information (they'd talk about things like instruction-to-port assignments and penalties for moving data between integer and FP domains).

galaxyLogic · on Aug 17, 2019

Maybe that means that the AMD presentation slides are pretty good?

Beyond that it gets into pretty deep expertise into both Intel and AMD for comparison of the approaches and I assume most such high-level experts works for either AMD or Intel so you would not get an impartial view anyway

rrss · on Aug 17, 2019

Maybe one of the write ups of AMD's presentation at hot chips next week will have what you want.

To be honest, though, I don't see a substantial difference between the haswell article you linked and the Zen 2 article, provided you are willing to look past the AMD slides. The haswell article is also just "putting the presentation slides into words," just from IDF 2012 instead of AMD Tech Day 2019, and apparently the author felt a need to do the block diagrams themselves.

(Also, FWIW, Agner's manual does have a literature section for Ryzen, it is just not numbered for some reason).

twotwotwo · on Aug 17, 2019

I think the tech press still tells us what they can, and stuff like execution ports, reorder windows, etc. is still publicly disclosed. AT talked about what was publicly said about Zen 2 (https://www.anandtech.com/show/14525/amd-zen-2-microarchitec...) and Sunny Cove (https://www.anandtech.com/show/14514/examining-intels-ice-la...). And their reviews do try to report the top observable results (memory latencies, relative performance on different kinds of task, power/clock info) and all that's arguably of more practical importance to lots of folks anyway.

There's also just the trend of modern designs being tricky enough it's harder to infer as much about them and harder to write accessibly about what you do know; it's not super easy to figure out and describe, say, modern branch predictors simply because they're all layering a lot of strategies on each other.

For example, from Haswell on, Agner Fog essentially said Intel's large-core branch predictors are good at lots of things but there's not much he can say about how they work (p29 at https://www.agner.org/optimize/microarchitecture.pdf). Writing code to beat Cortex-A76 prefetchers, AT's Andrei Frumusanu had difficulty fooling them with anything other than essentially-random access patterns and compared them to "black magic" (https://twitter.com/andreif7/status/1102230575522430977). These aren't just random folks saying "wow, CPUs are complicated"; they successfully figured lots of stuff about past generations of CPU.

AMD did reference the TAGE family of branch predictors, which there's lots about in public literature. There might be some broadly interesting stuff in the vendors' contributions to gcc/LLVM (machine models and arch-specific optimizations).

Maybe ARM implementors talk a little more about their stuff? That might have something to do with the dynamics of the relatively open/diverse market for ARM SoCs versus the long-running one-on-one-ish x86 rivalry.

Hard to boil all that down to a single point, but if AMD and Intel want to talk more about the guts of their products, I'm sure plenty of grateful wonks would lap it up. :)

Quequau · on Aug 17, 2019

I despair that the market is more interested in things like mobile apps and LED equipped RAM than serious in-depth technical reporting on microprocessor internals.

mmrezaie · on Aug 17, 2019

There must be a simulation for this kind of architectures to see what is the best combination of size and components while making it practical! I wonder if anyone knows something like that? A tool to minmax these choices and estimate if this can be done with resources they have got.

positr0n · on Aug 17, 2019

http://gem5.org/Main_Page Is an open source CPU simulator.

I used it in undergrad to run benchmarks with different cache sizes and cache coherence strategies to see which were more effective. I'm sure Intel and AMD have much more advanced simulation tools though. Most likely multiple, or at least multiple levels of granularity (so you could do stuff like, simulate these potential branch predictor designs at a gate level, and then turn around and simulate the entire CPU at a higher level of abstraction.)

ajross · on Aug 17, 2019

Tools like that are a core part of the design process. You write that software along with the choice of parametrization of the design. It's not an off the shelf thing. But yes, that's how it works.

It's also important to note that decisions like this are hugely workload-specific. There's no single best processor for all applications. In extreme examples: almost every transistor on a vector SIMD unit is wasted when trying to optimize for a client Javascript benchmark; streaming symmetric encryption gets no benefit from L3 cache (which is like half the chip these days!); etc...

mmrezaie · on Aug 17, 2019

Do people like Jim Keller are the experts in finding the right balance? Is that why he and his team are so important?

zazagura · on Aug 17, 2019

Maybe the time has come for applications specific CPU variations?

One optimized for node.js tasks, one for databases, ...

snaky · on Aug 18, 2019

Mainframes used to explore this long ago, leveraging different microcode for different workloads keeping the hardware of processing unit the same, https://en.wikipedia.org/wiki/ZIIP

On the other side of the computing spectrum, there were a couple of papers in 2010s about offloading most common mobile tasks (like CSS layouting) to the specialized mobile CPU subsystems. Maybe this should have been implemented if Google made their own mobile CPU.

pjc50 · on Aug 17, 2019

ARM got a Javascript oriented instruction: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc....

At least two different tasks, audio and graphics, have special application specific processors. Networking and crypto are also often offloaded.

tempguy9999 · on Aug 17, 2019

> one for databases

as a DB guy, there's no 'one task' for DBs . The only thing I can think that is nearly characteristic of DBs I've worked on is that they're IO bound.

That's possibly true of most things except Floating point and graphics.

snaky · on Aug 18, 2019

While that's true, you can achieve impressive results offloading particular parts of particular DBMS code to the specialized CPUs, https://www.ibm.com/support/knowledgecenter/en/SSEPEK_11.0.0...

Symmetry · on Aug 17, 2019

Oh yes. That's been the dominant approach at least since Computer Architecture: A Quantitative Approach came out in '89. The search space is pretty complicated, though, given the physical interactions as well as the logical ones.

sroussey · on Aug 17, 2019

Sounds like a lot of parameters for some ML setup

willis936 · on Aug 17, 2019

I’m pretty sure they roll their own. I would be surprised if simulation tools were not the most well guarded secrets of these companies.

chrisseaton · on Aug 17, 2019

They have many levels of simulations for their processor designs yes but they’re proprietary.

repolfx · on Aug 17, 2019

Yeah but they're often trying to predict where the software industry will go years in advance. You can see that in the disagreements between AMD and Intel on AVX-512: there's a chicken and egg situation where it's not always clear what's right to optimise for, as it depends on changing workloads and software platforms. For instance the big chip companies were caught out by the need for low precision maths for AI inferencing. That all came out of Google.

deepnotderp · on Aug 17, 2019

So yes, plenty of simulators exist, many internal ones as well as gem5* but automated search space exploration, say like with mcmc, isn't in widespread use yet iirc, although plenty of academic papers have explored the topic.

*which I swear every company has their own version of

brosenlof · on Aug 17, 2019

http://gem5.org

MayeulC · on Aug 17, 2019

> “We like features that improve both power and performance,” Clark elaborated. “Being on the right path more often is important because the worst use of power is executing instructions that you are just going to throw away. We are not throwing work away after we figure out dynamically that we were wrong to do it. This definitely burns more power on the front end, but it pays dividends on the back end.”

Every documentation I've seen is quite light on the branch prediction improvements. Going by the slides, they improved is accuracy by 1/3; I'd be curious to know how. Side note: if your superscalar is big enough (yeah, those registers use power), couldn't you just get rid of branch prediction at no performance cost (doing something else while waiting for the data)?

My only grudge against Zen (as a consumer) is that the AM4 socket is intended for both APUs and CPUs. While this is a good thing, I have a couple utterly useless video outputs on my motherboard. I would have liked AMD to include some display driver circuitry on every chip. Maybe in the I/O die, if they use such a thing in all of their designs going forward? I mean, I would be quite content with using software rendering when I need to drive a screen, or even spare a bit of memory bandwidth and CPU cycles to drive an extra display from my desktop's graphics card.

piadodjanho · on Aug 17, 2019

> Every documentation I've seen is quite light on the branch prediction improvements.

In one of the pictures in the article, it says the new architecture uses the TAGE Branche predictor. This is likely based on the work of Andre Seznec. There are many articles on the implementation (but they can be difficult to understand if you are not already familiar with his work).

I've implemented the bare bone predictor on a computer architecture course, you can see an abridged version of my presentation slides here[1]. Note this only describes the bare bone predictor, in recent work Andre Senzec added a Loop predictor and a Statistical Correlation Unit to increase the accuracy.

There are some work using TAGE with perceptions in the Statistical Correlation unit.

[1] https://docs.google.com/presentation/d/1aUrwD-ENYPB7pMrCoYmE...

MayeulC · on Aug 18, 2019

Thank you, I hadn't realized those branch predictors were actually documented, and thought that they were referring to internal names.

It is nice to see research being applied to new mainstream chips relatively quickly. In complement to your slides, there is a short overview here [1] (this is actually the first search result).

MayeulC · on Aug 19, 2019

I forgot to put the link. One of the first search results.

[1]: https://comparch.net/2013/06/30/why-tage-is-the-best/

shaklee3 · on Aug 17, 2019

This didn't really seem like a deep dive compared to the anandtech article. I was hoping for some memory bandwidth benchmarks, since this should be the first chip that has 8 channels without caveats (looking at you power 9). It's also not clear if it's 16 channels with 2S, but I suspect not.

Edit: the picture from AMD in this review makes me think it can hit 16 memory channels with the two socket version. Does anyone know if this is true?

wtallis · on Aug 17, 2019

> the picture from AMD in this review makes me think it can hit 16 memory channels with the two socket version. Does anyone know if this is true?

Yes, if the motherboard provides all the necessary slots. The inter-socket communication is achieved by re-purposing CPU pins used for PCIe, not pins used for DRAM. Each CPU has the full 8 DRAM channels of its own.

LargoLasskhyfv · on Aug 21, 2019

Somewhere at 50 to 60% down in the article:

"There are a total of eight DDR4 memory controllers on this hub chip, the same number in total that were on the Naples complex; both support one DIMM per channel and have two channels per controller, but Rome memory runs slightly faster – 3.2 GHz versus 2.67 GHz – and therefore with all memory slots filled, yields a maximum of 410 GB/sec of peak memory bandwidth per socket. That’s 45 percent higher than the Cascade Lake Xeon SP processor, which has six memory controllers for a total of 282 GB/sec of memory bandwidth running at 2.93 GHz and 21 percent higher than the 340 GB/sec that Naples turns in running that 2.67 GHz DRAM. (Those are ratings for two-socket servers.)"

thinkersilver · on Aug 17, 2019

The poster is holding a line of bash to the standard of code and is illustrating that readability should be the goal and a way of bringing bash commands to a standard of readability for something like a PR. Readability is really there to show _intent_

I would say though that if you are bringing this to the code standards of today then this should really be wrapped up in some kind of unit test (https://github.com/sstephenson/bats )for it to pass the PR. That would make the code a bit more maintainable and can be integrated as a stage in your CI/CD pipeline.

If we do that then the intent would be clarified by the input and the expected output of the test. Then then the code would at least be maintainable and the readability problem becomes less of an issue when it comes to technical debt.

I've done this plenty of times with my teams and its certainly helped.

insulanus · on Aug 17, 2019

Are you replying to this thread? https://news.ycombinator.com/item?id=20724679

thinkersilver · on Aug 18, 2019

Yes I was. I've posted the comment to the correct story now. I don't how that happened.

ramshanker · on Aug 17, 2019

My gut feeling is that Intel also lays out / develops IO block and cores seperately. It's just that they are all put on single silicon.

_ugfj · on Aug 17, 2019

But separate silicon is what gives AMD an almost insurmountable cost advantage. They can bin each chiplet separately, their yields are much higher because each die is smaller and the cherry on top is the different, cheaper process for the I/O die.