AMD EPYC 7713 'Milan' Zen 3 CPU Benchmarked

tyingq · on Nov 7, 2020

Expected, I suppose. But it's still surreal to me that I'd need rows of full rack servers I used to install 25 years ago just to equal the core count here. Many, many, more rows to get anywhere near the capability.

I suppose it's more surreal than comparisons I made back then because there's very little difference, really, in the operating system and popular languages.

reitzensteinm · on Nov 7, 2020

My surreal moment was reading the Genetic Programming series by John Koza. He discussed a 1000 socket Pentium II cluster which was used for a genetic search.

I did some back of the napkin math and realized that was roughly equivalent to my Zen 2 box.

So of course my failure to replicate the author's success was not the fault of my hardware.

chris_st · on Nov 7, 2020

If you like GP, you might want to check out Gene Expression Programming [1, 2]. A lot simpler to implement, with some nicer features. This excellent paper [3] has a lot of great ideas to get inspiration from (it uses GP).

[1] https://www.gene-expression-programming.com [2] https://en.wikipedia.org/wiki/Gene_expression_programming [3] https://ccc.inaoep.mx/archivos/CCC-17-009.pdf

reitzensteinm · on Nov 8, 2020

I will check it out, thank you!

qz2 · on Nov 7, 2020

Similar situation here. Circa 2000 I had a few separate machines in my lab making lots of noise, taking up lots of space and drinking up power. Now I run a much larger cluster of nodes on my zen 2 desktop in virtual machines. If I need some more I rent them from amazon for a few hours rather than having to go and buy something expensive.

raverbashing · on Nov 7, 2020

Hah interesting. I remember a John Koza book but that was before the PII era I think. Also he liked Lisp (so maybe that's why he needed all that computing)

In the end it seems GA/GP was an "overoptimizing" for very non-convex/non-linear search spaces (much more than real problems) and that we're better off with NN/SGD representations of problems

reitzensteinm · on Nov 7, 2020

There was a whole series of them, GP I-IV. I believe it was GP II that was discussing the Pentium II cluster.

I have had a lot of fun this year building a neural network from scratch as well as genetic programming. For real tasks, it does seem that the former has won out.

I have a layman's suspicion that GP will find itself useful in some way and have a resurgence in the future, but it might just be naïve wishful thinking.

walrus01 · on Nov 7, 2020

This makes me think of Slashdot jokes from 20 years ago about 'imagine a beowulf cluster of these'.

I once worked for a server manufacturer that did a setup with hundreds of 1U size, dual socket Pentium 3, 1.0 GHz CPUs. In the era when the only way to have more than one core was a dual socket motherboard.

fabian2k · on Nov 7, 2020

I'm curious if they will keep up the increase in core counts. It might get more difficult to do that unless the new TSMC processes work very well. They probably would have to go to 16 core (2x 8 core CCX) chiplets as I don't think they can easily fit more chiplets into Epyc.

kllrnohj · on Nov 7, 2020

Zen3 is still TSMC 7nm. So expect identical core counts & cache sizes as the existing models, but each core is ~15-20% faster

Similarly expect similar prices since nothing there really changed either. Yields are a very known quantity at this point.

ksec · on Nov 7, 2020

Well you can expect 128 Core from 5nm or 3nm. I would guess 3nm makes more sense from a die size perspective. I imagine Zen 4 on 5nm would be the same as Zen 3 except with 16 Core and the same amount of cache, since the L3 Cache are now shared.

But the problem isn't so much with the Core Count but the TDP per Core. Imagine it is 3W per Core, 128 Core gets to 384W. And that is excluding the IOD.

barbacoa · on Nov 7, 2020

I imagine some day soon they will start stacking core dies like they do for memory.

accountofme · on Nov 7, 2020

I imagine they will need to get quite creative with the cooling process to pull that off.... I have heard of in die water cooling, but that sounds like the only option.

chiph · on Nov 7, 2020

Sales of 3M Flourinert are sure to go up.

(famously used in the Cray 2)

https://en.wikipedia.org/wiki/Fluorinert

ta988 · on Nov 7, 2020

I replaced 4 servers (doing computation work not serving things so no need for availability) at work with a single desktop computer with a ryzen 7... Between the sata vs nvme the RAM speed and the CPU speed, tasks finish faster on rhe Zen2... Oh and it is 1/8 of the power as well. My only problem is that I can't get WoL to reliably work, only the mode that wakes up with any packet works.

stepbeek · on Nov 7, 2020

I’ve found from building a few Ryder machines now that while the CPU itself is amazing, the eco-system around it remains less polished than intel.

Hopefully this changes with Zen 3 looking like an awesome consumer chip.

rq1 · on Nov 7, 2020

They “just” miniaturised the rack into a chip :)

db48x · on Nov 7, 2020

Or at any rate into the chassis; all those PCIe lanes have to go to devices that are mounted somewhere.

anon73044 · on Nov 7, 2020

We have disk shelves now, wouldn't the next step be a "network shelf"? And probably a "GPU compute" shelf later on with some kind of high-speed high bandwidth backbone tying them all together. Turn the whole rack into a glorified backplane. ¯\_(ツ)_/¯ sounds like something IBM would do.

gravypod · on Nov 7, 2020

I recently had one of these realizations when looking at https://research.google.com/archive/mapreduce-osdi04-slides/...

fulafel · on Nov 7, 2020

We've had pretty high parallelism in server CPUs for a while, eg Ultrasparc T1 had 32 threads in 2005, later generations with 128 threads (T5). But a high number of threads and cores is a worse solution than being able to keep up high single threaded execution speedups, as most programs are not multithreaded because parallel programming is so hard.

tyingq · on Nov 7, 2020

The T1 was unusual at the time. It did have 32 threads, but provided by 8 VERY anemic cores. It was only competitive in some very niche scenarios. A "real" 8-core SMP Sun box was still pretty much a full rack and 6 figures.

zepearl · on Nov 7, 2020

I remember when the big company I was working for started using T1s for databases (I guess caused by Oracle marketing & change of licensing model) and asked us "consumers" to start using those systems.

I was skeptical so I downloaded some SPECint benchmark results ( https://www.spec.org/benchmarks.html ) of T1s, Power and Xeon, compared them and thought "mh, probably a bad idea" and then I ended up having to invest quite some time to convince my management to keep using "normal" servers.

On the other hand, later, I had a bit of fun hearing stories from other colleagues telling me how slow those T1-machines were once they started running on them their normal DB-workloads => after months of everybody complaining about bad performance, everybody went back to normal servers => a lot of time & money (& nerves) spent for nothing.

Normal IT politics, I guess.

Jugurtha · on Nov 7, 2020

>Normal IT politics, I guess.

Lunch driven procurement.

In all seriousness, someone is trying to make an app to benchmark software[0]. You might be interested in that thread. Thanks for the link.

- [0]: https://news.ycombinator.com/item?id=25010373

fulafel · on Nov 7, 2020

Yep. We've historically had a lot of CPUs or multi-cpu machines that were good for running "embarassingly parallel" loads but wasted capacity for general purpouse use, a bit like this one! More hw threads or cores has traditionally not been good news from software POV.

This kind of machine is nice to look at from afar but for most apps trying to get anywhere near a 64 fold speedup on your application on this would mean scrapping the code base and doing 1 failed rewrite, followed by 1 marginally successful one, taking up ruinous amounts of calendar time and engineering resources.

But of course it's diferent now than in 2005, because today we can't do any better.

mnd999 · on Nov 7, 2020

It’s not though in the server space. Many server workloads are inherently parallel because the support multiple concurrent users. Think database servers or web servers.

fulafel · on Nov 7, 2020

The speed is strictly worse as no application works better in this scenario, and most are worse (latency).

tpxl · on Nov 7, 2020

>strictly worse

I'm pretty sure double the core count at 90% of the speed will get you better performance in _a lot_ of scenarios.

fulafel · on Nov 7, 2020

By sustaining single core speedups, I meant big speedups, like we were observing bin the days before mainstream CPUs resorted to multiore. It was rather more than 10%.

And we'd still have be able to make processors with lots of slow threads of course for applications where that's cost or power effective, that's comparatively very easy.

tpxl · on Nov 7, 2020

Yes, having one core at 2x clock is definitely better than having two cores at 1x clock, I'd agree with that.

These days however, the difference between fastest core performance on low-core processors and fastest core performance on high-core processors is rather a lot less than 50% (Comparing like for like, eg AMD to AMD and Intel to Intel).

r1ch · on Nov 7, 2020

I really wish they would make the model numbers more distinct. "EPYC 7xxx" seems to encompass everything from 1st gen to 3rd gen, I have to look it up every single time.

icegreentea2 · on Nov 7, 2020

I think the last digit is the generation number. No idea if the middle two digits have a rhyme or reason, other than bigger is more expensive.

kllrnohj · on Nov 7, 2020

Correct, see https://www.anandtech.com/show/14694/amd-rome-epyc-2nd-gen/4

    EPYC = Brand
    7 = 7000 Series
    25-74 = Dual Digit Number indicative of stack positioning / performance (non-linear)
    1/2 = Generation
    P = Single Socket, not present in Dual Socket

When scanning the product stack or test results it's really annoying to look for the last digit though. But I guess Intel does that too with the whole vX thing at the end

nolok · on Nov 7, 2020

I agree but I think that's on purpose, as part of their "finally get some share in server space", to avoid scaring any non-tech decision maker into thinking this is yet another new product or whatever that need to be thought about before signing on it.

In their other markets they have very clear generation naming, even going out of their way to give naming-room to their laptop chips.

AMD really want to be in server space because that's the safe spot (in revenue) for the next time Intel comes backs, and that's the one they missed last time they were ahead.

reacharavindh · on Nov 7, 2020

Damn I wish we could simply buy a single socket AMD EPYC instead of a dual socket Intel Xeon. When compared at last gen Rome, scientific workloads compiled with Intel compiler that relied on AVX-512 a lot performed piss poor on AMD while other regular stuff flew ..

It is too hard to buy these awesome chips and have them NOT run the serious stuff that our group wants to. I wish AMD would come up with a solution that makes it magically work with code compiled for AVX-512 :-(

brutos · on Nov 7, 2020

If you can recompile but the code was written to only target AVX-512 you can use https://github.com/simd-everywhere/simde to near effortlessly map the intrinsics to AVX2 (or lower).

calaphos · on Nov 7, 2020

Recompiling is not an option? Also seems weird that the binary only ships with a AVX512 codepath and not AVX2 as well.

I remember there being also an issue of linked MKL libraries shipped with various software (e.g. Matlab) not using the appropriate AVX2 codepaths on 'unsupported' AMD processors and isn't was falling back on slow non vectorized code.

accountofme · on Nov 7, 2020

I think the key issue here is the use of the intel compiler on a non intel product. Can you not recompile your code with GCC or LLVM? are the performance metrics that bad that they are not options? AMD has made the system... Engineer your solution if you want to use it :)

g0xA52A2A · on Nov 7, 2020

What's the workload if you don't mind me asking? Presumably the difference isn't just down to the smaller width of AVX2 but new instructions in the AVX-512 family?

gameswithgo · on Nov 7, 2020

compile it for avx2 and you might still win

siscia · on Nov 7, 2020

I am wondering if we are not pushing too far the core count.

Do we need all those cores in a single machine? Do we know how to use them? Beside some scientific workload, how are they used?

I guess the standard answer is virtualized them and sell them in chunk as "cloud" but then why prices are still so high?

I am not complaining, I just feel that something is off, and I don't understand what it is.

ksec · on Nov 7, 2020

>and I don't understand what it is.

The prices felt off mostly because CPU are not the only cost, and in many cases it isn't even the most expensive cost on Server. Memory is, especially ECC Memory. Memory price per GB hasn't fall at all. And in many cases they are even more expensive. And it is only in recent few years you start seeing "High CPU Instances" because it is cheaper to include a Higher Core Count in the plan than to increase the Memory.

And because of the increase in VM / Core Count and Tenant, the server are now equipped with multiple High Speed Network connection. And if you look at it from a Cost Per Network Card / Port / Bandwidth / Server / Tenant. Price hasn't fallen much either. Adding in your cloud / VPS instance are paying Electricity, Co-Location and Networking charges, none of these have an Moore's Law and are part of the TCO. And if you include the continuous raise of wages, things doesn't look all so good.

But Zen 2 is barely deployed so its effect on the industry is not here yet. Linode already use Zen 2 and they are already offering better price plan than DO or other Medium Cloud vendors. I expect Zen 3 to be big with many of the Cloud Vendors. And you should see those reflected on pricing.

hedora · on Nov 7, 2020

Disks have gotten ridiculously fast. One NVMe card can move 5GB of data or 1,000,000 IOPS over 4 PCIe lanes. With 128 cores, you need to parse 39MB/sec per core, or dispatch 7.8 thousand I/Os per second for each drive in the system.

This processor has 64 PCIe lanes, and a drive uses 4. If you use half the lanes for disks, you’d need to multiply my numbers by 8. So, in addition to running application logic and talking to the network, each core needs to parse 312MB or dispatch 62K IOs per second.

Those numbers are certainly achievable, but hitting them requires some care. So, this processor is fairly well-balanced for moderately compute intensive workloads, such as query processing, or maybe frontend logic (which would probably want a machine with more network and less disk).

Note that the 128 cores are just to keep up with storage that easily fits in 1U or a fairly compact desktop tower case.

If you measure speed as the IO to compute density, these CPUs are actually slow and bulky by historical standards.

vbezhenar · on Nov 7, 2020

Some workloads can utilize almost all the power. Some workloads can not. Server workloads usually can scale almost infinitely. And we don't really have a choice, single-core workloads usually don't scale (most of the recent advancements are from implicit parallelization using speculative execution).

djsumdog · on Nov 7, 2020

VM data centers. One of these machines can be loaded up with ram and host tons and tons of instances without over provisioning. Some of those VMs will be mostly idle, but others will be at full workloads, depending on the tenants.

Also, there are sites like Source Hut or Travis that run thousands upon thousands of CI jobs every minute. I worked at a place where we were resizing thousands of photos and hundreds of incoming videos ever minute. One of these could easily replace 10~20 nodes in a transcoding farm for companies that do these types of workloads on physical servers.

There are tons of commercial applications that would keep these things busy during business hours, and some that would also keep them busy 24/7.

jeffbee · on Nov 7, 2020

No we don’t know how to use them. The abilities of single systems have far outstripped our ability to manage all those resources. It takes a lot of work to drive a system like this to the rails, and very few people even bother trying. The closest most people come to exploiting a machine this large is by carving it up into normal-sized virtual machines with independent operating systems. This makes the utilization look acceptable but in reality much of the resources are just being squandered.

kllrnohj · on Nov 7, 2020

It's easier to scale to multiple cores than to multiple machines, and cloud services have been doing multiple machines for decades.

If this wasn't needed or wasn't useful then we also wouldn't ever see companies buying more than a single rack (still want redundancy and hot spares ofc), but they do.

So yes we know how to use all these cores. You take what's currently 4 machines and consolidate it to a single machine. Or similarly 40 machines to 10 machines, etc...

calinet6 · on Nov 7, 2020

I think you're exactly right. These cores are designed to be sold en masse to the cloud providers, who are looking for density in the rack, lowest total cost of operation (up front cost is nearly irrelevant), and above all power and thermal efficiency.

AMD knows they have some serious advantages there, especially on the efficiency front, with Zen3, and they're going pedal to the metal.

gameswithgo · on Nov 7, 2020

databases and webservers can use more cores no problemo

api · on Nov 7, 2020

Prices are so high because cloud vendors have no incentive to lower them. It’s even worse with bandwidth. Cloud bandwidth prices are rape.

IntelMiner · on Nov 7, 2020

Please don't use graphic terms like that to describe mundane activities

childintime · on Nov 7, 2020

Epic: Faster Than Four Intel Xeon Platinum CPUs

tleb_ · on Nov 7, 2020

2x EPYC 7713 are faster than 4x Xeon Platiniums

amdhaswon · on Nov 7, 2020

Intel is absolutely devastated by AMD

amdhaswon · on Nov 7, 2020

How many milliseconds this processor needs to compile the entire Debian distro? :D

a012 · on Nov 7, 2020

The newer benchmark would be "how many hours does it compile Chromium?"

uep · on Nov 7, 2020

I did see this benchmark. For a consumer CPU, it looks pretty good. Looks like 46.3 minutes.

https://youtu.be/72AHENDeTEI?t=892

I guess it's actually pretty good for any CPU class.

hedora · on Nov 7, 2020

Or: “how many milliseconds does it take to echo a keystroke in slack”?

g42gregory · on Nov 7, 2020

I believe top-of-the-line Threadripper compiles Chromium in 30+ minutes.