AVX512 has long been infamous for causing thermal-related performance drops.

ncmncm · on May 24, 2020

The value of AVX512 is more in the extra instructions it provides than in its (somewhat illusory) wider data path.

No doubt the 5nm or 3nm chips will actually deliver the 512-bit path. It must be a crushing burden to come up with actually helpful uses for another billion transistors. And it must be frustrating knowing those AVX512 data paths usually sit idle because compilers don't know how to use them.

aseipp · on May 24, 2020

You have to be careful what you're talking about when you talk about AVX-512 performance - it's more complicated than you might think at first (like I did!) First, the instruction set and the datapath width are different; you can use AVX-512 with 128/256 bit vector registers if your CPU supports AVX512VL, which is nice, because the instruction set has a bunch of good stuff, without incurring the harshest downclock penalties on some architectures (a different "power license" in Intel terms, which varies per core on newer systems.)

But it's also very uarch dependent. Ice Lake has only a single FMA for instance, while some Skylake-SP cores will have 2 (there are no 2 FMA AVX-512 chips besides Skylake/Cascade at the moment.) The power usage is presumably lower than with 2 FMAs, but the reciprocal throughput is halved. If you only use 512-bit vectors on one core for instance, Ice Lake downclocks very very little, while it does not downclock at all with 256-bit vectors. So it's maybe worth doing. On the other hand, with all cores active, there is a penalty, but 512-bit vectors have no worse of a downclock penalty relative to 256-bit ones, so you are no worse off just using 512 bit instead of 256 bit if that's an option. This in effect will give you 2x the data throughput at the same power usage and same transition cost, which is a net win.[1] Thus, for any kind of batch work, where you're willing to pay for it, AVX-512 will probably be a free upgrade on Ice Lake. Ice Lake-SP will presumably change this up (2 FMAs at minimum) so who knows what will happen then.

However, all AVX transitions in general remain expensive (possibly millions of cycles dependent on many factors), thus you may not want to use them for very very short batch work, and you may want to think carefully about big vectors on older CPUs. It's certainly possible some kinds of code will perform worse in practice in some workloads (but this was always true e.g. with AVX2 on Haswell[2], where you do a full-chip downclock for even a single instruction, but nobody complained as much about that, presumably because nobody wrote lots of catchy blogs about it :)

And of course, the situation is different for AMD CPUs. Someone else can fill in the details.

TL;DR AVX transitions are expensive and micro architecture dependent. Ice Lake is pretty good at it, despite being mobile only for now. Presumably this will get better from uarch refinements and improved node processes by Intel, and also AMD. Do your own experiments and validate your own hypotheses.

[1] https://twitter.com/trav_downs/status/1258226270527197190 [2] https://gist.github.com/rygorous/32bc3ea8301dba09358fd2c64e0...

praseodym · on May 24, 2020

Cloudflare did write a blog about getting poorer performance with AVX-512 enabled in a mixed workload: https://blog.cloudflare.com/on-the-dangers-of-intels-frequen...

janwas · on May 24, 2020

Important to note that was a Xeon Silver, which is much more affected than Gold. More in-depth discussion here: https://lemire.me/blog/2018/08/13/the-dangers-of-avx-512-thr...