They did not specify what the tax was relative to. Maybe they meant relative to ...

smitty1110 · on July 21, 2019

Yes, there’s a bios setting to control this. It basically under clocks the core while AVX units are under load.

celrod · on July 21, 2019

Taking my 7980xe as an example: When it runs non-avx512 loads, I currently have it set to run at 4.1 GHz (all-core). When running avx-512 heavy loads, it instead runs at 3.6 GHz -- and tends to get much hotter (70-80C instead of 50-60C). 3.6 GHz is a mild overclock; Silicon Lottery reports 100% can achieve that speed for avx512 loads.[1]

Running programs doing the same thing (eg, Hamiltonian Monte Carlo where the likelihood function has or has not been vectorized), the avx512 version is far faster than scalar, and routinely 50%+ faster than avx2.

The avx512 instruction set itself also provides conveniences that make it easier to explicitly vectorize, even if most compilers don't take advantage of them on their own. Masking load and store operations in particular (they're better about masking to handle branches).

On why avx512 vs a graphics card: I need double precision, and my code routinely has maximum widths smaller than the 32 or 64 a graphics card would want to computer in parallel.

[1] https://siliconlottery.com/pages/statistics

m0zg · on July 21, 2019

Yeah, people tend to completely exaggerate the impact of throttling from AVX512. It's only an issue when you do short bursts of AVX512 and the rest is not AVX512. If you do math and your math can be done in AVX512, even with throttling it's going to be substantially faster. That it runs hotter doesn't concern me at all. Intel's claimed safe Tjunction is something like 105C. EEs tend to take the published component specifications seriously (e.g. your 1000v diode is guaranteed to withstand at least 1KV of reverse voltage), so I trust Intel when they say things are fine up to that temperature. Even beyond that it won't burn out, it'll just thermal throttle.

magicalhippo · on July 21, 2019

Maximum Tjunction for an STM32F303 (just happened to have datasheet open) is 150C, as is most other ICs I've seen.

So is 105C just a very conservative number, compensating for the probe location, or are there process specific things which brings it down to 105C?

userbinator · on July 21, 2019

From what I understand, the newer very-high-density procsses are far more sensitive to voltage and temperature than the older larger ones.

magicalhippo · on July 21, 2019

Makes sense. The STM32G series, which still has 150C Tjmax, is ST's first 90nm MCU[1] so yeah.

[1]: https://blog.st.com/stm32g0-mainstream-90-nm-mcu/

marmaduke · on July 21, 2019

What are you using to vectorize avx512 for HMC? Do you have a lot of element wise ops on big arrays?

When running Stan (NUTS/HMC) on Xeon Phi, telling Eigen to use avx512 provided a noticeable speed up but I didn't look at the assembly to be sure.

celrod · on July 21, 2019

I've been using Julia. I've been working on a front end meant to help specify vectorized models and their gradients. It is alpha-quality software (far from production ready), but here is the github: https://github.com/chriselrod/ProbabilityModels.jl

In the example I give there, the logdensity and gradient evaluation was about 25x faster than Stan, and sampling was about 20x faster. A simulation fitting many data sets for my dissertation took about 9 hours. 20x is the difference between running overnight, and taking a week.

If I understand correctly, one problem Stan has is that it uses a var datatype for its arrays, which interleaves the values (Scalar) with pointers (vi_). https://github.com/stan-dev/math/blob/master/stan/math/rev/c...

This interleaving is going to cause problems to an autovectorizer. To get a SIMD vector of the scalars, you'd probably have to load two vectors, and then blend them.

Even with arrays of doubles, I found Eigen's fixed size arrays to get about 3-8x worse performance than my Julia library (3-8x worse than my Julia library for Mx32 * 32xN, for combinations of M and N = (3,...,32) ): https://bayeswatch.org/2019/06/06/small-matrix-multiplicatio...

I compiled the Eigen benchmarks with: g++ -O3 -fno-signed-zeros -fno-trapping-math -fassociative-math -march=native -mprefer-vector-width=512 -shared -fPIC -I/usr/include/eigen3 eigen_mul.cpp -o libeigenmul.so

How did you tell Eigen to use avx512? At the time, I was getting errors when specifying -DEIGEN_ENABLE_AVX512. http://eigen.tuxfamily.org/bz/show_bug.cgi?id=1705

marmaduke · on July 22, 2019

> here is the github: https://github.com/chriselrod/ProbabilityModels.jl. In the example I give there, the logdensity and gradient evaluation was about 25x faster than Stan, and sampling was about 20x faster.

that looks pretty cool, though I don't yet know enough Julia to understand all of it. The speedups make sense given that Stan's compiler/math lib doesn't do much in the way of smart data layout. I would still keep in mind that the metric worth using for benchmarking is the number of effective samples per second, and this also depends on the HMC variant you use.

> Eigen's fixed size arrays to get about 3-8x worse performance than my Julia library

seems unsurprising that Julia can specialize a lot better than verbose C++ templating, no? (still, good job, very worth checking out)

> I was getting errors when specifying -DEIGEN_ENABLE_AVX512

I used this flag with Eigen 3.3.1, I think, on GCC 6 or 7. This was for Xeon Phi, so I tried to use icc but despite supporting C++11 it doesn't handle Stan or Eigen's template metaprogramming.

This is all the more reason to use Julia, but my graduate student days are long past..

celrod · on July 22, 2019

> I would still keep in mind that the metric worth using for benchmarking is the number of effective samples per second, and this also depends on the HMC variant you use.

I was getting similar effective sample sizes/sample size in both after switching to a diagonal mass matrix, like Stan uses, from the dense mass matrix DynamicHMC.jl uses by default (the HMC backed library I'm using).

Given how common it is for folks to run Stan over night or for a week to study prior sensitivity, internal coverage, type I and II errors, etc, via Monte Carlo, I think a focus on speed is worth while.

> seems unsurprising that Julia can specialize a lot better than verbose C++ templating, no? (still, good job, very worth checking out)

The C++ library Blaze did a lot better than Eigen, but still not as well. But yes, Julia'a meta-programming is much easier to work with. Julia expressions are Julia objects that you can manipulate like anything else, so I can write all the functions I want describing how to generate matmul kernels as a function of matrix size and CPU Info, and how to loop over them.

That approach feels much more straightforward. I haven't looked at the code bases of Eigen or Blaze, nor am I that familiar with template meta-programming. But I'd guess they define matmul recursively for arbitrary fixed sizes, and then have some templates defined for specific sizes (the kernels) -- or ideally have some clever way of generating the kernels from there.

Regardless, I agree that this is much easier in Julia. Aggressive specialization is also better aligned with Julia's compilation model in general, because methods get compiled just before they're used. Defining a million possible specializations doesn't have the cost of compiling a million specializations.

> This is all the more reason to use Julia, but my graduate student days are long past..

I'm defending this week, and next Monday will be my first day in an industry job. They expressed openness to Julia, but my biggest fear is that they'll renege so that I'll only be able to work on or use Julia in my spare time at home.

marmaduke · on July 23, 2019

> switching to a diagonal mass matrix

That's an interesting comment. We've always used Stan's default of a diagonal, but I think we'd benefit from mixed metrics, which doesn't seem possible in Stan, but looks somewhat doable in some of the HMC libs in Julia.

> Given how common it is for folks to run Stan over night or for a week to study prior sensitivity

Yes, we changed the walltime on our Slurm cluster to support Stan Jobs running up to a week long, have used multiple million core hours on this. Stan still isn't so shabby but it's a hard problem.

> I'm defending this week, and next Monday will be my first day in an industry job.

Good luck and congrats on the job. You'll probably have to bite your tongue and look for opportunities where Julia's advanced compilation model (as you described well above) is going to more than pay for the cost of deployment/extra language etc.

kccqzy · on July 21, 2019

All the newly added instructions, (VCVTNE2PS2BF16, VCVTNEPS2BF16, VDPBF16PS) to support BF16 are AVX512 instructions.