Taking my 7980xe as an example:
When it runs non-avx512 loads, I currently have it set to run at 4.1 GHz (all-core).
When running avx-512 heavy loads, it instead runs at 3.6 GHz -- and tends to get much hotter (70-80C instead of 50-60C). 3.6 GHz is a mild overclock; Silicon Lottery reports 100% can achieve that speed for avx512 loads.[1]
Running programs doing the same thing (eg, Hamiltonian Monte Carlo where the likelihood function has or has not been vectorized), the avx512 version is far faster than scalar, and routinely 50%+ faster than avx2.
The avx512 instruction set itself also provides conveniences that make it easier to explicitly vectorize, even if most compilers don't take advantage of them on their own. Masking load and store operations in particular (they're better about masking to handle branches).
On why avx512 vs a graphics card:
I need double precision, and my code routinely has maximum widths smaller than the 32 or 64 a graphics card would want to computer in parallel.
Yeah, people tend to completely exaggerate the impact of throttling from AVX512. It's only an issue when you do short bursts of AVX512 and the rest is not AVX512. If you do math and your math can be done in AVX512, even with throttling it's going to be substantially faster. That it runs hotter doesn't concern me at all. Intel's claimed safe Tjunction is something like 105C. EEs tend to take the published component specifications seriously (e.g. your 1000v diode is guaranteed to withstand at least 1KV of reverse voltage), so I trust Intel when they say things are fine up to that temperature. Even beyond that it won't burn out, it'll just thermal throttle.
I've been using Julia. I've been working on a front end meant to help specify vectorized models and their gradients. It is alpha-quality software (far from production ready), but here is the github:
https://github.com/chriselrod/ProbabilityModels.jl
In the example I give there, the logdensity and gradient evaluation was about 25x faster than Stan, and sampling was about 20x faster.
A simulation fitting many data sets for my dissertation took about 9 hours. 20x is the difference between running overnight, and taking a week.
This interleaving is going to cause problems to an autovectorizer. To get a SIMD vector of the scalars, you'd probably have to load two vectors, and then blend them.
Even with arrays of doubles, I found Eigen's fixed size arrays to get about 3-8x worse performance than my Julia library (3-8x worse than my Julia library for Mx32 * 32xN, for combinations of M and N = (3,...,32) ):
https://bayeswatch.org/2019/06/06/small-matrix-multiplicatio...
I compiled the Eigen benchmarks with:
g++ -O3 -fno-signed-zeros -fno-trapping-math -fassociative-math -march=native -mprefer-vector-width=512 -shared -fPIC -I/usr/include/eigen3 eigen_mul.cpp -o libeigenmul.so
> here is the github: https://github.com/chriselrod/ProbabilityModels.jl. In the example I give there, the logdensity and gradient evaluation was about 25x faster than Stan, and sampling was about 20x faster.
that looks pretty cool, though I don't yet know enough Julia to understand all of it. The speedups make sense given that Stan's compiler/math lib doesn't do much in the way of smart data layout. I would still keep in mind that the metric worth using for benchmarking is the number of effective samples per second, and this also depends on the HMC variant you use.
> Eigen's fixed size arrays to get about 3-8x worse performance than my Julia library
seems unsurprising that Julia can specialize a lot better than verbose C++ templating, no? (still, good job, very worth checking out)
> I was getting errors when specifying -DEIGEN_ENABLE_AVX512
I used this flag with Eigen 3.3.1, I think, on GCC 6 or 7. This was for Xeon Phi, so I tried to use icc but despite supporting C++11 it doesn't handle Stan or Eigen's template metaprogramming.
This is all the more reason to use Julia, but my graduate student days are long past..
> I would still keep in mind that the metric worth using for benchmarking is the number of effective samples per second, and this also depends on the HMC variant you use.
I was getting similar effective sample sizes/sample size in both after switching to a diagonal mass matrix, like Stan uses, from the dense mass matrix DynamicHMC.jl uses by default (the HMC backed library I'm using).
Given how common it is for folks to run Stan over night or for a week to study prior sensitivity, internal coverage, type I and II errors, etc, via Monte Carlo, I think a focus on speed is worth while.
> seems unsurprising that Julia can specialize a lot better than verbose C++ templating, no? (still, good job, very worth checking out)
The C++ library Blaze did a lot better than Eigen, but still not as well.
But yes, Julia'a meta-programming is much easier to work with. Julia expressions are Julia objects that you can manipulate like anything else, so I can write all the functions I want describing how to generate matmul kernels as a function of matrix size and CPU Info, and how to loop over them.
That approach feels much more straightforward. I haven't looked at the code bases of Eigen or Blaze, nor am I that familiar with template meta-programming. But I'd guess they define matmul recursively for arbitrary fixed sizes, and then have some templates defined for specific sizes (the kernels) -- or ideally have some clever way of generating the kernels from there.
Regardless, I agree that this is much easier in Julia. Aggressive specialization is also better aligned with Julia's compilation model in general, because methods get compiled just before they're used. Defining a million possible specializations doesn't have the cost of compiling a million specializations.
> This is all the more reason to use Julia, but my graduate student days are long past..
I'm defending this week, and next Monday will be my first day in an industry job.
They expressed openness to Julia, but my biggest fear is that they'll renege so that I'll only be able to work on or use Julia in my spare time at home.
That's an interesting comment. We've always used Stan's default of a diagonal, but I think we'd benefit from mixed metrics, which doesn't seem possible in Stan, but looks somewhat doable in some of the HMC libs in Julia.
> Given how common it is for folks to run Stan over night or for a week to study prior sensitivity
Yes, we changed the walltime on our Slurm cluster to support Stan Jobs running up to a week long, have used multiple million core hours on this. Stan still isn't so shabby but it's a hard problem.
> I'm defending this week, and next Monday will be my first day in an industry job.
Good luck and congrats on the job. You'll probably have to bite your tongue and look for opportunities where Julia's advanced compilation model (as you described well above) is going to more than pay for the cost of deployment/extra language etc.
AVX512 is expensive. I believe if you have an AVX512-heavy workload it can cause the processor to throttle.