Generally you are better off coding with "intrinsics", compiler extensions that represent the instructions more symbolically, if in fact the compiler offers what you need.
I am not sure the really interesting AVX-512 instructions have intrinsics yet. For those it's asm or nothing.
Potentially both. Most compilers have vectorization optimizations if you compile for an architecture that supports it.
However, a lot of software is compiled on one machine to be run on potentially many possible architectures, so they target a very lowest common denominator arch like x86-64. This will have some SIMD instructions but (I don't think) AVX-512.
So if a developer wants to ensure those instructions are used if they're supported, they'll write two code paths. one path will explicitly call the avx512 instructions with compiler intrinsics and then the other path will just use the manual code and let the compiler decide how to turn it into x86-64 safe instructions.
thanks for that! so it sounds like, if i purchase a chip that supports avx512, and run an operating system and compiler that supports avx512, i can write "plain old c code" with a minimal amount of compiler arguments and compile that code on my machine (aka not just running someone else's binary). and then the full power of avx512 is right there waiting for me? :)
A compiler turning C(++) code into SIMD instructions is called "autovectorization". In my experience this works for simple loops such as dot products (even that requires special compiler flags to enable FMA and reorders), but unfortunately the wheels often fall off for more complex code.
Also, I haven't seen the compiler generate the more exotic instructions.
if you are targeting more than one specific platform, do you like, include the immintrin.h header and use #ifdef to conditionally use avx512 if it's available on someone's platform?
It would be simpler to use the portable intrinsics from github.com/google/highway (disclosure: I am the main author).
You include a header, and use the same functions on all platforms; the library provides wrapper functions which boil down to the platform's intrinsics.
From what I have seen, this is unfortunately not very useful: it mainly only includes operations that the compiler is often able to autovectorize anyway (simple arithmetic). Support for anything more interesting such as swizzles seems nonexistent. Also, last I checked, this was only available on GCC 11+; has that changed?
I wonder how much compilers could be improved with AI?
I'd imagine outputting optimized avx code from an existing C for() loop would be much easier than going from a "write me a python code that..." prompt.
typically if it's available, compilers will use the avx512 register file. This means you'll see things like xmm25 and ymm25 (128 and 256 bit registers) and those are avx512 only. However, compilers using 512-wide instructions is kinda rare from what I've seen
In my experience, clang unrolls too much, so you end up spending all your time in the non-vectorized remainder.
Using smaller vectors cuts the size of the non-vectorized remainders in half, so smaller vectors often give better performance for that reason.
(Unrolling less could have the same effect while decreasing code size, but alas)