Noob question- must one write avx512 assembly directly by hand, or is this somet...

moloch-hai · on Jan 6, 2023

Generally you are better off coding with "intrinsics", compiler extensions that represent the instructions more symbolically, if in fact the compiler offers what you need.

I am not sure the really interesting AVX-512 instructions have intrinsics yet. For those it's asm or nothing.

habitue · on Jan 6, 2023

Potentially both. Most compilers have vectorization optimizations if you compile for an architecture that supports it.

However, a lot of software is compiled on one machine to be run on potentially many possible architectures, so they target a very lowest common denominator arch like x86-64. This will have some SIMD instructions but (I don't think) AVX-512.

So if a developer wants to ensure those instructions are used if they're supported, they'll write two code paths. one path will explicitly call the avx512 instructions with compiler intrinsics and then the other path will just use the manual code and let the compiler decide how to turn it into x86-64 safe instructions.

readonlybarbie · on Jan 6, 2023

thanks for that! so it sounds like, if i purchase a chip that supports avx512, and run an operating system and compiler that supports avx512, i can write "plain old c code" with a minimal amount of compiler arguments and compile that code on my machine (aka not just running someone else's binary). and then the full power of avx512 is right there waiting for me? :)

janwas · on Jan 6, 2023

A compiler turning C(++) code into SIMD instructions is called "autovectorization". In my experience this works for simple loops such as dot products (even that requires special compiler flags to enable FMA and reorders), but unfortunately the wheels often fall off for more complex code. Also, I haven't seen the compiler generate the more exotic instructions.

Iwan-Zotow · on Jan 6, 2023

You should use Intel intrinsics - generally, they are supported by all compilers.

E.g. https://www.intel.com/content/www/us/en/develop/documentatio...

readonlybarbie · on Jan 6, 2023

if you are targeting more than one specific platform, do you like, include the immintrin.h header and use #ifdef to conditionally use avx512 if it's available on someone's platform?

janwas · on Jan 6, 2023

It would be simpler to use the portable intrinsics from github.com/google/highway (disclosure: I am the main author). You include a header, and use the same functions on all platforms; the library provides wrapper functions which boil down to the platform's intrinsics.

Iwan-Zotow · on Jan 7, 2023

Well, there is SIMD proposal for C++23 with kind-of-reference implementation. But I don't know how well it works for AVX0512

janwas · on Jan 7, 2023

From what I have seen, this is unfortunately not very useful: it mainly only includes operations that the compiler is often able to autovectorize anyway (simple arithmetic). Support for anything more interesting such as swizzles seems nonexistent. Also, last I checked, this was only available on GCC 11+; has that changed?

Iwan-Zotow · on Jan 9, 2023

I think proposed Vc lib is tested under clang as well.

janwas · on Jan 9, 2023

Here is my source: https://github.com/VcDevel/std-simd

Ah, but this repo mentions that the GCC 11 implementation apparently also works with clang: https://github.com/VcDevel/Vc. Thanks!

gulikoza · on Jan 6, 2023

I wonder how much compilers could be improved with AI?

I'd imagine outputting optimized avx code from an existing C for() loop would be much easier than going from a "write me a python code that..." prompt.

brrrrrm · on Jan 6, 2023

typically if it's available, compilers will use the avx512 register file. This means you'll see things like xmm25 and ymm25 (128 and 256 bit registers) and those are avx512 only. However, compilers using 512-wide instructions is kinda rare from what I've seen

celrod · on Jan 6, 2023

You can use `-mprefer-vector-width=512` to use 512 bit vectors, or if you want a particular function to use 512, you could try the min-vector-width attribute: https://clang.llvm.org/docs/AttributeReference.html#min-vect...

In my experience, clang unrolls too much, so you end up spending all your time in the non-vectorized remainder. Using smaller vectors cuts the size of the non-vectorized remainders in half, so smaller vectors often give better performance for that reason. (Unrolling less could have the same effect while decreasing code size, but alas)

readonlybarbie · on Jan 6, 2023

so then, if i want my code to "explicitly" use avx512, i have to do something like this?

``` void myNotOptimizedThing(my_data* d){ _SPECIAL_CPU_MANUFACTURER_0X3D512(d); } ```

edit: and include some header from the manufacturer most likely?

brrrrrm · on Jan 6, 2023

without using intrinsics? `-O3 -march=skylake-avx512 -mprefer-vector-width=512`