Vector length agnostic programming has its own share of problems. I'm not familiar with the RISC-V V extension, but I assume it's similar to ARM's SVE. There's a good critical look at SVE and VLA here: https://gist.github.com/zingaburga/805669eb891c820bd220418ee...
I'm curious why you say they are very different? From where I sit, RVV also supports mask-like predication, and adds two concepts: LMUL (in-HW unrolling of each instruction) plus the ability to limit operations to a given number of elements.
The former is nifty, though intended for single-issue machines, and the latter seems redundant because masks can also do that.
Most of what's interesting about avx512 is the new instructions; the wider vectors are just icing on the cake. You would need to rewrite your code regardless.
I wonder to what extent compilers even emit avx512 instructions apart from the common ones (load, store, shuffle, arithmetic) in case you don’t want to manually optimize for sse / avx / avx2 / avx512.
How much code is compiled with `-march=native` or function multiversioning?
I would guess the percentage is relatively small, at least when it comes to distributed binaries.
Compiler autovectorizers also aren't very good at producing fast AVX512 code, so most of the benefit would probably come from using optimized libraries like Intel's MKL or simdjson.
Any installation of Gentoo is, presumably. (Otherwise, what's the point of compiling it all yourself?)
More interestingly, possibly all OEM firmware-installed copies of ChromeOS are -march=native builds as well, given that ChromeOS is based off of a Gentoo upstream.
True. I have never gone down the Gentoo rabbit hole. Might be fun to try sometime, but I'd seriously doubt that the time spent compiling would be won back from better performance.
Clear Linux is probably a more practical alternative. I used it a couple years ago, and found that they had a lot of avx2 and avx512 versions of random libraries built, with the appropriate ones presumably being loaded based on the hardware.
Random glibc math function calls, for example, were much faster on Clear Linux than Arch or Fedora.
But development of Clear seems to have stopped, libraries like llvm aren't being updated anymore so the toolchains are outdated.
I'd wanted to avoid the blood and sweat of managing my own toolchains, and ironically being on bleeding edge distros (Arch,Fedora,etc) was the way to keep that to a minimum.
Next time I reinstall an OS, I'll look at Clear again.
Or maybe Guix or Nix. Or maybe use spack for package management on top of some other distro.
A bug report has been filed with GCC for one of the issues. LLVM is much better here, but not perfect, or at least that has been my experience when trying to have the compiler generate assembly for an explicitly vectorized fletcher4 implementation.
Generally you are better off coding with "intrinsics", compiler extensions that represent the instructions more symbolically, if in fact the compiler offers what you need.
I am not sure the really interesting AVX-512 instructions have intrinsics yet. For those it's asm or nothing.
Potentially both. Most compilers have vectorization optimizations if you compile for an architecture that supports it.
However, a lot of software is compiled on one machine to be run on potentially many possible architectures, so they target a very lowest common denominator arch like x86-64. This will have some SIMD instructions but (I don't think) AVX-512.
So if a developer wants to ensure those instructions are used if they're supported, they'll write two code paths. one path will explicitly call the avx512 instructions with compiler intrinsics and then the other path will just use the manual code and let the compiler decide how to turn it into x86-64 safe instructions.
thanks for that! so it sounds like, if i purchase a chip that supports avx512, and run an operating system and compiler that supports avx512, i can write "plain old c code" with a minimal amount of compiler arguments and compile that code on my machine (aka not just running someone else's binary). and then the full power of avx512 is right there waiting for me? :)
A compiler turning C(++) code into SIMD instructions is called "autovectorization". In my experience this works for simple loops such as dot products (even that requires special compiler flags to enable FMA and reorders), but unfortunately the wheels often fall off for more complex code.
Also, I haven't seen the compiler generate the more exotic instructions.
if you are targeting more than one specific platform, do you like, include the immintrin.h header and use #ifdef to conditionally use avx512 if it's available on someone's platform?
It would be simpler to use the portable intrinsics from github.com/google/highway (disclosure: I am the main author).
You include a header, and use the same functions on all platforms; the library provides wrapper functions which boil down to the platform's intrinsics.
From what I have seen, this is unfortunately not very useful: it mainly only includes operations that the compiler is often able to autovectorize anyway (simple arithmetic). Support for anything more interesting such as swizzles seems nonexistent. Also, last I checked, this was only available on GCC 11+; has that changed?
I wonder how much compilers could be improved with AI?
I'd imagine outputting optimized avx code from an existing C for() loop would be much easier than going from a "write me a python code that..." prompt.
typically if it's available, compilers will use the avx512 register file. This means you'll see things like xmm25 and ymm25 (128 and 256 bit registers) and those are avx512 only. However, compilers using 512-wide instructions is kinda rare from what I've seen
In my experience, clang unrolls too much, so you end up spending all your time in the non-vectorized remainder.
Using smaller vectors cuts the size of the non-vectorized remainders in half, so smaller vectors often give better performance for that reason.
(Unrolling less could have the same effect while decreasing code size, but alas)
Using the RISC-V V vector instructions means that the underlying hardware vector width can change and the code will automatically take advantage of the larger width.
That said, many of the avx512 instructions are simply extended width AVX2/avx2 instructions. The interesting things about it are really the increased width and the additional registers. Not many of the new instructions that are not bitwidtg extended versions of the old ones are particularly interesting since Intel had already implemented most of the interesting things for smaller vector widths.
I've only scratched the surface of the avx512 instructions, but they are much more broad and useful. Masked gather, scatter, double precision exp and mantissa extraction, and floating point to integer conversions are all new and all proving useful to me.
Vectorizing text handling is likely to use generalized register permutes (and I spy some _mm512_shuffle_epi8 here), which are the bugaboo in length-agnostic SIMD. Fundamentally, the maximum index you can read from in a register depends on the register size.
So yeah, even in RISC-V V, vrgather has explicitly different per-element operation depending on VLMAX, which obviously depends on the HW's VLEN. So depending on the table size, you have to assume constraints on VLEN or execute different permute sequences.
If you specifically prefer SIMD over Vector, Andes has offerings with draft P extension.
If you otherwise want vector (V extension), "right now" would limit you to pre-1.0 V extension implementations.
If you need to license hardware IP, there are several very high performance implementations as of RISC-V Summit[0]. Actual hardware will pop up throughout 2023.
This is unlike RISC-V V extension, where the same code will run and utilize the hardware, regardless of vector unit width.