AVX-512 is wider, but also needs special instructions to leverage the hardware. ...

chris6f · on Jan 6, 2023

Vector length agnostic programming has its own share of problems. I'm not familiar with the RISC-V V extension, but I assume it's similar to ARM's SVE. There's a good critical look at SVE and VLA here: https://gist.github.com/zingaburga/805669eb891c820bd220418ee...

snvzz · on Jan 6, 2023

V extension and SVE2 are very different.

Here is a quite recent introduction to RISC-V Vector[0].

0. https://erikexplores.substack.com/p/grokking-risc-v-vector-p...

janwas · on Jan 6, 2023

I'm curious why you say they are very different? From where I sit, RVV also supports mask-like predication, and adds two concepts: LMUL (in-HW unrolling of each instruction) plus the ability to limit operations to a given number of elements.

The former is nifty, though intended for single-issue machines, and the latter seems redundant because masks can also do that.

moonchild · on Jan 6, 2023

Most of what's interesting about avx512 is the new instructions; the wider vectors are just icing on the cake. You would need to rewrite your code regardless.

stabbles · on Jan 6, 2023

I wonder to what extent compilers even emit avx512 instructions apart from the common ones (load, store, shuffle, arithmetic) in case you don’t want to manually optimize for sse / avx / avx2 / avx512.

celrod · on Jan 6, 2023

How much code is compiled with `-march=native` or function multiversioning? I would guess the percentage is relatively small, at least when it comes to distributed binaries.

Compiler autovectorizers also aren't very good at producing fast AVX512 code, so most of the benefit would probably come from using optimized libraries like Intel's MKL or simdjson.

derefr · on Jan 6, 2023

> How much code is compiled with `-march=native`

Any installation of Gentoo is, presumably. (Otherwise, what's the point of compiling it all yourself?)

More interestingly, possibly all OEM firmware-installed copies of ChromeOS are -march=native builds as well, given that ChromeOS is based off of a Gentoo upstream.

celrod · on Jan 6, 2023

True. I have never gone down the Gentoo rabbit hole. Might be fun to try sometime, but I'd seriously doubt that the time spent compiling would be won back from better performance.

Clear Linux is probably a more practical alternative. I used it a couple years ago, and found that they had a lot of avx2 and avx512 versions of random libraries built, with the appropriate ones presumably being loaded based on the hardware.

Random glibc math function calls, for example, were much faster on Clear Linux than Arch or Fedora. But development of Clear seems to have stopped, libraries like llvm aren't being updated anymore so the toolchains are outdated. I'd wanted to avoid the blood and sweat of managing my own toolchains, and ironically being on bleeding edge distros (Arch,Fedora,etc) was the way to keep that to a minimum. Next time I reinstall an OS, I'll look at Clear again. Or maybe Guix or Nix. Or maybe use spack for package management on top of some other distro.

Thev00d00 · on Jan 6, 2023

When I ran Gentoo I just had the builds running in the background with a really high niceness.

ryao · on Jan 6, 2023

Even if you use the GNU C vector extension to explicitly give the compiler ways of optimizing C, it is not very good at generating good vector code:

https://github.com/openzfs/zfs/pull/14234#issuecomment-13345...

A bug report has been filed with GCC for one of the issues. LLVM is much better here, but not perfect, or at least that has been my experience when trying to have the compiler generate assembly for an explicitly vectorized fletcher4 implementation.

readonlybarbie · on Jan 6, 2023

Noob question- must one write avx512 assembly directly by hand, or is this something a c compiler would do for you?

moloch-hai · on Jan 6, 2023

Generally you are better off coding with "intrinsics", compiler extensions that represent the instructions more symbolically, if in fact the compiler offers what you need.

I am not sure the really interesting AVX-512 instructions have intrinsics yet. For those it's asm or nothing.

habitue · on Jan 6, 2023

Potentially both. Most compilers have vectorization optimizations if you compile for an architecture that supports it.

However, a lot of software is compiled on one machine to be run on potentially many possible architectures, so they target a very lowest common denominator arch like x86-64. This will have some SIMD instructions but (I don't think) AVX-512.

So if a developer wants to ensure those instructions are used if they're supported, they'll write two code paths. one path will explicitly call the avx512 instructions with compiler intrinsics and then the other path will just use the manual code and let the compiler decide how to turn it into x86-64 safe instructions.

readonlybarbie · on Jan 6, 2023

thanks for that! so it sounds like, if i purchase a chip that supports avx512, and run an operating system and compiler that supports avx512, i can write "plain old c code" with a minimal amount of compiler arguments and compile that code on my machine (aka not just running someone else's binary). and then the full power of avx512 is right there waiting for me? :)

janwas · on Jan 6, 2023

A compiler turning C(++) code into SIMD instructions is called "autovectorization". In my experience this works for simple loops such as dot products (even that requires special compiler flags to enable FMA and reorders), but unfortunately the wheels often fall off for more complex code. Also, I haven't seen the compiler generate the more exotic instructions.

Iwan-Zotow · on Jan 6, 2023

You should use Intel intrinsics - generally, they are supported by all compilers.

E.g. https://www.intel.com/content/www/us/en/develop/documentatio...

readonlybarbie · on Jan 6, 2023

if you are targeting more than one specific platform, do you like, include the immintrin.h header and use #ifdef to conditionally use avx512 if it's available on someone's platform?

janwas · on Jan 6, 2023

It would be simpler to use the portable intrinsics from github.com/google/highway (disclosure: I am the main author). You include a header, and use the same functions on all platforms; the library provides wrapper functions which boil down to the platform's intrinsics.

Iwan-Zotow · on Jan 7, 2023

Well, there is SIMD proposal for C++23 with kind-of-reference implementation. But I don't know how well it works for AVX0512

janwas · on Jan 7, 2023

From what I have seen, this is unfortunately not very useful: it mainly only includes operations that the compiler is often able to autovectorize anyway (simple arithmetic). Support for anything more interesting such as swizzles seems nonexistent. Also, last I checked, this was only available on GCC 11+; has that changed?

Iwan-Zotow · on Jan 9, 2023

I think proposed Vc lib is tested under clang as well.

janwas · on Jan 9, 2023

Here is my source: https://github.com/VcDevel/std-simd

Ah, but this repo mentions that the GCC 11 implementation apparently also works with clang: https://github.com/VcDevel/Vc. Thanks!

gulikoza · on Jan 6, 2023

I wonder how much compilers could be improved with AI?

I'd imagine outputting optimized avx code from an existing C for() loop would be much easier than going from a "write me a python code that..." prompt.

brrrrrm · on Jan 6, 2023

typically if it's available, compilers will use the avx512 register file. This means you'll see things like xmm25 and ymm25 (128 and 256 bit registers) and those are avx512 only. However, compilers using 512-wide instructions is kinda rare from what I've seen

celrod · on Jan 6, 2023

You can use `-mprefer-vector-width=512` to use 512 bit vectors, or if you want a particular function to use 512, you could try the min-vector-width attribute: https://clang.llvm.org/docs/AttributeReference.html#min-vect...

In my experience, clang unrolls too much, so you end up spending all your time in the non-vectorized remainder. Using smaller vectors cuts the size of the non-vectorized remainders in half, so smaller vectors often give better performance for that reason. (Unrolling less could have the same effect while decreasing code size, but alas)

readonlybarbie · on Jan 6, 2023

so then, if i want my code to "explicitly" use avx512, i have to do something like this?

``` void myNotOptimizedThing(my_data* d){ _SPECIAL_CPU_MANUFACTURER_0X3D512(d); } ```

edit: and include some header from the manufacturer most likely?

brrrrrm · on Jan 6, 2023

without using intrinsics? `-O3 -march=skylake-avx512 -mprefer-vector-width=512`

ryao · on Jan 6, 2023

Using the RISC-V V vector instructions means that the underlying hardware vector width can change and the code will automatically take advantage of the larger width.

That said, many of the avx512 instructions are simply extended width AVX2/avx2 instructions. The interesting things about it are really the increased width and the additional registers. Not many of the new instructions that are not bitwidtg extended versions of the old ones are particularly interesting since Intel had already implemented most of the interesting things for smaller vector widths.

kolbe · on Jan 6, 2023

I've only scratched the surface of the avx512 instructions, but they are much more broad and useful. Masked gather, scatter, double precision exp and mantissa extraction, and floating point to integer conversions are all new and all proving useful to me.

brigade · on Jan 6, 2023

Vectorizing text handling is likely to use generalized register permutes (and I spy some _mm512_shuffle_epi8 here), which are the bugaboo in length-agnostic SIMD. Fundamentally, the maximum index you can read from in a register depends on the register size.

So yeah, even in RISC-V V, vrgather has explicitly different per-element operation depending on VLMAX, which obviously depends on the HW's VLEN. So depending on the table size, you have to assume constraints on VLEN or execute different permute sequences.

jonstewart · on Jan 6, 2023

Is there a RISC-V chip with SIMD available for purchase with roughly comparable price/performance to current Intel/AMD offerings?

snvzz · on Jan 6, 2023

If you specifically prefer SIMD over Vector, Andes has offerings with draft P extension.

If you otherwise want vector (V extension), "right now" would limit you to pre-1.0 V extension implementations.

If you need to license hardware IP, there are several very high performance implementations as of RISC-V Summit[0]. Actual hardware will pop up throughout 2023.

0. https://www.youtube.com/@RISCVInternational/videos

sylware · on Jan 6, 2023

512bits is 64 bytes, a cache line on x86_64.