Not a C coder but isn't there a way to embed platform specific optimizations int...

dagw · 2024-11-04T14:32:35 1730730755

Yes, but then you have to write (and debug and maintain) each part 3 times.

There are also various libraries that create cross platform abstractions of the underlying SIMD libraries. Highway, from Google, and xSIMD are two popular such libraries for C++. SIMDe is a nice library that also works with C.

pmarreck · 2024-11-04T18:52:18 1730746338

> Yes, but then you have to write (and debug and maintain) each part 3 times.

could you not use a test suite structure (not saying it would be simple) that would run the suite across 3 different virtualized chip implementations? (The virtualization itself might introduce issues, of course)

datadeft · 2024-11-04T19:56:11 1730750171

> each part 3 times.

True.

So most projects just use SIMDe, xSIMD or something similar for such use cases?

astrange · 2024-11-04T20:38:56 1730752736

FFmpeg is the most successful of such projects and it uses handwritten assembly. Ignore the seductive whispers of people trying to sell you unreliable abstractions. It's like this for good reasons.

dagw · 2024-11-04T20:56:49 1730753809

Or you just say that my code is only fast on hardware that supports nativ AVX512 (or whatever). In many cases where speed really matters that is a reasonable tradeoff to make.

ulrikrasmussen · 2024-11-04T14:26:53 1730730413

You can always do that using build flags, but it doesn't make it portable, you as a programmer still have to manually port the optimized code to all the other platforms.

mort96 · 2024-11-04T14:20:46 1730730046

Yeah, you can do that, but that still means you write platform-specific code. What you typically do is that you write a cross-platform scalar implementation in standard C, and then for each target you care about, you write a platform-specific vectorized implementation. Then, through some combination of compile-time feature detection, build flags, and runtime feature detection, you select which implementation to use.

(The runtime part comes in because you may want a single amd64 version of your program which uses AVX-512 if that's available but falls back to AVX-256 and/or SSE if it's not available)

magicalhippo · 2024-11-04T14:25:38 1730730338

> runtime feature detection

For any code that's meant to last a bit more than a year, I would say that should also include runtime benchmarking. CPUs change, compilers change. The hand-written assembly might be faster today, but might be sub-optimal in the future.

mort96 · 2024-11-04T14:35:31 1730730931

The assumption that vectorized code is faster than scalar code is a pretty universally safe assumption (assuming the algorithm lends itself to vectorization of course). I'm not talking about runtime selection of hand-written asm compared to compiler-generated code, but rather runtime selection of vector vs scalar.