Given the “blazingly fast” branding, I too would have thought this would be in s...

jandrewrogers · 2025-11-06T02:40:43 1762396843

Getting maximum performance out of SIMD requires rolling your own code with intrinsics. It is something a compiler can't do for you at a pretty fundamental level.

Most interesting performance optimizations from vector ISAs can't be done by the compiler.

capyba · 2025-11-06T10:19:21 1762424361

Interesting, how so? I’ve had really good success with the autovectorization in gcc and the intel c compiler. Often it’s faster than my own instrinsics, though not always. One notable example though is that it seems to struggle with reduction - when I’m updating large arrays ie `A[i] += a` the compiler struggles to use simd for this and I need to do it myself.

burntsushi · 2025-11-06T14:58:26 1762441106

There's no optimal portable `movemask` operation. Because aarch64 NEON doesn't have it.

exDM69 · 2025-11-06T12:12:59 1762431179

> Getting maximum performance out of SIMD requires rolling your own code with intrinsics

Not disagreeing with this statement in general, but with std::simd I can get 80% of the performance with 20% of the effort compared to intrinsics.

For the last 20%, there's a zero cost fallback to intrinsics when you need it.

jandrewrogers · 2025-11-06T16:13:51 1762445631

To clarify, there are many things SIMD is used for that look nothing like the loop parallelism or doing numerics commonly discussed. For example, heterogeneous concurrency is likely going to be beyond compilers for the foreseeable future and it is a great SIMD optimization.

A common example is executing the equivalent of a runtime SQL WHERE clause on arbitrary data structures of mixed types. Clever idioms allow surprisingly complex unrelated constraint operators to be evaluated in parallel with SIMD. It would be cool if a compiler could take a large pile of fussy, branchy scalar code that evaluates ad hoc constraints on data structures and converts it to an equivalent SIMD constraint engine but that doesn't seem likely anytime soon. So we roll them by hand.

burntsushi · 2025-11-06T14:57:27 1762441047

> Given the “blazingly fast” branding, I too would have thought this would be in stable Rust by now.

It's the exact opposite. It's the portable SIMD abstraction that isn't stable yet. But the vendor specific SIMD intrinsics have been stable for quite some time already (x86-64 for many years for example). And indeed, those are necessary for some cases.

ripgrep wouldn't be as fast as it was if it weren't possible to use SIMD on stable Rust.

queuebert · 2025-11-06T03:42:34 1762400554

I do scientific computing, and even I rarely have a situation where CPU SIMD is a clear win. Usually it's either not worth the added complexity, or the problem is so embarrassingly parallel that you should use a GPU.

capyba · 2025-11-06T10:13:37 1762424017

Interesting, in what domain? My work is in scientific computing as well (finite elements) and I usually find myself in the opposite situation: SIMD is very helpful but the added complexity of using a GPU is not worthwhile.

steveklabnik · 2025-11-06T00:22:18 1762388538

Don’t forget that autovectorization does a lot too. This is only for when you want to ensure you get exactly what you want, for many applications, they just kinda get it for free sometimes.