For vectorizing, that quote is only true for loops with dependencies between ite...

galangalalgol · 2025-11-06T02:42:57 1762396977

You just create f32x4 types, the wide crate does this. Then it autovectorizes just fine. But it still isn't the best idea if you are comparing values. We had a defect due to this recently.

the__alchemist · 2025-11-06T03:49:40 1762400980

I suspect I am misunderstanding. If you create an f32x4 type, aren't you manually vectorizing? Auto-vectoring is magic SIMD use the compiler does in some cases. (But usually doesn't...)

galangalalgol · 2025-11-06T09:16:52 1762420612

You are manually vectorizing, but it lets the optimizer know you don't care about safe rounding behavior so it ends up using the simd instructions. And this way it is portable still vs using intrinsics. Floating point addition is the only one the optimizer isn't allowed to do, so if you just need multiplication or only use integers it all autovectorizes fine. The f32xN stuff is just a way to tell it you don't care about the rounding. There are better ways to do that that could be added, like a FastF32 type, but I don't know if llvm could support that.

Edit: go to godbolt and load the rust aligned sum and play around with types. If you see addps that is the packed scalar simd instruction. The more you get packed, the higher your score! You'll need to pass some extra arguments they don't list to get avx512 sized registers vs the xmm or ymm ones. And not all the instances it uses support avx512 so sometimes you have to try a couple times.

dzaima · 2025-11-06T09:58:15 1762423095

Well, not really "you don't care about safe rounding behavior", more just "you have specified a specific operation order that happens to be more susceptible to being vectorizable". Implementing a float sum that way has the completely-safe completely-well-defined portable behavior of summing strides for any given size.

Both float multiplication and float addition are equally bad for optimizations though - both are non-associative: https://play.rust-lang.org/?version=stable&mode=debug&editio... ; and indeed changing the aligned-sum example to f64, neither .sum() nor .product() get vectorized.

And e.g. here's a plain rust loop autovectorizing both addition and multiplication (though of course not a reduction): https://rust.godbolt.org/z/6hEcj8zfx

galangalalgol · 2025-11-06T11:41:44 1762429304

I meant was multiply two vectors point by point autovecs, because there is no order. I'm usually doing accumulated products or something like them for dsp. As long as you only use the wide it is fine. I had a bug when comparing values constructed partially from simd vs not at all. Very unusual I'm sure, but there really is a reason rust won't let you turn on ffastmath