> A f32x16 version would also be faster on 256b hardware, but spill in SSE. For Zen5 you probably want to use f32x32.
Yeah, exceeding native vector width is kinda just adding another round of loop unrolling. Sometimes it helps, sometime it doesn't. This is probably mostly about register pressure.
And architecture specific benchmarking is required if you want to get most performance out of it.
> I'd prefer if std::simd would encurage relative to native SIMD width scaling (and support scalable SIMD ISAs).
It is possible to write width-generic SIMD code (ie. have vector width as generic parameter) in Rust std::simd (or C++ templates and vector extensions) and make it relative to native vector width (albeit you need to explicitly define that).
In my problem domain (computer graphics etc) the vector width is often mandated by the task at hand (e.g. 2d vs 3d). It's often not about doing something on an array of size N. This does not lead to optimal HW utilization, but it's convenient and still a lot faster than scalar code.
Scalable SIMD ISAs are kind of a new thing, so not sure how well current std::simd or C vector extensions (or LLVM IR SIMD ops) map to the HW. Maybe they would be better served by another kind of API? I don't really know, haven't had the privilege of writing any scalable vector code yet.
What I'm trying to say is IMO std::simd works well enough and should probably be stabilized (almost) as is, barring any show stopper issues. It's already useful and has been for many years.
I'd prefer if std::simd would encurage relative to native SIMD width scaling (and support scalable SIMD ISAs).