Yes, at the scale of 128-bit registers NEON is mostly enough, except for a few categories of instructions missing in that ISA subset, like scatter/gather ops, that can yield 30% boost over serial memory accesses: https://github.com/ashvardanian/less_slow.cpp/releases/tag/v...