Mostly unrelated: When I write heavily optimized code I prefer to write the stupidest, simplest thing that could possibly work first, even if I know it's too slow for the intended purpose. I'll leave the unoptimized version in the code base.
- It serves as a form of documentation for what the optimized stuff is supposed to do. I find this most beneficial when the primitives being used in the optimized code don't map well to the overarching flow of ideas (like how _mm256_maddubs_epi16 is just a vectorized 8-bit unsigned multiply if some preconditions on the inputs hold). The unoptimized code will follow the broader brush strokes of the fast implementation.
- More importantly, you can drop it into your test suite as an oracle to check that the optimized code actually behaves like it's supposed to on a wide variety of test cases. The ability to test any (small enough) input opens a door in terms of robustness.