Yeah, you can do that, but that still means you write platform-specific code. What you typically do is that you write a cross-platform scalar implementation in standard C, and then for each target you care about, you write a platform-specific vectorized implementation. Then, through some combination of compile-time feature detection, build flags, and runtime feature detection, you select which implementation to use.
(The runtime part comes in because you may want a single amd64 version of your program which uses AVX-512 if that's available but falls back to AVX-256 and/or SSE if it's not available)
For any code that's meant to last a bit more than a year, I would say that should also include runtime benchmarking. CPUs change, compilers change. The hand-written assembly might be faster today, but might be sub-optimal in the future.
The assumption that vectorized code is faster than scalar code is a pretty universally safe assumption (assuming the algorithm lends itself to vectorization of course). I'm not talking about runtime selection of hand-written asm compared to compiler-generated code, but rather runtime selection of vector vs scalar.
(The runtime part comes in because you may want a single amd64 version of your program which uses AVX-512 if that's available but falls back to AVX-256 and/or SSE if it's not available)