This is on modern x86-64 utilizing many cores and processing millions of messages per second. Efficient, well thought out architectures work well on small systems as well as large.
I bet that at -O3 it definitely does not generate the code you expect, and the same C code when compiled with a C++ compiler will generate even better machine code.
You would be wrong. I've spent plenty of time looking at code produced by the compiler from both C and C++ while working through performance profiling. Yes -O3 code can result in non-trivial transformations, but it's still easy enough to understand when one has been doing so for decades.
In my experience, it isn't the compiler that is the bottleneck to performance. Code can always be tweaked to get the compiler to produce output that does what is intended. My complaints about C++ are more due to the complete unsuitability of the standard library to embedded type applications caused us to spend plenty of time reimplementing algorithms that were already provided as part of normal C++ infrastructure. Having the right infrastructure to build a high performance application is far more than having an optimizing compiler. Libraries matter, kernel APIs matter, taking advantage of the right hardware configurations matters. The compiler is actually a fairly small part of the overall performance picture - doing things like pinning threads and avoiding cache line bounces between threads/CPUs buys orders of magnitude more performance than changing the compiler.