Having been there when C compilers used to generate lousy code and we had legends like Mike Abrash teaching an whole industry how to write high performance Assembly, it always feels ironic that there is this myth about C and C++ being blazing fast since they were created.
That's not what I believe. C generates the code I expect, rather than the insanely bloated crap that comes out of C++. C++ obfuscates what is actually going on. Oh, and then there are all the problems with the C++ standard library. It's unusable in high performance embedded style code due to many aspects of the standard library being unconstrained - things like memory allocations and unspecified complexity. In some cases we were able to use boost, but in many cases we had to reimplement many algorithms where the implementation performed allocations at init time, and no allocs at runtime. But oh how does an event driven, multithreaded and hot running messaging system go fast...
This is on modern x86-64 utilizing many cores and processing millions of messages per second. Efficient, well thought out architectures work well on small systems as well as large.
I bet that at -O3 it definitely does not generate the code you expect, and the same C code when compiled with a C++ compiler will generate even better machine code.
You would be wrong. I've spent plenty of time looking at code produced by the compiler from both C and C++ while working through performance profiling. Yes -O3 code can result in non-trivial transformations, but it's still easy enough to understand when one has been doing so for decades.
In my experience, it isn't the compiler that is the bottleneck to performance. Code can always be tweaked to get the compiler to produce output that does what is intended. My complaints about C++ are more due to the complete unsuitability of the standard library to embedded type applications caused us to spend plenty of time reimplementing algorithms that were already provided as part of normal C++ infrastructure. Having the right infrastructure to build a high performance application is far more than having an optimizing compiler. Libraries matter, kernel APIs matter, taking advantage of the right hardware configurations matters. The compiler is actually a fairly small part of the overall performance picture - doing things like pinning threads and avoiding cache line bounces between threads/CPUs buys orders of magnitude more performance than changing the compiler.