Obviously there's no avoiding this, but the author writes in the context of high-level code where OS-provided locking mechanisms are available. Why are we discussing this in a low-level (or embedded, where every ounce of performance matters) context?
You don't need to be embedded to have performance matter - even if your OS gives you concurrency primitives, there are many situations where jumping into kernel code is still "too expensive."
There are many situations in which OS-level primitives such as blocking semaphores are available but userland primitives are not. When implementing libc, for example. Or when implementing a language runtime, or TBB, or boost::thread_pool, or what have you.