You are right that the statement was overblown, however when I was testing with "trivial" load between yields (synchronized ping-pong between coroutines), I was getting numbers that I had trouble believing, when comparing them to other solutions.
In my test of a similar setup in C++ (IIRC about 10 years ago!), I was able to do a context switch every other cycle. The bottleneck was literally the cycles per taken jump of the microarchitecture I was testing again. As in your case it was a trivial test with two coroutines doing nothing except context switching, so the compiler had no need to save any registers at all and I carefully defined the ABI to be able to keep stack and instruction pointers in registers even across switches.