Your method of estimating instructions includes printing multiple lines which context switch to the kernel to perform IO. I'm not sure how an "instruction count" metric is useful anyway.
edit: I'm not actually sure you are counting the context switch, but I still don't think estimating instruction count that way is particularly useful.
250k/s is roughly the same speed as context switching, so while slow for pure computation, it is a reasonable amount of "waste" for switching between concurrent tasks.
If you didn't prevent preemptive context switches during your benchmarking, it's entirely possible the only thing you measured was the context switch time.
This is a fun experiment, but to get a rigorous idea of the overhead involved takes more work than what anyone in the post or comments has done.
Matching the cost of a genuine context switch should be a (laughably bad) upper bound for any language's particular concurrency offerings. It is not reasonable.
Reasonableness is relative and use case dependent. The post itself illustrates how the cost is insignificant compared to other "wasteful" operations related to CSS handling.
If this is too much overhead for your use case, there are plenty of other approaches and languages to choose from.
If it costs as much as a context switch, you might as well just do context switches.
These hosted language scaffoldings - whether that's asyncio, go routines, TPL, Webflux, etc. - exist specifically so you don't have to do a full context switch. If they cost as much as a context switch, they have failed. Regardless of what else is taking time in the system.
If you're not any better, just replace your whole hosted concurrency system with a statement that triggers sched_yield.
edit: I'm not actually sure you are counting the context switch, but I still don't think estimating instruction count that way is particularly useful.