I don't think you're measuring context switching. if I remember correctly, Go's ...

coder543 · on Nov 19, 2018

I wrote my own quick benchmark: https://gist.github.com/coder543/8c1b9cdffdf09c19ef61322bd26...

The results:

    1 switcher:    14_289_797.08 yields/sec
    2 switchers:    5_866_478.94 yields/sec
    3 switchers:    4_832_941.33 yields/sec
    4 switchers:    4_604_051.57 yields/sec
    5 switchers:    4_268_906.99 yields/sec
    6 switchers:    3_982_688.58 yields/sec
    7 switchers:    3_799_103.41 yields/sec
    8 switchers:    3_673_094.58 yields/sec
    9 switchers:    3_513_868.07 yields/sec
    10 switchers:   3_351_813.00 yields/sec
    11 switchers:   3_325_754.64 yields/sec
    12 switchers:   3_150_383.56 yields/sec
    13 switchers:   3_037_539.31 yields/sec
    14 switchers:   2_435_807.77 yields/sec
    15 switchers:   2_326_201.72 yields/sec
    16 switchers:   2_275_610.57 yields/sec
    64 switchers:   2_366_303.83 yields/sec
    256 switchers:  2_400_782.51 yields/sec
    512 switchers:  2_408_757.26 yields/sec
    1024 switchers: 2_418_661.29 yields/sec
    4096 switchers: 2_460_257.29 yields/sec

Underscores and alignment added for legibility.

It looks like the context switching speed when you have a single Goroutine just completely outperforms any of the benchmark numbers that have been posted here for Python or Ruby, as would be expected, and it still outperforms the others even when running 256 yielding tasks for every logical core.

The cost of switching increased more with the number of goroutines than I would have expected, but it seems to become pretty constant once you pass the number of cores on the machine. Also keep in mind that this benchmark is completely unrealistic. No one is writing busy loops that just yield as quickly as possible outside of microbenchmarks.

This benchmark was run on an AMD 2700X, so, 8 physical cores and 16 logical cores.

ioquatix · on Nov 19, 2018

I wrote an addendum https://www.codeotaku.com/journal/2018-11/fibers-are-the-rig...

With C++/assembly, you can context about 100 million times per CPU core in a tight loop.

coder543 · on Nov 19, 2018

The one additional comment I have is that this addendum doesn't involve a reactor/scheduler in the benchmark, so it excludes the process of selecting the coroutine to switch into, which is a significant task. The Go benchmark I posted above is running within a scheduler.

But, I appreciate the addendum.

ioquatix · on Nov 19, 2018

So, that's a good point, and yes the scheduler will have an impact probably several orders of magnitude in comparison.

That being said, a good scheduler is basically just a loop, like:

https://github.com/kurocha/async/blob/bee8e8b95d23c6c0cfb319...

So, once it's decided what work to do, it's just a matter of resuming all the fibers in order.

Additionally, since fibers know what work to do next in some cases, the overhead can be very small. You sometimes don't need to yield back to the scheduler, but can resume directly another task.