> fibers can context switch billions of times per second on a modern processor. ...

erdewit · on Nov 19, 2018

It must be off by 3 of 4 orders of magnitude. Python with asyncio does about 120_000 switches per second with a trivial fibre/coroutine here:

  import time
  import asyncio

  REPS = 1_000_000

  async def coro():
      for i in range(REPS):
          await asyncio.sleep(0)

  t0 = time.time()
  asyncio.run(coro())
  dt = time.time() - t0
  print(REPS / dt, 'reps/s')

ovi256 · on Nov 19, 2018

Golang seems to do around 350k. There's a chance I'm missing some tricks, but the code is so short it's not probable I'm missing that much.

See: https://repl.it/repls/GrimyChiefGame

  package main

  import (
  	"fmt"
  	"time"
  )

  func coro(reps int) {
  	for i := 0; i < reps; i++ {
		go time.Sleep(0 * time.Nanosecond)
  	}
  }

  func main() {
	REPS := 5000000
	start := time.Now()
	coro(REPS)
	dt := time.Since(start)
	fmt.Printf("The call took %v to run.\n", dt)
	fmt.Printf("REPS / duration %v\n", (REPS*1e9)/int(dt))
  }

coder543 · on Nov 19, 2018

I don't think you're measuring context switching.

if I remember correctly, Go's scheduler has a global queue and a local queue per worker thread, so when you spawn a goroutine it probably has to acquire a write lock on the global queue.

Allocating a brand new goroutine stack and doing some other setup tasks has a nontrivial overhead that has nothing to do with context switching, regardless of global locks.

To properly benchmark this, I think I would start with just measuring single task switching by measuring how long it takes main to call https://golang.org/pkg/runtime/#Gosched in a loop a million times. This would measure how quickly Go can yield a thread to the scheduler and have it be resumed, although this includes the overhead of calling a function.

Then I would launch a goroutine per core doing this yield loop and see how many switches per second they did in total, and then launch several per core, just to ensure that the number hasn't changed much from the goroutine per core measurement.

Since Go's scheduler is not bound to a single core, it should scale pretty well with core count.

I might run this benchmark myself in awhile, if I find time.

coder543 · on Nov 19, 2018

I wrote my own quick benchmark: https://gist.github.com/coder543/8c1b9cdffdf09c19ef61322bd26...

The results:

    1 switcher:    14_289_797.08 yields/sec
    2 switchers:    5_866_478.94 yields/sec
    3 switchers:    4_832_941.33 yields/sec
    4 switchers:    4_604_051.57 yields/sec
    5 switchers:    4_268_906.99 yields/sec
    6 switchers:    3_982_688.58 yields/sec
    7 switchers:    3_799_103.41 yields/sec
    8 switchers:    3_673_094.58 yields/sec
    9 switchers:    3_513_868.07 yields/sec
    10 switchers:   3_351_813.00 yields/sec
    11 switchers:   3_325_754.64 yields/sec
    12 switchers:   3_150_383.56 yields/sec
    13 switchers:   3_037_539.31 yields/sec
    14 switchers:   2_435_807.77 yields/sec
    15 switchers:   2_326_201.72 yields/sec
    16 switchers:   2_275_610.57 yields/sec
    64 switchers:   2_366_303.83 yields/sec
    256 switchers:  2_400_782.51 yields/sec
    512 switchers:  2_408_757.26 yields/sec
    1024 switchers: 2_418_661.29 yields/sec
    4096 switchers: 2_460_257.29 yields/sec

Underscores and alignment added for legibility.

It looks like the context switching speed when you have a single Goroutine just completely outperforms any of the benchmark numbers that have been posted here for Python or Ruby, as would be expected, and it still outperforms the others even when running 256 yielding tasks for every logical core.

The cost of switching increased more with the number of goroutines than I would have expected, but it seems to become pretty constant once you pass the number of cores on the machine. Also keep in mind that this benchmark is completely unrealistic. No one is writing busy loops that just yield as quickly as possible outside of microbenchmarks.

This benchmark was run on an AMD 2700X, so, 8 physical cores and 16 logical cores.

ioquatix · on Nov 19, 2018

I wrote an addendum https://www.codeotaku.com/journal/2018-11/fibers-are-the-rig...

With C++/assembly, you can context about 100 million times per CPU core in a tight loop.

coder543 · on Nov 19, 2018

The one additional comment I have is that this addendum doesn't involve a reactor/scheduler in the benchmark, so it excludes the process of selecting the coroutine to switch into, which is a significant task. The Go benchmark I posted above is running within a scheduler.

But, I appreciate the addendum.

ioquatix · on Nov 19, 2018

So, that's a good point, and yes the scheduler will have an impact probably several orders of magnitude in comparison.

That being said, a good scheduler is basically just a loop, like:

https://github.com/kurocha/async/blob/bee8e8b95d23c6c0cfb319...

So, once it's decided what work to do, it's just a matter of resuming all the fibers in order.

Additionally, since fibers know what work to do next in some cases, the overhead can be very small. You sometimes don't need to yield back to the scheduler, but can resume directly another task.

weberc2 · on Nov 19, 2018

I don’t think you measured context switches per second, but rather goroutines forks per second. Am I mistaken?

ciconia · on Nov 19, 2018

Not really equivalent since it does not involve an event reactor, only context switching:

    X = 1_000_000

    f = Fiber.new do
      loop { Fiber.yield }
    end

    t0 = Time.now
    X.times { f.resume }
    dt = Time.now - t0
    puts "#{X / dt.to_f}/s"

~ 780K on my machine

blattimwind · on Nov 19, 2018

asyncio eats about one order of magnitude. A more minimalistic async-approach gives something on the order of ~900000 resumptions per second.

(Single-thread hardware performance differences would not account for a factor 10)

blattimwind · on Nov 19, 2018

Slightly modified "benchmark" from above:

asyncio:

    import time
    import asyncio

    REPS = 1_000_000

    async def coro():
        for i in range(REPS):
            await asyncio.sleep(0)

    t0 = time.perf_counter()
    asyncio.run(coro())
    dt = time.perf_counter() - t0
    print(REPS / dt, 'reps/s')

~180k

asynker (https://github.com/enkore/asynker):

    import time
    import asynker

    REPS = 1_000_000

    async def coro():
        for i in range(REPS):
            await asynker.suspend()

    t0 = time.perf_counter()
    sched = asynker.Scheduler()
    sched.run_until_complete(coro())
    dt = time.perf_counter() - t0
    print(REPS / dt, 'reps/s')

~900k

The excellent curio (https://github.com/dabeaz/curio):

    import time
    import curio

    REPS = 1_000_000

    async def coro():
        for i in range(REPS):
            await curio.sleep(0)

    t0 = time.perf_counter()
    curio.run(coro())
    dt = time.perf_counter() - t0
    print(REPS / dt, 'reps/s')

~180k

Not that this number is hugely important.

wenc · on Nov 19, 2018

Would uvloop change these benchmarks?

https://github.com/MagicStack/uvloop

blattimwind · on Nov 19, 2018

It actually does, but again, this is a "how fast can you do nothing" microbenchmark. It's a bit of trivia for almost all intents and purposes.

(The number's 500k)

ioquatix · on Nov 19, 2018

Hi, I'm the author of the article, and it was a while since I measured that number. Let me check it and fix it if it's wrong.

Update: I wrote an addendum https://www.codeotaku.com/journal/2018-11/fibers-are-the-rig...

fake-name · on Nov 19, 2018

I can only assume they're talking about the entire cpu (as in, all 4+ cores in a single die) rather then a single logical core in the hardware.

It's sloppy wording either way.

ioquatix · on Nov 19, 2018

Hi, I'm the author, and I agree, so since everyone is so interested I'll check my original benchmark and fix the article.

Update: I wrote an addendum https://www.codeotaku.com/journal/2018-11/fibers-are-the-rig...

devxpy · on Nov 19, 2018

Would love to see some numbers supporting this..

He might have mis-phrased it though. Maybe he meant to say that it can handle billions of concurrent tasks maybe not that they can switch rapidly in just a second xD

tomp · on Nov 19, 2018

> It takes more than one clock cycle to determine the next task that we're going to resume execution for and then actually resume it.

Not really, if you just implement a round-robin scheduler and if none of the fibres are actually blocked (i.e. if they just yield).

Artificial benchmark? Sure. But there's no way you could ever approximate this with OS threads, even if the user-space code/threads themselves do no useful work.

wahern · on Nov 19, 2018

The difference between stackful coroutines (i.e. fibers in this case) and Python-style coroutines is that Python's coroutines need to rebuild the stack on every resume, whereas resuming a fiber is basically a goto. The cost of yielding and resuming a coroutine in Python is O(N) for the depth of the call chain, but O(1) for fibers.

So as you say, a simple scheduler that just walks a linked-list could resume a stackful coroutine in a few instructions, plus the cost of restoring hardware registers (which is the same cost as a non-inlined function call), and the latter is easily pipelined so would take fewer cycles than instructions.

tomp · on Nov 19, 2018

Indeed, a highly-optimized scheduler is basically the same as looping through an array and calling virtual methods of each object. Extremely cheap.