> fibers can context switch billions of times per second on a modern processor.
On a 3GHz (3 billion hertz) processor, you expect to be able to context switch billions of times per second?
I would probably accept millions without question, even though that might be pushing it for a GIL-ed runtime like Ruby has. But, unless your definition of "context switch" counts every blocked fiber that's passed over for a context switch as being implicitly context switched to and away from in the act of ignoring it, I find this hard to believe.
It takes more than one clock cycle to determine the next task that we're going to resume execution for and then actually resume it.
if I remember correctly, Go's scheduler has a global queue and a local queue per worker thread, so when you spawn a goroutine it probably has to acquire a write lock on the global queue.
Allocating a brand new goroutine stack and doing some other setup tasks has a nontrivial overhead that has nothing to do with context switching, regardless of global locks.
To properly benchmark this, I think I would start with just measuring single task switching by measuring how long it takes main to call https://golang.org/pkg/runtime/#Gosched in a loop a million times. This would measure how quickly Go can yield a thread to the scheduler and have it be resumed, although this includes the overhead of calling a function.
Then I would launch a goroutine per core doing this yield loop and see how many switches per second they did in total, and then launch several per core, just to ensure that the number hasn't changed much from the goroutine per core measurement.
Since Go's scheduler is not bound to a single core, it should scale pretty well with core count.
I might run this benchmark myself in awhile, if I find time.
It looks like the context switching speed when you have a single Goroutine just completely outperforms any of the benchmark numbers that have been posted here for Python or Ruby, as would be expected, and it still outperforms the others even when running 256 yielding tasks for every logical core.
The cost of switching increased more with the number of goroutines than I would have expected, but it seems to become pretty constant once you pass the number of cores on the machine. Also keep in mind that this benchmark is completely unrealistic. No one is writing busy loops that just yield as quickly as possible outside of microbenchmarks.
This benchmark was run on an AMD 2700X, so, 8 physical cores and 16 logical cores.
The one additional comment I have is that this addendum doesn't involve a reactor/scheduler in the benchmark, so it excludes the process of selecting the coroutine to switch into, which is a significant task. The Go benchmark I posted above is running within a scheduler.
So, once it's decided what work to do, it's just a matter of resuming all the fibers in order.
Additionally, since fibers know what work to do next in some cases, the overhead can be very small. You sometimes don't need to yield back to the scheduler, but can resume directly another task.
He might have mis-phrased it though. Maybe he meant to say that it can handle billions of concurrent tasks maybe not that they can switch rapidly in just a second xD
> It takes more than one clock cycle to determine the next task that we're going to resume execution for and then actually resume it.
Not really, if you just implement a round-robin scheduler and if none of the fibres are actually blocked (i.e. if they just yield).
Artificial benchmark? Sure. But there's no way you could ever approximate this with OS threads, even if the user-space code/threads themselves do no useful work.
The difference between stackful coroutines (i.e. fibers in this case) and Python-style coroutines is that Python's coroutines need to rebuild the stack on every resume, whereas resuming a fiber is basically a goto. The cost of yielding and resuming a coroutine in Python is O(N) for the depth of the call chain, but O(1) for fibers.
So as you say, a simple scheduler that just walks a linked-list could resume a stackful coroutine in a few instructions, plus the cost of restoring hardware registers (which is the same cost as a non-inlined function call), and the latter is easily pipelined so would take fewer cycles than instructions.
On a 3GHz (3 billion hertz) processor, you expect to be able to context switch billions of times per second?
I would probably accept millions without question, even though that might be pushing it for a GIL-ed runtime like Ruby has. But, unless your definition of "context switch" counts every blocked fiber that's passed over for a context switch as being implicitly context switched to and away from in the act of ignoring it, I find this hard to believe.
It takes more than one clock cycle to determine the next task that we're going to resume execution for and then actually resume it.