A typical server under load has more outstanding requests to answer than it can OS threads. It is not uncommon in a Go server at Google to find a million or so goroutines.
If you want to have an OS thread that can also run C code, it needs a large enough stack to run C code. A million such stacks is not practical.
On an OS that lets you avoid creating stacks with threads like Linux, and which has quite light-weight kernel-side accounting for threads, it would be possible to give each goroutine a thread. You would still need an N:M pool for cgo.
As an added problem, 1:1 is slower for certain kinds of common channel operations. Anywhere you have an unbuffered channel and block waiting for a value, a 1:1 model requires a context switch to pass the value, whereas M:N means the userland scheduler can switch goroutines on the OS thread.
It is precisely these benchmarks that led Ian to implement M:N in gccgo. If there are combinations where it is 1:1, that is either a new decision he has made in the last year, or (more likely) OS/ARCH combinations that he hasn't moved to M:N yet.
I have seen a similar attempt at 1:1 lightweight tasks in C++ that ran into this. Without the ability to preemptively move tasks between OS threads, it ran into performance problems. Programs that needed the speed in that model had to switch to a futures-style of programming.
> A typical server under load has more outstanding requests to answer than it can OS threads.
First of all, I object to the use of "can" here. There was a time where this was true, but, as I said, that time has passed. There is no problem running millions of kernel threads today.
That being said, there are advantages of not doing that. Right now Go uses a network polling mechanism like epoll to implement network I/O without consuming threads. This is totally orthogonal to the 1:1 vs. N:M discussion. Go can use 1:1 scheduling while still performing asynchronous I/O behind the scenes, just as it does today.
The nature of the mapping between g's and m's does not influence the other {g, g, ...} -> {g0{m}} mapping used to implement network I/O.
> It is not uncommon in a Go server at Google to find a million or so goroutines.
No problem with that, they can continue to do so.
> If you want to have an OS thread that can also run C code, it needs a large enough stack to run C code. A million such stacks is not practical.
C code needs its own stack, but this is again orthogonal to scheduling. Right now there is a N:1 mapping {g, g, ...} -> {g0} which makes it easy to switch to g0 stack. This will have to change to a N:M mapping {g, g, ...} -> {g0, g0, ...} and Go code would have to acquire C stacks, just as it acquires any other resource that it needs to use.
This is not expensive, in fact, C calls are now expensive precisely because there is a deep interaction between calling C code and the scheduler (that needs thousands of cycles). All this cycles will Go away. In a 1:1 model all you'd have to do is acquire a C stack, which is very fast and happens in userspace for the uncontented case (the common case), probably less than a hundred cycles.
> You would still need an N:M pool for cgo.
As explained above, you need a N:M pool, but this does not need to be integrated with the scheduler any more, making it much simpler to implement (and more performant).
> Anywhere you have an unbuffered channel and block waiting for a value, a 1:1 model requires a context switch to pass the value, whereas M:N means the userland scheduler can switch goroutines on the OS thread.
Only if you naively implement channels as mutex-based queues.
There is still a runtime underneath that can switch stacks on different threads. I want to move general scheduling decisions out of the Go runtime and into the kernel, but pointwise, stateless, swap-of-control type of things can still happen synchronously in the Go runtime.
I do not believe the networking issue is orthogonal. I believe you will find that any communication between two OS threads is significantly slower than switching one thread between an active and a parked goroutine. Futexes are slow and spinlocks burn CPU. (This is exactly the case the C++ model I mentioned ran into, and why gccgo got an N:M model.)
But as I said I'm happy to be convinced. The easiest way to demonstrate it would be to move gccgo linux/amd64 back to 1:1 without hurting the channel benchmarks. You could use that as an argument that fanning out an epoll loop among threads can be made fast.
> First of all, I object to the use of "can" here. There was a time where this was true, but, as I said, that time has passed. There is no problem running millions of kernel threads today.
I object to the use of "can", just because we can doesn't mean we should.
Last I checked, creating a thread (on a Windows) took 16 ms. Creating 1 million threads would take more than 4 hours. That's the hell of a startup time! =D
(And don't bother saying Linux is faster. It's not relevant unless it's by 4 orders of magnitude ;) ).
I should add: if you believe these are surmountable problems, please take it out of the forum and write a proposal. Everyone who works on the Go runtime would love to see it radically simplified. We put up with the complexity only because we believe it necessary.
> I should add: if you believe these are surmountable problems, please take it out of the forum and write a proposal.
Yes, I've been wanting to write a proposal for over 5 years now, if only there was more time in the world...
Coincidentally, I also have a method for implementing Go stacks (growing and shrinking) that does not involve copying stacks and avoids the hot-split problem.
Late edit:
In fact, this method would work particularly well for "growing" the stack to call C code in the 1:1 model proposed above.
It would make calling a C function, just another function call with some marker on it (like we have NOFRAME and NOSPLIT now). And when we switch to a more ABI-compatible calling convention (which we want to do anyway, irrespective of all this), then we would not even be a need for a C translator written in assembly any more.
No, it would not mean such a thing at all. Where did you get this idea?
In fact, on some operating systems/linker combinations, gccgo uses a 1:1 threading model.
> Nice sequential blocking-style code is my favorite thing about Go.
Mine too.