Great article. I am really curious, what's the theoretical position of using coroutines/green threads vs native OS threads especially in Linux as of 2018? Was Golang right in choosing this model or Linux has become very efficient when it comes to creating and scheduling threads?
It's a tradeoff. You can sometimes get better performance if you implement scheduling yourself in userland, primarily because you can save a trip through the OS scheduler when you do context switches. But you lose compatibility with all existing code if you do so.
It doesn't need to be this way. There is a Linux kernel patch that a Google engineer gave a presentation about in 2013 [1] that allows for true kernel threads that can be scheduled in userland, eliminating the tradeoff. Unfortunately, the patch never seemed to go anywhere, and it seems that the author is no longer working at Google (mail bounced when I tried to contact him about it). Note also that Windows already has this functionality, known as UMS.
Is are Linux kernel threads expensive even for the kernel itself to make use of, or only for userland to make use of? I.e., if you had an embedded Linux appliance coded as a kernel-space driver, would it be able to take advantage of very cheap threading?
I don't think anyone from google claims that threads with userspace switch-to scheduling is a novel advancement of computer science. However, it is somewhat novel to have a production-quality implementation of same in C++.
Depends on if you want to switch between callbacks or if you want to switch between actual stacks.
If you really need task switching you're not really going to beat the kernel by enough to matter. If you only want to switch between callbacks with very little state then yeah green theads work great.
Coroutines look to be a really great way to just get the best of both worlds, and seems to be a generally better model then what go did for most applications.
Did you read the PDFs that are cited in this thread? Paul Turner's switch-to scheme switches between real threads orders of magnitude faster than the kernel scheduler. So it's not true that "you're not really going to beat the kernel by enough to matter". It's actually much faster.
No, I don’t, but in my experience at Google the kernel scheduler is a big problem for thread-per-request servers pushed to their limits. Async completion-passing servers are more efficient at their limits but more difficult to write, maintain, and debug. Userspace thread scheduling expands the performance envelope of thread-per-request architecture.
This fits my expectation. I still tell most folks that starting at thread per request is a sensible starting place. The difficulty in writing continuation style can cause most efforts to stall out.
Of course, that last assertion needs data. And i could be wrong. :(
Green threads are still far faster mainly because of context switching overhead. This allows you to have many green threads per request without problems. Green threads will probably always be much faster unless we develop a way to speed up context switching.
That's a good point, if you create only few green threads the overhead of dealing with them should nullify any advantage of lightweight scheduling and creating.
Languages like Go like to advertise that you easily spawn millions of Green threads (which would fail with OS level threads or bring your OS down) but I've never seen a use case for that.
For realistic applications a work stealing OS thread pool should be faster than spawning lots of green threads. But in the end, performance has to be measured, of course.
Interesting, in my mind the memory usage of M threads is too expensive when you’re mostly waiting for network or file system events, the advantage I remember from green threads is that the thread can be used to execute another CPU bound task while waiting for network events which intrinsically makes it more memory efficient. I’m interested in how Linux has evolved to reduce memory consumption when you have millions of threads, because you can have millions of green threads.
Most of the memory usage comes from the user stack, and you can customize the size of the stack. So memory consumption isn't really the limiting factor here. More important is the cost of context switching.
2KB is tiny, that must be a default which grows or is extended as required? Obviously it works out okay for them, but I feel like it would add nontrivial overhead to every function call if you're always having to check whether or not you need to grow your stack to accommodate the new frame.
Linux does not immediately map 2MB of stack for every thread. That would be ridiculous. Look in proc/smaps to see how much space your stacks actually occupy.
A deeper problem is with Linux threads you can’t grow the stack beyond whatever limit you choose, so you either have to write all code with very limited stack usage or set a big enough stack for the worst case. Which will be too big for millions of threads.
The kernel has to restore registers, including the stack pointer, on a context switch. But it doesn't care about how big the allocation underpinning that stack is.
I'm not talking about creating them, I'm talking about memory / context switch. There is a reason why none of the webservers on linux use one thread per connection.
I know what you mean (nginx & co), but I wouldn't bet whether the majority of websites and APIs use a thread per connection or event-driven IO. E.g. there is are a lot of PHP sites out there, which are thread-per-connection. Or which are powered by Java servlets, which had been synchronous until quite recently.
Thread stacks are reserved, but not allocated up-front (unless you're running on OSv apparently[0]).
1000 concurrent threads will not take stack size * 1000, they'll take used pages * 1000 (with used pages being at least 1) + the kernel overhead of the thread structures.
That's trivial to check, even in a high-level language (with its own additional overhead) e.g. spawning 1000 threads in Python on OSX takes ~25MB. OSX uses 512k stacks for non-main threads.
Your general point does not stand. It is totally incorrect. A thread does not commit its stack limit immediately. The limit is the limit. The initial size is a single page.
And even then, the original point wasn't "an OS thread is as cheap as a userland thread", it was "all things considered a native thread is pretty cheap".