Great article. I am really curious, what's the theoretical position of using cor...

pcwalton · on Sept 30, 2018

It's a tradeoff. You can sometimes get better performance if you implement scheduling yourself in userland, primarily because you can save a trip through the OS scheduler when you do context switches. But you lose compatibility with all existing code if you do so.

It doesn't need to be this way. There is a Linux kernel patch that a Google engineer gave a presentation about in 2013 [1] that allows for true kernel threads that can be scheduled in userland, eliminating the tradeoff. Unfortunately, the patch never seemed to go anywhere, and it seems that the author is no longer working at Google (mail bounced when I tried to contact him about it). Note also that Windows already has this functionality, known as UMS.

[1]: https://blog.linuxplumbersconf.org/2013/ocw/system/presentat...

derefr · on Oct 1, 2018

Is are Linux kernel threads expensive even for the kernel itself to make use of, or only for userland to make use of? I.e., if you had an embedded Linux appliance coded as a kernel-space driver, would it be able to take advantage of very cheap threading?

twic · on Sept 30, 2018

2013! This idea dates back to at least 1992, in the form of scheduler activations, as described in this paper by Tom Anderson (no relation):

http://web.eecs.umich.edu/~mosharaf/Readings/Scheduler-Activ...

EDIT: And that Google work sounds like directed yields, an old idea which i traced back as far as 1996 before getting bored:

https://www.usenix.org/conference/osdi-96/cpu-inheritance-sc...

romed · on Sept 30, 2018

I don't think anyone from google claims that threads with userspace switch-to scheduling is a novel advancement of computer science. However, it is somewhat novel to have a production-quality implementation of same in C++.

kllrnohj · on Sept 30, 2018

Depends on if you want to switch between callbacks or if you want to switch between actual stacks.

If you really need task switching you're not really going to beat the kernel by enough to matter. If you only want to switch between callbacks with very little state then yeah green theads work great.

Coroutines look to be a really great way to just get the best of both worlds, and seems to be a generally better model then what go did for most applications.

romed · on Sept 30, 2018

Did you read the PDFs that are cited in this thread? Paul Turner's switch-to scheme switches between real threads orders of magnitude faster than the kernel scheduler. So it's not true that "you're not really going to beat the kernel by enough to matter". It's actually much faster.

taeric · on Oct 1, 2018

Even beating it by an order of magnitude doesn't automatically guarantee enough to matter, does it?

As an analogy, I could speed up my ability to get shoes on by an order of magnitude, and have it not really matter for my commute.

Not saying that is exactly the same here. Just extending the question on the impact if this.

romed · on Oct 1, 2018

That's a fair point. Such schemes are for people whose profiles clearly implicate the kernel scheduler.

taeric · on Oct 1, 2018

Do you have a good link outlining who that likely is?

romed · on Oct 1, 2018

No, I don’t, but in my experience at Google the kernel scheduler is a big problem for thread-per-request servers pushed to their limits. Async completion-passing servers are more efficient at their limits but more difficult to write, maintain, and debug. Userspace thread scheduling expands the performance envelope of thread-per-request architecture.

taeric · on Oct 1, 2018

This fits my expectation. I still tell most folks that starting at thread per request is a sensible starting place. The difficulty in writing continuation style can cause most efforts to stall out.

Of course, that last assertion needs data. And i could be wrong. :(

blaisio · on Sept 30, 2018

Green threads are still far faster mainly because of context switching overhead. This allows you to have many green threads per request without problems. Green threads will probably always be much faster unless we develop a way to speed up context switching.

13415 · on Sept 30, 2018

That's a good point, if you create only few green threads the overhead of dealing with them should nullify any advantage of lightweight scheduling and creating.

Languages like Go like to advertise that you easily spawn millions of Green threads (which would fail with OS level threads or bring your OS down) but I've never seen a use case for that.

For realistic applications a work stealing OS thread pool should be faster than spawning lots of green threads. But in the end, performance has to be measured, of course.

skybrian · on Sept 30, 2018

Are you assuming async I/O? For a thread-per-request architecture with blocking reads and writes, it seems like green threads will work better.

bithavoc · on Sept 30, 2018

Interesting, in my mind the memory usage of M threads is too expensive when you’re mostly waiting for network or file system events, the advantage I remember from green threads is that the thread can be used to execute another CPU bound task while waiting for network events which intrinsically makes it more memory efficient. I’m interested in how Linux has evolved to reduce memory consumption when you have millions of threads, because you can have millions of green threads.

pcwalton · on Sept 30, 2018

Most of the memory usage comes from the user stack, and you can customize the size of the stack. So memory consumption isn't really the limiting factor here. More important is the cost of context switching.

Thaxll · on Sept 30, 2018

You can customize it but the default value for Linux is 2MB, the Go one is 2KB, pretty big difference in term of memory.

http://man7.org/linux/man-pages/man3/pthread_create.3.html

mikepurvis · on Oct 1, 2018

2KB is tiny, that must be a default which grows or is extended as required? Obviously it works out okay for them, but I feel like it would add nontrivial overhead to every function call if you're always having to check whether or not you need to grow your stack to accommodate the new frame.

romed · on Sept 30, 2018

Linux does not immediately map 2MB of stack for every thread. That would be ridiculous. Look in proc/smaps to see how much space your stacks actually occupy.

pcwalton · on Sept 30, 2018

It's just a question of defaults. If stack size is your only concern, then change the default in your runtime: don't go M:N.

bboreham · on Sept 30, 2018

A deeper problem is with Linux threads you can’t grow the stack beyond whatever limit you choose, so you either have to write all code with very limited stack usage or set a big enough stack for the worst case. Which will be too big for millions of threads.

pcwalton · on Sept 30, 2018

Stack growth should work in 1:1 just as it does in M:N. If the size were a hard constraint then Go couldn't get around it either.

bboreham · on Oct 1, 2018

That doesn’t appear to be true. The thing doing scheduling has to cooperate with the thing doing stack management, so that context can be switched.

What are those two things, in your proposition that we can simply use Linux threads?

pcwalton · on Oct 1, 2018

The kernel has to restore registers, including the stack pointer, on a context switch. But it doesn't care about how big the allocation underpinning that stack is.

Thaxll · on Sept 30, 2018

Have you seen the performance of an application "just" creating 1000 Linux threads, it's pretty bad.

Also you're talking about linux, the Go runtime runs on many platforms, therefore is not dependent on platform thread performance.

paulddraper · on Sept 30, 2018

Creating Linux threads is pretty fast.

You should be able to "just" create 1000 threads in "only" 2ms or less.

Thaxll · on Sept 30, 2018

I'm not talking about creating them, I'm talking about memory / context switch. There is a reason why none of the webservers on linux use one thread per connection.

Matthias247 · on Oct 1, 2018

I know what you mean (nginx & co), but I wouldn't bet whether the majority of websites and APIs use a thread per connection or event-driven IO. E.g. there is are a lot of PHP sites out there, which are thread-per-connection. Or which are powered by Java servlets, which had been synchronous until quite recently.

paulddraper · on Oct 1, 2018

That was unclear.

Yes, 1000 concurrent threads will take stack size * 1000.

masklinn · on Oct 2, 2018

Thread stacks are reserved, but not allocated up-front (unless you're running on OSv apparently[0]).

1000 concurrent threads will not take stack size * 1000, they'll take used pages * 1000 (with used pages being at least 1) + the kernel overhead of the thread structures.

That's trivial to check, even in a high-level language (with its own additional overhead) e.g. spawning 1000 threads in Python on OSX takes ~25MB. OSX uses 512k stacks for non-main threads.

[0] https://github.com/cloudius-systems/osv/issues/143

presscast · on Sept 30, 2018

On Linux/x86-32, the default stack size for a new thread is 2 mb. The default stack for a goroutine is 2 kb.

Those 1000 goroutines eat up as much memory as roughy one OS thread.

Granted, I've somewhat shifted the goalpost, but goroutines are cheaper than OS threads or just about every sensible definition of "cheap".

masklinn · on Sept 30, 2018

> On Linux/x86-32, the default stack size for a new thread is 2 mb.

It's just reserved, not committed until actually used.

> Granted, I've somewhat shifted the goalpost

Shifted? You've strapped them to a rocket and put them in orbit.

presscast · on Sept 30, 2018

>No shit

The general point still stands. No need to be unpleasant.

romed · on Sept 30, 2018

Your general point does not stand. It is totally incorrect. A thread does not commit its stack limit immediately. The limit is the limit. The initial size is a single page.

masklinn · on Oct 1, 2018

And even then, the original point wasn't "an OS thread is as cheap as a userland thread", it was "all things considered a native thread is pretty cheap".

presscast · on Oct 2, 2018

You misunderstood the point.