> Green threads didn't provide performance benefits over native threads in Rust....

acconsta · on Sept 16, 2015

Rust's green threading was... unusual. It attempted to abstract green and native threads into a single API, which lead to a lot of unnecessary overhead:

https://github.com/rust-lang/rfcs/blob/0806be4f282144cfcd55b...

Sadly, that proposal also removed concurrent IO from the standard library, and it hasn't been replaced.

steveklabnik · on Sept 17, 2015

It may not be replaced in the standard library, but mio exists and is what most people use for non-blocking stuff.

acconsta · on Sept 17, 2015

1. Stackless coroutines will probably require help from the compiler team (hint hint). It'll be worth it! Think web frameworks with Nginx performance and Rust memory safety.

2. Documentation, stability, portability, quality. I'll keep banging that drum until there's a standard. :)

3. The Mio API is subjectively weird, but I might be missing something.

dbaupp · on Sept 17, 2015

The first point has been on the radar for quite a while (i.e. years), but it's really non-trivial, and not at all necessary for the core language (hence not a priority for stability) since it's basically just some syntactic niceties over existing functionality.

acconsta · on Sept 17, 2015

> it's basically just some syntactic niceties over existing functionality.

OK, cool! Can you point us to the existing functionality, or any example of it being used for stackless coroutines?

dbaupp · on Sept 17, 2015

Enums. This HN submission is an example.

acconsta · on Sept 17, 2015

State machines aren't a general solution — you have to write a state machine by hand for every application.

Stackless coroutines are — the compiler (or a clever library) transforms your ordinary function into a state machine.

dbaupp · on Sept 18, 2015

Yes, I know. That's exactly what I mean by syntactic sugar: it is an automated way to write something that could already be written, but is annoying to do by hand.

acconsta · on Sept 18, 2015

Oh, OK. When I hear syntactic sugar, I think of like a macro. Thanks though, I'm glad this is on the core team's radar.

pcwalton · on Sept 17, 2015

I'm confused. To me, green threads are just an implementation detail. The fact that green threading is being used shouldn't leak into the interface. To take Go for an example, you could perfectly well write a conforming implementation of Go that used 1:1 native threading: it would just have different performance characteristics and things like LockOSThread() would become a no-op.

Could you elaborate on what you consider the interface difference to be?

theseoafs · on Sept 17, 2015

This is backwards. Green threading can be an implementation detail, but in languages where it's a key feature, green threads let you write code that would otherwise be incorrect. For example, in a native threading model, it is an awful idea to spawn a thread for every incoming connection on a server. It's the easy way to write it, but it's wrong. With a green threading model, though, that's easy and efficient.

pcwalton · on Sept 17, 2015

> For example, in a native threading model, it is an awful idea to spawn a thread for every incoming connection on a server. It's the easy way to write it, but it's wrong. With a green threading model, though, that's easy and efficient.

A thread per connection isn't wrong, though. You're begging the question by assuming that green threads are faster than native threads. I'm specifically arguing against that.

dan00 · on Sept 17, 2015

I always thought that the main arguments for green threads was their cheaper context switch compared to native threads and their less usage of memory.

The next point might be if it's harder to get the same performance characteristics with native threads on different platforms.

Jweb_Guru · on Sept 17, 2015

Yeah, outside of Linux threads are dog slow. But Linux's implementation demonstrates that they don't have to be, and if you have serious scalability problems you're going to know your architecture well in advance.

theseoafs · on Sept 17, 2015

As I already said, green threads are not faster than native threads. That makes no sense. Green threads, due to smaller stacks and cheaper context switches, are more scalable and predictable at the cost of extra user land bookkeeping.

Jweb_Guru · on Sept 17, 2015

The point is: who cares? Do you really need 3 million threads (particularly, 3 million threads that aren't as efficient as ordinary threads for many tasks)? Ostensibly Google does about 3 billion queries a day; that's less than 35,000 queries per second. Obviously, there are many services underlying any Google search, but most of them will be distributed among multiple cores, machines, and datacenters. So what eventuality are you preparing for where regular threads won't be sufficient?

theseoafs · on Sept 17, 2015

The point isn't that I need 3 million threads. The point is that I can easily write code that is correct and behaves predictably even if I end up throwing some arbitrary number of threads at the processor.

As I view them, green threads are a useful abstraction just like objects or functions. Do you need something computed, not necessarily now, but at some point? Just make a green thread to compute it and check on the result later. That's the platonic ideal of green threads, anyway: very few languages have gotten to that point (Erlang, maybe Go, on a good day; C# has this with its async functionality but none of it happens in parallel).

Jweb_Guru · on Sept 17, 2015

> The point is that I can easily write code that is correct and behaves predictably even if I end up throwing some arbitrary number of threads at the processor.

But you can't. Green threads have scalability limitations too, and their performance is not predictable (except possibly by the creators of the runtime) in any language or on any processor that does automatic preemption. That's like saying you use i64 instead of i32 because now you can do calculations without worrying about overflow. It may be true for some classes of problem, but it's hardly a different programming paradigm like you're making it out to be. They're definitely not a different abstraction from threads (which are themselves an abstraction!).

comex · on Sept 17, 2015

I was curious what the actual state was of the "modern Linux kernel" pcwalton mentioned, so I tried running a test program to create a million threads on a VM - x86-64 with 8GB of RAM, Linux 4.0. For comparison, Go apparently uses about 4KB per goroutine, so it should be possible to create somewhat under 2 million goroutines. To be fair, I allocated the stacks manually in one large allocation; otherwise it dies quite quickly running out of VM mappings. I set the stack size to the pthread minimum of 16KB (actually, I tried cheating and making it smaller, but it crashed, so I gave up - not a good idea anyway). The threads waited for an indication from the main thread, sent after thread creation was done, to exit; in an attempt to avoid the overhead associated with pthread conditions, I just used futex directly:

    while (!ready)
        assert(!syscall(SYS_futex, &ready, FUTEX_WAIT, 0, NULL, NULL, 0));

The program caused the kernel to hit OOM (rather ungracefully!) somewhere around 270,000 threads. To see how long it took while ensuring all the threads actually ran, I reduced the thread count to 200,000, had it join all the threads at the end, and timed this whole process: after the first run it took about 4 seconds. (The first run was considerably slower, but that isn't a big deal for a server, which is the most important use case for having such a large number of goroutines/threads.) Therefore, the C version uses about 20 microseconds and 32 KB of memory per thread.

For completeness, I also tested a similar Go program on Go 1.4 (the version available from Debian on the VM); it actually got up to 3,150,000 before OOM, and took 9 seconds to do 2 million - 4.5 microseconds and 2.7KB per thread.

In other words, Linux is about an order of magnitude slower at managing a lrge number of threads. That looks pretty bad, but on the other hand, it's not that much in absolute terms! I'm pretty sure most server programs don't need more than 250,000 simultaneous connections (or can afford to spend more than 8GB of RAM on them) and don't mind spending an extra 20 microseconds to initiate a connection, so if operating systems other than Linux aren't a concern, they could be written to create a thread per connection without too much trouble. It's not going to give you the absolute maximum performance (meaning it's not appropriate for a decent class of program - then again, I suspect Go isn't either), but it's not terrible either.

I'd like to see it improve. I wouldn't be surprised if there is (still) some low hanging fruit; do kernel developers actually care about this use case?

(And yes, I know this doesn't really test performance of the scheduler during sustained operation. That's its own can of worms.)

pcwalton · on Sept 17, 2015

> It's not going to give you the absolute maximum performance (meaning it's not appropriate for a decent class of program - then again, I suspect Go isn't either), but it's not terrible either.

Yeah, this matches our results when we did similar tests. It's definitely faster to use green threads if you're just spawning and shutting down the threads, but if you're actually doing I/O work on those threads, the overhead quickly drops down.

It's not the fastest way to do I/O, but Go's approach isn't either. The fastest way to do I/O is to forego the thread management syscalls and the stack, like nginx does.

acconsta · on Sept 17, 2015

>It's not the fastest way to do I/O, but Go's approach isn't either

Though Go's is faster.

>forego the thread management syscalls and the stack, like nginx does.

The problem is nginx uses state machines, which aren't a general solution. Stackless coroutines allow you to write functions that the compiler (or a clever library) transforms into state machines.

theseoafs · on Sept 17, 2015

> To be fair, I allocated the stacks manually in one large allocation; otherwise it dies quite quickly running out of VM mappings.

Okay, so the test you did doesn't actually reflect the use case in practice. Can I expect to reach 200,000 threads if the threads are not all created at exactly the same moment? What if (God forbid) they're doing memory allocation? And if it does work out, will everything be handled efficiently?

Scramblejams · on Sept 17, 2015

Hope comex replies to your question. Typical green thread usage is spawn-em-as-you-need-em, so if in order to spawn lots of 1:1 threads I need to do it all up front, that could be very limiting or complicating.

comex · on Sept 17, 2015

Yeah, I just made a mistake - you can increase the maximum number of mappings using /proc/sys/vm/max_map_count; I tried doing that and switching back to normal stack allocation (but still specifying the minimum size of 16KB using pthread_attr_setstacksize) and it doesn't change the number of threads I was able to create.

...in fact, neither did removing the setstacksize call and having default 8MB stacks. I guess this makes sense: of course the extra VM space reserved for the stacks doesn't require actual RAM to back it; there is some page table overhead, but I guess it's not enough to make a significant difference at this number of allocations. Of course, on 32-bit architectures this would quickly exhaust the address space.

If increasing max_map_count hadn't worked, it would still be possible to allocate stacks on the fly - but you would get a bunch of them in one mmap() call, and therefore in one VM mapping, and dole them out in userland. However, in this case guard pages wouldn't separate different threads' stacks, you would have to generate code that manually checks the stack pointer to avoid security issues from stack overflows, rather than relying on crashing by hitting the guard page. Rust actually already does this, mostly unnecessarily; I'm not sure what Go is doing these days but I think it does too. Anyway, given that the above result I suspect this won't be an actual issue, at least until the number of threads goes up by an order of magnitude or something.

Scramblejams · on Sept 17, 2015

Thanks for your reply. Wonder how other platforms fare.

acconsta · on Sept 17, 2015

That doesn't seem that unrealistic — you could allocate your stacks using slab allocation, for example. I wonder why the Kernel allocator doesn't do a better job though.

acconsta · on Sept 17, 2015

The key number for server applications is latency — it doesn't matter if you can spawn 200,000 threads if it takes you four seconds to serve a request. And if a thread can't serve a request before its time slice is up, it gets sent to the back of the line and its cache entries get evicted by other threads.

Coroutines have three awesome advantages here:

Really fast switches (it's basically a function call — no need to trap into the kernel)

Scheduling in response to IO events (a coroutine can yield when making a "blocking" IO call and resume when it completes). You can write asynchronous code that looks synchronous!

Better cache behavior (hugely important on modern processors). You can allocate all your coroutine structures together and don't have to suffer all the cache evictions of a context switch every time a coroutine yields.

It would be interesting to see this benchmark repeated with a latency requirement.

comex · on Sept 17, 2015

> Scheduling in response to IO events (a coroutine can yield when making a "blocking" IO call and resume when it completes). You can write asynchronous code that looks synchronous!

You can also write "actual" synchronous code which does the same thing in the kernel. :P And there are several CPU schedulers to choose from to fine-tune when things get woken up.

I agree it would be interesting to test latency; actually, the numbers would be much more useful than my four seconds. After all, if thread creation throughput or latency becomes an issue, you can keep a thread pool around, while still retaining the advantages of one connection per thread.

acconsta · on Sept 18, 2015

>You can also write "actual" synchronous code which does the same thing in the kernel. :P And there are several CPU schedulers to choose from to fine-tune when things get woken up.

You can. But no matter how good your kernel scheduler is, that trip through it is going to cost you. Why? Because it's going to wreck your cache and TLB entries. I think the TLB is a big part of the story, given it's small, specific to the process, and particularly expensive to miss.

http://www.cs.cmu.edu/~chensm/Big_Data_reading_group/papers/...

Anyway, I'm sure you know this already. :) Out of curiosity, did you get a chance to take CS169 before you left Brown?

And I think the kernel already does pool/recycle stacks, but I guess not 200,000 of them given your results. An explicit thread pool would certainly work too.

comex · on Sept 18, 2015

Kernel stacks are allocated by Linux in alloc_thread_info_node; user stacks by glibc in allocate_stack, which does have some sort of cache. Overall there's a fair bit of work going on that could be skipped...

But yeah, certainly userland context switches are faster; it's just that your comment seemed to postulate "scheduling in response to IO events" as a benefit of coroutines over threads, as if the kernel had no way to distinguish between threads waiting for IO and threads needing to be woken up. I presumably misinterpreted it.

Interesting paper. It seems like it could be a useful step toward my ideal imagined environment where the distinction between coroutines and threads would be meaningless - because scheduling would be a job shared between userland and the kernel, so userland could do fast thread switches, batch system calls, etc., while the kernel would still give them PIDs and let them take signals or be ptraced, and native tools (debuggers) would treat them as threads.

Though the paper is from 2010; do you know if anyone has tried to implement anything along similar lines in production?

And no, I didn't take that course.

acconsta · on Sept 19, 2015

Oh OK, I definitely could have written that better.

From what I can tell, M:N threading has been thoroughly abandoned by Linux, but it might be worth revisiting given the horrendous I/O scaling of Linux kernel threads. I would imagine scheduling user threads cooperatively simplifies things a lot.

Google has a very interesting talk about cooperative native thread scheduling, but they haven't upstreamed their code:

https://www.youtube.com/watch?v=KXuZi9aeGTw

And here's what linux devs are actually doing (it's underwhelming, something like 50% of what the hardware is capable of):

https://lwn.net/Articles/629155/

In the meantime, people who actually need the performance (HPC and even some enterprise servers) have to bypass the kernel entirely. It looks like that's going to be possible even in virtualized environments, thanks to hardware support:

http://people.inf.ethz.ch/troscoe/pubs/peter-arrakis-osdi14....

asb · on Sept 17, 2015

It would be interesting to run the same test using musl, which allows a smaller minimum stack size.

krenoten · on Sept 17, 2015

It was a terrible idea with old OS's and engineers never learned about their schedulers. Many of your favorite systems run at high frequencies on Varnish using one thread per session.

smosher_ · on Sept 17, 2015

> To me, green threads are just an implementation detail. The fact that green threading is being used shouldn't leak into the interface.

I think that mentality, which I supported, may be what ruined green threads for Rust. Perhaps the enemy was the perfect of the good. On the other hand, maybe not having to worry about differences is the better benefit. I would rather be slower-but-acceptably-fast than have to think about 1:1 vs M:N threading in the case that they are not interchangeable.

On the third hand, it could be that Rust is too close to the metal for green threads to be of any significant benefit. If that's the case, Rust ought to trounce all the M:N implementations, subject to reasonable limits on spawn rate, memory pressure and inter-thread communication. Does that sound right?

Incidentally, I've never seen a cooperative threading implementation that didn't have a yield call of some kind available, and you can usually jam any of them up by writing an infinite computation-only loop, so even in the langauges where the scheduler is automatically invoked, correct 1:1 code can deadlock in the M:N scenario. State machines can too, but a trivial example, translated, might read something like the following:

    for(;;);
    exit(EXIT_SUCCESS);

I've seen code like this written by complete newbies, but threading constructs can hide that relationship even from experts. In parallel code it isn't a sequential relationship, but it is with cooperative threads. I guess this is all fairly pedantic (such a minor detail, mostly affecting only poorly-written programs), but in the end, one of: yield(), sched(), sync(), etc. does show up in practically all of the cooperative threading interfaces.

sillysaurus3 · on Sept 17, 2015

The interface is different because if you use a green thread within a GC'd language, you can do anything -- change anything, interact with anything -- and nothing will crash. No locking involved, and no hidden locking under the hood.

It's not meant to be a performance boost, but a way of writing code where "might this crash?" is a question you're never bothered to ask. It's wonderfully freeing.

Not true of native threads. If a goroutine were replaced with a native thread, it would swiftly crash on GC.

Veedrac · on Sept 17, 2015

We're talking pretty explicitly about Rust, which doesn't have that problem.

wtetzner · on Sept 17, 2015

> a way of writing code where "might this crash?" is a question you're never bothered to ask.

Is that actually true? Yes, you won't get a partially written int, but you could still run into consistency issues if you don't lock. For example, if you need thread A to update both x and y to have a consistent state, and B is reading both x and y, A might update x, then the scheduler would switch to B, and be might read the new x and the old y. If the update A was making needed to update both x and y for the system to be in a consistent state, you have a problem.

sillysaurus3 · on Sept 17, 2015

It will do what you just described. And what you described won't crash.

If you use the int as an offset into raw memory, then sure, that will. But if you use it as an offset into a list, then it won't. At most it will cause an exception to be generated. That will cause the reader thread to die, because it was the code that caused the exception, but it still won't crash.

wtetzner · on Sept 18, 2015

Right, but I don't really see how that's freeing. A crash isn't good, but having your program be incorrect might be even worse. If the program crashes, you know it. But it might take some time before you recognize that your logic is wrong, which might cause all kinds of damage.

So I don't think it frees you from thinking about concurrency at all.

theseoafs · on Sept 17, 2015

This is only true for n:1 green threading.

sillysaurus3 · on Sept 17, 2015

Sure. But I might write my n:1 code more quickly, robustly, and performant than your m:n code, since m:n forces the programmer to carefully worry about whether their code will break.

theseoafs · on Sept 17, 2015

Maybe, if my program has a lot of shared mutable state. I'm smart enough to avoid that though.

viraptor · on Sept 16, 2015

There are some examples of speed difference at https://en.m.wikipedia.org/wiki/Green_threads

Basically if you have green threads with only one core, you don't need to do as much work with locking.

derefr · on Sept 17, 2015

> Basically if you have green threads with only one core, you don't need to do as much work with locking.

This is half the basis for the original design of the Erlang BEAM VM. SMP was a grudging afterthought; the original concept was that each VM "node" was a single scheduler process bound to a single core, and then you tied those nodes together into a (machine-internal) multi-core distributed system, effectively creating a set of "threads" communicating over socket-based message-passing IPC, rather than shared memory.

M:N scheduling—where, critically, a task can be re-scheduled from one core to another—is a much more complex and less-predictable design, that's not always worth the trouble. For Erlang's particular use-cases, it turned out to be worth the trouble enough to justify programming it, eventually. But the SMP VM is still only built via a configure flag, rather than displacing the M:1 VM as "the" VM.

saurik · on Sept 16, 2015

The argument here is less about the performance difference existing or not, and that the decision in question sounds to have been made due to the lack of something we care less about.

viraptor · on Sept 16, 2015

I was responding to the first paragraph, which is technically all about the performance differences. (it does say "How could it be the case that virtual threads implemented in user space are faster than native threads provided by the OS?") Whether that matters and is a valid decision point is a completely different question as you say.