Hacker News new | past | comments | ask | show | jobs | submit login
The future of M:N threading (mail.mozilla.org)
117 points by joshbaptiste on Nov 13, 2013 | hide | past | favorite | 66 comments



Background: The author of this proposal, Daniel Micay (a.k.a. strcat, a.k.a. thestinger), is the most prolific volunteer contributor to the Rust compiler [1], and is fairly obsessed with eliminating design hazards that could lead to Rust not being competitive with C++ in terms of performance. For example, he was the driving force behind Rust's recent widespread switch from internal iterators to external iterators, and wrote most of the external iterator libraries himself.

[1] https://github.com/mozilla/rust/graphs/contributors

EDIT: To whoever removed the [Rust-dev] annotation from the title of this submission, please put it back. It adds valuable context, and it's in the title of the page itself, so you're violating HN's naming policy by removing it.


The annotation is critically important.

Seeing a link to [mozilla.org] with talk about threading, I assumed it was something to do with Firefox (which I care about) versus Rust (which I don't).


But the M:N system is already written, and it seems as though it works extremely well. In fact, it is one of the things I like most about the language, and one of the reasons why I prefer it over C++ (as well as D and Nimrod for that matter).

This decision probably puts it closer to C++, but I can't see how that is a good thing.


Note that this is not a decision; it's not by a core team member.


100k+ concurrent threads is a realistic starting point for M:N threading. As long as native threads require a separate preallocated stack that never shrinks there will always be a mismatch compared to the capabilities you get running thread per core event based code.

Making M:N threads a first class POSIX feature is a good starting point as long as it also allows you to express the intended locality of groups of threads. Locality isn't optional anymore. The slowdowns he reports also map to scale up issues you see when you don't think about keeping frequently accessed mutable data local and shared nothing to an OS thread.

M:N threads are just an abstraction around a stack and registers that frees you from having to bind up a bunch of related state into a series of objects and maps. It's awesome, I want it, but nothing is quite there yet. At least not if you want something like Java or C/C++ performance.

Work stealing across cores is really nice, but I am not sure it is always the behavior you want. A programmer can tell when migrating a task across cores is going to hurt more than waiting a little longer to get to it.

I also think any discussion of M:N threads or multi-threading is also incomplete if it doesn't also include garbage collection as a means of passing data between threads. Right now we have languages that are garbage collected and treat native memory as a second class citizen and non-GC languages that make multi-threading a headache.

GC is great for multi-threading, but not so great for hundreds of gigabytes of variable lifetime objects. Even if the GC can handle it, GC still has unacceptable 2x space overhead.

I don't want much :-)

The youtube video doesn't load for me :-(


> Right now we have languages that are garbage collected and treat native memory as a second class citizen and non-GC languages that make multi-threading a headache.

I think Rust did a good job of making a non-garbage-collected language make memory management less of a headache in a multithreaded scenario (though I'm biased of course). In fact, I think it works even better than in languages with global GC.

By default you transfer memory between threads ("don't communicate by sharing memory, share memory by communicating"), and the compiler enforces that you can't touch memory after you've given it away. You can also share read-only data (Arc), and if you do that then the compiler will make sure you can't mutate it and race. If you want to use locks, you can use those too (MutexArc), and the compiler will ensure you take and release locks properly. No tracing garbage collection in sight, and no fiddling with race detectors at runtime—the compiler does the work for you.


The size of the OS threads with their stacks is what I am worried about as well. It is interesting the author hints that:

> It drops further when asked to allocate 8MiB stacks like C is doing

Memory, just like CPU is constrained and costs money especially if provisioned in the cloud. If I have 2G of memory available I can spawn 250 threads and keep them around without swapping.

EDIT: actually scratch the above, _wmd pointed out that stack memory is just like virtual memory, it is allocated but brought into physical memory as needed.

This changes how code is written. It is not just an internal detail! Now you cannot think "one-connection = one-task" now it is about locks, queues and mapping control blocks to connections. That is the biggest loss.

> GC is great for multi-threading, but not so great for hundreds of gigabytes of variable lifetime objects. Even if the GC can handle it, GC still has unacceptable 2x space overhead

Good point on GC. I see the main issues in a large concurrent system (and presumably Rust wants to be a good tool for concurrent systems) is non-blocking GC. A slow GC might be acceptable, but if one errant task allocated memory in certain pattern and GC has to lock all the tasks out to run, that could be a serious issue.

Responsiveness (liveliness) of a concurrent system is something that is often forgotten and everyone wants to talk about how fast one can transpose a matrix.

Now a concurrent GC is possible, Azul built one, for Java, it is very interesting how it work. I enjoy reading their whitepapers:

http://www.azulsystems.com/resources/whitepapers


Just highlighting a small, but common miscomprehension (that on one project, caused a PM to mail everyone at 2am about a HUGE nonexistent memory leak)..

Thread stacks are virtual memory like everywhere else, so in reality an "8mb" stack means "8mb maximum size". Sure, they won't shrink once pages are faulted in to back them, but in the average application, especially on 64bit, this should never be a problem

If for the lifetime of your thread, its stack only ever grew to 32kb, then the OS will only have allocated 32kb of memory to back it.


You won't hit the worst case, but the common case is that threads will burst stack usage during task execution. If you have 100k threads your effective stack usage will be the high watermark.

It varies by workload, but for mine that is between 128 and 512 kilobytes per thread.

If you churn all your threads no problem. If you are hosting 100k persistent connections I suspect it would detract from available memory.


That's interesting. So if a thread briefly uses 1M stack, and relinquishes most of it, then the kernel still keeps that unused 1M memory in the page table?

I expected the linux kernel to be more intelligent, but maybe there's a reason it behaves like that?


Under pressure, it will eventually page the memory back out, but generally you never want to get your machine into a state where paging happens. Aside from that, there is no mechanism that allows the kernel to know the memory is no longer used unless you explicitly tell it, e.g. via madvise(2).

Though I can't think of a sane way to call madvise(2) involving a thread stack


That makes sense, thanks for clarifying. So if say there are 1000 client connection threads and they monitor some status and say ping their clients every 5 seconds, on wake-up, it could be only a small fraction of the stack would have to be paged-in. And it would depend how deep the stack was when the thread of put to sleep?


Sorry yep, didn't see your reply earlier.

That's not to say threads aren't expensive, they still require several heavyweight struct allocations on the kernel side, e.g. struct task_struct, which I counted to 1k before getting bored (and wasn't even quarter way through the fields)

Edit: Linux git HEAD with Debian unstable .config:

(gdb) print sizeof(struct task_struct)

$1 = 1904


A little while back Rust decided to drop segmented stacks and use normal stacks (and basically give up on 32-bit), so now they're deciding between M:N with full-size stacks or 1:1 with full-size stacks. The dominoes are falling.


In a year Rust is like C with a better type system and ML syntax. Not that interesting as a language anymore, but more useful in practice.


> Not that interesting as a language anymore, but more useful in practice.

I think Rust is still pretty interesting. The borrow checker is extremely intricate and is completely unlike anything seen in any industry language so far.


"a better type system" glosses over a lot, though. if all rust can do is bring linear typing and algebraic datatypes to c that will be a significant achievement.


I thought the same, piling up on the removal of typestate which was the weirdest/most interesting thing in rust.


I wish there was something like C but with a better type system, I don't think rust will fill that gap. There are still too many inconsistencies between abstractions with overhead and without making it a C++ replacement if anything.


> There are still too many inconsistencies between abstractions with overhead and without making it a C++ replacement if anything.

For example?


With the nice tuning and code standards you don't need the gigabytes of variable lifetime objects. Of course the whole environment needs to be available in the right way, but for example it's not uncommon in erlang to see thousands of processes that have the memory size adjusted enough to never GC and finish quickly. Then you collect the whole process, rather than separate objects. Immutability also helps here.

Not many languages can afford doing that and of course there may be lots of features that prevent similar behaviour.


In Erlang GC is boringly easy. It has separate heaps for processes and processes are isolated. The price to pay is copying some data between processes and handling immutability. But boring is good. They manage a pauseless and concurrent GC without a sweat.


Well, yes and no. Erlang's GC is simple, but incomplete. Shared data structures like ETS aren't GC-ed, and there are many data structures that you simply can't implement in Erlang because of the way that GC works. For example, you can't implement Java's concurrent hash-map in Erlang. Erlang relies on C code whenever it needs shared data structures.

So the simple and effective GC comes at the cost of a limited programming model. For Erlang, it's the right tradeoff, but not for every environment.


Interesting analogy: Erlang implements an M:N scheduling model, where it starts an OS thread for each CPU core (by default), and (preemptively) schedules its own lightweight processes on top of these threads. However, unlike OS threads, these are very lightweight: when they are created, they only need a few hundred bytes. Erlangers have been using the one process per request model for a very long time, and Erlang applications achieve massive concurrency via this.

To me it seems it is possible to do M:N right, but you need more abstraction and a different design. M:N seems to work well in Erlang-like cases (very lightweight processes on top of a kernel threads).


This is what Rust was doing, more or less, before they removed segmented stacks.


M:N threading has been a great idea in theory for at least twenty years, but during that whole time actually implementing it has always ended in tears. The unexpected blocking in I/O or page faults etc. always trips it up. That's why it works great for HPC, not so much elsewhere. There's far more "bang for the buck" in making 1:1 threading work better.


I'm under the impression that M:N threading has been implemented in Haskell, Erlang and Go. I think it's been reasonably successful in all three.


From what I've heard Go is considering moving to this 1:1 model as well, at least on Linux. This is the reason for the Linux patches mentioned in the thread.


Can you post a link to some discussion around this? A google didn't turn up anything obvious. Thanks.


It's hinted at in the video linked in the email[1]:

> Over the past year we've been looking at how a uniform, performant, synchronous programming model can be implemented across all of Google's core languages (C++, Java, Go, Python), while maintaining compatibility with existing code.

https://www.youtube.com/watch?v=KXuZi9aeGTw


Do you have a source for that?


If you have direct language support, the runtime can either anticipate blocking ops and make sure there are enough other threads still running, or detect that an execution thread has already blocked (from a supervisor thread) and spawn a new one. How well that works depends on what you're doing. Computation is the trivial case. Network stuff is almost as good, because sockets are pretty good about not blocking when you've asked them not to. Filesystems aren't quite as good, so you have to use AIO and other tricks. Page faults are worst of all. Then you have to deal with NUMA scheduling issues yourself because the OS can no longer do it for you, and so on.

So yeah, if you have a really good runtime you can see some significant gains on some systems and workloads. More often you'll get very modest gains, and you'll have some extremely subtle bugs that cause a precipitous drop in performance e.g. when you start paging. That's "ending in tears" for the poor schlep who has to debug that stuff. It can be done but, as I said, it's not the best bang for the buck.


>M:N threading has been a great idea in theory for at least twenty years, but during that whole time actually implementing it has always ended in tears.

Not always: http://www.haskell.org/ghc/



Erlang is comparatively slow. Think of it as a waterline -- as a language implementation gets faster, other performance problems (like M:N) appear above the surface.


GP didn't say "it always ends slow", but "it always ends in tears".


If you're interested in this area you might like this talk from the Linux plumbers' conference:

"Over the past year we've been looking at how a uniform, performant, synchronous programming model can be implemented across all of Google's core languages (C++, Java, Go, Python), while maintaining compatibility with existing code."

https://www.youtube.com/watch?v=KXuZi9aeGTw


I'm not sure I understand. Are you saying that a 1:1 threading model without preemption will be more performant than an M:N threading model without preemption?

How would this affect the performance of applications that spin up thousands of tasks?


No, they're saying the performance will be slightly degraded. But you can already spin up thousands of threads on Linux without performance problems.

The M:N model is notoriously difficult, making your threading system 10x as complicated usually isn't worth the 10% performance increase. Haskell is, again, the exception that proves the rule. In Haskell there is no stack and no need for TLS, so the complexity of M:N threading is much lower. Additionally, on Haskell there is a provision for pinning green threads to OS threads if you need interoperability with foreign code that can't get moved across threads as easily, and the assumption is that you usually don't need to pin threads.

I'd also like to point out that for some applications, using native threads instead of green threads will reduce the number of context switches, it's when threads are CPU bound that green threads give you the biggest performance boost.


> Additionally, on Haskell there is a provision for pinning green threads to OS threads if you need interoperability with foreign code that can't get moved across threads as easily, and the assumption is that you usually don't need to pin threads.

You can do that in Rust too (and we do it a lot).


If you restrict yourself to 64 bit systems, you can essentially go back to using OS threads with blocking I/O because the address space is so large. You can fit thousands of 2MB call stacks into the processes address space and rely on the OS and MMU to manage the memory.

The problem that created the whole non-blocking, async domain was the memory needed for the thread's call stacks. You don't have enough address space to place a bunch of 2MB call stacks for each thread, if you're handling tens of thousands of connections.


> The problem that created the whole non-blocking, async domain was the memory needed for the thread's call stacks. You don't have enough address space to place a bunch of 2MB call stacks for each thread, if you're handling tens of thousands of connections.

c10k is (almost) 15 years old, updating it for today are the OS and MMU going to manage 10 million threads and stacks for that many connections?


2MB call stacks?

Mine are 10MB:

    $ ulimit -s
    $ 10240

I can only spawn 200 threads in the worst case and keep them active before system starts swapping. Yes memory space is large by anytime those have to run they'd have to be brought up into the working set.

EDIT: nevermind, _wmd pointed out that stack memory is still virtual and it only be brought into the physical memory as needed.


Now I'm confused. This comment says that the stack memory is not allocated by the OS unless actually needed: https://news.ycombinator.com/item?id=6728030

Is this not true?


That is right, the full 8MB or 10MB or whatever of max is not being paged in.

I am stupid. I'll correct the initial comment.

Thanks


The note about Windows User-Mode Scheduling is really neat - it makes mmap much more usable when you can switch out on a page fault. Is there such a thing available for Linux?


There is no User-Mode scheduling in the Linux Kernel to my knowledge, but the video[1] mentioned the post describes Linux patches that add support for it.

[1] https://www.youtube.com/watch?v=KXuZi9aeGTw


Shame. mmap is so handy, but blocking on page faults makes it annoying to work with.


Some jump points for the interested on historical implementations other M:N projects (I remember this specifically for the case of NetBSDs work in this area):

http://en.wikipedia.org/wiki/Scheduler_activations

http://web.mit.edu/nathanw/www/usenix/freenix-sa/freenix-sa....


If I remember correctly, this was the threading model used by Solaris. I'm sure there are many applications that make use of M:N threading, but for the server I worked on (~3000 concurrent connections and 50+ threads), this actually hindered the multi-tasking performance of the server. I remember investigating and figuring out this was the reason, and switching back to 1:1 made the performance increase.


I can't find the reference, but I remember a grand old rant that threads only progressed so far because Solaris at the time had an very costly process-fork cost compared to other Unices and Windows.


Process spawning has been (and still is!) quite expensive on Windows, I've always associated the push for threads with that, ie being kludge for crappy kernels :)


What about shared memory?


True: forks have very different semantics than threads. Just thought the thread architecture would be less complicated if divisions of responsibilities were different.


Lthread (http://github.com/halayli/lthread) is basically an M:N threading model.

You can run an lthread scheduler per core.


How is this relevant? The post is about the problems with M:N threading models, and why they aren't a good fit for Rust.


If you have 1:1 threads in a language, can't you implement M:N threading as a library? Looking at projects like Akka[1] makes me think this is the case. Is there reason this can't be done that I am missing? Why would you want M:N threads baked in to the runtime?

[1] http://akka.io/


Those are usually leaky abstractions that don't allow you to do blocking operations with the idiomatic APIs and libraries of the language. You have to use the wrappers provided by the threading library which are typically incomplete and prevent you from using standard tools.


I have no experience with Akka, but what you write makes sense. Even Clojure concurrency primitives feel weird sometimes when one wants to mix them with low-level Java libraries.

Another option is to do it at the language level - see my comment on Erlang.


Aren't blocking operations also a problem if M:N threads are baked in to the runtime? If all of the idomatic APIs are using non-blocking IO then would there still be a problem?


When M:N threads are truly baked in (either at the runtime level or the OS) blocking an M thread doesn't block the underlying N thread.

An example would be Erlang which is a runtime built from the ground up to support lightweight threads where reading or writing to a socket of using the built in APIs won't block the underlying OS thread. You can still sabotage the runtime by writing your own native library and doing blocking operations from there, but the pitfalls are more obvious in that scenario.

If the underlying concurrency and network/disk IO primitives are truly non-blocking for lightweight threads then everything built on top of them will also be non-blocking. Non-blocking for the underlying OS thread that is.

That is where co-routines and other co-routine like libraries are a leaky abstraction. If you step into one of those blocking APIs in a third party or standard library you will block the underlying OS thread. This means you can't integrate with existing code or tools easily and you are always vulnerable to accidentally doing so.


How do run blocking code in a way that doesn't block an OS thread? Is there some abstraction that allows this that I'm missing?


You let the blocking call do a non-blocking call internally and then instruct an IO-aware user-mode scheduler to reschedule the user-mode thread when the non-blocking call has a result. This is what Haskell/GHC, Erlang and Ruby 1.8 do. There are some great papers on GHC's IO manager. In general, it is always possible to build a blocking interface on top of a non-blocking.


I guess I wasn't clear in my previous question. I know you can build a blocking interface our of a non-blocking one. My question was how you go the other way. How do you create a non-blocking interface of of a blocking one. The parent post seemed to imply that Erlang was able to do this. As far as I can tell this can't actually be done. If you have a blocking call you still have to block an OS thread somewhere.

If this can't be done, then that brings me back to my previous point. It seems like the important thing is that you have non-blocking IO and not that you have an M:N threading model in the runtime. If you have non-blocking IO then it seems like you could easily implement an M:N threading model as a library without making it awkward to interact with.

Is there some assumption I'm making that doesn't make sense?


The Erlang runtime uses non-blocking IO calls to the kernel but exposes blocking functions on top of this. Is such a function really blocking? It depends on what level you're looking at; it's blocking at Erlang level but non-blocking at kernel level. I think this is the source of the confusion. You are of course correct in saying "If you have a blocking call you still have to block an OS thread somewhere." if you're looking at kernel level, but not Erlang level (since it can translate blocking to non-blocking).

> If you have non-blocking IO then it seems like you could easily implement an M:N threading model as a library without making it awkward to interact with.

Yes, but your language/system has to have some mechanism to manage control flow. Examples: F# does what you ask for via asynchronous workflows. Ruby via fibers/EM-Synchrony. Java via a bytecode weaver like Kilim.


> Rust's tasks are often called lightweight but at least on Linux the only optimization is the lack of preemption.

I'm surprised by this, I would have expected that voluntary preemption saves you from having to save the FP registers as well. Is the absence of the optimization a non-x86 thing?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: