Fwiw, the original implementation of threads in Java were 'green' threads (it was project Green after all). These were implemented in SunOS (vs System V) using setjmp/longjmp and an some auto-yield logic in the JVM (basically when you did certain things it would start that thing and yield your thread while it completed). When I started I had the most System V experience so I ported it over the Solaris to get to what was affectionately called the 'hello hello hello world world world' stage. The choice to use the Solaris native threads over green threads was motivated by the work that had gone on in the Solaris kernel to do CPU scheduling efficiently. That gave Java multiprocessing for 'free'. Except it exposed a bunch of bugs in the mutex / synchronized code when you actually got parallel execution :-).
The magic, as the author points out, is dynamic stack allocation. If you disassociate a thread with a required stack size, then you are back to using just the amount of memory that your object heap and reference stores are using. GC gets to be trickier to of course.
This should nominally be a non-issue on 64 bit address machines as you only need map the pages that actually have data in them (minimum one page) but that still puts a huge load on the page tables. This is especially true with huge pages which are used to avoid tlb thrashing and to save on the number of levels you have to go through to get back the physical address of the page you are trying to access.
Its the kind of optimization problem that systems folks really like to dig into and optimize around.
This was 1992 and Posix, BSD, Microsoft, and a contingent from HP/Apollo were all putting forward ideas about what the API to 'threads' should be and even whether they were a "kernel" thing or a "user level" thing. Arguments were made that lightweight processes were the abstraction to use but as I recall that impacted portability.
So yeah, they were pretty new and they were blowing up things left and right. Sun at the time had a leg up on most (if not all) UNIX vendors with a threading model that worked well. But it made porting the JVM harder and green threads lived for a long time as a fallback.
assuming the question implies "without using threading" - spawn multiple processes and migrate fibers across them as needed.
of course it's not something a developer might want to do, but might be done as part of an user-land library. no idea why one would want that today outside of highly specialized code, but it might be that it could be more portable to other kind of non-preemtible system (i.e. os with fixed time sharing)
> The magic, as the author points out, is dynamic stack allocation. If you disassociate a thread with a required stack size, then you are back to using just the amount of memory that your object heap and reference stores are using. GC gets to be trickier to of course.
I would imagine register based VMs have a leg up in this regard?
As I understand it, stack-based VMs use a stack for the operands that the VM instructions operate on, whereas a register-based VM puts the operands in registers rather than a stack. Both types of VMs use a stack for the function call stack (which is not the same stack as the operand stack), though, so there is no difference in that regard.
Project Loom [1] is intended to provide green threads, or 'fibers', at the JVM level.
There have been other projects like Quasar from ParallelUniverse [2] that retrofits them by processing the compiled bytecode to enable saving/restoring the IP & stack.
It's all about fixed resource allocations. The default stack size for user-land threads tends to be 1MB or 2MB, and there's also (smaller) kernel stacks.
In implementations of languages like Scheme or Haskell where no stack may even be allocated, or where stacks might be allocated in chunks, the cost of a co-routine can be very small -- as small as the cost of a closure. If you take that to the limit and allocate every call frame on the heap, and if you make heavy use of closures/continuations, then you end up with more GC pressure because instead of freeing every frame on return you have to free many of them via the GC.
In terms of ease of programming to deal with async events, the spectrum runs from threads on the one hand to callback hell on the other. Callback hell can be ameliorated by allowing lambdas and closures, but you still end up with indentation and/or paren/brace hell. Co-routines are somewhere in the middle of the spectrum, but closer to threads than to callback hell. Await is something closer to callback hell with nice syntax to make things easier on the programmer.
Ultimately it's all about how to represent state.
In threaded programming program state is largely implicit in the call stack (including all the local variables). In callback hell program state is largely explicit in the form of the context argument that gets passed to the callback functions (or which they close over), with a little bit of state implicit in the extant event registrations.
Threaded programming is easier because the programmer doesn't have to think about how to compactly represent program state, but the cost is higher resource consumption, especially memory. More memory consumption == more cache pressure == more cache misses == slower performance.
Callback hell is much harder on the programmer because it forces the programmer to be much more explicit about program state, but this also allows the programmer to better compress program state, thus using fewer resources, thus allowing the program to deal with many more clients at once and also be faster than threaded programming.
Everything in computer science in the async I/O space in the last 30 years has been about striking the right balance between minimizing program state on the one hand and minimizing programmer pain on the other. Continuations, delineated continuations, await, and so on -- all are about finding that sweet spot.
Very good summary. I would only add that it pays to mention that thread/fiber stacks typically only reserve 1-2MB of VM, and only a small amount is committed initially (1 to 2 pages). With 32-bit address spaces, this is a minor distinction because the size of the address space is what matters the most, but with 64-bit address spaces, it can mean that you can aren't putting as much pressure on the VMM as one might think. If all of the threads in a process only use the default two pages, and the system page size is 4K, then 10,000 threads are only going to result in a commit size of 81MB of VM for the stacks. Of course, that's a lot of "ifs", and it completely ignores the effects of that many threads on the per-process VM mappings/CPU caches.
I keep getting comments about this, but it's not that simple. First of all, unless you're overcommitting, those stacks consume more resources than one might think at first blush. Second, the apps I am familiar with (on the Java side) get pretty deep stacks (you can tell from the stack traces in exceptions!). Third, if you want millions of stacks on a 64-bit address-space, I'm thinking you'll want larger pages if much more than a handful of 4KB pages per-stack get used (as otherwise the page tables get large), and then you're consuming large amounts of RAM anyways. I may be wrong as to the third item, but I suspect the actual depth of the stacks as-used makes a big difference.
The point is that thread-per-client is just never going to scale better than C10K no matter what, and that the work we see in this space is all about finding the right balance between the ease-of-programming of thread-per-client and the efficiency of callback hell (typical C10K).
For any size computer (larger than tiny), C10K designs will wring out orders of magnitude better scaling than typical thread-per-client apps no matter how much you optimize the hardware for the latter.
Thread-per-client can only possibly compare in the same ballpark as C10K when the actual stack space used is tiny and comparable to the amount of explicit state kept in the C10K alternative, and even then, the additional kernel resources consumed by those per-client threads will dwarf those needed for the C10K case (thread per-CPU), thus adding cache pressure. In practice, thread-per-client is never that efficient because the whole point of it is that it makes it trivial to employ layered libraries of synchronous APIs.
Now, I'm not saying that the alternatives to thread-per-client are easy, but we should do a lot better at building new libraries so that async I/O can get easier.
An important distinction between goroutines and typical coroutine implementations is that the Go compiler puts in the yield points for you, so goroutines are exactly as friendly as threads but with better performance.
That's not new at all. This is known as cooperative multitasking, and you have the system define places where a co-routine switches to another one waiting for CPU. These points are usually those at which I/O is done. Not at all new.
I'm not sure what you mean by "system" here, but if the OS does it then the architecture is probably not cooperative, but rather preemptive. On the other hand, if the runtime (such as the Go runtime) causes a context switch between coroutines because one of the coroutines is waiting (on blocking I/O, etc.), then the architecture could still be considered cooperative because the scheduling is such that none of the coroutines would ever be actually preempted by the runtime while executing instructions. Of course, they could still be pre-empted by the OS because they're ultimately all running as system threads, which is one of the downsides of all this: eventually, it all comes back to threads, how many of them are present, and how the OS handles them.
I didn’t claim it was new, I claimed it was different from typical coroutine implementations where you need to explicitly yield (all coroutines are cooperative multitasking by definition). I also claimed that this gives the usability properties of threads (you claimed that coroutines are somewhere between threads and callback hell WRT ease of use, which is true for traditional coroutines but not for goroutines. Another important distinction is that traditionally coroutines are not M:N, while goroutines are.
Heap allocated stacks require teaching the garbage collectors new tricks since they need to parse the frames for references, whereas those can be handle during the freeze/thaw operation if we copy the stack.
We think we can achieve the level of performance we want with stack copying.
Anyways, allocating frames or stacks on the heap doesn't seem like it should require teaching the GC all that much. They would essentially be objects of such a class that the GC can traverse them... just like a normal stack (from current frame through return closures), which it already has to know how to do. Their location shouldn't be so important.
In typical Java fashion, there is a project, Quasar, the adds true green thread support to the language. Done by instrumenting bytecode rather than JVM support, but the benchmarks are still excellent. I'm very curious whether the Java or Go implementation is faster. Java does have fast async support which is all you need to add green threads on a library level
Also in Java fashion, the support isn't perfect, and requires some fiddling with JVM parameters (and annotations, obivously). It's still really solid from my experience however, and a shame that it isn't more widely known.
A partner project Comsat provides fairly comprehensive library support for databases and such. I've used it before with Vert.X and the performance was insane. Possibly faster than Go.
Another aside, there's an active research project to add greaan threads to Javas core as well. Project Loom
>Also in Java fashion, the support isn't perfect, and requires some fiddling with JVM parameters (and annotations, obivously). It's still really solid from my experience however, and a shame that it isn't more widely known.
I could, but my laziness prevents me from doing so. The documentation is fairly good for Quasar, hard part is getting the code instrumentation up. If you use Maven, there's a plug-in to do much of it for you
Go may have more threads, but it is not better at concurrency. It lacks several key things that make Java much easier to scale up, even if Go is easier to get started with. On my wish list:
* Thread Local storage. A million Goroutines means nothing is they are all fighting over shared storage. Consider trying to implement Java's LongAdder class, which does intelligent sharding based on contention.
* ConcurrentHashMap. sync.Map is horrendously contentious for writes. It's not even possible to write one yourself because Go hides the built in hash function it generates for each type.
* Goroutine joining. This is between a rock and a hard place, because if you don't wait till Goroutines are done, you risk leaks, and if you do wait, you need to put WaitGroups all over the place.
All those millions of goroutines don't help if you don't have the tools to coordinate them.
The article is off by an order of magnitude in one of the more headline grabbing comparisons. It states 2.5 million goroutines in a gigabyte of RAM, when it should be 250k at 4KB per stack. Still impressive on its difference, but less so.
I think it is so important to not arbitrarily throw words like 'RAM' in such discussions, and use more unambiguous terminology such as - virtual address space and the resident set size (RSS).
On a 64-bit machine, with 1MB stack size (this is a claim on the virtual address space, not the physical memory), you can have millions of threads too (that you may hit a lower limit due to other OS level knobs is another matter).
But wouldn't the consequence of spawning millions of native threads be to spill those virtual addresses into many, many pages and incur many more TLB misses and page faults generally?
Only the portions of the stack that are actually touched trigger a TLB miss and page fault. If you've got a million threads but each only touches the first 4K of stack space, you end up only touching 4G of RAM even though you've mapped 1T of it.
Sure, you can have millions of 1MB stacks, but unless you have terabytes of RAM performance is going to suffer.
If you write your app with more explicit state rather than implicit state (bound up in the stack), such as by writing a C10K style callback-hell, thread-per-CPU application, you're going to get much much better performance. The reason is that you'll be using less memory per-client, which means fewer cache fills to service any one client, which means less cache pressure, all of which means more performance.
The key point is that thread-per-client applications are just very inefficient in terms of memory use. The reason is that application state gets bound up in large stacks. Whereas if the programmer takes the time to make application state explicit (e.g., as in the context arguments to callbacks, or as in context structures closed over by callback closures) then the program's per-client footprint can shrink greatly.
Writing memory-efficient code is difficult. But it is necessary in order to get decent throughput and scale.
Nobody is suggesting that you create a million threads. But to think that it is not possible without TBs of physical memory is a fallacy (for 64-bit machines).
The thread switching costs itself could be prohibitive to performance. IIRC, goroutine switching cost is roughly 1/10th of linux thread switching costs.
It's "possible". If you care about performance, then it's not. Obviously I'm not defining "performance", but you'll know this when you see this. Paging is the kiss of death for performance.
Paging has nothing to do with this. When I say 1 MB thread stacks, it means the maximum size of that thread's stack in the virtual address space [0]. Each of these million threads could be using only a few KBs for its stack (out of that 1 MB of stack space). That would imply a few gigs of physical memory => no paging.
I'm curious if there are any advantages to having all of your libraries use the same concurrency mechanism. I've been dealing with a lot of "sync vs async" (aka "What color is your function?") hell lately; I wonder if it would be similar dealing with some libraries that use native threads while others use one coroutine library and other libraries use other coroutine libraries?
> Supporting truly massive concurrency requires another optimization: Only schedule a thread when you know it can do useful work!
The problem is the scheduler doesn't know when a goroutine can do useful work. It thinks that if a new value arrives to a channel there is a good chance the goroutine will do something useful but that's not the case in general. My favorite example is Game of Life (CA) implemented as a network of goroutines communicating their state changes through channels: you have to collect 8 updates from predecessors to update your state, meaning you'll be scheduled to run 8 times before you finally make a real progress. Not good for scalability.
I'd point out everything in your example is a characteristic of the program, not the runtime. There isn't any other runtime or compiler than is going to do significantly better, or at least I'm not aware of any. Regardless of language, if you want to be high-performance on that task, you're going to want to chunk your game world into tiles and dispatch tiles to workers, for all kinds of reasons, or something similar. (Assuming you even win from concurrency at all since this is on the borderline of where a single CPU may be able to keep up with RAM streaming speeds, assuming modestly clever encoding. And assuming a straightforward implementation rather than HashLife, etc.)
It's true that it's hard for a scheduler to "know" that useful work is going to be done, but that's just a fact of life, just like the CPU doesn't "know" whether or not you're going to use the next page of RAM or go flying off in an effectively random direction.
In Erlang (and presumably Go) green threads end up in a waiting queue when waiting on further work to do, and as such don't need to be polled or consume any CPU at all while idle. It's only when they receive a message indicating more work can be done that they move into the work queue.
I think the person you're replying to was saying that 'useful work' does not including waking up, reading one value, and then immediately sleeping again for another response. That's a big expensive context switch for really nothing very useful, and it could have waited until all inputs were ready and there was significant computation that could be run.
Use one channel per neighbor and have a wait all primitive that wakes the coroutine when all its inputs are ready. I don't know if go has such a primitive and if its implementation is optimized in such a way or simply waits for each channel in sequence.
> CPUs can only execute about one logical thread per core truly concurrently.1 An inherent side effect: If you have more threads than cores, threads must be paused to allow other threads to do work, then later resumed when it’s their turn again.
When would millions of goroutines be useful if you can only run ~4 threads in parallel?
Continuations are different than concurrency. A continuation is useful for modelling a process that operates independently. You may or may not want the continuations to be running concurrently. One of the earliest uses of continuations in programming languages was in simulation and modelling.
Imagine that you want to model a quarry. You have a pile of rocks, a rock crusher, a pile of gravel and a whole bunch of trucks. A truck takes a rock from the rock pile. it gives it to the crusher. The crusher works on the rock for a while and outputs some gravel. A truck takes the gravel to the gravel pile. You want to simulate different ways of scheduling the trucks so as to be optimal.
You could write your simulation so that you update every object in your system at every time stamp, but that's really complicated and it also makes it really difficult to collect statistics about processes. Another way is to create a kind of "process" for each action in the system and a "resource" for each thing that could be used in the system. You then write the code for each process as a continuation and you request resources, blocking when they aren't available. Your overall simulation loop is very simple: it runs a single continuation until it requests a resource, and then it blocks. It then runs the next continuation until it requests a resource. Then it blocks. You just keep doing this, unblocking continuations when the resources become available (and all other continuations' simulation clocks are passed its current clock).
The nice thing about this is that you need no concurrency at all -- or you can have a concurrent thread of execution for each continuation. The latter will obviously run faster, but that's not necessarily the point. The point is that the code was dramatically easier to write.
For some kinds of simulation, a million continuations is not unreasonable at all. In fact, for a lot of applications it would be great if you could have a lot more.
If you're a server with a million clients holding an open connection, you can have a goroutine for each, holding the connection state while waiting for them to communicate again.
> When would millions of goroutines be useful if you can only run ~4 threads in parallel?
Concurrency.
Have you heard of the 10k problem?
Some people think the best way to handle tends of thousands of concurrent clients is to have each run in a lightweight thread (or goroutine, or whatever term for the same basic abstract concept.) This is as opposed to using an event-based system like Node.js does. Maybe they're right or maybe they're wrong, but that's where it's useful.
A bunch of threads on one core isn't that different from node handling the same number of events. Just moving some of the responsibility from the OS to a framework.
The most critical difference is that threads can yield anywhere. Events only yield in well-defined places. Sometimes that's useful, sometimes it isn't. For example a thread can yield in the middle of a native call to something like zlib. If zlib doesn't know about your event framework (and it doesn't) then it can't yield.
IO calls. Concurrency matters only if your threads are doing IO. Otherwise it's useless to use goroutines.
Lets say an IO call to high latency connection takes 4 seconds. With no concurrency in between that IO call the program just sits and waits for a response. Under concurrency the program can switch to another thread while it's waiting for the IO call to complete.
It is not just for io. If a system is better modelled with a set of indipendent execution flows exchaging messages (a simulation for example), then coroutines are perfect. Games often use coroutines to model game entities: it is not a coincidence that lua has excellent coroutine support.
Right, except that's not what's happening in reality no matter how you model it. Having your code mimic the actual pathway of execution rather than the overhead of simulating parallelism is more efficient.
Typically multithreading adds way more complexity and a whole new class of bugs to the code.
joe damato claims the reason threads were historically slower than userspace threading is because of an accident of map_32bit in glibc
> So I think I've identified the culprit and I believe the culprit is XFree86.
> it took seven years after this change went in for someone to figure out all the different parts that were damaged and fix them later, right? So threads were broken for a really long time.
What's the use-case for many virtual threads? I thought the main case for using threads was performance, where you need to use them to fully utilize the machine, although rather you wouldn't be using threads.
Another use-case for threads that I see is when you have work-queues that have different priorities, so you put each queue into its own thread with a matching priority. For example, 1 thread for responding to UI events, and 1 lower priority thread for all background work.
Anyways, none of these use-cases require many threads. What use-case am I missing that makes go-routines useful?
I guess doing something in a thread is a fail-safe way to make sure that 'some time' is spent processing it, so for example for the UI tasks, it would be easy to just dump all handling of UI events in virtual threads.
Coming from the Erlang world (very similar to Go as described in the blog post) they're not used for raw speed (in the sense of utilizing more cores). The Erlang VM (or the Go runtime) uses them internally in that way, but for the programmer the benefit is somewhat different.
Processes or co-routines are often used more as concurrent objects. That is, they're used to separate data and concerns into separate, isolated memory areas. However, they're also sequential programs in their own right, so the system becomes much easier to reason about. If you have a few OS thread per core, you are responsible for distributing work among those threads, whereas if you have many co-routines per "job" you can let the runtime decide and schedule intelligently.
A concrete example is a web server. Implemented in Erlang or Go you would typically spawn a co-routine for each incoming connection. There you would get isolation and a simple implementation (parsing the request, performing the task, returning response and exiting). A crash or bug handling a specific request would not affect the other clients.
Other great use cases are messaging and communication apps (examples like Rabbit Q or Discord come to mind), concurrent databases, etc.
C10k is the classic problem that made event-driven systems (async programming) the "must have" for modern distributed systems. Especially coupled with websocket proxying. Thousands of long lived connections, but individually they are all very light work, so the cost of switching had started to dominate the work done by the CPU.
Sure, it can be handled as a big array where every element is a struct consisting of the ingress and egress socket/fd referencse, read and write buffers, and some saved state label, and then crunching all of it in a big while loop, in C, with epoll. Nginx does this, but that big array can become the bottleneck quickly. (Lock contention, cache line conflicts resulting in poor memory bandwidth utilization, etc.)
But if you have low-overhead async-await, then you don't have to solve the big array problem and you don't have to write that huge state machine thing either, things become easy[-er] to reason about again.
To do so you’d step out of the JEE spec. Spring and others are trying to bridge this. Legacy DB drivers are a problem. For example, last I checked PG driver was still tread focused for concurrency and transactions.
So write a new driver. Yes, I know, that's a lot of work.
The thread-per-client sin derives in large part from having synchronous APIs first.
Every API, every project, needs an async-first approach.
Even then, many many all-synchronous-all-the-time thread-per-client apps will get written, but at least it will be possible to write thread-per-CPU, async-I/O apps where those apps need to scale.
While I would love for every API to have an async counterpart, there is a non-trivial cost in term of complexity, latency and performance to implement an async API.
Cost in developer time, yes. Cost in latency and performance? No! That's the whole point of async I/O: to improve latency, performance, and scalability.
It's much harder to rewrite software to use async I/O than it is to write it to use async I/O in the first place. It's also much harder to write software using async I/O than sync I/O. Therefore it's all about economics and forecasting. If you can forecast needing to scale soon enough, then start with async I/O, otherwise increase revenue and profits first then rewrite. For many companies developer time is a larger item on their budgets than the cost of extra servers and power and all that, which is why we have so much thread-per-client software -- it's a natural outcome.
This article is wrong and reflects what appears to be a widely-held belief (practically canon here on HN) that a thread has a fixed-size stack. This is not the case on every operating system! Linux, which I've heard is popular, initially commits one page for a thread stack and continues to add pages until that thread's stack limit is hit. So this statement from which this article draws its conclusion is incorrect:
Again, all this confusion is due to not being clear about virtual and physical memory. Every OS thread in linux does have a fixed virtual memory size (it is claimed when a thread is created, but this claim is on the virtual memory. This value is fixed at creation time). As you grow your program's conceptual stack in this virtual memory area, you would soon hit new pages of virtual memory, leading to page faults and linux allocating physical memory for you.
From a virtual memory standpoint, every thread would appear to have a fixed size which can be set when you create it. [0]
From a physical memory standpoint, every thread would appear to have a dynamic size (but bounded by the virtual memory size of course).
The article makes a number of authoritative claims without providing much in the way of references to other more scholarly work. One might characterize it as an opinion piece masquerading as serious analysis. Nevertheless, I found it thought-provoking and I'm learning a lot by reading the comments here.
One error I spotted is that at 4kb each, 1GB only gets you 250,000 goroutines (i.e. article is wrong by an order of magnitude). Easy mistake to make, but bad in a comparison article.
FTA: "With 4KB per stack, you can put 2.5 million goroutines in a gigabyte of RAM – a huge improvement over Java’s 1MB per thread"
Comparing Java threads and goroutines doesn't really work because Java's thread model maps threads 1:1 to OS threads, whereas goroutines are simply continuations that can consume any OS thread. Java threads couple both the concerns of scheduling and task definition. Java Runnables address only the concern of task definition and are closer in nature to what goroutines are, but they still don't possess the ability to suspend and yield their execution.
You can implement Go's thread model (more generally known as green threads, coroutines, fibers, lightweight threads) on the JVM. This is precisely what Project Loom intends to do and what Scala effect systems already do. You construct a program in a type called IO that only describes a computation that can be suspended and resumed, unlike Java Runnables. These IO values are cooperatively scheduled to run on a fixed thread pool. Because these IO values just represent continuations, you can have millions of them cooperatively executing at the same time. And there is no context switch penalty for switching the continuation you are executing for another.
The 1:1 mapping is just an implementation detail of of current JVMs. Originally green threads (with an M:N scheduler) were used instead. IIRC they weren't even guaranteed to be preemptively scheduled, I don't know if this has changed since.
Technically os threads are also simply continuations.
The article claims that aside from memory consumption the main advantage of the Go approach is that the scheduler is integrated with the channels concept and thus knows which goroutines are blocked and don't need to be scheduled until some data shows up in their channel.
But don't OS threads work like this as well, by being integrated with file descriptors among other things? If a thread is blocked on a read, the OS knows this and won't keep scheduling that thread - right?
If so, the article's argument about time wasted on scheduling doesn't make sense.
You're basically asking if generalized preemptive multitasking that has to solve every problem equally well is really that much slower than specialized cooperative multitasking that is tailored to specific language?
Of fucking course user level scheduling is going to be superior. Just think about it. Languages like erlang and ponylang dispatch to the scheduler after finishing the execution of every single function. Do you really think that switching to a new thread on every function call is going to be faster than not doing that? Consider that the way you're supposed to do iteration is via recursion which means you will have one function call on every iteration and therefore invoke the scheduler on almost every iteration. The user level scheduler can instantly switch to the next pending actor/goroutine/whatever as if it was just a regular function call. Meanwhile your regular threads will have to switch to the kernel first, now has to load the scheduling tree from main memory because the cache are filled with user level data, then has to make a complex scheduling decision and finally switch back and reload whatever data you just flushed out of the cache.
So yes the article's argument about time wasted on schedule makes a whole lot of sense because not only is the scheduling of preemptive threads itself more expensive, it also incurs extra costs through context switches and emptying the cache.
> So yes the article's argument about time wasted on schedule makes a whole lot of sense because ... context switches ...
Right. I understand that doing multi-tasking closer to the application code can be more efficient by avoiding involving the kernel and context switching.
What I was wondering about was this specific argument that the article was making:
Suppose for a minute that in your new system, switching between new threads takes only 100 nanoseconds. Even if all you did was context switch, you could only run about a million threads if you wanted to schedule each thread ten times per second. More importantly, you’d be maxing out your CPU to do so. Supporting truly massive concurrency requires another optimization: Only schedule a thread when you know it can do useful work! If you’re running that many threads, only a handful can be be doing useful work anyway. Go facilitates this by integrating channels and the scheduler. If a goroutine is waiting on a empty channel, the scheduler can see that and it won’t run the Goroutine.
Doesn't the OS also perform this optimization, i.e., only scheduling threads that can do useful work? If you have thousands of threads that are all sleeping or blocking on read, it was my understanding that the OS will not schedule them – contrary to what the article is saying above. Am I wrong about this?
This is exactly what I came to the comments to point out. You're 100% right. If the operating system's scheduler is any good at all, it won't wake up a thread which is blocked on something like read or select system calls. NIO-based servers on the JVM will (in their idle state) have a bunch of operating system threads all blocked on a select system call, waiting for a connection on the TCP socket. They won't be woken up until there's something to do.
This post is wrong. OS threads do not have a fixed-size stack in the 1:1 model. After all, if they did, then Go couldn't do stack growth, because goroutines are implemented as a layer on top of native threads. There are language runtimes, such as SML/NJ, that heap allocate all stack frames, and the kernel doesn't complain.
Additionally, as others have pointed out, Java (and, for that matter, all native code on Linux) used to use the Go model and switched to 1:1.
By heap allocating stack frames I mean that each time you call a function, SML/NJ allocates space for that individual frame on the heap. It then uses the garbage collector to clean up frames after the function call has returned.
Ok there's a technical gaff somewhere... In my understanding. I regularly have hundreds of threads running in jvms with between 128mb-384mb of heap. Im guessing the stack isn't kept on the heap?
I'm not really a java user but I vaguely recall that java started with green threads to deal with this issue long before you could just run an absurd number of threads in the OS without issues
Would it be possible/efficient to write an OS in Go, and use its scheduler to schedule threads living in the implemented OS? What if you ran Go in that OS?
"The paper contributes Biscuit, a kernel written in Go that implements enough of POSIX (virtual memory, mmap, TCP/IP sockets, a logging file system, poll, etc.) to execute significant applications. Biscuit makes liberal use of Go's HLL features (closures, channels, maps, interfaces, garbage collected heap allocation), which subjectively made programming easier. The most challenging puzzle was handling the possibility of running out of kernel heap memory; Biscuit benefited from the analyzability of Go source to address this challenge."
Roughly, that was done around 1990, and called Plan 9. The cost of fork in Unix was was one of the problems that Plan 9 tried to fix. No one used it, so the developers backported as much as they could to Unix, and released it as Go.
By my count, Go is a fifth system. (CTSS, Multics, Unix, Plan 9.) It's interesting to observe how the second system effect progresses when the same team has been working on the same problem for 60 years.
Creating millions of coroutines or "goroutines" isn't really interesting by itself. I believe the real questions for me are:
1. is Go scheduler better than Linux scheduler if you have a thousand concurrent goroutines or threads?
2. is really creating goroutines that much faster than creating a native thread?
3. are gorutines relevant for long lived and computationally intensive tasks or just for short lived I/O task?
4. What is the performance of using channels between goroutines compared to native threads?
tbh, I have read several of respected articles that criticize Golang goroutines in terms of performance and I am not really sure that Golang's only virtue imho which is simple concurrency is performant at all
Go's scheduler is not preemptive, which means that sometimes Goroutines can hog the processor. If loops don't contain function calls, or allocate memory, or other preemption points, they can prevent other goroutines from running. A simple example is:
package main
func main() {
go println("I ran")
for {}
}
If you run with:
GOMAXPROCS=1 go run main.go
It will never print the statement. This doesn't come up frequently in practice, but Linux does not suffer from this case.
This just happened to me. I wanted to run a goroutine and then not exit the program, so I ran an empty for loop and it didn’t work. Would a for{ continue } have work?
`for { continue }` is semantically equivalent to `for {}` since the closing bracket of the loop block implies a `continue`.
As a sibling correctly notes, `select {}` is what you want to do instead. You need to do something that can block the goroutine in such a way that control returns to the scheduler. Selecting is one of those ways.
These are good questions and I'd be interested in more authoritative answers, but I'll hazard guesses.
1. Go's scheduler is better at scheduling Go. There are well defined points at which a context switch can occur, and the scheduler is in userland so you never really have expensive context switches.
2. Not sure. I would guess so, if only because it saves a couple context switches.
3. Goroutines are absolutely relevant for both. The stdlib webserver is a high-performance, production-grade implementation and it forks a goroutine for each request, and applications frequently fork more goroutines within a request handler.
4. I assume you're comparing Go channels to pipes? In which case the answer is probably always "faster", but how much faster depends on the size of the thing you're transferring since IPC requires serialization and deserialization while in Go the overhead is just some locking and a pointer copy.
5. Go has dynamically-sized stacks. While you can change stack size on Linux (to use very small stacks and thus spin up more threads), I don't think you can change the size of an individual stack. Plus with Go, you don't have to configure stack size at all; it works out of the box on every platform.
For me, that Go's goroutines are performant is just gravy. I like that they're a platform agnostic concurrency abstraction. I don't have to deal with sync vs async ("What color is your function"; which incidentally cost me the last 1.5 workdays tracking down a performance bug in Python) or posix threads vs Windows threads.
Re 5: pthread allows setting the stack size of each individual thread although this is seldom done in practice as on 64 bit address space is not at premium.
GCC also, on some platforms, allows using segmented stacks on any C and C++ programs which means that a thread will only use only as much physical and virtual memory as required. I don't think segmented stacks have seen much use though as they have non zero cost and, for optimal space usage, ideally require recompiling any dependncy, including glibc with them.
Interestingly segmented stacks were implemented to support GCCGO.
How does Go handle non-async syscalls these days? If you spawn a thousand goroutines which all call stat(), does this still spawn a thousand kernel threads?
This took some of the bloom off the rose for me. For each goroutine I had to predict ahead-of-time whether it would require a kernel thread, and if so send it through a rate limiter. Effectively it was sync-vs-async but hidden.
Yes, it spawns a kernel thread if all the threads are currently blocked. I ran into this issue running Go on a FUSE mounted file system. If the network hung, Go would spawn thousands of threads and OOM. Go can't _not_ spawn them, as that would risk deadlock.
An expected way to manage this would be to have a limit on the number of threads to run blocking syscalls. Up to you how to do it (threadpool for syscalls, anythrrad can run a syscalls but checks a mutex first, etc)
I don't think there's a danger of deadlocks here -- your blocking syscalls shouldn't be dependant on other goroutines. Eventually the calls will succeed or timeout (one hopes) and the next call will commence.
In my experience, you can usually find a safe number of parallel syscalls to run -- and it's often not very many.
Imagine 1000 pipes, each of which has a goroutine calling write() and another calling read(). If you schedule all of the writers and none of the readers (or vice versa) you'll get a deadlock.
AFAIK when a goroutine "calls write" (on a channel or on a socket, stream, mutex etc).. and the write "is blocked" it yields the execution and the scheduler can activate something else.. which can be a reader goroutine (after all the writers are blocked for example). So there's no deadlock as long as you have at least one reader for any writer.
That requires the underlying syscall to support an async or non blocking mode though. Disk io or stat doesn't on linux for example. The usual alternative is some sort of background pool for blocking tasks (which adds non trivial overhead), or, where supported by the OS, scheduler activation.
Sorry, I also expected those things that can be translated into non-blocking calls with select-ish apis for readiness to be translated as well -- because it's the right thing to do in this type of environment.
To be more specific, most socket calls should be async + selectish, but file system calls would likely be done as synchronous, because async doesn't generally work for that -- anyway limiting the number of outstanding file I/O requests tends to help throughput.
stat() cannot (on platforms I am familiar with) be performed in a selectish/nonblocking way. That call can block forever. Local filesystem access in general is still not async-capable in very important ways.
Yes, filesystem calls are going to block -- but putting that into a limited threadpool shouldn't deadlock your system -- you can't do a blocking read on a file, waiting for it to be filled by another thread (unless/until you start playing with FUSE, I guess.
Re 3. Funny you should mention net http's goroutine per request model. Fasthttp massively out performs stdlib on every techemporer benchmark using a goroutine/worker pool model.
> so you never really have expensive context switches
It can be possible only if there are no more "physical" threads than cores. And anyway a switching between async routines on the same "physical" thread requires a switching of routines contexts (go to the memory and update cashes).
It seems to me that Green threads exist primarily for convenience.
They are definitely not more performant than native threads with a work-stealing approach (producer - consumer pattern) that is tuned to the number of cores available. How could they be more performant? Even the simplest switching adds some cost, from a performance perspective it would never make sense to run thousands of threads on 8 cores, whether they are scheduled by the OS or by the language implementation.
So to answer your 3rd question: short lived I/O tasks.
Node is definitely not performant by any equal benchmark, but the async model does lend itself to simplicity in some cases in non-sequential workloads. But even then, it gets stomped by vert.x running on the jvm. I think the most valuable contribution is democratizing package management for SPAs as we move towards thick clients again.
I find goroutines and channels much easier to work with than promises, callbacks, or async/await, and the former can parallelize computation as well. Not to mention that Go is typically faster than Node in single-threaded execution.
You can easily implement channels in ES6. I've done it and my implementation is probably 40-50 lines of code. You can then use async/await to simulate go routines. It's actually a bit nicer than go because of the way async coordinates with promises. You can more easily synchronise the routines (for example in situations where you might need a wait group in go).
I keep expecting someone to write one and put it up on NPM (I haven't looked recently, though). Perhaps I should clean up my and stick it up there. It provides a very nice abstraction of continuations. I did it merely as kata to show that continuations do not require concurrency.
Are you asking to compare mutexes and channels?
The article goes through multiple reasons why the go scheduler has more information about how something should execute than the Linux kernel.
Reread and then maybe rephrase?
I imagine there are actual use cases for it but what are programs doing spinning up thousands of threads? I've never seen a program like that. Whatever happened to worker thread pools?
Say you write a server, one thread, one client at a time. Easy peasy, terrible throughput and performance. Well, making it thread-per-client is trivial! Easy peasy, much better throughput and performance, but terrible at scale.
Now you want to scale, so you need to abandon the thread-per-client, everything-is-synchronous-looking model and start doing async I/O the hard way (perhaps with await, perhaps as callback hell). That's a huge refactor / re-write exercise.
But you'd be surprised how many thread-per-client apps exist. It's so easy to write them...
Well say that you're building a micro service that needs to handle a high number of concurrent requests. A server running Go will handle ~100x the traffic of a Python or Ruby app. Your milage will vary depending on the use case, but the difference is huge.
Found out yesterday that even with async in Python someone can sneak in some sync code and totally ruin your day. All you know is your performance is plummeting all across your system and CPU is spiky as hell. Took me most of the day to realize the issue. From now on I’ll start every performance investigation with ‘grep requests’. Problems like this simply don’t exist in Go.
I've easily ran thousands with an -Xss at 128kB. It wasn't a great implementation and we reduced this to a nice thread pool of 400 but it worked quite well.
Threads are quite useful in many situations and have great performance characteristics. It comes down to people knowing the language they are using and using it correctly.
There makes no sense to comparing goroutines with os threads, they are totally different things, this is like comparing apples with oranges, of course both of them can be used
to solve the same problem, but they are totally different.
Like others have mentioned, is well known that a thread-per-connection solution does not scale well, a quite better approach is having a pool of threads with fixed size and using an asynchronous non-blocking event driven model.
Actually in our company we did some benchmark messuring golang (1.9 with well known http servers) and java 8 with Netty, and the latter always won.
What does "won" mean here? Better throughput? Lower latency? Less variance in latency? Better efficiency? re: the last one, I'd be shocked if Netty could match Go stdlib http in memory usage even if you benchmarked higher throughput. I'd expect the Go server to have lower variance too due to its GC design.
Yes, this sentence is quite vague and does not give proper info about the topic, you only have to believe me or not, I'm not writing any blog about this, sorry.
Means higher throughput with lower jitter, that in our case was what we were looking at that time. For go defense we were using PooledByteBufAllocator for recycle Output Streams but memory usage was not a concern in benchmark as much as GC does not affect throughput. But I also believe that go http stdlib memory usage is better than netty, sorry for not being more helpful with this topic.
Thanks for going into more detail, I'm interested that you got lower jitter. My experience with high performance JVM networking stacks is that they achieve incredible throughput but are quite hungry (with a hungry baseline too) and that 99P can be not so great due to GC. I'll have to give Netty another look. I'd be interested in the RSS of each process after warmup.
In my experience tunning a jvm application in order to gain better throughput is far from a trivial job, although jvm ecosystem's requires a much more experienced developer on the platform to reach some levels in comparison to go simplicity which results in really high performance solution easier.
For instance deciding an appropriate concurrent algorithms and measuring it. e.g. using AtomicInteger vs a per-thread-counter-based is not easy (specially in more complicated scenarios)
We have both java8 and go, high throughput critical microservices with excellent 99pctl, and in general memory is not a concern (as much as we do fine tuning on gc and don't have any memory leak) and generally for the really critical and portable solutions we choose java over go (unit testing and library versatility is a big player in this discussion)
It does make sense. These are all techniques for dealing with async events (I/O). Thread-per-client, cooperative multitasking / co-routines with co-routine-per-client, ... all the way to C10K-style thread-per-CPU callback hell.
As I said both can be used to solve the same problem, but since they are totally different and works very different, to me it makes no sense doing a comparison and expecting similar results, if comparision's were analyzing the same technique in different languages then this will be fare.
The magic, as the author points out, is dynamic stack allocation. If you disassociate a thread with a required stack size, then you are back to using just the amount of memory that your object heap and reference stores are using. GC gets to be trickier to of course.
This should nominally be a non-issue on 64 bit address machines as you only need map the pages that actually have data in them (minimum one page) but that still puts a huge load on the page tables. This is especially true with huge pages which are used to avoid tlb thrashing and to save on the number of levels you have to go through to get back the physical address of the page you are trying to access.
Its the kind of optimization problem that systems folks really like to dig into and optimize around.