This project is not intended to be a general purpose replacement for the standard Go net package or goroutines. It's for building specialized services such as key value stores, L7 proxies, static websites, etc.
You would not want to use this framework if you need to handle long-running requests (milliseconds or more). For example, a web api that needs to connect to a mongo database, authenticate, and respond; just use the Go net/http package instead.
There are many popular event loop based applications in the wild such as Nginx, Haproxy, Redis, and Memcached. All of these are single-threaded and very fast and written in C.
The reason I wrote this framework is so I can build certain network services that perform like the C apps above, but I also want to continue to work in Go.
> It's for building specialized services such as key value stores, L7 proxies, static websites, etc.
First of all, thank you for publishing this project. It's very interesting in my opinion since I never thought about the benefits of an event loop. Would you mind explaining briefly why an event loop is a better suit for these applications? Is it due to performance and efficiency?
I'd suggest that's not the right way to look at it. To a first approximation, "everything" is using an event loop nowadays, in that everything is using the same fundamental primitives to handle and dispatch events. In particular, this includes the Go runtime; run "strace" on a Go network program and you'll see these same calls pop up in the strace.
What this does instead is give a Go program direct access to the event loop. The benefit is that it bypasses all of the stuff that Go wraps around the internal event loop call that allows it to implement the way it offers a thread-like interface for you, and integrates with the channel and concurrency primitives, and maintains your position in the call stack between events, etc. The penalty is... the exact same thing, that you lose all the nice stuff that the Go runtime offers to you to implement the thread-like interface, etc., and are back to a lower-level interface that offers less services.
The performance of the Go runtime is "pretty good", especially by scripting language standards, but if you have sufficiently high performance requirements, you will not want to pay the overhead. The pathological case for all of these nice high-level abstractions is a server that handles a ton of network traffic of some sort and needs to do a little something to every request, maybe just a couple dozen cycle's worth of something, at which point paying what could be a few hundred cycles for all this runtime nice stuff that you're not using becomes a significant performance drain. Most people are not doing things where they can service a network request in a few dozen cycles, and the longer it takes to service a single request the more sense it makes to have a nice runtime layer providing you useful services, as it drops in the percentage of CPU time consumed by your program. For the most part, if you are so much as hitting a database over a network connection, even a local one, in your request, you've already greatly exceeded the amount of time you're paying to the runtime, for instance.
It does seem to me that a lot of people are a bit bedazzled by the top-level stuff that various languages offer, and forget that under the hood, everyone's using the event-based interfaces. What differs between Node and Twisted and all of the dozens or hundreds of other viable wrappers over these calls is the services automatically provided, not whether or not they are "event loops". Go is an event loop at the kernel level. Node is an event loop at the kernel level. Erlang is an event loop at the kernel level. They aren't all the same, but "event-based" vs. "not event-based" is not the distinction; it's a question of what they lay on top of the underlying event loop, not whether they use it. Even pure OS threads are, ultimately, event loops under the hood, just in the kernel rather than the user space.
> It does seem to me that a lot of people are a bit bedazzled by the top-level stuff that various languages offer, and forget that under the hood, everyone's using the event-based interfaces.
Yup. It's all very similar under the hood.
The most important difference between I/O models is whether the paradigm involves explicit vs. implicit management of the event loop. Callback models like Node, async/await style models like those of C#, and low-level primitives like IOCP, epoll, and kqueue fall into the former category. Go/Erlang, plain old threads, and even Unix processes fall into the latter category. There are advantages and disadvantages of each model.
Within each of these broad categories, the distinctions are, IMHO, much less interesting, and they're often made out to be more significant than they actually are. In particular, the distinction between runtimes like Go and regular OS pthreads is often made out to be more important than it really is, when the difference ultimately boils down to the CPU privilege level that thread management runs at.
Patrick, on the 2.6+ Linux kernels, is there a significant difference between threads and processes? It seems like both threads and processes are created via clone and the only difference is memory access?
I often hear "context switching between threads is cheaper" but pthreads still have their own PID and everything, so is this really the case?
Is there really much advantage to pthreads over the way PostgreSQL does things with efficient CoW sharing between processes for the binary?
The significance of the distinction depends entirely on the use case.
Yes, they’re both created with clone, but with different levels of sharing. A pthread will share the virtual address space of its parent, which makes shared memory simple to implement; use the same pointer and you’re done. CoW is not “sharing” really, because you can’t communicate over it, it just saves some creation overhead.
With CoW, technically nothing gets copied initially, but as soon as the new process starts executing, it’s going to start copying the stack frame and any other regions it’s using. With a pthread you can be certain it will just copy the stack.
Context switches are usually cheaper when you don’t need to throw out the old virtual address space (and invalidate the Translation Lookaside Buffer). Pthreads share virtual address space, so there is no need to flush the TLB.
In a use case like Postgres, you don’t necessarily need to optimise for context switches. If you have a lot of concurrent connections, each of which has one process, then you’ll only hit limits with context switching overhead if very few of those connections are fighting over any locks or spending much time in IO at all. This is atypical, so usually those other factors hit you first.
> The significance of the distinction depends entirely on the use case.
Indeed.
> Context switches are usually cheaper when you don’t need to throw out the old virtual address space (and invalidate the Translation Lookaside Buffer). Pthreads share virtual address space, so there is no need to flush the TLB.
I believe the cost of that has been reduced somewhat due to tagged TLBs on modern hardware.
> In a use case like Postgres, you don’t necessarily need to optimise for context switches. If you have a lot of concurrent connections, each of which has one process, then you’ll only hit limits with context switching overhead if very few of those connections are fighting over any locks or spending much time in IO at all. This is atypical, so usually those other factors hit you first.
Yea. There's a number of limitations in postgres due to the process model, but they're imo not TLB / context switch related. The biggest issue is that dynamically sharing memory between processes is harder, because there's no guarantee that it's possible for all post-fork memory allocations can portably be put at the same virtual addresses. Which then makes it more complicated to have shared datastructures, because you need to use relative pointers and such. That's not a problem for the main buffer pool etc, which is allocated when postgres is started, but it is problematic e.g. for memory shared between multiple processes working on the same query (say the memory for a shared hashtable in a hashjoin).
> I don't think this qualifies as a performance overhead, though, beyond the odd isub.
It ends up as one. The reason is less the additional instruction(s), but that you actually need to ferry arround additional data. In common scenarios you'll end up with a number of mappings shared between processes, so you can't just assume a single base address per-process. Instead you've to associate the specific mapping with relative pointers, and that does add to overhead. Both programming wise and runtime efficiency wise.
This is an extremely helpful explanation. Would you consider adding a "Rationale" subheading to the readme and pasting this in wholesale? Great project, thanks for sharing!
One of my favorite things about Go is that it cuts through the "threads vs. events" debate by offering thread-style programming with event-style scaling using what you might call green threads (compiler assisted cooperative multitasking that has the programming semantics of preemptive multitasking).
That is, I can write simple blocking code, and my server still scales.
Using event loop programming in Go would take away one of my favorite things about the language, so I won't be using this. However I do appreciate the work, as it makes an excellent bug report against the Go runtime. It gives us a standard to hold the standard library's net package to.
<quote>That is, I can write simple blocking code, and my server still scales.
Using event loop programming in Go would take away one of my favorite things about the language, so I won't be using this.</quote>
If Go has or can emulate 'generators' a-la Python/Nodejs,then you can write synchronous looking,blocking-like code with event loops as well.
That is exactly what Go does by default. Any time a blocking operation is performed, Go either leaves the OS-level thread blocked there and switches away, or hand the blocking operation to an internal thread which is running epoll for the whole process.
The end result is much easier than Python/NodeJs because there is no explicit "async/await" or deferred-style programming. You simply write linear code and any blocking operation (at the syscall level) is transparently handled.
FYI, this is not unique to Go. A number of languages implement lightweight threads in the same way (roughly), like erlang or Haskell. It’s all epoll or other efficient polling primitives under the hood.
I'm of the impression that there's an important difference between Go and Haskell's models--namely that Go is M:N threaded and Haskell is not; however, I don't entirely understand the significance of the difference, so hopefully someone else can comment and enlighten me.
No, I'm not sure. :) I may have incorrectly assumed that the definition of M:N threads includes movable application threads (e.g., Go's scheduler can move goroutines from one kernel thread to another).
It depends if you're describing a semantic model or you're concerned about implementation details.
Semantically, a goroutine is a thread, within a shared memory model. But what makes Go unique (or let's say more unique) is that it offers programmers a thread-like programming approach (linear, blocking code) but internally turns it into an event-driven approach (epoll/kqueue) for networking.
Moreover, the fact that goroutines are much cheaper than OS-level threads enable a more pervasive approach to concurrency.
Go uses an m:n thread model. Goroutines are multiplexed onto a smaller number of os-level threads. They're sort of like threads, but they have a simplified programming model (There is no thread-level storage for example).
For my information, why is M:N so successful for Go and not for pthreads? Is M:N more practical for systems with a particular kind of garbage collector?
In my opinion, it's because the Go team put a ton of effort into getting M:N working and didn't rigorously evaluate any other alternatives in Go, once moving GC was implemented.
I'm not convinced that 1:1 wouldn't have been a perfectly reasonable implementation strategy for Go.
This is surprising. It seems like Go gets pretty high praise from all over for its threading model, and it seems like there are relatively few high-performance servers that are built with 1:1. Am I wrong about that? If not, what explains this?
It doesn't really "cut through" the debate any more than any other implementation of threads does. The only difference between Go and plain old one-thread-per-connection is that regular threads run in the kernel, while Go threads run in userspace. That's not a semantic difference, only an implementation detail (a large detail, to be clear, but still an implementation detail).
There were historical implementations of pthreads, such as NGPT, that used precisely the same model as Go, and they were abandoned because the advantages over 1:1 were not sufficient to justify the complexity.
What you call a "Go thread" has a precise name (goroutine) and running in userspace is hardly the only difference between a goroutine and a kernel thread.
Creating and destroying kernel threads is significantly more expensive.
A kernel thread has a fixed stack and if you go beyond, you crash. Which means that you have to create kernel threads with worst-case-scenario stack sizes (and pray that you got it right).
Goroutine has an expandable stack and starts with very small stack (which is partly why it's faster; setting up kernel page mappings to create a contiguous space for a large stack is not free).
Finally, goroutine scheduling is different than kernel thread scheduling: a blocked goroutine consumes no CPU cycles.
In a 4 core CPU there is no point in running more than 4 busy kernel threads but kernel scheduler has to give each thread a chance to run. The more threads you have, the more time kernel spends and pointless work of ping-ponging between threads. That hurts throughput, especially when we're talking about high-load servers (serving thousands or even millions of concurrent connections).
Go runtime only creates as many threads as CPUs and avoids this waste.
That's why high-perf servers (like nginx) don't just use kernel thread per connection and go through considerable complexity of writing event driven code.
Go gives you straightforward programming model of thread-per-connection with scalability and performance much closer to event-driven model.
You work on Rust and are well informed about this topic so I'm sure you know all of that.
Which is why it amazes me the lengths to which you go to denigrate Go in that respect and minimize what is a great and unique programming model among mainstream languages.
> What you call a "Go thread" has a precise name (goroutine)
I call goroutines threads because they are user-level threads.
As an analogy, NVIDIA calls local threadgroups "warps", but that doesn't make them not local threadgroups.
> Creating and destroying kernel threads is significantly more expensive.
Because kernel threads usually have larger stacks. But they don't always have large stacks: that is configurable. Other than the stack size, the primary difference is simply that kernel threads are created in kernel space and user threads are created in userspace.
> A kernel thread has a fixed stack and if you go beyond, you crash. Which means that you have to create kernel threads with worst-case-scenario stack sizes (and pray that you got it right).
You can do stack switching in 1:1 too. After all, if you couldn't, then Go couldn't do stack switching at all, since goroutines are built on top of kernel threads.
Go's small stacks are really a property of the moving GC, not a property of the threading model.
> In a 4 core CPU there is no point in running more than 4 busy kernel threads but kernel scheduler has to give each thread a chance to run.
> Go runtime only creates as many threads as CPUs and avoids this waste.
Not if they're blocked doing I/O!
If they're not blocked doing I/O, then Go tries to do preemption just as the kernel does. (I say "tries to" because Go currently cannot preempt outside function boundaries; this is a significant downside of M:N threading compared to 1:1 kernel threading.)
> That's why high-perf servers (like nginx) don't just use kernel thread per connection and go through considerable complexity of writing event driven code.
High-performance servers like nginx use an event loop because it's the only way to get the absolute fastest performance, with no overhead of stacks at all. The fact that the project described in the article gets better performance than Go's threads is proof of that fact, in fact.
It would be possible, and interesting, to do Go-like 1:1 threading with small stacks.
> Go gives you straightforward programming model of thread-per-connection with scalability and performance much closer to event-driven model.
Sure. But that's mostly because of the GC, not because of the M:N threading model.
> Which is why it amazes me the lengths to which you go to denigrate Go in that respect and minimize what is a great and unique programming model among mainstream languages.
It's not unique. As I said, NGPT used to do M:N for pthreads. Solaris used to do M:N for pthreads. The JVM used to do M:N.
The goroutine implementation scales, while other thread implementations (by default) do not. That's a semantic difference. A Go server can have millions of active goroutines with moderate resource use.
You can achieve the same on Linux or Solaris using kernel threads, but you have to work at it. With Go you don't have to work at it, and it works on macOS and Windows and a few other OSs too.
This is all comparisons between O(1) things, but the constant factor matters.
> You can achieve the same on Linux or Solaris using kernel threads, but you have to work at it.
By setting the thread stack size to a reasonable value. That's it. And, in fact, on 64-bit you often don't even need to do that.
The difference you're describing is a difference in default thread stack sizes, which is hardly a paradigm shift. We're talking about one call to pthread_attr_setstacksize().
First: if you have an epoll loop it is also the cost of the thread context switch, which has definitely us in RPC systems using kernel threads. By contrast the goroutine gets scheduled onto the kernel thread that answered the poll, saving the switch.
Second: as I alluded to earlier, linux and solaris can scale their kernel thread implementations, not all OSs can. My experiences with large numbers of threads on the BSDs and Windows (in years past admittedly) suggest other kernels don't have thread implementations designed to scale to such high numbers. Solving the problem in userspace means Go programs written in this style are portable across operating systems.
Third: you can only adjust stack sizes down if you know your program always keeps its stacks small. If you depend on libraries you don't own in C/C++, that's a difficult assumption. Go grows the stacks, so if you hit some corner case where a small number of goroutines need some significant amount of stack, your program uses more memory, but typically keeps working. No need for careful (manual!) stack accounting.
If all this were as easy as you say, we would still write nearly all our C/C++ servers using threads. We don't because it's not.
> First: if you have an epoll loop it is also the cost of the thread context switch, which has definitely us in RPC systems using kernel threads. By contrast the goroutine gets scheduled onto the kernel thread that answered the poll, saving the switch.
I'm not comparing M:N to a 1:1 system where all I/O is proxied out to another thread sitting in an epoll loop. I'm comparing M:N to 1:1 with blocking I/O. In this scenario, the kernel switches directly onto the appropriate thread.
> Second: as I alluded to earlier, linux and solaris can scale their kernel thread implementations, not all OSs can.
The vast majority of Go users are running Linux. And on Windows, UMS is 1:1 and is the preferred way to do high-performance servers; it avoids a lot of the problems that Go has (for instance, playing nicely with third-party code).
> Third: you can only adjust stack sizes down if you know your program always keeps its stacks small.
You could do 1:1 with stack growth just as Go does. As I've said before, small stacks are a property of the relocatable GC, not a property of the thread implementation.
> If all this were as easy as you say, we would still write nearly all our C/C++ servers using threads.
We don't write C/C++ servers using threads because (1) stackless use of epoll is faster than both 1:1 threading and M:N threading, as this project shows; (2) C/C++ can't do relocatable stacks, as the language is hostile to precise moving GC.
First a point of curiosity, have you seen a linux 1:1 system with blocking I/O scaled to millions of active threads? I have only ever seen it with epoll. My working assumption has been that the kernel blocking calls won't scale, but I have not tested that.
Second, almost all the event-driven C++ servers I have seen are written that way not for performance, but for scaling and latency. There is usually plenty of extra CPU and RAM, only a tiny fraction really bump up against resource limits. (A typical case of the vast majority of code not being performance sensitive.)
Otherwise, I agree with your points in this comment. Especially the broader point that there's no novel component of Go. Go is about combining well-known things together.
However, it seems to me that Go still cuts through the "threads vs. events" argument in a way nothing else does. I can write code in a blocking style using typical libraries, and have it scale to large numbers of active connections.
On other systems the implementations don't scale or I have to heavily restrict library use based on stack growth, or I am tied to a particular OS. It seems to me the only alternatives to Go's nice blocking code environment require significant compromise or require something to be built.
Choice of 1:1 or M:N is all about trade offs. NPLT chose 1:1 for simplicity (and decided to focus instead on making context switches cheap as possible in the Linux kernel). But that doesn’t mean M:N has no benefits - I think it does, as golang, erlang, and other languages illustrate.
I agree with OP that golang seems to provide the best of both worlds in the “event” vs “thread” debate. We can get the performance benefits of an eventing model with a much simpler programming model of thread per request.
It’s all “semantically” similar but it’s the details that matter. And I think golang chose the correct trade offs here (and with their sub-ms GC as well). The JVM, as an opposing example, made all the wrong choices I think for the general use case. Slow GCs and 1:1 threading.
I always understood the overhead of kernel threads compared to user threads to be significant at large scale. It’s not just stacks either. It can be a lot cheaper to swap between user threads, depending on implementation, compared to the scheduler having to preempt and trap into kernel code and provide a general purpose context switch.
There's nothing really that special about Goroutines. Ruby also introduced Fibres in 2007. There's been some discussion of adding a more automatic M:N threading model to Ruby 3.
The Go network stack already makes use of epoll and kqueue: https://golang.org/src/runtime/netpoll_epoll.go
So I'm not quiet sure why this would be faster since almost all I/O in Go is event driven, including the networking stack.
The benchmarks at the bottom of the readme show quite an improvement (with a single thread it seems).
I would speculate the performance win is because there is no stack switching and less channels.
I've done lots of event loops in the past (eg hellepoll in c++) and think that the cost of that is on the programmer - keeping track of things, callbacks, state machines and things and avoiding using the stack for state etc is all hard work and easy to mess up.
> I've done lots of event loops in the past (eg hellepoll in c++) and think that the cost of that is on the programmer - keeping track of things, callbacks, state machines and things and avoiding using the stack for state etc is all hard work and easy to mess up.
I very much agree. In the past, I have had quite some fun developing a few streaming parsers using Node.js, which also uses an event loop. And while these parser worked relatively good and efficient, debugging them was not an easy task. In addition, understanding the code is also a though challenge, especially for people other than the original authors.
When I started using Go more and more, I really enjoyed the different I/O-model using goroutines and blocking function calls. It also has a few drawbacks but the mental model is a lot easier to reason about.
> I've done lots of event loops in the past (eg hellepoll in c++) and think that the cost of that is on the programmer - keeping track of things, callbacks, state machines and things and avoiding using the stack for state etc is all hard work and easy to mess up.
This is improving, even in C++. This is what the core loop of a line-based echo server could look like in C++17 (and something very similar compiles today on my machine)
Unfortunately it's just exposition, but here[0] is a version that works with Clang 5 + Boost
Echo specific code starts on line 167. Everything above will hopefully be provided by the standard library once both the Networking TS and Coroutine TS merge in to C++20.
One nice thing about lines 1 - 165 though, is that it demonstrates how easy it is to extend the native coroutine capabilities in C++ to support arbitrary async libraries, even if the author of those libraries didn't know anything about coroutines. All this happens without breaking the ability to call these coroutines from C. You can even use async C libraries that only provide a void* argument to your callback.
Well, I guess because the runtime has to do a bunch of work to dispatch the events to the appropriate goroutine that is blocked waiting for that event. Switching and synchronization between goroutines is cheap, not free.
I'd be interested in the level seven reverse proxy application. As well as unix domain socket message queues. There are probably many other places in the networking pipeline evio could provide a boost.
It's a testament to what is possible through the "syscall" and "golang/x/sys" facilities. As well as your confidence in playing with Linux internals ;)
Not sure i understand what the use case is. As soon as you start doing something on the event loop , you need some kind of way to perform the operation in another "thread" ( or goroutine or whatever).
And then you start to need some kind of concurrency mechanism, and pay the price.
Stripping those mechanism to pretend the event handling is faster only works if you never intend to have some real computation performed. That's never true in practice... Or am i missing something ?
Not the OP but typically you resort to these tactics when you want to shave the last ms of the server's response time, and/or get that last 1000 requests/s/core performance. You have a "fast path" that is simple and event driven and hand off operation processing to regular threads for the (less frequent) more complex operations.
At this point, why not just use C++? I feel like people are trying to stretch Go way past what it's good for. It's not going to replace C++ where C++ is effective, and it shouldn't :)
This is single-threaded? What are you going to do with the other 31 or 63 cores?
The single-threaded nature of applications liked Redis an Haproxy is a singificant impediment to their vertical scalability. CPUs aren't getting faster, we're just going to get more cores, so anything that assumes there's only a single core seems like a dead end.
Haproxy literally just added multithreading support in 1.8.
The CPU is rarely the bottleneck and for both Redis/HAProxy the vertical scalability solution has been to launch multiple processes or forks with different core affinities. There are downsides of course (no IPC) but I still argue that CPU is not the bottleneck for 99% of usage scenarios.
HAProxy added threading support in 1.8 as you pointed out and Redis has started the same (for a certain subset of processing) in 4.0 as well. They're getting there but concurrency is tough.
To suggest that his product is a "dead end" due to not supporting threading seems a bit premature, as Redis and HAProxy are extremely well-regarded in their niche and they made it there without threading, and we've been at maximal clock speed for nearly a decade.
> There are downsides of course (no IPC) but I still argue that CPU is not the bottleneck for 99% of usage scenarios.
I suppose my experience might be unusual, but I frequently log in to c3.8xlarge redis machines that have a single core pegged at 100% and the rest doing nothing. Yes multiple processes help, but that requires updating clients and makes it harder to share memory.
> To suggest that his product is a "dead end" due to not supporting threading seems a bit premature, as Redis and HAProxy are extremely well-regarded in their niche and they made it there without threading.
Well yeah, CPUs hitting their GHZ limit and the dramatic increase in the number of cores per machine is a relatively recent phenomena.
I just think its weird to start a brand new project making those same assumptions, especially when the underlying programming language was explicitly designed with concurrency in mind.
It'd be like building a new networking library in Rust which ditches memory safety.
> This is single-threaded? What are you going to do with the other 31 or 63 cores?
Yes, the event loop is single-threaded. The other cores can be used for other stuff, but not the event loop.
It's completely possible with this library to process operations in a background thread and wake up the loop when it's time to write a response. If that's what the developer desires.
> anything that assumes there's only a single core seems like a dead end.
If my documentation somehow implies that systems running this library do not have multiple cores then I'm sorry for the confusion. This library makes no assumption about the host server, and it does not limit the application to a single core. It just runs the event loop in one thread.
You would not want to use this framework if you need to handle long-running requests (milliseconds or more). For example, a web api that needs to connect to a mongo database, authenticate, and respond; just use the Go net/http package instead.
There are many popular event loop based applications in the wild such as Nginx, Haproxy, Redis, and Memcached. All of these are single-threaded and very fast and written in C.
The reason I wrote this framework is so I can build certain network services that perform like the C apps above, but I also want to continue to work in Go.