Hacker News new | past | comments | ask | show | jobs | submit login
Philosophy of Coroutines (greenend.org.uk)
171 points by Tomte on Sept 1, 2023 | hide | past | favorite | 62 comments



https://www.chiark.greenend.org.uk/~sgtatham/coroutines.html

  int function(void) {
      static int i, state = 0;
      switch (state) {
          case 0: /* start of function */
          for (i = 0; i < 10; i++) {
              state = 1; /* so we will come back to "case 1" */
              return i;
              case 1:; /* resume control straight after the return */
          }
      }
  }


Huh, I didn't know you could put case statements arbitrarily anywhere inside a switch block in C. Past the initial WTF, it's a nice hack for implementing coroutines.


Oh yeah, you can toss ‘em in just about anywhere. The most well-known application of this would be Duff’s device: https://en.wikipedia.org/wiki/Duff%27s_device


As I understand it, they're kinda syntactic sugar for goto statements.


Personal preferences differ, but I would find this code so much cleaner if it was a goto.


Coroutines really are powerful. Here's to hoping Rust stabilizes generators (full semicoroutines, not the enhanced iteration construct) sooner rather than later. It would give programmers greater control to express state machine logic without being boxed into async/await. I do wonder if the `Context` parameter used for async tasks is appropriate for arbitrary generators, so if someone could chime in on this I would be grateful.


Or provide continuations and use them to implement coroutines.


A couple of us have been experimenting with deliminited continuations and I think it’s gonna take off soon in TS

https://youtu.be/uRbqLGj_6mI?si=kgKKjpCnehJ9bpIG

https://github.com/neurosnap/starfx https://github.com/thefrontside/effection/tree/v3


To my understanding, delimited continuations and coroutines are equivalent in power. Out of curiosity, can delimited continuations be implemented as efficiently (in Rust or otherwise) as stackless coroutines?


These don’t look like continuations to me, just ways to use generators? Continuations would involve reifying the current control flow into a value, which these examples don’t seem to do and still rely on generators to yield control.


These libraries rely on deliminited continuations:

https://github.com/thefrontside/continuation

Because of this tiny implementation we are able to express any async flow construct with less code than relying on something like async/await or callbacks.


Single-shot continuations in the form of lexical effect handlers would be a good fit for a systems language nowadays. I feel it's a bit too late to staple that onto Rust though.


I've been using genawaiter[1] for writing generators in Rust. Are there things you can't build on top of async-await? Or is it just that it's simpler without that indirection?

[1]: https://crates.io/crates/genawaiter


I was thinking that the semantics around async/await (function coloring) would make some usages more frustrating. This crate probably sidesteps that, if it is a problem at all. I have a bad habit of not dabbling around when I lack understanding.


This feels circular, since async/await is futures which are generators


I don't think `Context` is appropriate for generic generators, since it has no way to get data in or out of the `async` block. The more general `generator` feature allows a custom context type though.


I must've looked at an outdated design or something. Thanks for the info.


I think that the biggest missing "killer feature" for systems programming languages is the ability to treat stackless coroutines as just another "plain old data" struct. Here's a list of things I can do with structs that I can't do with C++ coroutines or Rust generator objects (much to my chagrin):

* Get their size/alignment at compile time

* Allocate them on the stack

* Allocate them in a memory region of my choosing

* Create a contiguous array of them

* Copy them

* Serialize/deserialize them

Imagine being able to save coroutines to your filesystem, reboot your machine, restart your program, and have it pick up where it left off. Or, migrating lots of actively running coroutines to another machine by serializing it and sending it over the network. Or creating a high-performance, high-concurrency event loop using a giant map of coroutine objects.


Scala async coroutines build FSMs that are just standard objects. They aren't very commonly used, but the same approach could probably be adapted to C.

Basically the compiler compiles async function to a very simple FSM object where each assignment to a local variable that lives longer than a yield point becomes a member of the FSM object. This is done after the transformation to a simple normal form, at which point the function basically looks like bytecode. The compiler generate a new class for every async coroutines.

The generated FSM classes are statically sized so they could in principle be stack allocated, and their instances can be manipulated like any other JVM object.

At $work we have a system using scala async that has all of these features (arrays of FSMs, serialization, distribution by sending to network etc.) Of course there are complications (file handles for example) and it's easier with a GC.


Maybe you missed that the article explicitly talks about the cloning/deep copy of coroutines:

> Using my C/C++ preprocessor coroutine system, this is perfectly possible. In that system, all the persistent variables of the coroutine – including the state variable that says where to resume from next – have to live in an explicitly declared structure (in C) or be members of a class (in C++). Either way, there’s no difficulty with making an exact copy [...]

> After you do that, you’ve got two copies of the coroutine, and each of them will resume from the same part of the code when it next runs [...] This isn’t a deliberate feature of my preprocessor system; it’s just a thing that drops out naturally from the implementation strategy

Resuming from an on-disk copy seems tougher - you need to supply all relevant execution context.


Thank you for your ideas and thoughts.

This might be relevant - I've been playing around with some assembly to unwind the stack, but it occurred to me I don't need to pop the stack to scan through it. So like C++ exception handling (I learned about it in the Itanium C++ ABI) or algebraic effects, you can scan memory if you have access to the stack start in memory (I do that by storing the rsp somewhere in .global main) in theory it's just data.

I need to generate sections of lookup data for range information for associating .text code section addresses with function names.

In theory this would also be useful for coroutines since a coroutine position/state is just a program counter position of code that you can JMP to in your yield function (that isn't a call but an offset)

To move a coroutine from one thread to another or another machine over the network or persist to disk, let me think. We could do what C++ coroutines does and have a promise struct object that is presumably on the stack when a coroutine resumes by jumping to that coroutines location.

I think the hard part is being stackless and persisting the current coroutine state. You could mov $CURRENT_POSITION_COMPILER_DETERMINED_OFFSET into -10(%rbp) that promise object and then when the coroutine resumes it does a JMP COROUTINE_BODY(%rip) + -10(%rbp) in a label before the coroutine body.

I am a beginner to assembly programming but here is my program: https://github.com/samsquire/assembly/blob/main/stackunwind....


You can do 1-4 with Zig frames, but 5 and 6 are tougher since there's no guarantee that you won't end up with internal pointers.


The easiest solution to this would be a "relocatable frame" type that forces the compiler to throw an error if it would otherwise have to spill an internal pointer at a suspension point.


Lua makes coroutines first class citizens. I have only built one toy project with it, but it was very enjoyable. Someone even managed to implement Erlang-style concurrency on top of Lua's coroutines. That was a lot of fun to play with. (/stares into the night, reminiscing/)


If you like Lua coroutines and C++20 coroutines, I have written a piece of code that allows to combine both: https://github.com/CM4all/libcommon/blob/master/src/lua/CoAw...

It's used for example by our "myproxy" project: https://github.com/CM4all/myproxy/blob/master/src/Connection... (Ctrl-F Lua::CoAwaitable) - a Lua coroutine is launched from within a C++20 coroutine and the C++20 coroutine awaits the Lua coroutine. Everything integrated in a non-blocking I/O event loop.


I am not a C++ person, but that does sound nice. I'm just not ready for the kind of commitment it takes to dive into C++. ;-)


I think coroutines are one of those abstractions that are easy to fall in love with, but in reality they do more harm than good. There are extremely few circumstances where coroutines are useful, and those are not the ones most people use them for.

Coroutines are not cheap threads, and if they are used as such, your software will end up either leaking resources or contending on resources under any considerable load. A very simple demonstrative is: yes you can have a million coroutines, but you have 10 database connections. No matter how smart you make the scheduler, you will be bound by your IO resources in almost all applications. (This also hints at the few instances where coroutines can be useful: pure compute).

The above problem is exacerbated by coroutines literally undoing one of the most powerful abstractions for resource management: RAII. Your resource allocations stop being tied to lexical scope, and a misplaced yield will leak the resource very easily. And even if you know about this, it is quite difficult to do correct resource management or provide proper backpressure, and the resulting code will be unintuitive and difficult to maintain. You'll need to introduce explicit queues/semaphore-like structures in places where a simple threadpool would have sufficed for all your resource management needs.

So what are the examples where coroutines are useful? Pure compute. This is very very rare, but for example one could use coroutines to fuse stream processors and create bounded-memory compute. There was also one paper that described how software transactional memory was used to essentially race coroutines to find optimal circuitry layouts. This kind of compute is kind of it. And this is most definitely not what people are using coroutines for. Instead, they throw around "threads are expensive" and treat Rust async like Java futures, completely missing the point.

Sidenote: the particular way that Rust async was adopted, and specifically the tokio family of libraries is a whole other hellhole which sidetracked the entire Rust ecosystem, but admittedly this is not the coroutine concept's fault.


Coroutines and concurrency are orthogonal. Coroutines can be used as a primitive in concurrent programs, but the core concept is that of a suspended computation; think replacing callbacks with 'async', not replacing threads with coroutines.

For example, coroutines work great in GUI code because typically you have an event loop, and you want to schedule work to return a value (or send an event) later on during some future execution of the loop. This can all be done on a single thread without bringing concurrency into the mix.

As far as resource contention, I can provide a couple of counter-examples. In Kotlin, if you know you are performing I/O-bound work, you use the "IO" scheduler, which uses an unbounded thread pool to block a thread per-task. In Java, Project Loom makes this altogether unnecessary by yielding IO-blocked virtual threads for you.


You’re confusing concurrency with parallelism. A single-threaded event loop a la node is indeed concurrent. Multiple threads a la the Kotlin IO pool you mention would be both concurrent and parallel. Concurrency isn’t about different parts of the code actually executing at the same time, merely about being able to structure and think about code as if it were executing that way.


I'm not talking about whether the scheduler can block the task on IO. This is trivial.

Let's see a couple of examples.

Database connection contention:

    async function() {
      let connection = getConnection().await; // Allocate resource
      connection.execute("SELECT 'Hello World!'").await; // <- yield to the scheduler, connection is held onto until continuation is scheduled
      connection.close().await; // Release resource
    }
This is an extremely trivial example, and this already leaks resources. If you launch 100 of these then there is a chance that there will be 100 open connections at the same time. Had you used a simple thread pool, the size of that pool limits the number of open connections.

Memory leak:

    async function() {
      let object = stream.readLargeObject().await; <- Memory allocated
      process1(object).await; <- Memory leaked to scheduler until continuation is scheduled
      return process2(object).await;
    }
Again, very trivial example, already leaking. process1 and process2 are not tied together with direct control flow edges but rather the flow dispatches through the coroutine scheduler, and they are holding onto the resources while sitting in the scheduler's bookkeeping. Again, yields disrupt the allocation's scope, and if you launch a lot of these coroutines you'll run out of memory.

This issue is made worse if you need to use combinations of resource allocations. For example DB connection + memory allocation or network + disk or even network+network are common combos. Depending on the complexity of the allocation nesting you can run into very nasty contention issues or even deadlocks, where - again - a threadpool would have worked well. N threads, N number of resources. Done.

I want to stress that there are ways to mitigate the above (queues/resource pools/semaphores), however they are not nearly as intuitive as using a threadpool, and you need to be constantly aware of these leaks when you're writing async code.


> I'm not talking about whether the scheduler can block the task on IO. This is trivial.

The whole point of coroutines is that they are trivial to block and resume, right? So the way to deal with limited resources... is to block on them.

So when you need a connection, you await until one from the pool becomes available. From the point of view of the consumer it's very natural because grabbing a connection is just a standard async call, and returning it to the pool can be done the same way you would usually close it. If anything, it's a lot easier to manage them like this because unlike in a non async program you don't have to worry too much about blocking progress.


That's not my point. The issue is that when the blocking happens, with coroutines control is yielded to the scheduler which will now schedule other tasks. Those tasks may again request and block on resources. This is where the leak is coming from. A resource pool is one way to get around this, however this stops working if you have several kinds of resources.

On the other hand, with threads the IO block is a "proper" block. No new task will be scheduled, the thread will only continue when the IO operation finishes, providing a very natural backpressure mechanism that prevents overallocation/contention.


Is it perhaps a bit more nuanced?

Green threads à la Go with channels would be a way to deal with several communicating "event loops" at the same time via coroutines, hence the name goroutines?


Your understanding of the use of coroutines is much superior to the parent post.

Think "cooperating sequential processes". If you don't have that, coroutines are likely not fit for the problem.


I believe coroutines are mostly unrelated to RAII. One can still have coroutines in a single ownership world and guarantee destructors are called.

(If not, I have a lot of redesigning to do for Vale!)

Also, doesn't your mentioned problem also happen for full threads? I'd imagine they'd contend for those 10 connections no matter how you did your concurrency.


They do happen with threads indeed, but it's a lot easier to manage and the code is straightforward. You can essentially tie your resources to your threads, and your threadpool's size becomes your resources' allocation limit. In fact one common strategy is to use thread local storage to cache certain resources like RNG or regex statemachines/parsers, and you have essentially a guarantee that you'll be fine(ofc as long as the code doesn't launch threads left and right).

Again, I'm not saying that you can't manage this with coroutines, but it is harder, and I've seen this problem pop up in very different contexts/languages(Kotlin/Java, Rust, Haskell) whenever coroutines are used. Large churn or an unexpected bottleneck(e.g. network saturated or disrupted) causing blowup of suspended coroutines that in turn leak resources without applying proper backpressure. When this happens coroutines tend to be harder to debug as well.

Plain threads work for disk IO, network IO, memory usage etc etc, and the resource allocations also compose well. E.g. classic flipped order resource alloc: coroutine 1 allocates bounded resource A, then tries to allocate B and yields. Coroutine 2 allocates B, then tries to allocate A, bam, deadlock once A's resource pool is depleted. Essentially suspended coroutines have hidden dependency edges on one another through resources. With threads, the thread pool size aligns precisely with the resource pool size, under high churn it will be the resource's actual slowness that blocks the pool temporarily, but there's no way to deadlock on resource allocations.


> With threads, the thread pool size aligns precisely with the resource pool size, under high churn it will be the resource's actual slowness that blocks the pool temporarily, but there's no way to deadlock on resource allocations.

I'm not sure I see the argument that that's definitely the case. I mean I agree it could be the case.

At a previous job we did lots of "enterprise Java" type stuff. Spring Boot (tends to hide what's going on a bit), Tomcat (fixed number of threads to process incoming HTTP requests, I think 200 by default), database connection pool (I think 10 by default in the connection pool that Spring chooses by default).

The average programmer on this team did not know about any of these pool sizes. They just wrote code and, for the most part, it worked.

But then there was the situation where, for example, one test hung on my machine, but on nobody else's. Turns out (if I remember correctly) that stream.parallel() was used, which processes things on a default thread pool, which is sized to the number of CPU cores by default. My machine had 20 cores, other people had 8 or 10 cores. So they were processing fewer items at once. On my machine this then requested more connections simultaneously than were available in the database connection pool, and I think due to having locked some other resource (again not really obviously if you read the code) then deadlocked on my machine only. As you can imagine it took me a whole afternoon to diagnose this!

So what I'm saying is, I agree with everything you've said, but I think these problems can happen just as easily with threads, at least the way there're commonly used in e.g. Java Spring Boot.


Hah, sounds like a fun debugging session!

Those are exactly the kinds of problems I've encountered with "too many processors for too few resources". At the workplace where we used Java we used a library called Quasar which implements green threading resembling Rust async (it rewrites the bytecode into a state machine). I remember encountering a very similar deadlock, except the issue was caused by certain green threads "handing over" database connections to other green threads, and in the process yielding to the scheduler. Under high churn there was a chance that all connections ended up in the suspended set, causing a deadlock when other tasks were trying to allocate. It took a couple of days to track down because attaching a debugger and even printing caused the issue to go away.

Your example is also a fun one, but to me it actually shows exactly why an unbounded/dynamic number of processing units are an issue. Coroutines are the extreme example where you are almost encouraged to launch as many tasks as you can.


The one big issue I have with coroutines is that they are often impossible to serialize. With a state machine you can just take your state and write it to disk, but with coroutines that state is hidden in the stack of the coroutine and the stack is not a first-class data structure you can access by normal programming language means.


Stackless coroutines are arguably pretty serializable, since they are (usually) implemented by bundling all the relevant data into a struct. There’s no a priori reason that couldn’t be serialized, thiugh I admit I cannot think of an example I’ve seen it used.


We use it at work to send FSMs over the network for distributing tasks. Admittedly we always do this with continuations that haven't started yet, but there is no technical reason we couldn't serialize the already started ones (although you'd have to make sure you also send the coroutines you are currently awaiting on.)


> I talk about generators as if they were the coroutine system in Python, but in fact, they’re one of two: Python also has a completely different kind of thing that it actually uses the name ‘coroutine’ for, in which function definitions start with async def and suspend with await.

The async and await system of coroutines in Python is in fact layered on top of the yield system, via an intermediate construct called yield from—which isn’t even technically required, it’s just that implementing it as a loop passing values through from and to the subcoroutine is exceptionally tedious when done carefully[1] and can’t be abstracted as a function call.

As a first approximation, await x in Python is equivalent to yield from x.__await__(). It should have been completely equivalent, arguably, but history got in the way.

David Beazley has a talk[2] where he walks you through implementing an async runtime (to put it pompously) and you can see how the whole thing hangs together. It’s really quite simple, and in my experience it’s much better than trying to wrap your head around the asyncio docs, because while the language-level foundations are extremely nice and simple, asyncio the stdlib module just kind of sucks.

[1] https://peps.python.org/pep-0380/#formal-semantics

[2] https://www.youtube.com/watch?v=Y4Gt3Xjd7G8


Thank you for this article. I also really like coroutines but how they manifest in codebases can sometimes be hard to follow (async Rust, async Python).

I have played around with protothreads and use generators in Python but Marce Coll has an amazing blog post about coroutines in assembly, which is down right now.

If you save the RSP register somewhere and restore it later, you can switch stacks.

I'm currently working on an architecture to unify coroutines and threads. I am taking inspiration from "Bulk synchronous parallel programming" in that I am using barriers. I hope to apply structured concurrency to it too. A coroutine can be remote in a thread or in a different current thread and parallel.

I am curious about stackless python.


If you literally switch stacks, wouldn't that be a (user space aka green aka lightweight) thread?


Switching stacks isn’t the difficult part, it’s the related stack memory management.


Could you explain more and elaborate?

The previous coroutine state needs to be recovered onto the stack and registers for the coroutine to function correctly from its previous point. Temporal memoizes API calls and replays the same function deterministically, so it doesn't attempt to jump into half way into a function, that a coroutine could do.


Switching stacks is just longjmp().

Memory management for stacks has tricky questions about how much address space to reserve for each stack and how to ensure there are guard pages to catch stack overflows.

Or you can YOLO it https://dotat.at/@/2010-01-22-coroutines-in-less-than-20-lin...


I admit that coroutines can make the code simpler to write and also to understand, they can feel natural in some cases to replace complex state machines.

But I am not in favor of "magical" abstractions, especially anything related to hidden control flow. I don't have much experience with coroutines but I have the intuition that tracking bugs can be nasty.


I'm totally on board with fewer magical abstractions. But I would consider coroutines to be at the same layer of abstraction as simple procedures.

Now, you can build coroutine systems on top of procedural semantics (and trivially vice-versa) — indeed this is how many coroutine implementations are built under the hood — and since we're so used to procedural programming it's natural to assume that coroutines are a magical abstraction layer on top of procedures.

But from a strictly semantic perspective (disregarding implementation), I think coroutines are best thought of as their own distinct class of non-procedural control flow, as generalisations of procedures.

Being generalisations, coroutines provide fewer restrictions, but also fewer guarantees than procedures. So I don't think they're appropriate for all code. Specifically, coroutines kill temporal proximity guarantees: if everything is a coroutine, everything can suspend at any time (to be fair, any procedure can fail to return, but this is rarer). It's similar to exceptions in that respect.

So your point about tracking bugs is extremely relevant. With great power comes great responsibility. The current approach is to isolate coroutine functionality from the procedural core (either by very explicitly making them objects like Python generators, or by colouring them and requiring them to be executed in a context like in Kotlin). But I don't think they are quite as magical as they seem, rather just different.

(Tangential note: I've always thought it would be interesting to have a language that flipped this, and embedded procedural guarantees within an all-coroutine (or even all-continuations) semantic structure.)


Perhaps with experience with coroutines in the right problem environment you might have a different idea.

There isn't really anything magical about them. They are a good way to build cooperating sequential processes.

My experience is that coroutines not only made it easier to find bugs in interrupt driven code, but it made it easier to avoid them in the first place, as the code in front of you was much easier to understand.


I first used coroutines in the 1970s in Sigma 5 assembler for an application that was in an interrupt-rich environment.

Once we tuned our approach, using a macro or two, the code for an process attached to an interrupt could hardly have been clearer. As a result, we were able to build a program to dial out and send text (this was in the days when the data modem and the dialing modem were separate) with no multi-thread errors and only one single-thread error.

This was significantly easier than doing coroutines in C, which is allegedly close to the machine.

In a later project, we discovered that bliss 36 had a built-in coroutine feature, but we didn't see any advantage to using this in building a compiler.


This takes me back more than 20 years when I first started to use Putty and learnt about Simon Tatham, and incidentally I learnt about coroutines at the same time from another source and later learned that PuTTY used them!


When the author here speaks of coroutines vs explicit state machine, by the latter they are referring to doing things like checking if `currentState == State.LOADING` or setting `currentState = State.COMPLETE`?


Mostly the second one. You’ll have a single bit of boilerplate checking the current state and jumping to the right bit of code; but when you’re actually implementing the core logic of your state machine, you’ll be doing lots of “state = do_the_next_thing; return” instead of just calling “do_the_next_thing()” directly.


Thanks, when I said "or" there I meant to provide just another example of a state machine type implementation so they weren't to the exclusion of each other..theu were both examples of explicit state machine. Thanks for your point though a lot state = vs return next_thing.


Poor man's delimited continuations.


But stackful coroutines and (one shot) delimited continuations are equivalent in power and expressivity.

Unless you are referring to stackless coroutines or multi-shot continuations.


Never heard of one shot delimited continuations; it sounds like an euphemism for "broken". Delimited continuations are functions; you can call them as many times as you like, with different arguments. They can call themselves recursively too.

Of course when I say "delimited continuation", I mean "delimited continuation" not "broken delimited continuation".

(If delimited continuations are used for simulating coroutines, it is my experience that they end up being called only once; each continuation invocation performs a time slice to the next yield. That's a particular simple use of continuations, not a type.)


Could you expand?

I have a limited understanding of continuations and never used them myself.


Yep. Delimited continuations and proper tail calls gives a superpower. (Even PTC alone...)


https://github.com/socketry/async uses coroutines and I think in general it’s been a great model with very few downsides in practice.


Coroutines are great in a single threaded environment. But if you mix them with threads - even behind the scenes - there will be dragons. Erlang solves that problem with proper light weight isolated processes.


Setting up pipelines seem fairly easy to do in JavaScript using async/await. Unfortunately, the Streams API is complicated and so are Node’s streams. I ended up inventing my own.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: