I think that the biggest missing "killer feature" for systems programming languages is the ability to treat stackless coroutines as just another "plain old data" struct. Here's a list of things I can do with structs that I can't do with C++ coroutines or Rust generator objects (much to my chagrin):
* Get their size/alignment at compile time
* Allocate them on the stack
* Allocate them in a memory region of my choosing
* Create a contiguous array of them
* Copy them
* Serialize/deserialize them
Imagine being able to save coroutines to your filesystem, reboot your machine, restart your program, and have it pick up where it left off. Or, migrating lots of actively running coroutines to another machine by serializing it and sending it over the network. Or creating a high-performance, high-concurrency event loop using a giant map of coroutine objects.
Scala async coroutines build FSMs that are just standard objects. They aren't very commonly used, but the same approach could probably be adapted to C.
Basically the compiler compiles async function to a very simple FSM object where each assignment to a local variable that lives longer than a yield point becomes a member of the FSM object. This is done after the transformation to a simple normal form, at which point the function basically looks like bytecode. The compiler generate a new class for every async coroutines.
The generated FSM classes are statically sized so they could in principle be stack allocated, and their instances can be manipulated like any other JVM object.
At $work we have a system using scala async that has all of these features (arrays of FSMs, serialization, distribution by sending to network etc.) Of course there are complications (file handles for example) and it's easier with a GC.
Maybe you missed that the article explicitly talks about the cloning/deep copy of coroutines:
> Using my C/C++ preprocessor coroutine system, this is perfectly possible. In that system, all the persistent variables of the coroutine – including the state variable that says where to resume from next – have to live in an explicitly declared structure (in C) or be members of a class (in C++). Either way, there’s no difficulty with making an exact copy [...]
> After you do that, you’ve got two copies of the coroutine, and each of them will resume from the same part of the code when it next runs [...] This isn’t a deliberate feature of my preprocessor system; it’s just a thing that drops out naturally from the implementation strategy
Resuming from an on-disk copy seems tougher - you need to supply all relevant execution context.
This might be relevant - I've been playing around with some assembly to unwind the stack, but it occurred to me I don't need to pop the stack to scan through it. So like C++ exception handling (I learned about it in the Itanium C++ ABI) or algebraic effects, you can scan memory if you have access to the stack start in memory (I do that by storing the rsp somewhere in .global main) in theory it's just data.
I need to generate sections of lookup data for range information for associating .text code section addresses with function names.
In theory this would also be useful for coroutines since a coroutine position/state is just a program counter position of code that you can JMP to in your yield function (that isn't a call but an offset)
To move a coroutine from one thread to another or another machine over the network or persist to disk, let me think. We could do what C++ coroutines does and have a promise struct object that is presumably on the stack when a coroutine resumes by jumping to that coroutines location.
I think the hard part is being stackless and persisting the current coroutine state. You could mov $CURRENT_POSITION_COMPILER_DETERMINED_OFFSET into -10(%rbp) that promise object and then when the coroutine resumes it does a JMP COROUTINE_BODY(%rip) + -10(%rbp) in a label before the coroutine body.
The easiest solution to this would be a "relocatable frame" type that forces the compiler to throw an error if it would otherwise have to spill an internal pointer at a suspension point.
* Get their size/alignment at compile time
* Allocate them on the stack
* Allocate them in a memory region of my choosing
* Create a contiguous array of them
* Copy them
* Serialize/deserialize them
Imagine being able to save coroutines to your filesystem, reboot your machine, restart your program, and have it pick up where it left off. Or, migrating lots of actively running coroutines to another machine by serializing it and sending it over the network. Or creating a high-performance, high-concurrency event loop using a giant map of coroutine objects.