Hacker News new | past | comments | ask | show | jobs | submit login
Go-style concurrency in C (libmill.org)
203 points by Jerry2 on Nov 18, 2015 | hide | past | favorite | 83 comments



This is interesting, but I don't really understand the differences with the existing coroutine libraries like libtask, libconcurrency or libcoro.

Of course, it uses the longjmp/setjmp trick, as suggested by Dmitry Vyukov.

http://www.1024cores.net/home/lock-free-algorithms/tricks/fi...

However, I'm not sure of the benefit of this approach, compared to writing a few lines of stack switching assembly code per architecture. Especially, how portable are the implementations of longjmp and setjmp across the multitude of C libraries and operating systems?

But more importantly, it doesn't seem to solve the two main issues of coroutines libraries written in C that were already solved in Go:

- How does the stack can grow or shrink? (specifying the stack size in the worse case scenario doesn't scale well)

- How can a coroutine be moved to another operating system thread and back? (useful when blocking or when parallelism is needed)


This is really dang cool. I started a similar project[0] but have tabled it for now, since I'm not sold on the usefulness of making C easier to use this way.

The way I see it, there are two ways to use a library like this that are worth considering: adapting existing projects, and starting new ones. In my opinion, the gains to be had from rearchitecting an existing project to use such a library are minimal, and for the vast majority new projects, C is not the best choice these days. If you want the performance of C with the programming style of Go, you can probably get away with just using Go!

That said, it's nice to see a polished, working implementation of this idea. Definitely something worth being aware of.

[0] https://github.com/aji/libco


> "Libmill is intended for writing single-threaded applications."

You can do concurrency on a single thread. This is the message I find it hard to get across when explaining CSP concurrency libraries.

Some people argue about usefulness of such libraries as being not native and leaky abstraction. I developed Go-style concurrency "port" to Tcl:

https://github.com/securitykiss-com/csp

and it is one of few developed-in-house things that became the productivity booster in my toolbox.

The interesting thing about it is that thanks to the expressive power of Tcl I was able to port not only the semantics but also almost completely mimic Go syntax.


I don't use Tcl but that looks nice.

> You can do concurrency on a single thread. This is the message I find it hard to get across when explaining CSP concurrency libraries.

Heck, there's an entire language built around single-threaded concurrency! Compiles to C, coincidentally - it targets embedded platforms. It's not quite CSP style though, it takes after Esterel and uses synchronous reactive concurrency:

http://ceu-lang.org/

However, as soon as you have multiple Céu programs communicating (for example, multiple Arduino's) the program-to-program communication is fairly CSP-like.


Don't forget Javascript.


There's more to report than simply an awesome library. Two days ago, someone allegedly unfamiliar to the project (Paul Banks) appeared from the ether, applied a few "simple" patches, and improved the performance of libmill 7 fold (for OSX at least). That just doesn't happen, ever, in open source.

https://twitter.com/sustrik/status/666360179802402816


The twitter message you linked to says this:

    The speedup was caused by replacing setjmp by _setjmp though.
I know this may sound mysterious, but for BSD/OSX... "yes". It is "well known" (if you're the sort of person who writes coroutine runtimes [1]) that _setjmp is far faster than setjmp. It's morally equivalent to 'sigsetjmp', but more portable, if less standard.

[1] https://github.com/jaroslov/coro/blob/master/coro.hpp


Martin Sustrik, the author of libmill was also interviewed in FLOSS Weekly podcast a month ago - https://twit.tv/shows/floss-weekly/episodes/358


From an earlier discussion https://news.ycombinator.com/item?id=9947749:

> If I understand correctly, this is a framework for non-preemptive cooperative multitasking whereby provided API functions serve as yield points. Pretty much what Windows API was before NT/2000. Not exactly a co-routine library, at least not in a conventional definition of co-routines.

> Yes, that's exactly what it is. Yes, that's exactly what it is. But it also matches wikipedia's difinition of coroutine: "Coroutines are computer program components that generalize subroutines for nonpreemptive multitasking, by allowing multiple entry points for suspending and resuming execution at certain locations."


All coroutine libraries are non-preemptive and cooperative. E.g. in Haskell, threads will switch only on GC allocations. In Go, they only switch on I/O (and maybe function returns). In theory, the compiler could automatically add `yield` statements into non-allocating, non-I/O loops. But nevertheless, all user-space threading is cooperative.


> But nevertheless, all user-space threading is cooperative.

Not necessarily. You can do user-space pre-emptive threading on any system that support pre-emptive signal handlers and that lets you mess around with the stack, particularly if you can generate signals from timers and/or IO operations (both of which you can with POSIX timers)


Then it's not really user-space, is it? You're still paying for a kernel context-switch or two.


It is user-space threading if the scheduler is in user-space.


I disagree. If it's the kernel who's interupting threads, it's not user-space threading. The scheduler just decides which thread gets run next, but the kernel decides when it gets run.


Technically it is not the kernel, it is the hardware clock.

At some point any user space scheduler will switch to the kernel in an event demultiplexing syscall and potentially block there if no user space thread is runnable. By your definition any threading system which doesn't handle all IO purely in user-space can't claim to be user-space threads.


But that is when the userspace must (and wants) to communicate with the kernel. Ultimately, that's the programmer's choice. You could easily write a program in Haskell or Go with two lightweight threads that would keep sending ping/pong messages to each other, and all the waiting (on locks, channels, etc.) and switching would be implemented in userspace.

I don't know enough about userspace/kernelspace communication, maybe I/O (refilling buffers or calling select) and timers are much more lightweight than full thread context switches, but I'm guessing it's still much slower than pure userspace scheduling.


How is a clock signal different? It is just another external event userspace wants to be notified about.

Note that on Unix you can use signals to get IO readiness notification and in fact for sometime on Linux (before epoll) realtime signals were the preferred method to do get notification over a large amount of fds.

Anyway, as a general guideline, user/kernel transition is cheaper than a (kernel) thread/thread transition which is in turn cheaper than a process/process transition.


Agree that coroutines are inherently non-preemptive and cooperative, but not all user-space threading is cooperative.


I haven't looked at the source yet so I don't know what the implementation is like, but this makes me wonder if the author has read this article: http://www.chiark.greenend.org.uk/~sgtatham/coroutines.html

Of course, doing it in portable C is going to be much slower than raw stack-switching style.


Sure I did. It's a classic.


Someone needs to make a tiny C lib that contains just the stack switching stuff (basically whatever is used as a substitute for ucontext), that the various coroutine libs can share. I like all these coroutine libs (libtask, lthread, libmill, etc), but it makes me uneasy to have non-portable assembly code lying about that might go unmaintained.


There are oh so many libraries for that:

https://github.com/baruch/libwire/wiki/Other-coroutine-libra...

And if you don't like any of those, my next choice would be to use the implementation from Luajit 1.x and 2.x, called Coco but not packaged into a separate library:

http://coco.luajit.org/portability.html


Cool! I wasn't aware of those low level libs. Now I should submit a patch to lthread to use fcontext. >:)


I didn't see any assembler code in libmill, nor does it use ucontext. Not sure what you are talking about here?


Ah, I forgot libmill uses setjmp/longjmp. I'm not sure how that really works for coroutines though since you still need to deal with stack allocations. So replace "non-portable assembly" with "possibly non-portable C hacks". I haven't gotten around to digging deep in the code yet though. If libmill's approach is future-proof then that'd be pretty awesome.


It's a hack, but one that's relatively benign: allocate stack using malloc, move sp using variable-sized array.


Is there any big difference from this vs https://swtch.com/libtask/ , which is a port of much of the same concept from Plan 9 - which again ties back to Go.


It's basically the same thing, but simpler and more familiar (if you dealt with Go, that is) API.


    It can execute up to 20 million coroutines 
    and 50 million context switches per second.
Is entirely hardware dependent. Also seems to disregard the fact that certain other things are much easier in go than C.


Language features very rarely get locked down to specific languages. Go doesn't own green threads... after all the term "green threads" predates Go. Nor does it own CSP, and it wasn't the first implementation. Also witness the number of languages that got hooked up to the various kernel-poll event loops over the past 10 years. Nowadays labeling a language as "supports event-based polling" is right up there with saying it supports closures... of course it does, it's the bare minimum of what anybody will look at anymore.

The things that distinguish languages are typically in other dimensions.


Not sure what your point is. That there's some argument about performance? There is no way that Golang can keep compete with performance of straight C code.


6 year old compiler vs 40 years of research in optimizing compilers.


So....what is that `choose` thing? Curious how that's implemented...


I wrote a blog post about it: http://250bpm.com/blog:56 Also, a blog post about how coroutine spawning is done: http://250bpm.com/blog:48



author is Martin Sústrik, who is also the creator of ZeroMQ, nanomsg, and ribosome

that's quite a resume


Can anyone familiar with this say if it runs one coroutine per thread? How does a coroutine get it's own stack? If coroutines share threads, is it possible to switch in between a coroutine say when it's blocked on IO like it is in go?

I'm not an experienced go developer but my understanding is that it's possible in go to have millions of goroutines each blocked on IO. I don't see how a C function could mimic that. I thought stack moving was an essential requirement of any runtime that wants to mimic go in this way.

I've looked at some of the special macros like coroutine and go and it looks like libmill runs the entire function fn when you call go(fn) so if fn is blocked on IO then your entire thread is blocked on IO and you can soon grind to a halt if you have even a few blocked coroutines.


It creates a stack per coroutine and if the coroutine would block, it switches to another coroutine instead.

As for threads, it's single threaded (yet concurrent). If you want to take advantage of multicore machines, there's a step in the tutorial about that: http://libmill.org/tutorial.html#step7


One of the reasons goroutines are considered 'lightweight' is that they allocate a small stack to begin with which is reallocated and moved if more space is needed. Of course, stack-moving isn't possible in C for various reasons.

What does libmill do here? Does it allocate a large stack right off the bat? What happens in a stack overflow, does it segfault or will it just trample over whatever is beneath it?

Edit: an ounce of investigation is worth a pound of answered questions, or something like that. It appears stack sizes are configurable, but default to 256kb [1] (for scale, go defaults to 2k), and are retained in a cache of size 64 [#L72] (by default). Also the bottom page is used as a stack guard if posix and mprotect are available [#L90].

From a cursory glance there appears to be some data races in the stack allocation code, though I could be mistaken. E.g. stack.c#L137

[1]: https://github.com/sustrik/libmill/blob/master/stack.c#L47


Note that large stacks only use address space, not memory, so it's not such a problem (especially on 64-bit).

I've never used libmill but from a glance at the docs it appears they suggest using multiple processes for parallelism, not multiple threads, so there is no such thing as a data race.


I don't think there's a data race, given that the whole thing is single-threaded. Multiple processes should be used for parallelism (see step 7 of the tutorial).

As for the default stack size, it's hard (but doable) to work with smaller stack size in C. For example, humble printf() can allocate a buffer several kB long on stack.


select(), poll(), etc are available in C which enables non-blocking I/O. I dont know how libmill is implemented, but I assume it is using one of those functions for non-blocking I/O.


So it looks like there are facilities for doing some IO, listening on a socket, etc. But I don't see any facilities for waiting on locks, semaphores, etc. Maybe they could build that in a future release or maybe they consider it antithetical to the idea of "share via communicating as opposed to communicate via sharing".


> But I don't see any facilities for waiting on locks, semaphores, etc

There's a very good reason for this. First of all, you can use file descriptors as locks, semaphores, condition variables, etc using pipe(2) and other functions. On Windows you can use WaitForMultipleObjects, etc on Mutex objects.

But select/epoll/kqueue/WaitForMultipleObjects is a system call that goes in to the kernel. That is very inefficient if you want e.g. mutual exclusion using a mutex. In a modern OS, a mutex (or CriticalSection in Windows) is implemented using a futex ("fast userspace mutex"), which is just a spinlock in userspace that only calls to kernel if the mutex is contended. An uncontended futex can be locked and unlocked in less than 20 nanoseconds. A system call can be 100x slower than that.

Some languages, e.g. Haskell have a green threading system that implements mutexes and i/o in a consistent way with i/o using a userspace scheduler. But this depends on language and runtime implementation.


It's single-threaded, so coroutine locks are trivial and other locks "can't block" (against coroutines, anyway).


Sort of. They enable non-blocking network IO. Disk IO in unix always blocks (aio excepted). Disk IO in Windows can by async, but not via select or poll.


Can it take advantage of GCC's split stack support? https://gcc.gnu.org/wiki/SplitStacks


Wouldn't libdispatch be a much better option if you don't mind Apple's blocks extension to the C language?


My understanding is that libdispatch is not a replacement for go's concurrency. With libdispatch each block (closure) that you dispatch runs on a given thread, if that block stalls for any reason then your thread is blocked. So even with a few blocked blocks (heh) your system could be stalled which is not the case with go where you can have millions of goroutines that are blocked.


I've built systems with libdispatch and blocks. I found it much nicer than traditional C style non-blocking IO or typical pthreads, both of which I have also done.

• You don't need to use blocks, there are function callback versions of the calls available, but blocks are nicer.

• There are asynchronous IO routines, that covers most of your stalls.

• Go probably stalls threads on page faults, as does libdispatch.

You can read about the Linux port of libdispatch at https://github.com/nickhutchinson/libdispatch


It looks like the Linux port of libdispatch uses OS threads through pthreads, so it's not a candidate for the user space light-weight threads needed for efficient CSP.


I'm fairly certain that is just to get the worker pool. The blocks on the queue don't each get a thread.


libdispatch is more than just programs submitting blocks to queues – dispatch sources and event handlers can handle channels of asynchronous input as well.


I forgot about that part of libdispatch. You still can't do synchronous IO but you can use libdispatch semaphores, wait on a queue, etc. and that should not block the thread, right?


The docs offer different kinds of sources for automatically enqueueing blocks:

     #define DISPATCH_SOURCE_TYPE_DATA_ADD 
     #define DISPATCH_SOURCE_TYPE_DATA_OR 
     #define DISPATCH_SOURCE_TYPE_MACH_RECV 
     #define DISPATCH_SOURCE_TYPE_MACH_SEND 
     #define DISPATCH_SOURCE_TYPE_PROC 
     #define DISPATCH_SOURCE_TYPE_READ 
     #define DISPATCH_SOURCE_TYPE_SIGNAL 
     #define DISPATCH_SOURCE_TYPE_TIMER 
     #define DISPATCH_SOURCE_TYPE_VNODE 
     #define DISPATCH_SOURCE_TYPE_WRITE 
     #define DISPATCH_SOURCE_TYPE_MEMORYPRESSURE


Sadly it doesn't have support for Windows MSVC yet: https://github.com/sustrik/libmill/issues/16


One feature worth pointing out is debugging support: http://libmill.org/documentation.html#debug


Sorry for the ignorance of a newbie, but I couldn't find any info on whether libmill does anything similar to the M:N multiplexing of light-weight threads upon OS threads?


M:1.


What happens when something blocks for synchronous I/O?


A different coroutine gets a go.


But only if it is using the aproved I/O socket operations from the library. What if it uses an operation from a regular socket librayr or if it runs a for loop or some CPU bound computation?


when using native i/o take care to use it in nonblocking manner. if it would block use libmill's fdwait() to wait while the file descriptor is unblocked. in case of long computations call yield() once in a while to give other coroutines a chance to run.


How do they do that? Their own I/O library with asynchronous everything?


Simple stack switching. See here for a toy example: http://250bpm.com/blog:48


Yes.


    "finally! it's so ridiculously easy now that everyone will start using it and 
    the world will be roses and meatballs from now on!"
said no one ever.

even the plan9 guys who were the first to do it with alef and libthread didn't feel like they'd solved concurrency in C.


If you want Go-style concurrency, why not use Go in the first place?


My primary use case was writing network protocols (the stuff I had to deal extensively when implementing ZeroMQ and nanomsg). On one hand you want ultimate performance, i.e. no context switchers, manual memory allocation et c. On the other hand, you want the convenience of writing linear code without caring about what to do when function blocks.


I you've written advanced-enough go, you probably have a bunch of file that start with some C code before the Go code starts.

Go simply can't interface everything the way C does. Go doesn't give you the same level of flexibility either.

tldr: Go's great, but it's just not as powerful as C (unlike Rust or some others, which can do anything C does).


What specifically are you talking about? Go can do syscalls perfectly fine, you can mess in your memory perfectly fine (even if you have to resort to the unsafe package), so what exactly is it where you think Go lacks flexibility compared to C? Because so far, your posting sounds rather hand-wavy.


One problem is that it's very hard to call go libraries from other languages (it requires a runtime). C libraries can be called from almost any language.

There are many go libraries I would love to use, but can't because of this. I expect many others are in a similar boat.


For ex try to read the data from a signal. Another example is to start a namespace before all threads start.

I can go on for a while :)


I can imagine it being nice if you're tied to existing C code. Or if you want C-style memory management.


Garbage collection.


What stuff are you working on where the runtime guarantees of the Go 1.5 GC aren't good enough?


Me personally? Nothing. I'm sticking with Go.


Exactly. I'd argue that the majority of people complaining about Go being unusable for them because of the GC would actually be able to deal with the GC easily. I'm working on a soft realtime application in Go, and the GC has never gotten in the way.


Right back at you though: I'd argue that the majority of people complaining about OCaml (or even the JVM) being unusable for them (and using Go instead) would actually be able to deal with it easily.


Stop derailing this discussion about Go GC vs. C no GC.


It is relevant though. You're talking about the large number of people who are using C and don't need no GC as though they're people who should be using go - but if they don't need no GC they probably don't need a language as low-level as go full stop and would be better served by a more expressive alternative. Ultimately, a large proportion of "people who need go or C" actually need C, making this library a good idea.


Can you elaborate about your experiences writing soft real-time in Go. Is your project open source?


You don't even need to use Go, the gccgo implementation of the runtime (including garbage collection) is available from C: https://github.com/stefantalpalaru/golib




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: