When we implemented the Go and Erlang lightweight-thread approach for the JVM in Quasar[1], we quickly realized that CSP/actors, while going a long way towards sane concurrency, are not sufficient. Every application eventually needs some shared mutable state (even if it's implemented at the database layer). One approach is to have a single fiber (or goroutine) in charge of managing state (Clojure does that with agents), but that doesn't scale for multiple writers. Neither does a global lock over the entire data store.
Good concurrent data structures that allow multiple concurrent writers are necessary. Java has some good basic ones like ConcurrentHashMap, ConcurrentSkipListMap, and ConcurrentLinkedQueue. Clojure takes that a step further with transactional refs. For more interesting kinds of data it's better to have a good in-memory transactional database.
So for truly "sane concurrency" you need both lightweight threads with CSP and/or actors, plus a transactional, concurrent (preferably in-memory) data store[2].
Sounds like you've basically explained several of the common concurrency features of Haskell. Haskell has the advantage that any pure data structure (ie, not using mutable references, which is almost all data structures we use) can be used concurrently and atomically using IORefs and atomicModifyIORef. If you need transactional guarantees, then software transactional memory (STM) is just as easy to use. For both these situations, it's laziness and purity that gives us these things essentially for free.
Of course you could create more advanced concurrent mutable structures as other languages have, but IORefs and STM get you a very long way when coupled with GHC's extremely light weight threading and high performance IO (which is getting even better in GHC 7.8, to be released soon).
To learn more, Simon Marlow's excellent book is available for free online [1] and in dead tree and eBook forms from O'Reilly. It's very clear, with a great balance of how to use things practically, how they work, and how to make them work their best.
No STM of course, but RWArc allows for multiple writers, or you can just uses Arc if immutable sharing is acceptable, or just message passing via Chan (in the standard library) if sharing data concurrently is not required. Lots of options.
You are correct, you'd use them to implement your own concurrent data structures per requirements. But I do think they're slightly higher level primitives than most languages provide, and there is some nice flexibility in the options available, in terms of control over sharing/mutability and mechanism.
RWArc allows multiple writers, but each blocks until the "active" writer gives up its lock. But I suppose you mean that you can't have concurrent writes, so I understand this might not be suitable for the use-cases you were talking about.
I think the Rust people have made the right choice at every juncture given the language requirements. It's a very impressive language, but it, too, has its sweet spots. There's a reason why Java has attracted so much concurrency research in the past decade, and why it's usually years ahead of almost any other platform in that respect: no, it's not (just) the platform-agnostic memory model, which later inspired C++11's (heck, it was done by the same people), but its world-class garbage collector(s). While it's possible to implement lock-free data structures without a good GC (with RCU), it's very difficult to do so in a general enough manner while still staying efficient. Java's GC, along with its memory model and direct access to CPU concurrency primitives (CASes since Java 5 and fences since Java 8), combined with Doug Lea's inspired implementations[1], have made the JVM the premier choice for anyone who wants to take advantage of cutting-edge concurrency.
EDIT: The Doug Lea talk linked above is probably the best overview of what doing low-level concurrency on modern hardware entails. In particular, it demonstrates interactions among abstractions layers and dispels myths that claim that having less software layers is always better (hardware already has so many layers as it is, that some software layers can actually help, if only to hide the hardware's complexity).
Thanks for the link to the Doug Lea screencast/talk, very interesting so far.
EDIT: Indeed it was a great talk, very informative! I came to the opposite conclusion about layers of the system, though I'd say that it isn't the number of layers that's the problem, but the nature of them.
From the talk, having the GC insert safepoints where threads can be stopped for a collection, and having the JIT inserting counters (resulting in memory contention) are two things in Java that caused for high-performance concurrency. Running on top of a VM causes similar issues. Though, clearly there are many other layers and unanticipated layer interactions, just laying in wait to mess you up regardless.
Well, he only mentions how these things make his life difficult, not how they make it easier. When you consider something, you must also consider the alternatives: a GC does insert safepoints on the one hand, but it makes lock-free data structures possible (or much, much easier) on the other. The JIT counters introduce contention before optimization kicks in, but then you get virtual method inlining, the mother of all optimizations, which you can't do without a JIT, certainly not when you want to support hot code loading.
Those things are necessary for the class of programs Java SE is meant to support (long running, high-performance server-side applications); they're not ideal for device drivers, command line programs or fast-loading GUI apps, which Rust is designed to handle. Everything is a tradeoff.
No, RWArc definitely allows more than one writer. One writer at a time, sure, but you can have multiple tasks try to write to it at the same time and the first will get access while the others will block.
EDIT: Oh, I see what you mean now; sorry, I misinterpreted. Carry on.
No. Like I said, agents allow a single writer at a time. Refs are more like it, but it's really hard to build a good concurrent data structure (like a B tree) efficiently with refs.
As far as I understand, refs are not intended for implementing data structures. Instead, they should be used to provide concurrent access to immutable data structures. So the "Clojure way" solution would be to efficiently implement an immutable B-tree and then expose a ref to it.
An atom would do that much more efficiently that a ref. Refs are there for more complex types of mutation (like mutating more than one persistent DS transactionally).
Also, B-trees are a mutable data structure. "Immutable" B-trees aren't B-trees. Clojure's own persistent DS tries are quite efficient, but not nearly as scalable as a concurrent B-Tree, which easily supports lots of concurrent writes. AFAIK, there are no persistent data-structures that efficiently scale writes.
Speaking of the JVM, Scala has immutable data-structures. You stick one of those suckers into an AtomicReference and presto you've got a thread-safe data-structure. And for more complicated scenarios, you also have Scala-STM: http://nbronson.github.io/scala-stm/
Even though the concurrent data-structures in Java's library are efficient and well built, they remain error-prone because thread-safe operations are not composable. Working with immutable data-structures in a high concurrency scenario is the best thing since sliced bread.
I do agree with your point - sane concurrency implies using the right abstractions for the job and there are cases in which you need shared memory storage. The JVM is in fact great because it does 1:1 multithreading, because the GC is so good that it allows you to be a little wasteful and because it provides the required concurrency primitives to build whatever abstraction on top that you desire.
For Scala + the JVM, just to enumerate - you've got actors by means of Akka and it's pretty capable as it's flexible and it allows you to manipulate/coordinate entire clusters of remote actors (and of course there's Quasar that does the lightweight threads thing and I'd love a comparison between the two btw), you've got the Future/Promise concept baked in and with a sane design [2], with C#'s "async" by means of a library that will be part of the standard [3], you've got great immutable data-structures that also have parallel extensions [4], you've got RxJava/RxScala for processing streams of events reactively [5], or Iteratee IO for safe processing of data streams [6], or shared transactional memory that I mentioned above [7] and of course things inherited from Java, like NIO which in spite of its flaws it's the sanest cross-platform abstraction over native APIs for async I/O, plus as I've said, the best cross-platform concurrency primitives you can get (e.g. I still get boners every time I use ReentrantReadWriteLock).
> ... immutable data-structures. You stick one of those suckers into an AtomicReference and presto you've got a thread-safe data-structure.
Thread-safe – yes (and Clojure does this beautifully; in fact any Clojure DS is thread-safe); concurrent – not really, because you can only have one writer at a time, which simply doesn't scale for large data structures.
> I'd love a comparison between the two btw
Quasar has true lightweight threads (M:N multithreading as in Go or Erlang), and an (optional) actor system built on top. Quasar actors can and should block as much as you want, have selective receive (just like Erlang), and are especially easy to use. Also, Quasar has Java and Clojure APIs that feel natural and idiomatic in both languages.
Go has some great high level concurrency primitives with channels and goroutines, but honestly the first function could have been written by sync.Mutex.Lock/defer sync.Mutex.Unlock.
Absolutely true. And if I were writing this for actual real world use then I would have done so. The main technique I was trying to get to was tightly coupling a struct w/ an internal event loop, though, so I allowed myself some awkward code (called out as such in the post) to build up to that point.
Although more verbose, I find this pattern much better than mutexes.
The problem with mutexes is that they are some kind of cheap way for the developer to say "stop the world while I think", but they require you to be quite aware of what happens in an object, too much I think.
In this example you'd have a mutex for the Account, so each time you want to change the amount you need to lock and unlock it. That's trivial for an object like that, but it can quickly become complex: one mutex is not enough because it blocks the whole object, so you start having multiple mutexes. But then you don't remember which mutex blocks what so you must have some good documentation about that, but the documentation is not as well maintained as the rest so it starts rotting. And then your file spreads among so many lines you can't quite grasp all the ways parallelism can kill you.
In the given example, the mindset is different: _all_ methods called from the outside are non-modifying, they only stack the change to be made. The actual data manipulation is done in a very tight loop that does a very simple thing, and nothing outside of this loop ever deals with memory. As a developer, you can fit all the code that can be "dangerous" in a mere 7 lines of code, instead of having it spread among the file.
I concure it's a different mindset, quite different at that, but it's a pretty good one (ang go makes it very pleasant)
Sorry, where's the part where I complained that the language is broken? I said that Go doesn't protect you from mutating shared state across multiple concurrent goroutines. Not a very controversial statement, that.
But how does Go’s approach to concurrency fare when viewed through the lens of encouraging code that supports local reasoning? Not very well, I’m afraid. Goroutines all have access to the same shared memory space
The article is flawed because you:
1. Base it on Go failing to protect you, when in actually you fail to use any of Gos "very useful concurrency primitives baked right into the language"
2. When you choose to use these primitives you choose the the wrong primitive that results in a much more verbose and complex solution then is necessary.
It would have been much better had you:
1. Showed why the solution with no protection is unsafe (and mention its unsafe in just about any mainstream language)
2. Show a mutuex solution
3. Show a channel based solution
When it comes to protecting a structs state, in most Go code (including the standard library) RW/Mutexes are the "go to" synchronization primitive. Channels are more commonly used for communication between longer running goroutines.
Snark aside, I added a bit to clarify that the first example is contrived, implemented as such for illustrative purposes, and a mutex would actually be a better choice in the trivial case.
I would have understood snarking at my original (admittedly somewhat snarky) comment. I'm not sure why you felt the need to snark at my constructive criticism, which you actually took-
It's maybe a bit silly having a conversation about tone on Hacker News, but generally people tend to respond better when you don't state opinions, especially critical ones, as though they're fact. See "The article is flawed because you..." and "It would have been much better had you..." I statements FTW.
I don't think the original version was particularly flawed, and your description of why you think it was doesn't make a lot of sense to me. But I added a note anyway to make it clear that I do in fact know about the alternative approaches, and that I made the choices I made deliberately and not out of ignorance.
Regardless, I appreciate you taking the time to read it, and to give your feedback. Cheers.
I tried an experiment in Go a few weeks back; The difference in writing to a logfile between using synchronous access (wrapped in Mutex.Lock) and passing a reference through a buffered channel (such that the ownership of the object was received along the channel). There was less than a 1% advantage to the Mutex, and the code was surprisingly clean with channels.
Mutexes are absolutely idiomatic Go code. In fact the Core developers on the ML will often suggest them as a simplification over goroutines for synchronizing access to a shared resource. I'm not sure where you get the idea that they aren't idiomatic.
"Do not communicate by sharing memory; instead, share memory by communicating.
This approach can be taken too far. Reference counts may be best done by putting a mutex around an integer variable, for instance. But as a high-level approach, using channels to control access makes it easier to write clear, correct programs. "
I don't really think what I suggest goes against the article you link to. It basically says: use channels when appropriate, but sometimes the best approach is to use locks.
E.g. An bank account is shared state (similar to a reference count) and the best approach is to put a mutex around it instead of fighting with channel synchronization.
There are no implications here, it's stated pretty explicitly:
"Reference counts MAY be best done by putting a mutex around an integer variable, for instance. BUT as a high-level approach, using channels to control access makes it easier to write clear, correct programs."
There is a difference between what someone with a background from other mainstream programming languages might think would be the best approach to locking (mutexes) and the idiomatic Go approach (unbuffered channels).
This is very, very familiar. I'm working on a library in which I'll be implementing something quite similar very soon. It actually reminds me quite a lot of the actor model - as each event loop is essentially an actor which receives and sends messages through channels. Background follows:
I'm the author of a Haskell library[1] for interfacing with HyperDex[2], a distributed database produced by some folks at Cornell.
The client-side library for HyperDex uses an event loop, and is "thread-safe" to call into as long as you synchronize access to the pointer. This is an awkward area to work with in Haskell-land, as the code I write has to deal with the intersection of the three uglies:
* Foreign function interfaces (and marshalling)
* Mutable resource management (allocating and freeing C-structs)
* Concurrency (synchronizing access to an object)
I've tried a variety of implementations. To speed up testing, I wrote a "fake HyperDex" client and test harness in which I can insert chaotic behaviors to test resilience.[3] While the code isn't as clean as I like right now, it lends itself to the smallest, most compact implementations of what I want. Each HyperDex connection pointer is paired with an event loop which receives requests for calls into HyperDex and processes them. When a response is demanded, the loop begins a busy-wait (will soon be replaced by an epoll/select on an fd) on results from the database. Each asynchronous request is an event loop too - a free floating closure sitting in memory that sends off a message and waits for responses.
GHC's garbage collector will determine when waiting on an MVar or Chan is deadlocked, and keeping everything in a ResourceT monad ensures that the whole system gracefully closes when portions fall out of scope and are unreachable.
The approach has a certain elegance to it. I am curious to hear what other people's thoughts are on such implementations though, because there are naturally performance implications.
[3] http://lpaste.net/101084 - a hodgepodge of code I wrote to test implementations - ill-documented, this is just for personal exploration and testing
> The client-side library for HyperDex uses an event loop, and is "thread-safe"
That is a common misconception. Or rather it is a tautalogy. No threads = thread-safe. But, it is not concurrently-modifying-data-structures safe -- which is the main painful point.
One can get just as easily tangled over a set of callbacks.
Here is a set of callbacks all started from some select/poll/epoll loop. Some call it a reactor (namely Glyph's own Twisted Python).
cb1 -> cb2 -> cb3|eb3 then cb3->cb4 and eb3->cb5
Processing starts with cb1 and ends with cb4 or cb5. Notice at some point cb2 function could result in generating an errback (eb3) which then ends up calling another callback cb5.
Understanding that the above, in a large system is just a messier, uglier concurrency structure than a thread/goroutine/task/actor is crucial.
It doesn't necessarily save you from simultaneous access to same shared data.
Imagine processing starts at cb1 and by the time it reaches cb3 (say cb2 calls some io or sleep operation), cb1 gets called again. cb1 through cb3 end up modifying some shared data (hey no need for lock, we are using callbacks remember!). Now there are two callback chains modifying shared data.
Yes you need locks and semaphores with the above just as you do with threads
For example this exists -- Twisted's own Sempahore:
I had to use it, and not just for throttling concurrency, but also to protect critical data from being modified concurrently.
Asynchronous/callback/promise/future based concurrency looks really good in small examples. In large application they get messy quickly.
Threads/actors/goroutines etc are still nicer from a logical, application point of view. You can even build them on top of the same epoll/select/kqueue system calls if the language can support some kind of a coroutine structure (which for example python gevent/eventlet) is doing.
I am unsure how to map your reply to my concept of HyperDex's event loop (and C API). You are of course correct, you have to synchronize access to some object which is a pain point.
What I am doing in my implementation of a thread-safe wrapper around HyperDex is similar to what you describe at the very end of your post.
The guarantee you get when using an event loop (or some other cooperative concurrency model) is that inside a contiguous block of code you don't have to worry about any other code accessing "your" data: until you yield control, the only code that will execute is your own.
Obviously if you call a function you need to be aware of what it might do to any state you grant it access to, but even then it can only mutate state either before it returns (as in most popular programming models even without concurrency) or, if it registers a callback, at some point after you yield control.
It takes a little bit of getting used to, but this effectively makes mutual exclusion blocks the default unit of execution - and in doing so eliminates a lot of the care needed to write correct code in a preemptive model. Take a look at a well written Node.js or Twisted app - neither Javascript nor Python has concurrent datastructures in their standard library, but it just isn't an issue in most cases.
All of that being said, I think that the failure of this model is that the reason programmers tend to yield control is not because they no longer need it, but rather because you they must do so in order to avoid blocking the event loop on I/O or a timer. In other words, they need to model arithmetic, for example, fundamentally differently than they model loading a configuration file. Things get even wonkier when you start specifying interfaces without knowledge of the implementation: at some point almost any operation could require IO, even if the default implementation doesn't, so you end up registering callbacks on every line. In addition to wreaking havoc on your mental model you have to begin worrying about stack sizes, etc.
In this sense, I think well written Go has a certain elegance. The programmer knows to treat shared state as a special and dangerous case. Instead, you tend to program around a produce/consume model using channels with near zero shared state. You don't need to worry about whether a call yields to the scheduler or not, or even how many threads your goroutines map to - you just write little blocks of code that operate on easily comprehensible state.
I believe you have a race in that code on error channel. You can't guarantee that the value you send on it will be received by correct receiver.
If you have 2 concurrent transfers and one errors and the other succeds, second one may receive an error intended for the first one.
Since the channels are unbuffered, theres no way to read from the error channel until the loop has accepted your request for the transaction. Also, since it won't accept another transaction request until it has sent an error (it always sends an error value), you always read the error for your transaction.
Any buffering in the request channels would have the race you've described though.
I've come up with a way, using Clojure, to have a discrete game loop, but still have multicore concurrency and parallelism. This gives one the simplicity of a cooperative multitasking model, without having to do explicit yields. It comes at the price of not being able to achieve full use of all of a machine's cores in a single instance, but I don't think this is necessarily a problem. There's this thing called an OS, which would let me achieve nearly full utilization with only a few instances. And by a few, I really do mean single digit numbers. For me, this is preferable, since it eliminates a single point of failure. (To be fair, Erlang also eliminates this if one uses its distribution capabilities intelligently.)
The world is divided into sub-grids, which are all processed in parallel, minus movements that cross sub-grid boundaries. At the end of the parallel processing, all of the sub-grids are reassembled into a world-grid. (Which takes O(n*Log32(n))) The world-grid acts as a "global" processor and processes all remaining movements.
Basically, everything is parallel, except that which has to be processed globally, which is just a tiny minority of everything. Then there is a non-parallelized step that knits together and coordinates the global grid. If you plotted parallelism over time, it would probably look like a saw tooth or a square wave.
Right now, there are only two tiers, but this could be generalized into a structure like an R-tree, which would probably make the algorithm cache oblivious.
It's the "knitting" step that keeps it from utilizing 100% of the cores in parallel with just one instance. I still hope to be able to support 1000's of entities with just one instance, however.
The race detector is great for such issues. But first and foremost, people ought to read the very enlightening presentations about Go's concurrency primitives and conventions:
Adventurous web developers can try out goroutine-like concurrency patterns in javascript, with ES6 generators. I've enjoyed experimenting with these two libraries:
sync.Once() instead of init(). Having the Do closed to where the channels being used makes sure we do not run into deadlock. It's a nice pattern for lazy initializing of channels and goroutine too.
For example idiomatic Java was a clusterfuck for nearly a decade, with all the EJB's and FactorySingletonProxies and XML everywhere. It would be better for Java programmers to ignore the idiomatic BS prevalent in the community, and just use the language in a saner way.
Idiomatic is often just another name for "how lots of people prefer to do it" or "cargo cult".
Good concurrent data structures that allow multiple concurrent writers are necessary. Java has some good basic ones like ConcurrentHashMap, ConcurrentSkipListMap, and ConcurrentLinkedQueue. Clojure takes that a step further with transactional refs. For more interesting kinds of data it's better to have a good in-memory transactional database.
So for truly "sane concurrency" you need both lightweight threads with CSP and/or actors, plus a transactional, concurrent (preferably in-memory) data store[2].
[1]: https://github.com/puniverse/quasar
[2]: http://blog.paralleluniverse.co/2013/10/16/spaceships2/