Rust concurrency: the archetype of a message-passing bug

Animats · on June 22, 2020

It should be noted that combining shared-state with event-loops is usually a recipe for subtle disaster.

Go has that problem, too. The claim for Go was that concurrency would be done via message passing. But, in the examples in the original manual, and often in practice, what's being passed on the channel is some kind of reference to shared data. In a garbage collected language, this is OK as long as the sender only sends new objects and never touches them after sending. Still, it's easy to screw up, and the language does not help.

In both Go and Python, there's not much thread-local data that the language will not let you export to another thread. That's a lack. Shared state that isn't obviously shared state is a recipe for trouble caused by later maintenance programming. This is one of Rust's strengths. Sharing has to be explicit.

Shutdown is always hard in an event queue environment. Go makes it harder by making a write to a closed channel kill the whole program. So if a receiver decides it needs to shut down, it can't just close the receive end of the channel and let the sender hit a send error. Negotiations are required to get the sender shut down.

linuxftw · on June 22, 2020

> Go makes it harder by making a write to a closed channel kill the whole program. So if a receiver decides it needs to shut down, it can't just close the receive end of the channel and let the sender hit a send error. Negotiations are required to get the sender shut down.

And thus Borg was born. Too hard to write processes that handle situations gracefully, so let's just let the process die and spin up a new one immediately.

Literally was looking at a patch set that was proposed recently that involved several channels in Go and one of the codepaths was os.Exit(1) for lack of any meaningful mechanism to sanely clean everything up. Sadly, this was in library code, so would be totally unexpected by the caller and probably not desirable whatsoever.

Animats · on June 22, 2020

And thus Borg was born.

Yes. Basic Dijkstra producer-consumer one-way queues are simple and elegant. But the error cases are hell.

OK, so if the consumer wants to make the producer stop, it sets an abort flag. That's a shared variable. (Does it need a lock? Does it need a lock or fencing on ARM, which has weaker memory concurrency guarantees than x86?)

So, producer sees the abort flag set and closes its end of the channel, right? Maybe not. Maybe the channel is full, and the producer is blocked on a send. OK, so, when the consumer aborts, it has to enter a drain loop, doing receives and discarding the result until it gets an EOF/channel close. That will unblock the producer, and then the consumer can exit.

But what if the producer doesn't have anything to send right now? Now the consumer is stuck waiting for the producer to send more data, and the consumer can't exit.

Should the consumer close the producer's end of the channel after setting the abort flag? Now there's a race condition between checking the abort flag and sending. Add a critical section that covers both actions? Now the producer can be blocked while in a critical section. This creates a potential deadlock if the consumer wants to set the abort flag while the producer is inside the critical section. Now both ends are blocked and there's a deadlock.

But it all looked so simple in theory!

codr7 · on June 23, 2020

If you insist on making it as difficult as possible, I guess.

Otherwise the producer decides when the show stops, by simply closing the channel.

fovc · on June 24, 2020

The ECMAScript TC came to the same conclusion with promises and fetch. Instead of adding an `.abort` method to promises for the consumer to use, you need to wire in an AbortController so that the producer (fetch) can close the channel. The consumer then sees an AbortError. It's a bit more work to wire up than simply calling `p.abort()`, but it saves you from having to reason through some hairy race conditions.

earthboundkid · on June 23, 2020

Yes, exactly. The workers can send results to the manager, but the manager does all the management. Anything else gets too hairy, too fast to reason about. I wrote a blog post about it: https://blog.carlmjohnson.net/post/share-memory-by-communica...

Animats · on June 24, 2020

Does that scale to a pipeline, a chain of producers and consumers?

What happened with Go, I suspect, is that the syntax came first. Go uses a "<-" operator to talk to channels. There's no place for an error status with that operator. Hence the problem.

UNIX-type pipelines have SIGPIPE to unwind things when something downstream aborts. Abort has to propagate back up the chain. I tend to argue that the workarounds for not having exceptions are worse than the problems of exceptions.

oconnor663 · on June 23, 2020

> Go makes it harder by making a write to a closed channel kill the whole program.

I was wondering about this recently. Why did Go choose this design? What's wrong with "sending and receiving can both fail if the channel is closed"?

apta · on June 23, 2020

golang isn't the place to look for a well designed language. It's full of inconsistencies and subpar decisions.

ivanbakel · on June 22, 2020

This is a cool technical exploration of some concurrency ideas: but event loops combined with Rust's borrow checker actually make for a re-invention of the actor model from first principles.

Languages like Pony, and the actor model more generally, were designed to prevent a whole class of concurrency bugs (same as Rust) - what's interesting is that, like this post shows, they also have unexpected problems with message causality.

It makes me wonder if there's an even better abstraction out there which will solve this kind of concurrency bug, and what will arise after these are solved.

oconnor663 · on June 22, 2020

I think the fundamentally tricky part is that sometimes you want your concurrent code to have this sort of nondeterministic behavior. I think the Rayon FAQ does a good job of summarizing the problem: https://github.com/rayon-rs/rayon/blob/master/FAQ.md#but-wai...

> Consider for example when you are conducting a search in parallel, say to find the shortest route. To avoid fruitless search, you might want to keep a cell with the shortest route you've found thus far. This way, when you are searching down some path that's already longer than this shortest route, you can just stop and avoid wasted effort...Now in this case, we really WANT to see results from other threads interjected into our execution!

polyglotfacto · on June 26, 2020

Yes, I would say the tricky part is accurately modelling what you want. Also, instead of actual nondeterministic behavior, I think that you rather want different things to be deterministic.

The article is an example of "task parallelism", where you have different threads doing different things on different (local) data. Having those threads communicate via message-passing is a nice way to model workflows crossing thread boundaries, and in that setup you usually want some sort of deterministic behavior related to the ordering of operations(for a given workflow).

On the other hand, I think the Rayon example is a good case of "data parallelism", where you have workers doing "the same thing" on "the same (shared-)data".

In such a case, perhaps locks and shared-data are a better fit than message-passing, and indeed what you're after is not necessarily some sort of sequential ordering of worker operations. However you still probably want the ability to make other deterministic assertions about the behavior of the workers.

I wrote another article highlighting some of these "different determinism" at https://medium.com/@polyglot_factotum/rust-concurrency-five-...

ivanbakel · on June 22, 2020

Oh, certainly. You don't want programmers to go around insisting there's a global ordering to their message systems.

But the problem I think could be solved is where programmers have a causal dependency, but they don't encode it in the program. The next generation of concurrent languages could have the killer correctness features for solving causality problems.

mcintyre1994 · on June 22, 2020

I haven't gotten to use it yet, but I think Pony would at least claim to prevent this problem. On https://tutorial.ponylang.io/types/at-a-glance.html it says:

> All message passing is causal. (Not casual!)

The word 'causal' is hyperlinked to https://courses.cs.vt.edu/~cs5204/fall00/causal.html which contains:

> The purpose of causal ordering of messages is to insure that the same causal relationship for the "message send" events correspond with "message receive" events. (i.e. All the messages are processed in order that they were created.)

ivanbakel · on June 22, 2020

Not at all - in fact, the problem described in the article is a typical pitfall of the causal ordering system.

In a simplified version:

    A -1-> B
           B -2-> C
    A ---3------> C

The human expectation is that 2 happens "before" 3, because that's how it looks. The reality is that 1 happens before 3, and 1 happens before 2, but 2 and 3 have no causal relation - and the runtime is free to rearrange them in any order.

That's why the solution is essentially

    A -1-> B
           B -2-> C
           B -3-> C

Because then 2 really happens before 3.

mratsim · on June 22, 2020

Everyone dealing with actors, multithreading, distributed systems should read about Vector Clocks:

- https://en.wikipedia.org/wiki/Vector_clock

- Detecting Causal Relationships in Distributed Computations:In Search of the Holy Grail, Schwarz et al - http://www.vs.inf.ethz.ch/publ/papers/holygrail.pdf

hugey010 · on June 23, 2020

Now that's some CS I wasn't even aware of! Although, I'm confused why those process clocking topics are considered specific to distributed systems and not computation in general.

grogers · on June 22, 2020

The article was explicit about this. The A->C message was initiated from thread A before the A->B message. So causally the B->C message is after the A->C message, because it's after the A->B message (which itself is after the A->C message). But in this case causality was broken.

It wouldn't be confusing if the A->B message was sent first, because then the B->C message would be unordered relative to the A->C message. Depending on the processing at B it would be totally normal that one or the other would reach C first.

dnautics · on June 22, 2020

>It makes me wonder if there's an even better abstraction out there which will solve this kind of concurrency bug, and what will arise after these are solved.

You could go the erlang way, which is to give up. Unexpected concurrency bug? Crash, then restart from a sane state. It's highly effective.

Your VM is carved up into failure domains that reduce the blast radius of such a crash to the minimum reasonable collection of processes. This is what frameworks like pony don't get. The reason why erlang actors (aka microservices) are so effective is that they declare, define, and decouple failure domains, and assign responsibility over these failure domains to specific bits of code and ultimately specific teams.

staticassertion · on June 22, 2020

I agree with all of your post except for the premise - concurrency bugs won't necessarily lead to a crash, but often just a bad result.

dnautics · on June 22, 2020

that's not the premise. Maybe I should reframe it this way: you should detect inconsistent results and use "crash" as a strategy to restore consistency. Also of importance is to log these events and work towards fixing the concurrency bug by addressing the root cause in the long run with application monitoring.

yencabulator · on June 23, 2020

But that is then again just sweeping bugs under the rug. The fact that you haven't detected an invalid state doesn't mean you haven't corrupted data. You need tooling that helps to avoid the corruption, not tooling that sweeps it under the rug. Erlang's crash-on-trouble style is ok if 1) your design only loses "one phone call" on a crash (and not anything more persistent or valuable) and 2) you're willing to lose that "phone call" every now and then.

senderista · on June 23, 2020

That’s one thing about Erlang that took me too long to grok: “let it crash” applies to bugs, not expected errors.

dnautics · on June 23, 2020

Haha also i guess you might also decide that correctness is unimportant. How often does this bug happen? Is it a once-a-week thing with no user-facing effect? Just let it ride until you're scaling hard. Your labor as a programmer is expensive, no need to chase down every possible bug.

staticassertion · on June 22, 2020

Ah, ok sure.

SAI_Peregrinus · on June 22, 2020

> and what will arise after these are solved.

Look at unintuitive situations in special relativity. The unituitiveness is essentially a concurrency bug in our mental model of how the universe works, as opposed to how it actually works. Almost all of the issues are due to mistaken ideas about what it means for two events to happen "at the same time" or in a given order.

I suspect programming concurrency will continue to create bugs because humans ideas of time don't actually match up to reality. Obviously not usually due to relativity, but there are all sorts of complexities in the underlying systems we program for that intuition doesn't capture well.

zzzcpan · on June 22, 2020

I'm pretty sure the exact problems the article describes can be solved with proper implementation of actor model, i.e. Erlang-style implementation, where actors can be killed, dying actors can cause messages to be send and messages can be received in specified order. Basically lack of proper richer actor primitives is what pushed them into the mess they are in.

toast0 · on June 22, 2020

The problem described in the article could also exist if you followed the same messaging flow.

Erlang and OTP may make it easier to have a better flow, but it's not a magic bullet. Plenty of ways to write race conditions and get unexpected message orderings (especially on a multinode system)

dnautics · on June 22, 2020

Yeah this happens a lot, and in practice it's not a big deal. I noticed that when I redeploy, there's a nondeterministic flurry of errors in my system on the node that restarts. Finally I came to understand that this was race conditions. But it's honestly not a big deal. The system restarts anything in a funny state and moves on with its life. I feel like if you have something exquisitely race-condition sensitive that you aren't aware that it should be so, then you're not using OTP correctly.

fzzzy · on June 22, 2020

I agree with this. Supervision is an essential part of the erlang model.

jojobas · on June 22, 2020

Rust docs never claimed to prevent race conditions (which he doesn't call by name).

https://doc.rust-lang.org/nomicon/races.html

pjmlp · on June 22, 2020

True, but that is something that many miss a lot when discussing Rust virtues.

Yes the type system is a great help when dealing with thread based concurrency/paralellism, but not so much for the scenarios where we are already using type safe managed languages in distributed computing across multiple processes, most likely not even written in the same language.

Which by the way, given the recent security exploits, is much preferable to go back to, instead of relying on threads for parallelism.

fluffything · on June 22, 2020

> True, but that is something that many miss a lot when discussing Rust virtues.

The large majority of comments I read about Rust safety claims are very explicit that Rust provides memory safety and data-race freedom.

You seem to be arguing that these comments are leaving out important information by not clarifying that "This does not mean that Rust prevents all bugs, e.g., you can still write programs with logic bugs, race conditions, dead locks, DDoS, ... and the list goes on and on... in Rust".

If somebody does not know what that "memory safety" and "data-race freedom" is, they should just ask, but if they don't, and they make an incorrect assumption, then its kind of their fault, not the fault of whoever wrote such a comment for not mentioning all the things that these two properties do not guarantee.

Also, it is actually trivial to build Rust programs that prevent some race-conditions, like deadlocks (just a cargo flag away), so if anything, I'd complain that people being "too correct" by restricting themselves to "data-race freedom" are leaving out some important advantages of Rust.

ashtonkem · on June 22, 2020

My experience is that Rust advocates often go well out of their way to specify the limits of Rust’s protection, and will even rush in to correct someone who over promises what Rust can give you.

htfy96 · on June 22, 2020

r/rust and users.rust-lang.org are informative and sensible when it comes to limitations. Twitter and, well, random programming discords are often full of cults

lostcolony · on June 22, 2020

Yeah, I've never criticized Rust advocates for not talking enough about Rust. :P

But seriously, yeah, the seasoned Rustaceans I've known have been pretty thorough in giving downsides/limitations as well. It's something I love about a language; when the voices in it are aggressive in talking about things it does poorly, or use cases it's not optimal (I still love that Erlang's official website has a section for when _not_ to use it).

yencabulator · on June 23, 2020

That doesn't sound like "fearless concurrency", though.

loopz · on June 22, 2020

It's really hard to envision exactly all the problems parallellization might enable in a given system, as it'll all depend on exact implementation and execution details. The safe assumption is whenever you introduce concurrency or even event-handling, stuff might implicitly FUBAR, unless explicitly designed not to. This is a tough pill to swallow when people can't envision exactly all the problems parallellization might enable in a given system (circular argument).

vasilakisfil · on June 22, 2020

I ll leave that here http://graydon2.dreamwidth.org/247406.html

Edit: in the link, rust creator argues that we should watch out how we market rust.

bluejekyll · on June 22, 2020

> but that is something that many miss a lot when discussing Rust virtues.

When someone says “Rust is thread-safe”, Rustaceans in one common voice reply, “Rust is data-race-safe”. After that, folks get into a good conversation about what the difference is, and learn about deadlocks and other issue in threads.

Generally speaking though, it does mean that multi-threaded programming in Rust is memory-safe, and won’t introduce a significant class of bugs that you see in C and C++, where so many exploits come from.

fluffything · on June 22, 2020

Thread-safety is defined as "the absence of data-races", so technically speaking, saying that Rust is "thread-safe" and saying that it provides "data-race freedom", is just two different forms of saying the exact same thing.

The problem is that some people don't know what these mean. I'm not sure why though. Like, I could understand that somebody coming to Rust from Java wouldn't know what "memory safety" means since it is essentially a solution to a problem that Java does not have. But most high-level languages do not protect you from data-races, so this is a problem that Rust does solve and one can run into in most widely-used programming languages (e.g. in Java they are also called data-races).

mratsim · on June 22, 2020

This is not what thread safety means:

> Thread safety is a computer programming concept applicable to multi-threaded code. Thread-safe code only manipulates shared data structures in a manner that ensures that all threads behave properly and fulfill their design specifications without unintended interaction. There are various strategies for making thread-safe data structures.

(Wikipedia)

Data races is only part of the problem. Thread-safe also means that your design is bug-free and you don't have livelocks for example because of how you state machines interact there is a way for them to loop forever in an unexpected way (example, dining philosophers).

Similarly, data-race-freedom as provided by Rust is helpful for end users but not for library developers. Developers of concurrent data structures or threading primitives are on their own to sail in the complexities of the C++11 Memory Model (i.e. Acquire/Release) and the compiler cannot verify that you didn't introduce a synchronization bug in your new concurrent queue or event bus.

fluffything · on June 22, 2020

> This is not what thread safety means: > > Thread safety is a computer programming concept applicable to multi-threaded code.

Woah woah, hold your horses.

Thread-safety is a definition in the specification of the operational semantics of programming languages. It is a precise mathematical term used in proofs.

The C, C++, and Rust memory models precisely define thread-safety as "the absence of data-races".

Whatever the wikipedia says just does not apply (and if the wikipedia does not mention this, it is just incomplete in its treatment of thread safety).

mratsim · on June 22, 2020

How does Rust prove that you are correctly using relaxed, acquire and release atomics?

fluffything · on June 23, 2020

It doesn't.

What Rust proves for safe Rust is that even if you were to use these incorrectly, your program is still data-race free and memory safe.

That is, if you use relaxed where you actually wanted to use sequential consistency, your program might do many sorts of incorrect things (e.g. a counter is incremented in the wrong order), but it will not have undefined behavior (it won't dereference a null pointer, doing an access out of bound, double free, or having a data-race).

To write atomic code that creates a data-race, you'd need to use unsafe Rust, and then your program would have UB.

Rust proves this for safe Rust using ownership + borrowing, and `Sync`+`Send`. Ownership and borrowing deal with "aliasing xor mutability" which essentially eliminates all data-races, because a data-race requires a conflicting access involving two aliases where one is a write, and these two features prevent that from happening. `Sync` and `Send` provide proofs that some concurrent accesses are ok, e.g., an atomic relaxed write that conflicts with an atomic relaxed read is ok (what's not ok is unsynchronized conflicting accesses).

As mentioned this does not proves that your program has no logic bugs, only that it has no data-races and UB. It is possible to write programs in safe Rust that do not do what you intend.

mratsim · on June 23, 2020

Well if you misuse atomics you have a data race and undefined behaviors.

fluffything · on June 23, 2020

As explained above, if you do that in safe Rust, your program does not compile.

So no, you don't have data races or undefined behaviors.

If you mix safe Rust with unsafe Rust, and your unsafe Rust has a bug that causes a data-race, then you only have to search for the bug in the 0.00001% of your code base that uses unsafe Rust. Which is many orders of magnitude better than even "safe" languages like Java.

mratsim · on June 25, 2020

Then if I'm writing a concurrent queue or concurrent hash-table I'm on my own since 100% of my codebase is unsafe Rust.

Which was exactly my point. Rust provides no support for fearless concurrency because actual concurrent code, the code that deals with synchronization, is considered unsafe.

I.e. fearless concurrency is a misleading statement and should not be trusted if you write concurrent code.

Also, I suppose you didn't try to write any concurrent data structure. Even a hundred of lines of concurrent code is incredibly hard to debug without tools when it involves multiple threads synchronizing with each other.

fluffything · on June 26, 2020

> Then if I'm writing a concurrent queue or concurrent hash-table I'm on my own since 100% of my codebase is unsafe Rust.

Sure. And if you are writing a concurrent application with 500k LOC , 99.99999999% of your code will be safe Rust.

Safe Rust that you can refactor at will - at least, in my anecdotal experience of working 3 years full-time on a concurrent >100kLOC Rust application, I have yet to see a big refactor that introduces a concurrency or memory safety bug. In my previous 15 years of doing the same in C++, I actually have to see a refactor by myself or others that did not introduce multiple bugs.

> Then if I'm writing a concurrent queue or concurrent hash-table I'm on my own since 100% of my codebase is unsafe Rust.

You can write a concurrent queue using safe Rust (e.g. it is trivial to write a spinlock using safe Rust that you can use for that). You are choosing not to do that, and provide your own instead.

Your job is not to write the unsafe code, but to _prove_ that it is safe to use from safe Rust.

Once you do that, the other 99.999999999% of your application can just use it without issues.

If you don't want to do that, Rust belt has soundness proofs for most concurrency primitives in the standard library. Aaron Turon's thesis has similar proofs for the ones in crossbeam. Etc.

> I suppose you didn't try to write any concurrent data structure.

You suppose wrong.

> Even a hundred of lines of concurrent code is incredibly hard to debug without tools when it involves multiple threads synchronizing with each other.

So don't do it? I've done it a couple of times, and the first thing I do is set up the fuzzer and thread-sanitizer for the build. That catches most bugs early.

I've never proven my implementations to be sound, but they were always good enough for my use cases. I've hitted bugs in my application when using them, but finding those bugs was always trivial, since they are only located in the 100 LOC of unsafe code.

---

I think you are fully misunderstanding the claim fearless concurrency. It is a claim about safe Rust, and you can write huge, complex, and extremely efficient concurrent applications using safe Rust without having to write one line of unsafe Rust yourself, and using only unsafe Rust code that has been proved correct.

If the only thing you use Rust for is writing concurrent data-structures, Rust is very specific that these claims do not apply there. You seem to not have understood this yet, and without having understood that, the chances that your implementations are correct are very thin. I hope you are not using them for anything important.

TheNorthman · on June 22, 2020

Continuing from the same Wikipedia page, it mentions three levels of thread safety. Rust support all of them, and can guarantee them at compile time:

> _Thread safe_: Implementation is guaranteed to be free of race conditions when accessed by multiple threads simultaneously.

> _Conditionally safe_: Different threads can access different objects simultaneously, and access to shared data is protected from race conditions.

> _Not thread safe_: Data structures should not be accessed simultaneously by different threads.

mratsim · on June 22, 2020

It does not though? How does Rust proves that your implementation of a Peterson's lock algorithm[1] or Lamport's Bakery algorithm[2] is:

- Deadlock-free

- Live-lock free

- Free of data races

Second, how does Rust proves that you are correctly using the relaxed, acquire and release atomics so that your channel implementation is free of data races?

You can use Loom[3] but this is guarantee is certainly not provided by the compiler.

In short, Rust compiler only addresses memory issues by flat out preventing memory sharing. However, someone has to write those sharing/synchronizing data structures (Channels, Event Bus, Concurrent Hash-Table, ...) and those people have no tools that help them navigate the significant complexity of multithreading that go beyond memory and lifetime bugs.

[1]: https://en.wikipedia.org/wiki/Peterson%27s_algorithm#The_alg...

[2]: https://en.wikipedia.org/wiki/Lamport%27s_bakery_algorithm

[3]: https://github.com/tokio-rs/loom

TheNorthman · on June 22, 2020

That's my mistake. I confused data races and general race conditions, sorry about that :).

mratsim · on June 22, 2020

That's my main grip with Rust "Fearless Concurrency", fearing concurrency and dipping into it with your toes is a good thing.

Claiming "Fearless Concurrency" but ignoring synchronization bugs, at a low-level in concurrent data structures or at a high-level in concurrent systems is misleading. People will be fearlessly introducing livelocks.

ragnese · on June 22, 2020

I agree that "fearless concurrency" isn't the best tagline. While I'm definitely less fearful with concurrency in Rust, I'm still pretty fearful... I'm very good at writing deadlocks.

jstrong · on June 22, 2020

Have you written a lot of concurrent code in rust? Did your experience doing so convince you that it's something you should "fear" doing?

mratsim · on June 22, 2020

I've written a multithreading runtime from the ground up in Nim.[1]

And yes, writing that taught me that current programming languages are not sufficiently equipped to deal with concurrency bugs, including Rust. The bugs that I had where of the dining philosophers kind (deadlocks or livelocks when trying to put a thread to sleep for example or resolving data dependencies between multiple threads), some were due to a bug in glibc where only formal verification allowed me to ensure that I was doing the correct thing and it was the lower-level layers that was incorrect.[2]

Similarly, state machine formal verification to prevent design bugs is something that hardware engineers are deeply aware of but would also be extremely useful to ensure "correct-by-construction" event driven code in software engineering.

In comparison, we have many more tools to address memory bugs (Address Sanitizer, Valgrind and fuzzers in particular) but synchronization and communicating state machine bugs are still impractically addressed at the moment. This is something I'd really want Nim to explore and I'm really excited about the Z3 integration[3] to enable new formally verified use-cases[4].

Unfortunately, when I express concerns about those bugs to people from the Rust community, they tend to dismiss those as if the borrow checker was the panacea to all multithreading problems.

[1]: https://github.com/mratsim/weave

[2]: https://github.com/mratsim/weave/issues/56

[3]: https://nim-lang.org/docs/drnim.html

[4]: https://github.com/nim-lang/RFCs/issues/222

staticassertion · on June 22, 2020

Sorry that I see your post is downvoted, I'm generally in agreement, as an avid rust programmer. Rust can express a lot of safety guarantees, but there's a massive amount of room for improvements, and I'm sorry that the community hasn't been accepting of that in your interactions.

> Similarly, state machine formal verification to prevent design bugs are something that hardware engineers are deeply aware of but would also be extremely useful to ensure "correct-by-construction" event driven code.

This reminds me a lot of P. Rust has a crate that I wouldn't necessarily recommend, but it might be similar to what you want:

https://github.com/fitzgen/state_machine_future

Of course, in this case the code is the model, there is no verification here.

The Z3 integration with Nim is super cool, would love to see something like that for Rust. I know of some existing research that reminds me of this but I don't have time to dig up the paper - it was implemented in macros, and looked a lot like what you've got there, with pre/post conditions, ensures, etc. If someone has that paper, please link it! I would love to reread.

edit: Here's one rust crate leveraging Z3, but not the one I was thinking of https://github.com/Rust-Proof/rustproof

ibotty · on June 23, 2020

Did you write a bug report to glibc? I'd love to read about how the bug is happening.

mratsim · on June 23, 2020

I didn't, however you can read my writeup and investigation attempts here:

- https://github.com/mratsim/weave/issues/56

And the fix was directly using Linux futexes:

- https://github.com/mratsim/weave/pull/58/files

skohan · on June 22, 2020

Yes I can't tell you how many rust talks and articles I have seen which prominently state: "no data races". You never hear the caveat that general race conditions are still possible, and I think your average programming news consumer could be forgiven for not knowing the distinction.

lubesGordi · on June 22, 2020

Obviously no language can prevent race conditions. Race conditions are usually a flaw of protocol or api design. The whole point of 'fearless concurrency' is that you won't crash because of shared state.

johnisgood · on June 23, 2020

I think Ada/SPARK comes pretty close in my opinion. How come it does not get as much attention as Rust when it comes to safety?

To stay on topic: https://docs.adacore.com/spark2014-docs/html/ug/en/source/co... and https://blog.adacore.com/gnat-community-2020-is-here are worth a read.

Additionally:

> The Ada concurrency model is based on the notion of task, a unit of concurrency that represents an independent thread of control. All, the tasks and the mechanisms for inter-task communication and synchronization, are introduced at language level in order to allow building safer programs. As an illustration, Ada 95 introduced protected objects to allow controlling how data is accessed, thus eliminating race conditions. Additionally, in 1997, Burns et al. introduced the Ravenscar profile, a subset of the Ada programming language that allows high integrity applications to be analyzed for their timing properties by pursuing three main goals: (1) ensuring predictable execution, (2) simplifying the runtime support, and (3) eliminating constructs with high overhead. The limitations imposed by the Ravenscar profile have an inevitable impact in the complexity of correctness analyses, e.g. tasks can only communicate through shared objects (tasks entries are not allowed, so the rendezvous mechanism cannot be used), tasks are assumed to be non-terminating, and tasks and protected objects cannot be dynamically allocated.

> Along the same lines, SPARK, a language that subsets Ada to enable the formal verification of programs, eliminates race conditions by forcing any global object referenced from a task to be marked as Part Of that task, or be a synchronized object.

I think I have a comment about Ada's concurrency where I get into it in more detail (IIRC).

mratsim · on June 23, 2020

> How come it does not get as much attention as Rust when it comes to safety?

Community building.

That said, Nim is very much inspired by Ada for the future safety features, in particular Z3 integration to enable Spark-like use-cases

- https://nim-lang.org/docs/drnim.html

AFAIK Ada was also inspired by Rust for memory-related safety. I find the cross-pollination between languages fascinating.

WHat's the multithreading story of Ada like? I didn't find that much code or article or papers when I was searching for multithreading runtime in other languages.

micronian2 · on June 25, 2020

Hi,

Actually, it is the pointer support in the SPARK variant of Ada that was inspired by Rust's ownership/borrower semantic. Originally, SPARK didn't allow any pointer usage, but with the new ownership/borrower semantics, SPARK allows limited usage of Ada pointers.

As for more literature about using Ada's tasking features, I recommend the wonderful books and papers by Alan Burns and/or Andy Wellings, such as "Concurrent and Real-Time Programming In Ada (3rd Edition)." AdaCore has an old presentation on using Ravenscar with multi-core processors (https://people.cs.kuleuven.be/~dirk.craeynest/ada-belgium/ev...). There are more articles on the internet on Ada Ravenscar tasking profile and full Ada tasking.

Recently AdaCore announced they are adding support for CUDA applications in SPARK (https://developer.nvidia.com/gtc/2020/video/s21122-vid)

mratsim · on June 25, 2020

Ah yes, I watched that talk at FOSDEM 2019, it was very nice.

micronian2 · on June 25, 2020

The link you provided to the SPARK user manual is a great reference. It explains how using the Ravenscar tasking profile in SPARK helps to avoid data-races and race-conditions. That, along with the ability to formally prove the absence of runtime errors and that your program satisfies various program properties and requirements are good reasons for people to seriously consider looking into using SPARK for critical parts of their software.

pjmlp · on June 23, 2020

Because GNAT is the only free compiler out of surviving 5 or so, and even it has had some controversy in the past regarding GCC and Pro version, on a day and age, where many don't want to pay for compilers, which unfortunately are on corporate price levels in what concerns Ada offerings.

Then Pascal/Algol syntax is not cool, so you always get a bit of push back on a world that now breaths curly brackets.

However NVidia has recently chosen Ada/SPARK for their security critical firmware, so there's that.

pjmlp · on June 22, 2020

If the shared state is via shared memory IPC, you can only protect the Rust side, and have zero guarantees about its current state.

It requires the use of unsafe on the Rust side, and you better code it very carefully.

ajross · on June 22, 2020

Yes, but there's a cost for that, which is something that's sort of at odds with Rust's design philosophy in other areas. Outlawing concurrent shared state means that every communication between two threads needs to be formatted as a "message" in whatever abstraction is in use. It needs to be copied, it needs to snapshot whatever data is needed to interpret it. You can't just set a flag and release the lock.

And that's... a problem. Because a lot of code that actually wants concurrency in the same memory space is code that also wants high performance. And Rust serves these applications well in other areas, where "Just as fast as C" is true.

But Rust isn't as fast as "C" here. It's demonstrably slower. Is that a good thing? I honestly don't know.

It's just a hard problem. But you can't eliminate the idea of "shared locked memory" by fiat. It's a real requirement, and Rust has little to say about it in its standard library or (IIRC, though I'd have to go check) its memory model and locking primitives. It's sort of a no-mans land of "super unsafe outside the box" code. And that's a pity.

justinpombrio · on June 22, 2020

> Outlawing concurrent shared state means that every communication between two threads needs to be formatted as a "message" in whatever abstraction is in use.

> It's a real requirement, and Rust has little to say about it in its standard library

So, I don't know much about concurrency, but you're looking for like, a Mutex, right? They're in the standard library and used all the time.

https://doc.rust-lang.org/stable/std/sync/struct.Mutex.html

And RwLock, a variant on Mutex that allows either one writer or multiple concurrent readers:

https://doc.rust-lang.org/stable/std/sync/struct.RwLock.html

lubesGordi · on June 22, 2020

Yes! There's nothing weird about rust's synchronization on shared state. If you have a mutex, you can do what you want just like in c++. The mpsc::channel thing that gpp might be talking about is just a nice to have tool for sending data between threads. No one is forcing its use.

pcwalton · on June 23, 2020

> Outlawing concurrent shared state means that every communication between two threads needs to be formatted as a "message" in whatever abstraction is in use. It needs to be copied, it needs to snapshot whatever data is needed to interpret it. You can't just set a flag and release the lock.

That isn't how Rust works. There are mutexes, condvars, atomics, etc. in the standard library. You don't need the unsafe keyword; go nuts with them.

hinkley · on June 22, 2020

In particular, locks or ownership do not protect against re-entrancy errors.

Recursive state change calls can end up applying multiple changes either in the wrong order or not at all.

JTenerife · on June 22, 2020

"Therefore, rather than making you spend lots of time trying to reproduce the exact circumstances under which a runtime concurrency bug occurs, incorrect code will refuse to compile and present an error explaining the problem. As a result, you can fix your code while you’re working on it rather than potentially after it has been shipped to production. We’ve nicknamed this aspect of Rust fearless concurrency. Fearless concurrency allows you to write code that is free of subtle bugs and is easy to refactor without introducing new bugs."

https://doc.rust-lang.org/1.30.0/book/second-edition/ch16-00...

These sound like quite bold claims to me.

remram · on June 22, 2020

Previous sentence in this paragraph:

> By leveraging ownership and type checking, __many__ concurrency errors are compile-time errors in Rust rather than runtime errors.

mratsim · on June 22, 2020

I'm pretty sure the OP meant that concurrency introduces many subtle bugs that are not related to memory safety and so cannot be prevented by the Rust compiler hence fearless is a bold claim.

mratsim · on June 22, 2020

And that's where model checking / formal verification is a tremendous help (or would be if it was easier to use).

There are many properties of multithreaded/distributed systems that are emergent and that current static compilers don't address.

In distributed systems it's becoming more common to use TLA+[1] to model the system and ensure that we don't have bugs at the design level (rather than just the system level).

In particular for message-passing, I'm quite curious about the state of the research on Communicating Finite State Machines[1] and Petri Nets[2] and if there are actual products that could be integrated in regular languages and are not just of academic interest only.

I expect that Ada/Sparks[1] has some support for this pattern though Ada multithreading story is not there yet (?).

I'm quite excited about the future Z3 prover integration in Nim[5] and the potential to write concurrent programs free of design bugs[6] to address both ends of the concurrency bugs:

- bugs in the low-level concurrent data structure via formal verification (which is something I used int he past to prove that my design was bug-free and that the deadlock I experienced was a condition variable bugs in Glibc[7] and I had to directly use futexes instead.

- bugs in the high-level design, implemented as event-loop / finite state-machine reacting to events as received by channels

Are there people working on model checkers or formal verification in Rust? Like the author, i think addressing memory bugs.the borrow checker is only a part of the "fearless concurrency" story and preventing design bugs would be extremely valuable.

[1]: https://lamport.azurewebsites.net/tla/tla.html

[2]: https://en.wikipedia.org/wiki/Communicating_finite-state_mac...

[3]: https://en.wikipedia.org/wiki/Petri_net

[4]: https://www.adacore.com/sparkpro

[5]: https://nim-lang.org/docs/drnim.html

[6]: https://github.com/nim-lang/RFCs/issues/222

[7]: https://github.com/mratsim/weave/issues/56

And shameless plug, my talk at the NimConf on debugging multithreading programs (data structure and design): https://www.youtube.com/watch?v=NOAI2wH9Cf0&list=PLxLdEZg8DR...

Slides: https://docs.google.com/presentation/d/e/2PACX-1vRq2wvd4q_mo...

inaseer · on June 22, 2020

We've used Coyote (https://microsoft.github.io/coyote/) which offers an actor-based programming model, as well as as async/await programming model on top of .NET's Tasks to express the concurrency in our services. Coyote includes a powerful systematic testing and exploration engine which explores the various states your programs can be in and helps you find violations to safety and liveness properties. This gives us the same benefits you get when using a model checker like TLA+ but on our actual production code as opposed to a model of the code. If your services are written in .NET, I seriously recommend taking a deeper look here.

mratsim · on June 22, 2020

Wow, that looks like a killer feature for .Net

I will definitely steal some ideas from Coyote to drive Nim formally verification story of concurrent code forward.

On the Rust side, i feel like Loom has a lot of potential https://github.com/tokio-rs/loom and I'm keeping tabs on them.

inaseer · on June 22, 2020

Looms looks quite interesting and is similar to Coyote in a lot of ways. The exploration engine used by Coyote is based on years of investigation by Microsoft Research so it scales nicely on real world programs. I believe the Coyote team is looking into exposing the systematic exploration capabilities in a language and platform agnostic way so developers can create language-specific bindings in their language of choice and benefit from all the research done over decades. No timelines that I'm aware of but it is on the road map I believe.

mratsim · on June 22, 2020

Yes I'm in the middle of many Microsoft papers[1] to add model checking to Nim, in particular CHESS.[2]

[1]: https://github.com/mratsim/weave/issues/18

[2]: Finding and Reproducing Heisenbugs in Concurrent Programs, http://www.cs.ucr.edu/~neamtiu/pubs/osdi08musuvathi.pdf

gavinray · on June 22, 2020

Holy smokes, that Z3 prover for Nim programs with inline invariant conditions is mindblowing, thank you for sharing this.

Also, your username looked familiar -- you're the creator of Arraymancer and Laser!!

You wrote the most performant, memory-efficient fast-math libs in existence currently (as far as I know) and your work on data-parallel operations is super fascinating.

mratsim · on June 22, 2020

Thanks for the kind words, I think that crown is for MKL-DNN / OneDNN though, for now ;), and in the deep learning domain.

I also expect Halide[1], Taichi[2], DaCe[3] and Tiramisu[4] to be able to more easily reach those levels of performance hence why I want to add a compiler to Laser (implemented at compile-time via Nim macros) but prerequisite is excellent and composable multithreading support which OpenMP didn't provide.

In statistics / data analysis, my PCA though (Principal Component Analysis) beats all implementations I've seen from Scipy, Facebook and R.

[1]: https://halide-lang.org/

[2]: https://github.com/taichi-dev/taichi

[3]: https://github.com/spcl/dace

[4]: http://tiramisu-compiler.org/

gavinray · on June 22, 2020

I had seen Taichi (still not entirely sure what "Differentiable Computing" is!) and Tiramisu before but not Halide or DaCe. Neat =D

For whatever it's worth, Arraymancer was #1 on Matmul benchmark (no claims to it being a realistic measure of realworld implementation) until what must have been only the last few weeks:

https://github.com/kostya/benchmarks#matmul

At some point recently that D lang library snuck in the top spot.

agentultra · on June 22, 2020

I was just coming here to say exactly this except that I would argue that writing it in TLA+ would have been shorter and easier to reason about.

Deep embedding of specifications with model checkers is a really neat idea. LiquidHaskell is a really interesting approach and it would be neat to see if Rust could adopt something similar.

mratsim · on June 22, 2020

I did mention TLA+ (first link) or what were you referring to?

agentultra · on June 22, 2020

> And that's where model checking / formal verification is a tremendous help (or would be if it was easier to use)

I assumed that statement was implying that understanding this problem in Rust is easier than in TLA+ which I disagree with.

I think TLA+ can state the problem and demonstrate it more clearly.

Apologies if I misunderstood! Text. I agree with your comment overall as a formal methods nerd/evangelist. :)

mratsim · on June 22, 2020

I was actually arguing that the Rust community should definitely look into formal methods, in particular TLA+ and Ada/Sparks.

The borrow-checher/free-of-data-races/fearless-concurrency taglines are only focused on memory bugs and unfortunately the community is still focused on memory bugs as well.

Thankfully, it didn't prevent the Tokio team that had to deal with concurrency bugs (beyond memory safety) to write a model checker for Rust code called Loom https://github.com/tokio-rs/loom.

Nethertheless this proves that Rust should grow beyond the fearless concurrency/borrow checker narrative because concurrency brings a lot of problems and memory/lifetimes are only part of the equation.

agentultra · on June 22, 2020

100% agree. I didn't know about Loom but I have been impressed by the adoption of TLA+ in different parts of Rust development.

devit · on June 22, 2020

Seems like the solution is to wait for the requests to complete, at least up to the point where a "slot" is reserved in the sequence order of a single actor.

Vanit · on June 22, 2020

Heh, reading this is reminiscent of the shortcomings of Typescript when external data is mistyped.

Ps I love Typescript.

crazypython · on June 22, 2020

A good theoretical understanding will never replace a compiler that yells at you for making a mistake in that theoretical understanding. Learn teh theory instead.