Properly testing concurrent data structures

reitzensteinm · 2024-07-06T13:34:52 1720272892

I've been building a library in Rust with a very similar approach called Temper [1], although in order to model the bizarre implications of the full Rust memory model you have to go quite a bit further, with bookkeeping tracking which writes each thread has perceived. Based on atomic memory ordering, read and write fences etc, perceiving write X may guarantee that write Y must be perceived.

It contains what I believe to be the largest set of test cases of the C++/Rust memory model, which is essentially everything I could find in books, the C++ standard, Stack Overflow, blogs, etc. E.g. this is the file for Rust Atomics and Locks by Mara Bos [2]

Loom [3], referenced in the article, is a similar and much more complete library, which enables you to exhaustively test higher level constructs like mutexes and queues. Although it doesn't model the memory model as exhaustively as Temper (I've been meaning to port my test cases over to it).

I was inspired by a Foundation DB testing talk by Will Wilson [4], who is now over at Antithesis [5], building a hypervisor based solution to perform this style of testing on arbitrary Docker containers.

I strongly believe we're going to see much more in this domain in the coming decade. WebAssembly is at the perfect intersection of being a platform that's complete enough to be able to compile arbitrary software to, but simple enough that building something like Antithesis isn't a half decade long project from an elite team that's already shipped a database.

[1] https://github.com/reitzensteinm/temper/tree/main [2] https://github.com/reitzensteinm/temper/blob/main/memlog/tes... [3] https://github.com/tokio-rs/loom [4] https://www.youtube.com/watch?v=4fFDFbi3toc [5] https://antithesis.com/

kaymanb · 2024-07-06T14:04:51 1720274691

Very cool! I implemented some shared-memory atomic snapshots in Rust [0] and also did my best to take automated testing very seriously. I started out using loom [1], the library mentioned in the article, but latter switched to shuttle [2].

Shuttle's approach is to be randomized, instead of exhaustive like loom. However, the scheduler does still give probabilistic guarantees about finding bugs. I found that shuttle was faster and scaled to more complicated test scenarios.

Similar to the article, shuttle also lets you save the random seed if a particular schedule causes the test suite to fail. Being able to quickly reproduce failing tests is really important and enables you write explicit test cases for bugs that were previously caught and fixed [3].

[0] https://github.com/kaymanb/todc/tree/main/todc-mem [1] https://github.com/tokio-rs/loom [2] https://github.com/awslabs/shuttle [3] https://github.com/kaymanb/todc/blob/0e2874a70ec8beed8fae773...

yafetn · 2024-07-06T17:20:44 1720286444

JetBrains’ Lincheck[0] is a good library in the Kotlin/Java world for this stuff. I especially like that it’s declarative, and also the way it outputs the linearizability results.

[0]: https://github.com/JetBrains/lincheck

idle_cycles · 2024-07-06T13:33:17 1720272797

This is great, is there a "loom" like library for C++? I have a set of lock-free data structures that I would like to test.

int08h · 2024-07-06T17:34:07 1720287247

Yes, the (IMHO) the easiest one to use is the Relacy Race Detector (https://github.com/dvyukov/relacy and https://www.1024cores.net/home/relacy-race-detector)

It's been around a while and is easy to work with. Written by Dmitry Vyukov, an expert in the concurrency world.

JonChesterfield · 2024-07-06T21:42:10 1720302130

Relacy is notable for being the only one in c++ that knows what a fence is, at least last time I looked.

It told me a structure might not make forward progress on some thread interleaving. I.e. if only one thread ever runs, the others don't do anything. That's true but uninformative. I wasn't able to coax it into using a vaguely fair scheduler to get more useful information out than that.

a_t48 · 2024-07-07T22:09:43 1720390183

32 bit only, no clang support :(

anacrolix · 2024-07-08T09:08:28 1720429708

I think he implemented this inside Go

ot · 2024-07-06T16:37:35 1720283855

Folly has DeterministicSchedule, which also wraps atomics and it is used to test its core synchronization primitives, but I don't think it's as sophisticated as loom.

https://github.com/facebook/folly/blob/main/folly/test/Deter...

wqeqweewq · 2024-07-07T04:31:39 1720326699

https://plv.mpi-sws.org/genmc/

o11c · 2024-07-06T17:12:17 1720285937

If I understand this correctly, this approach has limitations regarding soft forward-progress guarantees.

Consider a cmpxchg loop where the body computation is not quite trivial, but on real hardware (and with real schedulers) is exceedingly unlikely to be interrupted on a given CPU. Thus, if `n` is the number of CPUs, it has a worst-case `1/n` chance of making progress. But with this testing approach, it instead has a `1/t^p` chance, where `t` is the number of tasks (which may be far larger than the number of CPUs) and `p` is the number of pauses (easily at least 3) within the not-quite-trivial loop body; this is sufficient to turn a working algorithm into a broken one.

OTOH, if you do want the tester to detect soft forward-progress as a bug (because you want hard forward-progress), this doesn't seem to provide useful tools either.

(Regardless this is certainly useful for a lot of concurrency problems).

moonchild · 2024-07-06T19:57:51 1720295871

> 1/t^p

i don't think that's right. it's just 1/t. after all, after t time, one task must have made progress; since there are t tasks, the probability that i'm the task that made progress is just 1/t

the primary point of confusion, i think, is that getting interrupted does not mean that you are necessarily going to lose the cas

ahepp · 2024-07-06T16:33:59 1720283639

> Well, if I am being honest, there is a bit of up-front knowledge here. I don’t think we can avoid spawning real threads here, unless we do something really cursed with inline assembly. When something calls that pause() function, and we want it to stay paused until further notice, that just has to happen in a thread which maintains a stack separate from the stack of our test.

Is there a reason we couldn't use some kind of async runtime? It seems like you're instrumenting atomic operations to achieve cooperative multitasking. Maybe I need to drink more coffee, but it seems simpler without threads.

claytonwramsey · 2024-07-06T16:49:26 1720284566

It would be convenient to use async, but the other requirement is that we don’t want to change the outwardly-observable API of the software under test. Since async is “infectious,” we have to use sync implementations for sync APIs.

oasisaimlessly · 2024-07-06T17:46:50 1720288010

What about stackful coroutines? They don't necessarily change the API.

zbentley · 2024-07-06T23:39:59 1720309199

Why does stackfulness not color functions? So long as preemption/parallelism boundaries are manually annotated, I don’t think stackful coroutines save you anything here.

anonymoushn · 2024-07-07T06:22:17 1720333337

Just don't annotate them? The design of async is a wound that can be bandaged by dumping all your registers on the stack and jumping.

LegNeato · 2024-07-07T02:24:14 1720319054

Check out https://github.com/jonnadal/fibril

thrwyexecbrain · 2024-07-07T08:26:26 1720340786

One downside of this approach is that the tested code itself has to be modified to accomodate the testing code.

I think the same could be achieved by launching two threads and single stepping them with ptrace to "randomly" interleave the execution of their instructions. Something like rr's chaos mode.

Some instructions may not be atomic though, so we would need a way to single step on "atomic microcodes" if that's even possible without emulation?

392 · 2024-07-08T03:08:35 1720408115

Sounds like Antithesis's hypervisor.

skybrian · 2024-07-06T15:23:38 1720279418

It looks like you need to use conditional compilation to use Loom, which probably works okay for testing one library, but is pretty intrusive:

    #[cfg(loom)]
    pub(crate) use loom::sync::atomic::AtomicUsize;

    #[cfg(not(loom))]
    pub(crate) use std::sync::atomic::AtomicUsize;

I wonder if there are any languages that do a better job of letting you use your own scheduler?

kriper · 2024-07-06T15:32:11 1720279931

in C# it's basically automatic https://github.com/microsoft/coyote/

zokier · 2024-07-06T22:26:25 1720304785

I suppose if you want to be really thorough, you could run the tests with ptrace and single-step the threads to make different interleavings at instruction level. Has anyone seen that sort of thing done? Is there any alternatives for black-box testing, if you are not able to instrument the code like what is done here?

scottlamb · 2024-07-06T22:47:35 1720306055

I've done that for testing async signal handlers, but the combinatorics are much more favorable there. If the base thread runs n instructions, you just need n runs through that run 0..n instructions before inserting the signal, and then the signal handler runs to completion and then so does the base thread. O(n^2) total time. But with t threads each with n instructions to run, and they can interrupt each other at any boundary...it's not approachable for reasonable values of n. You need to reduce by singling out the operations with interesting behavior and simulating, I think.

a_t48 · 2024-07-07T22:12:30 1720390350

This is really neat, I’ll have to try it out. It won’t catch all classes of errors though - won’t each call to pause() cause a synchronization between the threads and hide some data race issues? maybe this is a nonissue in Rust.

317070 · 2024-07-06T15:43:38 1720280618

I have been stuck on this exact question a few weeks ago. Does someone know how the same could be done in python? Could a thread class be made that allows this kind of testing? Could it be written in C with e.g. pybind?

clearprop · 2024-07-06T14:38:20 1720276700

I thought Rust was thread-safe and I don't see any "unsafe" blocks. What am I missing?

kibwen · 2024-07-06T17:44:38 1720287878

From https://doc.rust-lang.org/nomicon/races.html :

"Safe Rust guarantees an absence of data races, which are defined as: two or more threads concurrently accessing a location of memory, where one or more of them is a write, and one or more of them is unsynchronized. A data race has Undefined Behavior, and is therefore impossible to perform in Safe Rust. [...] However Rust does not prevent general race conditions. This is mathematically impossible in situations where you do not control the scheduler, which is true for the normal OS environment. [...] For this reason, it is considered "safe" for Rust to get deadlocked or do something nonsensical with incorrect synchronization: this is known as a general race condition or resource race. Obviously such a program isn't very good, but Rust of course cannot prevent all logic errors. In any case, a race condition cannot violate memory safety in a Rust program on its own. Only in conjunction with some other unsafe code can a race condition actually violate memory safety."

chc4 · 2024-07-06T14:47:07 1720277227

Losing updates to the atomic counter is thread safe behavior, it's just a logic bug.

kibwen · 2024-07-06T18:02:10 1720288930

Specifically, one thread reads the counter with proper synchronization, then another thread writes an incremented value to the counter with proper synchronization, then the first thread writes its own incremented value to the counter with proper synchronization. At every step the use of an AtomicU32 guarantees proper synchronization to the underlying memory, which is what Rust is concerned with.

The fix for the logic bug in this case would be to indicate that you want the increment itself to be an atomic operation, using the fetch_add method: https://doc.rust-lang.org/std/sync/atomic/struct.AtomicU32.h...