What is an example of the "per-cpu data" they're talking about here[0]? [0] http...

jgalar · on June 21, 2018

Mathieu's (patch author) main motivation was removing atomic operation from the LTTng userspace tracer's hot path.

In that case, the per-cpu data is the 'reserve' and 'commit' counters that must be updated when the tracer saves an event to the per-CPU buffers.

Other uses that I'm aware of include memory allocators that maintain per-CPU arenas.

fyi1183 · on June 21, 2018

I suspect that something like a heap implementation could use this. For concurrency, you want different cores to use different pools to avoid atomics. In practice, this means per-thread pools are used today, but this rseq feature seems like it would allow using per-core pools instead. That would save memory and probably be even better for cache locality when a core is shared by multiple threads.

scottlamb · on June 22, 2018

I use higher-level APIs built on top of restartable sequences. Here's my understanding (could be wrong):

> I suspect that something like a heap implementation could use this.

Indeed. Let's say you want to have lots and lots and lots of threads, as described in the video schmichael linked. [0] Per-thread malloc pools become less attractive:

* too empty (lots of contention for the global pool) or * too full (lots of wasted RAM, probably poor CPU cache utilization as well) or * lots of sloshing

More generally, people sometimes do per-thread stuff to avoid lock contention. Some types of state might be reasonable to keep per-thread when the program is written in a thread-per-core / async style but might not be it's written in a thread-per-request / sync style. It might use too much RAM. If you ever have to access _all_ the threads' state (say, if you are doing some counters for a monitoring system: increment just the current thread's state on write; sum them on read), that path might get ridiculous. So per-CPU might work better.

Per-CPU stuff doesn't require restartable sequences. You can just use the CPU number to decide which shard to access then lock it or use atomics as you would with global state. You get less lock contention and cache-line bouncing. (Alternatively, you might get some of these benefits by picking a shard randomly, if the rng is cheap enough. Or a counter.)

Restartable sequences let you entirely avoid atomic operations for per-cpu stuff.

[0] https://www.youtube.com/watch?v=KXuZi9aeGTw

compudj · on June 22, 2018

If you are looking for examples of per-cpu data structures using rseq, see the selftests I implemented here: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

There are examples of per-cpu counters, per-cpu spinlocks, per-cpu linked-lists, and various forms of per-cpu buffers.

yarosv · on June 21, 2018

AFAIK it means that it guarantees the block of code will not be interrupted by scheduler and executed on another CPU/Core

jcriddle4 · on June 21, 2018

Sort of. It looks like if some critical code is interrupted, by a CPU move, then the process would get some code restarted. The idea being since, in some cases, CPU moves can be rare you aren't paying for the expense of locks to guard against this and just let the OS tell the process, hey I moved you will you were doing something critical, let me restart that for you. This isn't a full solution for all process locks but for a number of locking scenarios restarting the code is all that is needed.

Dykam · on June 21, 2018

Kind of like optimistic concurrency? "Lets try, we'll retry on failure".

zaarn · on June 22, 2018

It doesn't look like it's a full retry, atleast not for critical sections.

From what I can tell you do three things;

A) you can define a critical section which must run atomically in respect to the CPU core moves

B) if the critical section is interrupted you can define a callback or restart the section (meaning the section can be safely repeated)

C) you can define a single operation to commit (ie, updating a pointer)

You can certainly build a "retry on failure" method using this, CPU moves are rare so it's unlikely to fail the second time.

tg180 · on June 21, 2018

Efficient access to thread-local data, maybe. https://github.com/golang/go/issues/8884