Is it only me? A github link instead of a kernel.org one?
Maybe it's just an effort to reduce the HN death hug, and every git repo is an identical clone after all, yet I can't help but think that a link to the kernel tree should point to [0].
Github is (and never was) the only source of truth. It (sort of) became because of us.
Replace this with Operaring System, Social Network, default Win 95 browser, any commonly accepted belief.
I guess this would be in line with the Linus' description
of GitHub:
> I think github does a stellar job at the actual hosting part. I really do. There is no question in my mind that github is one of the absolute best places to host a project. It's fast, it's efficient, it works, and it's available to anybody.
I'm not entirely sure this is same thing - but if anybody know where the patches for this is please share - I have tried to contact the presenter to get them but had no success thus far.
Could you clarify if you think this is good, bad, or just interesting? Your comment might be received as though a change like this taking three years is somehow not efficient or tsk-worthy.
I'm not the OP, but I posted the video I remember watching 5 years ago with great interest.
I have no basis on which to inform an opinion on whether 5 years is a good or bad period of time to go from an idea to a syscall in the Linux kernel. I just find it interesting as someone who has switched jobs multiple times in 5 years -- much less worked on a single idea that long!
Well, to be fair to the authors, their idea became a system in the linux kernel within Google years ago. It's only the general public that has been deprived of its benefit.
(disclaimer: I am the patch author, Mathieu Desnoyers) Just as a clarification, the idea originates from Google (I give full credits to Paul Turner and Andrew Hunter for it). However, the extra 3 years of work required to get it upstream has been done by myself at EfficiOS.
I'm currently discussing with glibc maintainers on the best approach to integrate this into the Linux userspace ecosystem. So far, discussions aim into a direction where glibc would own the __rseq_abi TLS symbol, and register it for every thread. I can then maintain a rseq library which consists of helper header files that contain the common rseq operations for all supported architectures.
I am concerned about providing a librseq that handles rseq registration for early adopters though, because I don't want projects to eventually end up conflicting with future glibc versions. Once we settled how glibc will expose the symbol and register it, I will try to provide a helper library which exposes this symbol and allow performing explicit rseq registration in a way that won't conflict with future glibc versions.
> I am concerned about providing a librseq that handles rseq registration for early adopters though
Sounds very reasonable.
So at this point, as far as I understand it, FB and Google carry in-house rseq kernel
and user space patches. Right? Are they on board with the mainline rseq? Will FB support
rseq in jemalloc any time soon?
I've been in touch with FB. They are interested in using rseq for jemalloc. They have provided prototypes of jemalloc based on rseq, along with benchmarks helping me make the case for rseq mainlining.
I don't know whether Google will ever want to swap from their in house rseq implementation to the upstream Linux rseq, use both ABIs for a transition period, or simply keep using their own in-house rseq.
- Gather a list of desiderata, ensuring we take into account a complete list of use-cases targeted by everyone active in the rseq discussions. This is crucially important to ensure discussions don't spin in circles going back and forth between different requirements,
- Redesign the uapi/linux/rseq.h ABI, making sure a single TLS store is needed to enter a rseq critical section, without requiring any extra registers as ABI. I have introduced the "rseq_cs" structure as critical section descriptor to do this,
- Optimize arm32 and x86 rseq critical sections for speed, by creating my own benchmark programs,
- Rewrite the kernel rseq implementation a few times so it follows the kernel coding style and ensure it pleases everyone caring about it,
- Present 2 talks about rseq at Linux Plumbers Conference,
- Go through various rounds of in person, email, and IRC discussions with Paul Turner, Peter Zijlstra, Andy Lutomirski, Boqun Feng, Paul E. McKenney, Thomas Gleixner, Ben Maurer, Linus Torvalds, and many others. Those were very constructive discussions bringing up everyone's concerns with respect to this new system call,
- Extend the rseq selftests, adding new testing strategies such as delay loops between "steps" of the critical section, thus increasing the likelihood of generating preemption races,
- Figure out nasty races only happening on NUMA systems after about a full day of stress-testing,
- Provide solutions for debugger single-stepping "lack of progress" problem if rseq is used when retrying on abort. It's basically the cpu_opv system call I plan to propose for 4.19. Meanwhile, without cpu_opv, rseq can still be used in ways to guarantee forward progress, but the abort code needs to use a partitioning strategy rather than a simple retry (e.g. going to a different memory pool in case of abort for a memory allocator),
- Harden the rseq mechanism for security, by adding a "signature" word before the abort label,
- Implement prototypes of lttng-ust and liburcu which use rseq, gathering benchmarks to validate the approach,
- Write rseq and cpu_opv man pages.
And this is just the items that were "forward progress" in the rseq adventure. I'm leaving out everything that were attempts at making things more generic that had to be thrown away.
Thanks for this detailed and a bit overwhelmingly long list. When I saw the first Paul Turner techtalk on rseq (it was called something else - LPC - if I remember correctly), it seemed so simple, so obvious "just read this memory address and if there was an interruption, we have to retry".
But then of course real life is a lot more complex than slideware.
cpu_opv is new for me (no time for LWN these days), but looks simple, elegant and sort of obvious (again). Which makes me wonder why no one thought about it yet. (But of course this is probably my ignorance speaking.)
This is too far above my level to understand directly - has anyone got an example of an application that will be affected by this? I see some comments about tracing below.
Some use-cases likely to be enhanced by rseq: statistics counters, memory allocators (jemalloc, glibc malloc, and others), user-space tracing (LTTng), user-space Read-Copy Update (liburcu), reading performance monitoring unit counters from user-space on ARM64, and possibly user-space task scheduling.
Also, just reading the current CPU number can now be done faster by reading the __rseq_abi->cpu_id field rather than calling the sched_getcpu vDSO.
Yes, there are indeed pieces missing for this use-case. I intend to push another system call for the next merge window (4.19): "cpu_opv" [1]. It stands for "CPU operation vector", which is needed to take care of moving user-level tasks around between per-cpu work-queues touched by rseq fast-paths in a way that is safe against CPU hotplug. It's also needed to migrate free memory between per-cpu memory pools modified by rseq fast-paths safely against CPU hotplug. Some of it can be approximated by setting cpu affinity, but it's racy against CPU hotplug.
cpu_opv can be used as a slow-path fallback in pretty much all scenarios where the rseq fast-path aborts.
rseq user-level APIs are pretty much limited to only work on the current CPU, whereas cpu_opv allows creating operations on per-cpu data structures [2] which take the CPU number as argument. If it happens to be on the current CPU, rseq can be used and it is fast, but if the CPU number is not the current CPU, or is an offline CPU, then cpu_opv takes care of performing the operation safely with respect to rseq critical sections, and other cpu_opv operations.
I suspect that something like a heap implementation could use this. For concurrency, you want different cores to use different pools to avoid atomics. In practice, this means per-thread pools are used today, but this rseq feature seems like it would allow using per-core pools instead. That would save memory and probably be even better for cache locality when a core is shared by multiple threads.
I use higher-level APIs built on top of restartable sequences. Here's my understanding (could be wrong):
> I suspect that something like a heap implementation could use this.
Indeed. Let's say you want to have lots and lots and lots of threads, as described in the video schmichael linked. [0] Per-thread malloc pools become less attractive:
* too empty (lots of contention for the global pool) or
* too full (lots of wasted RAM, probably poor CPU cache utilization as well) or
* lots of sloshing
More generally, people sometimes do per-thread stuff to avoid lock contention. Some types of state might be reasonable to keep per-thread when the program is written in a thread-per-core / async style but might not be it's written in a thread-per-request / sync style. It might use too much RAM. If you ever have to access _all_ the threads' state (say, if you are doing some counters for a monitoring system: increment just the current thread's state on write; sum them on read), that path might get ridiculous. So per-CPU might work better.
Per-CPU stuff doesn't require restartable sequences. You can just use the CPU number to decide which shard to access then lock it or use atomics as you would with global state. You get less lock contention and cache-line bouncing. (Alternatively, you might get some of these benefits by picking a shard randomly, if the rng is cheap enough. Or a counter.)
Restartable sequences let you entirely avoid atomic operations for per-cpu stuff.
Sort of. It looks like if some critical code is interrupted, by a CPU move, then the process would get some code restarted. The idea being since, in some cases, CPU moves can be rare you aren't paying for the expense of locks to guard against this and just let the OS tell the process, hey I moved you will you were doing something critical, let me restart that for you. This isn't a full solution for all process locks but for a number of locking scenarios restarting the code is all that is needed.
The nature of the thing is that there's a chunk of per-architecture work required (maybe 20-30 lines of actual kernel code and another 700 lines of userspace/testcase support, judging by the diffstat). What's gone in to start with includes support for x86-64, i386, powerpc and 32-bit arm as well as the common code. Maintainers of other architectures can then add support on top of this at some point. Support for new syscalls not being perfectly in sync across all kernel archs is not uncommon, I think.
The other not-yet-there part is how userspace libcs are going to expose this usefully to applications; there's some interesting discussion on the kernel mailing list about that. (It needs some libc support because the kernel only supports a single registered restartable-sequence area, so if multiple libraries try to do it they'll tread on each others' feet; and libc will want to use this feature itself.)
A restartable sequence is defined by a start and an end offset. If I write a restarable sequence in C I can't be absolutely sure that it's contiguous and I can't know where the commit instruction is...
Does it mean that I have to write it in assembly to make sure that I know where the sequence starts and where it ends?
Indeed the restartable sequence critical section needs to be written in assembly. The idea is to keep this complexity within public headers implementing the common operations as inline assembly for all supported architectures. You can see such operations already implemented for x86 as part of the rseq selftests here: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...
I think C lost it at: abort must be possible to call from any place in the critical block. Since you can't be sure the registers didn't spill on the stack, I don't think anything more than ASM would work.
>The address of jump target abort_ip must be outside the critical region
Seems like an odd requirement, isn't the whole point that we want to restart the critical section, or at least part of it, over in the case that it's interrupted?
Not necessarily. Applications may choose to retry a certain number of times and fallback to another mechanism if the critical section can't complete.
A problematic case that was brought up on LKML was the interaction with debuggers and unexpected page faults, see this comment:
https://lwn.net/Articles/738119/
genuine question: why not boot the machine with 'isolcpu' and place your tasks on said cpu's ? such tasks would be not subjected to vagaries of scheduler etc...
It all depends on how much control you have on the system you target. The strategy you refer to may well work for a dedicated deployment, but if you are developing a general-purpose memory allocator targeting a wide range of applications, you might not want to impose those constraints on your users.
You could still have N threads but only want M (M<N) CPUs. For example there are situations where the single threads are mostly I/O bound, but reducing the little CPU work they do allows you to pack more threads (and e.g. serve more clients) on the same number of CPUs.
Without restartable sequences, replacing per-thread data with per-CPU data had a trade-off: per-thread data can be accessed with less overhead, per-CPU data consumes less memory. Restartable sequences give you the best of both worlds.
The Speedup numbers named in the commit [0] are impressive on arm but don't look very well on x86. On the other hand they add a thread-local write to every context switch and add a bunch of code. Add to that only 1.2 speedup on the LTTng benchmark.
To me as a non-kernel guy this doesn't sound very impressive..
The other responses were useful. Yours is just a put-down. Why bother? To discourage "middlebrow" comments? But the one you're responding to elicited useful information, so it's hardly a problem.
I found the articles on the "middlebrow dismissal" insightful. And I don't think ends usually justify means; I think the parent-most comment was not as nice as it should have been.
The parent comment is not so much a put-down as much as it calls out a particularly unsophisticated dismissal. The parent-most comment would have elicited the same useful information if it were phrased as a question, but would have been less uncouth. I think that's a perfectly fine thing to write a comment about.
Maybe it's just an effort to reduce the HN death hug, and every git repo is an identical clone after all, yet I can't help but think that a link to the kernel tree should point to [0].
Github is (and never was) the only source of truth. It (sort of) became because of us.
Replace this with Operaring System, Social Network, default Win 95 browser, any commonly accepted belief.
[0] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...