This is very cool stuff. Keno is a great rr contributor, and I'm grateful for his Pernosco shout-out :-).
This sort of capability has always been part of my vision for rr/Pernosco. I want to see more projects like this doing record-and-replay capture of bugs in the field. With more projects depending on this sort of infrastructure, we can get more attention from OS/hardware people to make the record-and-replay systems better, and then I think we can reach a tipping point where it becomes best practice. Eventually we should be looking back and wondering why, as an industry, we didn't go in this direction earlier.
No, thank you for all your work on this :). Building something like rr is only possible with some really solid engineering and I think you managed to accomplish that. I think we're aligned on the vision here. I bring up rr every time I speak to hardware people, but of course the first question they ask is "how many people need this right now?". With this I'm hoping to be able to increase the answer to that by an order of magnitude or so ;).
If you don't mind me asking, you mention that you want more attention from OS/hardware people. What sort of things are you looking for from OS/hardware people that would make replay systems better?
As for a more technical question on implementation. Assuming the post accurately represents the performance characteristics of rr, what do you think is keeping the overhead so high in the average case described in the post (50%)?
Disclaimer: I work for a company that sells similar technology.
-- Virtualization of rr's required PMU features on Azure and GCP
-- Virtualization of Intel's CPUID faulting on AWS and other cloud providers (Linux KVM has it, but it's not enabled in AWS)
-- Fix the reliability bug(s) in the retired-conditional-branches counter on AMD Ryzen so rr works there
-- Implement CPUID faulting in AMD Ryzen
-- Reliable instructions-retired counter on AMD/Intel (for more reliable/simpler operation)
-- Trap on LL/SC failures on ARM (AArch64 I guess) and other archs (so we can port rr there), with Linux kernel APIs to expose that to userspace
-- Similar traps for Intel HTM (XBEGIN/XEND) (less important now that's mostly been disabled for security reasons)
Bigger items:
-- Invest in kernel-integrated record-and-replay for Linux and other OSes
-- Implement QuickRec or some other hardware support for multicore record-replay on at least some CPU SKUs
(If anyone thinks they can help with any of that, talk to me!)
> what do you think is keeping the overhead so high in the average case described in the post (50%)?
I assume you've read https://arxiv.org/abs/1705.05937 ? For many workloads the major unavoidable cost is context switching. With rr's approach every regular tracee context switch (intra or inter process) turns into at least 2 inter-process context switches (tracee to rr, rr to tracee). This is bad for thread ping/pong-like behaviour. For many other workloads the cost is simply the loss of parallelism as we force everything onto a single core. For other workloads there is a high cost due to system calls that require context switches to rr because we don't currently accelerate them with "syscall buffering". The last one can be mostly engineered around, the former two are inherent to rr's approach.
This is my first time hearing about rr, and I'm very curious. Your comment below about increasing generality/performance with bpf is very interesting.
Can you share a little bit about how some of these features (looking at the first one for example) impact rr's ability to run in cloud environments? Does it work but with performance degradation? Without performance degradation but missing some features? I don't work at quite a low enough level to look at this list and quickly grasp the high level takeaway.
rr works today on cloud instances with hardware performance counters available, on Intel CPUs. For example it works on a variety of AWS instances that are large enough that your VM occupies a whole CPU socket. c5(d).9xlarge or bigger works well.
The performance impact of virtualization on rr seems small on AWS, small enough that we haven't tried to measure it.
On AWS rr is mostly feature complete. The only missing feature is CPUID faulting, i.e. the ability to intercept CPUID instructions and fake their results. This means that taking rr recordings from one machine and replaying them on an AWS instance with a different CPU does not work. (The other direction does work.)
(Pernosco uses AWS, but we have a closed-source binary instrumentation extension to rr replay that lifts this restriction.)
As I mentioned above, there's no technical reason AFAIK why AWS could not virtualize CPUID faulting; regular Linux KVM supports this.
It's already being used :). That said, I've talked to the upstream kernel maintainers about the ability to attach a blog program to ptrace, which could possibly improve performance for the ping pong case if made sufficiently general. That's also the crux though, since sufficiently general probably means eBPF, which is still not considered safe for unprivileged use cases.
I have done some experiments and it's reasonably possible to cut the average overhead to about 20% or so with known techniques. That said, I'm surprised you'd consider this high. I know of no system, including commercial ones that can even get close to these kind of low overhead numbers. Heck, I don't even know of one (other than rr) that actually works on applications of this complexity. If you have one, you should advertise it significantly more heavily ;).
Well, if you use hardware-assist you can get 0% overhead with shared memory and without serializing cores. Unfortunately, only some chips expose the correct hardware.
In terms of other publicly available systems, UndoDB claims properties similar to rr and appears to use the same general approach based on talks I have seen from the creator, but I have no first hand information on its actual properties.
If only advertising would help more. Unfortunately for all companies in this business, few companies are willing to invest even a tiny amount in developer productivity let alone properly value it. For instance, do you know of any company that spends more than $20k/year per developer on tools to enhance software developer productivity? To go even further, how many companies would even consider spending the astronomical sum of $20k/year per developer on tools? Yet, a fully burdened developer is ~$200k/year, so, using the most basic business analysis of just cost savings (which undervalues to a mind-boggling degree), $20k/year is a mere 10% cost increase, but for some reason it is viewed as an astronomical and ridiculous price. In other sectors, businesses easily spend significant percentages of salary cost in tooling. EDA tools for EEs can run $50k/year, trucks for truck drivers amortize out to $30-50k/year, etc. Obviously, you need to justify a 10% cost increase with an appropriate return, but the fact that few companies even bother to do so is in my opinion the primary limiting factor on increased adoption of the techniques and technologies we are discussing since nobody values them.
What hardware supports shared memory record-and-replay in the presence of data races? I'm not aware of any widely available hardware that does this.
UndoDB's design is similar to rr but they use binary instrumentation at record time, so it's higher overhead than rr and a more complex implementation in some ways, but they don't depend on performance counters so they work on more architectures (AMD, ARM) and in any virtual guest.
It's certainly true that it's hard to get people to pay for debugging tools. On the other hand, some big-name companies spend tons of money on internal tooling, so at least some companies are willing to spend on developer productivity in general. I think certain classes of tools have "traditionally" been free and that mindset is hard to change, and the majority of companies are reluctant to spend money on developer productivity in general. I don't have great ideas to tackle this other than "keep making tools better and better until the wall cracks".
If you want to talk more off the record, feel free to contact me.
An example is ARM systems including an ETM component that supports data tracing. You connect an external trace probe, usually via JTAG, to receive the trace events. This technique usually uses read/write and control-flow state for reconstruction. This has the advantage of not requiring a recording from the beginning to reconstruct since you can correctly work backward from the final state, but requires higher data bandwidth and storage.
Intel PT has really low overhead, from my experiments. The difficulty then is the size of the trace file and your ability to stream it to disk or network. There is support for it in core dumps now and gdb can do some time travel if it is present.
The 50% is average overhead and basically entirely due to context switch overhead and execution serialization. If the execution is entirely in userspace, rr's overhead is basically 0. I don't think PT will help here. That said, rr's performance on Intel chips is entirely acceptable for single threaded code. The big asks would be other architectures, or as roc mentioned, something like QuickRec for efficient multi-threaded recording.
PT doesn't get you close to what rr can do, which is reconstruct any program state at any time in the recorded execution.
As Keno says in his blog post, the promise of rr is that if you have an rr recording, you are almost completely assured of having enough information to figure out the bug. That is what we see in practice, and it has big implications for developer workflow.
Hi all, I'm pretty excited to get this out there for you. I had published this this morning, so I could be around through the day to answer any questions, but things were a bit delayed by the HN algorithm gods (- thanks dang for rescuing it ;) ). That said, I'll check in periodically for the next hour or two if there's any questions I can answer.
I recently spend two weeks on and off hunting down a bug on a platform that didn't support `rr`. I am fairly confident to say that if I had `rr` available it would have taken me a couple of hours at most.
Being able to run backwards from the point of failure and understand where a value is coming from is very powerful.
Having this available in Julia directly is great, and will make it much easier to get bug-reports from users.
I'm surprised I haven't heard of rr before, it sounds like it could be a game changer for debugging many types of problems. How long has this project existed/been usable ?
Am I correct in understanding that rr can be used with any application (ie. the application doesn't have to be built specifically to support it) ? That's the impression the usage introduction on the website gives: https://rr-project.org/.
This sort of capability has always been part of my vision for rr/Pernosco. I want to see more projects like this doing record-and-replay capture of bugs in the field. With more projects depending on this sort of infrastructure, we can get more attention from OS/hardware people to make the record-and-replay systems better, and then I think we can reach a tipping point where it becomes best practice. Eventually we should be looking back and wondering why, as an industry, we didn't go in this direction earlier.