So something I don't understand that doesn't seem to have an explanation in the linked article:
Why does the speed of backtrace generation matter for handling/reporting crashes? If the process is dead, I don't see how it taking an extra 50ms (if that) to tear down makes a big difference - especially since crashes should be an exceptional case, not a common one.
Is the optimization here actually intended to enable more accurate, less intrusive realtime profiling, or something like that? Otherwise I'm having a hard time understanding how optimizing for wall-clock time here is actually a useful exercise (even if it is very interesting)
1) What's most interesting to us is that with all these spare cycles mean we can start doing some computationally expensive analysis as part of crash reporting and can even do it at scale. However, this type of analysis requires a very efficient tracer. We will provide updates on the latter in an upcoming post.
2) The speed and efficiency of backtrace generation can affect recovery times. It's far more than 50ms for a lot of server-side or embedded applications.
3) Large programs today cannot be debugged feasibly (as in, good luck in generating a detailed crash report, your system will likely not have the resources), especially if they're time sensitive as well (tracing is the typical approach there). There are engineers out there who have to spend hours just to extract a small memory dump from a single thread.
4) Certain classes of bugs are best observed over time and minimizing jitter is important (more on that later as we unveil some features of the advanced tracer and user interface).
Less intrusive real-time profiling is interesting to us, but currently only in the context of state leading to a fatal bug (this also includes bugs that involve hanging such as an infinite loop). The technology does have applications for performance management but this isn't something we are focusing on at the moment.
Why does the speed of backtrace generation matter for handling/reporting crashes? If the process is dead, I don't see how it taking an extra 50ms (if that) to tear down makes a big difference - especially since crashes should be an exceptional case, not a common one.
Is the optimization here actually intended to enable more accurate, less intrusive realtime profiling, or something like that? Otherwise I'm having a hard time understanding how optimizing for wall-clock time here is actually a useful exercise (even if it is very interesting)