Faster Backtraces for Native Applications

js2 · on Sept 18, 2014

There are a couple excellent reporting frameworks, PLCR and breakpad. What's sorely missing is a really good backend.

There's commercial options in crittercism, crashlytics, bugsense, and hockeyapp, but these are either mobile only or subpar in what they provide.

And there's really not any decent opensource backend. Mozilla Socorro is ugly and tied to their infrastructure in many ways. The closest I can think of is Sentry... but it's primarily designed for handling tracebacks from backend apps. (Oh, I guess there's also squash.io)

Anyway, I'm curious to see what you're building on the backend, and if you intend to opensource it in addition to what you plan to offer commercially.

sbahra · on Sept 18, 2014

Will reveal the backend soon but we do not have plans to open-source it as of yet.

BruceM · on Sept 18, 2014

It would be useful to know the commands that were run for gdb, lldb, and so on to do our own comparisons.

I also have need of being able to run a program under a harness that, should it crash, lets me get the stack and other details. Bonus points if I get to set up my own code to help decode and pretty print my own data types. But for my purposes, that needs to be something under a more open license which is why I've been working with LLDB to date.

sbahra · on Sept 17, 2014

Appreciate any feedback to the early access core software and we're all ears on feature requests. Let us know if you encounter any sub-optimal output and we'll fix it, hope some of you guys find this useful and live a more manageable life :-P

landonjf · on Sept 17, 2014

As the author of a crash reporting framework, this is the first time I've seen execution speed of a frame unwinder so strongly touted :-)

Without code, I can't make much of a comment otherwise, although I'm certainly interested to see your patches to libunwind once it's upstreamed.

One of the reasons we didn't use libunwind in PLCrashReporter (and instead, wrote our own DWARF unwinding code) was the relative difficulty we saw in porting libunwind to Mach-O and the Mach thread/VM APIs, as compared to the cost in porting our relatively platform-neutral code to other platforms.

The analysis side is super interesting too, and it'll be great to see more of what emerges from your work. I can't think of much that's been published on that front other than Microsoft's overview of their Windows Error Reporting heuristics:

  http://www.sigops.org/sosp/sosp09/papers/glerum-sosp09.pdf

Congratulations on your release!

sbahra · on Sept 17, 2014

Come on now, you're the author of the crash reporting framework for mobile :-P Big fan of PLCrashReporter and your work with it, the mobile error reporting market has much to thank for it. I've looked at some of your code and definitely appreciated the cleanliness of it.

On performance: The execution speed being touted here is not that of a frame unwinder and we're building something that goes beyond this in scope. Our core client-side technology is a debugging library optimized for tracing. Unwinding was only a bottleneck for some specific ridiculous workloads and usually the bottleneck is elsewhere if program structure is being parsed (and this is where the general purpose debuggers end up taking up so much time).

There are applications out there that are so complex that traditional debuggers are simply infeasible (imagine 30 minute+ backtraces). However, what we're excited about are all those spare cycles we get to make use of...

On libunwind: Yes. We'll be targeting some exotic platforms, and this is where libunwind definitely helps. It definitely lacks in file format abstractions but we support multiple unwinding backends for this reason, it will be painful but not too painful.

Thanks for the SOSP link, looks interesting. I agree, this is an area that has definitely been neglected by academia.

We're very excited for the first release of our platform and I'll keep you posted on it, your feedback would be great and we think we've come up with very useful technology.

kevingadd · on Sept 18, 2014

So something I don't understand that doesn't seem to have an explanation in the linked article:

Why does the speed of backtrace generation matter for handling/reporting crashes? If the process is dead, I don't see how it taking an extra 50ms (if that) to tear down makes a big difference - especially since crashes should be an exceptional case, not a common one.

Is the optimization here actually intended to enable more accurate, less intrusive realtime profiling, or something like that? Otherwise I'm having a hard time understanding how optimizing for wall-clock time here is actually a useful exercise (even if it is very interesting)

sbahra · on Sept 18, 2014

1) What's most interesting to us is that with all these spare cycles mean we can start doing some computationally expensive analysis as part of crash reporting and can even do it at scale. However, this type of analysis requires a very efficient tracer. We will provide updates on the latter in an upcoming post.

2) The speed and efficiency of backtrace generation can affect recovery times. It's far more than 50ms for a lot of server-side or embedded applications.

3) Large programs today cannot be debugged feasibly (as in, good luck in generating a detailed crash report, your system will likely not have the resources), especially if they're time sensitive as well (tracing is the typical approach there). There are engineers out there who have to spend hours just to extract a small memory dump from a single thread.

4) Certain classes of bugs are best observed over time and minimizing jitter is important (more on that later as we unveil some features of the advanced tracer and user interface).

Less intrusive real-time profiling is interesting to us, but currently only in the context of state leading to a fatal bug (this also includes bugs that involve hanging such as an infinite loop). The technology does have applications for performance management but this isn't something we are focusing on at the moment.

peterfirefly · on Sept 18, 2014

It matters if it takes seconds to run, especially if the UI around the crash reporting/data gathering isn't too great, which it often isn't.

sbahra · on Sept 19, 2014

Thanks for all the feedback people. We've released a new version with robust handling of attachment failures, additional DWARF features and performance improvements (up to 35% on targets with lots of shared objects).