> You mean mode switch? Cheaper, but yes, still costly.
Hah, yes.
> Which means we can calculate the cost to be ~1.1 us per probe (on my system). Anyone know what XRay is clocking in at?
I'm not at Google any more to check (and haven't touched the code since 2012), but IIRC XRay overhead while tracing was something like 100 cycles per function (which includes an entry/exit pair). Hard to compare across different machines made years apart of course, but I think that a little over an order of magnitude difference sounds about right.
Ok, if the requirements are 100 cycles per function, then you'll (for current kernels) need user<->user. So, AFAIK, an LTTng-like thing or uBPF. And LTTng does have such an order of magnitude difference: slide 25+ of http://www.efficios.com/pub/endusersummit2012/presentation-e...
I'd also look at what problems need solving, in case they can be solved without function tracing. If 100 cycles per function was needed, it sounds like a high rate of functions, and a lot could be done with just timer-based stack trace sampling.
Brendan, did you read this awesome blog post (which mentions your work btw) about Google's tracing framework which may explain the kind of problems they want to solve? http://danluu.com/perf-tracing/
"Sampling profilers, the most common performance debugging tool, are notoriously bad at debugging problems caused by tail latency because they aggregate events into averages. But tail latency is, by definition, not average."
I don't quite get it -- to me, that's like saying a hammer is notoriously bad at screwing in screws. And then going ahead and proving it. Um. Good content, but I don't get the premise. I wasn't using sampling profilers to study tail latency in the first place. Who would? (Maybe it's a different usage of "sampling profilers" than I'm familiar with.)
As for tail latency: one way is to dump every function event ; a lower cost way is to summarize latency in-kernel as a histogram. For both you're still tracing every function, and care about overhead down to 100 cycles etc.
But think about what causes the tails in the first place. Lock contention? Trace that. Resource I/O? Trace that. If it costs 1 us to trace disk I/O, it's usually not a problem. I like to time scheduler switch events with stack traces -- a catch all (but a bit more expensive). Of course, these approaches require kernel instrumentation. :)
(Full disclosure: I'm Mathieu Desnoyers, part of the LTTng maintainer team.)
I would like to introduce a slightly less extreme point of view when considering "on-the-fly" aggregation of traces vs tracing to buffer followed by post-processing. I see from the current discussion thread that it's very much either one or the other, but I think that combining the two approach helps creating much more powerful tools. On-the-fly aggregation based on trace instrumentation helps pinpointing latency outliers. Tracing to buffers, on the other hand, provides very detailed information about the system behavior that leads to those outliers. By using on-the-fly aggregation as "triggers" to collect tracer in-memory ring-buffers, one can achieve investigation of latency outliers with very small I/O overhead.
Currently, LTTng-UST has slightly higher overhead than 100 cycles per event (roughly 250-300 ns/event on recent 2.4GHz Intel), which I expect is partly caused by use of per-CPU buffers rather than per-thread buffers. I have contributed the membarrier system call, and I am currently working on adding restartable sequences and cpu_id cache to the Linux kernel, so the speed of LTTng-UST can be brought closer to the performance of a tracer using per-thread buffers. Keeping per-CPU buffers ensures that the tracer efficiently use memory resources on workloads that have many more threads than CPU cores.
I also notice that filtering out all function entry/exit that take less than 5 microseconds is probably helping reaching those performance numbers. This kind of approach, although very specific to function tracing, seems worthwhile, and could eventually be introduced in LTTng-UST.
Another interesting aspect is that X-Ray seems to directly use the CPU cycle counter. It's all fine when the architectures has reliable TSC sources, but LTTng-UST uses the CLOCK_MONOTONIC vDSO to ensure that we properly fallback on other clock sources (e.g. HPET) whenever the system does not have a reliable TSC source. The extra function calls and seqlock may account for a few extra cycles difference between X-Ray and LTTng-UST.
> I'd also look at what problems need solving, in case they can be solved without function tracing.
Without speaking to the specifics of xray, which I'm not so familiar with, I'll say that the folks who developed xray are actually part of a debugging team. They are some of the best debuggers at Google, finding incredibly complex (and, usually rarely-occurring) issues that span many layers of Google's technology stack.
So, if they decided they needed function-level tracing, as opposed to any of the other tools they have available, there's probably a good reason.
Hah, yes.
> Which means we can calculate the cost to be ~1.1 us per probe (on my system). Anyone know what XRay is clocking in at?
I'm not at Google any more to check (and haven't touched the code since 2012), but IIRC XRay overhead while tracing was something like 100 cycles per function (which includes an entry/exit pair). Hard to compare across different machines made years apart of course, but I think that a little over an order of magnitude difference sounds about right.