Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Google XRay: A Function Call Tracing System [pdf] (googleusercontent.com)
124 points by mnem on May 2, 2016 | hide | past | favorite | 56 comments


NOT ANOTHER TRACER!!

I'm sure it's impressive engineering work, but why oh why...

How does it compare to Linux uprobes, which are built into Linux mainline? Bear in mind there are different front ends for uprobes (ftrace, perf_events, bcc, ...), and these are also still in development, so if one lacked certain features they needed, such features could be added. There's been a LOT of work in this area in the past 6 months, as well (see lkml).

If the goal was lowest performance, then why compile with no-op sleds ("negligible overhead") instead of using dynamic tracing (literally "zero overhead")? Or, if the existing kernel-based dynamic tracers benchmarked poorly, then why not something like LTTng?

How does it compare to DTrace, as well? (Doesn't Google have some FreeBSD?).

All the tracers I mentioned can not only do dynamic tracing, but also instrument all user and kernel code, without special recompilation.


Most of XRay was written before uprobes was merged into the Linux kernel (and well before such kernels were widely available).

I don't think any of the alternatives you mentioned are Pareto superior to XRay when considering all of "speed while tracing", "speed while not tracing", and "flexibility".

E.g.:

- In "speed while tracing", anything that takes a context switch per traced function will probably be dramatically slower. Even if there's some fast dispatch mechanism you have in mind that I'm not familiar with when you say dynamic tracing, if it doesn't insert the moral equivalent of a nop-sled, it will have to either choose between logging the whole PC (spending data, which means spending RAM and disk time) or figuring out how to map it to a function-specific unique int (spending cycles).

- In "speed while not tracing", anything much more expensive than nop-sleds will be too slow to run in production.

- Anything that doesn't have a compile time component probably won't be able to completely hook functions that get inlined, or whose source you aren't able to change, won't be able to pick out information the runtime wants to summarize from function arguments, etc.

To me, the neat thing about XRay isn't so much the "function patching" aspect, except insofar as it serves as a mechanism to execute arbitrary code at function entry or exit in a way that's runtime-customizable and very low overhead when you want it to be.


Thanks for the reply; some replies:

> "anything that takes a context switch per traced function will probably be dramatically slower."

Good thing uprobes don't context switch:

  # perf stat -e context-switches -e probe_libc:re_search_internal sed '/./d' /mnt/data.txt 
  
   Performance counter stats for 'sed /./d /mnt/data.txt':
  
                 6      context-switches                                            
        15,122,432      probe_libc:re_search_internal                                   
  
      19.744738204 seconds time elapsed
You mean mode switch? Cheaper, but yes, still costly. Here's runtime without the probe:

  # time sed '/./d' /mnt/data.txt 
  
  real	0m3.349s
  user	0m3.345s
  sys	0m0.004s
Which means we can calculate the cost to be ~1.1 us per probe (on my system). Anyone know what XRay is clocking in at?

AFAIK, LTTng has done work for user<->user instrumentation. I think uBPF will be doing this (https://github.com/iovisor/ubpf) - although that project is very new. Could use some help from some more good engineers (please do!).

> "In "speed while not tracing", anything much more expensive than nop-sleds will be too slow to run in production."

I'm not sure anyone is suggesting anything more than nop-sleds. Dynamic tracing is zero, and static tracing is nop-sleds.

> "probably won't be able to completely hook functions that get inlined"

Sure. Sometimes there's static tracing probes (nop-sled based), sometimes there isn't and it's dynamic probes, sometimes those dynamic probes are inlined and you walk up the stack to find one that isn't. If it is inlined, maybe you need to trace the address rather than the function entry.

In my experience it's pretty rare that something is just untracable because inlining is so insane. But yes, it does happen sometimes. Usually I figure out a workaround before giving up.

> "a mechanism to execute arbitrary code at function entry or exit in a way that's runtime-customizable and very low overhead when you want it to be"

BPF! In-kernel virtual machine that runs JIT'd code on events, and is part of mainline Linux. Lots of enhancements in the Linux 4.x series.


> You mean mode switch? Cheaper, but yes, still costly.

Hah, yes.

> Which means we can calculate the cost to be ~1.1 us per probe (on my system). Anyone know what XRay is clocking in at?

I'm not at Google any more to check (and haven't touched the code since 2012), but IIRC XRay overhead while tracing was something like 100 cycles per function (which includes an entry/exit pair). Hard to compare across different machines made years apart of course, but I think that a little over an order of magnitude difference sounds about right.


Ok, if the requirements are 100 cycles per function, then you'll (for current kernels) need user<->user. So, AFAIK, an LTTng-like thing or uBPF. And LTTng does have such an order of magnitude difference: slide 25+ of http://www.efficios.com/pub/endusersummit2012/presentation-e...

I'd also look at what problems need solving, in case they can be solved without function tracing. If 100 cycles per function was needed, it sounds like a high rate of functions, and a lot could be done with just timer-based stack trace sampling.


Brendan, did you read this awesome blog post (which mentions your work btw) about Google's tracing framework which may explain the kind of problems they want to solve? http://danluu.com/perf-tracing/

"Sampling profilers, the most common performance debugging tool, are notoriously bad at debugging problems caused by tail latency because they aggregate events into averages. But tail latency is, by definition, not average."


I don't quite get it -- to me, that's like saying a hammer is notoriously bad at screwing in screws. And then going ahead and proving it. Um. Good content, but I don't get the premise. I wasn't using sampling profilers to study tail latency in the first place. Who would? (Maybe it's a different usage of "sampling profilers" than I'm familiar with.)

As for tail latency: one way is to dump every function event ; a lower cost way is to summarize latency in-kernel as a histogram. For both you're still tracing every function, and care about overhead down to 100 cycles etc.

But think about what causes the tails in the first place. Lock contention? Trace that. Resource I/O? Trace that. If it costs 1 us to trace disk I/O, it's usually not a problem. I like to time scheduler switch events with stack traces -- a catch all (but a bit more expensive). Of course, these approaches require kernel instrumentation. :)


Hi Brendan,

(Full disclosure: I'm Mathieu Desnoyers, part of the LTTng maintainer team.)

I would like to introduce a slightly less extreme point of view when considering "on-the-fly" aggregation of traces vs tracing to buffer followed by post-processing. I see from the current discussion thread that it's very much either one or the other, but I think that combining the two approach helps creating much more powerful tools. On-the-fly aggregation based on trace instrumentation helps pinpointing latency outliers. Tracing to buffers, on the other hand, provides very detailed information about the system behavior that leads to those outliers. By using on-the-fly aggregation as "triggers" to collect tracer in-memory ring-buffers, one can achieve investigation of latency outliers with very small I/O overhead.


Currently, LTTng-UST has slightly higher overhead than 100 cycles per event (roughly 250-300 ns/event on recent 2.4GHz Intel), which I expect is partly caused by use of per-CPU buffers rather than per-thread buffers. I have contributed the membarrier system call, and I am currently working on adding restartable sequences and cpu_id cache to the Linux kernel, so the speed of LTTng-UST can be brought closer to the performance of a tracer using per-thread buffers. Keeping per-CPU buffers ensures that the tracer efficiently use memory resources on workloads that have many more threads than CPU cores.

I also notice that filtering out all function entry/exit that take less than 5 microseconds is probably helping reaching those performance numbers. This kind of approach, although very specific to function tracing, seems worthwhile, and could eventually be introduced in LTTng-UST.

Another interesting aspect is that X-Ray seems to directly use the CPU cycle counter. It's all fine when the architectures has reliable TSC sources, but LTTng-UST uses the CLOCK_MONOTONIC vDSO to ensure that we properly fallback on other clock sources (e.g. HPET) whenever the system does not have a reliable TSC source. The extra function calls and seqlock may account for a few extra cycles difference between X-Ray and LTTng-UST.


> I'd also look at what problems need solving, in case they can be solved without function tracing.

Without speaking to the specifics of xray, which I'm not so familiar with, I'll say that the folks who developed xray are actually part of a debugging team. They are some of the best debuggers at Google, finding incredibly complex (and, usually rarely-occurring) issues that span many layers of Google's technology stack.

So, if they decided they needed function-level tracing, as opposed to any of the other tools they have available, there's probably a good reason.


Unfortunately, I think the atleast one of developers was not aware of uprobes. I saw a post from the developer that said he was looking into uprobes after this point, and he was not aware of it.

He seems like a nice guy, and just seems like an oversight.


Is there any documentation on how the dynamic tracing functionality actually works? In particular, how does it avoid the traditional problem with hooking via patching (i.e., that you have to get the disassembly exactly correct or you risk placing a hook on top of a jump target)?


There's been a lot of work on these still-in-development systems in the last half year, meanwhile xray has been in production for half a decade. I think that answers your question.


DTrace has been in production for more than a decade, but yeah, not in Linux.


I would love to see a tracer that works across processes and machines.

For example, a request hits the web server, the web server tickles an application server, the application server makes a number of database queries, and the results propagate back the way they came.

I'd love to see something that could follow this request through all of the layers and across all of the machines, without the luxury of having the source code for most of the components.

I don't see why this wouldn't be possible, but I certainly see why it wouldn't be EASY!


Seeing that people have already mentioned Dapper, I'd add that there are already a few open source implementations of Dapper available. Appdash[1] is a very lightweight tracer that isn't too fancy, but gets the job done. There's also the much bigger, more popular tracer Zipkin[2], which was built by Twitter.

[1] https://github.com/sourcegraph/appdash

[2] https://github.com/openzipkin/zipkin


It looks like both of these require changes to the application. Dapper appears to have the same issue.

My use case is for understanding a closed source application which has a number of separate pieces. Since I don't have the source code, these don't look like viable options.


Google published a paper similar to this 6 years ago: http://research.google.com/pubs/pub36356.html

Dapper will follow the call path of an RPC and report the details of it.



Other people have mentioned end-to-end tracing frameworks, which are heavyweight application-level frameworks that are similar to logging.

However, lightweight cross-component dynamic instrumentation that correctly captures causality is still an open research problem. This is actually something I have done some work on recently in my system Pivot Tracing[1] which won a best paper award at the conference it was presented at (SOSP). Pivot Tracing combines dynamic instrumentation with causal tracing in order to address precisely the problem you describe. The code for the system is open source and available on GitHub[2].

[1] http://cs.brown.edu/~jcmace/papers/mace15pivot.pdf

[2] https://github.com/brownsys/tracing-framework


This looks good, but requires changes to the system being traced. In my case, the system I'm most interested in tracing is a closed-source black box, so changing it isn't an option.


I am never tired of this shameless plug: our state of the art open source instrumentation engines for Microsoft Windows where you can hook applications without even knowing about the complexities of hooking. The most programmer friendly is https://github.com/nektra/Deviare2 while the Microsoft Detours competitor is: https://github.com/nektra/Deviare-InProc

It has an embedded disassembler to smartly hook functions even if they have a jmp in the prologue.


Yeah but what if they have a branch target in the prologue?


When the stub for calling the original function is created, most hooking engines assumes the prologue contains the standard "mov edi,edi/push ebp/mov ebp,esp" and it is wrong.

If the prologue contains, for e.g., a relative jmp, copying opcodes is not enough. You must convert it to an absolute jump in the generated stub. The same applies to several instructions doing relative/indirect addressing.


Yes.


Url changed from https://research.google.com/pubs/pub45287.html, which points to this.

This project was discussed recently at https://news.ycombinator.com/item?id=11595287, but perhaps the current post adds more information.


I don't see any mention of Intel's errata on cross-modifying code on SMP in the paper. I wonder how the authors handle this ? See "Unsynchronized Cross-Modifying Code Operations Can Cause Unexpected Instruction Execution Results" Ref. http://www.intel.com.tr/content/dam/www/public/us/en/documen... AX72. This is one of the main challenges to cross-modifying code, and one key reason why LTTng-UST does not use a nop-slide today. One possible approach to this is to SIGSTOP the entire process while doing the code modification, which is unwanted in real-time systems. Another approach would be to integrate with uprobes and do a temporary breakpoint bypass, similarly to what is done in the Linux kernel today for jump labels.


More and more I think call tracing is a core process of programming, and am re-architecting the code I write to be easily traceable.

I feel like this approach of using layer-cake architectures where you function call has to plumb through a dozen layers that you didn't write and then trying to make sense of that with data analytics is the wrong approach.

Instead, I have been ditching layer-cake libraries for vertically integrated libraries that do one thing and take full responsibility for it. This requires architecting your application in a different way... it generally means more boilerplate. Libraries do the heavy lifting, but no sexy DSLs that turn your boilerplate into terse method chains and such.

But the ability to simply put a breakpoint anywhere in the system and have the stack be a good representation of which pieces of code are doing something right now vs just hanging around because this-kind-of-thing might need that-kind-of-interface later on.


Seems much more restricted than perf or DTrace, or all the other tracing tools. I don't understand the point of this at all. NIH?

I'd love to heard brendangregg's or bcantrill's thoughts on this. (Does HN give users some kind of alert when they are mentioned?).


(Tedious disclaimer: my opinion only, not speaking for anybody else. I'm an SRE at Google)

I recommend reading page 2 of the paper, which discusses the specific set of features that XRay offers. How do I get perf or DTrace to give me the six things listed there? I can only think of ways to get a couple of them.


> The cost is acceptable when tracing and barely measurable when not tracing.

"Acceptable" is obviously relative, but with DTrace's pid provider, the cost is zero when not tracing, and about the cost of a fastcall per probe point when enabled.

> Instrumentation is automatic and directed towards functions that are important for understanding the binary’s execution time.

I'm not sure what this means, but with DTrace, you enumerate the functions or binary objects (with wildcards and such) that you want to instrument, and the framework takes care of reliably instrumenting them, no matter the state of the process. Is that "automatic" and "directed"? I need to read the rest of the paper more closely.

> Tracing is efficient in both space and time -- only recording what is required and what matters.

DTrace records exactly what you ask it to. It supports in-situ aggregation for cases where it's not tenable to record a complete log of all interesting events. This is an important part of the design.

> Tracing is configurable with thresholds for storage (how much memory to use) and accuracy (whether to log everything or only function calls taking at least some amount of time).

With DTrace, it's pretty easy to filter on function execution time. The buffer size is configurable. There are also multiple buffer policies for different use-cases (e.g., ringbuffer of the last N events leading up to some other event).

> Tracing does not require changes to the operating system nor super-user privileges.

If they're running Linux, as I imagine they are, DTrace isn't necessarily an option. Several other platforms have just ported it. Using it to record user-level state on your own processes does not require superuser privileges.

> Tracing can be turned on and off dynamically without having to restart the server.

Absolutely -- that's what the "D" is for.

I'd strongly recommended checking out the DTrace paper: https://www.usenix.org/legacy/event/usenix04/tech/general/fu...

There may be good reasons not to use DTrace for this, but I'm not sure which of those six goals would be the sticking point other than OS availability. (edit: I also haven't read beyond that yet!)


It's definitely the sort of paper that would get reviewer feedback if it was submitted to a conference, in that it failed to compare any related work. (But it's unfair to grade it by that metric because it reads like more of a "here's what we did" blog post, not a research paper.)

From reading it, I believe no other tool is both (1) non-sampled and (2) supports instrumenting all functions simultaneously. (At least none of the other comments in this thread point at tools that do this.)


>From reading it, I believe no other tool is both (1) non-sampled and (2) supports instrumenting all functions simultaneously. (At least none of the other comments in this thread point at tools that do this.)

Perf supports dynamic tracing of probepoints, and probing multiple points simultaneously.

Perhaps I'm misunderstanding things, but I don't see how perf doesn't fit those two requirements. It does a lot more than just sampling


1. Errr, perf definitely does not support inserting N probepoints, where N is the number of function entry + exit in a program.

2. Perf is 14x slower with tracepoints, and is definitely not that low overhead with tons and tons of probe points.

See http://events.linuxfoundation.org/sites/events/files/slides/...


Is there something I'm missing in the XRay paper that mentions speed? The only overhead I can find mentioned in the paper is CPU and memory utilization.

If XRay is able to trace millions of probe points with less overhead than uprobe + frontends, that's extremely impressive... But I just don't see any numbers so far that have said this is the case?


Errr, it's tracing every function. That is equivalent to tracing every probe point. It is definitely less overhead than something that has to actually figure out what it is doing and care in some special way.

x-ray doesn't even make any attempt to optimize things well. If it did, it would be even lower overhead.

I'm honestly a bit stymied. I'm not sure why anyone would think a dynamic probing infrastructure that has to keep track of millions or billions of probes would ever go faster than something that has constant time and storage space by knowing what it is tracing and that it is tracing ahead of time.

The tradeoff is pretty simple. If you know what you want to probe ahead of time, and what the probes do, which is the case with x-ray, you can do it in fixed time and storage cost. You can never add or remove probes. Nothing has to worry about what probes exist.

If you want to be able to add or remove probes at any time, you now have a cost of having to keep track of all the probes that exist and possibly doing something different with them at different times and ...


So, to be fair to the xray people, what they have done meets a very specific set of use cases you can't meet with most other tracing tools

(I haven't looked at dtrace hard enough. at a glance, it looks like it would not meet several things, but dtrace is not a sane option for them for a variety of reasons).

Past that, the mechanism used to do this, and the system itself, is not uncommon to build. I don't see any claims that say they think they've broken new ground. Just that they built it :)


Does perf or Dtrace have the same functionality?

>XRay allows you to get accurate function call traces with negligible overhead when off and moderate overhead when on, suitable for services deployed in production. XRay enables efficient function call entry/exit logging with high accuracy timestamps, and can be dynamically enabled and disabled.


Yes?


First, i always assume dtrace can do something. Dtrace is not that low overhead in the situation that xray is describing.

http://dtrace.org/blogs/brendan/2011/02/18/dtrace-pid-provid...

When they traced everything, it was two orders of magnitude slower.

At 100k event/s (quite possible in the situations x-ray is used), the app would be about 60% slower. For just a simple probe, not full on tracing.

Does this make it better than dtrace? No. It just serves a different set of use cases.


> When they traced everything, it was two orders of magnitude slower. > At 100k event/s (quite possible in the situations x-ray is used), the app would be about 60% slower. For just a simple probe, not full on tracing.

I think you've misunderstood the post. The actual time reported in that post is 600 ns per probe, with sanity checks reporting as much as 2000ns, which he concluded was in the same ballpark.

He measured that by observing a program that just calls two functions in a tight loop, one of which is strlen() on a dozen-character string. That's basically the worst possible case for any function call tracing system -- I don't think it's fair to say that's "a simple probe, not full on tracing".

The "60%" slower conclusion is only by construction: if you have N events per second, and the framework adds overhead of T nanoseconds per probe, then you can pick N such that the tracing framework adds whatever percentage overhead you like. In this case, Brendan picked N=100K and came up with 60% overhead for that case, but I think there's a math error in there. For that calculation, he assumes a 6us probe time instead of 600ns. I think the overhead would be 6% for 100,000 events, not 60%.


I didn't misunderstand. The problem with this is that you assume that if you probe every function instead of just strlen that you get the same time bounds.

The only way to do this is, afaik, something similar to http://docs.oracle.com/cd/E19253-01/819-5488/gcgmc/index.htm...

This really just is going to expand to putting a probe on every function. I'd really like to see numbers on how dtrace handles many millions of probes at once (which is what xray is handling).

It's not uncommon to have 10 or 100 million functions in some of these programs. I have strong doubts that dtrace has the same overhead given 200 million probes.

(since the probe data structures alone will likely take up gigabytes of memory, accessing them is unlikely to be cache friendly, etc)

AFAICT, there is no more generic function entry probe than what that blog post describes. But i'd love to be wrong, and understand how dtrace is going to determine what instructions are a function entry in several ns :P

TL;DR dynamic probing infrastructures are not a panacea


> I'd really like to see numbers on how dtrace handles many millions of probes at once (which is what xray is handling). It's not uncommon to have 10 or 100 million functions in some of these programs. I have strong doubts that dtrace has the same overhead given 200 million probes.

I see now. That's a fair question, and I'm not aware of data either way. I just tried a pretty simple experiment inspired by Brendan's that suggests that on my machine, the overhead is about 1450ns per probe for as many as 140,000 probes:

https://gist.github.com/davepacheco/a12a0d45d55f0d7a28c312c2...

> AFAICT, there is no more generic function entry probe than what that blog post describes. But i'd love to be wrong, and understand how dtrace is going to determine what instructions are a function entry in several ns :P

Well, DTrace as architected is always going to pay the cost of a context switch into the kernel for each probe, and I think it's fair to take Brendan's result of 600ns as a lower bound of the per-probe overhead, at least on his machine. However, once in the kernel, for a typical native program (i.e., not JIT), I expect DTrace would only record the current userland thread instruction pointer. Names are typically resolved asynchronously by the consumer. So I would be surprised if it really was much slower, especially given the result above, but I too would like to see data.

I'm not saying that DTrace solves all problems or even that the OP should have used it instead. It's certainly true that for the special case of userland function boundary tracing, one might expect to do better by skipping the context switch (at the expense of much functionality, including any ability to correlate with broader system activity). But since DTrace was brought up, I wanted to help clarify the uncertainty about what it can do and what its overhead is.


I don't know about DTrace, but perf is a sampling profiler, not an accurate call tracer.


Perf is much more than a sampling profiler: https://perf.wiki.kernel.org/index.php/Main_Page. It can use uprobes and kprobes.

There are many ways to do dynamic tracing in Linux, apart from perf. Ftrace, raw kprobes/uprobes, eBPF, bcc compiler for eBPF, etc.


After perusing the perf man page, the only way I could figure out how to make it accurately count userspace function calls was using hardware breakpoints, e.g.: "perf stat -e mem:0xADDRESS:x"

Obviously that's not a very good approach, because you're limited to the number of breakpoints that your CPU can handle simultaneously (4 on my machine) and there's a lot of overhead. If you know of a better way to accomplish the same thing with perf, I'd be happy to hear it.



Ah, cool. I just tested it out and it seems to work as documented. Unfortunately it requires root access, and incurs about 1 microsecond of overhead per function call on my machine.


Reading the X-Ray abstract, unless I'm missing it, it doesn't go into detail about the time overhead. I do see mention of CPU and RAM usage, though.

I'm not sure that we have any evidence that X-Ray offers less overhead here, though it's certainly cool if it does.


Can you give an example where perf works with 10 million+ probe points?

:)

(and where this does not have an antagonist effect on the other things on the machine?)



And for those poor people who haven't seen bcc, do yourself a favor and look into it: http://iovisor.github.io/bcc/

Besides, with a unicorn for the logo, whats not to like?



The fact that it relies on compiler instrumentation makes it less interesting for people outside Google. Given that there are other instrumentation systems that do not require changes to compiler


Seems unnecessary to do anything to registers in the call edge glue. We did this (oh, decades ago...) on the 68000, on the Macintosh, and really all we needed was a distilled link map and some return addresses. Maybe there are some complications we didn't have, though.


I agree. This technique is as old as programming itself.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: