Hacker News new | past | comments | ask | show | jobs | submit login
A simple way to get more value from tracing (danluu.com)
110 points by zdw on June 2, 2020 | hide | past | favorite | 26 comments



At a recent start-up like situation we paid licensing for a monitoring and metrics provider that provided out of the box tracing infrastructure for us.

It turned out that the licensing was quite expensive for what we where getting (couldn't even get outlier traces), we didn't have much control of our infrastructure (couldn't control the sampling rate in such a way to keep all the traces we needed), and ultimately the learning curve for making the infrastructure scalable was a challenge as well.

Even after setting up tracing in our entire system successfully, I ultimately found it of minimal value and we ended up relying on ELK and log aggregation for more insight.

Tbh, it was a disappointing experience and just reading how much custom dev was required in this post to make tracing work for you guys makes me disappointed. Given the frame of work I do (consulting), I doubt tracing will ever prove more valuable than spending money on log aggregation and other monitoring efforts and tools.


A quick win I've seen and myself implemented as a consultant, is the use of correlationId in some form. If an incoming request has a correlation-id-header, then use it. If not generate a unique one. When calling other services make sure to include the current correlationId, and thus it spreads throughout the services for a given request.

Most logging frameworks allow you to add something to every log message, so add the correlationId there. Then it's easy to track between multiple services. It also makes it easy to distinguish which log messages belong to a given request for a single service, so even without lots of services it can be a nice little feature.


It’s easy for this to fall over. We were doing this where the entry point would make hundreds or even thousands of calls. Unfortunately, I couldn’t convince anyone that you needed spans and not just traces


What kind of "tracing" is this about?


Distributed tracing, a way of tracing requests across services, e.g. Zipkin [0] (which is what I believe Twitter uses), Jaeger [1] or Lightstep [2]

[0] https://zipkin.io/ [1] https://www.jaegertracing.io/docs/1.18/architecture/ [2] https://lightstep.com/distributed-tracing/



Now merged with OpenCensus to form https://opentelemetry.io


Think of it like a stack trace, except across processes.

You have to instrument your code and libraries to pass along a context that gets picked up by each piece along the call tree, and then each piece gets reported to a collector.

It's useful for microservices, and even for non-distributed web services if you have heavyweight libraries that it must interact with (databases, caches, queues, etc).

It only really works if all of the pieces understand each other, thus the OpenTelemetry [1] initiative.

[1] https://opentelemetry.io/


I think... the kind where you assemble a beginning to end view of a request and all the micro services it hits by crawling through log files.


Not really log files as I'm familiar with them. More structured. At least in the implementation I'm familiar with.


first web world problems


?


I interpret this as meaning that this is only a challenge for the kind of over-scale systems that only a handful of companies have.

You could be a pretty significant online retailer (Asos, say), run everything on a few instances of a monolith, and have all your information in easily interpretable log files, without needing to reach for distributed tracing tools.


Yeah, this tends to be more of an issue for consumer-grade internet services (where the majority of users are not customers).

Again, these tools are incredibly useful when you have a distributed system, and kinda pointless otherwise.


This seems really esoteric to distributed systems...


Yeah i guess. I think it's that it's more useful in distributed systems, and most people don't need it.

But when you do need it, you really need it.


Might be best to dive into the literature starting here:

Magpie: Online Modelling and Performance-aware Systems

"Understanding the performance of distributed systems requires correlation of thousands of interactions between numerous components ..."

https://pdfs.semanticscholar.org/cb76/9b0c983bce6a73cafd1f8a...


> Taken together, the issues were problematic enough that tracing was underowned and arguably unowned for years. Some individuals did work in their spare time to keep the lights on or improve things, but the lack of obvious value from tracing led to a vicious cycle where the high barrier to getting value out of tracing made it hard to fund organizationally, which made it hard to make tracing more usable.

Omg, this is 100% the situation at $DAYJOB. I actually think this is generally true for any new infrastructure that devs are expected to interface with. E.g. for metrics, most engineers I work with also don't really understand Prometheus, despite it being incredibly useful with a robust query language, but looking at Grafana graphs gives them a good enough picture. The big "wtf" moment with tracing is always the sampling, its really hard to build a good sampling system that collects meaningful traces ahead of time, much less explain it. The article's "problems" list rings really true.

For my story: we tried to roll our own naive system backed by ELK that piggybacked off our logging stack, which largely failed due to time/resource constraints (kibana and elasticsearch actually worked fairly well for some types of aggregations, but some really useful visualizations like a service graph aren't possible). I agree with the post in stating that building a good distributed tracing system from scratch definitely isn't trivial, but it's not very difficult given the tools that exist today. Also, the article mentions things like clock skew not mattering in practice, which if you were to approach this from a "blank slate" would definitely be a bit counter-intuitive.

Anyway, due to resourcing constraints (i.e. we had no observability team, and even if we did they wouldn't have worked on this particular problem 1+ years ago), we use a vendor (Lightstep) which is generally solid & requires very little maintenance, but there are still issues where simple conceptual queries like "show me traces that had this tag" aren't possible. E.g. they don't support things like querying `anno_index` from the article for historical data, you can only query in-memory data (so last ~10-20 mins, helpful for immediate oncall scenarios but not much else).

The really interesting thing about distributed tracing is that its really still in its infancy in terms of potential applications. The article focuses on performance/oncall scenarios, but we:

- Built a post-CI flow that captures integration test failures and links developers to a snap shot so they can see the exact service their test failed - no digging through logs.

- Can see traces during local development.

- In OpenTracing, there's the pattern of injecting untyped "baggage items" that make it through to downstream systems. We integrated this into our _logging_ clients, so you get free metadata from upstream services in logs without having to redeploy N microservices (this lets us get around some of the query limitations in Lightstep).

- I'm also shooting around ideas to leverage baggage items to inject sandbox data for writing tests against plolyglot, manyrepo services. This lets test data live next to test driver code (e.g. a Go service calls into a Python service & we want to mock the Python service's test data but don't want to have to update + merge a change to that repo).

Kudos to Twitter for a great tech + organizational initiative.


Tracing is a cross functional concern, and therefore does not belong in application binaries but in a sidecar or reverse proxy intermediates, where it can be written once and applied to everything. If you do not have the need for these complex deployments then don't bother with tracing either.


I disagree. There are internal parts of the application that benefit from tracing, and if you are using the OpenTracing standard, tagging your traces with metadata can allow correlations in the data to stand out (something LightStep is good at). Tracing at only your edges wont help you discover that requests get slower for customers with some subset of features enabled with given settings, for example.


While I 100% agree that having doing most of the tracing at the sidecar level is the right call, you'll still need "some" application-level awareness of tracing (e.g. passing along b3 headers: https://github.com/openzipkin/b3-propagation ), or you won't be able to match various spans to a common trace and lose a _lot_ of value from your distributed tracing.


Good point you need to pass the correlation key, but the sidecar is the one talking to the tracing infra


I don't see how this advice can work in practical applications. It's not unusual for a single user request in a large system to invoke thousands of internal requests and the depth of the call graph can be tens of edges. If each of those requests has to traverse a proxy 4 times, you'll wreck your service latency. And, as another comment noted, you'll lose the ability to represent the internal structure of calls as trace spans, or to log to them, or to tag them.


Based on the headline and current events I assumed this article was going to be about "Contact Tracing" for the pandemic.


Centralized logging is much more primordial. If you're dealing with HTTP services, centralized logging will show you the status code, path, response time, etc.. for every request. With ample support for aggregations.

Teaching developers to effectively use Kibana/Splunk/Graylog would provide much more benefits than investigating distributed tracing solutions.

If you're dealing with non HTTP services and want to investigate performance. What you're needing is most likely a profiler. Java has a fantastic one out-of-the-box (jvisualvm) that can even work remotely to live production process, C and C++ have some CPU profilers but expensive (the one in visual studio pro/ultimate), python has some tools but don't recall which one was good.

Of course none of these tools are trivial to understand for the non initiate. Show a developer stack traces and timing information and they have no idea what they mean, let alone how they are supposed to improve performance from that. What could improve adoption greatly is to give training and record a handful of video tutorials.


Linux has "perf." https://perf.wiki.kernel.org/index.php/Main_Page

It's easy to use and very powerful. I've used it with C and Rust and had great success.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: