Hacker News new | past | comments | ask | show | jobs | submit | mmclean's comments login

Likely Prometheus - Jaeger for tracing, Elastic for logs, Prometheus for metrics is a pretty common and effective OSS observability stack


Some of their engineers have made PRs to the OpenTelemetry JS repository, but it wouldn't have been mature enough (or even existed) when much of the work described in the blog was underway.


Regarding Google Cloud Profiler (I'm the PM), this is for a few reasons:

- We have good support for Go, Java, JS, and Python, but are still adding a few features for these languages (MUSL support for Alpine just shipped, still need to add heap profiling for Python)

- C++ isn't as heavily used by our ops tools customers as Go, Java, JS, and Python

- We have several new analysis features in the pipe, and the cost supporting an additional language would slow down the delivery of these

There's no lack of desire to add new languages, but we chose to prioritize completing existing language support and new analytics functionality this year. I'm guessing that the other teams making products in this space face similar constraints and made similar tradeoffs.


Thanks for your perspective. When I was at dropbox I wrote a quick tool that would convert perf data files to cloud profiler protobufs and upload them to google. I really appreciate that the API exists allowing people to use whatever language they want. I'm not a huge fan of the datadog model where you need their agent and they don't have a documented API.


FYI for all, the W3C TraceContext specification will become a Proposed Recommendation later this week. I'm one of the co-chairs of the group and am happy to answer questions about our W3C work or OpenTelemetry.


Just curious, what was the rationale for randomizing the spanId at each hop? (As opposed to a more structured format that could let you track the request tree without relying on another field like timestamp)


Existing tracing systems (Dapper, Zipkin, Dynatrace, Stackdriver, etc.) already randomize with each hop, and there was a desire to be consistent with the models that they already used. It's also more straightforward to implement.

There's a discussion about "correlation context" inside of this W3C group called , which maps to what you're describing. It'd be worth reaching out to Sergey (one of the other co-chairs) if you want to find out more.


Timestamps across distributed systems don't work well as correlation tools as time tends not to be accurate enough to order application retries particularly but also fan out type requests. You really want parent / child or follows from relationships to collect and represent the graph correctly.

Source: Working on distributed tracing at Twilio and Stitch Fix


W3C Trace Context defines an HTTP header format for traced requests, and OpenTelemetry implements this format by default. While this project is technically distinct from OpenTelemetry, it's effectively composed of the same people (including me).


Right now we're focused on the first release of OpenTelemetry, which will include distributed traces and metrics. Many users and contributors have asked for logging support, and this was already a big discussion topic in the pre-merger OpenCensus and OpenTracing communities - I'd expect us to start focusing on this after the release later this year. There have also been some early conversations around support for an error first-class signal type (there was a recent GitHub issue, though now I can't find it), however to my knowledge we haven't yet started any error-related specs or design discussions.

You can certainly add error-related annotations to traces, however these will typically be sampled.


Error information is a bit different than general logging - stacktraces, annotation of exceptions with case-specific information, matching that data with debug symbols, etc. It would be interesting to see the ability to enrich standard logging with custom-shaped data, without sampling, in a standardized way.


Yep! The Java library has some basic z-page support, but adding high quality z-pages will be a big part of the OpenCensus agent and OpenCensus service that we're developing now. These can be used within Kubernetes environments or on plain VMs.


High quality z-pages are great, but if there's no easy and secure way for a developer to get to them, they become useless. Rather than their contents or generation, my question is about automatic discovery and secure, restricted proxying of the pages. I don't want to tell my colleagues to run ad-hoc kubectl commands for each service or pod they want to inspect.


Ah, got it. We've discussed exactly this (within Google) with some of the Kubernetes maintainers who have similar wishes to yourself. The most immediate z-pages related work is focused on making them better in the more general case, as they're extremely barebones at the moment and are specific to each language's library. After that we'll explore the native Kubernetes integrations that you're asking for.

It'll come, but it's going to take some time.


You can send traces to Jaeger, Zipkin, Stackdriver, Azure App Insights, X-Ray, Honeycomb, and a few others. Supported metrics backends include Prometheus, Datadog, SignalFx, Stackdriver, Azure, etc.


Another OpenCensus maintainer here, happy to answer questions along with Nik


How does this work with Istio? I believe Istio also collects traces and sends them to Jaeger/Zipkin.


[SD APM PM here] Yep, we support Java / JVM, Node.js, and Go; Python is coming soon


What’s the long term plan here with StackDriver and OpenCensus?

Are you eventually going to support high cardinality events with the ability to aggregate and then break down (as opposed to say Prometheus which pre-aggregates, so you can’t break down to debug, and doesn’t support high cardinality).


Would you be willing to add a contact method in your profile (twitter/email) for some offline questions?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: