OTel is flawed for sure, but I don't understand the stance against metrics and l...

Dextro · on June 14, 2024

I mean, when you're the one selling the gas to light that money on fire you have a vested interest in keeping it that way right?

I do agree that logging and spans are very similar, but I disagree that logs are just spans because they aren't exactly the same.

I also agree that you can collect all metrics from spans and, in fact, it might be a better way to tackle it. But it's just not feasible to do so monetarily so you do need to have some sort of collection step closer to the metric producers.

What I do agree with is that the terminology and the implementation of OTEL's SDK is incredibly confusing and hard to implement/keep up to date. I spent way too many hours of my career struggling with conflicting versions of OTEL so I know the pain and I desperately wish they would at least take to heart the idea of separating implementation from API.

zeeg · on June 14, 2024

Food for thought- the subjective nature of both of those is exactly why it shouldn’t be bundled.

the_mitsuhiko · on June 14, 2024

> Traces are inherently sampled unless you're lighting all your money on fire

You can burn a lot of money with logs and metrics too. The question is how much value you get for the money you throw on the burning pile of monitoring. My personal belief is that well instrumented distributed tracing is more actionable than logs and metrics. Even if sampled.

(Disclaimer: I work at sentry)

Jemaclus · on June 14, 2024

I actually take the opposite approach. In my experience, well instrumented metrics and finely tuned logs are more actionable than distributed traces! Interesting how that works out.

the_mitsuhiko · on June 14, 2024

I believe on the infrastructure side that might be correct. Within applications that doesn’t match my experience. In many cases the concurrent nature of servers makes it impossible to repro issues and narrow down the problem without tracing or trace aware logs.

aserafini · on June 14, 2024

With only sampled traces though it’s very hard to understand the impact of the problem. There are some bad traces but is it affecting 5%, 10% or 90% of your customers. Metrics shine there.

remram · on June 14, 2024

Whether it is affecting 5% or 10% of your customers, if it is erroring at that rate you are going to want to find the root cause ASAP. Traces let you do that, whereas the precise number does nothing. I am a big supporter of metrics but I don't see this as the use case at all.

Jemaclus · on June 16, 2024

(not your OP) This is true, but I find that metrics are useful whether something is going wrong or not (metrics that show 100% success are useful in determining baselines and what "normal" is), whereas collecting traces _when nothing is going wrong_ is not useful -- it's just taking up space and ingress, and thus costing me money.

My typical approach in the past has been to use metrics to determine when something is going wrong, then enable either tracing or logs (usually logs) to determine exactly what is breaking. For a dev or team that is highly connected to their software, simply knowing what was recently released is enough to zero in on problems without relying upon tracing.

Traces can be useful, but they're expensive relative to metrics, even if sampled at a very low rate.

aserafini · on June 19, 2024

Yes, and:

Not all problems result in error traces to analyse.

Example, you release buggy client that doesn't call "POST /order/finalize" when it should.

There are no error traces, there are just missing HTTP requests. Metrics reveal that calls to "POST /order/finalize" for iOS apps are down 50% WoW.

pdimitar · on June 16, 2024

Strange example, you'd think you want to fix this as quickly as humanly possible, no?

Also we don't sample traces, it's a fire hose of data aimed at the OTel collector. We do archive them / move them to colder and cheaper storage after a little time though, and we found that a viable money-saving strategy and a good balance overall.

aserafini · on June 20, 2024

Not all problems result in error traces to analyse.

Example, you release buggy client that doesn't call "POST /order/finalize" when it should.

There are no error traces, there are just missing HTTP requests. Metrics reveal that calls to "POST /order/finalize" for iOS apps are down 50% WoW.

dalyons · on June 15, 2024

How? Distributed traces show you what’s going on with a request in detail. Metrics tell you zero detail, by definition. Logs show you whenever people have decided to log manually which is usually very incomplete

aleph_minus_one · on June 14, 2024

> OTel is flawed for sure, but I don't understand the stance against metrics and logs.

Even if you don't want to consider the privacy concerns: telemetry wastes quite some data of your internet connection.

cogman10 · on June 14, 2024

Hey, this isn't the sort of telemetry we are talking about with OTel.

About the only "privacy concern" with otel is that you are probably shipping traces/metrics to a cloud provider for your internal applications. This isn't the sort of telemetry getting baked into ms or google that is used to try and identify personal aspects of individuals, this is data that tells you "Foo app is taking 300ms serving /bar which is unusual".

cdelsolar · on June 14, 2024

After I added OTel to an open source project I run, I spent a bit of time arguing with someone about telemetry - they kept saying they didn't opt in and that we need to inform our users about it, etc., and I kept saying no, that's not the same type of telemetry. I wonder how common this misconception is.

cogman10 · on June 14, 2024

This is the second time I've seen this misconception come up in HN and I've definitely seen it in Reddit at least once.

remram · on June 14, 2024

OpenTracing was a much clearer name, especially for those of us who really don't care about doing logging or metrics through OTel.

spenczar5 · on June 14, 2024

Client-side transport is pretty unusual with OTel. I think almost everybody is sending things from the server side, so I don’t think your concern is usually relevant.

wdb · on June 14, 2024

I think you are more talking about RUM which isn't yet supported by OpenTelemetry. I think they are working on it.

I am not sure if it will support session replays like some vendors like Sentry or New Relic offer. Technically, I think session replays (rrweb etc) is pretty cool but as a web visitor I am not a fan.