OpenTelemetry in 2023

paulddraper · on Aug 28, 2023

Two problems with OpenTelemetry:

1. It doesn't know what the hell it is. Is it a semantic standard? Is a protocol? It is a facade? It is a library? What layer of abstraction does it provide? Answer: All of the above! All the things! All the layers!

2. No one from OpenTelemetry has actually tried instrumenting a library. And if they have, they haven't the first suggestion on how instrumenters should actually use metrics, traces, and logs. Do you write to all three? To one? I asked this question two years ago, zero answers :( [1]

[1] https://github.com/open-telemetry/opentelemetry-specificatio...

withinrafael · on Aug 28, 2023

1. Agreed. It's the sink and the house attached to it, and the docs are thin and confusing as a result.

2. I had a similar experience to you. I wanted to implement a simple heartbeat in our desktop app to get an idea of usage numbers. This is surprisingly not possible, which greatly confuses me given the name of the project. The low engagement on my question put me off and I abandoned my OpenTelemetry planning completely [1][2].

[1] https://github.com/open-telemetry/community/discussions/1598

[2] https://github.com/open-telemetry/semantic-conventions/issue...

coxley · on Aug 28, 2023

Agreed. Some things they suggest aren't actually possible with their SDKs.

For example, you cannot define a histogram's buckets near where you define the histogram. You have to give the global exporter (or w/e the type is) a list of "overrides" that map each histogram name => their buckets. This makes it extremely ugly when you have libraries that emit metrics.

https://github.com/open-telemetry/opentelemetry-go/issues/38...

hinkley · on Aug 29, 2023

It's silent when you fuck up the selector too.

Actually it's silent pretty much all of the time. Very reminiscent of J2EE coding. Stare at the configs and hope for enlightenment.

jauntywundrkind · on Aug 28, 2023

Good deck of questions but I don't think they matter. I don't think those are answerable questions for observability, be it OpenTelemetry or other proprietary systems.

You can go read the leading observability companies web pages and they'll have a 4 page writeup on custom instrumentation. That's not much, just covers very elementary basics! It's not like OTel is behind. The answer just tends heavily towards "it depends."

Once you have experience - OTel or other - you can work through these things that might confound a neophyte.

morelisp · on Aug 28, 2023

The problem is they keep making the OTel tooling worse for working through these things, because the people writing the OTel tooling broadly aren't the people actually trying to monitor things. Even before OTel, plain Prometheus client libraries suffered from this.

mirekrusin · on Aug 29, 2023

All they had to do is maintain 3 json schema files.

c-cube · on Aug 29, 2023

They maintain a bunch of protobuf files, which is somewhat better and more efficient. You can still use json if you want.

mirekrusin · on Aug 29, 2023

Changing subject, love your work with ocaml.

c-cube · on Aug 30, 2023

Thank you ^_^

bjt12345 · on Aug 29, 2023

3. Does it scale?

nijave · on Aug 29, 2023

Scale to what? We're generating >70k spans/sec and it's working fine. I'd say we're fairly medium size at best, though.

donutshop · on Aug 28, 2023

I totally read that with kubernetes in mind.

CSDude · on Aug 28, 2023

I love OpenTelemetry and we want to trace almost every span happening. We’d be bankrupt if we went any vendor. We wired opentelemetry with Java magic, 0 effort and pointed to a self hosted Clickhouseand store 700m+ span per day with a 100$ EC2.

https://clickhouse.com/blog/how-we-used-clickhouse-to-store-...

alright2565 · on Aug 28, 2023

I've got a small personal project submitting traces/logs/metrics to Clickhouse via SigNoz. Only about 400k-800k spans per day (https://i.imgur.com/s0J6Mzo.png), but running on a single t4g.small with CPU typically at 11% and IOPS at 4%. I also have everything older than a certain number of GB getting pushed to a sc1 cold storage drive.

w/ 1 month retention for traces:

    ┌─parts.table─────────────────┬──────rows─┬─disk_size──┬─engine────┬─compressed_size─┬─uncompressed_size─┬────ratio─┐
    │ signoz_index_v2             │  26902115 │ 17.06 GiB  │ MergeTree │ 6.21 GiB        │ 66.74 GiB         │   0.0930 │
    │ durationSort                │  26901998 │ 5.44 GiB   │ MergeTree │ 5.40 GiB        │ 53.02 GiB         │  0.10190 │
    │ trace_log                   │ 123185362 │ 2.64 GiB   │ MergeTree │ 2.64 GiB        │ 37.96 GiB         │   0.0695 │
    │ trace_log_0                 │ 120052084 │ 2.46 GiB   │ MergeTree │ 2.45 GiB        │ 37.60 GiB         │  0.06528 │
    │ signoz_spans                │  26902115 │ 2.21 GiB   │ MergeTree │ 2.21 GiB        │ 76.73 GiB         │ 0.028784 │
    │ query_log                   │  16384865 │ 1.91 GiB   │ MergeTree │ 1.90 GiB        │ 18.31 GiB         │  0.10398 │
    │ part_log                    │  17906105 │ 846.73 MiB │ MergeTree │ 845.39 MiB      │ 3.84 GiB          │  0.21521 │
    │ metric_log                  │   4713151 │ 820.92 MiB │ MergeTree │ 806.13 MiB      │ 14.56 GiB         │  0.05405 │
    │ part_log_0                  │  15632289 │ 702.82 MiB │ MergeTree │ 701.70 MiB      │ 3.34 GiB          │  0.20490 │
    │ asynchronous_metric_log     │ 795170674 │ 576.24 MiB │ MergeTree │ 562.50 MiB      │ 11.11 GiB         │ 0.049429 │
    │ query_views_log             │   6597156 │ 461.35 MiB │ MergeTree │ 459.75 MiB      │ 6.36 GiB          │  0.07060 │
    │ logs                        │   6448259 │ 408.59 MiB │ MergeTree │ 406.65 MiB      │ 5.99 GiB          │  0.06627 │
    │ samples_v2                  │ 949110122 │ 345.01 MiB │ MergeTree │ 325.31 MiB      │ 22.09 GiB         │ 0.014382 │

If I was less stupid I'd get a machine with the recommended Clickhouse specs and save myself a few hours of tuning, but this works great.

Downsides:

- clickhouse takes about 5 minute to start up because my tiny sc1 drive has like 4 IOPS allowed

- signoz's UI isn't amazing. It's totally functional, and they've been improving very quickly, but don't expect datadog-level polish

pranay01 · on Aug 28, 2023

Thanks for mentioning SigNoz, I am one of the maintainers at SigNoz and would love your feedback on how we can improve it further.

If anyone wants to check our project, here’s our GitHub repo - https://github.com/SigNoz/signoz

alright2565 · on Aug 28, 2023

I hope I'm not coming across as negative! Y'all are just have a much younger product, and have not had time to do all the polish and tiny tweaks. I'm also much more familiar with Datadog, and sometimes a learning curve feels like missing features.

- I really like your new Logs & Traces Explorers. I spend a lot of time coming up with queries, and having a focused place for that is great. Especially since there's now a way to quickly turn my query into an alert or a dashboard item.

- You've also recently (6mo?) improved the autocomplete dramatically! This is awesome, and one of my annoyances with Datadog

Other feedback, and honestly this is all very minor. I'd be perfectly happy if nothing ever changed.

- where do I go see the metrics? There's no "Metrics" tab the way there's a "Logs" and "Traces" tab. A "Metrics Explorer" would be great.

- when I add a new plot, having to start out with a blank slate is not great. Datadog defaults to a generic system.cpu query just to fill something in, I find this helpful.

- when I have a plot in a dashboard and I see it is trending in the wrong direction, it would be nice to be able to create an alert directly from the chart rather than have to copy the query over.

- the exceptions tab is very helpful, but I've only recently discovered the LOW_CARDINAL_EXCEPTION_GROUPING flag. It'd be super nice if the variable part of exceptions was automatically detected and they were grouped

- once nice thing in DD is being able to preview a span from a log or logs from a span without opening a new page. Or previewing a span from the global page. Temporary popping this stuff up in a sidebar would be great.

- I'm not sure if there's a way to view only root spans in the trace viewer.

- This might be a problem with the spring boot instrumentation, but I can't see how to figure out what kind of span it is. Is it a `http.request`, `db.query`, etc?

pranay01 · on Aug 29, 2023

Thanks for the detailed feedback, this is gold!

> - where do I go see the metrics? There's no "Metrics" tab the way there's a "Logs" and "Traces" tab. A "Metrics Explorer" would be great.

Great, idea. This is some thing which few users have asked for and we will be shipping this in few releases

> - when I have a plot in a dashboard and I see it is trending in the wrong direction, it would be nice to be able to create an alert directly from the chart rather than have to copy the query over.

Fair point, this is something which is also in the pipeline.

> - I'm not sure if there's a way to view only root spans in the trace viewer.

We launched a tab in the new traces explorer for this, does it not serve your use case?

> - when I add a new plot, having to start out with a blank slate is not great. Datadog defaults to a generic system.cpu query just to fill something in, I find this helpful.

We can do something like this, but we don't necessarily know the name of metrics users are sending us, wrt. DataDog which has some default. metrics which their agents generate.

Will also look into other feedbacks you have given

serverlessmom · on Aug 29, 2023

Oooooh this is great feedback. When I started using SigNoz I too was surprised that there was no metrics tab. I wonder what you’d expect to see after clicking it: a list of service names collecting metrics? Or some high-level system wide infra metrics? Let me know!

_boffin_ · on Aug 28, 2023

Are you making sure that you're doing a sample rate, but send over all errors?

At a former place, we were doing 5% of non-error traces.

grogenaut · on Aug 28, 2023

Careful, we've had systems go down under increased load just emitting errors if they didn't emit much in non error state

_boffin_ · on Aug 28, 2023

Can you go into more detail about your comment, please.

cwp · on Aug 28, 2023

Not the GP, but:

Imagine you're sampling successful traces at, say, 1%, but sending all error traces. If your error rate is low, maybe also 1%, your trace volume will be about 2% of your overall request volume.

Then you push an update that introduces a bug and now all requests fail with an error, and all those traces get sampled. Your trace volume just increased 50x, and your infrastructure may not be prepared for that.

grogenaut · on Aug 29, 2023

Sorry been busy running around all day. Basically what's happened for us on some very high transaction per second services is that we only log errors. Or Trace errors. And the service basically never has errors. So imagine a service that is getting 800,000 to 3 million request a second. And this is happily going along basically not logging or tracing anything. Then all the sudden a circuit opens on redis and for every single one of those requests that was meant to use that open circuit to redis you log or trace an error. You went from a system that is doing basically no logging or tracing to one that is logging or tracing at 800,000 to 3 million times a second. What actually happens is you open the circuit on redis because red is a little bit slow or you're a little bit slow calling redis and now you're logging or tracing 100,000 times a second instead of zero and that bit of logging makes the rest of the requests slow down and now you're actually within a few seconds logging or tracing 3 million requests a second. You have now toppled your tracing system your logging system and the service that's doing the work. Death spiral ensues. Now the systems that are calling this system starts slowing down and start tracing or logging more because they're also only tracing or logging mainly on error. Or sadly you have a better code that assumes that the tracing are logging system is up always and that starts failing causing errors and you get into doing extra special death loop that can only be recovered from by only attempting to log or error during an outage like this and you must push to fix. All the scenarios have happened to me in production.

In general you don't want your system to do more work in a bad state. In fact as the AWS well architected guide say when you're overloaded or you're in a heavy air State you should be doing as little work as possible. So that you can recover

Topgamer7 · on Aug 28, 2023

We've seen problems with memory usage on failure too. Python implementation sends data to the collector in a separate thread from the http server operations. But if these start failing, its configured for exponential backoff, so it can hold onto a lot of memory, and start causing issues with container memory limits.

grogenaut · on Aug 29, 2023

I've configured our systems to start dropping data at this point and emit an alarm metric that logging/metrics are overloaded

Longwelwind · on Aug 28, 2023

I think what they means is that if you provisioned your system to receive spans for 5% of non-error requests and a few error requests, if for some random act of god, all the requests yield an error, your span collector will suddenyl receive spans for all requests.

fastest963 · on Aug 28, 2023

How do you send all errors? The way tracing works, as I understand it, is that each microservice gets a trace header which indicates if it should sample and each microservice itself records traces. If microservice A calls microservice B and B returns successfully but then A ends up erroring, how can you retroactively tell B to record the trace that it already finished making and threw away? Or do you just accept incomplete traces when there are errors?

sigwinch28 · on Aug 28, 2023

You can do head-based sampling and tail-based sampling.

With head sampling, the first service in the request chain can make the decision about whether to trace, which can reduce tracing overhead on services further down.

With tail-based sampling, the tracing backend can make a determination about whether to persist the trace after the trace has been collected. This has tracing overheads, but allows you to make decisions like “always keep errors”.

phamilton · on Aug 28, 2023

https://opentelemetry.io/docs/concepts/sampling/ describes it as Head/Tail sampling, but in practice with vendors I see it as Ingestion sampling and Index sampling. We send all our spans to be ingested, but have a sample rate on indexing. That allows us to override the sampling at index and force errors and other high value spans to always be indexed.

fastest963 · on Aug 28, 2023

Maybe the Go client doesn't support that? https://opentelemetry.io/docs/instrumentation/go/sampling/

phillipcarter · on Aug 28, 2023

It does, but the docs aren't clear on that yet. TraceIdRatioBased is the "take X% of traces" sampler that all SDKs support today.

CSDude · on Aug 28, 2023

Normally yes, but we do a lot of data collection and identifying what's an error is usually hard because of partial errors. We also care about performance, per tenant and per resource with lots of dimensionality and sampling reduces that information for us.

podoman · on Aug 28, 2023

The reality is that most people don't want to manage their own Clickhouse store, and not all engineers can operate with SQL as efficiently as with code (me included). Nonetheless, this is pretty cool!

hnarn · on Aug 28, 2023

> not all engineers can operate with SQL as efficiently as with code

I don’t mean for this to sound insulting but I honestly do not think this is an acceptable take to have as a developer.

Not knowing SQL is like refusing to learn any language that has classes in it, simply because you don’t like it.

I’ve heard stories of huge corporations failing product launches because some code was written to SELECT * from a database and filtering it in-app instead of doing the queries correctly, and what’s so fun with these types of issues is that they usually don’t appear until weeks later when the table has grown to a size where it becomes a problem.

When you’re saying that you’d rather find the data in-app than in-database, you’re putting the work on an inferior party in the transaction simply because you can’t be bothered.

The code will never* find the correct data faster than the database.

* there may be exceptions, but they’re far enough between to still say “never”.

xyzzy_plugh · on Aug 28, 2023

Dropping down to SQL to write a really complex query is, in my professional experience, always a poor use of time. It's far simpler to just write the dumb for-loops over your data, if you can access it.

Of course not all engineers can operate with SQL as efficiently as code -- that's the whole point. Otherwise why would we be writing code? Learning SQL intimately doesn't change that fact.

hnarn · on Aug 28, 2023

> Dropping down to SQL to write a really complex query is, in my professional experience, always a poor use of time.

We’re not talking about Assembly here, “dropping down” to SQL is something that anyone should be expected to do as soon as you’re grabbing or modifying any data from a database in any scenario where performance or integrity matters. The errors you can see in situations like this are extremely complex and databases literally exist to solve them for us.

Also, if we just completely disregard the performance for a second and focus on data security instead, how do you ensure sensitive data isn’t passed to the wrong party if you don’t care about what queries are being sent?

I mean, it doesn’t matter if it’s not “in the end” displayed to an end user in the application you’re writing, or if its not stored in the intermediary node where your code is running, that data is now unnecessarily on the wire in a situation where it never should have been in the first place. If you end up mixing one customers data with another’s and sending all of it in such a way that it could even theoretically be accessed by a third party, that’s a lawsuit waiting to happen regardless of whether it was “displayed” or “forwarded” or not.

Imagine if you sniffed the packets going to some logistics app you use on your phone and you saw meta-data for all packages in your zip code in the response, or if some widget showing you your carbon footprint actually was based on a response containing the carbon footprint of every customer in the database. Even if it’s just [user_id,co2] it’s still completely unacceptable.

Never mind scenarios where you’re modifying, adding or deleting data, those are even worse and no explanation should be necessary for why.

duggan · on Aug 28, 2023

Obviously it greatly depends on what you're doing. If you're using a relational database as a glorified key-value store for offline or batch processing of a few hundred megabytes of data, sure. Hell, just serialize and unserialize a JSON document on every run if it's small and infrequent enough ¯\_(ツ)_/¯

If you've got a successful data hungry web service with a reasonably normalized schema and moderately complex access patterns though, you're not going to be looping over the whole thing on every page load.

ako · on Aug 29, 2023

SQL is definitely worth learning. Recently found that processing a 350kb json is equally fast by sending it to Postgres for processing, compared to using some dedicated Java libraries: https://ako.github.io/blog/2023/08/25/json-transformations.h...

This opens some interesting options if you want to join the result with data from your database.

somehnguy · on Aug 29, 2023

Way way way slower though. I just added something to our app that took 600ms for the naive ‘search and loop’ version (and kept getting slower the more items you needed, completely unscaleable) vs the 30ms for the ‘real SQL query’ version. Guess which version got actually committed.

qaq · on Aug 28, 2023

Did your for loop solution include concurrent access by multiple clients? I highly doubt "engineers can not operate with SQL as efficiently as code" can implement anything even remotely as robust as what SQL DBMS offer even for basic use cases. Are you mutating data? What will happen if the system crashes in the middle of the mutation? How are you handling concurrent writes and reads?

nrr · on Aug 28, 2023

It's unclear whether you mean that it's simpler to make a query and iterate over the rows to massage the result in your application or to make a query and then iterate over the returned rows to make more single-row queries. (Or perhaps some secret third thing I'm not considering.)

I'll admit I'm a little curious about what exactly you mean here.

simonw · on Aug 28, 2023

SQL is code.

tkiolp4 · on Aug 28, 2023

There’s a difference between writing olap and oltp sql queries. Hell, in the industry we even have a dedicated role for people who, among other things, write olap queries: data analysts. I’m assuming here that we are talking about writing complex analytical queries.

klysm · on Aug 28, 2023

SQL is code and absolutely worth learning.

jauntywundrkind · on Aug 28, 2023

"Don't want to manage their own" has for so long been a valid excuse but cloud costs haven't been going down for so long - in many cases prices have increased - and hardware keeps getting more badass. In so many cases it's fear speaking.

A decent sized server will host a hugely capable instance that you may not have to think about for years. The scoffing down at DIY has made sense to some degree, but it just works brilliantly keeps getting to be a stronger & stronger case & most just assume reality can't actually work that well, that it'll be bad, and those folks won't always be right.

qaq · on Aug 28, 2023

With current SSD prices a box that will have 30 million IOPS can cost you 10K. 30 mil IOPS in a cloud would be crazy $$$$

darkwater · on Aug 29, 2023

But in this case we are not even talking about own/rented HW vs cloud. It's self-hosted(even on cloud) vs SaaS softwares!

SaaS, especially in this space, can be *extremely* costly and its cost will scale up quickly as you send more traffic (either willingly or by mistake). Yes, Datadog, NewRelic etc will give you many pre-built and well-thought dashboards and some fancy AI-powered auto-detection thing but they will charge many $$$ for it. Consider that now cost management/analysis tools that were historically focused only on cloud, are now adding the same tooling for costly SaaS solutions!

I understand that many HN readers are skewed towards SaaS solutions, usually because they work at a SaaS shop, but depending on the size of the company, the overhead for managing it internally can totally be worth. There is overhead with SaaS as well...

CSDude · on Aug 28, 2023

We just left ours running for months in a Docker container. The volume is external, we just replace container image with new one, it takes 5 seconds to update, and spans are treated ephemeral. We store only 7d of data. We could use S3 but we have no use for that data in the long run.

To be fair, we wanted to get experience on ClickHouse and it's a special database need special attention to details on both ops and schema design.

simonw · on Aug 28, 2023

I'm beginning to sound like a broken record at this point, but if you don't know SQL very well but know how to use GPT-4, you have access to enough SQL to get a lot more done than you might think.

annanay · on Aug 28, 2023

This is really interesting, thanks for sharing. What's also cool was the low effort needed for this setup (Java autoinstrumentation + Clickhouse exporter + Grafana Clickhouse Plugin).

push-to-prod · on Aug 28, 2023

That's a really informative post, the ClickHouse thing sounds interesting!

streblo · on Aug 28, 2023

I'm hugely disappointed with OpenTelemetry. In my experience, its an over-engineered mess and the out-of-the-box experience is super user hostile. What it purports to be is so far away from what it actually is. Otel markets itself as a universal tracing/metrics/logs format and set of plug and play libraries that has adapters for everything you need. It's actually a bunch of half/poorly implemented libraries with a ton of leaky internals, bad adapters, and actually not a lot of functionality.

iofiiiiiiiii · on Aug 28, 2023

Agreed, I find myself having to think orthogonally to common sense whenever I try to use one of its SDKs. Nothing works the way you expect it to, everything has 3 layers of unnecessary abstraction and needs to be approached via the back door. Many features have caveats about when it works, where it works, how much it works, during what phase of the moon it works and how long your strings can be when Jupiter is visible in the sky.

That said, if we disregard the leaky SDK APIs and half-implemented everything, it does somewhat deliver on the pluggability promise. Before OTel, you had bespoke stacks for everything. Now there is some commonality - you can plug in different logging backends to one standard SDK and expect it to more or less work. Yes, it works less well than a vertically integrated stack but this is still something. It enables competition and evolution piece by piece, without having to replace an observability stack outright (never going to be a convincing proposition).

So while the developer experience is pretty unpleasant and I am also disappointed with the actual daily usage, from an architectural perspective it opens up new opportunities that did not exist before. It is at least a partial win.

korginator · on Aug 29, 2023

Yes, I really agree, and I've gone through the same pain, but try using the alternatives that claim to be better because they have OpenAPI specifications [1]

The example shows you how to use the swagger tool, parse the OpenAPI spec [2], auto-generate GoLang glue code, call __one__ of those auto-generated functions and log a trace.

However, there is zero documentation, zero other examples, and I'm left scratching my head whether there's even one person in the world using this approach. I eventually ended up just directly using the service APIs [3] via REST calls.

OTEL is painful, but the alternatives are no better :( I really wish there's some interest in this space, since SLO's and SLI measurements are becoming increasingly important.

[1] https://github.com/openzipkin/zipkin-api-example

[2] https://github.com/openzipkin/zipkin-api/blob/master/zipkin2...

[3] https://zipkin.io/zipkin-api/#/

valyala · on Aug 30, 2023

Prometheus text exposition format is de-facto standard used in monitoring. It would be great building an official observability standard on top of it. This format is much easier to debug and understand than OpenTelemetry for metrics. It is also more efficient, e.g. it requires less network bandwidth and less CPU for transfer than Otel for metrics.

[1] https://github.com/prometheus/docs/blob/main/content/docs/in...

fshr · on Aug 28, 2023

Ok, then do you have a suggestion for an alternative, or do you just put up with OT?

maccard · on Aug 29, 2023

I agree with OP, but I put up with Otel

cratermoon · on Aug 28, 2023

Okay but can you point to some specific experience and suggest improvements?

caust1c · on Aug 28, 2023

Try implementing an OTEL Tracer. This interface is insane, and should really just be a struct.

https://github.com/open-telemetry/opentelemetry-go/blob/trac...

skrtskrt · on Aug 28, 2023

how is this interface insane?

It's a list of the behaviors you need to implement if you're rolling your own OTEL Tracer Span implementation, and not using one of the multiple available.

In contrast, OpenTracing's interfaces had hardly any required methods, so you had to do a runtime type-cast to the whichever implementation you were using in order to access anything useful on the Span like the OperationName.

ramenmeal · on Aug 28, 2023

Why were you implementing your own tracer? Don't they publish an implementation?

phillipcarter · on Aug 28, 2023

Yes, all SDKs have a tracer you can just use. While you can technically create your own tracer, you're officially in Hard Mode territory - there's no highly extensible system I'm aware of that makes swapping core concepts (that already have a default) easy.

no_wizard · on Aug 28, 2023

The OTEL official project libraries don't work well on the web frontend yet. No way of correlating errors to source-maps for instance, at least out of the box.

The web browser collector published by the OTEL project uses Zone.js to hijack just about everything in the browser into contexts. If you used modern Angular before, you may recognize zone.js, its a real pain sometimes, and messes with globals, which isn't great, as it can create situations where behavior isn't predictable.

I don't know that OTEL has any standard around things like session replays either. Lots of telemetry platforms support this (Sentry, Rollbar, DataDog etc)

I think alot of backend teams have really like it. I do like the cross boundary nature of spans where you can follow them by a unique tag across your entire system.

I personally find its extremely verbose at times, in terms of the payload it generates, some logging platforms are more compact in this regard, but in practice, I haven't noticed it to be an issue

renke1 · on Aug 28, 2023

I think with native Promises (and async/await?) there is currently no way to implement something like Zone.js properly. I've tried to instrument my code manually, but it's really error-prone and verbose. We really need something like https://nodejs.org/api/async_context.html#class-asynclocalst... to be implemented in the browser.

vmarchaud · on Aug 28, 2023

You can follow [0] which is currently stage 2 to fix this

[0]: https://github.com/tc39/proposal-async-context

no_wizard · on Aug 28, 2023

In addition to this, is the new (stage 3 even!)explicit resource management proposal[0], supported by TypeScript version >= 5.2[1]

Though I agree that async context is better fit for this generally, the ERM should be good for telemetry around objects that have defined lifetime semantics, which is a step in the right direction you can use today

[0]: https://github.com/tc39/proposal-explicit-resource-managemen...

[1]: https://www.totaltypescript.com/typescript-5-2-new-keyword-u...

renke1 · on Aug 28, 2023

Thanks! I was only aware of a Zones proposal which was withdrawn I think.

mikeshi42 · on Aug 29, 2023

We've found that across most platforms, Otel is more of a starting point to building good instrumentation libraries. We've been building on top of Otel/Splunk's browser SDK implementation for our package [1] which does roll in session replay, and also adding on better exception tracking, etc. None of which remotely comes out of the box unfortunately.

Being able to connect frontend sessions -> backend trace/logs has been a huge DX change though imo.

[1] https://www.hyperdx.io/blog/browser-based-distributed-tracin...

pranay01 · on Aug 28, 2023

Do check this doc from Otel https://opentelemetry.io/docs/instrumentation/js/getting-sta...

It is not completely solving the issue, but a starting point

no_wizard · on Aug 28, 2023

this is exactly what I referencing, its not really a starting point. I read through all the docs quite thoroughly, these are missing features and design choices

jauntywundrkind · on Aug 28, 2023

DataDog's front end instrumentation (Real User Monitoring) is also notably unrefined. Playing with Duplo blocks level of finesse.

Does anyone have an even part way start at doing front end tracing?

no_wizard · on Aug 28, 2023

As with most of the DataDog platform, I have found that once you go underneath the sheen it leaves alot to be desired.

I've had better runs with Bugsnag for pure error reporting and more recently Sentry, which can do RUM / Session Replay collections.

None are what you expect though. If you want really good user behavior analytics FullStory is still top notch

nijave · on Aug 29, 2023

Have you come across Grafana Faro at all? You can have it send to Grafana Agent which is OSS and can store traces in other locations.

renke1 · on Aug 28, 2023

A few of my colleagues and I had the silly (?) idea that you don't really need logs anymore. Instead of log messages you just attach span events [0]. You then just log the span title and a link to that span in Jaeger; something like [1]. I've only really tried that in my private project, but it felt pretty good. The UI of Jaeger could be a bit better to support that usage, though.

Edit: Actually, those colleagues are doing a talk about that topic. So, if you are in Germany and Hannover area, have a look at [2] and search for "Nie wieder Log-Files!".

[0]: https://opentelemetry.io/docs/instrumentation/ruby/manual/#a...

[1]:

    tracing.ts:38 Usecase: Handle Auth 
    tracing.ts:47 http://localhost:16686/trace/ec7ffb1e23ddbb8dd770a3f08028666b
    tracing.ts:38 Adapter: Find Personal Board 
    tracing.ts:47 http://localhost:16686/trace/e22d342316ab0d7d23230864008e27bc
    tracing.ts:38 Adapter: Find Starred Board List 
    tracing.ts:47 http://localhost:16686/trace/129f89cee26d54cfdc38abea368d9b4e
    tracing.ts:38 Adapter: Find Personal Board List 
    tracing.ts:47 http://localhost:16686/trace/97948127d77501ff0c65a5db21b21b5a

[2]: https://javaforumnord.de/2023/programm/

phillipcarter · on Aug 28, 2023

Depending on how greenfield a project is, you don't even need span events unless you absolutely require a timestamp for a specific operation with no duration. Just using spans for every meaningful operation is like having more powerful structured logs.

This approach isn't possible for a lot of systems that have existing logs they need to bring along, but if you're greenfield enough, I'd recommend it.

glenjamin · on Aug 28, 2023

Using only spans is surprisingly effective!

I did a talk about this at QCon London last year

https://www.infoq.com/presentations/event-tracing-monitoring...

paulddraper · on Aug 28, 2023

Agreed.

Same is true for metrics derived from spans. (Though for metrics, you don't need to sample, and for spans you might. So keep in mind.)

klysm · on Aug 28, 2023

I've thought about doing this as well, but I like being able to use dumb tools to get an idea of what's going on. There's a lot that has to be working correctly to use traces. Or maybe I'm just scared of the tooling because I don't have enough experience with it yet idk

pquerna · on Aug 28, 2023

You can also just, Log the spans as they are being created to stderr/stdout -- I've done this on a previous project with this approach of "spans first".

It made it debuggable via output if needed, but the primary consumption became span oriented.

klysm · on Aug 28, 2023

Good idea yeah, but do the same notions of log level apply?

morelisp · on Aug 28, 2023

This was the idea behind Stripe's Veneur project - spans, logs, and metrics all in the same format, "automatically" rolling up cardinality as needed - which I thought was cool but also that it would be very hard to get non-SRE developers on board with when I saw a talk about it a few years ago.

https://github.com/stripe/veneur

kiitos · on Aug 28, 2023

You don't even really need to ship traces anywhere. You can just keep them in-process, and build an API on top of that in-memory trace data.

mman0114 · on Aug 29, 2023

I’ve been talking about the same thing at my place. I think it makes a ton of sense and then can kill logs almost completely

nijave · on Aug 29, 2023

We tried this with Grafana Cloud/Tempo but hit trace size limits. Logs are still good for "events"

kiitos · on Aug 28, 2023

OpenTelemetry is a marketing-driven project, designed by committee, implemented naively and inefficiently, and guided by the primary goal of allowing Fortune X00 CTOs to tick off some boxes on their strategy roadmap documents.

It's not something that anyone with a choice in the matter should be using.

brainbag · on Aug 28, 2023

It looks like every other comment in this thread is favorable to very positive, can you go into more detail about what specifically isn't good about it?

arp242 · on Aug 28, 2023

Not the previous poster, but I had to implement it a few years ago, and I found it unbelievable complex with dense and difficult to read specifications. I've implemented plenty of protocols and formats from scratch using just the specification, but rarely have I had such difficulty than with OpenTelemetry.

I guess this is something you don't notice as merely a "user", but IMHO it's horribly overengineered for what it does and I'm absolutely not a fan.

I also disliked the Go tooling for it, which is "badly written Java in Go syntax", or something along these lines.

This was 2 years ago. Maybe it's better now, but I doubt it.

In our case it was 100% a "tick off some boxes on their strategy roadmap documents" project too and we had much much better solutions.

OTel is one of those "yeah, it works ... I guess" but also "ewwww".

phillipcarter · on Aug 28, 2023

I'd recommend trying it out today. In 2021, very few things in OTel were GA and there wasn't nearly as much automatic instrumentation. One of the reasons why you had to dive into the spec was because there was also very little documentation, too, indicative of a heavily in-progress project. All of these things are now different.

arp242 · on Aug 28, 2023

I'll be happy to take your word that some implementation issues are now improved, but things like "overengineered" and "way to complex for what it needs to do" really are foundational, and can't just be "fixed" without starting from scratch (and presumably this is all by design in the first place).

phillipcarter · on Aug 28, 2023

That's fair. I find that to be a bit subjective anyways, so I don't have much to comment on there. Most languages are pretty lightweight. For example, initializing instrumentation packages and creating some custom instrumentation in Python is very lightweight. Golang is far more verbose, though. I see that as part and parcel of different cultures for different languages, though (I've always loved the brevity of Python API design and disliked the verbosity of Go API design).

kiitos · on Aug 29, 2023

One of the main reasons I became disillusioned with OTel was that the project treated "automatic instrumentation" as a core assumption and design goal for all supported languages, regardless of any language-specific idioms or constraints.

I'm not an expert in every language, but I am an expert in a few, and this just isn't something that you can assume. Languages like Go deliberately do not provide the sorts of features needed to support "automatic instrumentation" in this sense. You have to fold those concerns into the design of the program itself, via modules or packages which authors explicitly opt-in to at the source level.

I completely understand the enormous value of a single, cross-language, cross-backend set of abstractions and patterns for automatic instrumentation. But (IMO and IME) current technology makes that goal mutually exclusive with performance requirements at any non-trivial scale. You have to specialize -- by language, by access pattern (metrics, logs, etc.), by concrete system (backend), and so on -- to get any kind of reasonable user experience.

hinkley · on Aug 29, 2023

The Spec itself is 'badly written Java'. I haven't been a Java dev for about ten years. At this point it's a honey pot for architectural astronauts - a great service to humanity.

That is, until some open standard is defined by said Java astronauts.

podoman · on Aug 28, 2023

> OpenTelemetry is a marketing-driven project, designed by committee, implemented naively and inefficiently, and guided by the primary goal of allowing Fortune X00 CTOs to tick off some boxes on their strategy roadmap documents.

I'm the founder of highlight.io. On the consumer side as a company, we've seen a lot of value of from OTEL; we've used it to build out language support for quite a few customers at this point, and the community is very receptive.

Here's an example of us putting up a change: https://github.com/open-telemetry/opentelemetry-js/pull/4049

Do you mind sharing why you think no-one should be using it? Some reasoning would be nice.

jrockway · on Aug 28, 2023

I don't think that's true. It seems like it's more of a "oh shit, all this open source software emits Prometheus metrics and Jaeger traces, but we want to sell our proprietary alternatives to these and don't want to upstream patches to every project". (Datadog had a literal army of people adding datadog support to OSS projects. Honestly, probably a great early-career job; diving into unfamiliar codebases is a superpower.)

OTel lets the open source projects use an abstraction layer so that you can buy instead of self-host.

None of this has ever made me feel super great, but in the end I would probably consider OTel today for services that people other than my company operate. That way if some user wants to use Datadog, we're not in their way.

I used OTel in the very very early days and was rather disappointed; the Go APIs were extremely inefficient (a context.Context is needed to increment a counter? no IO in my request path please), and abstracted leakily (no way to set histogram buckets when exporting to Prometheus). I assume they probably fixed that stuff at some point, though.

the_duke · on Aug 28, 2023

What helps hosted data collectors helps self-hosting setups just as much.

More and more solutions are getting built in OTEL support, which means you can relatively seamlessly switch between backends without changing anything in your application code.

morelisp · on Aug 28, 2023

This only makes sense if you're in a world where you're switching backends more than once, which means you're not seriously programming, you're just burning VC money for lottery tickets.

jrockway · on Aug 28, 2023

I agree with this. For internal apps, pick a system and stick with it. If you're excited by Datadog's marketing pitch, just buy it and use it. It will not make or break your startup; like if your Datadog bill is what's standing between you and profitability, then you probably didn't actually find product/market fit. Switching to Prometheus at that point also won't help you find product/market fit.

In the 2 jobs where I've set up the production environment, I just picked Prometheus/Jaeger/cloud provider log storage/Grafana on day 1 and have never been disappointed. You explode the helm chart into your cluster over the course of 30 minutes, and then move on to making something great (or spending a week debugging Webpack; can't help you with that one).

esafak · on Aug 28, 2023

You had better have PMF if you pick Datadog!

https://thenewstack.io/datadogs-65m-bill-and-why-developers-...

nijave · on Aug 29, 2023

Also useful for local dev (you can still use it locally without a SaaS backend) and it helps with interoperability. Infrastructure like Envoy and nginx can emit spans that integrate with 1st party code and other 3rd party code. OSS libraries are more likely to implement an open standard so they just plug and play and emit data for internal things they're doing (especially helpful for things like DB and HTTP)

morelisp · on Aug 28, 2023

OTel is the backend, in-program equivalent of "we need all of five analytics systems on our frontend to figure out that users bounce when our page takes 10s to load because it has five analytics systems in it".

esafak · on Aug 28, 2023

What do you use?

l-albertovich · on Aug 28, 2023

That's overly harsh, they are doing good work I think their data model is a step forward in the right direction.

Their processors are quite capable and the entire receiver and exporter contrib collection is pretty good.

I'm not saying it's the best solution out there because that clearly depends on each use case but I don't think such harsh criticism makes sense.

Disclaimer: I'm part of the fluent-bit maintainer team.

conor- · on Aug 29, 2023

Being able to have services speak OTLP and having my application configurations simplified to sending data to the OTEL collector is great.

From an ops point-of-view devs can add whatever observability to their code and I can enforce certain filtering in a central place as well as only needing one central ingress path that applications talk to.

Also because everything emits OTLP if we ever want to move to new backends it's just a matter of changing a yaml file and not rewriting applications to support a new logging backend.

Given the choice of going back to the old way of using vendor-specific logging libraries, I will continue using OTEL 10/10 times because even given its warts, it's still a lot nicer than the alternatives.

kiitos · on Aug 29, 2023

Switching observability backends is like switching databases -- feasible in theory, impossible in practice for anything but the most trivial use cases.

You can't build a sound product if that's one of the design requirements.

conor- · on Aug 29, 2023

Except it isn't impossible because using OTLP as the data format means you're decoupled from any single backend.

Switching to a new backend is as simple as deploying the new backend, changing 1 line in the OTEL Collector yaml, then having your front-end pull from the new backend. 0 changes to application code necessary.

kiitos · on Aug 29, 2023

Metrics, logs, and traces are abstract categories of telemetry data, representing the most common modalities of how that data is produced and consumed. They are explicitly not specific or concrete types that can be precisely defined by an e.g. Protobuf schema

These domain concepts are descriptive, not proscriptive. They don't, and can't possibly, have specific wire-level definitions. So another way to phrase my point might be to say that OTel is asserting definitions which doesn't actually exist.

Telemetry necessarily requires specialization between producer (application/service) and consumer (observability backend) in order to deliver acceptable performance. It's core to the program as it is written: more like error handling than e.g. containerization.

anbotero · on Aug 28, 2023

What should people use?

With basic parameters in place so it doesn’t eat your billing, it’s been working great with me for years. Initially with New Relic, then Datadog, now a setup with OpenTelemetry is good enough.

kiitos · on Aug 28, 2023

Instrumentation isn't solved by any single specific thing. It's a praxis that you apply to your code as your write it, like I guess error handling; it's not a product that you can deploy, like I guess Splunk or New Relic or whatever else.

You should "use" metrics, logs, and traces thru dependencies that are specific to your organization. The interface between your business logic and operational telemetry should be abstract, essentially the same as a database or a remote HTTP endpoint or etc. The concrete system(s) collecting and serving telemetry data are the responsibility of your dev/ops or whatever team.

Main point: instrumentation is part of the development process, not something that's automatic or that can be bolted-on.

pgwhalen · on Aug 28, 2023

Have you worked with OTEL before? Basically all of your points about instrumentation are actually quite sympathetic to OTEL's view of the world. The whole point of OTEL is to provide some standards around how these pieces fit together - not to solve for them automatically.

kiitos · on Aug 28, 2023

I've been deeply involved with OTel from even before it was a CNCF jam. My experiences with the project, over time, have made me basically abandon the project as unsound and infeasible since a year or two. Those experiences also inform comments like the ones I've made here.

pgwhalen · on Aug 28, 2023

Can you elaborate on what unsound and infeasible mean? I'm newer to OTel than you (~6 months of working with it in depth), and don't really understand what you're getting at. It's solving real problems in my organization, with only a "regular" amount of pain for a component of its size.

klysm · on Aug 28, 2023

Okay so what’s the interface? Sounds like what OTEL provides to me

kiitos · on Aug 28, 2023

There are well-defined interfaces for specific sub-classes of telemetry data. Prometheus provides a set of interfaces for metrics which are pretty battle-tested by now. There are similar interfaces for logs and traces, authored by various different parties, and with various different capabilities, trade-offs, etc.

There is no one true interface! The interface is a function of the sub-class of telemetry data it serves, the specific properties of the service(s) it supports, the teams it's used by, the organization that maintains it, etc. etc.

OTel tries to assert a general-purpose interface. But this is exactly the issue with the project. That interface doesn't exist.

klysm · on Aug 28, 2023

OTEL is a set of interfaces, so I’m not sure your last point applies. I do agree that battle tested things like Prometheus work great, but why not have a set of standardized interfaces? There is clearly a cost to having them; for some projects this may be too much. For the projects I’ve used it in it let me spin up all the traces and telemetry without thinking hard.

KronisLV · on Aug 28, 2023

> What should people use?

I recall Apache Skywalking being pretty good, especially for smaller/medium scale projects: https://skywalking.apache.org/

The architecture is simple, the performance is adequate, it doesn't make you spend days configuring it and it even supports various different data stores: https://skywalking.apache.org/docs/main/v9.5.0/en/setup/back...

The problems with it are that it isn't super popular (although has agents for most popular stacks), the docs could be slightly better and I recall them also working on a new UI so there is a little bit of churn: https://skywalking.apache.org/downloads/

Still better versus some of the other options when you need something that just works instead of spending a lot of time configuring something (even when that something might be superior in regards to the features): https://github.com/apache/skywalking/blob/master/docker/dock...

Sentry comes to mind (OpenTelemetry also isn't simpler due to how much it tries to do, given all the separate parts), compare its complexity to Skywalking: https://github.com/getsentry/self-hosted/blob/master/docker-...

I wish there was more self-hosted software like that out there, enough to address certain concerns in a simple way on day 1 and leave branching out to more complex options like OpenTelemetry once you have a separate team for that and the cash is rolling in.

hinkley · on Aug 29, 2023

I'm honestly thinking that one of the statsd variants with label support would have been just fine if I'd had a time machine. The complexity overhead of labels in OpenTelemetry does not make it the slam-dunk it appears to be.

Internally, OTEL has to keep track of every combination of labels it's seen since process start, which can easily come to dominate the processing time in an existing project. It's another in a long line of tools that dovetail with my overall software development philosophy which is that you can make pretty much any process work for 18 months before the wheels fall off.

By the time you notice OpenTelemetry is a problem, you've got 18 months of work to start trying to roll back.

kiitos · on Aug 29, 2023

> Internally, OTEL has to keep track of every combination of labels it's seen since process start, which can easily come to dominate the processing time in an existing project.

Well, every unique combination of labels represents a discrete time series of telemetry data, and the total set of all time series in your entire organization always has to be of finite and reasonable cardinality. This means that label values always have to be finite e.g. enumerations, and never e.g. arbitrary values from user input.

> my overall software development philosophy which is that you can make pretty much any process work for 18 months before the wheels fall off.

The size of the set of labels in your process after (say) 1d of regular traffic should be basically the same size as after (say) 18m of regular traffic. If this isn't the case, it usually signals that you're stuffing invalid data into label values.

adra · on Aug 29, 2023

I have no idea why you think that all attributes need to be buffered in process forever. Most metrics systems simply keep key sets in ram cached for as long as they're still being emitted. Many drop unused key sets after like 10 minutes. But like all metrics processing, you should ideally keep cardinality to a bound set in order to avoid these types of issues both client and server side.

I'm sure there are valid qualms with OTEL in general, but this ain't one of them. Any and all metrics telemetry systems can fall into the same design constraint you pointed out.

hinkley · on Aug 29, 2023

I don’t know which one implementations support invalidation but it’s not happening for the nodejs impl.

Push implementations do not have this problem at the client end.

hankchinaski · on Aug 28, 2023

I kind of agree with you. Clueless managers just asking “open telemetry” on the roadmap without contextualising costs/benefits

dboreham · on Aug 28, 2023

Well, roll up your sleeves and fix the performance bugs that affect you (source: I have).

kiitos · on Aug 28, 2023

I have no reason to do so, because I don't believe that OpenTelemetry is a project that was created, or is maintained, in good faith to its stated goals.

Dopameaner · on Aug 28, 2023

Care to elaborate a bit more on the goals contrast?

kiitos · on Aug 29, 2023

https://opentelemetry.io

> OpenTelemetry is a collection of APIs, SDKs, and tools. Use it to instrument, generate, collect, and export telemetry data (metrics, logs, and traces) to help you analyze your software’s performance and behavior.

You can absolutely categorize telemetry into these high-level pillars, true. But the specifics on how that data is captured, exported, collected, queried, etc. is necessarily unique to each pillar, programming language, backend system, organization, etc.

That's because telemetry data is always larger than the original data it represents: a production request will be of some well-defined size, but the metadata about that request is potentially infinite. Consequently, the main design constraint for telemetry systems is always efficiency.

Efficiency requires specialization, which is in direct tension with features that generalize over backends and tools, e.g.

> Traces, Metrics, Logs -- Create and collect telemetry data from your services and software, then forward them to a variety of analysis tools.

and features that generalize over languages, e.g.

> Drop-In Instrumentation -- OpenTelemetry integrates with popular libraries and frameworks such as Spring, ASP.NET Core, Express, Quarkus, and more! Installation and integration can be as simple as a few lines of code.

I think OTel treats these goals -- which are very valuable to end users!! -- as inviolable core requirements, and then does whatever is necessary to implement them. But these goals are not actually valid, and so the resulting code is often inefficient, incoherent, or even unsound.

nerdjon · on Aug 28, 2023

Slight tangent for a moment, I really really hate the "subscribe" popup that comes up on this blog. It is not clear at all that you can just close it and not give your email since there is no "x" button. Instead it has the incredibly unintuitive "continue reading" under the subscribe button that I did not think would work. Clicking out of it also did not work. Seriously we can and should be better than this.

On the topic of open telemetry. I have been long wanting to play with it and see if it offers all of the capabilities when we send the data to datadog. But I have been reluctant to add in another thing to manage and train on if it means that the datadog agent is still necessary for anything outside of the basics.

Has anyone else actually tried hooking this up to datadog?

Edit: just to be clear, my driving goal of this is not necessarily to keep it with datadog. But that is currently where much of our alerting and logs are now. So the idea would be to switch to open telemetry which would then allow us to (theoretically) move to something else down the line.

monocasa · on Aug 28, 2023

WRT the popup, that's a medium.com thing. I agree that it's very annoying.

luispauloml · on Aug 28, 2023

It's not Medium, it's Substack.

remram · on Aug 29, 2023

Personally I found the poll much worse. Got asked for my opinion, clicked on one of the option, and got rewarded with a full-screen "create an account". Hit the back button and hit taken back to HN.

Nice.

baz00 · on Aug 28, 2023

Is there yet any way of having a front end for this which doesn't significantly dent your revenue stream either in staffing, infrastructure or license fees? We land over 2000 requests/second and it's expensive just keeping logs.

singron · on Aug 28, 2023

I recommend sampling traces if you aren't. I've been unimpressed with datadog apm, which has no affordable configuration. We've been running our own Jaeger stack with 0.1% sampling, and it's negligible to run compared to datadog apm.

For metrics and logs, sampling isn't so useful, so I don't have a good answer. Datadog has 80% gross margin, so at most 20% of what you pay them is the infra, so you stand to save a lot of money running your own open source stacks if your labor costs would be less than that 80%. With datadog, we have a project every 3 months to reduce usage, so it's not like we aren't constantly babysitting it anyway.

phamilton · on Aug 28, 2023

> I've been unimpressed with datadog apm, which has no affordable configuration

FWIW we sample in datadog APM and it works fine to control costs, I'm not sure what issues you hit.

pranay01 · on Aug 28, 2023

If you’re looking for open source APM stack which is OpenTelemetry native and you can self host - you can check out SigNoz ( https://github.com/SigNoz/signoz)

paulddraper · on Aug 28, 2023

> datadog apm, which has no affordable configuration

In my experience, Datadog's ingestion sampling works pretty well.

And there's retention filters you can use to override.

Thaxll · on Aug 28, 2023

Sampling is the answer. Sample 1% of success and all the errors.

Cost is one thing but you would be surprised how heavy observability can be on a service, it's uses a lot of %cpu.

kiitos · on Aug 28, 2023

Sampling is fine at the query layer, but if you sample at the ingest layer -- and therefore drop a majority of your telemetry data outright -- that's a total hack, and I can't see how the resulting system is anything but unsound.

BHSPitMonkey · on Aug 28, 2023

You use sampling rates that are statically significant enough for the system you're observing. You can always make exceptions to the default rate based on other heuristics. What's wrong with that, for the kinds of insights it provides?

kiitos · on Aug 28, 2023

General-purpose observability systems serve two use cases: presenting a high-level summary of system behavior, and allowing operators to inspect telemetry data associated with a specific e.g. request.

The former use case is often solved by a specific and narrow kind of observability data, which is metrics. A common tool for that purpose is Prometheus. You certainly can't query Prometheus for individual requests, which is fine, and accepting that invariant allows Prometheus to treat input data as statistical, in the sense that you mean in your comment.

But if we're talking about general-purpose telemetry, we're talking about more than just high-level summaries of system behavior, we also need to be able to inspect individual log events, trace spans, etc. If a user made a request an hour ago with reqid 123, I expect to be able to query my telemetry system for reqid 123 and see all of the metadata related to that request.

A telemetry system that samples prior to ingest certainly delivers value, but it can only ever solve the first use case, and never the second.

kortex · on Aug 28, 2023

Define "significant". At $lastco, we routed traces to Cassandra and stored them in an AWS elasticsearch domain. Jaeger was used to visualize traces. We also wrote some elasticsearch queries to generate some basic reports, eg finding the most sluggish queries. Pretty standard stuff if you follow the OTEL/jaeger tutorials.

Traces came on the order of hundreds/second, but we didn't have downsampling turned on, just collected all. Traces were saved 7 days (configurable). Very little actual optimization at the point where I left.

I think it cost on the order of dozens to hundreds of dollars a month.

There's an environment variable which you can set on your containers which defines how the tracing sampler behaves. It's in the docs. See OTEL_TRACES_SAMPLER

https://opentelemetry.io/docs/specs/otel/configuration/sdk-e...

conor- · on Aug 29, 2023

I think the issue described is when you're storing in a vendor backend, e.g. Data Dog or New Relic. If you're not careful with how much tracing data you ingest you can be looking at significant being 7 figures worth of bills.

number101010 · on Aug 28, 2023

Most observability vendors support OTEL at this point. To plug the OSS project I work on that supports OTEL ingestion:

https://github.com/grafana/tempo/

baz00 · on Aug 28, 2023

Thanks I had no idea Tempo even existed despite using Grafana. Will read into it.

candiddevmike · on Aug 28, 2023

IMO, Jaeger is easier to setup/manage and has a better interface than Grafana/Tempo. It's easy to add Jaeger to your local dev stack so you can have tracing while developing.

FWIW I now use Tempo because I have everything else in Grafana (Prometheus, Loki), but I do miss using Jaeger.

number101010 · on Aug 28, 2023

> It's easy to add Jaeger to your local dev stack so you can have tracing while developing.

Tempo can be spun up with docker compose using a local disk for ephemeral storage/querying: https://github.com/grafana/tempo/blob/main/example/docker-co...

Maybe this meets your needs?

> Jaeger is easier to setup/manage and has a better interface than Grafana/Tempo

What do you enjoy about the Jaeger interface? Perhaps it's a gap in Tempo we can improve.

sofixa · on Aug 28, 2023

Jaeger can use multiple backends for storage, including Tempo, so it's not an either/or situation.

I'm fairly sure there was an official Grafana-provided Jaeger gRPC plugin for Tempo, but can't easily find it, only this one: https://github.com/flitnetics/jaeger-tempo

secondcoming · on Aug 28, 2023

Why is this downvoted? This is a pain point for us too, except we're at 500k requests/second. We're currently Datadog but everyone knows they're too expensive.

phillipcarter · on Aug 28, 2023

Cost management via sampling is still largely a vendor concern. Each vendor has a different solution, and while some are better than others, all can be effective at bringing costs down.

spockz · on Aug 28, 2023

We tracé everything and first collects all spans in their trace. Then we sample the successful traces and we retain all traces ending in a failed state. This greatly reduces the need for storage.

lukaszwojtow · on Aug 28, 2023

Try coralogix.

memset · on Aug 28, 2023

I am building https://scratchdb.com/ to address this. It's an HTTP wrapper around Clickhouse, and uses pay-as-you-go pricing. It's also open source and easy to host yourself (single go binary.) Currently have users sending in the 1000 requests per second so definitely have capacity.

Typically seeing a 0.1 compression ratio on data before other optimizations.

I have it connected to Fly.io here: https://scratchdb.com/blog/fly-logs-to-clickhouse/

I'd be really grateful to learn more about what you're looking for (how are you even managing logs today?) Even if you end up not using scratchdb it'll help me figure out the next thing to build!

pondidum · on Aug 28, 2023

Honeycomb's pricing is pretty reasonable - but looking at your volume, some sampling might help also

yourapostasy · on Aug 28, 2023

If you’re using Honeycomb, how did you find their solution’s developer/DevOps engineer ergonomics, their support, and their overall experience?

pondidum · on Aug 28, 2023

The only thing I can recall being a bit of a stumbling point at the beginning was finding "my" trace from a local environment. I just didn't know where to find it. Once I figured that out I've not had problems.

The high-level view and being able to draw a box around something that looks weird on my graph, and honeycomb tells me what's different inside and outside the box is amazing (called bubble up, if you're searching).

It's faster than any other tool I've used; usually data is available by the time I've switched to their ui from the curl command to our API. It's mind blowing actually.

Other saas and self hosted options I've tried have all been awful in some way or other; honeycomb is a breath of fresh air, and going back to other tools after using honeycomb is painful.

I'm not sponsored or working for them btw, they just make one of the few products that I genuinely love using.

rdoherty · on Aug 28, 2023

Grafana seems to be an option? Handles metrics, logs and traces. I don't know what storage costs look like though if you are self hosting.. https://grafana.com/

jdwyah · on Aug 28, 2023

I'm working on giving you more options for saving money on logging with dynamic log levels: https://prefab.cloud/features/log-levels/

The hypothesis being that you can save money by turning things down, but easily turn them back up when you're actively investigating. Or turn the volume up for a targeted sub-segment of your traffic.

We've done some exploration into providing the same for APM and the rest of OTEL and I think it's pretty doable. hmu if you want to talk.

candiddevmike · on Aug 28, 2023

I would ask this a different way--is there a single OSS project that handles collecting all of the OTEL metrics/logs/traces? Folks keep saying you can't do this in data store/format, Elastic seems to, and having to manage separate storage/infrastructure for all of these tools is intense.

Could Prometheus be augmented to store metrics, logs, and traces somehow? I don't really mind if it doesn't scale well on a single instance or is highly available, I'll just add more instances and aggregate them.

CubsFan1060 · on Aug 28, 2023

I've been watching this with a lot of interest: https://signoz.io/

candiddevmike · on Aug 28, 2023

That looks really promising, thank you for sharing it

pranay01 · on Aug 28, 2023

Great to hear that. I am one of the maintainers at SigNoz - if you have any queries as you implement , do ask in our slack - https://signoz.io/slack

Svenstaro · on Aug 28, 2023

I guess you could take a look at this: https://openobserve.ai/

It's in Rust to add some HN catnip.

candiddevmike · on Aug 28, 2023

Why .ai?

esafak · on Aug 28, 2023

Have you used it? How does it stack up?

Svenstaro · on Aug 29, 2023

Haven't used it yet, I'm a bit too scared to put something like this into production anywhere. I might do a small installation with a toy project somewhere, though. Seems easy enough to set up.

kortex · on Aug 28, 2023

It's really not that intense. I basically set up my last co's telemetry infrastructure all by myself, using terraform, otel-python, jaeger, and AWS elasticsearch.

This TF project does most of the heavy lift. https://github.com/telia-oss/terraform-aws-jaeger

candiddevmike · on Aug 28, 2023

Jaeger for tracing, Elasticsearch for logs? What are you using for metrics?

mmclean · on Aug 28, 2023

Likely Prometheus - Jaeger for tracing, Elastic for logs, Prometheus for metrics is a pretty common and effective OSS observability stack

AndreasHae · on Aug 28, 2023

Super nitpick, but you meant profit, not revenue, right?

adra · on Aug 29, 2023

Whats the validity of the logs when they can be boiled down to metrics instead way more concisely (at least with olap-like storage).

For traces, as mentioned elsewhere sampling is key. Some systems use % of requests, and other systems cap traces to traces/sec.

Logs are just expensive but easy to use, so the best strategy is to lean on filtering to avoid a bunch of pointless repetition which don't add value.

TLDR, metrics require more planning but can end up being much cheaper than logs.

adra · on Aug 29, 2023

Oh and the operational overhead of performing full trace recording of 2k/requests a sec are non-trivial. You'll start to feel the overhead of adding extra telemetry.

pradn · on Aug 28, 2023

Most of the comments on this thread are talking about using OpenTelemetry to send metrics/logs to self-hosted collector jobs. Though using a standard library supported by a bunch of collector tools like ClickHouse is useful it itself, there's another benefit too. The specification allows transferring trace IDs across system boundaries. If you and your dependencies all implement the OpenTelemetry spec, then you get spans that reveal in granular detail what happened in the journey. For example, you could learn that it was your database loading a page from disk that took so long, or that a Cloud service's metadata plane was the responsible span for high latency.

oaiey · on Aug 28, 2023

I am very happy with the progress of OpenTelemetry. When I pushed for it years ago, my developers where hesitant (it was new and they never heard about it) but when I revisited the topic a year ago, OpenTelemetry was everywhere in our systems and the log/tracability vendor we have was switching over to it.

Thanks to this amazing group!

bilalq · on Aug 28, 2023

OpenTelemetry is being pushed as a replacement for AWS X-Ray SDKs by AWS, but it's in such a broken state for Lambda right now. A 200-500% performance penalty for using it is insane[1][2].

[1]: https://github.com/aws-observability/aws-otel-lambda/issues/...

[2]: https://github.com/open-telemetry/opentelemetry-lambda/issue...

hinkley · on Aug 29, 2023

How do you use it with Lambda? Delta mode? Or tons of labels?

bilalq · on Aug 29, 2023

I use X-Ray directly via AWS Powertools for Lambda[1] instead of going through OTel. The underlying AWS X-Ray SDK team pushes OTel pretty hard in their README[2] and Github issues, but the current situation is that OpenTelemetry is just not production ready for serverless applications.

[1]: https://docs.powertools.aws.dev/lambda/typescript/latest/

[2]: https://github.com/aws/aws-xray-sdk-node

kevinslin · on Aug 28, 2023

author of the post here - was inspired to write this post after working with OTEL for a few months - realized that OTEL had a ridiculously large surface area that most people (at least myself) might not be aware of

I see a lot of comments about how overly complex OTEL is. I don't disagree with this. in some sense, OTEL is very much the k8 for observability (good and bad)

The good is that it is a standard that can support every conceivable use case and has wide industry adoption The bad is that there is inherent complexity in needing to support the wide array of use cases

jeffchao · on Aug 28, 2023

Otel is great for avoiding vendor lock-in and the spec stability is a long time coming — great to see. Also good there’s architecture options aside from side car (there’s also proxy mode), which is nice for situations like deploying on a PaaS.

That said the main trouble I’ve had in the past is instability in the SDKs. Java one’s decent. Go, not so much, for example. Traditionally Otel has been awesome for tracing but not so much logs, events, and even metrics (obviously depending on your language).

Other than that, auto-instrumentation has been nice. We had this exact functionality when I worked on observability at Netflix and made starting up and maintaining microservices easy and really helped with adoption.

skrtskrt · on Aug 28, 2023

since I added OTEL instrumentation to an app, I already easily swapped out the backends from Prometheus and Jaeger to Grafana Mimir and Tempo.

The lack of lock-in is fantastic, don't know if I've ever just switched technologies that easily, even SQL databases.

podoman · on Aug 28, 2023

Disclaimer: I'm the founder of an observability company (Highlight.io).

OpenTelemetry has been INCREDIBLY valuable to us. Not only has it made it super fast to build out SDKs for our customers, but the fact that its maintained actively gives us confidence that we're rolling out stable logic to customers' environments.

I agree with the author that OpenTelemetry has succeeded, and its pretty obvious from the fact that most major observability vendors support.

In short, to a developer it may not seem like its particularly valuable because a metrics/logs/traces API is quite simple whether or not you use OTEL. But the fact that this is an industry wide spec is where it becomes powerful.

A few

m1keil · on Aug 29, 2023

It's quite understandable that a "o11y protocol" is very valuable to a o11y service provider (especially the smaller ones) since it reduces the cost of vendor swap.

edenfed · on Aug 29, 2023

Disclaimer: I am one of the maintainers

Many comments complain about the complexity of using OpenTelemetry, I recommend checking out Odigos, an open-source project which makes working with OpenTelemetry much easier: https://github.com/keyval-dev/odigos

We combine OpenTelemetry and eBPF to instantly generate distributed traces without any code changes.

pl4nty · on Sept 2, 2023

How does this compare to the OTel Operator feature-wise? I'm loving auto-instrumentation with Grafana Cloud, but operator config is painful and I've hit (and helped fix) a lot of bugs

williamscs · on Aug 29, 2023

Hi! Just wanted to let you know I'm getting a 500 error on your docs page.

https://docs.odigos.io/

edenfed · on Aug 29, 2023

Can you try again please? It works well for me

nickstinemates · on Aug 29, 2023

FWIW also working fine here.

phito · on Aug 28, 2023

OTEL is awesome. I was part of the team integrating it at my job, it went really smoothly, and it saved us so much time debugging our microservice application.

I would say the hardest part was getting other devs to use it, a lot of them are stuck in their own way and did not want to go through the relatively small learning curve...

jensneuse · on Aug 28, 2023

We're soon launching a GraphQL Analytics, Tracing and Metrics stack on top of OTEL. We've built a custom OTEL exporter to Clickhouse in go, so we can export OTEL traces and Prometheus metrics all to Clickhouse. We've built this stack for our federated GraphQL Gateway but it could essentially work for any OTEL service. If you want to learn more, here's some info: https://wundergraph.com/cosmo Were soon going to open source this, so just follow me/us of you're interested.

What I really like about this stack is that you can use our end to end solution, but you're not locked into it. We provide a full service, but you can also just use your own OTEL backend if you want to eject.

pphysch · on Aug 28, 2023

So I have a VictoriaMetrics (i.e. Prometheus) setup that I am happy with, but haven't touched OpenTelemetry; why should I care about it? Is it a serious solution for log aggregation, but do I need a separate database for logs?

hagen1778 · on Aug 28, 2023

You shouldn't unless you want to use the new open source standard for telemetry. You won't benefit from simplicity or performance improvements. It would be quite the opposite. You can check what is the actual cost of open telemetry adoption here [0]

But if you ever decide to go this path - VictoriaMetrics supports OpenTelemetry protocol for metrics [1]

[0] https://github.com/VictoriaMetrics/VictoriaMetrics/pull/2570

[1] https://docs.victoriametrics.com/Single-server-VictoriaMetri...

TwentyPosts · on Aug 28, 2023

Can someone tell me roughly at which point a tool like OpenTelemetry becomes interesting? It seems complicated, at which point should you bother with it instead of just handrolling simple stuff and eyeballing it?

paulddraper · on Aug 28, 2023

OpenTelemetry is several different things, so you'd have to be more specific.

But, for example, if you write a library and you want your downstream users to be able to see the telemetry. OpenTelemetry provides a standardized interface, so you don't need to make assumptions.

candiddevmike · on Aug 28, 2023

CNCF needs an OTEL log aggregator

lokar · on Aug 28, 2023

The otel collector/agent can do this. Not storage/query, but aggregation and processing.

korginator · on Aug 29, 2023

I've been playing with OTEL for a while, with a few backends like Jaeger and Zipkin, and am trying to figure out a way to perform end to end timing measurements across a graph of services triggered by any of several events.

Consider this scenario: There is a collection of services that talk to one another, and not all use HTTP. Say agent A0 makes a connection to agent A1, this is observed by service S0 which triggers service S1 to make calls to S2 and S3, which propagate elsewhere and return answers.

If we limit the scope of this problem to services explicitly making HTTP calls to other services, we can easily use the Propagators API [1] and use X-B3 headers [2] to propagate the trace context (trace ID, span ID, parent span ID) across this graph, from the origin through to the destination and back. This allows me to query the metrics collector (Jaeger or Zipkin) using this trace ID, look at the timestamps originating at the various services and do a T_end - T_start to determine the overall time taken by one call for a round trip across all the related services.

However, this breaks when a subset of these functions cannot propagate the B3 trace IDs for various reasons (e.g., a service is watching a specific state and acts when the state changes). I've been looking into OTEL and other related non-OTEL ways to capture metrics, but it appears there's not much research into this area though it does not seem like a unique or new problem.

Has anyone here looked at this scenario, and have you had any luck with OTEL or other mechanisms to get results?

[1] https://opentelemetry.io/docs/specs/otel/context/api-propaga...

[2] https://github.com/openzipkin/b3-propagation

[3] https://www.w3.org/TR/trace-context/

nijave · on Aug 29, 2023

I've had pretty decent experience with .NET + Grafana Cloud (hosted Tempo)

You can generate RED metrics then tail based sample and send a subset of full traces.

As with many new things, there's varying maturity in client libraries and still things missing. The Redis .NET client (maybe others?) wasn't very good (maybe alpha/beta quality) but the other .NET stuff seemed reasonable. At least with .NET, it integrates with the existing System.Diagnostics API.

I think libraries in some other languages (go?) are a bit clumsier depending on what type of diagnostic/debug/event/performance/reflection APIs that language runtime exposes for interspection.

throwaway894345 · on Aug 28, 2023

I'd be very interested to hear others experience (either as comments or good blog posts / write ups that they've read) with using Open Telemetry. I haven't used OTEL stuff directly, but I've had always been very disappointed with the telemetry vendors in the past (although as far as I can tell the problems weren't "these vendors don't talk to each other" kinds of problems that I think OTEL aims to solve).

vmarchaud · on Aug 28, 2023

I've used OpenTelemetry since it's original alpha in 2020. Originally the main issue i had was supporting tracing across common libraries (there wasn't a lot of libraries supported back then). Now (and i recently worked with it) i would say is which protocol is supported by which component: your sdk generate spans/metrics in a specific format, then you send that to a collector that accept a range of protocol versions and finally you can send that to your vendor ... but you need to know which protocol/version it supports.

That's not actually something you can do much about considering the sheer size of opentelemetry (both in term of implementation and vendors working on it) and i expect for people implementing nowadays, proto should be pretty stable and my experience should theorically not be the case anymore.

dboreham · on Aug 28, 2023

Experience has been positive. We had to understand (and in places fix/enhance) the agent used in our system but the benefits of it not being an expensive black box were huge. E.g. you can run the whole stack on a laptop and therefore use the same tooling for perf analysis in dev as production.

nickstinemates · on Aug 28, 2023

I run a full OSS otel stack in my home lab. It was a lot of fun to set up, with redundancy and all of the extras. It hasn't been particularly useful, but centralized logging and metrics are pretty to look at in dashboards, which was my motivation.

Let's hear some great debugging stories that have been powered by OTEL. I'd love to hear from the horses mouth without marketing speak how it was worth $$$$ collecting and storing all of this info.

hu3 · on Aug 28, 2023

Speaking about OpenTelemetry, has anyone used https://uptrace.dev ?

From a quick glance it seems to be simple, free and open-source deployed as a single Go binary. They use ClickHouse to store data. Almost too good to be true.

I'm contemplating them for a new project.

podoman · on Aug 28, 2023

Fwiw, we do the same (https://highlight.io). Heard good things about uptrace as well.