Hacker News new | past | comments | ask | show | jobs | submit login
OpenTelemetry in 2023 (kevinslin.com)
341 points by kevinslin on Aug 28, 2023 | hide | past | favorite | 244 comments



Two problems with OpenTelemetry:

1. It doesn't know what the hell it is. Is it a semantic standard? Is a protocol? It is a facade? It is a library? What layer of abstraction does it provide? Answer: All of the above! All the things! All the layers!

2. No one from OpenTelemetry has actually tried instrumenting a library. And if they have, they haven't the first suggestion on how instrumenters should actually use metrics, traces, and logs. Do you write to all three? To one? I asked this question two years ago, zero answers :( [1]

[1] https://github.com/open-telemetry/opentelemetry-specificatio...


1. Agreed. It's the sink and the house attached to it, and the docs are thin and confusing as a result.

2. I had a similar experience to you. I wanted to implement a simple heartbeat in our desktop app to get an idea of usage numbers. This is surprisingly not possible, which greatly confuses me given the name of the project. The low engagement on my question put me off and I abandoned my OpenTelemetry planning completely [1][2].

[1] https://github.com/open-telemetry/community/discussions/1598

[2] https://github.com/open-telemetry/semantic-conventions/issue...


Agreed. Some things they suggest aren't actually possible with their SDKs.

For example, you cannot define a histogram's buckets near where you define the histogram. You have to give the global exporter (or w/e the type is) a list of "overrides" that map each histogram name => their buckets. This makes it extremely ugly when you have libraries that emit metrics.

https://github.com/open-telemetry/opentelemetry-go/issues/38...


It's silent when you fuck up the selector too.

Actually it's silent pretty much all of the time. Very reminiscent of J2EE coding. Stare at the configs and hope for enlightenment.


Good deck of questions but I don't think they matter. I don't think those are answerable questions for observability, be it OpenTelemetry or other proprietary systems.

You can go read the leading observability companies web pages and they'll have a 4 page writeup on custom instrumentation. That's not much, just covers very elementary basics! It's not like OTel is behind. The answer just tends heavily towards "it depends."

Once you have experience - OTel or other - you can work through these things that might confound a neophyte.


The problem is they keep making the OTel tooling worse for working through these things, because the people writing the OTel tooling broadly aren't the people actually trying to monitor things. Even before OTel, plain Prometheus client libraries suffered from this.


All they had to do is maintain 3 json schema files.


They maintain a bunch of protobuf files, which is somewhat better and more efficient. You can still use json if you want.


Changing subject, love your work with ocaml.


Thank you ^_^


3. Does it scale?


Scale to what? We're generating >70k spans/sec and it's working fine. I'd say we're fairly medium size at best, though.


I totally read that with kubernetes in mind.


I love OpenTelemetry and we want to trace almost every span happening. We’d be bankrupt if we went any vendor. We wired opentelemetry with Java magic, 0 effort and pointed to a self hosted Clickhouseand store 700m+ span per day with a 100$ EC2.

https://clickhouse.com/blog/how-we-used-clickhouse-to-store-...


I've got a small personal project submitting traces/logs/metrics to Clickhouse via SigNoz. Only about 400k-800k spans per day (https://i.imgur.com/s0J6Mzo.png), but running on a single t4g.small with CPU typically at 11% and IOPS at 4%. I also have everything older than a certain number of GB getting pushed to a sc1 cold storage drive.

w/ 1 month retention for traces:

    ┌─parts.table─────────────────┬──────rows─┬─disk_size──┬─engine────┬─compressed_size─┬─uncompressed_size─┬────ratio─┐
    │ signoz_index_v2             │  26902115 │ 17.06 GiB  │ MergeTree │ 6.21 GiB        │ 66.74 GiB         │   0.0930 │
    │ durationSort                │  26901998 │ 5.44 GiB   │ MergeTree │ 5.40 GiB        │ 53.02 GiB         │  0.10190 │
    │ trace_log                   │ 123185362 │ 2.64 GiB   │ MergeTree │ 2.64 GiB        │ 37.96 GiB         │   0.0695 │
    │ trace_log_0                 │ 120052084 │ 2.46 GiB   │ MergeTree │ 2.45 GiB        │ 37.60 GiB         │  0.06528 │
    │ signoz_spans                │  26902115 │ 2.21 GiB   │ MergeTree │ 2.21 GiB        │ 76.73 GiB         │ 0.028784 │
    │ query_log                   │  16384865 │ 1.91 GiB   │ MergeTree │ 1.90 GiB        │ 18.31 GiB         │  0.10398 │
    │ part_log                    │  17906105 │ 846.73 MiB │ MergeTree │ 845.39 MiB      │ 3.84 GiB          │  0.21521 │
    │ metric_log                  │   4713151 │ 820.92 MiB │ MergeTree │ 806.13 MiB      │ 14.56 GiB         │  0.05405 │
    │ part_log_0                  │  15632289 │ 702.82 MiB │ MergeTree │ 701.70 MiB      │ 3.34 GiB          │  0.20490 │
    │ asynchronous_metric_log     │ 795170674 │ 576.24 MiB │ MergeTree │ 562.50 MiB      │ 11.11 GiB         │ 0.049429 │
    │ query_views_log             │   6597156 │ 461.35 MiB │ MergeTree │ 459.75 MiB      │ 6.36 GiB          │  0.07060 │
    │ logs                        │   6448259 │ 408.59 MiB │ MergeTree │ 406.65 MiB      │ 5.99 GiB          │  0.06627 │
    │ samples_v2                  │ 949110122 │ 345.01 MiB │ MergeTree │ 325.31 MiB      │ 22.09 GiB         │ 0.014382 │
If I was less stupid I'd get a machine with the recommended Clickhouse specs and save myself a few hours of tuning, but this works great.

Downsides:

- clickhouse takes about 5 minute to start up because my tiny sc1 drive has like 4 IOPS allowed

- signoz's UI isn't amazing. It's totally functional, and they've been improving very quickly, but don't expect datadog-level polish


Thanks for mentioning SigNoz, I am one of the maintainers at SigNoz and would love your feedback on how we can improve it further.

If anyone wants to check our project, here’s our GitHub repo - https://github.com/SigNoz/signoz


I hope I'm not coming across as negative! Y'all are just have a much younger product, and have not had time to do all the polish and tiny tweaks. I'm also much more familiar with Datadog, and sometimes a learning curve feels like missing features.

- I really like your new Logs & Traces Explorers. I spend a lot of time coming up with queries, and having a focused place for that is great. Especially since there's now a way to quickly turn my query into an alert or a dashboard item.

- You've also recently (6mo?) improved the autocomplete dramatically! This is awesome, and one of my annoyances with Datadog

Other feedback, and honestly this is all very minor. I'd be perfectly happy if nothing ever changed.

- where do I go see the metrics? There's no "Metrics" tab the way there's a "Logs" and "Traces" tab. A "Metrics Explorer" would be great.

- when I add a new plot, having to start out with a blank slate is not great. Datadog defaults to a generic system.cpu query just to fill something in, I find this helpful.

- when I have a plot in a dashboard and I see it is trending in the wrong direction, it would be nice to be able to create an alert directly from the chart rather than have to copy the query over.

- the exceptions tab is very helpful, but I've only recently discovered the LOW_CARDINAL_EXCEPTION_GROUPING flag. It'd be super nice if the variable part of exceptions was automatically detected and they were grouped

- once nice thing in DD is being able to preview a span from a log or logs from a span without opening a new page. Or previewing a span from the global page. Temporary popping this stuff up in a sidebar would be great.

- I'm not sure if there's a way to view only root spans in the trace viewer.

- This might be a problem with the spring boot instrumentation, but I can't see how to figure out what kind of span it is. Is it a `http.request`, `db.query`, etc?


Thanks for the detailed feedback, this is gold!

> - where do I go see the metrics? There's no "Metrics" tab the way there's a "Logs" and "Traces" tab. A "Metrics Explorer" would be great.

Great, idea. This is some thing which few users have asked for and we will be shipping this in few releases

> - when I have a plot in a dashboard and I see it is trending in the wrong direction, it would be nice to be able to create an alert directly from the chart rather than have to copy the query over.

Fair point, this is something which is also in the pipeline.

> - I'm not sure if there's a way to view only root spans in the trace viewer.

We launched a tab in the new traces explorer for this, does it not serve your use case?

> - when I add a new plot, having to start out with a blank slate is not great. Datadog defaults to a generic system.cpu query just to fill something in, I find this helpful.

We can do something like this, but we don't necessarily know the name of metrics users are sending us, wrt. DataDog which has some default. metrics which their agents generate.

Will also look into other feedbacks you have given


Oooooh this is great feedback. When I started using SigNoz I too was surprised that there was no metrics tab. I wonder what you’d expect to see after clicking it: a list of service names collecting metrics? Or some high-level system wide infra metrics? Let me know!


Are you making sure that you're doing a sample rate, but send over all errors?

At a former place, we were doing 5% of non-error traces.


Careful, we've had systems go down under increased load just emitting errors if they didn't emit much in non error state


Can you go into more detail about your comment, please.


Not the GP, but:

Imagine you're sampling successful traces at, say, 1%, but sending all error traces. If your error rate is low, maybe also 1%, your trace volume will be about 2% of your overall request volume.

Then you push an update that introduces a bug and now all requests fail with an error, and all those traces get sampled. Your trace volume just increased 50x, and your infrastructure may not be prepared for that.


Sorry been busy running around all day. Basically what's happened for us on some very high transaction per second services is that we only log errors. Or Trace errors. And the service basically never has errors. So imagine a service that is getting 800,000 to 3 million request a second. And this is happily going along basically not logging or tracing anything. Then all the sudden a circuit opens on redis and for every single one of those requests that was meant to use that open circuit to redis you log or trace an error. You went from a system that is doing basically no logging or tracing to one that is logging or tracing at 800,000 to 3 million times a second. What actually happens is you open the circuit on redis because red is a little bit slow or you're a little bit slow calling redis and now you're logging or tracing 100,000 times a second instead of zero and that bit of logging makes the rest of the requests slow down and now you're actually within a few seconds logging or tracing 3 million requests a second. You have now toppled your tracing system your logging system and the service that's doing the work. Death spiral ensues. Now the systems that are calling this system starts slowing down and start tracing or logging more because they're also only tracing or logging mainly on error. Or sadly you have a better code that assumes that the tracing are logging system is up always and that starts failing causing errors and you get into doing extra special death loop that can only be recovered from by only attempting to log or error during an outage like this and you must push to fix. All the scenarios have happened to me in production.

In general you don't want your system to do more work in a bad state. In fact as the AWS well architected guide say when you're overloaded or you're in a heavy air State you should be doing as little work as possible. So that you can recover


We've seen problems with memory usage on failure too. Python implementation sends data to the collector in a separate thread from the http server operations. But if these start failing, its configured for exponential backoff, so it can hold onto a lot of memory, and start causing issues with container memory limits.


I've configured our systems to start dropping data at this point and emit an alarm metric that logging/metrics are overloaded


I think what they means is that if you provisioned your system to receive spans for 5% of non-error requests and a few error requests, if for some random act of god, all the requests yield an error, your span collector will suddenyl receive spans for all requests.


How do you send all errors? The way tracing works, as I understand it, is that each microservice gets a trace header which indicates if it should sample and each microservice itself records traces. If microservice A calls microservice B and B returns successfully but then A ends up erroring, how can you retroactively tell B to record the trace that it already finished making and threw away? Or do you just accept incomplete traces when there are errors?


You can do head-based sampling and tail-based sampling.

With head sampling, the first service in the request chain can make the decision about whether to trace, which can reduce tracing overhead on services further down.

With tail-based sampling, the tracing backend can make a determination about whether to persist the trace after the trace has been collected. This has tracing overheads, but allows you to make decisions like “always keep errors”.


https://opentelemetry.io/docs/concepts/sampling/ describes it as Head/Tail sampling, but in practice with vendors I see it as Ingestion sampling and Index sampling. We send all our spans to be ingested, but have a sample rate on indexing. That allows us to override the sampling at index and force errors and other high value spans to always be indexed.


Maybe the Go client doesn't support that? https://opentelemetry.io/docs/instrumentation/go/sampling/


It does, but the docs aren't clear on that yet. TraceIdRatioBased is the "take X% of traces" sampler that all SDKs support today.


Normally yes, but we do a lot of data collection and identifying what's an error is usually hard because of partial errors. We also care about performance, per tenant and per resource with lots of dimensionality and sampling reduces that information for us.


The reality is that most people don't want to manage their own Clickhouse store, and not all engineers can operate with SQL as efficiently as with code (me included). Nonetheless, this is pretty cool!


> not all engineers can operate with SQL as efficiently as with code

I don’t mean for this to sound insulting but I honestly do not think this is an acceptable take to have as a developer.

Not knowing SQL is like refusing to learn any language that has classes in it, simply because you don’t like it.

I’ve heard stories of huge corporations failing product launches because some code was written to SELECT * from a database and filtering it in-app instead of doing the queries correctly, and what’s so fun with these types of issues is that they usually don’t appear until weeks later when the table has grown to a size where it becomes a problem.

When you’re saying that you’d rather find the data in-app than in-database, you’re putting the work on an inferior party in the transaction simply because you can’t be bothered.

The code will never* find the correct data faster than the database.

* there may be exceptions, but they’re far enough between to still say “never”.


Dropping down to SQL to write a really complex query is, in my professional experience, always a poor use of time. It's far simpler to just write the dumb for-loops over your data, if you can access it.

Of course not all engineers can operate with SQL as efficiently as code -- that's the whole point. Otherwise why would we be writing code? Learning SQL intimately doesn't change that fact.


> Dropping down to SQL to write a really complex query is, in my professional experience, always a poor use of time.

We’re not talking about Assembly here, “dropping down” to SQL is something that anyone should be expected to do as soon as you’re grabbing or modifying any data from a database in any scenario where performance or integrity matters. The errors you can see in situations like this are extremely complex and databases literally exist to solve them for us.

Also, if we just completely disregard the performance for a second and focus on data security instead, how do you ensure sensitive data isn’t passed to the wrong party if you don’t care about what queries are being sent?

I mean, it doesn’t matter if it’s not “in the end” displayed to an end user in the application you’re writing, or if its not stored in the intermediary node where your code is running, that data is now unnecessarily on the wire in a situation where it never should have been in the first place. If you end up mixing one customers data with another’s and sending all of it in such a way that it could even theoretically be accessed by a third party, that’s a lawsuit waiting to happen regardless of whether it was “displayed” or “forwarded” or not.

Imagine if you sniffed the packets going to some logistics app you use on your phone and you saw meta-data for all packages in your zip code in the response, or if some widget showing you your carbon footprint actually was based on a response containing the carbon footprint of every customer in the database. Even if it’s just [user_id,co2] it’s still completely unacceptable.

Never mind scenarios where you’re modifying, adding or deleting data, those are even worse and no explanation should be necessary for why.


Obviously it greatly depends on what you're doing. If you're using a relational database as a glorified key-value store for offline or batch processing of a few hundred megabytes of data, sure. Hell, just serialize and unserialize a JSON document on every run if it's small and infrequent enough ¯\_(ツ)_/¯

If you've got a successful data hungry web service with a reasonably normalized schema and moderately complex access patterns though, you're not going to be looping over the whole thing on every page load.


SQL is definitely worth learning. Recently found that processing a 350kb json is equally fast by sending it to Postgres for processing, compared to using some dedicated Java libraries: https://ako.github.io/blog/2023/08/25/json-transformations.h...

This opens some interesting options if you want to join the result with data from your database.


Way way way slower though. I just added something to our app that took 600ms for the naive ‘search and loop’ version (and kept getting slower the more items you needed, completely unscaleable) vs the 30ms for the ‘real SQL query’ version. Guess which version got actually committed.


Did your for loop solution include concurrent access by multiple clients? I highly doubt "engineers can not operate with SQL as efficiently as code" can implement anything even remotely as robust as what SQL DBMS offer even for basic use cases. Are you mutating data? What will happen if the system crashes in the middle of the mutation? How are you handling concurrent writes and reads?


It's unclear whether you mean that it's simpler to make a query and iterate over the rows to massage the result in your application or to make a query and then iterate over the returned rows to make more single-row queries. (Or perhaps some secret third thing I'm not considering.)

I'll admit I'm a little curious about what exactly you mean here.


SQL is code.


There’s a difference between writing olap and oltp sql queries. Hell, in the industry we even have a dedicated role for people who, among other things, write olap queries: data analysts. I’m assuming here that we are talking about writing complex analytical queries.


SQL is code and absolutely worth learning.


"Don't want to manage their own" has for so long been a valid excuse but cloud costs haven't been going down for so long - in many cases prices have increased - and hardware keeps getting more badass. In so many cases it's fear speaking.

A decent sized server will host a hugely capable instance that you may not have to think about for years. The scoffing down at DIY has made sense to some degree, but it just works brilliantly keeps getting to be a stronger & stronger case & most just assume reality can't actually work that well, that it'll be bad, and those folks won't always be right.


With current SSD prices a box that will have 30 million IOPS can cost you 10K. 30 mil IOPS in a cloud would be crazy $$$$


But in this case we are not even talking about own/rented HW vs cloud. It's self-hosted(even on cloud) vs SaaS softwares!

SaaS, especially in this space, can be *extremely* costly and its cost will scale up quickly as you send more traffic (either willingly or by mistake). Yes, Datadog, NewRelic etc will give you many pre-built and well-thought dashboards and some fancy AI-powered auto-detection thing but they will charge many $$$ for it. Consider that now cost management/analysis tools that were historically focused only on cloud, are now adding the same tooling for costly SaaS solutions!

I understand that many HN readers are skewed towards SaaS solutions, usually because they work at a SaaS shop, but depending on the size of the company, the overhead for managing it internally can totally be worth. There is overhead with SaaS as well...


We just left ours running for months in a Docker container. The volume is external, we just replace container image with new one, it takes 5 seconds to update, and spans are treated ephemeral. We store only 7d of data. We could use S3 but we have no use for that data in the long run.

To be fair, we wanted to get experience on ClickHouse and it's a special database need special attention to details on both ops and schema design.


I'm beginning to sound like a broken record at this point, but if you don't know SQL very well but know how to use GPT-4, you have access to enough SQL to get a lot more done than you might think.


This is really interesting, thanks for sharing. What's also cool was the low effort needed for this setup (Java autoinstrumentation + Clickhouse exporter + Grafana Clickhouse Plugin).


That's a really informative post, the ClickHouse thing sounds interesting!


I'm hugely disappointed with OpenTelemetry. In my experience, its an over-engineered mess and the out-of-the-box experience is super user hostile. What it purports to be is so far away from what it actually is. Otel markets itself as a universal tracing/metrics/logs format and set of plug and play libraries that has adapters for everything you need. It's actually a bunch of half/poorly implemented libraries with a ton of leaky internals, bad adapters, and actually not a lot of functionality.


Agreed, I find myself having to think orthogonally to common sense whenever I try to use one of its SDKs. Nothing works the way you expect it to, everything has 3 layers of unnecessary abstraction and needs to be approached via the back door. Many features have caveats about when it works, where it works, how much it works, during what phase of the moon it works and how long your strings can be when Jupiter is visible in the sky.

That said, if we disregard the leaky SDK APIs and half-implemented everything, it does somewhat deliver on the pluggability promise. Before OTel, you had bespoke stacks for everything. Now there is some commonality - you can plug in different logging backends to one standard SDK and expect it to more or less work. Yes, it works less well than a vertically integrated stack but this is still something. It enables competition and evolution piece by piece, without having to replace an observability stack outright (never going to be a convincing proposition).

So while the developer experience is pretty unpleasant and I am also disappointed with the actual daily usage, from an architectural perspective it opens up new opportunities that did not exist before. It is at least a partial win.


Yes, I really agree, and I've gone through the same pain, but try using the alternatives that claim to be better because they have OpenAPI specifications [1]

The example shows you how to use the swagger tool, parse the OpenAPI spec [2], auto-generate GoLang glue code, call __one__ of those auto-generated functions and log a trace.

However, there is zero documentation, zero other examples, and I'm left scratching my head whether there's even one person in the world using this approach. I eventually ended up just directly using the service APIs [3] via REST calls.

OTEL is painful, but the alternatives are no better :( I really wish there's some interest in this space, since SLO's and SLI measurements are becoming increasingly important.

[1] https://github.com/openzipkin/zipkin-api-example

[2] https://github.com/openzipkin/zipkin-api/blob/master/zipkin2...

[3] https://zipkin.io/zipkin-api/#/


Prometheus text exposition format is de-facto standard used in monitoring. It would be great building an official observability standard on top of it. This format is much easier to debug and understand than OpenTelemetry for metrics. It is also more efficient, e.g. it requires less network bandwidth and less CPU for transfer than Otel for metrics.

[1] https://github.com/prometheus/docs/blob/main/content/docs/in...


Ok, then do you have a suggestion for an alternative, or do you just put up with OT?


I agree with OP, but I put up with Otel


Okay but can you point to some specific experience and suggest improvements?


Try implementing an OTEL Tracer. This interface is insane, and should really just be a struct.

https://github.com/open-telemetry/opentelemetry-go/blob/trac...


how is this interface insane?

It's a list of the behaviors you need to implement if you're rolling your own OTEL Tracer Span implementation, and not using one of the multiple available.

In contrast, OpenTracing's interfaces had hardly any required methods, so you had to do a runtime type-cast to the whichever implementation you were using in order to access anything useful on the Span like the OperationName.


Why were you implementing your own tracer? Don't they publish an implementation?


Yes, all SDKs have a tracer you can just use. While you can technically create your own tracer, you're officially in Hard Mode territory - there's no highly extensible system I'm aware of that makes swapping core concepts (that already have a default) easy.


The OTEL official project libraries don't work well on the web frontend yet. No way of correlating errors to source-maps for instance, at least out of the box.

The web browser collector published by the OTEL project uses Zone.js to hijack just about everything in the browser into contexts. If you used modern Angular before, you may recognize zone.js, its a real pain sometimes, and messes with globals, which isn't great, as it can create situations where behavior isn't predictable.

I don't know that OTEL has any standard around things like session replays either. Lots of telemetry platforms support this (Sentry, Rollbar, DataDog etc)

I think alot of backend teams have really like it. I do like the cross boundary nature of spans where you can follow them by a unique tag across your entire system.

I personally find its extremely verbose at times, in terms of the payload it generates, some logging platforms are more compact in this regard, but in practice, I haven't noticed it to be an issue


I think with native Promises (and async/await?) there is currently no way to implement something like Zone.js properly. I've tried to instrument my code manually, but it's really error-prone and verbose. We really need something like https://nodejs.org/api/async_context.html#class-asynclocalst... to be implemented in the browser.


You can follow [0] which is currently stage 2 to fix this

[0]: https://github.com/tc39/proposal-async-context


In addition to this, is the new (stage 3 even!)explicit resource management proposal[0], supported by TypeScript version >= 5.2[1]

Though I agree that async context is better fit for this generally, the ERM should be good for telemetry around objects that have defined lifetime semantics, which is a step in the right direction you can use today

[0]: https://github.com/tc39/proposal-explicit-resource-managemen...

[1]: https://www.totaltypescript.com/typescript-5-2-new-keyword-u...


Thanks! I was only aware of a Zones proposal which was withdrawn I think.


We've found that across most platforms, Otel is more of a starting point to building good instrumentation libraries. We've been building on top of Otel/Splunk's browser SDK implementation for our package [1] which does roll in session replay, and also adding on better exception tracking, etc. None of which remotely comes out of the box unfortunately.

Being able to connect frontend sessions -> backend trace/logs has been a huge DX change though imo.

[1] https://www.hyperdx.io/blog/browser-based-distributed-tracin...


Do check this doc from Otel https://opentelemetry.io/docs/instrumentation/js/getting-sta...

It is not completely solving the issue, but a starting point


this is exactly what I referencing, its not really a starting point. I read through all the docs quite thoroughly, these are missing features and design choices


DataDog's front end instrumentation (Real User Monitoring) is also notably unrefined. Playing with Duplo blocks level of finesse.

Does anyone have an even part way start at doing front end tracing?


As with most of the DataDog platform, I have found that once you go underneath the sheen it leaves alot to be desired.

I've had better runs with Bugsnag for pure error reporting and more recently Sentry, which can do RUM / Session Replay collections.

None are what you expect though. If you want really good user behavior analytics FullStory is still top notch


Have you come across Grafana Faro at all? You can have it send to Grafana Agent which is OSS and can store traces in other locations.


A few of my colleagues and I had the silly (?) idea that you don't really need logs anymore. Instead of log messages you just attach span events [0]. You then just log the span title and a link to that span in Jaeger; something like [1]. I've only really tried that in my private project, but it felt pretty good. The UI of Jaeger could be a bit better to support that usage, though.

Edit: Actually, those colleagues are doing a talk about that topic. So, if you are in Germany and Hannover area, have a look at [2] and search for "Nie wieder Log-Files!".

[0]: https://opentelemetry.io/docs/instrumentation/ruby/manual/#a...

[1]:

    tracing.ts:38 Usecase: Handle Auth 
    tracing.ts:47 http://localhost:16686/trace/ec7ffb1e23ddbb8dd770a3f08028666b
    tracing.ts:38 Adapter: Find Personal Board 
    tracing.ts:47 http://localhost:16686/trace/e22d342316ab0d7d23230864008e27bc
    tracing.ts:38 Adapter: Find Starred Board List 
    tracing.ts:47 http://localhost:16686/trace/129f89cee26d54cfdc38abea368d9b4e
    tracing.ts:38 Adapter: Find Personal Board List 
    tracing.ts:47 http://localhost:16686/trace/97948127d77501ff0c65a5db21b21b5a
[2]: https://javaforumnord.de/2023/programm/


Depending on how greenfield a project is, you don't even need span events unless you absolutely require a timestamp for a specific operation with no duration. Just using spans for every meaningful operation is like having more powerful structured logs.

This approach isn't possible for a lot of systems that have existing logs they need to bring along, but if you're greenfield enough, I'd recommend it.


Using only spans is surprisingly effective!

I did a talk about this at QCon London last year

https://www.infoq.com/presentations/event-tracing-monitoring...


Agreed.

Same is true for metrics derived from spans. (Though for metrics, you don't need to sample, and for spans you might. So keep in mind.)


I've thought about doing this as well, but I like being able to use dumb tools to get an idea of what's going on. There's a lot that has to be working correctly to use traces. Or maybe I'm just scared of the tooling because I don't have enough experience with it yet idk


You can also just, Log the spans as they are being created to stderr/stdout -- I've done this on a previous project with this approach of "spans first".

It made it debuggable via output if needed, but the primary consumption became span oriented.


Good idea yeah, but do the same notions of log level apply?


This was the idea behind Stripe's Veneur project - spans, logs, and metrics all in the same format, "automatically" rolling up cardinality as needed - which I thought was cool but also that it would be very hard to get non-SRE developers on board with when I saw a talk about it a few years ago.

https://github.com/stripe/veneur


You don't even really need to ship traces anywhere. You can just keep them in-process, and build an API on top of that in-memory trace data.


I’ve been talking about the same thing at my place. I think it makes a ton of sense and then can kill logs almost completely


We tried this with Grafana Cloud/Tempo but hit trace size limits. Logs are still good for "events"


OpenTelemetry is a marketing-driven project, designed by committee, implemented naively and inefficiently, and guided by the primary goal of allowing Fortune X00 CTOs to tick off some boxes on their strategy roadmap documents.

It's not something that anyone with a choice in the matter should be using.


It looks like every other comment in this thread is favorable to very positive, can you go into more detail about what specifically isn't good about it?


Not the previous poster, but I had to implement it a few years ago, and I found it unbelievable complex with dense and difficult to read specifications. I've implemented plenty of protocols and formats from scratch using just the specification, but rarely have I had such difficulty than with OpenTelemetry.

I guess this is something you don't notice as merely a "user", but IMHO it's horribly overengineered for what it does and I'm absolutely not a fan.

I also disliked the Go tooling for it, which is "badly written Java in Go syntax", or something along these lines.

This was 2 years ago. Maybe it's better now, but I doubt it.

In our case it was 100% a "tick off some boxes on their strategy roadmap documents" project too and we had much much better solutions.

OTel is one of those "yeah, it works ... I guess" but also "ewwww".


I'd recommend trying it out today. In 2021, very few things in OTel were GA and there wasn't nearly as much automatic instrumentation. One of the reasons why you had to dive into the spec was because there was also very little documentation, too, indicative of a heavily in-progress project. All of these things are now different.


I'll be happy to take your word that some implementation issues are now improved, but things like "overengineered" and "way to complex for what it needs to do" really are foundational, and can't just be "fixed" without starting from scratch (and presumably this is all by design in the first place).


That's fair. I find that to be a bit subjective anyways, so I don't have much to comment on there. Most languages are pretty lightweight. For example, initializing instrumentation packages and creating some custom instrumentation in Python is very lightweight. Golang is far more verbose, though. I see that as part and parcel of different cultures for different languages, though (I've always loved the brevity of Python API design and disliked the verbosity of Go API design).


One of the main reasons I became disillusioned with OTel was that the project treated "automatic instrumentation" as a core assumption and design goal for all supported languages, regardless of any language-specific idioms or constraints.

I'm not an expert in every language, but I am an expert in a few, and this just isn't something that you can assume. Languages like Go deliberately do not provide the sorts of features needed to support "automatic instrumentation" in this sense. You have to fold those concerns into the design of the program itself, via modules or packages which authors explicitly opt-in to at the source level.

I completely understand the enormous value of a single, cross-language, cross-backend set of abstractions and patterns for automatic instrumentation. But (IMO and IME) current technology makes that goal mutually exclusive with performance requirements at any non-trivial scale. You have to specialize -- by language, by access pattern (metrics, logs, etc.), by concrete system (backend), and so on -- to get any kind of reasonable user experience.


The Spec itself is 'badly written Java'. I haven't been a Java dev for about ten years. At this point it's a honey pot for architectural astronauts - a great service to humanity.

That is, until some open standard is defined by said Java astronauts.


> OpenTelemetry is a marketing-driven project, designed by committee, implemented naively and inefficiently, and guided by the primary goal of allowing Fortune X00 CTOs to tick off some boxes on their strategy roadmap documents.

I'm the founder of highlight.io. On the consumer side as a company, we've seen a lot of value of from OTEL; we've used it to build out language support for quite a few customers at this point, and the community is very receptive.

Here's an example of us putting up a change: https://github.com/open-telemetry/opentelemetry-js/pull/4049

Do you mind sharing why you think no-one should be using it? Some reasoning would be nice.


I don't think that's true. It seems like it's more of a "oh shit, all this open source software emits Prometheus metrics and Jaeger traces, but we want to sell our proprietary alternatives to these and don't want to upstream patches to every project". (Datadog had a literal army of people adding datadog support to OSS projects. Honestly, probably a great early-career job; diving into unfamiliar codebases is a superpower.)

OTel lets the open source projects use an abstraction layer so that you can buy instead of self-host.

None of this has ever made me feel super great, but in the end I would probably consider OTel today for services that people other than my company operate. That way if some user wants to use Datadog, we're not in their way.

I used OTel in the very very early days and was rather disappointed; the Go APIs were extremely inefficient (a context.Context is needed to increment a counter? no IO in my request path please), and abstracted leakily (no way to set histogram buckets when exporting to Prometheus). I assume they probably fixed that stuff at some point, though.


What helps hosted data collectors helps self-hosting setups just as much.

More and more solutions are getting built in OTEL support, which means you can relatively seamlessly switch between backends without changing anything in your application code.


This only makes sense if you're in a world where you're switching backends more than once, which means you're not seriously programming, you're just burning VC money for lottery tickets.


I agree with this. For internal apps, pick a system and stick with it. If you're excited by Datadog's marketing pitch, just buy it and use it. It will not make or break your startup; like if your Datadog bill is what's standing between you and profitability, then you probably didn't actually find product/market fit. Switching to Prometheus at that point also won't help you find product/market fit.

In the 2 jobs where I've set up the production environment, I just picked Prometheus/Jaeger/cloud provider log storage/Grafana on day 1 and have never been disappointed. You explode the helm chart into your cluster over the course of 30 minutes, and then move on to making something great (or spending a week debugging Webpack; can't help you with that one).


You had better have PMF if you pick Datadog!

https://thenewstack.io/datadogs-65m-bill-and-why-developers-...


Also useful for local dev (you can still use it locally without a SaaS backend) and it helps with interoperability. Infrastructure like Envoy and nginx can emit spans that integrate with 1st party code and other 3rd party code. OSS libraries are more likely to implement an open standard so they just plug and play and emit data for internal things they're doing (especially helpful for things like DB and HTTP)


OTel is the backend, in-program equivalent of "we need all of five analytics systems on our frontend to figure out that users bounce when our page takes 10s to load because it has five analytics systems in it".


What do you use?


That's overly harsh, they are doing good work I think their data model is a step forward in the right direction.

Their processors are quite capable and the entire receiver and exporter contrib collection is pretty good.

I'm not saying it's the best solution out there because that clearly depends on each use case but I don't think such harsh criticism makes sense.

Disclaimer: I'm part of the fluent-bit maintainer team.


Being able to have services speak OTLP and having my application configurations simplified to sending data to the OTEL collector is great.

From an ops point-of-view devs can add whatever observability to their code and I can enforce certain filtering in a central place as well as only needing one central ingress path that applications talk to.

Also because everything emits OTLP if we ever want to move to new backends it's just a matter of changing a yaml file and not rewriting applications to support a new logging backend.

Given the choice of going back to the old way of using vendor-specific logging libraries, I will continue using OTEL 10/10 times because even given its warts, it's still a lot nicer than the alternatives.


Switching observability backends is like switching databases -- feasible in theory, impossible in practice for anything but the most trivial use cases.

You can't build a sound product if that's one of the design requirements.


Except it isn't impossible because using OTLP as the data format means you're decoupled from any single backend.

Switching to a new backend is as simple as deploying the new backend, changing 1 line in the OTEL Collector yaml, then having your front-end pull from the new backend. 0 changes to application code necessary.


Metrics, logs, and traces are abstract categories of telemetry data, representing the most common modalities of how that data is produced and consumed. They are explicitly not specific or concrete types that can be precisely defined by an e.g. Protobuf schema

These domain concepts are descriptive, not proscriptive. They don't, and can't possibly, have specific wire-level definitions. So another way to phrase my point might be to say that OTel is asserting definitions which doesn't actually exist.

Telemetry necessarily requires specialization between producer (application/service) and consumer (observability backend) in order to deliver acceptable performance. It's core to the program as it is written: more like error handling than e.g. containerization.


What should people use?

With basic parameters in place so it doesn’t eat your billing, it’s been working great with me for years. Initially with New Relic, then Datadog, now a setup with OpenTelemetry is good enough.


Instrumentation isn't solved by any single specific thing. It's a praxis that you apply to your code as your write it, like I guess error handling; it's not a product that you can deploy, like I guess Splunk or New Relic or whatever else.

You should "use" metrics, logs, and traces thru dependencies that are specific to your organization. The interface between your business logic and operational telemetry should be abstract, essentially the same as a database or a remote HTTP endpoint or etc. The concrete system(s) collecting and serving telemetry data are the responsibility of your dev/ops or whatever team.

Main point: instrumentation is part of the development process, not something that's automatic or that can be bolted-on.


Have you worked with OTEL before? Basically all of your points about instrumentation are actually quite sympathetic to OTEL's view of the world. The whole point of OTEL is to provide some standards around how these pieces fit together - not to solve for them automatically.


I've been deeply involved with OTel from even before it was a CNCF jam. My experiences with the project, over time, have made me basically abandon the project as unsound and infeasible since a year or two. Those experiences also inform comments like the ones I've made here.


Can you elaborate on what unsound and infeasible mean? I'm newer to OTel than you (~6 months of working with it in depth), and don't really understand what you're getting at. It's solving real problems in my organization, with only a "regular" amount of pain for a component of its size.


Okay so what’s the interface? Sounds like what OTEL provides to me


There are well-defined interfaces for specific sub-classes of telemetry data. Prometheus provides a set of interfaces for metrics which are pretty battle-tested by now. There are similar interfaces for logs and traces, authored by various different parties, and with various different capabilities, trade-offs, etc.

There is no one true interface! The interface is a function of the sub-class of telemetry data it serves, the specific properties of the service(s) it supports, the teams it's used by, the organization that maintains it, etc. etc.

OTel tries to assert a general-purpose interface. But this is exactly the issue with the project. That interface doesn't exist.


OTEL is a set of interfaces, so I’m not sure your last point applies. I do agree that battle tested things like Prometheus work great, but why not have a set of standardized interfaces? There is clearly a cost to having them; for some projects this may be too much. For the projects I’ve used it in it let me spin up all the traces and telemetry without thinking hard.


> What should people use?

I recall Apache Skywalking being pretty good, especially for smaller/medium scale projects: https://skywalking.apache.org/

The architecture is simple, the performance is adequate, it doesn't make you spend days configuring it and it even supports various different data stores: https://skywalking.apache.org/docs/main/v9.5.0/en/setup/back...

The problems with it are that it isn't super popular (although has agents for most popular stacks), the docs could be slightly better and I recall them also working on a new UI so there is a little bit of churn: https://skywalking.apache.org/downloads/

Still better versus some of the other options when you need something that just works instead of spending a lot of time configuring something (even when that something might be superior in regards to the features): https://github.com/apache/skywalking/blob/master/docker/dock...

Sentry comes to mind (OpenTelemetry also isn't simpler due to how much it tries to do, given all the separate parts), compare its complexity to Skywalking: https://github.com/getsentry/self-hosted/blob/master/docker-...

I wish there was more self-hosted software like that out there, enough to address certain concerns in a simple way on day 1 and leave branching out to more complex options like OpenTelemetry once you have a separate team for that and the cash is rolling in.


I'm honestly thinking that one of the statsd variants with label support would have been just fine if I'd had a time machine. The complexity overhead of labels in OpenTelemetry does not make it the slam-dunk it appears to be.

Internally, OTEL has to keep track of every combination of labels it's seen since process start, which can easily come to dominate the processing time in an existing project. It's another in a long line of tools that dovetail with my overall software development philosophy which is that you can make pretty much any process work for 18 months before the wheels fall off.

By the time you notice OpenTelemetry is a problem, you've got 18 months of work to start trying to roll back.


> Internally, OTEL has to keep track of every combination of labels it's seen since process start, which can easily come to dominate the processing time in an existing project.

Well, every unique combination of labels represents a discrete time series of telemetry data, and the total set of all time series in your entire organization always has to be of finite and reasonable cardinality. This means that label values always have to be finite e.g. enumerations, and never e.g. arbitrary values from user input.

> my overall software development philosophy which is that you can make pretty much any process work for 18 months before the wheels fall off.

The size of the set of labels in your process after (say) 1d of regular traffic should be basically the same size as after (say) 18m of regular traffic. If this isn't the case, it usually signals that you're stuffing invalid data into label values.


I have no idea why you think that all attributes need to be buffered in process forever. Most metrics systems simply keep key sets in ram cached for as long as they're still being emitted. Many drop unused key sets after like 10 minutes. But like all metrics processing, you should ideally keep cardinality to a bound set in order to avoid these types of issues both client and server side.

I'm sure there are valid qualms with OTEL in general, but this ain't one of them. Any and all metrics telemetry systems can fall into the same design constraint you pointed out.


I don’t know which one implementations support invalidation but it’s not happening for the nodejs impl.

Push implementations do not have this problem at the client end.


I kind of agree with you. Clueless managers just asking “open telemetry” on the roadmap without contextualising costs/benefits


Well, roll up your sleeves and fix the performance bugs that affect you (source: I have).


I have no reason to do so, because I don't believe that OpenTelemetry is a project that was created, or is maintained, in good faith to its stated goals.


Care to elaborate a bit more on the goals contrast?


https://opentelemetry.io

> OpenTelemetry is a collection of APIs, SDKs, and tools. Use it to instrument, generate, collect, and export telemetry data (metrics, logs, and traces) to help you analyze your software’s performance and behavior.

You can absolutely categorize telemetry into these high-level pillars, true. But the specifics on how that data is captured, exported, collected, queried, etc. is necessarily unique to each pillar, programming language, backend system, organization, etc.

That's because telemetry data is always larger than the original data it represents: a production request will be of some well-defined size, but the metadata about that request is potentially infinite. Consequently, the main design constraint for telemetry systems is always efficiency.

Efficiency requires specialization, which is in direct tension with features that generalize over backends and tools, e.g.

> Traces, Metrics, Logs -- Create and collect telemetry data from your services and software, then forward them to a variety of analysis tools.

and features that generalize over languages, e.g.

> Drop-In Instrumentation -- OpenTelemetry integrates with popular libraries and frameworks such as Spring, ASP.NET Core, Express, Quarkus, and more! Installation and integration can be as simple as a few lines of code.

I think OTel treats these goals -- which are very valuable to end users!! -- as inviolable core requirements, and then does whatever is necessary to implement them. But these goals are not actually valid, and so the resulting code is often inefficient, incoherent, or even unsound.


Slight tangent for a moment, I really really hate the "subscribe" popup that comes up on this blog. It is not clear at all that you can just close it and not give your email since there is no "x" button. Instead it has the incredibly unintuitive "continue reading" under the subscribe button that I did not think would work. Clicking out of it also did not work. Seriously we can and should be better than this.

On the topic of open telemetry. I have been long wanting to play with it and see if it offers all of the capabilities when we send the data to datadog. But I have been reluctant to add in another thing to manage and train on if it means that the datadog agent is still necessary for anything outside of the basics.

Has anyone else actually tried hooking this up to datadog?

Edit: just to be clear, my driving goal of this is not necessarily to keep it with datadog. But that is currently where much of our alerting and logs are now. So the idea would be to switch to open telemetry which would then allow us to (theoretically) move to something else down the line.


WRT the popup, that's a medium.com thing. I agree that it's very annoying.


It's not Medium, it's Substack.


Personally I found the poll much worse. Got asked for my opinion, clicked on one of the option, and got rewarded with a full-screen "create an account". Hit the back button and hit taken back to HN.

Nice.


Is there yet any way of having a front end for this which doesn't significantly dent your revenue stream either in staffing, infrastructure or license fees? We land over 2000 requests/second and it's expensive just keeping logs.


I recommend sampling traces if you aren't. I've been unimpressed with datadog apm, which has no affordable configuration. We've been running our own Jaeger stack with 0.1% sampling, and it's negligible to run compared to datadog apm.

For metrics and logs, sampling isn't so useful, so I don't have a good answer. Datadog has 80% gross margin, so at most 20% of what you pay them is the infra, so you stand to save a lot of money running your own open source stacks if your labor costs would be less than that 80%. With datadog, we have a project every 3 months to reduce usage, so it's not like we aren't constantly babysitting it anyway.


> I've been unimpressed with datadog apm, which has no affordable configuration

FWIW we sample in datadog APM and it works fine to control costs, I'm not sure what issues you hit.


If you’re looking for open source APM stack which is OpenTelemetry native and you can self host - you can check out SigNoz ( https://github.com/SigNoz/signoz)


> datadog apm, which has no affordable configuration

In my experience, Datadog's ingestion sampling works pretty well.

And there's retention filters you can use to override.


Sampling is the answer. Sample 1% of success and all the errors.

Cost is one thing but you would be surprised how heavy observability can be on a service, it's uses a lot of %cpu.


Sampling is fine at the query layer, but if you sample at the ingest layer -- and therefore drop a majority of your telemetry data outright -- that's a total hack, and I can't see how the resulting system is anything but unsound.


You use sampling rates that are statically significant enough for the system you're observing. You can always make exceptions to the default rate based on other heuristics. What's wrong with that, for the kinds of insights it provides?


General-purpose observability systems serve two use cases: presenting a high-level summary of system behavior, and allowing operators to inspect telemetry data associated with a specific e.g. request.

The former use case is often solved by a specific and narrow kind of observability data, which is metrics. A common tool for that purpose is Prometheus. You certainly can't query Prometheus for individual requests, which is fine, and accepting that invariant allows Prometheus to treat input data as statistical, in the sense that you mean in your comment.

But if we're talking about general-purpose telemetry, we're talking about more than just high-level summaries of system behavior, we also need to be able to inspect individual log events, trace spans, etc. If a user made a request an hour ago with reqid 123, I expect to be able to query my telemetry system for reqid 123 and see all of the metadata related to that request.

A telemetry system that samples prior to ingest certainly delivers value, but it can only ever solve the first use case, and never the second.


Define "significant". At $lastco, we routed traces to Cassandra and stored them in an AWS elasticsearch domain. Jaeger was used to visualize traces. We also wrote some elasticsearch queries to generate some basic reports, eg finding the most sluggish queries. Pretty standard stuff if you follow the OTEL/jaeger tutorials.

Traces came on the order of hundreds/second, but we didn't have downsampling turned on, just collected all. Traces were saved 7 days (configurable). Very little actual optimization at the point where I left.

I think it cost on the order of dozens to hundreds of dollars a month.

There's an environment variable which you can set on your containers which defines how the tracing sampler behaves. It's in the docs. See OTEL_TRACES_SAMPLER

https://opentelemetry.io/docs/specs/otel/configuration/sdk-e...


I think the issue described is when you're storing in a vendor backend, e.g. Data Dog or New Relic. If you're not careful with how much tracing data you ingest you can be looking at significant being 7 figures worth of bills.


Most observability vendors support OTEL at this point. To plug the OSS project I work on that supports OTEL ingestion:

https://github.com/grafana/tempo/


Thanks I had no idea Tempo even existed despite using Grafana. Will read into it.


IMO, Jaeger is easier to setup/manage and has a better interface than Grafana/Tempo. It's easy to add Jaeger to your local dev stack so you can have tracing while developing.

FWIW I now use Tempo because I have everything else in Grafana (Prometheus, Loki), but I do miss using Jaeger.


> It's easy to add Jaeger to your local dev stack so you can have tracing while developing.

Tempo can be spun up with docker compose using a local disk for ephemeral storage/querying: https://github.com/grafana/tempo/blob/main/example/docker-co...

Maybe this meets your needs?

> Jaeger is easier to setup/manage and has a better interface than Grafana/Tempo

What do you enjoy about the Jaeger interface? Perhaps it's a gap in Tempo we can improve.


Jaeger can use multiple backends for storage, including Tempo, so it's not an either/or situation.

I'm fairly sure there was an official Grafana-provided Jaeger gRPC plugin for Tempo, but can't easily find it, only this one: https://github.com/flitnetics/jaeger-tempo


Why is this downvoted? This is a pain point for us too, except we're at 500k requests/second. We're currently Datadog but everyone knows they're too expensive.


Cost management via sampling is still largely a vendor concern. Each vendor has a different solution, and while some are better than others, all can be effective at bringing costs down.


We tracé everything and first collects all spans in their trace. Then we sample the successful traces and we retain all traces ending in a failed state. This greatly reduces the need for storage.


Try coralogix.


I am building https://scratchdb.com/ to address this. It's an HTTP wrapper around Clickhouse, and uses pay-as-you-go pricing. It's also open source and easy to host yourself (single go binary.) Currently have users sending in the 1000 requests per second so definitely have capacity.

Typically seeing a 0.1 compression ratio on data before other optimizations.

I have it connected to Fly.io here: https://scratchdb.com/blog/fly-logs-to-clickhouse/

I'd be really grateful to learn more about what you're looking for (how are you even managing logs today?) Even if you end up not using scratchdb it'll help me figure out the next thing to build!


Honeycomb's pricing is pretty reasonable - but looking at your volume, some sampling might help also


If you’re using Honeycomb, how did you find their solution’s developer/DevOps engineer ergonomics, their support, and their overall experience?


The only thing I can recall being a bit of a stumbling point at the beginning was finding "my" trace from a local environment. I just didn't know where to find it. Once I figured that out I've not had problems.

The high-level view and being able to draw a box around something that looks weird on my graph, and honeycomb tells me what's different inside and outside the box is amazing (called bubble up, if you're searching).

It's faster than any other tool I've used; usually data is available by the time I've switched to their ui from the curl command to our API. It's mind blowing actually.

Other saas and self hosted options I've tried have all been awful in some way or other; honeycomb is a breath of fresh air, and going back to other tools after using honeycomb is painful.

I'm not sponsored or working for them btw, they just make one of the few products that I genuinely love using.


Grafana seems to be an option? Handles metrics, logs and traces. I don't know what storage costs look like though if you are self hosting.. https://grafana.com/


I'm working on giving you more options for saving money on logging with dynamic log levels: https://prefab.cloud/features/log-levels/

The hypothesis being that you can save money by turning things down, but easily turn them back up when you're actively investigating. Or turn the volume up for a targeted sub-segment of your traffic.

We've done some exploration into providing the same for APM and the rest of OTEL and I think it's pretty doable. hmu if you want to talk.


I would ask this a different way--is there a single OSS project that handles collecting all of the OTEL metrics/logs/traces? Folks keep saying you can't do this in data store/format, Elastic seems to, and having to manage separate storage/infrastructure for all of these tools is intense.

Could Prometheus be augmented to store metrics, logs, and traces somehow? I don't really mind if it doesn't scale well on a single instance or is highly available, I'll just add more instances and aggregate them.


I've been watching this with a lot of interest: https://signoz.io/


That looks really promising, thank you for sharing it


Great to hear that. I am one of the maintainers at SigNoz - if you have any queries as you implement , do ask in our slack - https://signoz.io/slack


I guess you could take a look at this: https://openobserve.ai/

It's in Rust to add some HN catnip.


Why .ai?


Have you used it? How does it stack up?


Haven't used it yet, I'm a bit too scared to put something like this into production anywhere. I might do a small installation with a toy project somewhere, though. Seems easy enough to set up.


It's really not that intense. I basically set up my last co's telemetry infrastructure all by myself, using terraform, otel-python, jaeger, and AWS elasticsearch.

This TF project does most of the heavy lift. https://github.com/telia-oss/terraform-aws-jaeger


Jaeger for tracing, Elasticsearch for logs? What are you using for metrics?


Likely Prometheus - Jaeger for tracing, Elastic for logs, Prometheus for metrics is a pretty common and effective OSS observability stack


Super nitpick, but you meant profit, not revenue, right?


Whats the validity of the logs when they can be boiled down to metrics instead way more concisely (at least with olap-like storage).

For traces, as mentioned elsewhere sampling is key. Some systems use % of requests, and other systems cap traces to traces/sec.

Logs are just expensive but easy to use, so the best strategy is to lean on filtering to avoid a bunch of pointless repetition which don't add value.

TLDR, metrics require more planning but can end up being much cheaper than logs.


Oh and the operational overhead of performing full trace recording of 2k/requests a sec are non-trivial. You'll start to feel the overhead of adding extra telemetry.


Most of the comments on this thread are talking about using OpenTelemetry to send metrics/logs to self-hosted collector jobs. Though using a standard library supported by a bunch of collector tools like ClickHouse is useful it itself, there's another benefit too. The specification allows transferring trace IDs across system boundaries. If you and your dependencies all implement the OpenTelemetry spec, then you get spans that reveal in granular detail what happened in the journey. For example, you could learn that it was your database loading a page from disk that took so long, or that a Cloud service's metadata plane was the responsible span for high latency.


I am very happy with the progress of OpenTelemetry. When I pushed for it years ago, my developers where hesitant (it was new and they never heard about it) but when I revisited the topic a year ago, OpenTelemetry was everywhere in our systems and the log/tracability vendor we have was switching over to it.

Thanks to this amazing group!


OpenTelemetry is being pushed as a replacement for AWS X-Ray SDKs by AWS, but it's in such a broken state for Lambda right now. A 200-500% performance penalty for using it is insane[1][2].

[1]: https://github.com/aws-observability/aws-otel-lambda/issues/...

[2]: https://github.com/open-telemetry/opentelemetry-lambda/issue...


How do you use it with Lambda? Delta mode? Or tons of labels?


I use X-Ray directly via AWS Powertools for Lambda[1] instead of going through OTel. The underlying AWS X-Ray SDK team pushes OTel pretty hard in their README[2] and Github issues, but the current situation is that OpenTelemetry is just not production ready for serverless applications.

[1]: https://docs.powertools.aws.dev/lambda/typescript/latest/

[2]: https://github.com/aws/aws-xray-sdk-node


author of the post here - was inspired to write this post after working with OTEL for a few months - realized that OTEL had a ridiculously large surface area that most people (at least myself) might not be aware of

I see a lot of comments about how overly complex OTEL is. I don't disagree with this. in some sense, OTEL is very much the k8 for observability (good and bad)

The good is that it is a standard that can support every conceivable use case and has wide industry adoption The bad is that there is inherent complexity in needing to support the wide array of use cases


Otel is great for avoiding vendor lock-in and the spec stability is a long time coming — great to see. Also good there’s architecture options aside from side car (there’s also proxy mode), which is nice for situations like deploying on a PaaS.

That said the main trouble I’ve had in the past is instability in the SDKs. Java one’s decent. Go, not so much, for example. Traditionally Otel has been awesome for tracing but not so much logs, events, and even metrics (obviously depending on your language).

Other than that, auto-instrumentation has been nice. We had this exact functionality when I worked on observability at Netflix and made starting up and maintaining microservices easy and really helped with adoption.


since I added OTEL instrumentation to an app, I already easily swapped out the backends from Prometheus and Jaeger to Grafana Mimir and Tempo.

The lack of lock-in is fantastic, don't know if I've ever just switched technologies that easily, even SQL databases.


Disclaimer: I'm the founder of an observability company (Highlight.io).

OpenTelemetry has been INCREDIBLY valuable to us. Not only has it made it super fast to build out SDKs for our customers, but the fact that its maintained actively gives us confidence that we're rolling out stable logic to customers' environments.

I agree with the author that OpenTelemetry has succeeded, and its pretty obvious from the fact that most major observability vendors support.

In short, to a developer it may not seem like its particularly valuable because a metrics/logs/traces API is quite simple whether or not you use OTEL. But the fact that this is an industry wide spec is where it becomes powerful.

A few


It's quite understandable that a "o11y protocol" is very valuable to a o11y service provider (especially the smaller ones) since it reduces the cost of vendor swap.


Disclaimer: I am one of the maintainers

Many comments complain about the complexity of using OpenTelemetry, I recommend checking out Odigos, an open-source project which makes working with OpenTelemetry much easier: https://github.com/keyval-dev/odigos

We combine OpenTelemetry and eBPF to instantly generate distributed traces without any code changes.


How does this compare to the OTel Operator feature-wise? I'm loving auto-instrumentation with Grafana Cloud, but operator config is painful and I've hit (and helped fix) a lot of bugs


Hi! Just wanted to let you know I'm getting a 500 error on your docs page.

https://docs.odigos.io/


Can you try again please? It works well for me


FWIW also working fine here.


OTEL is awesome. I was part of the team integrating it at my job, it went really smoothly, and it saved us so much time debugging our microservice application.

I would say the hardest part was getting other devs to use it, a lot of them are stuck in their own way and did not want to go through the relatively small learning curve...


We're soon launching a GraphQL Analytics, Tracing and Metrics stack on top of OTEL. We've built a custom OTEL exporter to Clickhouse in go, so we can export OTEL traces and Prometheus metrics all to Clickhouse. We've built this stack for our federated GraphQL Gateway but it could essentially work for any OTEL service. If you want to learn more, here's some info: https://wundergraph.com/cosmo Were soon going to open source this, so just follow me/us of you're interested.

What I really like about this stack is that you can use our end to end solution, but you're not locked into it. We provide a full service, but you can also just use your own OTEL backend if you want to eject.


So I have a VictoriaMetrics (i.e. Prometheus) setup that I am happy with, but haven't touched OpenTelemetry; why should I care about it? Is it a serious solution for log aggregation, but do I need a separate database for logs?


You shouldn't unless you want to use the new open source standard for telemetry. You won't benefit from simplicity or performance improvements. It would be quite the opposite. You can check what is the actual cost of open telemetry adoption here [0]

But if you ever decide to go this path - VictoriaMetrics supports OpenTelemetry protocol for metrics [1]

[0] https://github.com/VictoriaMetrics/VictoriaMetrics/pull/2570

[1] https://docs.victoriametrics.com/Single-server-VictoriaMetri...


Can someone tell me roughly at which point a tool like OpenTelemetry becomes interesting? It seems complicated, at which point should you bother with it instead of just handrolling simple stuff and eyeballing it?


OpenTelemetry is several different things, so you'd have to be more specific.

But, for example, if you write a library and you want your downstream users to be able to see the telemetry. OpenTelemetry provides a standardized interface, so you don't need to make assumptions.


CNCF needs an OTEL log aggregator


The otel collector/agent can do this. Not storage/query, but aggregation and processing.


I've been playing with OTEL for a while, with a few backends like Jaeger and Zipkin, and am trying to figure out a way to perform end to end timing measurements across a graph of services triggered by any of several events.

Consider this scenario: There is a collection of services that talk to one another, and not all use HTTP. Say agent A0 makes a connection to agent A1, this is observed by service S0 which triggers service S1 to make calls to S2 and S3, which propagate elsewhere and return answers.

If we limit the scope of this problem to services explicitly making HTTP calls to other services, we can easily use the Propagators API [1] and use X-B3 headers [2] to propagate the trace context (trace ID, span ID, parent span ID) across this graph, from the origin through to the destination and back. This allows me to query the metrics collector (Jaeger or Zipkin) using this trace ID, look at the timestamps originating at the various services and do a T_end - T_start to determine the overall time taken by one call for a round trip across all the related services.

However, this breaks when a subset of these functions cannot propagate the B3 trace IDs for various reasons (e.g., a service is watching a specific state and acts when the state changes). I've been looking into OTEL and other related non-OTEL ways to capture metrics, but it appears there's not much research into this area though it does not seem like a unique or new problem.

Has anyone here looked at this scenario, and have you had any luck with OTEL or other mechanisms to get results?

[1] https://opentelemetry.io/docs/specs/otel/context/api-propaga...

[2] https://github.com/openzipkin/b3-propagation

[3] https://www.w3.org/TR/trace-context/


I've had pretty decent experience with .NET + Grafana Cloud (hosted Tempo)

You can generate RED metrics then tail based sample and send a subset of full traces.

As with many new things, there's varying maturity in client libraries and still things missing. The Redis .NET client (maybe others?) wasn't very good (maybe alpha/beta quality) but the other .NET stuff seemed reasonable. At least with .NET, it integrates with the existing System.Diagnostics API.

I think libraries in some other languages (go?) are a bit clumsier depending on what type of diagnostic/debug/event/performance/reflection APIs that language runtime exposes for interspection.


JobMatchingBD - Your Bridge to Success! JobMatchingBD is a user-friendly platform connecting job seekers with career opportunities in Bangladesh. It offers a wide range of job listings, smooth application processes, and outstanding customer support. Highly recommended for career advancement. Visit: https://jobmatchingbd.com


I'd be very interested to hear others experience (either as comments or good blog posts / write ups that they've read) with using Open Telemetry. I haven't used OTEL stuff directly, but I've had always been very disappointed with the telemetry vendors in the past (although as far as I can tell the problems weren't "these vendors don't talk to each other" kinds of problems that I think OTEL aims to solve).


I've used OpenTelemetry since it's original alpha in 2020. Originally the main issue i had was supporting tracing across common libraries (there wasn't a lot of libraries supported back then). Now (and i recently worked with it) i would say is which protocol is supported by which component: your sdk generate spans/metrics in a specific format, then you send that to a collector that accept a range of protocol versions and finally you can send that to your vendor ... but you need to know which protocol/version it supports.

That's not actually something you can do much about considering the sheer size of opentelemetry (both in term of implementation and vendors working on it) and i expect for people implementing nowadays, proto should be pretty stable and my experience should theorically not be the case anymore.


Experience has been positive. We had to understand (and in places fix/enhance) the agent used in our system but the benefits of it not being an expensive black box were huge. E.g. you can run the whole stack on a laptop and therefore use the same tooling for perf analysis in dev as production.


I run a full OSS otel stack in my home lab. It was a lot of fun to set up, with redundancy and all of the extras. It hasn't been particularly useful, but centralized logging and metrics are pretty to look at in dashboards, which was my motivation.

Let's hear some great debugging stories that have been powered by OTEL. I'd love to hear from the horses mouth without marketing speak how it was worth $$$$ collecting and storing all of this info.


Speaking about OpenTelemetry, has anyone used https://uptrace.dev ?

From a quick glance it seems to be simple, free and open-source deployed as a single Go binary. They use ClickHouse to store data. Almost too good to be true.

I'm contemplating them for a new project.


Fwiw, we do the same (https://highlight.io). Heard good things about uptrace as well.


Telemetry seems like it would be a great candidate for columnar storage formats like Parquet or arrow. In particular I expect that it would compress very well, which could reduce telemetry bandwidth consumption / allow for a bigger sample rate.

Does anybody have any experience with the intersection of these technologies?


Honeycomb built their own columnar database[0] to support their product.

[0] - https://www.honeycomb.io/resources/why-we-built-our-own-dist...


Grafana Tempo also switched from Protobuf storage format to Apache Parquet last year. It's fully open source, and the proposal (from April 2022) is here: https://github.com/grafana/tempo/blob/main/docs/design-propo...

The relevant code for parquet storage backend can be found here: https://github.com/grafana/tempo/tree/main/tempodb/encoding

disclosure: I work for Grafana!


Cool thanks for sharing. Can you say something about how it's worked out? Has it reduced bandwidth or CPU usage?


The Parquet backend helped unlock traces search for large clusters (>400MB/s data ingestion) and over longer periods of time (>24h). It also helped unlock TraceQL (a query language for traces similar to PromQL/LogQL). There's more details in this blog post: https://grafana.com/blog/2023/02/01/new-in-grafana-tempo-2.0...

I don't have the exact CPU/bandwidth numbers on me right now but CPU usage has went up by about ~50% on our "Ingester" and "Compactor" components (you can read up about the architecture here - https://grafana.com/docs/tempo/latest/operations/architectur...). But this is optimising for read performance which improved significantly.


Someone from F5 worked on this with OpenTelemetry [0] for Arrow, another effort was done for Parquet but was dropped [1]

[0]: https://github.com/open-telemetry/oteps/pull/171

[1]: https://github.com/open-telemetry/opentelemetry-proto/pull/3...


Oh nice, thank you (and also solumos) for the links! It looks like oteps/pull/171 (merged June 2023) expanded and superseded the opentelemetry-proto/pull/346 PR (closed Jul 2022) [0]. The former resulted in merging OpenTelemetry Enhancement Proposal 156 [1], with some interesting results especially for 'Phase 2' where they implemented columnar storage end-to-end (see the Validation section [2]):

* For univariate time series, OTel Arrow is 2 to 2.5 better in terms of bandwidth reduction ... and the end-to-end speed is 3.1 to 11.2 times faster

* For multivariate time series, OTel Arrow is 3 to 7 times better in terms of bandwidth reduction ... Phase 2 has [not yet] been .. estimated but similar results are expected.

* For logs, OTel Arrow is 1.6 to 2 times better in terms of bandwidth reduction ... and the end-to-end speed is 2.3 to 4.86 times faster

* For traces, OTel Arrow is 1.7 to 2.8 times better in terms of bandwidth reduction ... and the end-to-end speed is 3.37 to 6.16 times faster

Pretty exciting results! The OTEL-Arrow adapter has subsequently been donated to the otel community; here's a comment that does a good job of summarizing the results and the recommendations that came out of the test [3].

[0]: https://github.com/open-telemetry/opentelemetry-proto/pull/3...

[1]: https://github.com/open-telemetry/oteps/blob/main/text/0156-...

[2]: https://github.com/open-telemetry/oteps/blob/main/text/0156-...

[3]: https://github.com/open-telemetry/community/issues/1332#issu...


We're big users of clickhouse at https://highlight.io. Some more details here if you're interested: https://www.highlight.io/blog/how-we-built-logging-with-clic...


A lot of companies roll their own (like Honeycomb, and Kentik (network telemetry)). Clickhouse is a very good open option.


Very split opinions in the comments. Seems people either love it or hate it with nothing in between


Someone needs to build the Segment of APM. Datadog lock-in is real and the bill is high.


Has anyone found a solution to automatically adding spans automatically for nodejs applications that doesn't involve littering startSpan or startActiveSpans everywhere?

I'm currently looking into monkey patching, but that seems dirty.


There isn't a way to automatically add spans for arbitrary functions in Node, no. There's currently a proposal to add some stuff that makes creating the span easier to make it auto-closing based on scope.

FWIW the philosophy here is that observability is a part of the application rather than something separate. That's distinctly different from the APM philosophy, which is that a separate process "does the observability" and your app is "clean" from that. I think there's quite a benefit to manually instrumenting in your codebase intentionally rather than having an automated process do it for you. But I can understand not wanting to go through and do that.


That's an interesting distinction. Given some of the older, larger code bases, manually instrumenting seems like it would require a reasonably large effort. I think that's one of the huge benefits that some of the APM's bring.

I wonder if there is any reason OTEL can't have both.


OTel kind of has both today. Most languages support autoinstrumentation via either an agent, libraries, or both. So for example, if you've got a big Spring Boot app, it'll instrument requests/responses and DB calls for you. Some languages also have lightweight "sprinkle some spans on it" things you can put in code, like Java method annotations. But none have truly automatic "span for every method call" instrumentation.


Yeah, I've used the Ruby one, and the results were... painful.


For stuff on the JVM, you just load the APM plugin and you're pretty much good to go as it inspects the bytecode, i believe.


Oh, right. I guess my question was more around "why can't OTEL _also_ have something that does that"?


Depends what you're trying to accomplish.

We're using middleware to start spans for http and message handlers, and then adding `startSpan` where we need.

I don't see a problem with `startSpan` everywhere as it's not much noiser than the `log.info` that would be there instead if we didn't have otel.


I think my major pain point with them being everywhere is unlike a log they push a new stack frame and closure. They make tracebacks so much more annoying, yes I know the error happened in a start span, thank you.

I wish observability actually observed more than it contributed. Once https://peps.python.org/pep-0669/ is available I'm gonna try my damndest to get otel working through it. Just give me a config file that says what functions you're interested in and I'll do the rest.


Agreed the amount of stuff showing in stack traces can be annoying - but it does depend on language.

I agree 100% in javascript/typescript its annoying, and I would love to get rid of them, In go however, there isn't an extra stack frame. Nor in C# thinking about it.

The config file of functions to trace is a really interesting idea. How would you handle wanting to add data to the spans from inside those functions though? e.g. I want to add all kinds of props to the spans that can't be known except inside the function executing.


I started my APM journey when using Spring w/ Java and fell in love with how i can trace the entire flow through the entire application and then just config on how many spans i want to send over and what sample rate i want collected.

I'd love to accomplish that.

What you mentioned is all nice and well (optimal route), but right now, i'm working with some applications that needs it, but has i don't even know how many methods / classes that i'd need to go through and implement it on.


Have you seen the OTEL auto instrumentation package[1]? It supports a number of common frameworks (as middleware) and libraries (via monkeypatches) setting up spans for you to capture things like external http requests, database/cache queries, etc. [2]

[1] https://opentelemetry.io/docs/instrumentation/js/automatic/ [2] https://github.com/open-telemetry/opentelemetry-js-contrib/t...


Tested those before, but still doesn't get what i'm looking for, which is inside the black box without manually adding the spans. The implementations for bunyan or winston would come the closet, but still doesn't achieve what i'm looking for.


I think the Ruby auto-instrumentation library does wrap practically every method but we ended up finding it too verbose. I think this is a bit of a goldilocks problem insofar as getting it just right is not easy and varies per application.


One of the things i think that can be done is also limit the number of spans collected via configuration.


Aye, but then you’re trading off between a centralized config that does sampling vs customized per-service (or whatever boundary) configs. There’s no free lunch, was mostly my point.


I'm new to telemetry. Can someone please explain the relationship between otel, Prometheus, Grafana agent?


OpenTelemetry is primarily definitions of protocols, APIs and semantic conventions for instrumentation data (traces, metric and logs, in decreasing order of maturity). OpenTelemetry also ships the Collector, which is like a patch-bay accepting many different data formats and sending on in more formats.

Grafana Agent is a bundling of a bunch of things for collecting instrumentation. The idea is to simplify deployment and allow opinionated setup. The Agent can upload to Grafana Cloud, or to any compatible backend (all the basic components are open source).

In particular it bundles Prometheus Agent, which does metrics collection from anything Prometheus-compatible but not queries, and OTel Collector.

It also bundles Promtail for logs.

(I work for Grafana Labs)


Prometheus: metrics collector/databases with a defined metrics format.

Many metrics/timeseries databases make an effort to be "Prometheus-compatible" as it was kind of the unofficial standard.

OTEL: new open standards which are supposed to provide vendor-independent formats which are compatible and composable between metrics, logs, and traces, and easily enable things like deriving metrics from traces (latency would be an obvious one here).

Grafana Agent: a metrics? maybe also logs and traces? collector that supports Prometheus, OTEL and other metrics formats, which can forward, sample, and transform the data. Made by Grafana, open source, etc.

Grafana's metrics DB Mimir and maybe some others are essentially "more scalable prometheus", and use prometheus metrics format on disc, so one large concern of the Grafana Agent would be converting OTEL metrics to prometheus metrics format for ingestion into the Grafana databases - but the agent has a whole bunch of other functions supported as well.

OTEL Collector - non-vendor-specific collector, like Grafana Agent but largely just concerned with allowing collection/ingestion of OTEL and coverting other formats into OTEL. Allows extensions and plugins to be added for other purposes.



Not only prometheus fwiw. There's lots of providers that support it, e.g.: - datadog - honeycomb.io - highlight.io (I'm a founder) - sentry.io


I just want Segment but for APM to reduce vendor lock-in.


Non-paywalled URL: https://archive.is/NULpZ


like a virus, the complexity spreads to fill head count.


Like a reflex, comments blaming solutions for complex challenges are being emitted. Alas, no alternatives are ever offered. Shall we limit all technology to what can be contained in a single box? should we just ban countries, companies and communities from growing past the 500k head count? Please, go on, do say what shall we do with the inherent complexity of the universe, shall we stop physics at Newton and ban particle accelerators? De-legalize storage systems exceeding petabyte? Please, let us hear how complexity can be put back to its old box


Ugh, protobuf and gRPC as blessed (?!) transports? Thanks, I always wanted to bring 100MB dependency into my code just to send metrics.


It's 8mb dll when compiled fully as dll(s). Check the https://github.com/open-telemetry/opentelemetry-cpp as they've recently added dll support, and I've been keeping a branch that's geared towards more simplicity on deploy (e.g. single .dll, instead of several) - OpenTelemetry's C++ solution though is more flexible, and less size used last I've tried it. Here is mine - https://github.com/malkia/opentelemetry-cpp


Where are you getting 100MB from? You don't need the whole protoc toolchain. Well I've never worked with C or C++ and gRPC, but insofar as rust, go, and python are concerned, the increase in container/binary size is a few MB.


Yeah, but you have the same problem when using Prometheus and the native histograms? So far I know it is not supported by the text / openmetrics format only in their protobuf-basaed format version


If you're using Rust, Go, C, Java, or some other performant language then it shouldn't be much. If you're using a slower scripting language like Ruby or JavaScript then you might have issues.


Yeah, including the official Opentelemetry parser for metrics just increases sophisticated Go binary size by 10MB - from 19MB to 29MB [1]

[1] https://github.com/VictoriaMetrics/VictoriaMetrics/pull/2570...


Just a quick clarification. OTEL is not just a transport, its a specification.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: