Hmm, as someone who uses and defacto manages Datadog for a sizable org in their day job, I'm not sure using that OpenTelemetry PR fiasco is an up to date picture.
There was definitely a period where Datadog seemed to be talking up OTel but not actually doing anything of note to support that ecosystem.
I'd say in the last year or two, they've done a bit of a 180 and been embracing it quite a lot.
One major change is that they not only added support for the W3C Context header format but actually set it as the default over their own proprietary format.
The reason that's a pretty big deal is that W3C Context is set as a MUST for OTel clients to implement so it goes a long way to making interoperability (and migration) pretty painless.
Prior to that, you could use OTel but the actual distributed aspect (spans between services linking up) probably wouldn't work as OTel services wouldn't recognise the Datadog header format and vice versa.
There are, of course, still some features that you would miss out on by using OTel over the Datadog SDKs like I don't believe the profiler would work necessarily but that's a tradeoff to be made.
In the last decade in all companies that I've worked the biggest issue by far in all of them were vendor lock-in and how the teams coped around tools that did not evolved with their problems along the time and the switch cost was high.
I know that Open Telemetry has it's own issues, but between low ergonomics and amenities and a lock-in; nowadays I will choose the first.
> What's been your biggest issues around ergonomics/amenities for OpenTelemetry?
I can't speak generally, but in the Rust ecosystem the various crates don't play well together. Here's one example: <https://github.com/tokio-rs/tracing/issues/2648> There are four crates involved (tracing-attributes, tracing-opentelemetry, opentelemetry, and either of opentelemetry-{datadog,otlp}) and none of them fit properly into any of the others.
At least I don't have to use all these things. `tracing` itself is essentially mandatory if I want to pick up all the spans and events created by the crates I depend on. But I can (and maybe will...) write my own `tracing::Subscriber`, sidestepping all the various bugs and incompatibilities of `tracing-subscriber` and `opentelemetry{-,otlp,-datadog}`.
I think it's a huge shame, as Rust is one of the few languages that tracing is effectively native in, and it's shocking that the defacto standard of tracing transport doesn't interop well.
Admittedly, I think part of it is just where Rust is on the language adoption curve and where Otel is in its project lifecycle as well. Writing your own subscriber might not be a terrible idea, we internally end up needing to do similar things as we find limitations/bugs that we can't wait for upstream to fix (but we're a vendor and that makes more sense for us than end users!)
Yes. I love Rust, but its observability story is terrible right now. Not just tracing; I was complaining in another thread recently that the async ecosystem doesn't have a production-ready equivalent of thread dumps. <https://news.ycombinator.com/item?id=37792011> I really want to see the whole picture improve (and as I'm able, participate in improving it).
I'm considering writing a tracing subscriber that dumps events, span starts/stops, and span field updates to a terse local log file format. This is a superset of what OpenTelemetry offers. (OpenTelemetry only has the concept of a completed span, which I find really unfortunate.) So I'd write a tool that takes that and pushes it to OpenTelemetry (otelcol-contrib plugin maybe) and more local-focused tools like `logcat`.
Oh huh! What would you do with span start independent of stops?
A tangent on logcat - local observability to me is a really intriguing area, I think there's a story of Otel for local as well if someone can build a good enough local DX for consuming them (we've been told a number of times about this https://github.com/hyperdxio/hyperdx/issues/7 as an example)
> Oh huh! What would you do with span start independent of stops?
* When browsing locally, I'd be able to see spans that never closed (because they were super long-running and/or because the process crashed mid-span). I suppose for the latter case, the otel collector could upload them (marking them as incomplete somehow) when it knows the process shut down.
* When looking at events(/logs), I'd be able to see the current state of all the enclosing spans, including their fields. Easiest to do locally, but ideally also in otel. Maybe some mechanism for automatically copying select attributes from the span to the event for use in otel (details tbd, whether it's selected in code at span creation time, tweaked in the otel collector config, or what).
Was and still is one of the biggest problems orgz have without even knowing they have them. until it hits.
I think architects should give that thought from a very early stage of a company and often don’t
This post speaks to a larger issue that cloud vendors are driven to extract as much money from you, the customer, as possible. They are not evil or malicious, they are commercial enterprises and cloud is built on consumption economics. There is no incentive to make it easy for you to move to another cloud. Replace 'observability space' with SEIM/SOAR, first party databases (Spanner/CosmosDB), many PaaS offerings, and the themes still apply. Pushing proprietary solutions to your is an effective means of making customers stick. I am not passing judgement here, there is some value to turnkey solutions but it depends on your business. Datadog in particular is a bit insidious as they have a multi-cloud proprietary service that can follow your workloads across clouds (even so far as to be essentially first party on Azure via Azure Native ISV Services).
>1. You start with Datadog§[...]
2. [...] your expenses have skyrocketed
3. You want to migrate your way out[...]
Step 3 doesn't seem logical here... Who doesn't try to control spend on something they've already invested in?
Step 3 should be to audit what you are spending the most on, and how to manage that in a way its still useful.
I've seen so many people not understand things like custom metrics billing, or log/trace retention and get burned for it.
If you're using a tool lime datadog, you should understand its billing structure a bit. I couldn't imagine setting up a redshift instance without understanding and tracking redshift spending. And then one it is high, just switching to RDS or something without even taking a look.
As a long term DataDog customer… no it’s beyond that.
DataDog have an extractive pricing model. It’s undergone a number of changes over the years that separated out features previously under the host price into their own products with separate monthly costs. They put limits on the numbers of unique containers you can have per host before charging container fees, so better hope you don’t have too many crash-restart loops or. Theres the per unique host per billing period pricing model that makes it impossible to scale your own infrastructure up and down to save on hardware costs, one personal case I can attest to was case the changes would halve the AWS bill but increase the DataDog bill by more than 10 times, a thousand percent… and you would think surely you can talk to the sales team if your use case fits extremely poorly with the pricing model…. Hahaha no… unless your a whale they give zero fucks about giving you any flexibility on price.
> unless your a whale they give zero fucks about giving you any flexibility on price.
Even the flexible pricing they offer ends up being a sham. It just seems better on paper but you end up paying nearly the same because they have a really complicated billing model where they give you free stuff with Infra hosts, once you switch away from this model you stop getting those freebies, so your Infra hosts might cost less now but everything else is more expensive now. The house always wins!
The vendor lock-in I've seen isn't on the collection side, it's on the display and interpretation side, specifically the dashboard. Vendors offering tools to really slice and dice the collected telemetry and display it in visualization that wouldn't make Edward Tufte pull his hair out can make choosing vendor lock-in a serious option.
Tufte would also want visualizations to be meaningful in addition to factual and clear. If a vendor sells you a propriety visualization tool, they are also likely to include propriety visualizations and analytics. For example, in addition to providing a Tufte-satisfying view of requests/second, the company might include a visualization for the made-up propriety metric "un-monetized requests/second", with the company-proven analysis showing why, if un-monetized requests are greater than some % of total requests, that's a problem that needs solving. Oh! and wonder of wonders, the company happens to have just the solution for the problem. At a small additional licensing fee, of course.
It's a bullshit meaningless metric, of course, and wouldn't pass the Tufte test, but the company gets a profit center.
We (the bacalhau.org[0] project) are interested in helping with this - one of our philosophies has been that part of the problem is with that first step. By first moving to a lake of some kind, you end up giving up lots of optionality. Even basic things like aggregation, filtering, windowing, etc now need to be in the "locked-in" tool, which is exactly the wrong first step to take.
SHOW HN: We have a solution that uses DuckDB to do some of this initial work[1] which can save you 70%+ or more on total data throughput. Further, it allows you to do interesting things like eventing, multi-homing observability data, etc.
I've bounced around Splunk, New Relic, Sentry and Datadog over the years. Most recently, I was working with Java and used the open source Vendor-neutral application observability facade Micrometer[1] to test out and confirm which APM we wanted to go with.
You should also check out SigNoz [1]. It's an open source observability platform with metrics, traces and logs in a single application & based natively on opentelemetry.
One of the reasons many people use SigNoz is to avoid the vendor lock-in which comes with adding proprietary SDKs of closed source products like DataDog and New Relic in their code.
If anyone is starting their observability journey today, I think OpenTelemetry is a very good place to start. You can instrument with Otel SDKs and chose a visualization layer/backend which suits your needs best
Datadog allows you to export all of this but I don't see how that's any useful. You can't really port Datadog dashboard to let's say Grafana easily. The query languages they use do not have 1:1 mapping, the way dashboards are organized and the different visualization tools you get are not same either
they surely doesn't help with the "migration" (application layer) part but they help you manage things with code, which is better than nothing (you can do some heavy-lifting to make it work in another provider afterwards)
that's the reason terraform is not really a solution for what I speak about in the blog post.
Does Keep have an open core business model, where they host your stuff using a proprietary control plane for a fee and introduce proprietary features around the edges?
There was definitely a period where Datadog seemed to be talking up OTel but not actually doing anything of note to support that ecosystem.
I'd say in the last year or two, they've done a bit of a 180 and been embracing it quite a lot.
One major change is that they not only added support for the W3C Context header format but actually set it as the default over their own proprietary format.
The reason that's a pretty big deal is that W3C Context is set as a MUST for OTel clients to implement so it goes a long way to making interoperability (and migration) pretty painless.
Prior to that, you could use OTel but the actual distributed aspect (spans between services linking up) probably wouldn't work as OTel services wouldn't recognise the Datadog header format and vice versa.
There are, of course, still some features that you would miss out on by using OTel over the Datadog SDKs like I don't believe the profiler would work necessarily but that's a tradeoff to be made.