Vendor lock-in in the observability space

spondyl · on Oct 31, 2023

Hmm, as someone who uses and defacto manages Datadog for a sizable org in their day job, I'm not sure using that OpenTelemetry PR fiasco is an up to date picture.

There was definitely a period where Datadog seemed to be talking up OTel but not actually doing anything of note to support that ecosystem.

I'd say in the last year or two, they've done a bit of a 180 and been embracing it quite a lot.

One major change is that they not only added support for the W3C Context header format but actually set it as the default over their own proprietary format.

The reason that's a pretty big deal is that W3C Context is set as a MUST for OTel clients to implement so it goes a long way to making interoperability (and migration) pretty painless.

Prior to that, you could use OTel but the actual distributed aspect (spans between services linking up) probably wouldn't work as OTel services wouldn't recognise the Datadog header format and vice versa.

There are, of course, still some features that you would miss out on by using OTel over the Datadog SDKs like I don't believe the profiler would work necessarily but that's a tradeoff to be made.

shahargl · on Oct 31, 2023

Yet again, that’s about the data, I think the blog post focuses on the applicative aspect

braza · on Oct 31, 2023

In the last decade in all companies that I've worked the biggest issue by far in all of them were vendor lock-in and how the teams coped around tools that did not evolved with their problems along the time and the switch cost was high.

I know that Open Telemetry has it's own issues, but between low ergonomics and amenities and a lock-in; nowadays I will choose the first.

mikeshi42 · on Oct 31, 2023

What's been your biggest issues around ergonomics/amenities for OpenTelemetry?

(We've been working on some Otel-based SDKs to try to make it easier to onboard [1] as an example, though curious where the general sentiment is)

[1] https://www.hyperdx.io/docs/install/javascript

scottlamb · on Oct 31, 2023

> What's been your biggest issues around ergonomics/amenities for OpenTelemetry?

I can't speak generally, but in the Rust ecosystem the various crates don't play well together. Here's one example: <https://github.com/tokio-rs/tracing/issues/2648> There are four crates involved (tracing-attributes, tracing-opentelemetry, opentelemetry, and either of opentelemetry-{datadog,otlp}) and none of them fit properly into any of the others.

mikeshi42 · on Oct 31, 2023

Oh yeah Rust... definitely a bit more sharp edges there which is frustrating (had a friend tell me the same recently).

scottlamb · on Oct 31, 2023

...yeah.

At least I don't have to use all these things. `tracing` itself is essentially mandatory if I want to pick up all the spans and events created by the crates I depend on. But I can (and maybe will...) write my own `tracing::Subscriber`, sidestepping all the various bugs and incompatibilities of `tracing-subscriber` and `opentelemetry{-,otlp,-datadog}`.

(Here's another fun one in `tracing-subscriber`: <https://github.com/tokio-rs/tracing/issues/2519#issuecomment...> These interfaces just aren't right.)

mikeshi42 · on Oct 31, 2023

I think it's a huge shame, as Rust is one of the few languages that tracing is effectively native in, and it's shocking that the defacto standard of tracing transport doesn't interop well.

Admittedly, I think part of it is just where Rust is on the language adoption curve and where Otel is in its project lifecycle as well. Writing your own subscriber might not be a terrible idea, we internally end up needing to do similar things as we find limitations/bugs that we can't wait for upstream to fix (but we're a vendor and that makes more sense for us than end users!)

scottlamb · on Oct 31, 2023

Yes. I love Rust, but its observability story is terrible right now. Not just tracing; I was complaining in another thread recently that the async ecosystem doesn't have a production-ready equivalent of thread dumps. <https://news.ycombinator.com/item?id=37792011> I really want to see the whole picture improve (and as I'm able, participate in improving it).

I'm considering writing a tracing subscriber that dumps events, span starts/stops, and span field updates to a terse local log file format. This is a superset of what OpenTelemetry offers. (OpenTelemetry only has the concept of a completed span, which I find really unfortunate.) So I'd write a tool that takes that and pushes it to OpenTelemetry (otelcol-contrib plugin maybe) and more local-focused tools like `logcat`.

mikeshi42 · on Oct 31, 2023

Oh huh! What would you do with span start independent of stops?

A tangent on logcat - local observability to me is a really intriguing area, I think there's a story of Otel for local as well if someone can build a good enough local DX for consuming them (we've been told a number of times about this https://github.com/hyperdxio/hyperdx/issues/7 as an example)

scottlamb · on Oct 31, 2023

> Oh huh! What would you do with span start independent of stops?

* When browsing locally, I'd be able to see spans that never closed (because they were super long-running and/or because the process crashed mid-span). I suppose for the latter case, the otel collector could upload them (marking them as incomplete somehow) when it knows the process shut down.

* When looking at events(/logs), I'd be able to see the current state of all the enclosing spans, including their fields. Easiest to do locally, but ideally also in otel. Maybe some mechanism for automatically copying select attributes from the span to the event for use in otel (details tbd, whether it's selected in code at span creation time, tweaked in the otel collector config, or what).

braza · on Nov 1, 2023

Would be great to have OTel bundled in a single package. Thanks for sharing.

talboren · on Oct 31, 2023

Was and still is one of the biggest problems orgz have without even knowing they have them. until it hits. I think architects should give that thought from a very early stage of a company and often don’t

RaouleDuke · on Oct 31, 2023

This post speaks to a larger issue that cloud vendors are driven to extract as much money from you, the customer, as possible. They are not evil or malicious, they are commercial enterprises and cloud is built on consumption economics. There is no incentive to make it easy for you to move to another cloud. Replace 'observability space' with SEIM/SOAR, first party databases (Spanner/CosmosDB), many PaaS offerings, and the themes still apply. Pushing proprietary solutions to your is an effective means of making customers stick. I am not passing judgement here, there is some value to turnkey solutions but it depends on your business. Datadog in particular is a bit insidious as they have a multi-cloud proprietary service that can follow your workloads across clouds (even so far as to be essentially first party on Azure via Azure Native ISV Services).

palijer · on Nov 1, 2023

>1. You start with Datadog§[...] 2. [...] your expenses have skyrocketed 3. You want to migrate your way out[...]

Step 3 doesn't seem logical here... Who doesn't try to control spend on something they've already invested in?

Step 3 should be to audit what you are spending the most on, and how to manage that in a way its still useful.

I've seen so many people not understand things like custom metrics billing, or log/trace retention and get burned for it.

If you're using a tool lime datadog, you should understand its billing structure a bit. I couldn't imagine setting up a redshift instance without understanding and tracking redshift spending. And then one it is high, just switching to RDS or something without even taking a look.

techdragon · on Nov 1, 2023

As a long term DataDog customer… no it’s beyond that.

DataDog have an extractive pricing model. It’s undergone a number of changes over the years that separated out features previously under the host price into their own products with separate monthly costs. They put limits on the numbers of unique containers you can have per host before charging container fees, so better hope you don’t have too many crash-restart loops or. Theres the per unique host per billing period pricing model that makes it impossible to scale your own infrastructure up and down to save on hardware costs, one personal case I can attest to was case the changes would halve the AWS bill but increase the DataDog bill by more than 10 times, a thousand percent… and you would think surely you can talk to the sales team if your use case fits extremely poorly with the pricing model…. Hahaha no… unless your a whale they give zero fucks about giving you any flexibility on price.

harpratap · on Nov 1, 2023

> unless your a whale they give zero fucks about giving you any flexibility on price.

Even the flexible pricing they offer ends up being a sham. It just seems better on paper but you end up paying nearly the same because they have a really complicated billing model where they give you free stuff with Infra hosts, once you switch away from this model you stop getting those freebies, so your Infra hosts might cost less now but everything else is more expensive now. The house always wins!

cratermoon · on Nov 1, 2023

The vendor lock-in I've seen isn't on the collection side, it's on the display and interpretation side, specifically the dashboard. Vendors offering tools to really slice and dice the collected telemetry and display it in visualization that wouldn't make Edward Tufte pull his hair out can make choosing vendor lock-in a serious option.

datadrivenangel · on Nov 1, 2023

Tufte pulling his hair out is such a good criteria for picking a visualization tool.

And it's a surprisingly high bar!

cratermoon · on Nov 1, 2023

Tufte would also want visualizations to be meaningful in addition to factual and clear. If a vendor sells you a propriety visualization tool, they are also likely to include propriety visualizations and analytics. For example, in addition to providing a Tufte-satisfying view of requests/second, the company might include a visualization for the made-up propriety metric "un-monetized requests/second", with the company-proven analysis showing why, if un-monetized requests are greater than some % of total requests, that's a problem that needs solving. Oh! and wonder of wonders, the company happens to have just the solution for the problem. At a small additional licensing fee, of course.

It's a bullshit meaningless metric, of course, and wouldn't pass the Tufte test, but the company gets a profit center.

shahargl · on Nov 1, 2023

Volundr · on Oct 31, 2023

The article seems to suggest https://github.com/open-telemetry/opentelemetry-collector-co... was silently killed, yet it appears to have been merged in January, am I missing something?

shahargl · on Oct 31, 2023

You should read this to get the full context - https://news.ycombinator.com/item?id=34540419

TheIronYuppie · on Oct 31, 2023

FULLY BIASED COMMENT:

We (the bacalhau.org[0] project) are interested in helping with this - one of our philosophies has been that part of the problem is with that first step. By first moving to a lake of some kind, you end up giving up lots of optionality. Even basic things like aggregation, filtering, windowing, etc now need to be in the "locked-in" tool, which is exactly the wrong first step to take.

SHOW HN: We have a solution that uses DuckDB to do some of this initial work[1] which can save you 70%+ or more on total data throughput. Further, it allows you to do interesting things like eventing, multi-homing observability data, etc.

I'd be very interested to hear any/all thoughts!

[0] https://github.com/bacalhau-project/bacalhau

[1] https://blog.bacalhau.org/p/bacalhau-x-duckdb-deploying-appl...

Disclosure: I co-founded the Bacalhau project.

hbcondo714 · on Oct 31, 2023

I've bounced around Splunk, New Relic, Sentry and Datadog over the years. Most recently, I was working with Java and used the open source Vendor-neutral application observability facade Micrometer[1] to test out and confirm which APM we wanted to go with.

[1] https://micrometer.io

pranay01 · on Oct 31, 2023

You should also check out SigNoz [1]. It's an open source observability platform with metrics, traces and logs in a single application & based natively on opentelemetry.

One of the reasons many people use SigNoz is to avoid the vendor lock-in which comes with adding proprietary SDKs of closed source products like DataDog and New Relic in their code.

If anyone is starting their observability journey today, I think OpenTelemetry is a very good place to start. You can instrument with Otel SDKs and chose a visualization layer/backend which suits your needs best

(Disclaimer: I am maintainer at SigNoz)

[1]https://github.com/signoz/signoz

talboren · on Oct 31, 2023

is things such as alerts/dashboards/proprietary data is easily exportable with SigNoz? For instance if I’d like to manage them with terraform

harpratap · on Nov 1, 2023

Datadog allows you to export all of this but I don't see how that's any useful. You can't really port Datadog dashboard to let's say Grafana easily. The query languages they use do not have 1:1 mapping, the way dashboards are organized and the different visualization tools you get are not same either

talboren · on Nov 1, 2023

they surely doesn't help with the "migration" (application layer) part but they help you manage things with code, which is better than nothing (you can do some heavy-lifting to make it work in another provider afterwards)

that's the reason terraform is not really a solution for what I speak about in the blog post.

shahargl · on Oct 31, 2023

But that’s for the data part (otel) and not the applicative part (alerts, dashboards, etc)

shahargl · on Oct 31, 2023

Cool! Never heard of it and definitely gonna check it.

wbeckler · on Oct 31, 2023

Does Keep have an open core business model, where they host your stuff using a proprietary control plane for a fee and introduce proprietary features around the edges?

shahargl · on Oct 31, 2023

Nope, fully open source - https://github.com/keephq/keep

camel_gopher · on Oct 31, 2023

Moving alerts is easy. Moving people, now that’s hard.

talboren · on Oct 31, 2023

Like other things around, there’s always the people’s problem but when vendor locked, that’s yet another big issue to address

shahargl · on Oct 31, 2023

Wdym by moving people?

syndicatedjelly · on Nov 2, 2023

What is o11y? Why does it need to be abbreviated like that?

j33zusjuice · on Nov 6, 2023

Observability. There are 11 letters between O and Y, same abbreviation method as K8s.