Sampling is fine at the query layer, but if you sample at the ingest layer -- an...

BHSPitMonkey · on Aug 28, 2023

You use sampling rates that are statically significant enough for the system you're observing. You can always make exceptions to the default rate based on other heuristics. What's wrong with that, for the kinds of insights it provides?

kiitos · on Aug 28, 2023

General-purpose observability systems serve two use cases: presenting a high-level summary of system behavior, and allowing operators to inspect telemetry data associated with a specific e.g. request.

The former use case is often solved by a specific and narrow kind of observability data, which is metrics. A common tool for that purpose is Prometheus. You certainly can't query Prometheus for individual requests, which is fine, and accepting that invariant allows Prometheus to treat input data as statistical, in the sense that you mean in your comment.

But if we're talking about general-purpose telemetry, we're talking about more than just high-level summaries of system behavior, we also need to be able to inspect individual log events, trace spans, etc. If a user made a request an hour ago with reqid 123, I expect to be able to query my telemetry system for reqid 123 and see all of the metadata related to that request.

A telemetry system that samples prior to ingest certainly delivers value, but it can only ever solve the first use case, and never the second.