lossy and simpler. IME, I've found sampling simpler to reason about, and with th...

kiitos · 2024-12-06T06:20:01 1733466001

If a service emits a log event, then that log event should be visible in your logging system. Basic stuff. Sampling fails this table-stakes requirement.

eru · 2024-12-06T09:23:37 1733477017

Typically, you store your most recent logs in full, and you can move to sampling for older logs (if you don't want to delete them outright).

kiitos · 2024-12-06T17:13:31 1733505211

It's reasonable to drop logs beyond some window of time -- a year, say -- but I'm not sure why you'd ever sample log events. Metric samples, maybe! Log data, no point.

But, in general, I think we agree -- all good!

eru · 2024-12-07T02:26:05 1733538365

> It's reasonable to drop logs beyond some window of time -- a year, say [...]

That's reasonable in a reasonable environment. Alas, I worked in large legacy enterprises (like banks etc) where storage space is at much more of a premium for reasons.

You are right that sampling naively works better for metrics.

For logs you can still sample, but in a saner way: so instead of dropping each log line with an independent probability, you'll want correlation. Eg for each log file for each hour only flip one weighted coin to decide whether you want to keep the whole thing.

kiitos · 2024-12-09T23:48:20 1733788100

Metrics can be sampled, because metrics by definition represent some statistically summarize-able set of observations of related events, whose properties are preserved even when those observations are aggregated over time. That aggregation doesn't result in loss of information. At worst, it results in loss of granularity.

That same key property is not true for logs. Logs cannot be aggregated without loss of information. By definition. This isn't up for debate. You can collect logs into groups, based on similar properties, and you can decide that some of those groups are more or less important, based on some set of heuristics or decisions you can make in your system or platform or whatever. And you can decide that some of those groups can be dropped (sampled) according to some set of rules defined somewhere. But any and all of those decisions result in loss of information. You can say that that lost information isn't important or relevant, according to your own guidelines, and that's fine, but you're still losing information.

eru · 2024-12-10T01:48:32 1733795312

If you have finite storage, you need to do something. A very simple rule is to just drop everything older than a threshold. (Or another simple rule: you keep x GiB of information, and drop the oldest logs until you fit into that size limit.)

A slightly more complicated rules is: drop your logs with increasing probability as they age.

With any rule, you will lose information. Yes. Obviously. Sampling metrics also loses information.

The question is what's the best (or a good enough) trade-off for your use case.

kiitos · 2024-12-10T05:57:52 1733810272

> If you have finite storage, you need to do something.

Sure.

> A very simple rule is to just drop everything older than a threshold.

Yep, makes sense.

> (Or another simple rule: you keep x GiB of information, and drop the oldest logs until you fit into that size limit.)

Same thing, it seems like.

> A slightly more complicated rules is: drop your logs with increasing probability as they age.

This doesn't really make sense to me. Logs aren't probabilistic. If something happened on 2024-11-01T12:01:01Z, and I have logs for that timestamp, then I should be able to see that thing which happened at that time.

> Sampling metrics also loses information.

I mean, literally not, right? Metrics are explicitly the pillar of observability which can aggregate over time without loss of information. You never need to sample metric data. You just aggregate at whatever layers exist in your system. You can roll-up metric data from whatever input granularity D1 to whatever output granularity e.g. 10·D1, and that "loses" information in some sense, I guess, but the information isn't really lost, it's just made more coarse, or less specific. It's not in any way the same as sampling of e.g. log data, which literally deletes information. Right?