By backfilling I assume you mean retroactively applying labels to a time series?...

chaps · on March 28, 2020

What I mean by that is, I have an exporter that (for example) might be reading a bunch of logs and exposing some inferred metrics information through the exporter. Two days later, I realize that there's some useful information from those logs or another log file, or hell, another server, that wasn't inferred. That way, in some monitoring analysis I'm running, both data points would be side by side. Prometheus doesn't allow for that, because opinions. Just a toy example, so please don't HN-penanticize it :)

Not sure why restoring data after a crash isn't a bigger problem in your eyes, though. In the context of fully understanding system stability through prometheus, that's a pretty giant gap in your post portem analysis without any way to correct it. It forces you into either using two tools to be able to do that sort of analysis, or accepting that you're always going to risk losing valuable data.

seattleeng · on March 28, 2020

Gotcha, it seems like you want to run backtests from within Prometheus for richer post-hoc analysis. That sounds like a normal use case, I often do fairly non-trivial post hoc analysis in Prometheus (e.g. Z scores and stddevs of metrics) since the query language is quite well suited to it, so I can see why you want to leverage it more.

Re: restoration, we use facilities like EBS snapshots, so its separated from prometheus. It could theoretically be an issue for post mortem analysis but data loss is relatively rare so in practice the ability to run post mortems has never been affected. Generally we don't consider our metrics as "business-critical" as production data (e.g. we wouldn't consider the above acceptable for MySQL) so that's where the leniency comes in. I can see different orgs weighing this differently -- we're not really an "infrastructure" company so I think the tradeoff is right for us, but if I was at a PaaS or IaaS I would feel differently.