Did you raise the challenges you had with Prometheus to the dev team and community? They are usually quite responsive and open for contributions. I'd be interested in the discussion around the reasons you brought up for forking.
Not OP, but one of the biggest issues that I've had with prometheus is that doesn't support the backfilling of timeseries data. Bringing it up to the prometheus devs led to oddly indignant responses that they had no intention of supporting backfilling, because that's not what prometheus was designed for. In large production systems, like prometheus is meant for, it was a very disappointing missing feature, and I could absolutely see why someone would want to fork if nothing else, just to get away from that level of indignancy.
By backfilling I assume you mean retroactively applying labels to a time series? Can you link to the thread with more context if you have it around? As a prometheus user myself at a 100+ eng org company I've never had a use case for a flow that would require editing old timeseries apart from having to restore data after a crash
What I mean by that is, I have an exporter that (for example) might be reading a bunch of logs and exposing some inferred metrics information through the exporter. Two days later, I realize that there's some useful information from those logs or another log file, or hell, another server, that wasn't inferred. That way, in some monitoring analysis I'm running, both data points would be side by side. Prometheus doesn't allow for that, because opinions. Just a toy example, so please don't HN-penanticize it :)
Not sure why restoring data after a crash isn't a bigger problem in your eyes, though. In the context of fully understanding system stability through prometheus, that's a pretty giant gap in your post portem analysis without any way to correct it. It forces you into either using two tools to be able to do that sort of analysis, or accepting that you're always going to risk losing valuable data.
Gotcha, it seems like you want to run backtests from within Prometheus for richer post-hoc analysis. That sounds like a normal use case, I often do fairly non-trivial post hoc analysis in Prometheus (e.g. Z scores and stddevs of metrics) since the query language is quite well suited to it, so I can see why you want to leverage it more.
Re: restoration, we use facilities like EBS snapshots, so its separated from prometheus. It could theoretically be an issue for post mortem analysis but data loss is relatively rare so in practice the ability to run post mortems has never been affected. Generally we don't consider our metrics as "business-critical" as production data (e.g. we wouldn't consider the above acceptable for MySQL) so that's where the leniency comes in. I can see different orgs weighing this differently -- we're not really an "infrastructure" company so I think the tradeoff is right for us, but if I was at a PaaS or IaaS I would feel differently.
I don't know if the Prometheus devs care about it being an enterprise level monitoring software, but not having any type of backfilling on the road map pretty much disqualifies it. Accurate reporting is not a nice-to-have for enterprise grade monitoring.
FWIW, from the lead dev's own description on his "up and running" book, pretty much implying that they do, in fact, care about its use in large production systems:
"Get up to speed with Prometheus, the metrics-based monitoring system used by tens of thousands of organizations in production. This practical guide provides application developers, sysadmins, and DevOps practitioners with a hands-on introduction to the most important aspects of Prometheus, including dashboarding and alerting, direct code instrumentation, and metric collection from third-party systems with exporters."
Curious to know why you don't think accurate reporting and backfilling aren't related. In my experience, they absolutely are -- mostly during disaster scenarios, where that information is more critical than any other time.
> accurate reporting and backfill are not related.
I strongly disagree. If you are unable to backfill data then report accuracy will be affected during multiple scenarios, including migrations from other products, unplanned downtime etc. Not having backfill is accepting that reports will suffer as soon as historical data is inaccurate and basically saying that you don't care.
Take a breather. Pretty sure the confusion here is in the overloading and different contexts of "reporting". One kind of reporting is post-hoc, and the other is "real time" reporting for outages and such. There are probably other ways to define it, but let's try to figure out where the confusion is first. How are you defining "reporting" to where it's orthogonal to the ability to backfill?
We've had a great experience with prometheus - both the community and the project. Building what we did would have been immeasurably harder without them. That said, this has been a rapid process for us. We are starting to engage and see what parts of our fork they would be interested for upstream. (In particular our storage needs are a bit different, so we'll see how that fits.)
Yet another Prometheus/time-series backend project.
And yet again, it would be far better to just export the data from Prometheus into a distributed columnstore data warehouse like Clickhouse, MemSQL, Vertica (or other alternatives). This gives you fast SQL analysis across massive datasets, real-time updates regardless of ordering, and unlimited metadata, cardinality and overall flexibility.
Prometheus is good at scraping and metrics collection, but it's a terrible storage system.
Yup, and that's kinda intentional. The design is to be a monitoring system, not a generic TSDB. What it does needs to be simple, fast, and reliable so that your alerts get sent out.
The original design inspiration, borgmon, also was a terrible storage system and had an external long-term store layered on top of it.
This isn't a design flaw, it's an intentional trade off to make the core use case as bulletproof as it can be. Having seen "monitoring systems" based on something like Cassandra, aka distributed storage, is cringe inducing. The first thing to crash a the first sign of network trouble is distributed storage.
My point is that monitoring != storage and there are plenty of great storage systems to use so there's not much reason to create another one. For some reason developers love to create home-made (time-series) databases.
Ze founder here. We adapted the scraper to our needs, and figured others have probably wanted to do what you're suggesting, so we decided to share it. In fact we pick up the scrape and ingest it into a distributed column store ourselves, where we use it with logs and do anomaly detection with it. I think Prometheus is pretty good at what it does. But like you're saying, depending on what you want to do with the data, sometimes a different backend is useful.
The ability to export from prometheus to a full-featured SQL database is definitely one aspect. In addition, for scaling, efficient transport from the scraper to the data store becomes pretty important.
Beyond just compression, having a transport protocol that takes advantage of the fact that much of the data does not change across successive scrapes, and the data that does change is often incremental (e.g. counters) makes a big difference.
Zebrium employee here:
We could not use prometheus remote storage adapter because of the following issues:
1. by the time you get there, it looses lot of information (like metric type, help, out of ordered drops etc..)
2. you do not have control over when it gets sent out, as it has to come through tsdb after chunking, this adds lot of latency.
3. It is per time series based protocol, not good for sending all the samples of a scraped target as a single chunk. Sending as a single chunk, helps you to group similar metrics at the remote side and simplifies the protocol for reducing the network bandwidth.
hence we did not use remote storage adapter. We introduced a new interface that plugs into the scraper directly.
There isn't much in terms of modelling built into the tool itself because that kind of functionality is secondary (the tool is primarily focused on quick and efficient gathering of metrics). You can do some basic forecasting at query time (holt-winters and linear regression), but beyond that you're using one of the remote storage adapter plugins to send the metrics to a remote time series database and the apply your ML on that (like OPs service does)
Datadog figured it out. It's pretty great. Obviously it doesn't see everything, but it's a very useful signal. Their anomaly correlation is even cooler.