A Prometheus fork for cloud scale anomaly detection across metrics and logs

mk_ · on March 28, 2020

Did you raise the challenges you had with Prometheus to the dev team and community? They are usually quite responsive and open for contributions. I'd be interested in the discussion around the reasons you brought up for forking.

chaps · on March 28, 2020

Not OP, but one of the biggest issues that I've had with prometheus is that doesn't support the backfilling of timeseries data. Bringing it up to the prometheus devs led to oddly indignant responses that they had no intention of supporting backfilling, because that's not what prometheus was designed for. In large production systems, like prometheus is meant for, it was a very disappointing missing feature, and I could absolutely see why someone would want to fork if nothing else, just to get away from that level of indignancy.

icook · on March 28, 2020

FWIW, backfilling is on their roadmap[0].

There's also a ticket open about it[1].

[0] https://prometheus.io/docs/introduction/roadmap/#backfill-ti...

[1] https://github.com/prometheus/prometheus/issues/535

chaps · on March 28, 2020

Hah, yeah I remember seeing that ticket. It's been open for over five years now. The dev's.... attitude... is clear as day.

mekster · on March 29, 2020

VictoriaMetrics seems to support that. I feel this is quite technically superior to other remote storages.

https://github.com/VictoriaMetrics/VictoriaMetrics/wiki/FAQ

(Not affiliated, just impressed with the author's blog posts explaining how it's better than competitors.)

seattleeng · on March 28, 2020

By backfilling I assume you mean retroactively applying labels to a time series? Can you link to the thread with more context if you have it around? As a prometheus user myself at a 100+ eng org company I've never had a use case for a flow that would require editing old timeseries apart from having to restore data after a crash

chaps · on March 28, 2020

What I mean by that is, I have an exporter that (for example) might be reading a bunch of logs and exposing some inferred metrics information through the exporter. Two days later, I realize that there's some useful information from those logs or another log file, or hell, another server, that wasn't inferred. That way, in some monitoring analysis I'm running, both data points would be side by side. Prometheus doesn't allow for that, because opinions. Just a toy example, so please don't HN-penanticize it :)

Not sure why restoring data after a crash isn't a bigger problem in your eyes, though. In the context of fully understanding system stability through prometheus, that's a pretty giant gap in your post portem analysis without any way to correct it. It forces you into either using two tools to be able to do that sort of analysis, or accepting that you're always going to risk losing valuable data.

seattleeng · on March 28, 2020

Gotcha, it seems like you want to run backtests from within Prometheus for richer post-hoc analysis. That sounds like a normal use case, I often do fairly non-trivial post hoc analysis in Prometheus (e.g. Z scores and stddevs of metrics) since the query language is quite well suited to it, so I can see why you want to leverage it more.

Re: restoration, we use facilities like EBS snapshots, so its separated from prometheus. It could theoretically be an issue for post mortem analysis but data loss is relatively rare so in practice the ability to run post mortems has never been affected. Generally we don't consider our metrics as "business-critical" as production data (e.g. we wouldn't consider the above acceptable for MySQL) so that's where the leniency comes in. I can see different orgs weighing this differently -- we're not really an "infrastructure" company so I think the tradeoff is right for us, but if I was at a PaaS or IaaS I would feel differently.

hnarn · on March 28, 2020

I don't know if the Prometheus devs care about it being an enterprise level monitoring software, but not having any type of backfilling on the road map pretty much disqualifies it. Accurate reporting is not a nice-to-have for enterprise grade monitoring.

chaps · on March 28, 2020

FWIW, from the lead dev's own description on his "up and running" book, pretty much implying that they do, in fact, care about its use in large production systems:

"Get up to speed with Prometheus, the metrics-based monitoring system used by tens of thousands of organizations in production. This practical guide provides application developers, sysadmins, and DevOps practitioners with a hands-on introduction to the most important aspects of Prometheus, including dashboarding and alerting, direct code instrumentation, and metric collection from third-party systems with exporters."

SuperQue · on March 28, 2020

Backfilling data is on the roadmap. It's already possible, the tools are being worked on.

Also, accurate reporting and backfill are not related.

chaps · on March 28, 2020

It's been on the road map for at least two and a half years, so that's not very convincing: https://web.archive.org/web/20170708101602/https://prometheu...

Curious to know why you don't think accurate reporting and backfilling aren't related. In my experience, they absolutely are -- mostly during disaster scenarios, where that information is more critical than any other time.

hnarn · on March 29, 2020

> accurate reporting and backfill are not related.

I strongly disagree. If you are unable to backfill data then report accuracy will be affected during multiple scenarios, including migrations from other products, unplanned downtime etc. Not having backfill is accepting that reports will suffer as soon as historical data is inaccurate and basically saying that you don't care.

sagichmal · on March 28, 2020

Backfilling isn't necessary for "accurate reporting"?

hnarn · on March 29, 2020

Is that a question? I think I pretty plainly stated that the answer to that question is yes.

sagichmal · on March 29, 2020

I'm confused because the statement doesn't make sense. Accurate reporting is orthogonal to the ability to backfill.

chaps · on March 29, 2020

Take a breather. Pretty sure the confusion here is in the overloading and different contexts of "reporting". One kind of reporting is post-hoc, and the other is "real time" reporting for outages and such. There are probably other ways to define it, but let's try to figure out where the confusion is first. How are you defining "reporting" to where it's orthogonal to the ability to backfill?

Rapzid · on March 28, 2020

Yeah, if that were the only reason given I would 100% be like "Ah, yep, good call".

greendave · on March 28, 2020

Disclosure: Zebrium employee.

We've had a great experience with prometheus - both the community and the project. Building what we did would have been immeasurably harder without them. That said, this has been a rapid process for us. We are starting to engage and see what parts of our fork they would be interested for upstream. (In particular our storage needs are a bit different, so we'll see how that fits.)

manigandham · on March 28, 2020

Yet another Prometheus/time-series backend project.

And yet again, it would be far better to just export the data from Prometheus into a distributed columnstore data warehouse like Clickhouse, MemSQL, Vertica (or other alternatives). This gives you fast SQL analysis across massive datasets, real-time updates regardless of ordering, and unlimited metadata, cardinality and overall flexibility.

Prometheus is good at scraping and metrics collection, but it's a terrible storage system.

SuperQue · on March 28, 2020

Yup, and that's kinda intentional. The design is to be a monitoring system, not a generic TSDB. What it does needs to be simple, fast, and reliable so that your alerts get sent out.

The original design inspiration, borgmon, also was a terrible storage system and had an external long-term store layered on top of it.

This isn't a design flaw, it's an intentional trade off to make the core use case as bulletproof as it can be. Having seen "monitoring systems" based on something like Cassandra, aka distributed storage, is cringe inducing. The first thing to crash a the first sign of network trouble is distributed storage.

manigandham · on March 28, 2020

My point is that monitoring != storage and there are plenty of great storage systems to use so there's not much reason to create another one. For some reason developers love to create home-made (time-series) databases.

stochastimus · on March 28, 2020

I agree. Prometheus is very good at what it was designed to do.

stochastimus · on March 28, 2020

Ze founder here. We adapted the scraper to our needs, and figured others have probably wanted to do what you're suggesting, so we decided to share it. In fact we pick up the scrape and ingest it into a distributed column store ourselves, where we use it with logs and do anomaly detection with it. I think Prometheus is pretty good at what it does. But like you're saying, depending on what you want to do with the data, sometimes a different backend is useful.

manigandham · on March 28, 2020

The big innovation here is the data compression. The lack of metric type for the standard remote storage interface is a good note.

Wouldn't adding that let you switch back to the main project and lower the local storage buffer to as small as possible?

RabbitmqGuy · on March 29, 2020

Hi.

I would be interested in a demo of the logs AI stuff using real data. Something like, https://www.honeycomb.io/play/ would do.

Do you have such?

greendave · on March 29, 2020

Disclosure: Zebrium employee.

The ability to export from prometheus to a full-featured SQL database is definitely one aspect. In addition, for scaling, efficient transport from the scraper to the data store becomes pretty important.

Beyond just compression, having a transport protocol that takes advantage of the fact that much of the data does not change across successive scrapes, and the data that does change is often incremental (e.g. counters) makes a big difference.

manigandham · on March 29, 2020

The compression part is definitely great work. It would be a valuable addition to the main prometheus repo if they accept it.

nak923 · on March 29, 2020

Thanks, yes. It makes our life lot easier too, if they accept it, so that we do not have to constantly maintain this forked version.

DeltaSigma · on March 28, 2020

Alternative title: "We layered compression on top of the Prometheus remote storage adapter"

nak923 · on March 29, 2020

Zebrium employee here: We could not use prometheus remote storage adapter because of the following issues: 1. by the time you get there, it looses lot of information (like metric type, help, out of ordered drops etc..) 2. you do not have control over when it gets sent out, as it has to come through tsdb after chunking, this adds lot of latency. 3. It is per time series based protocol, not good for sending all the samples of a scraped target as a single chunk. Sending as a single chunk, helps you to group similar metrics at the remote side and simplifies the protocol for reducing the network bandwidth.

hence we did not use remote storage adapter. We introduced a new interface that plugs into the scraper directly.

conjectures · on March 28, 2020

Anyone know what statistical models Prometheus uses?

Having scanned this article, read the wikipedia entry and hit the landing page, I am none the wiser.

DeltaSigma · on March 28, 2020

There isn't much in terms of modelling built into the tool itself because that kind of functionality is secondary (the tool is primarily focused on quick and efficient gathering of metrics). You can do some basic forecasting at query time (holt-winters and linear regression), but beyond that you're using one of the remote storage adapter plugins to send the metrics to a remote time series database and the apply your ML on that (like OPs service does)

https://prometheus.io/docs/prometheus/latest/querying/functi...

conjectures · on March 28, 2020

Figures, a fine objective, but not what I'd imagined.

jldugger · on March 28, 2020

It has holtwinters as a builtin: https://prometheus.io/docs/prometheus/latest/querying/functi...

Nothing stopping folks from adding in more.

sagichmal · on March 28, 2020

Anomaly detection is snake oil.

jcims · on March 28, 2020

Datadog figured it out. It's pretty great. Obviously it doesn't see everything, but it's a very useful signal. Their anomaly correlation is even cooler.

dang · on March 28, 2020

It would be good to add information explaining why. Empty dismissals don't contribute much, whereas if you make a substantive case, readers can learn.

pranay01 · on March 30, 2020

Would love to learn why do you say so?