Hacker News new | past | comments | ask | show | jobs | submit login
Monitoring Cloudflare's edge network with Prometheus (drive.google.com)
123 points by kiyanwang on Sept 24, 2017 | hide | past | favorite | 77 comments



Prometheus is an escaped implementation of Google’s borgmon, which is seen inside Google as a kind of horror show, and alternatives have been developed. It is kind of frightening that it has got out in the wild and people like it.


"Borgmon is the worst form of monitoring, except for all those other forms that have been tried from time to time."

-- Winston Churchill if he worked at Google.


Having worked both on borgmon and with borgmon, I can think of a few reasons why some (many?) don't like it all that much:

1. As already mentioned, the macro system (and the fact that its use is basically required to set up basic monitoring) has quite a steep learning curve. Prometheus doesn't have that and I personally would prefer it did.

2. It's not a service, so you have to set up your own instance, configure it, maintain it. In many engineers' mind this is just another hurdle in front of them launching their service.

3. As a software engineer (particularly new to Google) you might not expect to have to do ops work and carry a pager yourself.

Out of all three, only (1) is a valid reason to hate on borgmon. That and the language itself, which is almost a 1:1 match with Prometheus, are very different from your regular programming language. But given the choice between flat, simple metrics (which is what most monitoring systems give you) and the ability to have arbitrary dimensions and be able to work with them to build useful alerts and dashboards and troubleshoot quickly, I (again personal opinion) will always go with the latter.


Why is borgmon considered a horror show? Is something fundamentally flawed in the model? How do the alternatives differ from the original borgmon?


Google has an alternative that they gave a talk on back in December. Sadly there aren't any papers on it yet. It's called Monarch and it's what backs up Stackdriver.

It's config language is less crazy (Python based) and operates globally.

https://www.youtube.com/watch?v=LlvJdK1xsl4

Edit: Monarch config isn't sane, it's just different and at least not in the crazy languages that borgmon uses.


I will point out that borgmon's language (minus macros) is almost a 1:1 match with Prometheus. You can judge for yourself how crazy that is, but I feel that it's close to as simple as you can get for the power it gives you.

As for Monarch, it's a very different beast. For one, it stores all its rules in a protocol buffer format, so it's more structured. But then you have to write Python code that generates the protocol buffers and pushes them to storage. It looks similar but not the same as the ad-hoc query language. I wouldn't go as far as calling it sane.

It is also a service and it's optimized for Google's network architecture with datacenter local and global nodes and the language itself is aware of this distinction and some computations are done locally, others globally and so on.

For your local monitoring needs (or even global ones, if you're willing to put in the effort), Prometheus is a solid choice.


I'd agree with you after thinking about it some. I haven't really written either, mainly either copy/paste or using tools to assist in creation. So I can't really judge either monitoring language on their ease of use.


"sane" is an interesting choice of words to describe Monarch configuration...


Borgmon's language is weird and crusty. People can debate whether this is the main problem with Borgmon or whether more fundamental changes are necessary.

I won't weight in on that debate. But you can think of Prometheus as an experiment to decide the issue: it is very similar to Borgmon, but has a cleaner language.



And what would be a better alternative available outside of Google?


The decade old, open source, self hosted, debug it yourself standard for monitoring is collectd+graphite+grafana.

The equivalent easy to setup and use, with ALL the features working out of the box, SaaS standard is https://www.datadoghq.com/ or potentially Google Stack driver if you are on Google Cloud.


Plugging https://www.librato.com/ after reading an HN thread yesterday that DataDog pricing is insane [1].

[1] https://news.ycombinator.com/item?id=15315028

Disclaimer: No relation to either org.


The thread you are linking to is a complete joke. The guy didn't see that the pricing was per host, even though it's written in big letters. His whole series of rant is ridiculous.



Stack Driver is a Google Cloud service, you need a Google account with billing to use it. It can gather metrics from hundreds of vendors, including AWS and Azure.


It wasn't universally liked at Cloudflare. The federation component in particular is a PITA.


> alternatives have been developed

Such as Prometheus :) Even some teams in Google use it.


So you have any reason to not like it?


Not really. I just think it’s funny to see it described by one groups as a reasonable or even state-of-the-art system, while another group describes it as brain damage from ten years ago.


Borgmon being "brain damage" or some kind of "horror show" is more a meme than a serious opinion held by people who have used Borgmon.

Personally I'm very happy that the open source world is adopting something derived from Borgmon rather than something derived from its supposed "replacement".


Yup, I've had more than one current Google SRE state "You can have my Borgmon when you pry it from my cold dead hands.".

Borgmon may be dead in the eyes of some people, but I know for a fact that it's still the only thing monitoring core and critical systems.

Most of the problem with Borgmon, IMO, is the cruft that has built up over the decade+, and neglect due to the Google pattern of "The new thing that doesn't work, and the old thing that is deprecated.".

The difficulty at Google is that developers are rewarded for writing new and shiny from scratch, rather than fix the old but working systems.

This isn't always a problem, as some good things can come out of starting from scratch. But sometimes they throw out too many of the good ideas, in an attempt to be fancy and new.


I have heard unflattering things about it from a few different ex-Google SREs, specifically about the macro system and it being cumbersome to use.


You will note that Prometheus explicitly does not have a macro system.


Oh sure, I was responding to the OP stating that criticism of Borgmon was more a meme than reality.

It wasn't meant to be commentary on Prometheus(which I quite like) at all :)


It's pretty normal when you consider that google is 10 years ahead of almost every other company when it comes to infrastructure.


> is 10 years ahead of almost every other company when it comes to infrastructure.

For Google-scale orgs or infrastructure needs. Most everyone else in the world does not need Google scale tools.


Google needs are extremely common. Take a look at any Fortune 500 and and it could usually benefit greatly from a lot of the infrastructure that powers google.

Most of them do run their own datacenters, sometimes in numerous locations, they have massive and extremely complex IT systems in place.


Here is the video if you want to watch the full talk: https://www.youtube.com/watch?v=lHtY7TUsLzk


The talk originally given at PromCon 2017: https://www.youtube.com/watch?v=ypWwvz5t_LE


I don't get the hype around Prometheus.

What makes its pull-based mechanism superior to push-based ones like statsd?

And using exporters sounds clunky - instead of directly querying a metric and sending it to your metrics collector, you have an intermediate component which exposes them for collection.


There are services that might be used rarely, but they're still critical. Therefore you either implement a heartbeat-like approach where the monitored system indicates regularly that it's alive, or use pull. Using exporters might sound clunky, but in many cases you need to implement some kind of pull-based system anyway, because sometimes the monitored system does not support the monitoring infrastructure directly. Example: most database systems don't have Graphite or Prometheus support, they just expose their stats. So you'll end up writing or using a component that pulls these stats regularly, then pushes them. Also, when you use pull then you only need to configure the monitoring system, not the monitored services separately (e.g. if you relocate the monitoring system to a different host, then you don't need to update every single monitored service to talk to the new host).

I found that people who're new to larger-scale monitoring favor push, because that's somehow more intuitive; but pull really works very well, it's not clunky at all.


You are conflating metrics collection and metrics storage.

The metrics collection needs to be done by a local agent installed on the system, for the reason you gave, that's the only place the data is available.

The metrics storage is somewhere else.

Prometheus does the storage. For the collection, you still have to install collectd, statsd or similar on your hosts. Sure, prometheus could do a HTTP check remotely, but that won't get cover much of anything.


Prometheus does collection, though.

The Prometheus pattern with central things like a database (e.g. monitoring Postgres) is to let the exporter (the thing that acts as an intermediary to expose metrics for Prometheus to fetch) run anywhere it wants. It absolutely does not need to be a "local agent".

(In fact, if you're using something like Kubernetes, the only thing that needs local access to a node is the exporter that exposes node-specific metrics. Everything else can chat over the network.)

The benefit here is that if you have 10 Postgres databases, you can still run just a single exporter and have it extract data from all of them. Or you can run one exporter per database; there's conceptually little difference.

On Kubernetes, we usually run an exporter as a sidecar container, which means it can talk to Postgres or whatever on localhost and just live alongside the process that it's exporting metrics for, and we rely on Prometheus' automatic discovery to make Prometheus pull from it. Start a new Postgres instance and its metrics are almost immediately fed into Prometheus.

You don't need collectd etc. with Prometheus. There are exporters around for just about anything.


We are both describing the same thing, you end up with "agents" spread on various systems to collect the metrics you are interested in, which is the point I wanted to outline.

They can be called agents, exporters, collectors, whatever, the name is not important, the design pattern is.

A system that would be exclusively pull-based from the prometheus server does not work practically.


No, you don't generally end up with agents or exporters everywhere. You might have node_exporter running on each host to collect system metrics (load, iops, etc.) but those are exceptions. The rule is that the software under instrumentation is mounting /metrics endpoints and exposing Prometheus metrics to the Prometheus server directly.

Unless you take the presence of node_exporter to invalidate the premise (which would be stupid) a system that's exclusively pull-based is entirely practical.


I might be wrong but one reasoning I saw mentioned was that pulling would give you a better picture into failures. If one of the services you are monintoring goes down it won't be able to push anything.


But where's the difference between not being able to pull vs. not receiving metrics for a period of time?


Well, there's one thing going for pull over push - and generally I am strongly in favour of push.

With push, you need to know where to send the data. So you depend on a fixed configuration. With pull, you can essentially do a periodic nmap sweep to discover and update the monitoring data sources.

That's one less thing to go wrong.


Another nice thing about pull is you only pull from the authoritative source. If you have a daemon that has gone rogue your load balancer or naming system can throw it out of the named group/job/whatever even if it can’t shut it off. With push you might have zombies pushing duplicate data for what should be a unique target.


> But where's the difference between not being able to pull vs. not receiving metrics for a period of time?

Let's say I have a service that's only used once per day, at a random point in the day. I need to know if the service is available all the time, even if it's not used. With pull, I'll get back an 'OK' e.g. every minute. You could say that we should implement heartbeats, but again, then this service needs to be configurable to pick up changes in the monitoring infrastructure, then I need to have a separate check for the presence of heartbeats and the values of the samples..


Monitoring if the process is running, if the input queue is free, if it will take a new job, etc. seems to me the responsibility for all that should be with whatever supervisor has the ability to reset/restart the service. Then it can report stats on uptime and number of restarts required.

What if the service spins up on demand when it's needed, does its job, and then exits. You want to track how often it starts, how long it runs, how long the job was in the queue, how much CPU/RAM/IO was consumed, etc.

With push that's all part of the cleanup code in the process. With pull it seems like you could completely miss that the process was even ever run?


That's a short lived batch job, for which a different approach is needed - basically push is the sane option there as pull won't cut it. Prometheus has the Pushgateway for this.

The push vs pull debate is relevant for longer running daemons, not batch jobs.


Prometheus does support push. It's just that it's considered such an antipattern that it's been moved into a separate module (the Push Gateway) that you need to run separately.

Pulling has a few technical benefits, though. For one, only the puller needs to know what's being monitored; the thing being monitored can therefore be exceedingly simple, dumb and passive. Statsd is similarly simple in that it's just local UDP broadcast, of course, which leads to the next point:

Another benefit is that it allows better fine-grained control over when metrics gathering is done, and what. Since Prometheus best practices dictate that metrics should be computed at pull time, it means you can fine-tune collection intervals to specific metrics, and this can be done centrally. And since you only pull from what you have, it means there can't be a rogue agent somewhere that's spewing out data (i.e. what a sibling comment calls "authorative sources").

But to understand why pull is a better model, you have to understand Google's/Prometheus's "observer/reactor" mindset towards large-scale computing; it's just easier to scale up with this model. Consider an application that implements some kind of REST API. You want metrics for things like the total number of requests served, which you'll sample now and then. You add an endpoint /metrics running on port 9100. Then you tell Prometheus to scrape (pull from) http://example.com:9100/metrics. So far so good.

The beauty of the model arises when you involve a dynamic orchestration like Kubernetes. Now we're running the app on Kubernetes, which means the app can run on many nodes, across many clusters, at the same time; it will have a lot of different IPs (one IP per instance) that are completely dynamic. Instead of adding a rule to scrape a specific URL, you tell Prometheus to ask Kubernetes for all services and then use that information to figure out the endpoint. This dynamic discovery means that as you take apps up and down, Prometheus will automatically update its list of endpoints and scrape them. Equally importing, Prometheus goes to the source of the data at any given time. The services are already scaled up; there's no corresponding metrics collection to scale up, other than in the internal machinery of Prometheus' scraping system.

In other words, Prometheus is observing the cluster and reacting to changes in it to reconfigure it self. This isn't exactly new, but it's core to Google's/Prometheus's way of thinking about applications and services, which has subseqently coloured the whole Kubernetes culture. Instead of configuring the chess pieces, you let the board inspect the chess pieces and configure itself. You want the individual, lower-level apps to be as mundane as possible, let the behavioural signals flow upstream, and let the higher-level pieces make decisions.

This dovetails nicely with the observational data model you need for monitoring, anyway: First you collect the data, then you check the data, then you report anomalies within the data. For example, if you're measuring some number that can go critically high, you don't make the application issue a warning if it goes above a threshold; rather, you collect the data from the application as a raw number, then perform calculations (e.g. max over the last N mins, sum over the last N mins, total count, etc.) that you compare against the threshold.

In practice, implementing a metrics endpoint is exceedingly simple, and you get used to "just writing another exporter". I've written a lot of exporters, and while this initially struck me as heavyweight and clunky, my mindset is now that an HTTP listener is actually more lightweight than an "imperative" pusher script.


But why is it simpler for Prometheus to have to query Kube to discover all the endpoints in order to collect the data, versus the endpoints just pushing out to Prometheus?

Obviously endpoints already need to know how to contact all sorts of services they depend on. So it's not like you're "saving" anything by not telling them "PrometheusIP = X".

Let's say you want to cleanly shut-down some instances of your endpoint. They are holding connection stats & request counts that you don't want to lose. With push the endpoint can close its connection handler, finish any outstanding requests, push final stats, and then exit. With pull are you supposed to just sit and wait until a Pull happens before the process can exit?


Because it shifts all the complexity to the monitoring system, making the "agents" really, really dumb. There would have to be more to push than just a single IP:

* Many installations run multiple Prometheus servers for redundancy, so to start, it'd have to be multiple IPs.

* They would also need auth credentials.

* They'd need retry/failure logic with backoff to prevent dogpiling.

* Clients would have to be careful to resolve the name, not cache the DNS lookup, in order to always resolve Prometheus to the right IP.

* If Prometheus moves, every pusher has to be updated.

* Since Prometheus wouldn't know about pushers, it wouldn't know if a push has failed. As Prometheus is pull-based, you can detect actual failure, not just absence of data.

There's a lot to be said for Prometheus' principle of baking exporters into individual, completely self-encapsulated programs — as opposed to things like collectd, diamond, Munin, Nagios etc. that collect a lot of stuff into a single, possibly plugin-based, system.

Don't forget, a lot of exporters come with third-party software. You want those programs to have as little config as possible. If I release an open-source app (let's say, a search engine), I can include a /metrics handler, and users who deploy my app can just point their Prometheus at it. It's enticingly simple.

As for graceful shutdown: The default pull frequency is 15 seconds, and you can increase it if you want to avoid losing metrics. Prometheus is designed not to deal with extremely fine-grained metrics; losing a few requests due to a shutdown shouldn't matter in the big picture. But for metrics that are sensitive, it's easy enough to bake them into some stateful store anyway (Redis or etcd, for example), or computing them in real time from stateful data (e.g. SQL). For example, if you have some kind of e-commerce order system, it's better if the exporter produces the numbers by issuing a query against the transaction tables, rather than maintaining RAM counters of dollars and cents.


How do you handle aggregated metrics, e.g. request count? What does the instance (either server/container) expose at localhost:9001/metrics?


You would expose a counter with the total request count. Summing those up across all nodes known by Prometheus will give you the total amount of requests currently visible to monitoring. With "rate()" you could calculate the requests/second.

But yes it is possible to miss some requests if a node goes down without Prometheus collecting the latest stats.

But as the parent said, if you need such totals it might be better to store them persistently. Also I do not know a scenario where the total number of requests will trigger an alert.


> But yes it is possible to miss some requests if a node goes down without Prometheus collecting the latest stats.

The rate() function allows for this, you'll get the right answer on average.


The same way push based systems do it. Prom. scrapes the individual metrics off each instance and provides aggr. functions in the query lang.


Thank you for your patient and articulate reponses in this thread, 'lobster_johnson'. You make an excellent case. For me, this nails it:

>"if you're measuring some number that can go critically high, you don't make the application issue a warning if it goes above a threshold; rather, you collect the data from the application as a raw number, then perform calculations (e.g. max over the last N mins, sum over the last N mins, total count, etc.) that you compare against the threshold."


But that's how you do it with push-based statsd setups, too.


Also metrics have a very intuitive text exposition format which means you can host then as easily as updating a text file for an http server.


It's not about push or pull or exporters, it's about where you aggregate your events. As soon as you are working with a non-trivial number of events, you need to think more carefully about how deal with them.

With standard statsd, you're sending a message per event, this might be fine for one or two servers, with a trickle of traffic. But when you've got 100 servers handling 100s to 1000s of requests per second, we're talking about 10-100k/sec. This is a large amount of load to be directly sending to your network.

With statsd, you could buffer, but now you're losing the benefit of the live event stream.

With Prometheus, we just say keep it all in memory, localy, and as thread-local as possible. This way the cost to update an event is very, very tiny. Much smaller than even a log line. This allows every debug level log line to be recorded as an event counter. For example, the Prometheus Go library only requires ~15 nanoseconds of CPU time to update a single counter.

This cheapness allows for sprinkling metrics everywhere in your code with little worry that it'll be a performance problem.

On the topic of exporters, stand-alone exporters are only necessary where the existing code doesn't directly support Prometheus's metrics format. The good news here is that we're working with several other metrics systems in order to create a common format for polling and pushing metrics. Yes, insert XKCD joke here. :-)

Once we have a better common metrics format, think SNMP but not crazy, "exporters" will be unnecessary.


Working with Prometheus for a while. Only sticking point has been the 15 day storage of metrics. Anyone know why this is the limit on storage, or any strategies for long term storage of metrics?


As lobster_johnson stated, Prometheus storage retention is a command line launch option.

Prometheus is not intentionally designed as a long term cold storage option for metrics. You _can_ store metrics for as long as your storage allows, but Prometheus is not going to replicate or manage that data to prevent long term degradation. Depending on your long term needs, the preferred pattern is to roll data off Prometheus into a metrics store that better handles data over the scope of months. Rolling data off of Prometheus is done with an exporter and documented [here](https://prometheus.io/docs/operating/integrations/#remote-en...).

In most use operational use cases I've come across, we've only needed about a month of data, so keeping it in Prometheus was kosher. YMMV.


What if you want to keep time-series data for accounting purposes forever?

Right now I use graphite but would love something that also handles replication/redundancy with a good query language & enough performance to also use it to fetch the underlying data used to render front-end graphs for users.


The answer to these kinds of questions is very, very dependant on requirements and available resources. And there are rarely easy, out of the box answers.

The big questions to ask are: - what data is of interest - what are the requested ingest patterns (how is data getting into this system? How frequently? What rate of ingest is expected?) - what are the requested query patterns (who is doing queries? What do those queries look like specifically? Are people querying over unbounded time ranges or are people doing more focused queries? Do queries regularly involve aggregations or not?) - what is the requested SLA (i.e. how much partial down time is okay? How much full downtime is okay? What kind of query response time do you need to target?) - what resources are/will be available (money, man power, compute, storage, tech on hand)

It's possible that a time series data store might not be the correct system choice once these questions are answered. It's not unusual to see data split or copied into multiple systems to answer all requirements.


The most important consideration is the scale. A low scale system will not require an optimum solution, merely a correct one.

So I would say, the system is 5 minute ingest of about 1 million metrics. This is spread out over a half dozen locations, each which currently records in their own silo.

And aggregate metric is calculated with a 5 minute lag, which reads all the just-written data points and aggregates into sum-totals which are themselves stored and cached in one place. This is another million metrics basically stored separate from the rest.

But it doesn't really change the character of the system. In the end I'm trying to; Write batches of mostly numeric data Queries over time against those numbers Aggregate data over different blocks of time; 5 minute, hourly, daily, monthly Store it efficiently Ensure redundancy, integrity

Seems like a simple and common enough problem to have been reasonably "solved" for orders of magnitude higher scale than I'm operating at.


From what I've seen, each solution has been pretty specific to the situation. I've rarely seen one tech stack prevail at low to reasonable scale.

At reasonable scale, I've seen people get really far with pure graphite setups by utilizing tools like [carbon-c-relay](https://github.com/grobian/carbon-c-relay), [carbonate](https://github.com/graphite-project/carbonate), and high integrity filesystems like zfs underneath. It's a very hands on operation though. Things like growing the cluster are hard to do without downtime.

If constant growth and uptime is a concern, something like openTSDB might be a great choice. The complexity of setting up a Hadoop + HBase cluster is a pretty big upfront cost, but man is this thing the cockroach of time series data stores. Adding storage is just growing the HBase cluster. Querying across years of data is pretty simple and straightforward. For the complexity involved, openTSDB is worth it.


Feel free to reach out to me if you want to talk through it. alex@laties.info


> What if you want to keep time-series data for accounting purposes forever?

Prometheus is not recommended for uses where 100% accuracy is required, see https://prometheus.io/docs/introduction/overview/#when-does-...?

I'd also be wary of using Graphite for such a use case. For billing a more traditional database is probably best.


We store 18-24 months of data in OpenTSDB with good performance and compression, but it can be a pain to manage if you don't have experience working with HBase. InfluxDB has commercial cluster support, and there are quite a few other options. See https://blog.outlyer.com/top10-open-source-time-series-datab... for a good comparison list.


Add "-storage.local.retention" (e.g. "-storage.local.retention=1440h" for 60 days) when running Prometheus.


>"Monitoring Cloudflare's planet-scale edge network with Prometheus"

Is "planet scale" better than "web scale"?


It's just Cloudflare trying to make themselves sound bigger. Still nowhere near Google or Amazon.


>"It's just Cloudflare trying to make themselves sound bigger"

And sounding cringe-worthy in the process :)


They are similar in terms of users, requests and bandwidth.


They don't even stack up well against other CDNs.

https://i2.wp.com/stratusly.com/wp-content/uploads/2017/02/t...


Do you have a citation for that claim?


They serve 10% of the internet traffic and have over 1 billion users. They receive more than 5 million requests per second in average.


That's not really a citation. Where are these figures published? What does that mean "10% the internet traffic?" By bandwidth? By request volume?


You could start by reading the article you are commenting on. The traffic is by requests.

You may not be aware but yes, cloudflare is a significant internet company.


Its not actually an article, its a slide deck with a bunch of marketing on it. And I did read it. These are unsubstantiated numbers. 10% of total daily HTTP requests? Where is the total daily number of HTTP requests on the internet published? Cisco's VNI approximates bandwidth but I have never seen any number for the total HTTP requests on the internet.

I am aware that Cloudflare is a CDN, most CDNs are substantial internet companies. However I've worked at a couple of CDNs and they all throw a lot of marketing numbers around.


If memory serves, the 10% isn't based on some kind of global estimate of HTTP traffic. It's based on what percentage of active IP-space they see on a daily basis.


Now that's a marketing metric :)


To me, that's mostly because the things we need to monitor are scattered around our planet.




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: