Hacker News new | past | comments | ask | show | jobs | submit login

I don't understand how the properties you're describing imply that Prometheus isn't scalable.

High Availability always requires duplication of effort. Scaling queries always requires sharding and aggregation at some level.

I've deployed stock Prometheus at global scale, O(100k) targets, with great success. You have to understand and buy into Prometheus' architectural model, of course.

And I've seen a system that did that. We had to predict what global aggregates we would need. Looking at a metric required finding the right instance to connect to if it wasn't in the bubbleup. Picking the right expressions to avoid double counts was hard. Want to do something fancy? No luck because of the lack of distributed querying.

The ways in which you can scale Prometheus: you can scale anything.

It does not; itself, have highly scalable properties built in.

It does not do sharding, it does not do proxying, it does not do batching, it does not do anything that would allow it to run multiple servers and query over multiple servers.

Look. I’m not saying that it doesn’t work; but when I read about borgmon and Prometheus: I understood the design goal was intentionally not to solve these hard problems, and instead use them as primitive time series systems that can be deployed with a small footprint basically everywhere (and individually queried).

I submit to you, I could also have an influxdb in every server and get the same “scalability”.

Difference being that I can actually run a huge influxdb cluster with a dataset that exceeds the capabilities of a single machine.

It seems like you're asserting a very specific definition of scalability that excludes Prometheus' scalability model. Scalability is an abstract property of a system that can be achieved in many different ways. It doesn't require any specific model of sharding, batching, query replication, etc. Do you not agree?

Prometheus cannot evaluate a query over time series that do not fit in the memory of a single node, therefore it is not scalable.

The fact that it could theoretically ingest an infinite amount of data that it cannot thereafter query is not very interesting.

It can? It just partitions the query over multiple nodes?

Where is the code to do that?

Oh, I see what you mean. Sure, it's in Thanos, or Grafana, or whatever layer above, not Prometheus itself.

I’m not defining terms arbitrarily.


Scalability means running a single workload across multiple machines.

Prometheus intentionally does not scale this way.

I’m not being mean, it is fact.

It has made engineering design trade offs and one of those means it is not built to scale, this is fine, I’m not here pooping on your baby.

You can build scalable systems on top of things which do not individually scale.

Scalability isn't a well-defined term, and Prometheus isn't a database. :shrug:

Wrong on both counts

Sorry for being rude, but this level of ignorance is extremely frustrating.

Ignorance? I'm a core Prometheus contributor and a 20+ year distsys veteran. Prometheus is not a database, and scalability is not well-defined. These are not controversial statements.

It is extremely unbecoming to lie about who you are on this forum.

Scalability is defined differently depending on context; in this context (a monitoring/time series solution) it is defined as being able to hold a dataset larger than a single machine that scales horizontally.

Downsampling the data or transforming it does not meet that criteria, since that’s no longer the original data.

The way Prometheus “scales” today is a bolt-on passthrough with federation. It’s not designed for it at all, and means that your query will use other nodes as data sources until it runs out of ram evaluating the query. Or not.

The most common method of “scaling” Prometheus is making a tree; you can do that with anything (so it is not inherent to the technology, thus not a defining characteristic, if everything can be defined the same way then nothing can be- the term ceases to have meaning: https://valyala.medium.com/measuring-vertical-scalability-fo...)

I’ll tell you how influx scales: your data is horizontally sharded across nodes, queries are conducted cross shards.

That’s what scalability of the database layer is.

Not fetching data from other nodes and putting it together yourself.

Rehydrating from many datasets is not the storage system scaling: the collector layer doing the hydration is the thing that is scaling.

If you sold me a solution that used Prometheus underneath but was distributed across all nodes, perhaps we could talk.

But scalability is not a nebulous concept.

You should refer to your own docs if you think Prometheus isn’t a database, it certainly contains one: https://prometheus.io/docs/prometheus/latest/storage/

I should add (and extremely frustratedly): if you’re not lying and you’re a core Prometheus maintainer, you should know this. I’m deeply embarrassed to be telling you this.

> in this context (a monitoring/time series solution) it is defined as being able to hold a dataset larger than a single machine that scales horizontally.

This just isn't true :shrug: Horizontal scaling is one of many strategies.

I think the disconnect is that promethus helps a user to shard things, but it's not automatic. Other time series databases and monitoring solutions automatically distribute and query across servers. It's like postgres vs newswl (aka foundationdb, spanner, etc.,).

While Prometheus supports sharding queries when a user sets it up, my understanding is that this has to be done manually, which is definitely less convenient. This is better than a hypothetical system that doesn't allow this at all, but still not the same as something that handles scaling magically.

Prometheus supports sharding queries the way a screwdriver supports turning multiple screws at once. You can design a system yourself that includes the screwdriver, which will turn all the screws, but there's nothing inherent to the screwdriver that helps you with this. If "scalability" just means "you can use it to design something new from scratch that scales" then the term is pretty meaningless.

Lots of systems provide redundancy with 2X cost. It's not that hard.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
