I don't understand how the properties you're describing imply that Prometheus isn't scalable.
High Availability always requires duplication of effort. Scaling queries always requires sharding and aggregation at some level.
I've deployed stock Prometheus at global scale, O(100k) targets, with great success. You have to understand and buy into Prometheus' architectural model, of course.
And I've seen a system that did that. We had to predict what global aggregates we would need. Looking at a metric required finding the right instance to connect to if it wasn't in the bubbleup. Picking the right expressions to avoid double counts was hard. Want to do something fancy? No luck because of the lack of distributed querying.
The ways in which you can scale Prometheus: you can scale anything.
It does not; itself, have highly scalable properties built in.
It does not do sharding, it does not do proxying, it does not do batching, it does not do anything that would allow it to run multiple servers and query over multiple servers.
Look. I’m not saying that it doesn’t work; but when I read about borgmon and Prometheus: I understood the design goal was intentionally not to solve these hard problems, and instead use them as primitive time series systems that can be deployed with a small footprint basically everywhere (and individually queried).
I submit to you, I could also have an influxdb in every server and get the same “scalability”.
Difference being that I can actually run a huge influxdb cluster with a dataset that exceeds the capabilities of a single machine.
It seems like you're asserting a very specific definition of scalability that excludes Prometheus' scalability model. Scalability is an abstract property of a system that can be achieved in many different ways. It doesn't require any specific model of sharding, batching, query replication, etc. Do you not agree?
Ignorance? I'm a core Prometheus contributor and a 20+ year distsys veteran. Prometheus is not a database, and scalability is not well-defined. These are not controversial statements.
It is extremely unbecoming to lie about who you are on this forum.
Scalability is defined differently depending on context; in this context (a monitoring/time series solution) it is defined as being able to hold a dataset larger than a single machine that scales horizontally.
Downsampling the data or transforming it does not meet that criteria, since that’s no longer the original data.
The way Prometheus “scales” today is a bolt-on passthrough with federation. It’s not designed for it at all, and means that your query will use other nodes as data sources until it runs out of ram evaluating the query. Or not.
The most common method of “scaling” Prometheus is making a tree; you can do that with anything (so it is not inherent to the technology, thus not a defining characteristic, if everything can be defined the same way then nothing can be- the term ceases to have meaning: https://valyala.medium.com/measuring-vertical-scalability-fo...)
I’ll tell you how influx scales: your data is horizontally sharded across nodes, queries are conducted cross shards.
That’s what scalability of the database layer is.
Not fetching data from other nodes and putting it together yourself.
Rehydrating from many datasets is not the storage system scaling: the collector layer doing the hydration is the thing that is scaling.
If you sold me a solution that used Prometheus underneath but was distributed across all nodes, perhaps we could talk.
I should add (and extremely frustratedly): if you’re not lying and you’re a core Prometheus maintainer, you should know this. I’m deeply embarrassed to be telling you this.
> in this context (a monitoring/time series solution) it is defined as being able to hold a dataset larger than a single machine that scales horizontally.
This just isn't true :shrug: Horizontal scaling is one of many strategies.
I think the disconnect is that promethus helps a user to shard things, but it's not automatic. Other time series databases and monitoring solutions automatically distribute and query across servers. It's like postgres vs newswl (aka foundationdb, spanner, etc.,).
While Prometheus supports sharding queries when a user sets it up, my understanding is that this has to be done manually, which is definitely less convenient. This is better than a hypothetical system that doesn't allow this at all, but still not the same as something that handles scaling magically.
Prometheus supports sharding queries the way a screwdriver supports turning multiple screws at once. You can design a system yourself that includes the screwdriver, which will turn all the screws, but there's nothing inherent to the screwdriver that helps you with this. If "scalability" just means "you can use it to design something new from scratch that scales" then the term is pretty meaningless.
High Availability always requires duplication of effort. Scaling queries always requires sharding and aggregation at some level.
I've deployed stock Prometheus at global scale, O(100k) targets, with great success. You have to understand and buy into Prometheus' architectural model, of course.