Pull collection eventually became a real scaling bottleneck for Monarch.
The way the "pull" collection worked was that there was an external process-discovery mechanism, which the leaf used to connect to the entities it was monitoring, the leaf backend processes would connect to the monitored entities to an endpoint that the collection library would listen on, and those entities collection libraries would stream the metric measurements according to the schedules that the leaves sent.
Several problems.
First, the leaf-side data structures and TCP connections become very expensive. If that leaf process is connecting to many many many thousands of monitored entities, TCP buffers aren't free, keep-alives aren't free, and a host of other data structures. Eventually this became an...interesting...fraction of the CPU and RAM on these leaf processes.
Second, this implies a service discovery mechanism so that the leaves can find the entities to monitor. This was a combination of code in Monarch and an external discovery service. This was a constant source of headaches an outages, as the appearance and disappearance of entities is really spiky and unpredictable. Any burp in operation of the discovery service could cause a monitoring outage as well. Relatedly, the technical "powers that be" decided that the particular discovery service, of which Monarch was the largest user, wasn't really something that was suitable for the infrastructure at scale. This decision was made largely independently of Monarch, but required Monarch to move off.
Third, Monarch does replication, up to three ways. In the pull-based system, it wasn't possible to guarantee that the measurement that each replica sees is the same measurement with the same microsecond timestamp. This was a huge data quality issue that made the distributed queries much harder to make correct and performant. Also, the clients had to pay both in persistent TCP connections on their side and in RAM, state machines, etc., for this replication as a connection would be made from each backend leaf processes holding a replica for a given client.
Fourth, persistent TCP connections and load balancers don't really play well together.
Fifth, not everyone wants to accept incoming connections in their binary.
Sixth, if the leaf process doesn't need to know the collection policies for all the clients, those policies don't have to be distributed and updated to all of them. At scale this matters for both machine resources and reliability. This can be made a separate service, pushed to the "edge", etc.
Switching from a persistent connection to the clients pushing measurements in distinct RPCs as they were recorded eventually solved all of these problems. It was a very intricate transition that took a long time. A lot of people worked very hard on this, and should be very proud of their work. I hope some of them jump in to the discussion! (At very least they'll add things I missed/didn't remember... ;^)
Thanks George, and apologies for missing this comment on my first scan through this page. Your Youtube talk is lined up for viewing later today.
We're using prom + cortex/mimir. With ~30-60k hosts + at least that figure again for other endpoints (k8s, snmp, etc), so we can get away with semi-manual sharding (os, geo, env, etc). We're happy with 1m polling, which is still maybe 50 packets per query, but no persistent conns held open to agents.
I'm guessing your TCP issues were exacerbated by a much high polling frequency requirement? You come back to persistent connections a lot, so this sounds like a bespoke agent, and/or the problem was not (mostly) a connection establish/tear-down performance issue?
The external discovery service - I assume an in-house, and now long disappeared and not well publicly described system? ;) We're looking at NodeRED to fill that gap, so it also becomes a critical component, but the absence only bites at agent restart. We're pondering wrapping some code around the agents to be smarter about dealing with a non-responsive config service. (During a major incident we have to assume a lot of things will be absent and/or restarting.)
The concerns around incoming conns to their apps, it sounds like those same teams you were dealing with ended up having to instrument their code with something from you anyway -- was it the DoS risk they were concerned about?
It was more that they would rather send Monarch an RPC than be connected to. Not everyone wants e.g. an HTTP server in their process. For example maybe they are security sensitive, or have a limited memory envelope, or other reasons.
Again, a highly envious feature of a large organisation with almost exclusively bespoke applications that can port & integrate custom libraries directly into applications. Us little people have to contend with mostly black box applications, or occasionally native instrumented with at best a prom-alike endpoint.
Amusingly in the pre-web 1990's, at Telstra (Australia telco) we also developed & implemented a custom performance monitoring library that was integrated into in-house applications.
The rest of us can only hope Opentelemetry becomes more widely adopted. They have put in a lot of effort in decoupling the application instrumentation from the monitoring solution, to allow more rich instrumentation than just a prom-alike endpoint.
Re-reading my comment, it may have been worded badly.
The phrase 'you aren't Google' is true for 99.9% of us. We all get to fix the problems in front of us, was my point. And at that scale you've got unique problems, but also an architecture, imperative, and most importantly an ethos that lets you solve them in this fashion.
I was more reflecting on the (actually pretty fine) tools available to SREs caring for off-the-shelf OS's and products, and a little on the whole 'we keep coming full circle' thing.
The system design choice was to make data visible to queries as soon as possible after being pushed to Monarch, to satisfy alerting guarantees.
Thus there was no queue like a pubsub or Kafka in front of Monarch.
At scale this required a "smoothness of flow". What I mean by this is that at the scale the system was operating the extent and shape of the latency long tail began to matter. If there are many many many many thousands of RPCs flowing through servers in the intermediate routing layers, any pauses at that layer or at the leaf layer below that extended even a few seconds could cause queueing problems at the routing layer that could impact flows to leaf instances that were not delayed. This would impact quality of collection.
Even something as simple as updating a range map table at the routing layer had to be done carefully to avoid contention during the update so as to not disturb the flow, which in practice could mean updating two copies of the data structure in a manner analogous to a blue green deployment.
At the leaf backends this required decoupling--to make eventual--many ancillary data structure updates for data structures that were consulted in the ingest path, and to eventually get to the point where queries and ingest shared no locks.
https://prometheus.io/docs/introduction/faq/#why-do-you-pull... list a few reasons and also end with a note that it probably doesn't matter in the end. Personally, for smaller deployments, i like it because it gives you an easy overview of what should be running, otherwise you need to maintain this list elsewhere anyway, though today with all the auto-scaling around, the concept of "up" is getting more fuzzy.
On top of that there is also less risk that herd of misbehaving clients DoS the monitoring system, usually moments when you need such system the most. This of course wouldn't be a problem with a more scalable solution that distributes ingestion from querying, like the Monarch.
The way the "pull" collection worked was that there was an external process-discovery mechanism, which the leaf used to connect to the entities it was monitoring, the leaf backend processes would connect to the monitored entities to an endpoint that the collection library would listen on, and those entities collection libraries would stream the metric measurements according to the schedules that the leaves sent.
Several problems.
First, the leaf-side data structures and TCP connections become very expensive. If that leaf process is connecting to many many many thousands of monitored entities, TCP buffers aren't free, keep-alives aren't free, and a host of other data structures. Eventually this became an...interesting...fraction of the CPU and RAM on these leaf processes.
Second, this implies a service discovery mechanism so that the leaves can find the entities to monitor. This was a combination of code in Monarch and an external discovery service. This was a constant source of headaches an outages, as the appearance and disappearance of entities is really spiky and unpredictable. Any burp in operation of the discovery service could cause a monitoring outage as well. Relatedly, the technical "powers that be" decided that the particular discovery service, of which Monarch was the largest user, wasn't really something that was suitable for the infrastructure at scale. This decision was made largely independently of Monarch, but required Monarch to move off.
Third, Monarch does replication, up to three ways. In the pull-based system, it wasn't possible to guarantee that the measurement that each replica sees is the same measurement with the same microsecond timestamp. This was a huge data quality issue that made the distributed queries much harder to make correct and performant. Also, the clients had to pay both in persistent TCP connections on their side and in RAM, state machines, etc., for this replication as a connection would be made from each backend leaf processes holding a replica for a given client.
Fourth, persistent TCP connections and load balancers don't really play well together.
Fifth, not everyone wants to accept incoming connections in their binary.
Sixth, if the leaf process doesn't need to know the collection policies for all the clients, those policies don't have to be distributed and updated to all of them. At scale this matters for both machine resources and reliability. This can be made a separate service, pushed to the "edge", etc.
Switching from a persistent connection to the clients pushing measurements in distinct RPCs as they were recorded eventually solved all of these problems. It was a very intricate transition that took a long time. A lot of people worked very hard on this, and should be very proud of their work. I hope some of them jump in to the discussion! (At very least they'll add things I missed/didn't remember... ;^)