A huge difference between monarch and other tsdb that isn’t outlined in this overview, is that a storage primitive for schema values is a histogram. Most (maybe all besides Circonus) tsdb try to create histograms at query time using counter primitives.
All of those query time histogram aggregations are making pretty subtle trade offs that make analysis fraught.
In my experience, Monarch storing histograms and being unable to rebucket on the fly is a big problem. A percentile line on a histogram will be incredibly misleading, because it's trying to figure out what the p50 of a bunch of buckets is. You'll see monitoring artifacts like large jumps and artificial plateaus as a result of how requests fall into buckets. The bucketer on the default RPC latency metric might not be well tuned for your service. I've seen countless experienced oncallers tripped up by this, because "my graphs are lying to me" is not their first thought.
Wow, this is a fantastic solution to some questions I've had rattling around in my head for years about the optimal bucket choices to minimize error given a particular set of buckets.
Do I read right that circllhist has a pretty big number of bin sizes and is not configurable (except that they're sparse so may be small on disk)?
I've found myself using high-cardinality Prometheus metrics where I can only afford 10-15 distinct histogram buckets. So I end up
(1) plugging in my live system data from normal operations and from outage periods into various numeric algorithms that propose optimal bucket boundaries. These algorithms tell me that I could get great accuracy if I chose thousands of buckets, which, thanks for rubbing it in about my space problems :(. Then I write some more code to collapse those into 15 buckets while minimizing error at various places (like p50, p95, p99, p999 under normal operations and under irregular operations).
(2) making sure I have an explicit bucket boundary at any target that represents a business objective (if my service promises no more than 1% of requests will take >2500ms, setting a bucket boundary at 2500ms gives me perfectly precise info about whether p99 falls above/below 2500ms)
(3) forgetting to tune this and leaving a bunch of bad defaults in place which often lead to people saying "well, our graph shows a big spike up to 10000ms but that's just because we forgot to tune our histogram bucket boundaries before the outage, actually we have to refer to logs to see the timeouts at 50 sec"
Prometheus is in the process of developing a similar automatic log-linear histogram bucket type. The goal is to make it as cheap as a 10-15 bucket histogram, but not require pre-defined buckets.
Correct me if I’m wrong but I thought the sparse histogram effort in Prometheus will still use single metric line counters for its storage abstraction?
I think it’s a great addition to the product and will excitedly use it but it’s a pretty big difference from a histogram centric db like circonus’ or a schema’d one like monarch.
I’ve used these log-linear history in a few pieces of code. There is some configurability in the abstract - you could choose a different logarithm base.
In practice none of the implementations seem to provide that. Within the each set of buckets for a given log base you have reasonable precision at that magnitude. If your metric is oscillating around 1e6 you shouldn’t care much about the variance at 1e2, and with this scheme you don’t have to tune anything to provide for that.
There are a large amount of subtle tradeoffs around the bucketing scheme (log, vs. log-linear, base) and memory layout (sparse, dense, chunked) the amount of configurability in the histogram space (circllhist, DDSketch, HDRHistogram, ...). A good overview is this discussion here:
As for the circllhist: There are no knobs to turn. It uses base 10 and two decimal digits of precision. In the last 8 years I have not seen a single use-case in the operational domain where this was not appropriate.
Right - it’s why I said “in the abstract.” You could do it and still have a log-linear format. Base 10 works great for real-world distributions.
Thanks for making and ossing circllhist. I’ve been close by to the whole “what’s the OTel histogram going to be” discussion for the last many months and learned a lot from that. That discussion is what introduced me to circllhist and got me using them.
I definitely remember a lot of time spent tweaking histogram buckets for performance vs. accuracy. The default bucketing algorithm at the time was powers of 4 or something very unusual like that.
It's because powers of four was great for the original application of statistics on high traffic services where the primary thing the user was interested in was deviations from the norm, and with a high traffic system the signal for what the norm is would be very strong.
I tried applying it to a service with much lower traffic and found the bucketing to be extremely fussy.
My personal opinion is that they should have done a log linear histogram which solves the problems you mention (with other trade offs) but to me the big news was making the db flexible enough to have that data type.
Leaving the world of single numeric type for each datum will influence the next generation of open source metrics db.
Yeah it was a tough tradeoff for the default case, because the team didn't want to use too much memory in everyone's binary since the RPC metrics were on by default. This is easily changeable by the user if necessary, though.
The histograms are useful on their own (visualized as a heatmap). If percentile lines are necessary (and they often aren't), I prefer to overlay them on top of the heatmap so it is clear where the bucket edges are.
I've been pretty happy with datadog's distribution type [1] that uses their own approximate histogram data structure [2]. I haven't evaluated their error bounds deeply in production yet, but I haven't had to tune any bucketing. The linked paper [3] claims a fixed percentage of relative error per percentile.
That is a very different tradeoff, though. A DDSketch is absolutely gigantic compared to a power-of-four binned distribution that could be implemented as a vector of integers. A practical DDSketch will be 5KiB+. And when they say DDSketch merges are "fast" they are comparing to other sketches that take microseconds or more to merge, not to CDF vectors that can be merged literally in nanoseconds.
Prometheus is adding sparse histograms. There's a couple of online talks about it already but one of the maintainers, Ganesh, is giving a talk at kubecon on it next week off anyone is attending and curious about it.
Wavefront also has histogram ingestion (I wrote the original implementation, I'm sure it is much better now). Hugely important if you ask me but honestly I don't think that many customers use it.
Granted, it looks like Monarch supports a more cleanly-defined schema for distributions, whereas Prometheus just relies on you to define the buckets yourself and follow the convention of using a "le" label to expose them. But the underlying representation (an empirical CDF) seems to be the same, and so the accuracy tradeoffs should also be the same.
Much different. When you are reporting histograms you can combine them and see the true p50 or whatever across all the individual systems reporting the metric.
Can you elaborate a bit? You can do the same in Prometheus by summing the bucket counts. Not sure what you mean by “true p50” either. With buckets it’s always an approximation based on the bucket widths.
Ah, I misunderstood what you meant. If you are reporting static buckets I get how that is better than what folks typically do but how do you know the buckets a priori? Others back their histograms with things like https://github.com/tdunning/t-digest. It is pretty powerful as the buckets are dynamic based on the data and histograms can be added together.
Yes. This. Also, displaying histograms in heatmap format can allow you to intuit the behavior of layered distributed systems, caches, etc. Relatedly, exemplars allowed tying related data to histogram buckets. For example, RPC traces could be tied to the latency bucket & time at which they complete, giving a natural means to tie metrics monitoring and tracing, so you can "go to the trace with the problem". This is described in the paper as well.
All of those query time histogram aggregations are making pretty subtle trade offs that make analysis fraught.