Do you think it's good enough to replace Graphite's Carbon at this point? Not as a drop in to graphite, but as a backend to a custom metrics system?
I know your project is young and probably has not seen much battle testing, but your blog post indicates to me that you've put a lot of thought into it being robust.
We are using Carbon for our metrics solution at the moment, and I've read its source and it's not something I'd give a big 'ready for production' stamp even though I know many shops are using it in production.
Perhaps, if you feel like it and will entertain my cheap grab for information, could you give super small explanation of the performance differences between your partition style format and for example Whisper (like RRD, it's what Carbon uses) and InfluxDB? As far as I understand Whisper is simply a cyclic buffer of fixed time distance points in a file per series. And InfluxDB is simply a key value store I think.
Your solution lies somewhere in between those right?
InfluxDB uses Bolt in the latest version. Yes, it's a key-value store. See my other comment below about why I'm not using Bolt.
I'm not too familiar with Whisper, but it since it uses fixed size storage, it's not for my use case. I have large amounts of time series metrics (>10,000), and many of them end up being sparse with only a few irregular points. Partitioning by time instead of by metric means I don't have to deal with thousands of files, especially those that have some large, fixed size.
I'm very focused on making this "production ready." It's currently used for another project of mine[0], so I haven't been looking at integrating it into other systems.
Carbon uses a custom format called whisper, which was written by Chris Davis because you can't backfill past data into rrd.
FYI graphite supports pluggable storage backends such as influxdb. Personally, I'm pretty invested in Cyanite which is a small shim ontop of Cassandra which speaks the carbon line protocol.
Once this is semi prod worthy, I'd be happy to see if we can't write one of the pluggable backends to grapbite for it, which also works with graphite-api if you want a slightly easier to setup flask based version.
The amount of small writes it does is murder regardless of your disk subsystem (even with SSD's - your best case is you waste lots of time on context switches instead of waiting on disks, but you're still wasting lots of resources).
I'm curious why Bolt isn't a good fit for your timeseries project ? It has excellent iteration support, supports concurrent readers, is already mmaped and can hold multiple "namespaces" through the use of buckets.
Bolt doesn't support compression unless you implement it yourself (i.e. in your application layer). Bolt also does not support concurrent writers. I don't need MVCC nor transactions, which have overhead. Finally, tree-based systems aren't very good at doing large, sequential reads.
Edit: Again, not exactly sure about the downvoting, so here is a better explanation. If I'm wrong, please let me know!
Let's say you insert many points for a few metrics in time order. Because of the way B-tree page allocation works, you can't guarantee that the points for a metric will be stored in adjacent pages on disk (correct me if I'm wrong). I want the guarantee that if I'm going to read points sequentially off a disk, I will do so without seeking around to different pages. Catena does that, and keeps points very compact.
When I used Bolt with a single key-value pair per point, I saw disk usage that was something like 50 bytes per point (on average). With Catena's file partitions with compression, I see around 3 bytes per point (on average). These points are 8 byte timestamps with 4 byte values. That's a huge benefit.
To be clear, I'm not saying you should have used it, just that some of the functionalities are already there (and I love Bolt so I easily see parallels). It's always interesting to implement stuff from scratch, if only to understand how it works.
Also, I'm only talking about using Bolt as the on-disk partition store, ie the read-only partitions. You'd only write them when flushing the in-mem partitions, so no need for concurrent writers.
> tree-based systems aren't very good at doing large, sequential reads.
It all depends on the implementation you've chosen. B+trees exist exactly for efficient sequential reads, and Bolt uses one. (Take a look at this [0]: 40 ns/op per k/v, and putting the database in memory doesn't even improve it).
Trees also allow you to search for specific timestamps so you can
I like Bolt too. It was my first choice :). Using it for read-only partitions sounds like a great idea. I think the current implementation looks like a tree if you "squint" a bit. There's a hierarchy of metrics and sources, and the arrays of points are like the leaf nodes.
You're very correct about B+trees being efficient with sequential reads. We use MySQL with InnoDB for all of our time series where I work[0]. I started this way just to get a good understanding of what works and what doesn't. I think I'm on track to implementing something like a B+tree by using smaller extents of points.
Bolt author here. I don't think I'd agree that Bolt isn't good for time series. In fact, it has fast appends and fast sequential reads so it works really well for time sorted data. Having a sequential page ordering (like Catena) will have a huge boost when running on spinning disk but it's much less of an issue on SSDs.
All that being said, I do like the approach of Catena. I think it works well for best case scenarios and you can get some large perf gains from tailoring the format to a specific type of data. However, some of the trade offs such as dropping writes after the in-memory window are not going to be acceptable to many people.
As far as disk usage goes, it sounds like you could improve the utilization by increasing Bucket.FillPercent to 1.0. By default Bolt splits pages in half since it assumes random writes. You can do some optimizations if you're writing in order.
Another option that'd be interesting to look at would be doing Catena on top of Bolt. You can write your large, compressed Catena-ized data as a single value in Bolt and it'll be laid out sequentially on disk. I know you mentioned that you don't need transactions but I find them to be really useful once you start distributing data across a cluster and need some guarantees about writes.
I read through your post, and I am curious; what is the reasoning driving having metrics split across partitions, rather than partitioning by metric itself?
My use case involves having many, many metrics. I have at least 500 metrics per source, and potentially hundreds of sources.
It's also easier to handle retention the way Catena does it because it only involves deleting the oldest file.
A lot of systems have a file per metric. With the data I work with, this is an operational nightmare. Dropping old data becomes difficult, and getting points from lots of metrics for a certain time range gets difficult too. Imagine having a dashboard with lots of metrics displayed. You're probably going to be seeing points from the same time range across metrics. This scenario is a lot better if you only have to read through a single file instead of, say, a few dozen or something like that.
This is a pleasant, self-honest discussion of purpose-driven software as it is being implemented. The language is irrelevant; it's fun to read this type of piece.
Thank you! You're right; the language is irrelevant. I wrote this in Go because the parent project is written in Go. I'm looking forward to writing more of this blog series.
Hi, Prometheus[0] author here. Thanks for the interesting article!
Since I was curious how this compares to Prometheus's internal storage for writes, I whipped up some (disclaimer: very naive and ad-hoc!) benchmarks[1] to get a rough feeling for Catena's performance. I am not achieving a lot of write performance with it yet, but maybe I'm doing something wrong or using it inefficiently. Some questions to investigate would be: what's the best number of rows to batch in one insert, and are timestamps in seconds, milliseconds, or essentially only user-interpreted (I noticed the partitioning at least depends heavily on the interval between timestamps)? So far I've just done a tiny bit of fiddling and results haven't changed dramatically.
The benchmark parameters:
* writing 10000 samples x 10000 metrics (100 million data points)
* initial state: empty storage
* source names: constant "testsource" for all time series
* metric names: "testmetric_<i>" (0 <= i < 10000)
* values: the metric index <i> (constant integer value within each series)
* timestamps: starting at 0 and increasing by 15 seconds for every iteration
* Other machine load: SoundCloud playing music in the background
The benchmark results:
#### Prometheus ####
(GOMAXPROCS=4 go run prometheus_bench.go -num-metrics=10000 -samples-per-metric=10000)
Time: 1m26s
Space: 138MB
#### Catena ####
(GOMAXPROCS=4 go run catena_bench.go -num-metrics=10000 -samples-per-metric=10000)
Time: 1h25m
Space: 190MB
So in this particular benchmark Catena took 60x longer and used 1.4x more space.
Please don't take this as discouragement or a statement on one being better than the other. Obviously Catena is very new and also probably optimized for slightly different use cases. And possibly I'm just doing something wrong (please tell me!). I also haven't dug into possible performance bottlenecks yet, but I saw it utilize 100% of all 4 CPU cores the entire time. In any case, I'd be interested in a set of benchmarks optimized specifically for Catena's use case.
Unfortunately we also haven't fully documented the internals of Prometheus's storage yet, but a bit of background information can be found here: http://prometheus.io/docs/operating/storage/ Maybe that's worth a blog post sometime.
Thanks for trying it out! I haven't had time to run any benchmarks, so I really appreciate you taking the time to do this (especially since it took so long!).
I'm not sure what the best batch size is at the moment. Timestamps are int64s, and it's up to the user to interpret them as they wish. Partition sizes are in terms of the number of timestamps. If you had timestamps which correspond to seconds, and you wanted each partition to be 1 day, you'd choose 86400. This isn't configurable yet unless you modify the source.
I'm not surprised it's that slow. I'm not a storage engine expert, still a college student, and this has my first lock-free list implementation :). There is a lot of silliness going on with inserts, like launching a goroutine for each metric on inserts[0], using lock-free O(n) lists for sources and metrics[1] when I could have used a map, there's still a lock left that should be removed[2], and a bunch of others.
On an unrelated note, I see that someone from Prometheus will be at Monitorama. I'll be there too, so I'd love to you guys some more.
Ah ok, so some obvious things left to optimize then :) Thanks for the explanations!
As far as I know, none of the Prometheus core developers will be at Monitorama this year. But we'll be at SRECon Dublin, DevOpsCon Berlin, GopherCon Denver, and hopefully also at one (or both) of the VelocityConfs later this year.
Indeed, it would be great to catch up in person if you're at any of these.
Edit: I'm not sure why I'm being downvoted... I'm not home at the moment, so I'm trying my best to answer using my phone.
Edit #2: Back home with a full-size QWERTY keyboard :).