Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Gorilla: A Fast, Scalable, In-Memory Time Series Database (2015) [pdf] (vldb.org)
125 points by kiril-me on Dec 23, 2016 | hide | past | favorite | 56 comments


I read this paper a few months ago and summarized it for my team. Sharing the notes here in case it's helpful for others:

- when storing time series keys, you can save a lot of space by encoding them as a first timestamp followed by timestamp offsets (deltas)

- when storing time series values, you can save a lot of space by realizing sequential data points tend not to be volatile... e.g. a "writes per minute" series is more likely to be 100, 99, 101, etc. than it is to be 100, 999999, 1, etc. This means you can encode the current value by using XOR tricks with prior values to save space

- the authors suggest using in-memory caching for recent data, but doing eventual persistence to HBase or similar distributed filesystems; thanks to this, you can get real-time operational metrics that are fast and space efficient, yet still have a system that scales out horizontally for historical storage

The Morning Paper also did a good analysis here: https://blog.acolyer.org/2016/05/03/gorilla-a-fast-scalable-...


Those encoding tricks go back to research from the 1960 and even earlier. For example, if you expect the statistical distribution of the delta encoding to produce small values, it can be followed by Rice-Golomb coding [1], which further reduces the number of bits needed for each value.

[1] https://en.wikipedia.org/wiki/Golomb_coding


Good tips, on that note you really don't have to use HBase or some always-on managed filesystem. You can actually use S3! I know this sounds weird as you'd rack up API costs way faster than storage cost, but we implemented a simple batching approach for timeseries data that could handle 100M+ messages for $10/day (over 100GB+ of data, includes all costs, processing, storage, backup). Did a 2 min screen cast here: https://www.youtube.com/watch?v=x_WqBuEA7s8 . This hopefully can be a lifesaver on costs for some people, and keeps your system overall simpler.


This is neat! S3 is always my first choice when it comes to scalable cheap storage. The price and scalability beats almost everything in this category. Reliability is unprecedented as well. I am wondering what was you file format for storing the TS data.


> when storing time series keys, you can save a lot of space by encoding them as a first timestamp followed by timestamp offsets (deltas)

It's actually a delta of deltas, which means for regular time series that a huge portion of rows will contain a 0 (more than 96% according to the paper) because all events arrive at a fixed interval; if there's a slight delay in either direction, the stored value will be small in most cases.


Picking an encoding or compression on any column is now available in several popular datastores. Vertica by HP and Druid.IO are two that come to mine.


Not to bash Gorilla, because it looks like excellent stuff, but I do not understand the general obsession with specialized TS stores.

If you're operating on FB-scale, then sure that's what you have to do, but in most cases your database (especially Postgres) is a far superior option.

Time series is "unusual" in the sense that most people don't know first thing about it, event folks with degrees in math/statistics. I think this is why there is the prevailing misconception that it requires a specialized database.

Incidently I've just written a blog about storing TS in PostgreSQL: https://grisha.org/blog/2016/12/16/storing-time-series-in-po...


> If you're operating on FB-scale, then sure that's what you have to do, but in most cases your database (especially Postgres) is a far superior option.

> Time series is "unusual" in the sense that most people don't know first thing about it, event folks with degrees in math/statistics. I think this is why there is the prevailing misconception that it requires a specialized database.

I'm one of the developers of Prometheus, and while I'm going to point appropriate use cases towards Postgres (and regularly do), even relatively small monitoring loads require careful handling such that a traditional database isn't suitable.

An example from previous company is that we had only ~50 machines in one datacenter and yet were up to 30,000 samples per second. This is not considered to be a large setup, which would be in the hundreds of thousands of samples per second.

All the buffering/batching required to make such loads practical requires special design, as naively making each new data point into a disk write (with fdatasync) will not work out - even with SSDs.

And that's just writes. You also need to be able to efficiently read and process back the data efficiently when it is queried.

I'd suggest https://www.youtube.com/watch?v=HbnGSNEjhUc to give a look into how Prometheus does it (which also covers our use of Gorilla). There was also a good post on the InfluxDB blog about the evolution of their design and why the problem is hard that I can't seem find right now.


This section in our (InfluxDB's) doc on the evolution of our storage engine highlights some of the issues: https://docs.influxdata.com/influxdb/v1.1/concepts/storage_e...


I think that this is an argument over apples and oranges. I'm well familiar with the write problem you're speaking of and even wrote a blog a couple of years ago detailing influx's storage code.

The "apples" use case is when you want for lack of a better description a Grafana chart out of it. (E.g. the cluster cpu load, "we're getting X page views per second", etc).

The "oranges" case is when every data point matters, and I think that this particular case is not so much about time series, but logging, where a log is a "time series" technically speaking. Here of course you're talking massive amounts of data which you want to do your best to optimize and it's a hard problem. It's not a _time series_ problem, it's a problem of storage. Splunk (sorry, can't think of a better example ATM) doesn't describe itself as a "times series database", and yet it's addressing that very problem of horizontally scalable write-intensive storage. So is hbase, cassandra, accumulo, bigquery and all their friends.

In my experience the best solution is to use two separate systems one for each case. Because there is absolutely no value in a "disk used" measurement every millisecond - aggregated to once a minute (or few) is perfectly fine, and retaining the original data is not needed and you'd be a fool to build a Cassndra cluster to handle it. On the other hand if I'm recording equity bids and asks, well then I better record every point forever.


That's two separate dimensions.

One is metrics vs. event logging.

The other is consistency vs. availability.

There's many types of logs, and you don't need the consistency that'd be required for billing-related logs as for debug logs. Similarly you may have billing-related metrics (e.g. bandwidth usage) where consistency is important.

> It's not a _time series_ problem, it's a problem of storage.

Technically anything with a time dimension is a time series, so Splunk is a time series database.

> In my experience the best solution is to use two separate systems for each case.

I agree completely. There's at least two general problems here that need different approaches once you get beyond trivial scale.


Re. your horizontal scalable write intensive use case is basically what Riak TS was designed to handle, but specifically with a temporally sorted data model. Riak TS just got a version bump too [0]. Because Riak TS is a derivative of Riak KV it basically inherited the clustering, HA, bla, bla from day 0.

/disclaimer/ I work for Basho.

[0] http://docs.basho.com/riak/ts/1.5.0/releasenotes/


>30,000 samples per second.

5 years ago i was getting that amount of records written on one Oracle node with 2 CPU and a bunch of regular HDDs.

>naively making each new data point into a disk write (with fdatasync) will not work out - even with SSDs.

seems like you're trying to say that this situation requires 30K IOPS. That would be really naive :)

You have 50 writers sending 30K/sec total, ie. 600/sec from one writer. A typical RDBMS naturally batches parallel writes, so a bunch of 3-4 HDDs providing totally 600+ IOPS would easily serve your situation.


> seems like you're trying to say that this situation requires 30K IOPS. That would be really naive :)

If you want fast reads, it'd take at least that many IOPS for a naieve solution where each timeseries had its own block you were appending to.

> A typical RDBMS naturally batches parallel writes, so a bunch of 3-4 HDDs providing totally 600+ IOPS would easily serve your situation.

You could do that if you only cared about writes. There's two problems with such an approach.

First reads would unbearably slow as to read a single timeseries for an hour at say 10s resolution would take 360 operations (one per batch), or around 3.6 seconds presuming 10ms seek time. Wanting to read a hundred timeseries at once over a day is not unusual, which would take 8640 seconds.

Secondly it's not likely to be good disk space wise. As each point is individually written you'd not be able to do intra-timeseries compression, so you're probably talking at least 16 bytes per sample. Probably nearer 100-200 bytes, if you write the metric name each time.

Contrast that with the approach taken by something like Prometheus. We build up 1kB chunks, which hold ~780 samples on average as Gorilla gets us ~1.3B/sample. We batch these up per time series across several hours, so accessing 6 hours is only one disk operation. With the above example of reading 100 timeseries for a day that's only 4 seconds, which is 3 orders of magnitude faster than what simple write batching allows for.

Obviously this is ignoring many details like caching, but the general point holds.


That's a great talk, thanks for the link.


You don't have to be FB-scale to need something capable of handling a larger amount of timeseries data than postgresql can handle efficiently. 100 servers each producing 100 stats tagged 15 different ways aggregated into avg/50th/95th/min/max every 5 seconds is 540 million data points per hour (double check my math but it seems about right).


100 servers producing 100 stats is 10,000 series - the number of data points is a function of the resolution of the series, for example at 5s they'll accumulate 17,280 per day per series for a total of 172,800,000 data points / day, which is not a big number (relatively speaking). You probably also don't want to keep a 5s resolution forever, and after a day you may want to store 1m resolutions (reducing points 12x), then perhaps hourly after week, daily after a month, etc.


And this is exactly the misconception. In my use cases I want to have the original data for at least a couple of weeks, maybe even a year. I also need millisecond precision because some of the sources produce several values per second (others might not change for years). I also need to store the information if the value had an alarm, or an error or if it is an replacement value.

So a simple round robin scheme just doesn't work.


I wonder if you're confusing time series (which is a sequence of numbers which lends itself to stuff like regressions, etc, for predictions and other insights) with quite simply _logging_ so that you can go back a year later and figure out "what the hell happened then".


While metrics are often evenly spaced, that's not always the case.

Good resolution on the timestamps helps avoid graph artifacts and other weirdness, even if you're only collecting them once a minute. Millisecond accuracy seems to be sufficient in my experience, there's other races of that magnitude anyway (network jitter, kernel scheduling etc.).


https://en.wikipedia.org/wiki/Unevenly_spaced_time_series You can call that _logging_ if you want, but people will ask for it anyway.


Interesting, but I'm not convinced. Everything is evenly spaced given correct resolution. If your timestamp is at nanosecond precision, then you simply have a 1ns resolution series which is very sparse, I suppose another name for it might be "unevenly spaced".


Need 5 seconds for at least one week.

One day is too short, I regularly have to look at the metrics from yesterday to debug an issue that we noticed today and they're worthless because they've been averaged.


You probably also don't want to keep a 5s resolution forever.

Yeah, that's a feature provided by specialized TSDB, you don't need to implement yourself.


It's not a misconception. If you need to store timeseries and actually do something substantive with them (time oriented joins; calculating bid-ask spreads, order book math, correlation matrices, calculating betas), Postgres won't cut it, even on a relatively small data set, like all the OCHL daily stock market data for the last 30 years. Neither will any of the other open source solutions.


Interesting. It'd be very nice to use my preferred tool set (postgres) as an alternative to RRDTool when doing datalogging. Do you any comparisons for speed/accuracy/compactness vs using RRDTool?


> I do not understand the general obsession with specialized TS stores

Specialized TS stores exist for decades (process historians, tickerplants, and so on).


Did you compare the disk usage vs. whisper ?


whisper is 12 bytes per data point. The Gorilla style time series saves gobs of data over the whisper style, there is simply little to no comparison. Gorilla is better than whisper when it comes to storing the same data in less storage space.

Disclaimer: I am a co-maintainer of graphite and try to write code for it when I have free time (so not much right now or for the past year or so)

https://github.com/graphite-project/whisper/commits?author=S...


In this case Tgres beats Whisper by quite a bit - it uses 8 bytes per data point. In fact looking at the numbers for a small test db I have 2,057,275 points taking up 16,687,104 bytes, which is exactly 8.11 bytes per point.

BTW getting the total point count is as easy as:

  select sum(array_length(dp,1)) from ts;
If you adjust the width (i.e. points per table row) to a higher number than the 768 default, as soon as you get above the PG pages size, TOAST (https://www.postgresql.org/docs/current/static/storage-toast...) kicks in, which PG compresses. Not sure how well this works, but at least in theory it should make it even more compact (though not sure about it being as performant).


My guess is that whisper would be more efficient on disk, not by much though. But whisper won't let you

  SELECT ... 
    FROM tv
    JOIN customers ON ...
    JOIN purchases ON ...


There still isn't a leading time series database. Think the use case definitely deserves it though. I've seen so many projects spend months in building custom data rollup solutions, it's a waste of time. A good time series database would be pretty amazing.


Riak TS 1.5 just dropped this week [0]. Designed to do the horizontal scale thing over temporally sorted data.

/disclaimer/ I work for Basho

[0] http://docs.basho.com/riak/ts/1.5.0/releasenotes/


Have you guys considered using influxdb as a backend? Influxdb has written at length about the difficulties of existing storage engines(including leveldb).

My guess though is that Basho would opt to write their own backend off of influxdb's notes..

On a side note I find influxdb's use of nanosecond time precision spot on and Riak TS's millisecond resolution disappointing.


There's several, depending on use case. Each make different tradeoffs, so you have to decide what's important for you.

For example Prometheus (which I work on) is great at reliable monitoring and powerful processing of metrics at high volumes, but it'd be unwise to use it for event logging or customer billing.

If you're doing IoT or event logging then InfluxDB might be a good choice for you, though if you're doing more text-based logging then Elasticsearch is nearer to what you're looking for.

https://docs.google.com/spreadsheets/d/1sMQe9oOKhMhIVw9WmuCE... is one comparison of the various open source options.


We're using Elasticsearch for event logging -- where "event" means analytics event, e.g. a page view -- and it's fantastic. The aggregation support is superb.

We initially used Influx, but it could not perform well at the time (0.8). Our events are also heavily label-based. Basically, we do ETL at the time of write, collecting multiple documents into one mega-event, which is a complex, nested JSON document. It may have perhaps 150-200 fields. A single event may be something like "clicked button X". By storing the original document, we can aggregate based on any field value, including text and scalar fields, without having to think about a schema or about planning ahead of time what fields should be indexed or not. ES handles the rest pretty well.

To do the same thing with Influx or Prometheus I suspect we'd have to reverse this and store the document as the labels, along with a single count (1) as the "metric". I don't know how well Influx etc. scale with number of unique label values, though I'd love to find out. The last time I read about this, I think they recommended not going overboard with them.

What's different with business analytics is that the end product is typically multidimensional rollup reports over large time windows (number of page views per customer per web property per month, comparing by 2015 vs 2016, for example), and it's almost all "group by count", sometimes "count distinct" or averages. Whereas "rate per second"-type metrics aren't used anywhere in our app, for example.


I also use Elasticsearch for time series site analytics use cases. I gave a talk about Elastic{on} about it last year.

Apologies for the email gateway for the video, but you can also see my slides here: https://www.elastic.co/elasticon/conf/2016/sf/web-content-an...

We found that as we scaled it up, we couldn't really keep the data in raw form, so we had to build rollup documents that cover 5-minute and 1-day buckets. Do you use the same trick, or is the number of pageview events for you manageable enough that you just keep it all raw?


We haven't reached that stage yet, fortunately; though at some point someone will want to do a big multi-year aggregation report across all indexes and still expect it to take not more than a few seconds.

My ideal solution would be one that rotated the dataset into historical rollups on a daily basis, so that we only stored the raw data for today, and gradually merged earlier entries at lower granularities. However, I haven't thought much about how to do that with Elasticsearch. I can see a way of doing it by embedding the value in the field label, and using the field value as a count, but Elasticsearch really doesn't like lots of unique fields; you shouldn't be using more than a few hundred at most in a single installation (across all indexes).


At that point, you can just use Apache Drill over raw JSON files:

https://drill.apache.org/


In addition, there is influx-mysql: https://github.com/philip-wernersbach/influx-mysql , if you're looking for a MySQL compatible solution.


There is, it just costs as much as a fully loaded engineer (about $150k/yr). It's called Kdb.


Or more, and means you have to learn the dreaded Q language.


Personally I find Q fine. It's a useful filter. If the language is too hard for someone to pick up, the person would struggle with the complexities of correctly designing or performing analytics on a high frequency distributed custom time series db anyway.


While it's a relatively young product, I've been running InfluxDB at scale for over a year in production and it's been a joy to use. Hundreds of hosts reporting stats every 10 seconds and the query language is straightforward and powerful once you grok the data structures a bit. It fits the bill, I think.


+1 for InfluxDB and the TICK stack. We evaluated it for our monitoring needs, decided not to go with it but were damned impressed with how easy it was to setup and use, the features (including the query language) and the overall quality of the stack.

Less impressed with the Influx org and their SaaS, but I definitely want to find a usecase for getting stuck into Influx for time series collection


Can you describe your workload and your experience with InfluxDB? We evaluated InfluxDB in 2015 and found that it fell apart due to the large amount of data we produce. I'd be interested in hearing your experiences to see if we need to reevaluate.


Workload is mostly writing incoming metrics from many dynamic instances (AWS). In 2015 Influx was at pre-1.0 and definitely had a tendency to "fall apart" under stress - for one, there were no memory limitations implemented, so an accidental large query would eat up all your memory and kill the database. To be fair, no one recommended putting it in production at that point. It's just that I needed something to run a Graphite-like metrics platform at scale and it had to be future-proof. Bit of a gamble but it turned out well! :)


We use OpenTSDB with HBase with good success. Writing over 2m events/s in some clusters.

We're also working on something with Apache Phoenix over HBase to store TS so that we can query the data in more flexible ways.


I'm working on my own solution for a while (about 3 years) - https://github.com/akumuli/Akumuli. One of the greatest things about it is that there is no need for preconfigured rollups. The storage is based on B+Tree and some aggregates are stored inside inner nodes of the tree.


To some extent there is http://www.osisoft.com

(I work for this company so I'm a bit biased)

The thing is, yes we have a time series database, but a lot of value comes from giving people the tools to analyze, distribute, and act on the data stored (why store data if you can't do anything useful with it?)


VLDB '15. Previously discussed at: https://news.ycombinator.com/item?id=10207863


Yes, here is open source representation of the ideas presented in this paper: https://github.com/facebookincubator/beringei


Submitted this a while back, the Morning Paper did a good write up of BTrDB which is another very fast TSDB written in Go, storing 2.1 trillion data points, and supporting 119M queries per second (53M inserts per second) in a four-node cluster

https://blog.acolyer.org/2016/05/04/btrdb-optimizing-storage...


This is a great paper. On InfluxDB we used a similar compression technique for float64s in our new storage engine. A key difference we have is that we separated timestamps from values so we can use run length encoding for regular time series.


Does the architecture allow persistent storage plugins ?


"You will get a better Gorilla effect if you use as big a piece of paper as possible." -Kunihiko Kasahara, Creative Origami.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: