Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> If you're operating on FB-scale, then sure that's what you have to do, but in most cases your database (especially Postgres) is a far superior option.

> Time series is "unusual" in the sense that most people don't know first thing about it, event folks with degrees in math/statistics. I think this is why there is the prevailing misconception that it requires a specialized database.

I'm one of the developers of Prometheus, and while I'm going to point appropriate use cases towards Postgres (and regularly do), even relatively small monitoring loads require careful handling such that a traditional database isn't suitable.

An example from previous company is that we had only ~50 machines in one datacenter and yet were up to 30,000 samples per second. This is not considered to be a large setup, which would be in the hundreds of thousands of samples per second.

All the buffering/batching required to make such loads practical requires special design, as naively making each new data point into a disk write (with fdatasync) will not work out - even with SSDs.

And that's just writes. You also need to be able to efficiently read and process back the data efficiently when it is queried.

I'd suggest https://www.youtube.com/watch?v=HbnGSNEjhUc to give a look into how Prometheus does it (which also covers our use of Gorilla). There was also a good post on the InfluxDB blog about the evolution of their design and why the problem is hard that I can't seem find right now.



This section in our (InfluxDB's) doc on the evolution of our storage engine highlights some of the issues: https://docs.influxdata.com/influxdb/v1.1/concepts/storage_e...


I think that this is an argument over apples and oranges. I'm well familiar with the write problem you're speaking of and even wrote a blog a couple of years ago detailing influx's storage code.

The "apples" use case is when you want for lack of a better description a Grafana chart out of it. (E.g. the cluster cpu load, "we're getting X page views per second", etc).

The "oranges" case is when every data point matters, and I think that this particular case is not so much about time series, but logging, where a log is a "time series" technically speaking. Here of course you're talking massive amounts of data which you want to do your best to optimize and it's a hard problem. It's not a _time series_ problem, it's a problem of storage. Splunk (sorry, can't think of a better example ATM) doesn't describe itself as a "times series database", and yet it's addressing that very problem of horizontally scalable write-intensive storage. So is hbase, cassandra, accumulo, bigquery and all their friends.

In my experience the best solution is to use two separate systems one for each case. Because there is absolutely no value in a "disk used" measurement every millisecond - aggregated to once a minute (or few) is perfectly fine, and retaining the original data is not needed and you'd be a fool to build a Cassndra cluster to handle it. On the other hand if I'm recording equity bids and asks, well then I better record every point forever.


That's two separate dimensions.

One is metrics vs. event logging.

The other is consistency vs. availability.

There's many types of logs, and you don't need the consistency that'd be required for billing-related logs as for debug logs. Similarly you may have billing-related metrics (e.g. bandwidth usage) where consistency is important.

> It's not a _time series_ problem, it's a problem of storage.

Technically anything with a time dimension is a time series, so Splunk is a time series database.

> In my experience the best solution is to use two separate systems for each case.

I agree completely. There's at least two general problems here that need different approaches once you get beyond trivial scale.


Re. your horizontal scalable write intensive use case is basically what Riak TS was designed to handle, but specifically with a temporally sorted data model. Riak TS just got a version bump too [0]. Because Riak TS is a derivative of Riak KV it basically inherited the clustering, HA, bla, bla from day 0.

/disclaimer/ I work for Basho.

[0] http://docs.basho.com/riak/ts/1.5.0/releasenotes/


>30,000 samples per second.

5 years ago i was getting that amount of records written on one Oracle node with 2 CPU and a bunch of regular HDDs.

>naively making each new data point into a disk write (with fdatasync) will not work out - even with SSDs.

seems like you're trying to say that this situation requires 30K IOPS. That would be really naive :)

You have 50 writers sending 30K/sec total, ie. 600/sec from one writer. A typical RDBMS naturally batches parallel writes, so a bunch of 3-4 HDDs providing totally 600+ IOPS would easily serve your situation.


> seems like you're trying to say that this situation requires 30K IOPS. That would be really naive :)

If you want fast reads, it'd take at least that many IOPS for a naieve solution where each timeseries had its own block you were appending to.

> A typical RDBMS naturally batches parallel writes, so a bunch of 3-4 HDDs providing totally 600+ IOPS would easily serve your situation.

You could do that if you only cared about writes. There's two problems with such an approach.

First reads would unbearably slow as to read a single timeseries for an hour at say 10s resolution would take 360 operations (one per batch), or around 3.6 seconds presuming 10ms seek time. Wanting to read a hundred timeseries at once over a day is not unusual, which would take 8640 seconds.

Secondly it's not likely to be good disk space wise. As each point is individually written you'd not be able to do intra-timeseries compression, so you're probably talking at least 16 bytes per sample. Probably nearer 100-200 bytes, if you write the metric name each time.

Contrast that with the approach taken by something like Prometheus. We build up 1kB chunks, which hold ~780 samples on average as Gorilla gets us ~1.3B/sample. We batch these up per time series across several hours, so accessing 6 hours is only one disk operation. With the above example of reading 100 timeseries for a day that's only 4 seconds, which is 3 orders of magnitude faster than what simple write batching allows for.

Obviously this is ignoring many details like caching, but the general point holds.


That's a great talk, thanks for the link.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: