> If you're operating on FB-scale, then sure that's what you have to do, but in ...

pauldix · on Dec 23, 2016

This section in our (InfluxDB's) doc on the evolution of our storage engine highlights some of the issues: https://docs.influxdata.com/influxdb/v1.1/concepts/storage_e...

gtrubetskoy · on Dec 23, 2016

I think that this is an argument over apples and oranges. I'm well familiar with the write problem you're speaking of and even wrote a blog a couple of years ago detailing influx's storage code.

The "apples" use case is when you want for lack of a better description a Grafana chart out of it. (E.g. the cluster cpu load, "we're getting X page views per second", etc).

The "oranges" case is when every data point matters, and I think that this particular case is not so much about time series, but logging, where a log is a "time series" technically speaking. Here of course you're talking massive amounts of data which you want to do your best to optimize and it's a hard problem. It's not a _time series_ problem, it's a problem of storage. Splunk (sorry, can't think of a better example ATM) doesn't describe itself as a "times series database", and yet it's addressing that very problem of horizontally scalable write-intensive storage. So is hbase, cassandra, accumulo, bigquery and all their friends.

In my experience the best solution is to use two separate systems one for each case. Because there is absolutely no value in a "disk used" measurement every millisecond - aggregated to once a minute (or few) is perfectly fine, and retaining the original data is not needed and you'd be a fool to build a Cassndra cluster to handle it. On the other hand if I'm recording equity bids and asks, well then I better record every point forever.

bbrazil · on Dec 23, 2016

That's two separate dimensions.

One is metrics vs. event logging.

The other is consistency vs. availability.

There's many types of logs, and you don't need the consistency that'd be required for billing-related logs as for debug logs. Similarly you may have billing-related metrics (e.g. bandwidth usage) where consistency is important.

> It's not a _time series_ problem, it's a problem of storage.

Technically anything with a time dimension is a time series, so Splunk is a time series database.

> In my experience the best solution is to use two separate systems for each case.

I agree completely. There's at least two general problems here that need different approaches once you get beyond trivial scale.

siculars · on Dec 23, 2016

Re. your horizontal scalable write intensive use case is basically what Riak TS was designed to handle, but specifically with a temporally sorted data model. Riak TS just got a version bump too [0]. Because Riak TS is a derivative of Riak KV it basically inherited the clustering, HA, bla, bla from day 0.

/disclaimer/ I work for Basho.

[0] http://docs.basho.com/riak/ts/1.5.0/releasenotes/

trhway · on Dec 24, 2016

>30,000 samples per second.

5 years ago i was getting that amount of records written on one Oracle node with 2 CPU and a bunch of regular HDDs.

>naively making each new data point into a disk write (with fdatasync) will not work out - even with SSDs.

seems like you're trying to say that this situation requires 30K IOPS. That would be really naive :)

You have 50 writers sending 30K/sec total, ie. 600/sec from one writer. A typical RDBMS naturally batches parallel writes, so a bunch of 3-4 HDDs providing totally 600+ IOPS would easily serve your situation.

bbrazil · on Dec 24, 2016

> seems like you're trying to say that this situation requires 30K IOPS. That would be really naive :)

If you want fast reads, it'd take at least that many IOPS for a naieve solution where each timeseries had its own block you were appending to.

> A typical RDBMS naturally batches parallel writes, so a bunch of 3-4 HDDs providing totally 600+ IOPS would easily serve your situation.

You could do that if you only cared about writes. There's two problems with such an approach.

First reads would unbearably slow as to read a single timeseries for an hour at say 10s resolution would take 360 operations (one per batch), or around 3.6 seconds presuming 10ms seek time. Wanting to read a hundred timeseries at once over a day is not unusual, which would take 8640 seconds.

Secondly it's not likely to be good disk space wise. As each point is individually written you'd not be able to do intra-timeseries compression, so you're probably talking at least 16 bytes per sample. Probably nearer 100-200 bytes, if you write the metric name each time.

Contrast that with the approach taken by something like Prometheus. We build up 1kB chunks, which hold ~780 samples on average as Gorilla gets us ~1.3B/sample. We batch these up per time series across several hours, so accessing 6 hours is only one disk operation. With the above example of reading 100 timeseries for a day that's only 4 seconds, which is 3 orders of magnitude faster than what simple write batching allows for.

Obviously this is ignoring many details like caching, but the general point holds.

llimllib · on Dec 23, 2016

That's a great talk, thanks for the link.