Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

There still isn't a leading time series database. Think the use case definitely deserves it though. I've seen so many projects spend months in building custom data rollup solutions, it's a waste of time. A good time series database would be pretty amazing.


Riak TS 1.5 just dropped this week [0]. Designed to do the horizontal scale thing over temporally sorted data.

/disclaimer/ I work for Basho

[0] http://docs.basho.com/riak/ts/1.5.0/releasenotes/


Have you guys considered using influxdb as a backend? Influxdb has written at length about the difficulties of existing storage engines(including leveldb).

My guess though is that Basho would opt to write their own backend off of influxdb's notes..

On a side note I find influxdb's use of nanosecond time precision spot on and Riak TS's millisecond resolution disappointing.


There's several, depending on use case. Each make different tradeoffs, so you have to decide what's important for you.

For example Prometheus (which I work on) is great at reliable monitoring and powerful processing of metrics at high volumes, but it'd be unwise to use it for event logging or customer billing.

If you're doing IoT or event logging then InfluxDB might be a good choice for you, though if you're doing more text-based logging then Elasticsearch is nearer to what you're looking for.

https://docs.google.com/spreadsheets/d/1sMQe9oOKhMhIVw9WmuCE... is one comparison of the various open source options.


We're using Elasticsearch for event logging -- where "event" means analytics event, e.g. a page view -- and it's fantastic. The aggregation support is superb.

We initially used Influx, but it could not perform well at the time (0.8). Our events are also heavily label-based. Basically, we do ETL at the time of write, collecting multiple documents into one mega-event, which is a complex, nested JSON document. It may have perhaps 150-200 fields. A single event may be something like "clicked button X". By storing the original document, we can aggregate based on any field value, including text and scalar fields, without having to think about a schema or about planning ahead of time what fields should be indexed or not. ES handles the rest pretty well.

To do the same thing with Influx or Prometheus I suspect we'd have to reverse this and store the document as the labels, along with a single count (1) as the "metric". I don't know how well Influx etc. scale with number of unique label values, though I'd love to find out. The last time I read about this, I think they recommended not going overboard with them.

What's different with business analytics is that the end product is typically multidimensional rollup reports over large time windows (number of page views per customer per web property per month, comparing by 2015 vs 2016, for example), and it's almost all "group by count", sometimes "count distinct" or averages. Whereas "rate per second"-type metrics aren't used anywhere in our app, for example.


I also use Elasticsearch for time series site analytics use cases. I gave a talk about Elastic{on} about it last year.

Apologies for the email gateway for the video, but you can also see my slides here: https://www.elastic.co/elasticon/conf/2016/sf/web-content-an...

We found that as we scaled it up, we couldn't really keep the data in raw form, so we had to build rollup documents that cover 5-minute and 1-day buckets. Do you use the same trick, or is the number of pageview events for you manageable enough that you just keep it all raw?


We haven't reached that stage yet, fortunately; though at some point someone will want to do a big multi-year aggregation report across all indexes and still expect it to take not more than a few seconds.

My ideal solution would be one that rotated the dataset into historical rollups on a daily basis, so that we only stored the raw data for today, and gradually merged earlier entries at lower granularities. However, I haven't thought much about how to do that with Elasticsearch. I can see a way of doing it by embedding the value in the field label, and using the field value as a count, but Elasticsearch really doesn't like lots of unique fields; you shouldn't be using more than a few hundred at most in a single installation (across all indexes).


At that point, you can just use Apache Drill over raw JSON files:

https://drill.apache.org/


In addition, there is influx-mysql: https://github.com/philip-wernersbach/influx-mysql , if you're looking for a MySQL compatible solution.


There is, it just costs as much as a fully loaded engineer (about $150k/yr). It's called Kdb.


Or more, and means you have to learn the dreaded Q language.


Personally I find Q fine. It's a useful filter. If the language is too hard for someone to pick up, the person would struggle with the complexities of correctly designing or performing analytics on a high frequency distributed custom time series db anyway.


While it's a relatively young product, I've been running InfluxDB at scale for over a year in production and it's been a joy to use. Hundreds of hosts reporting stats every 10 seconds and the query language is straightforward and powerful once you grok the data structures a bit. It fits the bill, I think.


+1 for InfluxDB and the TICK stack. We evaluated it for our monitoring needs, decided not to go with it but were damned impressed with how easy it was to setup and use, the features (including the query language) and the overall quality of the stack.

Less impressed with the Influx org and their SaaS, but I definitely want to find a usecase for getting stuck into Influx for time series collection


Can you describe your workload and your experience with InfluxDB? We evaluated InfluxDB in 2015 and found that it fell apart due to the large amount of data we produce. I'd be interested in hearing your experiences to see if we need to reevaluate.


Workload is mostly writing incoming metrics from many dynamic instances (AWS). In 2015 Influx was at pre-1.0 and definitely had a tendency to "fall apart" under stress - for one, there were no memory limitations implemented, so an accidental large query would eat up all your memory and kill the database. To be fair, no one recommended putting it in production at that point. It's just that I needed something to run a Graphite-like metrics platform at scale and it had to be future-proof. Bit of a gamble but it turned out well! :)


We use OpenTSDB with HBase with good success. Writing over 2m events/s in some clusters.

We're also working on something with Apache Phoenix over HBase to store TS so that we can query the data in more flexible ways.


I'm working on my own solution for a while (about 3 years) - https://github.com/akumuli/Akumuli. One of the greatest things about it is that there is no need for preconfigured rollups. The storage is based on B+Tree and some aggregates are stored inside inner nodes of the tree.


To some extent there is http://www.osisoft.com

(I work for this company so I'm a bit biased)

The thing is, yes we have a time series database, but a lot of value comes from giving people the tools to analyze, distribute, and act on the data stored (why store data if you can't do anything useful with it?)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: