Hacker News new | past | comments | ask | show | jobs | submit login
A Review of Time Series Databases (dataloop.io)
133 points by slyall on Sept 13, 2016 | hide | past | favorite | 56 comments



Dear HN, recently time series DBs appear to have become very popular on HN but I'm confused. What are they for?

I understand the idea for, say, physics experiments. Lots of parameters sampled thousands of times per second, gotta store that stuff somewhere. But this is HN, not an experimental physics subreddit, so I must be missing something.

What do people on HN who care about this stuff use time series databases for? Why are there 20+ competitors?


When you run servers, you might want to record plenty of parameters. From received packets counters, via disk io operations, to cpu load.

If you want to identify issues with good precision, you will record _hundreds_ of counters per each server.

Sticking this kind of data to relational database is not good. With couple of servers you might have millions of _writes_ per second. While the data is only read when you refresh some charts. This makes the time series db unique - they need to handle write-heavy loads, and perform well when asked to retrieve couple of parameters within defined time window.

So two things:

- time series db's are needed for anyone running live systems and measuring them on-the-fly

- time series db's have unique access patterns, not usually optimized by more conventional databases


Time series databases are critical in financial institutions, storing end of day (or much more frequent) marks for example. The advantage of such databases is not just their ability to store temporal data natively, but the built-in functions specific to time-based datasets which would not be atomic in a non-time-based db.


Google basically does all its monitoring through time series databases.

Basically the record response times, error rates, and internal health metrics from all kinds of services and service instances all as time series, and then define rules on top of these to define when to generate alerts.

Other things you typically want to record are all kinds of resource usage metrics, like disk usage, RAM usage, number of processes, network traffic and saturation etc.


Interesting to uncover data that's not directly related to a requirement: get some search results. Of course it's humans sitting at the laptop, so time matters. But the other data you mention also has the nature of "the thing we care about" along with "time associated with the thing we care about."

I wonder what other extra-dimensional (totally made up term on the fly) data we might care about. Power per query. Temperature per query. Water consumed per query. Water consumed per current activity level of trending topics.


Money earned per... is usually a pretty big thing :-)


Big data analytics is the biggest new trend in enterprises right now.

And one of the key parts of it is building machine learning models which can do things like look at a customer's previous behaviour in order to predict what their future behaviour might be. So this is one scenario where a time series database is infinitely better than say a RDBMS or Columnar database.

Likewise performing analytics on IoT devices e.g. sensors in trucks or oil/gas equipment requires events to be captured, stored and later mined rapidly. And many time series database being schemaless allow you to manage data from disparate sources in the one table.


> So this is one scenario where a time series database is infinitely better than say a RDBMS or Columnar database

I still don't get it.

Most of my background is in "small data" and database programs tend either to store time series in "objects" (other kinds of objects are models and graphs), like Eviews, or as ultimately-isomorphic-to-spreadsheets tables with some added syntactic sugar for time structure that's only understood by some functions.

I'm beginning to do some "not so small data" (text analysis from news websites now; but the essential thing is the time sequencing of information cascades) with sqlite and pandas (pandas is just horrible, but it's already there) and basically the only problem I have is that raw data is unevenly sampled and I have to make some choices when downsampling to a fixed grid so statistical analysis proper can be performed.

That said: I understand, as the grandparent poster, that in some cases (high-end physics experiments) time-structured data is incrementally produced in enormous quantities, and just storing a thousand variables at ten thousand samples per second is this whole challenge.

But server logs? Sensor readings? At one point I had an Arduino hobby where I had a robot moving "of his own will" based on thermistors and did the necessary downsampling and filtering in the Arduino sketch itself for later analysis in Matlab. I mean, who's getting raw data from IoT devices into servers? Even my news scraping thing does some pre-selection before committing to the db.

People underestimate how quickly the signal-to-noise ratio decays at high frequencies; and it seems to me that Cargo Cult IT is massively overestimating its need for CERN-level compute.


Because it was nearly impossible to store all that data before, we had to figure out what we wanted to look at before we got the data. Now that we can store that data easily, it gives us the flexibility to find things we didn't know we wanted. For instance, if you are constantly storing raw sensor and proprioception data for your robot, if you are having problems with, say, its movement routines, you can look at the raw data and look for trends. Imagine how much easier debugging would be if you could just store the state of your software constantly rather than either looking for patterns ahead of time, or just toggling the data dump for small periods of time.

Now instead of a robotic hobbyist, imagine if you were a Tier 1 ISP or large banking institution.


To give a concrete example:

My company deploys wireless sensor networks. Each sensor periodically reports its battery status, which we store in a database so we can see which sensors need their battery replaced.

Recently we found out that some batteries run empty much quicker than expected. The hardware vendor asked us for 'all battery updates for the past year or so'. We didn't save those; each update was just replacing the current value in the database. So it becomes very hard to diagnose this problem because the historic data is lost.


I understand that I have never dealt with really large datasets, whether as a hobby or as a professional (doing econometrics and related analytical work).

What it's not clear to me is what TSDBs (such as those listed by Wikipedia; I looked a bit into InfluxDB particularly) do better than RDBMSes.

This isn't "negative" skepticism, it's an earnest lack of knowledge.


As an example, I track requests per second per datacenter in a time series database. This allows me to make pretty graphs and do traffic engineering between sites.


We use InfluxDB to store solar energy performance for hundreds of locations at 15-minute intervals. It doesn't take long to get millions of points, and querying them efficiently is important.


Time serie DB for storing: Monitoring data, IoT data, etc.

See example supported inputs by Telegraf https://github.com/influxdata/telegraf/tree/master/plugins/i... from InfluxData


The world is generating massive amounts of data. Not all of it fits as easily into relational database schemes like a customer with and order that buys some widgets.

Take, for example, fleet management data streaming in. Thousands of vehicles with time, GPS and other metrics streaming in. There's nothing terribly relational about this data, so there can be better-designed systems to handle ingress and analytics.


We use it to store fine-grained server and application metrics for performance monitoring, bottleneck detection, and general troubleshooting.


Druid provides realtime ad-hoc campaign analytics to our advertisers and publishers for our performance CPA network.


Dataloop develops DalmatinerDB. Guess which database won first place.


> Disclaimer: Dataloop helps build DalmatinerDB so there is a massive conflict of interest.


it's developed by @heinz_gies from Project FiFo, they may pay him to do changes on it, but it wasn't started there and AFAIK heinz doesn't work there full time.


Mariano is correct, DalmatinerDB is build and maintained by Project-FiFo. Dataloop (the authors of the blog post) is currently the biggest user (as far as I know at least), using it in their SaaS.

They have however been excellent open source citizens and contributed back improvements, bug reports and suggestions.


KDB+ is the number 1 time-series database, no question. Per-core, it's 10 times faster than any other time-series database. Seems like an odd omission.

http://kparc.com/q4/readme.txt


As noted in the scope section along with some other omissions:

> Only free and open source time series databases and their features have been compared. Therefore if someone asks “have you tried Kdb+ or Informix?” the answer will be no. They are probably awesome though.

Article could have been titled "Review of the Top10 FOSS Time Series Databases" since that's what it is.


The 32-bit version is free.

I agree that any comparison skipping it is meaningless, it is the "standard" in this space, to the extent there is one.


> The 32-bit version is free.

Only for non-commercial use: https://kx.com/2015/09/19/32-bit-kdb-for-non-commercial-use-...


KDB+ does have some disadvantages though.

Requiring a schema makes it ill-suited to many use cases since in many situations you don't know the schema ahead of time. And the biggest for me is poor Hadoop/Spark integration. JDBC is not a great choice since you have poor predicate pushdown support and it prevents say Spark from going specifically to the node with the data.


You always have a schema. The only choice you have is whether your tool helps you validate it or not.


You always have N>0 schemas. The question of whether you have N+1 or not is governed is whether your db requires you to have a second one other than the one in your code.


I dont understand the how/why it is, though. anyone have more information?


Performance is not the #1 criteria.


> It seems there is a never ending supply of time series databases being written on top of Cassandra. Unless any of them are significantly better than KairosDB I'm not going to change the top 10.

I really appreciate the work Brian Hawkins has done with KairosDB. We settled on KairosDB at work after having a bunch of trouble with InfluxDB. At the same time, it's very much stuck in the Cassandra 2.1 world* . There have been some significant advancements with Cassandra for this particular use case (new storage engine, TWCS) and KairosDB is missing out.

*Yes, it can be run on 2.2.x by enabling the now-disabled-by-default Thrift, but that feels like a last gasp.


I'm evaluating InfluxDB for work and was wondering what issues you ran into?


Blind guess, lack of clustering and a history of changing file formats so you loose your data every few month.


> I decided to pen a magnum opus of my own opinions

"magnum opus (noun) a large and important work of art, music, or literature, especially one regarded as the most important work of an artist or writer."

Considering that this review has a word count of 3708, describing it as a magnum opus (something I associate with a work that takes close to a lifetime to complete) rankles.


I only counted 3705 words. I might need to count them again, damn fingers.

Did you include the comments? I didn't myself, but thats sure to bump up the count. We should investigate what the cutoff limit is, maybe 10,000 words would qualify?

I'm struggling weather to classify the work as art or literature, maybe both? Its definitely a seminal piece.


Yes, I feel the same. Akin to becoming a PhD by scanning the spines of books in a library. It's by no means bad work at all, but not a Magnum Opus.


To be more precise, this is a review of more than 10 free and open source time series databases, wherein the author chooses 10 favourites, by some metrics.


What do you think is missing ?

Because those are probably the 10 most popular and well known.


Well, for one thing, this isn't even attempting an objective comparison. The article basically grabs whatever third party benchmark numbers are available and pretends all the metrics are comparable. In many cases, the external benchmarks are testing entirely different things, so it's meaningless to give those numbers in a "review" of all the options.

It's really just a "top ten list according to my arbitrary thoughts", which is fine. But it's not really a useful comparison at all.


I have no idea; it's not an area of my expertise. However, the author tests 16 and selects 10 by the methods shown. I'm glad for the reinforcement that these are likely reasonably decent ones by social proof via you!


Question:

Does it make sense to use a time-series database if you're not too sure that you can trust the "time" in the series?

For example, let's say you're logging user's locations on a jogging app and using the system's clock as the time record (I understand this may not be ideal). Someone could log a run 2 years from now, for example.


It is usually a good idea to discard late writes. It should be possible to write data with some time-stamp skew but if skew is too large data should be discarded as erroneous. It's totally depends on an application so.


What is a good TDB for collecting data from home automation sensors and that can run on a little PC like a Raspberry?


InfluxDB is perfect for these use cases. It's a single Go binary with excellent performance and use of resources.

http://www.aymerick.com/2015/10/07/influxdb-telegraf-grafana...


For a similar comparison but that encompasses commercial databases as well see here: http://www.timestored.com/time-series-data/column-oriented-d...


What about BTrDB from Berkeley? http://btrdb.io/


It's in the spreadsheet. I've not tried it and there wasn't much info available at the time of writing the blog.


I'm on my phone and my eyes are not so good as it is.

Does he say what the architecture of the platform he was testing this on?


Never mind, I found it in the Gist...

https://gist.github.com/sacreman/b77eb561270e19ca973dd505527...

1 x DalmatinerBD server GCE n1-standard-16 (16 cpu, 60GB memory, 1 x 375G local SSD disk)

I would assume that is the same setup they used for all the other tests, though not clear about that yet

Anyone know the network bandwidth on these instance types?


Hi, I'm the original author. I think that GCE instance has 2Gb per core but I'd need to check. It wasn't bottlenecked on network bandwidth though.

The other benchmarks are linked in the spreadsheet to their respective details. It's not an absolute, direct, fair comparison. However, we wanted to start somewhere with information available right now and try to collect better results over time as the respective interested parties benchmarked and blogged about their databases.

I'd like to spend more time benchmarking every database in the spreadsheet but it feels like something the project owners should do themselves. I'd probably only get the setup wrong.


Thanks for the clarification, good to know.


The image embedded in the Gist says 10Gb network.


What specific features a time series database offer over mongo or redis?


Well one advantage compared to Mongo is they actually keep your data safe.


I wouldn't trust on all of them to do that reliably. If you look at a large group of database systems, some of them are going to have bugs or bad defaults as well.


really? and how's that? to my understanding, data integrity is not a feature on any of them.

but I asked that because I've been using redis a lot lately. I built a real-time analytics using bitmaps and I could use sorted sets to store time series data.

I really wanted to know what benefits do they offer to justify adding something else to the stack...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: