Hacker News new | past | comments | ask | show | jobs | submit login
TimescaleDB vs. InfluxDB: built differently for time-series data (timescale.com)
178 points by mfreed on Aug 15, 2018 | hide | past | favorite | 46 comments



Wonderful analysis, I was waiting for something like this to come out!

Recently, I've gone through this very same choice and ended up with vanilla PostgreSQL (Timescale was not mature enough).

[Shameless self plug] You can read some of the details here: https://medium.com/@neslinesli93/how-to-efficiently-store-an...


One point of clarification for readers of @neslinesli93's post is that Timescale does not create "heavy" indexes.

We do create some default indexes that PostgreSQL does not, but these defaults can be turned off. We also allow you to create indexes after bulk loading data, if you want to compare apples-to-apples.

But to be clear, the indexes Timescale creates are the same, or can often times be cheaper, than PostgreSQL (remember, TimescaleDB is implemented as a PostgreSQL extension). We're always happy to help people work through proper set up and any implementation details in our Slack community (slack.timescale.com).


Hi, thanks for the tips!

As I mentioned inside the article, I tested last year version of TimescaleDB (July/August 2017) and that was my experience with it out of the box.

I am really impressed by all the progress you've made, and hopefully I'll consider TimescaleDB as my first choice on the next iteration of the product I'm working on.

Now, I'm skimming through the docs[1] and as I understand, create_hypertable is called before all the data is migrated, thus all TimescaleDB indexes are already present during the migration. What is the way to create indexes after data migration?

[1] https://docs.timescale.com/v0.11/getting-started/migrating-d...


Hi @neslinesli93, it's quite easy:

(1) Call create_hypertable with default indexes off (include an argument of `create_default_indexes => FALSE`) [1]

(2) Then just use a standard CREATE INDEX command on the hypertable at any time. B-Tree, hash index, GIN/Gist, single key, composite keys, etc. This DDL command will propagate to any existing chunks (and create them) and will be remembered to so any future chunks that are automatically created will also have these indexes [2]

[1] https://docs.timescale.com/latest/api#create_hypertable

[2] https://docs.timescale.com/latest/using-timescaledb/schema-m...


>What is the way to create indexes after data migration?

You can migrate the data and then do the normal PostgreSQL `CREATE INDEX` syntax to create the indexes on the hypertable. It's not an option to create_hypertable or anything, but that's how you would achieve it.


How does Timescale solve the problem of retention. In InfluxDB, old data is thrown out at every tick as the retention window continuously rolls. In the world of Postgres, wouldn't this mean an explicit cron-like DELETE of rows all the time?


I believe that since timescale creates time based partitions, you can expire old data by dropping old chunks: https://docs.timescale.com/v0.11/api#drop_chunks


InfluxDB isn't as easy as it's sound to operate.

Anything built on top of Postgres probably has year of knowledge to tune the db. but not much on InfluxDB. You are on your own.

You cannot even easily upgrade InfluxDB, especially when you want to use some new feature such as enabling the TSI.

When something is wrong, again, you're on the own.

The question to InfluxDB HA isn't available as well.

Yes, with all of that pain point, I'm still using InfluxDB. I even have to add in https://github.com/buger/goreplay to support replaying traffic to other instance during upgrade.

I have to write a tool to re-read old data and import into new instance instead of using their own import/restore.

Many gotcha with InfluxDB, hard to explain to dev not using high cardinality for tag, or not using too many tag. For example, people get used to `tag` concept of `fluentd` and put stuff like user id, device id into tag...

I want to log slow query time, now, I cannot use whole query as tag because of very high cardinality.

However, I kept using InfluxDB. I want to support it so we have something better than Postgres. I'm sick of SQL query(personaly) and I would like for Flux to be successful.

Similar to MongoDB, it's bad years ago but very good nowadays. I guess same thing will happen with InfluxDB. And indeed, they do improve over the year.

Similar to how we use Ruby vs C. It's about productivity. And despite of above pain points, I always find a way to solve them eventually. And the tooling around InfluxDB is nice, especially Grafana.


As someone who is struggling with issues with InfluxDB in production environments, this just moved my `Investigate replacing Influx with Timescale` issue higher up my priority list. Many of the problems with InfluxDB pointed out in the article are indeed real-world pain points for us.


What sort of issues are you running in to?


After running InfluxDB in production for a bit more than year, here's some issues from top of my mind we ran have into:

# Common - All: Various documentation issues

  - Including versioning in the documents
- Influx/Kapacitor: Cannot join values in Influx, but you can Kapacitor (but not dynamically)

  - Available in InfluxDB 1.6. now!
# InfluxDB

- Influx: GroupTag not grouping https://github.com/influxdata/kapacitor/pull/1773

- Influx: last() is really slow https://github.com/influxdata/influxdb/issues/8997

- Influx: Cannot update / edit tags https://github.com/influxdata/influxdb/issues/3904

- Apparently you can have tags and fields with the same names which bump up the query times by 1000x times, without ever knowing what's wrong (fixable by adding ::tag to the value)

- Cannot incrementally restore incrementally backed up databases (we made a script to do that) https://github.com/motleyagency/influxdb-incremental-restore

# Kapacitor

- Kapacitor does not support subqueries

- Kapacitor does not (properly) support queries from multiple measurements

- Cannot have field and tags with same name - Cast syntax doesn't work either (https://github.com/influxdata/influxdb/pull/6529)

# Telegraf

- Telegraf: Telegraf HTTPJson plugin does not support custom timestamps

# Chronograf

- The TickScript editor sometimes hangs for good and requires Chronograf restart

- Minor: No up-to-date syntax highlighting for TickScript in any common editors

That said we most likely would have ran into similar issues with out time series databases and I applaud their effort to keep InfluxDB open source.


In addition to the problems in the article (no mirroring, no HA, etc) it will simply stop responding seemingly randomly, and produce no errors. Have to restart the container to get it to respond again, which breaks other things.


These kind of articles by one of the compared parties only become interesting once the other party responds. So waiting for Paul Dix to show up in this thread as usual.


The TimescaleDB benchmark code is a fork of code I wrote, as an independent consultant, for InfluxData in 2016 and 2017. The purpose of my project was to rigorously compare InfluxDB and InfluxDB Enterprise to Cassandra, Elasticsearch, MongoDB, and OpenTSDB. It's called influxdb-comparisons and is an actively-maintained project on Github at [0]. I am no longer affiliated with InfluxData, and these are my own opinions.

I designed and built the influxdb-comparisons benchmark suite to be easy to understand for customers. From a technical perspective, it is simulation-based, verifiable, fast, fair, and extensible. In particular, I created the "use-case approach" so that, no matter how technical our benchmark reports got, customers could say to themselves: "I understand this!". For example, in the devops use-case, we generate data and queries from a realistic simulation of telemetry collected from a server fleet. Doing it this way creates benchmarking stories that appeal to a wide variety of both technical and nontechnical customers.

This user-first design of a benchmarking suite was a novel innovation, and was a large factor in the success of the project.

Another aspect of the project is that we tried to do right by the competition. That means that we spoke with experts (sometimes, the creators of the databases themselves) on how to best achieve our goals. In particular, I worked hard to make the Cassandra, Elasticsearch MongoDB, and OpenTSDB benchmarks show their respective databases in the best light possible. Concretely, each database was configured in a way that is 1) featureful, like InfluxDB, 2) fast at writes, 3) fast at reads, and 4) efficient with disk space.

As an example of my diligence in implementing this benchmark suite for InfluxData, I included a mechanism by which the benchmark query results can be verified for correctness across competing databases, to within floating point tolerances. This is important because, when building adapters for drastically different databases, it is easy to introduce bugs that could give a false advantage to one side or the other (e.g. by accidentally throwing data away, or by executing queries that don't range over the whole dataset).

I don't see that TimescaleDB is using the verification functionality I created. I encourage TimescaleDB to run query verification, and write up their benchmarking methods in detail, like I did here: [1].

I think it's great that TimescaleDB is taking these ideas and extending them. At InfluxData, we made the code open-source so that others could build and learn from our work. In that tradition, I hope that the ongoing discussion about how to do excellent benchmarking of time-series databases keeps evolving.

[0] https://github.com/influxdata/influxdb-comparisons (Note that others maintain this project now.)

[1] https://rwinslow.com/rwinslow-benchmark-tech-paper-influxdb-...


Hey rw, one of the core contributors to TSBS here. First of all, thank you for the work you did on influxdb-comparisons, it gave us a lot to work with and helped us understand Timescale’s strengths and weaknesses against other systems early on. We do appreciate the diligence and transparency that went into the project. We outline some of the reasons for our eventual decision to fork the project in our recent release post [1]. Most of the reasons boil down to needing more flexibility in the data models/use cases we benchmark and needing a more maintainable code design since we’re using this widely for a lot of internal testing.

Verification of the correctness of the query results is obviously something we take very seriously, otherwise running these benchmarks would be pretty pointless. We carefully verified the correctness of all of the query benchmarks we published. However, it’s a process we haven’t fully automated yet. From what we can tell, the same is true of influxdb-comparisons — the validation pretty prints full responses but each database has a significantly different format, so one needs to manually parse the results or set up a separate tool to do so. We have our own methods for doing that internally — once we get the process more standardized and automated we will definitely be adding it to TSBS. We encourage anyone with ideas around that (or anything else) to take a look at the open source TSBS code and consider contributing [2].

[1] https://blog.timescale.com/time-series-database-benchmarks-t...

[2] https://github.com/timescale/tsbs


> the focus of time-series databases has been narrowly on metrics and monitoring

I am curious if ppl are using TSDB's for business predictions, machine learning, exploratory visualisation, datascience and AI. I got curious after seeing a udacity course on tsdb predictions.

https://www.udacity.com/course/time-series-forecasting--ud98...


Quite a few companies do. I was involved in creating a predictive maintenance application using neural nets and using a TSDB as data source at my previous employer.


Does anyone have thoughts on why Postgres shouldn't provide:

- Automatic sharding of tables per-"shard key".

- Automatic sharding of those keyed shards by the range of some primary index.

Doesn't this get you 90%+ of the way there? (There's no "adaptive" time bucketing, I guess.)

For the record, I am a veteran of a naive Postgres time series scheme that was brought to its knees by seek times.


I think what you are asking is why something like TimescaleDB has to exist in the first place; i.e., why doesn't PostgreSQL just naturally do this?

Here's why: There are scenarios with time-series data that rarely occur with standard/vanilla PostgreSQL OLTP workloads. So PostgreSQL simply isn't designed to handle these scenarios well on its own.

Having 100s-1,000s of partitions is one such example. We found that insert rate on standard PostgreSQL drops quickly as the number of partitions increases, because PostgreSQL decides to hold a lock on every partition on insert, even though the insert may only touch one partition. [1]

When we asked the core PostgreSQL devs about this, they explained that they did this because sorting out the appropriate locks was a hard problem, and that they saw this scenario as so unlikely for OLTP that they instead directed their resources to other more pressing problems.

But with time-series data this is a very common scenario, so we (TimescaleDB) had to sort it out ourselves.

And this is just one example. There are quite a few query optimizations that we also had to develop for working with time-based data more efficiently.

At a high-level, every project has to optimize for something. PostgreSQL understandably optimizes for OLTP workloads. But the beauty of PostgreSQL is that it allows extensions to optimize for other workloads, such as time-series.

[1] https://blog.timescale.com/time-series-data-postgresql-10-vs...


> When we asked the core PostgreSQL devs about this, they explained that they did this because sorting out the appropriate locks was a hard problem, and that they saw this scenario as so unlikely for OLTP that they instead directed their resources to other more pressing problems.

The locking on partitioned tables is a little clunky, but the overhead of these locks is very low. The main performance problem in Postgres 10 was the partitioning pruning, which used an exhaustive linear search. That has been fixed in Postgres 11 (due in September) which uses binary search and introduces various other partitioning improvements [1].

[1] https://www.postgresql.org/docs/11/static/release-11.html#id...


I believe what akulkarni meant to talk about is relation accesses (and not just locks). While the partition pruning certainly improved things, two sources of inefficiency still remain in PG 11:

1) Fetching statistics for each table during queries (which happens by reading from the data file off of disk). This happens /before/ pruning, even on PG 11.

2) The overhead of locking each table is still there. Although it's a smaller issue than (1).

We at TimescaleDB found (1) to be the most significant overhead and in fact we have significantly improved things there [1].

[1]https://github.com/timescale/timescaledb/commit/b7257fc8f483...


You could also just have worked on lowering those overheads in PG, just saying. It's easy to blame "PG devs", but most of us could get changes quicker to our company's respective customers by just fixing everything in forks.


Timescaler here. We're not blaming "PG devs". We have great respect for the PostgreSQL developers and what they are doing; so much, in fact, that we chose to base our product on PostgreSQL. And, TimescaleDB is not a fork--it is an extension to PostgreSQL that can be loaded in existing PostgreSQL installations.

We would be happy to contribute to PostgreSQL, but I think the issue here is that, as a business that is focusing on a very particular use case, we are not perfectly aligned with the PostgreSQL roadmap. We want to be able to move quickly and adapt to customers needs, focusing on the pain points and issues they have. This simply isn't compatible with the more conservative development pace that main PostgreSQL understandably has.

From another perspective, I think one strength of PostgreSQL is, in fact, its support of extensions, enabling innovation alongside main PostgreSQL while the core developers can focus on a rock solid and extensible foundation. So, from where I am coming from, this is a feature and not a bug.


Thanks for your comment.

I also am guessing that the wall between the worlds of "create table" and "insert" is pretty ingrained in the SQL developers, so solutions where inserts actually create database objects isn't something they're interested in. This is why the documentation on "DDL Partitioning" is a long tutorial on what's ultimately a static scheme. Compare this to Cassandra where creating a new "table" per routing key is the obvious thing.

It does raise the question of whether "HyperTableDB" (for Postgres) would make its own coherent offering? (I don't mean to comment on your business strategy, this is an architectural/technical question about what hurdles there are.)


Postgres should provide it, it just doesn't (yet). Usable native partitions didn't really arrive until v10 and v11 finally makes them production capable with range+hash, default partitions, moving rows when the partition key changes, pruning partitions in queries, etc.

Until then, extensions like Partman, Timescale, and Citus make up the functionality along with specializations for your use-case.


> PostgreSQL 10 provides native partitioning. Despite the flexibility, this approach forces the planner to perform an exhaustive search and to check constraints on each partition to determine whether it should be present in the plan or not. Large amount of partitions may result in significant planning overhead.

> The pg_pathman module features partition managing functions and optimized planning mechanism which utilizes knowledge of the partitions' structure.

https://postgrespro.com/products/extensions/pg_pathman


Nice article but is it deliberate that you don't mention getting data into the database (line protocol or similar) or analysing the results and displaying them.

Input = use postGres Output = Grafana has a postGres data source which I assume works and mention in timescales b's issues of a Grafana query helper.

Also lack of analysis, consolidation (continuous queries) and retention policies.

I am however intrigued as it does seem to hit my sweet spot of 100's of servers each with 5 to 10 different series of 10 metrics each, every 10 sec.

Database size might be an issue, as would the complexity of deployment (a big win for Prometheus rather than influxdb though).

Final thoughts Can't help feeling this looks like a few input scripts running on postGres rather than a system solution for metrics and annotations.


Fellow Timescaler here. Thanks for the feedback. While we do not directly compare ingestion protocol and specific features, like continuous queries and retention polices (something I guess we could add), we do compare echosystems and third-party tools support, including ingestion (e.g., Kafka, Hibernate) and visualization (e.g., Tableau, Grafana). In fact, the developer behind the Grafana PostgreSQL data source is also a Timescaler, and an upcoming version of the data source will have an improved query builder and first-class TimescaleDB support.

Finally, I can assure you that this is more than a few input scripts. In fact, the project is thousands of lines of C code (if that matters) that implement automatic partitioning, query optimizations, and much more. Our code is open source here: https://github.com/timescale/timescaledb

We have other blog posts (http://blog.timescale.com) and system documentation (http://docs.timescale.com) that explain what we're doing if you're interested in learning more about the details.


Creator & CTO of InfluxDB here. I won't respond too much in this write-up specifically but felt I had to respond to the requests for me to comment. For the benchmarks, I haven't looked at their fork of our original code and we haven't had engineers attempt to reproduce their results. In truth, we probably won't put much effort into that. We have every intention of putting more effort into benchmarking, but I'll talk a bit more about that at the end of this post.

Much of this comparison is the technology equivalent of an argument through appeal to authority. The old "nobody ever got fired for buying Big Blue" argument. It's true that Postgres has been around for much longer than InfluxDB. Mike is actually overly generous when he pinpoints InfluxDB's first release in September of 2013. First commit was September 26th, 2013 where I added the MIT license and the empty README where I refer to the project as ChronosDB. The first "release" wouldn't follow for another six weeks and I would hardly qualify that as an official release (0.0.1). If you followed the commit log, you'd see that InfluxDB is actually younger than that. We rewrote the entire thing from November of 2014 onward. Ben Johnson gets the award for InfluxDB committer to have the largest delete set in a single commit when he ripped everything out when we started the 0.9 release line. The storage engine didn't even start getting written until December of 2015 (although I wrote the prototype of the concept over Labor Day Weekend in the beginning of September 2015). So in some sense you could say that we've been at this storage game for less than three years.

However, I wouldn't discount a technology simply because it's new. We take data loss very seriously and strive to create a storage engine that is safe for data. The issues linked to in that post are either closed, apply to a previous storage engine, or were recovered through tooling (in the case where a corrupt TSM file was written). Yes, these things take time to get right and there is always room for improvement. Proper infrastructure and planning mitigates these risks. For example, in our cloud environment, we take hourly snapshots of the EBS volumes that store data. We make sure that we're able to recover from a catastrophic failure, even if it is one that was induced by some software bug. Although we haven't seen corrupt TSM files or corrupt WAL files in our cloud environment.

The argument on community size is in a similar vein. Yes, the Postgres community is larger than InfluxDB's. But InfluxDB has a large, vibrant and growing community. PHP has a larger community than Go, but I'm not going to write code in PHP because of that (no offense to the PHP devs). When I was a Ruby programmer I didn't pick it because of maturity or community size. In 2005 barely anyone even knew about it. I picked Ruby (and Rails) at the time because of what I could build with them. More importantly, I picked those tools because of how quickly I could build with them. It also didn't hurt that I connected with the Ruby community and felt like I had found my tribe. So it's possible to have a community that you like and connect with regardless of size.

Ultimately, we've chosen to create from scratch. We've also chosen to create a new language rather than piggybacking on SQL. We've made these choices because we want to optimize for this use case and optimize for developer productivity. It's true that there are benefits to incremental improvement, but there are also benefits to rethinking the status quo. I've heard many times from our users that they liked the project because of how easy it was to get started and have something up and running. We'll continue to optimize for that while also optimizing performance, stability, the overall ecosystem and our community. It means that we invest into tools outside the database to make solving problems with time series data easier. It also means that we've created a storage engine from scratch and we're creating a new language, Flux. We've MIT licensed the language and the engine. This is because we're building Flux to make it work with other databases and systems. Our goal is to build an all new community and ecosystem around Flux, for programmers that are working with data (time series or otherwise).

Finally, some thoughts on benchmarks. I hate benchmarks. There are lies, damn lies, and benchmarks. Particularly in comparisons. You always have to look for what was in, what was out, and if everything was done to favor one solution over another. And yes, we're guilty of putting out the original performance benchmark comparisons. So here's what I want to do for InfluxDB as an ongoing effort. We should be benchmarking, but doing it with workloads that are as close to what we see in real production systems as possible. No bulk loading a bunch of data and then doing a bunch of basic queries while the DB isn't under any other load. Further, I don't want to do comparisons to other solutions. I don't want to do another vendor's work for them. I'd rather focus the benchmarks on continuous improvements against our own builds. Benchmarks are great when they lead to ongoing product improvement. They're also useful if we make them public for the community and our customers so they can see over time how things are shaping up.

We see time series as a massive market with many different offerings, which often have different philosophies. And much of this is about APIs and aesthetic, so for many questions, there isn't really a "correct" answer. Our goal is to focus our product efforts on delivering the best experience for the community and customers who are working with time series data and building applications and solutions on top of our platform. At the same time, we want to contribute as much of our from scratch code back to the open source Go community so that implementors ahead of us can build on our shoulders.


> Our goal is to focus our product efforts on delivering the best experience for the community and customers

After makimg clustering closed source after saying it would be part of the open source version, I think it would be more accurate and far more honest to just say "our customers" not "community and customers".


The vast majority of InfluxDB users are using open source exclusively. There are millions of servers all over the world running open source software built by InfluxData. When I say we're building for our community, I mean exactly that. We continue to put software into the open source ecosystem because it's a core value for our company and as developers we like to share what we're building with the world.

Yes, we build some features (like clustering) exclusively for paying customers, but that is what subsidizes the open source that we continue to build and make freely available. Last year I gave a talk and a related blog post about the dynamics of building a business on open source software: https://www.influxdata.com/blog/the-open-source-database-bus...


> The old "nobody ever got fired for buying Big Blue" argument. It's true that Postgres has been around for much longer than InfluxDB.

Yes. Postgres appeals to managers the same way IBM / Oracle did / do. Right-y-o.


At the end of the day Influx is going to be cerchunking along like the square wheel garbage collection is with occasional wild memory and latency gyrations. Couple with that a log structured tree for more "fun".

C doesn't guarantee you wont do things like that, but Timescale is built in such a way as to minimize most of this kind of extreme waste in patrols with full postgres user experience.


Why not compare it with the market leader kdb+? https://kx.com/media/2018/06/KdbTransitive-Comparisons.pdf


From kdb+ license agreement:

"1.4 Kdb+ Software Evaluations. End User shall not distribute or otherwise make available to any third party any report regarding the performance of the Kdb+ Software, Kdb+ Software benchmarks or any information from such a report unless End User receives the express, prior written consent of Kx to disseminate such report or information."


It is unfortunate that Kdb+ likes to tout benchmarks against Influx etc, but their license prevents anyone else from doing the same.


That's some Oracle level bullshit.


Would love a comparison to Market Store https://github.com/alpacahq/marketstore


Given the cambrian explosion in tools, i am always looking for safe ways to cross a tool off my "must consider" list. This article was very helpful in that regard!


Sure Postgresql can handle doing metrics; been doing it for 2 years without any timescaledb.

I also use influxdb. Different tools for different purposes


timescale.com isn't https enabled by default. influxdata.com is. Just my first impression as a 2018 web developer.


Submission by "Professor of Computer Science, Princeton. Co-founder and CTO, Timescale."


From the article.

    > Yes, we are the developers of TimescaleDB, so you might quickly disregard our
    > comparison as biased. But if you let the analysis speak for itself, you’ll find
    > that it tries to stay objective — indeed, we report several scenarios below in
    > which InfluxDB outperforms TimescaleDB.


I've just read and 2/3 of this article shows TS has more advantage than InfluxDB. I just wonder why TS instead of just use PostgreSQL directly?


(Timescaler here.) That's a common question, and one we address in this post: https://blog.timescale.com/timescaledb-vs-6a696248104e

tl;dr TimescaleDB vs. PostgreSQL: 20x higher inserts, 2000x faster deletes, 1.2x-14,000x faster queries, additional functions for time-series analytics (e.g., first(), last(), time_bucket() [1])

[1] http://docs.timescale.com/v0.11/api#analytics


Mike Freedman is extra legit. There were rumors that he was going to be the 11th member of Wu Tang to replace ODB (RIP).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: