Are there any best practices for using PostgreSQL for storing time series data? ...

SEJeff · on May 10, 2016

If you want to store a large number of time series, influx might not be for you. For smaller bits of data (where the full dataset can fit on a single machine, or there is a very low amount of dimensionality in the data, influx is nice.

However, if the data grows beyond what a single scaled up machine can achieve, take a look at druid (druid.io). It is a bit more involved in setting up than influx, but was built from the very beginning to scale out horizontally. As a result, I can do realtime analysis (using grafana) of over 10 billion data points and perform aggregations over said data. It is an incredibly useful tool, and the newly released 0.9 looks ever better.

It can also count Alibaby, eBay, Cisco, Paypal, Yahoo, and Netflix as users (amongst many others): http://druid.io/druid-powered.html

It is really impressive tech. Bonus points that some of the original founders of Druid from Metamarkets just founded a company to do enterprise support around it:

http://imply.io/post/2015/10/19/announcing-imply.html

olavgg · on May 10, 2016

I save 2 billion rows of timeseries data every year. I use a regular btree index for "hot data" that is less than 6 months old and BRIN index for older data. You can do this by writing a functional index.

You also have to spend some time to tune the query cost settings to avoid sequential scans if you're only gonna work with a subset of the data. Another optimization could be implementing table inheritance so you have a table for every year. If you work with data sets for a specific year you would get a big performance boost with sequential scans.

PostgreSQL's biggest weakness at the moment is aggregating data by using several cpu workers/cores. This is coming in PostgreSQL 9.6

Oh and I run PostgreSQL on ZFS with LZ4 compression,

takeda · on May 11, 2016

> Oh and I run PostgreSQL on ZFS with LZ4 compression

I'm generally a big advocate of ZFS, but I heard that COW file systems (ZFS and btrfs) are generally not good choice for a database workload.

How does it perform for you?

olavgg · on May 11, 2016

It performs as well as any other file system, but ZFS shines when you are IO bound as compression really helps when you do sequential scans.

I have also tested ZFS with Microsoft SQL Server by exporting a ZVOL over iSCSI(FreeBSD) over 10G ethernet. But without compression as it has no benefit on 4k blocks. Performance was similar to what you would get with the same drives striped on Windows Server 2012. The big win here is of course ZFS's data checksumming. Not sure about snapshot as backup though, I need to figure out how to talk to the Windows SQL Writer Service so it can tell SQL Server to flush and lock so I can take a consistent snapshot. Microsoft really needs to improve their documentation, because this would be really helpful for several enterprises when it comes to backup speed.

lfittl · on May 9, 2016

Do you mean timeseries data as in IoT/sensor data/etc, DevOps monitoring metrics (i.e. server load, app performance, etc), or something else?

Curious since I'm currently researching how PostgreSQL could do better in this space :)

takeda · on May 11, 2016

Not the TP, but I personally am interested in the later one (metrics).

There doesn't seem to be any silver bullet yet. And it is also hard to even see how relational database compares to the existing solutions, since most people dismiss it immediately.

goldbrick · on May 10, 2016

In my experience time-series is one of the use cases that warrants evaluating a specialized database off the bat. Plenty of databases (mongo ahem) advertise supporting time-series but under fairly pedestrian loads hit a wall and force you to resort to dirty hacks.

pzaich · on May 9, 2016

I'm also in a similar position where I'd like to store approximately 560k records / user / year. My understanding is that Cassandra doesn't support some useful queries that would be useful when business logic is less clear (like group by)[1]. I'm leaning towards using PostgreSQL with a dedicated write DB until performance becomes an issue.

[1] http://stackoverflow.com/questions/17342176/max-distinct-and...

yeukhon · on May 10, 2016

Datadog uses PostgreSQL. See [1].

[1]: http://www.enterprisedb.com/success-stories/postgresql-succe...

thom_nic · on May 10, 2016

I decided to tee my timeseries data into InfluxDB. Purpose built for the task and has builtin support for rollups/ aggregation/ retention policy/ gap filling. Admittedly I have not put Influx under much stress or scalability testing since my use case is more based on utility than performance.

Unless PG has some timeseries-specific extensions I have assumed it would be appropriate for a TS-specific database. Also curious to try Riak TS.

siculars · on May 10, 2016

/plug/ Riak TS was just released open source that does just this. I wrote a post on it, http://basho.com/posts/technical/time-series-the-new-shiny/ .

Disclaimer: I work for Basho, makers of Riak TS.

Jweb_Guru · on May 10, 2016

Postgres has fairly robust support for temporal-style data. In particular, with 9.5 BRIN indices are often a great fit for heavy write workloads where most queries are range scans. It's not the best database for it, though.

siculars · on May 10, 2016

Time series oriented databases are usually purpose built around storing data in sorted order on disk (easy on a single machine, harder in a clustered environment) such that range scans are efficient. /plug/ check out riak ts.

Disclaimer: I work for Basho, makers of riak ts.