PipelineDB 1.0 – High-Performance Time-Series Aggregation for PostgreSQL

manigandham · on Oct 25, 2018

PipelineDB = Insert data with time component to be aggregated on the fly into always up-to-date summary tables using a variety of aggregation functions. Raw data is not persisted.

TimescaleDB = Store data with time component into "hypertable" that is automatically partitioned by time, for faster queries when limited by time range. Single node and has helper methods to make time based bucketing and aggregation easier.

Citus = Store data in distributed tables automatically partitioned and spread across multiple nodes, by any single column. Join across nodes with non-distributed tables.

Can definitely use PipelineDB for real-time summaries and TimescaleDB or Citus for raw long-term storage in the same database.

Side note: It would be nice if Postgres had package manager for extensions.

mfreed · on Oct 25, 2018

Thanks for the great summary, manigandham.

We're actively working on the scale-out version of TimescaleDB that will allow you to transparently shard hypertables across many servers. Hope to announce more specifics in the next several months.

dominotw · on Oct 25, 2018

I loved timescaledb but single node restriction made us go with Kafka streams. It was just too much hassle to maintain mapping between data and which node it's located at.

Really looking forward to multi node version. Good luck.

nwmcsween · on Oct 25, 2018

What is wrong with citus? Why reimplement it?

buremba · on Oct 25, 2018

Use TimescaleDB if you have time-series data, if you want to scale out your OLTP workload then Citus is what you're looking for.

mslot · on Oct 25, 2018

Citus is also used for large time-series / analytics use cases e.g. https://www.citusdata.com/customers/heap

There's a question of what you actually want to do with the time-series data. If you don't expect to have much data or just want to store it and maybe occassionally query it, then a single server with partitioning (e.g. through pg_partman, Timescale) might be enough. If you want to build an analytical dashboard that needs to remain fast even if you're dealing with many users and terabytes of data per day, then you probably need Citus.

Citus can load, aggregate and query the data in parallel using all the cores in the cluster. It also supports Postgres' native partitioning and pg_partman: https://www.citusdata.com/blog/2018/01/24/citus-and-pg-partm...

buremba · on Oct 27, 2018

AFAIK Heap uses Citus but also has an internal partitioning scheduler for their customer event data so I don't think that they're a good example. Timescale doesn't support scaling out yet but it's in their roadmap so let's wait for them to implement for a fair conclusion.

If you're going to create roll-up tables and power your dashboard using those tables, you're fine with both options IMO. Cloudflare was also using Citus exactly for this use-case before they switched to Clickhouse.

If you have ad-hoc use-cases for time-series data, Timescale might be a better option because it's built exactly for this use-case and it knows the semantics of the data so it can partition the data in an optimized way and perform some optimizations such as parallelized operations and re-sizing chunks. In that sense, it's comparable to Influxdb, not Citus.

tpetry · on Oct 25, 2018

Money isn‘t flowing to their pockets.

tfehring · on Oct 25, 2018

Minor clarification: do you actually need to choose TimescaleDB xor Citus? I thought the fact that both are Postgres extensions (as opposed to forks) meant that they could be used jointly.

manigandham · on Oct 25, 2018

You can use both extensions on separate tables but they are not interoperable on the same table.

Timescale is limited to time-based partitioning on one node while Citus uses a master node with partitioning across multiple worker nodes. I don't see many situations why you would run both.

If you're just doing single-node time-series then use Timescale. If you have Citus then you can use the native partitions in 11 or pg_partman to add a secondary partitioning dimension.

crad · on Oct 25, 2018

Check out pgxn.org for Postgres extension distribution /package management.

allan_s · on Oct 24, 2018

Has anyone tried to mix pipelinedb with timescale[1] , I think both are working on different side of playing with timeseries data ?

[1]https://www.timescale.com/how-it-works

qaq · on Oct 24, 2018

theoretically sounds like a perfect match

chucky_z · on Oct 24, 2018

I've been following Pipeline since the beginning and it's so fricking cool. Please, if you can't think of a good use of Pipeline, use it instead of a count(*)! :D

Fergi · on Oct 24, 2018

(Jeff, PipelineDB Co-Founder here) - thanks, Chucky! We appreciate your support!

skunkworker · on Oct 24, 2018

Interesting, this seems to be the other side of the postgres time series extension coin.

TimescaleDB for writes, PipelineDB for reads.

grammr · on Oct 24, 2018

I'm Derek, one of the co-founders--that's an interesting way to frame it, I think that makes a lot of sense at a high level.

We're in contact with the TSDB founders (awesome and super smart guys!) and are in the early stages of figuring out an integration that makes sense. That's most likely going to happen.

To anyone interested: we'd love to hear and consider your ideas re: TSDB integration. Feel free to open an issue in either repo (or add to an existing one) and tell us more!

scrollaway · on Oct 24, 2018

Can you guys join forces and convince AWS to make both of those products available on RDS? :)

qaq · on Oct 24, 2018

So basically AWS will monetize something they have spent 0 resources building and will likely cannibalise the only viable monetization option?

chatmasta · on Oct 24, 2018

Surely not, AWS has never done anything like that!

grammr · on Oct 24, 2018

The most impactful thing you can do here is ask the RDS team for this. If enough users ask them for it they'll eventually begin seriously considering it :)

scrollaway · on Oct 24, 2018

Already did!

erinaceousjones · on Oct 25, 2018

RE integration: A docker image with both TSDB and PipelineDB extensions and PostGIS, supporting PG11 ;) which is something I will look into doing myself, but lack the time to do so..

The time-series database of a project I'm on uses timescale and it's been great for the quick inserts and the `time_bucket` function has been very useful for aggregate queries.. But moving from aggregations generated on-the-fly to ones updated continuously on data change sounds like it could be awesome for us, so I am v happy to see this article today :-)

skunkworker · on Oct 24, 2018

That's great to hear, I'll be looking forward to seeing where those talks and collaborations go.

nethi · on Oct 30, 2018

It seems like if you combine pipelinedb with timescaledb, you get continuous query capability of influx ?

wenc · on Oct 24, 2018

If I understand correctly, they're not really solving the same problem on the read/write sides of the coin.

In fact, they seem to be on different tracks.

PipelineDB seems to do continuous aggregation, so the type of data it deals with is essentially summary data. If you know your summary function a priori, this can lead to very compact and efficient storage. The use case for this is reporting, dashboarding, etc.

TimescaleDB on the other hand deals with raw data. This is useful if you have multiple parties needed different types of aggregation from the same raw data. Also, if you want to do any kind of machine learning, raw unaggregated data would typically be more useful.

They serve different use-cases it seems.

grammr · on Oct 25, 2018

PipelineDB co-founder here--I think this is a pretty fair take! I would also like to point out that the aggregate data stored in PipelineDB can still be further aggregated, processed, JOINed on etc. on demand as well.

Since a continuous view's output is simply stored as a regular table, you are free to run arbitrary SELECT queries on it to further distill and filter your results. PipelineDB's special combine [0] aggregate allows you to combine aggregate values with no loss of information for this very purpose.

The most common pattern among our user base is to aggregate time-series data into continuous views at some base level of granularity (e.g. by minute) and then aggregate over that for final results (e.g. aggregate down to hour-level rows for the date range my frontend has selected).

[0] http://docs.pipelinedb.com/aggregates.html#combine

akulkarni · on Oct 25, 2018

(Timescale founder) As someone points out elsewhere, the difference between TimescaleDB and PipelineDB is more akin to raw data and materialized aggregates (Timescale) vs. streaming summary data (Pipeline).

So we are big fans of what the PipelineDB team are building and see value in using both.

(And if you are interested in how TimescaleDB's hypertable/chunk architecture plus other optimizations (e.g. at the query planner level) lead to both higher inserts and faster queries compared to Postgres: https://blog.timescale.com/timescaledb-vs-6a696248104e)

camel_gopher · on Oct 25, 2018

One of the challenges of managing time series data is to be able to manage both reads and writes at scale concurrently. I wrote about some of the ways different TSDBs approach this issue (Disclosure, I work for a commercial time series database provider)

https://www.irondb.io/2018/08/tsdbs-at-scale-part-two/

dkulchenko · on Oct 24, 2018

How does this compare to TimescaleDB?

Are they solving the same problem in different ways or are they complementary projects? If it's the latter, what would that look like?

akulkarni · on Oct 25, 2018

(Timescale founder) I'd say they are quite complementary. More here: https://news.ycombinator.com/item?id=18298004

tracker1 · on Oct 24, 2018

Hoping this gains some traction as a defacto extension for cloud hosted postgresql. I think this is probably as useful as plv8 for a lot of use cases.

the-alchemist · on Oct 24, 2018

And it supports Postgres 10.x!

http://docs.pipelinedb.com/installation.html#install-postgre...

Can't wait for Postgres 11 support.

Fergi · on Oct 24, 2018

PostgreSQL 11 support is imminent =)

crescentfresh · on Oct 24, 2018

Looking over this cursorily, looks super cool.

    INSERT INTO events_stream (ts, value) VALUES (now(), '0ef346ac');

> As soon as the continuous view reads new incoming events and the distinct count is updated the reflect new information, the raw events will be discarded.

So you create a table, insert into it, and it's always empty. Is that right?

Does this work for any table in pg? How does pg know that the insert should NOT actually insert a row?

Fergi · on Oct 24, 2018

This only applies to continuous views, not all PG tables. Think of continuous views in PipelineDB as very high throughput, incrementally updated materialized views. Raw data hits continuous queries in PipelineDB (continuous views) and only the output of the continuous queries is stored. So 1 billion events ingested could be distilled down into a single row that incrementally counts up from 1 => 1 billion as each data point arrives, instead of storing all of the 1 billion raw data points and counting them up later.

ohnoesjmr · on Oct 24, 2018

You can't really do that with distinct, as if you have 1 billion distint entries, you essentially have to store all of them to dedup.

grammr · on Oct 24, 2018

This is precisely why PipelineDB has rich support for data structures such as HyperLogLog [0]. HLL's allow you to track distincts information using fixed-size HLLs that only grow to about 14KB while encoding uniques counts for billions of distinct values. The tradeoff is about a ~0.8% margin of error, which users generally find acceptable.

Furthermore, PipelineDB has a special combine [1] aggregate that allows you to combine data structures such as HLL across multiple rows with no loss of information. A simpler example would be average: to get the actual average of multiple averages you obviously can't simply take the average of all the averages. Their weights must be taken into account, and combine handles that.

The capability to combine aggregate values in this way generalizes to all aggregates in PipelineDB.

[0] http://docs.pipelinedb.com/aggregates.html#hyperloglog-aggre...

[1] http://docs.pipelinedb.com/aggregates.html#combine

grammr · on Oct 24, 2018

I'm Derek, one of the co-founders--great questions!

> So you create a table, insert into it, and it's always empty. Is that right?

That is correct. Streams can only be read by continuous queries (e.g. you can't even run a SELECT on them).

> Does this work for any table in pg? How does pg know that the insert should NOT actually insert a row?

PipelineDB streams are represented as a specific kind of PostgreSQL foreign table [0], so only foreign tables created in a specific way will be considered streams. You can use triggers to write table rows and updates out to streams if you want to though.

[0] https://www.postgresql.org/docs/current/static/sql-createfor...

usgroup · on Oct 25, 2018

Fantastic guys, thank you! I’ve been looking forward to it becoming an extension for half a year. This is great news.

This basically means Postgres now has continuous views and a toolkbox of functions for running calculations. Combined with PG11 partitioning features and better parallel gusty execution, PG is an even more formidable choice for medium sized data.

Arqu · on Oct 25, 2018

I work closely in the space of providing time series databases as managed solutions. I can say that I am very happy to see this recent development of new tsd's and this with timescale is a huge bump to the industry/segment. Everybody currently measures some analytics and mostly user data and there is so much abuse with it, yet there is so much more you can measure and do and it is still very early stage. Farms, industrial applications, IoT and so much more. I'd love to just measure temperature and wind speed at unprecedented resolution.

ishikawa · on Oct 26, 2018

Very interesting. Does it aggregate per day? If so, I wonder how it handles time-zone, I mean when to create a new day when you have agents on different time-zones.

grammr · on Oct 27, 2018

How aggregations are performed are determined entirely by your own continuous view definitions [0]. In this case I'm guessing you'd want to include a time-based column in the aggregation GROUP BY clause.

And since PipelineDB is a PostgreSQL extension, you can use the timestamptz type (which includes timezone support), and in general you could pretty easily simply normalize your event timezones in your continuous view definitions. When you're reading aggregate data back out, you could cast the time-based column using whatever timezone the client prefers.

Thanks for the question--I hope that was helpful!

[0] http://docs.pipelinedb.com/continuous-views.html

ishikawa · on Oct 30, 2018

Thank you, yes, it was helpful for me to understand the possibilities. I'll dig more into that.

Rapzid · on Oct 25, 2018

Running functions on top of the transaction log(in transaction order) is a really powerful thing.

alakin · on Oct 24, 2018

Is most of the intermediate processing done in memory, or is it limited by hd write speed?

grammr · on Oct 24, 2018

I'm Derek, one of the co-founders--excellent question!

The former. PipelineDB performs aggregations in memory on microbatches of events, and only merges the aggregate output of each microbatch with what's on disk. This is really the core idea behind why PipelineDB is so performant for continuous time-series aggregation. Microbatch size is configurable: http://docs.pipelinedb.com/conf.html.

jmort · on Oct 26, 2018

Can you say a bit more about "performant" or point me to some information? I haven't found any yet. I'm processing millions of protobufs per second and would love to get away from batch jobs to do some incredibly basic counting -- this seems like a fit conceptually...If its a fit, any recommendations on the best way to get those protobufs off a kafka stream and into pipelinedb would be great, too!

grammr · on Oct 29, 2018

Performance depends heavily on the complexity of your continuous queries, which is why we don't really publish benchmarks. PipelineDB is different from more traditional systems in that not all writes are all created equal, given that continuous queries are applied to them as they're received. This makes generic benchmarking less useful, so we always encourage users to roughly benchmark their workloads to really understand performance.

That being said, millions of events per second should absolutely be doable, especially if your continuous queries are relatively straightforward as you've suggested. If the output of your continuous queries fits in memory, then it's extremely likely you'd be able to achieve the throughput you need relatively easily.

Many of our users use our Kafka connector [0] to consume messages into PipelineDB, although given that you're using protobufs I'm guessing your messages require a bit more processing/unpacking to get them into a format that can be written to PipelineDB (basically something you can INSERT or COPY into a stream). In that case what most users do is write a consumer that simply transforms messages into INSERT or COPY statements. These writes can be parallelized heavily and are primarily limited by CPU capacity.

Please feel free to reach out to me (I'm Derek) if you'd like to discuss your workload and use case further, or set up a proof-of-concept--we're always happy to help!

[0] https://github.com/pipelinedb/pipeline_kafka

alakin · on Oct 24, 2018

That's awesome! If you don't mind - one more q.. I see that stream-stream joins are not yet supported (http://docs.pipelinedb.com/joins.html#stream-stream-joins). Can you comment on when you think this feature cold land or is it still a ways off?

grammr · on Oct 24, 2018

Sure! So stream-stream JOINs actually haven't been requested by users as much as you'd think. Users have generally been able to get what they need by using topologies of transforms [0], output streams, and stream-table JOINs. Continuous queries can be chained together into arbitrary DAGs of computation, which turns out to be a very powerful concept when mapping out a path from raw input events to the desired output for your use case.

The primary issue in implementing stream-stream JOINs is that we'd essentially need to preemptively store every single raw event that could be matched on at some point in the future. Conceptually this is straightforward, but on a technical level we just haven't seen the demand to optimize for it.

That being said, you could just use a regular table as one of the "streams" you wanted to JOIN on and then use an stream-table JOIN. As long as the table side of the JOIN is indexed on the JOIN condition, an STJ would probably be performant enough for a lot of use cases. With PostgreSQL's increasingly excellent partitioning support this is becoming especially practical.

I also suspect that this is an area where integration with TimescaleDB could be really interesting!

[0] http://docs.pipelinedb.com/continuous-transforms.html

Fergi · on Oct 24, 2018

Just out of curiosity, do you have a specific use case that necessitates stream-stream JOINs, or were you just exploring the docs and wondering about this?

alakin · on Oct 25, 2018

My use case is pretty much parallel time series alignment with several layers of aggregation. I guess I perceive stream-stream joins as an easy way for me to wrap my head around how to structure my compute graph, but it seems doable with the method mentioned by @grammr. I'd hope for an interface roughly like "CREATE join_stream from (SELECT slow_str.key AS key, sum(slow_str.val, fast_str.val) AS val FROM slow_str, fast_str INNER JOIN ON slow_str.key = fast_str.key)". I do realize there are some tough design decisions for a system like this, but I'd also like to drop my wacky zmq infrastructure ;)

jadbox · on Oct 24, 2018

How does this compare to Citus?

nwmcsween · on Oct 25, 2018

They are completely different products? Citus deals with scaling pipelinedb deals with continuous queries.

Mayzie · on Oct 25, 2018

Citus advertises itself as an excellent way to achieve real-time analytics across billions of rows and tonnes of data, which this product also does.

How both products achieve this is however different.

Fergi · on Oct 24, 2018

Fundamentally different approaches. Citus focuses on horizontal scaling for PostgreSQL and PipelineDB focuses on continuous aggregation for large volumes of streaming time-series data.

temuze · on Oct 24, 2018

Congrats!

Also, how's stride.io doing?

grammr · on Oct 24, 2018

I'm Derek, one of the co-founders--thank you!

We're super happy with where Stride is at! We've continued to onboard customers in a few AWS regions and the infrastructure is rock solid at this point. Most users are ingesting 10k+ events/s and their analytics frontends are retrieving results in well under 100ms. We've gotten it to the point where it "just works" which has made Stride users' lives a lot easier at that scale.

And since the hard parts of Stride are powered by PipelineDB, an added benefit for us is that we now get a ton of super detailed instrumentation data about PipelineDB performance and behavior, which has helped make the open-source product quite a bit better.

We'll be moving Stride into self-service/GA next year--stay tuned!

tnolet · on Oct 25, 2018

Big question for me: does it work on Heroku postgres?

Fergi · on Oct 25, 2018

No, not currently; however, we offer hosted deployments of PipelineDB as well as a SaaS product called Stride (stride.io), which is based on PipelineDB.