PipelineDB = Insert data with time component to be aggregated on the fly into always up-to-date summary tables using a variety of aggregation functions. Raw data is not persisted.
TimescaleDB = Store data with time component into "hypertable" that is automatically partitioned by time, for faster queries when limited by time range. Single node and has helper methods to make time based bucketing and aggregation easier.
Citus = Store data in distributed tables automatically partitioned and spread across multiple nodes, by any single column. Join across nodes with non-distributed tables.
Can definitely use PipelineDB for real-time summaries and TimescaleDB or Citus for raw long-term storage in the same database.
Side note: It would be nice if Postgres had package manager for extensions.
We're actively working on the scale-out version of TimescaleDB that will allow you to transparently shard hypertables across many servers. Hope to announce more specifics in the next several months.
I loved timescaledb but single node restriction made us go with Kafka streams. It was just too much hassle to maintain mapping between data and which node it's located at.
Really looking forward to multi node version. Good luck.
There's a question of what you actually want to do with the time-series data. If you don't expect to have much data or just want to store it and maybe occassionally query it, then a single server with partitioning (e.g. through pg_partman, Timescale) might be enough. If you want to build an analytical dashboard that needs to remain fast even if you're dealing with many users and terabytes of data per day, then you probably need Citus.
AFAIK Heap uses Citus but also has an internal partitioning scheduler for their customer event data so I don't think that they're a good example. Timescale doesn't support scaling out yet but it's in their roadmap so let's wait for them to implement for a fair conclusion.
If you're going to create roll-up tables and power your dashboard using those tables, you're fine with both options IMO. Cloudflare was also using Citus exactly for this use-case before they switched to Clickhouse.
If you have ad-hoc use-cases for time-series data, Timescale might be a better option because it's built exactly for this use-case and it knows the semantics of the data so it can partition the data in an optimized way and perform some optimizations such as parallelized operations and re-sizing chunks. In that sense, it's comparable to Influxdb, not Citus.
Minor clarification: do you actually need to choose TimescaleDB xor Citus? I thought the fact that both are Postgres extensions (as opposed to forks) meant that they could be used jointly.
You can use both extensions on separate tables but they are not interoperable on the same table.
Timescale is limited to time-based partitioning on one node while Citus uses a master node with partitioning across multiple worker nodes. I don't see many situations why you would run both.
If you're just doing single-node time-series then use Timescale. If you have Citus then you can use the native partitions in 11 or pg_partman to add a secondary partitioning dimension.
I've been following Pipeline since the beginning and it's so fricking cool. Please, if you can't think of a good use of Pipeline, use it instead of a count(*)! :D
I'm Derek, one of the co-founders--that's an interesting way to frame it, I think that makes a lot of sense at a high level.
We're in contact with the TSDB founders (awesome and super smart guys!) and are in the early stages of figuring out an integration that makes sense. That's most likely going to happen.
To anyone interested: we'd love to hear and consider your ideas re: TSDB integration. Feel free to open an issue in either repo (or add to an existing one) and tell us more!
The most impactful thing you can do here is ask the RDS team for this. If enough users ask them for it they'll eventually begin seriously considering it :)
RE integration: A docker image with both TSDB and PipelineDB extensions and PostGIS, supporting PG11 ;) which is something I will look into doing myself, but lack the time to do so..
The time-series database of a project I'm on uses timescale and it's been great for the quick inserts and the `time_bucket` function has been very useful for aggregate queries.. But moving from aggregations generated on-the-fly to ones updated continuously on data change sounds like it could be awesome for us, so I am v happy to see this article today :-)
If I understand correctly, they're not really solving the same problem on the read/write sides of the coin.
In fact, they seem to be on different tracks.
PipelineDB seems to do continuous aggregation, so the type of data it deals with is essentially summary data. If you know your summary function a priori, this can lead to very compact and efficient storage. The use case for this is reporting, dashboarding, etc.
TimescaleDB on the other hand deals with raw data. This is useful if you have multiple parties needed different types of aggregation from the same raw data. Also, if you want to do any kind of machine learning, raw unaggregated data would typically be more useful.
PipelineDB co-founder here--I think this is a pretty fair take! I would also like to point out that the aggregate data stored in PipelineDB can still be further aggregated, processed, JOINed on etc. on demand as well.
Since a continuous view's output is simply stored as a regular table, you are free to run arbitrary SELECT queries on it to further distill and filter your results. PipelineDB's special combine [0] aggregate allows you to combine aggregate values with no loss of information for this very purpose.
The most common pattern among our user base is to aggregate time-series data into continuous views at some base level of granularity (e.g. by minute) and then aggregate over that for final results (e.g. aggregate down to hour-level rows for the date range my frontend has selected).
(Timescale founder) As someone points out elsewhere, the difference between TimescaleDB and PipelineDB is more akin to raw data and materialized aggregates (Timescale) vs. streaming summary data (Pipeline).
So we are big fans of what the PipelineDB team are building and see value in using both.
(And if you are interested in how TimescaleDB's hypertable/chunk architecture plus other optimizations (e.g. at the query planner level) lead to both higher inserts and faster queries compared to Postgres: https://blog.timescale.com/timescaledb-vs-6a696248104e)
One of the challenges of managing time series data is to be able to manage both reads and writes at scale concurrently. I wrote about some of the ways different TSDBs approach this issue (Disclosure, I work for a commercial time series database provider)
INSERT INTO events_stream (ts, value) VALUES (now(), '0ef346ac');
> As soon as the continuous view reads new incoming events and the distinct count is updated the reflect new information, the raw events will be discarded.
So you create a table, insert into it, and it's always empty. Is that right?
Does this work for any table in pg? How does pg know that the insert should NOT actually insert a row?
This only applies to continuous views, not all PG tables. Think of continuous views in PipelineDB as very high throughput, incrementally updated materialized views. Raw data hits continuous queries in PipelineDB (continuous views) and only the output of the continuous queries is stored. So 1 billion events ingested could be distilled down into a single row that incrementally counts up from 1 => 1 billion as each data point arrives, instead of storing all of the 1 billion raw data points and counting them up later.
This is precisely why PipelineDB has rich support for data structures such as HyperLogLog [0]. HLL's allow you to track distincts information using fixed-size HLLs that only grow to about 14KB while encoding uniques counts for billions of distinct values. The tradeoff is about a ~0.8% margin of error, which users generally find acceptable.
Furthermore, PipelineDB has a special combine [1] aggregate that allows you to combine data structures such as HLL across multiple rows with no loss of information. A simpler example would be average: to get the actual average of multiple averages you obviously can't simply take the average of all the averages. Their weights must be taken into account, and combine handles that.
The capability to combine aggregate values in this way generalizes to all aggregates in PipelineDB.
I'm Derek, one of the co-founders--great questions!
> So you create a table, insert into it, and it's always empty. Is that right?
That is correct. Streams can only be read by continuous queries (e.g. you can't even run a SELECT on them).
> Does this work for any table in pg? How does pg know that the insert should NOT actually insert a row?
PipelineDB streams are represented as a specific kind of PostgreSQL foreign table [0], so only foreign tables created in a specific way will be considered streams. You can use triggers to write table rows and updates out to streams if you want to though.
Fantastic guys, thank you! I’ve been looking forward to it becoming an extension for half a year. This is great news.
This basically means Postgres now has continuous views and a toolkbox of functions for running calculations. Combined with PG11 partitioning features and better parallel gusty execution, PG is an even more formidable choice for medium sized data.
I work closely in the space of providing time series databases as managed solutions. I can say that I am very happy to see this recent development of new tsd's and this with timescale is a huge bump to the industry/segment. Everybody currently measures some analytics and mostly user data and there is so much abuse with it, yet there is so much more you can measure and do and it is still very early stage. Farms, industrial applications, IoT and so much more. I'd love to just measure temperature and wind speed at unprecedented resolution.
Very interesting. Does it aggregate per day? If so, I wonder how it handles time-zone, I mean when to create a new day when you have agents on different time-zones.
How aggregations are performed are determined entirely by your own continuous view definitions [0]. In this case I'm guessing you'd want to include a time-based column in the aggregation GROUP BY clause.
And since PipelineDB is a PostgreSQL extension, you can use the timestamptz type (which includes timezone support), and in general you could pretty easily simply normalize your event timezones in your continuous view definitions. When you're reading aggregate data back out, you could cast the time-based column using whatever timezone the client prefers.
I'm Derek, one of the co-founders--excellent question!
The former. PipelineDB performs aggregations in memory on microbatches of events, and only merges the aggregate output of each microbatch with what's on disk. This is really the core idea behind why PipelineDB is so performant for continuous time-series aggregation. Microbatch size is configurable: http://docs.pipelinedb.com/conf.html.
Can you say a bit more about "performant" or point me to some information? I haven't found any yet. I'm processing millions of protobufs per second and would love to get away from batch jobs to do some incredibly basic counting -- this seems like a fit conceptually...If its a fit, any recommendations on the best way to get those protobufs off a kafka stream and into pipelinedb would be great, too!
Performance depends heavily on the complexity of your continuous queries, which is why we don't really publish benchmarks. PipelineDB is different from more traditional systems in that not all writes are all created equal, given that continuous queries are applied to them as they're received. This makes generic benchmarking less useful, so we always encourage users to roughly benchmark their workloads to really understand performance.
That being said, millions of events per second should absolutely be doable, especially if your continuous queries are relatively straightforward as you've suggested. If the output of your continuous queries fits in memory, then it's extremely likely you'd be able to achieve the throughput you need relatively easily.
Many of our users use our Kafka connector [0] to consume messages into PipelineDB, although given that you're using protobufs I'm guessing your messages require a bit more processing/unpacking to get them into a format that can be written to PipelineDB (basically something you can INSERT or COPY into a stream). In that case what most users do is write a consumer that simply transforms messages into INSERT or COPY statements. These writes can be parallelized
heavily and are primarily limited by CPU capacity.
Please feel free to reach out to me (I'm Derek) if you'd like to discuss your workload and use case further, or set up a proof-of-concept--we're always happy to help!
That's awesome! If you don't mind - one more q.. I see that stream-stream joins are not yet supported (http://docs.pipelinedb.com/joins.html#stream-stream-joins). Can you comment on when you think this feature cold land or is it still a ways off?
Sure! So stream-stream JOINs actually haven't been requested by users as much as you'd think. Users have generally been able to get what they need by using topologies of transforms [0], output streams, and stream-table JOINs. Continuous queries can be chained together into arbitrary DAGs of computation, which turns out to be a very powerful concept when mapping out a path from raw input events to the desired output for your use case.
The primary issue in implementing stream-stream JOINs is that we'd essentially need to preemptively store every single raw event that could be matched on at some point in the future. Conceptually this is straightforward, but on a technical level we just haven't seen the demand to optimize for it.
That being said, you could just use a regular table as one of the "streams" you wanted to JOIN on and then use an stream-table JOIN. As long as the table side of the JOIN is indexed on the JOIN condition, an STJ would probably be performant enough for a lot of use cases. With PostgreSQL's increasingly excellent partitioning support this is becoming especially practical.
I also suspect that this is an area where integration with TimescaleDB could be really interesting!
Just out of curiosity, do you have a specific use case that necessitates stream-stream JOINs, or were you just exploring the docs and wondering about this?
My use case is pretty much parallel time series alignment with several layers of aggregation. I guess I perceive stream-stream joins as an easy way for me to wrap my head around how to structure my compute graph, but it seems doable with the method mentioned by @grammr. I'd hope for an interface roughly like "CREATE join_stream from (SELECT slow_str.key AS key, sum(slow_str.val, fast_str.val) AS val FROM slow_str, fast_str INNER JOIN ON slow_str.key = fast_str.key)". I do realize there are some tough design decisions for a system like this, but I'd also like to drop my wacky zmq infrastructure ;)
Fundamentally different approaches. Citus focuses on horizontal scaling for PostgreSQL and PipelineDB focuses on continuous aggregation for large volumes of streaming time-series data.
We're super happy with where Stride is at! We've continued to onboard customers in a few AWS regions and the infrastructure is rock solid at this point. Most users are ingesting 10k+ events/s and their analytics frontends are retrieving results in well under 100ms. We've gotten it to the point where it "just works" which has made Stride users' lives a lot easier at that scale.
And since the hard parts of Stride are powered by PipelineDB, an added benefit for us is that we now get a ton of super detailed instrumentation data about PipelineDB performance and behavior, which has helped make the open-source product quite a bit better.
We'll be moving Stride into self-service/GA next year--stay tuned!
No, not currently; however, we offer hosted deployments of PipelineDB as well as a SaaS product called Stride (stride.io), which is based on PipelineDB.
TimescaleDB = Store data with time component into "hypertable" that is automatically partitioned by time, for faster queries when limited by time range. Single node and has helper methods to make time based bucketing and aggregation easier.
Citus = Store data in distributed tables automatically partitioned and spread across multiple nodes, by any single column. Join across nodes with non-distributed tables.
Can definitely use PipelineDB for real-time summaries and TimescaleDB or Citus for raw long-term storage in the same database.
Side note: It would be nice if Postgres had package manager for extensions.