I built out a Druid backend for interactive graphing & aggregation of web traffic and application security metrics a few years back. Users could choose arbitrary filters, aggregations, and time slicing. This was a second system replacing a Spark cluster running over timeseries events in Cassandra, which wasn't really practically scalable. Tuning and debugging the Spark queries and Cassandra performance was an endless time sink.
Druid worked really well for almost all use cases, reliably getting sub-second results against many billions of records even on a pretty modest deployment. Being able to use arbitrary Javascript functions in queries was fantastic, so we could do things like filtering on IP subnets or case insensitive prefix/suffix matching, and the like, as needed.
The Docker setup that Druid ships with is deceptively simple - making a production installation was an effort.
My thoughts are:
- Build templating and code generation for the many config files for each component early on so you can edit constants in a single place and have all of the sundry config files update to reflect them, and also manage per-host overrides in a sane, version controlled, way.
- Druid will use as much ram as you can throw at it, but in a pinch, reading directly from fast NVME storage is pretty good.
- If you have realtime data ingestion, you will also have to build tooling to re-ingest older data that has changed or needed to be amended. This will end up looking like a 'lambda architecture'
If you had to do it again today, what would you do?
Especially interested in this part:
> If you have realtime data ingestion, you will also have to build tooling to re-ingest older data that has changed or needed to be amended. This will end up looking like a 'lambda architecture'
I'm not a data guy, can you explain this part a bit more if you have time?
> you have realtime data ingestion, you will also have to build tooling to re-ingest older data that has changed or needed to be amended. This will end up looking like a 'lambda architecture'
Lambda architecture for data processing, as popularized by Nathan Marz et al [0], has two components, the Batch layer and the Stream layer. At a high level, Batch trades quality for staleness whilst Stream optimises for freshness at the expense of quality [1].
I believe what GP means by Lambda is that, you'd need a system that batch processes the data to be amended / changed (reprocess older data) but stream processes whatever that's required for real-time [2].
An alternative is the Kappa architecture proposed initially by Jay Kreps [3][4], co-creator of Apache Kafka.
In theory this sounds great, but you have to account for processing capacity.
While compute is getting cheaper, one of the key reasons streaming in lambda sacrifices quality over throughput is compute capacity (as well as timing). If you have to feed already stored data through the same streaming pipe, you either have to have a lot of excess capacity, be willing to pay for that additional burst or accept latency in your results (assuming you can keep up with your incoming workload and not lose data). There is no free lunch.
I looked into Apache Druid for my company's event aggregation and serving. We decided to go with Apache Pinot instead. A major reason was serving latency. Most of the docs I saw that talked about Druid latency (and their goal for latency) made it seem like 500ms was okay for query latency speeds. Pinot seemed more focused on query latency. 50ms was more acceptable for my team's latency goal. I ran some basic latency tests and Pinot was faster (2mil and 22mil events, no real optimizations). I think Pinot's star-tree index would perform great for my team's use case.
Druid's intro flow was awesome. The tutorials worked very well. I hit hiccups with Pinot's intro flow. The Pinot team has been very responsive on Slack to issues.
Similar to @quietgolfer, we also have strict requirement on serving latency, and we benefited from Pinot's star-tree index. We had a table with more than 1 billion events, and we had to do DISTINCTCOUNT and SUM with GROUPBY. Initially without optimization, queries timed out (after 15s). After star-tree index was added, the query returned in 500ms.
Based on our earlier benchmark, Pinot query latency without star-tree is already slightly better than Druid: https://image.slidesharecdn.com/lnkdmeetupslides0426-1705021.... Though that benchmark was done years ago, the result seems consistent with the recent test run by @quietgolfer.
Some common ClickHouse use cases and verticals shown below. Basically any case where you have relatively structured data, very large fact tables, and a need for low latency response. Names of companies that have given public talks are shown.
Most people shouldn't. Clickhouse or other column-store data warehouses (redshift, bigquery, etc) are very fast and have all the features to handle time-series and other data.
Druid is good if you (1) make use of native integrations like Kafka (2) Need every field indexed for fast seeking to a few rows (3) can use the JSON API to make up for the still-in-beta SQL interface (4) don't need every single event/row as they are pre-aggregated (5) always have a time-column to partition by (6) want to use the native S3 tiering for older data (7) dont need joins and complex analysis
Imply's distribution is better than core Druid but it's still more operationally complex than Clickhouse and alternatives.
You are right about Clickhouse, but other data warehouses are not optimized for the same use case of Druid and Clickhouse, OLAP.
For example, RedShift and BigQuery cannot be put behind a user facing backend. BigQuery has a default limit of 50 concurrent queries, if that's your user limit, perfect. RedShift takes seconds for queries that Druid and Clickhouse can answer in milliseconds.
All data warehouses are designed for OLAP, that's their purpose. It doesn't require low latency though.
Redshift is an always running cluster of scale-out distributed postgres forked by AWS so it can and does return results in milliseconds, very similar to Clickhouse although still not as advanced in performance techniques.
Bigquery is a completely managed model that uses far greater scale-out architecture designed for througput (petabytes in seconds) rather than latency, although it has real-time streaming, BI Engine (memory cache) and materialized views so you can get pretty close today.
Snowflake is another option that runs on top of major clouds using instances to access object storage and also has low latency when your cluster is running.
Not all data warehouses are designed for OLAP. As the performance of these systems increase, it is possible to satisfy OLAP use cases with some warehouses such a BQ, but they are still not ideal as your number of concurrent queries scale and your response time requirements remain constant. The BQ team certainly doesn't consider their product an OLAP solution, neither the RedShift team. AWS is actually working on their own OLAP solution, Timestream; that is how they pitch it, as their OLAP solution.
BigQuery's petabyte aggregations in seconds is a false claim. I just launched a query to count the number of records of a 1.4 PB dataset and it has taken over a minute to just schedule the query, I have 1000+ compute units available too. I have been able to perform a similar query in seconds on top of Druid; the cluster had a run rate of $25k/month at the time, while I believe we pay 40k+/month just on our BQ deal. Maybe under ideal conditions BigQuery can run such a query, but I have never seen it do such things.
I have experience with Snowflake, it is great if you are in AWS, but I wouldn't choose it over BigQuery. I feel that Snowflake's strength is in their data sharing protocol, which allows us to tap into the datasets of partners without major data transfer costs. We use it for that reason only.
You're using a strange definition of OLAP here when it's really not tied to performance or concurrency. It only means you can send an ad-hoc query and get back results directly in response, rather than waiting on pre-computed or batched processing happening somewhere else.
Redshift and BigQuery are both OLAP products and their documentation and whitepapers state the same. The performance profiles of both are very different but that's due to the architecture. Redshift is an always running cluster of EC2 instances running their forked scale-out Postgres. BigQuery is an entirely different engine based on Dremel which focuses on parallelization as much as possible along with Google's proprietary datacenters that have petabits of internal bandwidth to support it. AWS Timestream is an entirely different product for time-series specific data, and basically
The only reason Druid is faster is because it pre-aggregates and indexes data. You can aggregate tables yourself in Redshift and use the right primary keys to get most of the same performance. BigQuery just launched materialized views and is working on scheduling improvements and in-memory caching as well. Snowflake now runs in Azure and GCP and can do sub-second responses when the warehouse is running and data is cached or the tables are smaller.
I wasn’t able to achieve the same performance with RedShift aggregates, I tried that first before I decided to migrate from RedShift to Druid back in 2014. We deal with dozens of dimensions per event and no combination of distribution keys in Redshift was able to give up the same performance over arbitrary scans+ aggregations.
Druid is not only fast because it pre-aggregates, but the memory structure is designed for scans over hundreds of dimensions.
Materializing views in BigQuery is just one DAG task. Unless you don’t have something like Airflow on your stack, I don’t see how it is worth mentioning. We are talking about denormalized data, time series data.
I am speaking from experience with each one of these products. Perhaps I did it all wrong, but we certainly achieved the objectives we were after.
We have negotiated a flat rate deal with Google and have access to a higher concurrent query limit than 100, and even then, it is not the right tool for my use case. BigQuery is also far more expensive.
Our users expect responses in the milliseconds. The user experience is seriously degraded when you approach tens of seconds or batch queries.
1) Most people handling large data streams are already using Kafka or similar. If you aren't, Druid's has pretty wide support for event ingress, including HTTP.
2) If you're looking to provide this type of analytics in the first place, you probably do want this. Being able to execute extremely fast, granular, ad-hoc queries over highly dimensional data is very powerful. I designed a reporting frontend that really took advantage of this and I always felt guilty when people would complimented me how fast it drilled down in to 10+ dimensions.
3) There are plenty of mature libraries for the Druid API.
4) This is almost inherent with any OLAP system. Although, even with high-cardinality data, Druid performs extremely well. Either way, you should have a backing data warehouse and offline, batch jobs if you need to perform BI analytics on row-by-row / non-aggregated data. Remember, OLAP sits on top of a data warehouse.
5) Yes, but technically you just need a key that increments.
6) Not sure if this is an argument against Druid. I've found the S3 deep storage / tiering to be very efficient and powerful, especially because you can create and push segments directly to storage and not run it through an indexer. S3 is also just an object store protocol specification now. There are lots of people who run S3-compatible object stores in-house. HDFS is also natively supported and another widely used storage backend in this space. Also, there are plenty of community extensions for other object stores.
7) Again, OLAP. Data flowing in to your OLAP layer should already be denormalized and ready for indexing. Also, you can join data in to your existing indices with Hadoop, etc. Druid support joins and lookups, although I've never used them. ClickHouse and other similar systems also don't do very well with joins. Maybe we have different definitions of "complex analysis," but in my experience, you can do some pretty crazy stuff with queries including writing your own JS functions, and if you're really dedicated, you can write your own extensions.
One thing that I feel like a lot of people miss is that Druid is specifically an OLAP layer designed for large scale, and I mean beyond Netflix size scale (they use Druid). Every individual component of Druid is designed to scale out independently and play nicely with your existing infrastructure. Similar to ES, you have nodes that have specific roles such as event ingress / real-time compute, query handlers, query brokers, historical data / cache / compute, etc. Then you also have a bunch of supporting architecture provided for you for (re)indexing data, coordinating the entire cluster, etc. Druid is huge, not an AIO (OLAP, OLTP, DWH, etc) analytics solution, and it takes more than one person to run a larger cluster, even though I did it for a few years.
Relational databases with joins and full SQL support are still unmatched in flexibility, and functionality like materialized views, aggregation pipelines (and table engines for Clickhouse) allows you to do everything that Druid does with aggregated summaries while still having all the other querying abilities.
Druid has a slight edge in data scale-out and indexed seeks but modern data warehouses are adding similar tiering features, along with with field indexing, full-text search, nested data records, and even in-memory rowstores for OLTP support.
They're all converging on the same feature set and eventually Druid will just become another data warehouse option, although I'd still recommend Clickhouse or MemSQL at that point.
Question for everyone in this thread: what resources do you use to follow this stuff? Are there blogs or conferences or ... that I should follow if I'm interested in this topic?
Start there and just read more about architecture of the various products. You'll learn the basics of distributed nodes that store data in column-oriented tables partitioned by a key and how to run queries over them using various performance techniques. Then you can dive into the unique differences of the various databases.
No sorry, I didn't mean where to read about how these things are designed (though these are useful and interesting links for that too, thanks!). What I'm asking about is what to follow (blogs, conferences, etc.) to keep up with new developments in the space: new products, experiences with them, reviews of their trade offs in practical use, that kind of thing. But I do appreciate the response!
Other than that, reading the blogs of the various vendors is how I keep up (using Feedly with RSS). The modern projects are Redshift, BigQuery, Snowflake, Azure DW, MemSQL, Clickhouse, YellowBrick with older projects being Vertica, Teradata, Greenplum. It's also useful to follow the "new" distributed SQL projects like CockroachDB, Citus, TiDB, Vitess, Yugabyte.
Can you speak briefly to the change? What motivated it? How's it been working out?
For context, I'm an engineer that uses Postgres heavily, with some basic BigQuery + Redshift. I don't understand what the benefit of something like Clickhouse are over a standard warehouse.
Druid and Clickhouse are OLAP systems. The key here is “online” as in they are optimized to answer queries in a matter of milliseconds. Maybe things have improved, but RedShift is not really an OLAP system. The last time I used RedShift was back in 2015 and it took it tens of seconds or even minutes to answer the queries that Druid could answer in just a few seconds; the Druid cluster was a third of the cost too, AWS was charging us too much.
I love BigQuery, but I look at it more as a background processor for queries over huge datasets. With BigQuery you have some real limitations on the number of concurrent queries. Even if BQ was able to answer in seconds, your users would be reaching your concurrent query limit all the time.
In RedShift you can perform joins, but in Druid and Clickhouse you cannot, your data has to be denormalized. There are quasi-joins but they are not the same. If you are going to swap Redshift with Druid or Clickhouse, then you need to denormalize the data or design around the specific quasi-join concepts of these databases that abstract dictionary lookups.
Druid is operations heavy compared to Clickhouse, while RedShift is just a few button presses. I started using Druid in 2014 and back then to set up a cluster you needed a decent understanding of orchestration. There are plenty of “minimal ops” solutions for Druid nowadays, but IMHO all they are doing is abstracting the problem and not really solving it. To run a Druid cluster, you need 5 or 6 node types and to arrive at the proper configuration of resources for your data needs will take you a lot of time. Clickhouse requires a single node type.
Ingesting data is easy in RedShift, but you will have to work a bit more with Druid and Clickhouse. Two of the druid node types are dedicated to ingestion, so your data pipelines should allocate these dynamically. Loading data into Clickhouse is as simple as asking it to ingest a file, but since it is big data, you need a dynamic place to stage such file right? All I am saying is that I wish there was an easier way to pick up the data from private S3 buckets.
Performance is a wash between Druid and Clickhouse for me. I never did a proper head to head between the two. However, the product I built on top of Clikhouse is a next get version of the product I built on top of Druid, and in my experience Clickhouse achieves the same performance for a third of the costs. However, you must account that cloud computing has gotten cheaper over the years and my numbers for Druid come from 2017 which is the last time I approved a bill.
I forgot to mention PG. I have put small analytic datasets, 80 million rows or so, in Postgres with good results, but even then the response time was never less than a few seconds.
> I forgot to mention PG. I have put small analytic datasets, 80 million rows or so, in Postgres with good results, but even then the response time was never less than a few seconds
There is already a bewildering number of options in this space, but did you try TimescaleDB (a time-series extension for Postgres)?
I'm using it with 100 million plus rows, and aggregate queries return in less than 100ms, even on my laptop - it really is very impressive.
I haven't used them yet, but it also supports continuous aggregates (automatic roll ups), which would make common queries near instantaneous.
Online refers to interactive queries that work and return results after submitting them, in contrast to offline or batch queries. There's no hard latency requirement and standard TPC-H data warehouse benchmark queries can take hours to run.
The main advantage of Druid is that it pre-aggregates and indexes everything resulting in much smaller summarized data with fast seeks to specific rows. Materialized views, stored procedures, and ETL processes can provide the same result in relational data warehouses and come with much better SQL support as you mentioned.
You should try Redshift again as it's come a long ways since the early days. Also I highly recommend MemSQL which is faster than Clickhouse with a much easier deployment, more usability and can serve OLTP workloads in the same database.
Filtering by partial string match on dimensions. It is equally slow on both of them.
For straightforward filters and aggregations, yeah, Clickhouse is faster. However I read a PR by the Druid team where they added a vector engine. If they release that, then the performance gap should be smaller or maybe in Druid’s favor depending on the dataset.
You are right about insertions up to a point.
If you consider data loads as insertions, with Druid I could scale my cluster elastically to speed them up. With Clickhouse I am bound by the query nodes.
Also, Druid can ingest data in real time using a special type of node that wasn’t part of the original distribution. I haven’t done real time data ingestion on Clickhouse. Hourly updates are good enough for my use case.
The in depth pros and cons are one of the most valuable bits of knowledge I find on HN. Everyone including Chief Databaseologist Andy Pavlo recommends starting with Clickhouse for you OLAP needs
(in https://www.youtube.com/watch?v=dPMc7FZ3Gqo&list=PLSE8ODhjZX...).
When this is the case, Druid needs strong arguments to secure mindshare in the future. It is crazy to write a comparison of Druid and sth else these days and not mention Clickhouse at all.
Did you try VictoriaMetrics for time series data? It easily handles 2 trillion plus rows (20000 times more than TimescaleDB in your setup) on a single node [1].
No, never even heard of it. There really are a bewildering number of options in this space!
I haven't actually tried stressing TimescaleDB, but I imagine it would hold up well if given 50GB of RAM to play with.
My point really though, is that TimescaleDB is a great fit for a lot of use cases, with basically no learning curve if you're already au fait with Postgres. And of course if you're already using Postgres elsewhere in your system, it's one less component to manage.
I have some industrial IoT use case in mind though, and I'll definitely look more into VictoriaMetrics, so thanks for the tip!
I used it a few years ago with my former company. It effortlessly injected 10K rows of data per second, and we configured Druid over 50 dimensions and about 20 different metrics. In that configuration, Druid was able to respond to most queries within a second. I am not sure how it works but it is an amazing product. We are evaluating it now for use at my current company.
Saw this last week. The examples are limited but clearly easy to reproduce. I saw in a discussion on Reddit where it seems they plan to get even faster using more techniques like prefetch. https://news.ycombinator.com/item?id=22803504
Druid worked really well for almost all use cases, reliably getting sub-second results against many billions of records even on a pretty modest deployment. Being able to use arbitrary Javascript functions in queries was fantastic, so we could do things like filtering on IP subnets or case insensitive prefix/suffix matching, and the like, as needed.
The Docker setup that Druid ships with is deceptively simple - making a production installation was an effort. My thoughts are:
- Build templating and code generation for the many config files for each component early on so you can edit constants in a single place and have all of the sundry config files update to reflect them, and also manage per-host overrides in a sane, version controlled, way.
- Druid will use as much ram as you can throw at it, but in a pinch, reading directly from fast NVME storage is pretty good.
- If you have realtime data ingestion, you will also have to build tooling to re-ingest older data that has changed or needed to be amended. This will end up looking like a 'lambda architecture'