Why do people on HN love Clickhouse so much? As far as I can tell, it’s an ordinary column store, with a bunch of limitations around distributed joins and a heuristic-based query planner. There are several good analytical databases that will give you the same scan performance and a much better query planner and executor.
This is not a rhetorical question, I would really like to know why it gets so much attention here.
As a column store engine that supports (hybrid) SQL, manages its own storage and clustering, and has only one external dependency (Zookeeper), I believe its main competitor is Vertica which can be very expensive. I assume Oracle, IBM, and MS have column stores as well and for a cost.
Greenplum is a Postgres fork, and same scale requires much more hardware. Citus is row based, and therefore will lag in scan time for many OLAP query patterns. Presto, Hive, Spark, all of the "post-hadoop" options may scale larger, but will also lag in scan time, and have significant external dependencies - mainly storage.
Clickhouse is easy to install, configure a cluster, load and query. It does have limitations, but currently all horizontally scaled database platforms do.
> There are several good analytical databases that will give you the same scan performance
The notion that you will get approximately the same query performance with all column stores is false. There can easily be an order of magnitude difference depending on the implementation. Take GROUP BY as a paradigmatic example of what OLAP stores do. Of course the way to implement GROUP BY is with a hash table but little tricks make all the difference and a lot of love went into the clickhouse implementation. Just to give you a taste: there is a custom hash table with specializations for different key types (e.g. it will store a precomputed hash for strings but not for integers and use it to speed up equality test). Variable length data is stored in arenas to reduce allocator pressure. Data will of course be aggregated in several hash tables in different threads and then merged together, but if there is a lot of keys each table will additionally be sharded so that the merge step can be performed in parallel too.
Of course you shouldn't trust random claims on the internet that clickhouse is fast and should do a small case study yourself. Then you'll appreciate how easy is to setup a clickhouse instance or a small cluster. It can easily slurp up most common formats. It is just a single binary with minimal dependencies that will run as-is on any modern linux. There is just a single node type (compare this to druid madness).
You are right that there is a lot of limitations and, how should I put it, quirks. This is resoundingly not a general-purpose database and someone used to the comforts of e.g. postgres will encounter some nasty surprises. Bugs are unfortunately common, especially in the newer functionality. But performance is its main feature and it makes many users of clickhouse put up with its limitations.
In addition to excellent raw scan rates for reasons given above ClickHouse has materialized views, which can re-order and/or preaggregate data. This can speed up query response by 3 orders of magnitude over queries on the source data. See https://www.altinity.com/blog/clickhouse-materialized-views-... + ClickHouse docs for an intro.
As the parent says, try it yourself on your own data. That's all that counts.
Disclaimer: I wrote the blog article and we sell support for ClickHouse.
> Of course the way to implement GROUP BY is with a hash table but little tricks make all the difference and a lot of love went into the clickhouse implementation.
Actually, for low cardinality columns, _not_ using a hash table will speed up things.
For example, a dictionary-encoded column for states might have a few values in its dictionary (1 => CA, 2 => FL, 3 => NY) and the data looks like an array of numbers (eg. [1, 1, 1, 2, 1, 3, 3, 3, 1, 3, 2]). The fastest way to aggregate is to actually use an array, as the dictionary index conveniently maps to an array index.
Then, when it comes to merging several of those arrays, they're turned into hash tables.
Combine enough of those optimizations and proper data layouts, and you end up with several orders of magnitude of performance differences between engines.
> Apache Druid is supposed to be very mature, but also very difficult to set up and manage
I wouldn't be surprised if this inverted at large scale (say 30+ machines). Druid data servers are rebalanced automatically; if you're on AWS and decide to scale up by adding a new data server, it will automatically load its assigned subset of data from S3. If AWS kills one of your data servers, then other data servers will automatically load from S3 some of the data that server used to carry, in order to reach the desired replication factor again.
Last time I checked ClickHouse had no automatic rebalancing at all, which sounds horrendous unless you're running at very small scale or willing to have people babysit it. I haven't operated ClickHouse at large scale though, so if I'm wrong I'd be happy to hear how people manage ordinary tasks like scaling up and down, replacing dead instances, changing instance types to adjust CPU/mem/disk, etc... with let's say a 100 TB compressed dataset.
Another difference is that Druid can index all dimensions. So if you plan to run queries with filters that only match a small fraction of rows, then Druid can be faster than ClickHouse. Conversely, if your queries have filters that match many rows, then ClickHouse will be faster because it has higher raw scan speed. (At least that was the case about a year ago. Since then, Druid has added vectorized aggregation, which I haven't benchmarked, but I'd bet that ClickHouse is still faster at doing full table scans.)
IMO these 2 things are the main elements to think about when choosing between ClickHouse and Druid.
I have worked with Druid and ClickHouse extensively in the past year.
I do think rebalancing is a weak point for ClickHouse, although for our use case that would not be so much of an issue, and it feels like it is on the roadmap for ClickHouse this year, but we will see. And if you are on Kubernetes, some of that headache may be handled for you with the ClickHouse Kubernetes operator.
I will say, that Druid indexing comes at a heavy cost in hardware for ingestion.
We find ClickHouse can easily ingest at least 3x the rate of Druid on the same hardware, and since Druid is asymmetric in design, you then have to get even more hardware to handle the queries.
Even with the vectorized aggregation, ClickHouse is beating Druid for full table scans at least, especially high cardinality data. But the vectorized aggregation has some restrictions to get on the fast paths, so that may improve. as those are removed.
Overall, I find ClickHouse much easier to work with and manage compared to Druid. ymmv
Vespa is inverted-index (many extras inside). Actually their aggregations isn't as optimized as elasticsearch (depends on scenario). To do aggs, the fields must reside in-memory.
CitusDB is not relevant here, I believe, as it still uses Postgres' table storage, so it's not columnar. It might be good for analytical workloads, but I very much doubt it will perform anywhere close to ClickHouse.
Greenplum: I've not used it, but it does support columnar tables, so maybe it's comparable.
Greenplum seems to like a way to throw more hardware at a problem. Queries are too slow with postgres? Shard your data across machines and distribute queries to speed things up by running in parallel. It's using scale as a means to compensate for low efficiency.
On the opposite side of the spectrum you have other open source projects like questDB that have full focus on core performance: constantly optimise to get as much as possible from a single processor core. You can't scale out (at least yet), but given how fast it is on single core, it will be pretty powerful if they chose to go this route.
My company migrated our time-series data from InfluxDB to ClickHouse last year (I personally led this, in fact), and the performance difference is night and day.
While I liked a lot of what Influx could do, it was also nonstandard in bizarre ways (Clickhouse behaves more like a subset of SQL), sometimes shockingly immature, and despite appearing fast when we first started using it, so slow that it was a considerable bottleneck.
Clickhouse is not really time-series database, it's more general analytical DB (e.g. also can handle strings, logs, user IPs).
But if you have a lot of time-series metrics (only numbers), you might be better with specialized time-series databases like Prometheus + VictoriaMetrics with Grafana for visualizing it.
Prometheus is not really a database, because the only operation allowed is query, you can't insert/delete data manually. For many use cases, that's not acceptable.
I think of influx performance as similar to pandas. It's fast enough for many things it was designed for, but it's pretty easy to hit a performance cliff. In particular, "select * from table" type queries are very slow, so don't expect to be pulling out large blocks of data to do analysis somewhere else.
However, influx, being schemaless, can be outstanding for rapidly prototyping ephemeral metrics, as adding a new measurement is zero-cost (just start writing it). It also plays great with grafana for building dashboards. Finally, I much prefer the ergonomics enhancements of the influx query language (v1, the v2 "flux" language looks terrible to me personally), particularly how duration strings are a first-class datatype ("group by time(1h)").
Interested to hear more about clickhouse performance, haven't had a chance to use it for anything where performance would matter significantly, although am aware it can be very fast.
To summarise a lot of the responses here, Clickhouse is extremely fast on very modest hardware, very easy to set up, very easy to get started (mostly normal SQL) and free. For our workloads and scale of data, nothing comes close in terms of performance (redshift, BQ, spark) and especially TCO.
It is super simple to try out, why not give it a go? Or is the question borne from a previous bad experience?
Perspective from sentry.io (we use it in production as wikibob pointed out below):
Besides all the performance considerations discussed in this thread, which are 100% correct, a few other features we really like are:
- the multiple table engines that are heavily optimized for specific data access patterns: ReplacingMergeTrees which essentially make records mutable, SummingMergeTrees that allow us to progressively build pre-aggregated data, AggregateMergeTrees that allow storing the intermediate aggregation state of most aggregation function and compose them at query time over multiple groups (example, store a p95 aggregation state hourly and query the daily p95 by composing them), and more.
- column data types is extensive and includes nested columns
- the architecture is relatively simple making it easy for developers and on prem users to deploy a single node local clickhouse very easily
- it is very efficient in inserting big batches of data which works really well for our use case were we ingest massive amount of errors.
- data skipping indexes, bloom filter indexes
(yes, as vlad@sentry.io mentioned below we are hiring for the team that manages storage and thus clickhouse)
Clickhouse is extremely optimized for performing analytics on time series style data. Things like clickstream data (hence the name). Performing queries to do funnel level analysis for example, are extremely fast compared to other analytics databases I have tried. As others have mentioned it is often compared to Apache Druid in its intent and feature set.
Its main use in analytics and real time dashboard/report generation in areas like adtech.
If a conversion happens, you need to generate the views/cubes again in reporting dashboard, clickhouse makes it cheap and easy to run such operations over commodity hardware.
If you don't have this, you'll be using big query and it might not be as fast.
> You seem to underestimate how heavily ClickHouse is optimized (e.g. compressed storage).
Is it any more compressed than Apache Hive's ORC format (https://orc.apache.org)? Because that's increasingly accepted as a storage format in a lot of these analytical systems.
Yes, looks like it. According to these posts, ORC only uses snappy or zlib compression, while Clickhouse uses double-delta, Gorilla, and T64 algorithms.
ORC or Parquet are file storage formats so without context their performance can be almost anything. Where is the data stored? S3? HDFS? Local ram disk?
Clickhouse manages the whole distributed storage, ram caching, etc. thing for you.
In my experience, a unified single purpose vertically integrated solution will be faster than a bunch of kitchen sink solutions bolted together.
Of those, it looks like only Presto is open source and/or free. So maybe it's a presto versus clickhouse comparison, which explains why so many choose clickhouse (it's one of only 2 options in its class).
Presto is mostly an engine that runs on top of other databases, although it does have its own query execution engine.
The basic idea behind Presto is that it federates other databases, and supports doing joins across them. From what I understand, the problem that it solved at Facebook is bridging the gap between different teams; if a team has MySQL and another has files stored on HDFS, it doesn't really matter because all you do is query Presto and it'll query both under the covers. The alternative is setting up data pipelines, and dealing with the ongoing issues of maintaining those data pipelines.
How well do those work on a single 8GB node? Because ClickHouse works very well at that scale, with a single C++ executable.
There's large complexity and cost overheads to Hadoop solutions, and not everyone has actual big data problems. ClickHouse hugely outperforms on query patterns that would devolve into table scans in a row store, while working at row store volumes of data without a bunch of big nodes.
Snowflake doesn’t really keep up with Clickhouse (in my experience) and it costs money.
DataBricks is essentially Spark, and I shouldn’t need a whole spark cluster just to get database functionality. It also costs money.
Unless I’m mistaken, Presto is just a distributed query tool over the top of a separate storage layer, so that’s 2 things you have to setup.
I have no experience with BigQiery but I’ve heard good things about it and Redshift, however but if the rest of your infra isn’t on GCP/AWS then that will probably be a blocker.
Clickhouse is open source, comes with convenient clients in a bunch of languages as well as a HTTP API. It’s outrageously fast and has some cool features and makes the right trade-offs for its use-case, large range of supported input/output formats, built-in Kafka support and the replication and sharding is reasonably straightforward to setup.
I don't think it's fair to say "A is faster than B" like in the above comments based on the order they appear in a list that mixes GPU clusters and laptops results. The author of the benchmark does nothing wrong deontologically, but the results table seems ordered by time and some people jump to quick conclusion or use it as a way to rank performance when it's not appropriate.
Low latency query execution and it scales down well to small datasets, are my reasons for really liking it. Top throughout when scanning TB+ of data on multiple nodes isn't my use case; it's answering a query in web request response times over a few hundred million rows of data.
The low latency to query execution is really nice.
It's the first time I've seen it on HN, and a cursory search shows only a few submissions with significant discussion. Nonetheless, as other commenters have said the SQL-based query language and low external dependencies made it easy for me to integrate it into my workflow.
For example, aside from the lack of transactions, Clickhouse is designed for insertion. There's an INSERT statement, but no UPDATE or DELETE statements. You can rewrite tables (there's ALTER TABLE ... UPDATE and ALTER TABLE ... DELETE), but they're intended for large batch operations, and the operations potentially asynchronous, meaning that they complete right away, but you only see results later.
ClickHouse has many other limitations. For example, there's no enforcement of uniqueness: You can insert the same primary key multiple times. You can dedupe the data, but only specific table engines support this.
There's absolutely no way anyone will want to use ClickHouse as a general-purpose database.
I should have phrased that differently: if something is good enough in some key metric, it extends to other uses - even if it makes a poor fit.
So I insist: everyone will WANT to use clickhouse as a general purpose database, and will create ways to make it so (ex: copy table with the columns you don't want filtered out, drop the original, rename)
It is just too fast and too good for many other things, so it
will expand from these strongholds to the rest.
A personal example: I am migrating my cold storage to clickhouse, because I can just copy the files in place and be up and running.
I know about insert and the likes, I have a great existing system - but this lets me simplify the design, and deprecate many things. Fewer moving parts is in general better.
After that is done, there is a database where I would benefit from things like alter tables or advanced joins, but keeping PostgreSQL and ClickHouse side by side, just for this? No. PostgreSQL will go. Dirty tricks will be deployed. Data will be duplicated if necessary.
There's been a lot of community interest in both topics. Merge join work is largely driven by the ClickHouse team at Yandex. Object storage contributions are from a wider range of teams.
That said I don't see ClickHouse replacing OLTP databases any time soon. It's an analytic store and many of the design choices favor fast, resource efficient scanning and aggregation over large datasets. ClickHouse is not the right choice for high levels of concurrent users working on mutable point data. For this Redis, PostgreSQL, or MySQL are your friends.
Sure - but the comment you're replying to made no mention of NoSQL. It just said Clickhouse lacks OLTP by design, that doesn't mean it won't be widely used, just that it will perhaps be limited to analytics workloads.
If you need deletes and transactions, look elsewhere, but Clickhouse seems to be great for what it's been designed for.
Clickhouse Local is great also for importing into your Clickhouse Server, where you can validate and preprocess CSVs into Clickhouse's native table format.
funny thing, just learnt about clickhouse today. for experienced people that use columnar stores and pandas for analytics, which tool do you usually prefer for BI stuff ? do ya'll load data into clickhouse then analyse it using pandas. or all analysis is done via the clickhouse sql dialect. As i'm sure things like pivot tables and rolling windows are a PITA in SQL
Why would you bother using a database like clickhouse to store data if you're just going to analyze it in pandas? Just store it in a csv, parquet, or orc.
> As i'm sure things like pivot tables and rolling windows are a PITA in SQL
I can't speak for clickhouse, but group-by and window functions are a very standard part of any SQL analysts toolbelt.
Another option to consider here is storing your data in TileDB[1] which allows you to access it via Python/Pandas, MariaDB for SQL (embeddable or standalone), Spark, R and more. With embedded MariaDB[2] you can query directly into pandas with minimum overhead. TileDB is similar to parquet in that it can capture sparse dataframe usage, however it is more general in that TileDB supports multi-dimensional datasets and dense arrays. Other major features built directly into TileDB include handling updates, time traveling and partitioning at the library level, removing the need for using extra services like Delta Lake to deal with the numerous Parquet files you may create. TileDB also has native support for S3 and in the next release we'll have native Azure Blob Storage support.
> Why would you bother using a database like clickhouse to store data if you're just going to analyze it in pandas?
Because I have more data than what fits locally, there’s a data pipeline that pushes more in, and I only need to work on a subset.
Storing everything in flat csv/parquet etc is useless when there’s more than fits on your local/single machine memory or if you want to search/subset etc some of the data or do anything that’s larger than memory without having to write spill-to-disk stuff in Python/pandas.
Where did I suggest storing the data on a local machine? Common usage when using pandas/spark/etc is storing source data in parquet/orc/etc and they fully support data partitioning. That doesn't need to come from a local machine by any means.
My bigger point to the OP was why bother using a DBMS that is specifically tuned for running fast analytical queries if they only intended to use it as a storage layer and pushing all of the analysis into Python. Use a solution focused on storage, not analysis, if that's the use case.
Because the more that you can offload onto the database means less people re-inventing from first principles every time they need to do something (and potentially making correctness mistakes).
Additionally, if I have my data in an actual database, I can attach tools like Tableau and Superset directly, as opposed to having to take it from storage and then put it in a database anyway before being able to visualise/use it.
You misinterpret my comment as being against databases, when in fact I'm simply against the notion of utilizing a purpose-built OLAP database (Clickhouse) merely as the storage layer for yet another analytical platform (Pandas). In this hypothetical scenario, before the data ever makes it into Clickhouse, there's a very high likelihood it would already live in an OLTP database such as Postgres or perhaps parquet files living on S3 designed to be read by Athena or Redshift Spectrum etc.
Why would you bother with load balancers, clusters, and distributed architectures when you could just deploy apache and PHP on a cheap commodity server?
It's much faster in cases that involve significant processing. Where possible query execution is vectorized and runs in parallel on as many cores as you make available.
This is not a rhetorical question, I would really like to know why it gets so much attention here.