Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Clickhouse Local (clickhouse.tech)
124 points by mooreds on March 3, 2020 | hide | past | favorite | 75 comments


Why do people on HN love Clickhouse so much? As far as I can tell, it’s an ordinary column store, with a bunch of limitations around distributed joins and a heuristic-based query planner. There are several good analytical databases that will give you the same scan performance and a much better query planner and executor.

This is not a rhetorical question, I would really like to know why it gets so much attention here.


Its somewhat unique in the open source category.

As a column store engine that supports (hybrid) SQL, manages its own storage and clustering, and has only one external dependency (Zookeeper), I believe its main competitor is Vertica which can be very expensive. I assume Oracle, IBM, and MS have column stores as well and for a cost.

Greenplum is a Postgres fork, and same scale requires much more hardware. Citus is row based, and therefore will lag in scan time for many OLAP query patterns. Presto, Hive, Spark, all of the "post-hadoop" options may scale larger, but will also lag in scan time, and have significant external dependencies - mainly storage.

Clickhouse is easy to install, configure a cluster, load and query. It does have limitations, but currently all horizontally scaled database platforms do.


> There are several good analytical databases that will give you the same scan performance

The notion that you will get approximately the same query performance with all column stores is false. There can easily be an order of magnitude difference depending on the implementation. Take GROUP BY as a paradigmatic example of what OLAP stores do. Of course the way to implement GROUP BY is with a hash table but little tricks make all the difference and a lot of love went into the clickhouse implementation. Just to give you a taste: there is a custom hash table with specializations for different key types (e.g. it will store a precomputed hash for strings but not for integers and use it to speed up equality test). Variable length data is stored in arenas to reduce allocator pressure. Data will of course be aggregated in several hash tables in different threads and then merged together, but if there is a lot of keys each table will additionally be sharded so that the merge step can be performed in parallel too.

Of course you shouldn't trust random claims on the internet that clickhouse is fast and should do a small case study yourself. Then you'll appreciate how easy is to setup a clickhouse instance or a small cluster. It can easily slurp up most common formats. It is just a single binary with minimal dependencies that will run as-is on any modern linux. There is just a single node type (compare this to druid madness).

You are right that there is a lot of limitations and, how should I put it, quirks. This is resoundingly not a general-purpose database and someone used to the comforts of e.g. postgres will encounter some nasty surprises. Bugs are unfortunately common, especially in the newer functionality. But performance is its main feature and it makes many users of clickhouse put up with its limitations.


In addition to excellent raw scan rates for reasons given above ClickHouse has materialized views, which can re-order and/or preaggregate data. This can speed up query response by 3 orders of magnitude over queries on the source data. See https://www.altinity.com/blog/clickhouse-materialized-views-... + ClickHouse docs for an intro.

As the parent says, try it yourself on your own data. That's all that counts.

Disclaimer: I wrote the blog article and we sell support for ClickHouse.


> Of course the way to implement GROUP BY is with a hash table but little tricks make all the difference and a lot of love went into the clickhouse implementation.

Actually, for low cardinality columns, _not_ using a hash table will speed up things.

For example, a dictionary-encoded column for states might have a few values in its dictionary (1 => CA, 2 => FL, 3 => NY) and the data looks like an array of numbers (eg. [1, 1, 1, 2, 1, 3, 3, 3, 1, 3, 2]). The fastest way to aggregate is to actually use an array, as the dictionary index conveniently maps to an array index.

Then, when it comes to merging several of those arrays, they're turned into hash tables.

Combine enough of those optimizations and proper data layouts, and you end up with several orders of magnitude of performance differences between engines.


All of the optimizations you describe are in every columnar analytical database built in the last 10 years, from Vertica to Snowflake.


Probably you are right about vertica and snowflake (where can I browse the source code to be sure?) but with these clickhouse competes on price.


Because it's one of the few mature, fast, scalable analytics-oriented databases that is also completely open source.

The closest open source thing that matches its feature set is Presto, but that one is quite different.

Apache Druid is supposed to be very mature, but also very difficult to set up and manage. I've not used it myself.

There's also Vespa, but I don't know how well it performs with large numbers of columns.

A lot of people use Elasticsearch for analytics. Being based on Lucene, it's kind of columnar, and it can perform very well indeed on aggregations.

InfluxDB may be good, but it's not fully open source.


> Apache Druid is supposed to be very mature, but also very difficult to set up and manage

I wouldn't be surprised if this inverted at large scale (say 30+ machines). Druid data servers are rebalanced automatically; if you're on AWS and decide to scale up by adding a new data server, it will automatically load its assigned subset of data from S3. If AWS kills one of your data servers, then other data servers will automatically load from S3 some of the data that server used to carry, in order to reach the desired replication factor again.

Last time I checked ClickHouse had no automatic rebalancing at all, which sounds horrendous unless you're running at very small scale or willing to have people babysit it. I haven't operated ClickHouse at large scale though, so if I'm wrong I'd be happy to hear how people manage ordinary tasks like scaling up and down, replacing dead instances, changing instance types to adjust CPU/mem/disk, etc... with let's say a 100 TB compressed dataset.

Another difference is that Druid can index all dimensions. So if you plan to run queries with filters that only match a small fraction of rows, then Druid can be faster than ClickHouse. Conversely, if your queries have filters that match many rows, then ClickHouse will be faster because it has higher raw scan speed. (At least that was the case about a year ago. Since then, Druid has added vectorized aggregation, which I haven't benchmarked, but I'd bet that ClickHouse is still faster at doing full table scans.)

IMO these 2 things are the main elements to think about when choosing between ClickHouse and Druid.


I have worked with Druid and ClickHouse extensively in the past year.

I do think rebalancing is a weak point for ClickHouse, although for our use case that would not be so much of an issue, and it feels like it is on the roadmap for ClickHouse this year, but we will see. And if you are on Kubernetes, some of that headache may be handled for you with the ClickHouse Kubernetes operator.

I will say, that Druid indexing comes at a heavy cost in hardware for ingestion.

We find ClickHouse can easily ingest at least 3x the rate of Druid on the same hardware, and since Druid is asymmetric in design, you then have to get even more hardware to handle the queries.

Even with the vectorized aggregation, ClickHouse is beating Druid for full table scans at least, especially high cardinality data. But the vectorized aggregation has some restrictions to get on the fast paths, so that may improve. as those are removed.

Overall, I find ClickHouse much easier to work with and manage compared to Druid. ymmv


I know a few companies that use Imply Cloud to set up and manage Druid. I linked it if you want to check it out. https://imply.io/product/imply-cloud


No pricing information. I doubt it's very affordable.


Vespa is inverted-index (many extras inside). Actually their aggregations isn't as optimized as elasticsearch (depends on scenario). To do aggs, the fields must reside in-memory.


What about Greenplum or CitusDB?


CitusDB is not relevant here, I believe, as it still uses Postgres' table storage, so it's not columnar. It might be good for analytical workloads, but I very much doubt it will perform anywhere close to ClickHouse.

Greenplum: I've not used it, but it does support columnar tables, so maybe it's comparable.



Greenplum seems to like a way to throw more hardware at a problem. Queries are too slow with postgres? Shard your data across machines and distribute queries to speed things up by running in parallel. It's using scale as a means to compensate for low efficiency.

On the opposite side of the spectrum you have other open source projects like questDB that have full focus on core performance: constantly optimise to get as much as possible from a single processor core. You can't scale out (at least yet), but given how fast it is on single core, it will be pretty powerful if they chose to go this route.


Note that Greenplum supports column-oriented tables. I haven't used it myself, so I can't comment on whether it's slower than ClickHouse.


It's legitimately fast.

My company migrated our time-series data from InfluxDB to ClickHouse last year (I personally led this, in fact), and the performance difference is night and day.

While I liked a lot of what Influx could do, it was also nonstandard in bizarre ways (Clickhouse behaves more like a subset of SQL), sometimes shockingly immature, and despite appearing fast when we first started using it, so slow that it was a considerable bottleneck.


Clickhouse is not really time-series database, it's more general analytical DB (e.g. also can handle strings, logs, user IPs).

But if you have a lot of time-series metrics (only numbers), you might be better with specialized time-series databases like Prometheus + VictoriaMetrics with Grafana for visualizing it.


Prometheus is not really a database, because the only operation allowed is query, you can't insert/delete data manually. For many use cases, that's not acceptable.


That's why I mentioned VictoriaMetrics :) It supports push (using any protocol).


I think of influx performance as similar to pandas. It's fast enough for many things it was designed for, but it's pretty easy to hit a performance cliff. In particular, "select * from table" type queries are very slow, so don't expect to be pulling out large blocks of data to do analysis somewhere else.

However, influx, being schemaless, can be outstanding for rapidly prototyping ephemeral metrics, as adding a new measurement is zero-cost (just start writing it). It also plays great with grafana for building dashboards. Finally, I much prefer the ergonomics enhancements of the influx query language (v1, the v2 "flux" language looks terrible to me personally), particularly how duration strings are a first-class datatype ("group by time(1h)").

Interested to hear more about clickhouse performance, haven't had a chance to use it for anything where performance would matter significantly, although am aware it can be very fast.


To summarise a lot of the responses here, Clickhouse is extremely fast on very modest hardware, very easy to set up, very easy to get started (mostly normal SQL) and free. For our workloads and scale of data, nothing comes close in terms of performance (redshift, BQ, spark) and especially TCO. It is super simple to try out, why not give it a go? Or is the question borne from a previous bad experience?


Perspective from sentry.io (we use it in production as wikibob pointed out below): Besides all the performance considerations discussed in this thread, which are 100% correct, a few other features we really like are:

- the multiple table engines that are heavily optimized for specific data access patterns: ReplacingMergeTrees which essentially make records mutable, SummingMergeTrees that allow us to progressively build pre-aggregated data, AggregateMergeTrees that allow storing the intermediate aggregation state of most aggregation function and compose them at query time over multiple groups (example, store a p95 aggregation state hourly and query the daily p95 by composing them), and more.

- column data types is extensive and includes nested columns

- the architecture is relatively simple making it easy for developers and on prem users to deploy a single node local clickhouse very easily

- it is very efficient in inserting big batches of data which works really well for our use case were we ingest massive amount of errors.

- data skipping indexes, bloom filter indexes

(yes, as vlad@sentry.io mentioned below we are hiring for the team that manages storage and thus clickhouse)


Clickhouse is extremely optimized for performing analytics on time series style data. Things like clickstream data (hence the name). Performing queries to do funnel level analysis for example, are extremely fast compared to other analytics databases I have tried. As others have mentioned it is often compared to Apache Druid in its intent and feature set.


Its main use in analytics and real time dashboard/report generation in areas like adtech.

If a conversion happens, you need to generate the views/cubes again in reporting dashboard, clickhouse makes it cheap and easy to run such operations over commodity hardware.

If you don't have this, you'll be using big query and it might not be as fast.


Could you name those good analytical databases? I'd love to learn more.


DuckDB perhaps[1]: https://www.duckdb.org

[1] I say "perhaps" because I've only just started using it having migrated from MonetDB, but have no experience of alternatives like Presto.


Curious, as a sometimes MOneyDB user, be interested to know why did you choose DuckDB over MonetDB?


Monetdb-Lite disappeared from CRAN and after some investigation, it appears that the development team is now focused on a new product, DuckDB[1]

[1] https://github.com/MonetDB/MonetDBLite-R/issues/38#issuecomm...



Use case blog of Apache Druid by Netflix if ya'll want to take a look: https://netflixtechblog.com/how-netflix-uses-druid-for-real-...


Snowflake, Redshift, BigQuery, Databricks, Presto.


I can say for BigQuery and Databricks from personal experience.

BigQuery is much slower and is much more expensive for both storage and query.

Databricks (Spark) is even slower than that (both io and compute), although you can write custom code/use libs.

You seem to underestimate how heavily ClickHouse is optimized (e.g. compressed storage).


> You seem to underestimate how heavily ClickHouse is optimized (e.g. compressed storage).

Is it any more compressed than Apache Hive's ORC format (https://orc.apache.org)? Because that's increasingly accepted as a storage format in a lot of these analytical systems.


Yes, looks like it. According to these posts, ORC only uses snappy or zlib compression, while Clickhouse uses double-delta, Gorilla, and T64 algorithms.

https://engineering.fb.com/core-data/even-faster-data-at-the...

https://www.altinity.com/blog/2019/7/new-encodings-to-improv...


ORC or Parquet are file storage formats so without context their performance can be almost anything. Where is the data stored? S3? HDFS? Local ram disk?

Clickhouse manages the whole distributed storage, ram caching, etc. thing for you.

In my experience, a unified single purpose vertically integrated solution will be faster than a bunch of kitchen sink solutions bolted together.


Of those, it looks like only Presto is open source and/or free. So maybe it's a presto versus clickhouse comparison, which explains why so many choose clickhouse (it's one of only 2 options in its class).


Presto is mostly an engine that runs on top of other databases, although it does have its own query execution engine.

The basic idea behind Presto is that it federates other databases, and supports doing joins across them. From what I understand, the problem that it solved at Facebook is bridging the gap between different teams; if a team has MySQL and another has files stored on HDFS, it doesn't really matter because all you do is query Presto and it'll query both under the covers. The alternative is setting up data pipelines, and dealing with the ongoing issues of maintaining those data pipelines.


Presto is not really a database, it's the SQL layer on top of many other data storages, like Hive / any other SQL DB / Redis / Cassandra / etc.


How well do those work on a single 8GB node? Because ClickHouse works very well at that scale, with a single C++ executable.

There's large complexity and cost overheads to Hadoop solutions, and not everyone has actual big data problems. ClickHouse hugely outperforms on query patterns that would devolve into table scans in a row store, while working at row store volumes of data without a bunch of big nodes.


Snowflake doesn’t really keep up with Clickhouse (in my experience) and it costs money.

DataBricks is essentially Spark, and I shouldn’t need a whole spark cluster just to get database functionality. It also costs money.

Unless I’m mistaken, Presto is just a distributed query tool over the top of a separate storage layer, so that’s 2 things you have to setup.

I have no experience with BigQiery but I’ve heard good things about it and Redshift, however but if the rest of your infra isn’t on GCP/AWS then that will probably be a blocker.

Clickhouse is open source, comes with convenient clients in a bunch of languages as well as a HTTP API. It’s outrageously fast and has some cool features and makes the right trade-offs for its use-case, large range of supported input/output formats, built-in Kafka support and the replication and sharding is reasonably straightforward to setup.


Also, Presto and Databricks(Spark) is just a layer on top of other storagea, it cannot optimize storage for querying, as you do storage yourself.


According to https://tech.marksblogg.com/benchmarks.html Clickhouse has better performance than 3 of those (the other 2 haven't been tested in that benchmark)


I would be cautious using this as a proxy for performance ranking as some items (dataset, queries) are normalized, but the hardware setup is not.


the hardware profile is listed in each row, also, the guy is totally meticulous!


I don't think it's fair to say "A is faster than B" like in the above comments based on the order they appear in a list that mixes GPU clusters and laptops results. The author of the benchmark does nothing wrong deontologically, but the results table seems ordered by time and some people jump to quick conclusion or use it as a way to rank performance when it's not appropriate.


Github link pls


Low latency query execution and it scales down well to small datasets, are my reasons for really liking it. Top throughout when scanning TB+ of data on multiple nodes isn't my use case; it's answering a query in web request response times over a few hundred million rows of data.

The low latency to query execution is really nice.


It's the first time I've seen it on HN, and a cursory search shows only a few submissions with significant discussion. Nonetheless, as other commenters have said the SQL-based query language and low external dependencies made it easy for me to integrate it into my workflow.


Performance at scale for certain workloads.



We do. We are also hiring. Reach out to vlad@sentry.io for more info.


Clickhouse is one of the most underrated databases.

This basically replaces most of my usages of SQLite.

When its SQL "dialect" matures, Clickhouse will eat MySQL lunch, then PostgreSQL.


For analytics - sure. Clickhouse was not designed to handle OLTP workload, there's no transaction support.


I cringe a bit inside at people using say nosql approaches when it makes literally no sense to do so.

Therefore I think the lack of OLTP will not matter much and that clickhouse will be widely used, but also misused when it becomes too fashionable.


This makes no sense.

For example, aside from the lack of transactions, Clickhouse is designed for insertion. There's an INSERT statement, but no UPDATE or DELETE statements. You can rewrite tables (there's ALTER TABLE ... UPDATE and ALTER TABLE ... DELETE), but they're intended for large batch operations, and the operations potentially asynchronous, meaning that they complete right away, but you only see results later.

ClickHouse has many other limitations. For example, there's no enforcement of uniqueness: You can insert the same primary key multiple times. You can dedupe the data, but only specific table engines support this.

There's absolutely no way anyone will want to use ClickHouse as a general-purpose database.


I should have phrased that differently: if something is good enough in some key metric, it extends to other uses - even if it makes a poor fit.

So I insist: everyone will WANT to use clickhouse as a general purpose database, and will create ways to make it so (ex: copy table with the columns you don't want filtered out, drop the original, rename)

It is just too fast and too good for many other things, so it will expand from these strongholds to the rest.

A personal example: I am migrating my cold storage to clickhouse, because I can just copy the files in place and be up and running.

I know about insert and the likes, I have a great existing system - but this lets me simplify the design, and deprecate many things. Fewer moving parts is in general better.

After that is done, there is a database where I would benefit from things like alter tables or advanced joins, but keeping PostgreSQL and ClickHouse side by side, just for this? No. PostgreSQL will go. Dirty tricks will be deployed. Data will be duplicated if necessary.


Advanced joins (specifically merge joins) and object storage are on the way. See the following PRs:

* https://github.com/ClickHouse/ClickHouse/pulls?q=is%3Apr+mer... -- Recent work to enable merge joins

* https://github.com/ClickHouse/ClickHouse/pulls?q=is%3Apr+s3 -- Same thing for managing data on S3 compatible object storage

There's been a lot of community interest in both topics. Merge join work is largely driven by the ClickHouse team at Yandex. Object storage contributions are from a wider range of teams.

That said I don't see ClickHouse replacing OLTP databases any time soon. It's an analytic store and many of the design choices favor fast, resource efficient scanning and aggregation over large datasets. ClickHouse is not the right choice for high levels of concurrent users working on mutable point data. For this Redis, PostgreSQL, or MySQL are your friends.


Sure - but the comment you're replying to made no mention of NoSQL. It just said Clickhouse lacks OLTP by design, that doesn't mean it won't be widely used, just that it will perhaps be limited to analytics workloads.

If you need deletes and transactions, look elsewhere, but Clickhouse seems to be great for what it's been designed for.


No, Clickhouse doesn't support UPDATEs, and schema changes are even harder than in classic SQL DBs. But we love it anyway :)


A related article from last year: https://news.ycombinator.com/item?id=20163017. No comments though.


Clickhouse Local is great also for importing into your Clickhouse Server, where you can validate and preprocess CSVs into Clickhouse's native table format.


funny thing, just learnt about clickhouse today. for experienced people that use columnar stores and pandas for analytics, which tool do you usually prefer for BI stuff ? do ya'll load data into clickhouse then analyse it using pandas. or all analysis is done via the clickhouse sql dialect. As i'm sure things like pivot tables and rolling windows are a PITA in SQL


Why would you bother using a database like clickhouse to store data if you're just going to analyze it in pandas? Just store it in a csv, parquet, or orc.

> As i'm sure things like pivot tables and rolling windows are a PITA in SQL

I can't speak for clickhouse, but group-by and window functions are a very standard part of any SQL analysts toolbelt.


Another option to consider here is storing your data in TileDB[1] which allows you to access it via Python/Pandas, MariaDB for SQL (embeddable or standalone), Spark, R and more. With embedded MariaDB[2] you can query directly into pandas with minimum overhead. TileDB is similar to parquet in that it can capture sparse dataframe usage, however it is more general in that TileDB supports multi-dimensional datasets and dense arrays. Other major features built directly into TileDB include handling updates, time traveling and partitioning at the library level, removing the need for using extra services like Delta Lake to deal with the numerous Parquet files you may create. TileDB also has native support for S3 and in the next release we'll have native Azure Blob Storage support.

[1] https://github.com/TileDB-Inc/TileDB

[2] https://docs.tiledb.com/developer/api-usage/embedded-sql

Disclosure: I am a member of the TileDB, Inc. team


thank you!


> Why would you bother using a database like clickhouse to store data if you're just going to analyze it in pandas?

Because I have more data than what fits locally, there’s a data pipeline that pushes more in, and I only need to work on a subset.

Storing everything in flat csv/parquet etc is useless when there’s more than fits on your local/single machine memory or if you want to search/subset etc some of the data or do anything that’s larger than memory without having to write spill-to-disk stuff in Python/pandas.


Where did I suggest storing the data on a local machine? Common usage when using pandas/spark/etc is storing source data in parquet/orc/etc and they fully support data partitioning. That doesn't need to come from a local machine by any means.

My bigger point to the OP was why bother using a DBMS that is specifically tuned for running fast analytical queries if they only intended to use it as a storage layer and pushing all of the analysis into Python. Use a solution focused on storage, not analysis, if that's the use case.


Because the more that you can offload onto the database means less people re-inventing from first principles every time they need to do something (and potentially making correctness mistakes).

Additionally, if I have my data in an actual database, I can attach tools like Tableau and Superset directly, as opposed to having to take it from storage and then put it in a database anyway before being able to visualise/use it.


You misinterpret my comment as being against databases, when in fact I'm simply against the notion of utilizing a purpose-built OLAP database (Clickhouse) merely as the storage layer for yet another analytical platform (Pandas). In this hypothetical scenario, before the data ever makes it into Clickhouse, there's a very high likelihood it would already live in an OLTP database such as Postgres or perhaps parquet files living on S3 designed to be read by Athena or Redshift Spectrum etc.


Why would you bother with load balancers, clusters, and distributed architectures when you could just deploy apache and PHP on a cheap commodity server?


I wonder how performant this is in comparison to other quick and dirty methods to work on local files like bash, pythonz etc.


It's much faster in cases that involve significant processing. Where possible query execution is vectorized and runs in parallel on as many cores as you make available.

There are some examples in the article cited by dang: https://news.ycombinator.com/item?id=20163017




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: