Bulk inserts on ClickHouse: How to avoid overstuffing your instance

chis · 2025-02-14T17:10:14 1739553014

Clickhouse has a built in feature for this called Asynchronous Inserts https://clickhouse.com/docs/en/optimize/asynchronous-inserts

It just buffers inserts in a queue and then does them all at once after a second has passed. This document does briefly mention it but it's such a nice feature that saves you from setting up a whole Kafka thing.

mikeshi42 · 2025-02-14T17:43:54 1739555034

You want to be careful of async inserts as it's in-memory buffered as well and is not replicated afaik. So you can lose a bit of durability when using it (or you make sure your upstream pipeline can retry).

It is really nice in a pinch though.

chis · 2025-02-14T18:01:12 1739556072

My read on it was that if you keep wait_for_async_insert=True then the client will still wait until there's confirmation of a successful transaction, giving the possibility to retry. It does require one to actually have a retry handler on the client side, but that's a good idea anyways.

dilyevsky · 2025-02-14T18:52:07 1739559127

In practice, using async insert with wait means you need batching/retries implemented on the client side or else you may find you have thousands of threads stuck in wait. If you need to replay bc you messed something up you're also SOL. I generally appreciate not having to run kafka but that comes at a cost.

chis · 2025-02-14T18:59:18 1739559558

Very good point.

jsumrall · 2025-02-14T19:06:20 1739559980

Note that you will need a thread pool that’s large enough to handle 1 second (or whatever flush timeout you set) of events in your app concurrently. If you have thousands of events per second, plan accordingly.

iancarroll · 2025-02-14T18:04:46 1739556286

Didn’t spend enough time to find the root cause, but at enough volume this option caused some of our inserts to take several seconds, even with awaiting for success disabled. Just moved to using an in-memory queue in our app and it fixed it.

pachico · 2025-02-14T15:14:29 1739546069

In the past I used clickhouse-bulk and the buffer engine, mostly, which did a good job indeed.

However, during the last years I always find myself using https://vector.dev/ for all sort of tasks, including bulk inserts in ClickHouse.

twalla · 2025-02-14T15:36:35 1739547395

Oh man this is exactly what I've been looking for, thanks!

tacone · 2025-02-14T18:55:13 1739559313

Did you use it to get CDC from a relational db to clickhouse?

yakkomajuri · 2025-02-14T14:39:18 1739543958

The Kafka engine gets a little bit of hate (well, I haven't worked with ClickHouse for like 2 years now) but honestly it will go a long way.

At PostHog we were inserting something around 50M rows a day into CH with it and it was quite nice to be able to pause ingestion to a table by just detaching the table via SQL :D

I think they're still using the Kafka engine today but not sure. In our case (at least back then) we had to live with suboptimal batch sizes because we were providing near realtime analytics so Kafka was a solid fit.

dm03514 · 2025-02-15T20:53:15 1739652795

We just added Clickhouse sink support to SQLFlow:

- https://github.com/turbolytics/sql-flow/issues/100

- https://github.com/turbolytics/sql-flow/pull/116

The Clickhouse python API made this trivial thanks to supporting Arrow!

We have a tutorial explaining how to stream data from Kafka to Clickhouse using SQLFlow:

https://sql-flow.com/docs/tutorials/clickhouse-sink

Local benchmarks show 70,000 rows / second insert using a batch size of 50k!