Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Bulk inserts on ClickHouse: How to avoid overstuffing your instance (runportcullis.co)
42 points by mooreds 6 months ago | hide | past | favorite | 12 comments


Clickhouse has a built in feature for this called Asynchronous Inserts https://clickhouse.com/docs/en/optimize/asynchronous-inserts

It just buffers inserts in a queue and then does them all at once after a second has passed. This document does briefly mention it but it's such a nice feature that saves you from setting up a whole Kafka thing.


You want to be careful of async inserts as it's in-memory buffered as well and is not replicated afaik. So you can lose a bit of durability when using it (or you make sure your upstream pipeline can retry).

It is really nice in a pinch though.


My read on it was that if you keep wait_for_async_insert=True then the client will still wait until there's confirmation of a successful transaction, giving the possibility to retry. It does require one to actually have a retry handler on the client side, but that's a good idea anyways.


In practice, using async insert with wait means you need batching/retries implemented on the client side or else you may find you have thousands of threads stuck in wait. If you need to replay bc you messed something up you're also SOL. I generally appreciate not having to run kafka but that comes at a cost.


Very good point.


Note that you will need a thread pool that’s large enough to handle 1 second (or whatever flush timeout you set) of events in your app concurrently. If you have thousands of events per second, plan accordingly.


Didn’t spend enough time to find the root cause, but at enough volume this option caused some of our inserts to take several seconds, even with awaiting for success disabled. Just moved to using an in-memory queue in our app and it fixed it.


In the past I used clickhouse-bulk and the buffer engine, mostly, which did a good job indeed.

However, during the last years I always find myself using https://vector.dev/ for all sort of tasks, including bulk inserts in ClickHouse.


Oh man this is exactly what I've been looking for, thanks!


Did you use it to get CDC from a relational db to clickhouse?


The Kafka engine gets a little bit of hate (well, I haven't worked with ClickHouse for like 2 years now) but honestly it will go a long way.

At PostHog we were inserting something around 50M rows a day into CH with it and it was quite nice to be able to pause ingestion to a table by just detaching the table via SQL :D

I think they're still using the Kafka engine today but not sure. In our case (at least back then) we had to live with suboptimal batch sizes because we were providing near realtime analytics so Kafka was a solid fit.


We just added Clickhouse sink support to SQLFlow:

- https://github.com/turbolytics/sql-flow/issues/100

- https://github.com/turbolytics/sql-flow/pull/116

The Clickhouse python API made this trivial thanks to supporting Arrow!

We have a tutorial explaining how to stream data from Kafka to Clickhouse using SQLFlow:

https://sql-flow.com/docs/tutorials/clickhouse-sink

Local benchmarks show 70,000 rows / second insert using a batch size of 50k!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: