Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

An important question not mentioned in this article - and may not have been known by the author - is how much (Dev)Ops burden do each of these add?

In the places I've worked that use Kafka, it's 100% always a source of issues and operational headaches.

That's in fairly high throughput environments though, no idea if it "just works" flawlessly in easy going ones.



I wonder... how many issues was Kafka "soaking up" by dealing with concerns that applications and services didn't have to even consider?

As in, I wonder how much application developer burden would be present if using MQTT instead.


The vast majority of "Kafka problems" (or rather "Kafka Streams problems") we have with a managed solution at work are due to not fully understanding how it works and how to set it up. There's so much stuff you can configure and so much potential to misuse it. Typical problems are wrong configuration for acks, not understanding durability guarantees, not understanding exactly-once semantics, not naming hidden topics in Kafka Streams, not using schema registry serializers for hidden topics, choosing the wrong partitioning scheme (and wanting to change it later), using Kafka clients with different hash functions, using wrong schema compatibility mode, etc. etc.


It's an interesting question. No idea how to go about quantifying it though.


Fair.


What issues did you run into?

From a technology perspective it's been rock solid for years in my experience.

Where issues crept in it was always due to people not understanding the architecture and patterns you need to use e.g. anti-patterns like splitting batches into multiple messages, "everything must be stored in Kafka" thinking, not understanding how offset commits work, not understanding when to use keys or the effects of partitioning, resetting offsets on a live topic, aggressive retention policies etc.


One issue I’ve encountered is over-partitioning to handle a spike in traffic.

I.e. an event occurs which causes an order of magnitude more messages than usual to be produced for a couple of hours, and because ingest and processing flows are out of whack, a backlog forms. Management wants things back in sync ASAP, and so green lights increasing the partition count on the topic, usually doubling it.

In an event driven architecture that is fairly well tuned for normal traffic this can have the same downstream effect, and those topics up their partition counts as well in response.

Once anomalous traffic subsides, teams go to turn down the now over-partitioned topics only to learn that that was a one way operation and now they’re stuck with that many partitions, and the associated cost overhead.

Also if I see another team try to implement “retries” or delayed processing on messages by doing some weird multi-topic trickery I’m going to lose my mind. Kafka is a message queue, not a job queue, and not nearly enough engineers seem to grok that.


Do you know of any places to learn those things? Kafka seems pretty interesting to me


- Confluent blog, particularly anything by Martin Kleppmann: https://www.confluent.io/blog/author/martin-kleppmann/

- "I Heart Logs" by Jay Krepps (free)

- "Kafka, The Definitive Guide" (also free)


For shops light on DevOps-fu, Confluent hosted Kafka is popular for just this reason.


If you’re on AWS I’ve had zero issues with their managed Kafka offering (MSK). I’m sure they did lots behind the scenes, but it was really one of our most rock-solid pieces of infrastructure.

If I had a need for Kafka in my current role, I’d probably give Confluent and Red Panda offerings a shot.


> In the places I've worked that use Kafka, it's 100% always a source of issues and operational headaches.

Compared to what?

I have the opposite experience. For example, ingesting large amounts of log data. Kafka could handle an order of magnitude more events compared to Elasticsearch. Even if the data ultimately ended up in ES, being able to ingest with Kafka improved things considerably. We ended up getting an out of the box solution that does just that (Humio, now known as LogScale).

Similar experience when replacing RabbitMQ with Kafka. None "just works" and there's always growing pains in high throughput applications, but that comes with the territory.

Is Kafka the source of headaches, or is it Zookeeper? Usually it's Zookeeper for me (although, again, Zookeeper has difficult problems to solve, which is why software packages use ZK in the first place).


To be fair, it's not like it's solving a trivial problem. High throughput, reliable and highly available message queuing is just hard.


Where I work we have an on-premises Hadoop cluster and Kafka is its only stable component that works without constant headaches.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: