An important question not mentioned in this article - and may not have been know...

sigwinch28 · on March 14, 2023

I wonder... how many issues was Kafka "soaking up" by dealing with concerns that applications and services didn't have to even consider?

As in, I wonder how much application developer burden would be present if using MQTT instead.

tsss · on March 14, 2023

The vast majority of "Kafka problems" (or rather "Kafka Streams problems") we have with a managed solution at work are due to not fully understanding how it works and how to set it up. There's so much stuff you can configure and so much potential to misuse it. Typical problems are wrong configuration for acks, not understanding durability guarantees, not understanding exactly-once semantics, not naming hidden topics in Kafka Streams, not using schema registry serializers for hidden topics, choosing the wrong partitioning scheme (and wanting to change it later), using Kafka clients with different hash functions, using wrong schema compatibility mode, etc. etc.

justinclift · on March 14, 2023

It's an interesting question. No idea how to go about quantifying it though.

sigwinch28 · on March 14, 2023

Fair.

ryanjshaw · on March 14, 2023

What issues did you run into?

From a technology perspective it's been rock solid for years in my experience.

Where issues crept in it was always due to people not understanding the architecture and patterns you need to use e.g. anti-patterns like splitting batches into multiple messages, "everything must be stored in Kafka" thinking, not understanding how offset commits work, not understanding when to use keys or the effects of partitioning, resetting offsets on a live topic, aggressive retention policies etc.

taywrobel · on March 14, 2023

One issue I’ve encountered is over-partitioning to handle a spike in traffic.

I.e. an event occurs which causes an order of magnitude more messages than usual to be produced for a couple of hours, and because ingest and processing flows are out of whack, a backlog forms. Management wants things back in sync ASAP, and so green lights increasing the partition count on the topic, usually doubling it.

In an event driven architecture that is fairly well tuned for normal traffic this can have the same downstream effect, and those topics up their partition counts as well in response.

Once anomalous traffic subsides, teams go to turn down the now over-partitioned topics only to learn that that was a one way operation and now they’re stuck with that many partitions, and the associated cost overhead.

Also if I see another team try to implement “retries” or delayed processing on messages by doing some weird multi-topic trickery I’m going to lose my mind. Kafka is a message queue, not a job queue, and not nearly enough engineers seem to grok that.

TheSoftwareGuy · on March 14, 2023

Do you know of any places to learn those things? Kafka seems pretty interesting to me

ryanjshaw · on March 15, 2023

- Confluent blog, particularly anything by Martin Kleppmann: https://www.confluent.io/blog/author/martin-kleppmann/

- "I Heart Logs" by Jay Krepps (free)

- "Kafka, The Definitive Guide" (also free)

Scubabear68 · on March 14, 2023

For shops light on DevOps-fu, Confluent hosted Kafka is popular for just this reason.

FridgeSeal · on March 14, 2023

If you’re on AWS I’ve had zero issues with their managed Kafka offering (MSK). I’m sure they did lots behind the scenes, but it was really one of our most rock-solid pieces of infrastructure.

If I had a need for Kafka in my current role, I’d probably give Confluent and Red Panda offerings a shot.

outworlder · on March 14, 2023

> In the places I've worked that use Kafka, it's 100% always a source of issues and operational headaches.

Compared to what?

I have the opposite experience. For example, ingesting large amounts of log data. Kafka could handle an order of magnitude more events compared to Elasticsearch. Even if the data ultimately ended up in ES, being able to ingest with Kafka improved things considerably. We ended up getting an out of the box solution that does just that (Humio, now known as LogScale).

Similar experience when replacing RabbitMQ with Kafka. None "just works" and there's always growing pains in high throughput applications, but that comes with the territory.

Is Kafka the source of headaches, or is it Zookeeper? Usually it's Zookeeper for me (although, again, Zookeeper has difficult problems to solve, which is why software packages use ZK in the first place).

anonymousDan · on March 15, 2023

To be fair, it's not like it's solving a trivial problem. High throughput, reliable and highly available message queuing is just hard.

drowsspa · on March 14, 2023

Where I work we have an on-premises Hadoop cluster and Kafka is its only stable component that works without constant headaches.