The vast majority of "Kafka problems" (or rather "Kafka Streams problems") we have with a managed solution at work are due to not fully understanding how it works and how to set it up. There's so much stuff you can configure and so much potential to misuse it. Typical problems are wrong configuration for acks, not understanding durability guarantees, not understanding exactly-once semantics, not naming hidden topics in Kafka Streams, not using schema registry serializers for hidden topics, choosing the wrong partitioning scheme (and wanting to change it later), using Kafka clients with different hash functions, using wrong schema compatibility mode, etc. etc.
From a technology perspective it's been rock solid for years in my experience.
Where issues crept in it was always due to people not understanding the architecture and patterns you need to use e.g. anti-patterns like splitting batches into multiple messages, "everything must be stored in Kafka" thinking, not understanding how offset commits work, not understanding when to use keys or the effects of partitioning, resetting offsets on a live topic, aggressive retention policies etc.
One issue I’ve encountered is over-partitioning to handle a spike in traffic.
I.e. an event occurs which causes an order of magnitude more messages than usual to be produced for a couple of hours, and because ingest and processing flows are out of whack, a backlog forms. Management wants things back in sync ASAP, and so green lights increasing the partition count on the topic, usually doubling it.
In an event driven architecture that is fairly well tuned for normal traffic this can have the same downstream effect, and those topics up their partition counts as well in response.
Once anomalous traffic subsides, teams go to turn down the now over-partitioned topics only to learn that that was a one way operation and now they’re stuck with that many partitions, and the associated cost overhead.
Also if I see another team try to implement “retries” or delayed processing on messages by doing some weird multi-topic trickery I’m going to lose my mind. Kafka is a message queue, not a job queue, and not nearly enough engineers seem to grok that.
If you’re on AWS I’ve had zero issues with their managed Kafka offering (MSK). I’m sure they did lots behind the scenes, but it was really one of our most rock-solid pieces of infrastructure.
If I had a need for Kafka in my current role, I’d probably give Confluent and Red Panda offerings a shot.
> In the places I've worked that use Kafka, it's 100% always a source of issues and operational headaches.
Compared to what?
I have the opposite experience. For example, ingesting large amounts of log data. Kafka could handle an order of magnitude more events compared to Elasticsearch. Even if the data ultimately ended up in ES, being able to ingest with Kafka improved things considerably. We ended up getting an out of the box solution that does just that (Humio, now known as LogScale).
Similar experience when replacing RabbitMQ with Kafka. None "just works" and there's always growing pains in high throughput applications, but that comes with the territory.
Is Kafka the source of headaches, or is it Zookeeper? Usually it's Zookeeper for me (although, again, Zookeeper has difficult problems to solve, which is why software packages use ZK in the first place).
In the places I've worked that use Kafka, it's 100% always a source of issues and operational headaches.
That's in fairly high throughput environments though, no idea if it "just works" flawlessly in easy going ones.