Hacker News new | past | comments | ask | show | jobs | submit login
What every software engineer should know about Apache Kafka (michael-noll.com)
597 points by aloknnikhil on May 16, 2020 | hide | past | favorite | 276 comments



This notion of “stream-table duality” might be the most misleading, damaging idea floating around in software engineering today. Yes, you can turn a stream of events into a table of the present state. However, during that process you will eventually confront every single hard problem that relational database management systems have faced for decades. You will more or less have to write a full-fledged DBMS in your application code. And you will probably not do a great job, and will end up with dirty reads, phantoms, and all the other symptoms of a buggy database.

Kafka is a message broker. It’s not a database and it’s not close to being a database. This idea of stream-table duality is not nearly as profound or important as it seems at first.


Recently I watched a 50-engineer startup allocate more than 50% of their engineering time for about two years to trying to cope with the consequences of using Kafka as their database, and eventually try to migrate off of it. The whole time I was wondering "but how could anyone have started down this path?!?"

Apparently the primary reason they went out of business was sales-related, not purely technical, but if they hadn't used Kafka, they could have had 2x the feature velocity, or better yet 2x the runway, which might have let them survive and eventually thrive.

Imagine, thinking you want a message bus as your primary database.


I worked with a short-lived startup which made this exact mistake. I suggested persisting important events to a postgres db, but I was shot down over and over. They were positive it would be fine to hold everything in kafka. There was this notion that kafka functioned fine as a long term data store with the correct configuration (I'm not arguing that), but there was no solution to the lack of... Well, everything postgres could offer. As you mentioned, the lack of velocity really dragged them down. Customers were constantly upset that x broke or y wasn't delivered yet, and each time it was clear that kafka wasn't helping them meet those needs.

Kafka is awesome for what it does, but there are a ton of people out there using it for weird stuff.


Do you happen to know what sort of thinking leads a team down this path? It seems a fundamental mistake. Resume driven development?


First - you decide to build your data platforms with apache kafka. makes sense as it's a very widely used, durable message bus that does the job well.

Then...you decide to attend a kafka developer conference like Kafka Summit to sharpen your skills and learn best practices.

At that conference, you see the CEO of Confluent give a keynote speech about how Kafka is a transaction store; you 'turn your database inside out', run SQL queries on it, even run your BI dashboards on top of it! They make it seem like this is a best practice that all the smartest engineers are doing, so you better do it as well.

At that point you're investing in a path that will make Confluent millions in revenue trying to consult you on implementing their experimental, up-sell architecture on top of Kafka. Maybe their sales rep will suggest you to buy a $100k/month subscription to Confluent Cloud to really patch up any holes and make it easier to maintain.

Moral of the story is doing your best to separate out the business and the tech side of open source. Behind many open source projects, there's a company trying to make money on an 'enterprise' version.

Ethically, I prefer data companies like Snowflake that are at least transparent about trying to make money off their product. Rather than companies that use open source as a ladder to drive adoption, then try to bait-and-switch you into the same 7-figure deals with software that is more brittle and still requires you to develop your own solutions with it.


In this case the application was commissioned by the founder, and the people developing it were given errant criteria (and not given other valid criteria) which admittedly made Kafka sound like it would make sense.

The developers who adopted the commissioned application were way too inexperienced to know one way or another, and by the time I was asked to help out, no one had really learned anything about Kafka.

My suggestions to replace it with a basic queue were rejected and my suggestions to store events in a database for easier recollection were rejected. They were sure learning Kafka better would fix everything instead.

Ironically there's not even that much to learn about it, in the scheme of things. It's like learning more about your car so you can make it fly one day. It's just not the right tool for the job.


Insufficient asking "why?"

1. Why do we want to use Kafka? [insert thinking] So that we can do x yet avoid y.

2. Why is the customer asking for these features? [insert listening] So that they can do z and think of themselves as w.

And then failure to remember and revisit that "why?".


In addition to what others have answered, I think many devs consistently underestimate the amount of heavy lifting that relational databases do.

And even worse a lot of this heavy lifting isn't obvious until you are fairly far into a project and suddenly realize that to solve a bug you need to implement transactions or referential integrity.


Not just Kafka, either. Mongo, Cassandra, couchbase, Redis, SOLR and elastic search are all mistaken as replacements for an RDBMS.


Cassandra can replace an rdbms in a lot of cases if eventual consistency works for your data.

It's not a drop in replacement though, that's for sure.


Better than the other suggestions, but my experience is that you run into Cassandra's limitations really quick: you can't query on non-primary columns, and you can't join tables without pulling all the data down to the client and merge manually.


> can't query on non-primary columns,

That's incorrect unless I'm misunderstanding what you mean with primary columns.. It's just not as efficient.

And missing joins are by design and one of the reasons cassandra is as fast as it is. And as I said before: it's not a drop in replacement. You need to architect your application around it's strength to leverage it's performance.

It is however usable as a rdbms replacement if you know what you're doing and your data is fine with eventual consistency.

And knowing what cassandra does with your data is important as well. It's actually a key-value store on steroids. Once you get that, it's limitations and strengths become obvious


> That's incorrect unless I'm misunderstanding what you mean with primary columns

I'm referring to "Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING"

> It's just not as efficient.

That's what I thought, too, so I "allowed filtering". And crashed the database. (Apparently the correct solution here is to use Spark).


haha, yes - thats a real possibility.

cassandra is at its heart a key-value store. for every queryable field, you need to create a new in-place copy of the data. so you're basically saving the data n-times, for each way you wish to query the data.

if you however try to query on a field which hasn't been marked as query-able, the cluster will have to basically select everything in the table and do a for loop to filter for the 'where' clause :-)

But i haven't used in production yet, so you've got more experience then i do


My hypothesis: it’s hard to have a descent grasp of technologies without having actually used it. Tie that with “let’s not include too many moving parts” and it’s easy to end up in a situation where the edict is “Kafka”. Let’s say you have never used RDBMS, only used rethinkdb and that turned out to be problematic for whatever reason, next project the founder hires you on the premise that the system you build needs to scale to billions of requests per minute ASAP (eventhough currently there is 0 traffic).


Even though currently there is zero traffic is exactly right. Haha. When this company finally did get customers, the thing they thought would help them manage thousands of high volume customers ended up making it so they could barely retain a few very low volume customers.


Yea, I agree. Having the endgame solution in place at the beginning is often a mistake and can actually foil your ability to get to the end game.


I'll admit, I only recognized that mistake because I've made it myself, over and over. It's hard to push code knowing it'll need improvements later, or knowing how scopes will change. I find myself repeating "perfect is the enemy of good" because I struggle to just let a solution be good enough.

It's tough to be consulted and watch people go against your advice like that, though.


Sometimes I'm annoying is that It seems that some cloud services (I imagine Firebase) only provides endgame solution for datastore.


I think this is worth mentioning again: https://blog.bradfieldcs.com/you-are-not-google-84912cf44afb


> this exact mistake

https://kafka-summit.org/sessions/source-truth-new-york-time...

Both your comment and the GP are using examples of failed startups as basis of your criticism. One mentions "50 engineers" and I'd bet not a single one had an "engineer" bone or training in their body. So I read that as 50 blog-driven programmers fail to execute a sophisticated system design they read about on some blog. Surprise?

Using Kafka as the distributed log of a deconstructed database is technically an order of magnitude more demanding than wiring MVC widgets, or using the latest paint by numbers framework, etc., because that is in fact developing a custom distributed database.

Technical leadership and experience are forgotten things but they do matter. Let's not blame the stack.


I said Kafka is great at what it's for - I'm not blaming it at all. I actually enjoyed working with it a lot, despite that it couldn't help my client very much. Fine tuning it to work better for them was probably one of the more fun projects I've worked on in the last year. It still wasn't the right tool for the job they were trying to do, though.

They could have replaced their brokers with simple worker queues and fed the raw data and results into a db. That was all they needed. The fear was that when they got MASSIVE, that architecture wouldn't scale for them. In reality it absolutely would, but like you said, the blogs told them otherwise.

Kafka is fine, using it for the wrong thing isn't.


If I had a dollar for every project that couldn't make a 1000 users happy because they were too worried about making a million happy.


No, you said it was a mistake to use Kafka as database. It's been pointed out that a key member of the Kafka trio, Jay Kreps, is on record that you can definitely use it as a source of truth. NYTimes example is also of a competent team that uses Kafka as a source of truth.


Fair enough, I should have been more explicit in saying that the team was using it in a weird way for their own purposes. Not only was the use case wrong for them, but the technology wasn't well understood by anyone either. The implementation was partially nonsense.

I'm not criticizing Kafka so much as the way I've seen it used in software. It's largely down to user error, and like I said, Kafka is good software. I had to learn a lot about it for that project and it was a lot of fun. It was my first deep dive into message brokers and I really loved it, so I mean no disrespect to the software or its maintainers.


For the record, I don't think any piece of software is beyond criticism, including Kafka. That said, I frankly have no idea why this sub-thread has meandered into the topic of Kafka the software! :)

You could replace Kafka with Pulsar (or DistributedLog for that matter) in my comments above and it would stand.

Is using a distributed log as the WAL component of a event-sourcing or distributed database entirely a bad idea? I am of the opinion that it is a viable but sophisticated approach.

Peace out.


Theyre not criticising Kafka, they’re criticising misuse of it in use cases it’s not designed for.


> Imagine, thinking you want a message bus as your primary database.

I've built (and sold) an entire company around this architecture, and am working on the next. It can be incredibly powerful, and is what you _have_ to do if you want to mitigate the significant downsides of a Lambda architecture. We successfully used it to power graph building, reporting [1], and data exports, all in real-time and at scale.

But, we didn't use Kafka for two key reasons (as evaluated back in 2014):

* We wanted vanilla files in cloud storage to be the source-of-truth for historical log segments.

This was really important, because it makes the data _accessible_. Applications that don't require streaming (Spark, Snowflake, etc) can trivially consume the log as a hierarchy of JSON/protobuf/etc files in cloud storage. Plus, who the heck wants to manage disk allocations these days?

* We wanted read/write IO isolation of logs.

This architecture is only useful if new applications can fearlessly back-fill at scale without putting your incoming writes at risk. Cloud storage is a great fit here: brokers keep an index and re-direct streaming readers to the next chunks of data, without moving any bytes themselves. Cloud read IOPs also scale elastically with stored data.

Operational concerns were certainly a consideration: we were building in Go, targeting CoreOS and then Kubernetes, had no interest in Zookeeper, wanted broker instances to be ephemeral & disposable (cattle, not pets), etc, but those two requirements are pretty tough to bolt on after the fact.

The result, Gazette, is now OSS (https://gazette.dev) and has blossomed to provide a bunch of other benefits, but serves those core needs really well.

[1] https://github.com/liveramp/factable


I worked on a project recently which had Kafka in it. Our first order of business was to completely remove it. I would bet good money that many projects on Kafka would benefit tremendously from simply removing it. It's wildly unnecessary in 7/10 situations I see it employed.


What was it replaced with?


Not the guy who posted above but I think I have relevant experience. After going through different options for message queuing we have decided that using S3 and object will be our best option for many reasons. Some of those reasons:

- durable

- cheap

- scalable

- we do not need to read messages in a strict order

- using S3's API is super easy

- no operations overhead

- no compute resources (for the queue part)

After doing it some time I think that I was drinking too much cool-aid with streaming services like Kinesis and Kafka and failed to see how easy it is to implement a simple analytics even service using purely S3.


I’m curious, what makes Kafka to be picked in the first place. Perhaps a vast majority of developers are growing up using Kafka for everything for their learning projects as it makes it convenient for rapid development? Maybe they learn about it through blogs, video tutorials and so on, at the cost of other technologies?

For instance, my nephew was heaping praises on Firebase and would use it for all his projects. It took me a while to understand that the convenience of using it masked some of the issues that would only get exposed at scale.

For whatever reason a vast number of developers are entering the workforce believing in Kafka’s capability to solve all problems. And going against that majority will be very tiring and even costly.


Similar story here. One startup's engineer explained to me how they use Kafka (0.8ish) as they transactional database for financial transactions. When I explained to him that Kafka is(was) not transactional he did not believe me.


I worked for a company that made the same mistake but with AWS Kinesis instead of Kafka.

So, much more costly mistake.


A quibble I have with the term "stream-table duality" is that it's not true duality.

You can construct a state (table) from a stream, but you cannot do the reverse. You cannot deconstruct a table into its original stream because the WAL / CDC information is lost -- you can only create a new stream from the table. This means you lose all ability to retrieve previous states by replaying the stream. Information is lost.

Duality in math is an overloaded term but it generally means you can go in either direction. This is not true here.


It depends on your data schema. If you store the data in a temporal form (every row has a transaction period), you can reconstruct a stream with ORDER BY LOWER(transaction history).

One benefit of using the table as a source of truth this way is that you can run migrations on the entire “stream” if you will, and republish the events in order, in the new schema definition.

Another benefit is that you can run SQL queries against your entire transaction history. Some (imo sketchy) tools exist that try to give you a SQL interface to Kafka, but that’s just bad. It’s much easier to have your cake and eat it in postgres than it is in Kafka, imo. Unless your data volume is absurd, in which case Kafka is your only realistic choice.

Of course, in a “literal” sense you’re correct, it’s not possible. But from a practical point of view, if your bit empor all data implementation is stable, you can totally depend on your ability to produce streams from a database. We do this at my current place of work all the time.


Just came here to post this comment. I feel like they must've known this when creating the terminology, so I can only imagine they called it "duality" because they wanted it to sound cool. But it's most definitely not duality and I would wager calling it that is actively harmful. Last thing you want is someone implementing a table under some misled impression that it can do everything a stream can!


This Confluent blog is a rehash of earlier conceptual position statement of Jay Kreps, who is now the boss of the Confluent blogger.

https://engineering.linkedin.com/distributed-systems/log-wha...

Pointing out (GP and yourself) that this is an 'asymmetric duality' is informative and valid.


Thank you -- it's food for thought.

A symmetric dual is one where there is equivalence: the dual of the dual is the primal.

But it is possible to have a "weak" duality that is assymmetric, where the dual encompasses the primal -- where there are regions of equivalence instead of a 1-to-1 equivalence.

I'm not sure if stream-duality fits this conceptual mold exactly but thank you for providing a counterpoint to my assertion. I'm going to think on it.


If one computes in a bounded time domain, then it is symmetric: a table is a ‘time slice’ of a stream.


so I can only imagine they called it "duality" because they wanted it to sound cool

Also, that you believing it is the most profitable mindset - for them. Company blogs are marketing, first and foremost.


I read it as “there exists a event schema and some database table for which they can be co-constructed”, not “every database table can be turned into a stream of events.”

I don’t have a proof, but For things like state transitions, I’ve often solved this with a <new_state>_at column. Seeing that field populated would indicate a second event need be emitted. (Assumes no cycles in the state machine.)

Often that table is not the one you have, even if (perhaps especially if) you emitted events for every update — seems like some entropy-like force pushes the universe toward emitting details in events that aren’t reflected in the table records, over a long enough timeline.


Couldn’t they just make the stored data bitemporal?


SQL temporal tables?

Difference algebra over the deltas could find some interesting relevance.


I've heard Haskell programmers talk about Data and Codata, and came away with the loose sense that Data might be like state and Codata might be like a stream.

That one would so confidently assert that "Codata is the categorical dual of Data" has me thinking there might be something here?

Anyone want to tell me how far off I am?


Data and codata are dual, but that doesn't mean they are the inverse of each other. It means we can formalize them both in category theory and they have the same formalization with reversed arrows.

More usefully, data is a finite data type and codata is an infinite data type.

Some example:

    sum = foldr (+) 0
    sum [0,1,2,3]
    > 6
    sum [0..]
    ...loops forever...

    sum' = scanl (+) 0
    sum' [0,1,2,3]
    > [0, 1, 3, 6]
    sum' [0..]
    > [0, 1, 3, 6, 10...
Haskell doesn't actually distinguish between finite and infinite data. Some functions work fine on infinite data, some functions loop forever.

Induction describes how to process data without diverging - process each input element in a finite time.

Co-induction describes how to process codata without diverging - produce each output element in a finite time.

This is useful if you want to prove things about infinite processes like operating systems.


> This notion of “stream-table duality” might be the most misleading, damaging idea floating around in software engineering today.

No. The notion of a "stream-table duality" is a powerful concept, that I've found can change the way any engineer thinks about how they are storing / retrieving data for the better (it's an old idea, rooted in Domain Driven Design, but for some reason a lot of engineers, myself included, still need to learn, or relearn, it).

The notion that relying on a stream as the primary data persistence abstraction or mechanism in a production is the misleading part, at least for now. I'd argue Kafka pushes us in a direction that makes progress along that dimension, and you can apply it successfully with a lot of effort. But to match what you can get from a more traditional DBMS? The tech just isn't there (yet).


We've been relying on streams since the advent of write ahead logs but the complexity of streaming state transformations was re-formed into something that is easier to reason about. The most common form is a relational DB which is generally users asking questions against snapshots of the WAL. RDB's is a highly opinionated layer of abstraction on top of data streams (in the form of inserts and updates).


Thanks, this is the take I was looking for.

I've definitely seen table-focused projects that really struggle because they need to be thinking in stream-focused terms. That's especially true in the age of mobile hardware. E.g., you want to take your web-focused app and have it run on mobile devices with imperfect connectivity so people in the field can use it, table-only thinking is a pain. It's much easier to reconcile event streams than tables.

Sure you can't expect a stream-oriented tool to do everything an SQL database does. But the reverse is true as well. If we threw out every tool because people unthinkingly used it for the wrong job, we wouldn't have many tools left.


I always found it ironic that you get most of this for free if you design your sql updates and save/query the transaction log and/or history. A lot of relational dbs have functionality for that.

And if you don't want to use that, there's also products for this specifically, such as event store.


That’s true, but you have to interpret the state changes from your table/row schema. Why not just update the DB and then publish an event?


because things can and will fail inbetween updating the db and publishing the event leading to inconsistency between your database and kafka. a better approach would be to update the db and then use something like debezium or maxwell to pull the updates from the db's bin log and publish them to kafka.


> update the DB and then publish an event

It can make sense to do both, so to de-couple internal table structures from externally exposed events. Debezium supports you safely doing this via the transactional outbox pattern [1]. Events are inserts into an outbox table in the same database as the actual business data, both updated in one transaction. Debezium then can be used to capture the outbox events from the logs and send the events to Kafka. To me this hits the sweetspot for many use cases, as you avoid the potential inconsistencies of dual writes which you mention while still not exposing internal structures to external consumers (for other use cases, like updating a cache or search index on the contrary, CDC-ing the actual tables may be preferable).

Disclaimer: I work on Debezium

[1] https://debezium.io/documentation/reference/configuration/ou...


thanks for your work on debezium! cdc to kafka has been a game changer for our data pipelines, so much easier to make generalized tooling at this level. our old pipeline requires either immutable rows with autoincrement primary keys or last modified dates with a secondary index on every table. it doesn't capture all the changes for example if a row is modified twice before the pipeline sees it the first update won't be recorded in the history. it also doesn't propagate deletes. debezium solves both of those issues plus doesn't require the application developers to do anything special for their data to get into the analytics systems. now instead of sounding like a broken record telling people to add last modified dates to all their tables i can sound like a broken record telling people to stop violating single source of truth by dual writing (which is a much better problem to have as before i had given up on single source of truth even being possible in a large organization)


That's awesome, very happy to hear! Drop by any time on the mailing list or chat if you got any issues or questions to discuss around Debezium.


This is a neat pattern, but it honestly just seems like you’re trading where you want the complexity. The outbox pattern would force devs to not use SQL directly for mutations - in so much as they can’t directly modify records like they normally would. I can see this being ok if you had someone with strong DBA skills implementing the abstraction.


Devs modify records just as they normally would. Only, in addition, they produce a record in the outbox table, representing the change to the outside world. I.e. instead of writing to the DB and Kafka, both changes are done to the database, and then events from the outbox table are CDC-ed to Kafka.


We do this now and make publishing to Kafka a requirement of transaction success, rolling back the DB transaction if publishing fails. The dual write is much more straightforward in my opinion and the pattern works as long as the primary database supports transactions or some other form of consistency guarantee.

I still can’t get behind using CDC via a DB watching tool unless I have little control of the system writing to the DB. There is so much context lost interpreting the change at the DB (who was the user that made this change, etc.) — unless you are sending this data to the DB (then yikes your DB schema is much more complex).


> rolling back the DB transaction if publishing fails.

If your DB transaction fails after the Kafka message has been sent, you'll end up with inconsistencies between your local database state and what has been published externally via Kafka. That's exactly the kind of dual write issues the outbox pattern suggested in my other comment avoids.

> unless you are sending this data to the DB

You actually can do this, and still keep this metadata separately from your domain tables. The idea is to add this metadata into a separate table, keyed by transaction id, and then use downstream stream processing to enrich the actual change events with that metadata. I've blogged about an approach for this a while ago here: https://debezium.io/blog/2019/10/01/audit-logs-with-change-d....


We don’t use auto commits with Kafka so it’s not a problem for us. Of course, this does reduce our publishing throughput - though that’s not a problem for us at our current scale.

Either way, I’m intrigued by the outbox pattern. As long as it is transparent to developers, it does seem like an ideal CDC mechanism.


I always found it ironic that you get most of this for free if you design your sql updates and save/query the transaction log and/or history. A lot of relational dbs have functionality for that

Being able to elegantly and efficiently do "AS OF" queries is still a hard problem, I don't see that KTables really solve it. Hate to say it but Oracle's solution to this is still unmatched. https://oracle-base.com/articles/10g/flashback-query-10g#fla...


I have recently analyzed Kafka as a message bus solution for our team. My take is that Kafka is a very mature and robust message system with clearly defined complexity and thus when used correctly, very dependable. The recent versions even have online rebalancing (but don't tell that to the Pulsar people). However "Kafka Streams" just make me shiver - I don't understand why someone would masquerade a project directly competing with other processing frameworks (e.g. Apache Flink or Spark) as being part of Kafka itself, when apart from utilizing Kafka's Producer and Consumer interfaces it really has nothing else in common with it. Streams is a giant, immature, complexity-hiding mess which as a responsible systems designer you have to run away from really, really fast.


The point of that is for people trying to figure out why streams are a useful abstraction at all what's needed to make them useful are some sort of aggregation, and of course tabular state is a common end point.

The article does not recommend writing this code yourself, it shows how to aggregate data into usable forms.

So I think your concerns may be a bit overblown. If you think that ksqlDB or Kafka Streams, the tools shown in this blog post, are are at risk for what you warn, this comment would be a valid criticism. But it's clear that the article isn't advocating for people to write their own versions of that...


Yes, they're definitely at risk. ksqlDB does not appear to have transactions at all.


The transaction boundary is the message/event itself (in the case of an event sourced system, which is what this would primarily be used with). Everything required to understand some event is contained within the payload. It is the atomic unit in this type of system.


I worked on a project that used CQRS and Event Sourcing years ago. It was an unmitigated disaster, never made it out of prototype phase.

By using Kafka (or anything else) as a "commit log" you've just resigned yourself to building the rest of the database around it. In a real RDBMS the commit log is just a small piece of maintaining consistency.

Every time I've worked on a project where we handle our own "projections" (database tables) we ended up mired deep in the consistency concerns normally handled by our database.

Whats so hard? Compaction is an evil problem. We figured we could "handle it later". Well it turns out maintaining an un-compactible commit log of every event even for testing data consumes many terabytes. This and "replays" sunk the project indirectly.

Replays. The ES crowd pretends this is easy. It's not. If you have 3 months of events to replay every time you make certain schema changes you are screwed. Even if you application can handle 100X normal load, its going to take you over a day to restore using a replay. This also happened to us. Every time an instance of anything goes down, it has to replay all the events it cares about. If you have an outage of more than a few instances the thundering herd will take down your entire system no matter how much capacity you have. With a normal system, the Db always has the latest version of all the "projections" unless you're restoring from a backup, then it only has to restore a partial commit log replay, and only once.

Data duplication. Turns out "projecting" everything everywhere is way worse on memory and performance than having the DB thats handling your commit log also store it all in one place. Who knew.

Data corruption. If you get one bad event somewhere in your commit log you are cooked. This happens all the time from various bugs. We resorted to manually deleting these events, negating all the supposed advantages of an immutable commit log. We would have been fine if we let the db handle the commit log events.

Questionable value of replays. You go through a ton of bs to get a system that can rewind to any point in time. When did we use this functionality? Never. We never found a use-case. When we abandoned ES we added Hibernate Auditing to some tables. Its relatively painless, transparent, and handled all of our use cases for replays


> This notion of “stream-table duality” might be the most misleading, damaging idea floating around in software engineering today.

I disagree. Kafka is a log. That's half of a database. The other half is something that can checkpoint state. In analytics that's often a data warehouse. They are connected by the notion that the data warehouse is a view of the state of the log. This is one of a small number of fundamental concepts that come up again and again in data management.

Since the beginning of the year I've talked to something like a dozen customers of my company who use exactly this pattern to implement large and very successful analytic systems. Practical example: you screw up aggregation in the data warehouse due to a system failure. What to do? Wipe aggregates from the last day, reset topic offsets, and replay from Kafka. This works because the app was designed to use Kafka as a replayable, external log. In analytic applications, the "database" is bigger than any single component.

I agree there's a problem with the stream-table duality idea, but it's more that people don't understand the concept that's behind it. The same applies to transactions, another deceptively simple idea that underpins scalable systems. "What is a transaction and why are they useful?" has been my go-to interview question to test developers' technical knowledge for over 25 years. I'm still struck by how many people cannot answer it.


> However, during that process you will eventually confront every single hard problem that relational database management systems have faced for decades. However, during that process you will eventually confront every single hard problem that relational database management systems have faced for decades. You will more or less have to write a full-fledged DBMS in your application code. And you will probably not do a great job, and will end up with dirty reads, phantoms, and all the other symptoms of a buggy database.

It's the same as moving from a framework to a library. At first, the framework seems great, it solves all your problems for you. But then as soon as you want to do something slightly different from the way your framework wanted you to do it, you're stuffed. Much better to use libraries that offer the same functionality that the framework had, but under your own control.

If you use the sledgehammer of a traditional RDBMS you are practically guaranteed to get either lost updates, deadlocks, or both. If you take the time to understand your dataflow and write the stream -> table transformation explicitly, you front-load more of the work, but you get to a database that works properly more quickly.

99% of webapps don't use database transactions properly. Indeed during the LAMP golden age the most popular database didn't even have them. The most well-designed systems I've seen built on traditional databases used them like Kafka - they didn't read-modify-write, they had append-only table structures and transformed one table to compute the next one. At which point what is the database infrastructure buying you?

If you're using a traditional database it's overwhelmingly likely that you're paying a significant performance and complexity cost for a bunch of functionality that you're not using. Given the limited state of true master-master HA in traditional databases, you're probably giving up a bunch of resilience for it as well. If Kafka had come first, everyone would see relational SQL databases as overengineered, overly-prescriptive, overhyped. There are niches where they're appropriate, but they're not a good general-purpose solution.


> If Kafka had come first, everyone would see relational SQL databases as overengineered, overly-prescriptive, overhyped. There are niches where they're appropriate, but they're not a good general-purpose solution.

Oh ho, there's some deep irony about calling a database over engineered when we're looking at kafka.

I'm running a copy of sqlite right here, and it's pretty nice; it does queries and it's portable, it even comes as a single .c file that I can drop directly into a project. I use the same API for my mobile app.

Where's the equivalent for kafka?

No? I have to install java, and run several independent processes (forget embedded), and configure a cluster to... um... do what? send some messages? If I want to do more than that I have to add my own self implemented logic to handle everything.

Different things you argue; but alas, that's the point; the point you made was that kafka would 'be it' if it had come first, but...since it cannot serve the purpose of a db in most contexts; you're just wrong.

It's a different thing, and it might be suitable for some things, but I have deployed and managed (and watched others manage) kafka clusters, and I'll tell you a secret:

Confluent sells managed kafka. They have a compelling product.

You know why? ...because maintaining a kafka cluster is hard, hard work, and since that's the product confluent sells, they're not about to fix it. Ever.

Now, if we're talking about event-streaming vs tables; now there's some interesting thoughts about what a simple single node embedded event streaming system might look like (hint: absolutely nothing like kafka), and what an alternate history where that became popular might have been...

...but Kafka, this kafka, would never have filled that role, and even now, it's highly dubious it fills that role in most cases; even, even if we acknowledge that event streaming could fill that role, if it was implemented in a different system.


> Different things you argue; but alas, that's the point; the point you made was that kafka would 'be it' if it had come first, but...since it cannot serve the purpose of a db in most contexts; you're just wrong.

SQLite can't serve the purpose of a db in most contexts either; not only can it not be accessed concurrently from multiple hosts, it can't safely be accessed from different hosts at all. It can't even concurrently be accessed from multiple processes on the same host. If we both took a list of general-purpose database use cases, and raced to implement them via Kafka or via SQLite, I guarantee the Kafka person would finish a lot quicker.

> You know why? ...because maintaining a kafka cluster is hard, hard work, and since that's the product confluent sells, they're not about to fix it. Ever.

Maintaining a Kafka cluster sucks. So does maintaining a Galera cluster (I've done both), and I'll bet the same is true for any other proper HA (master-master) datastore.

Comparing Kafka against "SQL databases" is a category error, sure. But in the big use case that everyone talks about, that all the big-name SQL databases focus on - as a backend for a network service that runs continuously where you don't want a single point of failure - those big-name SQL databases are overengineered and underwhelming compared to Kafka, and in most other use cases an SQL database is overengineered compared to a (notional) similarly-designed event-sourcing system for use with the same sort of problems.


Sqlite is quite literally the most popular piece of software in the world. I find it a bit ironic to say that it's not fit for most purposes.


JQuery is similarly popular. Does that mean it's fit for most purposes?


I have run several production web apps on sqlite, it’s fine unless you have enough traffic where you really need concurrent requests. If you do happen to get 2 requests exactly the same time one has to wait, this happens in “real” rdbms systems under lock conditions anyway. I think you’d be surprised how much traffic it can handle.

A huge advantage is backing up and restoring from backup with sqlite is trivially implemented with scp in a bash script. Ever tried to setup database restore from backup (that works) in AWS?

There are some real wins from using sqlite in prod.


> I have run several production web apps on sqlite, it’s fine unless you have enough traffic where you really need concurrent requests. If you do happen to get 2 requests exactly the same time one has to wait, this happens in “real” rdbms systems under lock conditions anyway. I think you’d be surprised how much traffic it can handle.

I've run apps that were written like that as well (though not by me). What you say is true, as far as it goes, but those apps have no business using an SQL database: they don't need ad-hoc queries (and usually don't offer the possibility of doing them), they don't need the SQL table model (and usually have to contort themselves to fit it), they don't need any but the crudest transactions... frankly in my experience all the developer needed was some way to blat structured data onto disk, and they reached for SQLite because it was familiar rather than because they'd carefully considered their options.

> A huge advantage is backing up and restoring from backup with sqlite is trivially implemented with scp in a bash script. Ever tried to setup database restore from backup (that works) in AWS?

You're not comparing like with like though: if you're willing to stop all your server processes and shut down your database to do the backup and restore (which is what you have to do with sqlite, you just don't notice because it's something you do all the time anyway) then it's easy to back up practically any datastore.


it seem funny to compare kafka to sql for a OLTP use case. The main benefit of SQL are index and Transactions.

If you don’t need them you might as well use a file system directly but if you need it, there is not much alternative :-)


Filesystems are a pain in various ways. The only decent HA option is AFS and it has performance issues and general awkwardness. In a lot of programming languages the filesystem APIs are surprisingly bad. And it's still really easy to have lost updates etc.


if you care about lost updates you should use transaction or at least an API that support optimistic concurrency control.

Posix File system API don’t offer much for concurrency control other than lock.

In my line of work we use HDFS or AWS S3 as “filesystem” so we are not using the OS file access API at all and most File are append only by a single server.


I don't understand - you suggested using the filesystem, but filesystems don't have good concurrency control. S3's concurrency handling is even worse than filesystems. Something like Kafka is a much better way to do a stream of commands that need to be transformed to eventually produce some kind of result (which is really almost all workloads).


No this was a misunderstanding I said if you need (concurrency control/transaction and index) use a real database not a filesystem, since it’s the whole reason database exist.

But if you don’t need it there are plenty of good distributed file system that will not eat your data.


> No this was a misunderstanding I said if you need (concurrency control/transaction and index) use a real database not a filesystem, since it’s the whole reason database exist.

Just because something is designed for x doesn't mean it's any good at x. Kafka is also designed for handling concurrent events, and it's significantly better at it.

> But if you don’t need it there are plenty of good distributed file system that will not eat your data.

There really aren't any distributed filesystems that are nice to work with. S3 isn't a full filesystem and your writes take a while before they're visible in your reads, HDFS isn't a full filesystem and can be slow to failover, AFS has generally weird performance and locks up every so often. If your needs are simple then anything will work, including distributed filesystems, but it's very rare that a distributed filesystem is an actively good choice, IME.


I disagree pretty strongly with this. Things like indexes, uniqueness constraints, foreign keys, basically data schemas are hugely useful. There was a big movement for awhile in web dev, to use mongo instead of “the sledgehammer of an rbdms” , people learned over time that actually they wanted a database. Because you run into so many issues with querying and data consistency or maybe you want to do a join. All of these things become huge problems without a database. An eventually consistent log of data is not a replacement for a database. Can you imagine designing a simple crud app with a couple of tables and different products, but now do it with Kafka as your backing store. You’ve added so much more complexity to your application because you didn’t want to use a database.


> Things like indexes, uniqueness constraints, foreign keys, basically data schemas are hugely useful. There was a big movement for awhile in web dev, to use mongo instead of “the sledgehammer of an rbdms” , people learned over time that actually they wanted a database. Because you run into so many issues with querying and data consistency or maybe you want to do a join.

Schemas are a good idea, but SQL is bad at expressing them. Uniqueness constraints and foreign keys seem like a good idea until you think about how you want to handle them; often a stream transformation gives you better options. Unindexed joins and indexed joins are very different, and it's much easier to understand the difference in a Kafka-like environment than in an SQL database.

> Can you imagine designing a simple crud app with a couple of tables and different products, but now do it with Kafka as your backing store.

Yes, that's exactly what I'm advocating. It's clearer and simpler and avoids a bunch of pitfalls - no lost updates, no deadlocks, your datastore naturally has decent failover.

> You’ve added so much more complexity to your application because you didn’t want to use a database.

On the contrary, if you use a database you add a huge amount of complexity that you don't want or need. You have to manage transactions even though you can't actually gain any of their benefits (if two users try to edit the same item they're going to race each other and one's update is going to get lost). You have to flatten your values into 2D tables and define mappings with join tables etc., rather than just serialising and deserialising in a standard way. Bringing in new indicies in parallel is flaky, as is any change to data format. And getting proper failover set up is an absolute nightmare, to the point that a lot of people will set up failover for the webserver but leave the database server as a single point of failure because it's too hard to do anything else.


When you want to support random writes and random reads you don't need any "query" language. You need a dumb and performant key value store which is optimized for these and is easy to distribute to multiple machines.

Querying can be ad hoc - ie you don't know what your query will be and in that case you want to offload your data to a olap system which will perform about 1 or 2 orders of magnitude better than your typical row oriented mysql database. Moving the data out of your primary database will prevent you from deadlocks and shooting yourself in a foot.

If you know the query beforehand you simply compute and update a materialized index with results of your query.

Buulding system is such a decoupled way is conceptually and impelementationally much simpler than your typical overcomplicated sql database with unpredictable execution plans. These databases are good only if 1. you don't know what you are doing 2. you need some quick and dirty system to work without thinking which and you are fine that it will never scale up. Many small CRUD ecommerce apps are like that and these databases get used because that is what your typical medium.com article recommends as the best practice.


While i agree with all you said materialized index are not enough you also need transaction that span more than one entity.

You can’t simply put all entity you might need to update atomically in the same key/document.


As long as you have a clear dataflow you don't need transactions, you just need to make sure there are no loops in your dependencies. And you have to do that with transactions as well, otherwise they'll deadlock - the difference is that if you make a mistake you get to find out in production rather than when designing.


I would be very curious to know how someone could design an airline Computer reservation system this way.

I agree that when using a SQL database that use locks instead of snapshot isolation you need to be careful about the order you scan row and join table to reduce frequency of dead-lock.

My experience using PostgreSQL is that deadlock are extremely rare and when they happen they are automatically detected by the database and client simply abort and retry the transaction when this happen.


> I would be very curious to know how someone could design an airline Computer reservation system this way.

You make each operation an event, you make all your logic into stream transformations, and you make sure the data only ever flows through in one direction (and you have to either avoid "diamonds" in your dataflow, or make sure that the consequences of a single event always commute with each other).

So at its simplest, someone trying to make a reservation would commit a "reservation request" event and then listen for a "reservation confirmed" event (technically async, but it's fast enough that you can handle e.g. web requests this way - indeed even with a traditional database people recommend using async drivers), and on the other end there's a stream transformation that turns reservation requests (and plane scheduling events and so on) into reservation confirmations/denials. Since events in (the same partition of) a single stream are guaranteed to have a well-defined order, if there are two reservation requests for the last seat on a given plane then one of them will be confirmed and the other will be rejected because the plane is full.

> My experience using PostgreSQL is that deadlock are extremely rare and when they happen they are automatically detected by the database and client simply abort and retry the transaction when this happen.

That's no good. If you do that then you haven't solved the problem and you're guaranteed to get a full lockup as your traffic scales up. You have to actually understand which operations have caused the deadlock and rearrange your dataflow to avoid them. (Not to mention that if it's possible for the client to "simply abort and retry" then you've already proven that you didn't really need traditional database-style synchronous transactions).


100% agree and that’s why google Spanner, Cockroachdb and all the NeeSqL exist.

People found that they miss the abstraction that SQL and transaction offer as it simplify your code a lot. They simply needed something that scale better than doing manual sharding on top of mysql :-)


It's worth noting that most managed Kafka solutions offer a service which will transform a stream into a database table so you can query it later. We've also built our own tooling to do it. We read from the Kafka stream directly when we're building a pipeline between microservices; for almost everything else we query the database. It works pretty well and we haven't found ourselves wishing we could write pseudo-queries on streams yet.


At the moment you more or less need to write a DBMS in your app code, but I don't think that's the end state. I think what we're seeing the beginnings of something big - it just might not seem like it yet because it's the v1 / no where near complete version. I think having all your data in a single system (Kafka, KsqlDb, ..) that allows you to work with it in cross paradigm ways will turn out to be very compelling.


> At the moment you more or less need to write a DBMS in your app code

Since we're discussing misunderstandings in the community, it should be pointed out that a Database Management System (DBMS) is not merely a database, to say nothing of a data store. Oracle, Postgres, et al are genuine "DBMS". ATM you are very likely putting together a data store in your app code.


You're conflating DBMS with RDBMS.

I interpreted the OP as DBMS=database, which absolutely includes application code that stores and retrieves data in proprietary formats.

Even a linked list mmap'd to disk can be a database, just maybe not a very good one.


A RDBMS is simply a Relational Database Management System. A quick tour of CS history reveals ancient curiosities such as Hierarchical Database Management Systems. And there is more:

https://www.studytonight.com/dbms/database-model.php


There are many kinds of databases. Graph, key/value, relational, hierarchal etc. That doesn't change the fact that any app that writes code to store and retrieve data is creating their own database or using someone else's.


There are many kinds of data models. A DBMS is a DBMS, and typically a specific DBMS supports a specific data model. A log structured file is at most a data store.

To be fair this is a somewhat fuzzy categorization (DBMS, DB, Data Store) and it can cause confusion. An DBMS is a system, just like it says right on the acronym tin. A DBMS is a system that provides capabilities such DDL, DML, auxiliary processes, etc. to manage a database. There are various data models, e.g. a triple store, for databases, but the data model X is a orthogonal to XDBMS.


Meteor did some pretty interesting work with this by tailing mongodb’s oplog. Definitely still room for innovation.


>You will more or less have to write a full-fledged DBMS in your application code.

Yes, but only if your data is highly relational, and needs to be operated on in that way (e.g. transactionally across records, needing joins to do anything useful, querying by secondary indices, etc.). Depending on your domain, this could be a large amount of your data, or not. Just like anything, Kafka and Kafka Streams are tools that can be used and misused.

There is perhaps an exciting future where something like ksqlDB makes even highly relational data sit well in Kafka, but we're definitely not there yet, and I'm not terribly confident about getting there.


> Yes, but only if your data is highly relational, and needs to be operated on in that way (e.g. transactionally across records, needing joins to do anything useful, querying by secondary indices, etc.).

Ah, you mean that if the data is in a normalized form. Relational data means, basically, data in tabular form. Tabular data can be normalized or not.

> Depending on your domain, this could be a large amount of your data, or not.

Assuming you are talking about data in normalized form, that's orthogonal to the domain. That's in fact the major breakthrough of relational databases–that data across domains can be modelled as relational and normalized.

Now, in the relatively rarer cases that you don't care about normalizing data, a stream can be useful. But in the vast majority of cases IMHO, we should be looking at a proper relational database like Postgres before thinking about streams.


I agree that _normalized_ is a more precise term for what I'm describing, I only used relational because contemporary discourse seems to favor that term in these kinds of discussions and seems to imply a high degree of normalization.

>Assuming you are talking about data in normalized form, that's orthogonal to the domain.

I don't agree with this though. Nothing about a software architecture is _ever_ orthogonal to the domain.


I work on a legacy CouchDB application and the same problem exists in that area. CouchDB is actually really amazing and is wonderful for some types of applications. It really sucks if you want a relational DB.

Keep in mind that I don't know anything about Kafka, but it seems like the same kind of thinking. You can think about your data as a log of events. Since you can always build the current state from the log of events, then you might as well just save the log. This works incredibly well when you treat your data as immutable and need to be able to see all of your history, or you want to be able to revert data to previous states. It also solves all sorts of problems when you want to do things in a distributed fashion.

The application which fits this model well and that almost every develop is familiar with is source control (and specifically git). If that is the kind of thing that you want to do with your data, then it can be truly wonderful. If it's not what you want to do, then it will be the lodestone that drags you underwater to your death.

The legacy application I work on is a sales system and actually it's not a bad area for this kind of data. You enter data. Modifications are essentially diffs. You can go back and forth through the DB to see how things have changed, when they were changed and often why they were changed. In many ways it is ideal for accounting data.

But as you say, if you want a full fledged DBMS, this should not be your primary data store. It can act at the frontend for the data store -- with an eventually consistent DB that you use for querying (and you really have to be OK with that "eventually consistent" thing). But you don't want to build your entire app around it.


> Kafka is a message broker. It’s not a database and it’s not close to being a database.

Kafka is somehow a database, even if specialized on logs of events. For a similar discussion, I would refer to Jim Gray "Queues Are Databases" https://arxiv.org/pdf/cs/0701158.pdf.


I fully agree that this "stream-table duality" is misleading or at least quite vague.

The diagram juxtaposing a chess board (the table) and a sequence of moves (the stream) seems enlightening but actually hides several issues.

* An initial state is required to turn a stream into a table.

* A set of rules is required to reject any move which would break the integrity constraints.

* On the reverse direction, a table value cannot be turned into a stream. We need an history of table values for that.

Hence, the duality, or more accurately the correspondence, is not between stream and table but between constrained-stream and table history.

Even if these issues are not made explicit, the Confluence articles and the Kafka semantics are correct.

* "A table provides mutable data". This highlights that stream-table-duality is relative to a sequence of table states.

* "Another event with key bob arrives" is translated into an update ... to enforce an implicit uniqueness constraint on the key for the tables.


Reminds me of that time when database vendors were overreaching to be message queues/brokers - OracleAQ, MSMQ, etc.


That's why ksqlDB exists and handles all that for you, turning streams into tables that you can query.


ksql does not solve any of the hard consistency or contention problems you will face if you attempt to use Kafka as a datastore. Consider the simplest possible example: you write an "update event" to a topic and then read a ksql view of that topic. The view may or may not yet reflect the update. This is called read-after-write consistency, and you will need to create it in your application code.


It's not a real database and doesn't promise ACID though. I think it's fine with the understanding that it's an incremental eventually-consistent materialized view engine that works seamlessly with Kafka, especially if you're already a heavy user.

Otherwise loading data into a real database/data warehouse and using ETL with triggers/views/aggregations is better if you need advanced querying rather than a simple stream to table transforms.


The same thing happens if you write to an RDBMS master and then read from a read-only replica.


Allow me to submit that this obsession is... Kafka-esque? :D



Not super familiar with Kafka, but behind the scenes some RDBMS are log replay engines. Are there technical reasons to not slap a RDBMS frontend on top of a Kafka log and avoid dealing with raw streams in the application code?

Edit: From the sibling comments, the answer is https://ksqldb.io.


The technical reason is the inherently distributed and shared nature of Kafka setups makes writing an ACID-compliant frontend (basically?) impossible.

If your application assumes ACID compliance, it will fail. Also, I haven’t looked into this one as directly, but I imagine that query latency compared to an RDBMS would be quite poor.

IMO, most people reach for Kafka too soon, where an RDBMS with a proper schema that includes transaction and valid history would better suit the domain. It’s possible to write very efficient and ergonomic “append-only”, temporal data schemas in Postgres schemas.


But there are a lot of usecases where ACID is not needed and we can do with eventually consistent states. Btw, you do get C and D.


>With Kafka, such a stream may record the history of your business for hundreds of years

Do not do this. Kafka is not a database! Kafka should never be the source of truth for your business. The source of truth should be in whatever consumes data from Kafka downstream when messages are committed as read. Why? Because in your middle layer you can do all the data normalization, sanity checking, processing, and interaction with a REAL database system downstream that can give you things like transactions, ACID, etc.

Of course Confluent wants you to try to use Kafka as a DB, so your usage of it is very high and you pay for the top support package and they have you by the cajones, but that doesn't mean you should do that. You will miss out on all the benefits of using a real database, with what benefit? Having a simple client API?


So, I've been having a back and forth with a colleague on this and I'm genuinely interested in why you so strongly suggest this.

For the record, I have good real world experience with all kinds of databases (relational, NoSQL, and even legacy multivalue and hierarchical ones), and I don't see why what you say has to be "always true".

One way of looking at Kafka is that is an unbundled transaction log, nothing more or less, so it could be used to permanently store and replay transactional activity, if one wishes. Noting that, even most databases don't store an immutable, permanent transaction log (as they often grow to be huge and are truncated every so often, and tables are used as the current state).

This article by Confluent seems to cover the topic (yes, recognizing it is written by the very vendor you suggest is trying to lock us in): https://www.confluent.io/blog/okay-store-data-apache-kafka/.

Ok, so how about the idea of a persistent, immutable, never-ending transaction log (uhoh, sounds like blockchain now!)? Setting aside Kafka for now, what do you think about the basic design pattern? To me it sounds a bit like it could represent a temporal database in raw transactional log form. Why not?

EDIT: After rereading your comment I see your main concern is using Kafka as a database management system (DBMS). I would agree, that's not what Kafka is for. But, I don't think Confluent intends that use case, do they? I look at it more as an unbundled single component that is very useful by itself, and is part of a more complex data platform/architecture (ex. Lambda or Kappa architecture).


> Ok, so how about the idea of a persistent, immutable, never-ending transaction log (uhoh, sounds like blockchain now!)? Setting aside Kafka for now, what do you think about the basic design pattern? To me it sounds a bit like it could represent a temporal database in raw transactional log form. Why not?

Because nobody wants to be replaying events all the time to get their actual data. They want the data to be, well, materialized. Replaying events can be helpful if you need an audit trail but the systems which need that have mostly all evolved their own audit trail techniques, e.g. double-entry accounting.

The people who invented these streaming event systems quickly realized that continually replaying events from the beginning would get absurdly expensive, so they even implemented 'checkpoint' events that snapshot the current state of the data every once in a while so that replays can start from (a hopefully recent) snapshot and finish quickly. At that point you have to encode the logic of how to roll up all events into the current state into your checkpointing code, which immediately enforces the notion of a global current state of the data, which is in fact what RDBMs solve anyway.


RDBMs do not generally (ie with a generic interface) allow access to the backing event stream, though, meaning you’ll need to write this yourself when you need to sync changes across databases. There’s no free lunch.


Across different databases as in Postgres-to-Maria, sure. But people who decide to do that should know what they are getting into. But across different instances of say Postgres, it is rather simple.


Debezium (debezium.io) might be interesting for you; it provides open-source change data capture for a variety of databases; together with the right Kafka Connect sink connectors (or using it via Pulsar etc.), setting up a Postgres-to-Maria data pipeline is quite easy. (Disclaimer: I work on Debezium)


I agree that you can use Kafka as a raw event log - not necessarily a transaction log unless you are basically putting records into Kafka that don't need to be transformed, which you can do but probably don't want/need to do. There are situations in which you want to replay raw events, but in most cases I think you want to replay actual transactions to your DB, so it makes more sense to log your progress as you commit using whatever consumes from Kafka rather than in Kafka itself.

My main concern is that when all you have is a hammer, or you already have a hammer lying around, everything looks like a nail. Kafka can be great as a raw event stream, and yes you can store raw events forever, but are raw events really the source of truth for your business? If your workflow is, as I think is appropriate in almost all cases, Kafka->Consumer service->DB, why do you want to rely on Kafka when you have a consumer that can have better logic and custom handling regarding how you actually interpret your events? Moreover, why keep the data in Kafka when you can just plug it into a temporal DB from your service?


To answer your last question; let’s say I have some shared state that multiple micro-services rely on and all need to fetch that state synchronously. My example is that you’re a loans company and almost everything is driven from the loan book. Reporting services need to know the active loan counts this month, invoicing services need to fetch the loan book to generate transaction lines, borrower services need to know whether the current user has any active loans.

The standard wheel-and-spokes model would probably have a loan book service where all others call into it via HTTP requests, but you’ll have to build a query language on top of that REST API, or share DB access I suppose.

Or in this model you would stream the loan book updates through Kafka and every consumer that needs a copy of the loan book can keep their own, query it against their own chosen materialised store in its native query language (some might just append to Mongo, or kSQL, some might always use event sourcing per-request). “select * from loans where user.id = current_user” becomes possible here rather than stitching the HTTP responses.

Now I’m not saying one is right or wrong outright as you’re only trading a more native query interface with data consistency concerns, but this for me still feels like a valid architectural pattern and one I’d consider.


If you have a bunch of services reaching into the same data store though, you've effectively built a distributed monolith. Which if that works for you, more power to you, but you lose most of the Conway's law advantages of services with that approach imo.


I agree with this completely. In some business applications DDD only gets you so far.


In DBMS the "source of truth" is the log. Kafka can be used in exactly the same way, and in that sense, it can be the primary data store.

Whether that's useful for a given use case is certainly debatable. IIRC, the New York Times uses Kafka as the primary storage for their entire archive.


> In DBMS the "source of truth" is the log

that's only true in the (frequent) case of write-ahead logs. Some database engines may use Rollback logs, or undo logs.


A log is a log. Maybe read "I <3 Logs" by Jay Kreps, one of the founders of Confluent and creators of kafka.


In a proper kafka setting you would use avro and a schema repository to have defined schemas for your events and have an ability to upgrade schemas automatically. For breaking changes you would use kafka streams to migrate the old topic into a new topic and then discard the old topic. Kafka has transactions, and with kafka streams you can create joined views that include those transactions.

It’s a different way of working for sure, but for an event sourced system I think kafka is an excellent choice. In a sql database keeping the complete history of changes for years is much less practical than in kafka.

What you shouldn’t do is look at kafka as a database. It’s a message log. You push messages in, you eventually read them back out. The key word is eventually. You can’t read your writes, and that makes building an interactive system on top hard.


Just a remark to writers, when you redact an "introduction to smth" please refrain from writing down the name of your product 50 or more time in the first paragraph. It's totally frustrating and made me run away to just look up wikipedia instead to get a grasp of the general idea.

Example: You've probably heard of smth thing and wonder how smth differ from smth else. In this article about smth we will dive into smth things to discover how smth is well better suited to do things thanks to smth things and other things that are really specific to smth! The power of smth enable things to things in a way things do things that make smth something you need to learn about.

So during this journey about smth will be cut in four part the first being an introduction.

(...) --sudo click the link for first part

This is the first part of a fourth series about smth to learn more about things and things in smth.

Smth is very specific about things, and that's specifically why smth is well suited for things. Smth is a new way to do something that will make you think more about things and things. Let's now dive into smth...

Smth use things as a things to do the smth things with tings on smth.........

--sudo repeat marketing ad nauseam

No matter of wether the actual product is pure gold or pure garbage you probably just lost 50% of the readers at some point.


i totally agree with what you're saying but you should also know that redact doesnt fit in your sentence


Too late to edit sorry. English is not my native language and I clearly see my translation mistake now. Thank for pointing out ;)


Something that I wish I had known about Apache Kafka a year or so ago is that it essentially has no support for long-running tasks, i.e. tasks where longest-possible-worker-execution-time >> longest-tolerable-group-rebalance-time.

After much angst in trying to work around this issue, I finally gave up and switched to Pulsar. Pulsar isn’t without it’s own issues (mostly around bugs and general maturity) but it handles this particular scenario admirably.


It's true, message buses and work queues have different characteristics. It sounds like you want a work queue, not a message bus. I have very successful experience with using rabbitmq for work queueing, but as you mention there are others too.


Pulsar works quite well as a message queue: https://pulsar.apache.org/docs/en/cookbooks-message-queue/


Pulsar can also support infinite data retention using data tiering and by spreading data on all nodes.

It’s much better than Kafka when you don’t know when the consumer will come back looking for what changed


Spreading data on all BookKeeper nodes, yes :) Pulsar brokers are themselves stateless.


You’re right. The issue is that in this particular application I need _both_ a work queue and a message bus. I’ve also successfully used Rabbit as a work queue, but it’s not high-throughput enough to meet my messaging needs. Pulsar seems to cope well in both roles.

All that said, if I had my time again I’d probably just use one of the cloud providers’ solutions and spend my efforts elsewhere...


What every software engineer should know about Kafka, it's dead.

If you're not already technically chained into it and Confluence hasn't already upsold your poor organization avoid it.

If you want the early flexibility and the rapid PoC just look at AWS Kinesis/Firehouse.

If you're looking at large scale (+1 gbit ingest, 100k/s, kind of stuff) then Apache Pulsar is where to go.


I would argue that Kinesis is not the way to go for quick POC unless you're tied to the JVM.

Pulsar is still niche in most enterprise.

Kafka is not dead, there are many enterprises (including 2 successful ones I've worked at in the past 5 years) that have built POCs and successful products on Kafka. Its supports all language performantly and has tons of community support. I would argue there is nothing better to build a POC on.


Why would you be tied to the JVM by using Kinesis? You can write a client library for any language. We did it for Go.


Because the official client from AWS is written in Java.

It is possible to write clients in any language, however, it is not that simple especially when you need to handle logic of scalling out or in your kinesis stream (that will split or join shards) and when you have multiple consumers in the same consumer group (you will need a distributed locking mechanism and a logic to steal locks if one consumer dies).

So it is not trivial.


Sure it is. AWS ships a bunch of clients for Kinesis as well as a kinesis agent for shipping log files.

https://docs.aws.amazon.com/sdk-for-go/api/service/kinesis/


This is an SDK not a full client for Kinesis.

This is the only official implementation from AWS: https://github.com/awslabs/amazon-kinesis-client

The client handles horizontal scalling, checkpoiting, shards split and shards merges. Using just the SDK, you have to build this yourself (unless you are using Kinesis for use cases that dont need it to be done correctly).

This is the doc for developing consumers using the SDK https://docs.aws.amazon.com/streams/latest/dev/developing-co...

And in the second paragraph of this documentation: "These examples discuss the Kinesis Data Streams API and use the AWS SDK for Java to get data from a stream. However, for most use cases, you should prefer using the Kinesis Client Library (KCL) . For more information, see Developing KCL 1.x Consumers."


Kafka is just starting to be free of some of its fundamental shortcomings that hindered it's adoption, similar to mongo in the early days, I think. Anyway, I see it being used more widely these days for streaming data.

Apache Pulsar definitely has a better architecture, separating the ingestion from the storage, but once something starts to gain widespread adoption (like Kafka) that lives little room for alternatives.


MongoDB is still pretty much broken. I really hope your bank is not using MongoDB.


Funny story, I sat through a MongoDB pitch yesterday and one of their selling points was how many banks use mongodb as their data store.


Some time ago I was talking to some bank software engineers and they loved mongodb because the schemaless nature let them avoid all the red tape which DBAs had created within the organization. Mongodb solves an organizational failure more than anything else.


I deal with text+discrete data. On the one hand this falls well into a document structure, on the other hand it wasn’t clear what exactly was offered on top of what we get from our current setup of elastic+rdbms (we are ok with data duplication and records are ~immutable).


wow, do you remember which bank and as the backend for what kind of system ?

if it’s just for some kind of cache or system that is ok to randomly corrupt and drop data I would sleep better :-)


Honestly I don’t have a clear recollection. Their point was to impress on us broad adoption so big names were dropped. This was in the context of Mongo Atlas, their cloud based solution, and how it is being used in heavily regulated/strong privacy environments.


Not nearly as broken as it was. It offers reasonable performance as a document database for unstructured data and surprisingly, works pretty well when containerized.

I wouldn't use it for transactions, though.


Kafka's fine, it's got a lot of market share and existing orgs with large deployments aren't going to go through the pain of switching needlessly.

Pulsar is great, I love how it combines message queue and pub-sub semantics. Tiered storage etc. The built-in cluster replication is brilliant, although it has some limitations - you can't replicate from cluster A to cluster B to cluster C due to how Pulsar avoids the active/active infinite replication circle of doom.

But, it's early days for it. Anyone adopting it will need to invest more time in learning the innards compared to Kafka.


I'm surprised by the amount of criticism in this thread. I've used Kafka in the past and it definitely got the job done (as a message bus, not using stream processing or the other more whiz-bang features). What do people use instead?


my experience is the opposite unless you really need the whiz-bang feature you should never use Kafka it’s the least reliable and hardest to run it require active babysitting by skilled team of admin and ops.

if you don’t need low end-to-end latency (tailing consumer blocked on long polling) it’s better to use something like HdFS or AWS S3, but if you need it but have low throughout it’s better to use something like RabbitMQ. if you have both high throughout and need polling then it’s worth investing into Apache Pulsar.


Pulsar is half baked with limited number of half baked clients even for major languages. Kafka instead is a bulletproof solution with tons of clients in any flavor, with instruments to detect issues, alert about them. Don't even compare them.


Pulsar is far better designed than Kafka and is much more reliable and scalable. Clients in every language are an entirely different issue and mostly down to developer bandwidth because it's a small team.

Kafka actually doesn't have that many great clients either, they all are just wrappers around the C++ librdkafka library.

Pulsar clients are also easy to create because of the stateless protocol, per-message acknowledgements, and optional websockets API. Or you can just use their Kafka bridge adapter and use your existing Kafka clients.


Pulsar itself is solid but it’s true number of good clients is limited. Same is true for kafka the only reliable client is the java one. All other are based on librdkafka which still have many major bugs causing it to be stuck and not producing to new leader during leader changes.

One benefit of pulsar is that the client library is completely stateless and much simple. It does not need to know where data is stored it’s all handled by the pulsar broker.


I have used the C++, python and Java clients and they are all pretty usable. Pulsar deployment support is great. It seems the teams deploy it themselves.


I just checked and they have a websocket API which I imagine is usable in almost every language.

We might use it ourselves instead of the official Python library which is not async. Async publishing and consuming is a requirement for our project.


I don't think this is coming up enough in the comments! Kafka is a gigantic pain in the ass to run. Configuration is torture. I haven't used Pulsar, but please if someone has a message broker that's easy to run, let me know.


Pulsar currently is easy to setup if you're on Kubernetes, especially if you can dedicate an entire cluster to it. You can run a small 3-node cluster on GKE with local SSD instance storage and GCS/S3 tiering that can handle GBit throughput.

Otherwise look at NATS with NATS Streaming (although they're currently redesigning a better version of it). There's also Liftbridge which is an alternative NATS Streaming fork.

https://docs.nats.io/

https://docs.nats.io/nats-streaming-concepts/intro

https://liftbridge.io/


Like most Apache products it comes with a lot of knobs and switches, the more you tweak those the worse it gets. Kafka runs completely fine as long as you aren’t trying to tune every config option, unless you have a super custom use case and you know what you’re doing.


I had trouble accomplishing simple things like setting the maximum message size. You have to: 1. Either increase the default max message size on the broker, or set it on the specific topic(s) involved. 2. Set the max message size at every producer. 3. Set the max batch size at every consumer, which has some interaction with a message count (if I'm remembering correctly).

Because we increased the allowed message size, I wanted to make sure that disk space was freed after a given interval. This was super opaque to get working. I don't remember the specific config keys, but there was about 4 parameters involved, and 3 had no effect if the 4th wasn't set properly. They all used sligtly different naming conventions, so didn't appear in context in the documentation.

Still, I agree that defaults are usually the way to go with Apache stuff. If you find yourself turning a lot of knobs to stabilize something, there's probably a design issue.


If you need. Ordered log abstraction (Apache pulsar, Facebook log device ...) if you need a real transactional and highly available message broker any (my/jms) server like Rabbitmq IBM MQ ....

Kafka in my experience is always the worse solution unless you need to aggregate http server log to do offline analytics using something like spark


For streaming data kafka is great. It gets misused in many ways. It can be made to guarantee order and has semantics for exactly once processing which may lead one to believe it can be used for transactions, but that forces all kinds of things to occur under the hood that makes performance at scale nearly impossible.

I know folks like Netflix are using it for streaming UI logs into Flink for real time sentiment analysis. There are some very serious use cases for Kafka at scale.

If one just needs a message broker, well, there are other, better ones.


My company uses it for batch file processing. The CIO demanded we use it instead of something like FTP, but didn't understand the software patterns it was built for.

We now have a very overly complex file transmission system and a new CIO.


exactly once messaging is a lie. it just mean you have duplicates message but let Kafka try to detect duplicate for you, which is almost never what you want. it’s much simpler to make your message processing indempotent.


Great comment. I have seen multiple people in my short career get burned by trying to figure out how to get exactly once semantics in a distributed system. The solution almost always, as you said, is idempotency (perhaps with the message producer appending a UUID field).


Kafka has exactly-once semantics that mostly work. They also have "idempotent producer" which ostensibly provides atomic transactions as well.

Sounds awesome, but the implementation pretty much breaks kafka in that one can no longer scale. Topic and partition are all constrained to a single broker. Hot spots abound!


AWS SQS, RabbitMQ, or Redis. I would build on one of those 3 first, then migrate when throughput became an issue. I've had good experiences with Kinesis Firehose as well.

I've yet to run into a workload that actually needed Kafka (and if you're not a Java shop don't add Java pieces unless you know you need what those Java pieces provide).


As a Kafka alternative, has anyone attempted to use PostgreSQL logical replication with table partitioning for async service communication?

Proof of concept (with diagrams in the comments): https://gist.github.com/shuber/8e53d42d0de40e90edaf4fb182b59...

Services would commit messages to their own databases along with the rest of their data (with the same transactional guarantees) and then messages are "realtime" replicated (with all of its features and guarantees) to the receiving service's database where their workers (e.g. https://github.com/que-rb/que, skip locked polling, etc) are waiting to respond by inserting messages into their database to be replicated back.

Throw in a trigger to automatically acknowledge/cleanup/notify messages and I think we've got something that resembles a queue? Maybe make that same trigger match incoming messages against a "routes" table (based on message type, certain JSON schemas in the payload, etc) and write matches to the que-rb jobs table instead for some kind of distributed/replicated work queue hybrid?

I'm looking to poke holes in this concept before sinking anymore time exploring the idea. Any feedback/warnings/concerns would be much appreciated, thanks for your time!

Other discussions:

* https://old.reddit.com/r/PostgreSQL/comments/gkdp6p/logical_...

* https://dba.stackexchange.com/questions/267266/postgresql-lo...

* https://www.postgresql.org/message-id/CAM8f5Mi1Ftj%2B48PZxN1...


Sounds a bit like change data capture (CDC), e.g. via Debezium [1] which is based on logical decoding for Postgres, sending change events to consumers via Kafka? In particular when using explicit events for downstream consumers using the outbox pattern [2]. Disclaimer: I work on Debezium.

[1] debezium.io/ [2] https://debezium.io/documentation/reference/configuration/ou...


I think you'll find Debezium (http://debezium.io) interesting. Specially it's embedded mode where it's not coupled to Kafka.

Also,do look at the Outbox Pattern. It's basically what you are describing.


Just as Kafka isn't a database, PostgreSQL isn't a queue/broker. You can use it that way, but you'll spend a lot of time tweaking it, and I suspect you'll find it's too slow for non-trivial workloads.


Skype at its earlier peak used Postgres as a queue at huge scale. PGQueue. It had a few tweaks and, sure, it is an anti-pattern but it can work. It is sure handy if you are already using postgres and want to maintain a small stack.


I believe you will enjoy this:

https://medium.com/revolut/recording-more-events-but-where-w...

Not exactly the same architecture you are proposing, and quite complex tbh, but it is working for them.


interesting idea: I think one issue is that the write throughout of a single master Instance is very limited.

But if working set fit in memory on the instances processing the writes, you could use PostgreSQL logical replication to update materialized view on other server easily.

but then it start looking like Amazon aurora or Datomic database.


What every engineer should know about Kafka is that it should not be used for anything critical like you would use Cassandra or Hbase.

But if you are ok with partitions not being available for many hours or losing all written data because the cluster did not automatically move parution to 3 new replica after 2 of the replica failed ... then it’s a good scalable(speed) product.

There is also no serious multi tenant support. So if you need multitenancy you gotta use kubernete and do one cluster per tenant and automate that yourself.


There seems to be this common problem among relatively new technologies, that they're not actually aware of what the average person knows about them. So let me be the moron in the room. I work at a company that uses Kafka. What I know so far is that Kafka is broken. It seems to me that this article is more about what every software engineer who plans to re-skill as a kafka engineer should know.


In which way is Kafka broken?


The main aspect for me after 5 years of running several Kafka cluster to run production critical system at Microsoft is that it’s very fragile because of the bad design.

distributed system are hard to design correctly. When using HDFS you can create a single 1TB file if the cluster have enough capacity. And with (hdfs hbase cassandra ..) if 5 server permanently fail in the span of 24 hours you don’t need manual intervention and your data is not lost.

In Kafka your partition data need to fit entirely on a single server and you might still run into issue if that server also hosting other partition.

And with Kafka if one by one the 3 server manually assigned as replica to one partition fail before an engineer is wake up in middle of the night and have time to fix it manually you just lost all your data.


Do you have examples of other systems that won’t lose all your data if all its replicas fail?

This seems unavoidable in any scenario.


Assuming replication factor of 3 i’m not talking about 3 machine going down at exact same time, but in sequence. like 5 minutes apart.

All other distributed system handle this with no issue. Ex: Cassandra, Hbase/hdfs, CockroachDB


If you're losing machines like that, why not set replicas higher? I'm assuming you're running a rather large cluster if 3 machines failing is fairly routine.

Also, did you look into Cruise Control at all?

But I agree that having to fit any entire replica of a partition on one machine is a weakness of Kafka, and something that Pulsar improves on by using Bookkeeper which only needs to be able to store a segment of a ledger - BK's autorecovery daemon is nice too.


because the only sane write mode in kafka is “ack=-1” which mean wait for confirmation from all in sync replica. The more replica you have the worse is your write latency.

In the case of pulsar ack quorum can be much smaller than replication factor so you don’t have to wait for slow replica. Also no number of node going down would cause new write to fail. And under replicated segment will fixed automatically using remaining healthy machines right away.


We use acks=1 just fine. At least once is pretty okay for a lot of use cases.

> In the case of pulsar ack quorum can be much smaller than replication factor so you don’t have to wait for slow replica. Also no number of node going down would cause new write to fail. And under replicated segment will fixed automatically using remaining healthy machines right away.

I agree, all good points about Pulsar using BK, although worth noting the autorecovery daemon has to be explicitly enabled.

But I'll mention Cruise Control again - it solves a all of the issues you've described without manual intervention. https://github.com/linkedin/cruise-control


Yes I agree if using extremely small partition size ~100mb and using Cruise Control it should be able to auto heal quickly.

But you will still have broker with like 100% cpu usage and high network usage while other are mostly idle.

Cruise control try to patch around the bad design of kafka but there is only little it can do.

Any broker in ISR going down still cause big latency spike , and if broker was leader for several partition all those partition will be down until the single cluster controller finish doing leader switch sequentially for all those partitions.

Typically on cluster with many partition of topic, each broker would be leader for ~1000 partition which mean it will take up to 15 minutes for leader switch to complete for all partitions.


Yep, agree that Pulsar is far better for dealing with large amounts of topics than Kafka. Doesn't mean that Kafka is broken, it just means it has limitations.


if your use case is ok with ack=1 then kafka is perfect for you.

because this mean you don’t care about losing some of the messages for which the broker confirmed to the client library that the write was successful.

if you care more about maximum throughput and efficiency than about data integrity and availability then kafka is a good choice.


> because this mean you don’t care about losing some of the messages for which the broker confirmed to the client library that the write was successful.

Nope - the leader only acks when the write has been committed to disk. It's called at-least-once because it could fail between committing to disk and acking, and then the producer would try again.

You're thinking of acks=0, at-most-once.


No, the parent is correct. acks=1 can still lose data as the broker can lose topic leadership after write but before replication.

Roughly you can think of it as:

- acks=0 - can tolerate only a small (but most common!) subset of failures; better to keep going with the next datum than fail/retry if anything is wrong, stale data might as well be no data or you only need 99% of it anyway; very common for web traffic analysis, hardware sensor data, etc;

- acks=1 - can tolerate client, client/broker, or single broker failure as long as client can retry; nice for e.g. API uploads where the client has more work to get onto but also wants to provide some reasonable feedback to the caller; a good default;

- acks=-1 - can tolerate client, client/broker, broker/broker, or arbitrary broker failure as long as client can retry; good for events which should absolutely not be lost and small enough to keep locally e.g. "batch job finished", "IDS alerted".

"At least once" and "at most once" are more often used to describe consumer commit behaviors than producer.


In the way that, as an engineer not working directly on Kafka, the only thing I ever hear is when data it's missing because Kafka is broken. That might be down to my company not using it correclty, or it being a finnicky tool, but it doesn't look good.


I would say this reflects on it not being designed & setup properly on the ops/infra side of things at your company.


I was going to say, what every software engineer should know about Kafka is "don't use it". This is more a list of things people who are already stuck with Kafka should know.


So much criticism here! I've read a lot about Kafka over the last few years and I wish I had read this article earlier -- even basic questions like "Can Kafka store data persistently?" are not adequately answered in many intros to it.

That said, I do find the tutorial flip-flops a bit in target audience. It's mainly "this is what Kafka is", but sometimes has weird asides like "This is how to optimise Kafka" (redundancy, number of partitions, etc) which are pretty distracting from the more fundamental points.


I read the introduction to the series, and then the introduction to the first part, and I’m still not sure exactly what Kafka is, or why I (as part of ”every software engineer”) need to know anything about it. The title suggests that the article(s) will convey some concepts that are useful in a broad sense, but from a skim, this looks like a lot of details about some database-ish product, which probably are good to know if you use that product, but not so much otherwise.


Hmm, I'm pretty sure that a software engineer developing safety-critical firmware for embedded medical systems does not need to know anything about Apache Kafka. Or a game developer. Or a web frontend developer. Given the title it's surprising how many software engineers can in fact go through life and career without ever knowing anything about Apache Kafka.


By now, "What Every Software Engineer Should Know" headlines aren't intended to be serious.


I wish, instead they are just not intended to be taken literally.


One can only make assumptions about the needs of every engineer out of profound ignorance, which accordingly sets expectations for the article contents.

Conveniently so, I might add.


Is there a reason Kafka wouldn't choose to leverage an existing mature RDBMS for their table storage instead of rolling their own?


Well they use rocksDB internally. Though that is not RDBMS.


Anyone know how large scale chat system ( facebook messenger, ...) are implemented? My guess is message are in a data store like hbase and a very simple notification system let user that are online know to fetch for new entry



thanks a lot. This seem to be a very simple but smart design.


anyone wanna share their thoughts about deploying their own messaging system vs using a messaging system provided by their cloud provider?


The GCP Pub/Sub API has largely replicated all the features you'd want out of Kafka (including Consumer Groups). The primary consideration at this point is cost. There's an inflection point in size (at some very large message volume) where it makes sense to start running your own Kafka cluster and hire a dedicated person or two to manage it. Most companies will never get anywhere close.

Any project just starting out should use Pub/Sub. One thing I really like is that GCP provides emulators of Pub/Sub et al for local testing. That used to be a bit of an obstacle not too long ago.

In terms of lock-in, I don't see how that applies to an AMQ. The data moving through it should only be transiently persisted, up to a week or two at most in the usual case.

If you want to avoid cloud lock-in, have DB backups, use Postgres/MySQL/etc, containerize your service(s), replicate data in object storage, etc. Common sense stuff, if that's something that's of concern.

Personally, I've seen "vendor lock-in" weaponized as an excuse for a lot of costly NIH bullshit. It's painful to reflect back on a project that could have involved literally a tenth of the time and pain it ended up taking because of that one choice alone.


IMO lock-in fears are overblown. Build stuff out quickly and prove your idea and then refactor when you have customers and revenue.


Refactoring a growing system while maintaining bug-for-bug compatibility is extraordinarily hard. Most people I know who've gone through such a migration never want to do it again.


Most people are simply bad at migrations.


Wasn’t Facebook built on MySQL with the friend graph written in relational SQL and PHP? The idea was but out and then perfected later. In my head this is the same as doing it on AWS or GCP fully on their proprietary stack and then once making some coin you start a refactor or a migration in parallel to your own hardware and stack as costs warrant it.

Dropbox is famous for doing this when they did their move from S3 to pocket — an in house storage layer they built.


It's not likely your product will have enough financial success to afford the huge amount of engineering effort facebook has put into improving their platform. If your system falls apart before you are making enough off of it to fix it you're going to just run out of runway.


Or to add to that, there are often very good reasons to optimize well before facebook did. The sheer friction of making the transition to The Right Thing can often be too difficult to overcome.


Ultimately, building a maintainable system requires that you allow for people being bad. If 10 smart engineers contribute to your system, and they each assume future developers will be good in their own distinct way, the resulting system will be so complicated that none of them are good enough to maintain it.


Preferably, everything that can be is loosely coupled and extensible.


If most are bad at migrations then what should most do?


Learn how to do migrations well. There's plenty of literature on the topic.


I don't believe PubSub has a way to ensure linearizability between messages, where Kafka does have this by deterministic hashing of keys partitioning. Otherwise yeah, Pubsub looks pretty solid and economical.

At my company I'm guessing we push over 200M messages a day through it, and it is a breeze to maintain and stable as hell. Maybe gets harder a couple orders of magnitude more, but no complaints here.


> The primary consideration at this point is cost.

Certainly. In my experience the operational cost of kafka is enormous. Ive seen incidents too often for a relatively small (<10 machines) cluster. I would expect that after a first long phase of finding the right configuration, to step on a phase were it runs smoothly, but I didnt see it yet.


GCP Pub/Sub is insanely expensive


I don't have a lot of devops experience, but I was just struck by how cheap it appeared to me. $40 / TB? I can't even imagine how much money is sunk into managing the Kafka clusters at my employer.


Can you explain? It doesn't seem that way at all.


If you are on GCP I think the choice is simple, use Cloud Pub/Sub. Extremely simple, extremely reliable, extremely performant, fairly inexpensive, multi-region (global). No maintenance, no scaling, almost no tunables, it just works.

Google provides a Pub/Sub emulator for local development.

I don't really buy the vendor lock-in thing for Pub/Sub-like systems. The Cloud Pub/Sub usage pattern is basically the same as Kafka, you can have a library that abstracts away the differences. There are open source libraries that do that[1]. If you ever need to switch cloud providers, or want a messaging system to span cloud providers, you can switch without changing lots of code.

[1] https://github.com/google/go-cloud/tree/master/pubsub


I don't think it's that simple unless I misunderstand how GCP Pubsub works. I don't think GCP PubSub will give you deterministic delivery order within a partition the way Kafka will: https://cloud.google.com/pubsub/docs/ordering


We have Kafka and GCP pubsub. Kafka is the way to go for us. In terms of reliability, performance, load, etc.


Are you saying GCP pubsub have bad reliability? I never used it but would be surprised if it was down for several hours each weak or would just drop your data.

I would love to know your experience


One of the teams was complaining about behavior on load. Now another team will migrate that flow to Kafka. Another team was complaining about reliability/speed. I think eventually they stopped relying on that.


Distributed messaging is really hard to get right. It'll seem to work fine right up until you get weird bugs and unreliability during at the worst moments.

I wouldn't recommend relying primarily on something vendor-specific like Amazon SQS, but there are very good out-of-the-box tools like RabbitMQ or Kafka available.

Writing your own messaging system is like writing your own database, it's the wrong choice 99.9% of the time.


While I agree I see a big problem that people often pick Kafka when they really needed Rabbitmq because of the hype around Kafka and not understanding the difference between a transactional message broker ( Rabbitmq and ibm mq ...) as used in banking and distributed log store (pulsar/ Kafka)


Not sure that I’d want to rely too much on Rabbit’s transactional behavior based on their limitations.

https://www.rabbitmq.com/semantics.html


Testability. Single biggest reason for us not going with the proprietary queue. We can create, reset, throw away queues as we want during testing, thousands times a day. Even on a laptop.


Do you have any advice for setting up a test Kafka environment? I'd love to be able setup a lightweight in-memory Kafka for unit-y tests without going through the whole Docker compose rigamarole.


Depending on complexity and specialisation of your use that's what may work for unit testing:

1) Abstract real library with tech-agnostic interface to remove as much tech details as possible and focus on behaviour.

2) Create simplest in-memory dummy implementation for that interface, basically a mock.

3) Write a test spec for that interface behaviour (does xxx on yyy, returns zzz otherwise) and run it agains both real wrapped interface & mock to be sure they're consistent.

4) Make code to load mock in local testing env

5) Use real kafka when running it on a testing pipeline

Same way you could abstract various things - e.g. once I just abstracted proprietary services and locally flushed everything into local Postgres. You can emulate multiple services there - e.g. database, somewhat-usable queue, etc. And get some extra control nice for tests (e.g. make it to throw errors on your wish or add timeout to simulate net delays). That saved a lot of time for local development for me, without dropping quality.


Check out the Kafka support in Testcontainers: https://www.testcontainers.org/modules/kafka/. It is container-based, but with a very simple-to-use API for spinning up Kafka in tests.

If you truly want something embedded, Debezium's KafkaCluster could be interesting: https://github.com/debezium/debezium/blob/master/debezium-co....

It's used four our own tests of Debezium, but I'm aware of several external users. It spins up AK and ZK embedded. Disclaimer: I work on Debezium


If you're using Kafka Streams it provides a TopologyTestRunner.


We use Amazon MSK and are pretty happy with it so far.


Much much cheaper.


Dev time would far outweigh the cost of a managed equivalent.


Disagree.

Dev time is virtually identical between the two kinds of infrastructure.


does that factor in development costs?


does developing for NATS is so much harder than for cloud provided pubsub?


Configuration & administration of the kafka cluster is more time-consuming.


Yes, but nats is dead simple


Avoiding vendor lock-in is the only quasi-sane reason I can think of.


"Every engineer".

The whole human activity of reducing science to practical art will do just fine without knowing Apache Kafka, thanks.


"Your scientists were so preoccupied with whether or not they could, they didn’t stop to think if they should."


Cool click-bait title


Not necessarily, it's most likely a reference to the popular article from 2013: https://engineering.linkedin.com/distributed-systems/log-wha...


Or directly from 1991's "What Every Computer Scientist Should Know About Floating-Point Arithmetic".


Where the "every" was actually justified, contrarily to its descendants.


Some times you should just use Graylog (kafka+elastic) ,especially if you are already comfortable with Elastic. You get to scale,retain and monitor your data in addition to stream processing. If I have fairly small Go webapp that needs stream processing, I would just use Graylog instead of trying to use Kafka directly.


One thing Kafka bite me is for each partition there can be only one consumer. If your consumer had performance issue (e.g using CPython) then you are out of luck.


The thing this article pointed out to me, which I didn't know before, it that is why you should just set the partition count very high to begin with. Then you just horizontally scale consumers.


Yeas that's the lesson I learned the hard way. The publisher was from another department and had only 1 partition with large volume of data.


Kafka for events Flink for continous sql against topic events Kudu for event storage. Databases for data storage and queries


1st rule is you don't need kafka.

Second rule is don't forget about 1st rule.


can someone give me a tldr for what problem kafka solves / what situation calls for the usage of kafka?


this is so timely, thank you!


Use pulsar - so much better than kafka


I am all for self-reliance, but if you really want to influence someone else, you might want to include a link to the project, especially when the only word you share has a much more prevalent meaning.


There was a Pulsar post on HN a few months ago, with interesting comments, some of the related to kafka. I keep Pulsar on the radar since that post:

https://news.ycombinator.com/item?id=21936252


Why is that?


If anybody has seen a good detailed comparison, I'd love to read one. The first dozen hits were pretty weak.


millions of topics, no zookeeper ect. Kafka is addressing these shortcomings on the roadmap.


The Pulsar documentation says it requires Zookeeper: https://pulsar.apache.org/docs/en/administration-zk-bk/


oh sorry i meant storing topic info in zookeper that limits kafka to a certain number of topics.


Nope, it also stores topic metadata in ZK - it's not exactly going to store that in the (near) stateless brokers, or in Bookkeeper - and BK also relies on ZK, but it's common to reuse the ZK quorum between the brokers and the bookies.

It also needs an additional ZK for cluster replication.


For a lot a projects this is hardly a problem. On the other hand Kafka is more mature and has a huge ecosystem (Kafka Connect, Kafka Streams, KSQL, ...).


Kafka also has less moving parts even today before zookeeper removal is complete (2 vs pulsar 3).


But one of those moving parts of Pulsar, BookKeeper, means that you're no longer storing data on message brokers. Worth the extra puzzle piece for a lot of use cases.


Pulsar is less mature but does provide functional equivalents to all of the above. Pulsar IO (Kafka Connect), Pulsar SQL (KSQL), Pulsar Functions (Kafka Streams).


Nah, Pulsar functions is nowhere near Kafka Streams - it's more like AWS lambda.

Example off the top of my head, is that you can't, in Pulsar Functions write the equivalent of "aggregate this stream across a 10 minute window and emit the results on window close".

But that's fine, it doesn't need to be like Kafka Streams, you can use Flink or Spark or Storm etc. to fill the same niche. In fact one of the founders of StreamNative (Pulsar's equivalent of Confluent) is a core committer on Flink.


Kafka is not more mature just more hyped. I just wish Aphyr jepsen test would also cover more scenario like - what happen to your data if x+1 server permanently fail in the cluster with a replication factor of X. - what happen if a single partition data size or request rate become 90% of the cluster capacity - what happen in multi-tenancy scenario to other user throughput and latency when one user try to use all the capacity of the cluster - ...


It's way more mature. I just spent a week evaluating Pulsar vs Kafka for a client and the fact Kafka has been open sourced for 10 years vs. Pulsar's 1.5 really shows in documentation, community support etc.

> what happen to your data if x+1 server permanently fail in the cluster with a replication factor of X.

It depends on how many in sync replica sets existed entirely within those X+1 servers. Their partitions will go offline, and other ISRs will have underreplicated partitions, and the alerting you've set up as a good engineer will have told you this was happening.

> what happen in multi-tenancy scenario to other user throughput and latency when one user try to use all the capacity of the cluster

Nothing because you're using ACL and have configured quotas appropriately.

Bad things otherwise.

PS, also been running Kafka since 0.8.


if replication factor is 3 and 3 server go down in the span of 1 or 2 hours no alert will save you


Yes, but this is true of any system offering N - 1 safety, e.g., HDFS, Vertica, Pulsar. It's not specific to Kafka.

And you can switch to your warm replicated cluster in this scenario, if you have one, Mirror Maker 2 supports replicated offsets so consumers can switch without losing state.

But what you're describing is going to shaft any replicated system.


not true for HDFS, Cassandra ,pulsar and most distributed file system.

As soon as a segment is under-replicated it”s replication factor is restored under less than 2 minutes by selecting new machine as replica.

Kafka try to do it with “kafka cruise control” but adding a replica to the in sync replica list take several hours if partition are 300GB and servers are already busy handling regular live traffic


> adding a replica to the in sync replica list take several hours if partition are 300GB

I'd be curious to hear more about this, because I run several topics with similar partition sizes, and haven't encountered several hours for one replica, and I've routinely shifted 350GB partition replicas as part of routine maintenance.

I have encountered 2 hours to restore a broker that was shut down improperly, but yeah, assuming your replica fetchers aren't throttled to shit, or your brokers aren't overloaded (what's the request handler avg idle? 20% or lower is time to add another broker, 10% is time to add another broker right now), that's really extreme.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: