Can someone share some long term event driven success stories? Almost everything...

ninkendo · on June 9, 2024

Chiming in with another “no” here. We adopted a message bus/event-driven architecture when moving a very popular piece of software from the cloud, to directly on the user’s device… it was a disaster IMO.

The core orchestration of the system was done via events on the bus, and nobody had any idea what was happening when a bug occurred. People would pass bugs around, “my code did the right thing given the event it got”, “well, my code did the right thing too”, and nobody understood the full picture because everyone was stuck in their own silo. Event driven architectures encourage this: events decouple systems such that you don’t know or care what happens when you emit a message, until one day it’s emitted with slightly different timing or ordering or different semantics, and things are broken and nobody knows why.

The worst part is that software is basically “take user input, do process A on it, then do process B on that, then do process C on that.” It could have so easily been a simple imperative function that called C(B(A(input))), but instead we made events for “inputWasEmitted”, “Aoutput”, “Boutput”, etc.

What happens when system C needs one more piece of metadata about the user input? 3 PR’s into 3 repos to plumb the information around. Coordinating the release of 3 libraries. All around just awful to work with.

Oh and this is a very high profile piece of software with a user base in the 9 figure range.

(Wild tangent: holy shit is hard to get iOS to accept “do process” in a sentence. I edited that paragraph at least 30 times, no joke, trying every trick I could to stop it correcting it to “due process”. I almost gave up. I used to defend autocorrect but holy shit that was a nightmare.)

tadfisher · on June 9, 2024

I think the true term for this phenomenon is "decoherence" rather than "decoupling". Your components are still as coupled as they ever were, but the coupling has moved from compile-time (e.g. function calls) to runtime. The component that "handles events" decoheres the entire system because it's now responsible for the entire messaging layer between components, rather than the individual components being responsible for their slice of the system.

jesse__ · on June 10, 2024

That's a great name for that property. I've always cringed when people say 'something something decoupling' because most of the time the end result is actually just as coupled, but ends up indirected or something. Now I have a more specific word for it, thanks!!

setr · on June 10, 2024

> (Wild tangent: holy shit is hard to get iOS to accept “do process” in a sentence. I edited that paragraph at least 30 times, no joke, trying every trick I could to stop it correcting it to “due process”. I almost gave up. I used to defend autocorrect but holy shit that was a nightmare.)

can you not just pick the original spelling in the autocomplete menu above the keyboard?

sharlos201068 · on June 10, 2024

We use an event driven architecture at work and find it works quite well, however events are for communicating between services across business domains and owned by different teams.

If you have some logic A and B running on user input, I wouldn't be splitting that across different services.

Salgat · on June 9, 2024

https://www.eventstore.com/case-studies/insureon

I can attest to this case study being 100% true. Our platform has been using EventStore as our primary database for 9 years going strong, and I'm still very happy with it. The key thing is that it needs to be done right from the very beginning; you can't do major architecture reworks later on and you need an architect who really knows what they're doing. Also, you can't half-ass it; event sourcing, CQRS, etc all had to embraced the entire time, no shortcuts.

I will say though, the biggest downside is that scaling is difficult since you can't always rely on snapshots of data, sometimes you need to event source the entire model and that can get data heavy. If you're standing up a new projector, you could be going through tens of millions of events before it is caught up which requires planning. It is incredible though being able to have every single state change ever made on the platform available, the data guys love it and it makes troubleshooting way easier since there's no secrets on what happened. The biggest con is that most people don't really understand it intuitively, since it's a very different way of doing things, which is why so many companies end up fucking it up.

Spivak · on June 9, 2024

Am I dumb or is this basically the binlog of your database but without the tooling to let you do efficient querying?

Like I get the "message bus" architecture when you have a bunch of services emitting events and consumers for differing purposes but I don't think I would feel comfortable using it for state tracking. Especially when it seems really hard to enforce a schema / do migrations. CQRS also makes sense for this but only when it functions as a WAL and isn't meant to be stored forever but persisted by everyone who's interested in it and then eventually discarded.

ffsm8 · on June 9, 2024

> Especially when it seems really hard to enforce a schema / do migrations

Enforcing the schema isn't too hard ime. But every migration needs to be bi-directionally compatible. That's likely what they meant with "you need an architect and can't make major changes later on"

It's the same issue you've had with nosql, even though you technically do have a schema

_3u10 · on June 9, 2024

Pretty much. Also all your projectors need to be deterministic.

eg. Your commands have to ALWAYS do the same thing else replaying the event log does not produce the same output and then you’re back to square one.

It’s usually easier / more useful to just use an audit table.

blowski · on June 9, 2024

Yes, if version 1.3 of Command handler X was the active version when an event happened, then it needs to be replayed with that version, even if you’re now on v4.5.

lmm · on June 10, 2024

> Am I dumb or is this basically the binlog of your database but without the tooling to let you do efficient querying?

Yes, and I honestly think a traditional database that exposed this stuff would be a winner (but I guess it's too hard, and meanwhile event-sourcing frameworks are building better alternatives). Separating the low-level part from the high-level part that does things like indexing and querying has a lot of advantages: you decouple data storage from validation so you can have validated data without having to drop invalid data on the floor, you decouple index updates from data writes so your scaling problems get way simpler, you can get sensible happens-before semantics without needing transactions that can deadlock and all the crazy stuff that traditional databases do (secret MVCC implementations while the database goes out of its way to present an illusion of a single "current" version of the data, snapshotting that you can't access directly, ...).

Salgat · on June 12, 2024

For events you include a version. When you're only adding properties to the event or removing properties (assuming you defensively write the projectors), no need for a new version, but if you're creating a breaking change in event schema, you'd increment the version for the event and update your projector to handle each version. It's simpler than you'd think.

therealdrag0 · on June 9, 2024

Note that event sourced data and event based architecture are different things. You can have one without the other.

Salgat · on June 9, 2024

It's a type of event driven architecture, since events generated both hydrate models and trigger event listeners/subscribers. For example, a command to update a customer's address might also have a process manager listening for that event that kicks off a process to send an email notification. That same event is still used to event source the customer model which includes an address.

I suppose you could have event sourcing in a purely isolated manner where the events aren't used for anything else, but you'd be severely limiting the advantages that come free with that.

therealdrag0 · on June 9, 2024

That’s what I mean. I worked on services that have event sourced db model but synchronous REST API. And I’ve worked on services that communicate with events for IPC but use relational sql for data store.

Your example uses the same events for both so sure that can be done but doesn’t have to. I haven’t worked on a system like that personally so maybe it can fine.

But honestly I’m a bit skeptical since that removes services’ data sovereignty. Sounds like the recipe for “distributed monolith” architecture. And actually I just remembered another team at my company is ripping out their use of kafka as their data source on a green field project cuz it was a disaster, so skepticism emphasized.

blowski · on June 9, 2024

I was the lead developer on one for an insurance company a few years back, and it’s still in active use. Insurance is a heavily regulated domain, where an audit trail is more important than performance. There was a natural pattern for it to follow, as we were mapping a stable industry standard.

I also tried doing it in a property setting, where profit margins were tight. The effort needed wasn’t worth the cost, and clients didn’t really care about the value proposition anyway. We pretty much replaced the whole layer with a more traditional crud system.

chenster · on June 10, 2024

What did you mean traditional crud as oppose to event-driven arch? How is it relevant to the subject in dicussion?

blowski · on June 10, 2024

Event-driven: At runtime, the client tells the system what has happened, the system stores the event and is configured in advance for how to react to it.

CRUD: Imperative. Client tells us to create/update a specific entity with some data.

devdude1337 · on June 9, 2024

When I did game dev I often went for an even driven approach or messaging based systems combined with oop and state machines to prevent eventual consistency locally. It works great in that domain, albeit not being the most performant solution.

In web or business systems it works well for some(!) parts. You just shouldn’t do everything that way - but often people get too exited about a solution and then they tend to overdo it and apply it everywhere, even when not appropriate.

Always chose the golden middle path and apply patterns where they fit well.

rswail · on June 9, 2024

Wrote a public transport ticketing system that processes 100-200K+ trips/day with sub-second push of notification to mobiles of trip/payments.

Event driven and CQRS "entities" made logic and processing much easier to create/test/debug.

Primary issues: 1. Making sure you focus on the "Nouns" (entities) not the "Verbs". 2. Kafka requiring polling for consumers sucks if you want to "scale to zero". 3. Sharding of event consumers can be complicated. 4. People have trouble understanding the concepts and keep wanting to write "ProcessX" type functions instead of state machines and event handlers. 5. Retry/replay is complicated, better to reverse/replay. Dealing with side effects in replay is also complicated (does a replay generate the output events which trigger state changes in other entities?)

Been running now for 6 years, minimal downtime except for maintenance/upgrades.

In the process of introducing major new entity and associated changes, most of the system unaffected due to the decoupling.

chenster · on June 10, 2024

Can you elaborate #1 Nouns over Vers?

rswail · on June 10, 2024

A lot of people focus on the process instead of the participating entities.

The focus when designing the system should be on the entities (Customer, Payment, Bill, Order, Inventory) instead of the processes (ordering, billing, fulfillment). I summarize that by saying "Nouns over Verbs".

The state of each of the entities is affected by the processes, but the effect happens from changes in other entities, Customers place an Order. Customers get a Bill for the Order, Customers make a Payment, etc.

The states of each of these entities is independent of the others and reacts/changes only as a result of two things, either an external "Command", or an "Event".

Commands are events that occur outside of the system boundary, usually visible as part of an API (if RESTful) that uses POST/PUT/DELETE or they are imperatives from one entity to another.

Commands are imperatives, Place Order, Pay Bill, Fulfill Order, etc.

Events are records of occurrences in the system, expressed in the past tense and are immutable. Order Placed, Bill Paid, Order Fulfilled.

Customers place an Order by POSTing to /orders (or potentially /customers/uuid/orders).

Events are generated from entities inside the system. (Order being placed generates an order_placed event).

The difference is that by focussing on the entities, and their state, independent of other entities, the entities can be created, tested, installed, evolved independently of other entities in the system.

The thinking about them is simplified and focussed, they are naturally decoupled because they can only find out about other entities by inquiring or affect other entities by generating a Command or an Event.

Any events they generate are processed asynchronously and can have multiple consumers.

cjk2 · on June 9, 2024

No. We have a complete fucking disaster on our hands.

macintux · on June 9, 2024

How old of a system? Do you feel it’s the implementation, the design, or the concept itself that went wrong? Is your system a good fit?

(No stake in this one way or another, just curious.)

cjk2 · on June 9, 2024

Less than 5 years. Vanity project. Built and maintained by astronaut architects. Entirely unnecessary. Poorly implemented down to the level of wire contracts being inconsistent. Overheads are insane both from engineering and operational POV.

moandcompany · on June 9, 2024

Resume driven development never goes out of style

cjk2 · on June 9, 2024

I call this one FDD: Fuckwit Driven Development. Because if it was resume-driven I'd expect it to be something that they would want to put on their resume. But this is unmentionable.

moandcompany · on June 9, 2024

There's sayings along the lines of "Victory/success has a thousand fathers, but defeat/failure is an orphan."

Chances are that system, and its outcomes are described very differently on a resume

OccamsMirror · on June 10, 2024

As long as the list of technologies used is impressive sounding you're on to a winner.

analognoise · on June 9, 2024

Hail, RDD, my favorite development style.

cweld510 · on June 9, 2024

I work on an event-based architecture that I think is successful, but that’s because our core primitives are event-based, so there is no impedance mismatch in the way that there can be if you migrate from a request-response architecture to an evented one. Specifically, we aren’t trying to deal with databases and HTTP (both of which are largely synchronous primitives). Instead, I work on a platform for somewhat arbitrary code execution; and the code we are executing depends on our code rather than vice versa. In general, the code we execute on the platform can run for an indeterminate amount of time, and it generally has control and calls back into our code rather than our code calling into it. So our control flow is naturally callback-based rather than request/response; as a result, our system is fundamentally event-based.

vmaurin · on June 9, 2024

I have been doing this kind of stuff both in ad tech and trust & safety industry, mainly to handle scalability. Something that looks like "Event-carried state transfer" here https://martinfowler.com/articles/201701-event-driven.html

These system are working fine, but maybe a common ground : * very few services * the main throughput is "fact" events (so something that did happen) * what you get as "Event carried state transfer" is basically the configuration. One service own it, with a classical DB and a UI, but then expose configuration to all the system with this kind of event (and all the consumers consume these read only) * usually you have to deal with eventual consistency a lot in this kind of setup (so it scales well, but there is a tradeoff)

jgraettinger1 · on June 9, 2024

PostgreSQL.

The WAL is an event log, and when you squint at its internal architecture, you’ll see plenty of overlap with distributed event sourcing.

mrkeen · on June 9, 2024

Likewise with git. There's the "top-level events" that you see (commits). But even when you're doing 'unsafe' operations, you're working with the lower-level reflog events.

marcosdumay · on June 9, 2024

Almost every modern software system. Anything running over the Web is event driven.

lolive · on June 10, 2024

99.99% of the data we consume on the Web comes out of databases [call it Transactional-SQL-xxx or ColumnBased-yyy or Elastic-SaaS-zzz].

marcosdumay · on June 10, 2024

Well, yes. Databases are event driven themselves.

As is any web application, because the web (at least without sockets) is constrained into communicating events only.

Also, most local GUI applications, because people just like events better for it.

lmm · on June 10, 2024

Right. The hard part is already done. Which makes it infuriating that it's all "internal". Every serious RDBMS already contains an implementation of an event-sourcing system, but you're not allowed to actually use it.

mrkeen · on June 9, 2024

We've had mistakes that we've been able to course-correct from.

Our users are small-businesses with organisation numbers, and we mostly think of them as unique. But they strictly aren't, so we 'overwrote' some companies with other companies.

Once we detected and fixed the bug, we just replayed the events with the fixed code, and we hadn't lost any data.

lz400 · on June 10, 2024

AFAIK almost every stock market order processing system is event driven, and they are all usually very old systems that have been successfully running for years. I've seen some implementation in investment banks, what you're usually told is that most exchanges and banks run similar architectures. The reason for this is partially that FIX, the protocol for electronic orders in markets is event based.

lanstin · on June 9, 2024

It is a very convenient way to move higher latency operations from the realtime path to a near real time path. E.g. you want to send an email when a payment is authorized, you don’t want to wait for the whole SMTP transaction so you just post an even and reply back to the user. Also settlements of captured autos, 5st sort of thing. Even saving some user pref, start the task, reply back to user. and if it fails async send a failure msg.

nitwit005 · on June 9, 2024

I've seen successful, but flawed usage.

Every use I've seen sent events after database transactions, with the event not part of the transaction. This means you can get both dropped events, and out of order events.

My current company has analytics driven by a system like that. I'm sure there's some corrupted data as a result.

The main issue being people just don't know how to build and test distributed systems.

mrkeen · on June 9, 2024

I had an interview where I was asked how I would guarantee that an event happened in addition to a database update (transactionally).

It sounded kind of impossible, I said as much, and then proposed a different approach. The interviewer persisted and claimed that it could be done with 'the outbox pattern'.

I disagreed and ended the interview there. Later when I was chatting about it with a former colleague, he said "Oh, they solved the two generals problem?"

> Every use I've seen sent events after database transactions, with the event not part of the transaction.

Maybe this is what they were doing.

lastofus · on June 9, 2024

I don't quite see what the outbox pattern has to do with the two generals problem.

The point of the outbox pattern is that a durable record of the need to send an event is stored in the DB as part of the DB txn, taking advantage of ACID guarantees.

Once you have that durable record in your DB, you can essentially treat your DB as a queue (there's lot of great articles on how to do this with Postgres for instance) for some worker processes to later process the queue records, and send the events.

The worker processes in turn can decide if they want to attempt at least once, or at most once delivery of the message. Of course if you choose the later, then maybe your event is never sent, and perhaps that was the point you were trying to make to the interviewer.

They key takeaway though is that you are no longer reliant on the original process that stores the DB txn to also send the event, which can fail for any number of reasons, and may have no path to recovery. In other words, at least once delivery is now an option on the table.

mrkeen · on June 11, 2024

> I don't quite see what the outbox pattern has to do with the two generals problem.

Well then, hopefully you would have found it an unsatisfactory 'solution' and walked away from that interview too ;)

> Once you have that durable record in your DB, you can essentially treat your DB as a queue (there's lot of great articles on how to do this with Postgres for instance) for some worker processes to later process the queue records, and send the events.

Yeah but I already have a queue I can treat as a queue. It's called Kafka.

nick__m · on June 10, 2024

You could two phase commit with XA compliant broker and database.

ramchip · on June 9, 2024

I'm a bit confused by the story. Why did you disagree?

mrkeen · on June 11, 2024

They asked how I would guarantee that a Postgres update and a Kafka event could both happen transactionally.

  (P, K)

Which sounds like one of those classical impossibility proofs.

Their solution was to introduce another part into the system, "the outbox":

  (P, O) K

P and O can form a transaction, but that still leaves the original problem unanswered, how to include K in the transaction.

ramchip · on June 11, 2024

The goal of the outbox pattern is at-least-once publishing though, not only-once. You either get P + (eventually) at least one copy of K, or you get no P and no K.

Without the outbox you can get P without K or K without P, which lead to consumers out of sync with the producer.

This requires the consumer to be able to deal with repeated events to some extent, but you usually want that anyway, since an event can be processed twice even if it appears only once in the topic. For instance if the consumer crashes after processing but before updating the offset.

mrkeen · on June 11, 2024

> The goal of the outbox pattern is at-least-once publishing though, not only-once.

Right, which is why it's an unacceptable solution to 'transacting over postgres and Kafka', and why I wouldn't want to work for a company that wants me to believe differently.

And there's a better solution anyway: just K.

ramchip · on June 11, 2024

I think you're just using "transaction" in a different sense than what the interviewer meant; "guarantee that an event happened in addition to a database update" sounds like at-least-once to me, and it's normally what you would want in this kind of system.

simonbw · on June 9, 2024

It's been an incredibly useful pattern for me in game development. I have a hard time imagining making a game with any level of complexity without it. You can definitely go overboard with it, but I have a hard time even imagining how some systems like collision detection/a physics engine could even work without it.

ClimaxGravely · on June 9, 2024

That's generally been my experience as well.

However I've seen some frameworks where you can do collision imperatively. For example

if (sprite.collide(tilemap)) {do something}

These are generally on smaller less taxing frameworks (in this case I'm referring to haxeflixel) but they do exist!

TeeMassive · on June 9, 2024

I've worked in an embedded Linux system that was a greenfield project. We needed a library that was written in a certain language, but we also wanted Python for the rest because getting the logic right with a client that changed his mind often was top priority and the data crunching was minimal.

So we ended up using protobufs over a local MQTT broker and adopted a macro-service architecture. This suited the project very well because it had a handful of obvious distinct parts and we took full advantage of Conway's law by making each devs work the part where their strengths and skills were maximized.

We made a few mistakes along the way but learned from them. Most of them relating to inter-service asynchronous programming. This article put words on concepts we learned through trial and errors, especially queries disguised as events.

liampulles · on June 10, 2024

Our system is command driven, and works well, but it is because we explicitly have less rigorous demands on the messages and the messages don't cross team boundaries. My past experience also makes me wary of event driven systems.

bob1029 · on June 9, 2024

I saw it done well in manufacturing.

I think it works well when it's the only thing that can work.

Osiris · on June 9, 2024

The project I'm working on is about 13 years old (ruby on rails) with over 260 engineers and the product has a very robust event driven system that is at the core of a lot of important features.

swistak35 · on June 10, 2024

Out of curiosity, what is the system? Is this based on Rails Event Store, or something else (custom?)?

tlarkworthy · on June 9, 2024

Webhooks? Slack automation? GitHub actions?

turkey99 · on June 9, 2024

Yes, it’s a great tool for integration. We have a product suite and it’s our chosen way to connect products.

tkiolp4 · on June 9, 2024

No. The usual pains are:

- Producer and consumer are decoupled. That’s a good thing m right? Good luck finding the consumer when you need to modify the producer (the payload). People usually don’t document these things

- Let’s use SNS/SQS because why not. Good luck reproducing producers and consumers locally in your machine. Third party infra in local env is usually an afterthought

- Observability. Of rather the lack of it. It’s never out of the box, and so usually nobody cares about it until an incident happens

throwaway24x7 · on June 9, 2024

> Good luck finding the consumer when you need to modify the producer

It sounds like your alternative is a producer that updates consumers using HTTP calls. That pushes a lot of complexity to the producer and the team that has to sync up with all of the other teams involved.

> Let’s use SNS/SQS because why not. Good luck reproducing producers and consumers locally in your machine

At work we pull localstack from a shared repo and run it in the background. I almost forget that it's there until I need to "git pull" if another team has added a new queue that my service is interested in. Just like using curl to call your HTTP endpoints, you can simply just send a message to localstack with the standard aws cli

https://github.com/localstack/localstack

> Observability. Of rather the lack of it. It’s never out of the box, and so usually nobody cares about it until an incident happens

I think it depends on what type of framework you use. At work we use a trace-id field in the header when making HTTP calls or sending a message (sqs) which is propagated automatically downstream. This enables us to easily search logs and see the flow between systems. This was just configured once and is added automatically for all HTTP requests and messages that the service produces. We have a shared dependency that all services use that handles logging, monitoring and other "plumbing". Most of it comes out of the box from Spring, and the dependency just needs to configure it. The code imports a generic sns/http/jdbc producer and don't have to think about it

thr0w · on June 10, 2024

> - Let’s use SNS/SQS because why not.

The amount of times I've come across someone who's inserted SQS into the mix to "speed things up"...

mrkeen · on June 9, 2024

> Good luck finding the consumer when you need to modify the producer (the payload)

I just grep for the event's class name.

thr0w · on June 10, 2024

> Can someone share some long term event driven success stories?

JavaScript

jesse__ · on June 10, 2024

I think we have very different definitions of success

richardw · on June 9, 2024

Banking, 7 ish years. Worked well for us in general. When I start needing increased confidence and truth the effort level goes way up but can be done. Definitely still worth it, has given us some solid benefits.

When I say increased, I mean we want the best answer but there are some answers the bank can’t know. If someone has transferred money into your account from another bank but we don’t know that yet, optimising for absolute correctness is pointless because the vast majority of wrong answers are baked in to the process. We can send you a message and you might read it a day later. Unless we delete the message from your phone, we can’t guarantee the message you read is fully consistent with our internal state.

Frankly our system is much better than the batch driven junk that is out of sync a second after it has executed. “Hey you have a reward.” “No I used it 2 hours ago you clowns.”

Note this isn’t cope. In some cases we started fully sync but relaxed it where there are tradeoffs that gave us better outcomes and we weren’t giving anything material up.

lanstin · on June 9, 2024

Or worse “hey you have a reward” but it doesn’t show up in the UI for three minutes. Twitter used to do this to me all the time.

richardw · on June 9, 2024

Eventually consistent means just that, I guess. But I’m sure their reasoning was a lot more sophisticated or impacted by scale than most. Cool problem.

throwawaymaths · on June 9, 2024

Does canbus count?