Seven Commandments for Event-Driven Architectures

EngineerBetter · on March 16, 2019

I think this misses the main one - append vector clocks because your events will inevitably turn up out of sequence at some point.

stingraycharles · on March 16, 2019

I find vector clocks to introduce a lot of complexity. In our case, we just use PostgreSQL to handle ordering. When committing an event into PostgreSQL, you verify that the last committed event for your stream is still what you expect it to be (i.e. CAS), and you have strong ordering.

Vector clocks I typically want to stay away from as far as possible.

EngineerBetter · on March 16, 2019

What do you do in a network partition?

tatersolid · on March 17, 2019

They go down. Or go read-only.

What do you do during a network partition? Accept writes that you’ll throw away eventually?

EngineerBetter · on March 17, 2019

Yeah, accept those writes and favour availability. Customer was a wealth management company.

Imagine a customer has £100 in their account. System partitions. Customer withdraws £70, hitting one DC. Customer then hits the second DC, this time withdrawing £50. Each DC thinks the transaction is valid, and so serves it.

Later when the partition is restored, events are played back, and divergent history is detected via the vector clocks - the two withdrawals are not causally related. Remediative action can then be taken.

Transactions prevent bad things happening, but require CP semantics. Eventual consistency allows AP, allows bad things to happen, meaning you have to be able to detect them and clean them up later.

tatersolid · on March 19, 2019

There’s a long history on this debate, but in practice even the GOOG and AMZN have settled on transactional CP systems for handling the important stuff like money (Spanner, Aurora). Expecting app developers to roll their own transactions and conflict resolution at the app layer proved intractable even with all their resources.

> Remediative action can then be taken

Sounds expensive and error-prone; taking a “read only” outage makes more sense in many use cases

gesticulator · on March 16, 2019

Does using Kafka help mitigate this? Or should it be producer-driven vector clock? I imagine the latter is the event-driven equivalent of a optimistic locking.

EngineerBetter · on March 16, 2019

Edit - yep, the event producer manages and increments the vector clock.

---

I've not really used Kafka, so couldn't comment on that. I did some work for a customer that involved multi-DC microservices with isolated databases (ie one DB per DC). We used event sourcing with vector clocks to do manual reconciliation of the databases including during partition. Reconciliation involved custom logic depending on the event type, so not sure how a transport mechanism like Kafka would handle that.

A talk about the approach is here: https://youtu.be/hYVh8PbbeJw

I get to vector clocks at the end.

gesticulator · on March 16, 2019

Neat! Thanks for the info.

Edit: to add on about Kafka, it guarantees a serializable ordering on the incoming messages and is driven internally by a vector clock. This may not be suitable for applications, but the throughput is high and allows multiple subscribers to get a consistent ordering.

xtagon · on March 16, 2019

The section on "Minimize state in-flight" [1] has an example of an event with both an amount withdrawn and the new balance, and the author recommends sending only the amount instead.

Wouldn't it be useful to send both, since if the expected balance does not match up when the event gets processed, that must mean something was processed out of order?

[1]: https://rjzaworski.com/2019/03/7-commandments-for-event-driv...

aarbor989 · on March 17, 2019

Well when you work with distributed messaging systems it can be difficult (if not impossible) to guarantee messages arrive or are processed in the same order in which they were generated. So I think the author is just saying generally speaking it’s better to track as little state as possible if you don’t need to.

pyrale · on March 17, 2019

I don't think it is a good idea to mix payload data and consistency mechanisms. Better keep the two split. For instance, if you want to ensure that your system is consstent, you can identify your events in such a way that you can infer the previous in sequence (e.g. if you received event 1 and 3, you know that you've lost event 2).

This way you keep two separate, simple mechanisms to deal with two problems, rather than getting one more intricated problem.

mavdi · on March 17, 2019

Excellent list, thank you. I’ve thought of most of them by experience but it’s great to know I’m not full of crap.

HeyBillFinn · on March 16, 2019

Related to ordering, I was mildly surprised that the "Minimize state in-flight" section didn't mention timestamp-ing all events with a created (and possibly a separate effective) date.

naasking · on March 17, 2019

A created date might encourage an assumption that the date is a reliable way to order events.

mamon · on March 16, 2019

Some things are simply so obvious, that you forget to mention them :)

jcims · on March 16, 2019

Curious about the minimize state in-flight. If the producer has state on-hand, why not include it and just have a policy around when you trust it. If it’s equal cost for producer or consumer to collect state then maybe, but how often is that the case?

pyrale · on March 17, 2019

Consider replacing "event" by "observation". What you really have is a set of observations, and a state that you infer from it. Observations you make may be reliable, but you have no guarantee that you can observe every fact in the problem.

For instance, if you're a payment processor in the EU, and a withdrawal was done in the US, you may not have received it yet, so you have an incorrect balance. If a withdrawal is made in the EU, and you add the balance in the event, you will have mixed a perfectly legit observation with something, which is not an observation but an inference, an artifact processed from your incomplete set of observations.

Trust is something that changes over time. When your EU platform was implemented, maybe your US platform didn't exist. But the difference between observations and inference will remain true.

rswail · on March 17, 2019

The event is independent, it is based on the state of the aggregate at the time the event occurred. It's better to have a vector clock for the aggregate as part of the event to determine what was the state when the event occurred.

Even if you "have a policy" people will ignore it. Better to make it explicit.

2sk21 · on March 16, 2019

Reminds me of this famous list from the 90s https://en.m.wikipedia.org/wiki/Fallacies_of_distributed_com...

Nicksil · on March 17, 2019

Non-mobile:

https://en.wikipedia.org/wiki/Fallacies_of_distributed_compu...

everyone · on March 16, 2019

The title is to broad imo.. This is clearly only about Event-Driven Architectures for the web. Many of those 7 points are specific to web.

jayd16 · on March 16, 2019

The examples are web based but these points can still be adapted down to any scale, really. Things like staleness are still a concern in something like an event driven UI ie user input disrupts the current action.

everyone · on March 17, 2019

Wrong. I like using an event driven architecture in games as opposed to an update driven one. There is no concept of staleness in a single player game running on one machine. Many of the concepts the author mentions simply do not exist in that context. In games, event driven = some event happens, all the logic that needs to execute in response to that happens immediately, as opposed to, for example being put on a blackboard and dealt with in some other update loop.

The reason a lot of games have frame-perfect bugs / exploits / glitches is because they use an update-driven architecture. So its possible that for one frame after certain things happen the game is in an inconsistent state.

pyrale · on March 17, 2019

Imo, the problem is not with the title being too broad, but with your experience with event-driven architectures being too narrow.

The author talks about general cases, and while respecting his rules in your particular situation is not going to give you much, it isn't going to cause trouble either. On the other hand, if you try to disrespect these advices in a different problem, you may run into huge problems, because the simplifications you made don't hold.

jayd16 · on March 17, 2019

You do run into the same issues in something like a game.

Lets say you have a character doing some animation clip and at the end you want to run some callback.

That character could be killed by the player or the player could quit to the main menu or any number of things.

Now your callback is stale. You have to deal with cleaning it up, ensuring it does actually fire if you need it to, and handling any issue with the now deceased character.

everyone · on March 17, 2019

Wrong again. You are obviously commenting about an area you are not familiar with. That is a very weird scenario. You dont want stuff like animation affecting logic. The game logic should never be affected no matter what the animations are doing. But if for some bizarre reason you did have that scenario, a way of doing it using a typical game engine approach would to put that callback at the end of a coroutine. If the character dies then no matter what is is you definitely dont want that callback to fire, (There shouldnt be logic being affected by an animation, but if that character is dead then it shouldnt be affecting anything at all) part of your game engine would be that coroutines do not run on things which have been destroyed.

jayd16 · on March 17, 2019

Man, you're the one who said you were using events for game dev not me. Coroutines or events, the idea that you need to reconsider your state when execution returns is still valid advice. Tying a courtine to a game object's life cycle _is_ managing the event state. Its clearly not a weird scenario when entire game engines are based around this principle.

everyone · on March 18, 2019

Well as I said having some logic executed by an animation would be doing things backwards...

Most of the points in the article are about using an event driven architecture in the world of distributed systems, where the messages triggering events are delayed by a certain amount, can be lost, malformed, or different to what the receiver expects.

None of that applies to a single program running on one machine. Thats why using an event driven architecture for a game is so great. I never have to worry about most of the stuff in the article. Also (as opposed to the more common Update driven model) if I do a good job its possible for my games logic to be flawless, it can be impossible for it to be in an illegal or weird state. So an event driven architecture in a game is fantastic for that reason, and the article is quite useless when it comes to event driven architecture in games. I'm sure its fine for event driven architecture in distributed systems, but the title is certainly too broad.

Like with many of the articles posted here, CRUD / web devs forget there are other fields of programming.

alexchamberlain · on March 17, 2019

To be fair, the author did state it is distributed event-driven architectures several times. You can find these outside of a web context quite easily.