Data-Oriented Architecture

jayd16 · on March 8, 2020

We jump through so many hoops to avoid this terrible pattern. This puts more load and less scaling options on your DB. If you need to upgrade how something is stored, you need to touch every service. And they didn't really solve their problem.

If you can't keep track of what services call each other, what makes you think you'll be able to keep track of who is writing out what data?

Reading how they did SOA (spaghetti of services with highly coupled interactions) I don't think putting that at the data layer is useful in the long run. They even suggest many services operating on the same data as a way to decorate the data as it haphazardly makes it way through this mess of an app. Their original problem was a poorly planned separation of concerns leading to cross service coupling. In DOA that still have it. Its just coupled directly through the datastore with no hope of cleaning it up without a massive migration.

kasey_junk · on March 9, 2020

The simple answer to this question is that all mature data stores have extremely sophisticated tooling for multiple concurrent access. Namespacing, views, access rights, concurrency/locking, migrations, stored procedures, etc are all things you end up poorly implementing at the app layer that you get out of the box at the data layer.

Every rdbms left on the stage was built with the assumption that lots of disparate systems would be coordinating within them.

I’ve yet to see a SOA system with half the tooling to support this. That’s before you get into the performance advantages.

If you’ve done a bad job coupling your data tier it’s because you didn’t know/follow best practices we’ve had for at least 30 years. Don’t blame the architecture for that.

zie · on March 9, 2020

Agreed with kasey_junk.

Also assuming you are sane and give every service/user/etc their own user account into the DB, your DB logging will easily show user X did action Y.

Databases these days scale very well. PostgreSQL, which is a great OSS database can scale very well out of the box on any single system, and the size of X86 boxes are getting pretty giant these days(not to mention other platforms). Plus there are loads of 3rd party, but well supported, options for scaling past a single instance.

There are other systems like FoundationDB, etc that scale well out of the box, but I'd argue most people don't actually need to scale that large, 99% of us will never get to Google size and one can get very far on a single DB instance. Especially for new projects, scaling should be near the bottom of your todo list, until it starts to hurt, and then the general answer is, just throw money at the problem, since if you are having scaling problems, you better not also be having money problems or you likely have larger problems than scaling your DB.

bcrosby95 · on March 9, 2020

You can accomplish this with a monolith too though. The service part is doing no work. And a stateless monolith is even easier to scale than a database.

Monoliths certainly can reach their breaking point for other reasons. But they can take you very, very far too. Depending upon your exact use case, a stateless monolith, Postgresql, and something like Redis can handle millions of daily users.

It's been a while since I've had to do this, about 10 years. The breaking point for our use case was around 6 million daily users. I would imagine you can take it further today, but haven't had the need to so I don't really know.

machinecoffee · on March 10, 2020

I would say nowadays, from a scale-up perspective we've never had it better.

Multi-core CPUs are cheaper than ever, and terabytes of RAM are not uncommon.

The only problem is that to have an HA solution with these monster machines, you need at least 2 of them, and maybe a similarly sized test system.

jayd16 · on March 9, 2020

>If you’ve done a bad job coupling your data tier it’s because you didn’t know/follow best practices we’ve had for at least 30 years. Don’t blame the architecture for that.

Well that's exactly my point. The author is saying this architecture will save them and it wasn't the architecture at all.

bjornjajajaja · on March 8, 2020

Hmm. I think it could work well. Thinking if Qt signals/slots mechanism but using database. If you have separate tables for publish subscribe (one writer multiple readers) it would be quick.

And to be honest, the services are loosely coupled between each other that is a great benefit. It’s not a “spaghetti of services” because a good design will carefully consider them.

Plus it could give some good benefits for auditing trails too and consistency in data storage/archiving.

jayd16 · on March 9, 2020

>because a good design will carefully consider them.

I don't want patterns where I have to be good. Anything is possible if you're good but how does the pattern help and protect the developer?

That said, as a developer, how do I know some data should be owned by one service or another? Should I notify other services when I change data? Where is the separation of concerns? Is the DAL a monolithic codebase that knows who to notify or what event to write when things are updated?

I like event queues but instead of using something like ZeroMQ they seem to suggest using a custom rolled table. If I want to add a new event listener such that I have more than one, do I have to know that at event write time and write n messages? Do I store message completion separate from the message itself? Does the DAL handle that by knowing who needs what? Aren't we back to thinking about all those service interactions but in the DAL? So much work is glossed over that I can only assume the author hasn't worked with queues or an ESB very much.

The ultimate question though is, why are these services even separate if they're mutating the same data? Just merge the services (or at least the operation in question) instead of rewriting the entire app. Way easier than orchestrating all of your systems through a monolithic DAL.

code_biologist · on March 9, 2020

> The ultimate question though is, why are these services even separate if they're mutating the same data? Just merge the services (or at least the operation in question) instead of rewriting the entire app.

Because if you did that you'd have a monolith, and microservices are really fashionable right now.

collyw · on March 9, 2020

I am not sure if you are being serious or not.

Being fashionable is the worst possible reason for making technology choice, but sadly one which many developers mindlessly choose.

code_biologist · on March 9, 2020

I agree with you, but I am being serious that there are developers stumbling into poor microservice architectures based on "microservices are best practice", ie. fashion.

atomicity · on March 8, 2020

I liked the clarity of the article and the reasoning is solid. Still, I'm not too convinced:

- A key reason for splitting up a monolith into services is because collaboration becomes too costly with 100s to 1000s of developers working on the same code. Your build & test system can't handle the number of commits. Your code takes too long to build. Would the data-access layer not them become the development bottleneck in such scenarios, meaning that it doesn't scale as well as SOA?

- What are the advantages and disadvantages of centralizing in the data-oriented layer over centralizing through a network-related tool like Envoy, Kubernetes, or Istio?

- The database layer itself often becomes a performance bottleneck, which requires us to run a sharded database. Some companies go the extra distance by running in-memory databases, document databases, and time-series databases. In such cases, wouldn't the data access layer need to support federation, which is a hard problem according to database research?

- Is O(N^2) really that big of a problem? It seems like the problem can be reduced to something simpler: developers cannot easily understand which services communicate with one another. If that is the case, would a visualization tool be sufficient?

jeffffff · on March 8, 2020

yes, at scale this turns the database into the bottleneck. this is not strictly a downside though, as it means you can centralize ownership of your database to one team of experts who can handle optimization, capacity planning, sharding, multi-tenancy, security, monitoring, etc for everyone. most product teams do not and should not need people with that expertise, so if you have multiple products or services running into scalability issues this approach can be a far more cost effective way of solving them than having each product or service handle these issues independently.

while they do not use the term "data-oriented architecture", many of the largest web companies use what is effectively this approach and have teams dedicated to building and maintaining a shared data layer. some examples:

google - spanner

youtube - vitess, migrated to spanner

facebook - tao

uber - schemaless

dropbox - edgestore

twitter - manhattan

linkedin - espresso

notably absent is amazon. amazon has taken the full blown microservices approach where anyone can do whatever they want. worth noting is that amazon is in a very sad place when it comes to data warehousing and analyzing data across teams/products/etc. while the shared database approach is strictly intended for OLTP use cases and explicitly not meant for OLAP use cases, having a common interface to all data and something approaching a data model makes it extremely easy to replicate all your data out into a data warehouse or data lake or whatever you want to call your system for your OLAP workloads. with the 'every service has its own database' model, each team has to be responsible for replicating their data to analytics systems, and that is usually not super high on their priority list relative to product features. this problem is magnified when people from a different team want to consume data from that team's product/service but the team producing the data has no incentive to make it available. in large organizations (including amazon) this is a huge issue for teams who mostly do analysis, reporting, marketing, and other activities where they primarily consume data produced by others.

pm90 · on March 9, 2020

Choosing an architectural design simply because it makes data warehousing easier doesn’t seem like a good enough reason to me.

You give examples of all the Big Tech having such shared DBs but that seems like more of a reason to not use that pattern. Good DBAs are hard to find and not many people choose to become DBAs anymore. Big Tech can hire the experienced ones since they can compensate them pretty well; most companies can’t. The shared DB therefore becomes a critical bottleneck to the business.

jeffffff · on March 9, 2020

beyond some fairly large size of company it's less that it makes data warehousing easier and more that it makes centralized data warehousing possible.

fortunately this type of environment is available today as a managed service in a few different offerings. gcp has spanner and vitess is available as a managed service on multiple cloud providers from planetscale.

closeparen · on March 9, 2020

Centralized data warehousing is possible as long as you constrain the number of distinct database engines and provide connectors for those. Services having private databases doesn't preclude data warehousing. It's why we have data warehousing! To enable joins across data from different silos.

closeparen · on March 9, 2020

Those things are database engines. Services can and do get their own instances. What those managed storage teams provide is akin to Amazon RDS, not one big database.

jeffffff · on March 9, 2020

yes they are database engines, but in many if not most of these cases there is only a single instance that is shared across all products at the company. it is very different than the rds model. of course there are access controls and abstractions such as schemas and tables but there aren't silos between data from different services

closeparen · on March 9, 2020

I guess neither of us want to out our employment history here, but for the ones I know about, that's absolutely not true. Reading from another service's database is sometimes possible but always considered hacky tech debt.

jeffffff · on March 9, 2020

from what i've heard from reliable sources there are only 2 spanner clusters, one for ads and one for everything else. i'd be surprised if there isn't an isolated one for gcp but for internal google products there are only 2. i also have it on good authority that data sharing between services through edgestore and tao is common. i have less insight into the others.

collyw · on March 9, 2020

- A key reason for splitting up a monolith into services is because collaboration becomes too costly with 100s to 1000s of developers working on the same code. Your build & test system can't handle the number of commits. Your code takes too long to build. Would the data-access layer not them become the development bottleneck in such scenarios, meaning that it doesn't scale as well as SOA?

You can't actually remove complexity like this, just push it to the dev ops layer. And also it makes setting up a development environment a lot more difficult.

Eyas · on March 9, 2020

> Would the data-access layer not them become the development bottleneck in such scenarios, meaning that it doesn't scale as well as SOA?

In terms of development, no, the data layer code grows sublinearly with the with the size/breadth of the schema/data. The data access layer is not much more than a database (plus usually, to enable event-driven programming, some semblance of subscriptions/notifications when data in your query changes). But it's fairly generalizable, and doesn't depend on the size of the team or schema using it.

> Is O(N^2) really that big of a problem? It seems like the problem can be reduced to something simpler: developers cannot easily understand which services communicate with one another. If that is the case, would a visualization tool be sufficient?

It really depends on how complex the system is. At some point, a visualization stops being helpful. There's obviously room to simplify a SOA dependency graph to look reasonable, and many do this successfully. But DOA is another interesting option in the toolkit: turn the problem on its head and say: maybe there's no graph at all.

apalmer · on March 8, 2020

This was the predominant architectural design about 20 years ago.

There are many strong points for this the general challenges are...

Databases don't/didn't have too much in the way of integration primitives...

Databases can become a performance bottleneck that can only be scaled vertically...

SQL language is not that great as an application programming language

EDIT: forgot the biggest one which is, it is completely up to discipline to produce any kind of separation between implementation details and public api since everything lives in the database. That's really the biggest challenge.

marcosdumay · on March 9, 2020

> Databases don't/didn't have too much in the way of integration primitives...

Hum... They have the best and most diverse set of integration primitives available. Services architectures (micro, SOA, and whatever) did never actually reach parity to them.

Your other points are good (DBMSes do scale horizontally, but it's not easy nor nice), agreed on everything. But they still do not beat the capacity DBMSes have for integrating stuff on most applications, so this is still a good paradigm.

apalmer · on March 9, 2020

I am quite serious when I ask for more specifics on the types of integration primatives you are talking about?

My experience has been the opposite here are some examples:

Number of times I had to implement or maintain hand rolled queues in the database.

Number of times I had to implement a web service whos only purpose was to expose database data to the world.

Number of ETL processes I wrote just to handle some data daily data input from a third party.

Number of times I had to use comparatively complex SQL techniques to iterate over a list of rows because SQL is intended for set based operations not iterative processing ..

Number of times I had to do tedious text templating to generate HTML or XML or JSON or any kind of heirarchical data format that is easily consumable by a non database system.

That's what I am talking about

Which isn't to say this is a bad architecture. Just that database as integration platform has its challenges, a lot of them.

dathinab · on March 8, 2020

It's in a certain way a very roundabout way to describe a event based SOA system.

Many of the problems described are problems from SOA systems which are using a number of common anti-patterns for SOA systems.

Also the way this person describes DOA is prone to end up with a monolith in _data_ and a just the logic is not monolithic but might end up accidentally being quite tight coupled. (Through it does have some benefits).

Also even with DOA you can end up with internal state coupling between services if you do it wrong. It's harder then in some bad designed SOA systems but IMHO roughly as likely as a SOA system communicating with events.

---

- So use events if you do SOA (for inter service communication)! - Never rely on the internal state of another Service. - Make sure you don't send events to a specific service, instead "just" send events, then all services interested _in that event_ will receive it. (Make sure to subscribe for events independent of their source not services; E.g. use a appropriate event broker or service mech; Sometimes just broadcasting is fine; Oh and naturally storing the event and making that trigger other services works, too. At which point we are back at DOA )

----

DOA is not bad just IMHO misguiding. If you want to use it look at common problems with event based systems for vectors of potential problems wrt. accidental internal state coupling as many of this will apply to DOA, too (if your system becomes complex enough).

Note that I don't mean all problems caused by combining eventual consistency + high horizontal scaling with event systems. Sadly this is just very often mixed up.

meowface · on March 8, 2020

What are some good articles/resources on event-based SOA? I've been hearing about event-driven architectures a lot in the past few years.

dathinab · on March 8, 2020

Honestly I'm not sure. I listened to some grate talks on youtube about event sourcing mainly for with a scalability focus but I don't remember by whom they where. Most good articles I read about event sourcing for traceability/replyability where pretty old even when I read them a view years ago. Also many articles I read where pretty messy wrt. what applications of event sourcing help with which problems and have which consequences :=(

Give me a minute I will try to find at least some of the sources, but don't get your hopes up.

dathinab · on March 8, 2020

I think that one was good:

- https://www.youtube.com/watch?v=STKCRSUsyP0

I wasn't able to find any other talk I watched back then or any of the stuff I did read, but I then last time I looked into talks and reading material about this was ~2.5 Years ago and while a lot of new software and tooling was done since then the principles didn't change. Try some of the other GOTO; talks about it if you like listening to talks, they tend to be quite good.

This was interesting as far as I remember but not what I was looking for:

- https://www.youtube.com/watch?v=CZ3wIuvmHeM

Groxx · on March 12, 2020

Martin Fowler talks about it quite a lot, e.g. "event-sourcing" and "CQRS": https://martinfowler.com/articles/201701-event-driven.html (there are many articles, this just came up with a quick google)

zmmmmm · on March 8, 2020

I like this architecture but I worry that in sufficiently complex systems it can lead to some significant complexity if there are entities within the database that have a lot of contention around them. For example, it may end up translating to deadlocks etc. when two services that are naive to each other start locking tables in interleaved sequences of interactions. It still seems wise even if you follow this pattern to segment areas of the database to management of particular services or consumers and then have those present APIs or message passing interfaces to each other. Which leads you half way back to SOA or microservices.

Eyas · on March 8, 2020

Yeah, totally. The way I've seen this work is that a single service will "own" a record, so you never are running into a multi-writer situation for a given record.

While this seems closer to SOA, the key difference here is that single Type or Table can still have multiple produces (of non-overlapping records). In a trading system, you'd have producers of RFQs from makretplace A, B, C, etc. but for each single row in that table, the same service "owns" it. So you still get the benefits of not caring about the DAG/callgraph or knowing about the individual service that calls you.

Locking might be hard if you're doing some transactional change. Those become harder to do. But the half good news is that shifting to an "Event-based" programming mindset might mean you run into less of these.

But yeah, there's a whole new set of drawbacks.

nine_k · on March 8, 2020

But what's the point to keep them in the same table then?

Won't having several independent (and maybe physically remote) tables, one per service, solve the problem better? You can still `union all` them for analytical purposes.

Eyas · on March 8, 2020

Same schema, and the service doing the query doesn't need to worry about which tables to union. (Problem with union all is that it reintroduces addressability and a form of direct component interaction)

But sure, you can probably implement that with views etc. also.

bencollier49 · on March 8, 2020

Hmm. Does this not turn the data access layer into a service bus?

This could get very messy at scale.

jayd16 · on March 8, 2020

This is the correct take. They just reinvented the ESB.

james_s_tayler · on March 9, 2020

Sub that database graphic out for the Kafka logo and it sounds exactly like everyone else's recommendations lately.

sadness2 · on March 9, 2020

Yikes. You haven't reduced inter-component communication. You've basically got an interface distributed across all your components, and defined as a data schema. Any change to this can now potentially break communication between any two or more processes.

awinter-py · on March 8, 2020

using schemas rather than APIs to share information between components seems right

IMO one of the reasons CRUD is so hard now is that the schema is different at every layer of the product

Slight differences between layers are necessary for permissions / privacy, but there are probably better ways to get that done than to reimplement the schema at every layer.

dathinab · on March 8, 2020

I was thinking about this recently a lot as in my experience the difference in the API is often a major source of overhead (not necessary complexity).

- I think a major problem is in the difference between the way you can layout thinks in the storage layer and your application.

- Another is where to evaluate correctness (e.g. in service checks + DB-system constraints, etc.).

- Another major think is that different actions/taks take different slices of the same data. One area where dynamic typed languages can have a clear benefit.

- Different actions/tasks in the same system work better with different representations.

---

What I currently think is helpful is to:

- learn from Entity-Component Systems (from Games) for slicing data of the same entity. (but can lead to problems with consistency across slices, transactions can help if doable).

- I would love to have a DB which can somehow do algebraic data types/sum types/tagged union/rust enum (all different words for roughly the same concept).

- Be very strict about preventing coupling of internal state, mixup of service responsibilities in logic/endpoint _and_ mixup of this responsibilities in data. Which e.g. means that e.g. you avoid any FK between schemas owned by different services, even through it often seems usefull at the beginning.

- Have a _system wide_ schema for all entities split up into small slices which if combined together from the entity and only use that schemas in the system no service specific schemas.

- Specify this schema somewhere, preferably generate data types and similar from it, preferably have some opt-in strict schema validation (enabled during part of testing).

- Use events for communication, use the schemas from above here, too.

- Have a well defined way to represent "patch"/"update" queries, I have run to often into the problem with JSON of "reseting/deleting" a values vs. just not changing it (null value vs. field not given) and/or nested optionality.

dathinab · on March 8, 2020

Also depending on what you do using event sourcing can be helpful, too. Or can be a big chunk of unneeded additional work. Be aware that there are 2 ways to do event sourcing one focused on traceability and maybe even replay-ability which has a choke point in that you have a single sequential log, i.e. it doesn't scale horizontally. Or the one which uses it to archive huge horizontal scalability but on the cost of problems with (more) eventual consistency and a very very hard time to get system wide transactions right. (Best of both worlds is if you can shard your system into _independent_ subsystems and have a use case where you can rely on it to not need too much throughput per shard so that you can go with sequential. E.g. like a chat system which can shard per channel all messages and user management per workspace/group).

tyingq · on March 8, 2020

DOA isn't a great acronym :)

dathinab · on March 8, 2020

Yes it's _Domain_ orientated architecture. Which is a less well known think since I think 10+ years ;=)

james_s_tayler · on March 9, 2020

Dead On Arrival

evanmoran · on March 8, 2020

Well said, maybe Model Oriented Architecture (MOA) is better?

hestefisk · on March 8, 2020

Congrats, you just reinvented the monolithic database with a bunch of functions on top. Everything old is new again.

janci · on March 8, 2020

Would it be possible make interactive systems with this architecture? The requester needs to be notified about result availability, so some communication must be targeted to specific component (the requester) going against the foundation idea of this pattern.

Eyas · on March 8, 2020

I didn't cover it in the article, but the key to get interactivity would be to switch to a datastore layer that supports subscribable queries.

I'm aware of Esper[1] which tries to do this. And maybe/arguably Firebase (?)

[1]: http://www.espertech.com/esper/

bradleyankrom · on March 8, 2020

I believe RethinkDB also has this functionality, but I’m not 100% certain. Will check the docs later and update...

ChicagoDave · on March 8, 2020

This is nothing new and generally how SOA began in the 2000’s.

One of the critical improvements to SOA is DDD (domain-Driven Design) where context matters and boundaries should include the service and its data.

Data oriented architecture a bad idea. Period.

Eyas · on March 8, 2020

While it's true that SOA generally began by breaking up services into separate binaries while usually keeping one centralized database, a lot of the specific principles of DOA are a _response_ to the distribution/fragmentation of state seen in SOA and micrservices in the mid 2000s.

DOA is often implemented taking advantage of event-driven programming, pubsub, and message passing, which as generalized practices were not as prominent when SOA began, imo.

bestouff · on March 8, 2020

I thought it was what started the ECS architectures which are all the rage in game engines nowadays, and start bleeding in GUI and other fields too. Doesn't seem such a bad idea to me.

TeMPOraL · on March 8, 2020

Nah, ECS probably got backported. On top of that, ECS as a term suffers from multiple personality disorder - there are different architectures with differing goals all bundled under the same name "ECS".

See: https://news.ycombinator.com/item?id=21897824.

jayd16 · on March 8, 2020

The DOA in this article is not the same data orientation the game industry talks about. This article has nothing to do with storing your data in arrays for tight looping.

macintux · on March 8, 2020

I am skeptical of blanket statements like that. I’ve yet to meet a (real) language or tool or architecture that doesn’t have merit.

Groxx · on March 8, 2020

Even "Big Ball of Mud"-design has some benefits: http://www.laputan.org/mud/

ChicagoDave · on March 9, 2020

If you have a simple system, fine. If you have a complex system, any hint at a monolithic future has the makings of bad architecture.

Data storage should reflect a bound context and its domains. It could be relational, graph, document, or key/value.

Putting all data in one place just because it’s convenient is ignoring your business capabilities.

DDD proves that aligning your technology to our business, reducing complexity, having real boundaries, is the way forward.

banq · on March 9, 2020

functional architecture is the future: https://increment.com/software-architecture/primer-on-functi...

stilisstuk · on March 8, 2020

Any good sources on DDD?

adamnemecek · on March 8, 2020

I thought this book was solid

https://www.amazon.com/Data-oriented-design-engineering-reso...

james_s_tayler · on March 9, 2020

Domain Driven Design Distilled is the best place to start. It compresses the topic down into only 200 odd pages rather than the other giant books on the subject. I'm reading it at the moment and it's starting to click far more than it ever has.

Also, don't just try to watch conference talks and read blog posts to understand it. It leaves too many gaps and a fuzzy understanding.

stevetodd · on March 8, 2020

“Domain-Driven Design: Tackling Complexity in the Heart of Software“ by Eric Evans seems to be the preeminent book on the subject.

dzikimarian · on March 8, 2020

It's valuable, but very hard to read. I would recommend implementing DDD by Vaughn Vernon first. Also understanding event storming helps a lot.

ChicagoDave · on March 9, 2020

Google videos of Eric Evans and others talking about DDD, Event Storming, and CQRS.

WesternStar · on March 9, 2020

So in this scenario are we essentially reimplementing Apache Kafka? Producers who create events that are then read by consumers and persisted on a database layer. What am I missing?

Eyas · on March 9, 2020

Not really reimplementing, Kafka is a great example of a technology you can structure a DOA around.

james_s_tayler · on March 9, 2020

Reinventing a less scalable version of Kafka/ESB is basically what this is.

discreteevent · on March 8, 2020

A central message broker that intermediates between services also solves the problems described in the "Problems of scale" paragraph. It might be worth mentioning this.

fulafel · on March 9, 2020

If you put your data in CSV files on your file system and had different programs on the same server access those files.. would it be a data oriented architecture?

dksidana · on March 9, 2020

Another interesting point of this architecture, that you can plan your cache really well.

throwawaygo · on March 8, 2020

Cache rules everything around me.

thecleaner · on March 8, 2020

So on a high level redux but for backend (Store-reducers-components) ?

lincpa · on March 9, 2020

Industrial-grade Data-Oriented Architecture:

[The Pure Function Pipeline Data Flow v3.0 with Warehouse / Workshop Model](https://github.com/linpengcheng/PurefunctionPipelineDataflow)

1. Perfectly defeat other messy and complex software engineering methodologies in a simple and unified way.

2. Realize the unification of software and hardware on the logical model.

3. Achieve a leap in software production theory from the era of manual workshops to the era of standardized production in large industries.

4. The basics and the only way to `Software Design Automation (SDA)`, just like `Electronic Design Automation (EDA)`.