Few data mesh proponents ”force” a particular storage medium - and the concept i...

bitsondatadev · on July 30, 2021

You say, “one cannot argue performance of a data warehouse” but that’s precisely the issue with a DW. DW requires a lot of work to move data from the way that domains model their data on the service layer to how data is modeled in a central DW. You have to wait for data to become live to even begin running analysis on it. Setting up and worst of all maintaining pipelines is an expensive undertaking in both time and money.

It’s not to say the DW is bad and never the solution. The problem is making it the only solution and not providing domains the flexibility to model data the way they need it. You say it’s more complex to manage but that’s the idea behind data mesh, you don’t manage that part, the team with their domain knowledge and data solution does. They can make it as simple or complex as they want internally but if they follow the standards to play in your data mesh who cares? Not your problem. For example say a domain needs realtime data analytics and use something like Druid to store their data. That’s fine. If they want to play in the data mesh you’ve provided, they just need to follow the rules in their data model, but they don’t need to use a cloud DW to do that.

You can’t argue that avoiding the copying of terabytes of data a day from a domain to a DW is more performant than adhoc analysis (MB to GB) of that data. Why move or copy a dataset when you don’t need to? Why force domains to use any solution that’s not actually solving their domain problem?

verbbis · on July 30, 2021

I have to disagree with almost all of the points you raise.

The degree to which you wish to (re)model data is a design decision - also within a DW. This is what I mean by divorcing the tech from the practice. Methods to store and manipulate semi-structured data exist within cloud data warehouses and there is nothing inherent in the technology preventing from extending said support. Also, it seems to me, that even in advanced analytics incl. ML one eventually works with the data in a tabular form anyway.

When it comes to performance, I was referring primarily to e.g. cross-domain queries. That continues to be challenging in data virtualization / federated query engines.

You (and data mesh) focus on the domains - and nothing prevents carving out ample portions of storage and compute from a cloud DW to them and doing the exact things you propose. Except that "federated computational governance" and enforcing standards is now heaps easier since everyone is relying on the same substrate.

Data mesh targets analytical workloads - and surely you do not suggest e.g. hooking Trino up directly to operational, OLTP databases? One has to, as least how things stand today, copy the data somewhere anyway for not the least historical analysis, as well as often transform it to be understandable to a downstream (data product) consumer. So you will have "pipelines" no matter what - even in a data mesh.

Not only are the domain teams free to build the aforementioned pipelines and model data in any way they see fit - within a cloud-native DW, but also the DataOps tooling available in this area is already relatively mature.

The only part I can somewhat relate to is about "forcing" to rely on a specific piece of technology. But I think that is just something one needs to accept in a corporate setting and a balancing act. And I'm talking about the majority of the regular companies out there, not FAANG. Also, there other arguments to be made - such as avoiding vendor lock-in or if the capabilities of the DW simply does not cater to the specific problem your domain has. But these are not the arguments you made.

bitsondatadev · on Aug 1, 2021

Agree to disagree then I guess.

"When it comes to performance, I was referring primarily to e.g. cross-domain queries. That continues to be challenging in data virtualization / federated query engines."

Have you tried Trino lately? Data virtualization like Denodo still relies on moving data back between engines to execute a query, Trino pushes the queries down to both systems and processes the rest in flight. The fact that you use both of those interchangeably makes me think you may not have tried it.

"Data mesh targets analytical workloads - and surely you do not suggest e.g. hooking Trino up directly to operational, OLTP databases?" Not operational data. Are you saying teams aren't allowed to store immutable data in PostgreSQL?

"Not only are the domain teams free to build the aforementioned pipelines and model data in any way they see fit"

You will organically see teams build data infrastructure with different tooling. This happens when you don't tell a team they have to use a DW to play the analytics game. I have never seen a company that has multiple teams that organically land on using the same tech. So naturally (that is without forcing the to use one substrate i.e. DW) you will have different teams (or domains) using different databases. This is why we have DW to begin with. They were literally created to be a copy of domain data that naturally lied in many operational databases.

I get that in an ideal world, there would be some magical one size fits all solution that everyone would just use. However, that system doesn't exist. It's certainly not the DW. DW can be one of those solutions, just not THE one.

verbbis · on Aug 1, 2021

"Trino pushes the queries down to both systems and processes the rest in flight."

This is how many data virtualization techniques work as well (e.g. PolyBase from Microsoft) - predicate pushdown is not exactly new. That is why I mentioned both approaches. However, according to my experience, the degree to which this helps is highly workload-dependent. Do you happen to have some reference which would help me understand how e.g. cross-domain joins between large datasets can be optimized with this approach?

"Are you saying teams aren't allowed to store immutable data in PostgreSQL?"

Of course they can. But unless you assume everyone (incl. the numerous COTS and legacy applications in any "normal" enterprise) is practicing event sourcing and/or their internal data models are somehow inherently understandable and usable by downstream consumers, a pipeline of some sort is required. That was my point. And if so, what's the difference for the team to target a dedicated DB within a cloud DW instance that speaks the PostgreSQL dialect - or close vs. a separate PostgreSQL instance?

Furthermore, if you think standardization of tools is not possible within an enterprise and everyone just does their own thing anyway - and mind you, despite suggesting Trino as one such tool yourself, I have low hopes of getting the data mesh standards for governance ending up being adopted either.

bitsondatadev · on Aug 2, 2021

"Do you happen to have some reference which would help me understand how e.g. cross-domain joins between large datasets can be optimized with this approach?"

Check out the original Presto paper (To be clear Trino is formerly PrestoSQL). https://trino.io/Presto_SQL_on_Everything.pdf. Also check out the definitive guide https://trino.io/blog/2021/04/21/the-definitive-guide.html for an even deeper dive.

"(incl. the numerous COTS and legacy applications in any "normal" enterprise)"

Funny you bring this up. In my last company we had a couple legacy apps that we didn't even have the code for any more. Just the artifacts (the license for these services were expiring and we were just waiting to kill them off). For the year or so that we had, we were able to write a simple Trino connector to pull these values through the API, and represent them as a table to Trino. And that's the point to your last question. It's the fact that you can meet the legacy code, or RDBMS database, or NoSQL database, or whatever the heck all these different teams own and get access to it in one location without migrating data and maintaining pipelines.

"I have low hopes of getting the data mesh standards for governance ending up being adopted either."

My view is that data mesh should be opt-in and flexible. Each team can pick and choose what portion of their data is modeled and exposed. If a team models their data in a way that doesn't align with some central standard, then they just need to create a view or some mapping that exposes their internal setup to match the central standard. Per the data mesh principle of federated computational governance, each team has a seat as to what these standards are. There can be some strict standards, and some that are more open. Teams/Domains only need to opt-in or concern themselves with the standards that are being requested of their data by the consumers. All the rest they can opt out.

I personally think avoiding company-wide standards is likely the best way to approach adoption. You basically have a list of well documented standards that can easily be searched and understood by consumers (analysts, business folks, data scientists, etc..) and it's up to the domains/consumers to negotiate standards on more of an adhoc basis. Therefore participating in the data mesh isn't some giant meeting everyone needs to join to grow consensus around. It can be much more distributed and less invasive.