Agree to disagree then I guess. "When it comes to performance, I was referring p...

verbbis · on Aug 1, 2021

"Trino pushes the queries down to both systems and processes the rest in flight."

This is how many data virtualization techniques work as well (e.g. PolyBase from Microsoft) - predicate pushdown is not exactly new. That is why I mentioned both approaches. However, according to my experience, the degree to which this helps is highly workload-dependent. Do you happen to have some reference which would help me understand how e.g. cross-domain joins between large datasets can be optimized with this approach?

"Are you saying teams aren't allowed to store immutable data in PostgreSQL?"

Of course they can. But unless you assume everyone (incl. the numerous COTS and legacy applications in any "normal" enterprise) is practicing event sourcing and/or their internal data models are somehow inherently understandable and usable by downstream consumers, a pipeline of some sort is required. That was my point. And if so, what's the difference for the team to target a dedicated DB within a cloud DW instance that speaks the PostgreSQL dialect - or close vs. a separate PostgreSQL instance?

Furthermore, if you think standardization of tools is not possible within an enterprise and everyone just does their own thing anyway - and mind you, despite suggesting Trino as one such tool yourself, I have low hopes of getting the data mesh standards for governance ending up being adopted either.

bitsondatadev · on Aug 2, 2021

"Do you happen to have some reference which would help me understand how e.g. cross-domain joins between large datasets can be optimized with this approach?"

Check out the original Presto paper (To be clear Trino is formerly PrestoSQL). https://trino.io/Presto_SQL_on_Everything.pdf. Also check out the definitive guide https://trino.io/blog/2021/04/21/the-definitive-guide.html for an even deeper dive.

"(incl. the numerous COTS and legacy applications in any "normal" enterprise)"

Funny you bring this up. In my last company we had a couple legacy apps that we didn't even have the code for any more. Just the artifacts (the license for these services were expiring and we were just waiting to kill them off). For the year or so that we had, we were able to write a simple Trino connector to pull these values through the API, and represent them as a table to Trino. And that's the point to your last question. It's the fact that you can meet the legacy code, or RDBMS database, or NoSQL database, or whatever the heck all these different teams own and get access to it in one location without migrating data and maintaining pipelines.

"I have low hopes of getting the data mesh standards for governance ending up being adopted either."

My view is that data mesh should be opt-in and flexible. Each team can pick and choose what portion of their data is modeled and exposed. If a team models their data in a way that doesn't align with some central standard, then they just need to create a view or some mapping that exposes their internal setup to match the central standard. Per the data mesh principle of federated computational governance, each team has a seat as to what these standards are. There can be some strict standards, and some that are more open. Teams/Domains only need to opt-in or concern themselves with the standards that are being requested of their data by the consumers. All the rest they can opt out.

I personally think avoiding company-wide standards is likely the best way to approach adoption. You basically have a list of well documented standards that can easily be searched and understood by consumers (analysts, business folks, data scientists, etc..) and it's up to the domains/consumers to negotiate standards on more of an adhoc basis. Therefore participating in the data mesh isn't some giant meeting everyone needs to join to grow consensus around. It can be much more distributed and less invasive.