Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Agree to disagree then I guess.

"When it comes to performance, I was referring primarily to e.g. cross-domain queries. That continues to be challenging in data virtualization / federated query engines."

Have you tried Trino lately? Data virtualization like Denodo still relies on moving data back between engines to execute a query, Trino pushes the queries down to both systems and processes the rest in flight. The fact that you use both of those interchangeably makes me think you may not have tried it.

"Data mesh targets analytical workloads - and surely you do not suggest e.g. hooking Trino up directly to operational, OLTP databases?" Not operational data. Are you saying teams aren't allowed to store immutable data in PostgreSQL?

"Not only are the domain teams free to build the aforementioned pipelines and model data in any way they see fit"

You will organically see teams build data infrastructure with different tooling. This happens when you don't tell a team they have to use a DW to play the analytics game. I have never seen a company that has multiple teams that organically land on using the same tech. So naturally (that is without forcing the to use one substrate i.e. DW) you will have different teams (or domains) using different databases. This is why we have DW to begin with. They were literally created to be a copy of domain data that naturally lied in many operational databases.

I get that in an ideal world, there would be some magical one size fits all solution that everyone would just use. However, that system doesn't exist. It's certainly not the DW. DW can be one of those solutions, just not THE one.



"Trino pushes the queries down to both systems and processes the rest in flight."

This is how many data virtualization techniques work as well (e.g. PolyBase from Microsoft) - predicate pushdown is not exactly new. That is why I mentioned both approaches. However, according to my experience, the degree to which this helps is highly workload-dependent. Do you happen to have some reference which would help me understand how e.g. cross-domain joins between large datasets can be optimized with this approach?

"Are you saying teams aren't allowed to store immutable data in PostgreSQL?"

Of course they can. But unless you assume everyone (incl. the numerous COTS and legacy applications in any "normal" enterprise) is practicing event sourcing and/or their internal data models are somehow inherently understandable and usable by downstream consumers, a pipeline of some sort is required. That was my point. And if so, what's the difference for the team to target a dedicated DB within a cloud DW instance that speaks the PostgreSQL dialect - or close vs. a separate PostgreSQL instance?

Furthermore, if you think standardization of tools is not possible within an enterprise and everyone just does their own thing anyway - and mind you, despite suggesting Trino as one such tool yourself, I have low hopes of getting the data mesh standards for governance ending up being adopted either.


"Do you happen to have some reference which would help me understand how e.g. cross-domain joins between large datasets can be optimized with this approach?"

Check out the original Presto paper (To be clear Trino is formerly PrestoSQL). https://trino.io/Presto_SQL_on_Everything.pdf. Also check out the definitive guide https://trino.io/blog/2021/04/21/the-definitive-guide.html for an even deeper dive.

"(incl. the numerous COTS and legacy applications in any "normal" enterprise)"

Funny you bring this up. In my last company we had a couple legacy apps that we didn't even have the code for any more. Just the artifacts (the license for these services were expiring and we were just waiting to kill them off). For the year or so that we had, we were able to write a simple Trino connector to pull these values through the API, and represent them as a table to Trino. And that's the point to your last question. It's the fact that you can meet the legacy code, or RDBMS database, or NoSQL database, or whatever the heck all these different teams own and get access to it in one location without migrating data and maintaining pipelines.

"I have low hopes of getting the data mesh standards for governance ending up being adopted either."

My view is that data mesh should be opt-in and flexible. Each team can pick and choose what portion of their data is modeled and exposed. If a team models their data in a way that doesn't align with some central standard, then they just need to create a view or some mapping that exposes their internal setup to match the central standard. Per the data mesh principle of federated computational governance, each team has a seat as to what these standards are. There can be some strict standards, and some that are more open. Teams/Domains only need to opt-in or concern themselves with the standards that are being requested of their data by the consumers. All the rest they can opt out.

I personally think avoiding company-wide standards is likely the best way to approach adoption. You basically have a list of well documented standards that can easily be searched and understood by consumers (analysts, business folks, data scientists, etc..) and it's up to the domains/consumers to negotiate standards on more of an adhoc basis. Therefore participating in the data mesh isn't some giant meeting everyone needs to join to grow consensus around. It can be much more distributed and less invasive.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: