In some fields its more important than others. In life sciences research to supp...

In some fields its more important than others.

In life sciences research to support synthetic control arms, the FDA is caring more about the lineage/manipulation of the data than the data science models used to predict X/Y/Z.

IE - what was the data originally, what did it end up as prior to ingestion into AIML, why was it changed, what steps were involved, etc.

There are not a ton of good out of the box solutions for data lineage and its driving me nuts.

We have Apache NIFI which promises data lineage out of the box and _appears_ to deliver. I've never implemented it though.

We have pachyderm which has some support here but I don't know about it.

Besides that it appears roll-your-own.

I kind of wish there was an accepted best practice for data lineage but its - surprisingly - wild west. And its completely 100% required for industry use.