In life sciences research to support synthetic control arms, the FDA is caring more about the lineage/manipulation of the data than the data science models used to predict X/Y/Z.
IE - what was the data originally, what did it end up as prior to ingestion into AIML, why was it changed, what steps were involved, etc.
There are not a ton of good out of the box solutions for data lineage and its driving me nuts.
We have Apache NIFI which promises data lineage out of the box and _appears_ to deliver. I've never implemented it though.
We have pachyderm which has some support here but I don't know about it.
Besides that it appears roll-your-own.
I kind of wish there was an accepted best practice for data lineage but its - surprisingly - wild west. And its completely 100% required for industry use.
In life sciences research to support synthetic control arms, the FDA is caring more about the lineage/manipulation of the data than the data science models used to predict X/Y/Z.
IE - what was the data originally, what did it end up as prior to ingestion into AIML, why was it changed, what steps were involved, etc.
There are not a ton of good out of the box solutions for data lineage and its driving me nuts.
We have Apache NIFI which promises data lineage out of the box and _appears_ to deliver. I've never implemented it though.
We have pachyderm which has some support here but I don't know about it.
Besides that it appears roll-your-own.
I kind of wish there was an accepted best practice for data lineage but its - surprisingly - wild west. And its completely 100% required for industry use.