Not to be the devil's advocate here, but almost certainly it can be the case that data was used to define heuristics (potentially using automated statistical methods) that a engineer then formalized as code. Without that data that specific heuristic wouldn't exist, at least very likely not in that form. Yet that data does not have to be included in any open source release. And obviously you as a recipient of the release can modify the heuristic (or at least, you can modify the version that was codified), but you can not reconstruct it from the original data.
I know my example is not exactly what is happening here, but the two sound pretty affine to me and there seem to be a fairly blurry line dividing the two... so I would argue that where "this must be included in a open source release" ends and "this does not need to be included in a open source release" starts is not always so cut and dry.
(A variant of this, that happens fairly frequently, is when you find a commit that says something along the lines of "this change was made because it made an internal, non-public workload X% faster"; if the data that measurement is based upon did not exist, or if the workload itself didn't exist, that change wouldn't have been made, or maybe it would have been made differently... so again you end up with logic due to data that is not in the open source release)
If we want to go one step further, we could even ask: what about static assets (e.g. images, photographs, other datasets, etc.) included in a open-source release... maybe I'm dead wrong here, but I have never heard that such assets must themselves be "reproducible from source" (what even is, in this context, the "source" of a photograph?).
That being said, I sure wish the training data used for all of these models was available to everyone...