Seems to me like they need to back out of this fast and at very least limit it such that it is only trained and then used on "license compatible" projects. eg: train it in isolation on MIT licensed projects and then have the user explicitly confirm what license the code they are working on is to enable it. Possibly they even need to auto-enable a mechanism to detect when code has been reused verbatim and enable some kind of attribution (or respect for other constraints) where that is required by the license.
Alternatively, they'll take it head-on, pay their lawyers to argue fair use, and blaze a new trail through the understanding of copyright application that allows this ML model (and others like it) to exist.
This is ultimately a Microsoft project, and they have Microsoft money and Microsoft lawyers to defend their position.