It costs more the more you care about squeaky clean training data. Of course you...

It costs more the more you care about squeaky clean training data. Of course you get a better model in return.

ChatGPT used a crawl of the internet and patches things up with alignment and DPO. Big boys like Microsoft might have deals with publishers to get textbooks in bulk.

Contents from sites with moderation can be filtered using the platform's mechanism, e.g., only include text with a certain length and count of upvotes.

LLMs can be used to generate and filter data as well. Humans have been used to do this, they might have to do this less in the future. Mostly to review what the LLMs are suggesting.