What are the obvious reasons?

netdevphoenix · 2026-02-16T13:21:22 1771248082

I thought it would be obvious: OpenAI has used repos on GitHub as training data. Would be like testing someone using a past paper publicly available.

Are you planning on carrying out the experiment? Regardless of the outcome, it would be of value to developers.

simonw · 2026-02-16T15:53:57 1771257237

Why wouldn't they train on Codeberg too?

It's pretty hard to block automated uses of "git clone".

netdevphoenix · 2026-02-17T10:19:37 1771323577

Why would they? Github has 28 million public repos, Codeberg only hit 300k last year. Anyway, Codeberg was just a placeholder for 'repo source _less_ likely to be in their training data'. Codeberg was quick candidate for a place to find a big old codebase with non-sensitive data.

It is indeed hard but the guys at Codeberg are certainly an order of magnitude better than Github as they opted out of the main AI crawlers, regularly block IPs known to belong to AI startups and they allow you to make your repos only be accessible to logged in users.

You seem be going on a tangent, here. Main point was about performing a well documented test anyway.

simonw · 2026-02-17T12:33:14 1771331594

My question about the "obvious" thing was genuine - it wasn't obvious to me.