"Accidentally picked up" Somehow they must have fed it with some JavaScript and ...

moyix · on Feb 27, 2019

Yeah, the dataset was 40GB of text from pages linked from Reddit, so I imagine it was quite hard to clean it to just English text. They also noted in their paper that it "accidentally" learned to translate English into French, even though they removed non-English web pages, because of examples like

"I’m not the cleverest man in the world, but like they say in French: Je ne suis pas un imbecile [I’m not a fool]."

tyingq · on Feb 26, 2019

Yes. But that's the sort of hype that seems to be desired. "OMG zombies, we accidentally put some bad samples in the training data"

tiborsaas · on Feb 26, 2019

With sloppy scraping I guess JS was picked up besides text.

PHP should be intentionally used as training material.

lugg · on Feb 26, 2019

Sloppy scraping?

You forget html and JS is perfectly valid syntax to find in a .php file.

tiborsaas · on Feb 26, 2019

Yes, of course. But JS is frontend mostly, so I can imagine it's easier to accidentally scrape some JS with text.

PHP, not so much, only if the server returns source by accident.

yorwba · on Feb 28, 2019

What if the server is GitHub? Or some random blog about PHP development? There are lots of situations where it's very intentional that PHP is contained in HTML.