Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

"Accidentally picked up" Somehow they must have fed it with some JavaScript and some php, right ?


Yeah, the dataset was 40GB of text from pages linked from Reddit, so I imagine it was quite hard to clean it to just English text. They also noted in their paper that it "accidentally" learned to translate English into French, even though they removed non-English web pages, because of examples like

"I’m not the cleverest man in the world, but like they say in French: Je ne suis pas un imbecile [I’m not a fool]."


Yes. But that's the sort of hype that seems to be desired. "OMG zombies, we accidentally put some bad samples in the training data"


With sloppy scraping I guess JS was picked up besides text.

PHP should be intentionally used as training material.


Sloppy scraping?

You forget html and JS is perfectly valid syntax to find in a .php file.


Yes, of course. But JS is frontend mostly, so I can imagine it's easier to accidentally scrape some JS with text.

PHP, not so much, only if the server returns source by accident.


What if the server is GitHub? Or some random blog about PHP development? There are lots of situations where it's very intentional that PHP is contained in HTML.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: