Yeah, the dataset was 40GB of text from pages linked from Reddit, so I imagine it was quite hard to clean it to just English text. They also noted in their paper that it "accidentally" learned to translate English into French, even though they removed non-English web pages, because of examples like
"I’m not the cleverest man in the world, but like they say in
French: Je ne suis pas un imbecile [I’m not a fool]."
What if the server is GitHub? Or some random blog about PHP development? There are lots of situations where it's very intentional that PHP is contained in HTML.