I'm happy to see this Web implementation of ftfy! I especially appreciate how it converts ftfy's fixing steps into example Python code.
Here's an interesting follow-up question for HN: one of the things that makes ftfy work is the "ftfy.bad_codecs" subpackage. It registers new text codecs in Python for encodings Python doesn't support. Should I be looking into actually making this part of Python?
To elaborate: once ftfy detects that text has been decoded in the wrong encoding, it needs to decode it in the right encoding, but that encoding may very well be one that's not built into Python. CESU-8 (a brain-damaged way to layer UTF-8 on top of UTF-16) would be one example. That one, at least, is gradually going away in the wild (I thank emoji for this).
Other examples are the encodings that I've given names of the form "sloppy-windows-NNNN", such as "sloppy-windows-1252". This is where you take a Windows codepage with holes in it, such as the well-known Windows-1252 codepage, and fill the holes with the useless control characters that are there in Latin-1. (Why would you do such a thing? Well, because you get an encoding that's compatible with Windows and that can losslessly round-trip any bytes.)
This has become such common practice on the Web that it's actually been standardized by WHATWG [1].
If a Web page says it's in "latin-1", or "iso-8859-1", or "windows-1252", a modern Web browser will actually decode it as what I've called "sloppy-windows-1252". So perhaps this encoding needs a new name, such as "web-windows-1252" or maybe "whatwg-1252". And similarly for 1251 and all the others.
But instead of just doing this in the ftfy.bad_codecs subpackage, should I be submitting a patch to Python itself to add "web-windows-NNNN" encodings, because Python should be able to decode these now-standardized encodings? Feel free to bikeshed what the encoding name should be, too.
My observation here is that the number of text encodings is generally decreasing, due to the fact that UTF-8 is obviously good. I want wacky encodings to die. But this is just a class of encodings that have existed for decades and that Python missed. Perhaps on the basis that they were non-standard nonsense, but now they're standardized.
It could be argued that web-windows-1252 is the third most common encoding in the world.
If I'm giving directions for how to decode text in this encoding, it currently only works if you've imported ftfy first, even if you don't need ftfy.
Sounds to me like you've argued yourself around to pitching them for inclusion! I find the argument that web-windows-1252 is supported by modern browsers very convincing.
Here's an interesting follow-up question for HN: one of the things that makes ftfy work is the "ftfy.bad_codecs" subpackage. It registers new text codecs in Python for encodings Python doesn't support. Should I be looking into actually making this part of Python?
To elaborate: once ftfy detects that text has been decoded in the wrong encoding, it needs to decode it in the right encoding, but that encoding may very well be one that's not built into Python. CESU-8 (a brain-damaged way to layer UTF-8 on top of UTF-16) would be one example. That one, at least, is gradually going away in the wild (I thank emoji for this).
Other examples are the encodings that I've given names of the form "sloppy-windows-NNNN", such as "sloppy-windows-1252". This is where you take a Windows codepage with holes in it, such as the well-known Windows-1252 codepage, and fill the holes with the useless control characters that are there in Latin-1. (Why would you do such a thing? Well, because you get an encoding that's compatible with Windows and that can losslessly round-trip any bytes.)
This has become such common practice on the Web that it's actually been standardized by WHATWG [1].
If a Web page says it's in "latin-1", or "iso-8859-1", or "windows-1252", a modern Web browser will actually decode it as what I've called "sloppy-windows-1252". So perhaps this encoding needs a new name, such as "web-windows-1252" or maybe "whatwg-1252". And similarly for 1251 and all the others.
But instead of just doing this in the ftfy.bad_codecs subpackage, should I be submitting a patch to Python itself to add "web-windows-NNNN" encodings, because Python should be able to decode these now-standardized encodings? Feel free to bikeshed what the encoding name should be, too.
[1] https://encoding.spec.whatwg.org/#legacy-single-byte-encodin...