Ftfy and Unidecode[1] are two of the first libraries I reach for when munging around with a new dataset. My third favorite is the regex filter [^ -~]. If you know you’re working with ASCII data, that single regex against any untrusted data is resolves soooo many potential headaches.
I manage and source data for a lead generation/marketing firm, and dealing with free text data from all over the place is a nightmare. Even when working with billion dollar firms, data purchases take the form of CSV files with limited or no documentation on structure or formatting or encoding, sales-oriented data dictionaries, and FTP drops. I have a preprocessing script in python that strips out lines that can’t be parsed as utf8, stage it into a Redshift cluster, then hit it with a UDF in Redshift called kitchen_sink that encapsulates a lot of my text cleanup and validation heuristics like the above.
I don't remember where I came across that regex, but it's saved me from so many headaches I quite literally get giddy any time I can insert it into a processing stream.
A developer upstream naively split on /n and leave some errant /r characters everywhere? Fixed.
Embedded backspaces or nulls or tabs or any sort of control character? Gone.
Non-ASCII data in a field you know should be ASCII only? Ha, not today good sir!
Until you've had to deal with the hell that is raw, free form data from who-knows-where, you cannot even fathom how satisfying it is to be able to deploy that regex (when appropriate) and know beyond a doubt that you can't have any more gremlins hiding in that particular field/data that'll hit you later.
It’s nice not having to deal with languages other than English.
In Portuguese, which I’ve worked with, you develop other tricks, like replacing à, á, â or ã with a. But, in order to do this, you still need to find out the encoding used before you can create the “ASCII” equivalent.
Fun trivia: coco means coconut; cocô means poo. So, by replacing ô with o, you’re guaranteed a chuckle at some stage.
True. But if that specific case is dealing in, let's say, URLs, then all of the content should be ASCII - either as a direct character representation, or encoded, again to ASCII.
Never did I see the parent mention "this is sufficient for humans", or just "...for English" - even that would be a naïve assumption (see? ;)).
If your definition of "URLs" includes "IRIs" (a term nobody really uses but which encompasses the idea that you can use Unicode in URLs), then this isn't a good assumption to make.
If the URL is "on the wire" it had better be 7-bit ASCII. Actually even more restrictive than that. Because that's the spec. https://tools.ietf.org/html/rfc3986
In user interaction with a browser or wherever else it seems that anything goes.
It's true that protocols such as HTTP only use ASCII URIs on the wire. If you are implementing an HTTP client yourself, you will need to implement percent-encoding.
Which is different from saying "well, I know URLs should be ASCII on the wire, so I can safely delete all the non-ASCII characters from a URL." That's not true.
As soon as you add or remove to the string, it no longer points to the same resource; I considered this too obvious to mention. OTOH, for _validation_, this is useful: "you have a ZWJ character in an URL, that's unlikely". And yes, I understand that there are protocols that allow you to pass around the full Unicode or aunt Matilda or whatever - I should have been more specific.
The first version is much more readable and less hacky; alas, unless you are positive that all the software you're interfacing with can handle Unicode (never have I ever been in such a glorious situation), fallback to URL-encoding it is.
Matches anything that's not between the space and tilde in the ASCII code range[1], which is the entire range of printable ASCII characters. It's similar to [a-Z] you see a lot, but expanded to include the space character, numbers, and punctuation. The regex [ -~] lets you match those characters, whereas [^ -~] negates that and matches anything that's not a printable ASCII character (useful for regex replace functions).
If you look at the table at the top of [1], you'll notice all of the characters at the beginning of the ASCII range which are non-printing and therefore invisible. Plus, at the end for some insane reason, the DELETE character. If there's no valid reason for any of these characters to exist in your dataset, nor for higher code point (UTF) characters to exist, then [^ -~] will match them and let you strip them out all in one go.
The reason for DEL being 0x7f is that it is originally intended as marker for "this character was deleted and should be skipped over" and not as command to delete previous/next character. And you can change any ASCII character into 0x7f by ORing it with with 0x7f, that is, punching all the holes in punched tape (or punchcard, but ASCII usually was not used for them and punching all holes on punchcard is not advisable for mechanical reasons).
> The regex [ -~] lets you match those characters, whereas [^ -~] negates that and matches anything that's not a printable ASCII character
Oh dear, I think I finally understand that mysterious phenomenon in some websites where, text I write in my native language gets saved and displayed as empty text, but writing in English works. I've come across this a few times over the years, in random forgotten corners of the Internet. It could well be that this regex got written down somewhere as good security practice, and some developers out there copied it into the code without thinking about whether the ASCII restriction is applicable.
It's probably a more mundane issue related to either older versions of HTML (which did not support UTF8[1]), or issues with an older database that stored things as latin1 or ASCII by default, or issues with an older programming language that doesn't seamlessly support non-latin1 or non-ascii characters without deliberately doing so (like Python 2).
That said, I do agree it's not something that should be used without thought. I mentioned that my usage generally involves knowledge that the data set has no valid reason to contain higher code points, and usually involves usage of Unidecode[2] to convert higher code points into ASCII equivalents (which in many cases strips out contextual knowledge, but is sometimes an acceptable trade off for stability and predictability of the data sent downstream).
I'm not sure, but Google Search in general strips punctuation that isn't specifically designated as a search operator[1], and likely strips control/non-printing characters in a similar manner. Testing it[2] with a few URL-encoded control and invisible characters[3], they seem to be ignored. But they do make it into the query parameter, at least until you edit it for a subsequent search, at which point they get converted into spaces/+ symbols in the query.
- Submits issue to Chromium requesting it just run webpages through this before displaying them (I've stumbled on old webpages with legitimately broken encoding within the past 4-5 months)
- Creates Rube Goldberg machine to fix UTF-8 text copy-pasted through TigerVNC (which I was surprised to discover setting LC_ALL doesn't fix)
--
Fun trivia w/ VNC, because it's cute:
1. Example from site: #╨┐╤Ç╨╨╨╕╨╗╤î╨╜╨╛╨╡╨┐╨╕╤é╨╨╜╨╕╨╡
2. Fixed: #правильноепитание
3. What I get when I copy #2 through VNC: #пÑавилÑноепиÑание
4. What happens when ftfy sees #3: #правильноепитание
Each de-mangling step decreases a "cost" metric on the text, based on its length plus the number of unusual combinations of characters. It never really decides that text is "okay", but when there's no step it can take that decreases the cost, it's done.
This is an imperfect greedy strategy, incidentally. If it takes multiple steps to fix some text, it's possible that the first step it needs to take is not the one that decreases the cost as much as possible, that it has to go through some awful-looking intermediate state so that everything falls into place for the next step. This is rare, though. I don't think I could come up with an example.
I'm happy to see this Web implementation of ftfy! I especially appreciate how it converts ftfy's fixing steps into example Python code.
Here's an interesting follow-up question for HN: one of the things that makes ftfy work is the "ftfy.bad_codecs" subpackage. It registers new text codecs in Python for encodings Python doesn't support. Should I be looking into actually making this part of Python?
To elaborate: once ftfy detects that text has been decoded in the wrong encoding, it needs to decode it in the right encoding, but that encoding may very well be one that's not built into Python. CESU-8 (a brain-damaged way to layer UTF-8 on top of UTF-16) would be one example. That one, at least, is gradually going away in the wild (I thank emoji for this).
Other examples are the encodings that I've given names of the form "sloppy-windows-NNNN", such as "sloppy-windows-1252". This is where you take a Windows codepage with holes in it, such as the well-known Windows-1252 codepage, and fill the holes with the useless control characters that are there in Latin-1. (Why would you do such a thing? Well, because you get an encoding that's compatible with Windows and that can losslessly round-trip any bytes.)
This has become such common practice on the Web that it's actually been standardized by WHATWG [1].
If a Web page says it's in "latin-1", or "iso-8859-1", or "windows-1252", a modern Web browser will actually decode it as what I've called "sloppy-windows-1252". So perhaps this encoding needs a new name, such as "web-windows-1252" or maybe "whatwg-1252". And similarly for 1251 and all the others.
But instead of just doing this in the ftfy.bad_codecs subpackage, should I be submitting a patch to Python itself to add "web-windows-NNNN" encodings, because Python should be able to decode these now-standardized encodings? Feel free to bikeshed what the encoding name should be, too.
My observation here is that the number of text encodings is generally decreasing, due to the fact that UTF-8 is obviously good. I want wacky encodings to die. But this is just a class of encodings that have existed for decades and that Python missed. Perhaps on the basis that they were non-standard nonsense, but now they're standardized.
It could be argued that web-windows-1252 is the third most common encoding in the world.
If I'm giving directions for how to decode text in this encoding, it currently only works if you've imported ftfy first, even if you don't need ftfy.
Sounds to me like you've argued yourself around to pitching them for inclusion! I find the argument that web-windows-1252 is supported by modern browsers very convincing.
I'm trying to learn Chinese. I wrote http://pingtype.github.io to parse blocks of text, and I'm now building up a large data set of movie subtitles, song lyrics, Bible translations, etc.
Try reading this in TextWrangler:
1 . 教會組織: 小會:代議長老郭倍宏
The box causes the follow characters to be unreadable - it gets interpreted as a half-character. Deleting it makes the text show correctly.
I tried it with ftfy, but it just copied the input through to the output.
The anomalous character is U+F081, a character from the Private Use Area. TextWrangler is allowed to interpret it as whatever it wants, but I don't know why that would mess up all the following characters.
Here's my theory. The text probably started out in the GBK encoding (used in mainland China). GBK has had different versions, which supported slightly different sets of characters. A number of these characters (decreasing as both GBK and Unicode updated) have no corresponding character in Unicode, and the standard thing to do when converting them to Unicode has been to convert them into Private Use characters.
So that probably happened to this one, which may have started as a rare and inconsistently-supported character.
Python's implementation of GBK (or GB18030) doesn't know what it is. So maybe what we need to do is flip through this Chinese technical standard [1], or maybe an older version of it, and track down which codepoint was historically mapped to U+F081 and what it is now and hahaha oh god
A few months ago I built a simple web interface for ftfy so I don't have to start a Python interpreter whenever I need to decode mangled text: https://www.linestarve.com/tools/mojibake/
In fact, ftfy already figures that text out! Here are the recovery steps that the website outputs:
import ftfy.bad_codecs # enables sloppy- codecs
s = '!¡!HONDA POW'
s = s.encode('sloppy-windows-1252')
s = s.decode('utf-8')
s = s.encode('sloppy-windows-1252')
s = s.decode('utf-8')
s = s.encode('latin-1')
s = s.decode('utf-8')
print(s)
Originally, the text had one non-ASCII character, an upside-down exclamation point. A series of unfortunate (but typical) things happened to that character, turning it into 9 characters of nonsense, the 9th of which is also an upside-down exclamation point.
It looks like ftfy is just removing the first 8 characters, but it's reversing a sequence of very specific things that happened to the text (which just happens to be equivalent to removing the first 8 characters).
This is awesome, it reminds me when we decided to add unicode support to our API, but our code had been connecting to MySQL with Latin-1 connection. As long as you read from a Latin-1 connection, it looked like everything was correct, but what was actually being stored was the UTF-8 bytes being decoded as a Latin-1 string, and then re-encoded to UTF-8 since the column was UTF-8. Basically:
although technically what mysql calls latin-1 is actually using Windows-1252 :(
...and what mysql calls UTF-8 is a subset that only supports code points of up to three bytes! To get UTF-8 you need to use "utf8mb4". Why anybody uses mysql is beyond me.
I can’t help but think that this [library] merely gives people the excuse they need for not understanding this “Things-that-are-not-ASCII” problem. Using this library is a desperate attempt to have a just-fix-it function, but it can never cover all cases, and will inevitably corrupt data. To use this library is to remain an ASCII neanderthal, ignorant of the modern world and the difference of text, bytes and encodings.
Let me explain in some detail why this library is not a good thing:
In an ideal world, you would know what encoding bytes are in and could therefore decode them explicitly using the known correct encoding, and this library would be redundant.
If instead, as is often the case in the real world, the coding is unknown, there exists the question of how to resolve the numerous ambiguities which result. A library such as this would have to guess what encoding to use in each specific instance, and the choices it ideally should make are extremely dependent on the circumstances and even the immediate context. As it is, the library is hard-coded with some specific algorithms to choose some encodings over others, and if those assumptions does not match your use case exactly, the library will corrupt your data.
A much better solution would perhaps involve a machine learning solution to the problem, and having the library be trained to deduce the probable encodings from a large set of example data from each user’s individual use case. Even these will occasionally be wrong, but at least it would be the best we could do with unknown encodings without resorting to manual processing.
However, a one-size-fits-all “solution” such as this is merely giving people a further excuse to keep not caring about encodings, to pretend that encodings can be “detected”, and that there exists such a thing as “plain text”.
[…]
I have […] two main arguments:
1. Due to its simplicity for a large group of naïve users, the library will likely be prone to over- and misuse. Since the library uses guessing as its method of decoding, and by definition a guess may be wrong, this will lead to some unnecessary data corruption in situations where use of this library (and the resulting data corruption) was not actually needed.
2. The library uses a one-size-fits-all model in the area of guessing encodings and language. This has historically proven to be less than a good idea, since different users in different situations use different data and encodings, and [the] library’s algorithm will not fit all situations equally well. I [suggest] that a more tunable and customizable approach would indeed be the best one could do in the cases where the encoding is actually not known. (This minor complexity in use of the library would also have the benefit of discouraging overuse in unwarranted situations, thus also resolving the first point, above.)
It's a little strange for you to be criticizing ftfy as an encoding guesser, given that ftfy is not an encoding guesser. Are you thinking of chardet?
> In an ideal world, you would know what encoding bytes are in and could therefore decode them explicitly using the known correct encoding, and this library would be redundant.
Twitter is in a known encoding, UTF-8. Most of ftfy's examples come from Twitter. ftfy is not redundant.
When ftfy gets the input "#╨┐╤Ç╨╨╨╕╨╗╤î╨╜╨╛╨╡╨┐╨╕╤é╨╨╜╨╕╨╡", it's not because this tweet was somehow in a different encoding, it's because the bot that tweeted it literally tweeted "#╨┐╤Ç╨╨╨╕╨╗╤î╨╜╨╛╨╡╨┐╨╕╤é╨╨╜╨╕╨╡", in UTF-8, due to its own problems. So you decode the text that was tweeted from UTF-8, and then you start fixing it.
I still think you're thinking of chardet.
> If instead, as is often the case in the real world, the coding is unknown...
...then you will need to detect its encoding somehow. By now ftfy is a library for Python 3 only. If you try to pass bytes into the ftfy function, the Python language itself will stop you.
Are you hypothesizing that everyone dealing with unmarked bytes is passing them through a chain of chardet and ftfy, and blaming ftfy for all the problems that would result?
Incidentally, I do machine learning. (That's why I had to make ftfy, after all.) I have tried many machine learning solutions. They do not come close to ftfy's heuristics, which are designed to have extremely low false positive rates that are not attainable by ML. If you want one false positive per billion inputs... you're going to need like a quadrillion inputs, or you're going to need a lovingly hand-tuned heuristic.
A guesser answers the question: what encoding did they actually use?
FTFY answers the question: What horrifying sequence of encode/decode transforms could output this sequence of bytes in UTF-8 that, when correctly decoded as UTF-8, still results in total gibberish?
In other words...
The problem fixed by an encoding guesser:
1. I encode my text with something that's not UTF-8-compatible.
2. I lie to you and say it's UTF-8.
3. You decode it as UTF-8 and get nonsense. What the heck?
4. A guesser tells you what encoding I actually used.
5. You decode it from the guessed encoding and get text.
----
The problem fixed by FTFY:
1. I encode string S with non-UTF-8 codec C.
2. I lie that it's UTF-8.
3. Someone decodes it as UTF-8. It's full of garbage, but they don't care.
4. They encode that sequence of nonsense symbols, not the original text, as UTF-8. Let's charitably name this "encoding" C'.
5. They say: Here teddyh, take this nice UTF-8.
6. You decode it as UTF-8. What the heck?
7. Is it ISO-8859? Some version of windows-X? Nope. It's UTF-8 carrying C', a non-encoding someone's broken algorithm made up on the spot. There's no decoder that can turn your UTF-8 back into the symbols of S, because the text you got was already garbage.
8. FTFY figures out what sequence of mismatched encode/decode steps generates text in C' and does the inverse, giving you back C^-1( C'^-1( C'( C( S )))) = S.
Damn this is good. I had faced a similar issue where the CSV had mixed encodings. That time I never looked for a library, I read a few SO answers and created an adhoc python script to make the file encoding uniform. Ftfy would have made my work simpler.
No, Unicode text is not broken. It’s either your program that’s broken, for it is interpreting ISO-8859-X/Windows-12XX/whatever as UTF-X; or the program that produced said data.
I wrote the original library. Your statement is true, but in many cases, not useful.
As quoted from the documentation [1]:
> Of course you're better off if your input is decoded properly and has no glitches. But you often don't have any control over your input; it's someone else's mistake, but it's your problem now.
I manage and source data for a lead generation/marketing firm, and dealing with free text data from all over the place is a nightmare. Even when working with billion dollar firms, data purchases take the form of CSV files with limited or no documentation on structure or formatting or encoding, sales-oriented data dictionaries, and FTP drops. I have a preprocessing script in python that strips out lines that can’t be parsed as utf8, stage it into a Redshift cluster, then hit it with a UDF in Redshift called kitchen_sink that encapsulates a lot of my text cleanup and validation heuristics like the above.