Building an invalid string in Python 2.x

twoodfin · on Nov 21, 2013

I'm not a Python geek, but I found the C implementation for unicode strings in CPython really interesting code reading:

http://hg.python.org/cpython/file/tip/Objects/unicodeobject....

CPython supports several internal representations from one to four bytes per character to optimize for space and performance. There's also a nifty sort of Bloom filter for quick discrimination of strings that might contain characters of interest.

gsnedders · on Nov 22, 2013

That's new in Python 3.3, which doesn't fall foul of that bug. This is PEP 393 (http://www.python.org/dev/peps/pep-0393/) if you want more reading about it.

excitom · on Nov 21, 2013

When working as an AIX kernel program in 1985, I set registers to a unique value so it would be easy spot code that tried to use an uninitialized value. My choice: 0xdeadbeef. Good to see that constant is still in use.

estebank · on Nov 21, 2013

Whenever I find myself having to change a MAC address, I end up using DEADBEEFCAFE.

I hope I never forget about changing them back and end up having to debug two different machines with the same MAC (which has actually happened to me in the wild, with two machines coming out of factory with the same MAC, talk about bad luck and shitty quality control).

nknighthb · on Nov 21, 2013

I've seen duplicate MACs twice in the last few years, on two different lines of embedded/consumer electronics boards from two different factories. There was a kind of Abbot & Costello routine that went on the first time, when a Taiwanese colleague with limited English reported the problem to me.

ctur · on Nov 21, 2013

This is a technique useful in many different places. Using enums/defines that don't start at 0 in C/C++, for instance, helps debugging when you're dealing with possible memory corruption. Likewise, making sure related enums don't overlap in values helps disambiguate logic errors and other potential bugs when those enums are used in data structures.

pdpi · on Nov 21, 2013

There's a fair few of those out in the wild.

The magic number that identifies a java .class file is 0xCAFEBABE.

arethuza · on Nov 21, 2013

Here's a list:

http://en.wikipedia.org/wiki/Hexspeak

dahart · on Nov 21, 2013

Here's another list. :P

jerf · on Nov 21, 2013

Thanks. I found I needed a bit of help for rapid scanning, so:

< /usr/share/dict/words perl -ne 'next unless m/^[abcdefilzsbtgo]*$/ && m/^........$/; next unless !(m/i/ && m/l/); chomp; print "$_ "; $_ =~ tr/ilzstgo/1125790/; print "0x$_\n"' | column

This adds the original word to the left of the hex. It makes the list much more scannable.

Tweaking the dots to count 10 shows that there are apparently 33 choices for WEP codes, if you are so inclined. (Which of course you shouldn't be, but, well...) And, alas, there do not appear to be any 64-bit constants according to my dictionary, though there's enough 32-bit coices to have some fun with phrases ("collated catcalls", "sadistic sabotage", "fattiest feedbags", "besotted ascetics", etc.). And that's just the even 8-8 phrases, 9-7 has almost as many ("godless geodesics", "falsest statistic", 0x7a55e11edb00b1e5).

onedognight · on Nov 21, 2013

< /usr/share/dict/words perl -lne 'print "$_ 0x",tr/ilzstgo/1125790/r if m/^[abcdefilzsbtgo]{8}$/ && !(m/i/ && m/l/)' | column

Thank dhart for an introduction to the "< /file program" idiom that puts file first without resorting to cat, and for column for that matter. Neat. In return I'll offer perl's new /r flag which "returns" the result of tr// or s/// rather than some count.

mistercow · on Nov 22, 2013

There are a surprising number of those which could potentially be useful, like 0x0B501373, 0xCA11AB1E, and 0x5E77AB1E.

rspeer · on Nov 23, 2013

I'd end up wanting to spell it 0x0B501337.

mistercow · on Nov 24, 2013

And no code review jury would convict you. Now, if you start padding your arrays with 0xA5B35705, OSHA might get involved.

Beltiras · on Nov 21, 2013

I'm not seeing any meaningful exploits coming from this. You can maybe send a request that will fail but I can't see any sort of injection taking place.

dlitz · on Nov 21, 2013

It managed to insert invalid Unicode into a SQLite database, causing a subsequent SELECT to fail. That's at least a DoS attack.

icebraining · on Nov 21, 2013

Yes, but only if you're decoding user input as UTF-7, which would be insane.

doki_pen · on Nov 21, 2013

What if you were scraping a webpage and it reported its encoding as UTF-7?

mattdeboard · on Nov 21, 2013

...which is, in fact, exactly how this bug was exposed.

gsnedders · on Nov 22, 2013

Modern browsers don't support UTF-7 any more after a number of XSS attacks relying on inserting UTF-7 encoded script elements which then cause the document to be sniffed as UTF-7.

The only place UTF-7 is still widely used is in email clients.

est · on Nov 22, 2013

> only if you're decoding user input as UTF-7

Hmm, may I ask what makes utf8 won't produce U+DEADBEEF? Or something remotely like that?

Edit:

'\xfb\x9b\xbb\xaf'.decode('utf8')

UnicodeDecodeError: 'utf8' codec can't decode byte 0xfb in position 0: invalid start byte

rspeer · on Nov 23, 2013

When UTF-8 was first defined, they didn't know how big the Unicode range was going to be, so they defined it as a 1-6 byte encoding that could encode any 32-bit codepoint.

When Unicode was deemed to end at U+10FFFF (because that's the largest value that UTF-16 can encode), UTF-8 was revised to be a 1-4 byte encoding that ends in the same place.

Python clearly implements UTF-8 in a way that uses at most four bytes per codepoint (why support five and six byte sequences if they'll never be used?). I think what we're seeing in '\xfb\x9b\xbb\xaf' is four bytes out of a six byte sequence.

gsnedders · on Nov 22, 2013

It's a bug in the UTF-7 decoder that yields an invalid codepoint (outwith of the Unicode codespace) and isn't checked anywhere.

drunkpotato · on Nov 21, 2013

That's really cool! Character encoding issues is something we wrestle with all the time, and it is surprisingly hard to reason about all the ways supposedly "string" data are handled in the course of a typical workflow. I cringe; I hadn't even considered bugs in the encoding and decoding process itself.

mzs · on Nov 21, 2013

Here's the bug (utf-7 decoder) so you don't have to login to github: http://bugs.python.org/issue19279

brokentone · on Nov 21, 2013

This reminds me of Godel's incompleteness theorem - which I'll poorly present as: Any system that is sufficiently complex and complete will contain legal assertions that will disprove or destroy the system. (Those that do not are not complete).

http://en.wikipedia.org/wiki/G%C3%B6del's_incompleteness_the... http://www.amazon.com/G%C3%B6del-Escher-Bach-Eternal-Golden/...

jerf · on Nov 21, 2013

Neither throwing an exception nor having a perfectly-deterministic buggy behavior is what Godel was referring to. This shouldn't remind you of anything related to the incompleteness theorem, because it's completely unrelated.

anaphor · on Nov 21, 2013

Not completely unrelated: https://en.wikipedia.org/wiki/G%C3%B6del_numbering but what he/she was talking about is unrelated.

sp332 · on Nov 21, 2013

I don't want to be condescending, but that isn't what the theorem says. (I'm not even sure it's true.) Incompleteness means there is a true statement, that cannot be proved true inside the system.

gohrt · on Nov 21, 2013

And the contrapositive is that any sufficiently complex complete system is inconsistent.

sp332 · on Nov 21, 2013

Oh man, brokentone was right this whole time. Sorry brokentone!

brokentone · on Nov 22, 2013

I'm only able to flirt with the edges of understanding this deep philosophy/math, so I can see correlations, but not defend them

kbd · on Nov 21, 2013

> ... will contain legal assertions...

Well, the comments state that to make this happen he had to "[exploit] a bug in the UTF-7 decoder". So, not legal assertions.

brokentone · on Nov 22, 2013

It parses, thus it is legal.