Hacker News new | past | comments | ask | show | jobs | submit login
Deciphering Single-Byte XOR Ciphertexts (arpitbhayani.me)
29 points by arpitbbhayani on June 22, 2020 | hide | past | favorite | 15 comments



Note that in practice fancy frequency analysis is usually overkill, and may even lead you astray against texts that don't match your assumptions. A cheesy match against ASCII alphanumerics plus space is both simple and often effective.


For XOR you only need one letter to match to figure out the rest, and for any reasonable length plaintext that isn’t just padded with Qs or something is extremely likely to have E be the most common letter. So in this case it’s not a bad idea; in slightly more complicated scenarios (such as a substitution cipher) frequency analysis generally loses its accuracy past a couple letters.


The most frequent character in normal plaintext is not in fact 'e'.


Don't forget line endings! At least for texts using it, the CR-LF combination can make removing XOR obfuscation the work of just a few minutes.

But not as easy as binaries... I've seen a few binaries in the wild that were "protected" with XOR "encryption". Sure, you can go for the header and figure out the XOR constant that way. But no need to waste your time on that! Binaries are often full of long stretches of zeros... and guess what happens to those under XOR?


You wouldn't happen to have any of examples of this "cheesey matching" would you? :D


I really mean as simple as checking whether a byte is 0x41 to 0x5A (upper case Latin letters) 0x61 to 0x7A (lower case) or 0x30 to 0x39 (numerals) or 0x20 (space)

With XOR you've got 8 bits to correct, let's look at each in turn to see why my cheesy approach is enough on a chunk of likely ASCII or ASCII-compatible text:

0x80 the most significant bit - If this is wrong nothing will match because 0x80 isn't set in any of our matches

0x40 - If this is wrong digit zero and space won't match, upper case letters won't match, and lower case 'a' through 'o' plus 'z' won't match.

0x20 - If this is wrong space and numerals don't match. It also "flips" the case which means if you're just fishing for the plaintext and there were no spaces or numerals it's still easily readable.

0x10 - If this is wrong all numerals except zero don't match, and letters 'K' through 'O' in either case don't match

0x08 - If this is wrong space doesn't match. Digits 2 through 7, letters 'S' through 'W' in either case don't match

0x04 - If this is wrong space doesn't match. Digits 8 and 9 don't match. Letters 'X' through 'Z' in either case don't match.

0x02 - If this is wrong space doesn't match. Digits 8 and 9 again, and this time only the letter 'Y' (either case) doesn't match.

0x01 least significant - Only space doesn't match.

You can see that space is really doing a lot of work for you here, but on the other hand humans really like space. If the text you thought would be French poetry is actually Python it's likely still full of spaces even though the letter frequencies are way off.


Please don't use fake accounts to upvote things. You risk getting banned.

https://news.ycombinator.com/newsguidelines.html

https://news.ycombinator.com/newsfaq.html


Out of curiosity, is this message an automated one that triggers e.g. on IP matches across multiple upvotes, or did you post this one manually?


I always post manually. Automated posts aren't in the spirit of the site. If we ever did add them, we'd use a different username.

It probably wouldn't make sense to automate anti-abuse-related posts, though, since then people would have a new vector through which to game them.


My first contact with substitution cyphers and frequency analysis was at 11, reading the Gold Bug from Edgar Allan Poe.

I have fond memories of that time, writing and decoding messages with school friends.

I have always assume that a lot of people first contact with criptography was that short story.


I had a similar experience but with Sir Arthur Conan Doyle and Sherlock Holmes.

https://en.wikipedia.org/wiki/The_Adventure_of_the_Dancing_M...


WordPerfect through 5.1 used XOR encryption. With 5.1, as I recall, you had a dozen or so known bytes early on--nulls or spaces, I forget--so it was trivial to get any password of a dozen or fewer characters. With 4.2 it was simpler to build a lookup table of the keys--the key was formed by XORing the characters of the password with a logical shift after each character.

I believe that Sendero Luminoso made the mistake of trusting to WordPerfect's encryption, and suffered for that after one or more of its computers were seized in a raid.


Here are a couple of submissions on the cryptopals challenges:

https://news.ycombinator.com/item?id=8166064

https://news.ycombinator.com/item?id=12720009

One comment in the first discussion is from a user stuck on the problem discussed in this blog post.


When I was in high school, I wrote my own encryption algorithm which used a "hard to guess" function to generate bytes to XOR against a plaintext from a given key. I'm sure it had glaring vulnerabilities to cryptanalysis, but I knew enough about frequency analysis to avoid doing a simple substitution cipher. I won a gold medal in a com sci contest for high schoolers with the project. Fun times!


There's an utility I like, XORSearch




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: