How to break the 'rapper code'

coderdude · on Feb 27, 2012

Recently I began reverse engineering the game files for Circuit's Edge (1989). Most of the in-game text is stored in a separate file called DISKTEXT.TXT. The method Westwood Associates used seems to be a mix of encryption and compression, and looks very similar to the kind of encoding seen in this article.

The file basically uses a number of bytes between 128-254 to represent two-letter combinations. After an hour of tweaking a Python script I finally had the file decrypted. As someone who had never before dabbled in this sort of thing I felt very accomplished, although I quickly realized how rudimentary their methods were.

jgrahamc · on Feb 27, 2012

How did you come to realize that you were dealing with bigrams?

coderdude · on Feb 27, 2012

Trial and error for the most part.

First I tried to find sequences of characters that, without modification, already looked like real words. I found one in particular that turned out to be the name of the city in which the game takes place ("Budayeen", though it was completely garbled).

That was a stroke of luck because that word appeared many times in the text and gave me some clue about the adjacent words, since there are only so many words you could reasonably put around the name of the city.

I tried globally replacing the characters in DISKTEXT.TXT but was ending up with a word half as long as I thought it should be (for "Budayeen"). One of the fortunate things that happened though was that it revealed to me, by accident, a couple other words (even though the decryption wasn't correct, it made some previously indistinguishable sequences look more like real words, which I pursued).

I think it really clicked that I was dealing with bigram substitution when I had to come up with a theory of how whitespace was so cleverly hidden.

Here's the lookup dictionary if anyone is interested: http://pastebin.com/pxJU7q7F

And here's the script that does the substitution: http://pastebin.com/8eYrGbzQ

You know, now that I actually think back to it, the lower 4 bits represent one character and the higher 4 bits represent the other. And there I was thinking in bytes the whole time.

emillon · on Feb 27, 2012

I did something similar with SNES roms. Sometimes the in-game text is encoded using "DTE compression", a mapping from one hex character to an alphanumeric sequence (bounded Huffman coding, basically). On certain sentences you can notice that gaps are smaller (in terms of bytes) than what should fit in there ; so you can deduce that it was a bigram (or trigram, or more).

evan_ · on Feb 27, 2012

"Dr Olsson, who has worked on hundreds of cases for police around the world, told the trial: ‘The thought process behind the code shows someone who is very able, very intelligent, very skillful.’"

I'm sure he means "relative to the typical criminal". The average criminal would probably not use a code at all, or would possibly use the classic A=1 B=2 cipher. Shuffling the numbers such that E=10, T=5, etc. is genius level compared to that.

andrewtbham · on Feb 27, 2012

By building up the criminals intelligence and skill, he is building up his own intelligence and skill by breaking the code.

bilbo0s · on Feb 27, 2012

Actually, the code maker IS pretty able.

You need to consider that this individual had to create the code and communicate its nature to compatriots on the outside. The code had to be simple enough for said compatriots to use in encoding information themselves, as well as being amenable to manual encryption and decryption. Since we can assume said compatriots probably were not exactly computer programmers or mathematicians. All of these requirements had to be met while keeping the code reasonably difficult for police to decipher.

I think it is, too often, tempting to only consider one side of the creation process in situations like this without giving due consideration to context. Giving full consideration to context, this would, to me, seem a relatively dangerous individual.

On another note, this is another example of how far law enforcement is willing to go, in terms of resources, to get their man or woman. Normally, the vast majority of us are not worth the effort of listening to our phone calls, or reading our emails, snail-mail, texts, or web posts. HOWEVER, once a spouse's body turns up, or ANY bodies turn up ... or maybe a bank goes under ... all of that changes. They will look through EVERYTHING. And you will be worth the effort to decrypt it.

It's not only national security that will get you that level of resource allocation.

maxerickson · on Feb 27, 2012

It sounds like the codebreaker in the article only took a few hours to break it (but maybe something got lost in the writing), that's not a huge allocation of resources in a case involving assault and such.

If the police were in the habit of decoding simple ciphers like this, they could have put it in front of somebody that wouldn't describe this code breaking process as painstaking.

eggbrain · on Feb 27, 2012

I hope John doesn't mind, but I finished up the cipher, and cleaned up some of the transcribing of the symbols, as he got a few wrong (G, P, and a few others)

You can check out the modified code (with the translated solution as well) here: http://pastebin.com/umz5mM5F

jgrahamc · on Feb 27, 2012

Thanks. I literally spent about 15 minutes on this and I was sure there were transcription errors.

jcr · on Feb 27, 2012

This was fun. Thanks for leaving it to be an exercise for the reader.

The interesting bit is the underscore '_' --it can be either P or G depending on context, but I'm yet to figure out if there's some rhyme or reason to how it works.

BitMastro · on Feb 27, 2012

From what I remember, a more efficient way of decrypting simple alphabet substitution codes is by working on the frequency of bigrams and trigrams, rather then single letters, because apart from E, already T and A can be easily swapped in the frequency order.

tptacek · on Feb 27, 2012

In practice, you spend a minute or two moving the dials ("E", or maybe " ", &c) and when you have the right one you can tell because the output starts making sense. Not saying you're wrong, just that you're overthinking it. :)

tripzilch · on Feb 27, 2012

Actually no. Trigram analysis is the best way to solve simple substitution ciphers. Most common trigram is nearly always "THE" with a much higher probability than single-letter trials will get you.

Sure, some trial and error will probably yield results, but what he's describing is (part of the) systematic approach, not "overthinking".

tptacek · on Feb 27, 2012

Um. Like. I'm sure you're right? But in practice, single-character substitution ciphers are so trivial to solve that you can more or less try "E", then space, then maybe "T" and have it.

You don't even need to know what a "trigram" is to solve single character substitution over English text; if you couldn't do it in a job interview in code on a whiteboard here, that'd be a "NO HIRE".

Also note that the "most efficient" ways of "breaking" (strange word to use when we're talking about Carmen Sandiego-grade ciphers) ciphers aren't always the best. For instance, the best way I know to break multi-character substitution in web apps is comically inefficient (in terms of ciphertexts required), but fits in just a couple lines of code.

jgrahamc · on Feb 27, 2012

Also, in this example the writer didn't use the word THE, he was writing in a vernacular and used DA.

omegant · on Feb 27, 2012

What tools do you use for this kind of deciphering?, pencil and paper, some kind of mode for vim ( or any other editor), or special software?

jgrahamc · on Feb 27, 2012

I wrote a small program in Perl to do it. All I needed was the number frequencies and some way of doing substitutions.

When I did the New Scientist code breaking competition I just used an EMACS buffer and did M-% substitutions. http://blog.jgc.org/2011/06/how-to-break-new-scientist-ciphe... When I worked on the 'Reddit code' I think I just used EMACS again: http://blog.jgc.org/2010/12/breaking-reddit-code.html When I worked on how the Zodiac Killer enciphered the 408 message I started out by hand and then wrote a small program: http://blog.jgc.org/2011/06/how-zodiac-enciphered-zodiac-408...

For the hidden part of the GCHQ challenge that I discovered and reversed the key for I think I wrote some code in C: http://blog.jgc.org/2011/12/down-gchq-rabbit-hole-or-i-think...

In general, I like to stare at things, work by hand and write code to automate.

omegant · on Feb 27, 2012

Thank you!, since I read "the code book" by Simon Sinth, all this kind of deciphering just seems awesome to me. I'll check your articles for sure!

J3L2404 · on Feb 27, 2012

I think you mean Simon Singh, but yeah a great book.

omegant · on Feb 27, 2012

Yep you are right, I am reading writting from the Iphone and made that mistake.

rmc · on Feb 27, 2012

If you're doing it yourself, you can just write simple programmes in a high level language like Python or Perl. It allows you to do things and change things that would be very tedious even during WW2

tikhonj · on Feb 27, 2012

Emacs has a mode for analyzing text encrypted with a simple monoalphabetic substitution cipher--like the one in the article. You can access it by putting the ciphertext in a buffer and using M-x decipher.

egometry · on Feb 27, 2012

The 'expert' comment at the end is interesting and sad... speaks to the fact that any field that deals with non-obvious knowledge will basically be saturated with people who present themselves as knowledgeable but provide little utility.

Jach · on Feb 27, 2012

This is awesome, both how easily it's broken and how primitive the methods that criminals still use are. Are there literally no calculators or cell phones that can calculate available in prison? You'd think they'd all use http://en.wikipedia.org/wiki/Diffie%E2%80%93Hellman_key_exch... these days. I remember more than one documentary sensationalizing "the codes", especially focusing on an apparently frequent use of gang-symbolism instead of words...

tomjen3 · on Feb 27, 2012

I doubt you could get access to something like that in prison, but you could at least you the Vignere cipher. It can be done by hand, but it is still moderately difficult.

Or be fancy and use the solitare algorithm. It is pretty much designed for this case.

FreeFull · on Feb 27, 2012

I wonder if you could change the Vignere cipher to be homophonic. Would that be harder to decrypt than a standard homophonic cipher?

jcr · on Feb 28, 2012

I cleaned up the input a good deal by studying a much better image of the coded message (url in code and below). I also left the lines/rows as they were written, and added block/paragraph delimiters.

http://pastebin.com/6EfmSpyi

Though I might not still be alive in 25 years when he gets out of prison, I still didn't think it would be wise to automate the correction of his spelling and grammar. ;)

tripzilch · on Feb 27, 2012

For those who enjoy this sort of thing, check out the challenges at 3564020356. First one's a simple substitution cipher (plus links to tutorials on it--though the links are dead probably, so that's another challenge)

jcr · on Feb 27, 2012

For those playing along at home, there is a better image of the complete note in this article:

http://www.dailymail.co.uk/news/article-210 6384/Rapper-Kieron-Bryan-jailed-25-years-codebreaker-exposes-gangland-hitman.html?ito=feeds-newsxml

The @code array in jcg's perl script is a bit off. It seems his sister's name might be "Koh Koh" --probably pronounced like "Coco" of "Coco Channel" fame. I haven't found any supporting evidence, but it could explain the "6 25 4, 6 25 4," at the start of the message.

TazeTSchnitzel · on Feb 27, 2012

For those outside the United Kingdom, the Mirror is a tabloid...

jrockway · on Feb 27, 2012

Surprisingly clean Perl script for a non-programming blog. Nice work!

jgrahamc · on Feb 27, 2012

I am a programmer: http://blog.jgc.org/2012/02/programmer.html :-)