Recently I began reverse engineering the game files for Circuit's Edge (1989). Most of the in-game text is stored in a separate file called DISKTEXT.TXT. The method Westwood Associates used seems to be a mix of encryption and compression, and looks very similar to the kind of encoding seen in this article.
The file basically uses a number of bytes between 128-254 to represent two-letter combinations. After an hour of tweaking a Python script I finally had the file decrypted. As someone who had never before dabbled in this sort of thing I felt very accomplished, although I quickly realized how rudimentary their methods were.
First I tried to find sequences of characters that, without modification, already looked like real words. I found one in particular that turned out to be the name of the city in which the game takes place ("Budayeen", though it was completely garbled).
That was a stroke of luck because that word appeared many times in the text and gave me some clue about the adjacent words, since there are only so many words you could reasonably put around the name of the city.
I tried globally replacing the characters in DISKTEXT.TXT but was ending up with a word half as long as I thought it should be (for "Budayeen"). One of the fortunate things that happened though was that it revealed to me, by accident, a couple other words (even though the decryption wasn't correct, it made some previously indistinguishable sequences look more like real words, which I pursued).
I think it really clicked that I was dealing with bigram substitution when I had to come up with a theory of how whitespace was so cleverly hidden.
You know, now that I actually think back to it, the lower 4 bits represent one character and the higher 4 bits represent the other. And there I was thinking in bytes the whole time.
I did something similar with SNES roms. Sometimes the in-game text is encoded using "DTE compression", a mapping from one hex character to an alphanumeric sequence (bounded Huffman coding, basically). On certain sentences you can notice that gaps are smaller (in terms of bytes) than what should fit in there ; so you can deduce that it was a bigram (or trigram, or more).
"Dr Olsson, who has worked on hundreds of cases for police around the world, told the trial: ‘The thought process behind the code shows someone who is very able, very intelligent, very skillful.’"
I'm sure he means "relative to the typical criminal". The average criminal would probably not use a code at all, or would possibly use the classic A=1 B=2 cipher. Shuffling the numbers such that E=10, T=5, etc. is genius level compared to that.
You need to consider that this individual had to create the code and communicate its nature to compatriots on the outside. The code had to be simple enough for said compatriots to use in encoding information themselves, as well as being amenable to manual encryption and decryption. Since we can assume said compatriots probably were not exactly computer programmers or mathematicians. All of these requirements had to be met while keeping the code reasonably difficult for police to decipher.
I think it is, too often, tempting to only consider one side of the creation process in situations like this without giving due consideration to context. Giving full consideration to context, this would, to me, seem a relatively dangerous individual.
On another note, this is another example of how far law enforcement is willing to go, in terms of resources, to get their man or woman. Normally, the vast majority of us are not worth the effort of listening to our phone calls, or reading our emails, snail-mail, texts, or web posts. HOWEVER, once a spouse's body turns up, or ANY bodies turn up ... or maybe a bank goes under ... all of that changes. They will look through EVERYTHING. And you will be worth the effort to decrypt it.
It's not only national security that will get you that level of resource allocation.
It sounds like the codebreaker in the article only took a few hours to break it (but maybe something got lost in the writing), that's not a huge allocation of resources in a case involving assault and such.
If the police were in the habit of decoding simple ciphers like this, they could have put it in front of somebody that wouldn't describe this code breaking process as painstaking.
I hope John doesn't mind, but I finished up the cipher, and cleaned up some of the transcribing of the symbols, as he got a few wrong (G, P, and a few others)
This was fun. Thanks for leaving it to be an exercise for the reader.
The interesting bit is the underscore '_' --it can be either P or G depending on context, but I'm yet to figure out if there's some rhyme or reason to how it works.
From what I remember, a more efficient way of decrypting simple alphabet substitution codes is by working on the frequency of bigrams and trigrams, rather then single letters, because apart from E, already T and A can be easily swapped in the frequency order.
In practice, you spend a minute or two moving the dials ("E", or maybe " ", &c) and when you have the right one you can tell because the output starts making sense. Not saying you're wrong, just that you're overthinking it. :)
Actually no. Trigram analysis is the best way to solve simple substitution ciphers. Most common trigram is nearly always "THE" with a much higher probability than single-letter trials will get you.
Sure, some trial and error will probably yield results, but what he's describing is (part of the) systematic approach, not "overthinking".
Um. Like. I'm sure you're right? But in practice, single-character substitution ciphers are so trivial to solve that you can more or less try "E", then space, then maybe "T" and have it.
You don't even need to know what a "trigram" is to solve single character substitution over English text; if you couldn't do it in a job interview in code on a whiteboard here, that'd be a "NO HIRE".
Also note that the "most efficient" ways of "breaking" (strange word to use when we're talking about Carmen Sandiego-grade ciphers) ciphers aren't always the best. For instance, the best way I know to break multi-character substitution in web apps is comically inefficient (in terms of ciphertexts required), but fits in just a couple lines of code.
If you're doing it yourself, you can just write simple programmes in a high level language like Python or Perl. It allows you to do things and change things that would be very tedious even during WW2
Emacs has a mode for analyzing text encrypted with a simple monoalphabetic substitution cipher--like the one in the article. You can access it by putting the ciphertext in a buffer and using M-x decipher.
The 'expert' comment at the end is interesting and sad... speaks to the fact that any field that deals with non-obvious knowledge will basically be saturated with people who present themselves as knowledgeable but provide little utility.
This is awesome, both how easily it's broken and how primitive the methods that criminals still use are. Are there literally no calculators or cell phones that can calculate available in prison? You'd think they'd all use http://en.wikipedia.org/wiki/Diffie%E2%80%93Hellman_key_exch... these days. I remember more than one documentary sensationalizing "the codes", especially focusing on an apparently frequent use of gang-symbolism instead of words...
I doubt you could get access to something like that in prison, but you could at least you the Vignere cipher. It can be done by hand, but it is still moderately difficult.
Or be fancy and use the solitare algorithm. It is pretty much designed for this case.
I cleaned up the input a good deal by studying a much better image of the coded message (url in code and below). I also left the lines/rows as they were written, and added block/paragraph delimiters.
Though I might not still be alive in 25 years when he gets out of prison, I still didn't think it would be wise to automate the correction of his spelling and grammar. ;)
For those who enjoy this sort of thing, check out the challenges at 3564020356. First one's a simple substitution cipher (plus links to tutorials on it--though the links are dead probably, so that's another challenge)
The @code array in jcg's perl script is a bit off. It seems his sister's name might be "Koh Koh" --probably pronounced like "Coco" of "Coco Channel" fame. I haven't found any supporting evidence, but it could explain the "6 25 4, 6 25 4," at the start of the message.
The file basically uses a number of bytes between 128-254 to represent two-letter combinations. After an hour of tweaking a Python script I finally had the file decrypted. As someone who had never before dabbled in this sort of thing I felt very accomplished, although I quickly realized how rudimentary their methods were.