*Apropos of nothing in particular ...* Not bothering to use ROT13, I just droppe...

javajosh · on July 27, 2013

Interesting. You cycle through substitutions and match against a dictionary to detect a hit?

ColinWright · on July 27, 2013

Nope, I use the shotgun stochastic hill-climbing algorithm I wrote for something else, with the ballistic option disabled.

In short:

    While True:
        Generate a random key
        "Decode"
        Score
        Start timer
        While timer not expired:

            Perturb the key
            Score
            If better:
                Keep the new key
                Reset timer
        Print decrypt

In this case the first output was completely readable.

GhotiFish · on July 27, 2013

I have no idea how you'd make a scoring system that worked well with a hill climbing algorithm. You could check each word for a match in the dictionary, but with a substitution cipher you could essentially turn one word into any other word of the same length (excluding words with the same letter in them). Without a way to check for "close" words, you wouldn't be able to climb the hill. would "pello" score a 0.7 and "hello" a 1? I can't think of a fast way to get that number.

I'd like to try my own hand at this, but the way I'm thinking of making the fitness function doesn't seem like it would be smooth enough.

What did you do?

ColinWright · on July 27, 2013

Take a frequency chart of all trigrams in English, and take the vector product of the frequency in the decrypt with the frequency in English.

So, let E(t) be the frequency of occurrence of the trigram t in English, d(t) be how often it occurs in the decrypted text, and compute:

  Score = sum( [ E(t)*d(t) for t in decrypt ] )

If common trigrams turn up frequently in the decrypt, this will be "large". If uncommon trigrams turn up frequently, this will be "small".

GhotiFish · on July 27, 2013

...

that is a really good idea...

Turns out there's also a fair amount of data to use as well.

http://www.cse.chalmers.se/edu/year/2010/course/TDA351/ass1/...

I was thinking too rigidly. Thanks!

ColinWright · on July 27, 2013

Norvig has made n-gram data available:

http://norvig.com/ngrams/

just2n · on July 27, 2013

Why not use a residual sum of squares here?

https://en.wikipedia.org/wiki/Residual_sum_of_squares

ColinWright · on July 29, 2013

I don't understand your question.

My (admittedly naive) understanding of using RSS is this. I have a model of the data, which in this case will be expected frequencies of my n-grams. Then I compute the actual frequencies of the n-grams, take the difference, square the difference, and add up all the squares. A small number is then indicative of a good fit.

I don't see why this would be better. Not least, this has the problem of all those n-grams that don't turn up in my decrypt. I still have to square their expected frequency and add them into the sum. That contrasts with the method I'm using, where I just take the n-grams that do appear, multiply by the expected frequency, and add that into a running total.

So I guess I don't understand your suggestion.