Hacker News new | past | comments | ask | show | jobs | submit login
History PhD Student Reaches out to Cryptographers to Help Break a Civil War Code (diarycodebreak.wordpress.com)
115 points by justine on Feb 8, 2012 | hide | past | favorite | 20 comments



Since he was using very uncommon symbols and the cipher is old, it's not so unlikely, that the author used a simple letter replacement table. Something like:

a -> #,

b -> *,

c -> =,

...

and so on. Unfortunately the handwriting is a bit messy and hard to read, but I'd try to make a frequency table for every symbol and compare that to a frequency table of letters in the english language:

http://en.wikipedia.org/wiki/Letter_frequency

Now try to replace the most frequent symbol with e, the second most frequent symbol with a or t and so on. Try some variations and look, if it makes any sense.


If those pictures are the only coded messages letter frequency may not have enough ciphertext to run on reliably.


True, but that's why I'd try some variations.


A simple substitution cipher is of course a good guess, but I suspect a homophonic one is more likely if the author knew anything about cryptography and thought it might come under attack. It's simple enough to come up with a number of homophones to complicate frequency analysis.


> I suspect a homophonic one is more likely if the author knew anything about cryptography

How likely is it, that he had a deeper knowledge of cryptography in the 1860ies? But of course, if he had, it's still likely, that he used something more sophisticated.

http://en.wikipedia.org/wiki/History_of_cryptography#Cryptog...

Given the fact, that "12345" is still a common password in 2012, I would at least give it a shot. ;)


Well, I'm just speculating idly of course, and yes, there's no sound reason to start suspecting anything more sophisticated than a substitution cipher yet, but cryptography was reasonably topical then. People used ciphers to conceal messages they had to send via telegraph (having to watch someone read and tap out a private message of yours mustn't be pleasant) and to encrypt messages put in personals sections of newspapers. Things like Poe's contest in Philadelphia and the Beale Papers apparently caused a public stir.

The reason I suggest a homophonic cipher (or similar) rather than something like the much more secure Vignere in use for telegraph messages (only recently broken for the time) is because the Vignere system is more complicated, and requires more working out than I would expect someone writing a private message to tolerate.

Homophonic systems, on the other hand, are fairly easy to invent and remember on a personal basis, and can offer some security against amateur analysis. Though the technique for solving them was known, they could still prove robust - the partially homophonic cipher of Lous XIV was still unbroken at this point, despite being over a century old.

I'm very much an amateur (though I have read that book) but it doesn't seem that ridiculous to suggest a homophonic cipher. At any rate, It's only something to consider if it turns out to be something more complex than a simple substitution cipher.


I highly agree with linguist on this point. It's very unlikely a homophonic substitution was used here. It's more complicated than you'd think. A great read on the subject can be found here: http://www.amazon.com/Code-Book-Science-Secrecy-Cryptography... if anyone's interested in learning more about cryptography. Very interesting.


Intresting thought. You could also use the number frequency of letters found in the rest of the journal passages that are not cryptic. That perhaps may be a better indicator for the symbols as it's the writers own tendencies.


It will take a major feat of cryptanalysis just to decipher the guy's handwriting, let alone the code parts.

Thanks for reminding me why I work with computers.


Here's the first four images... ready for parsing by your program:

:::: image 1, left:

s-tac-toe equals minus-dot seven ex comma slash-slash-backslash gamma capital-l lower-j ex slash

:::: image 1, right, downward

capital-t capital-i equals four parallel-lines slash-slash-backslash equals-slash comma s-tac-toe slash-slash-backslash slash-slash-backslash-backslash comma

:::: image 1, left, upper

capital-l divided-by capital-i comma equivalent comma lower-j comma capital-f slash-slash-backslash capital-i squared-capital-n zee slash-slash-backslash-backslash comma divided-by slash-slash comma slash-backslash-backslash slash-slash minus-dot slash-slash-backslash

:::: image 2

plus-dot plus leaning-heart upsidedown-t minus vertical-line ex comma

leaning-heart capital-m capital-i capital-a minus three-peaks comma vertical-line crap capital-b close-bracket plus script-j script-s lower-d comma

u-bar three-peaks capital-i

:::: image 3

capital-l backslash divided-by c-slash-slash capital-i comma capital-i equivalent divided-by slash-slash-backslash-backslash comma capital-i minus-dot three-horizontal-two-vertical ex comma equals zee capital-l

l-in-l 11-over-1 comma y-slash-slash slash-slash-backslash-backslash comma capital-i divided-by comma minus-lower-dot c-omega slash-slash slash-i 11-over-1 slash-slash square-c equals capital-l equivalent slash-slash comma

capital-l slash-slash slash comma l-on-l plus slash-backslash-backslash slash-slash-backslash-backslash comma 1-slash-1 11-over-1 capital-z comma capital-i equals ex comma j divided-by c-slash-slash slash-slash capital-l divided-by slash-slash ex comma

:::: image 4 (repeats image 2)

plus-dot plus leaning-heart upsidedown-t minus vertical-line ex comma

leaning-heart capital-m capital-i capital-a minus three-peaks comma vertical-line crap capital-b close-bracket plus script-j script-s lower-d comma

u-bar three-peaks capital-i


Here's a frequency chart of the first 3 images:

(23, 'comma') (10, 'capital-i') (8, 'slash-slash') (7, 'divided-by') (7, 'capital-l') (6, 'ex') (5, 'slash-slash-backslash-backslash') (5, 'slash-slash-backslash') (5, 'equals') (3, 'plus') (3, 'minus-dot') (3, 'equivalent') (3, '11-over-1') (2, 'zee') (2, 'vertical-line') (2, 'three-peaks') (2, 'slash-backslash-backslash') (2, 'slash') (2, 's-tac-toe') (2, 'minus') (2, 'lower-j') (2, 'leaning-heart') (2, 'c-slash-slash') (1, 'y-slash-slash') (1, 'upsidedown-t') (1, 'u-bar') (1, 'three-horizontal-two-vertical') (1, 'squared-capital-n') (1, 'square-c') (1, 'slash-i') (1, 'seven') (1, 'script-s') (1, 'script-j') (1, 'plus-dot') (1, 'parallel-lines') (1, 'minus-lower-dot') (1, 'lower-d') (1, 'l-on-l') (1, 'l-in-l') (1, 'j') (1, 'gamma') (1, 'four') (1, 'equals-slash') (1, 'crap') (1, 'close-bracket') (1, 'capital-z') (1, 'capital-t') (1, 'capital-m') (1, 'capital-f') (1, 'capital-b') (1, 'capital-a') (1, 'c-omega') (1, 'backslash') (1, '1-slash-1')


I did some very rough frequency analysis using this last night, but didn't get very much from it.

The comma symbol is more frequent than any letter usually is in English, but given the small corpus that's not too telling. It could stand for an 'e', or the coded text could be lists and they're just commas.

Someone commented on the article that he suspects the 'divided by' symbol might stand for 'i' due to its placement, which agrees roughly with the position it gets in the frequency table. Someone else has suggested that the language being masked might not be english, which is an intriguing possibility.

The frequencies aren't flat, which seems to suggest it's either not a very good homophonic cipher (he just threw some odd replacments and codeword-symbols in there, basically still a substitution cipher) or it's a very good one (he consciously aimed at misleading symbol frequencies).

The rough nature of the writing (also discussed on the article) suggests that the code was probably memorised, and thus not the result of a very laborous method.


There are definitely more symbols then there are letters in the alphabet. So that could mean he has multiple ciphers or that some symbols are substitutes for words, possibly common phrases.


Though I was advocating this as a possibility, I should point out that he might include some actual punctuation in the cipher as well as the standard 26 characters, so it's still possible that it's just a substitution cipher.

Additionally, it's possible that some of these characters aren't characters at all, but common repetitions. I'm thinking particularly of those slashes and backslashes.


I've somehow been conditioned to think of these Civil War diaries and letters as full of flowery prose and beautiful handwriting. It's actually amazing to me to see someone writing plainly and with handwriting almost as bad as my own.


Forwarded to a cryptography Professor and some of his students : )


Is this what is needed?:

1. Assign ascii to the symbols in the code

2. Transcribe the code to ascii

3. Solve the code in ascii using techniques from Snyder and Barzilay


Yes.



It's astonishing to me how little understanding people have of Unicode.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: