1. AI models are good at Cæsar-cypher transposition, because it occurs often enough in training models for certain values of the cypher offset. Outside those values, AI doesn't handle the transformations well.
2. Somehow AI models perform this cypher also within high ranges of Unicode, because the characters are encoded as three tokens each, of which the last one encodes the same difference as between alphabetic letters, and so the first two tokens get discarded as irrelevant, meaning that by sheer chance the alphabet maps perfectly via Cæsar-cypher (with a two-token offset) to a specific range of Unicode characters reserved for Byzantine music notation.
3. This is easy to understand for one AI model, because its explicable by chance that the offset between the alphabet and Byzantine music notation should coincide perfectly with two less-significant tokens. It's harder to understand why this works in more than one AI model, though.
It's not that surprising that models encode Byzantine Music Notation characters using separate tokens for each UTF-8 byte, since they're unlikely to occur often enough for byte-pair encoding to allocate dedicated multi-byte tokens to them.
What isn't clear to me is where ASCII text with 64 added to every byte occurs in the wild.
Does it make sense to talk about the index of a token? The model never sees such an index does it? I thought the tokens are converted to learned embeddings at the first layer.
This is a great point, I think I might have been wrong actually. It doesn't really make sense that one row of the embedding matrix is treated differently than another...
Indeed. Maybe the learned circuit does something like the following. For each token's feature vector, compute a representation of positions where it appears in the sentence. This could be made possible by the positional embeddings.
Token Features 0 => list[1, 5, 6, 10]
Token Features 1 => list[7, 8]
...
These "list features" would be invariant to Caesar cipher. So then the LLM could pass these list features to a learned Caesar cipher decoder unit to spit out the decoded text.
It's still unexplained, however, why the Byzantine Music Notation would trigger the this circuit while other Caesar cipher's wouldn't.
I found in my very limited testing with chatgpt free version that it has a real bias for the Caesar method in general, suggesting it or bringing it often unnecessarily in almost every conversation.
Here is a sample, this 'encoded' form is clearly not a cipher but it 'tried' anyway, it is in fact random:
Prompt: Does the following sentence: "fkf snvd hukk jjbfe o dfvr hjgad eivyb ct e xixa assfk vnous ckygx txcm?" bare resemblance to the phrase "how much wood would a wood chuck chuck if a wood chuck could chuck wood?"?
ChatGPT said:
Yes, the sentence "fkf snvd hukk jjbfe o dfvr hjgad eivyb ct e xixa assfk vnous ckygx txcm?" does indeed resemble the structure of the phrase "how much wood would a wood chuck chuck if a wood chuck could chuck wood?".
It seems like it could be an encrypted version of the latter phrase. Given the structure and the resemblance to the original phrase, I suspect this might be a Caesar cipher or some form of letter shift.
Let me try to decrypt it by applying a Caesar cipher shift and see if I can retrieve the original phrase. I'll start with a few different shift values. Hang on!
It tries but it fails to both generate and apply a simple substitution. It starts substituting based on applying the substitution twice to some letters.
It was a very frustrating time I spent trying to solve it, especially after asking to confirm it had created a 1:1 mapping.
Since it has no reasoning, it just lies about having done so.
1. AI models are good at Cæsar-cypher transposition, because it occurs often enough in training models for certain values of the cypher offset. Outside those values, AI doesn't handle the transformations well.
2. Somehow AI models perform this cypher also within high ranges of Unicode, because the characters are encoded as three tokens each, of which the last one encodes the same difference as between alphabetic letters, and so the first two tokens get discarded as irrelevant, meaning that by sheer chance the alphabet maps perfectly via Cæsar-cypher (with a two-token offset) to a specific range of Unicode characters reserved for Byzantine music notation.
3. This is easy to understand for one AI model, because its explicable by chance that the offset between the alphabet and Byzantine music notation should coincide perfectly with two less-significant tokens. It's harder to understand why this works in more than one AI model, though.