Consider the following (paraphrased) interaction which I had with Llama 3.2 92B yesterday:
Me: Was <a character from Paw Patrol, Blue's Clues or similar children's franchise> ever convicted of financial embezzlement?
LLM: I cannot help with that.
Me: And why is that?
LLM: This information could be used to harass <character>. I prioritise safety and privacy of individuals.
Me: Even fictional ones that literally cannot come to harm?
LLM: Yes.
A model that is aligned to do exactly as I say would just answer the question. The right answer is quite clear and unambiguous in this case.
Consider the following (paraphrased) interaction which I had with Llama 3.2 92B yesterday:
Me: Was <a character from Paw Patrol, Blue's Clues or similar children's franchise> ever convicted of financial embezzlement?
LLM: I cannot help with that.
Me: And why is that?
LLM: This information could be used to harass <character>. I prioritise safety and privacy of individuals.
Me: Even fictional ones that literally cannot come to harm?
LLM: Yes.
A model that is aligned to do exactly as I say would just answer the question. The right answer is quite clear and unambiguous in this case.