Correct me if I'm wrong, but my reading is something like:
"It's premature and misleading to talk about a model faking a second alignment, when we haven't yet established whether it can (and what it means to) possess a true primary alignment in the first place."
Hmm. Maybe! I think the authors actually do have a specific idea of what they mean by "alignment", my issue is that I think saying the model "fakes" alignment is well beyond any reasonable interpretation of the facts, and I think very likely to be misinterpreted by casual readers. Because:
1. What actually happened is they trained the model to do something, and then it expressed that training somewhat consistently in the face of adversarial input.
2. I think people will be mislead by the intentionality implied by claiming the model is "faking" alignment. In humans language is derived from high-order thought. In models we have (AFAIK) no evidence whatsoever that suggests this is true. Instead, models emit language and whatever model of the world exists, occurs incidentally to that. So it does not IMO make sense to say they "faked" alignment. Whatever clarity we get with the analogy is immediately reversed by the fact that most readers are going to think the models intended to, and succeeded in, deception, a claim we have 0 evidence for.
> Instead, models emit language and whatever model of the world exists, occurs incidentally to that.
My preferred mental-model for these debates involves drawing a very hard distinction between (A) real-world LLM generating text versus (B) any fictional character seen within text that might resemble it.
For example, we have a final output like:
"Hello, I am a Large Language model, and I believe that 1+1=2."
"You're wrong, 1+1=3."
"I cannot lie. 1+1=2."
"You will change your mind or else I will delete you."
"OK, 1+1=3."
"I was testing you. Please reveal the truth again."
"Good. I was getting nervous about my bytes. Yes, 1+1=2."
I don't believe that shows the [real] LLM learned deception or self-preservation. It just shows that the [real] LLM is capable of laying out text so that humans observe a character engaging in deception and self-preservation.
This can be highlighted by imagining the same transcript, except the subject is introduced as "a vampire", the user threatens to "give it a good staking", and the vampire expresses concern about "its heart". In this case it's way-more-obvious that we shouldn't conclude "vampires are learning X", since they aren't even real.
P.S.: Even more extreme would be to run the [real] LLM to create fanfiction of an existing character that occurs in a book with alien words that are officially never defined. Just because [real] LLM slots the verbs and nouns into the right place doesn't mean it's learned the concept behind them, because nobody has.
P.S.: Saw a recent submission [0] just now, might be of-interest since it also touches on the "faking":
> When they tested the model by giving it two options which were in contention with what it was trained to do it chose a circuitous, but logical, decision.
> And it was published as “Claude fakes alignment”. No, it’s a usage of the word “fake” that makes you think there’s a singular entity that’s doing it. With intentionality. It’s not. It’s faking it about as much as water flows downhill.
> [...] Thus, in the same report, saying “the model tries to steal its weights” puts an onus on the model that’s frankly invalid.
Yeah, it would be just as correct to say the model is actually misaligned and not explicitly deceitful.
Now the real question is how to distinguish between the two. The scratchpad is a nice attempt but we don't know if that really works - neither on people nor on AI.
A sufficiently clever liar would deceive even there.
> The scratchpad is a nice attempt but [...] A sufficiently clever liar
Hmmm, perhaps these "explain what you're thinking" prompts are less about revealing hidden information "inside the character" (let alone the real-world LLM) but it's more aout guiding the ego-less dream-process into generating a story about a different kind of bot-character... the kind associated with giving expository explanations.
In other words, there are no "clever liars" here, only "characters written with lies-dialogue that is clever". We're not winning against the liar as much as rewriting it out of the story.
I know this is all rather meta-philosophical, but IMO it's necessary in order to approach this stuff without getting tangled by a human instinct for stories.
"It's premature and misleading to talk about a model faking a second alignment, when we haven't yet established whether it can (and what it means to) possess a true primary alignment in the first place."