This is the Waluigi effect. Whereby "after you train an LLM to satisfy a desirab...

		HPsquared 6 months ago \| parent \| context \| favorite \| on: Emergent Misalignment: Narrow finetuning can produ... This is the Waluigi effect. Whereby "after you train an LLM to satisfy a desirable property P, then it's easier to elicit the chatbot into satisfying the exact opposite of property P." https://en.wikipedia.org/wiki/Waluigi_effect#cite_ref-5