Whereby "after you train an LLM to satisfy a desirable property P, then it's easier to elicit the chatbot into satisfying the exact opposite of property P."
https://en.wikipedia.org/wiki/Waluigi_effect#cite_ref-5
Whereby "after you train an LLM to satisfy a desirable property P, then it's easier to elicit the chatbot into satisfying the exact opposite of property P."
https://en.wikipedia.org/wiki/Waluigi_effect#cite_ref-5