Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This is the Waluigi effect.

Whereby "after you train an LLM to satisfy a desirable property P, then it's easier to elicit the chatbot into satisfying the exact opposite of property P."

https://en.wikipedia.org/wiki/Waluigi_effect#cite_ref-5



See also: [[Streissand Effect]]




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: