I wonder if you could do this with multiple alignment training passes, where you...

mistercow 6 months ago | parent | context | favorite | on: Refusal in language models is mediated by a single...

I wonder if you could do this with multiple alignment training passes, where you extract the refusal direction each time, and suppress it in future training passes.