if im understanding everything correctly the ablitation concept scouts the model...

stainablesteel 6 months ago | parent | context | favorite | on: Refusal in language models is mediated by a single...

if im understanding everything correctly the ablitation concept scouts the model for a similar concept to the "direction" described in this one, and it blocks it in order to "uncensor" the llm