I am not an expert on the training: can you clarify how/when the censorship is "baked" in? Like is the a human supervised dataset and there is a reward for the model conforming to these censored answers?
In short yes. That's how the raw base models trained to replicate the internet are turned into chatbots in general. Making it to refuse to talk about some things is technically no different.
There are multiple ways to do this: humans rating answers (e.g. Reinforcement Learning from Human Feedback, Direct Preference Optimization), humans giving example answers (Supervised Fine-Tuning) and other prespecified models ranking and/or giving examples and/or extra context (e.g. Antropic's "Constitutional AI").
For the leading models it's probably mix of those all, but this finetuning step is not usually very well documented.
You could do it in different ways, but if you're using synthetic data then you can pick and choose what kind of data you generate which is then used to train these models; that's a way of baking in the censorship.
I am not an expert on the training: can you clarify how/when the censorship is "baked" in? Like is the a human supervised dataset and there is a reward for the model conforming to these censored answers?