I think this argument doesn't work if the model is open source though.
First, it's unclear how all these defensive measures are supposed to help if a bad actor is using an LLM for evil on their personal machine. How do reflection types or watch lists help in that scenario?
Second, if the model is open source, a bad actor could use it for evil before good actors are able to devise, implement, and stress-test all the defensive measures you describe.
First, it's unclear how all these defensive measures are supposed to help if a bad actor is using an LLM for evil on their personal machine. How do reflection types or watch lists help in that scenario?
Second, if the model is open source, a bad actor could use it for evil before good actors are able to devise, implement, and stress-test all the defensive measures you describe.