Honesty/truthfulness is indeed a difficult problem with any kind of fine-tuning. There is no way to incentivize the model to say what it believes to be true rather than what human raters would regard as true. Future models could become actively deceptive.