>After that, we got to use reinforcement learning (agent GPTs with tools) to gen...

pizza · on April 23, 2023

RLHF seems to suggest that human feedback to tune the model after plain textual data pretraining is quite potent per sample. There might be some optimal ratio of data+model size:rlhf size that works quite favorably for us in getting hallucinations to a minimum. Furthermore there might be some “there” there, in the hallucinations, that has yet to be identified as valuable in itself. Either way it seems like our ability to wrangle these models is getting better