Hacker News new | past | comments | ask | show | jobs | submit login

> The language model could have "hallucinated" its own system prompt instructions, leaving no guarantee that this is the real deal.

How would you detect this? I always wonder about this when I see a 'jail break' or similar for LLM...






In this case it’s easy: get the model to output its own system prompt and then compare to the published (authoritative) version.

The actual system prompt, the “public” version, and whatever the model outputs could all be fairly different from each other though.




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: