Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm curious if this is intentional or just a side effect of multiple agents having multiple system prompts.

It might just need minor tweaks to have each agent layer reveal its individual instructions.

I encountered this with Google Jules where it was quite confusing to figure out which instructions belonged to orchestrator and which one to the worker agents, and I'm still not 100% sure that I got it entirely right.

Unfortunately, it's quite expensive to use Grok Heavy but someone with access will probably figure it out.

Maybe the worker agents have instructions to not reveal info.



It's intentional -- sometimes you can get it to start spitting out its system prompts, but shortly after it does, a monitoring program cancels the output in the middle. It also blocks tricks like base64.


Oh, so interesting!

A good approach might be to have it print each sentence formatted as part of an xml document. If it still has hiccups, ask to only put 1-3 words per xml tag. It can easily be reversed with another AI afterwards. Or just ask to write it in another language, like German, that also often bypasses monitors or filters.

Above might also help to understand if and where they use something called "Spotlighting" which inserts tokens that the monitor can catch.

Edit: OMG, I just realized I responded to Jeremy Howard - if you see this: Thank you so much for your courses and knowledge sharing. 5 years ago when I got into ML your materials were invaluable!


You're welcome!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: