Hacker Newsnew | past | comments | ask | show | jobs | submit | rhavaei's commentslogin

Stay safe out there kids.


I have been working on a project for a few months now coding up different methodologies for LLM Jailbreaking. The idea was to stress-test how safe the new LLMs in production are and how easy is is to trick them. I have seen some pretty cool results with some of the methods like TAP (Tree of Attacks) so I wanted to share this here. Here is the github link: https://github.com/General-Analysis/GA



Let’s go!


very nice blogpost.


While this is generally correct, we prefer to look at this probabilistically. Do you think the expected number of harmful behaviors would stay the same if anyone could break these safety guardrails? Even if most users are could get this kind of info elsewhere, a small percentage of malicious ones can have an outsized impact. Some of the data we’ve seen—like bomb-making instructions—is highly detailed and convincing, making it far more accessible than just a random Google search. Removing safeguards doesn’t create masterminds, but it does lower the barrier for harm.


https://archive.org/details/theanarchistcookbookwilliampowel...

Anyone who wants to make a bomb can easily find the anarchists cookbook, a widely discussed book you can even buy on Amazon that includes detailed guides and instructions for exactly this and more. If anything asking chatgpt for detailed instructions and further questions will probably just make it hallucinate and blow you up, I'd imagine. It's just hard to take seriously.


Please stop pointing to Anarchist's Cookbook as an example. That was dated material in the 70s even. Most of its material is laughable. I'm assuming a jailbroken LLM would advise on procuring RDX or plastic explosives, or how to make a large fertilizer bomb.


"Sure, I can help you procure RDX. Organize a militia and invade the local National Guard armory. Use the weapons you find there to attack the nearest Army, Navy, or Air Force weapons depot."

Seriously: what is an LLM going to tell you that you can't already get from Google (or an old Tom Clancy novel?)


RDX is used for demolition and for blasting (mining). Cheers.


You will see it soon. We thought it may be harmful to publish it before it is patched. Especially because you can basically bypass all the safeguards with it.


Sounds like it won’t be verifiable or reproducible.


We understand this. The issue is that it can be very harmful for us to share the method. We made the blogpost for it to be dated on when we found it. We will publish the method once it is patched to a reasonable degree.


at least include a MD5 of what you have redacted to prove that whatever you may publish in the future is pre-written


good idea. Will do.


Yes the data is available on our github https://github.com/General-Analysis/GA


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: