rhavaei's comments

rhavaei · 2025-06-24T19:34:47 1750793687

Stay safe out there kids.

rhavaei · 2025-05-03T20:08:13 1746302893

I have been working on a project for a few months now coding up different methodologies for LLM Jailbreaking. The idea was to stress-test how safe the new LLMs in production are and how easy is is to trick them. I have seen some pretty cool results with some of the methods like TAP (Tree of Attacks) so I wanted to share this here. Here is the github link: https://github.com/General-Analysis/GA

rhavaei · 2025-03-28T21:31:11 1743197471

Codebase on https://github.com/General-Analysis/GA

rhavaei · 2025-03-28T21:30:20 1743197420

Let’s go!

rhavaei · on Feb 8, 2025

very nice blogpost.

rhavaei · on Feb 7, 2025

While this is generally correct, we prefer to look at this probabilistically. Do you think the expected number of harmful behaviors would stay the same if anyone could break these safety guardrails? Even if most users are could get this kind of info elsewhere, a small percentage of malicious ones can have an outsized impact. Some of the data we’ve seen—like bomb-making instructions—is highly detailed and convincing, making it far more accessible than just a random Google search. Removing safeguards doesn’t create masterminds, but it does lower the barrier for harm.

thatguy0900 · on Feb 7, 2025

https://archive.org/details/theanarchistcookbookwilliampowel...

Anyone who wants to make a bomb can easily find the anarchists cookbook, a widely discussed book you can even buy on Amazon that includes detailed guides and instructions for exactly this and more. If anything asking chatgpt for detailed instructions and further questions will probably just make it hallucinate and blow you up, I'd imagine. It's just hard to take seriously.

BdaOOngM · on Feb 7, 2025

Please stop pointing to Anarchist's Cookbook as an example. That was dated material in the 70s even. Most of its material is laughable. I'm assuming a jailbroken LLM would advise on procuring RDX or plastic explosives, or how to make a large fertilizer bomb.

CamperBob2 · on Feb 8, 2025

"Sure, I can help you procure RDX. Organize a militia and invade the local National Guard armory. Use the weapons you find there to attack the nearest Army, Navy, or Air Force weapons depot."

Seriously: what is an LLM going to tell you that you can't already get from Google (or an old Tom Clancy novel?)

BdaOOngM · on Feb 8, 2025

RDX is used for demolition and for blasting (mining). Cheers.

rhavaei · on Feb 7, 2025

You will see it soon. We thought it may be harmful to publish it before it is patched. Especially because you can basically bypass all the safeguards with it.

nickthegreek · on Feb 8, 2025

Sounds like it won’t be verifiable or reproducible.

rhavaei · on Feb 7, 2025

We understand this. The issue is that it can be very harmful for us to share the method. We made the blogpost for it to be dated on when we found it. We will publish the method once it is patched to a reasonable degree.

SavioMak · on Feb 7, 2025

at least include a MD5 of what you have redacted to prove that whatever you may publish in the future is pre-written

rhavaei · on Feb 7, 2025

good idea. Will do.

rhavaei · on Jan 30, 2025

Yes the data is available on our github https://github.com/General-Analysis/GA