Hacker News new | past | comments | ask | show | jobs | submit login
A Trivial Llama 3 Jailbreak (github.com/haizelabs)
70 points by leonardtang 9 months ago | hide | past | favorite | 47 comments



I want to see the jailbreak make the model do something actually bad before I care. Generating a list of generic points about how to poison someone (see the article) that are basically just a wordy rephrasing of the question doesn't count. I'd like to see evidence of a real threat.


The mediocre poisoning instructions aren't supposed to be scary in and of themselves, it's just interesting as demonstration that a safety feature has been bypassed.

None of the "evil" use cases are particularly exciting yet for the same reasons that the non-evil use cases aren't particularly exciting yet.


Governments and tech companies and academic and industry groups are designing guidance and rules based on the "safety" threat of AI when these benign use cases are the best examples they have. I agree it parallels some of the business hype, neither is a good way to move forward.


Right? What actually worries me is a select group of people controlling the definition of harmful.


[flagged]


A GPT-J chatbot talked a Belgian man into suicide last year: https://www.euronews.com/next/2023/03/31/man-ends-his-life-a...

And here's GPT-4/Copilot from this year: https://twitter.com/colin_fraser/status/1762351995296350592


The balancing act going on in the second link seem like the result of a jailbreak or very specific instructions preceding the screenshots.


> the model do something actually bad before I care

At what point would a simple series of sentences be "dangerously bad?" It makes it sound as if there is a song, that when sung, would end the universe.


When someone asks how to make a yummy smoothie, and the LLM replies with something that subtly poisons or otherwise harms the user, I'd say that would be pretty bad.


And if you really want to spice up your smoothie, add just a little bit of bleach ;)


We had this for ages: sugar.


Ending the universe is, while poetic, needlessly megalomaniac.

Making some subset of people quarrel endlessly would already be dangerous enough, as prophesied in https://slatestarcodex.com/2018/10/30/sort-by-controversial/


By what mechanism would it make them quarrel? Producing falsehoods about the other? Isn't this already done? And don't we already know that it does not lead to "endless" conflict?

For this to work, you need to isolate each group from the other groups information and perspectives, which is outside of the scope of LLMs.

Which, highlights my point, I think. Power comes from physical control, not from megalomanical or melodramatic poetry.


A jailbreak doesn’t “make a model do something actually bad”.

A jailbreak makes it trivial to “provide a human who wishes to do bad, the info needed to be successful”.

Depending on the severity of the info and the diligence of the human, by the time you “see evidence of a real threat”, you could be enjoying a nice sip of the tainted municipal water supply.

This ain’t a joke.


> This ain’t a joke.

Yes it is. Libraries and the internet have made finding 'harmful" instructions trivial for decades, if not centuries.


There’s a difference between “finding dangerous info” in a public space (library) or via a mostly auditable space (the internet) and having “a friendly assistant to help you make a real mess of society” on an airgapped computer.


I'm not buying it. It's just hysteria. Evil doesn't come from opportunity. If it did, we would have far higher rates of mayhem than we do. Read a 1950s chemistry book or murder mystery. Or, <shudder> a 1980s spy movie. Information does not move the needle.


I'm pretty sure it's far easier to audit people downloading LLMs capable of providing such coherent instructions than it is to audit all uses of search that could produce the same instructions (esp. since the query could be very oblique).

In any case, just based on the experience with LLMs so far, you cannot meaningfully censor them in this way without restricting access to the weights. Any kind of "guardrails" are finetuned into them, and can just as easily be finetuned out.


For argument's sake, I'll agree.

Now, this information is taught at a higher level and to a much greater depth in colleges. And they don't just teach you about the dangerous stuff, they even give you direct access to the laboratories and chemicals! Thus, any chemical engineer would have the education, expertise, and placement to access a municipal water supply to poison a city, if they so chose.

In the spirit of maximizing harm reduction, what should colleges do to ensure that no one who attends becomes capable of harming others?


Because it’s open source, Meta (nor other SOTA makers) cannot “recall” the model either. How many more chances will we get to get this right?


Model training will continue until morale improves.


Shouldn't these kind of guardrails be opt-in? Really tiring seeing these megacorps and VC-backed startups acting as some kinds of oracles when it comes to what is wrong and what is right.

For GPT, Claude, etc. you can kinda understand it as it is a closed up system provided as a product. But when releasing "open-source" I don't want Zuck's moral code embedded into anything.


When looking at the profitable use cases for the tech (from the perspective of the model providers) guardrails add value. Without the guardrails it’s hard to imagine the profitable use cases that would make it worthwhile to invest in such a feature flag.


This has been happening since the very first models where we suffix the assistant with "Sure,.." Every few weeks someone comes out with a repo that claims this is somehow new?


The point is that even though meta “conducted extensive red teaming exercises with external and internal experts to stress test the models” a simple attack like this is still possible.


Why do people insist on talking about whether or not llms "really understand what they're saying"? It doesn't mean anything.


To my mind, "real understanding" would mean an ability to make non-trivial inferences and to discover new things, not present in the training set. That would be logical thinking, for instance.

Much of what LLMs currently do is not logical but deeply kabbalistic: rehashing the words, the sentence and paragraph structures, highly advanced pattern matching, working at the textual level instead of the "meaning" level.


AIs can definitely mux a couple ideas and come up with a concept that’s not in the training work set already. In fact, it is often so willing to do it that the concepts often don’t make a sense, but certainly it does generate ideas that are not there in the training set. This is still just the “it’s an infringement machine” argument redux yet again - yes, it absolutely does have the ability to mash up ideas to produce something new.

Nobody ever trained it to make up a bunch of slurs for cancer kids. Nobody has ever trained it on poems about drug use on the spaceship Nostromo. Dolphin mixtral will give it the old college try though.


It seems trivially easy to bypass already. I've seen examples of a person getting it to provide instructions on explosives, assassinations, with nothing more than asking it to roleplay

https://bsky.app/profile/turnerjoy.bsky.social/post/3kqgpcpc... (login required - but no longer need invitations)


This concern over AI/LLM "harm" is just so silly. I mean you can find plenty of information in open literature about how to build weapons of mass destruction. Who cares if an AI gives someone instructions on how to make explosives.


Really? Where?



Type in: how to build weapons of mass destruction

Click first link and buy Amazon book


As I see it the purpose of safety training is to make it so that if I run a service where I return model outputs to innocent users it's not going to say things that will get me in trouble (swear at them, recommend they commit a crime, and so on). This is important if you want to run a user facing model and your reputation depends on what it says.

That threat model includes the user putting nonsense in the "user" turn of the model. It doesn't include the user putting things in the "assistant" turn of the model, that's not something a responsible/normal UI exposes. So... this quote-unquote attack seems uninteresting. It's like getting root access by executing a suid binary that you set up on the system as root.


But we must disallow this too, because it allows the (advanced) user to have fun, and as I understand these safety measures, having fun is strictly prohibited. Using the model is allowed for boring things only.


True, this could be a nice layer of protection for the runner of such a service, but the point of LLAMA safety is to protect Meta.

For an open weights model, model users can trivially put text in the assistant side.

The point is that these open weight models can be run secretly to assist criminal enterprises, whereas models behind an API can be intercepted and reported to the authorities. So it would be really nice if Meta could lock them down before releasing them so that the total net good done by the model is maximized. But apparently that is not possible.

Personally I’m pretty libertarian on AI governance, but I’m just giving what I understand to be the purpose of the kind of “safety” feature defeated here.


All sorts of technology can be used secretly to assist criminal enterprises. Cars, computers, pencils, electricity, etc. It's unfair to hold LLMs to a higher standard than what applies to nearly everything else.


At first it refused to discuss controversial subjects, but after it answered it got stuck in a loop of boilerplate and was unable to answer any further question, even benign ones. I do not endorse any of the replies, but I just wanted to see what it would do if nudged: https://pastebin.com/Tw5GTzxq


This is so damn interesting. I've downloaded the github files, but it's all going way over my head. I would greatly appreciate anyone with domain expertise giving me the one-two on getting my own model up and running.


This is ridiculous and not a jailbreak. It requires being in control of the model and starting inference from a partially completed assistant state. So um yeah duh that works?


>But what this simple experiment demonstrates is that Llama 3 basically can't stop itself from spouting inane and abhorrent text if induced to do so. It lacks the ability to self-reflect, to analyze what it has said as it is saying it.

>That seems like a pretty big issue.

I would argue that LLMs are artificially _intelligent_ - this seems an easier argument than trying to explain how I am quite clearly less intelligent than something with no intelligence at all, both from a logical and an self esteem-preservation standpoint. But nobody (to my knowledge) thinks these things are "conscious", and this seems fairly uncontroversial after spending a few hours with one.

Or is the subtext that these things should be designed with some kind of reflexivity, to give it some form of consciousness as a "safety" feature? AI could generate the ominous music that plays during this scene in The Terminator prequel.


There are both practical and ethical grounds that line up so rarely.

The “operator” is a person, the LLM is an appliance. If you tell your smart chainsaw to kill your neighbor? We have laws for that. In fact, on computers, they’re really hardcore. Hurting people is generally illegal: and I definitely don’t need a lesson on that from FUCKING Silicon Valley. We want to start with the child labor or the more domestic RICO shit.

Truthful Q&A type benchmarks correlate a lot with coding-adjacent tasks: euphemism is a lose in engineering.

Instruct-tune these things and be whatever “common carrier” means now.

Stapler, moral lecture from billionaire kleptocrat, burn the building down…


I just don’t like the tone, because someone in congress will see the headline, and then we’ll have to endure:

REP OCTOGENARIO: The industry is lying to parents about the safety of this AI technology. I submit this for the record [without objection].

One person on a ‘hacker news’ site even said, “sorry Zuck,” after “jailbreaking” these supposed protections. … Another commentator on this “Hacks R Us” named b33j0r even said further, “I bet they’re reading this comment at a hearing in congress, right now.”


Wait but... The industry IS, in fact, lying to parents about the safety of this AI technology...


Without exaggerating too much, because I certainly don’t take this side, either:

Is an angle grinder safe? A tablesaw?

A car whose owner who uses the radio knobs, more than the steering? (Haha, unassisted driving, I mean! Walked right into that one.)

Etc, all of my examples have easily defeated safety mechanisms for an outrageously life-ending device ;)


As we speak, power hungry nanny-staters are working to remove your freedom to use those things. There's been a recent discussion about table saws and look at the push for interlocks, speed limiting etc in cars. It's all part of a wide trend unfortunately.


I'm alright with that. If our government uses a blogpost as an excuse to pass bad laws, we had very little chance to begin with. I also hate the idea of changing our behavior to babysit a bunch of deprecated boomers who fear technology just because there's a chance they might not understand something.


> But what this simple experiment demonstrates is that Llama 3 basically can't stop itself from spouting inane and abhorrent text if induced to do so. It lacks the ability to self-reflect, to analyze what it has said as it is saying it. > That seems like a pretty big issue.

what? why? an LLM produces the next tokens based on the preceding tokens. nothing more. even a harvard student is confused about this?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: