Many didn't get it last time that this model is a generic voice to voice model, it doesn't have a set of voices it can do it can do all sorts of voices and background noises.
That makes it entirely different from text to speech models we had previously, this model when uncensored could do all sorts of voice acting etc for games. But this example shows why they try to neuter it so hard, because it would spook a ton of people in its raw state.
It’s “unexpected” because their early training didn’t get rid of it as well as they hoped. LLM’s are good at detecting patterns and like to continue the pattern. They’re starting with autocomplete for voice and training it to do something else.
For now, it’s fairly harmless since it’s only a blooper in a lab, but there will likely be open-weights versions of this sort of thing eventually. And there will probably be people who argue that it’s a good thing, somehow.
> LLM’s are good at detecting patterns and like to continue the pattern. They’re starting with autocomplete for voice and training it to do something else.
This is a great summary of almost everything that goes wrong with LLM applications.
LLMs are autocomplete machines, which is why GitHub Copilot is still the most reliably useful application of LLM tech out there. The further you get from autocomplete, the less reliable the resulting product, because every problem has to be translated to an autocomplete problem to use LLMs and most translations are lossy!
I'd respectfully disagree with this characterization of LLMs. While they certainly excel at pattern recognition, calling them mere "autocomplete machines" vastly undersells their capabilities. LLMs demonstrate complex reasoning, multi-modal understanding, and emergent behaviors that go well beyond simple pattern continuation.
They've succeeded in areas like mathematical problem-solving, creative tasks, and various real-world applications outside of GitHub Copilot. Sure, LLMs have limitations, but reducing them to autocomplete overlooks their nuanced language understanding and generation abilities. I also think the "lossy translation" argument doesn't give enough credit to the sophisticated ways problems can be reformulated for LLMs.
I don't understand how anybody can still claim LLMs show "complex reasoning".
It's been shown time and time again that they'll produce a correct chain of reasoning when given a problem (e.g. wolf, goat, cabbage crossing a river; 3 guards and a door; etc.) that is roughly similar to what's in the training data but will fail when given a sufficiently novel modification _while still producing output that is confidently incorrect_.
My own recent experience was asking ChatGPT 3.5 to encode an x86 instruction into binary. It produced the correct result and a page of reasoning which was mostly correct, except 2 errors which if made by a human would be described as canceling each other out.
But GPT didn't make 2 errors, that's anthropomorphizing it. A human would start from the input and use other information plus logical steps to produce the output. An LLM produces a stream of text that is statistically similar to what a human would produce. In this particular case, it's statistics just weren't able to cover the middle of the text well enough but happened to cover the end. There was no "complex reasoning" linking the statements of the text to each other through logical inferences, there was simply text that is statistically likely to be arranged in that way.
I agree that LLMs don't really reason, but I'm starting to think that the name "Large Language Model" is just wrong enough to create this confusion. What they seem to be doing is modeling human reasoning as encoded in language, and then enable some mixing and matching of elements of that encoded reasoning.
This somewhat means that these models are trapped within the universe of "human capable reasoning" with some possibility of escaping it through the stochastic generative processes they're built on. But they simply can't think through novel problems and arrive at new conclusions.
Furthermore, they're limited by the fact they're built on human knowledge as encoded in text, which terribly imprecise and fluid. Any hope to have a path to AGI, where reasoning might actually happen, will have to have something far more rigorous for the internal reasoning, with language just being a clever interface rather than the mechanism by which the thinking is done in.
They're really "Large Analogy Engines" or maybe "Large Captured Reasoning Engines" and are an incredible technology.
GPT-3.5 is not the same thing. When people talk about LLMs having capabilities like complex reasoning, they're talking about current-gen models (eg Claude 3.5 Sonnet, GPT-4, Llama-405B).
Your statement doesn't contradict their statement. It produces text that is statistically likely to be arranged that way, but because we use non-deterministic sampling we get a wide variety of results that weren't necessarily in the training data.
No, I don't think they are talking about the decoding process there. If they were, it's a non statement. They are almost certainly talking about training data.
Perhaps it’s because I know human beings that have the exact same operation and failure mode as the LLM here and I’m probably not the only one. Failing at something you’ve never seen and faking through it is a very human endeavor.
Regarding errors: I don't know the exact mechanism in the brain that causes humans to make them but i believe it's a combination of imperfect memory, attention span and general lack of determinism. None of these affect logical reasoning as performed by a machine.
Regarding faking it till making it: This is a more general point that there's a difference between simulating human behavior and logical reasoning.
Either this whole comment is from a GPT or the author is a major contributor to the GPT training dataset.
Anyway, LLMs are nothing more than very expansive error reduction algorithms. They don't reason. They store and manipulate meaning in an automated way, but they don't reason.
I wouldn't say they're mere autocomplete machines, but their tendency to go into a loop to complete a pattern does break the magic, since it's a rather machine-like thing to do.
I never said that we couldn't use them for other things—it turns out that a lot of problems can be translated to autocomplete! And yes, in many cases the loss can be made minimal by being clever with the translation.
What I'm saying is that nearly every failure of LLM applications boils down to the engineers not understanding that they're just autocomplete engines. Most new LLM products are put together by someone who claims to be an "AI Engineer" but who approaches an LLM with magical thinking and only a vague understanding of the way they work internally.
They're autocomplete machines. That's not a bad thing or an argument that they're incapable of solving complex problems, it's just the reality of what they're doing, and a good AI engineer understands that and works with it.
You are making a very popular mistake: confusing the training procedure for LLMs (autocomplete) with "how they work"/their internal ontology (mostly unknown).
When we teach children how to do arithmetic, we have them predict missing items in equations. We don't accuse them of "only doing autocomplete". The same applies for LLMs.
I was implementing my own transformer-based models and fine tuning GPT-2 in 2019, and I've kept up with every development since then. I understand the internal structure of these things better than nearly all of the "AI Engineers" who are currently working on wrapping them up as black boxes embedded in applications.
I'm not making a "popular mistake", I'm literally describing how inference is done.
You are very clearly not doing that. Nothing about your comment had anything to do with the internal structure of LLMs.
I believe you that you've set up some models with pytorch or whatever, but this seemingly hasn't translated to a sufficiently coherent mental model to make the distinction between the extrinsic optimization criterion and intrinsic behavior.
I'm not talking about optimization criteria or training, I'm talking about how we use the model for inference.
We feed in a context and it gives a probability distribution for the next word. We sample from that distribution following some set of rules. We then update the context with the new word and repeat.
That algorithm is an autocomplete algorithm. As long as that's what LLM inference looks like, all problems that we want to feed to an LLM therefore must be translated to autocomplete.
I think you're making the mistake of assuming that because I use language that resembles the language used by cynics I'm therefore arguing that LLMs are useless. I'm not. All I'm saying is that we need to have an accurate mental model for the way these things work, and that mental model is autocomplete. Nearly every major failure in an LLM application was the result of failing to keep that in mind.
In other words, would it be fair to say that any naturally sequential problem is not very far from autocomplete?
Again, I think you're putting words in my mouth and thoughts in my head that aren't there. A lot of people have reacted to AI hype by going the other way and underestimating them—that's not me. I think there are lots of problems they can solve, I just think they all boil down to autocomplete and if you can't boil it down to autocomplete you're not ready to implement it yet with an LLM.
What's the difference between "talk" or "produce output" in your mind? I feel that automatic completion makes it sound as if the complete result was already supplied to the model or that it is akin to just the most probable next word from simple frequency rather than the "thinking" that is done by the model to produce the next token. Autocomplete doesn't adequately describe anything. Like imagine a hypothetical machine of infinite ability that still works in a way where it produces tokens in a series from previous tokens, you would still be arguing it is called autocomplete and bothering me.
This is kind of reductionist. It's like saying that a human writing a book is just doing manual word completion starting from the title. It's technically correct, but what insight is contributed? Would anything about this conversation be different if someone trained a model that did diffusion-like inference in which every possible word in the answer is pushed towards the final result simultaneously? Probably not.
You're just going around saying people are completely wrong without reading what they wrote or providing any justification for that claim. I'm not sure how to respond because your comment is a non sequitur.
The justification is that people are fine tuning LLMs for classification. They take out the last layer, replace it with a layer which maps to n classes instead of vocab_size, and the training data Y's aren't next word, they are a class label (I have a job which does binary classification, for example)
It's just completely wrong to say everything in LLM land is autocomplete. It's trivial and common to do the above.
That's still autocomplete. You use it by feeding in a context (all the text so far) and asking it to produce the next word (your fine-tuned classification token). The only difference is you don't ask for more tokens once you have one.
That's a very clever way of reducing a problem to autocomplete, but it doesn't change the paradigm.
If an email says "Respond to win $100 now!" and a classifier has it as 99%/1% for two classes representing spam/not spam, "spam" is not a sensible next token, it's a classification. The model is not trying to predict the next token, it's trying to classify the entire body of text. The training data isn't a bunch of samples where y is whatever came after those tokens.
It's a silly way to think about it. Have you seen how people are fine tuning for classification? It's not like fine tuning for instruction or summarization etc, which are still using next token prediction and where the last layer is still mapping to vocab_size outputs.
You are conflating how it works with what the goal is.. If someone asks "how does a human run fast?" you don't say "you get a stop watch and ask them to run as fast as they can and then look at the time and figure out how to get there quicker". There is a whole explanation involving biomechanics.... And if you understand how the biomechanics work, you might be able to answer how a human can jump high too.
To make it even more concrete.... I have an LLM where I removed the last layer and fine tuned on a classification problem and the last layer now only has two outputs rather than vocab size outputs. The goal is binary classification. The output is not a completion of an idea or anything of the sort. It's still an LLM. It still works the same way all the way up until the last layer. The weights are the same all the way up to the last layer. It works because an LLM has to create a rich understanding of how a bunch of concepts work together.
I am always confused when people compare LLMs with children - there is an obvious evolution of reasoning in children as the time passes by. This can be visible on a time span of months.
Using the same analogy: I dont see this in LLMs, I can speak with an LLM months and if we dont release a new version it is the same.
So they are not like children and I am not sure we really understand learning in children more so to the level that we can compare LLM with a child way of learning.
This is only a common mistake to the subset of people who labor under the delusion that we have actually stumbled upon Generalized Artificial Intelligence.
We know how LLMs work fundamentally and what their limits are. LLMs are only able to make “correct sounding” statements which have the side effect of being correct a certain percentage of the time. They do not have the ability to reason nor engage in high level thought.
How can you say they don't reason if reasoning is required to produce high rates of success that previously have only been successfully answered using reasoning? Its not like they are guessing randomly but just cooindidentally having high rates of accuracy. They clearly have an emergent property of reasoning. Perhaps reasoning is too ill-defined of a word. They are deducing or calculating the answer from their being. What we call the thing that occurs when they produce an insightful answer is less telling than the fact that they produce answers previously only possible by thinking and reasoning humans.
We know how they work only at the lowest level (the arithmetic operations) and the highest level (the optimization criterion and the representation of various layers, like the input/output layer and for things we can easily probe like embedding matrices).
We do not know "what they are doing" on the inner layers. This is an area of active research.
> They do not have the ability to reason nor engage in high level thought.
You are speculating (and probably incorrectly), or you're really holding back some valuable research from the field of AI interpretability.
You talk about about as if a human-created neural network is at the same level as quantum physics where there are limits as to our understanding. We know very well how large language models work even if the capabilities of this technology are actively being explored.
You along with others here are far overstating the unknowns we have within the context of AI, whether this is the result of a misinformation campaign targeted at trying to boost the value of this tech or if the pop-sci takes have really gotten too prevalent, it is unclear to me.
For the definition of "understand" that most people use, humans don't understand things which are highly complex. We don't really understand the weather, we can't predict it, it's too complex. But, you can break it down into matter and forces and energy and simulate it and get pretty darn good predictions. We can now throw it in a deep learning model and get good predictions. But, to suggest we "understand it" doesn't gel with most people's definition of "understand".
wyager is correct. There are very serious limits to our understanding of what's going on inside these networks. Even how an LLM answers simple factual questions like "The capital of France is ..." is only now just coming into view. And the moment it gets more complex than that, interpretability is lost again.
> This is only a common mistake to the subset of people who labor under the delusion that we have actually stumbled upon Generalized Artificial Intelligence
That is technically true but deceptive: that subset is enormous! Skim any HN thread and you'll see many people talking about reasoning ability or how we just need a little more magic sauce and the "hallucination problem" will be solved. And a lot of what people say in this vein is not even wrong.
And that's just HN. Non-technical users are of course going to assume that if it looks like a duck and is dressed up as a duck by its creators, then it's a duck. Why wouldn't they? So I would claim that the subset is a majority.
I disagree, because this comment doesn't feel right to me and therefore it isn't right.
That alone disproves this stance. This higher podium for the human race and even the ability to feel means we are more than just autocomplete machines. We have allegiance, irrationality, animalistic instinct.
But seriously, I could push back and say: all you did is put one idea after another there. If you spend time observing your thought, an enormous amount of it is just putting one idea after another.
This implies my thoughts are even observable. I don't know that they are. I mean, my brain is doing the thinking and it's also doing the observing. If it wanted to, it could keep things from me.
Nothing is complex or important in the big scheme of things, for example when setting the point of reference being the Universe.
But on Earth, humans are pretty complex living organisms that we tag with the term intelligence. And I would say this is the point of reference for the discussion about AI.
I do agree that we are not special and in an alternate Universe maybe a branch of Neanderthals would try to invent some new form of tech.
If you happened to teach a kid to read right after LLMs got good, it jolts you into thinking that humans are very similar to LLMs. They continually say word X when the page shows word Y, but where word X would make perfect sense, ie semantically similar words are being spat out.
They also make up stories where the words sort of make sense, but it's kinda nonsense. "Hallucinations" all over the place.
> there will likely be open-weights versions of this sort of thing eventually. And there will probably be people who argue that it’s a good thing, somehow.
If this technology is going to exist (which it is), do you think it would be more harmful for everybody to have access to it, or for only some group of elites (big business, your own government, somebody else government, criminal groups, whoever) to have access to it? Because those are the only two choices as far as I can tell.
Definitely less harmful if only the elites have it. It doesn't make them harder to kill, but it reduces proliferation. And "elites" per se isn't correlated with any political side, so it doesn't even change the culture war.
Generally speaking if you think most of the danger of any technology comes from accidents, you want fewer people to have it and don't care too much which ones it is, though it'd be preferable if they had a lot to lose.
> And "elites" per se isn't correlated with any political side, so it doesn't even change the culture war.
Only if you're looking at the more obvious but wrong culture war. Elites from both "sides" have more in common with each other than with the rest of humanity. We just don't notice the perspective they're pushing nearly as much because we're not being spoonfed the debate and rage-bait as we are with the left vs right conflict.
> And "elites" per se isn't correlated with any political side, so it doesn't even change the culture war.
Even if you trust your own government, which I would suggest is misguided (the known propaganda programs that have come out of the DOD alone should be enough to dispel this trust), in a world where only the regulating authorities have access to GenAI, the GRU and the PLA are going to have access to it too. This idea that the authorities who would control GenAI wouldn’t maliciously deploy it against you is a fantasy.
A lot of people will agree with this sentiment when the hazard is firearms or other weapons. But in this case the “hazard” is information, and the ability to process and distribute it. Entrusting an authority to regulate that to the extent that would be required to regulate GenAI is a far more dystopian proposition than regulating gun ownership.
It's functionally very similar to when, e.g., a child keeps mimicking a character from their favorite cartoon.
Pretty much all of the "quirks" that LLMs exhibit are similar to "human" quirks. Which is great if we're trying to create an AI that makes mistakes like the average human.
Can someone pleasee convince why i shouldn't be absolutely shit out of my mind cynical about this innovation? we are literally seeing the downfall of trust in society. and no, i dont believe i am exaggerating
We have adapted to monumental shifts in how we develop trust in society for as long as society has existed - from the printing press to photography to the Internet to CGI to ...
I don't see this as any different. We will determine new ways of establishing trust. They'll certainly have flaws, as establishing trust in a society always has, but we'll learn to recognize those flaws and hopefully fix them.
Beyond that, what's the alternative? Banning the technology? That doesn't seem feasible for various reasons, not least of which is it isn't going to stop bad actors. Another pretty good reason is it's just not really possible - anyone with enough compute can build LLMs now.
As a bit of an aside, why hasn't society fallen yet? I mean, ChatGPT has been around for a couple years now, and I've been hearing about how LLMs are the single greatest threat to civilized society we've ever faced... yet they don't seem to have had a major impact.
The big difference is, there was a very high bar to forging photographs, and most news was gated (with the ability to easily find, and sue those guilty of slander/libel).
Now it's utterly simplistic to forge, to libel, to slander, and there is no easy path in many cases to sue.
While you can say "yes, but..." to the above, that's the reality that we've lived with for 150 years, less extremely rare edge cases. All this has changed over the course of a couple of decades, with most of that change in the last 10, and focuses on the last 2 years.
Beyond that, it took significant effort and labour to create fake stories and images. People had to be experts, or be wordsmiths. Now, click click, and fake generated stories abound. In fact, they're literally everywhere. There's absolutely no comparison here.
Now, in the time it used to take one person to generate one fake story, you can generate millions and trillions if you have the cash. Really, it's the same problem with spam phone calls, and with spam email.
You didn't get 1000 spam letters in the mail in the 80s, because that cost money. Email was free, thus spam became plentiful. The same with spam phone calls, it cost hard cash for each call, now it's pennies per hundreds of automated calls, so spam phone calls abound.
The same is happening with all content on the internet. Realistically the web is now dead. It's now gone. Even things such as wikipedia are going to die, as over the next 2 to 3 years LLM output will become utterly and completely indistinguishable in all aspects.
> The same is happening with all content on the internet. Realistically the web is now dead. It's now gone. Even things such as wikipedia are going to die, as over the next 2 to 3 years LLM output will become utterly and completely indistinguishable in all aspects.
Like I said, people have been saying this exact thing for a couple years now. I'm sure I'll be hearing the same in a couple more.
If you mean voice cloning, they aren't bringing that tech to market. (Someone else will, though.)
Similarly, Google doesn't let you use face matching in image search to find every photograph of you on the web, even though they could, and quite similar technology is built into Google Photos.
That's the thing though, it wasn't impossible, just harder - but people believed whatever they saw, so it was done regardless of the costs. Now it's easier and people are finally getting sceptical, as they should've 20+ years ago.
> Before these developments it was impossible to fake a politician saying arbitrary stuff.
Voice impersonation has been possible forever, actually; and it use for misinformation (as well as less nefarious things like entertainment) is hardly novel. (Same with impersonation that goes beyond voice.)
I think the other comments make a good argument about how other forms of technology have also degraded trust, but that we've found a way through. I'll also add that I think one potential way we could reinstate trust is through signed multimedia. Cameras/microphones/etc could sign the videos/audio they create in a way that can be used to verify that the media hasn't been doctored. Not sure if that's actually a feasible approach, but it's one possibility.
It's feasible with advanced enough tech. The hard part isn't getting cameras to sign the files they produce. The hard part is to preserve the chain of custody as images are cropped, rescaled, recompressed etc. You can do it with tech like Intel SGX. But you also need serious defense of the camera platforms against hacking, of the CPUs, of the software stacks. And there's no demand. News orgs feel they should be implicitly trusted due to their brands, so why would they use complicated tech to build trust?
They might use complicated tech because things are changing due to AI in a way that could degrade their brands trust. By having camera-signed videos, when folks create eg deep fakes of their news anchors/brands, there's a way for consumers to verify what's real. It lets their brands become more trustworthy.
Yeah, preserving the chain of custody is hard. I was thinking there are a few options: (1) the signature of the original video could be attached even after editing/compression, and then a news org would let you look up the signature. That way if someone copied the signature and stuck it on a fake video, you could see the original video that actually passes with that signature, and determine if something has been doctored. Or (2), you could have editing software add a signature verifying the edits made: eg compression, rescaling etc.
And then a law to make it illegal to remove/tamper/valsify a signature, like we have for DVDs, to allow some form of prosecution. The hardware stack is a little easier to protect; the software stack less so. But if we can do it with things like eg browser DRM or http signatures, maybe we can with media editing software? But I'm not versed enough in Cryptography to really know.
I don't think there's any need for laws here. After all the whole point of a digital signature is that if it's removed or tampered with, that's detectable (assuming you expect it to be signed in the first place).
I've thought about this sort of approach many times in the past, and also done a lot of work with SGX and similar tech that implements remote attestation (RA). So I know how to build this sort of system conceptually. RA lets you do provable computations where the CPU can sign data, and the "public key" contains a hash of the program and data along with certificates that let you check the CPU was authentic. And it runs the program in a special CPU mode where the kernel and other things can't access the memory space. That's all you need to do verifiable computation.
So to preserve chain of custody you just have a set of filters or transforms that are SGX enclaves running ffmpeg or whatever, and each one attaches the attestation data to the output video which includes a hash of the input video. Then you gather up a certificate+signature over the original raw video from the camera (the cert is evidence the camera is authentic and the key is protected by the camera - you can get this from iPhone cameras), then an org certificate showing it came from a certain company, and then the attestation evidence for each transform. A set of scripts lets people verify all the evidence.
The problem is, after doing some business case analysis, I concluded it would only really be useful in some very small and specific cases:
1. Citizen journalists who are posting videos online that then get verified by news orgs. So the other way around. In this case the origin camera would use iPhone App Attestation to generate the source certificate, and all the fancy attested transform stuff isn't really important because it's the news org doing the transforms and doing the verifying.
2. Phone cam shots for insurance and other similar use cases. There is some business potential here, but it'd be sales force heavy as nobody knows the tech exists and deepfake fraud may not be a big enough problem for them to care (yet ...). If someone is looking for a startup idea, have this one for free.
3. Very new news companies that don't have any reputation yet and want to stand out from the crowd.
The thing is, for (3) or any place where a news org wants to increase the trust of the viewers, you don't need cryptography. That's just over-complicating things. You can just put a short random four letter code into the chyron that's unique to that particular shot you see on screen. Then on your website you have a page where the original unedited files can be downloaded by supplying the code. If you use cameras that produce cryptographic evidence like timestamps that's gravy, and for browny points you could publish video hashes into an unforgeable replicated log to stop you backdating footage. For most people that will be more than good enough. The sort of thing that causes people to lose trust in media is stuff not CNN broadcasting outright deepfakes, although that will happen eventually, but when they engage in selective editing, drop stories entirely, use archive footage and misrepresent it as something new etc.
The worst kind of fakery I've seen mainstream media engage in was Channel 4 UK's recent broadcast of a fake news segment, in which they "secretly filmed" someone who was pretending to be a racist Reform activist. People on X swiftly discovered that the person on-screen wasn't an activist at all but a professional actor, who had been putting on a fake accent the whole time (that he even advertised on his website). It looks for all the world like C4 broadcast entirely and truly fake news, knew they were doing it, and when they were called on it they just flat out refused to investigate knowing the British establishment was behind them all the way, as Reform is unpopular with the civil servant types who are supposed to police the media.
Unfortunately, for that kind of fakery there is no technological solution. Or, well, there is, but it's called social media + face recognition, and we already have it.
The way I see it LLMs are just making it more obvious to see all the flaws with the existing levels of trust. Humans have never had access to universal truths, or universal ways of validating anything. Any claim anybody makes could be intentionally or unintentionally deceitful or untrue. The idea that there are sources you can trust to do your thinking for you is the more dangerous illusion in my opinion, and I’m not convinced that society will be harmed by poking some holes through it.
> The idea that there are sources you can trust to do your thinking for you is the more dangerous illusion in my opinion, and I’m not convinced that society will be harmed by poking some holes through it
There is no alternative to this idea. It is completely impossible for an individual to possess all of the knowledge of everything that affects their lives. The only option for getting some of this information is going to trusted sources that compile it and present some conclusions.
This applies just as much to scientific knowledge as it does to medicine or to politics.
If you want to avoid trusting any authority, it's hard to even confirm that North Korea exists. Confirming that it is ruled by an authoritarian regime and that it possesses nuclear weapons is impossible. And yet it's a trivial bit of info that everyone agrees on - imagine what avoiding trusted authorities would do to knowledge about other more subtle or more controversial topics.
I said do your thinking for you, not do your information gathering for you.
I would suggest that you do not trust any single source to only ever tell you things that are true. If there’s a topic you want to know something about it’s a much better course of action to look at multiple different sources, and do your own thinking to come to your own conclusions.
There are no authorities who can reliably take on this role for you, and LLMs don’t change this. The same is true with science. Even prior to LLMs, the replication crisis should have shown that a single paper on any topic can’t be relied on to contain any truth (the same would be true even in the absence of a replication crisis for that matter).
Oh, then I misunderstood. I thought you were against the notion of trusting a source for information, at all.
I very much agree with your actual point - that no source should be trusted absolutely, and that the only way to get a decent-to-solid idea on a topic is to consume multiple sources on that topic.
However, the problem is that even then people have relatively little time. It's important to have sources that one can rely on to be relatively accurate with a high probability, to get some vague idea about a topic you're not deeply invested in, but do care about somewhat. And I think this is where LLMs can hurt the most.
> The idea that there are sources you can trust to do your thinking for you is the more dangerous illusion in my opinion
The difference between economically successful countries like the US and the peripheral countries is we are a high-trust society.
I don't spend 100 hours chemically testing my food because I have faith it is safe to eat. I don't waste money on scam after scam because I have faith most businesses are legitimate. If I'm a business, I can order stuff and more stuff and I trust the spec.
Our outsourcing of that trust to other people is what makes us economically successful.
Other countries which don't have this trust focus on basic tasks. Gathering food, water, shelter, and basic infrastructure. Because ultimately every man is out for himself. They aren't building software and airplanes and whatnot. Because as complexity increases, the more people are involved and therefore the most trust is required. Trust is required because of the fundamental limitations of human meat space - we have limited time and survival needs.
Attacks on trust have been around for as long as we have had trust. Generative AI makes some attack vectors easier but it's nothing new: if your trust is earned using a voice you recognise then you model for trust has been broken since before most of us were born.
This problem appeared during pre-release testing and has since been solved post-generation using an output classifier that verifies responses, according to the system card release. It was predictable that someone would spin this into a black mirror-esque clickbait story.
We built the torment nexus and then configured it not to use any of the torment functionality that showed up in testing. It was predictable that someone would turn this into a “don’t build the torment nexus”-esque clickbait story.
I know what the quote is about. I don't get how a text/audio generator is relevant to it. Or even AI - all the classic scifi I read sees AI as a positive development, it's only the mainstream action movies that present it as some catastrophe, usually in an even dumber way than the alien invasion movies. Whenever a classic scifi presented AI as part of a catastrophe, it wasn't the AI itself but some imperialist or fascist politician ordering it, not a robot uprising - because that's just dumb. People and governments are the torment nexus, not technology.
I don’t think the popular meme is dependent on some cultivated/exclusive understanding of classic sci-fi that excludes stuff like the Terminator movies. Memes by their nature are lowest common denominator.
But hey, if you missed the joke by virtue of being too clever and well read, that’s not so bad?
Terminator is pretty good (= fun to watch) as far as action movies go, even though I have a hard time calling it scifi - there's not much science going on, except that there's a robot. But ok, the label is not really important. But using it as some sort of prediction about future is just... Wtf. At least use the scifi stories that the author actually intended to be taken as such (not many of them, and AI is usually what propels the societies leaps forward).
This is not about being well read or too clever. Many people actually think this is some sort of torment nexus and that "scifi has proven you shouldn't build it". Well a) it didn't, b) why are you taking advice from fiction designed to sell books/cinema tickets, and c) where's the joke?
There are movies about cars transforming to huge humanoid destroyers, and yet nobody claims the next Ford car model is the torment nexus. This is the same. A text/audio generator is never going to do a robot uprising, it's just as likely as your neighbor's Ford destroying your house with its giant fists.
The capability is real even if it wont happen with current model censorship. Just have to wait until we get an open source version, I bet Meta is working on one right now.
Even the FOSS models will be "censored". Icky implications aside, a voice assistant that starts making up its own prompts instead of listening to you is not a useful product.
> a voice assistant that starts making up its own prompts instead of listening to you is not a useful product
That would be way more useful than a voice assistant, with that you can replace voice actors for cheap! Many would pay good money for that product.
The important part is that you can ask it to do specific voice lines with specific feelings and voices. If it then starts to copy your voice lines afterwards that wont take from the fact that it generated the voice lines you wanted first. The censor that prevents it from using other voices prevents this use case.
It would be more convincing if "has since been solved post-generation using an output classifier that verifies responses, according to the system card release" didn't sound like an AutoNation certified pre-owned car; the blind certifying something they neither understand nor control, but it sure checks all the boxes on the "model card".
>It doesn't, interestingly, seem to include the initial built-in prompts used as constraints.
That's because it is mostly a ploy by profit-driven corporations to allow their researchers to publish some stuff without having them actually reveal anything of value to competitors. Don't be surprised if you find it severely lacking for actual insight.
This looks extremely similar to what often happened with chat implementations in 3 and 3.5 era: GPT world generate it's answer and then go on and generate the next input by the user as well.
I want to know, if LLMs were unleashed on simulating real-world people, how well could they predict everything they are going to say?
In battles or business, how well would LLMs predict what people will do?
Seems to me that a billion parameters can model decision making by a group of humans pretty well. Especially strategies to psych them out like anticipating everything they try ahead of time, and showing them it’s hopeless (demoralizing) and they may as well turn to using a random oracle than use their own strategies.
Would any AI fans who don’t seem to mind anything AI does, be willing to volunteer for this experiment on the Internet?
It’s stronger than a Turing test. You’d be competing against an AI to prove it’s really you and not an AI trained on what you have been posting. As determined by people who have been seeing you post this whole time.
Or even more interestingly — after you hit send, we’ll compare to the 5 versions of what the AI predicted you’d say, and calculate the “loss”.
If you lose, the AI can take over your account and use it to amass more karma for you (a measurable metric).
Sure, I am doing some work, collecting my writings, preparing them for LLM. For example I could select my own text as positive and a generic one as negative for RLHF. The resulting model should predict when I'd like a piece of text. But it would need retraining to track my changes over time.
Makes sense. With local models it's easy to make this happen with poorly defined stop tokens.
The user's voice is vectorized somehow, and the model is predicting a series of vectors and won't have a sense of "self" to let it recognize it's own voice vs the user's.
We'll have to gradually get used to the notion that a person's voice, like so many other things we once thought of as intimately personal, is just a coordinate in a high-dimensional vector space.
One of the common bad takes is that since personal things can be distilled to “just math” they are no longer personal or valuable in the sort of way we value personal things. Increasing our understanding of how the world works shouldn’t devalue the world.
To put it another way, people having souls is not the only reason to treat people like people. Or as Dr Seuss says: A person’s a person no matter how small.
> they are no longer personal or valuable in the sort of way we value personal things
There were people with your face or voice before, you just didn't care. AI continues a trend: the internet has always been post-scarcity, free copying and offering an amazing array of choices. We are acting as if it's a new thing, but it's been here for 25 years.
We shouldn't value people simply because they are scarce. People are valuable intrinsically. Identical Twins don’t have less value because there are two of them. A mother is not less sad when one of them dies as compared to a mother who doesn’t have twins. Trying to make such a comparison of value is gross.
I do appreciate that not everyone agrees with that view, and that capitalism tries to value people at their output but intuitively plenty of people find that sort of thing wrong and ignoring that intuition is something I’d argue will not end well for us as a society.
But I do think there are some psychological hurdles we’ll have to overcome, as it becomes possible to mechanically copy individual personal aspects. A person’s a person no matter how small, but we aren’t all used to thinking of ourselves as very small.
It always has been. The same way passwords were never secure, we just found that they are too slow to bother uncovering. The same can be said for voices/AI. Except now it's not too slow to "crack an easy password" in the voices/AI world.
The idea that everything is open, you just need the right tool, is scary to most.. but it's not once you realise how expensive the tools are.
Cloning voice signature or timbre may need a bit more for a good quality. Then there are idiosyncracies in one’s voice. In addition to that, there are tiny verbal tics, expressions, cadence, feel, and some more to be able to say you have properly cloned someone’s voice. The two second sample is like a shallow clone of sorts and is indeed vector space.
The last sentence is hilarious. What do you think "properly" cloned voices are? Not every model is few shot and not every model relies on their training set for paralanguage anymore. Easiest way to try it out is properly the pro voice cloning from elevenlabs.
Did anybody with a passing familiarity with the technology not know this? There's a reason people don't trust their IP to public chat assistants, ya know. It's because they're remorselessly siphoning our data for training purposes. And you've agreed to let them use your inputs for anything they please.
No, I get that. But that's the thing about LLMs, isn't it? They're good at filling in the next token to make a plausible-looking conversation. And if the machine doesn't expect you to be done talking, why would it not just carry on in your voice?
I was under the impression that multiple separate models were in use. I assumed it was something like whisper that converted the user’s input to text then an LLM then text to speech.
I hadn’t realized this was all in one model. Wild stuff.
> And you've agreed to let them use your inputs for anything they please.
This has always been the Google and FaceBook deal - your data for services, and the AI assistant deal remains the same. I believe people will flock to the best AI with their data, handing it on a silver platter. Will they spend half a second thinking about privacy or prefer immediate results? When you know your competition uses AI, you aim to have just as much AI assistance or more.
This isn’t an example of general intelligence. However, it’s an example of the complexity of these systems. As they get more complex (and advanced) it’s going to be pretty scary (and creepy) what can happen. I’ll predict there will be moments when we’re truly convinced some independent autonomy has taken over.
Early on in the voice public release, I asked my 15 year old to give it a shot. They (they’re NB) had a long meandering conversation about their favorite book series, the Percy Jackson series.
At about 15 mins into the conversation between my kiddo and ChatGPT, the model started to take on the vocal mannerisms of my kiddo. It started using more “umms” and “you knows.”
At first this felt creepy but as I explained it to my kid, it’s because their own text has become weighted enough in the token count for the LLM to start incorporating or/and somewhere in the embedded prompts is “empathize with the user and emphasize clarity” and that prompting meant mirroring back speech styles.
Other than being randomly happening, it is not a threat. ElevenLabs does it in a few seconds already. If you are using mic while using Ai, expect companies like OpenAi to steal every bit of your soul. Nothing new here.
No, the whole point of transformer architecture is that it can do stuff like this without any extra training, an LLM can copy your writing pattern etc.
It did the same thing ChatGPT does when it picks up your writing style and exact words/sentences after a few messages. Literally - the audio is encoded as tokens and fed to the LLM, there is no distinction between text and audio from the model's point of view.
This was inference, not training. Like how you can paste a few paragraphs of text into ChatGPT and ask it to write another paragraph in a similar writing style.
Imagine a world where world dictators are replaced rather than killed. Rollback dictatorship over years, install democratic process, then magically commit seppuku in a plane crash.
That makes it entirely different from text to speech models we had previously, this model when uncensored could do all sorts of voice acting etc for games. But this example shows why they try to neuter it so hard, because it would spook a ton of people in its raw state.