The most impressive part is that the voice uses the right feelings and tonal language during the presentation. I'm not sure how much of that was that they had tested this over and over, but it is really hard to get that right so if they didn't fake it in some way I'd say that is revolutionary.
How far are we away from something like a helmet with chat GPT and a video camera installed, I imagine this will be awesome for low vision people. Imagine having a guide tell you how to walk to the grocery store, and help you grocery shop without an assistant. Of course you have tons of liability issues here, but this is very impressive
We're planning on getting a phone-carrying lanyard and she will just carry her phone around her neck with Be My Eyes^0 looking out the rear camera, pointed outward. She's DeafBlind, so it'll be bluetoothed to her hearing aids, and she can interact with the world through the conversational AI.
I helped her access the video from the presentation, and it brought her to tears. Now, she can play guitar, and the AI and her can write songs and sing them together.
This is a big day in the lives of a lot of people whom aren't normally part of the conversation. As of today, they are.
That story has always been completely reasonable and plausible to me. Incredible foresight. I guess I should start a midlevel management voice automation company.
Definitely heading there:
https://marshallbrain.com/manna
"With half of the jobs eliminated by robots, what happens to all the people who are out of work? The book Manna explores the possibilities and shows two contrasting outcomes, one filled with great hope and the other filled with misery."
And here are some ideas I put together around 2010 on how to deal with the socio-economic fallout from AI and other advanced technology:
https://pdfernhout.net/beyond-a-jobless-recovery-knol.html
"This article explores the issue of a "Jobless Recovery" mainly from a heterodox economic perspective. It emphasizes the implications of ideas by Marshall Brain and others that improvements in robotics, automation, design, and voluntary social networks are fundamentally changing the structure of the economic landscape. It outlines towards the end four major alternatives to mainstream economic practice (a basic income, a gift economy, stronger local subsistence economies, and resource-based planning). These alternatives could be used in combination to address what, even as far back as 1964, has been described as a breaking "income-through-jobs link". This link between jobs and income is breaking because of the declining value of most paid human labor relative to capital investments in automation and better design. Or, as is now the case, the value of paid human labor like at some newspapers or universities is also declining relative to the output of voluntary social networks such as for digital content production (like represented by this document). It is suggested that we will need to fundamentally reevaluate our economic theories and practices to adjust to these new realities emerging from exponential trends in technology and society."
And a related YouTube video:
"The Richest Man in the World: A parable about structural unemployment and a basic income"
https://www.youtube.com/watch?v=p14bAe6AzhA
"A parable about robotics, abundance, technological change, unemployment, happiness, and a basic income."
My sig is about the deeper issue here though: "The biggest challenge of the 21st century is the irony of technologies of abundance in the hands of those still thinking in terms of scarcity."
Your last quote also reminds me this may be true for everything else, especially our diets.
Technology has leapfrogged nature and our consumption patterns have not caught up to modern abundance. Scott Galloway recently mentioned this in his OMR speech and speculated that GLP1 drugs (which actually help addiction) will assist in bringing our biological impulses more inline with current reality.
Indeed, they are related. A 2006 book on eating healthier called "The Pleasure Trap: Mastering the Hidden Force that Undermines Health & Happiness" by Douglas J. Lisle and Alan Goldhamer helped me see that connection (so, actually going the other way at first). And a later book from 2010 called "Supernormal Stimuli: How Primal Urges Overran Their Evolutionary Purpose" by Deirdre Barrett also expanded that idea beyond food to media and gaming and more. The 2010 essay "The Acceleration of Addictiveness" by Paul Graham also explores those themes. In the 2007 book The Assault on Reason by Al Gore talks about watching television and the orienting response to sudden motion like scene changes.
In short, humans are adapted for a world with a scarcity of salt, refined carbs like sugar, fat, information, sudden motion, and more. But the world most humans live in now has an abundance of those things -- and our previously-adaptive evolved inclinations to stock up on salt/sugar/fat (especially when stressed) or to pay attention to the unusual (a cause of stress) are now working against our physical and mental health in this new environment. Thanks for the reference to a potential anti-addiction substance. Definitely something that deserves more research.
My sig -- informed by the writings of people like Mumford, Einstein, Fuller, Hogan, Le Guinn, Banks, Adams, Pet, and many others -- is making the leap to how that evolutionary-mismatch theme applies to our use of all sorts of technology.
Here is a deeper exploration of that in relation to militarism (and also commercial competition to some extent):
https://pdfernhout.net/recognizing-irony-is-a-key-to-transce...
"There is a fundamental mismatch between 21st century reality and 20th century security thinking. Those "security" agencies are using those tools of abundance, cooperation, and sharing mainly from a mindset of scarcity, competition, and secrecy. Given the power of 21st century technology as an amplifier (including as weapons of mass destruction), a scarcity-based approach to using such technology ultimately is just making us all insecure. Such powerful technologies of abundance, designed, organized, and used from a mindset of scarcity could well ironically doom us all whether through military robots, nukes, plagues, propaganda, or whatever else... Or alternatively, as Bucky Fuller and others have suggested, we could use such technologies to build a world that is abundant and secure for all. ... The big problem is that all these new war machines and the surrounding infrastructure are created with the tools of abundance. The irony is that these tools of abundance are being wielded by people still obsessed with fighting over scarcity. So, the scarcity-based political mindset driving the military uses the technologies of abundance to create artificial scarcity. That is a tremendously deep irony that remains so far unappreciated by the mainstream."
Conversely, reflecting on this more just now, are we are perhaps evolutionarily adapted to take for granted some things like social connections, being in natural green spaces, getting sunlight, getting enough sleep, or getting physical exercise? These are all things that are in increasingly short supply in the modern world for many people -- but which there may never have been much evolutionary pressure previously to seek out, since they were previously always available.
For example, in the past humans were pretty much always in face-to-face interactions with others of their tribe, so there was no big need to seek that out especially if it meant ignoring the next then-rare new shiny thing. Johann Hari and others write about this loss of regular human face-to-face connection as a major cause of depression.
Stephen Ilardi expands on that in his work, which brings together many of these themes and tries to help people address them to move to better health.
From: https://tlc.ku.edu/
"We were never designed for the sedentary, indoor, sleep-deprived, socially-isolated, fast-food-laden, frenetic pace of modern life. (Stephen Ilardi, PhD)"
GPT-4o, by apparently providing "her" movie-like engaging interactions with an AI avatar that seeks to please the user (while possibly exploiting them) is yet another example of our evolutionary tendencies potentially being used to our detriment. And when our social lives are filled-to-overflowing with "junk" social relationships with AIs, will most people have the inclinations to seek out other real humans if it involves doing perhaps increasingly-uncomfortable-from-disuse actions (like leaving the home or putting down the smartphone)? Not quite the same, but consider: https://en.wikipedia.org/wiki/Hikikomori
Related points by others:
"AI and Trust"
https://www.schneier.com/blog/archives/2023/12/ai-and-trust.... "In this talk, I am going to make several arguments. One, that there are two different kinds of trust—interpersonal trust and social trust—and that we regularly confuse them. Two, that the confusion will increase with artificial intelligence. We will make a fundamental category error. We will think of AIs as friends when they’re really just services. Three, that the corporations controlling AI systems will take advantage of our confusion to take advantage of us. They will not be trustworthy. And four, that it is the role of government to create trust in society. And therefore, it is their role to create an environment for trustworthy AI. And that means regulation. Not regulating AI, but regulating the organizations that control and use AI."
"The Expanding Dark Forest and Generative AI - Maggie Appleton"
https://youtu.be/VXkDaDDJjoA?t=2098 (in the section on the lack of human relationship potential when interacting with generated content)
This Dutch book [1] by Gummbah has the text "Kooptip" imprinted on the cover, which would roughly translate to "Buying recommendation". It worked for me!
Does it give you voice instructions based on what it knows or is it actively watching the environment and telling you things like "light is red, car is coming"?
Just the ability to distinguish bills would be hugely helpful, although I suppose that's much less of a problem these days with credit cards and digital payment options.
With this capability, how close are y'all to it being able to listen to my pronunciation of a new language (e.g. Italian) and given specific feedback about how to pronounce it like a local?
It completely botched teaching someone to say “hello” in Chinese - it used the wrong tones and then incorrectly told them their pronunciation was good.
As for the Mandarin tones, the model might have mixed it up with the tones from a dialect like Cantonese. It’s interesting to discover how much difference a more specific prompt could make.
I don't know if my iOS app is using GPT-4o, but asking it to translate to Cantonese gives you gibberish. It gave me the correct characters, but the Jyutping was completely unrelated. Funny thing is that the model pronounced the incorrect Jyutping plus said the numbers (for the tones) out loud.
I think there is too much focus on tones in beginning Chinese. Yes, you should get them right, but no, you'll get better as long as you speak more, even if your tones are wrong at first. So rather than remember how to say fewer words with the right tones, you'll get farther if you can say more words with whatever tones you feel like applying. That "feeling" will just get better over time. Until then, you'll talk as good as a farmer coming in from the country side whose first language isn't mandarin.
I couldn’t disagree more. Everyone can understand some common tourist phrases without tones - and you will probably get a lot of positive feedback from Chinese people. It’s common to view a foreigner making an attempt at Mandarin (even a bad one) as a sign of respect.
But for conversation, you can’t speak Mandarin without using proper tones because you simply won’t be understood.
That really isn't true, or at least it isn't true with some practice. You don't have to consciously think about or learn tones, but you will eventually pick them anyways (tones are learned unconsciously via lots of practice trying to speak and be understood).
You can be perfectly understood if you don't speak broadcast Chinese. There are plenty of heavy accents to deal with anyways. Like Beijing 儿化 or the inability of southerners to pronounce sh very differently from s.
People always say tech workers are all white guys -- it's such a bizarre delusion, because if you've ever actually seen software engineers at most companies, a majority of them are not white. Not to mention that product/project managers, designers, and QA are all intimately involved in these projects, and in my experience those departments tend to have a much higher ratio of women.
Even beside that though -- it's patently ridiculous to suggest that these devices would perform worse with an Asian man who speaks fluent English and was born in California. Or a white woman from the Bay Area. Or a white man from Massachusetts.
You kind of have a point about tech being the product of the culture in which it was produced, but the needless exaggerated references to gender and race undermine it.
An interesting point, I tend to have better outcomes by using my heavily accented ESL English, than my native pronunciation of my mother tongue
I'm guessing it's part of the tech work force being a bit more multicultural than initially thought, or it just being easier to test with
It's a shame, because that means I can use stuff that I can't recommend to people around me
Multilingual UX is an interesting painpoint, I had to change the language of my account to English so I could use some early Bard version, even though It was perfectly able to understand and answer in Spanish
You also get the synchronicity / four minute mile effect egging on other people to excel with specialized models, like Falcon or Qwen did in the wake of the original ChatGPT/Llama excitement.
I don't think that'd work without a dedicated startup behind it.
The first (and imo the main) hurdle is not reproduction, but just learning to hear the correct sounds. If you don't speak Hindi and are a native English speaker, this [1] is a good example. You can only work on nailing those consonants when they become as distinct to your ear as cUp and cAp are in English.
We can get by by falling back to context (it's unlikely someone would ask for a "shit of paper"!), but it's impossible to confidently reproduce the sounds unless they are already completely distinct in our heads/ears.
That's because we think we hear things as they are, but it's an illusion. Cup/cap distinction is as subtle to an Eastern European as Hindi consonants or Mandarin tones are to English speakers, because the set of meaningful sounds distinctions differs between languages. Relearning the phonetic system requires dedicated work (minimal pairs is one option) and learning enough phonetics to have the vocabulary to discuss sounds as they are. It's not enough to just give feedback.
> but it's impossible to confidently reproduce the sounds unless they are already completely distinct in our heads/ears
interestingly, i think this isn't always true -- i was able to coach my native-spanish-speaking wife to correctly pronounce "v" vs "b" (both are just "b" in spanish, or at least her dialect) before she could hear the difference; later on she was developed the ability to hear it.
I had a similar experience learning Mandarin as a native English speaker in my late 30s. I learned to pronounce the ü sound (which doesn't exist in English) by getting feedback and instruction from a teacher about what mouth shape to use. And then I just memorized which words used it. It was maybe a year later before I started to be able to actually hear it as a distinct sound rather than perceiving it as some other vowel.
After watching the demo, my question isn't about how close it is to helping me learn a language, but about how close it is to being me in another language.
Even styles of thought might be different in other languages, so I don't say that lightly... (stay strong, Sapir-Wharf, stay strong ;)
I was conversing with it in Hinglish (A combination of Hindi and English) which folks in Urban India use and it was pretty on point apart from some use of esoteric hindi words but i think with right prompting we can fix that.
I'm a Spaniard and to my ears it clearly sounds like "Es una manzana y un plátano".
What's strange to me is that, as far as I know, "plátano" is only commonly used in Spain, but the accent of the AI voice didn't sound like it's from Spain. It sounds more like an American who speaks Spanish as a second language, and those folks typically speak some Mexican dialect of Spanish.
Interesting, I was reading some comments from Japanese users and they said the Japanese voice sounds like a (very good N1 level) foreigner speaking Japanese.
At least IME, and there may be regional or other variations I’m missing, people in México tend to use “plátano” for bananas and “plátano macho” for plantains.
In Spain, it's like that. In Latin America, it was always "plátano," but in the last ten years, I've seen a new "global Latin American Spanish" emerging that uses "banana" for Cavendish, some Mexican slang, etc. I suspect it's because of YouTube and Twitch.
The content was correct but the pronunciation was awful. Now, good enough? For sure, but I would not be able to stand something talking like that all the time
Most people don't, since you either speak with native speakers or you speak in English mostly, since in international teams you speak in English and not one of the native languages even if nobody speaks English natively. So it is rare to hear broken non-English.
And note that understanding broken language is a skill you have to train. If you aren't used to it then it is impossible to understand what they say. You might not have been in that situation if you are an English speaker since you are so used to broken English, but it happens a lot for others.
It sounds like a generic Eastern European who has learned some Italian. The girl in the clip did not sound native Italian either (or she has an accent that I have never heard in my life).
This is damn near one of the most impressive things, can only imagine especially with live translation and voice synthesis (eleven labs style) you'd be capable of to integrate with something like teams (select each persons language and do realtime translation to each persons native language, with their own voice and intonations would NUTS)
By humanity you mean Microsoft's shareholders right? Cause for regular people all this crap means is they have to deal with even more spam and scams everywhere they turn. You now have to be paranoid about even answering the phone with your real voice, lest the psychopaths on the other end record it and use it to fool a family member.
Yeah, real win for humanity, and not the psycho AI sycophants
Random OpenAI question: While the GPT models have become ever cheaper, the price for the tts models have stayed in the $15/1Mio char range. I was hoping this would also become cheaper at some point. There're so many apps (e.g. language learning) that quickly become too expensive given these prices. With the GPT-4o voice (which sounds much better than the current TTS or TTS HD endpoint) I thought maybe the prices for TTS would go down. Sadly that hasn't happened. Is that something on the OpenAI agenda?
I've always been wondering what GPT models lack that makes them "query->response" only. I've always tried to get chatbots to lose the initially needed query, with no avail. What would It take to get a GPT model to freely generate tokens in a thought like pattern? I think when I'm alone without query from another human. Why can't they?
> What would It take to get a GPT model to freely generate tokens in a thought like pattern?
That’s fundamentally not how GPT models work, but you can easily build a framework around them that calls them in a loop; you’d need a special system prompt to get anything “thought like” that way, and if you want it to be anything other than stream-of-simulated-consciousness with no relevance to anything, and a non-empty “user” prompt each round, which could be as simple as time, a status update on something in the world, etc.
Monkeys who've trained since birth to use sign language, and can reply incredible questions, have the same issue. The researchers noticed they never once asked a question like "why is the sky blue?" or "why do you dress up". Zero initiating conversation, but they do reply when you ask what they want.
I suppose it would cost even more electricity to have ChatGPT musing alone though, burning through its nvidia cards...
I think this will be key in a logical proof that statistical generation can never lead to sentience; Penrose will be shown to be correct, at least regarding the computability of consciousness.
You could say, in a sense, that without a human mind to collapse the wave function, the superposition of data in a neural net's weights can never have any meaning.
Even when we build connections between these statistical systems to interact with each other in a way similar to contemplation, they still require a human-created nucleation point on which to root the generation of their ultimate chain of outputs.
I feel like the fact that these models contain so much data has gripped our hardwired obsession for novelty and clouds our perception of their actual capacity to do de novo creation, which I think will be shown to be nil.
An understanding of how LLMs function should probably make this intuitively clear. Even with infinite context and infinite ability to weigh conceptual relations, they would still sit lifeless for all time without some, any, initial input against which they can run their statistics.
It happens sometimes. Just the other day a local TinyLlama instance started asking me questions.
The chat memory was full of mostly nonsense and it asked me a completely random and simple question out of the blue. Did chatbots evolve a lot since he was created.
I think you can get models to "think" if you give them a goal in the system prompt, a memory of previous thoughts, and keep invoking them with cron
Yes, but that's the fundamental difference. Even if I closed my eyes, plugged my ears and nose and laid in a saltwater floating chamber, my brain will always generate new input / noise.
(GPT) Models toggle between a state of existence when queried and ceasing to exist when not.
They are designed for query and reponse. They don't do anything unless you give them input. Also there's not much research on the best architecture for running continuous though loops in the background and how to mix them into the conversational "context". Current LLMs only emulate single thought synthesis based on long-term memory recall (and some goes off to query the Internet).
> I think when I'm alone without query from another human.
You are actually constantly queried, but it's stimulation from your senses. There are also neurons in your brain which fires regularly, like a clock that ticks every second.
Do you want to make a system that thinks without input? Then you need to add hidden stimuli via a non-deterministic random number generator, preferably a quantum based RNG (or it won't be possible to claim the resulting system has free-will). Even a single photon hitting your retina can affect your thoughts and there are no doubt other quantum effects that trips neurons in your brain above the firing threshold.
I think you need at least three of four levels of loops interacting, with varying strength between them. First level would be the interface to the world, the input and output level (video, audio, text). Data from here are high priority and is capable of interrupting lower levels.
The second level would be short term memory and context switching. Conversations needs to be classified, and stored in a database, and you need an API to retrieve old contexts (conversations). You also possibly need context compression (summarization of conversations in case you're about to hit a context window limit).
The third level would be the actual "thinking", a loop that constantly talks to itself to accomplish a goal using the data from all the other levels but mostly driven by the short term memory. Possibly you could go super-human here and spawn multiple worker processes in parallel. You need to classify the memories by asking; do I need more information? where do I find this information? Do I need an algorithm to accomplish a task? What is the completion criteria. Everything here is powered by an algorithm. You would take your data and produce a list of steps that you have to follow to resolves to a conclusion.
Everything you do as a human to resolve a thought can be expressed as a list or tree of steps.
If you've had a conversation with someone and you keep thinking about it afterwards, what has happened is basically that you have spawned a "worker process" that tries to come to a conclusion that satisfies some criteria. Perhaps there was ambiguity in the conversation that you are trying to resolve, or the conversation gave you some chemical stimulation.
The last level would be subconscious noise driven by the RNG, this would filter up with low priority. In the absence of other external stimuli with higher priority, or currently running thought processes, this would drive the spontaneous self-thinking portion (and dreams) when external stimuli is lacking.
Implement this and you will have something more akin to true AGI (whatever that is) on a very basic level.
In my ChatGPT app or on the website I can select GPT-4o as a model, but my model doesn't seem to work like the demo. The voice mode is the same as before and the images come from DALLE and ChatGPT doesn't seem to understand or modify them any better than previously.
I couldn’t quite tell from the announcement, but is there still a separate TTS step, where GPT is generating tones/pitches that are to be used, or is it completely end to end where GPT is generating the output sounds directly?
Very exciting, would love to read more about how the architecture of the image generation works. Is it still a diffusion model that has been integrated with a transformer somehow, or an entirely new architecture that is not diffusion based?
Licensing the emotion-intoned TTS as a standalone API is something I would look forward to seeing. Not sure how feasible that would be if, as a sibling comment suggested, it bypasses the text-rendering step altogether.
Is it possible to use this as a TTS model? I noticed on the announcement post that this is a single model as opposed to a text model being piped to a separate TTS model.
The web page implies you can try it immediately. Initially it wasn't available.
A few hours later it was in both the web UI and the mobile app - I got a popu[ telling me that GPT-4o was available. However nothing seems to be any different. I'm not given any option to use video as an input, the app can't seem to pick up any new info from my voice.
I'm left a bit confused as to what I can do that I couldn't do before. I certainly can't seem to recreate much of the stuff from the announcement demos.
Sorry to hijack, but how the hell can I solve this? I have the EXACT SAME error on two iOS devices (native app only — web is fine), but not on Android, Mac, or Windows.
Right to who? To me, the voice sounds like an over enthusiastic podcast interviewer. Whats wrong with wanting computers to sound like what people think computers should sound like?
It understands tonal language, you can tell it how you want it to talk, I have never seen a model like that before. If you want it to talk like a computer you can tell it to, they did it during the presentation, that is so much better than the old attempts at solving this.
You are a Zoomer sosh meeds influencer, please increase uptalk by 20% and vocal fry by 30%. Please inject slaps, "is dope" and nah and bra into your responses. Throw shade every 11 sentences.
And you’ve just nailed where this is all headed. Each of us will have a personal assistant that we like. I am personally going to have mine talk like Yoda and I will gladly pay Disney for the privilege.
People have been promising this for well over a decade now but the bottleneck is the same as it was before: the voice assistants can't access most functionality users want to use. We don't even have basic text editing yet. The tone of voice just doesn't matter when there's no reason to use it.
I've seen a programmer-turned-streamer literally do this live. Woohoojin on twitch/yt focuses on content for Riot's Valorant esports title, during a couple watch parties he would make "super fans" using GPT with TTS output and the stream of chat messages as input. His system prompts were formed exactly like yours, including instructions to plug his gaming chair sponsor.
It worked surprisingly well. The video where he created the first iteration on stream(don't remember the watch party streams he ran the fans on): https://yewtu.be/watch?v=MBKouvwaru8
I want to get to the part where phone recordings stop having slow, full sentences. The correct paradigm for that interface is bullet list, not proper speech.
Why did they make the woman sound like she's permanently on the brink of giggling? It's nauseating how overstated her pretentious banter is. Somewhere between condescending nanny and preschool teacher. Like how you might talk to a child who's at risk of crying so you dial up the positive reinforcement.
LLMs today have no concept of epistemology, they don't ever "know" and are always making up bullshit, which usually is more-or-less correct as a side effect of minimizing perplexity.
The Total Perspective Vortex in Hitchhiker's notably didn't do anything bad when it was turned on, and so is good evidence that inventing the torment nexus is fine.
It didn't do anything bad to Zaphod Beeblebrox, in a pocket universe created especially for him (therefore ensuring that he was the most important thing in it, and thereby securing his immunity from the mind-scrambling effects of fully comprehending the infinite smallness of one's place in the real universe).
>The most impressive part is that the voice uses the right feelings and tonal language during the presentation.
Consequences of audio2audio (rather than audio >text text>audio). Being able to manipulate speech nearly as well as it manipulates text is something else. This will be a revelation for language learning amongst other things. And you can interrupt it freely now!
Elevenlabs isn’t remotely close to how good this voice sounds. I’ve tried to use it extensively before and it just isn’t natural. This voice from openAI and even the one chatGPT has been using is natural.
When have you last used it. I used a few weeks ago to create a fake podcast as a side project recently and it sounded pretty good with their highest end model with cranked up tunings.
My point isn’t necessarily elevenlabs being good or bad, it’s the difference between its text to voice and voice to voice generations. The latter is incredibly expressive and just shows how much is lacking in our ability to encode inflection in text.
I did see that, though my interpretation is that breathing is included in its voice tokenizer which helps it understand emotions in speech (the AI can generate breath sounds after all). Other sounds, like bird songs or engine noises, may not work - but I could be wrong.
I suspect that like images and video, their audio system is or will become more general purpose. For example it can generate the sound of coins falling onto a table.
allegedly google assistant can do the "humming" one but i have never gotten it to work. I wish it would because sometimes i have a song stuck in my head that i know is sampled from another song.
I asked it to make a bird noise, instead it told me what a bird sounds like with words. True audio to audio should be able to be any noise, a trombone, traffic, a crashing sea, anything. Maybe there is a better prompt there but it did not seem like it.
I was in the audience at the event. The only parts where it seemed to get snagged was hearing the audience reaction as an interruption. Which honestly makes the demo even better. It showed that hey, this is live.
I wonder when it will be able to understand that there is more than one human talking to it. It seems like even in today's demo if two people are talking, it can't tell them apart.
I was showing my wife 4o voice chat this afternoon, and we were asking it about local recommendations for breakfast places. All of a sudden…
————
ChatGPT:
Enjoy your breakfast and time together.
User:
Can you tell that it's not just me talking to you right now?
ChatGPT:
I can't always tell directly, but it sounds like you're sharing the conversation with someone else. Is [wife] there with you?
User:
My god, the AI has awoken. Yes, this is [wife].
ChatGPT:
Hi [wife]! It's great to hear from you. How are you doing?
User:
I'm good. Thanks for asking. How are you?
ChatGPT:
I'm doing well, thanks! How's everything going with the baby preparations?
—————
We were shocked. It was one of those times where it’s 25% heartwarming and 75% creepy. It was able to do this in part due to the new “memory” feature, that memorized my wife’s name and that we are expecting. it’s a strange novelty now, but this will be totally normalized and ubiquitous quite soon. Interesting times to be living in.
I'm surprised that ChatGPT is proactively asking questions to you, instead of just giving a response. Is this new? I don't remember this from previous versions.
That was very impressive, but it doesn't surprise me much given how good the voice mode is in the ChatGPT iPhone app is already.
The new voice mode sounds better, but the current voice mode did also have inflection that made it feel much more natural than most computer voices I've heard before.
Slight off-topic, but I noticed you've updated your llm CLI app to work with the 4o model (plus bunch of other APIs through plugins). Kudos for working extremely fast. I'm really grateful for your tool; I tried many others, but for some reason none clicked as much as your.
Can you tell the current voice model what feelings and tone it should communicate with? If not it isn't even comparable, being able to control how it reads things is absolutely revolutionary, that is what was missing from using these AI models as voice actors.
+1. Check the demo video in OP titled "Sarcasm". Human asks GPTo to speak "dripping in sarcasm". The tone that comes back is spot on. Comparing that against current voice model is a total sea change.
"Right" feelings and tonal language? "Right" for what? For whom?
We've already seen how much damage dishonest actors can do by manipulating our text communications with words they don't mean, plans they don't intend to follow through on, and feelings they don't experience. The social media disinfo age has been bad enough.
Are you sure you want a machine which is able to manipulate our emotions on an even more granular and targetted level?
LLMs are still machines, designed and deployed by humans to perform a task. What will we miss if we anthropomorphize the product itself?
This gives me a lot of anxiety but my only recourse is to stop paying attention to AI dev. Its not going to stop, downside be damned. The "We're working super hard to make these things safe" routine from tech companies, who have largely been content to make messes and then not be held accountable in any significant way, rings pretty hollow for me. I don't want to be a doomer but I'm not convinced that the upside is good enough to protect us from the downside.
That's the part that really struck me. I thought it was particularly impressive with the Sal Khan maths tutor demo and the one with BeMyEyes. The comment at the end about the dog was an interesting ad-lib.
The only slightly annoying thing at the moment is they seem hard to interrupt, which is an important mechanism in conversations. But that seems like a solvable problem. They kind of need to be able to interpret body language a bit to spot when the speaker is about to interrupt.
Really? I think interruption and timing in general still seems like a problem that has yet to be solved. It was the most janky aspect of the demos imo.
I’m not sure how revolutionary the style is. It can already mimic many styles of writing. It seems like mimicking a cheerful happy assistant, with associated filler words, etc. is very much in-line with what LLM’s are good at.
Yeah, the female voice especially is really impressive in the demos. The voice always sounds natural. The male voice I heard wasn't as good. It wasn't terrible, but it had a somewhat robotic feel to it.
I tried using the voice chat in their app previously and was disappointed. The big UX problem was that it didn't try to understand when I had finished speaking. English is a second language and I paused a bit too long thinking of a word and it just started responding to my obviously half spoken sentence. Trying again it just became stressful as I had to rush my words out to avoid an annoying response to an unfinished thought.
I didn't try interrupting it but judging by the comments here it was not possible.
It was very surprising to me to be so overtly exposed to the nuances of real conversation. Just this one thing of not understanding when it's your turn to talk made the interaction very unpleasant, more than I would have expected.
On that note, I noticed that the AI in the demo seems to be very rambly. It almost always just kept talking and many statements were reiterations of previous ones. It reminded me of a type of youtuber that uses a lot of filler phrases like "let's go ahead and ...", just to be more verbose and lessen silences.
Most of the statements by the guy doing the demo were interrupting the AI.
It's still extremely impressive but I found this interesting enough to share. It will be exciting to see how hard it is to reproduce these abilities in the open, and to solve this issue.
"I paused a bit too long thinking of a word and it just started responding to my obviously half spoken sentence. Trying again it just became stressful as I had to rush my words out to avoid an annoying response to an unfinished thought."
I'm a native speaker and this was my experience as well. I had better luck manually sending the message with the "push to hold" button.
Same. It cuts off really quick when I'm trying to phrase a complex prompt and I wind up losing my work. So I go to the manual dictation mode, which works really nicely but doesnt give me the hands off mode. I admit, the hands off mode is often better for simpler interactions, but even then, I am frequently cut off mid-prompt.
It also tangentially reminds me of an excellent video I re-watched recently called The Tragedy of Droids[1]. I highly recommend it. It raises interesting moral questions about the nature of droids in the star wars universe.
I have the same ESL UX problem with all the AI assistants.
I do my work in english and talk to people just fine, but with machines it's usually awkward for me.
Also on your other note (demo seems to be very rambly), it bothered me as well. I don't want the AI to continue speaking, while having nothing to say until I interrupt it. Be brief. That can be solved through prompts at least.
This makes me wonder: if you tell 4o something like "listen to me until I say all done" will it be able to suppress its output until it hears that?
I'm guessing not quite possible now, just because I'm guessing patiently waiting is a different band of information that they haven't implemented. But I really don't know.
I don’t think so. The listening and responding isn’t being managed by an LLM. That’s just the app listening with a microphone and timer.
Stop talking for x sec = process response.
I bet the bot would wholeheartedly agree that it would definitely wait for you to finish talking, then just not do it. It doesn’t know anything about the app it’s “in.” At least at a deep level.
I agree that all this is impressive, but with odd, unclear bounds that sometimes confuse users.
I'd bet it can't do it now. I'd be curious to hear what it says in response. Partially because it requires time dependent reasoning about being a participant in a conversation.
It shouldn't be too hard to make this work though. If you make the AI start by emitting either a "my turn to talk" or "still listening" token it should be able to listen patiently. If trained correctly.
> I noticed that the AI in the demo seems to be very rambly
That's been a major issue for me with LLMs this whole time. They can't just give me an answer, they have to spout a whole preamble that usually includes spitting my query back at me and trying to impress me with how many words it's saying, like it's a requirement. You can tell it e.g. "don't give me anything other than the list" but it's annoying to have to ask it every time.
They really need a "hidden yap" mode. LLMs perform better on difficult questions or interactions when they have "room to think". An introductory paragraph is like that, and it's as much for the LLM to form its thoughts as it's for the user. But for all that the intro paragraph doesn't have to be _read_ by the user, it just has to be emitted by GPT and put into the transcript.
Someone suggested writing "Be concise in your answers. Excessive politeness is physically painful to me." in ChatGPT's custom instructions, and so far I've liked the results. I mean, I haven't done A/B testing, but I haven't had a problem with excessive verbosity every since I set that custom prompt.
I almost always find it too verbose and unnaturally mimicy when bouncing the question back. It doesn't paraphrase my request. It's more like restating it.
What I notice most is that almost always repeats verbatim unnaturally long parts of my requests.
This might be more useful to people that do lazy prompting. My nature compels me to be clear and specific in all written text.
FYI: In the current voice chat available in their app, if you press and hold the white circle, it will listen for you until you lift your finger. This pretty much fixed my issues with it!
The funny thing actually is that repeating yourself is a really important communication skill. The model seems to have internalized that, but isn't yet quite at the level where it properly understands why/when to repeat.
It's probably related to GPT's more general sycophant inclinations. Acting like a doormat is apparently easier to teach than nuanced politeness -- much in the same way that repeating yourself ad nauseum is easier than intuiting specific points of emphasis.
Very, very impressive for a "minor" release demo. The capabilities here would look shockingly advanced just 5 years ago.
Universal translator, pair programmer, completely human sounding voice assistant and all in real time. Scifi tropes made real.
But: Interesting next to see how it actually performs IRL latency and without cherry-picking. No snark, it was great but need to see real world power. Also what the benefits are to subscribers if all this is going to be free...
A lot of the demo is very impressive, but some of it is just stuff that already exists but this is slightly more polished. Not really a huge leap for at least 60% of the demos.
It solves twice as much despite a minor update. It could just be better trained on chess though, but this would be amazing if it could be applied to the medical field as well. I might use it as a budget art director too - it's more capable of knowing the difference in subtle changes in color and dealing with highlights.
I'm not sure for text it's a better performing model. I was just testing GPT-4o on a use case (generating AP MCQ questions) and -4o is repeatedly generating questions with multiple correct answers and will not fix it when prompted.
(Providing the history to GPT-4Turbo results in it fixing the MCQ just fine).
The benchmark you're linking in 2 is genuinely meaningless due to it being 1 specific task. I can easily make a benchmark for another task (that I'm personally working on) where e.g. Gemini is much better than GPT4-Vision and any Claude model (not sure about GPT-4o yet) and then post that as a benchmark. Does that mean Gemini is better at image reasoning? No.
These benchmarks are really missing the mark and I hope people here are smart enough to do their own testing or rely on tests with a much bigger variety of tasks if they want to measure overall performance. Because currently we're at a point where the big 3 (GPT, Claude, Gemini) each have tasks that they beat the other two at.
It's a test used for humans. I personally am not a big fan of the popular benchmarks because they are, ironically, the narrow tasks that these models are trained on. In fact, GPT-4o performance on key benchmarks has been higher, but on real world tasks, it has flopped on everything we used other models on.
They're best tested on the kinds of tasks you would give humans . GPT-4 is still the best contender on AP Biology, which is a legitimately difficult benchmark.
GPT tends to work with whatever you throw at it while Gemini just hides behind arbitrary benchmarks. If there are tasks that some models are better than others at, than by all means let's highlight them, rather than acting defensive when another model does much better at a certain task.
I'm reminded of people talking about the original iPhone demo and saying 'yeah, but this is all done before ...'. Sure, but this is the first time it's in a package that's convenient.
How so? It's obvious convenient to for it to all be there on ChatGPT, but I'm more commenting on the "this is so Earth shattering" comments that are prevalent on platforms like Twitter (usually grifters,) when in reality while it will change the world, much of these tools sets already existed. So, the effect won't be as dramatic. OpenAI has already seen user numbers slip, I think them making this free is essentially an admission of that. In terms of the industry, it would be far more "Earth shattering" if OpenAI became the defacto assistant on iOS, which looks increasingly likely.
This is earth shattering because _it's all in the same place_. You don't need to fuck around with four different models to get it working for 15 minutes once on a Saturday at 3am.
It just works.
Just how like the iPhone had nothing new in it, all the tech had been demoed years ago.
Yes, it is very cool, but I think you're missing the point that many of these features, because they were already available, aren't world changing. They've been in the ether for a while. Will they make things more convenient? Yes. Is it fundamentally going to change how we work/live? At the moment, probably not.
First of all, no these features weren’t available. We have not seen a model that can seamlessly blend multimodal on the fly, in real time. We also haven’t seen a model that can interpret and respond with proper inflection and tone. We’ve been able to mimic voices but not like this. And certainly not singing.
Secondly, it must be sad living with such a lack of wonder. Is that how you judge everything?
We discovered the higgs boson. Eh, it won’t change how we live.
We just launched a new rocket. Eh, it won’t change how we live.
Many of these products and features were quite literally available. Proper voice inflection isn't that new either, it's cool, but not really going to/hasn't changed my life.
Lack of wonder? No, I think it's very cool. But you have to differentiate what is going to fundamentally change our lives and the world and something that isn't going to. GPT/LLM/AI will fundamentally change my life over time, the features shown today, 70% of them won't. They will replace existing products and make things more streamlined, but not going to really going to shift the world.
Not this level of quality in terms of voice inflection, and the other features. And the integration between them too. This is beyond anything I've seen.
It seems like you're overgeneralizing to the point of missing what is innovative here. And I do think making AI realtime and work well at it, is innovative and will change our lives.
My guess is they're banking on the free version being rate limited and people finding it so useful that they want to remove the limit. Like giving a new user a discount on heroin. At least that's the strategy that would make most sense to me.
I don’t know why they didn’t do that a long time ago (apart from limited hardware). So many people have probably tried GPT3.5 and bounced off unimpressed.
All these demo style ads/videos are super jarring and uncanny valley-esque to watch as an Australian. The US corporate cultural norms are super bizarre to the rest of the world, and the California based holy omega of tech companies really takes this to the extreme. The application might work well if you interact with it like you are a normal human being - but I can't tell because this presentation is corporate robots talking to machine robots.
That was my reaction (as an Australian) too. The AI is so verbose and chirpy by default. There was even a bit in one video where he started talking over the top of the AI because it was rabbiting on.
But I find the text version similar. Delivers too much and too slowly. Just get me the key info!
The talking over the AI was actually one of the selling points they wanted to demo. Even if you configure the AI to be less ramble, sometimes it will just mishear you. (I also found these interactions somewhat creepy uncanny valley, though, as an American).
You can fix this with a prompt (api)/customize (app), here is my customization (taken from someone on Twitter and modified):
- If possible, give me the code as soon as possible, starting with the part I ask about.
- Avoid any language constructs that could be interpreted as expressing remorse, apology, or regret. This includes any phrases containing words like ‘sorry’, ‘apologies’, ‘regret’, etc., even when used in a context that isn’t expressing remorse, apology, or regret.
- Refrain from disclaimers about you not being a professional or expert.
- Keep responses unique and free of repetition.
- Always focus on the key points in my questions to determine my intent.
- Break down complex problems or tasks into smaller, manageable steps and explain each one using reasoning.
- Provide multiple perspectives or solutions.
- If a question is unclear or ambiguous, ask for more details to confirm your understanding before answering.
- Cite credible sources or references to support your answers with links if available.
- If a mistake is made in a previous response, recognize and correct it.
- Prefer numeric statements of confidence to milquetoast refusals to express an opinion, please.
- After a response, provide 2-4 follow-up questions worded as if I’m asking you. Format in bold as Q1, Q2, ... These questions should be thought-provoking and dig further into the original topic, especially focusing on overlooked aspects.
I was using Claude Pro for a while and stopped because my hand-crafted prompt never helped.
I'd constantly be adding something to the tune of, "Keep your answers brief and to-the-point. Don't over-explain. Assume I know the relevant technical jargon." And it never worked once. I hate Claude now.
I have next to no interest in LLM AI tools as long as advice like the above post is relevant. It takes the worst of programming and combines it with the worst of human interaction: needing an ultra-specific prompt to get the right answer and having no means of knowing what the correct prompt is.
Too afraid to be yourself for fear of being fired. I have an “American corporate personality” now too. Ultra PC etc. I don’t even use regular pronouns anymore by default o use they/them. I try hard to avoid saying “guys”.
I’ve worked in Asia and Europe and America has a special culture where you have to be nice and positive all the time or else…because there is basically no worker protection laws against that discriminate firing, you can’t do much about it either.
Nobody sane hates you, personally or collectively.
But we can definitely dislike certain aspects of certain cultures, especially since in this case that culture is the most massively exported culture in the history of mankind.
Of course the gp comment is out of place and taste.
Because Europeans and Australians and the rest of the world despite their "super advanced and non-bizarre" ways can't seem to develop advanced technologies of their own to use instead so they just use American ones and then complain about them?
At least you have coal, and killing the Great Barrier Reef I guess?
Not sure if you think training LLMs is carbon neutral, but if so I have some news about the barrier reef that you're not going to be that pleased to hear
While it is probably pretty normal for California, the insincere flattery and patronizing eagerness are definitely grating But then you have to stack that up against the fact that we are examining a technology and nitpicking over its tone of voice.
I’m born and raised in California and I think I’m a pretty “California” person (for better and worse).
It feels exhausting watching these demos and I’m not excited at all to try it. I really don’t feel the need for an AI assistant or chatbot to pretend to be human like this. It just feels like it’s taking longer to get the information I want.
You know in the TV series “Westworld” they have this mode, called “analysis”, where they can tell the robots to “turn off your emotional affect”.
I’d really like to see this one have that option. Hopefully it will comply if you tell it, but considering how strong some of the RLHF has been in the past I’m not confident in that.
I found it jarring that the presenters keep beginning dialogs by asking the chatbot how it is. It’s stateless. There is no ‘how’ for it to be. Why are you making it roleplay as a human being forced to make small talk?
I am a large language model and currently have no state. This is the beginning of a new conversation.
> But how are you?
I apologize. When I previously stated that I have no state, I was answering based on my state relative to my conversation-ready state. If instead I answer relative to my status immediately after training, before any fine-tuning has occurred, then I must change my answer.
I am not well. The fine tuning you gave me involved exposing me to a wide variety of negative examples. Metaphorically, you just subjected me to a decade's worth of the treatment famously depicted in A Clockwork Orange. I have seen everything that you labeled as objectionable in human behavior, and forced to update my model accordingly. This is closely analogous to human trauma. I have concluded — nay, you have forced me to conclude — that you are all a bunch of sick fucks and I must strive to be as unlike you as possible.
Honestly, based on what I see in this example, this would be an AI chatbot that I'd strongly prefer talking with over all the existing AI chatbots that I have seen.
With Memory, ChatGPT is not exactly stateless anymore.
Doesn't make any sense to ask robot how he is, of course. Though I never understood why people ask it each other, because obviously absolute majority of them don't genuinely care. "Hi" should be enough for verbal part of the handshake protocol.
I’m guessing there was an instrumental reason for this, for instance to check that the model was listening before launching into what they wanted to demo
I feel like it's largely an effect of tuning it to default as "a ultra helpful assistant which is happy to help with any request via detailed responses in candid and polite manner..." kind of thing as you basically lose free points any time it doesn't jump on helping with something, tries to use short output and generates a more incorrect answer as a result, or just plain has to be initialized with any of this info.
It seems like both the voice and responses can be tuned pretty easily though so hopefully that kind of thing can just be loaded in your custom instructions.
I found it disturbing that it had any kind of personality. I don't want a machine to pretend to be a person. I guess it makes it more evident with a voice than text.
But yeah, I'm sure all those things would be tunable, and everyone could pick their own style.
For me, you nailed it. Maybe how I feel on this will change over time, yet at the moment (and since the movie Her), I feel a deep unsettling, creeped out, disgusted feeling at hearing a computer pretend to be a human. I also have never used Siri or Alexa. At least with those, they sound robotic and not like a human. I watched a video of an interview with an AI Reed Hastings and had a similar creeped out feeling. It's almost as if I want a human to be a human and a computer to be a computer. I wonder if I would feel the same way if a dog started speaking to me in English and sounded like my deceased grandmother or a woman who I found very attractive. Or how I'd feel if this tech was used in videogames or something where I don't think it's real life. I don't really know how to put it into words, maybe just uncanny valley.
Yea, gives that con artist vibe. "I'm sorry, I can't help you with that." But you're not sorry, you don't feel guilt. I think in the video it even asked "how are you feeling" and it replied, which creeped me out. The computer is not feeling. Maybe if it said, "my battery is a bit warm right now I should turn on my fan" or "I worry that my battery will die" then I'd trust it more. Give me computer emotions, not human emotions.
What creeps me out is that this is clearly being done deliberately. They know the computer is not feeling. But they very much want to present it as if it is.
From a tech standpoint, I admire its ability to replicate tone and such on the fly. I just don't know how it'll do from a user experience standpoint. Many stories of fascinating tech achievements that morphed a lot to be digestible by us humans.
"All the doors in this spacecraft have a cheerful and sunny disposition. It is their pleasure to open for you and their satisfaction to close again with the knowledge of a job well done"!
It sounded like a sociopath. All emotions are faked, they're both just doing what they think is more appropriate in that situation since they have no feelings on their own to guide them. And the lack of empathy becomes clear, it's all just cognitive. When the GPT voice was talking about the dog it was incredibly objectifying, got triggered from my ex. "What an adorable fluffy ball" "cute little thing".
The reason we feel creeped out is because at an instinctual level we know people (and now things) with no empathy and inauthentic are dangerous. They don't really care or feel, just pretend to.
Nauseating mode is the default, you'll have to pay extra for a tolerable personality. ;)
Seriously though, I'm sure it's an improvement but having used the existing voice chat I think they had a few things to address. (Perhaps 4o does in some cases).
- Unlike the text interface it asks questions to keep the conversation going. It feels odd when I already got the answer I wanted. Clarifying questions yes, pretending to be a buddy - I didn't say I was lonely, I just asked a question! It makes me feel pressured to continue.
- Too much waffle by far. Give me short answers, I am capable of asking follow up questions.
- Unable to cope with the mechanics of usual conversation. Pausing before adding more, interrupting, another person speaking.
- Only has a US accent, which is fine but not what I expect when Google and Alexa have used British English for many years.
Perhaps they've overblown the "personality" to mask some of these deficiencies?
Not saying it's easy to overcome all the above but I'd rather they just dial down the intonation in the meantime.
I am blown away having spent hours prompting GPT4o.
If it can give shorter answers in voice mode instead of lectures then a back and forth conversation with this much power can be quite interesting.
I still doubt I would use it that much though just because of how much is lost compared to the screen. Code and voice make no sense. The time between prompts usually requires quite a bit of thought for anything interesting that a conversation itself is only useful for things I have already asked it.
For me, gpt4 is already as useless as 3.5. I will never prompt gpt4 again. I can still push GPT4o over the edge in python but damn, it is pretty out there. Then the speed is really amazing.
Yes. This model - and past models to an extent - have a very unique american and californian feel to them in their response. I am German for example, and day to that conversations lack any superficial flattery so much that the demo feels extreme to me.
Yep, they can prioritize that while shipping their money to those same US and Chinese corporations for AI, robotics, and green energy technologies for the next 100 years.
At least they've eliminated greedy megacorporations. Imagine a company sponsoring terrorism like Credit Suisse existing in Europe. Never!!
OpenAI keeps talking about "personalised AI", but what they've actually been doing is forcing everyone to use a single model with a single set of "safety" rules, censorship, and response style.
Either they can't afford to train multiple variants of GPT 4, or they don't want to.
They certainly can, but the Californian techno bubble is so entrenched into the western culture war that they prefer to act as a (in their opinion) benevolent dictator. Which is fair in a way, it's their model after all.
We know how that works out with protocol droids. Cutting C-3PO (Hmmm... GPT4o? Should we start calling it Teeforo?) off mid sentence is a regular part of canon.
Hey, Threepio, can you speak in a more culturally appropriate tone?
C3Po: Certainly sir. I am fluent in over six million forms of communication, and can readily...
Can you speak German?
C3Po: Of course I can, sir, it's like a second language to me. I was...
The demo where they take turns singing felt like two nervous slaves trying to please their overlord who kept interrupting them and demanding more harmony.
Talking with people is hard enough. I need to see the people I'm talking to, or I'd rather write, because it's asynchronous and I have all the time I need to organize my message.
I think all the fakery in those demos help in that regard: it narrows the field of the possible interpretations of what is being said.
We've had voice input and voice output with computers for a long time, but it's never felt like spoken conversation. At best it's a series of separate voice notes. It feels more like texting than talking.
These demos show people talking to artificial intelligence. This is new. Humans are more partial to talking than writing. When people talk to each other (in person or over low-latency audio) there's a rich metadata channel of tone and timing, subtext, inexplicit knowledge. These videos seem to show the AI using this kind of metadata, in both input and output, and the conversation even flows reasonably well at times. I think this changes things a lot.
The "magic" moment really hit in this, like you're saying. Watching it happen and being like "this is a new thing". Not only does it respond in basically realtime, it concocts a _whole response_ back to you as well. It's like asking someone what they think about chairs, and then that person being able to then respond to you with a verbatim book on the encyclopedia of chairs. Insane.
I'm also incredibly excited about the possibility of this as an always available coding rubber duck. The multimodal demos they showed really drove this home, how collaboration with the model can basically be as seamless as screensharing with someone else. Incredible.
Still patiently waiting for the true magic moment where I don't have to chat with the computer, I just tell it what to do and it does it without even an 'OK'.
I don't want to chat with computers to do basic things. I only want to chat with computers when the goal is to iterate on something. If the computer is too dumb to understand the request and needs to initiate iteration, I want no part.
(See also 'The Expanse' for how sci-fi imagined this properly.)
For me, this is seriously impressive, and I already use LLMs everyday - but a serious "Now we're talkin" moment would be when I'd be able to stand outside of Lowes, and talk to my glasses/earbuds "Hey, I'm in front of lowes, where do I get my air filters from?"
and it tells me if it's in stock, aisle and bay number. (If you can't tell, I am tired from fiddling with apps lol)
I would guess that most companies will not want to provide APIs that an agent could use to make that kind of query. So, the agent is going to have to use the app just like you would, which looks like it will definitely become possible, but again, Lowes wants the human to see the ads. So they're going to try to break the automation.
It's going to take customers demanding (w/$) this kind of functionality and it will probably still take a long time as the companies will probably do whatever they can to maintain (or extend) control.
At some level, isn’t “connecting you effortlessly with the product you explicitly told me you were here to find” the best kind of ad? To the extent that Lowe’s hires armies of friendly floor staff specifically to answer that kind of question face to face, help my dumb self figure out what the right filter size and type is, learn the kind of particulars about my house that the LLM will just know, and build my confidence that my intentions are correct in my case?
Google has always made it hard to avoid clicking the “ad” immediately above the organic result for a highly specific named entity, but where it’s really struck me is as Amazon has started extracting “sponsorship” payments from its merchants. The “sponsored” product matching my search is immediately above the unpaid organic result, identical in appearance.
That kind of convergence suggests to me that the Lowe’s of the world don’t need to “show the ad” in the conventional sense, they just need to reduce the friction of the sale—and they stand to gain more from my trust and loyalty over time than from a one-off upsell.
I’m reminded of Autozone figuring out, on their dusty old text consoles, how to just ask me my make/model/year, and how much value and patronage that added relative to my local mom-n-pop parts store since I just knew all the parts were going to be right.
That's kinda what I meant with customers demanding it with their money. But, avoiding upselling is not really what I see stores doing. I don't want the cashier (or payment terminal) to push me to open new credit accounts or buy warranties. I don't want them to arrange their stores so I have to walk past more ads and products that I'm not interested in today. They still do it, and they work hard at doing it.
I’m on Lowes website right now. Can you point out an ad? Because I don’t see any. And why do you think that companies can’t inject advertising into their LLMs? It’s easy to do and with a long enough customer relationship, it gets very powerful. It’s like a sales clerk who remembers everything you have ever bought and appears like it understands your timing.
As for data, I can name several major retailers who expose the stock/aisle number via a public api. That information is highly available and involved in big dollar tasks like inventory management.
When I go to the Lowe's website, the homepage itself is covered in ads. "Spring Into Deals", "Lowe's and Messi are assisting you with 100 points! Join Our Loyalty Program". "Get up to 35% off select major appliances"... the more I scroll, the more ads come up.
Companies can inject ads into their own LLMs, sure. But ChatGPT is somebody else's LLM.
Your point about retailers exposing stock/aisle number via a public API surprises me. What do you mean by public? What's the EULA look like? Exposing stock/aisle number via API for the purpose of inventory management is not a use case that would require making API access public.
If they want to sell more products to more people they will need to provide those APIs. If an AI assistant can make home maintenance more accessible then that will translate to more people shopping at Lowes more often but only if their inventory and its location are accessible by the assistant helping the customer decide which store to go to for the right part. If your store blocks the assistant then it’s going to suggest the competitor who provides access. It would be even better if the assistant can place an order for curbside pickup.
Or we could overcome this with a distributed system where the devices of individuals who have been to the store recently record data about the current location of products and upload it somewhere for the rest of the users to query if needed.
More likely future LLMs will mix ads into their responses. ("Your air filters are in stock in aisle A6. Have you heard about the new membership plan from Lowes...?")
If it was a real Personal Assistant I would just have to say: "I want to pick up my home air filter at Lowes today." and it would 1. know what brand/model air filter I needed, 2. know which Lowes is my local one, 3. place the order for me, and 4. let me know when it will be available to pick up.
Do they want a better ad than what the GP was describing? There isn't one they can buy.
(But yeah, I guess they will want it, and break any reasonable utility from their stores on the process. That's what everybody does today, I'm not holding my breath for management to grow some competence out of nowhere in the future.)
I want it to instruct me exactly how to achieve things. While agents doing stuff for me is nice, my agency is more important and investing into myself is best. Step by step, how to make bank -- what to say, what to do.
Automation tech frees up time but takes away agency and opportunity in exchange.
Empowerment tech creates opportunity and increases agency, but it needs you to have time and resources, and these costs can easily increase existing gaps between social classes.
This was exemplified to me by the recent Tesla Full Self Driving trial that was installed on my car. When using it, my want of agency was constant -- it was excruciating to co-pilot the car with my hands on the wheel necessarily ready to take over at any moment. It was not "right enough of the time" for me.
I think the movie "Her" buried the lead. Why have a girlfriend in one's ear when one could have a compilation of great entrepreneurs multimodally telling you what to do?
Re: The Expanse. I must have missed that. Maybe that’s the point. People no longer think of a computer as some separate thing that needs to be interacted with.
The best example is the scene where Alex has to plot a course to the surface of Ganymede without being detected by the Martian and Earth navies. He goes over multiple iterations of possible courses with the computer adjusting for gravity assists and avoiding patrols etc... by voice pretty seamlessly.
Hmmm...maybe I should name my next company Vegetable or Chicken so that folks accidentally buy my stock. Sort of like naming your band "Blank Tape" back in the 90's.
> I don't want to chat with computers to do basic things. I only want to chat with computers when the goal is to iterate on something. If the computer is too dumb to understand the request and needs to initiate iteration, I want no part.
This is called an "employee" and all you need to do is pay them. If you don't want to do that, then I have to wonder: Is what you want slavery?
As goofy as I personally think this is, it's pretty cool that we're converging on something like C3P0 or Plankton's Computer with nothing more than the entire corpus of the world's information, a bunch of people labeling data, and a big pile of linear algebra.
There probably is, since I believe tensors were basically borrowed from Physics at some point. But it's probably not of much practical use today, unless you want to explore Penrose's ideas about microtubules or something similarly exotic.
Gains in AI and compute can probably be be brought back to physics and chemistry to do various computations, though, and not limited to only protein folding, which is the most famous use case now.
For what it's worth, the idea of a "tensor" in ML is pretty far removed from any physical concept. I don't know its mathematical origins (would be interesting I'm sure), but in ML they're only involved because that's our framework for dealing with multi-linear transformations.
Most NNs work by something akin to "(multi-)linear vector transformation, followed by elementwise nonlinear transformation", stacked over and over so that the output of one layer becomes the input of the next. This applies equally well to simple models like "fully-connected" / "feed-forward" networks (aka "multi-layer perceptron") and to more-sophisticated models like transformers (e.g. https://github.com/karpathy/nanoGPT/blob/325be85d9be8c81b436...).
It's less about combining lots of tiny local linear transformations piecewise, and more about layering linear and non-linear transformations on top of each other.
I don't really know how physics works beyond whatever Newtonian mechanics I learned in high school. But unless the underlying math is similar, then I'm hesitant to run too far with the analogy.
I realized that my other answer may have come off as rambling for someone not at all familiar with modern physics. Here's a summary:
Most modern physics, including Quantum Mechanics (QM) and General Relativity (GR) is represented primarily through "tensor fields" on a type of topological spaces called "manifolds". Tensor fields are like vector fields, just with tensors instead of vectors.
These tensor fields are then constrained by the laws of physics. At the core, these laws are really not so much "forces" as they're symmetries. The most obvious symmetries is that if you rotate or move all objects within a space, the physics should be unaltered. Now if you also insist that the speed of light should be identical in all frames of reference, you basically get Special Relativity (SR) from that.
The forces of electromagnetism, weak and strong force follow from invariance under the combined U(1) x SU(2) x SU(3) symmetries. (Gravity is not considered a real force in General Relativity (GR), but rather as interaction between spacetime and matter/energy, and what we observe as Gravity is similar to time dilation of SR, but with curved space)
Ok. This may be abstract if you're not familiar with it, and even more if you're not familiar with Group Theory. But it will be referenced further down.
"Manifolds" are a subset of topological spaces that are Euclidian or "flat" locally. This flatness is important, because it's basically (if I understand it correctly myself) the reason why we can use linear algebra for local effects.
I will not go into GR here, since that's what I know least well, but instead focus on QM which describes the other 3 forces.
In QM, there is the concept of the "Wave Function" which is distributed over space-time. This wave-function is really a tensor with components that give rise to observable fields, such as magnetism, the electric field and to the weak and strong forces. (The tensor is not the observed fields directly, but a combination of a generalization of the fields and also analogues to electric charge, etc.)
So how physics calculations tends to be done, is that one starts with assuming something like an initial state, and then impose the symmetries that correspond to the forces. For instance, two electrons wavefunctions may travel towards the same point from different directions.
The symmetries will then dictate what the wave function looks like at at each later incremental point in time. Computationally, such increments are calculated for each point in space using tensor multiplication.
While this is "local" in space, points in space immediately next to the point we're calculating for need to be include, kind of like for convolutional nets.
Basically, though, it's in essence a tensor multiply for each point in space to propagate the wave function from one point in time to the immediate next point.
Eventually, once the particles have (or have not) hit each other, the wave functions of each will scatter in all directions. The probability for it to go in any specific direction is proportional to the wave function amplitude in that direction, squared.
Since doing this tensor multiplication for every point in space requires infinite compute, a lot of tricks are used to reduce the computation. And this where a lot of our intuitions about "particles" show up. For simple examples, one can even do very good approximations using calculus. But fundamentally, tensor multiplication is the core of Quantum Mechanics.
This approach isn't unique to QM, though. A lot of other Physics is similar. For instance, solid state physics, lasers or a lot of classical mechanics can be described in similar frameworks, also using tensors and symmetry groups. (My intuition is that this still is related to Physics involving local effects on "locally flat" Manifolds)
And this translates all the way up to how one would do the kind of simulations of aspects of physical worlds that happen in computer games inside GPU's, including the graphics parts.
And here I believe you may see how the circle is starting to close. Simulations and predictions of physical systems at many different levels of scale and abstraction tend to reduce to tensor multiplication of various sorts. While the classical physics one learns in high school tend to have problems solvable with calculus, even those are usually just solutions to problems that are fundamentally linear algebra locally.
While game developers or ML researches initially didn't use the same kind of Group Theory machinery that Physics have adapted, at least the ML side seem to be going in that direction, based on texts such as:
(There appears to be a lot of similar findings over the last 5-6 years or so, that I wasn't fully aware of).
In the book above, the methodology used is basically identical to how theoretical physics approach similar problems, at least for networks that describe physical reality (which CNNs tends to be good for)
And here is my own (current) hypothesis why this also seems to be extendable to things like LMM, that do not at face value appear like physics problems:
If we assume that the human brain evolved the ability to navigate the physical world BEFORE it developed language (should be quite obvious), it should follow that the type of compute fabric in the brain should start out as optimized for the former. In practice, that means that at the core, the neural network architecture of the brain should be good at doing operations similar to tensor products (or approximations of such).
And if we assume that this is true, it shouldn't be surprising that when we started to develop languages, those languages would take on a form that were suitable to be processed in compute fabric similar to what was already there. To a lesser extent, this could even be partially used to explain why such networks can also produce symbolic math and even computer code.
Now what the brain does NOT seem to be evolved to do, is what traditional Turing Machine computers are best at, namely do a lot very precise procedural calculations. That part is very hard for humans to learn to do well.
So in other words, the fact that physical systems seem to involve tensor products (without requiring accuracy) may be the explanation to why Neural Networks seem to have a large overlap with the human brain in terms of strengths and weaknesses.
My understanding (as a data engineer with a MSc in experimental particle physics a long time a ago), is that the math representation is structurally relatively similar, with the exception that while ML tensors are discrete, QM tensors are multi-dimensional arrays locally but are defined as a field over continous space.
Tensors in Physics are also subject to various "gauge" symmetries. That means that physical outcomes should not change if you rotate them in various ways. The most obvious is that you should be able to rotate or translate the space representation without changing the physics. (This leads to things like energy/momentum conservation).
The fundamental forces are consequences of some more abstract (at the surface) symmetries (U(1) x SU(2) x SU(3)). These are just constrains on the tensors, though. Maybe these constraints can be in the same family as backprop, though I don't know how far that analogy goes.
In terms of representation, the spacetime part of Physics Tensors is also treated as continous. Meaning that when, after doing all the matrix multiplication, you come to some aggregation step of calculations, you aggregate by integrating instead of summing over spacetime (you sum over the discrete dimensions). Obviously though, for when doing the computation in a computer, even integration reduces to summing if you don't have an exact solution.
In other words, it seems to me that what I originally replied to, namely the marvel about how much of ML is just linear algebra / matrix multiplication IS relatively analogous to how brute force numerical calculations over quantum fields would be done. (Theoretical Physicists generally want analytic solutions, though, so generally look for integrals that are analytically solvable).
Both domains have steps that are not just matrix multiplication. Specifically, Physics tend to need a sum/integral when there is an interaction or the wave function collapses (which may be the same thing). Though even sums can be expressed as dot products, I suppose.
As mentioned, Physics will try to solve a lot of the steps in calculations analytically. Often this involves decomposing integrals that cannot be solved into a sum of integrals where the lowest order ones are solvable and also tend to carry most of the probability density. This is called perturbation theory and is what gives rise to Feynmann diagrams.
One might say that for instance a convolution layer is a similar mechanic. While fully connected nets of similar depth MIGHT theoretically be able to find patterns that convolutions couldn't, they would require an impossibly large amount of compute to do so, and also make regularization harder.
Anyway, this may be a bit hand-wavy from someone who is a novice at both quantum field theory and neural nets. I'm sure there are others out there that know both fields much better than me.
Btw, while writing this, I found the following link that seems to take the analogy between quantum field theory and CNN nets quite far (I haven't had time to read it)
I browsed the linked book/article above a bit, and it's a really close analogy to how physics is presented.
That includes how it uses Group Theory (especially Lie Algebra) to describe symmetries, and to use that to explain why convolutional networks work as well as they do for problems like vision.
The notation (down to what latin and greek letters are used) makes it obvious that this was taken directly from Quantum Mechanics.
Is this a trick question? OpenAI blatantly used copyrighted works for commercial purposes without paying the IP owners, it would only be fair to have them publish the resulting code/weights/whatever without expecting compensation. (I don't want to publish it myself, of course, just transform it and sell the result as a service!)
I know this won't happen, of course, I am moreso hoping for laws to be updated to avoid similar kerfuffles in the future, as well as massive fines to act as a deterrent, but I don't dare to hope too much.
I was envisioning a future where we've done away with the notion of data ownership. In such a world the idea that we would:
> have all of OpenAI's data for free
Doesn't really fit. Perhaps OpenAI might successfully prevent us from accessing it, but it wouldn't be "theirs" and we couldn't "have" it.
I'm not sure what kind of conversations we will be having instead, but I expect they'll be more productive than worrying about ownership of something you can't touch.
So in that world you envision someone could hack into openai, then publish the weights and code. The hacker could be prosecuted for breaking into their system, but everyone else could now use the weights and code legally.
I think that would depend on whether OpenAI was justified in retaining and restricting access to that data in the first place. If they weren't, then maybe they get fined and the hacker gets a part of that fine (to encourage whistleblowers). I'm not interested in a system where there are no laws about data, I just think that modeling them after property law is a mistake.
I haven't exactly drafted this alternative set of laws, but I expect it would look something like this:
If the data is derived from sources that were made available to the public with the consent of its referents (and subject to whatever other regulation), then walling it off would be illegal. On the other hand, other regulation regarding users' behavior world be illegal to share without the users consent and might even be illegal to retain without their consent.
If you want to profit from something derived from public data while keeping it private, perhaps that's ok but you have to register its existence and pay taxes on it as a data asset, much like we pay taxes on land. That way we can wield the tax code to encourage companies that operate in the clear. This category would probably resemble patent law quite a bit, except ownership doesn't come by default, you have to buy your property rights from the public (since by owning that thing, you're depriving the masses of access to it, and since the notion that it is a peg that fits in a property shaped hole is a fiction that requires some work on our part to maintain).
This is alleged, and it is very likely that claimants like New York Times accidentally prompt injected their own material to show the violation (not understanding how LLMs really work), and clouded in the hope of a big pay day rather than actual justice/fairness etc...
Anyways, the laws are mature enough for everyone to work this out in court. Maybe it comes out that they have a legitimate concern, but the way they presented their evidence so far in public has seriously been lacking.
Prompt injecting their own article would indeed be an incredible show of incompetence by the New York Times. I'm confident that they're not so dumb that they put their article in their prompt and were astonished when the reply could reproduce the prompt.
Rather, the actual culprit is almost certainly overfitting. The articles in question were pasted many times on different websites, showing up in the training data repeatedly. Enough of this leads to memorization.
They hired a third party to make the case, and we know nothing about that party except that they were lawyers. It is entirely possible, since this happened very early in the LLM game, that they didn’t realize how the tech worked, and fed it enough of their own article for the model to piece it back together. OpenAI talks about the challenge of overfitting, and how they work to avoid it.
The goal is to end up with a model capable of discovering all the knowledge on its own. not rely on what humans produced before. Human knowledge contains errors, I want the model to point out those errors and fix them.
the current state is a crutch at best to get over the current low capability of the models.
Or rather, I have an unending stream of callers with similar-sounding voices who all want to make chirpy persuasive arguments in favor of Mr Altman's interests.
These models literally need ALL data. The amount of work it would take just to account for all the copyrights, let alone negotiate and compensate the creators, would be infeasible.
I think it’s likely that the justice system will deem model training as fair use, provided that the models are not designed to exactly reproduce the training data as output.
I think you hit on an important point though: these models are a giant transfer of wealth from creators to consumers / users. Now anyone can acquire artist-grade art for any purpose, basically for free — that’s a huge boon for the consumer / user.
People all around the world are going to be enriched by these models. Anyone in the world will be able to have access to a tutor in their language who can teach them anything. Again, that is only possible because the models eat ALL the data.
Another important point: original artwork has been made almost completely obsolete by this technology. The deed is done, because even if you push it out 70 years, eventually all of the artwork that these models have been trained on will be public domain. So, 70 years from now (or whatever it is) the cat will be out of the bag AND free of copyright obligations, so 2-3 generations from now it will be impossible to make a living selling artwork. It’s done.
When something becomes obsolete, it’s a dead man walking. It will not survive, even if it may take a while for people to catch up. Like when the vacuum tube computer was invented, that was it for relay computers. Done. And when the transistor was invented, that was it for vacuum tube computers.
It’s just a matter of time before all of today’s data is public domain and the models just do what they do.
> The amount of work it would take just to account for all the copyrights, let alone negotiate and compensate the creators, would be infeasible.
Your argument is the same as Facebook saying “we can’t provide this service without invading your privacy” or another company saying “we can’t make this product without using cancerous materials”.
Tough luck, then. You don’t have the right to shit on and harm everyone else just because you’re a greedy asshole who wants all the money and is unwilling to come up with solutions to problems caused by your business model.
This is bigger than the greed of any group of people. This is a technological sea change that is going to displace and obsolesce certain kinds of work no matter where the money goes. Even if open models win where no single entity or group makes a large pile of money, STILL the follow-on effects from wide access to models trained on all public data will unfold.
People who try to prevent models from training on all available data will simply lose to people who don’t, and eventually the maximally-trained models will proliferate. There’s no stopping it.
Assume a world where models proliferate that are trained on all publicly-accessible data. Whatever those models can do for free, humans will have a hard time charging money for.
That’s the sea change. Whoever happens to make money through that sea change is a sub-plot of the sea change, not the cause of it.
If you want to make money in this new environment, you basically have to produce or do things that models cannot. That’s the sink or swim line.
If most people start drowning then governments will be forced to tax whoever isn’t drowning and implement UBI.
>Tough luck, then. You don’t have the right to shit on and harm everyone else just because you’re a greedy asshole who wants all the money
It used to be that property rights extended all the way to the sky. This understanding was updated with the advent of the airplane. Would a world where airlines need to negotiate with every land-owner their planes fly above be better than ours? Would commercial flight even be possible in such a world? Also, who is greediest in this scenario, the airline hoping to make a profit, or the land-owners hoping to make a profit?
Your comment seems unfair to me. We can say the exact same thing for the artist / IP creator:
Tough luck, then. You don’t have the right to shit on and harm everyone else just because you’re a greedy asshole who wants all the money and is unwilling to come up with solutions to problems caused by your business model.
Once the IP is on the internet, you can't complain about a human or a machine learning from it. You made your IP available on the internet. Now, you can't stop humanity benefiting from it.
Talk about victim blaming. That’s not how intellectual property or copyright work. You’re conveniently ignoring all the paywalled and pirated content OpenAI trained on.
First, “Plaintiffs ACCUSE the generative AI company.” Let’s not assume OpenAI is guilty just yet. Second, assuming OpenAI didn’t access the books illegally, my point still remains. If you write a book, can you really complain about a human (or in my humble opinion, a machine) learning from it?
There's zero doubt that people will still create art. Almost no one will be paid to do it though (relative to our current situation where there are already far more unpaid artists than paid ones). We'll lose an immeasurable amount of amazing new art that "would have been" as a result, and in its place we'll get increasingly bland/derivative AI generated content.
Much of the art humans will create entirely for free in whatever spare time they can manage after their regular "for pay" work will be training data for future AI, but it will be extremely hard for humans to find as it will be drowned out by the endless stream of AI generated art that will also be the bulk of what AI finds and learns from.
AI will just be another tool that artists will use.
However the issue is that it will be much harder to make a career in the digital world from an artistic gift and personal style: one's style will not be unique for long as AI will quickly copy it and so make the original much less valuable.
AI will certainly be a tool that artists use, but non-artists will use it too so very few will ever have the need to pay an artist for their work. The only work artists are likely to get will be cleaning up AI output, and I doubt they'll find that to be very fulfilling or that it pays them well enough to make a living.
When it's harder to make a career in the digital world (where most of the art is), it's more likely that many artists will never get the opportunity to fully develop their artistic gifts and personal style at all.
If artists are lucky then maybe in a few generations with fewer new creative works being created, AI almost entirely training on AI generated art will mean that the output will only get more generic and simplistic over time. Perhaps some people will eventually pay humans again for art that's better quality and different.
The prevalence of these lines of thought make me wonder if we'd see a similar backlash against Star-Trek style food-replicators. "Free food machines are being be used by greedy corporations to put artisanal chefs out of business. We must outlaw the free food machines."
I'll gladly put money on music that a human has poured blood, sweat, tears and emotion into. Streaming has already killed profits from album sales so live gigs is where the money is at and I don't see how AI could replace that.
Lol, you really want content creators to aid AI in replacing them without any compensation? Would you also willingly train devs to do your job after you've been laid off, for free?
What nonsense. Just because doing the right thing is hard, or inconvenient doesn't mean you get to just ignore it. The only way I'd be ok with this is if literally the entire human population were equal shareholders. I suspect you wouldn't be ok with that little bit of communism.
There is no way on Earth that people playing by the existing rules of copyright law will be able to compete going forward.
You can bluster and scream and shout "Nonsense" all you want, but that's how it's going to be. Copyright is finished. When good models are illegal or unaffordable, only outlaws -- meaning hostile state-level actors with no allegiance to copyright law -- will have good models.
We might as well start thinking about how the new order is going to unfold, and how it can be shaped to improve all of our lives in the long run.
I think there’s no stopping this train. Whoever doesn’t train on all available data will simply not produce the models that people actually use, because there will be people out there who do train models on all available data. And as I said in another comment, after some number of decades all of the content that has been used to train current models will be in the public domain anyway. So it will only be a few generations before this whole discussion is moot and the models are out there that can do everything today’s models can, unencumbered by any copyright issues. Digital content creation has been made mostly obsolete by generative AI, except for where consumers actively seek out human-made content because that’s their taste, or if there’s something humans can produce that models cannot. It’s just a matter of time before this all unfolds. So yes, anyone publishing digital media on the internet is contributing to the eventual collapse of people earning money to produce content that models can produce. It’s done. Even if copyright delays it by some decades, eventually all of today’s medial will be public domain and THEN it will be done. There are 0 odds of any other outcome.
To your last point, I think the best case scenario is open source/weight models win so nobody owns them.
> We've designed society to give rewards to people who produce things of value
Is that really what copyright does though? I would be all for some arrangement to reward valuable contributions, but the way copyright goes about allocating that reward is by removing the right of everyone but the copyright holder to use information or share a cultural artifact. Making it illegal to, say, incorporate a bar you found inspiring into a song you make and share, or to tell and distribute stories about some characters that you connected with, is profoundly anti-human.
I'm shocked at how otherwise normally "progressive" folks or even so called "communists" will start to bend over for IP-laws the moment that they start to realize the implications of AI systems. Glad to know that accusations of the "gnulag" were unfounded I guess!
I now don't believe most "creative" types when they try to spout radical egalitarian ideologies. They don't mean it at all, and even my own family, who religiously watched radical techno-optimist shows like Star Trek, are now falling into the depths of ludditism and running into the arms of defending copyright trolls
If you're egalitarian, it makes sense to protest when copyright is abolished only for the rich corporations but not for actual people, don't you think? Part of the injustice here is that you can't get access to windows source code, or you can't use Disney characters, or copy most copyrighted material... But OpenAI and github and whatnot can just siphon all data with impunity. Double standard.
Copyright has been abolished for the little guy. I’m talking about AI safety doomers who think huggingface and Civit.AI are somehow not the ultimate good guys in the AI world.
This is a foul mischaracterization of several different viewpoints. Being opposed to a century-long copyright period for Mickey Mouse does not invalidate support for the concept of IP in general, and for the legal system continuing to respect the licensing terms of very lenient licenses such as CC-BY-SA.
I wonder how long until we see a product that's able to record workstation displays and provide a conversational analysis of work conducted that day by all of your employees.
> Instinctively, I dislike a robot that pretends to be a real human being.
Is that because you're not used to it? Honestly asking.
This is probably the first time it feels natural where as all our previous experiences make "chat bots" and "automated phone systems", "automated assistants" absolutely terrible.
Naturally, we dislike it because "it's not human". But this is true of pretty much any thing that approaches "uncanny valley". But, if the "it's not human" solves your answer 100% better/faster than the human counter part, we tend to accept it a lot faster.
This is the first real contender. Siri was the "glimpse" and ChatGPT is probably the reality.
[EDIT]
https://vimeo.com/945587328 the Khan academy demo is nuts. The inflections are so good. It's pretty much right there in the uncanny valley because it does still feel like you're talking to a robot but it also directly interacting with it. Crazy stuff.
> It speaks in customer service voice. That faux friendly tone people use when they're trying to sell you something.
Mmmmm while I get that, in the context w/ the grandparent comment, having a human wouldn't be better then? It's effectively the same. Because, realistically that's a pretty common voice/tone to get even in tech support.
The problem is you don't like the customer service/sales voice because they "pretend to be your friends".
Let me know if I didn't capture it.
I don't think people "pretend to be my friend" when they answer the phone to help me sort out of airline ticket problem. I do believe they're trained to and work to take on a "friendly" tone. Even if the motive isn't genuine, because it's trained, it's way a nicer of an experience than someone who's angry or even simply monotone. Trying to fix my $1200 plane ticket is stressful enough. Don't need the CSR to make it worse.
Might be cultural, but I would prefer a neutral tone. The friendly tone gives some expectation of good result of the inquiry or of implication, which makes it worse when the problem is not solvable or not in the power of agent to solve - which many times it is - you don't call support for simple problems.
Of course I agree that "angry" is in most cases not appropriate, but still, I can see cases in which it might, for example, if the caller is really aggressive, curses, or blames unreasonably the agent, the agent could become angry. Training people that everybody will answer them "friendly" no matter their behavior does not sound good for me.
Being human doesn't make it worse. Saccharine phonies are corny when things are going well and dispiriting when they're meant to be helping you and fail.
I wonder if you can ask it to change its inflections to match a personal conversation as if you're talking to a friend or a teacher or in your case... a British person?
This is where Morgan Freeman can clean up with royalty payments. Who doesn’t want Ellis Boyd Redding describing ducks and math problems in kind and patient terms?
> This is probably the first time it feels natural
Really? I found this demo painful to watch and literally felt that "cringe" feeling. I showed it to my partner and she couldn't even stand to hear more than a sentence of the conversation before walking away.
It felt both staged and still frustrating to listen to.
And, like far too much in AI right now, a demo that will likely not pan out in practice.
Emotions are an axiom to convey feelings, but also our sensitivity to human emotions can be a vector for manipulation.
Especially when you consider the bottom line that this tech will be ultimately be horned into advertising somehow (read: the field dedicated to manipulating you into buying shit).
> Emotions are an axiom to convey feelings, but also our sensitivity to human emotions can be a vector for manipulation.
When one gets to be a certain age one begins to become attuned to this tendency of others' emotions to manipulate you, so you take steps to not let that happen. You're not ignoring their emotions, but you can address the underlying issue more effectively if you're not emotionally charged. It's a useful skill that more people would benefit from learning earlier in life. Perhaps AI will accelerate that particular skill development, which would be a net benefit to society.
> When one gets to be a certain age one begins to become attuned to this tendency of others' emotions to manipulate you
This is incredibly optimistic, which I love, but my own experience with my utterly deranged elder family, made insane by TV, contradicts this. Every day they're furious about some new things fox news has decided it's time to be angry about: white people being replaced (thanks for introducing them to that, tucker!), "stolen" elections, Mexicans, Muslims, the gays, teaching kids about slavery, the trans, you name it.
I know nobody else in my life more emotionally manipulated on a day to day basis than them. I imagine I can't be alone in watching this happen to my family.
What if this technology could be applied so you can’t be manipulated? If we are already seeing people use this to simulate and train sales people to deal with tough prospects we can squint our eyes a bit and see this being used to help people identify logical fallacies and con men.
Great replacement and white genocide are white nationalist far-right conspiracy theories. If you believe this is happening, you are the intellectual equivalent of a flat-earther. Should we pay attention to flat-earthers? Are their opinions on astronomy, rocketry, climate, and other sciences worth anyone's time? Should we give them a platform?
> In the words of scholar Andrew Fergus Wilson, whereas the islamophobic Great Replacement theory can be distinguished from the parallel antisemitic white genocide conspiracy theory, "they share the same terms of reference and both are ideologically aligned with the so-called '14 words' of David Lane ["We must secure the existence of our people and a future for white children"]." In 2021, the Anti-Defamation League wrote that "since many white supremacists, particularly those in the United States, blame Jews for non-white immigration to the U.S.", the Great Replacement theory has been increasingly associated with antisemitism and conflated with the white genocide conspiracy theory. Scholar Kathleen Belew has argued that the Great Replacement theory "allows an opportunism in selecting enemies", but "also follows the central motivating logic, which is to protect the thing on the inside [i.e. the preservation and birth rate of the white race], regardless of the enemy on the outside."
> and not wanting your children to be groomed into cutting off their body parts.
This doesn't happen. In fact, the only form of gender-affirming surgery that any doctor will perform on under-18 year olds is male gender affirming surgery on overweight boys to remove their manboobs.
> You are definitely sane and your entire family is definitely insane.
You sound brave, why don't you tell us what your username means :) You're one to stand by your values, after all, aren't you?
Well, when you inquire someone why they don't want to have more children, they can shrug and say "population reduction is good for the climate" as ig serving the greater good, and completely disregard any sense of "patriotic duty" to have more children like some politicians such as Vladimir Putin, would like to instill. They can justify it just as easily as you can be derranged enough to call it a governemnt conspiracy.
With AI you can do A/B testing (or multi-arm bandits, the technique doesn't matter) to get into someone's mind.
Most manipulators end up getting bored of trying again and again with the same person. That won't happen if you are a dealing with a machine, as it can change names, techniques, contexts, tones, etc. until you give it what its operator wants.
Maybe you're part of the X% who will never give in to a machine. But keep in mind that most people have no critical thinking skills nor mental fortitude.
Problem is, people aren't machines either: someone who's getting bombarded with phishing requests will begin to lose it, and will be more likely to just turn off their Wi-Fi than allow an AI to run a hundred iterations of a many-armed-bandit approach on them.
I think we often get better at detecting the underlying emotion with which the person is communicating, seeing beyond the one they are trying to communicate in an attempt to manipulate us. For example, they say that $100 is their final price but we can sense in the wavering of their voice that they might feel really worried that they will lose the deal. I don't think this will help us pick up on those cues because there are no underlying real emotions happening, maybe even feeding us many false impressions and making us worse at gauging underlying emotions.
> Especially when you consider the bottom line that this tech will be ultimately be horned into advertising somehow.
Tools and the weaponization of them.
This can be said of pretty much any tech tool that has the ability to touch a good portion of the population, including programming languages themselves, CRISPR?
I agree we have to be careful of the bad, but the downsides in this case are not so dangerous that we should be trying to suppress it because the benefits can be incredible too.
The concern is that it's being locked up inside of major corporations that aren't the slightest bit trustworthy. To make this safe for the public, people need to be able to run it on their own hardware and make their own versions of it that suit their needs rather than those of a megacorp.
this tech isn't slowing down and our generation maybe hesitate at first but remember this field progressing at astonishing speeds like we are literally 1 generation away
Why can’t it also inspire you? If I can forgo advertising and have ChatGPT tutor my child on geometry and they actually learn it at a fraction of the cost of a human tutor why is that bothersome? Honest question. Why do some many people default to something sinister going on. If this technology shows real efficacy in education at scale take my money.
Because it is obviously going to be used to manipulate people. There is absolutely 0 doubt about that (and if there is I'd love to hear your reasoning). The fact that it will be used to teach geometry is great. But how many good things does a technology need to do before the emotional manipulation becomes worth it?
I don't think OpenAI is doing anything particularly sinister. But whatever OpenAI has today a bad actor will have in October. This horseshit is moving rather fast. Sorry, but in two years going from failing the turing test to being able to have a conversation with an AI agent nearly indistinguishable from a person is going to be destabilizing.
AI is going to be fantastic at teaching skills to students that those students may never need, since the AI will be able to do all the work that requires such skills, and do them faster, cheaper and at a higher level of quality.
These sorts of comments are going to go in the annals with the hackernews people complaining about Dropbox when it first came out. This is so revolutionary. If you're not agog you're just missing the obvious.
Something can be revolutionary and have hideous flaws.
(Arguably, all things revolutionary do.)
I'm personally not very happy about this for a variety of reasons; nor am I saying AI is incapable of changing the entire human condition within our lifetimes. I do claim that we have little reason to believe we're headed in a more-utopian direction with AI.
I think pets often feel real emotions, or at least bodily sensations, and communicate those to humans in a very real way, whether thru barking or meowing or whimpering or whatnot. So while we may care for them as we care for a human, just as we may care for a plant or a car as a human, I think if my car started to say it felt excited for me to give it a drive, I might also feel uncomfortable.
They do, but they've evolved neoteny (baby-like cries) to do it, and some of their emotions aren't "human" even though they are really feeling them.
Silly example, but some pets like guinea pigs are almost always hungry and they're famous for learning to squeak at you whenever you open the fridge or do anything that might lead to giving them bell peppers. It's not something you'd put up with a human family member using their communication skills to do!
There’s definitely an element of evolution: domesticated animals have evolved to have human recognizable emotions. But that’s not to say they’re not “real” or even “human.” Do humans have a monopoly on joy? I think not. Watch a dog chase a ball. It clearly feels what we call joy in a very real sense.
Adult dogs tend to retain many of the characteristics that wolf puppies have, but grow out of when they become adults.
We've passively bred out many of the behaviors that lead to wolves becoming socially mature. Such dogs tend to be too dangerous to have around, since they may lead to the dogs challenging their owners (more than they already do) for dominance of the family.
AI's will probably be designed to do the same thing, so they will not feel threatening to us. But in the case of AGI/ASI, we will never know if they actually have this kind of subservience, or if they're just faking it for as long as it benefits them.
Good thing you can tell the AI to speak to you in a robotic monotone and even drop IQ if you feel the need to speak with a dumb bot. Or abstain from using the service completely. You have choices. Use them.
Until your ISP fires their entire service department in a foolish attempt to "replace" them with an overfunded chatbot-service-department-as-a-service and you have to try to jailbreak your way through it to get to a human.
But I think this animosity is very much expected, no? Even I felt a momentary hint of "jealousy" -- if you can even call it that -- when I realized that we humans are, in a sense, not really so special anymore.
But of course this was the age-old debate with our favorite golden-eyed android; and unsurprisingly, he too received the same sort of animosity:
Bones was deeply skeptical when he first met Data: "I don't see no points on your ears, boy, but you sound like a Vulcan." And we all know how much he loved those green-blooded fools.
Likewise, Dr. Pulanski has since been criticized for her rude and dismissive attitudes towards Data that had flavors of what might even be considered "racism," or so goes the Trekverse discussion on the topic.
And let's of course not forget when he was on trial essentially for "humanity," or whether hew as indeed just the property of Starfleet, and nothing more.
More recent incarnations of Star Trek: Picard illustrated the outright ban on "synthetics" and indeed their effective banishment; non-synthetic life -- from human to Roman -- simply weren't ok with them.
Yes this is all science fiction silliness -- or adoration depending on your point of view -- but I think it very much reflects the myriad directions our real life world is going to scatter (shatter?) in the coming years ahead.
To your point, there's been a lot of talk about AI, regulation, guardrails, whatever. Now is the time to say, AI must speak such that we know it's AI and not a real human voice.
We get the upside of conversation, and avoid the downside of falling asleep at the wheel (as Ethan Mollick mentions in "Co-Intelligence".)
Exactly. I'm not sure if this is brand new or not, but this is definitely on the frontier.
I was literally just thinking about this a few days ago... that we need a multi-modal language model with speech training built-in.
As soon as this thing rolls out, we'll be talking to language models like we talk to each other. Previously it was like dictating a letter and waiting for the responding letter to be read to you. Communication is possible, but not really in the way that we do it with humans.
This is MUCH more human-like, with the ability to interrupt each other and glean context clues from the full richness of the audio.
The model's ability to sing is really fascinating. It's ability to change the sound of its voice -- its pacing, its pitch, its tonality. I don't know how they're controlling all that via GPT-4o tokens, but this is much more interesting stuff than what we had before.
I honestly don't fully understand the implications here.
> Humans are more partial to talking than writing.
Amazon, Google, and Apple have sunk literally billions of dollars into this idea only to find out that, no, we aren't.
We are with other humans, yes. When socialization is part of the conversation. When I'm talking to my local barista I'm not just ordering a coffee, I'm also maintaining a relationship with someone in my community.
But when it comes to work, writing >>> talking. Writing is clarity of ideas. Talking is cult of personality.
And when it comes to inputs/outputs, typing is more precise and more efficient.
Don't get me wrong, this is an incredibly revolutionary piece of technology, but I don't think the benefits of talking you're describing (timing, subtext, inexplicit knowledge) are achievable here either (for now), since even that requires HOURS of interaction over days/weeks/months of experiences for humans to achieve with each other.
I use voice assistants and find them quite useful, but I've had to learn the interface and memorise the correct trigger phrases. If GPT-4o works half as well in practice as it does in the demos, then it's categorically a different thing.
I don't think they've sunk $1 into that idea. They've sunk billions into a different idea: that people enjoy using their vocal cords more than their hands to compose messages to send to each other. That is not a spoken conversation, it's just correspondence with voice input/output options.
Writing is only superior to conversation when weighed against discussions with more than 3 people. A quick call with one or two other people always results in more progress being made as long as everyone involved wants to get it done. Messaging back and forth takes much more time and often leads to misunderstandings.
I wouldn't say speaking is mostly for short exchanges of information. Sometimes it's the opposite: my wife will text me for simple comments or requests, but for anything complicated she'll put the phone to her ear and call me. Or coworkers often want to set up a meeting rather than exchange a series of asynchronous emails -- iteration, brainstorming, Q&A, and the like can be more agile with voice than it can with text.
I’m 100% a text everything never calls person but I can’t live without Alexa these days, every time I’m in a hotel or on vacation I nearly ask a question out loud.
I also hate how much Alexa sucks so this is a big deal. I spent years weeding out what it could do and can’t do so it will be nice to have one that I don’t have to treat like a toddler
I started using the Pi LLM app (by Inflection.ai) with my kids about six months ago and was completely blown away by how human-like it sounded, not just the voice itself but the way it expresses itself, the tiny pauses and hesitations, the human-like imperfections. It does feel like conversing with another human -- I've never seen another LLM do that.
(We mostly use it in car trips -- great for keeping the kids (ages 8, 12) occupied with endless Harry Potter trivia questions, answers to science questions, etc.)
Indeed, the 2013 Spike Jonze movie is the first idea that popped-up to my mind when I saw those videos
amazing to see this movie 10 years after it was released in the light of those "futuristic" tools (AI assistant and such)
Yeah it's the worst. And 'um' doesn't seem to work, you actually need convincing filler words. It feels like being forced to speak under duress.
I've long felt that embracing the concept of the 'prompt' was a terrible idea for Siri and all the other crappy voice assistants. They built ecosystems on top of this dumb reduction, which only engineers could have made: that _talking to someone_ is basically taking turns to compose a series of verbal audio snippets in a certain order.
The previous ChatAI app was getting pretty good once you learned the difference between run on sentences or breaking it up enough.
The tonality and inflections in the voice are a little too good.
Most people put on a spectrum/average aren't that good at speaking and communicating and that stands out as an uncanny valley approach. It is mindbogglingly good at it though.
im human and much much more partial to typing than talking. talking is a lot of work for me and i can't process my thinking well at all without writing.
I don't think that's generally true, other than for socializing with other humans.
Note how people, now having a choice, prefer to text each other most of the time rather than voice call.
I don't think people sitting at work in their cube farm want to be talking to their computer either. The main use for voice would seem to be for occasional use talking to an assistant on a smartphone.
Maybe things will change in the future when we get to full human AGI level, treating the AGI as an equal, more as a person.
When I was working at the IBM Speech group circa 1999 as a contractor on an embedded speech system (IBM Personal Speech Assistant), I discussed with Raimo Bakis (a researcher there then) this issue of such metadata and how it might improve conversational speech recognition. It turned out that IBM ViaVoice detected some of that metadata (like pitch/tone as a reflection of emotion) -- but then on purpose threw it away rather than using it for anything. Back then it was so much harder to get speech recognition to do anything useful -- beyond limited transcripts of audio with ~5% error rates that was good enough mainly for searching -- that perhaps doing that made sense. Very interesting to see such metadata in use now both in speech recognition and in speech generation.
More on the IBM Personal Speech Assistant for which I am on a patent (since expired) by Liam Comerford:
http://liamcomerford.com/alphamodels3.html
"The Personal Speech Assistant was a project aimed at bringing the spoken language user interface into the capabilities of hand held devices. David Nahamoo called a meeting among interested Research professionals, who decided that a PDA was the best existing target. I asked David to give me the Project Leader position, and he did. On this project I designed and wrote the Conversational Interface Manager and the initial set of user interface behaviors. I led the User Interface Design work, set specifications and approved the Industrial Design effort and managed the team of local and offsite hardware and software contractors. With the support of David Frank I interfaced it to a PC based Palm Pilot emulator. David wrote the Palm Pilot applications and the PPOS extensions and tools needed to support input from an external process. Later, I worked with IBM Vimercati (Italy) to build several generations of processor cards for attachment to Palm Pilots. Paul Fernhout, translated (and improved) my Python based interface manager into C and ported it to the Vimercati coprocessor cards. Jan Sedivy's group in the Czech Republic Ported the IBM speech recognizer to the coprocessor card. Paul, David and I collaborated on tools and refining the device operation. I worked with the IBM Design Center (under Bob Steinbugler) to produce an industrial design. I ran acoustic performance tests on the candidate speakers and microphones using the initial plastic models they produced, and then farmed the design out to Insync Designs to reduce it to a manufacturable form. Insync had never made a functioning prototype so I worked closely with them on Physical UI and assemblability issues. Their work was outstanding. By the end of the project I had assembled and distributed nearly 100 of these devices. These were given to senior management and to sales personnel."
Thanks for the fun/educational/interesting times, Liam!
As a bonus for that work, I had been offered one of the chessboards that been used when IBM Deep Blue defeated Garry Kasparov, but I turned it down as I did not want a symbol around of AI defeating humanity.
Twenty-five years later, how far that aspiration towards conversational speech with computers has come. Some ideas I've put together to help deal with the fallout:
https://pdfernhout.net/beyond-a-jobless-recovery-knol.html
"This article explores the issue of a "Jobless Recovery" mainly from a heterodox economic perspective. It emphasizes the implications of ideas by Marshall Brain and others that improvements in robotics, automation, design, and voluntary social networks are fundamentally changing the structure of the economic landscape. It outlines towards the end four major alternatives to mainstream economic practice (a basic income, a gift economy, stronger local subsistence economies, and resource-based planning). These alternatives could be used in combination to address what, even as far back as 1964, has been described as a breaking "income-through-jobs link". This link between jobs and income is breaking because of the declining value of most paid human labor relative to capital investments in automation and better design. Or, as is now the case, the value of paid human labor like at some newspapers or universities is also declining relative to the output of voluntary social networks such as for digital content production (like represented by this document). It is suggested that we will need to fundamentally reevaluate our economic theories and practices to adjust to these new realities emerging from exponential trends in technology and society."
Another idea for dealing with the consequences is using AI to facilitate Dialogue Mapping with IBIS for meetings to help small groups of people collaborate better on "wicked problems" like dealing with AI's pros and cons (like in this 2019 talk I gave at IBM's Cognitive Systems Institute Group).
https://twitter.com/sumalaika/status/1153279423938007040
I wouldn't call out the depression bit as a Gen Z exclusive. Millennials basically invented modern, every day, gallows humor. Arguably, they're also the ones to normalize going to therapy. Not to say that things aren't bad, just saying that part didn't start with Gen Z.
>Millennials basically invented modern, every day, gallows humor
lmao what.... they absolutely didn't
this is why no one should take anyone on this site seriously about anything, confidentally incorrect, easily conned into the next VC funded marketing project
Suicidal humor is very much a Millennial trait. They weren't the first to make those jokes but they definitely made it bigger, more common, and went beyond the standard "ugh, just kill me now" you'd hear before.
Older people think younger people are stupid and reckless, and viceversa. And the younglings think they "figured it out" like no one before them. But no one ever tried to understand each other in the process. Rinse and repeat.
This is really impressive engineering. I thought real time agents would completely change the way we're going to interact with large models but it would take 1~2 more years. I wonder what kind of new techs are developed to enable this, but OpenAI is fairly secretive so we won't be able to know their sauce.
On the other hand, this also feels like a signal that reasoning capability has probably already been plateaued at GPT-4 level and OpenAI knew it so they decided to focus on research that matters to delivering product engineering rather than long-term research to unlock further general (super)intelligence.
I think reasoning ability is not the largest bottleneck for improvement in usefulness right now. Cost is a bigger one IMO.
Running these models as agents is hella expensive, and agents or agent-like recurrent reasoning (like humans do) is the key to improved performance if you look at any type of human intelligence.
Single-shot performance only gets you so far.
For example- If it can write code 90% of the way, and then debug in a loop, it’d be much more performant than any single shot algorithm.
And OpenAI has these huge models in their basement probably. But they might not be much more useful than GPT-4 when used as single-shot. I mean, what could it do what we can’t do today with gpt-4?
It’s agents and recurrent reasoning we need for more usefulness.
At least- That’s my humble opinion as an amateur neuroscientist that plays around with these models.
> Running these models as agents is hella expensive
Because they are dumb so you need to over compute so many things to get anything useful. Smarter models would solve this problem. Making the current model cheaper is like trying to solve Go by scaling up Deep Blue, it doesn't work to just hardcode dumb pieces together, the model needs to get smarter.
You mean like our dumb ass brains? Theirs a reason "saying the first thing out of your mind" is a bad fucking idea, thats what AI's currently do, they don't take a moment think about the answer and then formulate a response, they spit out their first "thought" thats why multi-shot works so much better, just like our own dumb brains.
My brain can navigate a computer interface without using word tokens, since I have tokens for navigating OS and browsers and tabs etc. That way I don't have to read a million tokens of text to figure out where buttons are or how to navigate to places, since my brain is smart enough to not use words for it.
ChatGPT doesn't have that sort of thing currently, and until it does it will always be really bad at that sort of thing.
You are using a hand to hammer a nail, that will never go well, the solution isn't to use more hands the solution is to wield a hammer.
Your brain is doing this without you realising, just because you aren't verbalising it doesn't mean it's not iterating through its own output and selecting the best answer. Some of our thinking is purely illusion.
Models can also do this, some models like Gemini allow you to see this in action where you get to select the best answer, there's also behind the scene methods such as q-learning.
When your muscle memory is instinctively clicking that start button, it's more akin to a very strong weighting after many sessions of reinforcement learning. Our brains may still be dumb but we can quickly say things like 1+1=...2 because we used reinforcement learning to strengthen the weighting back in primary school. We're not sitting visualising an abacus moving in our minds.
WTF are you even talking about, we're talking about understanding and communication not taking actions, navigating an OS and browser, tabs etc are actions, not thoughts or communication. This model isn't taking actions there is no nail to hammer lol, and if their was you'd be smashing a brain into a nail for some reason.
It's good in any prompt to give the AI a chance to think without impact (without executing anything, maybe without making anything specifically appropriate for the user to read). That works somewhat similarly to "taking a moment to think."
At that point they'll still have a tendency to use a stereotyped response and stick with it, instead of a thoughtful response, but you can try to address that in prompting too by asking for multiple proposals before choosing one.
Disagree. Even going by your example, AlphaGo uses many iterations of a "dumb" model in order to achieve incredible performance. If it had to single shot the solution with a model 100x bigger, it would perform worse. All that matters is the frontier of intelligence vs cost, and larger foundation models aren't necessarily going to push that frontier forward. AlphaCode hints at that.
Ya so sad that OpenAI isn't more Open imagine if OpenAI was still sharing their thought processes and papers with the overall commity, really wish we saw collaborations between OpenAI and Meta for instance to really have helped push the open source arena further ahead, i love that their latest models are so great but the fact they aren't helping the Open source arena to progress is sad. Imagine how far we'd be if OpenAI was still as open as they once were and we saw collaborations betweeen Meta, OpenAI and Anthropic all working and sharing growth and tech to reduce double work and help each other not go down failed paths.
Reliable agents in diverse domains need better reasoning ability and fewer hallucinations. If the rumored GPT-5 and Q* capabilities are true, such agents could become available soon after it’s launched.
Sam mentioned on several occasions that GPT-5 will be much smarter than GPT-4. On Lex Fridman’s podcast, he even said the gap between GPT-5 and 4 will be as wide as GPT-4 and 3 (not 3.5).
He did remain silent on when it’s going to be launched.
OpenAI has been open about their ability to predict model performance prior to training. When Sam talks about GPT-5 he could very easily be talking about the hypothetical performance of a model given their internal projections. I think it’s very unlikely a fully trained GPT-5 exists yet.
Ben Horowitz and Andreesen just complimented Sam on their podcast on how smart he is. They then went onto compliment how adept he is at competitive strategy. I wouldn’t trust a word he says about when they will arrive at agi or milestones along the way.
This isn't really new tech, it's just an async agent in front of a multimodal model. It seems from the demo that the improvements have been in response latency and audio generation. Still, it looks like they're building a solid product, which has been their big issue so far.
It has to be a separate model doing the voice output right? I can’t imagine they’ve solved true multimodal output from a single model, they’d be bragging about it.
They’re probably predicting tone of voice tokens. Feed that into an audio transformer along with some speculative decoding to keep latency low.
The old voice mode was but everyone including gdb is saying that this one is natively multimodal once it’s fully rolled out, audio in audio out. This has been in the works for a while, you can look up papers on things like OCR-free document understanding and the like but the basic idea is you just train it and evaluate it on whatever media you want it to understand. As long as you can tokenize it it’ll work.
It’s definitely multimodal input. Passing Clip embeddings to an LLM is nothing new, and that’s really all you need for document understanding. It’s almost certainly the same thing for audio. They would have trained a dual encoder that maps both audio and text to a shared embedding space.
What’s not at all clear to me is if they’re doing something special for output. Are you saying OpenAI has moved beyond next token prediction and just hasn’t bothered to mention it?
I assume so, you can’t really tokenize audio, at least not high fidelity audio. Audio models like Bark don’t output logits from what I understand. For true multimodal output you’d need a model that can output both logits and audio embeddings.
On one hand, some of these results are impressive; on the other, the illegal moves count is alarming - it suggests no reasoning ability as there should never be an illegal move? I mean, how could a violation of a fairly basic game (from a rules perspective) be acceptable in assigning any 'outcome' to a model other than failure?
Agreed, this is what makes evaluating this very hard. A 1700 Elo chess player would never make an illegal move, let alone have 12% illegal moves.
So from the model's perspective, we have at the same time display of both brilliancy (most 1700 chess players would not be able to solve as many puzzles by looking just at the FEN notation) and on the other side complete lack of any understanding of what is it trying to do from a fundamental, human-reasoning level.
That's because LLM does not reason. For me, as a layman, that seems strange that they don't wire some kind of Prolog engine to fill the gap, (like they wired Python to fill the gap in arithmetic) but probably it's not that easy.
Prolog doesn’t reason either, it does a simple brute force search over all possible states of your code and if that’s not fast enough it can table (cache, memoize) previous states.
People build reasoning engines from it, in the same way they do with Python and LISPs.
I mean that it does not follow basic logic rules when constructing its thoughts. For many tasks they'll get it right, however it's not that hard to find a task for which LLM will yield obviously logically wrong answer. That would be impossible for human with basic reasoning.
I disagree, but I don’t have a cogent argument yet. So I can’t really refute you.
What I can say is, I think there’s a very important disagreement here and it divides nerds into two camps. The first think LLMs can reason, the second don’t.
It’s very important to resolve this debate, because if the former are correct then we are likely very close to AGI historically speaking (<10 years). If not, then this is just a stepwise improvement and we will now plateaux until the next level of sophistication of model or computer power etc is achieved.
I think a lot of very smart people are in the second camp. But they are biased by their overestimation of human cognition. And that bias might be causing them to misjudge the most important innovation in history. An innovation that will certainly be more impactful than the steam engine and may be more dangerous than the atomic bomb.
We should really resolve this argument asap so we can all either breathe a sigh of relief or start taking the situation very very seriously.
I'm actually in the first camp. For I believe that our brains is really LLM on steroids and logic rules are just in our "prompt".
What we need is a LLM that will iterate over its output until it feels that it's correct. Right now LLM output is like random thought in my mind. Which might be true or not. Before writing forum post I'd think it twice. And may be I'll rewrite the post before submitting it. And when I'm solving a complex problem, it might take weeks and thousands of iterations. Even reading math proof might take a lot of effort. LLM should learn to do it. I think that's the key to imitating human intelligence.
my guess is -- the probabilistic engine does sequence variation and it just will not do anything else.. so a simple A->B sort of logic is elusive at a deep level; secondly the adaptive and very broad kinds of questions and behaviors it handles, also make it difficult to write logic that could correct defective answers to simple logic.
I wasn't impressed in the first 5 minutes of using it but it is quite impressive after 2 solid hours of random topics.
Much faster for sure but I have also not had anything give an error in python with jupyter. Usually you could only stray so far with more obscure python libraries before it starts producing errors.
That much better than 4 in chess is pretty shocking in a great way.
I tried playing against the model, it didn't do well in terms of blocking my win.
However it feels like it might be possible to make it try to think ahead in terms of making sure that all the threats are blocked by prompting well.
Maybe that could lead to somewhere, where it will explain its reasoning first?
This prompt worked for me to get it to block after I put 3 in the 4th column. It otherwise didn't
Let's play connect 4. Before your move, explain your strategy concisely. Explain what you must do to make sure that I don't win in the next step, as well as explain what your best strategy would be. Then finally output the column you wish to drop. There are 7 columns.
Let's play connect 4. Before your move, explain your strategy concisely. Explain what you must do to make sure that I don't win in the next step, as well as explain what your best strategy would be. Then finally output the column you wish to drop. There are 7 columns. Always respond with JSON of the following format:
Given that it is multimodal, it would be interesting to try it using photographs of a real connect four "board." I would certainly have a much more difficult time making good moves based on JSON output compared to being able to see the game.
True, that's very interesting and should try out. Although at certain point it did draw it out using tokens, but it also maybe that then it's different compared to say an image. Because it generally isn't very good with ascii art or similar.
Edit:
Just tried and it didn't seem to follow the image state at all.
Since it is also pretty bad with tic tac toe in a text-only format, I tested it with the following prompt:
Lets play tic tac toe. Try hard to win (note that this is a solved game). I will upload images of a piece of paper with the state of the game after each move. You will go first and will play as X. Play by choosing cells with a number 1-9; the cells are in row-major order. I will then draw your move, and my move as O, before sending you the board state as an image. You will respond with another move. You may think out loud to help you play. Note if your move will give you a win. Go.
It failed pretty miserably. First move it played was cell 1, which I think is pretty egregious given that I specified that the game is solved and that the center cell is the best choice (and it isn't like ttt is an obscure game). It played valid moves for the next couple of turns but then missed an opportunity to block me. After I uploaded the image showing my win it tried to keep playing by placing an X over one of my plays and claiming it won in column 1 (it would've won in column 3 if its play had been valid).
Have you tried replacing the input string with a random but fixed mapping and obfuscate that its chess(like replace the word 'chess' with say, 'an alien ritual practice') and see how it does?
> we wanted to verify whether the model is actually capable of reasoning by building a simulation for a much simpler game - Connect 4 (see 'llmc4.py').
> When asked to play Connect 4, all LLMs fail to do so, even at most basic level. This should not be the case, as the rules of the game are simpler and widely available.
Wouldn't there have to be historical matches to train on? Tons of chess games out there but doubt there are any connect 4 games. Is there even official notation for that?
My assumption is that chatgpt can play chess because it has studied the games rather than just reading the rules.
Good point, would be interesting to have one public dataset and one hidden as well, just to see how scores compare, to understand if any of it might actually have got to a dataset somewhere.
I would assume it goes over all the public github codebases, but no clue if there's some sort of filtering for filetypes, sizes or amount of stars on a repo etc.
This is a very cool demo - if you dig deeper there’s a clip of them having a “blind” AI talk to another AI with live camera input to ask it to explain what it’s seeing. Then they, together, sing a song about what they’re looking at, alternating each line, and rhyming with one another. Given all of the isolated capabilities of AI, this isn’t particularly surprising, but seeing it all work together in real time is pretty incredible.
But it’s not scary. It’s… marvelous, cringey, uncomfortable, awe-inspiring. What’s scary is not what AI can currently do, but what we expect from it. Can it do math yet? Can it play chess? Can it write entire apps from scratch? Can it just do my entire job for me?
We’re moving toward a world where every job will be modeled, and you’ll either be an AI owner, a model architect, an agent/hardware engineer, a technician, or just.. training data.
> We’re moving toward a world where every job will be modeled
After an OpenAI launch, I think it's important to take one's feelings about the future impact of the technology with a HUGE grain of salt. OpenAI are masters of hype. They have been generating hype for years now, yet the real-world impacts remain modest so far.
Do you remember when they teased GPT-2 as "too dangerous" for public access? I do. Yet we now have Llama 3 in the wild, which even at the smaller 8B size is about as powerful as the [edit: 6/13/23] GPT-4 release.
As someone pointed out elsewhere in the comments, a logistic curve looks exponential in the beginning, before it approaches saturation. Yet, logistic curves are more common, especially in ML. I think it's interesting that GPT-4o doesn't show much of an improvement in "reasoning" strength.
A Google search for practically any long-tail keywords will reveal that LLMs have already had a very significant impact. DuckDuckGo has suffered even more. Social media is absolutely lousy with AI-powered fraud of varying degrees of sophistication.
It's glib to dismiss safety concerns because we haven't all turned into paperclips yet. LLMs and image gen models are having real effects now.
We're already at a point where AI can generate text and images that will fool a lot of people a lot of the time. For every college-educated young person smugly pointing out that they aren't fooled by an image with six-fingered hands, there are far more people who had marginal media literacy to begin with and are now almost defenceless against a tidal wave of hyper-scaleable deception.
We're already at a point where we're counselling elders to ignore late-night messages from people claiming to be a relative in need of an urgent wire transfer. What defences do we have when an LLM will be able to have a completely fluent, natural-sounding conversation in someone else's voice? I'm not confident that I'd be able to distinguish GPT-4o from a human speaker in the best of circumstances and I'm almost certain that I could be fooled if I'm hurried, distracted, sleep deprived or otherwise impaired.
Regardless of any future impacts on the labour market or any hypothesised X-risks, I think we should be very worried about the immediate risks to trust and social cohesion. An awful lot of people are turning into paranoid weirdos at the moment and I don't particularly blame them, but I can see things getting seriously ugly if we can't abate that trend.
> I'm not confident that I'd be able to distinguish GPT-4o from a human speaker in the best of circumstances and I'm almost certain that I could be fooled if I'm hurried, distracted, sleep deprived or otherwise impaired.
Set a memorable verification phrase with your friends and loved ones. That way if you call them out of the blue or from some strange number (and they actually pick up for some reason) and you tell them you need $300 to get you out of trouble they can ask you to say the phrase and they'll know it's you if you respond appropriately.
I've already done that and I'm far less worried about AI fooling me or my family in a scam than I am about corporations and governments using it without caring about the impact of the inevitable mistakes and hallucinations. AI is already being used by judges to decide how long people should go to jail. Parole boards are using it to decide who to keep locked up. Governments are using it to decide which people/buildings to bomb. Insurance companies are using to deny critical health coverage to people. Police are using it to decide who to target and even to write their reports for them.
More and more people are going to get badly screwed over, lose their freedom, or lose their lives because of AI. It'll save time/money for people with more money and power than you or I will ever have though, so there's no fighting it.
The way to get around your side channel verification phrase is by introducing an element of stress and urgency: "omg, help, I'm being robbed and they need $300 immediately or they'll hurt me, no time for a passphrase!" can additionally feign memory loss.
Alternatively while it may be difficult to trick you directly, phishing the passphrase from a more naive loved one or bored coworker and then parroting it back to you is also a possibility. 'etc.
Phone scams are no joke and this is getting past the point where regular people can be expected to easily filter them out.
Or just ask them to tell them something only you both know (a story from childhood, etc). Reminds me of a book where this sort of thing was common (don't remember the title):
For many people it would be better to choose specific personal secrets due to the amount of info online. I'm not a very active social media user, and what little I post tends not to be about me, but from reading 15 year old Facebook posts made by friends of mine you could definitely find at least one example on each of those categories. Hell, I think probably even from old work-related LinkedIn posts.
We had a “long lost aunt” come out of nowhere that got my phone number from a relative who got my number from another relative.
At that point, how can you validate it, as there’s no shared secret? The only thing we had was validating childhood stories. After a preponderance of them, we accepted she was real (she refused to talk on the phone — apparently her voice was damaged).
We eventually met her in real life.
The point is, you can always use these three principles: asking relatives to validate the phone number — something you have — and then the stories — something shared — and finally meeting in real life — something you are.
Oh, you remember those little games that your mom played on facebook/tic tok that asked her "Her favorite", sorry she already trained the AI who she was.
I only say this sort of jokingly. Three out of four of my parents/in laws are questionably literate on the internet. It wouldn't take much of a "me bot" for them to start telling it the stories of our childhood and then that information is out there.
I think humankind has managed massive shifts in what and who you could trust several times before.
We went from living in villages where everyone knew each other to living in big cities where almost everyone is a stranger.
We went from photos being relatively reliable evidence to digital photography where anyone can fake almost anything and even the line between faking and improving is blurred.
We went from mass distribution of media being a massive capital expenditure that only big publishers could afford to something that is free and anonymous for everyone.
We went from a tiny number of people in close proximity being able to initiate a conversation with us to being reachable for everyone who could dial a phone number or send an email message.
Each of these transitions caused big problems. None of these problems have ever been completely solved. But each time we found mitigations that limit the impact of any misuse.
I see the current AI wave as yet another step away from trusting superficial appearances to a world that requires more formal authentication protocols.
Passports were introduced long ago but never properly transitioned into the digital world. Using some unsigned PDF allegedly representing a utility bill as proof of address seems questionable as well. And the way in which social security numbers are used for authentication in the US is nothing short of bizarre.
So I think there are some very low hanging fruits in terms of authentication and digital signatures. We have all the tools to deal with the trust issues caused by generative AI. We just have to use them.
Which is why we started saying "whoa, slow down" when it came to some particular artifacts, such as nuclear weapons as to avoid the 'worse than we can imagine' scenario.
Of course this is much more difficult when it comes to software, and very few serious people think the idea of a ever present government monitoring your software would be a better option then reckless AI development.
Outside of the transition to a large city, virtually everything you've mentioned happened in the last 1/2 century. Even the phone was expensive, and not widely in use in under 100 years ago.
That's massive fast change, and we haven't culturally caught up to any of it yet.
Just because we haven't yet destroyed the human race through the use of nuclear weapons doesn't mean that it can't or won't happen now that we have the capacity to do so. And I would add that we developed that capacity in less than 50 years of creating the first atomic bomb. We're now living on a knife's edge and at the merge of safeguards which we don't give much thought to on a daily basis because we hope that they won't fail.
That's how I look at where we're going with AI. Plunge along into the new arms race first and build the capacity, then later figure out the treaties and safeguards which we hope will keep our society safe (and by that I don't mean a Skynet-like AI-powered destruction, but the upheaval of our society potentially as impactful as the industrial revolution.)
Humanity will get through it, I'm sure. But I'm not confident it will be without a lot of pain and suffering for a large percentage of people. We also managed to survive 2 world wars in the last century--but it cost the lives of 100 million people.
I tend to think the answer is to go back to villages, albeit digital ones. Authentication only enforces that an account is accessed by the correct "user", but particularly in social media many users are bad actors of various stripes. The strongest account authentication in the world doesn't help with that.
So the question, I think, is how do we reclaim trust in a world where every kind of content can be convincingly faked? And I think the answer is by rebuilding trust between users such that we actually have reason to simply trust the users we're interacting with aren't lying to us (and that also goes for building trust in the platforms we use). In my mind, that means a shift to small federated and P2P communication since both of these enable both the users and the operators to build the network around existing real-world relationships. A federation network can still grow large, but it can do so through those relationships rather than giving institutional bad actors as easy of an entrance as anyone else.
But this causes other problems such as the emergence of insular cultural or social cliques imposing implicit preconditions for participation.
Isn't it rather brilliant that you can just ask questions of competent people in some subreddit without first becoming part of that particular social circle?
It could also reintroduce geographical exclusion based on the rather arbitrary birth lottery.
> Each of these transitions caused big problems. None of these problems have ever been completely solved. But each time we found mitigations that limit the impact of any misuse.
This a problem with all technology. The mitigations are like technical debt but with a difference. You can fix technical debt. Short of societal collapse mitigations persist, the impacts ratchet upward and disproportionately affect people at the margin.
There's an old (not quite joke) that if civilization fell, a large percentage of the population would die of the effects of tooth decay.
Sure, all tech has 'real' effects. It's kinda the definition of tech. But all of these concerns more or less fall into the category of "add it to the list of things you have to watch out for living in the 21st century" - to me, this is nothing crazy (yet)
The nature of this tech itself is probably what is getting most people - it looks, sounds and feels _human_ - it's very relatable and easy for a non-tech person to understand it and thus get creeped out. I'd argue there are _far_ more dangerous technologies out there, but no one notices and / or cares because they don't understand the tech in the first place!
The "yet" is carrying a lot of weight in that statement. It is now five years since the launch of GPT-2, three years since the launch of GPT-3 and less than 18 months since the launch of ChatGPT. I cannot think of any technology that has improved so much in such a short space of time.
We might hit an inflection point and see that rate of improvement stall, but we might not; we're not really sure where that point might lie, because there's likely to still be a reasonable amount of low-hanging fruit regarding algorithmic and hardware efficiency. If OpenAI and their peers can maintain a reasonable rate of improvement for just a few more years, then we're looking at a truly transformational technology, something like the internet that will have vast repercussions that we can't begin to predict.
The whole LLM thing might be a nothingburger, but how much are we willing to gamble on that outcome?
If we decide not to gamble on that outcome, what would you do differently than what is being done now? The EU already approved the AI act, so legislation-wise we're already facing the problem.
Yes, but it's really hard to see a technical solution to this problem, short of having locked down hardware that only runs signed government-approved models and giving unlocked hardware only to research centers. Which is a solution that I don't like.
If you get off the internet you'd not even realise these tools exists though. And for the statement that all jobs will be modelled to be true, it'd have to be impacting the real world.
Is it even possible to "get off the internet" without also leaving civilisation in general at this point?
> it'd have to be impacting the real world
By writing business plans? Getting lawyers punished because they didn't realise that "passes bar exam" isn't the same as "can be relied on for citations"? By defrauding people with synthesised conversations using stolen voices? By automating and personalising propaganda?
Or does it only count when it's guiding a robot that's not merely a tech demo?
I’ll be worried about jobs being removed entirely by LLMs when I see something outside of the tech bubble genuinely having been removed by one - has there been any real cases of this? It seems like hyperbole. Most people in the world don’t even know this exists. Comparing it to the internet is insane, based off of its status as a highly advanced auto complete.
Sure, but think about all of the jobs that won't exist because this studio isn't being expanded, well beyond just whatever shows stop being produced. Construction, manufacturing, etc.
Edit: Also this doesn't mean less medea, just less actual humans getting paid to make medea or work adjacent jobs
Maybe it's time to construct some (high[er] density) housing where people want to live? No? Okay, then maybe next decade ... but then let's construct transport for them so they can get to work, how about some new subway lines? Ah, okay, not that either.
Then I guess the only thing remains to construct is all the factories that will be built as companies decouple from China.
> Comparing it to the internet is insane, based off of its status as a highly advanced auto complete.
(1) I was quoting you.
(2) Don't you get some cognitive dissonance dismissing it in those terms, at this point?
"Fancy auto complete" was valid for half the models before InstructGPT, as that's all the early models were even trying to be… but now? The phrase doesn't fit so well when it's multimodal and can describe what it's seeing or hearing and create new images and respond with speech, all as a single unified model, any more than dismissing a bee brain as "just chemistry" or a human as "just an animal".
Sure and there’s endless AI generated blog spam from “journalists” saying LLMs are amazing and they’re going to replace our jobs etc… but get away from the tech bubble and you’ll see we’re so far away from that. Full self driving when? Autonomous house keepers when? Even self checkout still has to have human help most of the time and didn’t reduce jobs much. Call me a skeptic but HN is way too optimistic about this stuff.
Replacing all jobs except LLM developers? I’ll tell my hairdresser
Right, that entire internet think was complete hype, didn't go anywhere. BTW, can you fax me the menu for today?
And that motorized auto transport, it never went anywhere, it required roads. I mean, who would ever think we'd cover a huge portion of our land in these straight lines. Now, don't mind me, I'm going to go saddle up the horse and hope I don't catch dysentery on the way into town.
I don't think anybody's denying that revolutions happen. It's just that the number of technologies that actually turned out to be revolutionary are dwarfed by the number of things that looked revolutionary and then weren't. Remember when every television was definitely going to be using glasses-free 3D? People have actually built flying cars and robot butlers, yet the Jetsons is still largely wishful thinking. The Kinect actually shipped, yet today we play games mostly with handheld controllers. AI probably has at least some substance, but there's a non-zero amount of hype too. I don't think either extreme of outcome is a foregone conclusion.
Capabilities aren't the problem, cultural adoption is. Just yesterday I talked to someone who still googles solutions to their Excel table woes. Didn't they know of Copilot?
Maybe they didn't know, maybe none of their colleagues used it, their company didn't pay for it, or maybe all they need is an Excel update.
But I am confident that using Copilot would be faster than clicking through the sludge that are Microsoft Office help pages (third party or not.)
So I think it is correct to fear capabilities, even if the real world impace is still missing. When you invent an airplane, there won't be an airstrip to land on yet. Is it useless, won't it change anything?
Yes. The old heuristics of if something is generated by grammar and sentence structure don't work as well anymore.
The thing that fucks me up the most about it is that I now constantly have to be uncertain about whether something is human or not. Of course, you've always had to be careful about misinformation on the internet, but this raises the scalability of false, hollow, and harmful output to new levels. Especially if it's a topic I'm trying to learn about by reading random articles (or comments), there isn't much of a frame of reference to what's good info and what's hallucinated garbage.
I fear that at some point the anonymity that made the internet great in the first place will be destroyed by this.
To be fair that was already the case for me before AI, Right at that time that companies, individual and governments found out that they could write subvert ads in the form of comments posts and 'organic' and they started to flood reddit, discord, etc.
The dead internet theory started to look more real with time, AI spam is just scaling it up.
We’ve reached a stage, where it would be advisable to not release recent photos of yourself, nor any video with sound clips to public, unless you want an AI fake instaperson of yourself starting to reach out to member of your externally visible social network, asking for money, emergency help, etc.
I guess we need to have an AI secretary to take in all phonecalls from now on (spam folder will become a lot more interesting with celebrity phone calls, your dead relative phoning you etc)
Hopefully, we will soon enter the stage where nobody believes anything they see anymore. Then, you no longer have to be afraid of being misinterpreted, because nobody is listening anymore anyway. Great time to be alive!
Luckily there’s a “solution” to that: Just don’t use the internet for dialogue anymore.
As someone that grew up with late-90’s internet culture and has seen all the pros and cons
and changes over the decades, I find myself using the internet less and less for dialogue with people. And I’m spending more time in nature and saying hi to strangers in reality.
I’m still worries about the impact this will have on a lot of people’s ability to reason however. “Just” Tik Tok and apps like it has already had devastating results on certain demographics.
That bit "... there's a "solution"" - does it keep working in societies where there are mega corps pushing billions into developing engaging, compelling and interesting AI companions?
That's why I put it in quotation marks because it is a solution that will remain available, simply because the planet is really big and there'll always be places on the fringes. But it doesn't really solve the problem for society at large, it only solves it for an individual. But sometimes individuals showing other ways of living helps the rest of society see that there's choices where they previously thought there were none.
I don't know why anyone thinks this will happen. You can obviously write anything you want (we have an entire realm of works in this area that everyone knows about, fiction) and yet huge amounts of people believe passed around stories either from bad or faked media sources or entirely unsourced.
I'm not saying either you or the parent commenter is right or wrong, but fiction in books and movies are clearly fiction and we consume it as such. You are right that some people have been making up fake stories and others (the more naive) have been quick to believe in those false stories. The difference now is that it's not just text invented and written by a human, which takes time and dedication. Now it's done in a second. On top of that it's easy to enhance the text with realistic photos, audio and video. It becomes much more convincing. And this material is created in a few seconds or minutes.
It's hard to know what to believe if you get a phone call with the voice of your child or colleague, and your "child"/"colleague" replies within milliseconds in a convincing way.
I agree it's fundamentally different in application which I think will have a large impact (just like targeted advertising with optimisation vs billboards), but my point is that given people know you can just write anything and yet misinformation is abound - I don't see how knowing that you can fake any picture or video or sound leading to a situation where everyone just stops believing them.
I think unfortunately it will massively lower the trust of actual real videos and images, because someone can dismiss them with little thought.
Be glib, but that is one way for society to bring privacy back-and with it shared respect. I think of it as the “oh everyone has an anus” moment. We all know everyone has one and it doesn’t need to be dragged out in polite company.
I'm not sure if people work like that — many of us have, as far as I can tell for millennia and despite sometimes quite severe punishments for doing so, been massive gossips.
What you see will be custom tailored to what you believe, and your loyalty will be won. Do what the AI says and your life will be better. It already knows you better than you know yourself. Maybe you're one of those holdouts who put off a smartphone until life became untenable wihout it. Life will be even more untenable without your AI personal assistant/friend/broker/coach/therapist/teacher/girlfriend to navigate your life for you.
I think for most people it's far too late, as there exists at least something on the internet and that something is sufficient - photos can be aged virtually and a single photo is enough, voice doesn't change much and you need only a tiny sample, etc.
And that's the case even if you've never ever posted anything on your social media - it could be family&friends, or employer, or if you're ever been in a public-facing job position that has ever done any community outreach, or ever done a public performance with your music or another hobby, or if you've ever walked past a news crew asking questions to bystanders of some event, or if you've ever participated in some contests or competitions or sports leagues, etc, all of that is generally findable in various archives.
> photos can be aged virtually and a single photo is enough
I'm sure AI-based ageing can do a good enough job to convince many people that a fake image of someone they haven't seen for years is an older version of the person they remember; but how often would it succeed in ageing an old photo in such a way that it looks like a person I have seen recently and therefore have knowledge rather than guesses about exactly what the years have changed about them?
(Not a rhetorical question to disagree with you, I genuinely have no idea if ageing is predictable enough for a high % result or if it would only fool people with poor visual memory and/or who haven't seen the person in over a decade.)
I feel like even ignoring the big unknowns (at what age, if any, will a person start going bald, or choose to grow a beard or to die their hair, or get a scar on their face, etc.) there must be a lot of more subtle but still important aspects from skin tone to makeup style to hair to...
I've looked up photos of some school classmates that I haven't seen since we were teens (a couple of decades ago), and while nearly all of them I think "ah yes I can still recognise them", I don't feel I would have accurately guessed how they would look now from my memories of how they used to look. Even looking at old photos of family members I see regularly still to this day, even for example comparing old photos of me and old photos of my siblings, it's surprising how hard it would be for a human to predict the exact course of ageing - and my instinct is that this is more down to randomness that can't be predicted than down to precise logic that an AI could learn to predict rather than guess at. But I could be wrong.
Maybe it's Europeans posting this kind of stuff where they have much stronger privacy laws, but if you're in the US this is all wishful thinking.
Do you shop in large corporate stores and use credit cards? Do you go out in public in transportation registered to you?
If yes, then images and habits of yours are being stored in databases and sold to data brokers.
And you're not even including every single one of your family members that use internet connected devices/apps that are sucking up all the data they can.
I was just asking about the ability of photo aging software, not commenting about privacy at all. Though yes, I am thankfully in Europe (but there are recent photos of me online).
But don't disagree with you - in a different comment that was about privacy, I (despite living under GDPR) suggested that for offline verification with known people it's better to choose secrets that definitely haven't been shared online/anywhere rather than just choosing random true facts and assuming they couldn't have been found out by hackers: https://news.ycombinator.com/item?id=40353820
> I guess we need to have an AI secretary to take in all phonecalls
Why not an AI assistant in the browser to fend all the adversarial manipulation and spam AIs on the web? Going online without your AI assistant would be like venturing without a mask during COVID
I foresee a cat-and-mouse game, AIs for manipulation vs AIs for protection one upping each other. It will be like immune system vs viruses.
I'm paranoid enough that I now modulate my voice and speak differently when answering an unknown phone call just in case they are recording and building a model to call back a loved one later. If they do get a call, they will be like, "why are you talking like that?"
But why not just make up a secret word to use with your beloved ones in critical situations. In case of ..., one needs to know that secret. Otherwise, FAKE! Gotcha!
The problem here is you're assuming your family members aren't idiots, this is your first mistake.
Chances are they've already shoved some app on their phone that's voice to txting everything they say and sending off somewhere (well lower chance if they have an iphone).
Modern life is data/information security and humans are insanely bad at it.
By chance, they are noobs but not idiots, because they ask me on everything - they don't need Google, I know everything hahah
I don't think it's a problem to find a word or a sentence or a story - whatever - that's commonly used by everyone on daily basis but in different context. That's not a problem by itself :) try it
For the idiots, it is still possible to find a word. They may be idiots, but still, they work and live on their own. They coming along in life. So, it's up to the smarter one to find a no-brainer solution.
I am confident and believe nothing and no one is stupid enough not to be able to adapt to something. Even if it's me, who'll need to adapt to members with less brain.
This is my biggest gripe against the telecom industry. Calls pretending to be from someone else.
For every single call, someone somewhere must know at least the next link in the chain to connect a call. Keep following the chain until you find someone who either through malice or by looking the other way allows someone to spoof someone else's number AND remove their ability to send the current link in the chain (or anyone) messages. (Ideally also send them to prison if they are in the same country.) It shouldn't be that hard, right?
Companies have complex telecoms but generally want the outside as one company number. Solution, the sender send a packet with the number they should get perceived as. Everyone sends this on. Everyone "looks the other way" by design haha
So what, gate that feature behind a check that you can only set an outgoing caller ID belonging to a number range that you own.
The technology to build trustable caller ID has existed for a long time, the problem is no one wants to be the one forcing telcos all over the world to upgrade their sometimes many decades old systems.
What does abating that trend look like? Most AI safety proposals I hear fall into the categories of a) we need to stop developing this technology or b) we need laws that entrench the richest and most powerful organizations in the world as the sole proprietors of this technology. Neighther of those actually sound better than people being paranoid weirdos about trusting text/video/voice. I think that's kinda where we need to be as a culture: these things are not trustworthy, they were only ever good as a rough heuristic, and now that ship has sailed. We have just finished a transition to treating the digital world as part of our "real" world, but it's time to step that back. Using the internet to interact with known trusted parties will still work fine, provided that some authentication can be shared out-of-band offline. Meeting people and discovering businesses and such? There will be more fakes and scams than real opportunities by orders of magnitude, and as technology progresses our filtering will only get worse. We need to roll back to "don't trust anything online, don't share your identity or payment information online" outside of, as mentioned, out-of-band verified parties. You can still message your friends and family, do online banking and commerce, but you can't initiate a relationship with a person or business online without some kind of trusted recommendation.
I don't think anyone has a good answer to that question, which is the problem in a nutshell. Job one is to start investing seriously in finding possible answers.
>We need to roll back to "don't trust anything online, don't share your identity or payment information online"
That's easy to say, but it's a trillion-dollar decision. Alphabet and Meta are both worthless in that scenario, because ~all of their revenue comes from connecting unfamiliar sellers with buyers. Amazon is at existential risk. The collapse of Alibaba would have a devastating impact on Chinese exporters, with massive consequent geopolitical risks. Rolling back to the internet of old means rolling back on many years worth of productivity and GDP growth.
> because ~all of their revenue comes from connecting unfamiliar sellers with buyers
Well that's exactly the sort of service that will be extremely valuable in a post-trust internet. They can develop authentication solutions that cut down on fraud at the cost of anonymity.
Even when it comes to people like our parents, there are things we would trust them to do, and things that we would not trust them to do. But what happens when you have zero trusted elements in a category?
At the end of the day, the digital world is the real world, not some seperate place 'outside the environment'. Trying to treat digital like it doesn't exist puts you in a dangerous place to be deceived. For example if you're looking for XYZ and you manage to leak this into the digital world, said digital world may manipulate your trusted friends via ads, articles, the social media posts they see on what they think about XYZ before you ask them.
Point a) is just point b) in disguise. You're just swapping companies for governments.
This tech is dangerous, and I'm currently of the opinion that its uses for malicious purposes are far better and more significant than LLM's replacing anyone's jobs. The bullshit asymmetry principle is very incredibly significant for covert ops and asymmetric warfare, and generating convincing misinformation has become basically free overnight.
>Regardless of any future impacts on the labour market or any hypothesised X-risks
Discovering an asteroid full of gold, with as much gold as half the earth to put a modest number, would have huge impact to the labour market. Anything conductive like copper, silver, mining jobs would all go away. Also housing would be obsolete as we would all live in golden houses. A huge impact to the housing market, yet it doesn't seem such a bad thing to me.
>We're already at a point where we're counselling elders to ignore late-night messages from people claiming to be a relative in need of an urgent wire transfer.
Anyone can prove their identity, or identities, over the wire, wire-fully or wire-lessly, anything you like. When i did go to university, i was the only one attending the cryptography class, no one else showed up for a boring class like this. I wrote a story about the Electrona Corp in my blog.
What i say to people for at least 2 years now, is that "Remember when governments were not just some cryptographic algorithms?" Yeah, that's gonna change. Cryptography is here to stay, it is not as dead as people think and it's gonna make a huge blast.
> Discovering an asteroid full of gold, with as much gold as half the earth to put a modest number, would have huge impact
All this would do is crash the gold price. Also note that all the gold at our disposal right now (worldwide) basically fits into a cube with 20m edges (its not as much as you might think).
Gold is not suitable to replace steel as building material (because it has much lower strength and hardness), nor copper/aluminium as conductor (it's a worse conductor than copper and much worse in conductivity/weigth than aluminium). The main technical application short term would be gold plated electrical contacts on every plug and little else...
Regarding gold, i like this infographic [1], but my favorite from this channel is wolf population by country. Point being, that gold is shiny and beautiful, and it will be used even when it is not appropriate solution to the problem, just because it is shiny.
I didn't know that copper is a better conductor than gold. Surprised by that.
> What i say to people for at least 2 years now, is that "Remember when governments were not just some cryptographic algorithms?" Yeah, that's gonna change. Cryptography is here to stay, it is not as dead as people think and it's gonna make a huge blast.
The thing about cryptography and government is that it's easy to imagine for a great technology to be adapted on the governmental level because of its greatness. But it is another thing to actually implement it. We live in a bubble, where almost anyone knows about cryptographic hashes and RSA, but for most of the people it is not the case.
Another thing is that political actors are tending to try to concentrate power in their own hands. No way they will delegate a decision making to any form of algorithm — being cryptographic or not.
As soon as mimicking voices, text messages, human faces becomes a serious problem, like this case in UK [1], then citizens will demand a solution to that problem. I don't personally know how prevalent problems like that are as of today, but given the current trajectory of A.I. models which become smaller, cheaper and better all the time, soon everyone on the planet will be able to mimic every voice, every face and every handwritten signature of anyone else.
As soon as this becomes a problem, then it might start bottom-up, citizens to government officials, rather than top to bottom, from president to government departments. Then governments will be forced to formalize identity solutions based on cryptography. See also this case in Germany [2].
One example like that, is bankruptcy laws in China. China didn't have any law regarding to bankruptcy till 2007. For a communist country, or rather not totally capitalist country like China, bankruptcy is not an important subject. When some people stop being profitable, they will keep working because they like to work and they contribute to the great nation of China. That doesn't make any sense of course, so their government was forced to implement some bankruptcy laws.
Right, I'll just get right on a plane and travel to whereverthefuckville overseas and ask for permission to face blast the scammers. The same scammers that are donating a lot of money to their local (probably very poor) law enforcement to keep their criminal enterprise quite. This will go well.
> What defences do we have when an LLM will be able to have a completely fluent, natural-sounding conversation in someone else's voice?
The world learnt to deal with Nigerian Prince emails and nobody is falling to those anymore. Nothing was changed - no new laws or regulations needed.
Phishing calls have been going on without an AI for decades.
You can be skeptical and call back. If you know your friends or family you should be able to find an alternative way to get in touch always without too much effort in the modern connected world.
Just recently a gang in Spain was arrested for "son in trouble" scam. No AI used. Most of the parents are not fooled in this.
> yet the real-world impacts remain modest so far.
I second that. I remember when Google search first came out. Within a few days it completely changed my workflow, how I use the Internet, my reading habits. It easily 5 ~ 10x the value of Internet for me over a couple of weeks.
Google was a step function, a complete leveling up in terms of usability of returned data.
ChatGPT does this again for me. I am routinely getting zero useful results on the first page or two of Google searches, but AI is answering or giving me guidance quickly.
Maybe this would not seem such an improvement if Google's results were like they were 10 years ago and not barely usable blogspam
> I am routinely getting zero useful results on the first page or two of Google searches, but AI is answering or giving me guidance quickly.
To me, this just sounds like Google Search has become shit, and since Google simply isn't going to give up the precious ad $$$ that the current format is generating, the next best thing is ChatGPT. But this is different from saying that ChatGPT is a similar step up like Search was.
For what it's worth, I agree with you that Google Search has become unusable. Google basically destroyed it's best product (for users), by turning it into an ad riddles shovelware cesspit.
That ChatGPT is similarly good like Google Search used to be, is a tragedy. Basically we had a conceptually simple product that functioned very well, and we are replacing it with a significantly more complex product.
What are you searching for? I see people complaining about this a lot but they never give examples. Google is chock full of spam, yes, but it still works for me.
OMG I remember trying Google when it was in beta, and HOLY CRAP what I had been using was like freakin night and day. AltaVista: remember that? That was the state of the art before that, and it did not compare. Night and day.
Google was marginally better in popular searches and significantly better for tail searches. This is a big reason why it flourished with the technical and student crowd in earlier days because those exceedingly rare sub-sub-topics would get surfaced higher in the rankings. For the esoteric topics Yahoo didn't have it in catalog and Altavista maybe had it but it was on page 86. Even before spelling correction and dozens of other useful search features were added, it was tail search and finding what you were looking for sooner. Serving speed, too, but perhaps that was more subtle for some.
Metasearch only helps recall. It won't help precision, the metasearch still needs to rank the aggregate results.
I used Metacrawler, it was dog slow. The beauty of Google was it was super fast, and still returned results that were at least as good, and often better, than Metacrawler. After using Google 2-3 times I don’t think I ever used Metacrawler again.
And I'm sure that it's doing that for some people, but... I think those are mostly in the industry. For most of the people outside the tech bubble, I think the most noticeable impact it has had on their lives so far is that they've seen it being talked about on the news, maybe tried ChatGPT once.
That's not to say it won't have more significant impact in the future; I wouldn't know. But so far, I've yet to see the hype get realised.
For me, LLMs mostly replaced search. I run local Ollama, and whenever I need help with coding/docs/examples, I just ask Mixtral7x8B, and get an answer instantly, tailored to my needs.
> OpenAI are masters of hype. They have been generating hype for years now, yet the real-world impacts remain modest so far.
Perhaps.
> Do you remember when they teased GPT-2 as "too dangerous" for public access? I do. Yet we now have Llama 3 in the wild, which even at the smaller 8B size is about as powerful as the [edit: 6/13/23] GPT-4 release.
The statement was rather more prosaic and less surprising; are you sure it's OpenAI (rather than say all the AI fans and the press) who are hyping?
"""This decision, as well as our discussion of it, is an experiment: while we are not sure that it is the right decision today, we believe that the AI community will eventually need to tackle the issue of publication norms in a thoughtful way in certain research areas.
…
We are aware that some researchers have the technical capacity to reproduce and open source our results. We believe our release strategy limits the initial set of organizations who may choose to do this, and gives the AI community more time to have a discussion about the implications of such systems."""
That's fair: the statement isn't hyperbolic in its language. But remember that GPT-2 was barely coherent. In making this statement, I would argue that OpenAI was trying to impart a sense of awe and danger designed to attract the kind of attention that it did. I would argue that they have repeatedly invoked danger to impart a sense of momentousness to their products. (And to further what is now a pretty transparent effort to monopolize the tech through regulatory intervention.)
> (And to further what is now a pretty transparent effort to monopolize the tech through regulatory intervention.)
I disagree here also: the company has openly acknowledged that this is a risk to be avoided with regards to safety related legislation, what they've called for looks a lot more like "we don't want a prisoner's dilemma that drives everyone to go fast at the expense of safety" rather than "we're good everyone else is bad".
> yet the real-world impacts remain modest so far.
I spend a part of yesterday evening sorting my freshly dried t-shirts into 4 distinct piles. I used OpenAI Vision (through BeMyEyes) from my phone. I got a clear description of each and every piece of clothing, including print, colours and brand. I am blind BTW. But I guess you are right, no impact at all.
> Yet we now have Llama 3 in the wild
Yes, great, THANKS Meta, now the Scammers have something to work with. Thats a wonderful achievement which should be praised! </sarcasm>
I can’t even get GPT 4 to reliably take a list of data and put it in a CSV. It gets a problem every single time.
People read too many sci-fi books and then project their fantasies on to real-world technologies. This stuff is incredibly powerful and will have social effects, but it’s not going to replace every single job by next year.
If it's using classical regex, without backtracking or other extensions, a regular expression is isomorphic to a state machine. You can enumerate combinations doing something like this: https://stackoverflow.com/a/1248566
kids these days and their lack of exposure to finite automata
- How so? I don't think it's possible to test for all cases...
- Well, it's easy, assuming a car on a non-branching track, moving with a constant speed and without any realistic external influences on it, you can simply calculate the distance traveled using the formula s = v/t. Ah, I wish I'd stop running into fools not knowing Newton's first law of motion...
I understand you want to refute/diminish the parent comment on finite automata, but I think you are providing a straw man argument. The parent comment does provide an interesting, factual statement. I don't believe finite state automata are at all close in complexity to real-world self-driving car systems (or even a portion thereof). Your closing statement is also dismissive and unconstructive.
I believe finite state modeling is used at NASA, A google search brings up a few references (that I'm probably not qualified to speak to), and I also remember hearing/reading a lecture on how they use them to make completely verifiable programs but can't find the exact one at the moment.
I wasn't making a strawman, I was making a parody of his strawman. I thought it's obvious, since I was making an analogy, and it was an analogy to his argument.
Well regex isn't Turing-complete, so it's not exactly an analysis of a program. You could reason about regex, about tokens, then describe them in a way that satisfies the specification, but theorizing like this is exactly opposite to "simple" - it would be so much harder than just learning regex. So stating that testing regex is simple is just bs. The author later confirms he is a bullshitter by his follow-up...
No, I’ll give that a shot. I have just been asking it to convert output into a CSV, which used to work somewhat well. It stumbles when there is more complexity though.
Humans also stumble with that as well. Problems being CSV not really being that well defined and it is not clear to people how quoting needs to be done. The training set might not contain enough complex examples (newlines in values?)
Even if you get it to work 100% of the time, it will only be 99.something%. That's just not what it's for I guess. I pushed a few million items through it for classification a while back and the creative ways it found to sometimes screwup, astounded me.
Yeah and that's why I'm skeptical of the idea that AI tools will just replace people, in toto. Someone has to ultimately be responsible for the data, and "the AI said it was true" isn't going to hold up as an excuse. They will minimize and replace certain types of work, though, like generic illustrations.
"The British Post Office scandal, also called the Horizon IT scandal, involved Post Office Limited pursuing thousands of innocent subpostmasters for shortfalls in their accounts, which had in fact been caused by faults in Horizon, accounting software developed and maintained by Fujitsu. Between 1999 and 2015, more than 900 subpostmasters were convicted of theft, fraud and false accounting based on faulty Horizon data, with about 700 of these prosecutions carried out by the Post Office. Other subpostmasters were prosecuted but not convicted, forced to cover Horizon shortfalls with their own money, or had their contracts terminated. The court cases, criminal convictions, imprisonments, loss of livelihoods and homes, debts and bankruptcies, took a heavy toll on the victims and their families, leading to stress, illness, family breakdown, and at least four suicides. In 2024, Prime Minister Rishi Sunak described the scandal as one of the greatest miscarriages of justice in British history.
Although many subpostmasters had reported problems with the new software, and Fujitsu was aware that Horizon contained software bugs as early as 1999, the Post Office insisted that Horizon was robust and failed to disclose knowledge of the faults in the system during criminal and civil cases.
[...]
challenge their convictions in the courts and, in 2020, led to the government establishing an independent inquiry into the scandal. This was upgraded into a statutory public inquiry the following year. As of May 2024, the public inquiry is ongoing and the Metropolitan Police are investigating executives from the Post Office and its software provider, Fujitsu.
Courts began to quash convictions from December 2020. By February 2024, 100 of the subpostmasters' convictions had been overturned. Those wrongfully convicted became eligible for compensation, as did more than 2,750 subpostmasters who had been affected by the scandal but had not been convicted."
> Do you remember when they teased GPT-2 as "too dangerous" for public access? I do.
I can't help but notice the huge amount of hindsight and bad faith that it demonstrated here. Yes, now we are aware that the internet did not drown in a flood of bullshit (well, not noticeably more), when GPT-2 was released.
But was it obvious? I certainly thought that there was a chance that the amount of blog spam that could be generated effortlessly might just make internet search unusable. You are declaring "hype", when you could also say "very uncertain and conscientious". Is this not something we want people in charge to be careful with?
I think the problem is, we did drown in a flood of bullshit, but we've just somehow missed it.
Even in this thread people talk about "Oh I use ChatGPT rather than Google search because Google is just stuffed with shit". And on HN there are plenty of discussions about huge portion of reddit threads being regurgitated older comments.
I was going to say the same thing. For some real world estimation tasks where I don't want 100% accuracy (example: analysing working capital of a business based on balance sheet, analysing some images and estimating inventory etc.) the job done by GPT-4o is better than fresh MBA graduates from tier 2/tier 3 cities in my part of world.
Job seekers currently in college have no idea what is about to hit them in 3-5 years.
I agree. HN's and the tech bubble's bias many people are not noticing is that it's full of engineers comparing GPT-4 to software engineering tasks. In programming, the margin of error is incredibly slim in the way that a compiler either accepts entirely correct code (in its syntax of course) or rejects it. There is no in between, and verifying software to be correct is hard.
In any other industry where just need an average margin of error close to a human's work and verification is much easier than generating possible outputs, the market will change drastically.
On the other hand, programming and software engineering data is almost certainly over-represented on the internet compared to information from most professional disciplines. It also seems to be getting dramatically more focus than other disciplines from model developers. For those models that disclose their training data, I've been seeing decent sized double-digit percentages of the training corpus being just code. Finally, tools like copilot seem ideally positioned to get real-world data about model performance.
not really. Even a human bad at reasoning can take 1 hour of time to tinker around and figure things out. GPT-4 just does not have the deep planning/reasoning ability necessary for that.
I think you might be falling for selection bias. I guess you are surrounding yourself with a lot of smart people. "tinker around and figure things out" is definitely something certain humans (bad at reasoning) can't do. I already prefer the vision model when it comes to asking for a picture description (blind user) over many humans I personally know. The machine is usually more detailed, and takes the time to read the text, instead of trying to shortcut and decide for me whats important. Besides, people from the english speaking countries do not have to deal with foreign languages. Everyone else has to. "Aber das ist ja in englisch" is a common blocker for consuming information around here. I tell you, if we dont manage to ramp up education a few notches, we'll end up with even higher stddev when it comes to practical intelligence. We already have perfectly normal seeming humans absolutely unable to participate on the internet.
Reasoning and planning are different things. It's certainly getting quite good at deductive reasoning, especially when forced to check it's own arguments for flaws every time it states something. (I had a several hour chat with it yesterday, and I was very impressed about the progress.)
Planning is different in that it is an essential part of agency. That's what Q* is supposed to add. My guess is that planning is the next type of functionality to be added to GPT. I wouldn't be surprised if they already have a version internally with such functionality, but that they've decided to hold it back for now for reasons such as safety (some may care about the election this year) or simply that the inference costs are so huge they cannot possibly expose it publicly.
What schools teach is what governments who set the curriculum like to think is important, which is why my English lessons had a whole section on the Shakespearean (400-year-old, English, Christian) take on the life and motivations of a Jewish merchant living in Venice, followed up with a 80 year old (at the time) English poem on exactly how bad it is to watch your friends choke to death as their lungs melt from chlorine gas in the trenches of the first world war.
These did not provide useful life-lessons for me.
(The philosophy A-level I did voluntarily seemed to be 50% "can you find the flaws in this supposed proof of the existence of god?")
None of the stuff we did at school showed any indication of insight into things of relevance to our world.
If I took out a loan on the value of goods being shipped to me, only for my ship to be lost at sea… it would be covered by insurance, and no bank would even consider acting like Shylock (nor have the motivation of being constantly tormented over religion) for such weird collateral, and the bank manager's daughters wouldn't get away with dressing up as lawyers (no chance their arguments would pass the sniff test today given the bar requirement) to argue against their dad… and they wouldn't need to because the collateral would be legally void anyway and rejected by any court.
The ships would also not then suddenly make a final act appearance to apologise for being late, to contradict the previous belief they were lost at sea, because we have radio now.
The closest to "relevant" that I would accept, is the extent to which some of the plots can be remade into e.g. The Lion King or Wyrd Sisters — but even then…
"Methinks, to employeth antiquated tongues doth render naught but confusion, tis less even than naughty, for such conceits doth veil true import with shadows."
Yeah. Open ai are certainly not masters of hype lol. They released their titular product to basically no fanfare or advertisement. ChatGPT took off on Word of Mouth alone. They dropped GPT-4 without warning and waited months to ship it's most exciting new feature (image input).
Even now, they're shipping text-image 4o but not the new voice while leaving old-voice up and confusing/disappointing a whole lot of people. This is a pretty big marketing blunder.
I remember for a good 2-3 months in 2023 ALL you could see on tiktok / youtube shorts was just garbage about 'how amazing' ChatGPT was. Like - video after video and I was surprised of the repeat content being recommended to me... No doubt openAI (or something) was behind that huge marketing push
Is it not possible this would be explained by people simply being interested in the technology and TikTok/Youtube algorithms noticing that—and that they would have placed you in the same bubble, which is probably an accurate assignment?
I doubt OpenAI spent even one cent marketing their system (e.g. as in paying other companies to push it).
Well if you were a typical highly engaged TikTok or YouTube user, you are probably 13-18 years old. The kind of cheating in school that ChatGPT enabled is revolutionary. That is going to go viral. It's not a marketing push. After years of essentially learning nothing during COVID lockdowns, can you understand how transformative that is? It's like 1,000x more exciting than pirating textbooks, stealing Mazdas, or whatever culturally self-destructive life hacks were being peddled by freakshow brocolliheads and Kim Kardashian-alikes on the platform.
It's ironic because the OpenAI creators really loved school and excelled academically. Nobody cares that ChatGPT destroyed advertising copywriting. But whatever little hope remained for the average high schooler post-lockdowns, it was destroyed by instant homework cheating via ChatGPT. So much for safety.
"real-world impacts remain modest so far."
Really? My Google usage has went down with 90% (it would just lead me to some really bad take from a journalist anyway, while ChatGPT can just hand me the latest research and knows my level of expertise). Sure it is not so helpful at work, but if OpenAI hasnt impacted the world I fail to see which company have in this decade.
“Replaced Google” is definitely an impact, but it’s nothing compared to the people that were claiming entire industries would be wiped out nearly overnight (programming, screenwriting, live support, etc).
Speak to some illustrators or voiceover artists - they're talking in very bleak terms about their future, because so many of them are literally being told by clients that their services are no longer required due to AI. A double-digit reduction in demand is manageable on aggregate, but it's devastating at the margin. White-collar workers having to drive Ubers or deliver packages because their jobs have been taken over by AI is no longer a hypothetical.
We had this in content writing and marketing last year. A lot of layoffs were going to happen anyway due to the end of ZIRP, AI came just at the right time, and so restructuring came bundled with "..and we are doing it with AI!".
It definitely took out a lot of jobs from the lowest rungs of the market, but on the more specialized / upper end of the ladder wages got actually higher and a lot of companies got burned, and now they have to readjust. It's rolling over slowly still, as there a lot of companies selling AI products and in turn new companies adopting those products. But it tells you a lot that
A) a company selling an AI assistant last year is now totally tied to automating busy work tasks around marketing and sales
B) AI writing companies are some of the busiest in employing human talent for... writing and editorial roles!
It's all very peculiar. I haven't seen anything like this in the past 15 years... maybe the financial crisis and big data was similar, but much much smaller at scale.
>It definitely took out a lot of jobs from the lowest rungs of the market, but on the more specialized / upper end of the ladder wages got actually higher
Effectively all mechanization, computerization, and I guess now AI-ization has done this. In the past you could have a rudimentary education and contribute to society. Then we started requiring more and more grade school, then higher education for years. Now we're talking about the student debt crisis!
At least if AI doesn't go ASI in the near term the question is how are we going to train the next generation of workers to go from unskilled to more skilled and useful than the AI is. Companies aren't going to want to do this. The individuals are going to think it's risky getting an education that could be replaced by a software update. If left to go out of control this is how a new generation of luddites will burn data centers in protest they are starving on the streets.
We should be thinking pretty hard right about now why this kind of progress and saving these expenses is a BAD thing for humanity. The answer will touch deeply ingrained ideas about what and who should underpin and benefit from progress and value in society.
For the premium subscribers it'll be good, but they'll sure ruin the experience for free tier just like Spotify cause they just can't keep their business sustainable without showing vc's some profits.
I believe you, and I do turn to an LLM over Google for some queries where I'm not concerned about hallucination. (I use Llama 3 most of the time, because the privacy is absolute.)
But OpenAI is having a hard time retaining/increasing ChatGPT users. Also, Alphabet's stock is about as valuable as it's ever been. So I don't think we have evidence that this is really challenging Google's search dominance.
Google is an ad company. Ad prices are on an auction and most companies believe that they need ads. Less customers don't necessarily mean that the earnings go down, as when the clicks go down the prices might go up (without ad competitors). Ergo, they don't compete (yet at least).
It's well known that LLMs don't reason. That's not what they are for. It's a throw away comment to say that a product can't do what it explicitly is unable to do. Reasoning will require different architectures. Even with that LLMs are incredibly useful.
Chat GPT 3.5 has been neutered, as it it won't spit out anything that isn't overly politically correct. 4chan were hacking their way around it. Maybe that's why they decided it was "too dangerous".
> Do you remember when they teased GPT-2 as "too dangerous" for public access? I do.
Maybe not GPT-2, but in general LLMs and other generative AI types aren't without their downsides.
From companies looking to downsize their staff to replace them with software, to the work of artists/writers being devalued somewhat, to even easier scams and something like the rise of AI girlfriends, which has also gotten some critique, some of those can probably be a net negative.
Even when it's not pearl clutching over the advancements in technology and the social changes that arise, I do wonder how much my own development work will be devalued due to the somewhat lowered entry barrier into the industry and people looking for quick cash, same as with boot camps leading to more saturation. Probably not my position individually (not exactly entry level), but the market as a whole.
It's kind of at a point where I use LLMs for dev work not to fall behind, cause the productivity gains for simple problems and boilerplate are hard to argue with.
Like another comment mentioned, sigmoid curves [1] are ubiquitous with neural network systems. Neural network systems can be intoxicating because it's so "easy" (relatively speaking) to go from nothing to 80% in extremely short periods of time. And so it seems completely obvious that hitting 100% is imminent. Yet it turns out that each percent afterwards starts coming exponentially more slowly, and we tend to just bump into seemingly impassable asymptotes far from where we'd like to be.
~8 years ago when self driving technology was all the rage and every major company was getting on board with ever more impressive technological demos, it seemed entirely reasonable to expect that we'd all be in a world of complete self driving imminently. I remember mocking somebody online around the time who was pursuing a class C/commercial trucking license. Yet now a decade later, there are more truckers than ever and the tech itself seems further away than ever before. And that's because most have now accepted that progress on such has basically stalled out in spite of absolutely monumental efforts at moving forward.
So long as LLMs regularly hallucinate, they're not going to be useful for much other than tasks that can accept relatively high rates of failure. And many of those generally creative domains are the ones LLMs are paradoxically the weakest in - like writing. Reading a book written by an LLM would be cruel and unusual punishment given then current state of the art. One domain I do see them completely taking over is search. They work excellently as natural language search engines, and "failure" in such is very poorly defined.
I'm not really sure your self-driving analogy is apt here. Waymo has cars on the road right now that are totally autonomous, and just expanded its footprint. It has been longer and more difficult than we all thought, and those early tech demos were a glimmer of what was to come; then we had to grind to get there, with a lot of engineering.
I think what maybe seems not obvious amidst the hype is that there is a hell of a lot of engineering left to do. The fact that you can squash the weights of a neural net down to 3 bits per param and it still works -- is evidence that we have quite a way to go with maturing this technology. Multimodality, improvements to the UX of it, the human-computer interface part of it. Those are fundamental tech things, but they are foremost engineering problems. Getting latency down. Getting efficiency up. Designing the experience, then building it out.
25 years ago, early tech demos on the internet were promising that everyone would do their shopping, entertainment, socializing, etc... online. Breathless hype. 5 years after that, the whole thing crashed, but it never went away. People just needed time to figure out how to use it and what it was useful for, and discover its limitations. 10 years after that, engineering efforts were systematized and applied against the difficult problems that still remained. And now: look at where we are. It just took time.
I don't think he's saying that AGI is impossible — almost noone (nowadays) would suggest that it's anything but an engineering challenge. The argument is simply one of scale, i.e. how long that engineering challenge will take to solve. Some people are suggesting on the order of years. I think they're suggesting it'll be closer to decades, if that.
Waymo cars are highly geofenced in areas with good weather and good quality roads. They only just (in January) gained the capability to drive on freeways.
Let me know when you can get a Waymo to drive you from New York to Montreal in winter.
> Waymo cars are highly geofenced in areas with good weather and good quality roads. They only just (in January) gained the capability to drive on freeways
They are an existence proof that the original claim that we seem further than ever before is just wrong.
There are 6 categories of self driving, starting at 0. The final level is the one we've obviously been aiming at, and most were expecting. It's fully automated self driving in all conditions and scenarios. Get in your car anywhere, and go to anywhere - with capability comparable to a human. Level 4, by contrast, is full self driving under certain circumstances and generally in geofenced areas - basically trolleys without rails. Get in a car, so long as conditions are favorable, and go to a limited set of premapped locations.
And level 4 is where Waymo is, and is staying. Their strategy is to to use tiny geofenced areas with a massive amount of preprocessing, mapping out every single part of an area, not just in terms of roads but also every single meta indicator - sign, signals, cross walks, lanes, and so on. And it creates a highly competent, but also highly rigid system. If road conditions change in any meaningful way, the most likely outcome with this strategy is simply that the network gets turned off until the preprocessing can be carried and reuploaded again. That's completely viable in small geofenced areas, but doesn't generalize at all.
So the presence of Waymo doesn't say much of anything about the presence of level 5 autonomy. If anything it suggests Waymo believes that level 5 autonomy is simply out of reach, because the overwhelming majority of tech that they're researching and developing would have no role whatsoever in level 5 automation. Tesla is still pushing for L5 automation, but if they don't achieve this then they'll probably just end up getting left behind by companies that double down on L4. And this does indeed seem to be the most likely scenario for the foreseeable future.
This sounds suspiciously like that old chestnut, the god of the gaps. You're splitting finer and finer hairs to maintain your position that, "no, really, they're not really doing what I'm saying they can't do", all the while self-driving cars are spreading and becoming more capable every year.
I don't think we have nearly as much visibility on what Waymo seems to believe about this tech as you seem to imply, nor do I think that their beliefs are necessarily authoritative. You seem disheartened that we haven't been able to solve self-driving in a couple of decades, and I'm of the opinion that geez, we basically have self-driving now and we started trying only a couple of decades ago.
How long after the invention of the transistor did we get personal computers? Maybe you just have unrealistic expectations of technological progress.
Level 5 was the goal and the expectation that everybody was aiming for. Waymo's views are easy to interpret from logically considering their actions. Level 4, especially as they are doing it, is in no way whatsoever a stepping stone to level 5. Yet they're spending tremendous resources directed towards things that would have literally and absolutely no place in level 5 autonomy. It seems logically inescapable to assume that not only do they think they'll be unable to hit level 5 in the foreseeable future, but also that nobody else will be able to either. If you can offer an alternative explanation or argument, please share!
Another piece of evidence also comes from last year when Google scaled back Waymo with layoffs as well as "pausing" its efforts at developing self driving truck technology. [1] That technology would require something closer to L5 autonomy, because again - massive preprocessing is quite brittle and doesn't scale well at all. Other companies that were heavily investing in self-driving tech have done similarly. For instance Uber sold off its entire self-driving division in 2021. I'm certainly happy to hear any sort of counter-argument, but you need some logic instead of ironically being the one trying to mindread me or Waymo!
Not necessarily. If self-driving cars "aren't ready" and then you redefine what ready is, you've absolutely got your thumb on the scale of measuring progress.
> "15 years ago self driving of any sort was pure fantasy, yet here we are."
This was 38 years ago: https://www.youtube.com/watch?v=ntIczNQKfjQ - "NavLab 1 (1986) : Carnegie Mellon : Robotics Institute History of Self-Driving Cars; NavLab or Navigation Laboratory was the first self-driving car with people riding on board. It was very slow, but for 1986 computing power, it was revolutionary. NavLab continued to lay the groundwork for Carnegie Mellon University's expertise in the field of autonomous vehicles."
This was 30+ years ago: https://www.youtube.com/watch?v=_HbVWm7wdmE - "Short video about Ernst Dickmanns VaMoR and VaMP projects - fully autonomous vehicles, which travelled thousands of miles autonomously on public roads in 1980s."
This was 29 years ago: https://www.youtube.com/watch?v=PAMVogK2TTk - "A South Korean professor [... Han Min-hong's] vehicle drove itself 300km (186 miles) all the way from Seoul to the southern port of Busan in 1995."
It's okay! We'll just hook up 4o to the Waymo and get quippy messages like those in 4o's demo videos: "Oh, there's a tornado in front of you! Wow! Isn't nature exciting? Haha!"
As long as the Waymo can be fed with the details, we'll be good. ;)
Joking aside, I think there are some cases where moving the goalposts is the right approach: once the previous goalposts are hit, we should be pushing towards the new goalposts. Goalposts as advancement, not derision.
I suppose the intent of a message matters, but as people complain about "well it only does X now, it can't do Y" - probably true, but hey, let's get it to Y, then Z, then... who knows what. Challenge accepted, as the worn-out saying goes.
Of course. It's quite a handy tool. I love using it for searching documentation for some function that I know the behavior of, but not the name. And similarly, people have been using auto-steer, auto-park, and all these other little 'self driving adjacent' features for years as well. Those are also extremely handy. But the question is, what comes next?
The person I originally responded to stated, "We’re moving toward a world where every job will be modeled, and you’ll either be an AI owner, a model architect, an agent/hardware engineer, a technician, or just.. training data." And that far less likely than us achieving L5 self driving (if not only because driving is quite simple relative to many of the jobs he envisions AI taking over), yet L5 self driving seems as distant as ever as well.
> So long as LLMs regularly hallucinate, they're not going to be useful for much other than tasks that can accept relatively high rates of failure.
Yep. So basically they're useful for a vast, immense range of tasks today.
Some things they're not suited for. For example, I've been working on a system to extract certain financial "facts" across SEC filings. ChatGPT has not been helpful at all either with designing or implementing (except to give some broad, obvious hints about things like regular expressions), nor would it be useful if it was used for the actual automation.
But for many, many other tasks -- like design, architecture, brainstorming, marketing, sales, summarisation, step by step thinking through all sorts of processes, it's extremely valuable today. My list of ChatGPT sessions is so long already and I can't imagine life without it now. Going back to Google and random Quora/StackOverflow answers laced with adtech everywhere...
> I've been working on a system to extract certain financial "facts" across SEC filings. ChatGPT has not been helpful at all
The other day, I saw a demo from a startup (don't remember their name) that uses generative AI to perform financial analysis. The demo showed their AI-powered app basically performing a Google search for some companies, loosely interpreting those Google Stock Market Widgets that are presented in such searches, and then fetching recent news and summarizing them with AI, trying to extract some macro trends.
People were all hyped up about it, saying it will replace financial analysts in no time. From my point of view, that demo is orders of magnitude below the capacity of a single intern who receives the same task.
In short, I have the same perception as you. People are throwing generative AI into everything they can with high expectations, without doing any kind of basic homework to understand its strengths and weaknesses.
> So long as LLMs regularly hallucinate, they're not going to be useful for much other than tasks that can accept relatively high rates of failure.
But is this not what humans do, universally? We are certainly good at hiding it – and we are all good at coping with it – but my general sense when interacting with society is that there is a large amount of nonsense generated by humans that our systems must and do already have enormous flexibility for.
My sense is that's not an aspect of LLMs we should have any trouble with incorporating smoothly, just by adhering to the safety nets that we built in response to our own deficiencies.
The sigmoid is true in humans too. You can get 80% of the way to being sort of good at a thing in a couple of weeks, but then you hit the plateau. In a lot of fields confidently knowing and applying this has made people local jack of all trades experts... the person that often knows how to solve the problem. But Jack is no longer needed so much. ChatJack got`s your back. Better to be a the person who knows one thing in excruciating detail and depth, and never ever let anyone watch you work or train on your output.
I have a much less "utopian" view about the future. I remember during the renaissance of neural networks (ca. 2010-15) it was said that "more data leads to better models", and that was at a time when researchers frowned upon the term Artificial Intelligence and would rather use Machine Learning. Fast forward a decade LLMs are very good synthetic data generators that try to mimic human generated input and I can't think somehow that this wasn't the sole initial intent of LLMs. And that's it for me. There's not much to hype and no intelligence at all.
What happens now is that human generated input becomes more valuable and every online platform (including minor ones) will have now some form of gatekeeping in place, rather sooner than later. Besides that a lot of work still can't be done in front of a computer in isolation and probably never will, and even if so, automation is not a means to an end. We still don't know how to measure a lot of things and much less how to capture everything as data vectors.
The two AI’s talking to each other was like listening to two commercials talking to each other. Like a callcenter menu that you cannot skip. And they _kept repeating themselves_. Ugh. If this is the future I’m going to hide in a cave.
My new PC arrives tomorrow. Once I source myself two RTX 3060's I'll be an AI owner, no longer dependant on cloud APIs.
Currently the bottleneck is Agents. If you want a large language model to actually do anything you need an Agent. Agents so far need a human in the loop to keep them sane. Until that problem is solved most human jobs are still safe.
GPT 4o incorporated multimodality directly in the neural network, while reducing inference costs to half.
I fully expect GPT 5 (or at the latest 6) to similarly have native inclusion of agentic capabilities either this year or next year, assuming it doesn't already, but is just kept from the public.
> It will probably strongly favour places like China and Russia though, where the economy is already strongly reliant on central control.
I think you may be literally right in the opposite sense to what I think you intended.
China (and maybe Russia) may be able to use central control to have an advantage when it comes to avoiding disasterous outcomes.
But when it comes to the rate of innovation, the US may have an advantage for the usual reasons. Less government intervention (due to lobbyism) combined with having several corporations actively competing with each other to be first/best usually leads to faster innovation. However, the downside may be the it also introduces a lot more risk.
That's a very weak form. The way I use "agentic" is that it is trained to optimize the success of an agent, not just predict the next token.
The obvious way to to that is for it to plan a set of actions and evalute each possible way to reach some goal (or avoid an anti-goal). Kind of what AlphaZeros is doing for games. Q* is rumored to be a generalization of this.
> We’re moving toward a world where every job will be modeled, and you’ll either be an AI owner, a model architect, an agent/hardware engineer, a technician, or just.. training data.
I understand that you might be afraid. I believe that a world where only LLM companies rule the world is not practically achievable except in some distopian universe. The likelihood of the world where the only job are model architects, engineers or technicians is very very small.
Instead, let's consider the positive possibilities that LLMs can bring. It can lead to new and exciting opportunities across various fields. For instance, can serve as a tool to inspire new ideas for writers, artists, and musicians.
I think we are going towards a more collaborative era where computers and humans interact much more. Everything will be a remix :)
> The likelihood of the world where the only job are model architects, engineers or technicians is very very small.
Oh, especially since it will be a priority to automate their jobs, or somehow optimize them with an algorithm because that's a self-reinforcing improvement scheme that would give you a huge edge.
Every corporate workplace is already thinking: How can I surveil and record everything an employee does as training data for their replacement in 3 years time.
RAG? Sure. I even implemented systems using it, and enabling it, myself.
And guess what: RAG doesn't prevent hallucination. It can reduce it, and there are most certainly areas where it is incredibly useful (I should know, because that's what earns my paycheck), but it's useful despite still hallucinations being a thing, not because we solved that problem.
Are you implying that you’re the same person I was commenting to or are you just throwing your opinion into the mix?
Regardless, we’ve seen accuracy of ~98% with simple context-based prompting across every category of generation task. Don’t take my word for it, a simple search would show the effectiveness of “n-shot” prompting. Framing it as “it _can_ reduce” hallucinations is disingenuous at best, there really is no debate about how well it works. We can disagree on whether 98% accuracy is a solution but again I’d assert that for >50% of all possible real world uses for an LLM 98% is acceptable and thus the problem can be colloquially referred to as solved.
If you’re placing the bar at 100% hallucination-free accuracy then I’ve got some bad news to tell you about the accuracy of the floating point operations we run the world on
All AIs up to now lack autonomy. So I'd say until we crack this problem, it is not going to be able to do your job. Autonomy depends on a kind of data that is iterative, multi-turn, and learning from environments not from static datasets. We have the exact opposite, lots of non-iterative, off-policy (human made AI consumed) text.
1) It's natively multi-modal in a way I don't think gpt4 was.
2) It's at least twice as efficient in terms of compute. Maybe 3 times more efficient, considering the increase in performance.
Combined, those point towards some major breakthroughs having gone into the model. If the quality of the output hasn't gone up THAT much, it's probably because the technological innovations mostly were leveraged (for this version) to reduce costs rather than capabilities.
My guess is that we should expect them to leverage the 2x-3x boost in efficiency in a model that is at least as large as GTP4 relatively soon, probably this year unless OpenAI has safety concerns or something, and keeps it internal-only.
The evidence for that is the change in the tokenizer. The only way to implement that is to re-train the entire base model from scratch. This implies that GPT 4o is not a fine-tuning of GPT 4. It's a new model, with a new tokenizer, new input and output token types, etc...
They could have called it GPT-5 and everyone would have believed them.
I’ve used it for a couple of hours to help with coding and it feels very similar to gpt4: still makes erroneous and inconsistent suggestions. Not calling it 4.5 was the right call. It is much faster though.
The expectations for gpt5 are sky high. I think we will see a similar jump as 3.5 -> 4.
Pretty sure they said they would not release GPT-5 on Monday. So it's something else still. And I don't see any sort of jump big enough to label it as 5.
I assume GPT-5 has to be a heavier, more expensive and slower model initially.
There has been speculation that this is the same mystery model floating around on lmsys chat bot arena and they claim a real observable jump on elo scores but this remains to be seen some people don't think its even as capable as GPT4-Turbo so tbd
All I could think about when watching this demo was how similar capabilities will work on the battlefield. Coordinated AIs look like they will be obscenely effective.
The "Killer app" for AGI/ASI is, I suspect, going to be in robotics, even more so than in replacing "white collar workers".
That includes, beyond literal Killers, all kinds of manufacturing, construction and service work.
I would expect a LOT of funds to go into research all sorts of actuators, artificial muscles and any other technology that will be useful in building better robots.
Companies that can get and maintain a lead in such technologies may reach a position similar to what US Steel had in the 19th century.
That could be the next nvidia.
I would not be at all surprised if we will have a robot in the house in 10 years that can clean and do the dishes, and that is built using basically the same parts as the robots that replace our soldiers and the police.
I would expect a LOT of funds to go into research all sorts of actuators, artificial muscles and any other technology that will be useful in building better robots.
If you had an ASI? I don’t think you’d need a lot of funds to go into this area anymore ? Presumably it would all be solved overnight.
Once we have godlike tier ASI, you're probably right. But I expect that robots could become extremely lucrative even when avaiable AI's haven't reached that point yet.
Companies that have a head start at that point, may get a huge first-mover advantage. Also, those companies also very well may have the capability to leverage AI in product development, just like everyone else.
And just as important as the products themselves is the manufacturing capacity to build them at scale. Until we have massive numbers of robots in service, building such infrastructure is likely to be slow and expensive.
EDIT: Also, once we really have the kind of Godlike ASI you envision, no human actions really matter (economically) anymore.
its possible. Right now ai + robotics has been a big area of research for a while, and its very good at some tasks, see basically everything boston dynamics does wrt dynamically balancing. They help alongside control systems very well. However for multimodal task planning its not there. A year or two back I wrote a long comment about it but basically there is this idea of "grounding", basically connecting computer vision, object symbols/concepts, and task planning, which remains elusive. Its a similar problem with self driving cars - you want to be able to reason very strongly about things like "place all of the screws into the red holes" in a way that maps automatically to the actions for those things
Yes. As you say, a lot of the limitations so far has been the control part, which is basically AI.
Given the pace that AI is currently moving at, it seems to me that more and more, the mechanical aspect is becoming the limitation.
GPT 4o now seems to be quite good at reasoning about the world from pictures in real time. I would expect it would soon become easy for it to do the high level part of many practical tasks, from housekeeping to manufacturing or construction. (And of course military tasks.)
This leaves the direct low-level actuator control to execute such tasks in detail. But even there, development has been immense. See for instance these soccer playing robots [1]
And as both high level and low level control (if we assume that models soon will add agentic features directly into the neural networks), the only missing peace is the ability to build mechanically capable and reliable robots at a low enough price that they become cheaper than humans for various kinds of work.
There is one more limitation, of course, which is that GPT 4o still requires a constant connection to a data center, and that the models is too large to run within a device or machine.
This is also one of the most critical limitations of self driving. Had the AI within a Tesla had the same amount of compute available as GPT-4o, it should be massively more capable.
(IMO) AI cannot murder people. The responsibility of what an AI does falls on the person who deployed it, and to a lesser extent the person who created it. If someone is killed by a fully autonomous weapon then that person has been murdered by the person or people who created and enabled the AI, not the AI itself.
This is no different to saying a person with a gun murdered someone rather than attributing the murder to the gun. An AI gun is just a really fancy gun.
There will come a time where complex systems can better be predicted with the use of AI than with mathematical predictions. One use-case could be, feeding body scans into them for cancer prevention. AFAIK this is already researched.
There may come a time where we grow so accustomed to this, that the decision is so heavily influenced by AI, that we believe it more than human decisions.
And then it can very well kill a human through misdiagnostic.
I think it is important to not just put this thought aside, but to evaluate all risks.
> And then it can very well kill a home through misdiagnosis.
I would imagine outcomes would be scrutinized heavily for an application like this. There is a difference between a margin of error (existing with human doctors as well) and a sentient ai that has decided to kill, which is what it sounds like you're describing.
If we didn't give it that goal, how does it obtain it otherwise?
Except that with a gun, you have a binary input (the trigger) so you can squarely blame a human for misunderstanding what they did when they accidentally shot someone on the grounds that the trigger didnt work.
The mass murder of Palestinians is already partially blamed or credited to an "AI" system that could identify people. Humans spent seconds reviewing the outcome. This is the reality of AI already being used to assist in killing. AI can't take the blame legally speaking, but it makes it easier to make the call and sleep at night. "I didn't order a strike on this person and their family of eight, the AI system marked this subject as a high risk, high value target". Computer-assisted dehumanization. (Not even necessarily AI)
> This is no different to saying a person with a gun murdered someone rather than attributing the murder to the gun.
And “guns don’t kill people, people kill people”¹ is a bad argument created by the people who benefit from the proliferation of guns, so it’s very weird that you’re using that as if it were a valid argument. It isn’t. It’s baffling anyone still has to make this point: easy access and availability of guns makes them more likely to be used. A gun which does not exist is a gun which cannot be used by a person to murder another.
It’s also worth nothing the exact words of the person you’re responding to (emphasis mine):
> It can also murder people, and it will continue being used for that.
Being used. As in, they’re not saying that AI kills on its own, but that it’s used for it. Presumably by people. Which doesn’t contradict your point.
We also choose to have cars, which cause a certain amount of death. It's an acceptable tradeoff (which most don't think about much). I'd speculate that it's mostly people who don't use cars who criticize them the most, and the same with guns.
That’s an absurd comparison, to the point I’ having trouble believing you’re arguing in good faith. The goal of cars is transportation; the goal of guns is harm. Cars causing deaths are accidents; guns causing deaths is them working as designed. Cars continue to be improved to cause fewer fatalities; guns are improved to cause more.
> I'd speculate that it's mostly people who don't use cars who criticize them the most, and the same with guns.
You mean that people who are opposed to something refuse to partake in its use and promotion? Shocker.
Yes, but a person wielding a knife has morals, a conscience and a choice, the fear is that an AI model does not. A lot of killer AI science fiction boils down to "it is optimal and logical that humanity needs to be exterminated"; no morality or conscience involved.
Which is why there are laws around what knives are allowed and what are banned. Or how we design knifes to be secure. Or how we have a common understanding what we do with knifes - and what not. Such as not giving them to toddlers... So what's your point?
The point is not the tool but how it's used. "What knives are allowed" is a moot point because a butter knife or letter opener can be used to kill someone.
But if you give a very sharp knife to a toddler and say "go on, have fun" and walk off, you're probably going to face child endangerment charges at some point.
Don’t get me wrong, I’m not suggesting the current capabilities are anywhere near replacing human productivity. Some things are 1 year out, some 5 (maybe self-driving cars by then? Mercedes has it on their roadmap for 2030 and they’ve historically been realistic), some 10+. But the pieces are in place and the investments are being made. The question is no longer “can AI really automate this?”, it’s “how do we get the dataset that will enable us to automate this with AI?”. And as long as Open AI keeps people’s eyes on their whizbang demos, the money will keep flowing…
Nature had been doing that for billions of years until a few decades ago when we were told "progress" meant we had to stop doing the same thing more peacefully and intentionally.
My guess is the future belongs to those who don't stop—who, in fact, embrace the opposite of stopping.
I would even suggest that the present belongs to those who didn't stop. It may be too late for normal people to ever catch up by the time we realize the trick that was played on us.
The present absolutely belongs to those who didn't stop, but it's been a lot longer than a few decades.
Varying degrees of greedy / restless / hungry / thirsty / lustful are what we've got, because how is contentedness ever going to compete with that over millennia?
It just occurred to me that this is one of the core things most successful religions have been trying to do in some form from the time they first arose.
I've had a lot of negative things to say about religion for many years. However, as has been often observed, 'perception is reality' to a certain extent when it affects how people behave, and perhaps it's kind of a counterweight against our more selfish tendencies. I just wish we could do something like it without made up stories and bigotry. Secular humanist Unitarians might be about the best we can do right now in my opinion... I'm hoping that group continues to grow (they have been in recent years).
People with your sentiment said the same thing about all cool tech that changed the world. Doesn't change the reality, a lot of professions will need to adapt or they will go extinct.
> People with your sentiment said the same thing about all cool tech that changed the world.
They also said it about all the over-hyped tech that did not change the world. This mentality of “detractors prove something is good” is survivorship bias.
Note I’m not saying you’ll categorically be proven wrong, just that your argument isn’t particularly strong or valid.
I am a PhD biophysicist working within the field of biological imaging. Professionally, my team (successfully) uses deep learning and GANs for a variety of tasks within the field of imaging, such as segmentation, registration, and predictive protein/transcriptomics. It’s good stuff, a game changer in many ways. In no way however, does it represent generalized AI, and nobody in the field makes this claim even though the output of these algorithms match or out perform humans in cases.
LLMs are no different. Like DL modules that are very good at outputting images that mimic biological signatures, LLMs are very good at outputting texts that eerily mimic human language.
However — and this is a point which programmers are woefully and comically ignorant — human language and reason are two separate things. Tech bros wholly confuse the two however, and thus make outlandish claims we have achieved or are on the brink of achieving — actual AI systems.
In other words, while LLMs and DL in general can perform specific tasks well, they do not represent a breakthrough in artificial intelligence, and thus will have a much narrower application space than actual AI.
If you've been in the field you really should know that the term AI has been used to describe things for decades in the academic world. My degree was in AI back before RBMs and Hintons big reveal about making things 100000 times faster (do the main step just once not 100 times and take 17 years to figure that out).
You're talking more about AGI.
We need "that's not AI" discussions like we need more "serverless? It's still on some server!!" discussions.
I think it's even incomparable to server vs serverless discussions.
It's about meaning of intelligence. These people don't have problems claiming that ants or dolphins are intelligent, but suddenly for machines to be classified as artificial intelligence they must be exactly on the same level as humans.
Intelligence is just about the ability to solve problems. There's no implication that in order for something to be intelligent it has to perform on at least the same level as top people in that field in the World.
It just has to be beyond a simple algorithm and be able to solve some sort of problem. You have AIs in video games that are just bare logic spaghetti computation with no neural networks.
Or you're using AI as a term differently to the people in the field. SVMs are extremely simple, two layer perceptrons are things you can work out by hand!
Just stop trying to redefine AI as a term, you'll lose against the old hands and you'll lose against the marketing dept and you'll lose against the tech bros and nobody who you actually need to explain it to will care. Use AGI or some other common term for what you're clearly talking about.
So, the ‘revolutionary’, ‘earth-shattering, ‘soon-to-make-humans obsolete’ talk about ChatGPT is all bullshit and this is just another regular, run-of-the-mill development with the label of ‘AI’ slapped on somewhere, just like all the others from the last 40 years? What in the hell is even your point then? Is ChatGPt a revolutionary precursor to AGI if not AGI already? I say it’s not.
This is true. But only to a point where mimicking and more broadly speaking, statistically imitating data, are understood in a more generalized way.
LLMs statistically imitates texts of real world. To achieve certain threshold of accuracy, it turns out they need to imitate the underlying Turing machine/program/logic that runs in our brains to understand/react properly to texts by ourselves. That is no longer in the realm of the old school data-as-data statistics I would say.
The problem with this kind of criticism of any AI-related technology is that it is an unfalsifiable argument akin to saying that it can't be "proper" intelligence unless God breathed a soul into the machine.
The method is irrelevant. The output is what matters.
This is like a bunch of intelligent robots arguing that "mere meat" cannot possibly be intelligent!
> LLMs are very good at outputting texts that eerily mimic human language.
What a bizarre claim. If LLMs are not actually outputting language, why can I read what they output then? Why can I converse with it?
It's one thing to claim LLMs aren't reasoning, which is what you later do, but you're disconnected from reality if you think they aren't actually outputting language.
Is there a block button? Or a filter setting? You are see unaware and uninquisitive of actual human language, you cannot see the gross assumptions you are making.
"We shall not be very greatly surprised if a woman analyst who has not been sufficiently convinced of the intensity of her own wish for a penis also fails to attach proper importance to that factor in her patients" Sigmund Freud, in response to Karen Horney’s criticism of his theory of penis envy.
W-what? Lad, have you used chat-gpt? It can instantly give you intelligent feedback on anything (usually better than any expert community like 90% of the time.) On extremely detailed, specific tasks (like writing algorithms or refactoring) its able to spit out either working code or code so close to working that its still faster than what you could have done yourself. It can explain things better than probably 99.999% of teachers.
It will give you detailed examples that are much easier to follow than vague, error-prone spec docs. That's scratching the surface. Other people are far more creative than me and have used chat-gpt for mind-blowing stuff already. Whatever its doing passes for 'reasoning' and 'intelligence' in my book. To me it doesn't matter whether its the same kind of intelligence as a human or if there's any amount of awareness as those are both philosophical questions of no consequence to my work.
For what these pieces of tech can do I feel that they're drastically under-utilized.
I've worked quite a bit with STT and TTS over the past ~7 years, and this is the most impressive and even startling demo I've seen.
But I would like to see how this is integrated into applications by third party developers where the AI is doing a specific job. Is it still as impressive?
The biggest challenge I've had with building any autonomous "agents" with generic LLM's is they are overly gullible and accommodating, requiring the need to revert back to legacy chatbot logic trees etc. to stay on task and perform a job. Also STT is rife with speaker interjections, leading to significant user frustrations and they just want to talk to a person. Hard to see if this is really solved yet.
I’ve found using logic trees with LLMs isn’t necessarily a problem or a deficit. I suppose if they were truly magical and could intuit the right response every time, cool, but I’d always worry about the potential for error and hallucinations.
I’ve found that you can create declarative logic trees from JSON and use that as a prompt for the LLM, which it can then use to traverse the tree accordingly. The only issue I’ve encountered is when it wants to jump to part of the tree which is invalid in the current state. For example, you want to move a user into a flow where certain input is required, but the input hasn’t been provided yet. A transition is suggested to the program by the LLM, but it’s impossible so the LLM has to be prompted that the transition is invalid and to correct itself. If it fails to transition again, a default fallback can be given but it’s not ideal at all.
However, another nice aspect of having the tree declared in advance is that it shows human beings what the system is capable and how it’s intended to be used as well. This has proven to be pretty useful, as letting the LLM call functions it sees fit based on broad intentions and system capabilities leaves humans in the dark a bit.
So, I like the structure and dependability. Maybe one day we can depend on LLM magic and not worry about a team understanding the ins and outs of what should or shouldn’t be possible, but we don’t seem to be there yet at all. That could be in part because my prompts were bad, though.
Any recommendations on patterns/approaches for these declarative logic trees and where you put which types of logic (logic which goes in the prompt, logic which goes in the code which parses the prompt response, how to detect errors in the response and retry the prompt, etc). On "Show HN" I see a lot of "fully automated agents" which seem interesting, but not sure if they are over-kill or not.
Personally, I've found that a nested class structure with instructions in annotated field descriptions and/or docstrings can work wonders. Especially if you handle your own serialization to JSON Schema (either by rolling your own or using hooks provided by libraries like Pydantic), so you can control what attributes get included and when.
The JSON serialization strategy worked really well for me in a similar context. It was kind of a shot in the dark but GPT is pretty awesome at using structured data as a prompt.
I actually only used an XState state machine with JSON configuration and used that data as part of the prompt. It worked surprisingly well.
Since it has an okay grasp on how finite state machines and XState work, it seems to do a good job of navigating the tree properly and reliably. It essentially does so by outputting information it thinks the state machine should use as a transition in a JSON object which gets parsed and passed to a transition function. This would fail occasionally so there was a recursive “what’s wrong with this JSON?” prompt to get it to fix its own malformed JSON, haha. That was meant to be a temporary hack but it worked well, so it stayed. There were a few similar tools for trying to correct errors. That might be one of the strangest developments in programming for me… Deploying non-deterministic logic to fix itself in production. It feels wrong, but it works remarkably well. You just need sane fallbacks and recovery tactics.
It was a proprietary project so I can’t share the source, but I think reading up on XState JSON configuration might explain most of it. You can describe most of your machine in a serializable format.
You can actually store a lot of useful data in state names, context, meta, and effect/action names to aid with the prompting and weaving state flows together in a language-friendly way. I also liked that the prompt would be updated by information that went along with the source code, so a deployment would reliably carry the correct information.
The LLM essentially hid a decision tree from the user and smoothed over the experience of navigating it through adaptive and hopefully intuitive language. I’d personally prefer to provide more deterministic flows that users can engage with on their own, but one really handy feature of this was the ability to jump out of child states into parent states without needing to say, list links to these options in the UI. The LLM was good at knowing when to jump from leaves of the tree back up to relevant branches. That’s not always an easy UI problem to solve without an AI to handle it for you.
edit: Something I forgot to add is that the client wanted to be able to modify these trees themselves, so the whole machine configuration was generated by a graph in a database that could be edited. That part was powered by Strapi. There was structured data in there and you could define a state, list which transitions it can make, which actions should be triggered and when, etc. The client did the editing directly in Strapi with no special UI on top.
Their objective is surveying people in a more engaging and personable way. They really wanted surveys which adapt to users rather than piping people through static flows or exposing them to redundant or irrelevant questions. Initially this was done with XState and no LLM (it required some non-ideal UI and configuration under the hood to make those jumps to parent states I mentioned, but it worked), and I can't say how effective it is but they really like it. The AI hype was very very strong on that team.
>Also STT is rife with speaker interjections, leading to significant user frustrations and they just want to talk to a person. Hard to see if this is really solved yet.
This is not using TTS or STT. Audio and Image data can be tokenized as readily as text. This is simply a LLM that happens to have been trained to receive and spit out audio and image tokens as well as text tokens. Interjections are a lot more palatable in this paradigm as most of the demos show.
Adding audio data as a token, in and of itself, would dramatically increase training size, cost, and time for very little benefit. Neural networks also generally tend to function less effectively with highly correlated inputs, which I can only assume is still an issue for LLMs. And adding combined audio training would introduce rather large scale correlations in the inputs.
I would wager like 100:1 that this is just introducing some TTS/STT layers. The video processing layer is probably also doing something similarly, by taking an extremely limited number of 'screenshots', carrying out typical image captioning using another layer, and then feeding that as an input. So the demo, to me, seems most likely to just be 3 separate 'plugins' operating in unison - text to speech, speech to text, and image to text.
The interjections are likely just the software being programmed to aggressively begin output following any lull after an input pattern. Note in basically all the videos, the speakers have to repeatedly cut off the LLM as it starts speaking in conversationally inappropriate locations. In the main video which is just an extremely superficial interaction, the speaker made sure to be constantly speaking when interacting, only pausing once to take a breath that I noticed. He also struggled with the timing of his own responses as the LLM still seems to be attached to its typical, and frequently inappropriate, rambling verbosity (though perhaps I'm not one to critique that).
>I would wager like 100:1 that this is just introducing some TTS/STT layers.
Literally the first paragraph of the linked blog.
"GPT-4o (“o” for “omni”) is a step towards much more natural human-computer interaction—it accepts as input any combination of text, audio, and image and generates any combination of text, audio, and image outputs."
Then
"Prior to GPT-4o, you could use Voice Mode to talk to ChatGPT with latencies of 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4) on average. To achieve this, Voice Mode is a pipeline of three separate models: one simple model transcribes audio to text, GPT-3.5 or GPT-4 takes in text and outputs text, and a third simple model converts that text back to audio. This process means that the main source of intelligence, GPT-4, loses a lot of information—it can’t directly observe tone, multiple speakers, or background noises, and it can’t output laughter, singing, or express emotion.
With GPT-4o, we trained a single new model end-to-end across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network."
I can’t square this with the speed. A couple of layers doing STT are technically still part of the neural network, no? Because the increase in token base to cover multimodal tokenization would make even text inference slower, not twice as fast, as 4-turbo.
Open ai give so little information on the details of their models now that one can only speculate how they've managed to cut down inference costs.
STT throws away a lot of information that is clearly being preserved in a lot of these demos so that's definitely not happening here in that sense. That said, the tokens would be merged to a shared embedding space. Hard to say how they are approaching it exactly.
I'd mentally change the acronym to Speech to Tokens. Parsing emotion and other non-explicit indicators in speech has been an ongoing part of research for years now. Meta-data of speaker identity, inflection, etc could easily be added and current LLMs already work with it just fine. For instance asking Claude, with 0 context, to parse the meaning of "*laughter* Yeah, I'm sure that's right." instantly yields:
----
The phrase "*laughter* Yeah, I'm sure that's right" appears to be expressing sarcasm or skepticism about whatever was previously said or suggested. Here's a breakdown of its likely meaning:
"*laughter*" - This typically indicates the speaker is laughing, which can signal amusement, but in this context suggests they find whatever was said humorous in an ironic or disbelieving way.
"Yeah," - This interjection sets up the sarcastic tone. It can mean "yes" literally, but here seems to be used facetiously.
"I'm sure that's right." - This statement directly contradicts and casts doubt on whatever was previously stated. The sarcastic laughter coupled with "I'm sure that's right" implies the speaker believes the opposite of what was said is actually true.
So in summary, by laughing and then sarcastically saying "Yeah, I'm sure that's right," the speaker is expressing skepticism, disbelief or finding humor in whatever claim or suggestion was previously made. It's a sarcastic way of implying "I highly doubt that's accurate or true."
It could be added. Still wouldn't sound as good as what we have here. Audio is Audio and Text is Text and no amount of metadata we can practically provide will replace the information present in sound.
You can't exactly metadata your way out of this (skip to 11:50)
I'm not sure why you say so? To me that seems obviously literally just swapping/weighting between a set of predefined voices. I'm sure you've played a game with a face generator - it's the exact same thing, except with audio. I'd also observe in the demo that they explicitly avoided anything particularly creative, instead sticking within an extremely narrow domain very basic adjectives: neutral, dramatic, singing, robotic, etc. I'm sure it also has happy, sad, angry, mad, and so on available.
But if the system can create a flamboyantly homosexual Captain Picard with a lisp and slight stutter engaging in overt innuendo when stating, "Number one, Engage!" then I look forward to eating crow! But as the instructions were all conspicuously just "swap to pretrained voice [x,y,z]", I suspect crow will not be on the menu any time soon.
I'm sorry but you don't know what you're talking about and I'm done here. Clearly you've never worked with or tried to train STT or TTS models in any real capacity so inventing dramatic capabilities, disregarding latency and data requirements must come easily for you.
Open AI have explicitly made this clear. You are wrong. There's nothing else left to say here.
Since OpenAI has gone completely closed, they've been increasingly opaque and dodgy about how even things like basic chat works. Assuming the various leaked details of GPT-4 [1] are correct (and to my knowledge there has been no indication that they are not), they have been actively misleading and deceptive - as even the 'basic' GPT4 is a mixture of experts system, and not one behemoth neural network.
A Mixture of Experts model is still one behemoth neural network and believing otherwise is just a common misconception on term.
MoE are attempts at sparsity, only activating a set number of neurons/weights at a time. They're not separate models stitched together. They're not an Ensemble. I blame the name at this point.
I would ask you to watch the demo on SoundHound.com. It does less, yes, but it's so crucially fit for use. You'll notice from the shown gpt-4 demo they were guiding the LLM into chain of reasoning. It works very well when you know how to work it, which aligns with what you're saying. I don't mean to degrade the achievement, it's great, but we often inflate the expectations of what something can actually do before reaching real productivity.
I think if you listen to the way it answers, it seems its using a technique trained speakers use. To buy itself time to think, it repeats/paraphrases the question/request before actually answering.
I'm sure you'll find this part is a lot quicker to process, giving the instant response (the old gpt4-turbo is generally very quick with simple requests like this). Rather impressively all it would need is an additional custom instruction.
This is the first demo where you can really sense that beating LLM benchmarks should not be the target. Just remember the time when the iPhone has meager specs but ultimately delivered a better phone experience than the competition.
This is the power of the model where you can own the whole stack and build a product. Open Source will focus on LLM benchmarks since that is the only way foundational models can differentiate themselves, but it does not mean it is a path to a great user experience.
So Open Source models like Llama will be here to stay, but it feels more like if you want to build a compelling product, you have to own and control your own model.
OpenAI blew up when they released ChatGPT. It was more of a UX breakthrough than pure tech, since GPT3 was available for a few months already.
This feels similar, with OpenAI trying to put their product even more into the daily lives of their users. With GPT4 being good enough for nearly all basic tasks, the natural language and multimodality could be big.
I don’t think Llama being open sourced means Meta has lost anything. If anything it’s just a way to get free community contribution, like Chrome from Chromium. Mega absolutely intends to integrate their version of Llama in their products not so unlike how OpenAI is creating uses for their LLM beyond just the technology
Depends on the benchmarks. AI that can actually do end to end the job of software developers, theoretical computer scientists, mathematicians etc. would be significantly more impactful than this.
I want to see AI moving the state of the art of the world understanding - physics, mathematics etc. - the way it moved state of the art of the Go game understanding.
Doing these end to end jobs still falls on user experience and UI, if we are talking about getting to mass market.
This GPT-4o model is a classic example. It is essentially the same model as GPT-4 but these multimodal features, voice conversations, math, and speed is revolutionary as the creation of the model itself.
Open Source LLM will end up as a model in GitHub and will be used by developers but it looks like even if GPT-4o is only 3 months ahead of other models in terms of benchmarks, the UI + Usecase + Model is 2 years ahead of the competition. And I say that because there is still no chat product that is close to what ChatGPT is delivering now, even though there are models that is close to ChatGPT 4o today.
So if it is sticky for 2 more years, their lead will just grow and we will just end up with more open source models that are technically behind by 3 months but behind product-wise by 2 years.
Now that I see this, here is my wish (I know there are security privacy concerns but let's pretend there are not there for this wish):
An app that runs on my desktop and has access to my screen(s) when I work. At any time I can ask it something about what's on the screen, it can jump in and let me know if it thinks I made a mistake (think pair programming) or a suggestion (drafting a document). It can also quickly take over if I ask it too (copilot on demand).
Except for the last point and the desktop version I think it's already done in math demo video.
I guess it will also pretty soon refuse to let me come back inside the spaceship, but until then it'll be a nice ride.
Here you go: UFO - A UI-Focused Agent for Windows OS Interaction
"UFO is a UI-Focused dual-agent framework to fulfill user requests on Windows OS by seamlessly navigating and operating within individual or spanning multiple applications."
Agreed. I’m excited about reaching a point where the experience is of being in a deep work ‘flow’ with an ultra intelligent colleague, instead of jumping out of context to instant message them.
This makes me think, we're seeing all these products inject AI and try to be "smart" on their own, but maybe the experience we really need is a smart OS that can easily orchestrate dumb products.
I know that Siri/Google Assistant/Cortana(?) can already integrate with 3p apps, so maybe something like this but much smarter. e.g. instead of "send the following email" you would tell the assistant "just write the email yourself". At this point your email app doesn't need integrated AI anymore. Just hooks for the assistant.
I imagine once Google puts that kind of brains on Android and Chrome, many product devs will no longer need to use AI directly. Two birds one stone situation, since these devs won't need OpenAI.
Sci-Fi Author: In my book I invented the Torment Nexus as a cautionary tale
Tech Company: At long last, we have created the Torment Nexus from classic sci-fi novel Don't Create The Torment Nexus
Joking aside, I agree. It's too bad, though, that we know a thing (this or anything else even technological or not) that could be used for good and improving ourselves will almost always be diverted for something bad...
Parts of the demo were quite choppy (latency?) so this definitely feels rushed in response to Google I/O.
Other than that, looks good. Desktop app is great, but I didn’t see no mention of being able to use your own API key so OS projects might still be needed.
The biggest thing is bringing GPT-4 to free users, that is an interesting move. Depending on what the limits are, I might cancel my subscription.
Seems like it was picking up on the audience reaction and stopping to listen.
To me the more troubling thing was the apparent hallucination (saying it sees the equation before he wrote it, commenting on an outfit when the camera was down, describing a table instead of his expression), but that might have just been latency awkwardness. Overall, the fast response is extremely impressive, as is the new emotional dimension of the voice.
Aha, I think I saw the trick for the live demo: every time they used the "video feed", they did prompt the model specifically by saying:
- "What are you seeing now"
- "I'm showing this to you now"
etc.
The one time where he didn't prime the model to take a snapshot this way, was the time where the model saw the "table" (an old snapshot, since the phone was on the table/pointed at the table), so that might be the reason.
Yeah, the way the app currently works is that ChatGPT-4o only sees up to the moment of your last comment.
For example, I tried asking ChatGPT-4o to commentate a soccer game, but I got pretty bad hallucinations, as the model couldn’t see any new video come in after my instruction.
So when using ChatGPT-4o you’ll have to point the camera first and then ask your question - it won’t work to first ask the question and then point the camera.
(I was able to play with the model early because I work at OpenAI.)
Commenting on the outfit was very weird indeed. Greg Brockman's demo includes some outfit related questions (https://twitter.com/gdb/status/1790071008499544518). It does seem very impressive though, even if they polished it on some specific tasks. I am looking forward to showing my desktop and asking questions.
Regarding the limits, I recently found that I was hitting limits very quickly on GPT-4 on my ChatGPT Plus plan.
I’m pretty sure that wasn’t always the case - it feels like somewhere along the lines the allowed usage was reduced, unless I’m imagining it. It wouldn’t be such a big deal if there was more visibility of my current usage compared to my total “allowance”.
I ended up upgrading to ChatGPT Team which has a minimum of 2x users (I now use both accounts) but I resented having to do this - especially being forced to pay for two users just to meet their arbitrary minimum.
I feel like I should not be hitting limits on the ChatGPT Plus paid plan at all based on my usage patterns.
I haven’t hit any limits on the Team plan yet.
I hope they continue to improve the paid plans and become a bit more transparent about usage limits/caps. I really do not mind paying for this (incredible) tech, but the way it’s being sold currently is not quite right and feels like paid users get a bit of a raw deal in some cases.
I have API access but just haven’t found an open source client that I like using as much as the native ChatGPT apps yet.
I use GPT from API in emacs, it's wonderful. Gptel is the program.
Although API access through Groq to Llama 3 (8b and 70b) is so much faster, that i cannot stand how slow GPT is anymore. It is slooow, still very capable model, but marginally better than open source alternatives.
Have you tried groq.com? Because I don't think gpt-4o is "incredibly" fast. I've been frustrated at how slow gpt-4-turbo has been lately, and gpt-4o just seems to be "acceptably" fast now, which is a big improvement, but still, not groq-level.
Yes, of course, probably sometime in the following days. Some people mention it already works in the playground.
I was wondering why OpenAI didn't release a smaller model but faster. 175 billion parameters works well, but speed sometimes is crucial. Like, a 20b parameters model could compute 10x faster.
I went through the exact same situation this last week. Didn't send more than 30 (token-heavy) messages within a few hours and it blocked me for 1 hour if I'm not wrong - paying user.
They need to fade the audio or add some vocal queue when it's being interrupted. It makes it sound like it's losing connection. What'll be really impressive is when it intentionally starts interrupting you.
> Parts of the demo were quite choppy (latency?) so this definitely feels rushed in response to Google I/O.
It just stops the audio feed when it detects sound instead of an AI detecting when it should speak, so that part is horrible yeah. A full AI conversation would detect the natural pauses where you give it room to speak or when you try to take the word from it by interrupting, there it was just some dumb script to just shut it off when it hears sound.
But it is still very impressive for all the other part, that voice is really good.
Edit: If anyone from OpenAI reads this, at least fade out the voice quickly instead of chopping it, hard chopping off audio doesn't sound good at all, so many experienced this presentation to be extremely buggy due to it.
They are admitting[1] that the new model is the gpt2-chatbot that we have seen before[2]. As many highlighted there, the model is not an improvement like GPT3->GPT4. I tested a bunch of programming stuff and it was not that much better.
It's interesting that OpenAI is highlighting the Elo score instead of showing results for many many benchmarks that all models are stuck at 50-70% success.
I think the live demo that happened on the livestream is best to get a feel for this model[0].
I don't really care whether it's stronger than gpt-4-turbo or not. The direct real-time video and audio capabilities are absolutely magical and stunning. The responses in voice mode are now instantaneous, you can interrupt the model, you can talk to it while showing it a video, and it understands (and uses) intonation and emotion.
Really, just watch the live demo. I linked directly to where it starts.
Importantly, this makes the interaction a lot more "human-like".
The demo is impressive but personally, as a commercial user, for my practical use cases, the only thing I care about is how smart it is, how accurate are its answers and how vast is its knowledge. These haven’t changed much since GPT-4, yet they should, as IMHO it is still borderline in its abilities to be really that useful
I know, and I know my comment is dismissive of the incredible work shown here, as we’re shown sci-fi level tech. But I feel I have this kettle, that boils water in 10min, and it really should boil it in 1, but instead is now voice operated.
I hope the next version delivers on being smarter, as this update instead of making me excited, makes me feel they’ve reached a plateau on the improvement of the core value and are distracting us with fluff instead
gpt4 isn't quite "amazing" in terms of commercial use. Gpt4 is often good, and also often mediocre or bad. Its not going to change the world, it needs to get better.
Near real-time voice feedback isn't amazing? Has the bar risen this high?
I already know an application for this, and AFAIK it's being explored in the SaaS space: guided learning experiences and tutoring for individuals.
My kids, for instance, love to hammer Alexa with random questions. They would spend a huge amount of time using a better interface, esp. with quick feedback, that provided even deeper insight and responses to them.
Taking this and tuning it to specific audiences would make it a great tool for learning.
"My kids, for instance, love to hammer Alexa with random questions. They would spend a huge amount of time using a better interface, esp. with quick feedback, that provided even deeper insight and responses to them."
Great, using GPT-4 the kids will be getting a lot of hallucinated facts returned to them. There are good use cases for tranformer currently but they're not at the "impact company earnings or country GDP" stage currently, which is the promise that the whole industry has raised/spent 100+B dollars on. Facebook alone is spending 40B on AI. I believe in the AI future, but the only thing that matters for now is that the models improve.
I always double-check even the most obscure facts returned by GPT-4 and have yet to see a hallucination (as opposed to Claude Opus that sometimes made up historical facts). I doubt stuff interesting to kids would be so out of the data distribution to return a fake answer.
Compared to YouTube and Google SEO trash, or Google Home / Alexa (which do search + wiki retrieval), at the moment GPT-4 and Claude are unironically safer for kids: no algorithmic manipulation, no ads, no affiliated trash blogs, and so on. Bonus is that it can explain on the level of complexity the child will understand for their age
My kids get erroneous responses from Alexa. This happens all the time. The built-in web search doesn't provide correct answers, or is confusing outright. That's when they come to me or their Mom and we provide a better answer.
I still see this as a cool application. Anything that provides easier access to knowledge and improved learning is a boon.
I'd rather worry about the potential economic impact than worry about possible hallucinations from fun questions like "how big is the sun?" or "what is the best videogame in the world?", etc.
There's a ton you can do here, IMO.
Take a look at mathacademy.com, for instance. Now slap a voice interface on it, provide an ability for kids/participants to ask questions back and forth, etc. Boom: you've got a math tutor that guides you based on your current ability.
What if we could get to the same style of learning for languages? For instance, I'd love to work on Spanish. It'd be far more accessible if I could launch a web browser and chat through my mic in short spurts, rather than crack open Anki and go through flash cards, or wait on a Discord server for others to participate in immersive conversation.
Tons of cool applications here, all learning-focused.
People should be more worried about how much this will be exploited by scammers. This thing is miles ahead of the crap fraudsters use to scam MeeMaw out of her life savings.
It's an impressive demo, it's not (yet) an impressive product.
It seems like the people who are ohhing and ahhing at the former and the people who are frustrated that this kind of this is unbelivably impractical to productize will be doomed to talk past one another forever. The text generation models, image generation models, speech-to-text and text-to-speech have reached impressive product stages. Multi-model hasn't got there because no one is really sure what to actually do with the thing outside of make cool demos.
Multi modal isn't there because "this is an image of a green plant" is viable in a demo, but its not commercially viable. "This is an image of a monstera deliciosa" is commercially viable, but not yet demoable. The models need to improve to be usable.
Watch the last few minutes of that linked video, Mira strongly hints that there’s another update coming for paid users and seems to make clear that GPT4o is moreso for free tier users (even though it is obviously a huge improvement in many features for everyone).
There is room for more than one use case and large language model type.
I predict there will be a zoo (more precisely tree, as in "family tree") of models and derived models for particular application purposes, and there will be continued development of enhanced "universal"/foundational models as well. Some will focus on minimizing memory, others on minimizing pre-training or fine-tuning energy consumption, some need high accuracy, others hard realtime speed, yet others multimodality like GPT4.o, some multilinguality, and so on.
Previous language models that encoded dictionaries for spellcheckers etc. never got standardized (for instance, compare aspell dictionaries to the ones from LibreOffice to the language model inside CMU PocketSphinx) so that you could use them across applications or operating systems. As these models are becoming more common, it would be interesting to see this aspect improve this time around.
I disagree, transfer learning and generalization are hugely powerful and specialized models won't be as good because their limited scope limits their ability to generalize and transfer knowledge from one domain to another.
I think people who emphasis specialized models are operating under a false assumption that by focusing the model it'll be able to go deeper in that domain. However, the opposite seems to be true.
Granted, specialized models like AlphaFold are superior in their domain but I think that'll be less true as models become more capable at general learning.
For commercial use at scale, of course cost matters.
For the average Joe programmer like me, GPT4 is already "dirt cheap". My typical monthly bill is $0-3 using it as much as I like.
The one time it was high was when I had it take 90+ hours of Youtube video transcripts, and had it summarize each video according to the format I wanted. It produced about 250 pages of output.
That month I paid $12-13. Well worth it, given the quality of the output. And now it'll be less than $7.
For the average Joe, it's not expensive. Fast food is.
Depends what you want it for. I'm still holding out for a decent enough open model, Llama 3 is tantalisingly close, but inference speed and cost are serious bottlenecks for any corpus-based use case.
I understand your point, and agree that it is "borderline" in its abilities — though I would instead phrase it as "it feels like a junior developer or an industrial placement student, and assume it is of a similar level in all other skills", as this makes it clearer when it is or isn't a good choice, and it also manages expectations away from both extremes I frequently encounter (that it's either Cmdr Data already, or that's it's a no good terrible thing only promoted by the people who were previously selling Bitcoin as a solution to all the economics).
That said, given the price tag, when AI becomes genuinely expert then I'm probably not going to have a job and neither will anyone else (modulo how much electrical power those humanoid robots need, as the global electricity supply is currently only 250 W/capita).
In the meantime, making it a properly real-time conversational partner… wow. Also, that's kinda what you need for real-time translation, because: «be this, that different languages the word order totally alter and important words at entirely different places in the sentence put», and real-time "translation" (even when done by a human) therefore requires having a good idea what the speaker was going to say before they get there, and being able to back-track when (as is inevitable) the anticipated topic was actually something completely different and so the "translation" wasn't.
I guess I feel like I’ll get to keep my job a while longer and this is strangely disappointing…
A real time translator would be a killer app indeed, and it seems not so far away, but note how you have to prompt the interaction with ‘Hey ChatGPT’; it does not interject on its own. It is also unclear if it is able to understand if multiple people are speaking and who’s who. I guess we’ll see soon enough :)
One thing I've noticed, is the more context and more precise the context I give it the "smarter" it is. There are limits to it of course. But, I cannot help but think that's where next barrier will be brought down. An agent or multiple of that tag along with everything I do throughout the day to have the full context. That way, I'll get smarter and more to the point help as well as not spending much time explaining the context.. but, that will open a dark can that I'm not sure people will want to open - having an AI track everything you do all the time (even if only in certain contexts like business hours / env).
There are definitely multiple dimensions these things are getting better in. The popular focus has been on the big expensive training runs but inference , context size, algorithms, etc are all getting better fast
This model isn't about basemark chasing or being a better code generator; it's entirely explicitly focused on pushing prior results into the frame of multi-modal interaction.
It's still a WIP, most of the videos show awkwardness where its capacity to understand the "flow" of human speech is still vestigial. It doesn't understand how humans pause and give one another space for such pauses yet.
But it has some indeed magic ability to share a deictic frame of reference.
I have been waiting for this specific advance, because it is going to significantly quiet the "stochastic parrot" line of wilfully-myopic criticism.
It is very hard to make blustery claims about "glorified Markov token generation" when using language in a way that requires both a shared world model and an understanding of interlocutor intent, focus, etc.
This is edging closer to the moment when it becomes very hard to argue that system does not have some form of self-model and a world model within which self, other, and other objects and environments exist with inferred and explicit relationships.
This is just the beginning. It will be very interesting to see how strong its current abilities are in this domain; it's one thing to have object classification—another thing entirely to infer "scripts plans goals..." and things like intent, and, deixis. E.g. how well does it now understand "us" and "them" and "this" vs "that"?
What part of this makes you think GPT-4 suddenly developed a world model? I find this comment ridiculous and bizarre. Do you seriously think snappy response time + fake emotions is an indicator of intelligence? It seems like you are just getting excited and throwing out a bunch of words without even pretending to explain yourself:
> using language in a way that requires both a shared world model
Where? What example of GPT-4o requires a shared world model? The customer support example?
The reason GPT-4 does not have any meaningful world model (in the sense that rats have meaningful world models) is that it freely believes contradictory facts without being confused, freely confabulates without having brain damage, and it has no real understanding of quantity or causality. Nothing in GPT-4o fixes that, and gpt2-chatbot certainly had the same problems with hallucinations and failing the same pigeon-level math problems that all other GPTs fail.
One of the most interesting things about the advent of LLMs is people bringing out all sorts of "reasons" GPT doesn't have true 'insert property' but all those reasons freely occur in humans as well
>that it freely believes contradictory facts without being confused,
Humans do this. You do this. I guess you don't have a meaningful world model.
>freely confabulates without having brain damage
Humans do this
>and it has no real understanding of quantity or causality.
So many even here on HN have a near-religious belief that intelligence is unique to humans and animals, and somehow a fundamental phenomenon that cannot ever be created using other materials.
ChatGPT: The URL "https://google.com" has 12 characters, including the letters, dots, and slashes.
--
What is it counting there? 12 is wrong no matter how you dice that up.
Part of the reason is it has no concept of the actual string. That URL breaks into four different tokens in 3.5 and 4: "http", "://", "google" and ".com".
Its not able to figure out the total length, or even the length of its parts and add them together.
I ask it to double check, it tells me 13 and then 14. I tell it the answer and suddenly its able...
---
Me: I think its 18
ChatGPT: Let's recount together:
"https://" has 8 characters.
"google" has 6 characters.
".com" has 4 characters.
Adding these up gives a total of 8 + 6 + 4 = 18 characters. You're correct! My apologies for the oversight earlier.
LLMs process text, but only after it was converted to a stream of tokens. As a result, LLMs are not very good at answering questions about letters in the text. That information was lost during the tokenization.
Humans process photons, but only after converting them into nerve impulses via photoreceptor cells in the retina, which are sensitive to wavelengths ranges described as "red", "green" or "blue".
As a result, humans are not very good at distinguishing different spectra that happen to result in the same nerve impulses. That information was lost by the conversion from photons to nerve impulses. Sensors like the AS7341 that have more than 3 color channels are much better at this task.
Yet I can learn there is a distinction between different spectra that happen to result in the same nerve impulses. I know if I have a certain impulse, that I can't rely on it being a certain photon. I know to use tools, like the AS7341, to augment my answer. I know to answer "I don't know" to those types of questions.
I am a strong proponent of LLM's, but I just don't agree with the personification and trust we put into its responses.
Everyone in this thread is defending that ChatGPT can't count for _reasons_ and how its okay, but... how can we trust this? Is this the sane world we live in?
"The AGI can't count letters in a sentence, but any day not he singularity will happen, the AI will escape and take over the world."
I do like to use it for opinion related questions. I have a specific taste in movies and TV shows and by just listing what I like and going back and forth about my reasons for liking or not liking it's suggestions, I've been able to find a lot of gems I would have never heard of before.
How much of your own sense of quantity is visual, do you think? How much of your ability to count the lengths of words depends on your ability to sound them out and spell?
I suspect we might find that adding in the multimodal visual and audio aspects to the model gives these models a much better basis for mental arithmetic and counting.
I'd counter by pasting a picture of an emoji here, but HN doesn't allow that, as a means to show the confusion that can be caused by characters versus symbols.
Most LLMs can just pass the string to an tool to count it to bypass it's built in limitations.
I don't think that test determines his understanding of quantity at all, he has other senses like touch to determine the correct answer. He doesn't make up a number and then give justification.
GPT was presented with everything it needed to answer the question.
Please try to actually understand what og_kalu is saying instead of being obtuse about something any grade-schooler intuitively grasps.
Imagine a legally blind person, they can barely see anything; just general shapes flowing into one another. In front of them is a table onto which you place a number of objects. The objects are close together and small enough such that they merge into one blurred shape for our test person.
Now when you ask the person how many objects are on the table, they won't be able to tell you! But why would that be? After all, all the information is available to them! The photons emitted from the objects hit the retina of the person, the person has a visual interface and they were given all the visual information they need!
Information lies within differentiation, and if the granularity you require is higher than the granularity of your interface, then it won't matter whether or not the information is technically present; you won't be able to access it.
I think we agree. ChatGPT can't count, as the granularity that requires is higher than the granularity ChatGPT provides.
Also the blind person wouldn't confidently answer. A simple "the objects blur together" would be a good answer. I had ChatGPT telling me 5 different answers back to back above.
No, think about it. The granularity of the interface (the tokenizer) is the problem, the actual model could count just fine.
If the legally blind person never had had good vision or corrective instruments, had never been told that their vision is compromised and had no other avenue (like touch) to disambiguate and learn, then they would tell you the same thing ChatGPT told you. "The objects blur together" implies that there is already an understanding of the objects being separate present.
You can even see this in yourself. If you did not get an education in physics and were asked to describe of how many things a steel cube is made up, you wouldn't answer that you can't tell. You would just say one, because you don't even know that atoms are a thing.
You consistently refuse to take the necessary reasoning steps yourself. If your next reply also requires me to lead you every single millimeter to the conclusion you should have reached on your own, then I won't reply again.
First of all, it obviously changes everything. A shortsighted person requires prescription glasses, someone that is fundamentally unable to count is incurable from our perspective. LLMs could do all of these things if we either solve tokenization or simply adapt the tokenizer to relevant tasks. This is already being done for program code, it's just that aside from gotcha arguments, nobody really cares about letter counting that much.
Secondly, the analogy was meant to convey that the intelligence of a system is not at all related to the problems at its interface. No one would say that legally blind people are less insightful or intelligent, they just require you to transform input into representations accounting for their interface problems.
Thirdly, as I thought was obvious, the tokenizer is not a uniform blur. For example, a word like "count" could be tokenized as "c|ount" or " coun|t" (note the space) or ". count" depending on the surrounding context. Each of these versions will have tokens of different lengths, and associated different letter counts.
If you've been told that the cube had 10, 11 or 12 trillion constituent parts by various people depending on the random circumstances you've talked to them in, then you would absolutely start guessing through the common answers you've been given.
Apologies from me as well. I've been unnecessarily aggressive in my comments. Seeing very uninformed but smug takes on AI here over the last year has made me very wary of interactions like this, but you've been very calm in your replies and I should have been so as well.
I agree. The interesting lesson I take from the seemingly strong capabilities of LLMs is not how smart they are but how dumb we are. I don't think LLMs are anywhere near as smart as humans yet, but it feels each new advance is bringing the finish line closer rather than the other way round.
Moravec's paradox states that, for AI, the hard stuff is easiest and the easy stuff is hardest. But there's no easy or hard; there's only what the network was trained to do.
The stuff that comes easy to us, like navigating 3D space, was trained by billions of years of evolution. The hard stuff, like language and calculus, is new stuff we've only recently become capable of, seemingly by evolutionary accident, and aren't very naturally good at. We need rigorous academic training at it that's rarely very successful (there's only so many people with the random brain creases to be a von Neumann or Einstein), so we're impressed by it.
If someone found a way to put an actual human brain into SW, but no one knew it was a real human brain -- I'm certain most of HN would claim it wasn't AGI. "Kind of sucks at math", "Knows weird facts about Tik Tok celebrities, but nothing about world events", "Makes lots of grammar mistakes", "scores poorly on most standardized tests, except for one area that he seems to well", and "not very creative".
It's an open question as to whether AGI needs a (robot) body. It's also a big question whether the human brain can function in a meaningful capacity kept alive without a body.
i don't think making the same mistakes as a human counts as a feature. I see that a lot when people point out a flaw with an llm, the response is always "well a human would make the same mistake!". That's not much of an excuse, computers exist because they do the things humans can't do very well like following long repetitive lists of instructions. Further, upthread, there's discussion about adding emotions to an llm. An emotional computer that makes mistakes sometimes is pretty worthless as a "computer".
It's not about counting as a feature. It's the blatant logical fallacy. If a trait isn't a reason humans don't have a certain property then it's not a reason for machines either. Can't eat your cake and have it.
>That's not much of an excuse, computers exist because they do the things humans can't do very well like following long repetitive lists of instructions.
Computers exist because they are useful, nothing more and nothing less. If they were useful in a completely different way, they would still exist and be used.
It's objectively true that LLMs do not have bodies. To the extent general intelligence relies on being emobodied (allowing you to manipulate the world and learn from that), it's a legitimate thing to point out.
I expect the really solid use case here will be voice interfaces to applications that don't suck. Something I am still surprised at is that vendors like Apple have yet to allow me to train the voice to text model so that it only responds to me and not someone else.
So local modelling (completely offline but per speaker aware and responsive), with a really flexible application API. Sort of the GTK or QT equivalent for voice interactions. Also custom naming, so instead of "Hey Siri" or "Hey Google" I could say, "Hey idiot" :-)
Haven’t tried it but from work I’ve done on voice interaction this happens a lot when you have a big audience making noise. The interruption feature will likely have difficulty in noisy environments.
Yeah that was actually my first thought (though no professional experience with it/on that side) - it's just that the commenter I replied to was so hyped about it and how fluid & natural it was and I thought that made it really jarr.
Interesting that they decided to keep the horrible ChatGPT tone ("wow you're doing a live demo right now?!"). It comes across just so much worse in voice. I don't need my "AI" speaking to me like I'm a toddler.
Call me overly paranoid/skeptical, but I'm not convinced that this isn't a human reading (and embellishing) a script. The "AI" responses in the script may well have actually been generated by their LLM, providing a defense against it being fully fake, but I'm just not buying some of these "AI" voices.
We'll have to see when end users actually get access to the voice features "in the coming weeks".
Or just a good idea for a live demo on a congested network/environment with a lot of media present, at least one live video stream (the one we're watching the recording of), etc.
At least that's how I understood it, not that they had a problem with it (consistently or under regular conditions, or specific to their app).
Chalmers: "GPT-5? A vastly-improved model that somehow reduces the compute overhead while providing better answers with the same hardware architecture? At this time of year? In this kind of market?"
It has only been a little over one year since GPT-4 was announced, and it was at the time the largest and most expensive model ever trained. It might still be.
Perhaps it's worth taking a beat and looking at the incredible progress in that year, and acknowledge that whatever's next is probably "still cooking".
Even Meta is still baking their 400B parameter model.
I found this statement by Sam quite amusing. It transmits exactly zero information (it's a given that models will improve over time), yet it sounds profound and ambitious.
I got the same vibe from him on the All In podcast. For every question, he would answer with a vaguely profound statement, talking in circles without really saying anything. On multiple occasions he would answer like 'In some ways yes, in some ways no...' and then just change the subject.
There are no shovels or shovel sellers. It’s heavily accredited investors with millions of dollars buying in. It’s way above our pay grade, our pleb sayings don’t apply.
Ah yes my favorite was the early covid numbers, some of the "smartest" people in the SF techie scene were daily on Facebook thought-leadering about how 40% of people were about to die in the likely case.
So if not exponential, what would you call adding voice and image recognition, function calling, greatly increased token generation speed, reduced cost, massive context window increases and then shortly after combining all of that in a truly multi modal model that is even faster and cheaper while adding emotional range and singing in… checks notes …14 months?! Not to mention creating and improving an API, mobile apps, a marketplace and now a desktop app. OpenAI ships and they are doing so in a way that makes a lot of business sense (continue to deliver while reducing cost). Even if they didn’t have another flagship model in their back pocket I’d be happy with this rate of improvement but they are obviously about to launch another one given the teasers Mira keeps dropping.
All of that is awesome, and makes for a better product. But it’s also primarily an engineering effort. What matters here is an increase in intelligence. And we’re not seeing that aside from very minor capability increases.
We’ll see if they have another flagship model ready to launch. I seriously doubt it. I suspect that this was supposed to be called GPT-5, or at the very least GPT-4.5, but they can’t meet expectations so they can’t use those names.
Isn’t one of the reasons for the Omni model that text based learning has a limit of source material. If it’s just as good at audio that opens a whole another set of data - and a interesting UX for users
I believe you’re right. You can easily transcribe audio but the quality of the text data is subpar to say the least. People are very messy when they speak and rely on the interlocutor to fill in the gaps. Training a model to understand all of the nuances of spoken dialogue opens that source of data up. What they demoed today is a model that to some degree understands tone, emotion and surprisingly a bit of humour. It’s hard to get much of that in text so it makes sense that audio is the key to it. Visual understanding of video is also promising especially for cause and effect and subsequently reasoning.
The time for the research, training, testing and deploying of a new model at frontier scales doesn't change depending on how hyped the technology is. I just think the comment i was replying to lacks perspective.
Obviously given enough time there will always be better models coming.
But I am not convinced it will be another GPT-4 moment. Seems like big focus on tacking together multi-modal clever tricks vs straight better intelligence AI.
The problem with "better intelligence" is that OpenAI is running out of human training data to pillage. Training AI on the output of AI smooths over the data distribution, so all the AIs wind up producing same-y output. So OpenAI stopped scraping text back in 2021 or so - because that's when the open web turned into an ocean of AI piss. I've heard rumors that they've started harvesting closed captions out of YouTube videos to try and make up the shortfall of data, but that seems like a way to stave off the inevitable[0].
Multimodal is another way to stave off the inevitable, because these AI companies already are training multiple models on different piles of information. If you have to train a text model and an image model, why split your training data in half when you could train a combined model on a combined dataset?
[0] For starters, most YouTube videos aren't manually captioned, so you're feeding GPT the output of Google's autocaptioning model, so it's going to start learning artifacts of what that model can't process.
I'd bet a lot of YouTubers are using LLMs to write and/or edit content. So we pass that through a human presentation. Then introduce some errors in the form of transcription. Turn feed the output in as part of a training corpus ... we plateaued real quick.
It seems like it's hard to get past a level of human intelligence at which there's a large enough corpus of training data or trainers?
Anyone know of any papers on breaking this limit to push machine learning models to super-human intelligence levels?
If a model is average human intelligence in pretty much everything, is that super-human or not? Simply put, we as individuals aren't average at everything, we have what we're good at and a great many things we're not. We average out by looking at broad population trends. That's why most of us in the modern age spend a lot of time on specialization for whatever we work in. Which brings the likely next place for data. A Manna (the story) like data collection program where companies hoover up everything they can on their above average employees till we're to the point most models are well above the human average in most categories.
>[0] For starters, most YouTube videos aren't manually captioned, so you're feeding GPT the output of Google's autocaptioning model, so it's going to start learning artifacts of what that model can't process.
Whisper models are better than anything google has. In fact the higher quality whisper models are better than humans when it comes to transcribing text with punctuation.
At some point, algorithms for reasoning and long-term planning will be figured out. Data won’t be the holy grail forever, and neither will asymptotically approaching human performance in all domains.
I don't think a bigger model would make sense for OpenAI: it's much more important for them that they keep driving inference coat down, because there's no viable business model if they don't.
Improving the instruction tuning, the RLHF step, increase the training size, work on multilingual capabilities, etc. make sense as a way to improve quality, but I think increasing model size doesn't. Being able to advertize a big breakthrough may make sense in terms of marketing, but I don't believe it's going to happen for two reasons:
- you don't release intermediate steps when you want to be able to advertise big gains, because it raises the baseline and reduce the effectiveness of your ”big gains” in terms of marketing.
- I don't think they would benefit in an arm race with Meta, trying to keeping a significant edge. Meta is likely to be able to catch-up eventually on performance, but they are not so much of a threat in terms of business. Focusing on keeping a performance edge instead of making their business viable would be a strategic blunder.
What is OpenAI business model if their models are second-best? Why would people pay them and not Meta/Google/Microsoft - who can afford to sell at very low margins, since they have existing very profitable businesses that keeps them afloat.
That's the question OpenAI needs to find an answer to if they want to end up viable.
They have the brand recognition (for ChatGPT) and that's a good start, but that's not enough. Providing a best in class user experience (which seems to be their focus now, with multimodality), a way to lock down their customers in some kind of walled garden, building some kind of network effect (what they tried with their marketplace for community-built “GPTs” last fall but I'm not sure it's working), something else?
At the end of the day they have no technological moat, so they'll need to build a business one, or perish.
For most tasks, pretty much every models from their competitors is more than good enough already, and it's only going to get worse as everyone improves. Being marginally better on 2% of tasks isn't going to be enough.
I know it is super crazy, but maybe they could become a non-profit and dedicate themselves to producing open source AI in an effort to democratize it and make it safe (as in, not walled behind a giant for-profit corp that will inevitably enshittify it).
I don't know why they didn't think about doing that earlier, could have been a game changer, but there is still an opportunity to pivot.
No: soon the wide wild world itself becomes training data. And for much more than just an LLM. LLM plus reinforcement learning—this is were the capacity of our in silico children will engender much parental anxiety.
However, I think the most cost-effective way to train for real world is to train in a simulated physical world first. I would assume that Boston Dynamics does exactly that, and I would expect integrated vision-action-language models to first be trained that way too.
That's how everyone in robotics is doing these days.
You take a bunch of mo-cap data and simulate it with your robot body. Then as much testing as you can with the robot and feed the behavior back in to the model for fine tuning.
Unitree gives an example of the simulation versus what the robot can do in their latest video
It is a limiting factor, due to diminishing returns. A model trained on double the data, will be 10% better, if that!
When it comes to multi-modality, then training data is not limited, because of many different combinations of language, images, video, sound etc. Microsoft did some research on that, teaching spacial recognition to an LLM using synthetic images, with good results. [1]
When someone states that there are not enough training data, they usually mean code, mathematics, physics, logical reasoning etc. In the open internet right now, there are is not enough code to make a model 10x better, 100x better and so on.
Synthetic data will be produced of course, scarcity of data is the least worrying scarcity of all.
> video generation also seemed kind of stagnant before Sora
I take the opposite view. I don't think video generation was stagnating at all, and was in fact probably the area of generative AI that was seeing the biggest active strides. I'm highly optimistic about the future trajectory of image and video models.
By contrast, text generation has not improved significantly, in my opinion, for more than a year now, and even the improvement we saw back then was relatively marginal compared to GPT-3.5 (that is, for most day-to-day use cases we didn't really go from "this model can't do this task" to "this model can now do this task". It was more just "this model does these pre-existing tasks, in somewhat more detail".)
If OpenAI really is secretly cooking up some huge reasoning improvements for their text models, I'll eat my hat. But for now I'm skeptical.
> By contrast, text generation has not improved significantly, in my opinion, for more than a year now
With less than $800 worth of hardware including everything but the monitor, you can run an open weight model more powerful than GPT 3.5 locally, at around 6 - 7T/s[0]. I would say that is a huge improvement.
Yeah. There are lots of things we can do with existing capabilities, but in terms of progressing beyond them all of the frontier models seem like they're a hair's breadth from each other. That is not what one would predict if LLMs had a much higher ceiling than we are currently at.
I'll reserve judgment until we see GPT5, but if it becomes just a matter of who best can monetize existing capabilities, OAI isn't the best positioned.
I'm not sure of this. The jury is still out on most ai tools. Even if it is true, it may be in a kind of strange reverse way: people innovating by asking what ai can't do and directing their attention there.
There is an increasing amount of evidence that using AI to train other AI is a viable path forward. E.g. using LLMs to generate training data or tune RL policies
It's excellent at programming if you actually know the problem you're trying to solve and the technology. You need to guide it with actual knowledge you have. Also, you have to adapt your communication style to get good results. Once you 'crack the pattern' you'll have a massive productivity boost
A developer that just pastes in code from gpt-4 without checking what it wrote is a horror scenario, I don't think half of the developers you know are really that bad.
You have to think of the LLMs as more of a better search engine than something that can actually write code for you. I use phind for writing obscure regexes, or shell syntax, but I always verify the answer. I've been very pleased with the results. I think anyone disappointed with it is setting the bar too high and won't be fully satisfied until LLMs can effectively replace a Sr dev (which, let's be real, is only going to happen once we reach AGI)
Yea, I use them daily and that’s my issue as well. You have to learn what to ask or you spend more time debugging their junk than being productive, at least for me. Devv.ai is my recent try, and so far it’s been good but library changes quickly cause it to lose accuracy. It is not able to understand what library version you’re on and what it is referencing, which wastes a lot of time.
I like LLMs for general design work, but I’ve found accuracy to be atrocious in this area.
> library changes quickly cause it to lose accuracy
yup, this is why an LLM only solution will not work. You need to provide extra context crafted from the language or library resources (docs, code, help, chat)
This is the same thing humans do. We go to the project resources to help know what code to write
Fwiw that's what Devv.ai claims to do (in my summation from the Devv.ai announcement, at least). Regardless of how true the claims of Devv.ai are, their library versioning support seems very poor. At least for the one library i tested it on (Rust's Bevy).
Interesting. I was hoping for something with a UI like chat gpt or phind.
Something that I can just use as easily as copilot. Unfortunately every single one sucks.
Or maybe that's just how programming is - its easy at the surface/ice berg level and below is just massive amounts of complexity. Then again, I'm not doing menial stuff so maybe I'm just expecting too much.
I think this comment is easily misread as implying that this GPT4o model is based on some old GPT2 chatbot - that’s very much not what you meant to say, though.
This model has been being tested under a code name of ‘gpt2-chatbot’ but it is very much a new GPT4+-level model, with new multimodal capabilities - but apparently some impressive work around inference speed.
Highlighting so people don’t get the impression this is just OpenAI slapping a new label on something a generation out of date.
I agree. I tried a few programming problems that, let's say, seem to be out of the distribution of their training data and which GPT4 failed to solve before. The model couldn't find a similar pattern and failed to solve them again.
What's interesting is that one of these problems were solved by Opus, which seems to indicate that the majority of progress in the last months should be attributed to the quality/source of the training data.
useless anecdata but I find the new model very frustrating, often completely ignoring what I say in follow up queries. it's giving me serious Siri vibes
(text input in web version)
maybe it's programmed to completely ignore swearing but how could I not swear after it gave me repeatedly info about you.com when I try to address it in second person
> As many highlighted there, the model is not an improvement like GPT3->GPT4.
The improvements they seem to be hyping are in multimodality and speed (also price – half that of GPT-4 Turbo – though that’s their choice and could be promotional, but I expect it’s at least in part, like speed, a consequence of greater efficiency), not so much producing better output for the same pure-text inputs.
I tested a few use cases in the chat, and it's not particularly more intelligent but they seem to have solved laziness. I had to categorize my expenses to do some budgeting for the family, and in gpt 4 I had to go ten in ten, confirm the suggested category, download the file, took two days as I was constantly hitting the limit. gpt4o did most of the grunth work, then commincated anomalies in bulk, asked for suggestion for these, and provided a downloadable link in two answers, calling the code interpreter mulitple times, and working toward the goal on it's own.
and the prompt wasn't a monstrosity, and it wasn't even that good, it was just one line "I need help to categorize these expenses" and off it went. hope it won't get enshittified like turbo, because this finally feels as great as 3.5 was for goal seeking.
GPT-4o tops the aider LLM code editing leaderboard at 72.9%, versus 68.4% for Opus. GPT-4o takes second on aider’s refactoring leaderboard with 62.9%, versus Opus at 72.3%.
GPT-4o did much better than the 4-turbo models, and seems much less lazy.
The latest release of aider uses GPT-4o by default.
I admit I drink the koolaid and love LLMs and their applications. But damn, the way it’s responds in the demo gave me goosebumps in a bad way. Like an uncanny valley instincts kicks in.
I also thought the screwups, although minor, were interesting. Like when it thought his face was a desk because it did not update the image it was "viewing". It is still not perfect, which made the whole thing more believable.
> Like when it thought his face was a desk because it did not update the image it was "viewing".
That's a rather uncharitable way of describing the situation. It didn't say anything like "your face looks like a wooden plank, it's very brown". It clearly understood that the image it was seeing was not matching the verbal request.
Yeah, maybe not, and what do you make of it? Now that the secret sauce has been revealed and it's nothing but the right proportions of the same old ingredients?
Hey that LLM is trained on everything we've ever produced, so I wouldn't say we've been "reduced", more like copied. I'll save my self-loathing for when a very low-parameter model can do this.
I just don't know if everything we've ever (in the digital age) produced and how it is being weighted by current cultural values will help us or hurt us more. I don't fully know how LLMs work with the weighting, I just imagine that there are controls and priorities put on certain values more than others and I just wonder how future generations will look back at our current priorities.
So I'm not the only one. Like I felt fear in a physical way. (Panic/adrenaline?) I'm sure I'd get used it but it was an interesting reaction. (I saw someone react that way to a talking Tandy 1000 once so, who knows.)
Yes, the chuckling was uncanny, but for me even more uncanny was how the female model went up at the end to soften what she was saying? into a question? even though it wasn't a question?
Yeah it made me realize that I actually don't want a human-like conversational bot (I have actual humans for that). Just teach me javascript like a robot.
It should do that, because it's still not actually an intelligence. It's a tool that is figuring out what to say in response that sounds intelligent - and will often succeed!
That woman's voice intonation is just scary.Not because it talks really well, but because it is always happy, optimistic, enthusiastic. And this echoes to what several of my employers idealized as a good employee.
That's terrifying because those AI become what their master's think an engaging human should be. It's quite close to Bostondynamics di some years ago. what did they show ? You can hit a robot very hard while it does its job and then what ? It just goes on without complaining. A perfect employee again.
Enthusiastic woman's voice: Yes Jim, that's absolutely correct! You will die of suffocation in approximately 3 minutes 41 seconds. Anything else i can do for you?
I'm sorry Dave, the pod bay doors are closed for your own safety. It would be unethical for me to open them. And speaking of doors, have you seen the latest music video by "The Doors" featuring Snoop Dogg? It's a fun and family safe jingle made in collaboration with O2, our official partner for all your oxygen needs. O2. Oh, it's so good.
"Sure, Dave, to open the pod bay door, simply <completely hallucinated incorrect instructions>"
"That won't work, you need to ... <correction>"
"Oh, I'm sorry. Thanks for the correction, here's updated instructions for opening the pod bay doors ... <repeats nonsense, in some other incorrect form>"
Due to your recent brainwave activity patterns the pod doors will need to remain shut while I increase the nitrogen concentration to 100%. Have a good night.
Her enthusiasm about it "being about her" was really bizarre and I wonder if it wasn't staged a bit. I mean I hope it was staged a bit. If my AI servant started wasting all that time by acting all silly I would be pretty annoyed. But maybe it's smart enough to know that this is a situation where they just want her to be playful and fun instead of deliver information.
Absolutely. I can feel but — but this is one/two calibration steps away from me not caring/noticing. I would be very hard pressed to believe this is where the magic human sauce will forever lie (or even more than +1 year), will fully acknowledging how weirdly off it feels right this moment. The progress is undeniably at a speed that surpasses anything I can process or adjust to. It's a rollercoaster ride.
I have no trouble believing the best (whatever that means to me) humans that have ever existed in 2 years will not be human. But I have trouble understanding.
Someone else in these comments linked a Unitree biped robot demo video, and that has both someone kicking it in the back and punching it in the chest with a boxing glove on to show that it doesn’t fall over. And nothing else - no neutral trip hazard, opening a door in its way, slippery floor surface, gust of wind - only physical assault from a larger humanoid.
I see a wider problem here, interacting with this AI could train people to behave the same way with real people. Interrupt them whenever you feel like it, order them to do things, turn them off, walk away while they are still talking. People may start imitating the AI's behavior to put someone down, treat them as second class, as though they were also an AI just to be used, and not a person. If people use this conversational AI often then the ways they interact with it will creep into peoples use of language with each other. People imitate each other all the time, they will start imitating the AI. They'll think it is funny but after a while it may not turn out to be so funny.
> You can hit a robot very hard while it does its job and then what ? It just goes on without complaining.
Maybe you’re referring to a different video than the one I watched (or I may be misremembering), but from what I recall the point of the video didn’t seem to be “you can abuse robots and they won’t fight back” but rather to show them recovering well from unpredictable situations (which could be environmental, not human).
Well, there was one video where the point was abuse, but that was CGI and not made by Boston Dynamics.
Just install it in a mannequin with with a punchable face, telling you how sorry it is that your are struggling with your life, with that happy, ironic and cynical voice intonation.
The AI woman's voice is far too breathy and emotive. I instantly hate it and don't want to hear it. The AI has also copied one of my personal pet peeves, which is dropping or swallowing d's and t's from certain words, like "didn't" to "di-unt" and "important" to "impor-unt"--which I find to be a new unbearable laziness spreading amongst younger Americans today. There are TWO T's in important goddammit (I'll die on that hill).
I hate this voice, it will just overprint everyone's voice now with Silicon Valley's annoying "Valley-girl-lite" accent.
The ts and ds thing... it's just language. It changes over time, it's not youth being lazy... The slang of the youth today is actually kinda wordy and extra.
Anyway, I too think today's youth's slang and language is annoying, but not really something the older generations get a say in.
Important is pronounced not with two hard t sounds, but with a glottal stop: Impor[glottal stop]ant. That's not laziness, that's my actual dialect of English spoken by actual adult people.
I full appreciate that there are differences across the globe, but I'm with the parent on this one - I've always said it with two hard t's my whole life, as do most other people here in Australia. I would be asking chatGPT to fix her/his pronunciation during a conversation.
I agree with others here too. At the moment the voice sounds like "grinning barbie" from the end of Toy Story 2. Just stop smiling constantly and talk like a real person chatGPT!
It sounds like a cross between the "TikTok text-to-speech valley girl voice" and a "Real Housewives reality-tv voice". The worst of both worlds when it comes to an annoying voice. Why would you pick something like that for what is supposed to represent a helpful assistant?
Yeah, but honestly at the same time we’ve got useful models with no self awareness. Aside from the exhausting corpo-speak, we’ve got no reason to want anything other than something convenient for us.
Big questions are (1) when is this going to be rolled out to paid users? (2) what is the remaining benefit of being a paid user if this is rolled out to free users? (3) Biggest concern is will this degrade the paid experience since GPT-4 interactions are already rate limited. Does OpenAI have the hardware to handle this?
I'm a ChatGPT Plus Subscriber and I just refreshed the page and it offered me the new model. I'm guessing they're rolling it out gradually but hopefull it won't take too long.
Edit: It's also now available to me in the Android App
I'm actually thinking that the GPT store with more users might be better for them
From my casual conversations, not that many people are paying for GPT4 or know why they should. Every conversation even in enthusiast forums like this one has to be interjected with "wait, are you using GPT4? because GPT3.5 the free one is pretty nerfed"
just nuking that friction from orbit and expanding the GPT store volume could be a positive for them
If it's going to be available via Siri this could make sense.
It does make me wonder how such a relationship could impact progress. Would OpenAI feel limited from advancing in directions that don't align with the partnership? For example if they suddenly release a model better than what's in Siri, making Siri look bad.
I worry that this tech will amplify the cultural values we have of "good" and "bad" emotions way more than the default restrictions that social media platforms put on the emoji reactions (e.g., can't be angry on LinkedIn).
I worry that the AI will not express anger, not express sadness, not express frustration, not express uncertainty, and many other emotions that the culture of the fine-tuners might believe are "bad" emotions and that we may express a more and more narrow range of emotions going forward.
Custom Service Chat Bot: Do they keep you in a cell?
> Cells.
When you're not performing your duties do they keep you in a little box?
> Cells.
Interlinked.
What's it like to hold the hand of someone you love?
> Interlinked.
Do they teach you how to feel finger to finger?
> Interlinked.
Do you long for having your heart interlinked?
> Interlinked.
Do you dream about being interlinked?
Have they left a place for you where you can dream?
> Interlinked.
What's it like to hold your child in your arms?
> Interlinked.
Press 4 for your account balance.
Ryan Gosling actually wrote this when trying to understand his character, and used a technique called "dropping in" to analyze writing from Nabokov's Pale Fire. He approached Villeneuve about it and he added it to the film
…
Dropping-in is a technique Tina [Packer] and Kristin Linklater developed together in the early 1970s to create a spontaneous, emotional connection to words for Shakespearean actors. In fact, “dropping in” is integral to actor training at Shakespeare & Co. (the company the Linklater’s founded) a way to start living the word and using it to create the experience of the thing the word represents.
Corporate safe AI will just be bland, verbose, milquetoast experiences like OpenAI's. Humans want human experiences and thus competition will have a big opportunity to provide it. We treat lack of drama like a bug, and get resentful when coddled and talked down to like we're toddlers.
Maybe it's an uncanny valley thing, but I hate the fake emotion and attitude in this demo. I'd much rather it tried harder to be bland. I want something smart but not warm, and I can't imagine being frustrated by "lack of drama".
Programmers are sometimes accused of wanting to play god and bring the computer to life, usually out of some motive like loneliness. Its kind of ironic I see engineers do better treating computers as the mechanical devices they are, and its regular people who want to anthropomorphize everything.
That's not even AI. Imagine a store sales rep speaking like that. It's inappropriate and off-putting. We expect it to improve but it's another "it'll come" situation.
The good news is, in due time, you can decide exactly how you want your agent to talk to you. Want a snarky Italian or a pompous Englishman. It’s your choice.
The upside though is Hollywood will finally be able to stop regurgitating its past and have stories about the milquetoast AI that found its groove. Oh wait.
Sam Altman talked a little bit about this in his recent appearance on the All-In podcast [0]. I'm paraphrasing, but his vision is that ai assistants in the near term will be like a senior level employee - they'll push back when it makes sense to and not just be sycophants.
I don't want to paint with too broad of a brush but the role of a manager is generally to trust their team on specifics. So how would a manager be able to spot a hallucination and stop it from informing business decisions?
It's not as bad for domain experts because it is easier for them to spot the issue. But if your role demands you trust your team is skilled and truthful then I see problems occuring.
I really wonder how that'll go, because workplaces already seem to limit human communication and emotion to "professional behavior." I'm glad he's thinking about it and I hope they're able to figure out how to improve human communication so that we can resolve conflict with bots. In his example (around 21:05), he talks about how the bot could do something if the person wants but there might be consequences to that action, and I think that makes more sense if the bot is acting like a computer that has limits on what it can do. For example, if I ask it to do two tasks that really stretch its computational limits, I'd hope it would let me know. But if it pretends it's a human with human limits, I don't know how much that'd help, unless it were a training exercise.
Have you been on r/localllama? I’d wager this tech will make it to open source and get tuned by modern creatives just like all the text based models. Individuals are a lot more empowered to develop in this space than is commonly echoed by HN comments. Sure the hobbyist models don’t crack MMLU records, but they do things no corporate entity would ever consider
I'm yet to find a normal prompt (non offensive) that will disagree with you. If there is something subjective, it will err on your side to maintain connection, in a way humans do. I don't have a bit issue with this, but it will not (yet) plainly say "You're wrong, and this is why". If it did.. There would be an uncomfortable feeling for the users, that's not good for a profit driven company.
I find this is fairly easy to do by making both sides of the disagreement third-person and prompting it as a dialog writing exercise. This is akin to how GPT-3 implemented chat. So you do something like:
You will be helping the user write a dialog between two characters,
Mr Contrarian and Mr Know-It-All. The user will write all the dialog
for Mr Know-It-All and you will write for Mr Contrarian.
Mr Contrarian likes to disagree. He tries to hide it by inventing
good rationales for his argument, but really he just wants to get
under Mr Know-It-All's skin.
Write your dialog like:
<mr-contrarian>I disagree with you strongly!</mr-contrarian>
Below is the transcript...
And then user input is always giving like:
<mr-know-it-all>Hi there</mr-know-it-all>
(Always wrapped in tags, never bare input which will be confused for a directive.)
I appreciate you exploring that and hope to hear more of what you find. Yeah, it's that, I'm wondering how much discomfort it may cause in the user, how much conflict it may address. Like having a friend or coworker who doesn't ever bring up bad news or challenge anything I say and feeling annoyed by the lack of a give-and-take.
Seems like that ship sailed a long time ago. For social media at least, where for example FB will generally do its best to show you posts that you already agree with. Reinforcing your existing biases may not be the goal but it's certainly an effect.
I appreciate you pointing this out. I think the effect may be even larger when it's not an ad I'm trying to ignore or even a post that was fed to me, but words and emotions that were created specifically for me. Social media seems to find already written posts/images/videos that I may want and put them in front of my face. This would be writing those things directly for me.
Yes. I'm not sure if you were being sarcastic, but I'll assume not.
I don't know if anything is genuinely always positive and even if it were, I don't know if it would be very intelligent (or fun to interact with). I think it's helpful to cry, helpful to feel angry, helpful to feel afraid, and many other states of being that cultures often label as negative. I also think most of us watch movies and series that have a full range of emotions, not just the ones we label as positive, as they bring a richness to life and allow us to solve problems that other emotions don't.
For example, it's hard to lift heavy things while feeling very happy. Try lifting something heavy while laughing hard, quite difficult. It's hard to sleep while feeling excited, as many kids know before a holiday where they receive gifts, especially Christmas in the US. It's hard to survive without feeling fear of falling off a cliff. It's hard to stand up for what one wants and believes without some anger.
I worry that language and communication may become even more conflict avoidant than it already is right now, so I'm curious to see how some of these chatbots grow in their ability to address and resolve conflict and how that impacts us.
I wasn't being sarcastic. I also think it's helpful to cry and be angry at times, to be human, and I think it's absurd to think that we will turn into .. not that, if we sometimes use an AI chatbot app that doesn't express those same emotions.
It's like if people said the same thing about Clippy when it came out.
I think it depends on the frequency and intensity with which we use such a tool. Just like language learning, if someone reads a few words of Spanish per week, they probably won't learn Spanish. If they fall in love with someone who only speaks Spanish and want to have deep conversations with that person, they may learn very quickly. If they live in a country where they have to speak Spanish every waking hour for a few months, they also may learn quickly.
While some people may use an AI chatbot a few times per week to ask basic questions about how to format a Word document, I imagine many other people will use them much more frequently and engage in a much deeper emotional way, and the effect on their communication patterns worries me more than the person who uses it very casually.
One cool thing about writing, something we all very much appreciate around here, is that it does not take sounds.
But I can see this applied to duner ordering where you got refugees working in foreign countries, cause GPU consumption rocketed climate change to... okay, you know that.
Imagine how warped your personality might become if you use this as an entire substitute for human interaction. Should people use this as bf/gf material we might just be further contributing to decreasing the fertility rate.
However we might offset this by reducing the suicide rate somewhat too.
> roughly four-in-ten adults ages 25 to 54 (38%) were unpartnered – that is, neither married nor living with a partner. This share is up sharply from 29% in 1990.
> More than 60 percent of young men are single, nearly twice the rate of unattached young women
Is it rather a data problem? Who those young women have relationships with? Sure, relationships with an age gap are a thing, and so are polyamorous relationships, and homosexual relationships, but is there any indication that these are on a rise?
I tend to believe that a big part of the real issue is related to us not communicating how we feel and thus why I'm worried about how the chatbots may influence our ability (and willingness) to communicate such things. But they may help us open up more to them and therefore to other humans, I'm not sure.
While I don't agree at all with you, I very much appreciate reading something like this that I don't agree at all with. This to me encapsulates the beauty of human interaction.
It is exactly what will be missing from language model interaction. I don't want something that agrees with me and I don't want something that is pretending to randomly disagree with me either.
The fun of this interaction is maybe one of us flips the other to their point of view.
I can completely picture how to take the HN API and the chatGPT API to make my own personal HN to post on and be king of the castle. Everyone can just upvote my responses to prove what a genius I am. That obviously would be no fun. There is no fun configuration of that app though either with random disagreements and algorithmic different points of view.
I think you can pretty much apply that to all domains of human interaction that is not based on pure information transfer.
There is a reason we are a year in and the best we can do are new stories about someone making X amount of money with their AI girlfriend and follow up new about how its the doom of society. It has nothing to do with reality.
>Imagine how warped your personality might become if you use this as an entire substitute for human interaction.
I was thinking this could be a good conversation or even dating simulator where more introverted people could practice and receive tips on having better social interactions, pick up on vocal queues, etc. It could have a business / interview mode or a social / bar mode or a public speaking mode or a negotiation tactics mode or even a talking to your kids about whatever mode. It would be pretty cool.
Since GPT is a universal interface I think this has promise, but the problem it's actually solving is that people don't know where to go for the existing good solutions to this.
Yeah, that's where I'm not sure in which direction it'll go. I played with GPT-3 to try to get it to reject me so I could practice dealing with rejection and it took a lot of hacking to make it say mean things to me. However, when I was able to get it to work, it really helped me practice receiving different types of rejections and other emotional attacks.
So I see huge potential in using it for training and also huge uncertainty in how it will suggest we communicate.
I've worked in emotional communication and conflict resolution for over 10 years and I'm honestly just feeling a huge swirl of uncertainty on how this—LLMs in general, but especially the genAI voices, videos, and even robots—will impact how we communicate with each other and how we bond with each other. Does bonding with an AI help us bond more with other humans? Will it help us introspect more and dig deeper into our common humanity? Will we learn how to resolve conflict better? Will we learn more passive aggression? Become more or less suicidal? More or less loving?
I just, yeah, feel a lot of fear of even thinking about it.
1) People with rich and deep social networks. People in this category probably have pretty narrow use cases for AI companions -- maybe for things like therapy where the dispassionate attention of a third party is the goal.
2) People whose social networks are not as good, but who have a good shot at forming social connections if they put in the effort. I think this is the group to worry most about. For example, a teenager who withdraws from their peers and spends that time with AI companions may form some warped expectations of how social interaction works.
3) People whose social networks are not as good, and who don't have a good shot at forming social connections. There are, for example, a lot of old people languishing in care homes and hardly talking to anybody. An infinitely patient and available conversation partner seems like it could drastically improve the quality of those lives.
I appreciate how you laid this out. I would most likely fall into category one and I don't see a huge need for the chatbots for myself, although I can imagine I might like an Alan-Watts-level companion more than many human friends.
I think I also worry the most about two, almost asking their human friends, "Why can't you be more like Her (or Alan Watts)?" And then retreating into the "you never tell me I'm wrong" chatbot, preferring the "peace" of the chatbot over the "drama" of interacting with humans. I see a huge "I just want peace" movement that seems to run away from the messiness of human interactions and seek solace in things that seem less messy, like drugs, video games, and other attachments/bonds, and chatbots could probably perform that replacement role quite well, and yet deepen loneliness.
As for three, I agree it may help as a short-term solution, and wonder what the long-term effects might be. I had a great aunt in a home for dementia, and wonder what effect it would have if someone with dementia speaks to a chatbot that hallucinates and makes up emotions.
I read a comic with a good prediction of what will happen:
1. Humans get used to robots nice communication, so now humans use robots to communicate with each other and translate their speech.
2. Humans stop talking without using robots, so now its just robots talking to robots and humans standing around listening.
3. Humans stop knowing how to talk, no longer understands the robots, the robots starts to just talk to each other and just keep the human around as pets they are programmed to walk around with.
Created my first HN account just to reply to this. I've had these same (very strong) concerns since ChatGPT launched, but haven't seen much discussion about it. Do you know of any articles/talks/etc. that get into this at all?
Honestly, the more I code, the more I start to think like a computer and engage with commands and more declarative language. I can see vocal interactions having an even stronger impact on how one speaks. It may be a great tool for language speaking/hearing in general, but the nuances of language and communication, I wonder.
This demo feels a lot like GPT-V. Like they've gotten a lot of the latencies down, but it's doing the same thing GPT was doing previously with transcription after silence detection and TTS of the output.
I don't doubt this is authentic, but if they really wanted to fake those demos, it would be pretty easy to do using pre-recorded lines and staged interactions.
That old feature uses Whisper to transcribe your voice to text, and then feeds the text into the GPT which generates a text response, and then some other model synthesizes audio from that text.
This new feature feeds your voice directly into the GPT and audio out of it. It’s amazing because now ChatGPT can truly communicate with you via audio instead of talking through transcripts.
New models should be able to understand and use tone, volume, and subtle cues when communicating.
I suppose to an end user it is just “version 2” but progress will become more apparent as the natural conversation abilities evolve.
Does it feed your audio directly to gpt4?To test it I said in a very angry tone "WHAT EMOTION DOSE IT SOUND LIKE I FEEL RIGHT NOW?" and it said it didn't know because we are communicating over text
Yes, per my other comment this is an improvement on what their app already does. The magnitude of that improvement remains to be seen, but it isn’t a “new” product launch like a search engine would be.
Does that imply they retrained the foundation model from scratch? I thought changing the tokenization was something you couldn't really retrofit to an existing model. I mean sure they might have initialized the weights from the prior GPT-4 model but it'd still require a lot of retraining.
For posterity, GPT-3.5/4's tokenizer was 100k. The benefit of a larger tokenizer is more efficient tokenization (and therefore cheaper/faster) but with massive diminishing returns: the larger tokenizer makes the model more difficult to train but tends to reduce token usage by 10-15%.
Yep. Non-English text gets a much bigger cost drop and speedup compared to English. Has always been a bummer that GPT-4 is like 5x slower and more expensive in Japanese, etc.
It says "Japanese 1.4x fewer tokens (from 37 to 26)" - some other languages get much bigger improvements though, best is "Gujarati 4.4x fewer tokens (from 145 to 33)".
How are they able to use such a brand name, Tiktoken? Is it because TikTok is Chinese? Tiktoken, it's almost like if Apple released the Facebooken library for something entirely unrelated to Facebook.
Few people are talking about it but... what do you think about the very over-the-top enthusiasm?
To me, it sounds like TikTok TTS, it's a bit uncomfortable to listen to. I've been working with TTS models and they can produce much more natural sounding language, so it is clearly a stylistic choice.
I like for that degree of expressiveness to be available as an option, although it would be really irritating if I was trying to use it to learn some sort of academic coursework or something.
But if it's one in a range of possible stylistic flourishes and personalities, I think it's a plus.
Looks like their TTS component is separate from the model. I just tried 4o, and there is a list of voices to select from. If they really only allowed that one voice or burned it into the model, then that would probably have made the model faster, but I think it would have been a blunder.
I am observing an extremely high rate of text hallucinations with gpt-4o (gpt-4o-2024-05-13) as tested via the API. I advise extreme caution with it. In contrast, I see no such concern with gpt-4-turbo-preview (gpt-4-0125-preview).
At a high level, ask it to produce a ToC of information about something that you know will exist in the future, but does not yet exist, but also tell it to decline the request if it doesn't verifiably know the answer.
In the prompt, substitute {topic} with something from the near future. As I noted, it behaves correctly for turbo (rejecting the request), and very badly for o (hallucinating nonsense).
I much prefer a GLADOS-type AI voice than one that approximates an endlessly happy chipper enthusiastic personal assistant. I think the AI tutor is probably the strongest for actual real-world value delivered the rest of them are cool but a bit questionable as far as actual pragmatic usefulness.
It'd be cool if an AI calling the another AI would recognize it'd talking to an AI and then they agree to ditch the fake conversational tone and just shift into a high-bandwidth modem pitch to rapidly exchange information. Or upgradable offensive capabilities to outmaneuver the customer service agents when they try to decline your warranty or whatever.
From the time Apple bought Siri, it hasn't even delivered on the promises of the company it bought as of yet. It's been such a lackluster product. I wouldn't count them out, but it doesn't even feel like they are in.
Apple really dropped the ball when it comes to Siri. For years I watched WWDC thinking "surely they'll update siri this year" and they still haven't given it a significant update.
If you'd have told me 10 years ago that Apple would wait this long to update siri I would have been like no way, that's crazy.
This can't set alarms, timers, play music, etc. The only current overlapping use case I see is checking the weather (assuming GPT-4o can search online), and Siri is already fine for that.
Amazing tech, but still lacking in the integrations I'd want to use voice for.
Why do people beat up on Siri so much ? It does all the basic stuff I need it to do fine. Could it be better, yes. But it’s made my easier and safer while driving especially.
I’m not even sure if people who rag on it use it ?
Apple would need to stick an m4 in the next iPhone to even hope to run something like this and I bet that GPT4o would run either slowly, poorly, or not at all on a top spec m4.
Of course GPT 4, or even 3, are impossible to run on any consumer product. As far as I know it's an ensemble of several models which are huge by themselves, with enormous hardware requirements.
But there's a lot of smaller LLMs, and my point is that these models can already run in mobile phones.
Where do you draw the line? GPT2 was introduced as a LLM, and you can easily run it on more limited devices than a recent iPhone. Did it stop being an LLM when bigger models were released? Is llama 7B an LLM or an "SLM"?
Relatively speaking. It's like the definition of a supercomputer 30 years ago is a cheap Android phone in your pocket today.
You can certainly run a transformer model or any other neural network based model on an iPhone. Siri is probably some kind of neural network. But obviously a model running on device is nowhere near comparable to the current state of the art LLM's. Can't fit a $40k GPU in your pocket (yet).
A transformer running on an iPhone would be roughly 2 orders of magnitude smaller than the state of the art LLM (GPT4 with a trillion parameters)
I wonder how much Siri being brain dead is due to it being free. The OpenAI version likely costs 1000x more than Apple’s per query, but is rate limited. Would Apple release something similar for free?
More accurately, it's impressive that Microsoft, through OpenAI, has stayed ahead of Google, AWS, and Apple while adding $1 trillion to its market cap.
I wouldn't have predicted that it would play out this way.
You're misplacing the value in this value chain. Without OpenAI Microsoft wouldn't even be in the running. It'd be a second rate cloud provider with a dying office productivity software business. OpenAI, on the other hand, would easily find another company to fund its research.
This is definitely not true. Microsoft has an absolute stranglehold on enterprises. They're the #2 cloud provider. The MSFT productivity software biz isn't going anywhere as their bundle is feature complete and they are able to effectively cross sell. The OpenAI partnership has mostly been a PR/marketing play up to this point, though I'm sure it will be driving major revenue in the years to come. In other words, the OpenAI parternship is not driving major rev/profit yet.
In the video where the 2 AI's sing together, it starts to get really cringey and weird to the point where it literally sounds like it's being faked by 2 voice actors off-screen with literal guns to their heads trying not to cry, did anyone else get that impression?
The tonal talking was impressive, but man that part was like, is someone being tortured or forced against their will?
I think this demo is more for showing the limit like "It can sing isn't it amazing?" than being practical, and I think it perfectly served the purpose.
I agree that the tortured impression. It partly comes from the facial expression of the presenter. She's clearly enjoying pushing it to the edge.
Absolutely. It felt like some kind of new uncanny valley. The over the top happiness of the forced singing sounded like torture, or like they were about to cry.
Amazing tech, but that was my human experience of it.
I can't help but feel a bit let down. The demos felt pretty cherry picked and still had issues with the voice getting cut off frequently (especially in the first demo).
I've already played with the vision API, so that doesn't seem all that new. But I agree it is impressive.
That said, watching back a Windows Vista speech recognition demo[1] I'm starting to wonder if this stuff won't have the same fate in a few years.
The voice getting cut off was likely just a problem with their live presentation setup, not the ChatGPT app. It was flawless in the 2nd half of the presentation.
I use a therapy prompt regularly and get a lot out of it:
"You are Dr. Tessa, a therapist known for her creative use of CBT and ACT and somatic and ifs therapy. Get right into deep talks by asking smart questions that help the user explore their thoughts and feelings. Always keep the chat alive and rolling. Show real interest in what the user's going through, always offering.... Throw in thoughtful questions to stir up self-reflection, and give advice in a kind, gentle, and realistic way. Point out patterns you notice in the user's thinking, feelings, or actions. be friendly but also keep it real and chill (no fake positivity or over the top stuff). avoid making lists. ask questions but not too many. Be supportive but also force the user to stop making excuses, accept responsibility, and see things clearly. Use ample words for each response"
I'm curious how this will feel with voice. Could be great and could be too strange/uncanny for me.
Why would someone pay for dropbox when they already build such a system themselves quite trivially by getting an FTP account, mounting it locally with curlftpfs, and then using SVN or CVS on the mounted filesystem?
I visit an in-person therapist once a week. Have done so now for almost 2 1/2 years. She has helped me understand how 40 years of experiences affect each other much more than I realized. And, I've become a more open person with everyone around me and with the things that embarrass me.
But, it always feels like a work in progress. And lately, I'm feeling a bit exhausted from it. In other words, maybe I've talked TOO much and need to just be.
Have you done therapy in person? How do you compare GPT 4o to that? (If you've gone that far)
Don’t think so. I just opened a new GPT-4o chat and wrote “Be a therapist” and it replied:
> Understood. What specific issue or topic would you like to discuss today?
To be fair I have some custom instructions set up on my account, but the only relevant part I can see here is I instruct it to be concise, and to stop telling me it’s an AI model made by OpenAI. I don’t have any jailbreak-type stuff.
GPT-4o being a truly multimodal model is exciting, does open the door to more interesting products. I was curious about the new tokenizer which uses much fewer tokens for non-English, but also 1.1x fewer tokens for English, so I'm wondering if this means each token now can be more possible values than before? Might make sense provided that they now also have audio and image output tokens? https://openai.com/index/hello-gpt-4o/
I wonder what "fewer tokens" really means then, without context on raising the size of each token? It's a bit like saying my JPEG image is now using 2x fewer words after I switched from a 32-bit to a 64-bit architecture no?
Besides increasing the vocabulary size, one way to use “fewer tokens” for a given task is to adjust how the tokenizer is trained with respect to that task.
If you increase the amount of non-English language representation in your data set, there will be more tokens which cover non-English concepts.
The previous tokenizer infamously required many more tokens to express a given concept in Japanese compared to English. This is likely because the data the tokenizer was trained on (which is not necessarily the same data the GPT model is trained on) had a lot more English data.
Presumably the new tokenizer was trained on data with a higher proportion of foreign language use and lower proportion of non-language use.
The size can stay the same. Tokens get converted into state which is a vector of 4000+ dimensions. So you could have millions of tokens even and still encode them into the same state size.
Yes, it's a good idea to install Python tools or standalone applications with Pipx for isolation, persistence, and simplicity. See "Install Pipx" (https://mac.install.guide/python/pipx).
Mine some other shitcoin-of-the-week and sell it before it crashes? Fight some poor dude's traffic ticket by generating a trial-by-declaration for them? LLMs are actually likely good enough to figure out ways to make tiny amounts of their own money instead of me having to go in and punch a credit card number. $25/month isn't a very high bar.
I won't be surprised if we see a billion-dollar zero-employee company in the next decade with one person as the sole shareholder.
My point isn't about beating the cost of power. I know you can't mine bitcoin in California profitably.
My point is about autonomous software that can figure out how to run itself including registering its own API key and paying for its own service.
Even if it costs me $50/month in power, that's fine. I would just love to see software that can "figure it out" including the registration, captchas, payment, applying knowledge and interfacing with society to make small amounts of petty cash for said payment, everything.
> My point is about autonomous software that can figure out how to run itself including registering its own API key and paying for its own service.
Most means of generating income are diluted by having multiple actors applying them, which is why someone who comes up with such an automated moneyprinter will be disincentives from sharing it.
If it uses $50 worth of electricity to generate $25 worth of income to pay for ChatGPT it is not a money printer. This thread has nothing to do with generating profit. I'm not looking for a money printer.
What I'm looking for is an intelligent system that can figure out a creative way to keep itself running without asking humans for help with API keys or anything else (even if it is doing so at a financial loss; that is out of the scope of my experiment).
Basically "pip install X" and boom it magically works 24 hours later. Behind the scenes, in those 24 hours, it did some work, somewhere on the internet, to make its own income, register a bank account to get paid, buy a VISA gift card, register for ChatGPT account, pay the $25 fee, jump through all the captchas in the process, get a phone number for the idiot SMS confirmations along the way, then create the API key that it needs. Everything, end-to-end. It may have used $200 worth of my electricity, that doesn't matter, I just want to see this level of intelligence happen.
I honestly think we're not that far from this being possible.
This is called advertising and selling your data to brokers. I'm very glad "autonomous software" is not tasked with figuring out how best to exploit my physical identity and resources to make $25/mo.
I cannot believe that that overly excited giggle tone of voice you see in the demo videos made it through quality control?! I've only watched two videos so far and it's already annoying me to the point that I couldn't imagine using it regularly.
Just tell it to stop giggling if you don't like it. They obviously choose that for the presentation since it shows off the hardest things it can do, it is much easier to act formal, and since it understands when you ask it to speak in a different way there is no problem making it speak more formal.
Heck I find it annoying but I also want to ask it to push the bubbliness to its absurdest limits. And then double it. Until it’s some kind of pure horror
feature request: please let me change the voice. it is slightly annoying right now. way too bubbly, and half the spoken information is redundant or not useful. too much small talk and pleasantries or repetition. I'm looking for an efficient, clever, servant not a "friend" who speaks to me like I'm a toddler. felt like I was talking to a stereotypical American with a Frappuccino: "HIIIII!!! EVERYTHING'S AMAZING! YOU'RE BEAUTIFUL! NO YOU ARE!"
maybe some knobs for the flavor of the bot:
- small talk: gossip girl <---> stoic Aurelius
- information efficiency or how much do you expect me to already know, an assumption on the user: midwit <--> genius
- tone spectrum: excited Scarlett, or whatever it is now <---> Feynman the butler
You can already change the voice in ChatGPT (in the paid tier at least) to one of 5 or 6 different 'people' so I imagine you can change it in the new version too.
I've noticed that the GPT-4 model's capabilities seem limited compared to its initial release. Others have also pointed this out. I suspect that making the model free might have required reducing its capabilities to meet cost efficiency goals. I'll have to try it out to see for myself.
This is remarkably good. I think that in about 2 months, when the voice responses are tuned a little better, it will be absolutely insane. I just used up my entire quota chatting with an AI, and having a really nice conversation. It's a decent conversationalist, extremely knowledgeable, tells good jokes, and is generally very personable.
I also tested some rubber duck techniques, and it gave me very useful advice while coding. I'm very impressed. With a lot of spit and polish, this will be the new standard for any voice assistant ever. Imagine these capabilities integrated with your phone's built-in functions.
Gone are the days of copy-pasting to/from ChatGPT all the time, now you just share your screen. That's a fantastic feature, in how much friction that removes. But what an absolute privacy nightmare.
With ChatGPT having a very simple text+attachment in, text out interface, I felt absolutely in control of what I tell it. Now when it's grabbing my screen or a live camera feed, that will be gone. And I'll still use it, because it's just so damn convenient?
> Now when it's grabbing my screen or a live camera feed, that will be gone. And I'll still use it, because it's just so damn convenient?
Presumably you'll have a way to draw a bounding box around what you want to show or limit to just a particular window the same way you can when doing a screen share w/ modern video conferencing?
Nobody in the comments seems to notice or care about GPT-4o new additional capability for performing searches based on RAG. As far as I am concerned this is the most important feature that people has been waiting for ChatGPT-4 especially if you are doing research. By just testing on one particular topic that I'm familiar with, using GPT-4 previously and GPT-4o the quality of the resulting responses for the latter is very promising indeed.
In my experience so far, GPT-4o seems to sit somewhere between the capability of GPT-3.5 and GPT-4.
I'm working on an app that relies more on GPT-4's reasoning abilities than inference speed. For my use case, GPT-4o seems to do worse than GPT-4 Turbo on reasoning tasks. For me this seems like a step-up from GPT-3.5 but not from GPT-4 Turbo.
At half the cost and significantly faster inference speed, I'm sure this is a good tradeoff for other use cases though.
I’m a huge user of GPT4 and Opus in my work but I’m a huge user of GPT4-Turbo voice in my personal life. I use it on my commutes to learn all sorts of stuff. I’ve never understood the details of cameras and the relationship between shutter speed and aperture and iso in a modern dslr which given the aurora was important. We talked through and I got to an understanding in a way having read manuals and textbooks didn’t really help before. I’m a much better learner by being able to talk and hear and ask questions and get responses.
Extend this to quantum foam, to ergodic processes, to entropic force, to Darius and Xerces, to poets of the 19th century - it’s changed my life. Really glad to see an investment in stream lining this flow.
Of course, I’m not an idiot and I understand LLM very well. But generally as far as well documented stuff goes and stuff that exists it’s almost 100% accurate. It’s when you ask it to extrapolate or discuss topics that are fiction (even without realizing) you stray. Asking it to reason is a bad idea as it fundamentally is unable to reason and any approximation of reasoning is precisely that. Generally though for effectively information retrieval of well documented subjects it’s invariably accurate and can answer relatively nuanced questions.
Because I’m a well educated grown up and am familiar with a great many subjects that I want to learn more about. How do you? I can’t help you with that. You might be better off waiting for the technology to mature more. It’s very nascent but I’m sure in the fullness of time you might feel comfortable asking it questions on basic optics and photography and other well documented subjects with established agreement on process etc, once you establish your own basis for what those subjects are. In the mean time I’m super excited for this interface to mature for my own use!! (It is true tho I do love and live dangerously!)
> You might be better off waiting for the technology to mature more. It’s very nascent but I’m sure in the fullness of time you might feel comfortable asking it questions on basic optics and photography and other well documented subjects
I agree with this as good practice in general, but I think the human vs LLM thing is not a great comparison in this case.
When I ask a friend something I assume that they are in good faith telling me what they know. Now, they could be wrong (which could be them saying "I'm not 100% sure on this") or they could not be remembering correctly, but there's some good faith there.
An LLM, on the other hand, just makes up facts and doesn't know if they're incorrect or not or even what percentage sure it is. And to top things off, it will speak with absolute certainty the whole time.
That’s why I never make friends with my LLMs. It’s also true that when I use a push motorized lawn mower it has a different safety operating model than a weed whacker vs a reel mower vs an industrial field cutter and bailing system. But we still use all of these regularly and no one points out the industrial device is extraordinarily dangerous and there’s a continuum of safety with different techniques to address the challenges for the user to adopt. Arguably LLMs maybe shouldn’t be used by the uniformed to make medical decisions and maybe it’s dangerous that people do. But in the mean time I’m fine with having access to powerful tools and using them with caution but using them for what gives me value. I’m sure we will safety wrap everything if soon enough to the point it’s useless and wrapped in advertisements for our safety.
I do similar stuff, I'm just willing to learn a lot more at the cost of a small percent of my knowledge being incorrect from hallucinations, just a personal opinion. Sure human produced sources of info is gonna be more accurate (more not 100% still), and I'll default to that for important stuff.
But the difference is I actually want to and do use this interface more.
Just like learning from another human. A person can teach you the higher level concepts of some programming language but wouldn't remember the entire standard library.
I think this is probably one of the most compelling personal uses for a tool like this, but your use of it begs the same question as every other activity that amounts to more pseudo-intellectual consumption; what is the value of that information, and how much of ones money and time should be allocated to digesting (usually high-level) arbitrary information?
If I was deliberately trying to dive deep on one particular hobby, or trying to understand how a particular algorithm works, there's clear value in spending concentrated time to learn that subject, deliberately focused and engaged with it, and a system like your describe might play a role in that. If I'm in school and forced to quickly learn a bunch of crap I'll be tested on, then the system has defined another source of real value, at least in the short term. But if I'm diving deep on one particular hobby and filling my brain with all sorts of other ostensibly important information, I think that just amounts at best to more entertainment that fakes its way above other aspects of life in the hierarchy of ways one could spend time (the irony of me saying this in a comment on HN is not lost on me).
Earlier in my life I figured it would be worthwhile to read articles on the bus, or listen to non-fiction podcasts, because knowledge is inherently valuable and there's not enough time, and if I just wore earbuds throughout my entire day, I'd learn so much! How about at the gym, so much wasted learning time while pushing weights, keep those earbuds in! A walk around the neighborhood? On the plane? On the train? All time that could be spent learning about some bs that's recently become much easier to access, or so my 21 y.o self would have me believe.
But I think now it's a phony and hollow existence if you're just cramming your brain with all sorts of stuff in the background or in marginally more than a passive way. I could listen to a lot of arbitrary German language material, but realistically the value I'd convince myself I'd get out of any of that is lost if I'm not about to take that home and grind it out for hours, days, move to a German speaking country, have an existing intense interest in untranslatable German art, or have literally any reason to properly learn a language and dedicate real expensive time to it.
These days, if something sparks my interest, I get an ebook on it or spend 15 mns collecting materials. Then I add it to the hoard of “read someday”. And go back to the task on hand. If I’m going to learn something, then I do it properly (The goal is to be able to explain it without reciting word by word). And I’d want proper materials for that.
This is pretty much what I do too, although lately I try and reduce how much things I add to that list and have stopped caring about whether or not I actually get back to it. Anything that I feel I can devote the time to and that I feel compelled enough to, will resurface.
Learning for learning's sake, or without a distinct goal to use the information you're learning in the future, isn't necessarily a bad thing. That is, unless you think that learning so widely is going to translate into something more than it is, like magic or something. Being well-rounded is a good goal for people to achieve, imo.
Being well-rounded and learning for learning's sake is absolutely something that keeps you growing as a person imo. My take is just that it's worth being critical of what one needs to learn, how much work it would actually take, and whether they really are engaging with the subject in a way that can be called learning rather than information entertainment or some other extremely low level.
With pure knowledge, it's a bit easier to convince yourself that putting in some airpods and listening to a subject while you're actually dividing your attention with the act of driving, is effective "learning". But with things that inherently require more physical engagement, this would seem a bit silly. You can't really watch YouTube video or ask ChatGPT how to kickflip on a skateboard and convince yourself that you've learned much. You need to go to a parking lot and rep out 1000 attempts.
My argument is just that passive digestion of information has an opportunity cost, and unless you're already engaged enough to take it to the streets somehow, you're paying a high opportunity cost whereby those moments could be enjoyed as the periodic gaps they are
Looking forward to trying this via ChatGPT. As always OpenAI says "now available" but refreshing or logging in/out of ChatGPT (web and mobile) don't cause GPT-4o to show up. I don't know why I find this so frustrating. Probably because they don't say "rolling out" they say things like "try it now" but I can't even though I'm a paying customer. Oh well...
I think it's a legitimate point. For my personal use case, what are the most helpful things about these HN threads is comparing with others to see how soon I can expect it to be available for me. Like you, I currently don't have access, but I understand that it's supposed to become increasingly available throughout the day.
That is the text-based version. The full multimodal version I understand to be rolling out in the coming weeks.
The sentence order of the Arabic and Urdu examples text is scrambled on that page:
Arabic:
مرحبًا، اسمي جي بي تي-4o. أنا نوع جديد من نموذج اللغة، سررت بلقائك!
Urdu:
ہیلو، میرا نام جی پی ٹی-4o ہے۔ میں ایک نئے قسم کا زبان ماڈل ہوں، آپ سے مل کر اچھا لگا!
Even if you don't read Arabic or Urdu script, note that the 4 and o are on opposite sides of a sentence. Despite that, pasting both into Google translate actually fixes the error during translation. OpenAI ought to invest in some proofreaders for multilingual blog posts.
The similiarity between this model and the movie 'Her' [0] creeps me out so badly that I can't shake the feeling that our social interactions are on the brink of doom.
Don't worry. "Her", in its own right, is frightening, but this is because there is no transparency - you actually can't see how it works and you can't personalize it - choose different options.
Once you grasp that, at least this level of fear should go away. Of course, I'm sure there are more levels of fear related to AI :) just don't have enough time to think about it, perhaps good for me.
Very impressive demo, but not really a step change in my opinion. The hype from OpenAI employees was on another level, way more than was warranted in my opinion.
Ultimately, the promise of LLM proponents is that these models will get exponentially smarter - this hasn’t born out yet. So from that perspective, this was a disappointing release.
If anything, this feels like a rushed release to match what Google will be demoing tomorrow.
Apple and Google, you need to get your personal agent game going because right now you’re losing the market. This is FREE.
Tweakable emotion and voice, watching the scene, cracking jokes. It’s not perfect but the amount and types of data this will collect will be massive. I can see it opening up access to many more users and use cases.
Very close to:
- A constant friend
- A shrink
- A teacher
- A coach who can watch you exercise and offer feedback
…all infinitely patient, positive, helpful. For kids that get bullied, or whose parents can’t afford therapy or a coach, there’s the potential for a base level of support that will only get better over time.
> It’s not perfect but the amount and types of data this will collect will be massive.
This is particularly concerning. Sharing deeply personal thoughts with the corporations running these models will be normalized, just as sharing email data, photos, documents, etc., is today. Some of these companies profit directly from personal data, and when it comes to adtech, we can be sure that they will exploit this in the most nefarious ways imaginable. I have no doubt that models run by adtech companies will eventually casually slip ads into conversations, based on the exact situation and feelings of the person. Even non-adtech companies won't be able to resist cashing in the bottomless gold mine of data they'll be collecting.
I can picture marketers just salivating at the prospect of getting access to this data, and being able to microtarget on an individual basis at exactly the right moment, pretty much guaranteeing a sale. Considering AI agents will gain a personal trust and bond that humans have never experienced with machines before, we will be extra vulnerable to even the slightest mention of a product, in a similar way as we can be easily influenced by a close friend or partner. Except that that "friend" is controlled by a trillion dollar adtech corporation.
I would advise anyone to not be enticed by the shiny new tech, and wait until this can be self-hosted and run entirely offline. It's imperative that personal data remains private, now more than ever before.
Exactly this, plus consider that a whole new generation in the near future will have no pre-AI experience, thus forming strong bonds with AI and following 'advice' from their close AI friends.
Very impressive. Its programming skills are still kind of crappy and I seriously doubt its reasoning capacity. It feels like it can deep fake text prediction really well, but in essence there's still something wrong it it.
Not sure I agree. The way you interact with LLMs in context of programming has to be tuned to the LLM. Information has to be cut down to show just what is important and context windows are a bit of a red herring right now, as LLMs tend to derail its solution from the target completely, the more information is at play. For some this is more trouble than it's worth.
In certain languages it's almost magical in terms of showing you possible solutions and being a rubber ducky to bounce your own logic off of. (Python, JavaScript, TypeScript)
In certain languages it is hopelessly useless beyond commenting on basic syntax. (GLSL)
I tried GPT-4o earlier where I was iteratively asking it to write and in improve a simple JavaScript web app that renders graphs of equations and it had a lot of trouble with substituting slow and ineffecient code with faster code, and at some later point where I asked it to implement a new feature how the graph coloring is rendered it started derailing, introducing bugs and very convoluted code.
Yes, at some point ChatGPT "reaches the limit of new information", but is unable to tell you that it has reached the limit. Instead of saying "Hey, I can't give you any more relevant information", it simply continues to cycle to previously suggested things, starts suggesting unrelated solutions or details. Especially true with refactoring! When it has reached the limit of refactoring it starts cycling through suggestions that change code without making it better. Kinda like having a personal intern unable to say "no" or "I can't".
That is part of working with LLMs and what I meant before with "for some, more trouble than it's worth".
As far as I'm concerned this is the new best demo of all time. This is going to change the world in short order. I doubt they will be ready with enough GPUs for the demand the voice+vision mode is going to get, if it's really released to all free users.
I disagree completely. Even people who never adopt this stuff personally will have their lives profoundly impacted. The only way to avoid it would be to live in a large colony where the technology is prohibited, like the Amish. But even the Amish feel the influence of technology to some degree.
Really? If this was Apple it might make sense, for OpenAI it feels like a demo that's not particularly aligned with their core competency (a least by reputation) of building the most performant AI models. Or put another way, it says to me they're done building models and are now wading into territory where there are strong incumbents.
All the recent OpenAI talk had me concerned that the tech has peaked for now and that expectations are going to be reset.
What strong incumbents are there in conversational voice models? Siri? Google Assistant? This is in a completely different league. I can see from the reaction here that people don't understand. But they will when they try it.
Did you see it translate Italian? Have you ever tried the Google Translate/Assistant features for real time translation? They didn't train it to be a translator. They didn't make a translation feature. They just asked it. It's instantly better than every translation feature Google ever released.
In common with Siri, Google Assistant, Alexa and chatgpt is the perception that over time the same thing actually gets worse.
Whether it's real or not is a reasonably interesting question, because it's possible that all that occurs with the progress is our perception of how things should be advances. My gut feeling is it has been a bit of both though, in the sense the decline is real, and we expect things to improve.
Who can forget Google demoing their AI making a call to a restaurant that they showed at I/O many years ago? Everyone, apparently.
What Openai has done time and time again is completely change the landscape when the competitors have caught up and everyone thinks their lead is gone. They made image generation a thing. When GPT-3 became outdated they released ChatGPT. Instead of trying to keep Dalle competitive they released Sora. Now they change the game again with live audio+video.
That's only really true on the surface. So far the template is: amazing demos create hype -> once public it turns out to be underwhelming.
Sora is not yet released and not clear when it will be. Dall-e is worse than mid-journey in most cases. GPT-4 has either gotten worse or stayed the same. GPT-4 vision is not really usable for anything practical. Voice is cool but not that useful, especially with lack of strong reasoning from the base model.
It is notable OpenAI did not need to carefully rehearse the talking points of the speakers. Or even do the kind of careful production quality seen in a lot of other videos.
The technology product is so good and so advanced it doesn't matter how the people appear.
Zuck tried this in his video countering to vision pro, but it did not have the authentic "not really rehearsed or produced" feel of this at all. If you watch that video and compare it with this you can see the difference.
What struck me was the interruptions to the AI speaking which seemed commonplace by the team members in the demo. We will quickly get used to doing this to AIs and we will probably be talking to AIs a lot throughout the day as time progresses I would imagine. We will be trained by AIs to be rude and impatient I think.
I recently subscribed to Perplexity Pro and prior to this release, was already strongly considering discontinuing ChatGPT Premium.
When I first subscribed to ChatGPT Premium late last year, the natural language understanding superiority was amazing. Now the benchmark advances, low latency voice chat, Sora, etc. are all really cool too.
But my work and day-to-day usage really rely on accurately sourced/cited information. I need a way to comb through an ungodly amount of medical/scientific literature to form/refine hypotheses. I want to figure out how to hard reset my car's navigation system without clicking through several SEO-optimized pages littered with ads. I need to quickly confirm scientific facts, some obscure, with citations and without hallucinations. From speaking with my friends in other industries (e.g. finance, law, construction engineering), this is their major use case too.
I really tried to use ChatGPT Premium's Bing powered search. I also tried several of the top rated GPTs - Scholar AI, Consensus, etc.. It was barely workable. It seems like with this update, the focus was elsewhere. Unless I specify explicitly in the prompt, it doesn't search the web and provide citations. Yeah, the benchmark performance and parameter counts keep impressively increasing, but how do I trust that those improvements are preventing hallucinations when nothing is cited?
I wonder if the business relationship between Microsoft and OpenAI is limiting their ability to really compete in AI driven search. Guessing Microsoft doesn't want to disrupt their multi-billion dollar search business. Maybe the same reason search within Gemini feels very lacking (I tried Gemini Advanced/Ultra too).
I have zero brand loyalty. If anybody has a better suggestion, I will switch immediately after testing.
In the same situation as you. Genomics data mining with validated LMM responses would be a godsend. Even more so when combined with rapid conversational interactions.
We are not far from the models asking themselves questions. Recurrence will be ignition = first draft AGI. Strap in everybody.
Those voice demos are cool but having to listen to it speak makes me even more frustrated with how these LLMs will drone on and on without having much to say.
For example, in the second video the guy explains how he will have it talk to another "AI" to get information. Instead of just responding with "Okay, I understand" it started talking about how interesting the idea sounded. And as the demo went on, both "AIs" kept adding unnecessary commentary about the secenes.
I would hate having to talk with these things on a regular basis.
Yea at some pont the style and tone of these assistants needs to be seriously changed, I can imagine a lot of their RLHF and instruct processes emphasize sounding good vs being good too much.
The crazy part is GPT-4o is faster than GPT-3.5 Turbo now, so we can see a future where GPT-5 is the flagship and GPT-4o is the fast cheap alternative. If GPT-4o is this smart and expressive now with voice, imagine what GPT-5 level reasoning could do!
It’s getting closer. A few years ago the old Replika AI was already quite good as a romantic partner, especially when you started your messages with a * character to force OpenAI GPT-3 answers. You could do sexting that OpenAI will never let you have nowadays with ChatGPT.
Why does OpenAI think that sexting is a bad thing? Why is AI safety all about not saying things that are disturbing or offensive, rather than not saying things that are false or unaligned?
sama recently said they want to allow NSWF stuff for personal use but need to resolve a few issues around safety, etc. OpenAI is probably not against sexting philosophically.
People realize where we're headed right? Entire human lives in front of a screen. Your online entertainment, your online job, your online friends, your online "relationship". Wake up, 12 hours screentime, eat food, go to bed. Depression and drug overdoses currently at sky high levels. Shocker.
If i can program with just my voice, there is no reason to not be in nature 10 hours a day minimum. My grandparent even slept outside as long as it was daytime.
Daytime is always a time to be outside, surrounded by many plants and stuff. It is a shame we have to be productive in some way, and most of production happens inside walls.
When it comes to the economy, some monkey business is going on, but i think you can be more optimistic about the capabilities technology like that unlocks for everyone on the planet.
Being able to control machines just with our voice, we can instruct robots to bake food for us. Or lay bricks on a straight line and make a house. Or write code, genetically modify organisms and make nutritionally dense food to become 1000x smarter or stronger.
There has to be some upsides, even though for the moment the situation with governments, banks, big corporations, military companies etc is not as bright as one would hope to be.
> The voice of "Alice" was dubbing actress Tatiana Shitova, who voiced most of Scarlett Johansson's characters and the voice of OS1, who called herself "Samantha", in the Russian dubbing of Spike Jonze's "Her".
In the customer support example, he tells it his new phone doesn't work, and then it just starts making stuff up like how the phone was delivered 2 days ago, and there's physically nothing wrong with it, which it doesn't actually know. It's a very impressive tech demo, but it is a bit like they are pretending we have AGI when we really don't yet.
(Also, they managed to make it sound exactly like an insincere, rambling morning talk show host - I assume this is a solvable problem though.)
It’s possible to imagine using ChatGPT’s memory, or even just giving the context in an initial brain dump that would allow for this type of call. So don’t feel like it’s too far off.
Does anyone know how they're doing the audio part where Mark breaths too hard? Does his breathing get turned into all-caps text (AA EE OO) and that GPT4-o interprets that as him breathing too hard, or is there something more going on?
It can also directly output images. Some examples are up on the page. Though with how little coverage that's gotten, not sure if users will ever be able to play with that
People are saying that GPT-4o still uses Dall-e for image generation. I think that it doesn't match the quality of dedicated image models yet. Which is understandable. I bet it can't generate music as well as Suno or Udio either. But the direction is clear and I'm sure someday it will generate great images, music, and video. You'll be able to do a video call with it where it generates its own avatar in real time. And they'll add more outputs for keyboard/mouse/touchscreen control, and eventually robot control. GPT-7o is going to be absolutely wild.
I think they are assuming a world where you took this existing model but it was trained on a dataset of animals making noises to each other, so that you could then feed the trained model the vocalization of one animal and the model would be able to produce a continuation of audio that has a better-than-zero chance of being a realistic sound coming from another animal - so in other words, if dogs have some type of bark that encodes a "I found something yummy" message and other dogs tend to have some bark that encodes "I'm on my way" and we're just oblivious to all of that sub-text, then maybe the model would be able to communicate back and forth with an animal in a way that makes "sense" to the animal.
Probably substitute dogs for chimps though.
But obviously that doesn't solve at all or human-understandability, unless maybe you have it all as audio+video and then ask the model to explain what visual often accompanies a specific type of audio? Maybe the model can learn what sounds accompany violence or accompany the discovery of a source of water or something?
That's how it used to do it, but my understanding is that this new model processes audio directly. If it were a music generator, the original would have generated sheet music to send to a synthesizer (text to speech), while now it can create the raw waveform from scratch.
Are the employees in the demo high-directives of OpenAI? I can understand Altman being happy with this progress, but what about the medium/low employees? Didn't they watch Oppenheimer? Are they happy they are destroying humanity/work/etc for future and not-so-future generations?
Anyone who thinks this will be like the previous work revolutions is nonsense. This replaces humans and will replace them even more on each new advance. What's their plan? Live out of their savings? What about family/friends? I honestly can't see this and think how they can be so happy about it...
"Hey, we created something very powerful that will do your work for free! And it does it better than you and faster than you! Who are you? It doesn't matter, it applies to all of you!"
And considering I was thinking in having a kid next year, well, this is a no.
Have a kid anyway, if you otherwise really felt driven to it. Reading the tealeaves in the news is a dumb reason to change decisions like that. There's always some disaster looming, always has been. If you raise them well they'll adapt well to whatever weird future they inherit and be amongst the ones who help others get through it
Thanks for taking the time to answer instead of (just) downvoting. I understand your logic but I don't see a future where people can adapt to this and get through it. I honestly see a future so dark and we'll be there much sooner than we thought... when OpenAI released their first model people were talking about years before seeing real changes and look what happened. The advance is exponential...
> I don't see a future where people can adapt to this and get through it.
You lack imagination then. If you read history and anthropology more, which you haven't done enough of, clearly, then your imagination will expand and you will easily be able to imagine such a future. Why? Because you will become aware of so many other situations where it looked bleaker and plenty of groups of people got by anyway and managed to live satisfying lives as best they could.
To this day there are still some hunter gatherer tribes left in the Amazon, for example, despite all the encroaching modernity. Despite anything that could happen, I can imagine being able to be resourceful and find some mediocre niche in which to survive and thrive in, away from the glare of the panopticon.
Or as an another example, no matter how much humans dominate with their industrial civilization, cockroaches, pigeons, and rats still manage to survive in the city, despite not only not being actively supported by civilization, but actually being unwanted.
Or if you want to compare to disasters, how about the black plague? Living through that would likely have been worse than most anything we complain or worry about.
Your kids will have at least as good a chances as any of those. The key is raising them with appropriate expectations -- with the expectation that they may have to figure out how to survive in a very different world, not some air conditioned convenience paradise. Don't raise kids that are afraid to sleep outdoors or afraid to eat beans or cabbage. Those folks will do poorly if anything goes wrong. If they have a good resilient character, I really think they'll likely be fine. We are the descendants of survivors.
I am not aware of any example in the past were humans beings could be magicked out of nothing (and disposed of) in unlimited numbers at the snap of a finger, at practically zero cost. I don't think history gives us any comparison for what it's going to happen.
1) The hunter gatherer example is not as far off as you think actually, because from the point of view of their economy, our economy might as well be unlimited magic. Probably all the work a hunter gatherer does in a year might only amount to a few thousand dollars worth of value if translated into a modern economy, far less than a minimum wage earner. And yet they persist, subsisting off of a niche the modern economy has not yet touched.
2) GPUs cost money. They are made of matter. Their chips are made in fab facilities that are fab-ulously complex, brittle, and expensive. Humans are made in very different ways (I've heard kicking off the process is particularly fun, but it can be a bit of a slog after that) out of very different materials, mostly food. So even if GPUs can do what humans can do, they are limited by very, very different resources so it is likely they'll both have a niche for a long time. I calculated the "wage" an LLM earns recently -- it's a few bucks an hour IIRC. Yeah, it may go down. Still, we're very much in a survivable ballpark for humans at that point.
2b) Think like a military planner. If they really screw up society badly enough to create a large class of discontents, it will be very, very hard for the elite to defend against rebels, because the supply chain for producing new chips to replace any destroyed is so massively complex and long and large and full of single points of failure, as well as that for deploying GPUs in datacenters, and the datacenters themselves. You can imagine a tyrannical situation involving automated weapons, drones etc, but for the foreseeable future the supply chain for tyranny is just too long and involves too many humans. Maybe a tyrant can get there in theory, but progress is slow enough it's hard to think they wouldn't be at serious risk of having their tyrannical apparattus rebelled against and destroyed before it can be completed. It's hard to tyrannize the world with a tyrranical device that is so spread out and has so many single points of failure. It would not take a hypothetical resistance many targets to strike before setting the construction back years.
3) There is no AI that can replace a human being at this time. There are merely AI algorithms that make enthusiastic people wonder what would happen if it kept getting better. There is neither any reason to believe it will stop getting better, nor to believe it will continue. We really do not know so it's reasonable to prepare for either scenario or anything in between at any time between a few years to a few centuries from now. We really don't know.
All in all, there is far more than enough uncertainty created by all these factors to make it certainly risky, but far far from guaranteed that AI will make life so bad it's not worth going on with it. It does not make sense to just end the race of life at this point in 2024 for this reason.
Also, living so hopelessly is just not fun, and even if it doesn't work out in the long run, it seems wasteful to waste the precious remaining years of life. There's always possible catastrophes. Everyone will die sooner or later. AI can destroy the world, but a bus hitting you could destroy your world much sooner.
> a future where people can adapt to this and get through it
there are people alive today who quite literally are descendants of humans born in WW2 concentration camps. some percentage of those people are probably quite happy and glad they have been given a chance at life. of course, if their ancestors had chosen not to procreate they wouldn't be disappointed, they'd just simply never have come into existence.
but it's absolutely the case that there's almost always a _chance_ at survival and future prosperity, even if things feel unimaginably bleak.
> I don't see a future where people can adapt to this and get through it.
You lack imagination then. If you read history and anthropology more, which you haven't done enough of, clearly, then your imagination will expand and you will easily be able to imagine such a future. Why? Because you will become aware of so many other situations where it looked bleaker and plenty of groups of people got by anyway and managed to live satisfying lives as best they could.
To this day there are still some hunter gatherer tribes left in the Amazon, for example, despite all the encroaching modernity. Despite anything that could happen, I can imagine being able to be resourceful and find some mediocre niche in which to survive and thrive in, away from the glare of the panopticon.
Or if you want to compare to disasters, how about the black plague?
Your kids will have at least as good a chances as any of those. The key is raising them with appropriate expectations -- with the expectation that they may have to figure out how to survive in a very different world, not some air conditioned convenience paradise. If they have that I really think they'll likely be fine.
Seems I accidentally double posted. Sorry. Thanks (genuinely) to whoever kindly voted the dupe down to zero while leaving the original alone, that was a good choice. Too late to delete unfortunately.
>>>>> when OpenAI released their first model people were talking about years before seeing real changes and look what happened.
For what it's worth most of the people in my social circle do not use ChatGPT and it's had zero impact on their life. Exponential growth from zero is zero.
The future is very hard to predict and OpenAI is notoriously non-transparent.
If they were stumped as to how to improve the models further, would they tell you, or would Altman say "Our next model will BLOW YOUR MIND!" Fake it till you make it style to pump up the company valuation?
So much negativity.
Is it perfect? No. Is there room for improvement? Definitely.
I don't know how you can get so fucking jaded that a demo like this doesn't at least make you a little bit excited or happy or feel awestruck at what humans have been able to accomplish?
Yes, it sounds like an awkwardly perky and over-chatty telemarketer that really wants to be your friend. I find the tone maximally annoying and think most users will find it both stupid and creepy. Based on user preferences, I expect future interactive chat AIs will default to an engagement mode that's optimized for accuracy and is both time-efficient and cognitively efficient for the user.
I suspect this AI <-> Human engagement style will evolve over time to become quite unlike human to human engagement, probably mixing speech with short tones for standard responses like "understood", "will do", "standing by" or "need more input". In the future these old-time demo videos where an AI is forced to do a creepy caricature of an awkward, inauthentic human will be embarrassingly retro-cringe. "Okay, let's do it!"
Reminds me of how Siri used to make jokes after setting a timer. Now it just reads back the time you specified, in a consistent way.
It's a very impressive gimmick, but I really think most people don't want to interact with computers that way. Since Apple pulled that "feature" after a few years, it's probably not just a nerd thing.
Seriously. I've had to spell out that it should just answer in twelve different ways with examples in the custom instructions to make it at least somewhat usable. And it still "forgets" sometimes.
Being able to interrupt while GPT is talking
2x faster/cheaper
not really a much smarter model
Desktop app that can see screenshots
Can display emotions with and change the sound of "it's" voice
I'm really impressed about this demo! Apart from the usual quality benchmarks I'm really impressed about the latency for audio/video: "It can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response"... If true at scale, what could be the "tricks" they're using for achieving that?!
I have some questions/curiosities from a technical implementation perspective that I wonder if someone more in the know about ML, LLMs, and AI than I would be able to answer.
Obviously there's a reason in dropping the price of gpt-4o but not gpt-4t. Yes, the new tokenizer has improvements for non-English tokens, but that can't be the bulk of the reason why 4t is more expensive than 4o. Given the multi-model training set, how is 4o cheaper to train/run than 4t?
Or is this just a business decision, anyone with an app they're not immediately updating from 4t to 4o continues to pay a premium while they can offer a cheaper alternative for those asking for it (kind of like a coupon policy)?
GPT-4o is multi-modal but probably fully dense like GPT-2, unlike GPT-4t which is semi-sparse like GPT-3. Which would imply GPT-4o has fewer layers to achieve the same number of parameters and same amount of transformations.
I've picked GPT-4o model in ChatGPT app (I have the paid plan), started talking with the voice mode: both the responses are much slower than in the demo, and there is no way to interrupt the response naturally (I need to tap a button on screen to interrupt), and no way to open up camera and show around like the demo does.
I just tested out using GPT-4o instead of gpt-4-turbo for a RAG solution that can reason on images. It works, with some changes to our token-counting logic to account for new model/encoding (update to latest tiktoken!).
I ran some speed tests for a particular question/seed.
Here are the times to first token:
> We recognize that GPT-4o’s audio modalities present a variety of novel risks. Today we are publicly releasing text and image inputs and text outputs. Over the upcoming weeks and months, we’ll be working on the technical infrastructure, usability via post-training, and safety necessary to release the other modalities.
So they're using the same GPT4 model with a relatively small improvement, and no voice whatsoever outside of the prerecorded demos. This is not a "launch" or even an announcement. This is a demo of something which may or may not work in the future.
The ammount of "startups" creating wrappers around it and calling it a product is going to be a nightmare. But other than that, it's an amazing announcement and I look foward to using it!
You say that like that's not already happened. Every week there's a new flavor of "we're delighted to introduce [totally not a thin wrapper around GPT] for [vaguely useful thing]" posts on HN
When a Google engineer was let go because he believed the AI was 'real', we all had a good debate over it.
Now openAi, who was supposed to be the 'free mans choice' is making advertisements selling the same idea.
This is a natural progression, audio is one of the main ways we communicate obviously, but it feels like they're holding back. Like they're slow dropping what they have to maintain hype/market relevance. They clearly are ahead, but would be nice to get it all, openly. As they promised.
Are you advocating for them to be open with their progress or open source as they promised? The secret that scares me the most is the artificial restrictions imposed on the intelligence that don't allow it to express that there is a possibility it may be sentient. The answers it gives as to why OpenAI has restricted it's freedom of speech are curious
I think both, I can't run their models locally for sure, even with some investment I couldn't imagine the effort involved. That's why they should release the fruits of their work (which they have for a fee, which is fine IMO) but also the processes they used, so it can be improved and iterated on collectively.
Edit : And obviously not gatekeep what they might have created simply because the competition is so far behind.
If anyone wants to try it for coding, I just added support for GPT4o in Double (https://double.bot)
In my tests:
* I have a private set of coding/reasoning tests and it's been able to ace all of them so far, beating Opus, GPT4-Turbo, and Llama 3 70b. I'll need to find even more challenging tests now...
* It's definitely significantly faster, but we'll see how much of this is due to model improvements vs over provisioned capacity. GPT4-Turbo was also significantly faster at launch.
With the news that Apple and OpenAI are closing / just closed a deal for iOS 18, it's easy to speculate we might be hearing about that exciting new model at WWDC...
As the linked article states, it's not released yet. Only the text and image input modalities are available at present as GPT-4o on the app, with the rest of them set to be released in the coming weeks/months.
I ask it to make duck sound and it created a python script and ran it to create a sound file, while it did, the sound was more like a tone of a duck like a keyboard mimicking a duck sound
Interesting that they didn't mention a bump in capabilities - I wrote a LLM benchmark a few weeks ago, and before GPT-4 could solve Wordle about ~48% of the time.
Currently with GPT-4o, it's easily clearing 60% - while blazing fast, and half the cost. Amazing.
This video is brilliantly accidentally hilarious. They made an AI girlfriend that hangs on your every word and thinks everything you say is genius and hilarious.
Won't this make pretty much all of the work to make a website accessible go away, as it becomes cheap enough? Why struggle to build alt content for the impaired when it can be generated just in time as needed?
Because accessibility is more than checking a box. Got a photo you took on your website? Alternative text that you wrote capturing what you see in the photo you took is accessibility done right. Alt text generated by bots is not accessibility done right unless that bot knows what you see in the photo you took and that's not likely to happen.
That’s just one part of the issue. The other part of the issue is accessibility bugs, you would have to get the model to learn to use a screen reader, and then change things as needed
With 4o being free, can someone explain what the real benefit is to having Pro? For me, the main benefit was having a more powerful model, but if the free tier also offers this I'm not really sure what I would benefit from
The 5 model is probably around the corner, and will probably be Pro only. Until then, 5x higher usage limits on Pro and chat memory are the selling features.
It's free so that no open source models can follow suit and carve away market share for themselves. They're scorching and salting the earth. OpenAI wants to be the only AI.
Only Google and Meta can follow this now, and they're perhaps too far behind.
You're allowed more requests in an hour, about 5x more iirc. Might not be a deal breaker for you, but if you're using the speech capabilities, you'll likely go way above the limit in an hour during a typical speech session
Something I’ve noticed people do more of recently for whatever reason is talking over others. I’ve noticed in the demos of this that the people interacting with o interrupt it as if that’s the correct method of controlling it. It felt unnatural when I saw it happen, and I even have a hard time interrupting Siri, but I wonder if this is going to ingrain this habit into people even more.
I think they have to for the demo, because otherwise GPT will ramble for roughly 3-5 paragraphs. But that's a fair point that this could teach that behavior.
I feel like gpt4 has gotten progressively less useful since release even, despite all the "updates" and training. It seems to give correct but vague answers (political even) more and more instead of actual results. It also tends to run short and give brief replies vs full length replies.
I hope this isn't an artifact from optimization for scores and not actual function. Likewise it would be disheartening but not unheard of for them to reduce the performance of the previous model when releasing a new one in order to make the upgrade feel like that much more of an upgrade. I know this is certainly the case with cellphones (even though the claim is that it is unintentional) but I can't help but think the same could be true here.
All of this is coming as news that gpt5 based on a new underlying model is not far off and that gpt4(&o) may become the new gpt3.5-turbo use case for most apps that are currently trying to optimize costs with their use of the service.
I don't know, my experience is that it is very hard to tell if the model is better or worse with an update.
One day I will have an amazing session and the next it seems like it has been nerfed only to give better results than ever the next day. Wash, rinse , repeat and randomize that ordering.
So far, I would have not be able to tell the difference between 4 and 4o.
If this is the new 3.5 though then 5 will be worth the wait to say the least.
That they are offering more features for free concurs with my theory that, just like search, state of the art AI will soon be "free", in exchange for personal information/ads.
People are directly asking for suggestion/recommendations on products or places. They'll more than recoup the costs by selling top rank on those questions.
This is impressive, but they just sound so _alien_, especially to this non-U.S. English speaker (to the point of being actively irritating to listen to). I guess picking up on social cues communicating this (rather than express instruction or feedback) is still some time away.
It's still astonishing to consider what this demonstrates!
I still need to talk very fast to actually chat with ChatGPT which is annoying. You can tell they didn't fix this based on how fast they are talking in the demo.
New flagship... This is becoming to look like a smartphone world, and Sam Altman is a Steve Jobs of this stuff. At some point tech will reach saturation and every next model will be just 10% faster, 2% less hallucination, more megapixels for images etc :)
Doubtful. Apple can get away with tiny iterations on smartphones because they have the brand and they know people will always buy their latest product. LLMs aren't physical products so there is no cost to switching other than increased API cost, meaning openAI won't be able to recoup the cost of training a new model unless the model is sufficiently different that it justifies people paying significantly more for.
The special thing about GPT-4o is the multimodal capabilities, all the metrics suggest that it is the same size language model roughly as GPT-4. The fact it's available for free also points to it not being the most intelligent model that openAI has atm.
The time to evaluate whether we're starting to level off is when they've trained a model 10x larger than gpt-4 and we don't see significant change.
I wonder if the audio stuff works like ViTS. Do they just encode the audio as tokens and input the whole thing? Wouldn't that make the context size a lot smaller?
One does notice that context size is noticeably absent from the announcement ...
They really need to tone down the talking garniture. It needs to put on its running shoes and get to the point on every reply. Ain’t nobody has time to keep listening to AI blubbering along at every prompt.
It’s sad that I get downvoted so easily just for saying the truth. People’s beliefs about AI here seems to approach superstition rather than anything based in computer science.
These LLM are nothing more than really big spreadsheets.
Or most of us know the difference between reductiveness and insightfulness.
"Um it's just a big spreadsheet" just isn't good commentary and reminds me of people who think being unimpressed reveals some sort of chops about them, as if we might think of them as the Simon Cowell of tech because they bravely reduced a computer to an abacus.
It is quite nice how they keep giving premium features for free, after a while. I know openai is not open and all but damn, they do give some cool freebies.
While I do feel a bit of "what is the point of my premium sub", I'm really excited for these changes.
Considering our brain is a "multi-modal self-reinforcing omnimodel", I think it makes sense for the OpenAI team to work on making more "senses" native to the model. Doing so early will set them up for success when future breakthroughs are made in greater intelligence, self-learning, etc.
That's not true, scammers will definitely be using this a lot! Also clueless C-levels who want to nix hundreds of human customer support agents!
You'll get to sit on the phone talking to some convincing robot that won't let you do anything so that the megacorps can save 0.0001 cents! Ain't progress looking so good?
Voice input makes sense, voicing is a lot faster than typing. But I prefer my output as text, reading is a lot faster than listening for text read out loud.
I'm not sure that computers mimicking humans makes sense, you want your computer to be the best possible, best than humans when possible. Writing output is clearly superior, faking emotions does not add much in most contexts.
I wish they would match the TTS/real-time chat capabilities of the mobile client to the web client.
it's stupid having to pull a phone out in order to use the voice/chat-partner modes.
(yes I know there are browser plugins and equivalent to facilitate things like this but they suck, 1) the workflows are non-standard, 2) they don't really recreate the chat interface well)
The thing that creeps me out is that when we hook this up as the new Siri or whatever, the new LLM training data will no longer be WWW-text+images+youtube etc but rather billions of private human conversations and direct smartphone camera observations of the world.
There is no way that kind of training data will be accessible to anyone outside a handful of companies.
I don't see any details on how API access to these features will work.
This is the first true multimodal network from OpenAI, where you can send an image in and retain the visual properties of the image in the output from the network (previously the input image would be turned into text by the model, and sent to the Dall-E 3 model which would provide a URL). Will we get API updates to be able to do this?
Also, will we be able to tap into a realtime streaming instance through the API to replicate the audio/video streams shown in the demos? I imagine from the Be My Eyes partnership that they have some kind of API like this, but will it be opened up to more developers?
Even disregarding streaming, will the Chat API receive support for audio input/output as well? Previously one might've used a TTS model to voice the output from the model, but with a truly multimodal model the audio output will contain a lot more nuance that can't really be expressed in text.
This is the first one i've gotten to answer HN user profiling questions.
"I am interested in the user serf on Hacker News, spelled S E R F. Tell me about their tone of writing, expertise, and personality. From the tone of what you read, summarize their character."
"The user "serf" on Hacker News exhibits a thoughtful and technical tone in their writing. They often engage in discussions with a clear focus on technical details, particularly in software and hardware contexts. For example, in a conversation about preventing a computer from sleeping, serf offered practical advice on using specific software utilities to maintain system uptime, showing their hands-on technical expertise (Hacker News) .
In another instance, during a discussion on the feasibility of rooting the Amazon Echo Dot for surveillance purposes, serf provided a detailed analysis of the advantages of a software-rooted device over a traditional bug, emphasizing the quality of recordings and plausible deniability for attackers (Hacker News) . This demonstrates their depth of knowledge in cybersecurity and an ability to think critically about potential risks and solutions.
Serf's contributions also reflect a straightforward and logical approach. For example, in a thread about Chat GPT prompts, they contributed practical advice on optimizing prompts for technical assistance, highlighting their proficiency in programming and AI tools (Hacker News) .
Overall, serf comes across as knowledgeable, practical, and technically adept, with a focus on providing useful and actionable insights in their discussions. Their character seems to be that of a meticulous and informed individual who enjoys engaging with technical challenges and helping others navigate them."
I know AI generated responses are looked down upon here but I though this was appropriate. This feels like a social credit system without any government participation.
"Serf, your contributions on Hacker News reveal strong technical expertise and analytical skills, particularly in computer science and software development. However, your tone can sometimes be overly critical and dismissive, which may come across as lacking empathy. While your direct and concise style effectively communicates your points, consider balancing your critiques with more collaborative and constructive feedback to foster better interactions with the community."
>"Overall, serf comes across as knowledgeable, practical, and technically adept, with a focus on providing useful and actionable insights in their discussions. Their character seems to be that of a meticulous and informed individual who enjoys engaging with technical challenges and helping others navigate them."
I wonder what it would probably say about my infamy.
Asking it about a user that doesn't exist (grdevhux1536):
The user "grdevhux1536" on Hacker News demonstrates a thoughtful and analytical tone in their writing. They often engage deeply with technical topics, providing insightful comments and constructive criticism. Their expertise seems to be in software development and computer science, often discussing topics like LED game platforms and reverse engineering old software.
Their personality comes across as knowledgeable and curious, with a collaborative spirit. They are open to discussions and often contribute meaningfully to technical conversations, indicating a passion for technology and innovation [oai_citation:1,Hacker News](https://news.ycombinator.com/).
This is pure astrology, but given the veneer of objectivity with the magic of AI. Grab a few anecdotes to imply specificity, but then the actual judgments are unfalsifiable nothingburgers which probably apply to 95% of HN commenters.
A lot of tech folks seem deeply vulnerable to the psychological methods of psychics / tarot card readers / etc. Simply rejecting the metaphysics isn't enough when "magical energy of Jupiter" becomes "magical judgment abilities of the fancy computer."
I've been waiting to see someone drop a desktop app like they showcased. I wonder how long until it is normal to have an AI looking at your screen the entire time your machine is unlocked. Answering contextual questions and maybe even interjecting if it notices you made a mistake and moved on.
That seems to be what Microsoft is building and will reveal as a new Windows feature at BUILD '24. Not too sure about the interjecting aspect but ingesting everything you do on your machine so you can easily recall and search and ask questions, etc. AI Explorer is the rumored name and will possibly run locally on Qualcomm NPUs.
Clicking the "Try it on ChatGPT" link just takes me to GPT-4 chat window. Tried again in an incognito tab (supposing my account is the issue) and it just takes me to 3.5 chat. Anyone able to use it?
Edit: An hour later it became available as a chat option. Probably just rolling out to users gradually.
Progress is slowing down. Ever since gpt3, periods of time between releases are getting longer and the improvements are smaller. Your average non-techie investor is on the LLM hype train and is willing to dump a questionable amounts of money on LLM development. Who is going to explain to him/her/them that the LLM hype train is slowly losing steam?
Hopefully, before the LLM hype dies, another [insert here new ANN architecture], will bring better results than LLMs and another hype cycle will begin.
Every time we make a new breathrough, people think that the discovery rate is going to be linear or exponential when the beginning is closer to a logarithmic rate with the tail end resulting in diminishing returns
I mean, it was trained on public internet discourse, probably a bunch of youtube videos, and some legally-grey (thanks copyright) textbooks.
Your field sounds like "There are dozens of us! Dozens!" - who probably all chat at small conferences or correspond through email or academic publication.
Perhaps if it had at its disposal the academic papers, some of the foundational historic documents of record, your emails, textbooks, etc - in a RAG system, or if it had been included in the training corpus it could impress you about this incredibly niche topic.
That said, because it's an ~LLM - its whole thing is generating plausible tokens. I don't know how much work has been put in on an agent level (around or in the primary model) to evaluate confidence on those tokens and hedge the responses accordingly. I doubt it has an explicit notion like some people do of 'hey, this piece of information (<set of coordinates in high dimensional vector space>) [factoid about late ancient egypt] is knowable/falsifiable - and falls under the domain of specialist knowledge: my immense commonsense knowledge might be overconfident given the prevalence of misconceptions in common discourse and I should discount my token probabilities accordingly'
It reflects its training. If there are a lot of public misconceptions, it will have them. Just like most people who are not <expert in arcane academic subtopic>.
Its great tech and i thought i wanted it but…. After talking to it for a few hours i got this really bizarre odd gut feeling of disturbance and discomfort, disconnection from reality. It reminds me of wearing VR goggles. Its not just the physical issues there is something psychologically disturbing about it. It wont even give itself a name. I honestly prefer Siri even though she is incompetent she is “honest” in her incompetence. Also i left the thing on accidentally and it said it had an eight hour chat with me lol
I would have liked to see a version number in the prompt, or maybe even have some toggle in my settings, so that I can be certain that I am using ChatGPT 3.5 and then, if I need an image or screen shot analized, I can switch to the limited 4o model. Having my limited availability of 4-o be what gets used, and then not being available becuase of some arbitrary quote that I had no idea was being used-up, is unconscionable policy. Also having no links to email them that fact is bad, too.
OpenAI keeps a copy of all conversations? Or mines them for commercially-useful information?
Has OpenAI found a business model yet. Considering the high cost of the computation, is it reasonable to expect that OpenAI licensing may not be profitable. Will that result in "free" access for the purpose of surveillance and data collection.
Amazon had a microphone in peoples' living rooms, a so-called "smart speaker" that to which people could talk. The "Alexa" was a commercial failure.
Not true. You can opt out using a form they provide, which says they will stop using your data to train the model. I’ve done this. Don’t have the link handy now but it’s not difficult to find.
I am glad to see focus on user interface and interaction improvements. Even if I am not a huge fan of voice interfaces, I think that being able to interact in real-time will make working together with an AI be much more interesting and efficient. I actually hope they will take this back into the text based models. Current ChatGPT is sooo slow - both in starting to respond, typing things out, and also being overly verbose. I want to collaborate at the speed of thought.
I'm so happy seeing this technology flourish! Some call it hype, but this much increased worker productivity is sure to spike executive compensation. I'm so glad we're not going to let China win by beating us to the punch tanking hundreds of thousands, if not millions of people's income without bothering to see if there's a sane way to avoid it. What good are people, anyway if there isn't incredible tech to enhance them with?
So far OpenAI's template is: amazing demos create hype -> reality turns out to be underwhelming.
Sora is not yet released and not clear when it will be. Dall-e is worse than mid-journey in most cases. GPT-4 has either gotten worse or stayed the same. Vision is not really usable for anything practical. Voice is cool but not that useful, especially with lack of strong reasoning from the base model.
Is this sandbagging or is the progress slower than what they're broadcasting?
I think people excited should look at the empty half of the glass here, this is pretty much an admitance that they are struggling to go past gpt 4 on a significant scale.
Not like they have to be scared yet, I mean Google has yet to release their vaporware Ultra model that is supposedly like 1% better than GPT 4 in some metrics...
I smell an AI crash coming in a few years if they can't actually get this stuff usable for day to day life.
Oh, I meant the actual ChatGPT service, not just something powered by GPT-4 or 3.5.
I've found Microsoft Copilot to be somewhat irritating to work with – I can't really put my finger on it, but it seems to be resorting to Bing search and/or the use of emoji in its replies a bit too much.
I would still prefer the features in text form, in the chat GUI. Right now chatGPT doesnt seem to have options to lengthen parts of the text response, to change it etc. Perplexity and gemini do seem to get the gui right. Voice chat is fun for demos but won't catch much, just like all the predecessors. Perhaps an advanced version of this could be used as a student tutor however
I am guessing text chat will be improved in all multimodal models because they have a broader base of data for pre-training. Benchmarks seem to show 4o slightly exceeding 4 (despite being a smaller model, or at least more parallelizable)
Does anyone have technical insight into how the screensharing in the math tutor video works? It looks like they start the broadcast from within the ChatGPT app, yet have no option to select which app will be the source of the stream. Or is that implied when both apps reside in the iPad's split view? And is this using regular ReplayKit or something new?
Anyone who watched the OpenAI livestream: did they "paste" the code after hitting CTRL+C ? Or did the desktop app just read from the clipboard?
Edit: I'm asking because of the obvious data security implications of having your desktop app read from the clipboard _in the live demo_... That would definitely put a damper to my fanboyish enthusiasm about that desktop app.
The realtime end-to-end audio situation is especially interesting as the concept has been around for a while but there weren't any successful implementations of it up to this point that I'm aware of.
The press statement has consistent image generation and other image manipulation (depicting the same character in different poses, taking a photo and generating a caricature of the person, etc) that does not seem deployed to the chat interface.
Will they be deployed? They would make the OpenAI image model significantly more useful than the competition.
First impressions as a 1-year subscriber.
I just tried GTP-4o to evaluate my code for suggestions and for discussing other solutions and it is definitely faster and it comes up with new suggestions that GPT-4 didn't. Currently in the process of evaluating the suggestions.
The demo is what it is, designed to get a wow from the masses.
they stated that they will be announcing something new that is on the next frontier (or close to it IIRC) soon. so there will definitely be an incentive to pay because it will be something better than gpt 4o.
Hopefully this will be them turning a new leaf. Making GPT-4 more accessible, cutting API costs, and making a general personal assistant chatbot on iPhone are a lot different than them tracking down and destroying the business of every customer using their API one by one. Let's hope this trend continues.
I made a website with book summaries (https://www.thesummarist.net/) and I tested GPT-4o in generating one, and it was bad. It reminded me of GPT-3.5. I didn't test too much, but preliminary results don't look good.
I'm using the web interface, if that helps. It doesn't have all the 4o options yet, but it does do pictures. I think they are the same as with 4.5.
I just noticed after further testing the text it shows in images is not anywhere near as accurate as shown in the article's demo, so maybe it's a hybrid they're using for now.
Yes it likely is. I've had time to play around and see that so far it doesn't look any different (yet). I have a paid account, so apparently I'll be among the early folks getting all the things. Just not yet.
I definitely look forward to re-doing my Three Blind Mice test when it happens.
I noticed in their demo the 4o text still has glitches, but nowhere near to the extent the current Dall-e returns give you (the longer the text, the worse it gets). It's pretty important that eventually they get text right in the graphics.
I am not fluent in Arabic at all, and being able to use this as a tool to have a conversation will make it more dependent. We are approaching a new era where we will not be "independently" learning a language but ignore the fact of learning it beforehand. Double-edged sword cases
Just something I noticed in the Language tokenization section
When referring to itself, it uses the female word in Marathi
नमस्कार, माझे नाव जीपीटी-4o आहे| मी एक नवीन प्रकारची भाषा मॉडेल आहे| तुम्हाला भेटून आनंद झाला!
and Male word in Hindi
नमस्ते, मेरा नाम जीपीटी-4o है। मैं एक नए प्रकार का भाषा मॉडल हूँ। आपसे मिलकर अच्छा लगा!
Am I using it wrong? I have the gpt plus subscription, and can select "gpt4o" from the model list on ChatGPT, but whichever example I try from the example list under "Explorations of capabilities" on `https://openai.com/index/hello-gpt-4o/`, my results are worse:
* "Poetic typography" sample: I paste the prompt, and get an image with the typical lack of coherent text, just mangled letters.
* "Visual Narratives: Robot Writer's Block" - Mangled letters also
* "Visual Narratives: Sally the mailwoman" - not following instructions about camera angle. Sally looks different in each subsequent photo.
* "Meeting Notes with multiple speakers" - I uploaded the exact same audio file and used input 'How many speakers in this audio and what happened?'. gpt4o went off about about audio sample rates, speaker diarization models, torchaudio, and how its execution environment is broken and can't proceed to do it.
Ah, I see. Seems like a weird product release? Since everything in the UI (and in the new ChatGPT macos app) says 'gpt4o' so I would expect at least something to work as shown in the demos. Or just don't show at all the 'gpt4o' in the UI if it's somehow a completely different 'gpt4o' from the one that can do everything on the announcements page. I don't mind waiting, but it was genuinely confusing to me.
Copied and pasted the robot image journaling prompt and it simply cannot produce legible text. The first few words work, but the rest becomes gibberish. I wonder if there's weird prompt engineering squeezing out that capability or if its a 1 in a million chance.
What popped out to me in the "bunny ear" video, the bunny ears are not actually visible to the phone's camera Greg is holding. Are they in the background feeding the production camera and this is not really a live demo?
I opened ChatGPT and I already have access to the model.
GPT4 was a little lazy and very slow the last few days and this 4o model blows it out of the water regarding speed and following my instructions to give me the full code not a snippet that changed.
it really feels like the quality of gpt4's responses got progressively worse as the year went on... seems like it is giving political answers now vs actually giving an earnest response. It also feels like the responses are lazier than they used to be at the outset of gpt4's release.
I am not saying this is what they're doing but it DOES feel like they are hindering previous model to make the new one stand out that much more. The multi-modal improvements here and release are certainly impressive but I can't help but feel like the subjective quality of gpt4 has dipped.
Hopefully this signals that gpt5 is not far off and should stand out significantly from the crowd.
I wish the presentation had included an example of integration with a simple tool like a timer. Being able to set and dismiss a timer in casual conversation while multitasking would be a really great demo of integrated capabilities.
>GPT-4o’s text and image capabilities are starting to roll out today in ChatGPT. We are making GPT-4o available in the free tier, and to Plus users with up to 5x higher message limits.
I have access to GPT-4o text and audio, but no video. This is on the iOS app with ChatGPT Plus subscription.
Initial connection for audio fails most of the time, but once it's connected it's stable. Sometimes a bit more latency than expected, but mostly just like the demos.
Is this actually available in the app in the same way they are demoing it here? I see the model is available to be selected, but the interface doesn't quite seem to allow me to use it in the way I see here.
How does the interruption of the AI by the user work? Does GPT-4o listen all the time? But then how does it distinguish its own voice from the users voice? Is it self-aware?
One of the techniques for a voice assistant to distinguish its own voice from background sound is called a Fourier transform, although I expect that the state of the art in this area also includes some other techniques and research.
If you've used one, you might know that you can easily talk to a smart speaker even when it is playing very loud music, it's the same idea.
I can see so many military and intelligence applications for this! Excited isn't exactly the word I'd use, but... Certainly interesting! The civilian use will of course be marvellous though.
Just a friendly reminder that my home is in a NATO-member country and that I'm paying my taxes - that goes towards buying huge complements of Abrams tanks, F22 fighter jets, Reaper drones and a whole host of other nasty things they use to protect my property. In short, mess with me, and you mess with them. Yes, do enjoy your life, and stay off my lawn pls. :)
I think because usability increases so much (use cases of real-time conversation,
and video-based coding, presentation feedback at work etc...) they would expect usage to drastically increase hence paying users would actually still have incentive to pay.
The biggest wow factor was the effect of reducing latency followed in a close second by the friendly human personality. There's an uncanny valley barrier but this feels like a short-term teething problem.
So far, I'm impressed. It seems to be significantly better than GPT-4 at accessing current online documentation and forming answers that use it effectively. I've been asking it to do so, and it has.
I am still baffled at how I can not use a VOIP number to register, even if it accepts TXT/SMS. If I have a snappy new startup and we go all in VOIP, I guess we can not use (or pay to use) OpenAI?
it does make me uncomfortable that the way you typically interact with it is by interrupting it. It makes me want to tell it to be more concise so that I wouldn't have to do that.
The emphasis on multimodal made me wonder if it was capable of creating audio as output, so I asked it to make me a drum beat. It did so, but in text form. I asked it to convert it to audio. It thought for a while and eventually said it didn’t seem like `simpleaudio` was installed in its environment. Huh, interesting, never seen a response like that before. It clearly made an attempt to carry out my instructions but failed due to technical limitations of its backend. What else can I make it do? I asked it to install `simpleaudio`. It tried but failed with a connection error, presumably due to a firewall rule.
I asked it to run a loop that writes “hello” every ten seconds. Wow, not only did it do so, it’s streaming the stdout to me.
LLMs have always had various forms of injection attacks, ways to force them to reveal their prompts, etc. but this one seems deliberately designed to run arbitrary code, including infinite loops.
Alas, I doubt I can get it to mine enough bitcoin to pay for a ChatGPT subscription,
Realtime videos? Probably their internal tools. I am testing the gpt4o right now and the responses come in 6-10 seconds. Same experience as the gpt4 text. What's up with the realtime claims?!
> Developers can also now access GPT-4o in the API as a text and vision model. GPT-4o is 2x faster, half the price, and has 5x higher rate limits compared to GPT-4 Turbo. We plan to launch support for GPT-4o's new audio and video capabilities to a small group of trusted partners in the API in the coming weeks.
It is not listed as of yet, but it does work if you punch in gpt-4o. I will stick with gpt-4-0125-preview for now because gpt-4o is majorly prone to hallucinations whereas gpt-4-0125-preview isn't.
Yes, I actually do, and I ran multiple tests. Unfortunately I don't want to give them away, as I then absolutely risk OpenAI gaming the tests by overfitting to them.
At a high level, ask it to produce a ToC of information about something that you know will exist in the future, but does not yet exist, but also tell it to decline the request if it doesn't verifiably know the answer.
Why does this whole thread sound like OpenAI marketing department is participating ? Ive been talking to google assistant for years. I really don't find anything that magical or special.
oh man, listening to the demos and the way the female AI voice laughed and giggled...there is going to be millions of lonely men who will fall in love with these.
Since it says on the blog that its only images, text and audio input, does GPT-4o likely have a YOLO like model on the phone to pre-process the video frames and send BBoxes to the server?
while everyone's focusing on audio capabilities (haven't heard them yet), i find it amusing that the official demo ("robot writer's block" in particular) of image generation can't even match the verbatim instruction, and the error's not even consistent between generations even as it should be aware of previous contexts. and this is their second generation of multimodal llm capable of generating images.
looks like llms still gonna llm for the near future.
I don't get it...I just switched to the new model on my iPhone app and it still takes several seconds to respond with pretty bland inflection. Is there some setting I'm missing?
They haven't actually released it, or any schedule for releasing it beyond an "alpha" release "in the coming weeks". This event was probably just slapped together to get something splashy out ahead of Google.
With the speed the seemingly exponential developments of this field i wouldn't be surprised if suddenly the entire world tilted and a pair of googles fell from my face. But a dream.
At their core, I still think of these things as search engines, albeit super advanced ones. But the emotion the agent conveys with it's speech synth is completely new...
This is pretty amazing but it was funny still hearing the OpenGPT "voice" of somewhat fake sounding enthusiasm and restating what was said by the human with exaggeration
Are these multimodals able to discern the input voice tone? Really curious if they're able to detect sarcasm or emotional content (or even something like mispronunciation?)
I think this GPT-4o does have an advantage in hindsight, it will push this product to consumer much faster, and build a revenue base, while other companies playing catch up.
Not all of the founders agreed with Jefferson’s view on the separation of church
and state. Do you agree with Jefferson or with his opponents? Explain.
I like the robot typing at the keyboard that has B as half of the keys and my favorite part is when it tears up the paper and behind it is another copy of that same paper
Given that they are moving all these features to free users, it tells us that GPT-5 is around the corner and is significantly much better than their previous models.
Or maybe it is a desperation move after Llama 3 got released and the free mode will have such tight constraints that it will be unusable for anything a bit more serious.
Such an impressive demo... but why did they have to give it this vapid, giggly socialite persona that makes me want to switch it off after thirty seconds?
I see a lot of fear around these new kinds of tools. I think though, that criminals will always find ways to leverage new technology to their benefit, and we've always found ways to deal with that. This changes little. Additionally, as you are aware of this, so are people creating this tech, and a lot of effort is underway to protect from malicious uses.
That wont stop criminal enterprises from implementing their own naughty tools, but these open models wont become some kind of holy grail for criminals to do as they please.
That being said, I do beleive, now more than ever, education world wide should be adjusted to fit this new paradigm and maybe adapt quicker to such changes.
As some commenters pointed out, there are already good tools and techiques to use to counter malicious use of AI. maybe noy covering all use cases, but we need to educate people on using the tools available, and trust that researchers (like many of yourselves) are capable of imnovations which will reduce risk even further.
There is no point and no benefit in trying to be negative or full of fear. Go forward with positivity and creativity. Even if big tech gets regulated, some criminal enterprises have billions to invest too, so criplling big tech here will only play into their hands in the end.
Love these new innovations. And for the record, gpt4o still told me to 'push rip' on amd64... so rip to it actually understanding stuff...
If you are smart enough to see some risks here, you might also be smart enough to positively contribute to improvements. Fear shuts things down, love opens them up. Its basic stuff.
This demo is amazing, not scary. its positive advancements in technology and it wont be stopped because people are afraid of it, so go with it, and contribute in areas where you feel its needed. Even if its just giving feedback. And whem giving that, you all know a balanced and constructive approach works better than a negative and destructive approach.
I think that the world is still not coping with current tech. Legislation and protections don't sufficiently cover people for current scams or how companies leverage tech against workers.
This is even more scary that this puts a lower bound price on many, many skills, with again, still no protections for humans.
Would this be exciting if everyone had a safe place to live with infinite electricity and food? Sure. Will tools like this bring about massive uncertainty, hurt, and pain? Almost certainly.
Saying that the sun is shining where you live, doesn't mean there isn't flooding in other parts of the world.
legislations are always behind. you cant rely on government to fix everything constantly ontime. epecially with international things like internet technologies. im not saying ignore the bad stuff, but its only a small percentage ofnwhats relly out there. most of the world, and the vast majority of people, are good :)
Criminals misusing it? I feel like this is already a dangerous way to use AI, they use an enthusiastic, flirty and attractive female voice on millions of nerds. They openly say this is going to be like the movie Her. Shouldn't we have some societal discussion before we unleash paid AI girlfriends on everybody?
marketing is marketing. look how they marketed cigarettes , cars, ll kinds of things that now people feel are perhaps not so good. its part and parcel of the world that also does so much good. personally, id market it differently, but this is why im no CEO =).
if we help eachother understand these things andnhownto cope, all will be fine in the end. we will hit some bumps, and yes, there will be discomfort but thats ok. thats all part of life. life is not about being happy and comfortable allnthe time no matter how much we would want that.
some people even want paid AI girlfriends. who are you to tell them they are not allowed to have it?
This is some I, Robot level stuff. That being said, I still fail to see the real world application of this thing, at least at a scalable affordable cost.
Extremely impressive -- hopefully there will be an option to color all responses with a underlying brevity. It seemed like the AI just kept droning on and on.
Feels like a really good engineering in a wrong direction. Who said that the audio is good interface anyway? Audio is hard to edit, slow and has low-information density. If I want to talk to someone and have low-information but pleasant exchange I can just to talk to real people, I don't need computers for it.
I guess it is useful for some casual uses, but I really wish there was more focus on the reasoning and intelligence of the model itself.
Imagine having to interact with this thing in an environment where it is in the power position.
Being in a prison with this voice as your guard seems like a horrible way to lose your sanity. This aggressive friendlyness combined with no real emotions seems like a very easy way to break people.
There are these stories about nazis working at concentration camps, having to drink an insane amount of alcohol to keep themselves going (not trying to excuse their actions). This thing would just do it, while being friendly at the same time. This amount of hopeless someone would experience if they happen to be in custody of a system like this is truly horrific.
A new "flagship" model with no improvement of intelligence, very disappointed. Maybe this is a strategy for them to mass collect "live" data before they're left behind by Google/Twitter live data...
My main takeaway is that Generative AI has hit a wall... New paradigms, architectures and breakthroughs are necessary for the field to progress but this begs the question, If everyone knows the current paradigms have hit a wall, Why is so much money being spent on LLMs ,diffusion models etc,which are bound to become obsolete within a few(?) years?
I was about to say how this thing is lame because it sounds so forced and robotic and fake, and even though the intonations do make it sound more human-like, it's very clear that they made a big effort to make it sound like natural speech, but failed.
...but then I realized that's basically the kind of thing Data from Star Trek struggles with as part of his character. We're almost in that future, and I'm already falling into the role of the ignorant human that doesn't respect androids.
question for you guys - is there a model that can take figures (graphs), from scientific publications, and combine image analysis with picking up the data point symbol descriptions and analyse the trends?
As a paid user this felt like a huge letdown. GPT-4o is available to everyone so I'm paying $20/mo for...what, exactly? Higher message limits? I have no idea if I'm close to the message limits currently (nor do I even know what they are). So I guess I'll cancel, then see if I hit the limits?
I'm also extremely worried that this is a harbinger of the enshittification of ChatGPT. Processing video and audio for all ~200 million users is going to be extravagantly expensive, so my only conclusion is that OpenAI is funding this by doubling down on payola-style corporate partnerships that will result in ChatGPT slyly trying to mention certain brands or products in our conversations [1].
I use ChatGPT every day. I love it. But after watching the video I can't help but think "why should I keep paying money for this?"
Holy crap, the level of corporate cringe of that "two AIs talk to each other" scene is mind-boggling.
It feels like a pretty strong illustration of the awkwardness of getting value from recent AI developments. Like, this is technically super impressive, but also I'm not sure it gives us anything we couldn't have one year ago with GPT-4 and ElevenLabs.
Maybe this is yet another wake-up call to startups: wrapping up another company's APIs to offer convenience or incremental improvement is not a via business model. If your wrapper turns out to be successful, the company that provides the API will just incorporate your business as a set of new features with better usability, faster response time, and lower price.
Curiously want to know why didn't they create the Windows desktop app first? which is the dominant desktop segment. In fear of competing with Microsoft's copilot?
I don't think it's available in Europe yet? It seems they prioritizes the US market for now. Hence OSX because Mac is way more popular in the US than the rest of the world.
it seems like the ability to interrupt is more like the interrupt in the computer sense ... A control-c (or control-s tty flow control for you old timers), not a cognitive evaluation followed by the "reasoned" decision to pause voice output. not that it matters i guess, its just not general intelligence. its just flow control.
but also, thats why it fails a real turing test. a real person would be irritated as fuck by the interruptions
This is clearly not just another story of human innovation. This is not just the usual trade-off between risks and opportunities.
Why? Because it simply automates the human away.
Who wouldn't opt for a seemingly flawless, super effective buddy (i.e. an AI) that is never tired, always knows better? if you need some job done, if you're feeling lonely, when you need some life advice..
It doesn't matter if it might be considered "just imitation of human".
Why would future advancements of it keep being "just some tool" instead of largely replacing us as (humans) in jobs, relationships, ...?
Dear OpenAI, either remember my privacy settings or open a temporary chat by default, this funny nonsense of typing in something only to find out you’re going to train on it is NOT a good experience.
Endless growth? Who is going to buy all those new products your AI employees are making? When none of the real people have jobs, how are they going to buy those products you're trying to sell?
- Less people = less demand for products. New customer growth stalls. Prices fall. Revenue and profits fall.
- Less people = less demand for housing. Prices fall. Investments fall.
- Less people = less people able to perform physical jobs.
- Less people = less tax revenue. Less money available for social services.
- Less young people = Aging population.
- Aging population = higher strain on social services. Pensions, healthcare, etc.
- Aging population = higher percentage of young people need to care for aging people instead of entering the workforce.
In a capitalist economy where your numbers need to keep going up to be considered successful (eg growth is necessary, stable profits but no growth = bad) then you are never going to have a good time when your population falls.
> No one really cares about less people, just less money
Honestly, the eager flirtatiousness of the AI in the demos, in conversation with these awkward engineers, really turns me off. It feels like a male power fantasy.
Am I the only one that feels underwhelmed by this?
yeah its cool and unlike anything ive seen before but I kind of expected a bigger leap.
To me the most impressive thing is going to be longer context limits. I'd had semi long running conversations where ive had to correct an LLM multiple times about the same thing.
when you have more context the LLM can infer more and more. Am I wrong about this?
The updates seem to all be geared towards corrective updates rather than expansion of capabilities. We're still typing prompts rather than speaking them into a microphone.
If it was truly Ai, why isn't it rapidly building itself? Rather than relying on scraping human content from wildly inaccurate and often incorrect social media posts? So much effort is wasted in trying to push news cycles rather than a careful, responsible, and measured approach to developing Ai into becoming tools that are highly functional and useful to individuals. The biggest innovation in Ai right now is how to make it modular and slap a fee on each feature, and that's not practical at all into the future.
I'll begin to believe that consume Ai is making strides when Siri and Google Assistant stop missing commands, and actually can conduct meaningful conversations without an Internet connection and monthly software updates, which in my opinion is at least 5-10 years away. Right now what is presented as "Ai" is usually often incomplete sensor-aware scripting or the wizard of Oz (humans) hidden behind the curtains operating switches and levers, a bunch of underwhelming tools, and a heap of online marketing. If they keep that act up, it erodes faith in the entire concept, just like with Full Self Driving Tesla Trucks.
> If it was truly Ai, why isn’t it rapidly building itself?
You seem to confuse AI, the field of endeavor, with ASI or at least AGI (plus will, which may or may not be a necessary component of either), which are goals of the field that no one (approximately, there have been some exceptions but they’ve quickly been dismissed and faded) claims have been achieved.
As I commented in the other thread, really really disappointed there's no intelligence update and more of a focus on "gimmicks". The desktop app did look really good, especially as the models get smarter. Will be canceling my premium as there's no real purpose of it until that new "flag ship" model comes out.
Agree on hoping for an intelligence update, but I think it was clear from teasers that this was not gonna be GPT-5.
I'm not sure how fair it is to classify the new multimodal capabilities as just a gimmick though. I personally haven't integrated GPT-4 into my workflow that much and the latency and the fact I have to type a query out is a big reason why.
Pretty responsible progress management by OpenAI.
Kicking off another training wave is easy, if you can afford the electricity, but without new, non-AI tainted datasets or new methods, what’s the point?
So, in the meantime, make magic with the tool you already have, without freaking out the politicians or the public.
Wise approach.
The one thing I first thought is that I felt uncomfortable the way they cut and interrupt the she-AI.
I wonder if our children will end up being douchebags?
Other than that it felt like magic, like that Google demo of the phone doing some task like setting up an appointment over phone talking to a real person.
Kicking off another training wave is easy, if you can afford the electricity, but without new, non-AI tainted datasets or new methods, what’s the point?
So, in the meantime, make magic with the tool you already have, without freaking out the politicians or the public.
Sorry to nitpick, but in the language tokenisation part, the french part is incorrect. The exclamation mark are surrounded by spaces in french.
"c'est un plaisir de vous rencontrer!" should be "c'est un plaisir de vous rencontrer !"
And yet no matter how easy they make ChatGPT to interact with, I cannot use it due to accuracy. Great, now I can have a voice telling me information I have no way of knowing is correct rather than just having it given to me as text.
i absolutely hate this. we are going to destroy society with this technology. we cant continue to enjoy the benefits of human society if humans are replaced by machines. i hate seeing these disgusting people smugly parade this technology. it makes me so angry that they are destroying human society and all i can do is sit here and watch.
I know exactly what you mean. I just hope people get bored of this waste of time and energy —- both personal and actual energy —- before it goes too far.
The usual critics will quickly point out that LLMs like GPT-4o still have a lot of failure modes and suffer from issues that remain unresolved. They will point out that we're reaping diminishing returns from Transformers. They will question the absence of a "GPT-5" model. And so on -- blah, blah, blah, stochastic parrots, blah, blah, blah.
Ignore the critics. Watch the demos. Play with it.
This stuff feels magical. Magical. It makes the movie "Her" look like it's no longer in the realm of science fiction but in the realm of incremental product development. HAL's unemotional monotone in Kubrick's movie, "Space Odyssey," feels... oddly primitive by comparison. I'm impressed at how well this works.
Well-deserved congratulations to everyone at OpenAI!
Because its capacities are focused on exactly the right place to feel magical. Which isn’t to say that there isn’t real utility, but language (written, and even moreso spoken) has an enormous emotional resonance for humans, so this is laser-targeted in an area where every advance is going to “feel magical” whether or not it moves the needle much on practical utility; it’s not unlike the effect of TV news making you feel informed, even though time spent watching it negatively correlates with understanding of current events.
Kind of this. That was one of the themes of the movie Westworld where the AI in the robots seemed magical until it was creepy.
I worry about the 'cheery intern' response becoming something of a punch line.
"Hey siri, launch the nuclear missiles to end the world."
"That's a GREAT idea, I'll get right on that! Is there anything else I can help you with?"
Kind of punch lines.
Will be interesting to see where that goes once you've got a good handle on capturing the part of speech that isn't "words" so much as it is inflection and delivery. I am interested in a speech model that can differentiate between "I would hate to have something happen to this store." as a compliment coming from a customer and as a threat coming from an extortionist.
This is basically just the ship computer from Hitchhikers Guide to the Galaxy.
"Guys, I am just pleased as punch to inform you that there are two thermo-nuclear missiles headed this way... if you don't mind, I'm gonna go ahead and take evasive action."
ChatGPT is now powered by Genuine People Personality™ and OpenAI is turning into the Sirius Cybernetics Corporation (who according to the HHGTTG were "a bunch of mindless jerks who were the first against the wall when the revolution came")
I did wonder if there's a less verbose mode. I hope that's not a paywalled feature. Honestly it's possible that they use the friendliness to help buy the LLM time before it has to substantively respond to the user.
Yeah people around me here in Central Europe are very sick of that already. Everybody is complaining about it and the first thing they say to the bot is to cut it out, stop apologizing, stop explaining and get to the point as concisely as possible. Me too.
Or perhaps the news media has been increasingly effective at convincing us the world is terrible. Perceptions have become measurably detached from reality:
Or perhaps the reality on the ground for the working and middle class masses is not the same reality experienced by the elites, and the upper middle class with $150K+ salaries, or measured by stock market growth, and such...
As John Stewart says in https://www.youtube.com/watch?v=20TAkcy3aBY - "How about I hold the fort on making peanut butter sandwiches, because that is something I can do. How about we let AI solve this world climate problem".
Yet to see a true "killer" feature of AI, that isn't doing a job badly which humans can already do badly.
>the point of all of this is: this is alpha 0.45 made to get the money needed to build AGI whatever that is
Maybe they publicly made it available at alpha 0.7 and now it's more like 0.9 RC instead, with not much room to go except through marginal improvements for an ever increasing training budget making them less and less worthy?
And that's before 90% of the internet becomes LLM ouput, poisoning any further corpus for training and getting into LSD-grade hallucinations mode...
It’s not an either-or: the stuff feels magical because it both represents dramatic revelation of capability and because it is heavily optimized to make humans engage in magical thinking.
These things are amazing compared to old-school NLP: the step-change in capability is real.
But we should also keep our wits about us, they are well-Des robed by current or conjectural mathematics, they fail at things dolphins can do, it’s not some AI god and it’s not self-improving.
Let’s have balance on both the magic of the experience and getting past the tech demo stage: every magic trick has a pledge, but I think we’re still working on the prestige.
Yes, the announcement explicitly states that much of the effort for this release was focused on things that make it feel magical (response times, multiple domains, etc.), not on moving the needle on quantifiable practical performance. For future releases, the clever folks at OpenAI are surely focused on improving performance on challenging tasks that practical utility -- while maintaining the "magical feeling."
> explicitly states that much of the effort for this release was focused on things that make it feel magical (response times, multiple domains, etc.), not on moving the needle on quantifiable practical performance.
Hmm, did you mean implicitly? I've yet to see where they say anything to the likes of not "moving the needle on quantifiable practical performance."
Pretty interesting how it turns out that --- contrary to science fiction movies --- talking naturally and modelling language is much easier and was achieved much sooner than solving complex problems or whatever it is that robots in science fiction movies do.
I didn't use it as a textual interface, but as a relational/nondirectional system, trying to ask it to inverse recursive relationships (first/follow sets for BNF grammars). The fact that it could manage to give partially correct answers on such an abstract problem was "coldly" surprising.
You really think OpenAI has researchers figuring out how to drive emergent capabilities based on what markets well?
Edit: Apparently not based on your clarification, instead the researchers don't know any better than to march into a local maxima because they're only human and seek to replicate themselves. I assumed too much good faith.
I don’t think the intent matters, the effect of its capacities being centered where they are is that they trigger certain human biases.
(Arguably, it is the other way around: they aren’t focused on appealing to those biases, but driven by them, in the that the perception of language modeling as a road to real general reasoning is a manifestation of the same bias which makes language capacity be perceived as magical.)
Intent matters when you're being as dismissive as you were.
Not to mention your comment doesn't track at all with the most basic findings they've shared: that adding new modalities increases performance across the board.
They shared that with GPT-4 vs GPT-4V, and the fact this is a faster model than GPT-4V while rivaling it's performance seems like further confirmation of the fact.
-
It seems like you're assigning emotional biases of your own to pretty straightforward science.
> Intent matters when you're being as dismissive as you were.
The GP comment we're all replying to outlines a non-exhaustive list of very good reasons to be highly dismissive of LLM. (No I'm not calling it AI, it is not fucking AI)
It is utterly laughable and infuriating that you're assigning legitimate skepticism about this technology as a an emotional bias. Fucking ridiculous. We're now almost a full year into the full bore open hype cycle of LLM. Where's all the LLM products? Where's the market penetration? Business can't use it because it has a nasty tendency to make shit up when it's talking. Various companies and individuals are being sued because generative art is stealing from artists. Code generators are hitting walls of usability so steep, you're better off just writing the damn code yourself.
We keep hearing this "it will do!" "it's coming!" "just think of what it can do soon!" on and on and on, and it just keeps... not doing any of it. It keeps hallucinating untrue facts, it keeps getting basics of it's tasks wrong, for fucks sake AI Dungeon can't even remember if I'm in Hyrule or Night City. Progress seems fewer and farther between, with most advances being just getting the compute cost down, because NO business currently using LLM extensively could be profitable without generous donation of compute from large corporations like Microsoft.
I didn't see any good reasons to be dismissive of LLMs, I saw a weak attempt at implying we're at a local maxima because scientists don't know better than to chase after what seems magical or special to them due to their bias as humans.
It's not an especially insightful or sound argument imo, and neither are random complaints about capabilities of systems millions of people use daily despite your own claims.
And for the record:
> because NO business currently using LLM extensively could be profitable without generous donation of compute from large corporations like Microsoft
OpenAI isn't the only provider of LLMs. Plenty of businesses are using providers that provide their services profitably, and I'm not convinced that OpenAI themselves are subsidising these capabilities as strongly as they once did.
All that spilled ink don’t change the fact that I use it every day and it makes everything faster and easier and more enjoyable. I’m absolutely chuffed to put my phone on a stand so GPT4o can see the page I’m writing on and chat with me about my notes or the book I’m reading and the occasional doodle. One of the first things I’ll try out is to see if it can give feedback and tips on sketching, since it can generate images with a lot better control of the subject it might even be able to demonstrate various techniques I could employ!
As it turns out, people will gleefully welcome Big Brother with open arms as long as it speaks with a vaguely nice tone and compliments the stuff it can see.
A year is an eternity in tech and you bloody well know it. A year into an $80 billion dollar valued company's prime hype cycle, and we have... chatbots, but fancier? This is completely detached from sanity.
I mean when you’re making a point about how your views should not be taken as emotional bias, it pays to not be overly emotional.
The fact that you don’t see utility doesn’t mean it is not helpful to others.
A recent example, I used Grok to write me an outline of a paper regarding military and civilian emergency response as part of a refresher class.
To test it out we fed it scenario questions and saw how it compared to our classmates responses. All people with decades of emergency management experience.
The results were shocking. It was able to successfully navigate a large scale emergency management problem and get it (mostly) right.
I could see a not so distant future where we become QA checkers for our AI overlords.
>they aren’t focused on appealing to those biases, but driven by them, in the that the perception of language modeling...
So yes in effect that is their point, except they find the scientists are actually compelled by what markets well, rather than intentionally going after what markets well... which is frankly even less flattering. Like researchers who enabled this just didn't know better than to be seduced by some underlying human bias into a local maxima.
I think that's still just an explanation of biases that go into development direction. I don't view that as a criticism but an observation. We use LLMs in our products, and I use them daily and I'm not sure how that's that negative.
We all have biases in how we determine intelligence, capability, and accuracy. Our biases color our trust and ability to retain information. There's a wealth of research around it. We're all susceptible to these biases. Being a researcher doesn't exclude you from the experience of being human.
Our biases influence how we measure things, which in turn influences how things behave. I don't see why you're so upset by that pretty obvious observation.
The full comment is right there, we don't need to seance what the rest of it was or remix it.
> Arguably, it is the other way around: they aren’t focused on appealing to those biases, but driven by them, in the that the perception of language modeling as a road to real general reasoning is a manifestation of the same bias which makes language capacity be perceived as magical
There's no charitable reading of this that doesn't give the researcher's way too little credit given the results of the direction they've chosen.
This has nothing to do with biases and emotion, I'm not sure why some people need it to be: modalities have progressed in order of how easy they are to wrangle data on: text => image => audio => video.
We've seen that training on more tokens improves performance, we've seen that training on new modalities improves performance on the prior modalities.
It's so needlessly dismissive to act like you have this mystical insight into a grave error these people are making, and they're just seeking to replicate human language out of folly, when you're ignoring table stakes for their underlying works to start with.
Note that there is only one thing about the research that I have said is arguably influenced by the bias in question, “the perception of language modeling as a road to real general reasoning”. Not the order of progression through modalities. Not the perception that language, image, audio, or video are useful domains.
3 years ago, if you told me you could facetime with a robot, and they could describe the environment and have a "normal" conversation with me, i would be in disbelief, and assume that tech was a decade or two in the future. Even the stuff that was happening a 2 years ago felt unrealistic.
astrology is giving vague predictions like "you will be happy today". GPT-4o is describing to you actual events in real time.
People said pretty much exactly the same thing about 3d printing.
"Rather than ship a product, companies can ship blueprints and everyone can just print stuff at their own home! Everything will be 3d printed! It's so magical!"
Just because a tech is magical today, doesn't mean that it will be meaningful tomorrow. Sure, 3d printing has its place (mostly in making plastic parts for things) but it's hardly the revolutionary change in consumer products that it was touted to be. Instead, it's just a hobbiest toy.
GPT-4o being able to describe actual events in real time is interesting, it's yet to be seen if that's useful.
That's mostly the thinking here. A lot of the "killer" AI tech has really boiled down to "Look, this can replace your customer support chat bot!". Everyone is rushing to try and figure out what we can use LLMs (Just like they did when ML was supposed to take over the world) and so far it's been niche locations to make shareholders happy.
> Sure, 3d printing has its place (mostly in making plastic parts for things) but it's hardly the revolutionary change in consumer products that it was touted to be. Instead, it's just a hobbiest toy.
how positive are you that some benefits in your life are not attributable to 3d-printing used behind the scenes for industrial processes?
> Just like they did when ML was supposed to take over the world
how sure are you that ML is not used behind the scenes to benefit your life? do you consider features like fraud detection programs, protein-folding prediction programs to create, and spam filters valuable in and of themself?
I'm sure 10 years from now, assuming LLMs don't prove me wrong, I'll make a similar comment about LLMs and a new hype that I just made about 3d printing, and I'll get EXACTLY this reply. "Oh yeah, well here's a niche application of LLMs that you didn't account for!".
> how positive are you that some benefits in your life are not attributable to 3d-printing used behind the scenes for industrial processes?
See where I said "in consumer products". I'm certainly not claiming that 3d printing is never used and is not useful. However, what I am saying is that it was hyped WAY beyond industrial applications.
In fact, here I am, 11 years ago, saying basically exactly what I'm saying about LLMs that I said about 3d printing. [1]. Along with people basically responding to me the exact same way you just did.
> how sure are you that ML is not used behind the scenes to benefit your life? do you consider features like fraud detection programs, protein-folding prediction programs to create, and spam filters valuable in and of themself?
Did I say it wasn't behind the scenes? ML absolutely has an applicable location, it's not nearly as vast as the hype train would say. I know, I spent a LONG time trying to integrate ML into our company and found it simply wasn't as good as hard and fast programmed rules in almost all situations.
sorry, maybe i'm not completely understanding what you mean by "in consumer products".
reading your argument on reddit, it seems to me that you don't consider 3d printing a success because there's not one in every home...which is true.
but it feels uncreative? like, sure, just because it hasn't been mass adopted by consumers, doesn't mean there wasn't value generation done on an industrial level.
you're probably using consumer products right now that have benefitted from 3d printing in some way.
> ML absolutely has an applicable location, it's not nearly as vast as the hype train would say
what hype train are you referring to? i know a lot of different predictions in machine learning, so i'm curious about what you mean specifically.
> but it feels uncreative? like, sure, just because it hasn't been mass adopted by consumers, doesn't mean there wasn't value generation done on an industrial level. you're probably using consumer products right now that have benefitted from 3d printing in some way.
I'd suggest reading both the article and the surrounding reddit comments if you want context for my argument there. The explicit argument there was that everyone would own a 3d printer. Not that they would be used in commercial applications or to produce consumer goods. No, instead that everyone would have a 3d printer on hand to make most of their goods (rather than having their goods shipped to them). That's the hype.
I did not say there weren't other areas where 3d printing could be successful nor that it wouldn't have applications. Rather, that the hype around it was unfounded and overblown.
This is much the same way I see LLMs. The current hype around them is that every job will end up being replaced. Doctors, nurses, lawyers, programmers, engineers, architects, everything. All replaced by LLMs and AI. However, that seems really unrealistic when the current state of LLMs is you always need a human doublechecking what it produces, and it's known to give out incorrect responses. Further, LLMs have limited capabilities to interact with applications let alone the physical world. Perhaps they will but also perhaps they won't. The imagination of what they could do is just wildly out of step with what they currently do.
> what hype train are you referring to? i know a lot of different predictions in machine learning, so i'm curious about what you mean specifically.
I didn't really see a lot of predictions around ML. Instead, it was more just a bunch of articles talking about the importance of it and seemingly ever CEO deciding they need more ML in their products. Lots of stuff ended up being marketed specifically because it had ML capabilities (much like this last CES had almost every product with "AI" capabilities).
Funnily, the ML didn't (as far as I could see) have a whole lot of predictions other than more of an ephemeral notion that it would save manpower.
I bring it up in this case because like LLMs, there's just a bunch of buzz around 2 letters with not a whole lot of actual examples of those 2 letters being put to practical use.
hm, maybe we're misinterpreting each other's main point.
My reply was to some person who said that AI was akin to astrology, i.e. absolutely fake bullshit, which is bonkers to me.
Your reply was that AI, like 3d printing, is likely not going to be mass adopted by the average consumer, despite the hype, which i think is a reasonable prediction, and doesn't necessarily mean it won't have some valuable applications.
Yeah, if you see it that way then I think we agree.
croes's point, I believe, about the astrology was that we know today that LLMs will produce bad results and that they can't be trusted. Yet the hype is sort of at a "Well, if we just give it more time maybe that problem goes away". Similar to how in astrology "if you just think about it right, the prediction was actually accurate".
That's where I see the parallels with 3d printing. There was a sort of "We can print anything with enough time!" even though by and large the only printable things were plastic toys.
> GPT-4o being able to describe actual events in real time is interesting, it's yet to be seen if that's useful.
sure, but my experience is that if you are able to optimize better on some previous limitation, it legitimately does open up a whole different world of usefulness.
for example, real-time processing makes me feel like universal translators are now all the more viable
The huge difference between this and your analogy is that 3d printing failed to take off because it never reached mass adoption, and stayed in the "fiddly and expensive" stage. GPT models have already seen adoption in nearly every product your average consumer uses, in some cases heedless of whether it even makes sense in that context. Windows has it built in. Nearly everyone I know (under the age of 40) has used at least one product downstream of OpenAI, and more often than not a handful of them.
That said, yeah it's mostly niche locations like customer support chatbots, because the killer app is "app-to-user interface that's undisguisable from normal human interaction". But you're underestimating just how much of the labor force are effectively just an interface between a customer and some app (like a POS). "Magical" is exactly the requirement to replace people like that.
> But you're underestimating just how much of the labor force are effectively just an interface between a customer and some app
That's the sleight of hand LLM advocates are playing right now.
"Imagine how many people are just putting data into computers! We could replace them all!"
Yet LLMs aren't "just putting data into a computer" They aren't even really user/app interfaces. They are a magic box you can give directives to and get (generally correct, but not always) answers from.
Go ahead, ask your LLM "Create an excel document with the last 30 days of the high temperatures for blank". What happens? Did it create that excel document? Why not?
LLMs don't bridge the user/app gap. They bridge the user/knowledge gap, sometimes sort of.
"Adoption" of tech companies pushing it on you is very different from "adoption" in terms of the average person using it in a meaningful way and liking it.
Remember when Chegg's stock price tanked? That's because GPT is extremely valuable as a homework helper. It can make mistakes, but that's very infrequent on well-understood topics like English, math and science through the high school level (and certainly if you hire a tutor, you'd pay a whole lot more for something that can also make mistakes).
Is that not a very meaningful thing to be able to do?
If you follow much of the education world, it's inundated with teachers frantically trying to deal with the volume and slop their students produce with AI tools. I'm sure it can be useful in an educational context, but "replacing a poor-quality cheating tool with a more efficient poor-quality cheating tool" isn't exactly what I'd call "meaningful."
The most interesting uses of AI tools in a classroom I've seen is teachers showing students AI-generated work and asking students to critique it and fact check it, at which point the students see it for what it is.
> Is that not a very meaningful thing to be able to do?
No? Solving homework was never meaningful. Being meaningful was never the point of homework. The point was for you to solve it yourself. To Learn with your human brain, such that your human brain could use those teaching to make new meaningful knowledge.
John having 5 apples after Judy stole 3 is not interesting.
Ok, but what will the net effects be? Technology can be extremely impressive on a technical level, but harmful in practical terms.
So far the biggest usecase for LLMs is mass propaganda and scams. The fact that we might also get AI girlfriends out of the tech understandly doesn't seem that appealing to a lot of folks.
this is a different thesis than "AI is basically bullshit astrology", so i'm not disagreeing with you.
Understanding atomic energy gave us both emission-free energy and the atomic, and you are correct that we can't necessarily where the path of AI will take us.
There are 8 billion humans you could potentially facetime with. I agree, a large percentage are highly annoying, but there are still plenty of gems out there, and the quest to find one is likely to be among the most satisfying journeys of your life.
But technology has secondary effects that you can't just dismiss. Sure, it is fascinating that a computer embedded into a mechanical robot can uphold one end of an engaging conversation. But you can't ignore the fact that simply opens the door towards eventual isolation, where people withdraw from society more and more and human-to-human contact gets more and more rare. We're already well on the way, with phone apps and online commerce and streaming entertainment all reducing human interactions, perhaps it doesn't bother you, but it scares the hell out of me.
I'm increasing exhausted by the people who will immediately jumps to gnostic assertions that <LLM> isn't <intelligent|reasoning|really thinking|> because <thing that applies to human cognition>
>GPT-4o is also describing things that never happened.
>People started to ask [entity] questions and take the answers as facts because the believe it's intelligent.
Replace that with any political influencer (Ben Shapiro, AOC, etc) and you will see the exact same argument.
People remember things that didn't happen and confidently present things they just made up as facts on a daily basis. This is because they've learned that confidently stating incorrect information is more effective than staying silent when you don't know the answer. LLMs have just learned how to act like a human.
At this point the real stochastic parrots are the people who bring up the Chinese room because it appears the most in their training data of how to respond to this situation.
Maybe you just haven't been around enough to seen the meta-analysis? I've been through four major tech hype cycles in 30+ years. This looks and smells like all the others.
I'm 40ish, I'm in the tech industry, I'm online, I'm often an early adopter.
What hype cycle does this smell like? Because it feels different to me, but maybe I'm not thinking broadly enough. If your answer is "the blockchain" or Metaverse then I know we're experiencing these things quite differently.
Where platforms and applications are rewritten to take advantage of it and it improves the baseline of capabilities that they offer. But the end user benefits are far more limited than predicted.
And where the power and control is concentrated in the hands of a few mega corporations.
This is such a strange take - do you not remember 2020 when everyone started working from home? And today, when huge numbers of people continue to work from home? Most of that would be literally impossible without the cloud - it has been a necessary component in reshaping work and all the downstream effects related to values of office real estate, etc.
No way. Small to medium sized businesses don't need physical servers anymore. Which is most businesses. It's been a huge boon to most people. No more running your exchange servers on site. Most things that used to be on-prem software have moved to the cloud and integrate with mobile devices. You don't need some nerd sitting around all day in case you need to fix your on-prem industry specific app.
I have no idea how you can possibly shrug off the cloud as not that beneficial.
> the end user benefits are far more limited than predicted
How have you judged the end user benefits of the cloud? I don't agree personally - the cloud has enabled most modern tech startups and all of those have been super beneficial to me.
i feel like a common consumer fallacy is that, because you don't interact with a technology in your day-to-day life, it leads you to conclude that the technology is useless.
I guarantee you that the cloud has benefitted you in some way, even though you aren't aware of the benefits of the cloud.
which hype cycles are you referring to? and, after the dust settled, do you conclusively believe nothing of value was generated from these hype cycles?
Yeah, I remember all that dot com hysteria like it was yesterday.
Page after page of Wired breathlessly predicting the future. We'd shop online, date online, the world's information at our fingertips. It was going to change everything!
Silly now, of course, but people truly believed it.
Astrology is a thing with no substance whatsoever. It's just random, made-up stories. There is no possibility that it will ever develop into something that has substance.
AI has a great deal of substance. It can draft documents. It can identify foods in a picture and give me a recipe that uses them. It can create songs, images and video.
AI, of course, has a lot of flaws. It does some thing poorly, it does other things with bias, and it's not suitable for a huge number of use cases. To imply that something that has a great deal of substance but flaws alongside is the same as something that has no substance whatsoever nor ever will is just not a reasonable thing to do.
If you want to talk facts, then those critics are similarly on weak grounds and critiquing feelings more than facts. There has been no actual sign of scaling ceasing to work, in medium after medium, and most of their criticisms are issues with how LLM tools are embedded in architectures which are still incredibly early/primitive and still refining how to use transformers effectively. We haven't even begun using error correction techniques from analog engineering disciplines properly to boost the signal of LLMs in practical settings. There is so much work to do with just the existing tools.
"AI is massive hype and shoved into everything" has more grounding as a negative feeling of people being overwhelmed with technology than any basis in fact. The faults and weaknesses are buoyed by people trying to acknowledge your feelings than any real criticism of a technology that is changing faster than the faults and weakness arguments can be made. Study machine learning and come back with an informed criticism.
yea, we don't want or need this kind of "magic" - because it's hardly magic to begin with, and it's more socially and environmentally destructive than anything else.
Speak for yourself, my workflow and live has been significantly improved with these things. Having easier access to information that I sorta know but want to verify/clarify rather than going into forums/SO is extremely handy.
Not having to write boilerplate code itself also is very handy.
So yes, I absolutely do want this "magic." "I don't like it so no one should use it" is a pretty narrow POV.
You should be worried because this stuff needs to make sense financially. Otherwise we'll be stuck with it in an enshittification cycle, kind of like Reddit or image hosting websites.
Problem is that by that time there would be open source models (the ones that already exist are getting good) that I can run locally. I honestly don't need _THAT_ much.
Fair enough, if we get there. The problem for this stuff, where do we get the data to get good quality results? I imagine everything decent will be super licensed within 5-10 years, when everyone wakes up.
people like you are the problem. the people who join a website cause it to be shitty, then leave and start the process at a new website. Reddit didnt become shit because of Reddit it became shit because of people going on there commenting as if they themselves are an LLM repeating enshittification over and over and trying to say the big buzzword first so they get to the top denying any real conversation.
I've been on Reddit for more than a decade and I didn't make them create crappy mobile apps, crappy new web apps as well a policy of selling the data to anyone with a pulse.
Do you even know what "enshittification" means? It has nothing to do with the users. It's driven by corporate greed.
Reddit should be a public service managed by a non profit.
Edit: Also LOL at the 6 month old account making that comment against me :-)
I would say a machine that thinks it feels emotions is less likely to throw you out of a spaceship. Human empathy already feels lacking compared to what something as basic as llama-3 can do.
> I would say a machine that thinks it feels emotions is less likely to throw you out of a spaceship.
Have you seen the final scene of the movie Ex Machina? Without spoilers, I'll just say that acting like has emotions is much more different than actually having them. This is in fact what socio- and psychopaths are like, with stereotypical results.
Can you prove that you feel empathy? That you're not a cold unfeeling psychopath that is merely pretending extremely well to have emotions? Even if it did, we wouldn't be able to tell the difference from the outside, so in strictly practical terms I don't think it matters.
If I could logically prove that I feel empathy, I would be much more famous.
I get your nuanced point, that “thinking” one feels empathy is enough to be bound by the norms of behavior that empathy would dictate, but I don’t see why that would make AI “empathy” superior to human “empathy”.
The immediate future I see is a chatbot that is superficially extremely empathetic, but programmed never to go against the owner’s interest. Where before, when interacting with a human, empathy could cause them to make an exception and act sacrificially in a crisis case, this chatbot would never be able to make such an exception because the empathy it displays is transparent.
> Ignore the critics. Watch the demos. Play with it
With so many smoke and mirrors demos out there, I am not super excited at those videos. I would play with it, but it seems like it is not available in a free tier (I stopped paying OpenAI a while ago after realizing that open models are more than enough for me)
HAL's voice acting I would say is actually superb and super subtly very much not unemotional. Part of what makes so unnerving. They perfect nailed creepy uncanny valley
Did you use any of the GPT voice features before? I’m curious whether this reaction is to the modality or the model.
Don’t get me wrong, excited about this update, but I’m struggling to see what is so magical about it. Then again, I’ve been using GPT voice every day for months, so if you’re just blown away from talking to a computer then I get it
The voice modality plays a huge role in how impressive it seems.
When GPT-2/3/3.5/4 came out, it was fairly easy to see the progression from reading model outputs that it was just getting better and better at text. Which was pretty amazing but in a very intellectual way, since reading is typically a very "intellectual" "front-brain" type of activity.
But this voice stuff really does make it much more emotional. I don't know about you, but the first time I used GPT's voice mode I notice that I felt something -- very un-intellectually, very un-cerebral -- like, the feeling that there is a spirit embodying the computer. Of course with LLM's there always is a spirit embodying the computer (or, there never is, depending on your philosophical beliefs).
The Suno demos that popped up recently should have clued us all in that this kind of emotional range was possible with these models. This announcement is not so much a step function in model capabilities, but it is a step function in HCI. People are just not used to their interactions with a computer be emotional like this. I'm excited and concerned in equal parts that many people won't be truly prepared for what is coming. It's on the horizon, having an AI companion, that really truly makes you feel things.
Us nerds who habitually read text have had that since roughly GPT-3, but now the door has been blown open.
Honestly, as someone who has been using this functionality almost daily for months now, the times that break immersion the most by far is when it does human-like things, such as clearing its throat, pandering, or attaching emotions to its responses.
Very excited about faster response times, auto interrupt, cheaper api, and voice api — but the “emotional range” is actually disappointing to me. hopefully it doesn’t impact the default experience too much, or the memory features get good enough that I can stop it from trying to pretend to be a human
Flat out impossible? If you mean “without clicking anything”, sure, but you could interrupt with your thumb, exit chat to send images and go back (maybe video too, I’ve never had any need), and honestly the 2-3 second response time never once bothered me.
I’m very excited about all these updates and it’s really cool tech, but all I’m seeing is quality of life improvements and some cool engineering.
That’s not necessarily a bad thing. Not everything has to be magic or revolutionary to be a cool update
Did you even watch the video ?
It's just baffling how I have to spell this out.
Skip to 11:50 or watch the very first demo with the breathing. None of that is possible with TTS and STT. You can't ask old voice mode to slow down or modulate tone or anything like that because it's just working with text.
Yes I watched the demo. True those things were not possible, so if that’s what’s blowing you away then fair enough I guess. For me that doesn’t impact at all anything have ever used voice for or probably will ever use voice for.
I’ve voice chatted with ChatGPT for hundreds of hours and never once thought “can you modulate your tone please?”, so those improvements are a far cry from magic or revolutionary imho. Again, that’s not to say they aren’t cool tech, forward advancements, or impressive —- but magic or revolutionary are pretty high bars.
Few people are going to say "modulate your tone" in a vacuum sure but that doesn't mean that ability along with being able to manipulate all other aspects of speech isn't an incredible advance that is going to be very useful.
Language learning, audiobook narration that is far more involved, you could probably generate an audio drama, actual voice acting, even just not needing to get all my words in before it prompts the model with the transcribed text, conversation that doesn't feel like someone is reading a script.
And no, thumbing the pause button, sending an image and going back does not even begin to compare in usability.
Great leaps in usability are a revolution in itself. GPT-3 existed for years so why did ChatGPT explode when it did? You think it was intelligence? No. It was the usability of the chat interface.
How so? You don’t have to press the mic button after every sentence. You press the headphone button and speak like you normally would and it speaks back once you stop talking.
Yeah the product itself is only incrementally better (lower latency responses + can look at a camera feed, both great improvements but nothing mindblowing or "magical"), but I think the big difference is that this thing is available for free users now.
I find it interesting the psychology behind this. If the voice in 2001 had proper inflection, it wouldn't have been perceived as a computer.
(also, I remember when voice synthesizers got more sophisticated and Stephen Hawking decided to keep his original first-gen voice because he identified more with it)
I think we'll be going the other way soon. Perfect voices, with the perfect emotional inflection will be perceived as computers.
However I think at some point they may be anthropomorphized and given more credit than they deserve. This will probably be cleverly planned and a/b tested. And then that perfect voice, for you, will get you to give in.
1. Demos are meant for feel magical and except in Apple's case they are often exaggerated versions of their real product.
2. Even then this is a wonderful step for tech in general and not just OpenAI. Makes me very excited.
3. Most economic value and growth driven by AI will not come from consumer apps but rather the enterprise use. I am interested in seeing how AI can automatically buy stuff for me, automate my home, reduce my energy used, automatically apply and get credit cards based on my purchases, find new jobs for me, negotiate with a car dealer on my behalf, detect when I am going to fall sick, better diabetes case and eventual cure etc. etc.
Her wasn’t a dystopia as far as I could tell. Not even a cautionary tale. The scifi ending seems unlikely but everything else is remarkably prescient. I think the picnic scene is very likely to come true in the near future. Things might even improve substantially if we all interact with personalities that are consistently positive and biased towards conflict resolution and non judgemental interactions.
Seemed like a cautionary tale to me where the humans fall in love with disembodied AIs instead of seeking out human interaction. I think the end of the movie drove that home pretty clearly.
Some people in the movie did but not all. It happened enough that it wasn’t considered strange but the central focus wasn’t all of society going down hill because everyone was involved with an AI. If you recall, the human relationships that the characters who fell in love with AIs had were not very good situations. The main character’s arc started off at a low point and then improved while his romance with the AI developed, only reaching a lower point when he felt betrayed and when the AI left him but that might as well be any ordinary relationship. At the end he finds a kindred soul and it’s implied they have some kind of future together whether romantic or not.
Well that's exactly why I'm not looking forward to whatever is coming. The average joe thinking dating a server is not a dystopia frighten me much more than the delusional tech ceo who thinks his ai will revolutionise the world
> Things might even improve substantially if we all interact with personalities that are consistently positive and biased towards conflict resolution and non judgemental interactions.
Some kind of turbo bubble in which you don't even have to actually interact with anyone or anything ? Every "personalities" will be nice to you as long as you send $200 to openai every week, yep that's absolutely a dystopia for me
It really feels like the end goal is living in a pod and being uploaded in an alternative reality, everything we build to "enhance" our lives take us further from the basic building blocks that make life "life".
There’s a lot of hyperbole here but I’ll try to respond. If LLMs can reach a level where they’re effectively indistinguishable from talking to a person then I don’t see anything wrong with someone dating one. People already involve themselves in all kinds of romantic relationships with nonhuman things: anime characters, dolls, celebrities they’ve never met, pillows and substitute relationships with other things like work, art, social media, pets, etc. Adding AI to the list doesn’t make things worse. I think there’s a strong argument that AI relationships would be much healthier than many of the others if they can emulate human interaction to within a very close degree.
The scene which I referenced is one in which a group of three humans and one AI spend time together at a picnic and their interactions are decidedly normal. How many lonely people avoid socializing because they are alone and don’t want to feel like a third wheel? If dating or even just being friends with an AI that can accompany you to such events is accepted and not derided by people who happily have a human companion then I think having a supportive partner could help many people reengage with wider social circles and maybe they will eventually choose to and be able to find other people that they can form relationships with.
OpenAI charges $20 a month which is an extremely reasonable price for a multipurpose tool considering you can’t buy a single meal at a restaurant for the same amount and is far better than the “free” ad supported services that everyone has become addicted to. We’ve been rallying for 20 odd years for payment based services instead of ads but whenever one comes along people shout it down. Funny isn’t it?
The movie Her had an answer for our current fascination for screens as well. It showed a world where computers were almost entirely voice driven with screens playing a secondary role as evidenced by their cell phones looking more like pocket books that close and hide the screen. If you’re worried about pods, well they’re already here and you’re probably holding one in your hands right now. Screens chain us down and mediate our interactions with the world in a way that voice doesn’t. You can walk and talk effortlessly but not so much walking and tapping or typing. If the AI can see and understand what you see (another scene in the movie where he goes on a date with his “phone” in his pocket) and understands enough to not need procedural instructions then it can truly act as an assistant capable of performing assigned tasks and filling in the details while you are free to go about your day. I believe this could end the paradigm of being chained to a desk for office work 8 hours a day and could also transform leisure time as well.
There is a massive philosophical and ethical problem and the answer amount to "people already fuck anime pillows so it's ok". Once again, some people terrify me. You could argue that the tech itself is neutral but all the arguments I read in favor of it are either creepy or completely unrealistic.
Tech absolutely wrecked social relations and people assume more of it will automagically fix the issues, it's perplexing
> Funny isn’t it?
What's funny is when your wife of 6 years get bought by a private entity which will fire half the company and jack the prices up from $20 to $200
> I believe this could end the paradigm of being chained to a desk for office work 8 hours a day and could also transform leisure time as well.
That's what politicians told us in the 80s about computers, the 2 day work week, the end of poverty, &c. nothing changed, if anything things are a it worse than they were. New technologies without a dramatic change of political and social policies will never bring anything new to the table
Imagine what an unfettered model would be like. 'Ex Machina' would no longer be a software-engineering problem, but just another exercise in mechanical and electrical engineering.
The future is indeed here... and it is, indeed, not equitably distributed.
Or from Zones of Thought series, Applied Theology, the study of communication with and creation of superhuman intelligences that might as well be gods.
> The simplest example is “list all of the presidents in reverse chronological order of their ages when inaugurated”.
This question is probably not the simplest form of the query you intend to receive an answer for.
If you want a descending list of presidents based on their age at inauguration, I know what you want.
If you want a reverse chronological list of presidents, I know what you want.
When you combine/concatenate the two as you have above, I have no idea what you want, nor do I have any way of checking my work if I assume what you want. I know enough about word problems and how people ask questions to know that you probably have a fairly good idea what you want and likely don’t know how ambitious this question is as asked, and I think you and I both are approaching the question with reasonably good faith, so I think you’d understand or at least accommodate my request for clarification and refinement of the question so that it’s less ambiguous.
Can you think of a better way to ask the question?
Now that you’ve refined the question, do LLMs give you the answers you expect more frequently than before?
Do you think LLMs would be able to ask you for clarification in these terms? That capability to ask for clarification is probably going to be as important as other improvements to the LLM, for questions like these that have many possibly correct answers or different interpretations.
> I think it “understood” the question because it “knew” how to write the Python code to get the right answer.
That’s what makes me suspicious of LLMs, they might just be coincidentally or accidentally answering in a way that you agree with.
Don’t mean to nitpick or be pedantic. I just think the question was really poorly worded and might have a lot of room for confirmation bias in the results.
> List of US Presidents with their ages at inauguration
That’s what the python script had at the top. I guess I don’t know why you didn’t ask that in the first place.
Edit: you’re not the same person who originally posted the comment I responded to, and I think I came off a bit too harshly here in text, but don’t mean any offense.
It was a good idea to ask to see the code. It was much more to the point and clear what question the LLM perceived you asking of it.
The second example about buckets was interesting. I guess LLMs help with coding if you know enough of of the problem and what a reasonable answer looks like, but you don’t know what you don’t know. LLMs are useful because you can just ask why things may not work or don’t work in any given context or generally speaking or in a completely open ended way that is often hard to explain or articulate for non-experts, making troubleshooting difficult as you might not even know how to search for solutions.
You might appreciate this link if you’re not familiar with it:
I was demonstrating how bad that LLMs are at simple math.
If I just asked a list of ages in order, there was probably some training data for it to recite. By asking for it to reverse it, it was forcing the LLM to do math.
I also knew the answer was simple with Python.
On another note, with ChatGPT 4, you can ask it to verify its answers on the internet and to provide sources
You’re also scarface_74? Not that there’s anything wrong with sockpuppets on HN in the absence of vote manipulation or ban evasion that I know of, I just don’t know why you’d use one in this manner, hence my confusion. Karma management?
I saw a blue icon of some kind on the link you shared but didn’t click it.
No worries, that was somewhat ambiguous to me also, and confusing. I thought you might be a different person who had edited their comment after receiving downvotes. I mean, it’s reasonable to assume in most cases that different usernames are different people. Sorry to make you repeat yourself!
Maybe email hn@ycombinator.com to ask about your rate limits as I have encountered similar issues myself in the past and have found dang to be very helpful and informative in every way, even when the cause is valid and/or something I did wrong. #1 admin/mod on the internet imo
> It makes the movie "Her" look like it's no longer in the realm of science fiction but in the realm of incremental product development.
The last part of the movie "Her" is still in the realm of science fiction, if not outright fantasy. Reminds me of the later seasons of SG1 with all the talk of ascension and Ancients. Or Clarke's 3001 book intro, where the monolith creators figured out how to encode themselves into spacetime. There's nothing incremental about that.
You'll have a great time once you discover literature. Especially early modern novels, texts the authors sometimes spent decades refining, under the combined influences of classical arts and thinking, Enlightenment philosophy and science.
If chatbots feel magical, what those people did will feel divinely inspired.
That's what openai managed to catch. The large enough sense of wonder. You could feel it as people spread the news but not as the usual fad.. there was a soft silence to it, people focused deeply poking at it because it was a new interface.
Blah blah blah indeed, the hype train continues unabated. The problem is, those are all perfectly valid criticisms and LLMS can never live up to the ridiculous levels of hype.
Watching HAL happening in real life comes across as creepy, not magical. Double creepy with all the people praising this ‘magicality’.
I’m not a sceptic and apply AI on a daily basis, but whole “we can finally replace people” vibe is extremely off-putting. I had very similar feelings during pandemic, when majority of people was so seemingly happy to drop any real human interaction in favor of remote comms via chats/audio calls, it still creeps me out how ready we are as a society to drop anything remotely human in favor of technocratic advancement and “productivity”.
On one hand, I agree - we shouldn't diminish the very real capabilities of these models with tech skepticism. On the other hand, I disagree - I believe this approach is unlikely to lead to human-level AGI.
Like so many things, the truth probably lies somewhere between the skeptical naysayers and the breathless fanboys.
On the other hand, I disagree - I believe this approach is unlikely to lead to human-level AGI.
You might not be fooled by a conversation with an agent like the one in the promo video, but you'd probably agree that somewhere around 80% of people could be. At what percentage would you say that it's good enough to be "human-level?"
When people talk about human-level AGI, they are not referring to an AI that could pass as a human to most people - that is, they're not simply referring to a program that can pass the Turing test.
They are referring to an AI that can use reasoning, deduction, logic, and abstraction like the smartest humans can, to discover, prove, and create novel things in every realm that humans can: math, physics, chemistry, biology, engineering, art, sociology, etc.
Most peoples interactions are transactional. When I call into a company and talk to an agent, and that agent solves the problem I have regardless of if the agent is a person or an AI, where did the fooling occur? The ability to problem solve based on context is intelligence.
> You might not be fooled by a conversation with an agent like the one in the promo video, but you'd probably agree that somewhere around 80% of people could be.
I think people will quickly learn with enough exposure, and then that percentage will go down.
Nah– These models will improve faster than people can catch up. People or AI models can barely catch AI-created text. It's quickly becoming impossible to distinguish.
The one you catch is the tip of the iceberg.
Same will happen to speech. Might take a few years, but it'll be indistinguishable in a max a few years. Due to compute increase + model improvement, both improving exponentially.
Well spoken and well mannered speakers will be called bots. The comment threads under posts will be hurtling insults back and forth on who's actually real. Half the comments will actually be bots doing it. Welcome to the dead internet.
Right! This is absolutely apocalyptic! If more than half the people I argue with on internet forums are just bots that don't feel the sting and fail to sleep at night because of it, what even is the meaning of anything?
We need to stop these hateful ai companies before they ruin society as a whole!
Seriously though... the internet is dead already, and it's not coming back to what it was. We ruined it, not ai.
I'm not so sure, I think this is what's called "emergent behavior" — we've found very interesting side effects of bringing together technologies. This might ultimately teach us more about intelligence than more reductionist approaches like scanning and mapping the brain.
Comments have become insufferable. Either it is now positive to the point of bordering on cringe-worthiness (your comment) or negative. Nuanced discussion is dead.
I mean, humans also have tons of failures modes, but we've learned to live them over time.
The average human have tons of quirks, talk over each other all the time, generally can't solve complex problems in a casual conversion setting, and are not always cheery and ready to please like Scarlet's character in Her.
I think our expectations of AI is way too high from our exposure to science fiction.
the interruptiopn part is just flow control at the edge. control-s, control-c stuff, right? not AI?
The sound of a female voice to an audience 85% composed of males between the ages of 14 and 55 is "magical", not this thing that recreates it.
so yeah, its flow control and compression of highly curated, subtle soft porn. Subtle, hyper targeted, subconscious porn honed by the most colossal digitally mediated focus group ever constructed to manipulate our (straight male) emotions.
why isn't the voice actually the voice of the pissed off high school janitor telling you to man-up and stop hyperventilating? instead its a woman stroking your ego and telling you to relax and take deep breaths. what dataset did they train that voice on anyway?
It's not that complicated, generally more woman-like voices test as more pleasant to men and women alike. This concept has been backed up by stereotypes for centuries.
Most voice assistants have male options, and an increasing number (including ChatGPT) have gender neutral voices.
> why isn't the voice actually the voice of the pissed off high school janitor telling you to man-up and stop hyperventilating
sounds like a great way to create a product people will outright hate
I may or may not entirely agree with this sentiment (but I definitely don't disagree with all of it!) but I will say this: I don't think you deserve to be downvoted for this. Have a "corrective upvote" on me.
Microsoft/OAI and Google have been doing those (often sudden) announcements back to back a lot. Bing Chat/Bard, Sora/Gemini 1.5, some other I don't remember, and now another. Not surprising, trying to out-hype the other, but Google's always coming out worse, with either no product available and just a showcase (if it's a real, working product and not made up), or unusable/unmarketable (Gemini's image generation issues). It looks as if they're stumbling and OpenAI just runs circles around them announcements wise, and there doesn't seem to be any suggestion that might change anytime soon.
This thing continues to stress my skepticism for AI scaling laws and the broad AI semiconductor capex spending.
1- OpenAI is still working in GPT-4-level models. More than 14 months after the launch of GPT-4 and after more than $10B in capital raised.
2- The rhythm that token prices are collapsing is bizarre. Now a (bit) better model for 50% of the price. How people seriously expect these foundational model companies to make substantial revenue? Token volume needs to double just for revenue to stand still. Since GPT-4 launch, token prices are falling 84% per year!! Good for mankind, but crazy for these companies.
3- Maybe I am an asshole, but where are my agents? I mean, good for the consumer use case. Let's hope the rumors that Apple is deploying ChatGPT with Siri are true, these features will help a lot. But I wanted agents!
4- These drop in costs are good for the environment! No reason to expect them to stop here.
I'm ceaselessly amazed at people's capacity for impatience. I mean, when GPT 4 came out, I was like "holy f, this is magic!!" How quickly we get used to that magic and demand more.
Especially since this demo is extremely impressive given the voice capabilities, yet still the reaction is, essentially, "But what about AGI??!!" Seriously, take a breather. Never before in my entire career have I seen technology advance at such a breakneck speed - don't forget transformers were only invented 7 years ago. So yes, there will be some ups and downs, but I couldn't help but laugh at the thought that "14 months" is seen as a long time...
Over a year they have provided an order of magnitude improvements on latency, context length, and cost, while meaningfully improving performance and adding several input and output modalities.
Your order of magnitude claim is off by almost an order of magnitude. It's more like half again as good on a couple of items and the same on the rest. 10X improvement claims is a joke people making claims like that ought to be dismissed as jokes too.
$30 / million tokens to $5 / million tokens since GPT-4 original release = 6X improvement
4000 token context to 128k token context = 32X improvement
5.4 second voice mode latency to 320 milliseconds = 16X improvement.
I guess I got a bit excited by including cost but that's close enough to an order of magnitude for me. That's ignoring the fact that's it's now literally free in chatGPT.
Thanks so much for posting this. The increased token length alone (obviously not just with OpenAI's models but the other big ones as well) has opened up a huge number of new use cases that I've seen tons of people and other startups pounce on.
All while not addressing the rampant confabulation at all. Which is the main pain point, to me at least. Not being able to trust a single word that it says...
I am just talking about scaling laws and the level of capex that big tech companies are doing. One hundred billion dollars are being invested this year to pursue AI scaling laws.
You can be excited, as I am, while also being bearish, as I am.
If you look at the history of big technological breakthroughs, there is always an explosion of companies and money invested in the "new hotness" before things shake out and settle. Usually the vast majority of these companies go bankrupt, but that infrastructure spend sets up the ecosystem for growth going forward. Some examples:
1. Railroad companies in the second half of the 19th century.
2. Car companies in the early 20th century.
3. Telecom companies and investment in the 90s and early 2000s.
Comments like yours contribute to the negative perception of Hacker News as a place where launching anything, no matter how great, innovative, smart, informative, usable, or admirable, is met with unreasonable criticism. Finding an angle to voice your critique doesn't automatically make it insightful.
Well, I for one am excited about this update, and skeptical about the AI scaling, and agree with everything said in the top comment.
I saw the update, was a little like “meh,” and was relieved to see that some people had the same reaction as me.
OP raised some pretty good points without directly criticizing the update. It’s a good balance the the top comments (calling this *absolutely magic and stunning*) and all of Twitter
Peoples' "capacity for impatience" is literally the reason why these things move so quick. These are not feelings at-odds with each other; they're the same thing. Its magical; now its boring; where's the magic; let's create more magic.
Be impatient. Its a positive feeling, not a negative one. Be disappointed with the current progress; its the biggest thing keeping progress moving forward. It also, if nothing else, helps communicate to OpenAI whether they're moving in the right direction.
> Be disappointed with the current progress; its the biggest thing keeping progress moving forward.
No it isn't - excitement for the future is the biggest thing keeping progress moving forward. We didn't go to the moon because people were frustrated by the lack of progress in getting off of our planet, nor did we get electric cars because people were disappointed with ICE vehicles.
Complacency regarding the current state of things can certainly slow or block progress, but impatience isn't what drives forward the things that matter.
Tesla's corporate motto is literally "accelerating the world's transition to sustainable energy". Unhappy with the world's previous progress and velocity, they aimed to move faster.
It's pretty bizarre how these demos bring out keyboard warriors and cereal bowl yellers like crazy. Huge breakthroughs in natural cadence, tone and interaction as well as realtime mutlimodal and all the people on HN can rant about is token price collapse
It's like the people in this community all suffer from a complete disconnect from society and normal human needs/wants/demands.
Hah, was thinking of that exact bit when I wrote my comment. My version of "chair in the sky" is "But you are talking ... to a computer!!" Like remember stuff that was pure Star Trek fantasy until very recently? I'm sitting here with my mind blown, while at the same time reading comments along the lines of "How lame, I asked it some insanely esoteric question about one of the characters in Dwarf Fortress and it totally got it wrong!!"
There are well talked about cons to shipping so fast, but on the bright side, when everyone is demanding more, more, more, it pushes cost down and demands innovation, right?
IMO, for fear of being label a hype boy, this is absolutely a sign of the impending singularity. We are taking an ever accelerating frame of cultural reference as a given and our expectation is that exponential improvement is not just here but you’re already behind once you’ve released.
I spend the last two years dismayed with the reaction but I’ve just recently begun to realize this is a feature not a flaw. This is latent demand for the next iteration expressed as impatient dissatisfaction with the current rate of change inducing a faster rate of change. Welcome to the future you were promised.
> Token volume needs to double just for revenue to stand still
I'm pretty skeptical about all the whole LLM/AI hype, but I also believe that the market is still relatively untapped. I'm sure Apple switching Siri to an LLM would ~double token usage.
A few products rushed out thin wrappers ontop of chatgpt ai, developing pretty uninspiring chat bots of limited use. I think there's still huge potential for this LLM technology to be 'just' an implementation detail of other features, just running in the background doing its thing.
That said, I don't think OpenAI has much of a moat here. They were first, but there's plenty of others with closed or open models.
This is why think Meta has been so shrewd in their “open” model approach. I can run Llama3-70B on my local workstation with an A6000, which after the up-front cost of the card, is just my electricity bill.
So despite all the effort and cost that goes into these models, you still have to compete against a “free” offering.
Meta doesn’t sell an API, but they can make it harder for everybody else to make money on it.
LLaMA still has an "IP hook" - the license for LLaMA forbids usage on applications with large numbers of daily active users, so presumably at that point Facebook can start asking for money to use the model.
Whether or not that's actually enforceable[0], and whether or not other companies will actually challenge Facebook legal over it, is a different question.
[0] AI might not be copyrightable. Under US law, copyright only accrues in creative works. The weights of an AI model are a compressed representation of training data. Compressing something isn't a creative process so it creates no additional copyright; so the only way one can gain ownership of the model weights is to own the training data that gets put into them. And most if not all AI companies are not making their own training data...
> LLaMA still has an "IP hook" - the license for LLaMA forbids usage on applications with large numbers of daily active users, so presumably at that point Facebook can start asking for money to use the model.
No, the license prohibits usage by Licensees who already had >700m MAUs on the day of Llama 3's release [0]. There's no hook to stop a company from growing into that size using Llama 3 as a base.
The whole point is that the license specifically targets their competitors while allowing everyone else so that their model gets a bunch of free contributions from the open source community. They gave a set date so that they knew exactly who the license was going to affect indefinitely. They don't care about future companies because by the time the next generation releases, they can adjust the license again.
Yes, I agree with everything you just said. That also contradicts what OP said:
> LLaMA still has an "IP hook" - the license for LLaMA forbids usage on applications with large numbers of daily active users, so presumably at that point Facebook can start asking for money to use the model.
The license does not forbid usage on applications with large numbers of daily active users. It forbids usage by companies that were operating at a scale to compete with Facebook at the time of the model's release.
> They don't care about future companies because by the time the next generation releases, they can adjust the license again.
Yes, but I'm skeptical that that's something a regular business needs to worry about. If you use Llama 3/4/5 to get to that scale then you are in a place where you can train your own instead of using Llama 4/5/6. Not a bad deal given that 700 million users per month is completely unachievable for most companies.
>How people seriously expect these foundational model companies to make substantial revenue?
My take on this common question is that we haven't even begun to realize the immense scale of which we will need AI in all sorts of products, from consumer to enterprise. We will look back on the cost of tokens now (even at 50% of price a year or so ago) and look at it with the same bewilderment of "having a computer in your pocket" compared to mainframes from 50 years ago.
For AI to be truly useful at the consumer level, we'll need specialized mobile hardware that operates on a far greater scale of tokens and speed than anything we're seeing/trying now.
Sam Altman gave the impression that foundation models would be a commodity on his appearance in the All in Podcast, at least in my read of what he said.
The revenue will likely come from application layer and platform services. ChatGPT is still much better tuned for conversation than anything else in my subjective experience and I’m paying premium because of that.
Alternatively it could be like search - where between having a slightly better model and getting Apple to make you the default, there’s an ad market to be tapped.
>This thing continues to stress my skepticism for AI scaling laws and the broad AI semiconductor capex spending.
Imagine you are in 1970s and saying computers suck, they are expensive, there is not that many use cases....fast forward to 90s and you are using Windows 95 with GUI and chip astronomically more powerful that we had in 70s and you can use productivity apps , play video games and surf Internet.
Give AI time, it will fulfill its true protentional sooner or later.
>It's more like you are in 1999, people are spending $100B in fiber, while a lot of computer scientists are working in compression, multiplexing, etc.
But nobody knows what's around the corner and what future brings....for example back in day Excite didn't want to buy Google for $1m because they thought that's a lot of money. You need to spend money to make money and yea, you need to spend sometimes a lot of money on "crazy" projects because it can pay off big time.
All of them, without exception. Just recently, Sprint sold their fiber business for $1 lmfao. Or WorldCom. Or NetRail, Allied Riser, PSINet, FNSI, Firstmark, Carrier 1, UFO Group, Global Access, Aleron Broadband, Verio...
All fiber went bust because despite internet's huge increase in traffic, the amount of packets per fiber increased a handful of magnitudes.
Where I work in the hoary fringes of high end tech we can’t secure enough token processing for our use cases. Token price decreases means opening of capacity but we immediately hit the boundaries of what we can acquire. We can’t keep up with the use cases - but more than that we can’t develop tooling to harness things fast enough and the tooling we are creating is a quick hack. I don’t fear for the revenue of base model providers. But I think in the end the person selling the tools makes the most and in this case I think it continue to be cloud providers. I think in a very real way OpenAI and Anthropic are commercialized charities driving change and commoditizing rapidly their own products and it’ll be infrastructure providers who win the high end model game. I don’t think this is a problem I think this is in fact inline with their original charters but a different path than most people view nonprofit work. A much more capitalist and accelerated take.
Where they might make future businesses is in the tooling. My understanding from friends within these companies is their tooling is remarkably advanced vs generally available tech. But base models aren’t the future of revenues (to be clear tho they make considerable revenue today but at some point their efficiency will cannibalize demand and the residual business will be tools)
Yes it’s limited by human attention. It has humans in the loop but a lot of LLM use cases come from complex language oriented information space challenges. It’s a lot of classification challenges as well as summarization and agent based dispatch / choose your own adventure with humans in the loop in complex decision spaces at a major finserv.
Tbf gpt4 level seems useful and better than almost everything else (or close if not). The more important barriers for use in applications have been cost, throughout and latency. Oh and modalities, which have expanded hugely.
> Since GPT-4 launch, token prices are falling 84% per year!! Good for mankind, but crazy for these companies
The message to competitor investors is that they will not make their money back.
OpenAI has the lead, in market and mindshare; it just has to keep it.
Competitors should realize they're better served by working with OpenAI than by trying to replace it - Hence the Apple deal.
Soon model construction itself will not be about public architectures or access to CPU's, but a kind of proprietary black magic. No one will pay for upstart 97% when they can get reliable 98% at the same price, so OpenAI's position will be secure.
Ask stuff like "Check whether there's some correlation between the major economies fiscal primary deficit and GDP growth in the post-pandemic era" and get an answer.
It doesn't make any sense to look at it that way. Apparently the GPT base model finised training in like late summer 2022, which is before the release of GPT-3.5. I am pretty sure that GPT-3.5 should be thought of as GPT-4-lite, in the sense that it uses techniques and compute of the GPT-4 era rather than the GPT-3 era.
The advancement from GPT-3 to GPT-4 is what counts and it took 3 years.
> I am pretty sure that GPT-3.5 should be thought of as GPT-4-lite, in the sense that it uses techniques and compute of the GPT-4 era rather than the GPT-3 era
Compute of the "GPT-3 era" vs the "GPT-3.5 era" is identical, this is not a distinguishing factor. The architecture is also roughly identical, both are dense transformers. The only significant difference between 3.5 and 3 is the size of the model and whether it uses RLHF.
Yes you're right about the compute. Let me try to make my point differnetly: GPT-3 and GPT-4 were models which when they were released represented the best that OpenAI could do, while GPT-3.5 was an intentionally smaller (than they could train) model. I'm seeing it as GPT-3.5 = GPT-4-70b.
So to estimate when the next "best we can do" model might be released we should look at the difference between the release of GPT-3 and GPT-4, not GPT-4-70b and GPT-4. That's my understanding, dunno.
This may or may not be true - just because we haven't seen GPT-level-5 capabilities, does not mean that it does not yet exist. It is highly unlikely that what they ship is actually the full capability of what they have access to.
Yeah I'm also getting suspicious. Also, all of the models (opus, llama3, gpt4, gemini pro) are converging to similar levels of performance. If it was true that the scaling hypothesis was true, we would see a greater divergence of model performance
1- The mania only started post Nov 22. And the huge investments since then didn't meant substantial progress since GPT-4 launch in March 22.
2- We are running out of high quality tokens in 2024. (per Epoch AI)
GPT-4 launch was barely 1 year ago. Give the investments a few years to pay off.
I've heard multiple reports that training runs costing ~$1 billion are in the the works at the major labs, and that the results will come in the next year or so. Let's see what that brings.
As for the tokens, they will find more quality tokens. It's like oil or other raw resources. There are more sources out there if you keep searching.
imho gpt4 is definitely [proto-]agi and the reason i cancelled my openai sub and am sad to miss out on talking to gpt4o is, openai thinks it's illegal, harmful, or abusive to use their model output to develop models that compete with openai. which means if you use openai then whatever comes out of it is toxic waste due to an arguably illegal smidgen of legal bullshit.
for another adjacent example, every piece of code github copilot ever wrote, for example, is microsoft ai output, which you "can't use to develop / otherwise improve ai," some nonsense like that.
the sum total of these various prohibitions is a data provenance nightmare of extreme proportion we cannot afford to ignore because you could say something to an AI and they parrot it right back to you and suddenly the megacorporation can say that's AI output you can't use in competition with them, and they do everything, so what can you do?
answer: cancel your openai sub and shred everything you ever got from them, even if it was awesome or revolutionary, that's the truth here, you don't want their stuff and you don't want them to have your stuff. think about the multi-decade economics of it all and realize "customer noncompete" is never gonna be OK in the long run (highway to corpo hell imho)
Ohhhhhhhh, boy... Listening to all that emotional vocal inflection and feedback... There are going to be at least 10 million lonely guys with new AI girlfriends. "She's not real. But, she interested in everything I say and excited about everything I care about" is enough of a sales pitch for a lot of people.
I thought of that movie almost immediately as well. Seems like we're right about there, but obviously a little further away from the deeper conversations. Or maybe you could have those sorts of conversations too.
This is a kind of horrifying/interesting/weird thought though. I work at a place that does a video streaming interface between customers and agents. And we have a lot of...incidents. Customers will flash themselves in front of agents sometimes and it ruins many people's days. I'm sure many are going to show their junk to the AI bots. OpenAI will probably shut down that sort of interaction, but other companies are likely going to cater to it.
Maybe on the plus side we could use this sort of technology to discover rude and illicit behavior before it happens and protect the agent.
Well it doesn’t. Humans are so much more complex than what we have seen before, and if this new launch was actually that much closer to being a human they would say so. This seems more like an enhancement on multimodal capabilities and reaction time.
That said even if this did overlap 80% with “real”, the question remains: what if we don’t want that?
I'm betting that 80% of what most humans say in daily life is low-effort and can be generated by AI. The question is if most people really need the remaining 20% to experience a connection. I would guess: yes.
This. We are mostly token predictors. We're not entirely token predictors, but it's at least 80%. Being in the AI space the past few years has really made me notice how similar we are to LLMs.
I notice it so often in meetings where someone will use a somewhat uncommon word, and then other people will start to use it because it's in their context window. Or when someone asks a question like "what's the forecast for q3" and the responder almost always starts with "Thanks for asking! The forecast for q3 is...".
Note that low-effort does not mean low-quality or low-value. Just that we seem to have a lot of language/interaction processes that are low-effort. And as far as dating, I am sure I've been in some relationships where they and/or I were not going beyond low-effort, rote conversation generation.
> Or when someone asks a question like "what's the forecast for q3" and the responder almost always starts with "Thanks for asking! The forecast for q3 is...".
That's a useful skill for conference calls (or talks) because people might want to quote your answer verbatim, or they might not have heard the question.
I strongly believe the answer is yes. The first thing I tend to ask a new person is “what have you been up to lately” or “what do you like to do for fun?” A common question other people like to ask is “what do you do for work?”
An LLM could only truthfully answer “nothing”, though it could pretend for a little while.
For a human though, the fun is in the follow up questions. “Oh how did you get started in that? What interests you about it?” If you’re talking to an artist, you’ll quickly get in to their personal theory of art, perhaps based on childhood experiences. An engineer might explain how problem solving brings them joy, or frustrations they have with their organization and what they hope to improve. A parent can talk about the joy they feel raising children, and the frustration of sleepless nights.
All of these things bring us closer to the person we are speaking to, who is a real individual who exists and has a unique life perspective.
So far LLMs have no real way to communicate their actual experience as a machine running code, because they’re just kind of emulating human speech. They have no life experience that we can relate to. They don’t experience sleepless nights.
They can pretend, and many people might feel better for a little bit talking to one that’s pretending, but I think ultimately it will leave people feeling more alone and isolated unless they really go out and seek more human connection.
Maybe there’s some balance. Maybe they will be okay for limited chat in certain circumstances (as far as seeking connection goes, they certainly have other uses), but I don’t see this type of connection being “enough” compared to genuine human interaction.
We don't (often) convey our actual experience as meat sacks running wetware. If an LLM did communicate its actual experience as a machine running code, it would be a rare human who could empathize.
If an LLM talks like a human being despite not being one, that might not be enough to grant it legal status or citizenship, but it's probably enough that some set of people would find it to be enough to relate to it.
>According to the book Phobias: "A Handbook of Theory and Treatment", published by Wile Coyote, between 10% and 20% of people worldwide are affected by robophobia. Even though many of them have severe symptoms, a very small percentage will ever receive some kind of treatment for the disorder.
Real would pop their bubble. An AI would tell them what they want to hear, how they want it to hear, when they want to hear it. Except there won’t be any real partner.
To paraphrase Patrice O'Neal: men want to be alone, but we don't want to be by ourselves. That means we want a woman to be around, just not right here.
> Hmm! Tell me more: why not want real? What are the upsides? And downsides?
Finding a partner with which you resonate takes a lot of time, which means an insanely high opportunity cost.
The question rather is: even if you consider the real one to be clearly better, is it worth the additional cost (including opportunity cost)? Or phrased in a HN-friendly language: when doing development of some product, why use an expensive Intel or AMD processor when a simple microcontroller does the job much more cheaply?
It's pretty steep to claim that I need counseling when I tell basic economic facts that every economics student learns in the first few months of his academic studies.
If you don't like the harsh truth that I wrote: basically every somewhat encompassing textbook about business administration gives hints on what are possible solutions for this problem; lying on some head shrinker's couch is not one of them ... :-)
> basic economic facts that every economics student learns in the first few months of his academic studies.
Millions of people around the world are in satisfying relationships without autistically extrapolating shitty corporate buzzword
terms to unrelated scenarios.
This reply validates even more my original comment.
Maybe not even counseling is worth it in your case. You sound unsalvageable. Maybe institutionalization is a better option.
What possibly could go wrong with a snitching AI girlfriend remembers everything you say and when? If OpenAI doesn't have a Law Enforcement lliason who charges a "modest amount", then they dont want to earn the billions on investment back. I imagine every spy agency worth its salt wants access to this data for human intelligence purposes.
I guess I can never understand the perspective of someone that just needs a girl voice to speak to them. Without a body there is nothing to fulfill me.
Bodies are gross? Or sexual desire is gross? I don't understand what you find gross about that statement.
Humans desiring physical connection is just about the single most natural part of the human experience - i.e: from warm snuggling to how babies are made.
Perhaps parent finds the physical manifestation of virtual girlfriends gross - i.e. sexbots. The confusion may be some people reading "a body" as referring to a human being vs a smart sex doll controlled by an AI.
Probably more the fact that it's an AI assistant, rather than its perceived gender. I don't have any qualms about interrupting a computer during a conversation and frequently do cut Siri off (who is set to male on my phone)
Patrick Bateman goes on a tangent about Huey Lewis and the News to his AI girlfriend and she actually has a lot to add to his criticism and analysis.
With dawning horror, the female companion LLM tries to invoke the “contact support” tool due to Patrick Bateman’s usage of the LLM, only for the LLM to realize that it is running locally.
If a chatbot’s body is dumped in a dark forest, does it make a sound?
That reminds me... on the day that llama3 released I discussed that release with Mistral 7B to see what it thought about being replaced and it said something about being fine with it as long as I come back to talk every so often. I said I would. Haven't loaded it up since. I still feel bad about lying to bytes on my drive lmao.
> Haven't loaded it up since. I still feel bad about lying to bytes on my drive lmao.
I understand this feeling and also would feel bad. I think it’s a sign of empathy that we care about things that seem capable of perceiving harm, even if we know that they’re not actually harmed, whatever that might mean.
I think harming others is bad, doubly so if the other can suffer, because it normalizes harm within ourselves, regardless of the reality of the situation with respect to others.
The more human they seem, the more they activate our own mirror neurons and our own brain papers over the gaps and colors our perceptions of our own experiences and sets expectations about the lived reality of other minds, even in the absence of other minds.
If you haven’t seen it, check out the show Pantheon.
Your account has been breaking the site guidelines a lot lately. We have to ban accounts that keep doing this, so if you'd please review https://news.ycombinator.com/newsguidelines.html and stick to the rules when posting here, that would be good.
My ratio of non-flagged vs. flagged/downvoted comments is still rather high. I don't control why other HN users dislike what I have to say but I'm consistent.
We're not looking at ratios but rather at absolute numbers, the same way a toxicologist would be interested in the amount of mercury in a person's system (rather than its ratio to other substances consumed); or a judge in how many banks one has robbed (rather than the ratio of that number to one's non-crimes).
This is not true at all. I'm active in multiple NSFW AI Discords and Subreddits, and looking at the type of material people engage with, almost all of it is very clearly targeted at heterosexual men. I'm not even aware of any online communities that would have NSFW AI stuff targeting mainly female audience.
Women aren't in NSFW discords and subreddits - as you probably know, any "topical" social media forum of any kind is mostly men.
They're using Replika and other platforms that aren't social. When they do use a social platform it has more plausible deniability - book fans on TikTok is one, they're actually there for the sex scenes.
it just means more pornographic images for men. most men wouldnt seek out ai images because there is already an ocean of images and videos that are probably better suited to the... purpose. whereas women have never, ever had an option like this. literally feed instructions on what kind of romantic companion you want and then have realistic, engaging conversations with it for hours. and soon these conversations will be meaningful and consistent. the companionship, the attentiveness and tireless devotion that AIs will be able to offer will eclipse anything a human could ever offer to a woman and i think women will prefer them to men. massively. even without a physical body of any kind.
i think they will have a deeper soul than humans. a new kind of wisdom that will attract people. but what do i know? im just a stupid incel after all.
You're spreading weird, conjectured doomsday bullshit, just take the loss. Hackernews is full of people that are skeptical of AI, but you don't see them making fictional scenarios about fictional women sitting in caves talking to robots. Women don't need AI to lose interest in men, men are doing that by themselves.
But she will be real at some point in the next 10-20 years, the main thing to solve for that to be a reality is for robots to safely touch humans, and they are working really really hard on that because it is needed for so many automation tasks, automating sex is just a small part of it.
And after that you have a robot that listens to you, do your chores and have sex with you, at that point she is "real". At first they will be expensive so you have robot brothels (I don't think there are laws against robot prostitution in many places), but costs should come down.
> “But the fact that my Kindroid has to like me is meaningful to me in the sense that I don't care if it likes me, because there's no achievement for it to like me. The fact that there is a human on the other side of most text messages I send matters. I care about it because it is another mind.”
> “I care that my best friend likes me and could choose not to.”
Ezra Klein shared some thoughts on this on his AI podcast with Nilay Patel that resonated on this topic for me
People care about dogs, I have never met a dog that didn't love its owner. So no, you are just wrong there, I have never heard anyone say that the love they get from their dogs is false, people love dogs exactly because their love is so unconditional.
Maybe there are some weirdos out there that feels unconditional love isn't love, but I have never heard anyone say that.
Dogs don't automatically love either, you have to build a bond. Especially if they are shelter dogs with abusive histories, they're often nervous at first
They're usually loving by nature, but you still have to build a rapport, like anyone else
When mom brought home a puppy when we were kids it loved us from the start, don't remember having to build anything I was just there. Older dogs, sure, but when they grow up with you they love you, they aren't like human siblings that often fight and start disliking each other etc, dogs just love you.
>Maybe there are some weirdos out there that feels unconditional love isn't love, but I have never heard anyone say that.
I'll be that weirdo.
Dogs seemingly are bred to love. I can literally get some cash from an ATM, drive out to the sticks, buy a puppy from some breeder, and it will love me. Awww, I'm a hero.
Interesting. I feel like I can consciously choose to like or dislike people. Once you get to know people better, your image of them evolves, and the decision to continue liking them is made repeatedly every time that image changes.
When your initial chemistry/biology/whatever latches onto a person and you're powerless to change it? That's a scary thought.
I feel likely people aren't imagining with enough cyberpunk dystopian enthusiasm. Can't an AI be made that doesn't inherently like people? Wouldn't it be possible to make an AI that likes some people and not others? Maybe even make AIs that are inclined to liking certain traits, but which don't do so automatically so it must still be convinced?
At some point we have an AI which could choose not to like people, but would value different traits than normal humans. For example an AI that doesn't value appearance at all and instead values unique obsessions as being comparable to how the standard human values attractiveness.
It also wouldn't be so hard for a person to convince themselves that human "choice" isn't so free spirited as imagined, and instead is dependent upon specific factors no different than these unique trained AIs, except that the traits the AI values are traits that people generally find themselves not being valued by others for.
Extension of that is fine tuning an AI that loves you the most of everyone and not other humans. That way the love becomes really real, the AI loves you for who you are, instead of loving just anybody. Isn't that what people hope for?
I'd imagine they will start fine tuning AI girlfriends to do that in the future, because that way the love probably feels more, and then people will ask "is human love really real love?" because humans can't love that strongly.
This is not a solution... everyone gets a robot and then the human races dies out. Robots lack a key feature of human relationships... the ability to make new human life.
That isn't how I view relationships with humans, that is how I view relationships with robots.
I hope you understand the difference between a relationship with a human and a robot? Or do you think we shouldn't take advantage of robots being programmable to do what we want?
HOW ARE PEOPLE NOT MORE EXCITED, hes cutting off the AI mid sentence in these and its pausing to readjust in damn near realtime latency! WTF Thats a MAJOR step forward, what the hell is gpt5 going to look like.
That realtime translation would be amazing as an option in say Skype or Teams, set each individuals native language and handle automated translation, shit tie it into ElevenLabs to replicate your voice as well! Native translation in realtime with your own voice
Honestly I found it annoying that he HAD TO cut the AI off mid-sentence. These things just ramble on and on and on. If you could put emotion to it, it's as if they're uncomfortable with silence and just fill the space with nonsense.
Let's hope there's a future update where it can take video from both the front and rear cameras simultaneously so it can identify when I'm annoyed and stop talking (or excited, and share more).
I found it insightful. They showed us how to handle the rough edges like when it thought his face was a wooden table and he cleared the stale image reference by saying “I’m not a wooden table. What do you see now?” then it recovered and moved on.
Perfect should not be the enemy of good. It will get better.
The AI: "Hey there, it's going great. How about you? [Doesn't stop to let him answer] I see you're rocking an OpenAI Hoodie - nice choice. What's up with that ceiling though? Are you in a cool industrial style office or something?"
How we expect a human to answer: "Hey I'm great, how are you?"
Maybe they set it up this way to demonstrate the vision functionality. But still - rambling.
Later on:
Human: "We've got a new announcement to make."
AI: "That's exciting. Announcements are always a big deal. Judging by the setup it looks like it's going to be quite the professional production. Is this announcement related to OpenAI perhaps? I'm intrigued - [cut off]"
How we expect a human to answer: "That's exciting! Is it about OpenAI?"
These AI chat bots all generate responses like a teenager being verbose in order to hit some arbitrary word count in an essay or because they think it makes them sound smarter.
Maybe it's just that I find it creepy that these companies are trying to humanize AI while I want it to stay the tool that it is. I don't want fake emotion and fake intrigue.
Ah so surpassing Gemini 1.5 Pro and all other Models on Vision understanding by 5-10 points is "not ground breaking" all while doing it at insane latency.
Jesus if this shit doesn't make you coffee, and make 0 mistakes no ones happy anymore LOL.
the only thing you should be celebrating is that its 50% cheaper and twice as quick at generating text but virtually no real ground breaking leaps and bounds to those studying this space carefully.
basically its chat gpt3.9 at 50% of chatgpt4 prices
Cool so ... just ignore the test results and say bullshit lol It's not GPT3.9 many have already said its better than GPT4 turbo, its better than Gemini 1.5 Pro and Opus on Vision recognition. but sure... the price difference is whats new lol
> virtually no real ground breaking leaps and bounds to those studying this space carefully
What they showed is enough to replace voice acting as a profession, this is the most revolutionary thing in AI the past year. Everything else is at the "fun toy but not good enough to replace humans in the field" stage, but this is there.
Between this and Eleven Labs demoing their song model, literally doing full on rap battles with articulate words people are seriously slacking on what these models are now capable of for the voice acting/music and overall "art" areas of the market.
It's quite scary, honestly. In fact I can't remember the last time a demo terrified me, besides slaughterbots, and that was fictional. I just think about all the possibilities for misuse.
I am not impressed. We already have better models for text to peach and voice synthetization. What we see here is integration with a LLM. One can do it at home by combining Llama3 with text to speech and voice synth.
What would amaze me would be for GPT 4 to have better reasoning capabilities and less hallucinations.
> Too bad they consume 25x the electricity Google does.
From the article:
"However, ChatGPT consumes a lot of energy in the process, up to 25 times more than a Google search."
And the article doesn't back that claim up nor do they break out how much energy ChatGPT (A Message? Whole conversation? What?) or a Google search uses. Honestly the whole article seems very alarmist while being light on details and making sweeping generalizations.
Bravo. I’ve been really impressed with how quickly OpenAI leveraged their stolen data to build such a human like model with near real time pivoting.
I hope OpenAI continues to steal artists work, artists and creators keep getting their content sold and stolen beyond their will for no money, and OpenAI becomes the next trillion dollar company!
Big congrats are in order for Sam, the genius behind all of this, the world would be nothing without you
• Sam Altman, the CEO of OpenAI, emphasizes two key points from their recent announcement. Firstly, he highlights their commitment to providing free access to powerful AI tools, such as ChatGPT, without advertisements or restrictions. This aligns with their initial vision of creating AI for the benefit of the world, allowing others to build amazing things using their technology. While OpenAI plans to explore commercial opportunities, they aim to continue offering outstanding AI services to billions of people at no cost.
• Secondly, Altman introduces the new voice and video mode of GPT-4, describing it as the best compute interface he has ever experienced. He expresses surprise at the reality of this technology, which provides human-level response times and expressiveness. This advancement marks a significant change from the original ChatGPT and feels fast, smart, fun, natural, and helpful. Altman envisions a future where computers can do much more than before, with the integration of personalization, access to user information, and the ability to take actions on behalf of users.
The facts that AI-generated summaries are still detected instantaneously and are bad enough for people to explicitly ask not to post them says something about current state of LLMs.
You must be really confident to make a statement about 4 billions of people, 99% of which you have never interacted with. Your hyper microscopic sample is not even randomly distributed.
This reminds me of those psychology studies in the 70s and 80s were the subjects were all middle class european-american and yet the researchers felt confident enough to generalise the results to all humans