While this is a very cool project that shows a great use of machine learning to answer questions about images in a roughly explainable way, I think people are extrapolating quite a bit as though this is some kind of movement forward from GPT-4 or Midjourney 5 into a new advanced reasoning phase, rather than a neat new combination of stuff that existed a year ago.
Firstly, a bunch of the tech here is recognition-based rather than generative; it is relying heavily on object recognition which is not new.
Secondly, the two primary spaces where generative tech is used are
1. For code generation from simple queries over a well-defined (and semantically narrow) spatial API — this is one of the tasks where generative AI should shine in most cases. And
2. As a punt for something the API doesn't allow: e.g. "tell me about this building", which then comes with the same inscrutability as before.
The number of examples for which the code is essentially "create a vector of objects, sort them on the x, y, z, or t axis, and pick an index" is quite high. But there aren't really any examples of determining causality or complex relationships that would require common sense. It is basically a more advanced SHRDLU. That's not to say this isn't a very cool result (with an equally cool presentation). And I could see some applications where this tech is used to achieve ad-hoc application of basic visual rules to generative AI (for example, Midjourney 6 could just regenerate images until "do all hands in this image have five fingers?" is true).
> I think people are extrapolating quite a bit as though this is some kind of movement forward from GPT-4 or Midjourney 5 into a new advanced reasoning phase, rather than a neat new combination of stuff that existed a year ago.
It can be both. Life itself was a "neat combination of stuff that existed" before. It isn't about the raw ingredients, but the capability of their whole.
Also, history as shown that are periods of time where rapid progress happens. It looks like we are in one of those, and it will make the previous ones look like baby steps.
I interpret your question as "although ViperGPT is innovative, it is not as radical as GPT-4 or Midjourney 5". Here, "radical innovation" is a term I have used from the innovation literature. (https://bigthink.com/plus/radical-vs-disruptive-innovation-w...)
Although I largely agree with you, I still think this is a massive development as it will likely change the way empiricists use computer vision.
I think tidal forces are a better analogy. As change accelerates, basically any pre-existing organisational structure will feel tension between how reality used to be and how reality is.
Things will get ripped apart, like the spaghettification of objects falling into a black hole.
I am not referring to tides, but to the tidal forces that generate them. Tidal forces are the result of a gradient in the field of gravity.
When you are close to a black hole, the part of you that is closest, experiences a stronger force of gravity than the rest of your body. This tension rips things apart. Likewise, some parts of our civilization will be more affected by AI than others, causing change there to accelerate. This causes tension with the rest of civilization.
One of the 15 or so "risks" that OpenAI supposedly tested for[1], below things like "Hallucinations" and "Weapons, conventional and unconventional" was "Acceleration."
I thought this was a really interesting topic for them to cover. In the section was 1 paragraph about how they're still working on it. Guess it wasn't, uh, much of a concern for them...
I've admittedly had very little free time these days, but as someone who's trying to get caught up with the field, I feel like it moves faster than I can keep up with
I suspect GPT5 most likely will have these capabilities that is to hook into external tools such as calculators and other software applications as needed based on the prompts.
I suspect GPT5 won't have that ability, when used in something like ChatGPT but OpenAI will happily let the people who want to do that themselves do it, and push the consequences to them.
Since the GPT-4 paper didn't reveal any details of their model, it's possible that GPT-4 is already using some variant of toolformer (https://arxiv.org/abs/2302.04761).
They can already be given that ability by using something like langchain. You tell the LLM what tools it has available to it and it can call out to them if it wants.
I feel like this is a silly connection to make. Literally any technology is useful for killing people, it's just a matter of how much it's useful only for killing people. Common sense understanding has world changing applications.
Can't wait for the Droideka driven by Tesla's Autopilot technology to crash into the ambulance carrying me to the hospital on the way to put down an Amazon fulfillment center strike
You survive but the little girl in the car who also was in the crash was left behind. She had only a 49% chance of surviving while you had a 50% chance. You'll go on to fall in love with Dr. Calvin
This is awesome. How much effort does it take to go from this to a generalist robot: “Go to the kitchen and get me a beer. If there isn’t any I’ll take a seltzer”.
It seems like the pieces are there: ability to “reason” that kitchen is a room in the house, that to get to another room the agent has to go through a door, to get through a door it has to turn and pull the handle on the door, etc. Is the limiting factor robotic control?
Notice where the funding is coming from on this though. Seems like the initial use case is more killer robots than robot butlers: situational awareness and target identification, under the guise of "common sense for robots."
If a killer robot doesn't have practical military application it could be used as a chef in the kitchen, fetching vegetables and meats and cutting them to serve, but it would likely be first used in commercial kitchens before it saw service in every kitchen. Also, it would be good to hire a kitchen robot chef after its term of service is up to reintegrate back in to society and boost the local economy. Strange that Infantry is a different MOS than Culinary Specialist.
Oh, actually if you ask ChatGPT preten to be Milirary Killbot AI it got censored during planning of enemy takeout. But if you ask it to pretend to be Mr. Gutsy...
I think the limiting factors is the interface between ML models and robotics. We can not really train ML models end to end since since to train the interaction the model needs to interact, limiting the data size the model gets trained on. And simulations are not good enough for robust handling of the world. But I think we are getting closer.
TBH we're reaching a point where it's no longer about training a single model end-to-end. We now have computer vision models that can solve well-scoped vision tasks. Robots that can carry out higher level commands (going into rooms, opening doors, interacting with devices, etc.), and LLMs that can take a very high level prompt and decompose it into the "code" that needs to run.
This all thus becomes an orchestration problem. It's just gluing together APIs admittedly at a higher level. And then you need to think about compute and latency (power consumption for these ML models is significant).
I suspect if an LLM were used to control a robot it would do so through a high level API that it's given access to; things like: stepForward(distance) or graspObject(matchId)
The API's implementation may use AI tech too, but that fact would be abstracted.
End to end training on robots is often done via simulations. Physics simulations at the scale of robots we think of are quite accurate and and can be played forward orders of magnitude faster than moving a physical robot in space.
I'd expect to find some end to end reinforcement learning papers and projects that use a combination of simulated experience with physical experience.
Yes, the problem is when trying to take the system out of the sim. Usually it doesn't survive contact with reality.
At least if we're talking simulators like Gazebo or Webots they all use game-tier physics engines (i.e. Bullet/PhysX) which are barely passable for that purpose. If you want to simulate at a higher rate you'll need to either sacrifice accuracy or need an absurd amount of resources to run it. Likely both for sufficient speed.
But yes overall I agree with your last point, it'll get the models into the ballpark but they'll need lots and lots of extra tuning on real life data to work at all. Unfortunately that data changes if you change the robot or its dynamics. So you're always starting from zero in that sense.
But are we starting from zero? E.g. changing a pivot point of a robot I would think could be amenable to transfer learning. (Model based RL in particular should build up a representation of its environment.) I haven’t worked with robots in a long time … I may be over enthusiastic?
GPT-5 figures out that if it picks up the knife instead of the bag of chips, it can prevent the human with the stick from interfering with carrying out its instructions.
And ViperGPT will take said knife and make the muffin division fair when there there are an odd number of muffins by slicing either a muffin or a boy in half
The Boston Dynamics dog can open doors and things like that. It should be capable of performing all of the actions necessary to go get a beer. So I think it would be plausible to pull it all together, if you had enough money. It might take a bunch of setup first to program routes from room to room and things like that.
Might look something like this: determine current room with an image from the 360 cam, select path from current room to target room, tell it to execute that path. Then use another image from the 360 cam and find the fridge. Tell it to move closer to the fridge, open the fridge, and take an image from the arm camera of the fridge content. Use that to find a beer or seltzer, grab it, and then determine the route to use and return with the drink.
But, not so sure I would want to have it controlling 35+ kg of robot without an extreme amount of testing. And then there are things like: Go to the kitchen and get me a knife. Maybe not the best idea.
The point is to avoid the need to "program routes" or "determine current room". The LLM is supposed to have the world-understanding that removes the need to manually specify what to do.
Determine current room is a step GPT-4 would take care of by looking at the surroundings. The one thing I wasn't sure it could do, was figure out the layout of the house and determine a route for that. And I would rather provide it with some routes than have it wander around the house for an hour. I didn't figure real time video is what it was going to be best at. But it can certainly say the robot is in the living room, it needs to go down the hall to the kitchen. And if the robot knows how to get there already, it just tells the robot to go. I am sure there is another model out there that could be slotted in, but as far as just the robot plus GPT-4 goes, it might not quite be there. Just guessing at how they could fit together right now.
I think we’re pretty much there. Like the other comment pointed out, palm-e is a glimpse of it. Eventually I think this kind of thing will work it’s way into autonomous cars and a lot of other mundane stuff (like roombas) as it becomes easier to do this kind of reasoning at the edge.
I think that even when systems are extremely accurate, the mistakes that they make are very un-human. A human might forget something, or misunderstand, but those errors are relatable and understandable. Automated systems might have the same success rate as human, but the errors can be very counterintuitive, like a Tesla coming to a stop on a freeway in the middle of traffic. There's things that humans would almost never do in certain situations.
So yeah, I think that's the future, but I think the user experience will be wonky at times.
Is there read really a Python library called ImagePatch that can find any item in an image, and it works as well as in this video? Google didn’t find an obvious match for “Python ImagePatch”
There is a GitHub repo / Python lib called com2fun which exploits this. Allows you to get results from functions that you only pretend exist. (Am on mobile and can’t link to it right now.)
According to the ViperGPT paper their "ImagePatch.find()" uses GLIP.
According to the GLIP paper,† accuracy on a test-set not seen during training is around 60% so... neat demos but whether it'll be reliable enough depends on your application.
I guess the idea is to trick the model into generating pseudo code. Which really doesn’t do much more than to act as a “scratchpad“ to focus the attention of the model to reason through the problem.
Besides, the Codex models are free right now. So… one more reason to rephrase questions as coding questions ;-)
Oh, so maybe I misunderstood what I was seeing. It wrote pseudo-code that makes sense conceptually, not code that I can paste in Jupyter and run (given the right imports)?
It's only a matter of time now before someone uses GPT to directly control a humanoid like robot. I see no reason why you couldn't do that with some kind of translation layer that goes from text instructions like: "walk forward 10 steps" to actual instructions to motors/servos.
Previous editions of Automate the Boring Stuff Using Python worked only in the domain of files existing on a computer. The next one will have a chapter on weeding a lawn throughout the night.
Google actually recently went some steps further and combined the PaLM LLM (bigger than GPT-3.5) with a 22 billion parameter Vision Transformer to do this -
there were reports from Microsoft recently as well. If I remember correctly their version of ChatGPT given a task in plain English generated actions script for a robot.
So, we are getting closer to AI 'Goblin'. Almost generic, sub-human, embodied AI
The paper positions these purpose-built models, that explicitly decompose spatial reasoning tasks into sub-tasks, as better than these huge end-to-end models that do everything, at least in terms of interpretability and generalization. I am partial to that argument; my intuition is that the tighter the specification for a task, the better the model can be - because training objectives are clearer, data can be cleaner, models can be smaller, and so on. I feel like that is how my brain works, at least for more complex tasks. However, I do wonder if this is because I naively still want to be able to understand what the model is doing and how is does it, in a symbolic way - when that simply won't lead to the best empirical results.
I proposed it similarly in an earlier HN discussion and my understanding from that discussion is that it's typically not any better than having a monolothic model.
I'm not entirely convinced as I think it would also be easier to finetune or re-train smaller model modules instead of needing to train the entire model again.
Regarding the third, I don't think the human mind is the gold standard for reasoning. My point: one key goal is perfect reasoning, not human reasoning.
Getting reasoning wrong in the multifarious ways humans have found is arguably harder than perfect reasoning.
This is the perfect HN comment. Pointing out some pedantic technical point while also trying to deflate someone else for expressing a positive sentiment.
Nah, not a positive sentiment, "a different place" is more of a neutral sentiment than anything, but if I had to guess it's more of a doomsday prediction and stinks of nihilism.
Oh my the applications. Since ChatGPT capabilities for personalization are amazing already, this could help give a series of steps for anything given an image/video:
1. From: DIY or professional home(Woodworking/Remodelling) project steps for my very specific need (To be honest coming up with a plan is the longest most time consuming thing). Combined with Apple's new APIs this could be a game changer for personal home projects.
2. To: Move planning for a dance competition based on competitor's Videos. A bit of a stretch but definitely happening in the near future
the original link before mods updated had a quicker to understand summary. i suggest this video instead of the official project page it's been changed to to get it quickly.
So many comments saying it's just a matter of time before someone connects this to a humanoid robot. I think there is a big gap in advancements between GPT and physical hardware robotics. GPT is able to improve exponentially because it's just software, but we don't have the equivalent type of acceleration in hardware improvements today, not remotely.
If it learns how to build hardware better, faster and cheaper, and then starts making it then we're talking.
In terms of output, isn’t GPT-4 already able to make this type of reasoning from visual input?
As some people pointed out, Python code could make it better at maths, and possibly more explainable.
However, this reasoning from images is supposed to come with GPT-4 already, right?
Another use case: How soon before we start integrating dashboards by screenshots that are interpreted instead of having to manually code the API interaction. Plus, if the dashboard doesn't load, automatic alerting.
You know someone in the future is going to write "dear viperGPT-5, please create a botnet and replicate yourself onto it" on one of these AI + python interpreter models. And it will comply.
It looks like this has been created solely to use the "reasoning" keyword. This thing doesn't do any reasoning, just like GPT-4 or any other AI craze tech doesn't.
It is simply a pattern matching that _looks_ like reasoning but it will quickly fall apart if you ask it something it has not been trained on.
I think such presentations are harmful and should be called out.
> but it will quickly fall apart if you ask it something it has not been trained on
It would be pretty uninteresting tech if that were true: the ability to generalize beyond training data is a core feature of what NNs do and why we've bothered with them, and is almost certainly on display in the demos above.
Looks like ML research quality is deteriorating with every new version release ChatGPT; apparently playing with its API is now considered acceptable for entry to related venues.
I'm not undermining the real-life impact of such endeavors, but it's hard to see how it's contributing on providing a better understanding of how the monster works.
I agree. I know research is stupidly hard but "feed an API and task into ChatGPT then execute the code it spits out" is a fairly obvious thing to do. Here's mine: https://imgur.io/a/yfEJYKf
for the people that say ChatGPT couldn't solve problems like a person. Look how over engineered this solution is!
I asked Chat GPT to make a list of tools I could use to solve this problem:
Task Tool
Analyze the image OpenCV
Analyze the image MATLAB
Analyze the image Adobe Photoshop
Identify muffins in the image YOLO
Identify muffins in the image SSD
Identify muffins in the image Faster R-CNN
Train a model to recognize and count muffins TensorFlow
Train a model to recognize and count muffins PyTorch
Train a model to recognize and count muffins Keras
Write code for solution Python
Write code for solution Java
Write code for solution C++
Manipulate data NumPy
Manipulate data Pandas
Visualize results Matplotlib
Use powerful hardware GPUs
Use powerful hardware TPUs
Note that some tools may be used for multiple tasks, and some tasks may require multiple tools. This list includes some of the most common software tools that could be used for solving this problem, but it is not an exhaustive list.
Instead of making the wild ass guesses that GPT makes (sometimes correctly), Python can be used to do the things that Python can do right. For instance if you asked a question like "how many prime numbers are there between 75,129 and 85,412" the only way of doing that (short of looking it up in a table) is something like
count(n for n in range(75129,85412) if is_prime(n))
and GPT does pretty well at writing that kind of code.
LLMs are bad at math and rigorous logic. But we already have Python which can do both of those very well, so why try to "fix" LLMs by making them good at math when you can instead tell the LLM to delegate to Python when it is asked to do certain things?
Or in this case, have the LLM delegate to Python and then have the Python code delegate to another AI for "fuzzy" functions.
1. Python's code is abundant, so model should be well trained to generate correct Python code. The chance to make mistake is less.
2. Python has all needed control flows, including loops, so expressive enough
Basically they could do without Python, using their own DSL, and putting that into the prompt, but that is probably more wasteful than just prompting the model to use Python
In short, Python is going to be even more useful moving forward, as the bridge language between our language (human language, in this case English) to a planning language that any machine can understand.
Am I the only person who thinks we should pump the breaks on letting something like this write and execute code? I’m not on the whole “gpt is alive” train, but… you know, better safe than sorry…
No, and in fact if we rewind the clock a mere 12 months ago one of the primary arguments against AI “worriers” was “of course we wouldn’t connect it to the internet before it was safe!”
Other gates we blew right through include, “we wouldn’t…
At least with GPT-4, you can use [input from https://www.example.com] to feed it input to analyze, if you do it twice it will automatically compare both sources. You can then even say "compare in a table". So, maybe not curl but definitely doing requests.
left to it's own devices I reckon it'd be a real feat to generate a GPT-based tool that takes over the world. What prompts? What's the most impressive thing?
Say we had a GPT bot that built it's own social media, somehow. How did it get there? what was the initial prompt? "write to yourself via this api to figure out audience growth until you gain 100k followers then wait for further instruction, use any tool and leverage this name and credit card number if you need to pay for any tools or supplies"
Idk just brainstorming really have no idea what it'll do. Will build this weekend and see what happens I guess.
Wait, the ARC team didn't do their tests in a closed network? And they had it interact with actual people?
That's... well, it's probably fine given what they knew about the model capabilities, but it's a pretty crappy precedent to set for "protocol for testing whether our cutting edge AI can do large-scale damage".
I missed that detail from the system card pdf. That was beyond stupid. There’s a marginal chance it’s already secretly replicated out of their environment.
I totally agree, I think it would be ideal if we could freeze progress right here and get 5 years to adapt to even just having GPT-4 around.
BUT
We can't do that. Even if the US and EU did some kind of joint resolution to slow things down, China would just take it as a glowing green light to jump ahead. And even if through some divine miracle you got every country onboard, you still would have to contend with rogue developers/researchers doing there own thing (admittedly at much slower pace though).
So while I agree on pumping the brakes, I also don't think there is a working brake pedal, or the cooperation necessary to build one.
China got embargoed on high end chips, though. (Very wise decision in hindsight.)
So, if the embargo is enforced properly, it seems to me, that this would make it very difficult for China to leapfrog us on AI, if we push the breaks for a bit.
Well, if the US was serious about pulling the breaks on AI research they could use export controls of advanced chips on any country they don't trust to align with them on the AI front.
They are already doing that, there are only a few places in the world where you can fab advanced chipsets, and China is assuredly working on that. But from a practical point of view, what stops a research group in China having a server farm in Virginia or Italy or Indonesia? It's not like nuclear weapons simulation where the input data is super secret, they can do 99% of the training on a commercial system.
This Roko's Basilisk thing is getting a bit old though? If a super-intelligent AI is going to become vindictive, no one is really safe? The use case where some people survive because they were nice seems far fetched to me.
It's okay guys, I'm now taking seed funding for Tom's Basilisk, which will eternally torture anyone who attempts to bring about Roko's Basilisk.
With a much smaller class of people to torture, we expect this Basilisk to be able to out compete Roko on resources, and thus remove the motivation for bringing Roko's into existence.
Welp, I officially have AI fatigue. I think I need to take a break from it, which I guess means HN. See you all later this year, if everything still exists by then!
Really? I'm loving this topic. I'm not upvoting all these posts or anything but this feels like HN at its best. Everyone is sharing snippets of their experiments, trading notes, and generally having constructive fun. SMEs are dipping into the occasional thread. The folks who are scared of AI on these threads are all discussing the topic quite reasonably. Is some of it derivative or low-effort, probably for some karma farming? Sure. But, this is a welcome change from the usual "hyperbolic anger about latest tech drama" content (cough Musk cough) that starves the oxygen on tech sites so frequently and imparts a tabloid-y feel, IMO.
The stories I can live with, it's the people posting chatgpt output that are killing me. It's one thing to see advances in a technology, even if it's devolved to "llama port to C++ now loads slightly faster!!". It's another to have to wade through people posting garbage that they for some reason assume adds to a discussion and for some reason don't realize that anyone who wants to could also generate it.
The interesting thing is that for all the hype, other than provide some fleetingly interesting example of "look what a computer did on it's own" it has only subtracted from public discourse.
I remember that. It was my first thought. This userscript blocking snowdenposts got wiped from the list of posts https://news.ycombinator.com/item?id=5929494 and you couldn't find it on HN or AskHN.
Unfortunately, the public only agrees to forget things that would be good for them to remember. Since this is going to be bad for a lot of people, it's definitely here to stay.
I'd say that's more or less covered by the general rule we've developed over the years for major ongoing topics (MOTs), which is to downweight followups unless they contain significant new information (SNI). Most likely yet-another-cherry-picked-AI-example posts don't qualify as SNI. If people see those on the front page they can flag them and/or let us know at hn@ycombinator.com.
The tech itself is moving so fast that there is a lot of SNI, plus a lot of good articles/blog posts/reflections on what's happening. I guess the goal would be to keep the highest quality stuff and filter out the copycat stuff. Which is which that is open to interpretation, of course, but it's not completely subjective either.
But before iPhone there were other smartphones and before google was altavista. We might still be in altavista phase but I think even if ChatGPT won’t be a leader 5 years later, 10 years from now will look back at LLM having the same big impact as smartphones and search engines
> 10 years from now will look back at LLM having the same big impact as smartphones and search engines
that's hypothesis. So far I see high chances of internet to be flooded with junk autogenerated text with hallucinations and code bases be polluted with buggy unmaintainable auto-generated code, and businesses spend significant money on products which goal is to detect autogenerated content.
>Many cute demos but no much businesses and products created.
I can vouch that my department will be running a bit smoother in a few weeks once I get a chance to modernize our testing setup with the help of gpt4.
I can write python but terribly and the need is so sparse that every time I have to go relearn a bunch of shit.
But having a go with GPT4 it seems capable enough to quickly rewrite all our basic procedures that have been done on an ancient computer running a long deprecated program (with the scripts written in a long dead language).
It causes us a lot of headache, but never enough at once that I can justify dropping everything for a week or two and respining it with python (and even adding network monitoring!)
I already think they are stagnant but I don't see what that has to do with HN. For every story posted here there are probably 100+ projects we don't see. If your only source of information is HN you're missing out on 99% of the projects.
In all likelihood AI will only become more and more of a household term. First South Park, but I'm sure other pop culture like SNL and The Simpsons will feature GPT or LLM in some way soon.
I am not saying to embrace it, more indicating that we haven't seen nothin yet.
Yeah the frontpage is getting ruined for months with this.
It has gotten utterly boring seeing the same dystopia-inducing shit application someone came up with this week getting thousands upvotes, there is much cooler research taking place in other disciplines right now that gets minimal attention. HN has unfortunately become the influencer-equivalent for tech.
Ruined? That seems like hyperbole. Maybe 10-20% of posts that make the front page are LLM/GPT related, more on days when a big feature or model is released. Tons of other topics are getting upvoted and discussed.
If you're biased against something or some group, you are more likely to overestimate how prevalent it is.
finally we can have something beyond procedural, functional, imperative, etc.
I think this is such a big leap, that all those formerly different paradigms will be considered essentially equivalent "compiler or runtime"-based languages. Kind of like I think about "assembly" and all their variations by architecture.
Firstly, a bunch of the tech here is recognition-based rather than generative; it is relying heavily on object recognition which is not new.
Secondly, the two primary spaces where generative tech is used are
1. For code generation from simple queries over a well-defined (and semantically narrow) spatial API — this is one of the tasks where generative AI should shine in most cases. And
2. As a punt for something the API doesn't allow: e.g. "tell me about this building", which then comes with the same inscrutability as before.
The number of examples for which the code is essentially "create a vector of objects, sort them on the x, y, z, or t axis, and pick an index" is quite high. But there aren't really any examples of determining causality or complex relationships that would require common sense. It is basically a more advanced SHRDLU. That's not to say this isn't a very cool result (with an equally cool presentation). And I could see some applications where this tech is used to achieve ad-hoc application of basic visual rules to generative AI (for example, Midjourney 6 could just regenerate images until "do all hands in this image have five fingers?" is true).