i always thought the potential for openinterpreter would be kind of like an "open source chatgpt desktop assistant" app with swappable llms. especially vision since that (specifically the one teased at 4o's launch https://www.youtube.com/watch?v=yJHw33cVeHo) has not yet been released by oai. they made some headway with the "o1" device that they teased.. and then canceled.
instead all the demo usecases seem very trivial: "Plot AAPL and META's normalized stock prices". "Add subtitles to all videos in /videos" seems a bit more interesting but honestly trying to hack it in a "code interpreter" inline in a terminal is strictly worse than just opening up cursor for me.
i'd be interested if anyone here is active users of OI and what you use it for.
> "open source chatgpt desktop assistant" app with swappable llms
THIS. OpenAI, Claude and Perplexity - all have their individual desktop apps now. It's time we got a generic, model-agnostic one.
On the other hand, we will likely see the underlying platforms (operating systems and browsers) ship with their own interface. I just hope they turn out to be model-agnostic.
I find the "Can you ..." phrasing used in this demo/project fascinating. I would have expected the LLM to basically say "Yes I can, would you like me to do it?" to most of these questions, rather than directly and immediately executing the action.
If an employer were to ask an employee, "can you write up this report and send it to me" and they said, "yes I can, would you like me to do it?", I think it would be received poorly. I believe this is a close approximation of the relationship people tend to have with chatgpt.
Depends, the 'can you' (or 'can I get') phrasing appears to be a USA English thing.
Managers often expect subordinates to just know what they mean, but checking instructions and requirements is usually essential and imo is a mark of a good worker.
"Can you dispose of our latest product in a landfill"...
Generally in UK, unless the person is a major consumer of USA media, "can you" is an enquiry as to capability or whether an action is within the rules.
I'm very curious why you think that! Sincerely. These models undergo significant human-aided training where people express a preference for certain behaviours, and that is fed back into the training process: I feel like the behaviour you mention would probably be trained out pretty quickly since most people would find it unhelpful, but I'm really just guessing.
What distinguishes LLMs from classical computing is that they're very much not pedantic. Because the model is predicting what human text would follow a given piece of content, you can generally expect them to react approximately the way that a human would in writing.
In this example, if a human responded that way I would assume they were either being passive aggressive or were autistic or spoke English as a second language. A neurotypical native speaker acting in good faith would invariably interpret the question as a request, not a question.
I assume it's more a part of explicitly programmed set of responses than it is a standard inference. But you're right that I should be cautious.
ChatGPT, for example, says it can retrieve URL contents (for RAG). When it does an inference it then shows a message indicating the retrieval is happening. In my very limited testing it has responded appropriately. Eg it can talk about what's on HN front page right now.
Similarly Claude.ai says it can't do such retrieval - except through API use? - and doesn't appear to do so either.
It's funny that we're getting so much attention funneled towards the thought-to-machine I/O problem now that LLMs are on the scene.
If the improvements are beneficial now, then surely they were beneficial before.
Prior to LLMs, though, we could have been making judicious use of simple algorithmic approaches to process natural language constructs as command language. We didn't see a lot of interest in it.
> Prior to LLMs, though, we could have been making judicious use of simple algorithmic approaches to process natural language constructs as command language. We didn't see a lot of interest in it.
Siri was released in 2011, and Alexa and Google Assistant followed soon thereafter. Companies spent tens of millions of dollars improving their algorithmic NLP because voice interfaces were "the future". I took a class in the late 2010s that went over all of the methodologies that they used for intent parsing and slot filling. All of that has been largely abandoned at this point in favor of LLMs for everything.
My hope is that at some point people will come back to these UI paradigms as we realize the limitations of "everything is a chat bot". There's a simplicity to the context-free limited voice assistants that had a set of specific use cases they could handle, and the effort to chatbot everything is starting to destroy the legitimate use cases that came out of that era like timers and reminders.
I have a somewhat different perspective. The way I see it, for the past 10+ years, the major vendors were going out of their way to try for generic NLP interface. At that point, it's already been known that controlled language[0] + environmental context could allow for highly functional voice control. But for some reason[1], the vendors really wanted for assistants to guess what people mean. As a result, we got 10+ years of shitty assistants that couldn't reliably do anything, not even set a goddamn timer, and weren't able to do much either - it's hard to have many complex features when you can't get the few simplest ones right.
This was a bad direction then. Now, for better or worse, all those vendors got their miracle: LLMs are literally plug-and-play boxes that implement the "parse arbitrary natural-language queries and map them to system capabilities" functionality. Thanks to LLMs, voice interfaces could actually start working. If vendors could also get the "having useful functionality" part right.
(Note: this is distinct from "everything is a chat bot". That's a bad idea simply because typing text sucks, specifically typing out your thoughts in prose form is about the least efficient way to interact with a tool. Voice interfaces are an exception here.)
[1] - Perhaps this weird idea that controlled languages are too hard for general population, too much like programming, or such. They're not. More generally, we've always had to "meet in the middle" with our machines, and it was - and remains - always a highly successful approach.
A lot of money was poured into that goal, but because every type of action required a handcrafted integration, they were either costly to develop or extremely limited. That’s no longer the case.
Complex digital assistants aiming to be do-everything secretaries are not what I had in mind when I said "simple algorithmic approaches".
That aside, which of those were attempts to improve input to a computer like the project submitted here? Everything you listed was most focused on (a) trying to establish voice as a valid input method (b) to create a new class of applications (c) for more-or-less locked down devices. (The one assistant that's closest to what I'm referring to—but still misses the mark—is the one you didn't mention: Cortana.)
> because every type of action required a handcrafted integration, they were either costly to develop or extremely limited
That describes all conventional software—think of everything you do on your computer. How many lines of code across how many different software packages, each handcrafted, are on your computer? And how narrow versus broad and featureful is each one (calc.exe, for example)? "Do one thing and do it well" is an entire, night regarded philosophical outlook about how to make great software.
Here's the transcript: https://gist.github.com/simonw/f78a2ebd2e06b821192ec91963995...