This is quite intriguing, mostly because of the author.
I don't understand very well how llamafiles work, so it looks a little suspicious to just call it every time you want completion (model loading etc), but I'm sure this is somehow covered withing the llamafile's system. I wonder about the latency and whether it would be much impacted if a network call has been introduced such that you can use a model hosted elsewhere. Say a team uses a bunch of models for development, shares them in a private cluster and uses them for code completion without the necessity of leaking any code to openai etc.
I've just added a video demo to the README. It takes several seconds the first time you do a completion on any given file, since it needs to process the initial system prompt. But it stores the prompt to a foo.cache file alongside your file, so any subsequent completions start generating tokens within a few hundred milliseconds, depending on model size.
Looks like I won't use it though, cause I like how Microsoft's copilot and it's implementations in emacs work: suggest completions with greyed out text after cursor, in one go, without the need to ask for it and discard it if it doesn't fit. Just accept the completion if you like it. For reference: https://github.com/zerolfx/copilot.el
That, coupled with speed, makes it usable for slightly extended code completion (up to one line of code), especially in a highly dynamic programming languages that have worse completion support.
Fair enough. Myself on the other hand, I want the LLM to think when I tell it to think (by pressing the completion keystroke) and I want to be able to supervise it while it's thinking, and edit out any generated prompt content I dislike. The emacs-copilot project design lets me do that. While it might not be great for VSCode users, I think what I've done is a very appropriate adaptation of Microsoft's ideas that makes it a culture fit for the GNU Emacs crowd, because Emacs users like to be in control.
While I understand the general sentiment, I don't understand the specific point. After all, company-mode and it's numerous lsp-based backends are often used as an _unprompted_ completion (after typing 2 or 3 characters) which the user has the option to select or move on. It's the first time I hear of this being somehow against the spirit of GNU. Would you argue this is somehow relinquishing control? I like it, since it's very quick and cheap, I don't mind it running more often than I use it, because it saves me the keyboard clicks to explicitly ask for completion.
FYI I'm not trying to diminish your project, and I'm glad you've made something which scratches your exact itch. I'm also hopeful others will like it.
> Would you argue this is somehow relinquishing control? I like it, since it's very quick and cheap, I don't mind it running more often than I use it, because it saves me the keyboard clicks to explicitly ask for completion.
I can't answer for others, but personally I don't like the zsh-like way to "show the possible completions in dark grey after the cursor" because it disrupts my thoughts.
It's pull vs push: whether on the commandline or using an AI, I want the results only when I feel I need them - not before.
If they are pushed into me (like the mailbox count, or other irrelevant parameters), they are distracting and interrupting my thoughts.
I love optimization and saving a few clicks, but here the potential for distraction during an activity that requires intense concetration would be much worse.
I don't mind a single completion so much, as long as there's a reasonable degree of precision there. But otherwise I agree with you. I feel like they're only useful if you start typing without knowing what you want to do or how to do it, but if that is the case I know that is the case. Having a keypress to turn on that behavior temporarily just for that might not be so bad.
It's a massive distraction to me, and I refuse to have it turned on anywhere I can turn it off and will actively choose away software that forces it on me.
I can somewhat accept it showing an option if 1) it's the only one, 2) it's not rapidly changing with my typing. I know what I want to type before I type it or know I'm unsure what to type. In the former, a completion is only useful if it correctly matches what I wanted to type.
In the latter, what I'm typing is effectively a search query, and then completion on typing might not be so bad, but that's the exception, not the norm.
Eh, it's a mixed bag. The way Github Copilot offers suggestions means that it's very easy to discover the sorts of things it can autocomplete well, which can be surprising. I've certainly had it make perfect suggestions in places I thought I was going to have to work at it a bit - like, say, thinking I'm going to need to insert a comment to tell it what to generate, pressing enter, and it offering the exact comment I was going to write. Having tried both push and pull modes I found it much harder to build a good mental model of LLM capabilities in pull-mode.
It's annoying when a pushed prediction is wrong, but when it's right it's like coding at the speed of thought. It's almost uncanny, but it gets me into flow state really fast. And part of that is being able to accept suggestions with minimal friction.
I agree with that. The constant stream of completions with things like VS Code even without copilot is infuriatingly distracting, and I don't get how people can work like that.
I don't use Emacs any more, but I'll likely take pretty much the same approach for my own editor.
Yes, I find it absolutely awful. It covers things I want to see and most keypresses it provides no value. I'm somewhat more sympathetic to UI's if they provide auto-complete in a static separate panel that doesn't change so quickly. It feels to me like a beginner's crutch, but even when I'm working in languages I don't know I'd much rather call it up as needed so I actually get a chance to improve my recall.
Also not familiar with llamafiles, but if it uses llama.cpp under the hoods, it can probably make use of mmap to avoid fully loading on each run. If the GPU on Macs can access the mmapped file, then it would be fast.
Author here. It does make use of mmap(). I worked on adding mmap() support to llama.cpp back in March, specifically so I could build things like Emacs Copilot. See: https://github.com/ggerganov/llama.cpp/pull/613 Recently I've been working with Mozilla to create llamafile, so that using llama.cpp can be even easier. We've also been upstreaming a lot of bug fixes too!
I don't understand very well how llamafiles work, so it looks a little suspicious to just call it every time you want completion (model loading etc), but I'm sure this is somehow covered withing the llamafile's system. I wonder about the latency and whether it would be much impacted if a network call has been introduced such that you can use a model hosted elsewhere. Say a team uses a bunch of models for development, shares them in a private cluster and uses them for code completion without the necessity of leaking any code to openai etc.