Hacker News new | past | comments | ask | show | jobs | submit login
Speech Dictation Mode for Emacs (lepisma.xyz)
127 points by adityaathalye 4 months ago | hide | past | favorite | 32 comments



To run text-to-speech on my laptop, I've been using Justine Tunney's downloadable single executable Whisper file.

I use it transcribe audio then copy into an LLM to get notes on whatever it is. Helps me decide to watch or listen to something and saves a bunch of time.

Her tweet: https://x.com/JustineTunney/status/1825551821857010143

Instructions from Simon Willison: https://simonwillison.net/2024/Aug/19/whisperfile/

Command line options: https://github.com/Mozilla-Ocho/llamafile/issues/544#issueco...


Amazing work.

I am also impressed by the advances in technology. 20 years ago, I had severe RSI problems and worked on "vx-mode", a package for interfacing XEmacs to Dragon NaturallySpeaking, the best speech-recognition solution available at the time. My goals were similar, although the result was nowhere near what the OP has done. Also, speech recognition tech was nowhere near what we have now: I still remember buying good microphones, worrying about microphone placement relative to mouth, endless training and re-training…

This kind of software can make a huge difference for many people.


I'm really happy about it but I'm not sure how game changing it would be for a blind person. It seems to require seeing what's on the page.


Perhaps not for a blind person, but for anyone with RSI or other hand/wrist impairments, this can make a huge difference. I speak from experience, having used dictation to work around RSI issues.


Year 2080: AGIs help you trinscribe, structure, layout your code/text/thoughts. At the same time: HN posts: „New package for Emacs doing xyz“.


And all it requires is some emacs version bump, some dependency upgrades, some external servers and changing the default shortcut in a confusing lisp file to something that doesn't require pressing 8 keys at the same time


Fun fact: even pressing three keys at the same time is rare when using Emacs (although there are some three-key combos I use regularly), most shortcuts consist of consecutive key presses.


I sometimes feel like playing the piano :D But the UX is better than you'd think, there's packages that show you what options you have for what key to press next, and the sequences are generally quite logical (e.g. CTRL-x followed by "p" has all the commands related to projects).

Plus you can always just enter the command instead of using the key stroke for it. Again, the default UX for that is a bit weak, but with a few packages it becomes pretty strong.


> there's packages that show you what options you have for what key to press next

Rejoice! The excellent which-key package that does this comes bundled with Emacs 30! (Emacs 30 will probably be released soon.)

> enter command… default UX is a bit weak

Agreed: the packages Helm, Ivy, and Vertico make this interface much nicer. I use Vertico [1] personally. Though, from Emacs 29, there are some really nice options you can set. I used the following in my Bedrock starter kit [2] to get nicer tab-completion: as soon as you hit TAB twice you'll get bumped into the Completion buffer to select something with your cursor.

Here's the relevant config:

    (setopt completion-auto-help 'always)                  ; Open completion always; `lazy' another option
    (setopt completions-max-height 20)                     ; This is arbitrary
    (setopt completions-detailed t)
    (setopt completions-format 'one-column)
    (setopt completions-group t)
    (setopt completion-auto-select 'second-tab)            ; Much more eager
    ;(setopt completion-auto-select t)                     ; See `C-h v completion-auto-select' for more possible values
There's more configuration options, of course, but this is helpful:

[1]: https://github.com/minad/vertico [2]: https://codeberg.org/ashton314/emacs-bedrock


which-key made it in? Sweet! I've been saying for years it should be in Emacs and turned on by default.


True. I often times find myself typing out the command rather than using some obscure key sequence like C-c C-v n (case in point: https://orgmode.org/manual/Key-bindings-and-Useful-Functions...). Since Emacs does tab completion for the command name too, I personally find that a better UX than using the "shortcut" (if I can remember it at all).


I tend to use search for infrequently used stuff and stuff I'm just trying to learn for the first time, then if I find myself using it several times in a session I look up the keybind to start practicing that. If it sticks, it sticks, and if it doesn't... the search functionality is great!


> the sequences are generally quite logical (e.g. CTRL-x followed by "p" has all the commands related to projects).

They really are not.


Depends on if you count shift. I C-M-% (query-regexp-replace) fairly regularly, and that's 4.


Sure, shift counts. I suppose I would bind it to a more convenient keybinding if I used query-regexp-replace regularly, but note that I didn't say there weren't any such keybindings, just that they're rare.


I assume this varies widely across setups.

    (use-package visual-regexp
      :defer t
      :bind (("C-c r" . vr/replace)
             ("C-c q" . vr/query-replace)
             ("C-r" . vr/isearch-backward)
             ("C-s" . vr/isearch-forward)))

    (use-package visual-regexp-steroids
      :defer t)


year 2080: "M-x ai: imagine you are a smart emacs developper, write a configuration file that sets up LSP"

answer:

"I did it. Please note that you're using a Microsoft protocol. Microsoft has a long history of attacking the 4 core freedoms of the Free Software movement which are

The freedom to run the program as you wish, for any purpose (freedom 0). ..."


This is kinda ideal tbh. I like how, for instance, F-Droid warns users about anti-features and integrations with proprietary web services. Clear messaging about problematic software + freedom to nonetheless choose those problematic options is great.

That said, I don't think this is the way the FSF evaluates software, or that they'd treat an open protocol like this. I could imagine a warning like this about integrating with a proprietary language server in particular, though— and I'd be grateful for it! A locally-run AI assistant that cared about things like that would be super cool.



That AI would be running under GNU Hurd with Guix. Also, Scheme simplified itself so hard that it created something akin to the Common Lisp standard unitfying all ice's and srfi's into something manageable from humans in a single package.

Also it rewrote all of the legacy Emacs' Elisp into manageable Emacs Guile (with an uberfast JIT and/or libre Guile microcode from the FSF).


Hey, author here. Didn't notice this came up on HN.

I wrote a small follow up trying to write and speak at the same time here https://lepisma.xyz/journal/2024/09/13/can-i-output-two-stre...


Thats a cool idea. Could the LLM find the right location for the audio stream by simply having the context of the buffer, and the location of the text and audio cursor when the intersction starts?


I think it could work. In my example of writing docstring, I can see this working out with high probability.


This looks very useful, and beautifully presented — looking forward to being able to use with local model.


I would use this for edits that are hard to do otherwise. Like, instead of typing `M-x align-regexp` and then figuring out what regular expression to type, I would just highlight a passage and say to the LLM "Can you align all the library names in this import statement?"


I did something similar here:

https://blog.nawaz.org/posts/2023/Dec/cleaning-up-speech-rec...

I now use Whisper with a much expanded prompt and have the flow integrated both in Emacs and my WM.

Prior HN discussion:

https://news.ycombinator.com/item?id=40174921

I've since done hours of transcription with it - often transcribing whole emails. The challenge is that my brain thinks very differently while talking compared to while typing. As a result, my output is very verbose, and is very different from what I would have typed. I haven't figured out how to speak as if I'm typing.


"Emacs: Upgrade to MELPA"

ELPA installed s/w suite: "I'm sorry Dave, I can't do that"


More like: Emacs: pull all the libre MELPA repos into a local .el file to be checked ondemand. Hide all the propietary depending or propietary repos.


nerd-dictation is a decent offline speech dictation tool for Linux that I've used with Emacs https://github.com/ideasman42/nerd-dictation


Has anyone gotten whisper.el/.cpp to work on OSX with the microphone permissions and Emacs?


Does the author mind if he shared his Emacs configuration? So beautiful!





Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: