Hacker Newsnew | past | comments | ask | show | jobs | submit | mythz's commentslogin

Big fan of Salvatore's voxtral.c and flux2.c projects - hope they continue to get optimized as it'd be great to have lean options without external deps. Unfortunately it's currently too slow for real-world use (AMD 7800X3D/Blas) when adding Voice Input support to llms-py [1].

In the end Omarchy's new support for voxtype.io provided the nicest UX, followed by Whisper.cpp, and despite being slower, OpenAI's Whisper is still a solid local transcription option.

Also very impressed with both the performance and price of Mistral's new Voxtral Transcription API [2] - really fast/instant and really cheap ($0.003/min), IMO best option in CPU/disk-constrained environments.

[1] https://llmspy.org/docs/features/voice-input

[2] https://docs.mistral.ai/models/voxtral-mini-transcribe-26-02


Hi! This model is great, but it is too big for local inference, Whisper medium (the "base" IMHO is not usable for most things, and "large" is too large) is a better deal for many environments, even if the transcription quality is noticeable lower (and even if it does not have a real online mode). But... It's time for me to check the new Qwen 0.6 transcription model. If it works as well as their benchmarks claim, that could be the target for very serious optimizations and a no deps inference chain conceived since the start for CPU execution, not just for MPS. Since, many times, you want to install such transcription systems on server rent online via Hetzner and other similar vendors. So I'm going to handle it next, and if it delivers, really, time for big optimizations covering specifically the Intel, AMD and ARM instructions sets, potentially also thinking at 8bit quants if the performance remain good.

Same experience here with Whisper, medium is often not good enough. The large-turbo model however is pretty decent and on Apple silicon fast enough for real time conversations. The addition of the prompt parameter can also help with transcription quality, especially when using domain specific vocabulary. In general Whisper.cpp is better with transcribing full phrases than with streaming.

And not to forget, for many use cases more than just English is needed. Unfortunately right now most STT/ASR and TTS focus on English plus 0-10 other languages. Thus being able to add with reasonable effort more languages or domain specific vocabulary would be a huge plus for any STT and TTS.


+1 for voxtype with Whisper-base model it is quite fast an accurate

One thing I keep looking for is transcribing while I'm talking. I feel like I need that visual feedback. Does voxtype support that?

(I wasn't able to find anything at glance)

Handy claims to have an overlay, but it seems to not work on my system.


Not sure how it works in other OS's but in Omarchy [1] you hold down `Super + Ctrl + X` to start recording and release it to stop, while it's recording you'll see a red voice recording icon in the top bar so it's clear when its recording.

Although as llms-py is a local web App I had to build my own visual indicator [2] which also displays a red microphone next to the prompt when it's recording. It also supports both Tap On/Off and hold down for recording modes. When using voxtype I'm just using the tool for transcription (i.e. not Omarchy OS-wide dictation feature) like:

$ voxtype transcribe /path/to/audio.wav

If you're interested the Python source code to support multiple voice transcription backends is at: [3]

[1] https://learn.omacom.io/2/the-omarchy-manual/107/ai

[2] https://llmspy.org/docs/features/voice-input

[3] https://github.com/ServiceStack/llms/blob/main/llms/extensio...


Ah, the thing I really want is to see the words that I'm speaking being transcribed (i.e. realtime) For some reason I rarely see that feature.


hahaha! plus ca change indeed.

(I keep coming back to this one so I've got half a dozen messages on HN asking for the exact same thing!).

It's a shame, whisper is so prevalent, but not great at actual streaming, but everyone uses it.

I'm hoping one of these might become a realtime de facto standard so we can actually get our realtime streaming api (and yep, I'd be perfectly happy with something just writing to stdout. But all the tools always end up just batching it because it's simpler!)


I am using a window manager with Waybar. Voxtype can display a status icon on Waybar [1], it is enough for me to know what is going on.

[1] https://github.com/peteonrails/voxtype/blob/main/docs/WAYBAR...


I've shipped an Omarchy MCP Server that lets AI Assistants manage your Omarchy desktop themes - switch wallpapers, change color schemes, toggle dark mode and more, all from natural language:

https://llmspy.org/docs/mcp/omarchy_mcp


Domain bought too early, Clawdbot (fka Moltbot) is now OpenClaw: https://openclaw.ai

Yes, much like many of the enterprising grifters who squatted clawd* and molt* domains in the past 24h, the second name change is quite a surprise.

However: Moltbook is happy to stay Moltbook: https://x.com/moltbook/status/2017111192129720794

EDIT: Called it :^) https://news.ycombinator.com/item?id=46821564



EOL of Windows 10 forced me to, but I'm not mad - Desktop Linux is Great!

It's definitely the superior OS for modern development and general system admin, WSL/Docker always felt like an uncanny valley kludge.


doesn't work, looks like the link or SVG was cropped.


MCP support is available via the fast_mcp extension: https://llmspy.org/docs/mcp/fast_mcp

I use llms .py as a personal assistant and MCP is required to access tools available via MCP.

MCP is a great way to make features available to AI assistants, here's a couple I've created after enabling MCP support:

- https://llmspy.org/docs/mcp/gemini_gen_mcp - Give AI Agents ability to generate Nano Banana Images or generate TTS audio

- https://llmspy.org/docs/mcp/omarchy_mcp - Manage Omarchy Desktop Themes with natural language

I will say there's a noticable delay in using MCP vs tools, where I ended up porting Anthropic's node filesystem MCP to Python [1] to speed up common AI Assistant tasks, so their not ideal for frequent access of small tasks, but are great for long running tasks like Image/Audio generation.

[1] https://github.com/ServiceStack/llms/blob/main/llms/extensio...


Does the MCP implementation make it easy to swap out the underlying image provider? I've found Gemini is still a bit hit or miss for actual print-on-demand products compared to Midjourney. Since MJ still doesn't have a real API I've been routing requests to Flux via Replicate for higher quality automated flows. Curious if I could plug that in here without too much friction.


MCP allows AI Models that doesn't support Image generation the ability to generate images/audio via tool calling.

But you can just select the Image Generation model you prefer to use directly [1]. Currently supports Google, Open AI, OpenRouter, Chutes, Z.ai and Nvidia.

I tried Replicate's MCP, but it looks like everything but generate images which I didn't understand, surely image generation would be its most sought after feature?

[1] https://llmspy.org/docs/v3#image-generation-support


I wouldn't use Claude API Key pricing, but I also wouldn't get a Claude Max sub unless it was the only AI tool I used.

Antigravity / Google AI Pro is much better value, been using it as my primary IDE assistant for a couple months and have yet to hit a quota limit on my $16/mo sub (annual pricing) which also includes a tonne of other AI perks inc. Nano Banana, TTS, NotebookLM, storage, etc.

No need to use Anthropic's premium models for tool calling when Gemini/MiniMax are better value models that still perform well.

I still have a Claude Pro plan, but I use it much less than Antigravity and thanks to Anthropic axing their sub usage, I no longer use it outside of CC.


Counterpoint: on the $20 monthly account I would hit my 5 hour limits within an hour on antigravity. I end up spending half my time managing my context and keeping conversations short.


Yeah, I’ve hit this too. Once you do real agentic work or TDD, you’re optimizing context instead of code. That frustration is why we built Cortex: flat cost, no turn limits, runs locally, and git-aware context so you can just keep going. cortex.build


Couldn't think of a better title, do you have any suggestions?


No custom state machine or agent, it's only a copy of Anthropic's 3 computer use tools: run_bash, edit, computer.

https://github.com/ServiceStack/llms/tree/main/llms/extensio...

It's run in the same process, there's no long agent loops, everything's encapsulated within a single message thread.


Yep, but it only supports GitHub OAuth. i.e. Content is either saved under no user (anonymous) or the authenticated GitHub User.

https://llmspy.org/docs/deployment/github-oauth


Thanks. Looks like this is purely to gatekeep internal access, but isn't ready for any oidc, or with a db backed session store.

All the best for the project, will check in later on these..


If you are looking for a open source Chat WebUI with support for OIDC, maybe you are interested in the one we are building?[0]

We are leveraging oauth2-proxy for the login here, so it should support all OIDC-compliant IDPs, and there are some guides by oauth2-proxy on how to configure for all the bigger providers. We do have customers using it with e.g. Azure, Keycloak, Google Directory.

[0]: https://erato.chat


I see you have a dockerfile.combined - is this built and served via gh artifacts? I can try it out.

Pros: Open source, and focus on lightweight. This is good.

Cons: "customers" - Ugh, no offense, but smells of going down the same path as "open" webui, with the services expanding to fill enterprise use cases, and simplicity lost.

LLMs.py seems to be focussing purely on simplicity + OK with rewriting for it. this + 3bsd is solid ethos. Will await their story on multi-user, hosted app. They have most of the things sorted anyway, including RAG, extensions, etc.


> I see you have a dockerfile.combined - is this built and served via gh artifacts? I can try it out.

Our recommended way of deploying is via Helm[0] with latest version listed here[1].

> with the services expanding to fill enterprise use cases, and simplicity lost.

TBH, I don't think that simplicity was lost for OpenWebUI because of trying to fill enterprise needs. Their product has felt like a mess of too many cooks and no consistent product vision from the start. That's also where part of our origin story comes from: We started out as freelancers in the space and got inquiries to setup up a Chat UI for different companies, but didn't deem OpenWebUI and the other typical tools fit for the job, and too much of a mess internally to fork.

We are small team (no VC funding), our customers end-users are usually on the low-end of AI literacy and there is about ~1 DevOps/sysadmin at the company our tool is deployed, so we have many factors pushing us towards simplicity. Our main avenue of monetization is also via SLAs, so a simple product for which we can more easily have test coverage and feel comfortable about the stability is also in our best interest here.

[0]: https://erato.chat/docs/deployment/deployment_helm

[1]: https://artifacthub.io/packages/helm/erato/erato


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: