I'm eagerly awaiting for Qwen 3 coder being available on Cerebras.
I run plenty of agent loops and the speed makes a somewhat interesting difference in time "compression". Having a Claude 4 Sonnet-level model running at 1000-1500 tok/s would be extremely impressive.
To FEEL THE SPEED, you can either try it yourself on Cerebras Inference page, through their API, or for example on Mistral / Le Chat with their "Flash Answers" (powered by Cerebras). Iterating on code with 1000 tok/s makes it feel even more magical.
Exactly. I can see my efficiency going up a ton with this kind of speed. Every time I'm waiting for agents my mind looses some focus and context. Running parallel agents gets more speed but at the cost of focus. Near instant iteration loops in Cursor would feel magical (even more magical?).
It will also impact how we work: interactive IDEs like Cursor probably make more sense than CLI tools like Claude code when answers are nearly instant.
I was justing thinking the opposite. If the answers are this instant, then subject to cost I'd be tempted to have the agent fork and go off and try a dozen different things, and run a review process to decide which approach(es) or part of approaches to present to the user.
It opens up a whole lot of use cases that'd be a nightmare if you have to look at each individual change.
However, I think Cerebras first needs to get the APIs to be more openAI compliant. I tried their existing models with a bunch of coding agents (include Cline which they did a PR for) and they all failed to work either due to a 400 error or tool calls not being formatted correctly. Very disappointed.
Have you used Claude Code and how do you compare the quality to Claude models? I am heavily invested in tools around Claude, still struggling to make a switch and start experimenting with other models
I still exclusively use Claude Code. I have not yet experimented with these other models for practical software development work.
A workflow I've been hearing about is: use Claude Code until quota exhaustion, then use Gemini CLI with Gemini 2.5 Pro free credits until quota exhaustion, then use something like a cheap-ish K2 or Qwen 3 provider, with OpenCode or the new Qwen Code, until your Claude Code credits reset and you begin the cycle anew.
Are you using Claude code or the web interface? I would like to try this with CC myself, apparently with some proxy use an OpenAI compatible LLM can be swapped in.
I am using Claude code, my experience with it so far is great. I use it primarily from terminal, this way I stay focused while reading code and CC doing its job in the background.
It'll be nice if this generates more pressure on programming language compilation times. If agentic LLMs get fast enough that compilation time becomes the main blocker in the development process, there'll be significant economic incentives for improving compiler performance.
I run plenty of agent loops and the speed makes a somewhat interesting difference in time "compression". Having a Claude 4 Sonnet-level model running at 1000-1500 tok/s would be extremely impressive.
To FEEL THE SPEED, you can either try it yourself on Cerebras Inference page, through their API, or for example on Mistral / Le Chat with their "Flash Answers" (powered by Cerebras). Iterating on code with 1000 tok/s makes it feel even more magical.