Finally! I've been using the assistants api in building an ai mock interviewer (https://comp.lol) but the responses were painfully slow when using the latest iterations of the gpt-4 model. This will make things so much more responsive
I'd still want to see the entire response all at once. Having it stream in while I read it would be very distracting and make it difficult for me to read.
It's a request the front-end developer should be confronted with, not OpenAI.
The website could as well buffer the incoming stream until the used clicks an area to request the display of the next block of the response, once he has finished reading the initial sentences.
yes, it like surfing porn in the early internet year using a dialup modem. One line a the time until you finally can see enough of the picture (reply) to realize that is was not the reply you were looking for.
LLM streaming must be a cost saving feature to prevent you from overloading the servers by asking to many questions with in a short time frame. Annoying feature IMHO
How is hiding it behind a loading spinner any better? You still can't spam it with questions since you need to wait for it to finish. With streaming you can at least hit the stop button if it looks incorrect, so you actually spam it more with it enabled.
For me, the constant visual changes of new parts being streamed in are annoying, and straining on the eyes. Ideally, web frontends would honor `prefers-reduced-motion` and buffer the response when set.
Personally, I've fallen in love with that visual effect of streaming text you're talking about. It's a bit pavlovian, but I think in my head it signifies that I'm reading something high signal (even though it isn't always).
It's more about UX, to reduce the perceived delay. LLMs inherently stream their responses, but if you wait until the LLM has finished inference, the user is sitting around twiddling their thumbs.