Interesting approach to model serving - the 2-4x lower TTFT compared to vLLM is ...

adiraja · 2024-12-17T15:43:09 1734450189

Thanks for your comments. Absolutely, as we were mentioning in one of the other threads, we are really keen on building towards having a reproducible dashboard of efficiency and other metrics.

Also regarding the no rate limits, we agree this is a real challenge and it's part of why we're interested in building this as well. I think the clever GPU utilization tricks are exactly what we're building out and also looking forward to see what the various issues we're going to run into at such scale.