Interesting approach to model serving - the 2-4x lower TTFT compared to vLLM is impressive, but I'd be curious to see detailed benchmarks across different batch sizes and model architectures to validate those performance claims. The no rate limits policy is bold but could get expensive fast if you're not doing some clever GPU utilization under the hood.
Thanks for your comments. Absolutely, as we were mentioning in one of the other threads, we are really keen on building towards having a reproducible dashboard of efficiency and other metrics.
Also regarding the no rate limits, we agree this is a real challenge and it's part of why we're interested in building this as well. I think the clever GPU utilization tricks are exactly what we're building out and also looking forward to see what the various issues we're going to run into at such scale.