it looks like TensorRT-LLM (TRT-LLM) is the way to go for a realtime API for more and more companies (i.e perplexity ai’s pplx-api, Mosaic’s, baseten…). Would be super-nice to find people deploying multimodal (i.e LLaVA or CLIP/BLIP) to discuss approaches (and cry a bit together!)