I wanted to share a project I've been working on for the past few weeks: llgtrt. It's a Rust implementation of a HTTP REST server for hosting Large Language Models using llguidance library for constrained output with NVIDIA TensorRT-LLM.
The server is compatible with the OpenAI REST API and supports structured JSON schema enforcement as well as full context-free grammars (via Guidance). It's similar in spirit to the Python-based TensorRT-LLM OpenAI server example but written entirely in Rust and built with constraints in mind. No Triton Inference Server involved.
This also serves as a demo for the llguidance library, which lets you apply sampling constraints via Rust, C, or Python interface, with minimal generation overhead and no startup cost.
Any feedback or questions are welcome!