We used candle[0], which uses cudarc and the metal crate under the hood. That means we run on nvidia hardware in production and can test locally on macbooks with smaller models.
I would certainly like to use non nvidia hardware but at this point it's not a priority. The subset of tensor operations needed to run the forward pass of LLMs isn't as large as you'd think though.
Just trying to stay focused on launching first (https://docs.mixlayer.com) and keeping early customers happy, but would love to open source some of this work.
It'd probably be a separate crate from candle. If you haven't checked it out yet, mistral.rs implements some of these things (https://github.com/EricLBuehler/mistral.rs). Eric hasn't done multi-GPU inference yet, but I know it's on his roadmap. Not sure if it helped, but I shared an early version of my llama 3.1 implementation with him.
I also have a Rust LLM inference project. The overlap is very high between what mixlayer is doing and what my project is doing. It's actually crazy how we basically have the same features. [1] Right now I'm still using llama.cpp on the backend, but eventually want to move to candle via mistral.rs.
We’ve also been building our inference stack on top of Candle, I’m really happy with it.