Yeah, I’ve had to rewrite continuous batching and other scheduling logic. That a...

RRRozie · 2024-09-23T17:44:02 1727113442

Are you aiming for Nvidia hardware with rust-cuda, or looking to integrate with non-Nvidia hardware?

zackangelo · 2024-09-30T17:31:20 1727717480

We used candle[0], which uses cudarc and the metal crate under the hood. That means we run on nvidia hardware in production and can test locally on macbooks with smaller models.

I would certainly like to use non nvidia hardware but at this point it's not a priority. The subset of tensor operations needed to run the forward pass of LLMs isn't as large as you'd think though.

[0] https://github.com/huggingface/candle

k2so · 2024-09-19T06:31:09 1726727469

This is awesome, are you contributing this to candle or is it a standalone package?

zackangelo · 2024-09-19T14:12:11 1726755131

Just trying to stay focused on launching first (https://docs.mixlayer.com) and keeping early customers happy, but would love to open source some of this work.

It'd probably be a separate crate from candle. If you haven't checked it out yet, mistral.rs implements some of these things (https://github.com/EricLBuehler/mistral.rs). Eric hasn't done multi-GPU inference yet, but I know it's on his roadmap. Not sure if it helped, but I shared an early version of my llama 3.1 implementation with him.

J_Shelby_J · 2024-09-19T21:50:09 1726782609

Hey, mixlayer is really cool.

I also have a Rust LLM inference project. The overlap is very high between what mixlayer is doing and what my project is doing. It's actually crazy how we basically have the same features. [1] Right now I'm still using llama.cpp on the backend, but eventually want to move to candle via mistral.rs.

[1] https://github.com/ShelbyJenkins/llm_client