What would be a reasonable throughput level to expect from running 8-bit or 16-bit versions on 8x H200 DGX systems?
You should be able to get 40 to 50 tokens / s in the minimum. High throughput mode + a small draft model might get you 100 tokens / s generation
What would be a reasonable throughput level to expect from running 8-bit or 16-bit versions on 8x H200 DGX systems?