Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Yes, decoding is very I/O heavy. It has to stream in the whole of the model weights from HBM for every token decoded. However, that cost can be shared between the requests in the same batch. So if the system has more GPU RAM to hold larger batches, the I/O cost per request can be lowered.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: