Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Oh hey! :) TLDR naively gradient accumulation was over-weighting short sequence lengths in LLM finetuning and training runs, and under-weighting long sequence lengths.

For eg a text with sequence lengths of [1, 100] would be scaled by 1/(100+1) in full batch training, but grad accum of 2 would weight [1] as 1/1 * 1/2 = 1/2, whilst [100] as 1/100 * 1/2 = 1/200. (1/2 since grad accum needs to divide by the # of grad accum steps)



Is this a general issue rather than unsloth specific. How wide is this problem? Sounds wild if it has been affecting everyones training.


Unfortunately it's not an Unsloth issue but a general issue affecting nearly all trainers which use grad accum. We worked with Huggingface so their trainers should be fixed now though in the main branch




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: