Surya here from the core Gemma team -- we can think of a distillation loss as le... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		suryabhupa on June 27, 2024 \| parent \| context \| favorite \| on: Gemma 2: Improving Open Language Models at a Pract... Surya here from the core Gemma team -- we can think of a distillation loss as learning to model the entire distribution of tokens that are likely to follow the prefix thus far, instead of only the token in the training example. If you do some back of the envelope calculations, we can see that learning to model a larger distribution yields many more bits of information to learn from.

jakobov on June 27, 2024 [–]

Gotcha. That makes sense. Thanks!

What are the theories as to why this works better than training on a larger quantity of non-simulated tokens?

Is it because the gradient from the non-simulated tokens is too noisy for a small model to model correctly?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact