Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Interesting. One thing i noticed is that Mistral has a `max_position_embeddings` of ~32k while these have it at 4096.

Any thoughts on that?



Is complicated.

The 7B model (cybertron) is trained on Mistral. Mistral is technically a 32K model, but it uses a sliding window beyond 32K, and for all practical purposes in current implementations it behaves like an 8K model.

The 34B model is based on Yi 34B, which is inexplicably marked as a 4K model in the config but actually works out to 32K if you literally just edit that line. Yi also has a 200K base model... and I have no idea why they didn't just train on that. You don't need to finetune at long context to preserve its long context ability.


Did you mean "but it uses a sliding window beyond" *8K*? Because I don't understand how the sentence would work otherwise.


Yeah exactly, sorry.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: