Interesting. One thing i noticed is that Mistral has a `max_position_embeddings`...

brucethemoose2 · on Dec 8, 2023

Is complicated.

The 7B model (cybertron) is trained on Mistral. Mistral is technically a 32K model, but it uses a sliding window beyond 32K, and for all practical purposes in current implementations it behaves like an 8K model.

The 34B model is based on Yi 34B, which is inexplicably marked as a 4K model in the config but actually works out to 32K if you literally just edit that line. Yi also has a 200K base model... and I have no idea why they didn't just train on that. You don't need to finetune at long context to preserve its long context ability.

ComputerGuru · on Dec 8, 2023

Did you mean "but it uses a sliding window beyond" *8K*? Because I don't understand how the sentence would work otherwise.

brucethemoose2 · on Dec 8, 2023

Yeah exactly, sorry.