mistral which clocks at 32k context
- Mistral can handle 32k context, but only using sliding window attention. So it can't really process all 32k tokens at once.
- Mixtral (note the 'x') 8x7B can handle 32k context without resorting to sliding window attention.
I wonder whether Mistral would do a better job summarizing a long (32k token) doc all at once, or using recursive summarization.
Maybe a neat eval to try.
- Mistral can handle 32k context, but only using sliding window attention. So it can't really process all 32k tokens at once.
- Mixtral (note the 'x') 8x7B can handle 32k context without resorting to sliding window attention.
I wonder whether Mistral would do a better job summarizing a long (32k token) doc all at once, or using recursive summarization.