> The Code Llama models provide stable generations with up to 100,000 tokens of ...

littlestymaar · on Aug 24, 2023

I'm pretty sure they don't do that, but for code the relevant relationship between two tokens is easy to determine with the semantics of the language alone (for instance you can say that tokens related to a local variable have no relationship with tokens outside), so it would lead to a sparse matrix in the transformer, reducing the cost of big contexts by a lot. But it would require language specific preprocessing, and whether you can make it fast is also dubious. I don't think it's been tried so far.