I'm pretty sure they don't do that, but for code the relevant relationship between two tokens is easy to determine with the semantics of the language alone (for instance you can say that tokens related to a local variable have no relationship with tokens outside), so it would lead to a sparse matrix in the transformer, reducing the cost of big contexts by a lot. But it would require language specific preprocessing, and whether you can make it fast is also dubious. I don't think it's been tried so far.
what is the trick to achieve 100k context? They can't just use 100k wide transformer layer, it is cost prohibitive, right?..