I suspect GPT-2's limitations with larger-scale structure have less to do with t...

I suspect GPT-2's limitations with larger-scale structure have less to do with the capacity to track long-range dependencies (which shouldn't be a problem for an attention-based architecture), and more to do with language modeling itself as a task.

Language modeling is about predicting what can be predicted about the rest of a text, given the first N tokens. Not everything in text can be predicted in this way, even by humans; the things we say to each other tend to convey novel information and thus aren't fully compressible. And indeed the compressible-ness of text varies across a text in a way that it is itself relatively predictable. If someone writes "for all intents and" you can be pretty sure the next word is "purposes," i.e. you're unlikely to learn much when you read it; if someone is writing a dialogue between two characters, and you're about to see the see one of their names for the first time, you will learn something new and unpredictable when you read the next word, and you know that this will happen (and why).

A language modeling objective is only really natural for the first of these two cases. In the latter case, the "right" thing to do from the LM perspective is to output a fairly flat probability distribution over possible names (which is a lot of possibilities), assigning very low probability to any given name. But what this means is actually ambiguous between "I am unsure about my next observation because I don't understand the context" and "I understand the context, and it implies (predictably) that my next observation will be inherently unpredictable."

Since any model is going to be imperfect at judging whether it's about to see something unpredictable, it'll assign some weight to the next observation being predictable (say, a repeated topic or name) even if it's mostly sure it will be unpredictable. This will push up the probabilities of its predictions on the assumption of predictability (i.e. of a repeated topic/name), and meanwhile the probability of anything else is low, because if an observation is unpredictable then it might well be anything.

I hypothesize that this is behind behavior like putting a single name ("Obama" in your earlier example) in too many roles in an article: if only Obama has been mentioned, then either an upcoming name is "Obama" (in which case we should guess "Obama") or it's some other name (in which case we should guess against Obama in slight favor of any other name -- but this will only be conveyed to the model via the confusing signal "guess this arbitrary name! now this other one! now this one!", with the right trend only emerging in the average over numerous unpredictable cases, while the predictable-case rule where you guess the name that has already been mentioned is crystal-clear and reinforced in every case where it happens to be right).

I also suspect the use of a sub-word encoding (BPE) in GPT-2 exacerbates this issue once we are doing generation, because the model can initially guess only part of the high-entropy word without fully committing to a repeat (say just the "O" in "Obama"), but once this becomes part of the context the probability of a repeat is now much higher (we already thought "Obama" was unusually probable, and now we're looking for a name that starts with "O").