Hacker News new | past | comments | ask | show | jobs | submit login

Transformer-XL should be better at learning long-term dependencies.

See the last section of this paper: https://arxiv.org/pdf/1901.02860.pdf




Interesting, thanks!

I figured that expanding scope was probably a task we knew how to solve/improve, this makes sense. Beyond keeping antecedents, it seems like this could indirectly improve "essay structure" by retaining more knowledge of what's already been said.

I'm curious to see how much of GPT-2's length issues turn out to stem from long-range dependencies as opposed to other issues. Right now long output seems to suffer a combination of lost scope, lack of division/structure, diverging reference texts, self-contradiction, and redundancy/looping traps.

One of the most interesting patterns I've noticed is that GPT-2 "knows" which concepts relate to one another, but not which stances. Essays for and against Foo are both essays about Foo, so GPT-2 slips between them quite happily. It seems like the sort of problem that should be pretty tractable to improve, but all the obvious approaches would be pretty domain-specific.


Language models like GPT-2 have fundamental limitations for text generation. They simply model the most likely sequence of words, they don't have a "real understanding" of what they write. They don't have agency (a purpose), can't do explicit reasoning and don't have symbol grounding.

Few people would have thought that we can get this type of quality using just language modeling, so it's unclear how much further we can go before we have to start tackling those fundamental problems.


> Few people would have thought that we can get this type of quality using just language modeling, so it's unclear how much further we can go before we have to start tackling those fundamental problems.

Yep, that seems to be the core question. This isn't going to be the road to AGI, but it's already ahead of what most people thought could be done without underlying referents. As you say, GPT-2 has some fundamental limitations.

One is 'cognitive': it doesn't have referents or make any kind of logical assessments, only symbolic ones. Even if it develops to the point of writing publication-worthy essays, they'll be syntheses of existing texts, and inventing a novel argument (except by chance) should be essentially impossible. Even cross-format transference is basically out of the question; a news story reading "the mouse ate the tiger" won't be ruled out based on dictionary entries about mice and tigers.

The other, though, is 'intentful'. As Nostalgebraist points out, GPT-2 is essentially a text predictor impersonating a text generator. That difference creates an intrinsic limitation, because GPT-2 isn't trying to create output interchangeable with its inputs. It actively moves away from "write a novel" towards "write the most common/formulaic elements of a novel". At the current level of quality, this might actually improve results; GPT is less likely to make jarring errors when emulating LoTR's walking-and-eating sections than when emulating Gandalf's death. But as it improves and is used to generate whole texts, the "most likely" model leaves it writing unsurprising, mid-corpus text indefinitely.

Solving the cognitive limitation essentially requires AGI, and even adding crude referents in one domain would be quite tough. So I'll make a wild speculation that fixing the intentful limitation will be a big advance soon: training specifically for generation will produce models that more accurately recreate the structure and 'tempo' of human text, without being put deterred by excess novelty.


I suspect GPT-2's limitations with larger-scale structure have less to do with the capacity to track long-range dependencies (which shouldn't be a problem for an attention-based architecture), and more to do with language modeling itself as a task.

Language modeling is about predicting what can be predicted about the rest of a text, given the first N tokens. Not everything in text can be predicted in this way, even by humans; the things we say to each other tend to convey novel information and thus aren't fully compressible. And indeed the compressible-ness of text varies across a text in a way that it is itself relatively predictable. If someone writes "for all intents and" you can be pretty sure the next word is "purposes," i.e. you're unlikely to learn much when you read it; if someone is writing a dialogue between two characters, and you're about to see the see one of their names for the first time, you will learn something new and unpredictable when you read the next word, and you know that this will happen (and why).

A language modeling objective is only really natural for the first of these two cases. In the latter case, the "right" thing to do from the LM perspective is to output a fairly flat probability distribution over possible names (which is a lot of possibilities), assigning very low probability to any given name. But what this means is actually ambiguous between "I am unsure about my next observation because I don't understand the context" and "I understand the context, and it implies (predictably) that my next observation will be inherently unpredictable."

Since any model is going to be imperfect at judging whether it's about to see something unpredictable, it'll assign some weight to the next observation being predictable (say, a repeated topic or name) even if it's mostly sure it will be unpredictable. This will push up the probabilities of its predictions on the assumption of predictability (i.e. of a repeated topic/name), and meanwhile the probability of anything else is low, because if an observation is unpredictable then it might well be anything.

I hypothesize that this is behind behavior like putting a single name ("Obama" in your earlier example) in too many roles in an article: if only Obama has been mentioned, then either an upcoming name is "Obama" (in which case we should guess "Obama") or it's some other name (in which case we should guess against Obama in slight favor of any other name -- but this will only be conveyed to the model via the confusing signal "guess this arbitrary name! now this other one! now this one!", with the right trend only emerging in the average over numerous unpredictable cases, while the predictable-case rule where you guess the name that has already been mentioned is crystal-clear and reinforced in every case where it happens to be right).

I also suspect the use of a sub-word encoding (BPE) in GPT-2 exacerbates this issue once we are doing generation, because the model can initially guess only part of the high-entropy word without fully committing to a repeat (say just the "O" in "Obama"), but once this becomes part of the context the probability of a repeat is now much higher (we already thought "Obama" was unusually probable, and now we're looking for a name that starts with "O").




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: