Another great blog post on great research by the DeepMind guys, who are also simultaneously releasing a new dataset for long-range language modeling.
The post is worth reading in its entirety.
If I may summarize, the authors propose a transformer augmented with a short-term memory mechanism (analogous to TransformerXL) as well as a long-term memory mechanism (new) that learns to 'compress and memorize' embeddings from the short-term memory. The model is trained on book-length samples (!!!!), and seems to perform significantly better than prior models at generating language with long-range contexts. To my eyes, text generated by the trained model is virtually indistinguishable from human output, and qualitatively superior to GPT2 samples.
Agreed that the generated sample is superior to similar outputs from GPT-2.
Looking at the additional samples in the publication, my first thought is that the model cannot easily stray from or modify the context. Once a fact is stored within the compressed memory, it seems the model cannot easily generate sentences contradictory to that fact.
This is problematic because frequent changes to relational information (e.g. the location a character is standing) is fundamental to story telling.
I believe the transformer-xl pre-trained model can also be downloaded, to provide a similar long term memory functionality as the compression transformer. I don't have a direct link, but it's available via huggingface https://huggingface.co/transformers/pretrained_models.html
Yeah. I didn't mention Transformer-XL because I'm not sure how much of a long-range dependency it actually learns to handle. The only papers I've seen on recurrency indicate that they tend to learn very short-range dependencies, while something like Reformer with direct access to thousands of timesteps seems more likely to actually be making use of them.
Will look into releasing some pre-trained weights, but the model trained on PG-19 is not really intended to be a general purpose language generation model so I'd prefer if it not be picked up for downstream applications like gpt2 & bert. The text from these old books contains some historical bias etc.
Hopefully the model can be useful for people wanting to model long sequences generally, or build on other compressive memory ideas.
From the description in the research paper of how they compress the memory it sounds like a form of meta-learning.
Perhaps a network like this would be interested in reading the same books more than once. Perhaps it could find favorite books it wanted to read many times.
Thank you, this just made a huge connection for me between the role of sleep, memory, and its role in decision making (in the "consolidated episodic memories" link):
I was suffering from sleep apnea at this time last year and was on call 1 out of every 3 weeks so was not defragging my brain's hard drive. I got decision fatigue and my productivity fell to 10%, which led to me being unable to work for several months.
The post is worth reading in its entirety.
If I may summarize, the authors propose a transformer augmented with a short-term memory mechanism (analogous to TransformerXL) as well as a long-term memory mechanism (new) that learns to 'compress and memorize' embeddings from the short-term memory. The model is trained on book-length samples (!!!!), and seems to perform significantly better than prior models at generating language with long-range contexts. To my eyes, text generated by the trained model is virtually indistinguishable from human output, and qualitatively superior to GPT2 samples.