There's definitely active research on it. But here's the basic recipe that gets ...

There's definitely active research on it. But here's the basic recipe that gets state-of-the-art accuracy on most tasks. It's been around for 5 years or so now, which is why I said "just".

You take an encoder transformer architecture like BERT and train it on a language modelling objective. Then you discard the part of the network that does the token prediction, and stick some randomly initialised network that does a specific task on. This is generally kept minimal. For classification tasks, often people just connect a linear layer to the vector that represents a dummy sentinel token after the sequence.

The general goal is to exploit the representations learned during the language modelling task, so that the network starts out knowing general grammatical structure of the language, multi-word expressions, can distinguish word senses from each other, etc. Then it needs far fewer examples to learn a specific task. There's definitely transfer learning going on: if you initialise the network randomly instead of via the LM objective, results are very much worse.

This is a good blog post from when the recipe was still fairly new: https://www.ruder.io/nlp-imagenet/