The post mentions that there's work ongoing to make this available in languages other than English. If you'd like to help contribute a language, there's a discussion thread available with 20 languages under development so far - we'd love to have more folks join us: http://forums.fast.ai/t/language-model-zoo-gorilla/14623
I'm glad we're again concentrating on newer language models.
Curious how it'll perform compared to fasttext when used as encoding network in larger tasks. I can't help but notice the trend of going back to simpler models with smarter optimizations and regularization to achieve better results.
This is a frequent question of mine, which I ask to everyone using RNNs - what do you think of the idea that CNNs will be able to replace RNNs for sequence tasks [0]? CNNs are less computationally expensive too, so there's a definite benefit of switching to them if the performance is on par.
fasttext is just an encoding of the first layer of a model (the word embeddings - or subword embeddings). Full multi-layer pre-trained models are able to do a lot more. For instance, on IMDb sentiment our method is about twice as accurate as fasttext.
As to whether CNNs can replace RNNs in general, the jury is still out. Over the last couple of years there have been some sequence tasks where CNNs are state of the art, some where RNNs are. Note that with stuff like QRNNs the assumption that CNNs are less computationally expensive is no longer necessarily true: https://github.com/salesforce/pytorch-qrnn
I'd be surprised if for tasks that require long-term state (like sentiment analysis on large docs) whether CNNs will win out in the end, since RNNs are specifically designed to be stateful - especially with the addition of an attention layer.
For instance, on IMDb sentiment our method is about twice as accurate as fasttext.
Seeing as fasttext accuracy is 90%+, does this mean your method achieves 180%?
I'm nitpicking of course, but lately I've seen claims like "20% improvement in accuracy", where on closer inspection, the authors mean error rate dropped from 5% to 4%.
Which is not bad of course, but in the grand of scheme of things, 1% absolute improvement may not be such game-changer, especially if it comes at the cost of other relevant metrics like model complexity, developer sanity or performance.
(haven't read your paper yet, just a general sigh/rant)
This generally is the metric you care about - a difference of one percentage point can be an improvement of twenty percent, as that means that the total number of "bad events" that you expect to get when running the system is decreased by 20%. And it's quite reasonable to assume that here, as in almost all other domains, "x% improvement" means the percentage difference (multiplicative), not the percentage point difference (subtractive). For pretty much every percentage quantity, things like defect ratios, recidivism rates or financial interest rates, "20% increase" never means an increase of 20 percentage points but an increase by 20 percent of the starting value. If we're nitpicking, "1% absolute improvement" is an inaccurate statement, the improvement should be described as 1pp (or 20%), not 1%.
Especially for more well defined problems, going from 98.5% to 99.5% is "just" 1pp absolute improvement but the fact that you have three times less mistakes can well justify a more complex model that requires ten times more hardware. The metric that you'd actually care about would often be like "number of hours required to correct the mistakes" or "number of lost sales due to mistakes", which all would get modified by the relative percentage change.
Your note on "more well defined problems" is spot on. Chasing single percent improvements and SOTA is indeed the name of the game there.
But defining the problem in the first place, figuring out the cost matrix and solution constraints, is typically the bigger challenge in highly innovative projects. Once you know what to chase, 80% of the job is done.
Disclosure: building commercial ML systems for the past 11 years, using deep learning and otherwise. What you call "metric you care about" is often not the metric you care about. This is why people coming from academia are sometimes taken by surprise that logistic regression, linear models, or heck, even rule-based systems (!) are still so popular. Model simplicity, developer sanity and performance do matter, too.
> but in the grand of scheme of things, 1% absolute improvement may not be such game-changer, especially if it comes at the cost of other relevant metrics like model complexity, developer sanity or performance
fasttext makes errors about 10% of the time, and our approach makes errors about 5% of the time. It's certainly fair to say (although nitpicky) that "accuracy" isn't quite the right term here (I should have said "half the error").
But as for your general sigh/rant... absolute improvement is very rarely the interesting measure. Relative improvement tells you how much your existing systems will change. So if you're error goes from 5% to 4% then you have 20% less errors to deal with than you used to.
An interesting example: the Kaggle Carvana segmentation competition had a lot of competitors complaining that the simple baseline models were so accurate that the competition was pointless (it was very easy to get 99% accuracy). The competition administrator explained however that the purpose of the segmentation model was to do automatic image pasting into new backgrounds, where every mis-classified pixel would lead to image problems (and in a million+ pixels, that's a low error rate!)
Radim: please consider incorporating this into gensim. It really is superior to simpler classification models running on top of word/BPE/wordpiece embeddings and to classic machine learning algorithms used for text classification and topic modeling like HDP, LDA, LSI/LSA, etc. (You can see for yourself how well this works out-of-the-box with a simple exercise: grab a pretrained model from fast.ai, run a bunch of documents through it, grabbing and saving each time the last hidden-layer representation of each document, and then map these representations to a two-dimensional plot with, say, t-SNE.)
I realize that outside of Silicon Valley and other technology centers, most established companies are far -- far -- from adopting deep learning for any application of importance, due partly to the current unavailability of developers with AI expertise, and partly to deep learning's so-called "unexplainability" (i.e., the inability of many corporate executives and machine learning practitioners to reason about it, and their resulting discomfort with it). But it's only a matter of time before Corporate America starts following the lead of companies like Google and Facebook, which today are aggressively using state-of-the-art AI in lots of important applications.
Why not get ahead of this multi-decade trend?
PS. For those who don't know, Radim is the creator of gensim, a popular, friendly Python library for text classification and topic modeling.[a]
Generally % change in failure rate is what you care about, in most fields. e.g. If something "increases your chance of getting cancer by 50%" that doesn't mean it increases the risk to 1 in 2. It just means the risk goes from 1% to 1.5%
When I think of state of the art in this area, I think of the Deep Contextualized Word Vectors/ELMo paper from Peters et al, which you cite, but you don't have any comparisons to.
The only point of reference between the two papers I see is the CoVe models, which you guys beat pretty handily, but the ELMo model also beats the CoVe model handily, just on different datasets, so not clear how they stack up.
Any chance you could do some more direct comparisons? You do say that it's a more complex architecture, and the tokenized char convolution stuff is a bit of a pain to do, but if that actually helps, it's not that bad to do once.
From an engineering perspective, not changing the LM weights is kind of nice because then you can train multiple separate models on top of the embeddings without needing to retrain everything (and deal with the associated "noise" when retraining models) and it gives some nice modularity. It would be nice to know how much of the performance is lost from having embeddings that can be shared across a lot of tasks.
Random note: it seems like in Table 7, you have bolded "Freez + discr + stlr" in the IMDb column which has a value of 5.00, whereas "Full + discr" has a value of 4.57, and so should probably be the bolded number.
Very interesting comments. Yes absolutely want to do comparisons to ELMo. It's a little tricky to do so on our datasets, since ELMo isn't really a complete method on its own, but more an addendum to existing methods. In the future we hope to do seq2seq and sequence labeling studies, and we can then ensure we pick datasets that the ELMo paper covered.
Using char tokens can definitely be helpful, as can sub-words. It's something we've been working on too, and hope to show results of this in the future.
I mainly disagree with your view of end-to-end training. In computer vision we pretty much gave up on trying to re-use hyper-columns without fine-tuning, because the fine-tuning just helps so much. It's really no trouble doing the fine-tuning - in fact the consistency of using a single model form across so many different datasets is really convenient and helpful for doing additional levels of transfer learning.
Thanks for the note about table 7 - it's actually an error (it should be 5.57, not 4.57; Sebastian is in the process of uploading a corrected version).
Perhaps even more interesting than comparison would be modifications to ULMFit to incorporate good ideas from the AllenNLP ELMo paper.
The learned weighting of representation layers seems like a decent candidate, as does giving the model flexibility to use something other than a concatenated [mean / max / last state] representation of final LSTM output layer (as is the case in some of ELMo's task models). I'm personally curious about using an attention mechanism in conjunction with something like ELMo's gamma task parameter (regularizer) for learning a weighted combination of outputs but haven't been able to get things to function well in practice.
The dataset the ELMo model is trained might also be preferable to WIKI 103 for practical English tasks, although you lose the nice multilingual benefits you get from working with WIKI 103.
In general it seems like the format described in the ELMo paper is simply not designed to work at very low N because the weights of the (often complex) task models used in ELMo's benchmarks are learned entirely for scratch. That's not possible without a decent amount of labeled training data.
Anyhow, thought the paper was very well put together, definitely an enjoyable read. Hope yourself and Sebastian collaborate on future papers, as good things certainly came of this one!
I just wanted to clear up my comments on fine tuning. These LMs are huge. The ELMo paper has 300 dimensional embeddings, yours has 400 (which, btw, should probably be controlled in a comparison). As an engineer, I don't really want to deploy a fine tuned LM for every task I have. Especially on smartphones, I can barely deploy one of these.
The obvious answer is that I should just train a single joint model.
That's great, but when you retrain a model, even if you get similar accuracy, your actual predictions change. It's basically why same model ensembles help.
So if I am trying to improve predictions for a single task, but I have a joint model, then I have to deal with a whole pile of churn that I wouldn't if I had separate models.
This doesn't show up in academic metrics, but people care when things that used to work stop working for no real reason, even if an equal amount of new things started working.
So, I'm not saying we shouldn't fine tune things, it's that I have a set of engineering challenges that make fine tuning less ideal, and I am curious how much we can get away with sharing. There are plenty of CV papers which indicate that the very first layers basically don't benefit from fine tuning because they are so general. Is that true for NLP as well, or are words embeddings already quite domain specific?
This method dramatically improves over previous approaches to text classification, and the code and pre-trained models allow anyone to leverage this new approach to better solve problems such as: Finding documents relevant to a legal case; Identifying spam, bots, and offensive comments; Classifying positive and negative reviews of a product; Grouping articles by political orientation;
I'm starting a new project where I'm given many recipes and I need to take in a free form text of recipe ingredients (e.g. "1/2 cup diced onions", "two potatoes, cut into 1-inch cubes", etc.) and build a program that identifies the ingredient (e.g. onion, potato), as well as the quantity (e.g. 0.5 cup, 2.0 units). Could I use something like Fast.ai to tackle this problem?
CRF works quite well, it's actually what I utilize right now to approach recipe parsing on https://cookalo.com/. It's based on CRFsuite with Python bindings for data training on already labeled recipes. If you build your own app and want to do some comparison, feel free to run some benchmarks against it.
Yes, that's correct, it's similar to the mechanisms NY Times guys were using and I've been focusing on the datasets to feed the CRF with as it's what drives the whole thing. This is the output I've got based on your example:
[
{
"unit": "cup",
"input": "1$1/2 cups seedless red or green grapes",
"name": "red grapes",
"qty": "1$1/2",
"comment": "seedless or green"
}
]
Don't hesitate to try the API out by pasting some examples to the white box on the site and pressing the "Try it out!" button, it's interactive :)
I'm not sure - what you're describing is information extraction. I haven't tried that yet, but I'm certainly interested in doing so (especially for medical data).
Similarly, could something like this be useful to extract out a command that a user wants to run from a transcription? For example, "Add a user named Jenny to our client list.", which results in the command, 'create user Jenny'. Or, "Could you add Jenny to our client list?", which results in the same command, 'create user Jenny'. Perhaps instead of outputting the next word, output the expected command from a set of commands?
Initially embeddings of OOV terms are initialized to mean embeddings, and then the language model fine-tuning step allows the model to adjust these values.
Hey Jeremy, thanks for sharing this awesome article!
Do you think this is also applicable to classify 1. readability and 2. entertainment/fun of a text? Thanks!
Yes I think it would work great for both of those applications. For readability, you'd ideally want a dataset showing how long someone took to read each document. For entertainment, some kind of review dataset would be fine (it's basically a type of sentiment analysis, which we've already tested on 3 datasets.)
Wow - great paper! Very readable / accessible. I'm working on some stuff for NLP in materials science academic literature, but we haven't tried anything beyond the usual word embedding -> supervised classifier approach. I'll have to give this a try!
Jeremy, after a quick first reading, I believe the following can be improved in the paper:
1. In the introduction, there seem to be contradictory statements about the status of Inductive Transfer for NLP. It is first stated that it "has had a large impact in practice", then in the next paragraph it is stated that "it has been unsuccessful for NLP". How is it possible, having a large impact and at the same time being unsuccessful?
2. In the introduction, it is stated that "Research in NLP focused mostly on transductive transfer". Perhaps this statement were valid back in 2007, but it seems to me outdated. Recently most transfer learning in NLP are related to using pre-trained embeddings in a inductive transfer setting.
3. In the beginning of the "2 Related Work" section, in the excerpt "Features in deep neural networks in CV have been observed to transition from task-specific to general from the first to the last layer", I believe the order "first to the last" should read "last to the first", since the last layers have the more task-specific features and the first layers have the more general.
Man that video at the bottom is great. I just finished up research on an image classification task in medical imaging and that tool would have helped out a lot in debugging and interpreting results - especially when you're working on image datasets of objects which you aren't very used to (like medical datasets where different tissue textures are important and only medical experts can distinguish between them).
Hi Jeremy,
I noticed you were active on the Kaggle toxic comments challenge though did not participate. Did you apply this model to that problem and if so, how were the results?
Is there an easily accessible API? And is this robust to bad labels - imperfect training data? I have a huge corpus of labeled descriptions for jobs and I want to categorize them as 'programming' or 'not programming'. The accuracy of my manual labeling is like 95%. Can that be used to train a classifier using this newly published technology?
I understand that you do mention the pre-training / transfer learning approach clearly, but isn't it disingenuous to claim that you provide better performance based on (only) 100 labeled examples, when the pre-training dataset (Wikitext-103) actually contains 103M words?
Of course not. The use of pre-training on a large unlabeled corpus and subsequent fine-tuning is what the paper is about. It is stated repeatedly in the paper and the post.
It is totally correct and in no way misleading to say we need only 100 labeled examples. Anyone can get similar results on their own datasets without even needing to train their own wikitext model, since we've made the pre-trained model available.
(BTW, I see you work at a company that sells something that claims to "categorize SKUs to a standard taxonomy using neural networks." This seems like something you maybe could have mentioned.)
Got it. I was looking for input on how generalizable (the ability of weights to change/adapt) when the training labeled data is 100x smaller than the initial pre-training dataset?
Also, I don't understand the need to be so defensive though and the relevance between my employer and my post?
When you use the word disingenuous, you invited the response you got. Totally uncalled for to write that.
His response on your employer was likely driven by an assumption that you viewed this as free, open source competition to your product, and thus the negative comment.
To the OP:.
I've find a lot of NLP, and this is phenomenal work.