A practical issue for Naive Bayes that also infects linear models is bias w.r.t. document length. Typically when you are detecting a rare, relatively compact class such as sports articles (or spam) you will tend to have a strongly negative prior, many positive features, and few negative ones. As a consequence, as the length of your text increases, not only does the variance of your prediction increase, but the mean tends to as well. This leads to all very long documents being classified as positive, regardless of their text. You can observe this by training your model and then classifying /usr/dict/words.
This is the most common mistake I've seen in production use of linear models on document text. Invariably, they'll misfire on any unusually long document.
I agree, there are issues with NB such as the ones you brought up. I don't think document length is the real offender here though. This really boils down to noise and how well you filter and devalue it. Stacking more filters like stemming, stopwords, and high frequency features definitely helps in this case to the point where longer documents can actually improve accuracy. Additionally, tuning your ngram lengths or using variable lengths, choosing between word or character ngrams, and limiting your distribution size all will help depending on what you're trying to categorize.
> 1. Use only the beginning of the document, as that's probably the most important part anyways, and it's fast.
That seems to be a solution devised for news articles, as the standard news writing style involves providing answers to the Five Ws up front on the article.
I'll add to this that you can add a very crude (separate) model for the document length and number of distinct words, and use that to flag outlier documents that might bump into the known weaknesses with respect to document length.
This is not invariant to the size of the document (though agreed, generally better). It doesn't solve the problem of having mostly positive features and a negative prior.
Stated more formally, your model is b + wᵀx. Generally, b is < 0, and E[wᵀx] > 0. As the document grows, wᵀx tends to dominate b. You'll have bias with length as long as E[wᵀx]≠0 and there aren't any constraints on w that would force this.
If your data obeys the naive Bayes assumptions then this model is mathematically optimal. That each word is independently drawn from some distribution conditional on it's class. E.g. if there was an exactly 1% chance any given word in a spam email would be "viagra".
Now obviously real world data doesn't obey these assumptions perfectly. But I don't see how violating the independent features assumption would cause the problem you mention. A longer email does mean the word "viagra" is more likely to occur in a normal email just by random chance. But the model takes that into account by recording the frequency of "viagra" in normal emails and seeing if it's consistent with that.
For a simple example, imagine a dataset where the naive assumption is true if you split it into 100 classes, but false if you split it into one vs everything else. All of the conditional probabilities for the "everything else" class will be underestimated, biasing the weights towards the one.
This problem happens because the class you are interested in is more compact than its inverse.
It's also exacerbated by feature selection, as the negative features have smaller weights and thus lower information gain than the positive features.
I have an unusually poor understanding of this subject, so I know I am likely to be wrong and am seeking your correction. I just don't get it.
Naively, I was shocked that you suggest classifying usr/dict/words as a test after training - to me, this seems like classifying some specific exhaustive sample catalogue of microbes in the Center for Disease Control's building for -- anything related to microbes that was trained on the real world. I expect that of course it would pass the test (be a false positive for) a hospital, as a leprosy ward, as a middle school, as a sewage processing plant, as a factory, as a kitchen, as an African steppe, as the ISS space station, as a suburb, as anything. If it's exhaustive, it'll have whatever is in those places! And if it's a catalogue, you've reduced the frequency information that could differentiate these.
Like, how could I expect an exhaustive list of unique words with their count reduced to 1 to have any ability to be classified in any way based on a training set based on words and frequencies? It just seems like a shocking suggestion because it artificially reduces to a count of 1 the exact feature you are using, and then to make sure you can't even use the presence or absence of anything, it includes 1 of everything.
I was thinking, it is like learning to classify a Romance language (French, Italian, Spanish etc) based on the frequency of the letters A-Z based on documents in those languages, and then "testing" that classification on the text "abcdefghijklmnopqrstuvwxyz". If we trained on frequency, why would we expect useful output after reducing frequency to 1? Actually since a dictionary is exhaustive, it is like including the German ß, not in other languages, the French ç, not in other languages - and then being surprised that it passes the test for German (based on letters frequency) and for French. You've completely removed the information you trained on, then added examples for anything that can be used.
This seems really unrelated to document length.
A further example I thought of is suppose you were training something to classify professions, from a database of classified CV's/resumes, to decide if someone is a programmer, cook, athlete, etc. If I fed it a list of every keyword from my entire database, one per line, then the answer to 'is this an embedded systems programmer' is yes, is this front-end developer the answer is 'yes' and if your database is big enough, is this a Historian specialized is medieval metallurgy the answer is 'yes.' Why would any of these be considered false positives since I've just put them as skills!
So I realize I am trying to be intuitive and likely am very naive, but I am totally confused why you suggest usr/dict/words.
You seem to say it is about length, but my a-z text example is just 26 characters (plus special ones) whereas the training data could be thousands of pages in each language.
I simply don't follow at all, which is frustrating because I have such a poor understanding to begin with.
I will appreciate any correction!
Sorry this ended up a bit long. I tried to include 2 or 3 practical examples so you can better judge my understanding and maybe correct it. Thank you.
The examples other people are using are fairly narrow. I would like to substantiate that text categorization via naive bayes classifier is surprisingly accurate and simple. This paper[1] uses ngrams and a simple out of place measure to compare articles against different verticals and often sees greater than 99% accuracy for relatively small blocks of text. The out of place measure also adds a penalty to features not found in the document, which helps establish the individuality for the category classification. Raw matching performance is also fairly impressive; A less naive implementation is also highly parallelizable.
From experience, I suspect NB does well on text primarily for two reasons:
(1) High dimensionality - this is also the reason why SVMs with linear kernels do so well on text (NBs find a linear boundary too - but a different linear boundary than linear-kernel SVMs)
(2) this reason applies to certain cases only: text where "token" based estimates are sufficiently discriminatory. For ex take a dataset that has two kinds of documents - one talking about product A and the other talking about product B. And you want to label the documents based on which product they're talking about. Here, just noticing if the term "A" or "B" appears in the document is good enough for classification. You don't have to have powerful models that infer "connotations" of words based on context. NB will do well here, esp if bundled with a feature selection technique that weeds out noisy features (like sthg based on Normalized Mutual Information)
Considering the relative ease of implementation, classification accuracy with smaller datasets, and computational efficiency of Naive Bayes classifiers, I am surprised that they are not mentioned as often as other machine learning competitors, such as random forest.
Are there major drawbacks to Naive Bayes classifiers? Is it just that they aren't as accurate on large datasets?
The (conditional) independence assumptions it makes are pretty strong and usually inaccurate. In the text example, consider "Steph Curry" and "Warriors." These are going to show up very frequently together, and both provide strong evidence that the topic is sports. However, they don't provide strongly independent evidence of sports. Supposing Curry announced he was making an investment in a local restaurant, the article would probably mention the Warriors just because it's Curry.
The problem is that seeing "Warriors" when you see "Steph Curry" shouldn't make you that much more confident that it's sports than just seeing "Steph Curry" alone. Just like seeing "University" with "Stanford" shouldn't change any inferences much from seeing "Stanford" alone. In naive bayes models, these are treated as independent pieces of evidence and that can lead to overconfidence and errors.
Naive bayes is a generative model, so in flipping bayes rule around to discriminate classes, you have to be sure your probability model is decent. In a discriminative model, you just go straight to learning the p(class | observations) and have no requirement for a decent model of p(observations). p(observations) is the kind of model that would have to know about "university" and "stanford" being likely to be observed together. Often, it's better to go straight to the discriminative model. In this case, logistic regression.
It doesn't have a good accuracy. I have yet to see a real-life dataset where it's better than to just call LogisticRegressionCV from scikit-learn. For bigger datasets, you may use vowpal wabbit or Fasttext. It may be a little bit slower for the training (but not so much), and as fast as LR for the training. What is the purpose of using an algorithm when another one is just better?
There's a reason for that. NB needs to fit more params than LR (and other linear discriminant learners), e.g. for binary classification, NB needs to fit 2*N params while LR just N.
Now, if you take the log of the final p(c_1|w) I derived, the product in p(w|c_0) / p(w|c_1) turns into a exponentiated sum, giving you one parameter per word, plus an intercept for the log of p(c_0) / p(c_1). You end up with exactly the same 1/(1+exp(linear stuff)) with the same parametrization and form you have in logistic regression [0].
This is a broader thing that a given graphical model with a given fixed parametrization can be generatively or discriminatively trained. They will end up learning different models, but that's because they make different assumptions, not because of different parametrizations.
[0] linear stuff = intercept + sum_i (coefficient of word_i)
I may be missing something, but how do you take a log on the RHS of p(c_1|w)? The denominator has a sum i.e. "1+ something", which the log doesn't distribute over.
Heh, nope you're right I was writing to quickly. I meant to say to take a log of of the probability term in p(c_1|w). So you take p(c_1|w) = 1/(1+stuff) and turn it into p(c_1|w) = 1/(1+exp(log(stuff))). stuff is a product, so log(stuff) is a sum. Good catch.
In that case, you needn't have kept the denominator around at all. Since P(w) is not based on a class i.e. its not class conditioned, the classifier could have directly calculated P(C_1|w)/P(C_0|w). The P(w) term cancels out, and you end up with the product of ratios of feature probabilities conditioned on the classes.
Note though that, for K-classes, K>2, the number of parameters you need to store would blow up. You would need have these N ratios: P(w_j|C_x)/P(w_j|C_y) for all possible classes x,y. Here N is the number of features. So, in all N*C(K,2) values. On the contrary, multiclass softmax regression (the discriminative analogue for NB) would need NK parameters.
In the multiclass problem, they have the same number of degrees of parameters. You can see this by choosing a reference class c_y. The NB parameter for class c_x and feature w_j is f_jxy = log(p(w_j|c_x)/p(w_j|c_y)). If you want a new reference class c_w, you can see that f_jxw = f_jxy - f_jwy. No new learned parameters needed. You need one of those parameters per N features and (K-1) classes that aren't c_y. So you get N(K-1) features. This is the same as for multiclass softmax regression: N(K-1) instead of the NK you wrote (which you can see by working it out with a reference class c_y). It really is the same parametrization. The analogue with NK parametrized softmax may be more straightforward if you just use NK naive bayes features of the form f_jx = log(p(w_j|c_x)) for each class and check the equivalence with softmax.
It really is just a special case of a more general rule that a given PGM with a fixed parametrization can be trained discriminatively or generatively.
I'm sorry I didn't have a chance to respond before -you've been very patient with your responses! Thanks!
Yes, this is quite neat!
I will look up the rule/theorem you mention. It seems "kind of" reasonable in the sense the same number of independent parameters should show up in any representation, but (and this is probably me just being dumb) I need to think about why wouldn't the form of the representations not affect the number of parameters.
Although, if m parameters in a representation can be mapped to n (m>n) parameters in another, the m parameters aren't really independent.
OK, now I am just talking to myself :-)
Also, in this context, the results around discriminative classifiers learning better than their generative counterparts (asymptotically) is something I need to think about. [1]
Accuracy is a problem. NBs are linear classifiers, and more so, the linear boundary is found by assuming features to be independent.
So, in cases:
(1) data cannot be separated linearly - NBs don't do well for the same reason as any other linear classifier wouldn't work well
(2) for linearly separable data - if the feature independence assumption is incompatible with the data, linear classifiers that don't make this assumption will beat it. For ex you can prove that logistic regression beats (or is at least as good as) Naive Bayes asymptotically. [1]
many competition solutions use stacking, where you use the probabilities of several classifiers are used instead of their predictions.
Since naive bayes makes an independence assumption, and most competitions have features that are highly correlated, its probability / confidence of its prediction is probably a lot more useless than other classifiers.
besides that, nb doesn't do feature selection, regularization, and interactions like many tree based algorithms (such as XGBoost) do out of the box.
interactions is a big one for me. With NB, you have to add an interaction feature by multiplying two together, but in creating the feature you must steal some probability away from the other stuff, and finding a good interaction among many features is hard. The tree structure of random forests naturally captures that
edit: to your last point, nb does worse on larger data sets. when you dont have a lot of data, you have to make larger assumptions. NB makes a big independence assumption, thus it does well on very small data sets but falls short on large ones
I would add that another practical aspect about Naive Bayes classifiers is that you can make use of the conditional probabilities for each feature that contributes to the predictions. That gives you some introspection on how the model is working and it's useful when "debugging" classifiers by finding features that should/shouldn't be used.
A side note here: the classification probabilities NB produces are not very accurate and usually need some correction using a process called "calibration".
Is Naive Bayes really ever the most practical choice? Yes it is a simple, fast algorithm, but it's usually a non trivial step below other simple models in my experience and doesn't seem to show any major advantages. The results shown here seem good but bag of words models usually do better than you might think on supervised NLP. So what's the motivation?
AFAIK it's by far the fastest machine learning method and one of the only ones that can be learned "online". I.e. it can just update the model each time it gets a datapoint, and then throw it away without saving it for future training. These are nice properties if you are doing something at a very large scale or in an environment with very limited resources.
And if your data happens to actually meet the naive bayes assumptions (that all the features are conditionally independent) then it's literally mathematically optimal and you can't do any better than it. It seems to work fairly well even when that isn't the case though.
Logistic regression can easily be made online too, keep in mind! sklearn has an implementation of online gradient descent, and vowpal wabbit is also excellent at those problems.
Naive bayes can be parallelized in ways that SGD can't, that's a whole other conversation.
Gradient descent can be made online. But it's very slow and suffers from catastrophic forgetting. Typical gradient descent needs to iterate over the dataset many times, while naive Bayes only needs one pass.
This is the most common mistake I've seen in production use of linear models on document text. Invariably, they'll misfire on any unusually long document.