We must suspend disbelief a bit regardless: Any “toxicity classifier” has a limited operational life as people who want to say toxic things will simply adapt their language and walk circles around it.
From simple letter substitution (sh!t) to completely different words/concepts (unalive) to “layer 2 sarcasm” (where someone adopts the persona of someone who supports the word view that’s against what they believe in a non-obvious attempt to rally people against that persona).
People have been getting away with being toxic in public for a long time. ML cannot keep up. Humans can’t even keep up.
(Post author here.) Agree with both you and the parent here! We work a lot in the NLP and Trust & Safety space, and many of the models and datasets we see do ignore context -- and so real-world "toxicity models often end up simply as "profanity detectors" (https://www.surgehq.ai/blog/are-popular-toxicity-models-simp...). Which would certainly happen with a Naive Bayes model as well.
Similarly, a lot of the training data/features ML engineers use ignore context -- for example, a Reddit comment may seem hateful in isolation, until you realize the subreddit it's in changes the meaning entirely (https://www.surgehq.ai/blog/why-context-aware-datasets-are-c...).
Regarding your point, we actually do a lot of "adversarial labeling" to try to make ML models robust to countermeasures (e.g., making sure that the ML models train on word letter substitutions), but it's pretty tricky!
From simple letter substitution (sh!t) to completely different words/concepts (unalive) to “layer 2 sarcasm” (where someone adopts the persona of someone who supports the word view that’s against what they believe in a non-obvious attempt to rally people against that persona).
People have been getting away with being toxic in public for a long time. ML cannot keep up. Humans can’t even keep up.