We must suspend disbelief a bit regardless: Any “toxicity classifier” has a limi...

echen · on March 25, 2022

(Post author here.) Agree with both you and the parent here! We work a lot in the NLP and Trust & Safety space, and many of the models and datasets we see do ignore context -- and so real-world "toxicity models often end up simply as "profanity detectors" (https://www.surgehq.ai/blog/are-popular-toxicity-models-simp...). Which would certainly happen with a Naive Bayes model as well.

Similarly, a lot of the training data/features ML engineers use ignore context -- for example, a Reddit comment may seem hateful in isolation, until you realize the subreddit it's in changes the meaning entirely (https://www.surgehq.ai/blog/why-context-aware-datasets-are-c...).

Regarding your point, we actually do a lot of "adversarial labeling" to try to make ML models robust to countermeasures (e.g., making sure that the ML models train on word letter substitutions), but it's pretty tricky!