Decent intro to natural language processing, but scientifically rubbish. I think...

Yen · on April 20, 2017

Even more egregious than that:

> For quality control, I looked only at comments with Reddit score > 100

That's a non-trivial popularity score. Also, since it's an absolute score, it will bias against smaller subreddits, where 100 points on any comment is a difficult task.

This is much less "how people talk on reddit", and much more "the type of comment that gets upvotes on the default subreddits"

PascLeRasc · on April 21, 2017

Yikes, that sounds like a great way to bias your data away from controversial opinions about weed. That would be like taking an exit poll of only people wearing lots of political apparel.

Danylon · on April 21, 2017

Using an innocuous encoding of a word is a form of encryption. People who expect to be under surveillance agree on a set of code words to denote illegal things. Though hard, there are multiple ways to semi-automatically break such a linguistic encryption.

Imputation. [1] Remove a word from a sentence then try to predict it from its surrounding context. "when I get home tonight, i vape a ___ then space out". Assign predicted probabilities to imputed word ":leaf emoji:" ["marijuana cigarette", "electronic cigarette", "cigar"].

Active learning. Seed the algorithm with expert knowledge from law enforcement, drug users, and social workers, who know of the encryption keys.

Anomaly detection. Though perhaps easily-confused with other, innocuous usage, street slang is a distinct form of language with its own properties and patterns. Compared to common discourse, it is strange and random. This pattern could be measured.

Doing this rigorously, like building search engines for illegal drugs or human trafficking on the deep web, requires a lot of expert knowledge. [2] Maybe future deep learning can do this end-to-end on arbitrary domains? [3] Let's see.

[1] https://arxiv.org/abs/1312.3005 "One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling"

[2] http://www.darpa.mil/program/memex

[3] https://universe.openai.com/envs#world_of_bits

elif · on April 20, 2017

and then there is the next level:

intentional use of slang terms during serious discourse in order to subtly delegitimize your opponents arguments. There is an Atlanta City Council member currently who, when responding to a question about e.g. medical marijuana will always change the noun to "pot" or "weed" or even "weed.. or.. pot" with an enunciation implying the concept of medical marijuana is a joke.

Karawebnetwork · on April 20, 2017

Who would agree to legalize the "devil's tobacco"? It's clearly and evil plant! Think of the "reefer madness"!

treehau5 · on April 20, 2017

It's the devil's lettuce, not tobacco. :D

OldSchoolJohnny · on April 20, 2017

Satans salad?

minimaxir · on April 20, 2017

What may be interesting is to use cosine similarity between the embeddings of these words to see if synonyms can be accurately identified.

Awhile ago, SpaCy set up a demo doing just that on the Reddit dataset:

https://demos.explosion.ai/sense2vec/?word=cannabis&sense=au...

https://demos.explosion.ai/sense2vec/?word=marijuana&sense=a...

discreditable · on April 21, 2017

It gets a little more fuzzy when you consider that /r/marijuanaenthusiasts is a subreddit for the discussion of trees (and happens to be subreddit of the day for 4/20).

at-fates-hands · on April 20, 2017

Having studied culture for more than a decade, this makes me feel some validation that even in 2017, computers have yet to unravel the intricacies of culture.

hl5 · on April 20, 2017

In a very limited sense they have since the ads you end up seeing are targeted towards your culture based on your browsing history.