Hacker News new | past | comments | ask | show | jobs | submit login

Decent intro to natural language processing, but scientifically rubbish. I think using "cannabis" and "marajuana" skews heavily towards advocacy and serious discussion, since general chat about weed will use words such as, well, weed. Or pot, or hash, or herbs, or the maple leaf emoji, or a link to /r/trees. The problem with natural language processing on drugs is that the names people use for drugs are specifically chosen to be easily-confused with another, innocuous usage. That's the entire point of street names - to hide the fact that you're talking about drugs. I think you would need some kind of AI leagues ahead of our technology to accurately analyse people colloquially chatting about any drug, let alone one as popular and ubiquitous as weed.



Even more egregious than that:

> For quality control, I looked only at comments with Reddit score > 100

That's a non-trivial popularity score. Also, since it's an absolute score, it will bias against smaller subreddits, where 100 points on any comment is a difficult task.

This is much less "how people talk on reddit", and much more "the type of comment that gets upvotes on the default subreddits"


Yikes, that sounds like a great way to bias your data away from controversial opinions about weed. That would be like taking an exit poll of only people wearing lots of political apparel.


Using an innocuous encoding of a word is a form of encryption. People who expect to be under surveillance agree on a set of code words to denote illegal things. Though hard, there are multiple ways to semi-automatically break such a linguistic encryption.

Imputation. [1] Remove a word from a sentence then try to predict it from its surrounding context. "when I get home tonight, i vape a ___ then space out". Assign predicted probabilities to imputed word ":leaf emoji:" ["marijuana cigarette", "electronic cigarette", "cigar"].

Active learning. Seed the algorithm with expert knowledge from law enforcement, drug users, and social workers, who know of the encryption keys.

Anomaly detection. Though perhaps easily-confused with other, innocuous usage, street slang is a distinct form of language with its own properties and patterns. Compared to common discourse, it is strange and random. This pattern could be measured.

Doing this rigorously, like building search engines for illegal drugs or human trafficking on the deep web, requires a lot of expert knowledge. [2] Maybe future deep learning can do this end-to-end on arbitrary domains? [3] Let's see.

[1] https://arxiv.org/abs/1312.3005 "One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling"

[2] http://www.darpa.mil/program/memex

[3] https://universe.openai.com/envs#world_of_bits


and then there is the next level:

intentional use of slang terms during serious discourse in order to subtly delegitimize your opponents arguments. There is an Atlanta City Council member currently who, when responding to a question about e.g. medical marijuana will always change the noun to "pot" or "weed" or even "weed.. or.. pot" with an enunciation implying the concept of medical marijuana is a joke.


Who would agree to legalize the "devil's tobacco"? It's clearly and evil plant! Think of the "reefer madness"!


It's the devil's lettuce, not tobacco. :D


Satans salad?


What may be interesting is to use cosine similarity between the embeddings of these words to see if synonyms can be accurately identified.

Awhile ago, SpaCy set up a demo doing just that on the Reddit dataset:

https://demos.explosion.ai/sense2vec/?word=cannabis&sense=au...

https://demos.explosion.ai/sense2vec/?word=marijuana&sense=a...


It gets a little more fuzzy when you consider that /r/marijuanaenthusiasts is a subreddit for the discussion of trees (and happens to be subreddit of the day for 4/20).


Having studied culture for more than a decade, this makes me feel some validation that even in 2017, computers have yet to unravel the intricacies of culture.


In a very limited sense they have since the ads you end up seeing are targeted towards your culture based on your browsing history.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: