How people talk about marijuana on Reddit: a natural language analysis

nonsince · on April 20, 2017

Decent intro to natural language processing, but scientifically rubbish. I think using "cannabis" and "marajuana" skews heavily towards advocacy and serious discussion, since general chat about weed will use words such as, well, weed. Or pot, or hash, or herbs, or the maple leaf emoji, or a link to /r/trees. The problem with natural language processing on drugs is that the names people use for drugs are specifically chosen to be easily-confused with another, innocuous usage. That's the entire point of street names - to hide the fact that you're talking about drugs. I think you would need some kind of AI leagues ahead of our technology to accurately analyse people colloquially chatting about any drug, let alone one as popular and ubiquitous as weed.

Yen · on April 20, 2017

Even more egregious than that:

> For quality control, I looked only at comments with Reddit score > 100

That's a non-trivial popularity score. Also, since it's an absolute score, it will bias against smaller subreddits, where 100 points on any comment is a difficult task.

This is much less "how people talk on reddit", and much more "the type of comment that gets upvotes on the default subreddits"

PascLeRasc · on April 21, 2017

Yikes, that sounds like a great way to bias your data away from controversial opinions about weed. That would be like taking an exit poll of only people wearing lots of political apparel.

Danylon · on April 21, 2017

Using an innocuous encoding of a word is a form of encryption. People who expect to be under surveillance agree on a set of code words to denote illegal things. Though hard, there are multiple ways to semi-automatically break such a linguistic encryption.

Imputation. [1] Remove a word from a sentence then try to predict it from its surrounding context. "when I get home tonight, i vape a ___ then space out". Assign predicted probabilities to imputed word ":leaf emoji:" ["marijuana cigarette", "electronic cigarette", "cigar"].

Active learning. Seed the algorithm with expert knowledge from law enforcement, drug users, and social workers, who know of the encryption keys.

Anomaly detection. Though perhaps easily-confused with other, innocuous usage, street slang is a distinct form of language with its own properties and patterns. Compared to common discourse, it is strange and random. This pattern could be measured.

Doing this rigorously, like building search engines for illegal drugs or human trafficking on the deep web, requires a lot of expert knowledge. [2] Maybe future deep learning can do this end-to-end on arbitrary domains? [3] Let's see.

[1] https://arxiv.org/abs/1312.3005 "One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling"

[2] http://www.darpa.mil/program/memex

[3] https://universe.openai.com/envs#world_of_bits

elif · on April 20, 2017

and then there is the next level:

intentional use of slang terms during serious discourse in order to subtly delegitimize your opponents arguments. There is an Atlanta City Council member currently who, when responding to a question about e.g. medical marijuana will always change the noun to "pot" or "weed" or even "weed.. or.. pot" with an enunciation implying the concept of medical marijuana is a joke.

Karawebnetwork · on April 20, 2017

Who would agree to legalize the "devil's tobacco"? It's clearly and evil plant! Think of the "reefer madness"!

treehau5 · on April 20, 2017

It's the devil's lettuce, not tobacco. :D

OldSchoolJohnny · on April 20, 2017

Satans salad?

minimaxir · on April 20, 2017

What may be interesting is to use cosine similarity between the embeddings of these words to see if synonyms can be accurately identified.

Awhile ago, SpaCy set up a demo doing just that on the Reddit dataset:

https://demos.explosion.ai/sense2vec/?word=cannabis&sense=au...

https://demos.explosion.ai/sense2vec/?word=marijuana&sense=a...

discreditable · on April 21, 2017

It gets a little more fuzzy when you consider that /r/marijuanaenthusiasts is a subreddit for the discussion of trees (and happens to be subreddit of the day for 4/20).

at-fates-hands · on April 20, 2017

Having studied culture for more than a decade, this makes me feel some validation that even in 2017, computers have yet to unravel the intricacies of culture.

hl5 · on April 20, 2017

In a very limited sense they have since the ads you end up seeing are targeted towards your culture based on your browsing history.

minimaxir · on April 20, 2017

A quick note about using natural language/sentiment APIs: trained machine learning models must be used apples-to-apples on similar datasets; for example, you can’t accurately perform Twitter sentiment analysis on a dataset using a model trained on professional movie reviews since Tweets do not follow AP Style guidelines. (e.g. for some reason, training Python's NLTK on the IMDb movie review dataset to predict the sentiment of Donald Trump's tweets is a oddly common Hello World, even though the results are misleading and may cause confirmation bias)

Reddit comments are very idiosyncratic, and in this particular case, even moreso than usual. As a result, I am skeptical of trusting the output of such APIs as gospel, even one trained on massive datasets. (however, training a model on a Reddit-only dataset might be interesting, and is an idea I have in the pipeline.)

Last year, spaCy trained a model, sense2vec, on the Reddit dataset and got interesting results: https://explosion.ai/blog/sense2vec-with-spacy

ppod · on April 20, 2017

This is a complicated problem and is I think best thought of as type of overfitting rather than a complete mistake. The independent or output variable, sentiment, does have an obvious generalisation from movies to politicians, unlike, for example, cinematography quality or trustworthiness. You are also overtraining when you test movie sentiment in the 2010s with reviews trained in the 90s as the concept of sentiment might have shifted if you look at it in that much detail.

(I don't disagree with anything you wrote, just expanding.)

iamacynic · on April 20, 2017

the good news is as long as the training data is known accurate (basically human-prepared), you can use a relatively tiny amount of it for very good results on huge datasets.

kennywinker · on April 20, 2017

Not directly related, but the podcast On The Media recently did a great episode on the origins of the war on drugs. One thing I didn't know? The word "marijuana" was actively popularized during the early days of the war on drugs to make the plant feel like a foreign import, despite the fact that it grew wild throughout the states.

http://www.wnyc.org/story/on-the-media-2017-04-14/

SmellTheGlove · on April 20, 2017

This seems more like a quick intro to some Google BigQuery and NLP capabilities using a keyword that will attract readers. Not a bad thing, but anyone expecting analysis of the topic in the headline should know it's really not about that.

However, it worked on me, I'll probably give these tools a spin in the near future.

rcarrigan87 · on April 20, 2017

I was hoping to see a little more analysis as well. We did a study[1] about people moving for marijuana and we were surprised to find people were fairly open about discussing the topic. But I'd be very curious to see more about how conversations online are forming around marijuana.

[1]https://www.movebuddha.com/blog/moving-for-marijuana/

inuhj · on April 20, 2017

'Marijuana' is considered a racist term in the industry. The preferred word is 'cannabis'.

metaphorm · on April 20, 2017

upvoted for truth. maybe people downvoting aren't familiar with the history.

so here's some history. the name "marijuana" was pushed by Harry Anslinger[0] as a way to trigger racial anxiety amongst conservative whites who held negative views with respect towards Mexicans. the other names for the plant being "hemp" (a non-psychoactive strain used as an industrial fiber crop) and "cannabis" (latin name for the genus of the plant).

[0] - https://en.wikipedia.org/wiki/Harry_J._Anslinger

jat850 · on April 20, 2017

I didn't downvote, but some research shows that this is a controversial truth, even among those in the industry. The history is not controversial but the treatment of marijuana as a racist term is, from what I can see.

logicallee · on April 21, 2017

It has a Mexican name literally IN it. A conservative, white term would be Marijosepha.

maxerickson · on April 21, 2017

It has a Mexican name literally IN it. A conservative, white term would be Marijosepha.

logicallee, I think you've misunderstood the point made in the gp post. Harry Anslinger wanted it to sound strange and foreign to white conservatives, not familiar.

logicallee · on April 21, 2017

I mean that it has Juan in the name (which I contrast with Joseph). So MariJUANa. Couldn't sound more Mexican if they tried.

jpttsn · on April 20, 2017

Do you have a source for the about downvoters' opinions?

strathmeyer · on April 20, 2017

Ganja is the drug harvested from the cannabis plant.

Muuuchem · on April 20, 2017

Funny how marijuana was most mentioned with Donald Trump. I know the /r/trees subreddit thought Trump would be good for recreational Marijuana but doesn't look like that is true. The Trump subreddit (The_Donald) was also extremely pro marijauna and often boasted that Trump was better for legal marijuana than Clinton. Then enter Jefferey Bouraguard Sessions III.

cavanasm · on April 20, 2017

I don't really understand why people would think Trump would be supportive of any kind of intoxicant. His older brother was an alcoholic and died before their dad (which depending on the source you're reading, had a big impact on him).

_mhyx · on April 20, 2017

Well, he said a bunch of pro-weed stuff at one point, so it's not coming from nowhere:

https://www.merryjane.com/news/want-marijuana-legalized-then...

OscarCunningham · on April 20, 2017

Many of the biggest benefits of marijuana legalisation come from the fact that it tends to displace alcohol.

code_duck · on April 20, 2017

Many believed Trump to be a libertarian or even a closet Democrat, who supported saner policy for the practical reasons that many do, setting aside their personal beliefs about whether individuals should consume drugs.

ktRolster · on April 20, 2017

More than a closet Democrat, he was an active Democrat who voted Democrat and donated to the party. Overall philosophically he's probably more 'opportunist' than anything.

code_duck · on April 21, 2017

Sure, in the past. If he's a Democrat now or in the past 8 years, it's definitely a well-kept secret.

jpttsn · on April 20, 2017

Trump's media prominence means Trump is mentioned a lot in any topic.

jerf · on April 20, 2017

And I'd note this is not a political point; it's a base-rate comment [1]. Political comments will tend to have other political topics arise in them, and it would require more detailed analysis to know if there is any true signal here. (Not necessarily a lot more detailed... just more than an eyeball glance.)

[1]: https://en.wikipedia.org/wiki/Base_rate

wtf_is_up · on April 21, 2017

>Then enter Jefferey Bouraguard Sessions III

People haven't been this obssessed with a politician's middle name since Barack hit the scene almost a decade ago. I'm really glad you felt the need to add it, since I don't think I would've felt the full dramatic effect of your comment otherwise.

abraves10001 · on April 21, 2017

I love the 3 named presidents.

Franklin Delano Roosevelt, Lyndon Baines Johnson, Warren Gamaliel Harding.

gozur88 · on April 21, 2017

Trump can't legalize marijuana. Only Congress can do that. If Sessions is enforcing the law, well, that's his job.

rwmj · on April 20, 2017

Is filtering by score > 100 a good idea? At least you might want to counterbalance that with negative-scoring comments, since people downvote to disagree, and it may be the Reddit audience are more likely to disagree with anyone thinking pot should remain illegal. In fact, why filter on score at all?

cool_shit · on April 21, 2017

Sometimes data is not beautiful, but very ugly. These results are based on a flawed premise. One red flag -- where is the word "dank" in your list? Where are words used by people who actually smoke weed? Also, is "where score > 100" a good heuristic for this kind of study? I would argue that "where score < 100" is a better heuristic.

For example, a shill or superuser (people getting top comment) will not be using domain specific language -- they will be using language that caters to a general audience. If this is true, you would end up squeezing most of the interesting language out of your study. Have you been to Grass City forums? I am guessing these people surely aren't using terms like "Donald Trump" in their everyday conversations about weed.

Reddit is a huge melting pot and probably isn't a good place for insight about potheads. Grass City might not be either -- Grass City users are not typical potheads. The best place would be 10th grade high school social circles and college dorms. It really is amazing how little data is produced by social networks, in the grand scheme of things. We are all so used to hearing about how much data is produced by the internet. There are orders more data in the raw world just waiting to be scooped up.