Decent intro to natural language processing, but scientifically rubbish. I think using "cannabis" and "marajuana" skews heavily towards advocacy and serious discussion, since general chat about weed will use words such as, well, weed. Or pot, or hash, or herbs, or the maple leaf emoji, or a link to /r/trees. The problem with natural language processing on drugs is that the names people use for drugs are specifically chosen to be easily-confused with another, innocuous usage. That's the entire point of street names - to hide the fact that you're talking about drugs. I think you would need some kind of AI leagues ahead of our technology to accurately analyse people colloquially chatting about any drug, let alone one as popular and ubiquitous as weed.
> For quality control, I looked only at comments with Reddit score > 100
That's a non-trivial popularity score. Also, since it's an absolute score, it will bias against smaller subreddits, where 100 points on any comment is a difficult task.
This is much less "how people talk on reddit", and much more "the type of comment that gets upvotes on the default subreddits"
Yikes, that sounds like a great way to bias your data away from controversial opinions about weed. That would be like taking an exit poll of only people wearing lots of political apparel.
Using an innocuous encoding of a word is a form of encryption. People who expect to be under surveillance agree on a set of code words to denote illegal things. Though hard, there are multiple ways to semi-automatically break such a linguistic encryption.
Imputation. [1] Remove a word from a sentence then try to predict it from its surrounding context. "when I get home tonight, i vape a ___ then space out". Assign predicted probabilities to imputed word ":leaf emoji:" ["marijuana cigarette", "electronic cigarette", "cigar"].
Active learning. Seed the algorithm with expert knowledge from law enforcement, drug users, and social workers, who know of the encryption keys.
Anomaly detection. Though perhaps easily-confused with other, innocuous usage, street slang is a distinct form of language with its own properties and patterns. Compared to common discourse, it is strange and random. This pattern could be measured.
Doing this rigorously, like building search engines for illegal drugs or human trafficking on the deep web, requires a lot of expert knowledge. [2] Maybe future deep learning can do this end-to-end on arbitrary domains? [3] Let's see.
intentional use of slang terms during serious discourse in order to subtly delegitimize your opponents arguments. There is an Atlanta City Council member currently who, when responding to a question about e.g. medical marijuana will always change the noun to "pot" or "weed" or even "weed.. or.. pot" with an enunciation implying the concept of medical marijuana is a joke.
It gets a little more fuzzy when you consider that /r/marijuanaenthusiasts is a subreddit for the discussion of trees (and happens to be subreddit of the day for 4/20).
Having studied culture for more than a decade, this makes me feel some validation that even in 2017, computers have yet to unravel the intricacies of culture.
A quick note about using natural language/sentiment APIs: trained machine learning models must be used apples-to-apples on similar datasets; for example, you can’t accurately perform Twitter sentiment analysis on a dataset using a model trained on professional movie reviews since Tweets do not follow AP Style guidelines. (e.g. for some reason, training Python's NLTK on the IMDb movie review dataset to predict the sentiment of Donald Trump's tweets is a oddly common Hello World, even though the results are misleading and may cause confirmation bias)
Reddit comments are very idiosyncratic, and in this particular case, even moreso than usual. As a result, I am skeptical of trusting the output of such APIs as gospel, even one trained on massive datasets. (however, training a model on a Reddit-only dataset might be interesting, and is an idea I have in the pipeline.)
This is a complicated problem and is I think best thought of as type of overfitting rather than a complete mistake. The independent or output variable, sentiment, does have an obvious generalisation from movies to politicians, unlike, for example, cinematography quality or trustworthiness. You are also overtraining when you test movie sentiment in the 2010s with reviews trained in the 90s as the concept of sentiment might have shifted if you look at it in that much detail.
(I don't disagree with anything you wrote, just expanding.)
the good news is as long as the training data is known accurate (basically human-prepared), you can use a relatively tiny amount of it for very good results on huge datasets.
Not directly related, but the podcast On The Media recently did a great episode on the origins of the war on drugs. One thing I didn't know? The word "marijuana" was actively popularized during the early days of the war on drugs to make the plant feel like a foreign import, despite the fact that it grew wild throughout the states.
This seems more like a quick intro to some Google BigQuery and NLP capabilities using a keyword that will attract readers. Not a bad thing, but anyone expecting analysis of the topic in the headline should know it's really not about that.
However, it worked on me, I'll probably give these tools a spin in the near future.
I was hoping to see a little more analysis as well. We did a study[1] about people moving for marijuana and we were surprised to find people were fairly open about discussing the topic. But I'd be very curious to see more about how conversations online are forming around marijuana.
upvoted for truth. maybe people downvoting aren't familiar with the history.
so here's some history. the name "marijuana" was pushed by Harry Anslinger[0] as a way to trigger racial anxiety amongst conservative whites who held negative views with respect towards Mexicans. the other names for the plant being "hemp" (a non-psychoactive strain used as an industrial fiber crop) and "cannabis" (latin name for the genus of the plant).
I didn't downvote, but some research shows that this is a controversial truth, even among those in the industry. The history is not controversial but the treatment of marijuana as a racist term is, from what I can see.
It has a Mexican name literally IN it. A conservative, white term would be Marijosepha.
logicallee, I think you've misunderstood the point made in the gp post. Harry Anslinger wanted it to sound strange and foreign to white conservatives, not familiar.
Funny how marijuana was most mentioned with Donald Trump. I know the /r/trees subreddit thought Trump would be good for recreational Marijuana but doesn't look like that is true. The Trump subreddit (The_Donald) was also extremely pro marijauna and often boasted that Trump was better for legal marijuana than Clinton. Then enter Jefferey Bouraguard Sessions III.
I don't really understand why people would think Trump would be supportive of any kind of intoxicant. His older brother was an alcoholic and died before their dad (which depending on the source you're reading, had a big impact on him).
Many believed Trump to be a libertarian or even a closet Democrat, who supported saner policy for the practical reasons that many do, setting aside their personal beliefs about whether individuals should consume drugs.
More than a closet Democrat, he was an active Democrat who voted Democrat and donated to the party. Overall philosophically he's probably more 'opportunist' than anything.
And I'd note this is not a political point; it's a base-rate comment [1]. Political comments will tend to have other political topics arise in them, and it would require more detailed analysis to know if there is any true signal here. (Not necessarily a lot more detailed... just more than an eyeball glance.)
People haven't been this obssessed with a politician's middle name since Barack hit the scene almost a decade ago. I'm really glad you felt the need to add it, since I don't think I would've felt the full dramatic effect of your comment otherwise.
Is filtering by score > 100 a good idea? At least you might want to counterbalance that with negative-scoring comments, since people downvote to disagree, and it may be the Reddit audience are more likely to disagree with anyone thinking pot should remain illegal. In fact, why filter on score at all?
Sometimes data is not beautiful, but very ugly. These results are based on a flawed premise. One red flag -- where is the word "dank" in your list? Where are words used by people who actually smoke weed? Also, is "where score > 100" a good heuristic for this kind of study? I would argue that "where score < 100" is a better heuristic.
For example, a shill or superuser (people getting top comment) will not be using domain specific language -- they will be using language that caters to a general audience. If this is true, you would end up squeezing most of the interesting language out of your study. Have you been to Grass City forums? I am guessing these people surely aren't using terms like "Donald Trump" in their everyday conversations about weed.
Reddit is a huge melting pot and probably isn't a good place for insight about potheads. Grass City might not be either -- Grass City users are not typical potheads. The best place would be 10th grade high school social circles and college dorms. It really is amazing how little data is produced by social networks, in the grand scheme of things. We are all so used to hearing about how much data is produced by the internet. There are orders more data in the raw world just waiting to be scooped up.