Nice read. I did something sort of similar with the same dataset about a year ago. I compared LDA (Latent Dirichlet Allocation) to TF-IDF as tools to find similar beers based on their review text. Lots of intuitive and funny topics discovered.
I suggest you play with LDA, it seemed to work really well at generating topics. There is also a lot of fascinating, very readable research using it. Check out SNAPs work on the same dataset [1] and some of the Yelp Dataset challenge winners [2]. If you end up interested in doing so, Gensim [3] was pleasant enough to work with.
Great post! I've been thinking about writing something similar with that same BeerAdvocate data. Good job beating me to it :)
Instead, I ended up writing a satirical beer snob bot [1] which tweets nonsensical beer reviews using Markov Chains. Some are bad, but some are pure gold. You can read about it here [2]. The code's also on GitHub [3].
Cool stuff, followed! Feel free to steal any parts of my work you think may improve it. May be cool to be able to control the polarity of the review you're tweeting.
For anyone interested in beer and data science, my startup[1] uses machine learning and artificial intelligence to build flavor profiling and quality control tools for craft beverage producers.
Our models flag and predict flaws, taints, contaminations, and batch-to-batch deviations in real time from human sensory data. We then leverage our clients quality control data for flavor profile optimization, demographic targeting, and cognitive marketing - helping them sell consistently better products to their most valuable consumers.
If the Author of the Software (the "Author") needs a place to crash and you have a sofa available, you should maybe give the Author a break and let him sleep on your couch.
If you are caught in a dire situation wherein you only have enough time to save one person out of a group, and the Author is a member of that group, you must save the Author.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO BLAH BLAH BLAH ISN'T IT FUNNY HOW UPPER-CASE MAKES IT SOUND LIKE THE LICENSE IS ANGRY AND SHOUTING AT YOU.
I suggest you play with LDA, it seemed to work really well at generating topics. There is also a lot of fascinating, very readable research using it. Check out SNAPs work on the same dataset [1] and some of the Yelp Dataset challenge winners [2]. If you end up interested in doing so, Gensim [3] was pleasant enough to work with.
[1] http://snap.stanford.edu/data/web-BeerAdvocate.html
[2] http://www.yelp.com/dataset_challenge
[3] https://radimrehurek.com/gensim/wiki.html#latent-dirichlet-a...