Data from Yelp's Dataset Challenge

glaugh · on Aug 28, 2014

Some notes about the data, and in particular differences between how it's presented here and its raw form via Yelp:

1. Businesses can be in multiple neighborhoods in the original dataset. In this version businesses can only be in one (the more common of the neighborhoods the business was listed in). There's some nice presentation and analysis advantages to this.

2. We dropped categories with less than 50 businesses in them because of some limitations of Statwing (it slowed us down a lot without much benefit, for reasons I'm happy to explain but are pretty boring.

3. Instead of taking the number of stars typically presented on a business (1.0, 1.5, 2.0, etc.), we grabbed an average from Yelp's dataset of reviews for each of these businesses, so you end up having businesses with ratings like 1.37 or 3.22. There's spikes at 1, 1.5, 2, etc. because of businesses with very few reviews, so filtering to only include businesses with >25 reviews is pretty handy.

4. This is only one of several datasets Yelp provides (one for each business, one for each review, one for each user, etc.) http://www.yelp.com/dataset_challenge

Final note is that we're of course always interested in feedback, so have at it.

minimaxir · on Aug 28, 2014

> 3. Instead of taking the number of stars typically presented on a business (1.0, 1.5, 2.0, etc.), we grabbed an average from Yelp's dataset of reviews for each of these businesses, so you end up having businesses with ratings like 1.37 or 3.22.

I don't believe that derivation is equivalent.

From my own tabulation of the data:

# of reviews in Yelp's reviews dataset: 1,125,458 reviews

# sum reviews among all reviews for businesses in Yelp's business dataset: 1,236,445 reviews

So the aggregate will fail to account for about 10% of the rating data.

glaugh · on Aug 28, 2014

There's definitely some inconsistency here.

An even larger issue is probably that the way Yelp calculates ratings for a business isn't a straight average, it involves a notion of a prior expectation. I'd go into more detail here but I'm struggling to find the (I think official?) URL talking about this.

thalesfc · on Aug 28, 2014

Wow, what a fantastic tool. I liked it.

minimaxir · on Aug 28, 2014

This is explicitly against Yelp's Terms of Use for the challenge dataset. Any redistribution of the raw data is disallowed.

Source: https://news.ycombinator.com/item?id=8121730

glaugh · on Aug 28, 2014

We have authorization from Yelp representatives to show their data in this fashion.