Some notes about the data, and in particular differences between how it's presented here and its raw form via Yelp:
1. Businesses can be in multiple neighborhoods in the original dataset. In this version businesses can only be in one (the more common of the neighborhoods the business was listed in). There's some nice presentation and analysis advantages to this.
2. We dropped categories with less than 50 businesses in them because of some limitations of Statwing (it slowed us down a lot without much benefit, for reasons I'm happy to explain but are pretty boring.
3. Instead of taking the number of stars typically presented on a business (1.0, 1.5, 2.0, etc.), we grabbed an average from Yelp's dataset of reviews for each of these businesses, so you end up having businesses with ratings like 1.37 or 3.22. There's spikes at 1, 1.5, 2, etc. because of businesses with very few reviews, so filtering to only include businesses with >25 reviews is pretty handy.
4. This is only one of several datasets Yelp provides (one for each business, one for each review, one for each user, etc.)
http://www.yelp.com/dataset_challenge
Final note is that we're of course always interested in feedback, so have at it.
> 3. Instead of taking the number of stars typically presented on a business (1.0, 1.5, 2.0, etc.), we grabbed an average from Yelp's dataset of reviews for each of these businesses, so you end up having businesses with ratings like 1.37 or 3.22.
I don't believe that derivation is equivalent.
From my own tabulation of the data:
# of reviews in Yelp's reviews dataset: 1,125,458 reviews
# sum reviews among all reviews for businesses in Yelp's business dataset: 1,236,445 reviews
So the aggregate will fail to account for about 10% of the rating data.
An even larger issue is probably that the way Yelp calculates ratings for a business isn't a straight average, it involves a notion of a prior expectation. I'd go into more detail here but I'm struggling to find the (I think official?) URL talking about this.
1. Businesses can be in multiple neighborhoods in the original dataset. In this version businesses can only be in one (the more common of the neighborhoods the business was listed in). There's some nice presentation and analysis advantages to this.
2. We dropped categories with less than 50 businesses in them because of some limitations of Statwing (it slowed us down a lot without much benefit, for reasons I'm happy to explain but are pretty boring.
3. Instead of taking the number of stars typically presented on a business (1.0, 1.5, 2.0, etc.), we grabbed an average from Yelp's dataset of reviews for each of these businesses, so you end up having businesses with ratings like 1.37 or 3.22. There's spikes at 1, 1.5, 2, etc. because of businesses with very few reviews, so filtering to only include businesses with >25 reviews is pretty handy.
4. This is only one of several datasets Yelp provides (one for each business, one for each review, one for each user, etc.) http://www.yelp.com/dataset_challenge
Final note is that we're of course always interested in feedback, so have at it.