Hacker News new | past | comments | ask | show | jobs | submit login
Data from Yelp's Dataset Challenge (statwing.com)
56 points by glaugh on Aug 28, 2014 | hide | past | favorite | 6 comments



Some notes about the data, and in particular differences between how it's presented here and its raw form via Yelp:

1. Businesses can be in multiple neighborhoods in the original dataset. In this version businesses can only be in one (the more common of the neighborhoods the business was listed in). There's some nice presentation and analysis advantages to this.

2. We dropped categories with less than 50 businesses in them because of some limitations of Statwing (it slowed us down a lot without much benefit, for reasons I'm happy to explain but are pretty boring.

3. Instead of taking the number of stars typically presented on a business (1.0, 1.5, 2.0, etc.), we grabbed an average from Yelp's dataset of reviews for each of these businesses, so you end up having businesses with ratings like 1.37 or 3.22. There's spikes at 1, 1.5, 2, etc. because of businesses with very few reviews, so filtering to only include businesses with >25 reviews is pretty handy.

4. This is only one of several datasets Yelp provides (one for each business, one for each review, one for each user, etc.) http://www.yelp.com/dataset_challenge

Final note is that we're of course always interested in feedback, so have at it.


> 3. Instead of taking the number of stars typically presented on a business (1.0, 1.5, 2.0, etc.), we grabbed an average from Yelp's dataset of reviews for each of these businesses, so you end up having businesses with ratings like 1.37 or 3.22.

I don't believe that derivation is equivalent.

From my own tabulation of the data:

# of reviews in Yelp's reviews dataset: 1,125,458 reviews

# sum reviews among all reviews for businesses in Yelp's business dataset: 1,236,445 reviews

So the aggregate will fail to account for about 10% of the rating data.


There's definitely some inconsistency here.

An even larger issue is probably that the way Yelp calculates ratings for a business isn't a straight average, it involves a notion of a prior expectation. I'd go into more detail here but I'm struggling to find the (I think official?) URL talking about this.


Wow, what a fantastic tool. I liked it.


This is explicitly against Yelp's Terms of Use for the challenge dataset. Any redistribution of the raw data is disallowed.

Source: https://news.ycombinator.com/item?id=8121730


We have authorization from Yelp representatives to show their data in this fashion.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: