If you want more insight into the data processing (and code which didn't work out), I strongly recommend looking at the R Notebook for the post: http://minimaxir.com/notebooks/amazon-spark/
R Notebooks have been a tremendous help for my workflows. (I do have a post planned to illustrate their many advantages over Jupyter Notebooks)
This is really cool. Looking forward to the R Notebook vs. Jupyter shootout.
Question: under "Distribution of average scores", I notice both distributions have a trend of oscillating up/down on every other bar. Is that a binning artifact, or somehow inherent in the Amazon rating system? With counts of O(1e5) I was expecting much smoother histograms.
Keep in mind the cumulative distribution of reviews. With 5 reviews minimum, the average has few sig-figs of precision, which is why the binning is also set to 1 sig-fig. It also makes the chart more readable. (2 sig-figs would add up to 10x as many columns, with potential for gaps with missing values)
I haven't used it yet but I'm sure it's great!! (Five stars)
or
Looks neat! (Four stars)
or
GOT THIS LAPTOP FOR MY GIRLFRIEND AND SHE COULDNT FIGURE OUT HOW TO USE IT. I CANNOT BELIEVE THIS. FOR $200 I SHOULD GET A TOP-END MACHINE NOT THIS TRASH. I AM RETURNING. (ONE STAR)
It's difficult to sell an item when you have no ratings at all or very litte ratings and a competitor has hundreds, so I can understand why they give products away for ratings. I think Amazon are looking into marking those reviews more clearly.
When I am looking for something, I go by the amount of reviews, too, so I really can't blame them.
Personally, I just ignore those reviews. Reviews in the 2-4 star range are more useful anyway.
I imagine the author used this as an excuse to play around with Spark. If it were me doing this for work, yeah I'd drop this in Postgres. Most of these analyses would be short SQL queries.
A bit of both. I wanted an excuse to test out Spark to find the kinks which were ommited from the documentation (and boy did I find kinks), and also provide a practical demo.
Essentially the same Spark caveats of lazy evaluation and immutability of caches: neither are a big deal on small datasets, but making a mistake on either on a large dataset can result in a lot of lost time or confusion.
Then there are the massive shuffle read/writes that result in 50GB i/o which are not great for SSDs.
lets say i tried to load up postgres and a data set on my ..fairly powerful.. laptop to run queries. how many records could i get up to? 100m? 1b? say 16gb ram
One would need to know the size of the record. This is an exercise you'll often do if you're doing capacity analysis / growth analysis for planning (or in conjunction with FP&A).
1,000,000 4kb records takes up, as you'd guess, 4GB of RAM. You can obviously go well beyond your allocation of RAM and still have the database perform, but you'll find you're now bottlenecked on the speed of IO from your SSD / HDD. Throughput will quickly decrease, queries will run slower, etc.
This is why you'll often find that DB benchmarks that never exceed RAM can be "false" comparisons if the expected workload will always exceed available memory, and data not resident to memory will need to be loaded.
So, to truly answer your question, it's actually less a question of RAM, and more a question of HDD. The 2015 ACS (American Community Survey) plus some geographic data is around 100GB, and I comfortably run analysis against it on my wimpy 2015 Macbook (8GB RAM, 1.3Ghz Core M).
you can load up a billion records with half that RAM but the most important part is the type of queries you want to run. in most cases even the most complex select queries are okay (as longh as you're not running 100+ in parallel). if you are running a lot of inserts/updates, you'll probably run into issues. but selects are fairly trivial as long as you have the space.
Spark has a number of features and constructs that can make it very powerful to work with, even on "small" data sets. Big data isn't just measured by size, it's also measured by computational complexity. 80,000,000 rows is massive if the operation you're performing against it is O(N^2), as an example.
Very cool to see R playing nicely with Spark via sparklyr package. The new flex dashboarding feature out of knitr is awesome. The R/RStudio team definitely knows what they're doing and very excited to see what's next for the data science community.
Shiny is slowly expanding its dashboarding capabilities. I have rolled out dashboards with hundreds of thousands of rows which don't need any explicit pagination and can be used simultaneously by dozens of internal customers. All this spun up in a matter of hours.
Can one use this data set for commercial purposes? May sound like a silly question, and the answer may be no, but this sort of data would be very useful to build something cool.
"I wrote a simple Python script to combine the per-category ratings-only data from the Amazon product reviews dataset curated by Julian McAuley, Rahul Pandey, and Jure Leskovec for their 2015 paper Inferring Networks of Substitutable and Complementary Products"
Just make a request. When I did so, I heard back within a few hours and the only requirement was that I cite their work, which is entirely reasonable. Similar datasets are more likely to be created if people are at least mentioned for their efforts.
We've used this dataset to build a product review classification pipeline as an example application that can be developed using our project, KeystoneML (which runs on spark) - code is here: https://github.com/amplab/keystone/blob/master/src/main/scal...
It would be really interesting to see the same analysis for verified reviews only, and contrast it with the overall numbers and non-verified reviews. I would actually want to read that more than this (which was still interesting).
Do you mean reviews done by verified purchasers? I believe I have read that firms buying reviews get around that qualifier by giving the reviewer money/gift cards to purchase the item, then write a positive review. They could easily set it up so that the reviewer has to send back the sample item, too, to reduce the cost.
Exactly this. Amazon claims they're fixing it but I don't think that'll stop the problem. If anything it'll cause them to do reviews without posting a notice about it being sponsored or potentially biased.
Does make you wonder what value the ratings have. Given that most of the ratings are 4-5 you would think most products are wonderful on amazon. Also makes you wonder how many are real users and how many are paid.
I've noticed on a few applications/sites I've made...
Star ratings tend to bias high, but ratings attached to reviews tend to bias low. If you're happy with the product, you just give it a 5... if you're unhappy, you give it a 1 and a review... but that takes more effort... so the happy outweighs the unhappy it seems... just my own theory, but its played out in a few places...
What kind of api did you use to pull all reviews out of amazon? As I see they have blocked giving reviews via api they give it via iframe in api nowadays.
How did you curate the list of each and every product present in amazon?
R Notebooks have been a tremendous help for my workflows. (I do have a post planned to illustrate their many advantages over Jupyter Notebooks)