Quite frankly, for data of this scale, PostgreSQL is more than adequate on a modern machine (although the OP's step-by-step guide is surely help for those new to Redshift)
I'm not sure I would want to try analyzing a 267 GB database on most modern machine. Redshift, on the other hand, is optimized quite well for rapidly querying flat sets like this.
On the other hand, since Redshift is PostgreSQL 8.0.2 under the hood, any code you write for analysis should be easy to move over (with the exception of having to add dist and sort keys).
Could not agree more. Production data needs to be bigger than a petabyte or three before my default response is anything other than 'let's load it into RAM.' :-)
Well shucks, this is why I try not to HN at 3am on the regular... Yes, terabyte, not petabyte. And I'm well past the edit interval at this point; how embarrassing, mea culpa :-)
I meant in the context of people playing at home who might be working off a laptop or a desktop with only 16-32 GB of ram, not someone with half a terabyte of RAM on a multi-socket server.
Really interesting article but I wish there was more data on the speed/performance of the Redshift queries. It seems to cover everything but the actual performance metrics!
From my experience analysing similar sized datasets, it really comes down to the structure of the query and dataset. If you nail your dist keys and your queries don't have many large joins, it's incredibly quick. If your dist keys aren't great and you have many joins, it will take a long long time.
Exactly. You have to keep in mind that other than distkeys / sortkeys, there are no indexes. So all you can do to speed things up is organize your data better.
For a ballpark figure, we work with a ~ 500million rows / 12 columns data set, and complex queries with a lot of joins typically take around 10 - 20 minutes to complete on Redshift's smallest cluster ($250/m).
Issues often accumulate content-less “+1” comments which serve only to spam the maintainers and any others subscribed to the issue. These +1s serve a valuable function in letting maintainers know how widespread an issue is, but their drawbacks are too great. We’d like issues to gain a first-class voting system, and for content-less comments like “+1” or “:+1:” or “me too” to trigger a warning and instructions on how to use the voting mechanism.
I've been curious whether this dataset could be exploited to invade privacy, e.g. by targeting rides to/from from a sensitive address (medical-related, a strip club). Or by cross-checking it with other data, like a private detective comparing a pick-up recorded on a security camera and looking up where they went to in this database. I guess this makes it easier to find out.
really wish the AWS tutorials were this clean and a smiling photo of the author on the left menu and it gives a super human feel and it was easier to understand for me.
For those interested in funny and insight analyses of the data, check out this blog: http://toddwschneider.com/posts/analyzing-1-1-billion-nyc-ta...