Hacker News new | past | comments | ask | show | jobs | submit login
A Billion NYC Taxi and Uber Rides in AWS Redshift (marksblogg.com)
84 points by marklit on Feb 14, 2016 | hide | past | favorite | 17 comments



Quite frankly, for data of this scale, PostgreSQL is more than adequate on a modern machine (although the OP's step-by-step guide is surely help for those new to Redshift)

For those interested in funny and insight analyses of the data, check out this blog: http://toddwschneider.com/posts/analyzing-1-1-billion-nyc-ta...


I'm not sure I would want to try analyzing a 267 GB database on most modern machine. Redshift, on the other hand, is optimized quite well for rapidly querying flat sets like this.

On the other hand, since Redshift is PostgreSQL 8.0.2 under the hood, any code you write for analysis should be easy to move over (with the exception of having to add dist and sort keys).


> I'm not sure I would want to try analyzing a 267 GB database on most modern machine.

Why? Modern servers come with more RAM than that. Even multi-terabyte datasets are not a problem. Today's hardware is way better than most think.


Could not agree more. Production data needs to be bigger than a petabyte or three before my default response is anything other than 'let's load it into RAM.' :-)


You mean terabyte, right?


Well shucks, this is why I try not to HN at 3am on the regular... Yes, terabyte, not petabyte. And I'm well past the edit interval at this point; how embarrassing, mea culpa :-)


I meant in the context of people playing at home who might be working off a laptop or a desktop with only 16-32 GB of ram, not someone with half a terabyte of RAM on a multi-socket server.


That's still fine, it will take longer but will still run without problems. A good SSD is recommended though.


Really interesting article but I wish there was more data on the speed/performance of the Redshift queries. It seems to cover everything but the actual performance metrics!


From my experience analysing similar sized datasets, it really comes down to the structure of the query and dataset. If you nail your dist keys and your queries don't have many large joins, it's incredibly quick. If your dist keys aren't great and you have many joins, it will take a long long time.


Exactly. You have to keep in mind that other than distkeys / sortkeys, there are no indexes. So all you can do to speed things up is organize your data better.

For a ballpark figure, we work with a ~ 500million rows / 12 columns data set, and complex queries with a lot of joins typically take around 10 - 20 minutes to complete on Redshift's smallest cluster ($250/m).


Same^ This was great for "how to move postgres -> redshift" but i didnt learn anything about the title of the post ...


+1


    Issues often accumulate content-less “+1” comments which serve only to spam the maintainers and any others subscribed to the issue. These +1s serve a valuable function in letting maintainers know how widespread an issue is, but their drawbacks are too great. We’d like issues to gain a first-class voting system, and for content-less comments like “+1” or “:+1:” or “me too” to trigger a warning and instructions on how to use the voting mechanism.


I've been curious whether this dataset could be exploited to invade privacy, e.g. by targeting rides to/from from a sensitive address (medical-related, a strip club). Or by cross-checking it with other data, like a private detective comparing a pick-up recorded on a security camera and looking up where they went to in this database. I guess this makes it easier to find out.


Addresses are indeed anonymised somewhat, so it might be difficult.


really wish the AWS tutorials were this clean and a smiling photo of the author on the left menu and it gives a super human feel and it was easier to understand for me.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: