A Billion NYC Taxi and Uber Rides in AWS Redshift

ktamura · on Feb 14, 2016

Quite frankly, for data of this scale, PostgreSQL is more than adequate on a modern machine (although the OP's step-by-step guide is surely help for those new to Redshift)

For those interested in funny and insight analyses of the data, check out this blog: http://toddwschneider.com/posts/analyzing-1-1-billion-nyc-ta...

JimmyAustin · on Feb 14, 2016

I'm not sure I would want to try analyzing a 267 GB database on most modern machine. Redshift, on the other hand, is optimized quite well for rapidly querying flat sets like this.

On the other hand, since Redshift is PostgreSQL 8.0.2 under the hood, any code you write for analysis should be easy to move over (with the exception of having to add dist and sort keys).

manigandham · on Feb 14, 2016

> I'm not sure I would want to try analyzing a 267 GB database on most modern machine.

Why? Modern servers come with more RAM than that. Even multi-terabyte datasets are not a problem. Today's hardware is way better than most think.

lwhalen · on Feb 14, 2016

Could not agree more. Production data needs to be bigger than a petabyte or three before my default response is anything other than 'let's load it into RAM.' :-)

sokoloff · on Feb 14, 2016

You mean terabyte, right?

lwhalen · on Feb 14, 2016

Well shucks, this is why I try not to HN at 3am on the regular... Yes, terabyte, not petabyte. And I'm well past the edit interval at this point; how embarrassing, mea culpa :-)

JimmyAustin · on Feb 14, 2016

I meant in the context of people playing at home who might be working off a laptop or a desktop with only 16-32 GB of ram, not someone with half a terabyte of RAM on a multi-socket server.

manigandham · on Feb 14, 2016

That's still fine, it will take longer but will still run without problems. A good SSD is recommended though.

Narkov · on Feb 14, 2016

Really interesting article but I wish there was more data on the speed/performance of the Redshift queries. It seems to cover everything but the actual performance metrics!

JimmyAustin · on Feb 14, 2016

From my experience analysing similar sized datasets, it really comes down to the structure of the query and dataset. If you nail your dist keys and your queries don't have many large joins, it's incredibly quick. If your dist keys aren't great and you have many joins, it will take a long long time.

stingraycharles · on Feb 14, 2016

Exactly. You have to keep in mind that other than distkeys / sortkeys, there are no indexes. So all you can do to speed things up is organize your data better.

For a ballpark figure, we work with a ~ 500million rows / 12 columns data set, and complex queries with a lot of joins typically take around 10 - 20 minutes to complete on Redshift's smallest cluster ($250/m).

aerioux · on Feb 14, 2016

Same^ This was great for "how to move postgres -> redshift" but i didnt learn anything about the title of the post ...

techprotocol · on Feb 14, 2016

zzleeper · on Feb 14, 2016

    Issues often accumulate content-less “+1” comments which serve only to spam the maintainers and any others subscribed to the issue. These +1s serve a valuable function in letting maintainers know how widespread an issue is, but their drawbacks are too great. We’d like issues to gain a first-class voting system, and for content-less comments like “+1” or “:+1:” or “me too” to trigger a warning and instructions on how to use the voting mechanism.

abalone · on Feb 14, 2016

I've been curious whether this dataset could be exploited to invade privacy, e.g. by targeting rides to/from from a sensitive address (medical-related, a strip club). Or by cross-checking it with other data, like a private detective comparing a pick-up recorded on a security camera and looking up where they went to in this database. I guess this makes it easier to find out.

HappyTypist · on Feb 14, 2016

Addresses are indeed anonymised somewhat, so it might be difficult.

jorgecurio · on Feb 14, 2016

really wish the AWS tutorials were this clean and a smiling photo of the author on the left menu and it gives a super human feel and it was easier to understand for me.