Amazon Redshift - What You Need To Know

monstrado · on Sept 30, 2013

These are very different databases, PostgreSQL is a transactional database, while Redshift (aka: ParAccel) is an analytical database. Each of these databases have implemented much different design decisions, which improve queries on certain type of workloads.

PostgreSQL is optimized for blazing fast record mutations, inserts, while maintaining adequate query response times on medium to large size data. Redshift (ParAccel), or your other analytical databases like Greenplum, Teradata, or Netezza make optimizations that make sense for queries that pertain to the majority of a tables data (full table scans).

For example, Redshift stores stores a tables columns in separate locations, allowing you to not only skip reading columns which don't pertain to the query, but also allows for easier disk parallelization. Keeping your columns separate from each other slows down things like record reconstruction, since you're performing n disk seeks, where n = # of columns...this is bad for databases where you need a single record. Databases like Redshift are meant to compliment your MySQL and PostgreSQL databases, not replace them.

Shameless plug (I work at Cloudera): For reasons unknown, the article dismissed Hadoop without listing any reasons. If you're interested in having a secondary system designed for doing ad-hoc queries over your large datasets (billions of rows), I suggest trying out Impala. You can run it across a few servers you have sitting around... http://rideimpala.com/

kungfooey · on Sept 30, 2013

> These are very different databases..

Sure, they are. The point is that the API is similar, so developers already familiar with SQL, or libraries that speak SQL (SQLAlchemy in our case) can speak to Redshift. I attempted to make that point in the post, hopefully I can make that more clear.

> For reasons unknown, the article dismissed Hadoop without listing any reasons.

I didn't feel that it was the place for it, but the basics are that we didn't have the skills in place to restructure our system to use Hadoop. Second, our data structures are constantly changing, which Hadoop didn't accommodate very well. Third, we were not able to accommodate ad-hoc queries as every reduce operation needed to be figured out prior to running. Last, we were looking for something to handle "large" data sets, but not "big data." Hadoop works great for "big data" (petabytes) but it was overkill for us (working with terabytes).

I'm sure all of these problems could have been resolved, but the point of my write-up is not to kick dirt at Hadoop so much as to explain how Redshift fit our needs. Other companies, I'm sure, have different needs and should of course make their own evaluations.

monstrado · on Sept 30, 2013

Thanks for the clarifications, I wasn't not meaning to discredit your article in anyway. I am just trying to help people understand that these databases are very different from eachother, and were created to solve different use cases.

> Hadoop works great for "big data" (petabytes) but it was overkill for us (working with terabytes).

This is a very common misunderstanding about Hadoop, terabytes of data is still very large when you want to apply complex transformations, or allow someone to run queries over it with low latency expectations (a few seconds).

Impala does not use MapReduce, it was designed with low latency goals in mind...speeds are comparable to RedShift (https://amplab.cs.berkeley.edu/benchmark/), although they are citing an older version.

kungfooey · on Sept 30, 2013

Interesting. I will definitely check out Impala. Thanks for bringing it up.

nemothekid · on Sept 29, 2013

In the COUNT(*) example, looks like you annotated the wrong snippet.

"5 seconds! That’s an improvement. Note that I didn’t make any adjustements to the data: no indexes, no differences in table structure." should be 1.5s.

jmduke · on Sept 29, 2013

Looks like he typo'd and put a space in between the "1." and the "5", which got interpreted as an ordered list.

kungfooey · on Sept 29, 2013

Whoops, thanks for catching that. I didn't actually have a space there, but it did get interpreted that way.

stevoski · on Sept 29, 2013

select count( * ) from dummy_table;

takes 10 minutes on PostgreSQL to return the result 21454134?

With H2, an open source embedded SQL database I use daily, a "select count( * ) from table_name" query returns the result instantly. I assumed therefore that this was the norm...

saurik · on Sept 30, 2013

As mentioned by sargun (who was irritatingly downvoted by someone) the difference is that PostgreSQL uses the MVCC concurrency model, which allows you to do quite complex queries concurrently with other people making writes to the database, but comes at the cost that there is no longer any objective count of the rows in a given table: the number of rows is related to which transactions are considered live at any given moment.

http://wiki.postgresql.org/wiki/Slow_Counting

kungfooey · on Sept 30, 2013

In my case, this was inadvertently a good comparison. Most of the "ad-hoc" queries we're doing are AVG, SUM, or similar operations that require scanning the entire table.

I may try to edit this post to compare those operations, as it may be more meaningful.

Thanks for posting the link. TIL.

scragg · on Sept 29, 2013

I thought this query was instant in most dbs because it reads an internal row count of the table since this particular query has no conditions or joins.

sargun · on Sept 29, 2013

Does H2 do MVCC?

ogrisel · on Sept 30, 2013

AFAIK not by default. Furthermore the MVCC mode is still considered experimental: http://www.h2database.com/html/advanced.html#mvcc

superails · on Sept 30, 2013

> Redshift smelled enough like PostgreSQL

I'm curious. How does something smell like Postgres? Does it use some of the same PG-specific datatypes?

kungfooey · on Sept 30, 2013

It uses similar data types and has a similar API. The differences between the two are summarized in the documentation.

http://docs.aws.amazon.com/redshift/latest/dg/c_redshift-and...

tmarthal · on Sept 30, 2013

Not only does it use the same datatypes, but I know that it uses the same pg wire format such that the java (and I'm sure python) postgres drivers work out of the box.

ceyhunkazel · on Sept 29, 2013

I think beside familiar PostgreSQL enviroment there is not much advantage over SAP HANA. I listed advantages of HANA on my posting https://news.ycombinator.com/item?id=6466222

mtam · on Sept 29, 2013

Redshift is a SQL relational database for large structured data volumes, whereas Amazon EMR (Elastic MapReduce) is for non-structured data. The Redshift FAQ has more details on this topic: http://aws.amazon.com/redshift/faqs/#0120

Other than the marketing buzz from SAP, I have not seen any reputable (neutral) comparison between Hana and Redshift that includes price. I also have never met anyone that uses/run Hana that was not already an SAP ERP client, albeit the fact that SAP keeps saying Hana is the fastest-growing product in the company's history.

tonywe64 · on Sept 30, 2013

I don't think it makes sense to compare Redshift to HANA. There is no way that HANA could scale the way ParAccel/Vertica scales within the same price range.

Besides all the marketing buzz SAP has been throwing it out there, how many companies actually use it with a decent amount of data (at least 10TB+ range)?