Back at my old job in ~2016, we built a cheap homegrown data warehouse via Postgres, SQLite and Lambda.
Basically, it worked like this:
- All of our data lived in compressed SQLite DBs on S3.
- Upon receiving a query, Postgres would use a custom foreign data wrapper we built.
- This FDW would forward the query to a web service.
- This web service would start one lambda per SQLite file. Each lambda would fetch the file, query it, and return the result to the web service.
- This web service would re-issue lambdas as needed and return the results to the FDW.
- Postgres (hosted on a memory-optimized EC2 instance) would aggregate.
It was straight magic. Separated compute + storage with basically zero cost and better performance than Redshift and Vertica. All of our data was time-series data, so it was extraordinarily easy to partition.
Also, it was also considerably cheaper than Athena. On Athena, our queries would cost us ~$5/TB (which hasn't changed today!), so it was easily >$100 for most queries and we were running thousands of queries per hour.
I still think, to this day, that the inevitable open-source solution for DWs might look like this. Insert your data as SQLite or DuckDB into a bucket, pop in a Postgres extension, create a FDW, and `terraform apply` the lambdas + api gateway. It'll be harder for non-timeseries data but you can probably make something that stores other partitions.
For our scale and request patterns (easily-partitioned / 0.1 qps), no major issues but the JavaScript bindings (which are different to their wasm bindings) that I use leave a lot to be desired. To DuckDB's credit, they seem to have top-notch CPP and Python bindings that even support the efficient memory-mapped Arrow format that's purpose-built for cross-language / cross-process , in addition to being top-notch in-memory representation for Panda-like data-frames.
Granted DuckDB's is in constant development, but it doesn't yet have native cross-version export/import feature (since its developers claim DuckDB hasn't reached maturity to stabilise its on-disk format just yet).
I also keep an eye on https://h2oai.github.io/db-benchmark/ As for Arrow-backed query engines, Pola.rs and DataFusion in particular sound the most exciting to me.
It also remains to be seen how DataBrick's delta.io develops (might come in handy for much much larger data-warehouses).
I've looked into this but saw hugely variable throughput, sometimes as little as 20 MB / second. Even if full throughput I think s3 single key performance maxes out at ~130 MB / second. How did you get these huge s3 blobs into lambda in a reasonable amount of time?
* With larger lambdas you get more predictable performance, 2GB RAM lambdas should get you ~ 90MB/s [0]
* Assuming you can parse faster than you read from S3 (true for most workloads?) that read throughput is your bottleneck.
* Set target query time, e.g 1s. That means for queries to finish in 1s each record on S3 has to be 90MB or smaller.
* Partition your data in such a way that each record on S3 is smaller than 90 MBs.
* Forgot to mention, you can also do parallel reads from S3, depending on your data format / parsing speed might be something to look into as well.
This is somewhat of a simplified guide (e.g for some workloads merging data takes time and we're not including that here) but should be good enough to start with.
How large were the SQLite database files you were working with here?
I've been thinking about building systems that store SQLite in S3 and pull them to a lambda for querying, but I'm nervous about how feasible it is based on database file size and how long it would take to perform the fetch.
I honestly hadn't thought about compressing them, but that would obviously be a big win.
I've worked with Redshift for about 5-ish years and BigQuery for about a year. IMO BigQuery wins hands down.
From an operational perspective we've had almost 0 issues with BQ, whereas with Redshift we had to constantly keep giving it TLC. Right from creating users /schemas to WLM tuning, structuring files as Parquet for Spectrum access, understanding why and how spectrum performs in different scenarios, etc. everything was a chore. All this redshift specific specialization I was learning was not really contributing to the product in a meaningful way.
Switched to Bq, a year ago, and it's been mostly self driving. the only thing we had to tend to was the slots and bit of a learning curve for the org about partitioning keys (there is a setting in BQ that fails your query IF partition key is not specified)
Having switched to BQ it's really hard for me to imagine going back to Redshift. It almost feels antiquated.
I briefly worked on Redshift and have used Athena/Presto and Bigquery. Redshift felt like an architectural middle ground. Presto can query almost anything, and Bigquery requires storing data in Bigquery, but like Presto, you don't have to pay for inactive compute use. Redshift's scaling story is more complicated, and paying for inactive compute wasn't ideal. It sounds like it might have improved, but you're still essentially building Bigquery at that point. There might be some use cases that need a fast, columnar store that's already online, so queries take 3s, not 10s with Bigquery.
I generally prefer Bigquery, and between it and Bigtable, I actually prefer GCP over AWS because their offerings for hard-to-do things are really good. I'd honestly pick GCP just for those two products.
I work on BigQuery. All of these are great points: just wanted to point out that BigQuery can federate into external data sources as well: e.g. files on cloud storage and BigTable. Relevant feature is BigLake: https://cloud.google.com/bigquery/docs/biglake-intro
Are there any performance benefits of BigLake over external tables stored in Parquet governed by Hive? Or is the main benefit the governance flexibility?
Currently the main benefit of BigLake over the current external tables is governance: you get row and column level security over cloud storage data. The governance is uniformly enforced across BigQuery and also the BigQuery storage API. The storage API can be used by any engine and we have pre-built open source connectors available for Spark, Presto/Trino, Dataflow and Tensorflow.
We're constantly working on improving BigQuery performance over open file formats on cloud storage. Some of these features will be specific to BigLake. Please stay tuned.
Our most shocking discovery on Redshift was that primary key constraints are not honored. (Not sure why they even have PK identifiers given this, it just adds more confusion.)
This is the case for most column oriented data warehouses (including BigQuery, but Snowflake does allow for 1 PK). It's just the nature of the technology.
I worked on BQ, it is a very poor product I have to say. Poor performance and only few customers use it. Let's don't preach BQ... After I switched to snowflake, I can tell how good snowflake is in terms of the performance.
Does anyone choose Redshift these days, asides from those completely tied into AWS? Feels like Redshift is playing catch up but moving so slow it will never catch up.
As another commenter noted, Redshift in my experience was an operational hassle.
Amazon is losing their edge, the toxic company culture is pushing away good engineers and we are seeing more and more 3rd party companies begin to dominate over spaces amazon previously did e.g. snowflake
Redshift had architectural issues from Day 1. Too many knobs to turn, and it's just not something customers wanted to do. I know, because we built a whole company around selling Redshift performance tuning.
Everything that's been added in terms of functionality is just window dressing in IMHO. BigQuery and in particular Snowflake have a superior architecture, with the separation of storage and compute.
Also, someone mentions in on the thread, the Redshift marketing was horrible.
Features, features, features, features, etc.
Compare that to Snowflake, and how their message evolved:
"The Cloud Warehouse"
"Data Cloud"
So much more compelling. Also, Snowflake had a killer sales and marketing team.
Many other, little things. The Redshift team was constrained by what the AWS Console would give them. Snowflake could build more, better admin features. Redshift tried to mitigate that by acquiring a client (Datarow), but from what I heard, the acquisition never got integrated.
Having said that, set-up and configured the right way, Redshift was faster and cheaper than any other data warehouse on the market. Except - nobody wanted to spend the time on properly configuring their cluster. People just wanted their warehouse to work, and that's what BigQuery and Snowflake delivered. Even when that meant paying more.
The time you spent tuning Redshift for performance, you spent on tuning Snowflake and BigQuery for cost. Pick your poison. But again, people didn't care about the money - they just wanted things to work.
I didn't really see BigQuery as competition, simply because that meant switching clouds.
What I did hear is that analytics teams preferred GCP overall, and I think that has driven a lot of cloud workloads from AWS and GCP, because in the last few years analytics teams started to be decision market in many companies.
IMHO, GCP has the much better analytics portfolio than AWS. By now, also the better sales team. It's been really smart by GCP to bet on data science and analytics because of data gravity. Once you have the data in your cloud, it attracts workloads.
Snowflake is what eventually killed our performance tuning business. You can find my post-mortem on our company on Medium.
As you can probably tell, I know more about the whole analytics ecosystem than I bargained for.
I am glad there is some innovation happening. If you compare redshift to BigQuery it feels quite archaic in the current production version. I understand that Google commands a metric ton of engineering excellence, yet the difference should not be so extreme especially when you consider AWS still being the king of cloud. BigQuery is downright magic.
I would love to see more competition in this space as having large amounts of data with Google always makes me feel uneasy for all kinds of reasons.
In the last two years I have used both Redshift (AQUA) and Snowflake, and I prefer Snowflake by a mile. Snowflake is just a lot easier to use, scales better and has a much better permission model.
Basically, it worked like this:
- All of our data lived in compressed SQLite DBs on S3.
- Upon receiving a query, Postgres would use a custom foreign data wrapper we built.
- This FDW would forward the query to a web service.
- This web service would start one lambda per SQLite file. Each lambda would fetch the file, query it, and return the result to the web service.
- This web service would re-issue lambdas as needed and return the results to the FDW.
- Postgres (hosted on a memory-optimized EC2 instance) would aggregate.
It was straight magic. Separated compute + storage with basically zero cost and better performance than Redshift and Vertica. All of our data was time-series data, so it was extraordinarily easy to partition.
Also, it was also considerably cheaper than Athena. On Athena, our queries would cost us ~$5/TB (which hasn't changed today!), so it was easily >$100 for most queries and we were running thousands of queries per hour.
I still think, to this day, that the inevitable open-source solution for DWs might look like this. Insert your data as SQLite or DuckDB into a bucket, pop in a Postgres extension, create a FDW, and `terraform apply` the lambdas + api gateway. It'll be harder for non-timeseries data but you can probably make something that stores other partitions.