Ironically for many, the lack of support for arbitrary extensions on RDS, Google Cloud SQL, and the like means that if your data's stuck there, you'd need to migrate to a self-managed database instance - which means you'd need to use the strategies in the blog post anyways!
More generally, of course, it's not always a "free lunch" to have your analytics compute using the same CPU/RAM that runs your production transactional database. If you're at billion-row scales, you're likely to be at the point where mirroring to a data warehouse is becoming necessary.
Google Cloud SQL does support quite a few postgres extension, but unfortunately it does not support TimescaleDB, that's quite sad indeed. I hope that at some point in the future TimescaleDB could be merged into postgres to make it a native feature.
I'm confused. Everything sounds very expensive to me.
The last table which compares it with the other vendors is surprising. Even Stich Data (cheapest) costs $1 to move 240K records: (1B / 4,166.67 = 240K). Is this real?
So, their solution costs $1 to process 13.6M records. Sounds like this is not very share-worthy.
Welcome to 2024 and the generation of developers raised in the cloud native world who think this is normal.
A billion rows is nothing and having $100 appear in conjunction with that is absurd unless you are doing some kind of really heavy compute or AI model training on that data.
By 2030 we’ll have those costs well up over a thousand dollars and it’ll take five or six separate SaaS systems wired together to do this. Progress!
To be fair, my machine that can process 1B rows an unlimited number of times still cost $1000 to build, so if you need a one off maybe paying the $100 is better?
This isn't an accurate premise. Modern OLAP databases make dealing with billions to trillions of rows manageable, including on a single server. Exporting "select * from table" from an OLTP such as Postgres or MySQL into an OLAP is trivial and quite fast, and if 100M rows/sec on commodity servers isn't fast enough, there's always performance tuning [1].
That sentence is more in the context of the blog of moving a billion rows across data-stores. We will edit it to make that more clear. Thanks for the feedback.
CSV based export/import would involve writing to disk, sending over network then reading back from file. Select is just one copy over network, no intermediate CSV.
I've already thought of the follow-up to this article, 'Querying a billion rows on a $XX budget.' Let me give you my answer directly: switch from Snowflake to StarRocks. It's an open-source project under the Linux Foundation, with speed that's more than adequate, especially for queries involving multiple tables. If you're interested, you might want to check it out, https://medium.com/starrocks-engineering/how-to-reduce-snowf....
Hard to set up in Kubernetes. Seems to want you to know too many specifics for how to shard your db ahead of time. Don't remember how it handles schema evolution. I settled for Databend because it just worked easy and fast with nothing more than table definitions.
Seems like the title would be more accurate specifying postgres to snowflake
In the end, there's companies paying to use snowflake, & despite what one may believe they aren't price oblivious. Having their application in postgres is a cost optimization, but then still relying on snowflake for data warehouse integrations
Based on my experience working with Postgres users since a decade (ex-Citus/Microsoft), I don't think Postgres is there yet to support every possible workload - ex: medium to larger scale (ex:1TB+) workloads in Real Time analytics, Data Warehousing, Search etc. Sure at smaller scales, it is very versatile to support any workload. That is why it is super common for companies to complement Postgres with other data stores. Don't mean to say that Postgres will not get there, I think it will! But I see it to be more in the long term.
We operate with 80 Tb of data ATM. It is laying in several nodes and meta nodes (this is our own terminology). All Postgres.
Recently we need to move data from one DB to another, about 600M records. It is not biggest chank of the data, but we need it on different server because we use FTS a lot. And don't want to interrupt other operations on previous server. It took 3 days and costs 0.
Thanks for context! Totally understand where you are coming from. Postgres can be moulded to work for many use-cases. However it could take good amount of effort to make it happen. For example in your case building and managing a sharded Postgres environment isn't straightforward. It requires quite a lot of time and expertise. Citus automated exactly this (sharding). However it wasn't a fit for every workload. https://docs.citusdata.com/en/v12.1/get_started/what_is_citu...
Tbh I never bothered by these mental restrictions. I can't say that we have effort that requires more than normal human brain can handle (and we aren't smartest people on the planet).
I used to work in one project where we process big part of all shop's cash receipts in one of the biggest european country. We don't use any of these products. And it was done by one person.
Only stupid idea we had was to use AWS. Learned helplessness push people to change best product on the market but without salesman who tickle your balls.
Postgres is one of the best product on the market. But so much FUD makes a new space for "problem solvers" for the problem never exist. I'm not about Citus, I'm about idea that it is require much effort to build something around Postgres.
Both, we process all the data in several blockchains and make public reports available for our users. Some of data can be processed once, others need to be recalculated on the schedule. So we process every block, every log entity in transaction nature. And also have a lot of data that processed several ways 2nd/3rd/etc times.
We also slowly evolve our internal analytics/intelligence. It is not something that generates high load, but will be at some point. Imagine something like dune.com.
> I don't think Postgres is there yet to support every possible workload
No system targets every possible workload. If that’s your goal, change your goal to make it more achievable.
With work (e.g. a cluster of Postgres instances), you can do quite a lot with Postgres if you think it through. If you want an out of the box solution to a specific problem, it may make sense to move away from Postgres. Of course another option is just to change the overall design/deployment of your Postgres cluster.
At some point you’ll need to do some custom work. Unless you plan on moving away from your new system next year because it doesn’t cover every possible workload.
since when is 1TB+ medium to large scale ? That easily fits on a single computer running Postgres, can even run almost all of it on memory depending on the server..
Sure, the workload could be compute bound than memory. It also depends on what query workloads is the postgres planner and executor optimized for. TPC-H benchmark is an example where vanila Postgres is not as optimal as pure analytical stores.
OP here, thanks for chiming in. Yep, the blog talks about how efficient you can be while replicating data from Postgres to Data Warehouses. Also just for some context, the inspiration behind writing this blog is a common concern from customers that data-movement is expensive and also the insanely crazy margins in this space.
Why not just set up a replica pgsql, then break the connection and upgrade it to primary?
It's amusing how much effort is put into ETL these days. I remember when ETL departments were filled with the sludge of the programming world. It took ETL departments weeks to generate a CSV, and it would inevitably have massive numbers of errors because they didn't actually follow the format that was specified on the form they forced everyone to use.
I only knew Snowflake the id selection algorithm, so was a bit confused, but googling "snowflake db" showed me this blurb and now I'm even more confused.
> Snowflake enables organizations to learn, build, and connect with their data-driven peers. Collaborate, build data apps & power diverse workloads in the ...
synapse is such a disaster. I don't know any example of a decent size succesfull deployment that. It grinds to a halt and charges you a lot of money for it.
Snowflake is an amazing Agile database that has a great ecosystem.
Teradata still remains king if you want this type of cloud datawarehouse, and becomes price competitive with Snowflake if you actually use it intensively (transactions are cheaper on teradata than snowflake)
But in the end, a good managed postgresql is probably enough for 80% of the clients of Synaps/Snowflake anyway. It's just that CTO's are starting to lose technical knowledge and are more politicians nowadays
This was a disappointing article. It was expecting it to explore validating the integrity and consistency of the data, but that just seems to be handwaved away by a short section saying PeerDB handles it. This is especially disappointing since the article calls that out as one of the cruxes that makes this so difficult.
Postgres seriously needs a columnstore backed table instead of just a rowstore.
MSSQL has this and it is magic.
SingleStore has it, and it is wonderful.
I'm willing to give a bounty of $1000 to whoever adds that into main postgres tree.
Snowflake is great as a warehouse. it's latency is shit when it comes to fast lookups and aggregates. If you can tolerate >1s api calls, that is fine. It takes forever to insert a few rows in a large table.
If you want a proper live DB, snowflake is a rich man's poor database.
You can also connect other stores using the foreign data wrappers, like parquet files stored on an object store, duckdb, clickhouse… though the joins aren’t optimised as PostgreSQL would do full scan on the external table when joining.
SELECT create_hypertable('public.challenge_1br', by_range('time'));
Now, enjoy your better than Snowflake query performance performance at no extra cost.