Hacker News new | past | comments | ask | show | jobs | submit login
Moving a billion Postgres rows on a $100 budget (peerdb.io)
153 points by samaysharma 8 months ago | hide | past | favorite | 68 comments



You can do something better for 0$, just install the TimescaleDB postgres extension and execute:

SELECT create_hypertable('public.challenge_1br', by_range('time'));

Now, enjoy your better than Snowflake query performance performance at no extra cost.


Ironically for many, the lack of support for arbitrary extensions on RDS, Google Cloud SQL, and the like means that if your data's stuck there, you'd need to migrate to a self-managed database instance - which means you'd need to use the strategies in the blog post anyways!

More generally, of course, it's not always a "free lunch" to have your analytics compute using the same CPU/RAM that runs your production transactional database. If you're at billion-row scales, you're likely to be at the point where mirroring to a data warehouse is becoming necessary.


Google Cloud SQL does support quite a few postgres extension, but unfortunately it does not support TimescaleDB, that's quite sad indeed. I hope that at some point in the future TimescaleDB could be merged into postgres to make it a native feature.


I'm confused. Everything sounds very expensive to me.

The last table which compares it with the other vendors is surprising. Even Stich Data (cheapest) costs $1 to move 240K records: (1B / 4,166.67 = 240K). Is this real?

So, their solution costs $1 to process 13.6M records. Sounds like this is not very share-worthy.

What I'm missing here?


What I want to know is why the fuck it takes 8 days to load 700MM records—in 2024.

I couldn’t even continue reading the article because it must be from 2006.


700M records in 8 days (1024 rps) is to mimic a real-world transactional (OLTP) workload. It doesn't define limits on what throughput can be achieved.


Ahhh. Thank you. That makes more sense.


Welcome to 2024 and the generation of developers raised in the cloud native world who think this is normal.

A billion rows is nothing and having $100 appear in conjunction with that is absurd unless you are doing some kind of really heavy compute or AI model training on that data.

By 2030 we’ll have those costs well up over a thousand dollars and it’ll take five or six separate SaaS systems wired together to do this. Progress!


To be fair, my machine that can process 1B rows an unlimited number of times still cost $1000 to build, so if you need a one off maybe paying the $100 is better?


> Moving 1 billion rows is no easy task

This isn't an accurate premise. Modern OLAP databases make dealing with billions to trillions of rows manageable, including on a single server. Exporting "select * from table" from an OLTP such as Postgres or MySQL into an OLAP is trivial and quite fast, and if 100M rows/sec on commodity servers isn't fast enough, there's always performance tuning [1].

[1] https://altinity.com/blog/loading-100b-rows-in-minutes-in-al...


That sentence is more in the context of the blog of moving a billion rows across data-stores. We will edit it to make that more clear. Thanks for the feedback.


I can't speak for Snowflake, but on Clickhouse it's quite literally [1]:

    insert into new_table select * from postgresql('postgres:5432', 'db', 'table', 'user', 'pass');
I assume it's similarly easy on Snowflake, Databricks, SingleStore, and the rest.

[1] https://clickhouse.com/docs/en/sql-reference/table-functions...


Yes, and there’s loads of opportunity to over engineer every step of the process.


data stores or data lakes with data rivers?


You’re forgetting the data dams on those data rivers


Please can we have data fish ladders next?


Here in Oregon we are removing the data dams to have more natural data flows.


Data dams adds to the visual metaphor of data breaches...


gotta stick to the rivers and lakes that you're used to


Or wouldn't COPY from CSV be much faster?


CSV based export/import would involve writing to disk, sending over network then reading back from file. Select is just one copy over network, no intermediate CSV.


You can stream CSV without writing it to a disk.


What would be the point in that?


COPY is way faster than INSERT, at least in PG.


Fwiw you'd lose type information. Parquet / arrow are much more appropriate here


How about 11b and a horrible python script over to parquet to your wee little NAS for your homelab?


Your wee little NAS will probably still have 2x-3x better IO performance than cloud based services.


import multiprocessing and up the process count until you hit network bottleneck.


now we're talking about innovation


I've already thought of the follow-up to this article, 'Querying a billion rows on a $XX budget.' Let me give you my answer directly: switch from Snowflake to StarRocks. It's an open-source project under the Linux Foundation, with speed that's more than adequate, especially for queries involving multiple tables. If you're interested, you might want to check it out, https://medium.com/starrocks-engineering/how-to-reduce-snowf....


Starrocks sounds too good to be true, what are the cons?


Hard to set up in Kubernetes. Seems to want you to know too many specifics for how to shard your db ahead of time. Don't remember how it handles schema evolution. I settled for Databend because it just worked easy and fast with nothing more than table definitions.


Do you have a connection with StarRocks?


What do you want? You can connect with the StarRocks eng team on the slack channel. https://join.slack.com/t/starrocks/shared_invite/zt-z5zxqr0k...


If someone is price conscious, why move from postgres to snowflake?


That was my first thought as well — like who is the audience for this?

My first thought was that you could stay on Postgres and save that $100 by using the secret power of Open Source.

Well said.


Seems like the title would be more accurate specifying postgres to snowflake

In the end, there's companies paying to use snowflake, & despite what one may believe they aren't price oblivious. Having their application in postgres is a cost optimization, but then still relying on snowflake for data warehouse integrations


Honestly I think most snowflake customers were w cheap money side effect.

I love snowflake but their pricing is, in fact, absurd.


Based on my experience working with Postgres users since a decade (ex-Citus/Microsoft), I don't think Postgres is there yet to support every possible workload - ex: medium to larger scale (ex:1TB+) workloads in Real Time analytics, Data Warehousing, Search etc. Sure at smaller scales, it is very versatile to support any workload. That is why it is super common for companies to complement Postgres with other data stores. Don't mean to say that Postgres will not get there, I think it will! But I see it to be more in the long term.


We operate with 80 Tb of data ATM. It is laying in several nodes and meta nodes (this is our own terminology). All Postgres.

Recently we need to move data from one DB to another, about 600M records. It is not biggest chank of the data, but we need it on different server because we use FTS a lot. And don't want to interrupt other operations on previous server. It took 3 days and costs 0.


Thanks for context! Totally understand where you are coming from. Postgres can be moulded to work for many use-cases. However it could take good amount of effort to make it happen. For example in your case building and managing a sharded Postgres environment isn't straightforward. It requires quite a lot of time and expertise. Citus automated exactly this (sharding). However it wasn't a fit for every workload. https://docs.citusdata.com/en/v12.1/get_started/what_is_citu...


Tbh I never bothered by these mental restrictions. I can't say that we have effort that requires more than normal human brain can handle (and we aren't smartest people on the planet).

I used to work in one project where we process big part of all shop's cash receipts in one of the biggest european country. We don't use any of these products. And it was done by one person.

Only stupid idea we had was to use AWS. Learned helplessness push people to change best product on the market but without salesman who tickle your balls.

Postgres is one of the best product on the market. But so much FUD makes a new space for "problem solvers" for the problem never exist. I'm not about Citus, I'm about idea that it is require much effort to build something around Postgres.


Is your workload an analytical or transactional workload ?.


Both, we process all the data in several blockchains and make public reports available for our users. Some of data can be processed once, others need to be recalculated on the schedule. So we process every block, every log entity in transaction nature. And also have a lot of data that processed several ways 2nd/3rd/etc times.

We also slowly evolve our internal analytics/intelligence. It is not something that generates high load, but will be at some point. Imagine something like dune.com.


> I don't think Postgres is there yet to support every possible workload

No system targets every possible workload. If that’s your goal, change your goal to make it more achievable.

With work (e.g. a cluster of Postgres instances), you can do quite a lot with Postgres if you think it through. If you want an out of the box solution to a specific problem, it may make sense to move away from Postgres. Of course another option is just to change the overall design/deployment of your Postgres cluster.

At some point you’ll need to do some custom work. Unless you plan on moving away from your new system next year because it doesn’t cover every possible workload.


since when is 1TB+ medium to large scale ? That easily fits on a single computer running Postgres, can even run almost all of it on memory depending on the server..


Sure, the workload could be compute bound than memory. It also depends on what query workloads is the postgres planner and executor optimized for. TPC-H benchmark is an example where vanila Postgres is not as optimal as pure analytical stores.


Its not about being price conscious, its about finding an efficient way to replicate data into your data warehouse.


OP here, thanks for chiming in. Yep, the blog talks about how efficient you can be while replicating data from Postgres to Data Warehouses. Also just for some context, the inspiration behind writing this blog is a common concern from customers that data-movement is expensive and also the insanely crazy margins in this space.


Then they should take out the "on a $100 budget" from the title.


Thanks for chiming in. The concepts/idea of this blog can be extended to other target data-stores too, say ClickHouse.


Why not just set up a replica pgsql, then break the connection and upgrade it to primary?

It's amusing how much effort is put into ETL these days. I remember when ETL departments were filled with the sludge of the programming world. It took ETL departments weeks to generate a CSV, and it would inevitably have massive numbers of errors because they didn't actually follow the format that was specified on the form they forced everyone to use.


When thinking about solving a problem one of questions to be asked is whether you want a challenge or a solution.

Engineers, on average, want to be challenge, which impacts options chosen.

There may also have been other factors, not mentioned here.


If someone asked me how much it would cost to move a billion Postgres rows, I would say "probably under $100."

I just had to move 500 million "rows" into S3, and it came in at about $100. I would expect S3 to be more expensive.


(edit: mostly offtopic observation follows)

I only knew Snowflake the id selection algorithm, so was a bit confused, but googling "snowflake db" showed me this blurb and now I'm even more confused.

> Snowflake enables organizations to learn, build, and connect with their data-driven peers. Collaborate, build data apps & power diverse workloads in the ...


Snowflake & Azure Synapse are competing products, if that helps.


synapse is such a disaster. I don't know any example of a decent size succesfull deployment that. It grinds to a halt and charges you a lot of money for it.

Snowflake is an amazing Agile database that has a great ecosystem. Teradata still remains king if you want this type of cloud datawarehouse, and becomes price competitive with Snowflake if you actually use it intensively (transactions are cheaper on teradata than snowflake)

But in the end, a good managed postgresql is probably enough for 80% of the clients of Synaps/Snowflake anyway. It's just that CTO's are starting to lose technical knowledge and are more politicians nowadays


This was a disappointing article. It was expecting it to explore validating the integrity and consistency of the data, but that just seems to be handwaved away by a short section saying PeerDB handles it. This is especially disappointing since the article calls that out as one of the cruxes that makes this so difficult.


Does the elastic license fall under open source as defined by OSI?


No it does not.


> Moving a Billion Postgres Rows on a $100 Budget

In a magical universe where your time is free.


Postgres seriously needs a columnstore backed table instead of just a rowstore.

MSSQL has this and it is magic. SingleStore has it, and it is wonderful.

I'm willing to give a bounty of $1000 to whoever adds that into main postgres tree.

Snowflake is great as a warehouse. it's latency is shit when it comes to fast lookups and aggregates. If you can tolerate >1s api calls, that is fine. It takes forever to insert a few rows in a large table.

If you want a proper live DB, snowflake is a rich man's poor database.


Columnar store PostgreSQL extension exists, here are two but I think I’m missing at least another one:

https://github.com/citusdata/cstore_fdw

https://github.com/hydradatabase/hydra

You can also connect other stores using the foreign data wrappers, like parquet files stored on an object store, duckdb, clickhouse… though the joins aren’t optimised as PostgreSQL would do full scan on the external table when joining.


Parade DB are working on this problem: https://github.com/paradedb/paradedb/tree/dev/pg_analytics


> I'm willing to give a bounty of $1000 to whoever adds that into main postgres tree.

This is the most disrespectful thing I've read on HN.


I'm sorry my intent came off in the wrong way. I am not super wealthy.


Know anybody who could do it in a weekend? That’s about what you’re paying for.


For time series, you have the TimescaleDB extension.

https://www.timescale.com/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: