Moving a billion Postgres rows on a $100 budget

rthnbgrredf · 2024-02-22T07:15:57 1708586157

You can do something better for 0$, just install the TimescaleDB postgres extension and execute:

SELECT create_hypertable('public.challenge_1br', by_range('time'));

Now, enjoy your better than Snowflake query performance performance at no extra cost.

btown · 2024-02-22T17:15:52 1708622152

Ironically for many, the lack of support for arbitrary extensions on RDS, Google Cloud SQL, and the like means that if your data's stuck there, you'd need to migrate to a self-managed database instance - which means you'd need to use the strategies in the blog post anyways!

More generally, of course, it's not always a "free lunch" to have your analytics compute using the same CPU/RAM that runs your production transactional database. If you're at billion-row scales, you're likely to be at the point where mirroring to a data warehouse is becoming necessary.

rthnbgrredf · 2024-02-22T19:11:44 1708629104

Google Cloud SQL does support quite a few postgres extension, but unfortunately it does not support TimescaleDB, that's quite sad indeed. I hope that at some point in the future TimescaleDB could be merged into postgres to make it a native feature.

tharakam · 2024-02-21T21:59:24 1708552764

I'm confused. Everything sounds very expensive to me.

The last table which compares it with the other vendors is surprising. Even Stich Data (cheapest) costs $1 to move 240K records: (1B / 4,166.67 = 240K). Is this real?

So, their solution costs $1 to process 13.6M records. Sounds like this is not very share-worthy.

What I'm missing here?

garciasn · 2024-02-21T23:41:53 1708558913

What I want to know is why the fuck it takes 8 days to load 700MM records—in 2024.

I couldn’t even continue reading the article because it must be from 2006.

saisrirampur · 2024-02-21T23:49:46 1708559386

700M records in 8 days (1024 rps) is to mimic a real-world transactional (OLTP) workload. It doesn't define limits on what throughput can be achieved.

garciasn · 2024-02-22T00:27:48 1708561668

Ahhh. Thank you. That makes more sense.

api · 2024-02-22T01:08:28 1708564108

Welcome to 2024 and the generation of developers raised in the cloud native world who think this is normal.

A billion rows is nothing and having $100 appear in conjunction with that is absurd unless you are doing some kind of really heavy compute or AI model training on that data.

By 2030 we’ll have those costs well up over a thousand dollars and it’ll take five or six separate SaaS systems wired together to do this. Progress!

Aeolun · 2024-02-22T10:36:56 1708598216

To be fair, my machine that can process 1B rows an unlimited number of times still cost $1000 to build, so if you need a one off maybe paying the $100 is better?

hipadev23 · 2024-02-21T21:58:43 1708552723

> Moving 1 billion rows is no easy task

This isn't an accurate premise. Modern OLAP databases make dealing with billions to trillions of rows manageable, including on a single server. Exporting "select * from table" from an OLTP such as Postgres or MySQL into an OLAP is trivial and quite fast, and if 100M rows/sec on commodity servers isn't fast enough, there's always performance tuning [1].

[1] https://altinity.com/blog/loading-100b-rows-in-minutes-in-al...

saisrirampur · 2024-02-21T22:05:29 1708553129

That sentence is more in the context of the blog of moving a billion rows across data-stores. We will edit it to make that more clear. Thanks for the feedback.

hipadev23 · 2024-02-21T22:08:58 1708553338

I can't speak for Snowflake, but on Clickhouse it's quite literally [1]:

    insert into new_table select * from postgresql('postgres:5432', 'db', 'table', 'user', 'pass');

I assume it's similarly easy on Snowflake, Databricks, SingleStore, and the rest.

[1] https://clickhouse.com/docs/en/sql-reference/table-functions...

notjoemama · 2024-02-22T00:23:10 1708561390

Yes, and there’s loads of opportunity to over engineer every step of the process.

CyberDildonics · 2024-02-22T02:09:21 1708567761

data stores or data lakes with data rivers?

ssss11 · 2024-02-22T02:30:03 1708569003

You’re forgetting the data dams on those data rivers

collingreen · 2024-02-22T04:44:33 1708577073

Please can we have data fish ladders next?

davidw · 2024-02-22T04:47:50 1708577270

Here in Oregon we are removing the data dams to have more natural data flows.

datadrivenangel · 2024-02-22T15:09:44 1708614584

Data dams adds to the visual metaphor of data breaches...

forgingahead · 2024-02-22T05:16:38 1708578998

gotta stick to the rivers and lakes that you're used to

vmfunction · 2024-02-22T00:57:23 1708563443

Or wouldn't COPY from CSV be much faster?

killingtime74 · 2024-02-22T03:29:22 1708572562

CSV based export/import would involve writing to disk, sending over network then reading back from file. Select is just one copy over network, no intermediate CSV.

subleq · 2024-02-22T03:50:46 1708573846

You can stream CSV without writing it to a disk.

edgyquant · 2024-02-22T05:07:25 1708578445

What would be the point in that?

vmfunction · 2024-02-22T12:17:43 1708604263

COPY is way faster than INSERT, at least in PG.

jasonjmcghee · 2024-02-22T15:48:55 1708616935

Fwiw you'd lose type information. Parquet / arrow are much more appropriate here

_boffin_ · 2024-02-21T21:30:33 1708551033

How about 11b and a horrible python script over to parquet to your wee little NAS for your homelab?

MenhirMike · 2024-02-22T00:00:32 1708560032

Your wee little NAS will probably still have 2x-3x better IO performance than cloud based services.

ActorNightly · 2024-02-22T10:14:05 1708596845

import multiprocessing and up the process count until you hit network bottleneck.

sss111 · 2024-02-21T21:55:34 1708552534

now we're talking about innovation

HermitX · 2024-02-21T23:00:11 1708556411

I've already thought of the follow-up to this article, 'Querying a billion rows on a $XX budget.' Let me give you my answer directly: switch from Snowflake to StarRocks. It's an open-source project under the Linux Foundation, with speed that's more than adequate, especially for queries involving multiple tables. If you're interested, you might want to check it out, https://medium.com/starrocks-engineering/how-to-reduce-snowf....

ako · 2024-02-22T01:05:56 1708563956

Starrocks sounds too good to be true, what are the cons?

392 · 2024-02-22T05:41:13 1708580473

Hard to set up in Kubernetes. Seems to want you to know too many specifics for how to shard your db ahead of time. Don't remember how it handles schema evolution. I settled for Databend because it just worked easy and fast with nothing more than table definitions.

hodgesrm · 2024-02-22T12:51:26 1708606286

Do you have a connection with StarRocks?

HermitX · 2024-02-27T02:44:08 1709001848

What do you want? You can connect with the StarRocks eng team on the slack channel. https://join.slack.com/t/starrocks/shared_invite/zt-z5zxqr0k...

twelfthnight · 2024-02-21T21:30:50 1708551050

If someone is price conscious, why move from postgres to snowflake?

happytiger · 2024-02-21T21:32:18 1708551138

That was my first thought as well — like who is the audience for this?

My first thought was that you could stay on Postgres and save that $100 by using the secret power of Open Source.

Well said.

__s · 2024-02-21T21:55:45 1708552545

Seems like the title would be more accurate specifying postgres to snowflake

In the end, there's companies paying to use snowflake, & despite what one may believe they aren't price oblivious. Having their application in postgres is a cost optimization, but then still relying on snowflake for data warehouse integrations

foobiekr · 2024-02-22T07:39:07 1708587547

Honestly I think most snowflake customers were w cheap money side effect.

I love snowflake but their pricing is, in fact, absurd.

saisrirampur · 2024-02-21T21:57:53 1708552673

Based on my experience working with Postgres users since a decade (ex-Citus/Microsoft), I don't think Postgres is there yet to support every possible workload - ex: medium to larger scale (ex:1TB+) workloads in Real Time analytics, Data Warehousing, Search etc. Sure at smaller scales, it is very versatile to support any workload. That is why it is super common for companies to complement Postgres with other data stores. Don't mean to say that Postgres will not get there, I think it will! But I see it to be more in the long term.

xenator · 2024-02-21T22:42:07 1708555327

We operate with 80 Tb of data ATM. It is laying in several nodes and meta nodes (this is our own terminology). All Postgres.

Recently we need to move data from one DB to another, about 600M records. It is not biggest chank of the data, but we need it on different server because we use FTS a lot. And don't want to interrupt other operations on previous server. It took 3 days and costs 0.

saisrirampur · 2024-02-21T22:52:39 1708555959

Thanks for context! Totally understand where you are coming from. Postgres can be moulded to work for many use-cases. However it could take good amount of effort to make it happen. For example in your case building and managing a sharded Postgres environment isn't straightforward. It requires quite a lot of time and expertise. Citus automated exactly this (sharding). However it wasn't a fit for every workload. https://docs.citusdata.com/en/v12.1/get_started/what_is_citu...

xenator · 2024-02-23T18:48:38 1708714118

Tbh I never bothered by these mental restrictions. I can't say that we have effort that requires more than normal human brain can handle (and we aren't smartest people on the planet).

I used to work in one project where we process big part of all shop's cash receipts in one of the biggest european country. We don't use any of these products. And it was done by one person.

Only stupid idea we had was to use AWS. Learned helplessness push people to change best product on the market but without salesman who tickle your balls.

Postgres is one of the best product on the market. But so much FUD makes a new space for "problem solvers" for the problem never exist. I'm not about Citus, I'm about idea that it is require much effort to build something around Postgres.

quadrature · 2024-02-21T23:07:01 1708556821

Is your workload an analytical or transactional workload ?.

xenator · 2024-02-23T18:34:15 1708713255

Both, we process all the data in several blockchains and make public reports available for our users. Some of data can be processed once, others need to be recalculated on the schedule. So we process every block, every log entity in transaction nature. And also have a lot of data that processed several ways 2nd/3rd/etc times.

We also slowly evolve our internal analytics/intelligence. It is not something that generates high load, but will be at some point. Imagine something like dune.com.

CogitoCogito · 2024-02-22T10:49:22 1708598962

> I don't think Postgres is there yet to support every possible workload

No system targets every possible workload. If that’s your goal, change your goal to make it more achievable.

With work (e.g. a cluster of Postgres instances), you can do quite a lot with Postgres if you think it through. If you want an out of the box solution to a specific problem, it may make sense to move away from Postgres. Of course another option is just to change the overall design/deployment of your Postgres cluster.

At some point you’ll need to do some custom work. Unless you plan on moving away from your new system next year because it doesn’t cover every possible workload.

aprdm · 2024-02-22T03:01:33 1708570893

since when is 1TB+ medium to large scale ? That easily fits on a single computer running Postgres, can even run almost all of it on memory depending on the server..

saisrirampur · 2024-02-22T03:14:06 1708571646

Sure, the workload could be compute bound than memory. It also depends on what query workloads is the postgres planner and executor optimized for. TPC-H benchmark is an example where vanila Postgres is not as optimal as pure analytical stores.

quadrature · 2024-02-21T21:41:13 1708551673

Its not about being price conscious, its about finding an efficient way to replicate data into your data warehouse.

saisrirampur · 2024-02-21T21:44:43 1708551883

OP here, thanks for chiming in. Yep, the blog talks about how efficient you can be while replicating data from Postgres to Data Warehouses. Also just for some context, the inspiration behind writing this blog is a common concern from customers that data-movement is expensive and also the insanely crazy margins in this space.

ies7 · 2024-02-22T05:52:43 1708581163

Then they should take out the "on a $100 budget" from the title.

saisrirampur · 2024-02-21T22:00:40 1708552840

Thanks for chiming in. The concepts/idea of this blog can be extended to other target data-stores too, say ClickHouse.

mannyv · 2024-02-22T07:40:55 1708587655

Why not just set up a replica pgsql, then break the connection and upgrade it to primary?

It's amusing how much effort is put into ETL these days. I remember when ETL departments were filled with the sludge of the programming world. It took ETL departments weeks to generate a CSV, and it would inevitably have massive numbers of errors because they didn't actually follow the format that was specified on the form they forced everyone to use.

kroolik · 2024-02-22T09:37:34 1708594654

When thinking about solving a problem one of questions to be asked is whether you want a challenge or a solution.

Engineers, on average, want to be challenge, which impacts options chosen.

There may also have been other factors, not mentioned here.

ralusek · 2024-02-21T21:54:02 1708552442

If someone asked me how much it would cost to move a billion Postgres rows, I would say "probably under $100."

I just had to move 500 million "rows" into S3, and it came in at about $100. I would expect S3 to be more expensive.

rplnt · 2024-02-21T21:45:16 1708551916

(edit: mostly offtopic observation follows)

I only knew Snowflake the id selection algorithm, so was a bit confused, but googling "snowflake db" showed me this blurb and now I'm even more confused.

> Snowflake enables organizations to learn, build, and connect with their data-driven peers. Collaborate, build data apps & power diverse workloads in the ...

quickslowdown · 2024-02-21T22:37:56 1708555076

Snowflake & Azure Synapse are competing products, if that helps.

dacryn · 2024-02-22T07:23:06 1708586586

synapse is such a disaster. I don't know any example of a decent size succesfull deployment that. It grinds to a halt and charges you a lot of money for it.

Snowflake is an amazing Agile database that has a great ecosystem. Teradata still remains king if you want this type of cloud datawarehouse, and becomes price competitive with Snowflake if you actually use it intensively (transactions are cheaper on teradata than snowflake)

But in the end, a good managed postgresql is probably enough for 80% of the clients of Synaps/Snowflake anyway. It's just that CTO's are starting to lose technical knowledge and are more politicians nowadays

jmholla · 2024-02-21T22:06:58 1708553218

This was a disappointing article. It was expecting it to explore validating the integrity and consistency of the data, but that just seems to be handwaved away by a short section saying PeerDB handles it. This is especially disappointing since the article calls that out as one of the cruxes that makes this so difficult.

pryelluw · 2024-02-21T21:51:57 1708552317

Does the elastic license fall under open source as defined by OSI?

tristan957 · 2024-02-22T03:36:50 1708573010

No it does not.

guidedlight · 2024-02-22T06:05:09 1708581909

> Moving a Billion Postgres Rows on a $100 Budget

In a magical universe where your time is free.

nojvek · 2024-02-21T23:30:04 1708558204

Postgres seriously needs a columnstore backed table instead of just a rowstore.

MSSQL has this and it is magic. SingleStore has it, and it is wonderful.

I'm willing to give a bounty of $1000 to whoever adds that into main postgres tree.

Snowflake is great as a warehouse. it's latency is shit when it comes to fast lookups and aggregates. If you can tolerate >1s api calls, that is fine. It takes forever to insert a few rows in a large table.

If you want a proper live DB, snowflake is a rich man's poor database.

speedgoose · 2024-02-22T06:09:13 1708582153

Columnar store PostgreSQL extension exists, here are two but I think I’m missing at least another one:

https://github.com/citusdata/cstore_fdw

https://github.com/hydradatabase/hydra

You can also connect other stores using the foreign data wrappers, like parquet files stored on an object store, duckdb, clickhouse… though the joins aren’t optimised as PostgreSQL would do full scan on the external table when joining.

dkatz23238 · 2024-02-21T23:58:07 1708559887

Parade DB are working on this problem: https://github.com/paradedb/paradedb/tree/dev/pg_analytics

ddorian43 · 2024-02-22T07:34:16 1708587256

> I'm willing to give a bounty of $1000 to whoever adds that into main postgres tree.

This is the most disrespectful thing I've read on HN.

nojvek · 2024-02-22T18:51:14 1708627874

I'm sorry my intent came off in the wrong way. I am not super wealthy.

CSSer · 2024-02-22T03:21:32 1708572092

Know anybody who could do it in a weekend? That’s about what you’re paying for.

Loic · 2024-02-22T06:58:22 1708585102

For time series, you have the TimescaleDB extension.

https://www.timescale.com/