How to Distribute Postgres Geographically

jitl · 2024-06-02T12:23:50 1717331030

(I work at Notion, one of the larger Notion clones)

We experimented with using partitioned tables and doing the fancy postgres_fdw thing when we sharded our tenant-clustered tables in 2020. Even with all the Postgres instances in the same region we found the approach unwieldy. Routing queries in the application gives you a lot more control versus needing to teach Postgres the exact logic to route a query, plus do a (dangerous? one-way? global?) schema migration whenever you want to change the routing logic.

We touched on this briefly in our sharding blog post here: https://www.notion.so/blog/sharding-postgres-at-notion

Another reason to avoid complex routing in Postgres is risk. If something goes wrong at 3am in a load-bearing cross-Postgres-instance query, how easily will it be to mitigate that if the routing is happening in Postgres (with whatever advanced clustering system Citus, pgzx, postgres_fdw) versus if the routing is happening in the application? For example if there’s a network partition between the “global” postgres instance and the us-east-2 cluster? Maybe you’re Postgres wizards and know how to handle this with a quick schema change or something in the psql CLI but I’d bet more on my team’s ability to write fault tolerance in the application versus in Postgres internal logic.

tudorg · 2024-06-02T12:50:19 1717332619

Thanks a lot for commenting and pointing me to the blog post. I do think I've seen it before but forgot about it. I've re-read it now and it's a great read!

From what I understand you decided to do sharding in the application code, and given the current state I think that makes total sense and I'd have probably done the same.

Part of my point with the blog post is that there is a built-in horizontal sharding solution in vanilla Postgres (partitioning + FDW), but it's currently badly lacking when it comes to cluster management, schema changes, distributed transactions, and more. If we put work into it, perhaps we tip the balance and the next successful Notion clone could choose to do it at the DB level.

Ozzie_osman · 2024-06-02T15:24:37 1717341877

Thanks for this commentary! At a startup where we are preparing to shard Postgres. I'd be curious if you're familiar with AWS limitless, and how you would have approached deciding whether to use it vs. the approach in the blog post had it existed back in 2021.

jitl · 2024-06-02T16:14:18 1717344858

I’m a solid “no” on any primary datastore database thingy that’s under 5 years of industry wide production workload experience, and would seriously weigh it against something with 10+ years industry use.

In 2019 when I was interviewing at companies for my next position I heard from a few places that the original Aurora for Postgres lost their data. It seems like the sentiment on mainline Aurora has improved a bit, but I would never bet my company’s future on an AWS preview technology. Better the devil you know (and everyone else knows).

dilyevsky · 2024-06-03T16:56:18 1717433778

What about CockroachDB? There are real-world, large-scale deployments of it (e.g Netflix) going back more than 5 years easy.

jitl · 2024-06-04T20:50:38 1717534238

It might be a good choice, I don't know enough about either the technology or the market for CockroachDB expertise.

sgarland · 2024-06-02T15:44:36 1717343076

My biggest concern with Limitless – other than inherent performance issues with Aurora – is that according to their docs, it’s built on Aurora Serverless. IME, Serverless anything tends to get absurdly expensive very quickly.

hazaskull · 2024-06-02T06:38:36 1717310316

While it is fun to see how to creatively solve such issues, it does raise the question of managability. When sharding data into loosely (fdw) coupled silo's it would become tricky to make consistent backups, ensure locking mechanisms work when sharded data might sometimes be directly related, handle zone/region failures gracefully, prevent hot spots, perform multi-region schema-changes reliably, etc. I suppose this pattern principally only works when the data is in fact not strongly related and the silo's are quite independent. I wouldn't call that a distributed system at all, really. This may be a matter of opinion of course.

It does give a "When all you have is a hammer..." vibe to me and begs the question: why not use a system that's designed for use-cases like this and do it reliably and securely ? i.e.: https://www.cockroachlabs.com/docs/stable/multiregion-overvi... (yes, I know full data domiciling requires something even more strict but I currently don't know of any system that can transparently span the globe and stay performant while not sharing any metadata or caching between regions)

tudorg · 2024-06-02T08:02:16 1717315336

> It does give a "When all you have is a hammer..." vibe to me and begs the question: why not use a system that's designed for use-cases like this and do it reliably and securely ?

(disclaimer: blog post author)

A reason would be that you want to stick to pure Postgres, for example because you want to use Postgres extensions, or prefer the liberal Postgres license.

It can also be a matter of performance, distributed transactions are necessarily slower so if almost all the time you can avoid them by connecting to a single node, which has all the data that the transaction needs, that's going to get you better performance.

hazaskull · 2024-06-02T08:31:54 1717317114

Hi there! Thank you for the post and your work on pgzx! Though it depends on the system (cockroachdb can place leaders on specific nodes to speed up local queries, it has global tables and otherwise there's always follower-reads) those are of course valid reasons. Admittedly if you want to keep data "pinned", you're into manual placement, rather than horizontal scaling but that's nitpicking and there's pros and cons. I do enjoy the freedom of Postgres and am hopeful that its virtually prehistoric storage-design becomes a non-issue thanks to tech such as Neon and Orioledb. The option to decouple storage would provide wonderful flexibility for solutions like yours too I think.

tudorg · 2024-06-02T09:17:10 1717319830

Thanks for noticing pgzx! We are working on a project building on top of it and going into the direction hinted by this blog post.

I agree that the separation of storage and compute complements this nicely. In fact, we take advantage of it in the Xata platform which uses Aurora.

nurple · 2024-06-02T14:03:26 1717337006

Ever since the war of stored procedures, I've been very reticent to put any more logic than absolutely necessary into the database. The simple truth of the matter is that the DB server codebases are awful, complicated, places to develop this kind of functionality. I strongly believe these features should be handled at higher layers.

At my last job, in fintech, we used application-level shard routing with each aggregate root served by a particular RPC service (users, accounts, transactions, etc). When one of these aggregate services were asked to pull data, they would query a central routing service for the server and shard the requested data resided in.

Between them and the DB servers we had a cluster of pgbouncer instances to handle connection oversubscription and the routing of queries around DB instances in the face of failures or maintenance.

While this was pretty effective, the work to manage sharded Postgres, DDL updates, caching, locating, and balancing data was still very complicated and error prone, and was the root of many production outages.

I didn't design this system, but after a couple years leading a platform migration, which necessitated getting into all this at a deep level, I would do it differently. Instead of query routing and DB sharding, I would shard the aggregate root services themselves. Each "sharded" set of these instances would have a simple PG instance (with replica chain) behind it that knows nothing about any other set.

At this point, instead of routing being done by the aggregate root services locating and querying data from DB shards on huge vertically-scaled servers, each set only pulls data from their dedicated DB. A routing facade is placed in front of the set of sets that sends requests from consumers to the one holding the desired data.

With this architecture, the complexity of sharding and vertically scaling at the DB layer, and handling connection oversubscription with a query router like PGbouncer, just falls away.

I would keep these sets, and their DBs, rather small to also reap the benefits of small datasets. One of the biggest issues we had with huge DBs is the time it takes to do anything; as an example, restarting replication in a failed chain could take many hours because of the huge amount of data, and any hiccups in connectivity would quickly overrun our capability to "catch" replication up and we'd have to fall back to zfs-send to resnap.

A larger number of smaller DBs would not improve the total time needed to do something like backups or DDL mutations, but it would significantly reduce the time for any particular instance which reduces overall risk and blast radius of a DB failure.

Another thing I think small DBs can help with is data locality, DB automation, and possibly making the instances effectively ephemeral. When your dataset is small, bringing an instance up from a hot backup can take on the order of a few seconds, and could allow you to schedule the root services and their DB on the same host.

For geographical distribution, the routing facade can also send writes to the set serving that shard in some other region.

sgarland · 2024-06-02T16:22:51 1717345371

> Ever since the war of stored procedures, I've been very reticent to put any more logic than absolutely necessary into the database.

Counterpoint: the fewer round trips to the DB you have to do, the faster your app is.

My belief is that as DBAs more or less went away thanks to cloud providers and the rise of Full Stack Engineering, the quiet part that no one wanted to admit was that RDBMS are hideously complicated, and you really need SMEs to use them correctly. Thus, the use of stored procedures, triggers, advanced functions etc. went away in favor of just treating the DB as a dumb store, and doing everything in app logic.

As more and more companies are discovering that actually, you do need DB SMEs (I’m a DBRE; demand for this role has skyrocketed), my hope is we can push some of the logic back out to the DB. I am all for VCS, automated migrations, canary instances, and other careful patterns. I’d just like to see less of treating RDBMS as a document store, and embracing the advantages of the tech already in place.

nurple · 2024-06-02T17:52:29 1717350749

I kind of agree with your assessment, it would even speak to the popularity of document DBs coming up around the same time. However, I would argue that the decline in deep RDBMS integrations is exactly because of their excessive complexity. Modern software development methods left them behind because their change velocity is awful. This is doubly apparent when you look at features like TFA is talking about, exactly because it's extremely complex to stand up, manage, and make changes to.

I personally left in the era of ORMs exactly because it let me build systems in high-productivity languages and frameworks, but also because the cost benefit of RDBMS integration just wasn't there. The DB is absolutely the highest-risk place to put logic in any stack, if something goes wrong _everything_ breaks and it's the most difficult place to fix things.

I don't know what the future looks like for DBs, but I think we'll continue to see them become even more transparent. Devs just want to persist entities, and integrating at the RDBMS level is a slog.

sgarland · 2024-06-02T19:20:44 1717356044

I’m obviously biased because of my career speciality, but I love everything about RDBMS. I love mastering the million knobs they have, their weird quirks, and how if you use them as designed with properly normalized schema, your data will ALWAYS be exactly what you told it to be, with no surprises.

To this end, “devs just want to persist entities” makes me both sad and frustrated. Everyone seems to default to “just use Postgres,” but then don’t want to actually use an RDBMS. If you want a KV store, then use one. If you want a document store, then use one. Don’t subject a relational database to storing giant JSON blobs with UUIDv4 keys, and then complain that it’s slow because you’ve filled the memory with bloated indices, and saturated the disk bandwidth with massive write amplification.