Notion seems like an interesting data storage problem. The vast majority of the ...

StreamBright · on Oct 6, 2021

I would try to use a simple k:v system that is much easier to scale. Even S3 would be a good candidate. Maybe I am missing the point.

lapaz17 · on Oct 8, 2021

Could you explain that k:v system a little bit? Junior eng. here.

roguas · on Oct 10, 2021

KV means key value. Essentially the suggestion is to organize your data as a big key value structure(hashmap-like). This term usually means data is not normalized/separated as in regular SQL.

for instance, in SQL:

User table, that has columns: id, email

Article table, that has columns: id, text, user_id

KV/noSQL equivalent:

Article document, that has properties: id, text, user_email

rantanplan · on Oct 7, 2021

> PG doesn't have a way to tell it how to organize data on disk so there is no good way around this (CLUSTER doesn't count, it's unusable in most use cases).

Aren't tablespaces (https://www.postgresql.org/docs/10/manage-ag-tablespaces.htm...) supposed to help with that?

Haven't used them, I'm honestly curious

mbell · on Oct 8, 2021

Tablespaces allow you to store tables in defined directories on the file system.

What I was talking about is controlling the ordering of the rows within a table on disk. If you are going to be reading some group of rows together often, ideally you want those rows to be contiguous on disk as a sequential read of a range is much faster than bouncing around to dozens of locations to collect the needed rows. This becomes more important for very large tables. Imagine a 5TB `blocks` table and you need to read 50 blocks to render a given notion doc but those blocks could be scattered all over the the place on disk, it's a lot more work and it thrashes the page cache.

PG doesn't normally make any guarantees about how rows are ordered on disk and it may move rows around when updates are made. It does has a CLUSTER operation, which re-orders rows based on the order of an index you give it, but this is a one time operation and locks the table while running. This makes it functionally useless for large tables that are accessed and updated frequently.

Some other databases do give you control over disk ordering, SQL Server for example has `CLUSTERED INDEX` which you can apply to a table and it'll order data on disk based on the index order, even for new insertions / updates. It does cost a bit more on the write side to manage this, but it can be worth it in some cases.

rantanplan · on Oct 8, 2021

Got it, thanks.