I think the one big problem with BLOBs, especially if you have a heavily read-bi...

dfee · on Aug 31, 2023

I’m sure you’re right, but it’s unbelievable to me how cost effective people claim S3 is. I’ve just never been able to get the pricing calculator to show me “cheap”. And I guess I’ve never really gotten comfortable with the access patterns, as it’s not a file system.

Where could I explore situations where people have used S3 with extremely favorable conditions relative to local storage(?), in terms of: price, access latency, and any other relevant column?

I want to believe, it’s just hard for me to go all in on an Amazon API.

rrrix1 · on Aug 31, 2023

> relative to local storage

That's the catch. S3 is ideal for when the sum total of your blobs can't easily fit on local storage - unless you want to use a NAS, SAN or something else with a load of spinning rust.

Storing your data on a single HDD you got off NewEgg will always win if you only use the one metric of $/GB.

S3's main draw isn't $/GB. It's actually more like ($/GB) * features

E.g. 12-9s of durability, Lambda events, bucket policies, object tags, object versioning, object lock, cross region replication

Doing that with anything over 100TB starts to get very expensive very quickly. Especially if you need that data for, you know, your business to survive...

rewmie · on Aug 31, 2023

> Storing your data on a single HDD you got off NewEgg will always win if you only use the one metric of $/GB.

That right there is the mistake you're making. Storage is not the only way that AWS charges you for S3. You're also billed for stuff like each HTTP request, each metadata tag, and data transferred out once you drop off the free tier. You're basically charged for every time you look at data you put in a S3 bucket the wrong way.

I strongly recommend you look at S3's pricing. You might argue that you feel S3 is convenient, but you pay through the nose for it.

https://aws.amazon.com/s3/pricing/

dfee · on Aug 31, 2023

The pricing calculator has given me the perspective you've shared, not the one of your comment's parent. It looks outrageously expensive.

Cloudflare's R2 pricing seems a lot more competitive, but it's still lock in: https://r2-calculator.cloudflare.com

Another pain is the testing story. I just want to be able to write to a FS. There are S3 fuse bindings, though. Maybe I'm just a dinsosaure these days.

bastawhiz · on Sept 2, 2023

Well in fairness, if you're comparing Postgres to S3, there's no universe where S3 doesn't win on every way of pricing things out (unless you're only ever using the data from the same machine running Postgres perhaps).

avereveard · on Aug 30, 2023

Also cost is a factor, databases are usually attached to expensive disk and their storage layer is tuned for iops, and blobs will be using gigs of that for just sitting there being sequentially scanned.

SOLAR_FIELDS · on Aug 30, 2023

I have to worked with BLOBs in DB fields in almost a decade, but when I worked with them one really annoying problem I ran into was that the DB wanted to morph the blob in transit by default. So JPEGs and PDFs etc would get corrupted because SQL Server wanted to add its own little byte signature to it during read/write operations.

Not sure if that was an endemic problem or a specific SQL server shiftiness

sroussey · on Aug 30, 2023

Ignoring ACL stuff for the moment, if you put images for a user's blog post inside a database, then front it with a CDN, thing might work out fine.

But if the usage pattern was such that the blog post and images were being transformed based on the time or something, then nothing would get cached, and back to the issue you describe.

That said... you could setup read replicas into groups and direct the blob stuff to a pool that is separate from the higher priority data requests for normal db stuff.

refulgentis · on Aug 30, 2023

Do you have any thoughts on when a JSON column is too large? I've been wondering about the tradeoffs between a jsonb column in postgres that may have values, at the extreme, as large as 10 MB, usually just 100 KB, versus using S3.

xmprt · on Aug 31, 2023

Wouldn't the same reasoning for BLOBs apply to JSON columns? Unless you're frequently querying for data within those columns (eg, filtering by one of the JSON fields), then you probably don't need to store all the JSON data in the DB. And even if that is the case, you could probably work out a schema where the JSON data is stored elsewhere and only the relevant fields are stored in the DB.

At the same time, I'm working with systems where we often store MBs of data in JSON columns and it's working fine so it's really up to you to make the tradeoff.

bastawhiz · on Sept 2, 2023

If you're only querying the JSON data and not returning it in full often, it's almost certainly fine. It's the cost of transit (and parsing/serialization) that's a problem.

bastawhiz · on Sept 2, 2023

It depends on how much you're getting back total. 100 rows returning a kb of data is the same as one row with 100kb. I get worried when the total expected data returned by a query is more than 200kb or so.

fovc · on Aug 31, 2023

We had this problem as well. For us a big part of the latency was just the bandwidth required for Postgres’s verbose text mode. It’s a shame there’s no way to compress the data on the wire

commonlisp94 · on Aug 31, 2023

You can cache them on disk. The DB blob is just the source of truth.

throwaway2990 · on Aug 31, 2023

Using PostgreSQL as a document database so everything is stored as JSONB and it’s only around 800gb but it’s still lightning fast.