I think the one big problem with BLOBs, especially if you have a heavily read-biased DB, is you're going to run up against bandwidth/throughput as a bottleneck. One of the DBs I help maintain has some very large JSON columns and we frequently see this problem when traffic is at its peak: simply pulling the data down from Postgres is the problem.
If the data is frequently accessed, it also means there are extra hops the data has to take before getting to the user. It's a lot faster to pull static files from S3 or a CDN (or even just a dumb static file server) than it is to round trip through your application to the DB and back. For one, it's almost impossible to stream the response, so the whole BLOB needs to be copied in memory in each system it passes through.
It's rare that any request for, say, user data would also return the user avatar, and so you ultimately just end up with one endpoint for structured data and one to serve binary BLOB data which have very little overlap except for ACL stuff, but signed S3 URLs will get you the same security properties with much better performance overall.
I’m sure you’re right, but it’s unbelievable to me how cost effective people claim S3 is. I’ve just never been able to get the pricing calculator to show me “cheap”. And I guess I’ve never really gotten comfortable with the access patterns, as it’s not a file system.
Where could I explore situations where people have used S3 with extremely favorable conditions relative to local storage(?), in terms of: price, access latency, and any other relevant column?
I want to believe, it’s just hard for me to go all in on an Amazon API.
That's the catch. S3 is ideal for when the sum total of your blobs can't easily fit on local storage - unless you want to use a NAS, SAN or something else with a load of spinning rust.
Storing your data on a single HDD you got off NewEgg will always win if you only use the one metric of $/GB.
S3's main draw isn't $/GB. It's actually more like ($/GB) * features
E.g. 12-9s of durability, Lambda events, bucket policies, object tags, object versioning, object lock, cross region replication
Doing that with anything over 100TB starts to get very expensive very quickly. Especially if you need that data for, you know, your business to survive...
> Storing your data on a single HDD you got off NewEgg will always win if you only use the one metric of $/GB.
That right there is the mistake you're making. Storage is not the only way that AWS charges you for S3. You're also billed for stuff like each HTTP request, each metadata tag, and data transferred out once you drop off the free tier. You're basically charged for every time you look at data you put in a S3 bucket the wrong way.
I strongly recommend you look at S3's pricing. You might argue that you feel S3 is convenient, but you pay through the nose for it.
Another pain is the testing story. I just want to be able to write to a FS. There are S3 fuse bindings, though. Maybe I'm just a dinsosaure these days.
Well in fairness, if you're comparing Postgres to S3, there's no universe where S3 doesn't win on every way of pricing things out (unless you're only ever using the data from the same machine running Postgres perhaps).
Also cost is a factor, databases are usually attached to expensive disk and their storage layer is tuned for iops, and blobs will be using gigs of that for just sitting there being sequentially scanned.
I have to worked with BLOBs in DB fields in almost a decade, but when I worked with them one really annoying problem I ran into was that the DB wanted to morph the blob in transit by default. So JPEGs and PDFs etc would get corrupted because SQL Server wanted to add its own little byte signature to it during read/write operations.
Not sure if that was an endemic problem or a specific SQL server shiftiness
Ignoring ACL stuff for the moment, if you put images for a user's blog post inside a database, then front it with a CDN, thing might work out fine.
But if the usage pattern was such that the blog post and images were being transformed based on the time or something, then nothing would get cached, and back to the issue you describe.
That said... you could setup read replicas into groups and direct the blob stuff to a pool that is separate from the higher priority data requests for normal db stuff.
Do you have any thoughts on when a JSON column is too large? I've been wondering about the tradeoffs between a jsonb column in postgres that may have values, at the extreme, as large as 10 MB, usually just 100 KB, versus using S3.
Wouldn't the same reasoning for BLOBs apply to JSON columns? Unless you're frequently querying for data within those columns (eg, filtering by one of the JSON fields), then you probably don't need to store all the JSON data in the DB. And even if that is the case, you could probably work out a schema where the JSON data is stored elsewhere and only the relevant fields are stored in the DB.
At the same time, I'm working with systems where we often store MBs of data in JSON columns and it's working fine so it's really up to you to make the tradeoff.
If you're only querying the JSON data and not returning it in full often, it's almost certainly fine. It's the cost of transit (and parsing/serialization) that's a problem.
It depends on how much you're getting back total. 100 rows returning a kb of data is the same as one row with 100kb. I get worried when the total expected data returned by a query is more than 200kb or so.
We had this problem as well. For us a big part of the latency was just the bandwidth required for Postgres’s verbose text mode. It’s a shame there’s no way to compress the data on the wire
If the data is frequently accessed, it also means there are extra hops the data has to take before getting to the user. It's a lot faster to pull static files from S3 or a CDN (or even just a dumb static file server) than it is to round trip through your application to the DB and back. For one, it's almost impossible to stream the response, so the whole BLOB needs to be copied in memory in each system it passes through.
It's rare that any request for, say, user data would also return the user avatar, and so you ultimately just end up with one endpoint for structured data and one to serve binary BLOB data which have very little overlap except for ACL stuff, but signed S3 URLs will get you the same security properties with much better performance overall.