You're saying it's impossible to have public write access to a table without also providing public read access?
"it can be output somewhere before you execute your logic" is a design choice that is orthogonal from whether you execute your logic before or after input into the database.
First of all, most database records couldn't fit child porn, unless it was somehow encoded across thousands of records, in which case you couldn't realize it was child porn until after you've stored 99% of it.
Sure though, by putting "child porn" in a sentence, you can make anything seem bad. Tell me this, would you rather your application middleware was in the "copying child porn" business? ;-)
Actually, the more I think about it, the crazier this seems. You're going to store all the "child porn" you receive in RAM until you've validated that it is child porn?
I don’t get your tone or why you seem shocked that binary data can be stored in a database. Postgres and MySQL both have column sizes for binary data that can hold gigabytes.
Second, you generally need to hold the entire image in RAM to create the perceptual hash needed to check that the image is/isn’t child porn.
> I don’t get your tone or why you seem shocked that binary data can be stored in a database. Postgres and MySQL both have column sizes for binary data that can hold gigabytes.
My tone is shocked, because what you're describing seems totally removed from any system I've seen, and I've implemented a ton of systems. For performance reasons, you want to stream large uploads to storage (web servers, like nginx, are typically configured to do this even before the request is sent to any application logic). You invariably want to store UGC data that conforms to your schema, even if you're going to reject it for content. There's a whole process for contesting, reviewing and reversing decisions that requires the data be in persistent storage.
I think you misunderstood what I said. Yes, Postgres, MySQL and a variety of other databases have column sizes for binary data that can hold gigabytes. What I wouldn't agree with is that most database records can hold gigabytes, binary or otherwise. Heck, most database records aren't populated from UGC sources and not UGC sources where child porn is a risk.
But okay, let's assume, for arguments sake, most database records are happily accepting 4TB large objects, and you're accepting up to 4TB uploads (where Postgres' large objects max out). Do all your web & application servers have 4TB of memory? What if you're processing more than one request at once, do you have N*4TB of memory?
At least all the systems I've implemented that receive data from users enforce limits on request sizes, and with the exception of file uploads, which are typically directly streamed to the filesystem before processing, those limits tend to be quite small, often less than a kilobyte. Maybe someone could write some really terse child porn prose and compress it down to fit in that space, but pretty much any image would have to be spread across many records. By design, almost any child porn received would be put in persistent storage before being identified as such.
> Second, you generally need to hold the entire image in RAM to create the perceptual hash needed to check that the image is/isn’t child porn.
This is one of many reasons that you generally want to stream file uploads to storage before performing analysis. Otherwise you're incredibly vulnerable to a DoS attack on your active memory resources. Even without a DoS attack, you're harming performance by unnecessarily evicting pages that could be used for caching/buffering for bytes that won't be served at least until you've finished receiving all the file's data.
[Note: Many media encodings tend to store neighbouring pixels together, so you can, conceptually, compute a perceptual hash progressively, without loading the entire file into active memory, which is often desirable, particularly with video content.]
Thought about it some more... this whole scenario makes sense in only the narrowist of contexts. Very few applications directly serve UGC to the public, and a lot of applications are B2B. You're authenticated, and there's a link to your employer (or you if you're self-employed). Uploaded data isn't made visible to the public. Services are often limited to a legal jurisdiction. If you want to upload your unencrypted child porn to a record in Google's Firebase database, you go ahead. The feds could use some easy cases.
There's little point in not writing it to disk, the idea of holding it in RAM vs writing a file to disk is moot. You've got to handle it and the best way of handling that kind of thing at scale is to write it to a temporary disk and then have a queue process work over the files doing the analysis.
No serious authority is going to hang you for UGC which is illegal material in storage while you process it. Heck, you can even allow stuff to go straight to publicly accessible if you have robust mechanisms for matching and reporting. The authorities won't take a hard line against a platform which is open to the public as long as they have the right mitigations in place. And they won't immediately blame you unless you act as a safe haven.
A sensible architectural pattern for binary UGC upload data would plan to put it in object storage and then deal with it from there.
I have never in my life wrote a "child porn validator" that restrict files uploaded by users to "non child porn". This sound nontrivial and futile (every bad file can also be stored as a zip file with a password). This sound like an example of a "think of the children" fallacy.
I also find the firebase model weird (but I didn't use it yet), but not for the child porn reasons.
"it can be output somewhere before you execute your logic" is a design choice that is orthogonal from whether you execute your logic before or after input into the database.