Just a pro tip: you don't want to use GZip since it can't be split, and if you want to parallelize effectively you need to split... that is, unless you're dealing with many small files (<100MB), in which case your cloud storage costs may explode at scale.
(I ingest, process, and warehouse 20TB of data a day)
Many small objects seems fine -- where small is 50-500MB per object. You can't have super huge compressed objects since BigQuery and perhaps other systems won't ingest a file with more than 1 or 2GB of decompressed data.
20TB a day is a lot -- what is your stack, and what compression/file format are you using?
Well, it's only a few PB a year. But, I'll say that 20 TB is the higher bound per day. It's seasonal, and maybe the average is 10 TB per day.
Personally, I've found the best results with bZip2. But, we use LZO and LZ4 and even Snappy for different things.
Spark is money, but we've forked it to make several important customization / enhancements. We have proprietary indexing, and we do all of our processing in-memory, of course.
Oh, and some of our data is sharded in Redshift. Redshift works well with GZip, so we do use it there. But, that particular part of our warehouse (or data lake, or whatever you call it these days) has much less data than we ingest since we reduce a lot.
(I ingest, process, and warehouse 20TB of data a day)