Just a pro tip: you don't want to use GZip since it can't be split, and if you w...

mattbillenstein · on Feb 25, 2016

Many small objects seems fine -- where small is 50-500MB per object. You can't have super huge compressed objects since BigQuery and perhaps other systems won't ingest a file with more than 1 or 2GB of decompressed data.

20TB a day is a lot -- what is your stack, and what compression/file format are you using?

ap22213 · on Feb 25, 2016

Well, it's only a few PB a year. But, I'll say that 20 TB is the higher bound per day. It's seasonal, and maybe the average is 10 TB per day.

Personally, I've found the best results with bZip2. But, we use LZO and LZ4 and even Snappy for different things.

Spark is money, but we've forked it to make several important customization / enhancements. We have proprietary indexing, and we do all of our processing in-memory, of course.

Oh, and some of our data is sharded in Redshift. Redshift works well with GZip, so we do use it there. But, that particular part of our warehouse (or data lake, or whatever you call it these days) has much less data than we ingest since we reduce a lot.