reading lots of small files on s3 or local filesystems is tricky. a million devi...

reading lots of small files on s3 or local filesystems is tricky. a million devices with one dozen files, so lets say 12 million files.

One thing locally is each file takes up a full block. So even if you only need 500 bytes of data in a file, and a block is 4kb, youve wasted 3.5kb of space and IO. Multiply that by a million and youre wasting gigabytes of space.

In S3, listing 12 million files takes 12 thousand http(max return is 1000 items). So that would take two minutes if you assume its 10ms per round trip. Let's say you wanted to read each file, and again each read takes 10ms.. youre looking at 1.4 days. Obviously this can be parallelized, but when you look at the raw byte size this is a huge overhead, and this is just to read one day of data.

If you concatenate the files together to get a reasonable size and number of files, raw json on s3 is really powerful. Point athena at it, and you just write sql and it handles the rest, and is serverless. But it does make single row lookups more expensive(supplementing with dynamodb could keep it serverless if single row lookups are frequent).

lots of optimizations will get improvements, like parquet that tobilg mentioned(binary format and columnar), but anything with a decent file size will work.