You don't know what you need until you need it. The signal you need to dig out of the data often isn't known until some other event provides the context. Also, for some industries and some applications, there are regulatory reasons you retain the data. In some cases these are sampled data feeds, even at the extreme data rates seen, because the available raw feed would break everything (starting with the upstream network).
In virtually all real systems, data is aged off after some number of months, either truncated or moved to cold storage. Most applications are about analyzing recent history. Everyone says they want to store the data online forever but then they calculate how much it will cost to keep exabytes of data online and financial reality sets in. Several tens of petabytes is a more typical data model given current platform capabilities. Expensive but manageable.
I interesting worked on a project as a data scientist with a client who worked in high precision manufacturing. Their signals (sensors) and actuators were stored in a historian which couldn't handle data 100ms samples even though the data was collected at a 10ms rate. One of the problems required us to look at the process that took just 85ms. The problem was the historian was showing signals up to 20ms it took a while to realise that it was extrapolating when you tried getting finer resolution. The company was using this historian for more than 20 years they had to commission another project to change the historian. So you're right, you don't know what you need until you need it.
In virtually all real systems, data is aged off after some number of months, either truncated or moved to cold storage. Most applications are about analyzing recent history. Everyone says they want to store the data online forever but then they calculate how much it will cost to keep exabytes of data online and financial reality sets in. Several tens of petabytes is a more typical data model given current platform capabilities. Expensive but manageable.