Hacker News new | past | comments | ask | show | jobs | submit login

> and I assume harddrive manufacturers try to compress and deduplicate data

You assume incorrectly. There have been attempts, like SandForce drives in the early days of SSDs but there are a bunch of reasons why you don't want to do opportunistic general-purpose compression at the lowest possible level of the storage chain.

The problem with compression* is that instead of having one drive capacity number (say, 500 GB), you have two: the drive's physical capacity and it's "logical" capacity. Which is fine and dandy, except that the logical capacity varies wildly with what kind of data is stored on the drive. Highly patterned data (text and most executables) compresses very well. Data which is already compressed (most images, videos, archives, etc) does not. So how do you report to the user how much space they have left on the drive? Even if you know the existing contents and current compression ratio, you don't know what they're going to put on the drive in the _future_. Your best guess would be, "250 GB or maybe about a terabyte, lol i dunno".

There's also the fact that application-specific compression algorithms tend to do FAR better than general-purpose compression algorithms, which is why almost all of our media storage formats default to using them. JPEG, HEVC, and so on. Plus you get the benefit of having the thing compressed all the time (even over the I/O channel or network) instead of just on disk.

Compressing data which is _already_ compressed often results in additional overhead. So unless the drive is testing the data to see how well it compresses before writing it to disk (which would murder performance), your 500 GB drive could actually end up being a 450 GB drive.

Further, always-on compression would result in a substantial performance penalty unless special silicon is crafted to handle it. Storage is already an industry with razor-thin margins, companies are not going to add to the BOM cost for a feature that could ultimately make people buy _fewer_ drives.

In the case of deduplication, there's no OS-level standard for it which makes current implementations far less efficient than they could be.

That said, data compression is very popular in the enterprise storage space, but it is typically done at the pool or volume level (large groups of disks) rather than per-disk. These arrays usually combine compression and deduplication with other strategies like thin provisioning to optimize storage to an almost absurd level. It typically requires trained storage engineers to manage them.

* _I lump deduplication into compression because dedeuplication is actually just one kind of compression strategy, even though lots of things treat it like a separate feature._




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: