Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

No, that's flat out not true.

I've seen that metric thrown around when talking about the "dedup" feature of ZFS, but honestly, don't use dedup unless you know what you're doing. It's way to easy for things to go wrong otherwise.



How much memory does ZFS (and/or ZoL) need per 1TB of storage when dedup is off?

Also, bup ("it backs things up!") efficiently dedups across an ssh connection (using bloom filters) Scales are differrent, but it might work for ZFS as well.


The recommendations I've read say you need 1GB RAM for system use (assuming a dedicated file server), and then as much RAM (ideally ECC RAM) as you want to give it for caching data.

If you're short on RAM (sub 4GB), you might need to change some of the default settings to avoid problems, but RAM's fairly cheap nowadays, so unless you've got an old machine, it's not likely to be a problem :)


I'd point out that cheap ram is about $10/GB right now, and a cheap HDD is $30/TB. So if you need an extra GB of ram for each TB of storage, you're increasing storage costs by a third.


I answered this question here:

https://news.ycombinator.com/item?id=8437921


Ah ha! Thanks. Data deduplication. Well, yeah, that makes sense to me. Thank you very much for clearing that up.


No problem. It's really a shame to have such a high bar for using dedup. It either fits a given workload extremely well, or can be extremely detrimental.

There's been talk in the developer community about ways to address the usability of dedup, but so far nothing has gone further than small prototypes.


Setting zfs_dedup_prefetch=0 has been found to help systems using ZFS data deduplication. There is a patch in ZoL HEAD for the next release that makes zfs_dedup_prefetch=0 by default:

https://github.com/zfsonlinux/zfs/commit/0dfc732416922e1dd59...

Aside from that, you are right that there has not been much done here. Making data deduplication more performant is a difficult task and so far, no one has proposed improvements beyond trivial changes.


I think for many or most use-cases, it would make more sense to have "off-line" deduplication (like, I believe, BTRFS does), as you can free-up space on-demand, when you judge it would yield you the most benefit and the system is the least busy.

I'm not sure how much benefit compared to the "real time" deduplication this approach can provide however, as the mapping table would still need to exist in memory, but I think there should be an increase to the write performance of non-duplicate data.

PS. My use case is that I have a few dozens of (linux-vserver) gentoo containers that obviously share many files and, unfortunately, trying to maintain a shared read-only mount of the core system doesn't seem to be practical/viable (as it does i.e. for FreeBSD jails, due to mostly the clear system seperation). The waste is not significant enough to really bother me, it would just be nice to avoid. The solutions that I am aware of are (not sure if I'm missing any, I'd be happy to be pointed to something else, if there is):

- integrated FS-level deduplication (ZoL, BTRFS)

- higher-level deduplication (lessfs and opendedup)

- hard-linking scripts (obviously at the file-level)


Cloned snapshots carcass good way to dediplicate similar FS trees and have no RAM overhead. Its functionally the same as hardlinking.


Thanks, you mean LVM2/ZFS/BTRFS based snapshots right? I've seen that mentioned but I was thinking that eventually, with system updates, the snapshots will end up having less and less common blocks with their original source, so I would have to frequently recreate new snapshots from a more similar source and then copy the unique data on top of them again to raise the hit rate -but that sounds a pretty inconvenient thing to do.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: