Well, if the exact same semantics were required (i.e. always evict the oldest buffer), then it's really O(M*n) where M is the number of bytes to evict, and n being the number of sublists; each sublist would have to be checked for each buffer that evicted. But, I agree, if that was the only concern then the extra comparisons _probably_ wouldn't be enough to be concerned about.
The bigger issue is how to lock the sublists. It seems heavy handed to lock all sublists, find the oldest buffer, evict it, and then unlock all sublists. But if each is locked, checked, then unlocked; each sublists' oldest buffer can change while the eviction thread is still iterating the sublists to find the oldest buffer.
So, while the current solution is not very elegant, it is simple, requires minimal locking, and appears to work well enough.
Currently, the number of sublists created is equal to the number of cores on the system.
While it'd be fairly trivial to extend the code to make it easy for an admin to define the number of sublists, I don't have any reason to believe that'd be a useful knob to export.
Indeed. This is the "Appeal to Lack of Authority" fallacy [1]
Authority has a reputation for being corrupt and inflexible, and this stereotype has been leveraged by some who assert that their own lack of authority somehow makes them a better authority.
Starling might say of the 9/11 attacks: "Every reputable structural engineer understands how fire caused the Twin Towers to collapse."
Bombo can reply: "I'm not an expert in engineering or anything, I'm just a regular guy asking questions."
Starling: "We should listen to what the people who know what they're talking about have to say."
Bombo: "Someone needs to stand up to these experts."
The idea that not knowing what you're talking about somehow makes you heroic or more reliable is incorrect. More likely, your lack of expertise simply makes you wrong.
Error detection can always be used, but error correction may or may not be available (it depends on the type of block). Metadata blocks are redundant even on a single drive pool; so if you just have a partial failure (e.g. overwrite a metadata block) it might be able to correct the block using another redundant copy on the same drive. Data blocks will require a redundant pool configuration, though, as these are not store redundantly by default (e.g. multiple drives in a raidz or mirror).
Actually you can correct data blocks, on a single-drive ZFS pool, if you set the attribute copies=2 (ditto blocks): https://blogs.oracle.com/relling/entry/zfs_copies_and_data_p... Obviously this redundancy feature makes you use twice the disk space you would normally use.
There is work in progress to add the ability to remove a device from a pool. So, hopefully in the not too distant future, that work will land in illumos and migrate to all the downstream implementations (e.g. linux, mac, and freebsd).
I've seen that metric thrown around when talking about the "dedup" feature of ZFS, but honestly, don't use dedup unless you know what you're doing. It's way to easy for things to go wrong otherwise.
How much memory does ZFS (and/or ZoL) need per 1TB of storage when dedup is off?
Also, bup ("it backs things up!") efficiently dedups across an ssh connection (using bloom filters) Scales are differrent, but it might work for ZFS as well.
The recommendations I've read say you need 1GB RAM for system use (assuming a dedicated file server), and then as much RAM (ideally ECC RAM) as you want to give it for caching data.
If you're short on RAM (sub 4GB), you might need to change some of the default settings to avoid problems, but RAM's fairly cheap nowadays, so unless you've got an old machine, it's not likely to be a problem :)
I'd point out that cheap ram is about $10/GB right now, and a cheap HDD is $30/TB. So if you need an extra GB of ram for each TB of storage, you're increasing storage costs by a third.
No problem. It's really a shame to have such a high bar for using dedup. It either fits a given workload extremely well, or can be extremely detrimental.
There's been talk in the developer community about ways to address the usability of dedup, but so far nothing has gone further than small prototypes.
Setting zfs_dedup_prefetch=0 has been found to help systems using ZFS data deduplication. There is a patch in ZoL HEAD for the next release that makes zfs_dedup_prefetch=0 by default:
Aside from that, you are right that there has not been much done here. Making data deduplication more performant is a difficult task and so far, no one has proposed improvements beyond trivial changes.
I think for many or most use-cases, it would make more sense to have "off-line" deduplication (like, I believe, BTRFS does), as you can free-up space on-demand, when you judge it would yield you the most benefit and the system is the least busy.
I'm not sure how much benefit compared to the "real time" deduplication this approach can provide however, as the mapping table would still need to exist in memory, but I think there should be an increase to the write performance of non-duplicate data.
PS. My use case is that I have a few dozens of (linux-vserver) gentoo containers that obviously share many files and, unfortunately, trying to maintain a shared read-only mount of the core system doesn't seem to be practical/viable (as it does i.e. for FreeBSD jails, due to mostly the clear system seperation). The waste is not significant enough to really bother me, it would just be nice to avoid. The solutions that I am aware of are (not sure if I'm missing any, I'd be happy to be pointed to something else, if there is):
- integrated FS-level deduplication (ZoL, BTRFS)
- higher-level deduplication (lessfs and opendedup)
- hard-linking scripts (obviously at the file-level)
Thanks, you mean LVM2/ZFS/BTRFS based snapshots right? I've seen that mentioned but I was thinking that eventually, with system updates, the snapshots will end up having less and less common blocks with their original source, so I would have to frequently recreate new snapshots from a more similar source and then copy the unique data on top of them again to raise the hit rate -but that sounds a pretty inconvenient thing to do.
Yes, I run ZFS on Linux on my single drive laptop currently.
ZFS doesn't _need_ multiple drives to work well, but it is capable of using them if they're available. One can also run a HW raid solution underneath ZFS, where the RAID engine just presents ZFS with a small number of LUNs. It's all up to the admin; and what sort of performance, redundancy, and maintenance guarantees are required.
FWIW, if there's anybody interested in learning about the ZFS code base, we'd love help porting patches from ZoL into Illumos and vice versa. That's a good way to get a new developer integrated with the code and process surrounding each platform.
I can't recall the exact detail from memory, but I believe #1 has to do with the fact that zfs creates/clones/snapshots/etc are done in "syncing" context. Thus, each command has to wait for a full pool sync to complete, limiting the rate at which these can be done.
This is a known problem, and likely to be fixed in the not too distant future.
The bigger issue is how to lock the sublists. It seems heavy handed to lock all sublists, find the oldest buffer, evict it, and then unlock all sublists. But if each is locked, checked, then unlocked; each sublists' oldest buffer can change while the eviction thread is still iterating the sublists to find the oldest buffer.
So, while the current solution is not very elegant, it is simple, requires minimal locking, and appears to work well enough.