> I'm only creating /tempdir/$inodenumber if the hard-link count is >1
Which (again) is always, in the case of the OP.
> and once the destination hard-link count reaches the same count as the source it gets moved
Yes I've thought of that optimization; I haven't checked whether cp does it already, might well be.
Thing is the OP said he's got an nlinks of about 20. This means, if they are evenly spread out over the file system, that you see the last link when you're about 95% through. That won't help much.
> If I'm reading it right it requires doing a sort of the full file list.
Correct.
> This means you need to use enough space to list all the files and their inode number even if there are no hardlinks.
My algorithm works with the nlink==1 optimization just as well: just don't put those files into the file list that needs to be sorted, instead copy them immediately or put them into a list that doesn't need sorting.
And, the whole point of using the sort approach is not to make the list or index smaller (that's simply not possible). The point is that it works with lower RAM requirements. Because sorting works well with serial access and little random access, using a list much bigger than RAM works well.
PS. I wonder how much smaller the table gets when you always have nlinks==2 (the best case for the "done with the inode" optimization). The size of the table will follow some curve over the duration of the program; I'm not sure what it is exactly, perhaps a sine half wave? I expect the max of the curve to be about half way of the max of the unoptimized variant.
Is it a pity to lose the possibility for that optimization by using my algorithm? Disk space is cheaper than RAM. Dropping most of the RAM usage and instead using twice as much in the form of disk still seems worthwhile.
The "done with the inode" optimization will work better if hard linked paths appear close to each other. So perhaps sometimes still useful. It should be possible to come up with a hybrid: go with the hash table until it reaches a certain size; then convert the hash table into the start of the list, and continue by appending to the list. I'm not sure this is worth the more complex programming, though. If it looks nice just to avoid creation of a temp file in the common case, then a less complex alternative may be to store the list in the heap until a certain size (only then dump to file), otherwise still use my algorithm. That way swapping will behave nicer, too, in case part of the cp heap needs to be swapped out because of other system activity.
PPS. note that we're assuming that all the hard links are within the source tree that's being copied. If the files also have a path outside the source path, then the optimization won't work as you never reach zero.
> My algorithm works with the nlink==1 optimization just as well: just don't put those files into the file list that needs to be sorted, instead copy them immediately or put them into a list that doesn't need sorting.
If you do that then your extra storage requirements are now O(paths to inodes with nlinks > 1) which is still strictly higher than O(inodes with nlinks>1). You're also doing two passes, one to enumerate the files, another to do the copying and hardlinking, and doing an extra operation (the sort). I'd be very surprised if that was faster, and unless you have an on-disk sort it will still be RAM+swap limited instead of just disk-space limited.
> O(paths to inodes with nlinks > 1) .. O(inodes with nlinks>1)
Yes, that's right, I ignored that part in the calculation above.
By using the filename+parent pointer approach for path compression, that could be improved. (The current version of megacp also stores all necessary stat information (size, uid, gid, perms) in the tempfile to avoid a second stat call; this could be dropped if it makes matters worse.)
Also, quite likely using a fast compression algorithm like lzop on the file would make disk access faster and disk storage requirements lower.
> You're also doing two passes, one to enumerate the files, another to do the copying and hardlinking
Right now I'm doing exactly the same number of system calls as the cp approach needs. The second pass doesn't do readdir or, currently, even stat on the source file (that's why I saved the stat info above; which is more beneficial would need to be tested). It may be beneficial to sort the sorted-by-inode file back into sorted-by-path after the inode cluster recognition step to get original path locality during the copy phase (the included test/genfiles script already does that); I'll have to test that.
> unless you have an on-disk sort it will still be RAM+swap limited
Using an online algorithm is the whole point, I've been saying that from the start. And I remember that Knuth wrote about them so I conclude that they exist for sort, and I quite expect that GNU sort(1) is already sorting big files efficiently (limited testing seems to confirm that, files several times the size of RAM sort just fine in the low single digit minute count).
> I'd be very surprised if that was faster
You're ignoring how much slower disk seek is compared to serial reading/writing. A lot of similar optimizations are done in OSes, databases etc. for that reason. I guess you'll argue next that SSDs don't have such high seek times (and that everybody should be using them), and while those points are right SSDs are still a lot slower for random versus serial access, so whether the overhead of the sorting approach is warranted on SSDs remains to be tested. In any case, I very much expect it to be faster in the OP's scenario which was with spinning disks (note how it took him many days; I expect the sort for the same data to take few hours or quite likely less than an hour given his RAID array).
> Also, quite likely using a fast compression algorithm like lzop on the file would make disk access faster and disk storage requirements lower.
>Using an online algorithm is the whole point, I've been saying that from the start. And I remember that Knuth wrote about them so I conclude that they exist for sort, and I quite expect that GNU sort(1) is already sorting big files efficiently (limited testing seems to confirm that, files several times the size of RAM sort just fine in the low single digit minute count).
The cp algorithm already solves this just fine and my proposed reimplementation on-disk solves the RAM usage problem. You're trying to make a strictly worse algorithm work. It both requires more storage and is heavier than it needs to be (full sort vs just a map). What do you expect to gain from it?
Which (again) is always, in the case of the OP.
> and once the destination hard-link count reaches the same count as the source it gets moved
Yes I've thought of that optimization; I haven't checked whether cp does it already, might well be.
Thing is the OP said he's got an nlinks of about 20. This means, if they are evenly spread out over the file system, that you see the last link when you're about 95% through. That won't help much.
> If I'm reading it right it requires doing a sort of the full file list.
Correct.
> This means you need to use enough space to list all the files and their inode number even if there are no hardlinks.
My algorithm works with the nlink==1 optimization just as well: just don't put those files into the file list that needs to be sorted, instead copy them immediately or put them into a list that doesn't need sorting.
And, the whole point of using the sort approach is not to make the list or index smaller (that's simply not possible). The point is that it works with lower RAM requirements. Because sorting works well with serial access and little random access, using a list much bigger than RAM works well.
PS. I wonder how much smaller the table gets when you always have nlinks==2 (the best case for the "done with the inode" optimization). The size of the table will follow some curve over the duration of the program; I'm not sure what it is exactly, perhaps a sine half wave? I expect the max of the curve to be about half way of the max of the unoptimized variant.
Is it a pity to lose the possibility for that optimization by using my algorithm? Disk space is cheaper than RAM. Dropping most of the RAM usage and instead using twice as much in the form of disk still seems worthwhile.
The "done with the inode" optimization will work better if hard linked paths appear close to each other. So perhaps sometimes still useful. It should be possible to come up with a hybrid: go with the hash table until it reaches a certain size; then convert the hash table into the start of the list, and continue by appending to the list. I'm not sure this is worth the more complex programming, though. If it looks nice just to avoid creation of a temp file in the common case, then a less complex alternative may be to store the list in the heap until a certain size (only then dump to file), otherwise still use my algorithm. That way swapping will behave nicer, too, in case part of the cp heap needs to be swapped out because of other system activity.
PPS. note that we're assuming that all the hard links are within the source tree that's being copied. If the files also have a path outside the source path, then the optimization won't work as you never reach zero.