Hacker News new | past | comments | ask | show | jobs | submit login

ROCm. About 40%. But there is duplication there as well. Two 16GB folders containing the exact same version.



Run rmlint on it, it will replace duplicate files with reflinks (if your fs supports them — xfs and btrfs do), or hardlinks if not.


Thanks! Hearing about this for the first time. Never felt the need before.


Does uv have any plans for symlink/hardlink deduplication?


Not sure. The simplest solution is to store all files under a hashed name and sym/hardlink on a case to case basis. But some applications tend to behave weirdly with such files. Windows has its own implementation of symlinks and hardlinks. They simply call it something else. Perhaps portability could be an issue.


The article says it already hard links duplicates. But likely not able to help if you are using multiple versions of interpreter and lib.


Sounds like a great use case for ZFS’s deduplication at block level.


I use ZFS everywhere EXCEPT on this drive. Not willing to have ZFS on the primary drive till native support lands in the kernel (so, never).


Have you tried borg [0]? Also, why not BTRFS?

[0] https://borgbackup.readthedocs.io/en/stable/index.html


Have been using ZFS for the past thirteen years and all my workflows including backup are based on it. It just works.


Sure, I was just curious, since you mentioned not wanting to use ZFS without kernel support and BTRFS does have that. Being familiar with ZFS, I guess is a decent explanation.


When the topic of backups came up last year, I talked about my current solution: https://news.ycombinator.com/item?id=41042790. Someone suggested a workaround in the form of zfsbootmenu but I decided to stick to the simple way of doing things.


or, you know.. symlinks


Main issue with symlink is needing to choose the source of truth— one needs to be the real file, and the other point to it. You also need to make sure they have the same lifetimes to prevent dangling links.

Hardlink is somewhat better because both point to the same inode, but will also not work if the file needs different permissions or needs to be independently mutable from different locations.

Reflink hits the sweetspot where it can have different permissions, updates trigger CoW preventing confusing mutations, and all while still reducing total disk usage.


I don't disagree but I think some of these problems could potentially be solved by having somewhat of a birds nest of a filesystem for large blobs, eg.

/blobs/<sha256_sum>/filename.zip

and then symlinking/reflinking filename.zip to wherever it needs to be in the source tree...

It's more portable than hardlinks, solves your "source of truth" problem and has pretty wide platform support.

Platforms that don't support symlinks/reflinks could copy the files to where they need to be then delete the blob store at the end and be no worse off than they are now.

Anyway, I'm just a netizen making a drive-by comment.




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: