I think one of ZFS's most significant contributions was embracing the specific ways in which disks and HBAs often fail and then building mechanisms to ensure data integrity in the face of those failures. Bit rot and phantom writes are the filesystem's problem, even though they're not the filesystem's fault. ZFS did a lot of work to ensure that integrity: storing checksums in parent blocks, storing metadata redundantly, fixing bad copies with the good ones when corruption is detected, and scrubbing. In many filesystems still in use today, applications can easily receive garbage data from the system.
I understand this filesystem is still nascent, but shouldn't data integrity at least be one of the design goals?
The first section in the README is called "Design goals", with 13 items. None of them is "data integrity", and none of them even talks about validating the data or handling any failures aside from power loss.
By contrast, in the canonical slide deck on ZFS[1], the first slide talks about "provable end-to-end data integrity". In the paper[2], "design principles" section 2.6 is "error detection and correction".
I'm glad to hear that's also a focus for TFS. With ZFS, the emphasis on data integrity resulted in significant architectural choices -- I'm not sure it's something that can just be bolted on later. As a reader, I wouldn't have assumed TFS had the same emphasis. I think it's pretty valuable to spell this out early and clearly, with details, because it's actually quite a differentiator compared with most other systems.
> The first section in the README is called "Design goals", with 13 items. None of them is "data integrity", and none of them even talks about validating the data or handling any failures aside from power loss.
Both ZFS and Btrfs were initially developed by really high caliber people and experts with good track record. ZFS had five years for full time development until release, next five years to get close to the features and stability that ZFS has now. Btrfs started 10 years ago and it's still trying to catch up.
Has anyone done an analysis of the number of logic bugs in ZFS and btrfs vs. memory safety or concurrent updates to memory, etc.?
Also, I'll point out that Apple just dropped a new FS on millions of devices, with no issues... that was developed in 3 or 4 years. I'm still blown away that they pulled that off.
They have full control of that environment though. So they tested it on a few hundred devices, and by that exhausted all the possible configurations. And when all worked, they knew they have a pretty good indicator for successful deployment on all those millions of devices.
It's an impressive feat, regardless of the differences in target devices. Even with the hardware configurations well known, the fact that it was done at such a large scale successfully means that even unusual edge conditions didn't crop up.
This shouldn't be downplayed, it actually speaks to why it's important to have incremental stages of software delivery. First target highly constrained environments (iOS, Watch OS, tvOS), then work on the more difficult and less constrained general computing environment.
It's absolutely an impressive feat to pull this off. But it's not quite the same problem as building a robust general-purpose FS for a diverse ecosystem.
True; and yet, it feels like there is much low-hanging fruit left in filesystens that are just built for specific vertically-integrated use-cases. A NAS hardware-appliance company, for example, could likely pull off something similar to what Apple did, and to great benefit.
I'm not. Not only is the hardware behavior completely knowable and deterministic, but the user is seriously limited in how much they directly affect either the hardware behavior or the filesystem.
Meanwhile commodity hardware will accept a flush to disk command, and return true when it hasn't flushed it to disk. This is done for performance reasons.
I also note that Apple deployed APFS by default only on the most locked down of the two OS's they produce: it's not the default file system on macOS where there's a much wider assortment of user intervention and hardware availability.
And HFS+ had enough of its own issues, hence why it was being replaced, that it's entirely plausible the problems are fewer expressly because of the change in file systems. I imagine their intent and hope was exactly that or why bother?
Exactly, BcacheFS, HAMMER2 are other file-systems that are being worked on by some REALLY good programmers for years and years to come. It seems like a file-system takes longer to stabilize than a kernel.
> It seems like a file-system takes longer to stabilize than a kernel.
For good reason in my opinion. An unstable kernel will cause application glitches, unnecessary slowness or OS crashes. They are annoying but not persistent (ie a reboot and you can carry on for a bit). However an unstable file system are persistent and thus could destroy all of your data forcing you to recover from backups (assuming you're diligent enough to keep tested backups).
Plus file systems still have to deal with buggy consumer hardware and other similar edge cases (eg storage devices, power failures, etc) just like a kernel would.
It is even worse. You can recover from backups but that implies that you already know at which point you destroyed your data and how fast you discovered that – if that ever happens. Imagine you're collecting data over a substantial period of time (maybe do calculations based on it that influence how you're collecting further data) and you're very smart and backup all your data let's say every day. The problem is if your data gets defects due to the file-system it is very unlikely to detect that. You can run your system for like a year without apparent problems whatsoever. Lets say that error accumulates and you eventually detect this by accident. Your backups are pretty much worthless ...
I've had some discussions with file system developers and a developer of a distributed file system. The sheer number of failure cases that need to be handled, because they are hit in production systems, is simply staggering.
I think you think to highly of the btrfs group. There not bad people but Filesystems are HARD to get right and they simply are not in the same league as the zfs developers were. The well known raid 5 issue btrfs had shows a prime example that the btrfs simply is not in the same class as zfs' design.
Btrfs started as a free and open source project from day 1, and people started using it from the moment it was merged mainline which was something like a year.
ZFS was started internally at Sun with a bunch of resources dedicate from people who had been involved in and thinking about storage devices and file systems and their myriad problems for a long time well before they started ZFS. And the cat wasn't out of the bag to end users for about 4 years, during which time there wasn't the contributor pile on effect.
Btrfs had to contend with some balancing act of not saying "no thanks" to patches, meanwhile those contributions very likely did clutter up the code every bit as much as allowing the contribution aided in hyping the project rather than turning people away from contributing at all.
As for raid56, ZFS emerged before cluster file systems like Ceph and Gluster were even conceived. The companies that need to store tons of data, and fund various storage related projects don't care nearly as much as they once did about raid56. They can just replicate the data elsewhere using Gluster and if a whole brick, be it XFS or Btrfs implodes, they can just make a new brick and the data gets replicated again.
Anyway, ZFS and Btrfs are really not comparable even though on the surface they seem to do really similar things.
I use ZFS under FreeBSD but I am wondering what the future of ZFS looks like.
I went to the OpenZFS website (http://open-zfs.org/wiki/Main_Page) to try and get a feel for the amount of work currently going on. They have some videos and slides, they have a mailing list. The mailing list is mostly full of messages via GitHub but at least that lead me to https://github.com/openzfs/openzfs which I was previously unaware of and which I had not found when trying to look for an OpenZFS specific repository.
As far as I have come to understand, the goal of the OpenZFS project is to merge changes made to ZFS by Illumos, FreeBSD, ZFS-on-Linux and other projects. Their videos might answer this but I wish their website had a clear and simple overview of what has been done. Most of what I can find on their wiki is various ideas for things that they want to do, but I can't know if that means that the OpenZFS project is mostly talk and not so much action, or if it's just that they have done a lot of those things but because they are busy doing they don't have time to write about it. Could be that they have some pages on the wiki I haven't seen also but in that case I think it should be organized better.
I know that Joyent picked up several highly skilled software engineers and programmers that used to work for Sun, and that their SmartOS operating system builds on code descendant from OpenSolaris and that ZFS is as integral on SmartOS as it was on Solaris, perhaps even more so on SmartOS. I have not seen mention of SmartOS on the OpenZFS wiki though.
All in all I feel that ZFS is still being actively developed and maintained by many people. But to what extent they are able to cooperate as much as a lot of them seem to want to I would like to know. It would be a shame if btrfs overtook ZFS simply because the development of ZFS was too fragmented :/
The various Illumos distributions routinely upstream their changes into the master Illumos repository ( https://github.com/illumos/illumos-gate). I know that ZFS on Linux (https://github.com/zfsonlinux/zfs) routinely integrates those changes, as well as making some of its own and I believe pushing them upstream to Illumos. When I've looked at FreeBSD commit logs, they too are pulling in ZFS changes from Illumos on a frequent basis (but I know less about any independent development and upstreaming).
My strong impression is that the ZFS code base is a sufficiently big and tangled thing that no one wants it to fragment. With bug-fixes and improvements happening in Illumos, ZoL and FreeBSD both want to be able to incorporate them on a regular basis and to push their own changes upstream to reduce the maintenance burden of carrying those changes.
Illumos is the repo of record. Everyone of the above OS distributions feeds changes upstream keeping zfs in sync between platforms. No one relying on zfs wants forks or incompatibilities. That's one of zfs main selling points, that you can yank a pool from FreeBSD and shove them into illumos and vice Versa.
I searched the github page and this HN comment thread for the string "frag" and got nothing ...
I don't know if the authors are here, but if they are - would you comment on fragmentation and the dangers of growing a filesystem past 95-98% full ?
In the world of ZFS, performance can become significantly degraded with as low as 90% space filled. Further, our experience has been that you can permanently degrade filesystem performance by churning the usage above 95% for any significant amount of time. Which is to say, even if you reduce usage back down to 80%, the zpool maintains poor performance until it is destroyed and recreated.
This is exactly what you would expect to see with a fragmenting filesystem that has no defrag tool.
Unfortunately, creating a defrag tool for ZFS is a very daunting technical hurdle and it appears that nobody is interested in pursuing it.
How does TFS behave ? Does it have, or do you plan for it to have, a defrag utility ?
> I don't know if the authors are here, but if they are - would you comment on fragmentation and the dangers of growing a filesystem past 95-98% full ?
Fragmentation isn't an issue in TFS, at all. Because it is a cluster-based file system. Essentially that means that files aren't stored contagiously, but instead in small chunks. The allocation is done entirely on the basis of unrolled freelists.
This does cause a slight space overhead (only slight, coming from the fact that metadata of the file is stored in the full form), but it completely eliminates any fragmentation.
I only have a basic understanding of harddisks/filesystems, but won't that slow down reading/writing on harddisks since the chunks won't be in order and close together?
While there are many good arguments to be made against AES in favor of ARX construction ciphers, the choice of SPECK for this is not okay. The correct choice of an ARX cipher would have been something like ChaCha20 or Salsa20.
The choice of algorithm is less baffling than the choice of mode of operation. When designing a new file system, why in world would you use unauthenticated XEX (or XTS) mode instead of an authenticated mode (SIV, HMAC+CTR, ChaCha20-Poly1305, Speck128+CMAC-Speck128 or whatever). It's not like you need a one-to-one block mapping between encrypted and unencrypted data — you design datastructures yourself, it's a new filesystem! Can't they afford additional 16 bytes?
1) Are built from constant time operations, which means they are naturally resistant to side channel attacks (timing, cache, power, etc).
2) Are far simpler in their construction. This makes them easier to reason about and analyze.
3) Related to #2, this also makes them really easy to implement, which means less likelihood of some coding mistake.
Beyond that, most recent ARX ciphers also have a few other advantages over AES. For example, Threefish has a built-in tweak field, which makes using it infinitely easier in practice.
EDIT: In case you're hungry for more detailed explanations, I highly recommend reading the papers for Salsa/Chacha and Threefish. They're very well written, easy to understand even if you don't have a lot of experience with cryptography, and they have sections that explain the design decisions in enlightening detail.
ARX constructions are also easier to tune for high software performance, and generally don't require special hardware support, because all CPUs already have fast ARX operations built in.
Firstly, it's ChaCha20. I don't think anybody in their right mind would advocate a 2 round ChaCha. Secondly, there _are_ steam cipher constructions to achieve the design requirements for something like TFS.
On the contrary, it makes it suitable for file systems! File systems are not block devices, they are data structures on top of block devices — it's the job of these data structures to keep stuff, such as data, inodes, and... keys, and IVs, and MACs, checksums, etc.
"If you’re encrypting a filesystem and not disk blocks, still don’t use XTS! Filesystems have format-awareness and flexibility. Filesystems can do a much better job of encrypting a disk than simulated hardware encryption can."
Edit 2: also check out bcachefs encryption design doc: http://bcachefs.org/Encryption/ (also not perfect, but uses proper AEAD — ChaCha20-Poly1305. I sent some questions and suggestions to the author, but received no reply :/)
I wonder would it be possible to use it somehow with Linux (in kernel space, not with fuse, because fuse has work slower because of necessary context switches from kernel to user space). I mean it is interesting can a wrapper kernel module be written for interfacing with Rust code, or there are some obstacles that would prevent from doing it efficiently.
The unstable Kernel ABI is really interesting. It's one of the reasons we see all these shims between proprietary drivers and the kernel (nvidia/amd), that and licensing of course.
I know this is true on the nvidia side, but I think it is less true on the AMD side [0]. The old AMD drivers may have been like this, but it appears they have changed.
What cipher block mode is used with SPECK? Block based disk encryption is complicated to get right due to replay attacks of blocks over time and IV issues. There are established best-practice compromises with AES, but I dont know if they apply to other block ciphers and doubt they are tested.
XEX, unfortunately. That's a mistake. Unauthenticated tweakable wide-block cipher modes are designed for simulated hardware disk encryption. That's not what an encrypted filesystem is: a filesystem knows where files begin and end, and has space for metadata. Filesystem encryption should use authenticated encryption.
I assume they're targeting more than just ARM and x86, given they're talking about "truely portable AES implementations" in the README[1].
That said, I'm very inclined to distrust a very young cipher... released by the NSA, no less[2]. Which is an add-rotate-xor cipher[2], for which we already have the more reviewed ChaCha20[3], suggested for TLS 1.3[4].
Or if they really want a lightweight cipher with a small block size, they should consider SPARX, especially since a Rust implementation is readily available: https://github.com/jedisct1/rust-sparx
Salsa20 and ChaCha20 are faster on modern hardware than built-in AES instructions. AES is just old and complex. As Salsa family and other modern ciphers have shown you don't have to use complex function to achieve security.
- AES-256-CTR is about ~25 % faster with AES-NI than ChaCha20. The gap widens when we consider their respective AEAD constructions.
- AES construction has had simply different design goals than ChaCha20. Software implementations took the back seat, Hardware impls were important and AES is unproblematic for those.
- AES is arguably a rather conservative design. To some extent even twenty years later.
Salsa20 and ChaCha20 aren't FIPS140-2 certified (that I'm aware of or can find documented) which is pretty much a basic requirement at this point for anything in the enterprise.
And they probably won't for political reasons. Salsa20 was created by DJB who has previously fought with US Gov against encryption export ban and also they have their own suits they prefer you to use (for their own reasons).
It doesn't matter if I trust or distrust the US government for creating secure encryption. There are laws in place across a plethora of industries REQUIRING FIPS140-2. If you want to do business in the US, it's a basic requirement.
At the end of the day the odds of this filesystem gaining any traction are basically 0. I'm just pointing out that not supporting an encryption algorithm that can meet the FIPS requirement makes it basically a non-starter for any commercial application.
> At the end of the day the odds of this filesystem gaining any traction are basically 0. I'm just pointing out that not supporting an encryption algorithm that can meet the FIPS requirement makes it basically a non-starter for any commercial application.
In the US. In many other countries, it might gain lots of traction.
Improved caching
TFS puts a lot of effort into caching the disk to speed up disk accesses. It uses machine learning to learn patterns and predict future uses to reduce the number of cache misses.
See that does sound like a good idea - I've always observed HDDs with an SSD cache to have a phenomanly useless caching system
I'm not a big fan of putting machine learning into a file system. Usually you can't understand why a machine learning algorithm is doing what it's doing, and I would be worried about a production server suddenly having massively different performance because the cache learning algorithm started doing something differently.
Interesting idea, but I would want to battle test it before I bought in.
It's impossible to tell what's going on nowdays with optimizing compilers, memory overcommitment, hardware branch predictions, power saving measures, etc.
This can't be seen as anything else than a research filesystem that tries several new things at once. Doomed from the start.
A good design would look at the state of the art and use the best techniques available. If the aim was research, then try one new thing, not a thousand.
For actual promising new filesystem efforts I'd be looking at HAMMER2, Tux3 and F2FS.
> A good design would look at the state of the art and use the best techniques available. If the aim was research, then try one new thing, not a thousand.
That's what it does: It takes from many sources (although mainly ZFS).
A power-off is a simple fault, with fairly well-defined effects. It's actually one of the easiest cases for a data-storage system to deal with. "Hardware failure" includes all manner of crazy Byzantine faults, many of which are literally impossible to deal with. What this is saying is that TFS's model of the underlying hardware is that it will either execute all writes correctly or stop executing any.
Sadly, a lot of hardware has much more complicated behavior than that. A lot of RAID cards in particular will lie about what was actually written, so in a power-off scenario later writes might have made it while earlier ones didn't, writes can be incomplete, etc. I don't mean this as a knock against TFS. It's more of a suggestion that the fault model be expanded to include at least a few more possibilities.
TFS provides following guarantees:
- Unless data corruption happens, the disk should never be
an inconsistent state \footnote{TFS achieves this without using
journaling or a transactional model.}. Poweroff and the alike
should not affect the system such that it enters an invalid or
inconsistent state.
Provided that following premises hold:
- Any sector (assumed to be a power of two of at least
\minimumsectorsize bytes) can be read and written atomically,
i.e. it is never partially written, and interrupting the write
will never render the disk in a state in which it not already
is written or retaining the old data.
Data corruption can break these guarantees or premises, and TFS
encompasses certain measures against such corruption, but they are
strictly speaking heuristic, like any error detection and correction
method, as the damage could be across all the disks.
Unexpected shutdown would just use the features of a Journaling file system to recover the data. A hardware failure would be a Raid controller failure or malfunction or a hard drive / SSD failure .
I'm fairly sure it's serious, though if you look at the specification it's made to be easily swapable. It's specified by a 'vdev', which is a layer on top of the core filesystem specified by a 16bit int. You could easily make another for ChaCha20 and use that instead (or also).
TFS was created to speed up the development. The issue is that following the design specs makes it much slower to implement, and prevents a "natural" development (like, you cannot implement it like a tower, you need every component before completion). It was started[1] and got far enough to reading images, but implementing it took ages, so we decided to put it off for now.
This doesn't quite seem to follow? ZFS's pool model has supported flagged off features for a very long time - isn't the issue more that to do the things ZFS does you need to implement all the other components? And since you're planning to do a lot of what ZFS does...
I don't consider that a bad thing. The project may pick up more contributors down the road — or it might not. Either way, that doesn't speak negatively of the project itself.
I certainly agree. I think there are compelling arguments both for keeping the work to a very small handful of people and for spreading the work across many contributors. It seems to be early days for TFS, but so far it looks like an impressive bit of work.
Well.. let's google it... "TFS" .. hmm, abbridged lead in to the wikipedia article at the top... And though I'm not a systems programmer likely to implement, or support the code for a filesystem, I am A programmer, and work in IT... And systems operators are also likely to come accross TFS (Team Foundation Server) in terms of supporting a deployment of it.
Though, they've started to refer to the source control protocol implementation as TFVC, since TFS supports git as well now. It does seem to have some conflicts, and even notes another file system called TFS themselves.
In this case, I'm pretty sure another name might be a better idea. Hell, TFS the version control system and the other file system are more well known than Firebird the database when Mozilla renamed their shiny new browser.
The point is, there's already a prevalent technology in use by the same name... I wouldn't really expect to choose a programming language based on google-ability... go, is to this day hard to search for on its' own, "golang" being better.
That said, I wouldn't expect a "new" programming language called "Java-Script" (not the current ES/JavaScript) to gain traction. Or for that matter a programming language called "Coffee" to be very successful either.
You're tilting at windmills. Most threads where someone starts something new has someone making similar complaints to the one you're making here. Yet people's behavior doesn't change. And it won't -- it's just too much overhead to avoid the ever-expanding space of products that have the same name.
Yes, but there's at least two other filesystems called TFS as well, per other threads... so even then, it's still overload. If I were releasing something to the public, I would probably namespace or consider something different in this case.
For what it's worth, I think you should have taken the advice of the guy you're replying to. The first result for "TFS file system" is https://github.com/redox-os/tfs.
I and everyone else on the team enjoys the fact that broken code produced by offshore developers that should have been fired in first place, never touches the official development repository.
Those developers will eventually produce something that passes the unit tests and gets merged, instead of borking the build for weeks.
Should the process done in another form with code reviews and such, yes but that isn't how many enterprise projects are managed.
Note that I am only referring to those that can't learn to code even if we ELI5 them.
There are others on the offshore teams that are highly skilled, but like us onsite, cannot do anything to change the rules of the game.
I read the TFS github landing page and it didn't explain what the 'T' stood for. I then noticed the author's name is "Ticki"[1]. Therefore, I can only guess that TFS stands for "Ticki's File System".
I understand this filesystem is still nascent, but shouldn't data integrity at least be one of the design goals?