Given the age of these posts, the many of the numbers used are outdated, but most of the general ideas are still relevant. There are a few other concepts that would be worth mentioning these days even in this kind of high-level overview: Error handling and recovery has gotten really complex; drives will adjust how they perform read and program operations as the flash ages. Some drives do proactive scanning for data degradation to catch errors before they become uncorrectable. There are new ways for the host OS to provide hints about data lifetime and preferred data placement, which can be very helpful in avoiding unnecessary write amplification. In the absence of such hints, some drives have heuristics that try to infer data lifetime information based on IO patterns.
Not enough knowledge, in any of the areas involved,
but my intuition makes me doubt the longtime security of these complexities,
although i really appreciate the results.
Or is there no need for trusting the firmware, if all the I/O is encrypted?
I am fantasizing all these data patterns could combined with other information lead to the data-storage/-encryption equivalent of branch-prediction-exploits..
..but maybe it's just half-knowledge combined with
the fear-what-you-don't-understand..
..and of course this wouldn't be in the first layer of attack.
Could you make money by producing a cheap "dumb" flash memory where you have raw access and leave all the wear leveling to software by using one of the many available log structured file systems?
Some people would rather have full transparent overview and control over the state and health of their storage and make their own decisions about handling problems in software instead of talking to a black box.
FYI: This is already a thing in the embedded world and Linux already has corresponding software infrastructure:
The "mtd" subsystem which stands for "memory technology devices" is a thin hardware abstraction layer for primarily raw flash devices with an internal framework for NAND flash with driver backends for all sorts of controlers and devices and support for e.g. software error correction if the device/controller doesn't support it, etc...
The "ubi" subsystem which provides wear-leveling, bad block management and LVM style logical volume partitioning on top of mtd. It does not emulate a normal block device, it acts more like a flash device with idealized properties.
The "ubifs" file system which was specifically designed for working on top of ubi.
If you try to use this stack on a large-ish scale to build a DIY Linux powered SSD, you might run into scaling problems with ubi tough (don't think some companies haven't tried that yet).
While it basically works, as always, pretty much every flash vendor guards the inner workings of their devices as secrets and those things have undocumented quirks and undocumented, secret instructions to work around various quirks used by the companies own SSDs and eMMCs.
> Could you make money by producing a cheap "dumb" flash memory where you have raw access and leave all the wear leveling to software by using one of the many available log structured file systems?
Technically, the market exists, but whether you could get into it and make money is another question. I'm willing to bet that if you design a NAND flash from scratch and try to sell it, the existing vendors will clobber you with patent infringement law suits.
AFAIK the reasons why Linux has ubi instead of a full, in-kernel flash translation layer are also legal ones.
Most Linux-based home routers (which means most routers) have raw access to flash. If you ever just wanted to play with it you can just find some old router out of a local thrift shop, check if it can handle a DD-WRT or OpenWRT, then experiment.
You just have to be careful because the bootloader and NVRAM partitions are crucial to the device booting. Serious experimenters would find the serial port header pins and have a working serial port.
Most home routers use serial NOR flash, not NAND flash. Lower capacity, but easier to interface with. No wear leveling is supported or required, as the expected write count is much lower than the endurance of the device.
The cost savings probably wouldn't be as dramatic as you hope. You'd still need some sort of controller for serdes -- NAND flash chips tend to use big parallel busses which aren't suited for off-board connectors -- and, at that point, it's not that much of a jump to make that controller handle wear leveling and error correction.
(Having the storage device handle those tasks is pretty convenient, anyways. It means you don't have to perform error correction on the main CPU -- which is a nontrivial ask -- and it means the SSD can behave as a bootable device.)
Wear levelling and error correction are pretty different problems.
Error correction is well suited to hardware, whereas wear levelling in my opinion is not.
My ideal design would have basic hardware error detection and correction (hopefully configurable for how many data bits Vs ECC bits), and maybe with a few heirarchical levels. Wear levelling and management of data layouts would all be done in software.
You could imagine a "bulk read" API which is given a block number to start at, with a bunch of ECC parameters, which would DMA a large block of error-corrected data to RAM, together with information about the error rate in each word/block read.
> You could imagine a "bulk read" API which is given a block number to start at, with a bunch of ECC parameters, which would DMA a large block of error-corrected data to RAM, together with information about the error rate in each word/block read.
There are a lot of existing SoCs that have built in flash controller hardware that can interface with an external flash chip and, from a driver perspective, are controlled pretty much the way you describe it.
The margins on flash media are razor thin already. This is a high volume operation where subcents per unit count. If you can shave off a little more by using a dumb controller that does not contain any IP that needs to be licensed it is very easy to calculate how high the volume has to be in order to be profitable.
Consider that Olympus/Fujifilm tried this in the early 2000s with XD-Picture Card. (The cards were literally just a NAND flash chip wired up to the pins.) Even though this theoretically reduced the BOM cost for the cards, XD lost in the market to other solutions like Compact Flash and SD, which embedded a controller.
xD failed because it was a proprietary standard used exclusively by FujiFilm and Olympus and were generally more expensive to the consumers than the completely open and free SD-format.
The lack of a flash translation layer was a major advantage.
The main abstraction costs are excess DRAM and overprovisioning, and those don't come from wear levelling but the block abstraction layer (aka. pretending to support random writes). So where there used to be a buzz around ‘open-channel SSDs’, that give a lot of this control and responsibility to the host, but often just meant the host is absorbing all those costs, the hot new thing is ‘zoned namespaces’, which keep wear levelling on the host, but provide an API without totally random writes that maps well to hardware, and that seems to be a much better trade-off.
Flash can support random writes, but not overwrites. The zoned namespaces abstraction is a little bit more restrictive than is necessary for SSDs, but it's being developed that way so that the same abstraction can also be used with SMR hard drives.
They're not quite append only, there's a ‘zone random write area’ that you can update in-place that exposes the write buffer in front of a zone. To a lesser degree, it also supports parallel appends, which naturally aren't purely sequential either.
So you've got random writes both between zones and at the front of each zone. The one thing missing is being able to erase blocks within the committed portion of a zone, but idk how you'd support that without just reimplementing a block layer.
The ZNS spec doesn't seem to have been published quite yet, but from what I can find it seems like the zone random write area is an optional extension that won't necessarily be implemented by all vendors.
I have always wanted to do this. I would much rather manage redundancy and error correction at the application level, so that the application has a good idea of what data is being lost well in advance of it being gone for good.
The idea of having some chip with software that I've never seen in charge of this seems crazy to me. You can look at how much data people lose on SD cards (every Raspberry Pi user will tell you they "go bad"); most of it is not loss of content on flash, but corruption from interrupting the internal MCU at a bad time. Remove that and use proper transactional writes, and they'd probably never instantly go bad. But, random write benchmarks would not look as good.
I did something like that long ago; stole md code and made my own raw block i/o routines where low cost data was raid0 and (re)generate it at need; but indexes and pointer tables were duplicated across all the spindles and even allowed greater parallelism.
It was a lot of work; it had advantages but today it probably wouldnt be worth the effort.
I don't think you could unless you had a flagship customer using it in a huge data centre. You'd need to write a "reference design" filesystem for it anyway. Since volume is so important in electronics, you'd need to scale up a lot in order to make it cost-effective.
Amazon, Google, Microsoft and Apple are in a position to do this for their cloud offerings, but so far as I can tell they haven't. Which tells you it's probably not cost effective.
(I think there are benefits to be had from giving even local storage a "blob" interface, as you reduce the amount of random writes, but we're not there yet)
> Could you make money by producing a cheap "dumb" flash memory where you have raw access
The problem is that low-level or "raw" access to flash is not that well-defined and not "cheap" or "dumb" at all. SD cards and eMMC storage are cheap and use very baseline FTL's, but what they provide is far from "low-level" access - you can't even erase a block and write sub-blocks on it individually, which is the basic operation in any flash translation layer.
Most SD cards do have debug ways to read and write raw flash... Just they're undocumented and manufacturer specific, so you probably wouldn't want to use them in a shipping product.
I used to do SSD firmware development. The raw flash is quite unreliable with many unideal behavior to work around. Also most of the cost is on the flash.
The T2 chip contains the functionality of a somewhat non-standard NVMe SSD controller, so the x86 CPU running macOS has pretty much the same responsibilities as when dealing with an off the shelf SSD.
Given the rate at which we are progressing on SSD endurance, these methods will become irrelevant. Even 5 years back, the old tech report article showed that for an average user, they could handle a lot of writes.
That winner of that test, the 840 Pro, uses 2-bit (MLC) cells.
Nowadays, 3-bit (TLC) and 4-bit (QLC) are much more common (not sure QLC even existed back then). So whilst each of these technologies has matured, the drive a consumer may consider today is more likely to be bigger, replacing their HDD entirely, and will probably be capable of fewer writes than an MLC drive from 5 years ago, iiuc.
MLC->TLC transition decreases endurance but 2D->3D transision increases much endurance. Current 3D TLC drives handles more TBWs compared to old consumer 2D MLC drives. So industry going to make QLC chips.
The best ssd in that test lasted for writes ~1000x its capacity. Without a good wear-leveling SSD controller, you would see data loss in a day or two on e.g. log files. A good controller will never be irrelevant.
A bit off-topic, but speaking of SSDs, how do you compare them when shopping around? I see different brands and models advertising similar features at different price points, and I know enough to look at the throughput, IOPS, but beyond that it's hard to tell how to evaluate differences. e.g. should consumers be weary of QLC? Do the models and brands matter a lot? Can you somewhat trust listed lifetimes? Are there other specs worth paying attention to? How should they be evaluated? Anyone have thoughts/links on comparing modern SSDs?
The advertised performance specifications for consumer SSDs are mostly useless. They pick the metrics that produce the biggest numbers, without regard to whether those numbers have any relevance to real-world use. Those numbers are mostly useful for determining what class of controller and flash memory are used, in case the spec sheet doesn't list that information. Likewise for warranty period and write endurance ratings; those are more about signalling product segmentation than about actual expected lifetime.
For consumer SSDs, there's no need to worry about write endurance or QLC NAND unless you know you have a very atypical usage pattern and have actually measured your workload by eg. tracking SMART indicators on your current drive(s) for several days.
Brand matters very little unless you really care about the experience of getting a warranty replacement in the rare event that your drive fails before the warranty expires. I'm not aware of any solid information indicating that certain SSD brands have consistently lower premature failure rates. My anecdotal experience from reviewing SSDs for several years is that all the top-tier brands have at some point sent me review samples that were either DOA or died during testing that shouldn't have killed a drive.
Outside the top tier brands belonging to the NAND flash memory manufacturers, everyone is buying NAND on the open market from the same 2-3 manufacturers and buying SSD controller solutions from the same 2-5 vendors. There are literally dozens of retail models all using the same combination of Phison E12 controller and Toshiba/Kioxia 3D NAND, and the differences between these are almost entirely cosmetic. All the PCIe 4.0 consumer SSDs that have been released so far are functionally identical.
This is great info, thank you! Regarding SMART indicators, do you know if they're generally reliable? I seem to recall there have been drives (at least HDD, not sure about SSD) that didn't report correct numbers. In fact, I just checked and my current SSD doesn't report SMART data at all. Have they gotten better over the years?
I don't make a habit out of doing sanity checks on the SMART reporting of drives I test. My tests log that data, but I haven't written anything to parse that into useful information. I can't recall noticing obviously wrong or entirely missing SMART data in any drive I've recently tested.
I'm curious what your SSD is and just how old it is, if it isn't giving you any SMART information. (Maybe you need to run `smartctl -s on` before it'll show you the stats it has probably been tracking all along?)
Ah I see. It's a PM981, from late 2017. Interestingly, the GSmartControl GUI says it doesn't support SMART. But now that I try smartctl -a, I see a bit of information, though not much. The only things regarding failures seem to be "Media and Data Integrity Errors" and "Error Information Log Entries" (which I'm not really sure how to interpret in terms of what's too high and what's too low). Other stuff is just temperature and other statistics. Not sure if I'm supposed to be seeing more, but I feel like hard disks used to report more than this!
Ah I see, thank you, that makes sense. Sadly it seems this model doesn't publish an endurance specification at all, so it's hard to tell how close I am to its end of life. Interestingly mine actually already has some errors:
Available Spare: 100%
Available Spare Threshold: 10%
Media and Data Integrity Errors: 10
Error Information Log Entries: 735
but I guess that's probably normal? A least unless it fails suddenly/catastrophically.
One word of warning: we often get bored of reviewing the same drive several times, with a different vendor's sticker over the same turnkey solution built by the same ODM. So if you run across a decent price on a drive that hasn't been reviewed in-depth by someone you trust, it may be worth checking if it's basically equivalent to something that has been well reviewed. On Reddit, NewMaxx has been doing a great job of publicly aggregating information about SSDs: https://old.reddit.com/r/NewMaxx/comments/dhvrdm/ssd_guides_...
This is especially important if you're not in North America because the brands that are available and affordable locally aren't always the same as what's good here in the US. (And on rare occasions, a product sold in multiple regions will have different internals depending on the region.)