> In flash, seeks are essentially free. That means that it does no longer matter...

dannyw · on June 12, 2021

Disagree, because my understanding your OS visible positions have zero relevance to what will actually be translated to PHYs.

If you feed your NVMe a stream of 1GB writes spread out at completely randomised OS visible places (LBAs), the FTL may very well write it sequentially and you get the solid sustained write performance.

Conversely, you may try to write 1GB of sequential LBAs, and your FTL may very well spread it out all across the physical blocks simply because that's what’s available.

What I'm saying is that sequential reads and writ workloads are good, but whether the OS considers them sequential or not in terms of LBAs is irrelevant. The controller ignores LBAs and abstracts everything away.

My understanding could be wrong, so please correct me if I am.

anarazel · on June 12, 2021

That may sometimes be true the first times you write the random data (but in my experience it's often not true even then, and only if you carefully TRIMed the whole filesystem and it was mostly empty). But on later random writes it's rarely true, unless your randomness pattern of exactly the same as in the first run. To make room the FTL will (often in the background) need to read the non-written parts of erase blocks sized data assigned in the previous runs, just to be able to write out the new random writes. At some point new writes need to wait for this. Slowing things down.

Whereas with larger/sequential writes, there's commonly no need for read-modify-write cycles. The entire previous erase block sized chunks can just be marked as reusable with new content - the old data isn't relevant anymore.

This is pretty easy to see by just running benchmarks with sustained sequential and random write IO. But on some devices it'll take a bit - initially the writes are all in a faster area (e.g. using SLC flash instead of denser/cheaper mlc/tlc/qlc).

Of course, if all the random writes are >= erase block size, with a consistent alignment to multiples of the write size, then you're not going to see this - it's essentially sequential enough.