Direct Storage will allow for hardware-accelerated decompression straight from a...

wtallis · on March 1, 2021

The more important thing about DirectStorage is probably that it will encourage games to use multithreaded async IO rather than serializing all their IO requests even when the underlying storage device requires dozens of simultaneous requests to deliver its full throughput.

amluto · on March 1, 2021

I’m not entirely convinced that DirectStorage can do DMA directly from the device to the GPU. I suspect that even current NVMe devices aren’t quite fast enough for this to be a huge deal yet.

I think, but I’m not entirely sure, that Linux can do the peer to peer DMA trick. One nasty bit on any OS is that, if a portion of the data being read is cached, then some bookkeeping is needed to maintain coherence, and this adds overhead to IO. I wouldn’t be surprised if consoles had a specific hack to avoid this overhead for read-only game asset volumes.

astrange · on March 1, 2021

Does it need "multithreaded async IO" or just "async IO"? It's usually async _or_ multithreaded; the native multi-request I/O APIs are single threaded, but if you have multithreaded I/O using simpler APIs, the system is batching them into one request at the cost of a little latency.

johncolanduoni · on March 1, 2021

Kernel-mediated async disk IO is still a mess on all major platforms except for newer Linux kernels with io_uring. There's no way to call the APIs in a way that won't block sometimes, and to even have a snowball's chance in hell of not blocking requires giving up on the kernel's disk cache.

Also you're probably going to want to do multithreaded decompression anyway, and it'll be more efficient if you have the threads completing the reads do the decompression themselves. So in any case you probably want multiple threads handling the completion events.

wtallis · on March 1, 2021

NVMe is natively a multi-queue storage protocol, so there's no reason for the application or OS to collect IO requests into a single thread before issuing them to the lower layers. The normal configuration is for each CPU core to be allocated its own IO queue for the drive. But multithreaded synchronous (blocking) IO often isn't enough to keep a high-end NVMe SSD properly busy; you run out of cores and get bogged down in context switching overhead at a few hundred thousand IOPS even with a storage benchmark program that doesn't need any CPU time left over for productive work.

With a sufficiently low overhead async API (ie. io_uring) you can saturate a very fast SSD with a single thread, but I'm not sure it would actually make sense for a game engine to do this when it could just as easily have multiple threads independently performing IO with most of it requiring no synchronization between cores/threads.

smaudet · on March 1, 2021

Lol

"Hey we have this great new tech that makes things even faaster!!"

2 years later

"GTA 6 found to have double online load times, denies claims that game performs worse than GTA 5, tells people to upgrade their systems"

4 years later:

"Tech blogger reverses code, realizes someone managed to loop access between hard drive and gpu despite extremely common modern tech, gets 10x boost after spending a day fixing junk product"

Better technology just hasn't met its match from dumber management and more bureaucratic dev shops...