Sure, if you're dealing with extremely high bandwidth apps (4GB/sec is pretty high bandwidth!) what I said doesn't apply.
THe number of people who are sustaining 4GB/sec (on a single machine/device array) is pretty small and they have a reason to go beyond the straightforward approaches the kernel makes available through a simple API (everything you described, like bypass, puts you in a rare category).
Anyway, when I was swapping to SSD, the kswapd process was using 100% of one core while swapping at 500MB/sec. I suspect many kernel threads haven't been CPU-optimized for high throughput.
THe number of people who are sustaining 4GB/sec (on a single machine/device array) is pretty small and they have a reason to go beyond the straightforward approaches the kernel makes available through a simple API (everything you described, like bypass, puts you in a rare category).
Anyway, when I was swapping to SSD, the kswapd process was using 100% of one core while swapping at 500MB/sec. I suspect many kernel threads haven't been CPU-optimized for high throughput.