> on modern ultra fast single user systems I wonder?
The latency of a 'syscall' is on the order of a few hundred instructions. You're switching to a different privilege mode, with a different memory map, and where your data ultimately has to leave the chip to reach hardware.
It depends what the consumer is doing with the data as it exits the buffer. If it’s a terminal program printing every character, then it’s going to be slow. Or more generally if it’s any program that doesn’t have its own buffering, then it will become the bottleneck so the slowdown will depend on how it processes input.
Ultimately even “no buffer” still has a buffer, which is the number of bits it reads at a time. Maybe that’s 1, or 64, but it still needs some boundary between iterations.
Those ultra fast systems also have ultra fast I/O. Buffering is critical to get good performance e.g. on your NVMe. The difference between writing one character at a time and writing a few megabytes at a time would be many orders of magnitude (x1000? x10000?) enough to make pipe processing of large files be unacceptably slow. Even between processes you want to move large blocks of data otherwise you're just context switching all the time. You can try this by flushing after every character in a toy program and do some sort of chain `toy largefile | toy | toy > mycopy`