In addition, the official 1BRC explicitly evaluated results on a RAM disk to avoid I/O speed entirely: https://github.com/gunnarmorling/1brc?tab=readme-ov-file#eva... "Programs are run from a RAM disk (i.o. the IO overhead for loading the file from disk is not relevant)"
Unfortunately there are still cases where local disk I/O can be a serious bottleneck that you do have to be aware of:
1. Cloud VMs sometimes have very slow access even to devices advertised as local disks, depending on which cloud
2. Windows
The latter often catches people out because they develop intuitions about file I/O speed on Linux, which is extremely fast, and then their app runs 10x slower on Windows due to opening file handles being much slower (and sometimes they don't sign their software and virus scanners make things much slower again).
That's interesting. A project at work is affected by Windows slow open() calls (wrt to Linux/Mac) but we haven't found a strong solution rather than "avoid open() as much as you can".
It's likely Windows Defender, which blocks on read I/O to scan files. Verify by adding a folder to its exclusion list, though this isn't helpful if it's a desktop app. The difference is most noticeable when you're reading many small files.
Really? Where did you hear that? Conventional wisdom is based on a post by an MS employee years ago that described the performance issues as architectural.
MS does have something called "dev drive" in Win11. Dev Drive is what ReFS turned into and is an entirely different filesystem that isn't NTFS, which is better optimized for UNIX-style access patterns. The idea is that core Windows remains slow, but developers (the only people who care about file IO performance apparently) can format another partition and store their source/builds there.
I was surprised by this recently when optimizing a build process that uses an intermediate write-to-disk step. I replaced the intermediate filesystem API with an in-memory one and it was not measurably faster. Not even by a single millisecond.
How much data were you writing? If you don't fill the OS's page cache and the SSD controller's DRAM cache, and you're not blocking on fsync() or O_DIRECT or some other explicit flushing mechanism, then you're not going to see much of a difference in throughput.
I haven't looked at the problem closely enough to answer, but could we start from the other direction: what makes you think that memory I/O would be the bottleneck?
From my limited understanding, we sequentially bring a large text file into L1 and then do a single read for each value. On most processors we can do two of these per cycle. The slow part will be bringing it into L1 from RAM, but sequential reads are pretty fast.
We then do some processing on each read. At a glance, maybe 4 cycles worth in this optimized version? Then we need to write the result somewhere, presumably with a random read (or two?) first. Is this the part you are thinking is going to be the I/O bottleneck?
I'm not saying it's obviously CPU limited, but it doesn't seem obvious that it wouldn't be.
Edit: I hadn't considered that you might have meant "disk I/O". As others have said, that's not really a factor here.
It's quite a bit more than that, just the code discussed in the post is around 20 instructions, and there's a bunch more concerns like finding the delimiter between the name and the temperature, and hashtable operations. All put together, it comes to around 80 cycles per row.
When explaining the timing of 1.5 seconds, one must take into account that it's parallelized across 8 CPU cores.
You are right. In my defense, I meant to say "about 4 cycles per byte of input" but in my editing I messed this up. I'd just deleted a sentence talking about the number of bytes per cycle we could bring in from RAM, but was worried my estimate was outdated. I started trying to research the current answer, then gave up and deleted it, leaving the other sentence wrong.
Since the dataset is small enough so that it fits into the Linux kernel page cache, and since the benchmark is repeated for 5 consecutive times, first iteration of the benchmark will be bottlenecked by the disk I/O but the remaining 4 will not - e.g. all data will be in RAM (page-cache).