Can someone explain to me how the BRC isn't bottlenecked on I/O? I don't underst...

benhoyt · 2024-03-10T01:17:20 1710033440

Local disk I/O is no longer the bottleneck on modern systems: https://benhoyt.com/writings/io-is-no-longer-the-bottleneck/

In addition, the official 1BRC explicitly evaluated results on a RAM disk to avoid I/O speed entirely: https://github.com/gunnarmorling/1brc?tab=readme-ov-file#eva... "Programs are run from a RAM disk (i.o. the IO overhead for loading the file from disk is not relevant)"

mike_hearn · 2024-03-10T20:29:40 1710102580

Unfortunately there are still cases where local disk I/O can be a serious bottleneck that you do have to be aware of:

1. Cloud VMs sometimes have very slow access even to devices advertised as local disks, depending on which cloud

2. Windows

The latter often catches people out because they develop intuitions about file I/O speed on Linux, which is extremely fast, and then their app runs 10x slower on Windows due to opening file handles being much slower (and sometimes they don't sign their software and virus scanners make things much slower again).

NikkiA · 2024-03-10T22:17:33 1710109053

Apparently, the cause of the long standing windows disk IO problem was discovered a month or so ago, and MS were said to be working on a fix.

Whether it'll be constrained to Win 11 is yet to be seen.

dr_zoidberg · 2024-03-10T23:23:29 1710113009

That's interesting. A project at work is affected by Windows slow open() calls (wrt to Linux/Mac) but we haven't found a strong solution rather than "avoid open() as much as you can".

tomnipotent · 2024-03-11T12:36:45 1710160605

It's likely Windows Defender, which blocks on read I/O to scan files. Verify by adding a folder to its exclusion list, though this isn't helpful if it's a desktop app. The difference is most noticeable when you're reading many small files.

mike_hearn · 2024-03-11T10:08:59 1710151739

Really? Where did you hear that? Conventional wisdom is based on a post by an MS employee years ago that described the performance issues as architectural.

MS does have something called "dev drive" in Win11. Dev Drive is what ReFS turned into and is an entirely different filesystem that isn't NTFS, which is better optimized for UNIX-style access patterns. The idea is that core Windows remains slow, but developers (the only people who care about file IO performance apparently) can format another partition and store their source/builds there.

davidmurdoch · 2024-03-10T01:29:08 1710034148

I was surprised by this recently when optimizing a build process that uses an intermediate write-to-disk step. I replaced the intermediate filesystem API with an in-memory one and it was not measurably faster. Not even by a single millisecond.

10000truths · 2024-03-10T03:30:46 1710041446

How much data were you writing? If you don't fill the OS's page cache and the SSD controller's DRAM cache, and you're not blocking on fsync() or O_DIRECT or some other explicit flushing mechanism, then you're not going to see much of a difference in throughput.

davidmurdoch · 2024-03-10T16:16:33 1710087393

Only a few MB written then read, so that is a likely explanation.

stavros · 2024-03-10T07:21:00 1710055260

Ahh, thanks, I didn't know about the ramdisk. Very interesting about I/O not being the bottleneck, though.

superjan · 2024-03-10T10:57:16 1710068236

For some background: here is an interview with Daniel Lemire who built an entire career based on the observation that I/O is often not the bottlenck. https://corecursive.com/frontiers-of-performance-with-daniel...

nkurz · 2024-03-10T01:29:49 1710034189

I haven't looked at the problem closely enough to answer, but could we start from the other direction: what makes you think that memory I/O would be the bottleneck?

From my limited understanding, we sequentially bring a large text file into L1 and then do a single read for each value. On most processors we can do two of these per cycle. The slow part will be bringing it into L1 from RAM, but sequential reads are pretty fast.

We then do some processing on each read. At a glance, maybe 4 cycles worth in this optimized version? Then we need to write the result somewhere, presumably with a random read (or two?) first. Is this the part you are thinking is going to be the I/O bottleneck?

I'm not saying it's obviously CPU limited, but it doesn't seem obvious that it wouldn't be.

Edit: I hadn't considered that you might have meant "disk I/O". As others have said, that's not really a factor here.

haxen · 2024-03-10T07:10:52 1710054652

> maybe 4 cycles worth in this optimized version?

It's quite a bit more than that, just the code discussed in the post is around 20 instructions, and there's a bunch more concerns like finding the delimiter between the name and the temperature, and hashtable operations. All put together, it comes to around 80 cycles per row.

When explaining the timing of 1.5 seconds, one must take into account that it's parallelized across 8 CPU cores.

nkurz · 2024-03-10T22:18:07 1710109087

You are right. In my defense, I meant to say "about 4 cycles per byte of input" but in my editing I messed this up. I'd just deleted a sentence talking about the number of bytes per cycle we could bring in from RAM, but was worried my estimate was outdated. I started trying to research the current answer, then gave up and deleted it, leaving the other sentence wrong.

stavros · 2024-03-10T10:59:07 1710068347

Sorry, yes, I meant disk I/O, I should have clarified.

codegladiator · 2024-03-10T01:51:49 1710035509

The test is executed with memfs. The file and everything is in ram at startup.

menaerus · 2024-03-10T08:31:01 1710059461

Since the dataset is small enough so that it fits into the Linux kernel page cache, and since the benchmark is repeated for 5 consecutive times, first iteration of the benchmark will be bottlenecked by the disk I/O but the remaining 4 will not - e.g. all data will be in RAM (page-cache).

fulafel · 2024-03-10T16:26:49 1710088009

Not being IO bound is part of the test design in this case. https://github.com/gunnarmorling/1brc?tab=readme-ov-file#eva...

"Programs are run from a RAM disk (i.o. the IO overhead for loading the file from disk is not relevant)"