Hacker News new | past | comments | ask | show | jobs | submit login

If we mmap a lot of files, _exit(2) is not instantaneous but takes a few hundred milliseconds because the kernel has to clean up a lot of resources. As a workaround, we should organize the linker command as two processes; the first process forks the second process, and the second process does the actual work. As soon as the second process writes a result file to a filesystem, it notifies the first process, and the first process exits. The second process can take time to exit, because it is not an interactive process.

Is this safe? Are you sure that the _exit() delays are not part of the kernel committing all the pending mmap I/O to the buffer caches? If a build script links a binary using this, and then immediately executes it, the second (background process) might still not be finished. Is it guaranteed that all of the mmap I/O will be visible - or will the binary appear incomplete?

I don't know the answer to this myself - it would be clear cut if the linker was using write() calls, because UNIX guarantees that future read() calls will see the results, but the ordering guarantees with mmap I/O are far more loose, I believe.




It is safe because the child process calls munmap before telling its parent process to exit. munmap is guaranteed to act as a commit operation. Alternatively, you can call msync (https://man7.org/linux/man-pages/man2/msync.2.html) if you want to keep it mmapped.


Linux gives much stronger guarantees than POSIX here. I wonder if you save measurable time by skipping munmap.


munmap is often a remarkably slow operation, if your process is multi-threaded, because of TLB shootdowns; on each munmap, all the other threads get paused and their page map caches get trashed, each time.

It is usually much better to have multiple regular processes, instead of threads, that only share chosen mappings, if you want to use munmap. Or, you can terminate and join all your threads before you start munmapping.


Is that documented?


I would have said yes, but I can’t find it. That being said, Linux has a “unified page cache”, and MAP_SHARED is coherent with read(2) and write(2), at least on any local filesystem (not sure about FUSE) and when direct IO is not involved.

That being said, I could easily believe that largeish pwrite(2) calls would be comparably fast compared to mmap, since mmap needs to play with page tables, and page faults on x86 are expensive. MAP_POPULATE would also be worth trying if you’re not already using it.

I assume that copy_file_range(2) is out of the question due to relocations.


I once counted the number of 4 KiB blocks that has at least one relocation. I used Chrome as a sample. It turned out that almost all 4 KiB blocks have at least one relocation. They mutate everywhere.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: