There isn't exactly a shortage of examples where mmap() was in the short-term quite convenient to get to market quickly with something Good Enough to appear production ready, that in the long-term proved to be rather problematic.
Mongodb's original mmap storage engine (32-bit support anyone?), ultimately required replacement (wiredtiger was purchased in part for this reason IIRC).
systemd-journald read side (journalctl, `systemctl status`, sd-journal API), performance/scalability definitely suffers from the decision to use mmap for all IO.
mmap IO is extremely seductive in terms of programming style/convenience as it facilitates writing procedural, IO-naive code that assumes file-backed memory accesses Just Work. But this is transparently blocking, synchronous code.
Since your program isn't explicitly scheduling IO and continuations for execution upon completion of IO, there's zero application-level IO parallelism. The only asynchronous IO possible is what the kernel can automagically perform using relatively dumb page fault heuristics and madvise() hints.
When your process triggers a major fault on an mmap region, everything comes to a grinding hault; no speculative execution past that point, no potential for triggering other page faults in other files or independent regions of the same file, it all stops until the data is read from the backing store. Not without the use of threads, which brings a host of other issues.
Multiple threads touching the same mapped page at the same time won't give you IO parallelism. You need different threads to be faulting on different pages at the same time in order to get multiple simultaneous IO requests sent to the storage device. But the IO queue depth is still limited by your thread count, and that limit can still be a serious problem even on the latest processors with the highest core counts.
Claiming "mmap is faster than system calls" is dangerous.
I once worked for a company where they also heard someone say "mmap is faster than read/write" and as a consequence rewrite their while( read() ) loop into the following monstrosity:
1. mmap a 4KB chunk of the file
2. memcpy it into the destination buffer
3. munmap the 4KB chunk
4. repeat until eof
This is different from the claim in the article -- the above monstrosity is individually mmaping each 4KB block, while I presume the article's benchmark is mmaping the entire file in memory at once, which makes much more sense.
After I claimed the "monstrosity" was absurdly stupid, someone pointed to a benchmark they made and found that the "monstrosity" version was actually faster. To me, this made no sense. The monstrosity has triple the syscall overhead vs the read() version, requires manipulating page tables for every 4KB block and as a consequence had several page faults for each 4KB block of the file. Yet it was true: their benchmarks showed the monstrosity version to be slightly faster.
The idealist in me couldn't stand this and I reverted this change, using for my own (unrelated) experiments a binary which used the older, classic, read() loop instead of mmap.
Eventually I noticed I was getting results much faster using my build on my single-socket Xeon than they were getting on their $$$ server farms. Despite what the benchmark said.
Turns out, the "monstrosity" was indeed faster, but if you had several of these binaries running concurrently on the same machine, they would all slow down each other, as if the kernel was having scale issues with multiple processes constantly changing their page tables. The thing would slow down to single-core levels of performance.
I still have no idea why the benchmark was apparently slightly faster, but obviously they were checking it either isolated or on machines where the other processes where running read() loops. I guess that by wasting more kernel CPU time on yourself you may starve other processes in the system leaving more user time for yourself. But once every process does it, the net result is still significantly lowered performance for everyone.
Just yet another anecdote for the experience bag...
Using mmap in an unusual way (to read chunks) on presumably legacy hardware doesn't generalize to using it in the obvious way (mmap entire files or at least larger windows) on modern hardware.
Even old operating systems like Windows NT4 could easily memory map a window into a file as large as 256 MB at a time, and could map multiple segments concurrently. It could do this even if there was much less physical memory available, because this is a virtual memory mapping only.
Well-written code would start with a sliding window of some reasonable size such as 64 MB, and if that failed would try halving it repeatedly down to some lower threshold.
Unfortunately, the 64-bit era has lead to a "pit of failure" where many programmers incorrectly assume that this means that 2⁶⁴ bytes can be mapped reliably in a single call. This is never true, because of all sorts of operating system and hardware limitations.
I've seen "modern" code written with this assumption, such as a Rust library and a couple of C# libraries. They fail on older Xeons, some hypervisors, and 32-bit platforms.
Even in 2021 server applications run as 32-bit surprisingly often. For example, Azure's App Service defaults to 32-bit on the low-end tiers to save memory.
I think the story is to always benchmark first, and also to make sure your benchmarks reflect real-world use. What's dangerous is assuming something is faster without benchmarking.
I think many people reading that anecdote may come away with the idea that mmap is bad (and a monstrosity even) and read is good rather than your interpretation that you should benchmark better.
I dislike this kind of muddying the waters and I hope my comment provides another perspective for readers.
The whole point of the story seems to be that you shouldn't just read a story like this or the linked article and take away a simplistic data point like "mmap is faster" or "mmap is bad"
Best place to start is to have a good mental model of how things work and why they would be performant or not for a particular use case. Otherwise you're just taking shots in the dark.
It seems a more apples-to-apples comparison would be to open a file, seek(), read() a block, then close() the file. Just as bizarre as the repeated mmap, of course.
Regardless of how bizarre it is, I've seen this in real code in embedded applications before. It's a workaround of buggy serial port drivers (flow control or buffering is probably broken): You open the port, read/write a line, close it, and open it again...
Hah I came here to say pretty much the same thing! Recently ran into it and coding that workaround on a resource constrained system felt absolutely bonkers.
If someone claimed running was faster than walking and then I told a story about how I once saw someone running in snow shoes on the grass and it was slower than walking then that would just be muddying the waters.
Yet the article is literally is also showing a current peculiarity (user-space memcpy being faster than kernel-space memcpy) that may be or may not be true for your platform _now_ or in 3 years from now.
If someone hears "mmap is faster than system calls" and then mmaps and munmaps 4KB chunks at a time in a loop, not realizing that mmap and munmap are actually system calls and that the benefit is not about calling those functions as much as possible, there is no saving them.
That's not the fault of a 'dangerous' claim, that's the fault of people who go heads first into something without taking 20 minutes to understand what they are doing or 20 minutes to profile after.
You'd need significantly more time then 20 minutes to form an informed opinion on the topic of you don't already know basically everything about it.
The only thing you could do in that timespan is reading a single summary on the topic and hope that this includes all relevant information. Which is unlikely and the reason why things are often mistakenly taken out of context.
And as the original comment mentioned: they did benchmark and it showed an improvement. They just didn't stresstest it, but that's unlikely doable within 20 minutes either.
In 20 minutes you can read what mmap does and see that you you can map a file and copy it like memory.
In another 20 minutes you can compile and run a medium sized program.
Neither of those is enough time for someone to go deep into something, but you can look up the brand new thing you're using and see where your bottlenecks are.
But without going deeper you are likely to have preconceived notions like "mmapping a gigabyte file into a memory actually uses a gigabyte of memory, oh noes!" I can totally see the thinking behind this case.
I don't think an unsubstantiated general predication about people making a wild assumption has much to do with the point that nothing is going to protect people from performance pitfalls who don't learn the bare basics about what they are doing.
Who is 'they'? Is it still the made up future scenario? The bare basics of mmap on a file is that it maps the file into a processes memory space instead of copying it into memory.
Was it perhaps a multi-threaded task? Because that would almost definitely crawl.
In general, unmapping expensive, much more expensive than mapping memory, because you need to do a TLB-shootdown/flush/whatever to make sure a cached version of the old mapping is not used. A read/write does a copy, so no need to mess with mappings and TLBs, hence it can scale very well.
It was multi-processing. I guess mmap(or at least munmap) also needs to send an IPI even if no other processor currently has the same VM, to avoid race conditions.
The fact that you once met some idiots who wrote code that wouldn't pass the review of any experienced Unix programmer doesn't really invalidate the fact that faulting in a mapped page is generally faster than getting the same data with a page-sized read. The one thing doesn't have anything to do with the other.
Right, especially if the alignment within the file is a multiple of the page size, and it's a shared mapping, many OSes can make a zero-copy read directly from non-volatile storage into the process's memory (buffer cache pages mapped directly into the process's address space).
On Linux mmap_sem contention is a well-known concurrency bottleneck, you may have been hitting that. Multiple efforts over the years have failed to fix it, IIRC. I guess one day they'll find a good solution, but until then, take care.
mmap is a more ergonomic interface than read too. How often are people copying a file to a local buffer, or the whole file a chunk at a time, in order to use the file like an array of bytes. mmap gives us a range of bytes right off the start. Even if not optimal, the simplicity in usage often means less room for bugs.
That can sometimes be a feature, not a bug. I've worked on systems that used memory mapped files as a form of persistent shared memory between processes.
I'm also puzzled as to why the benchmark of the monstrosity would have been faster. How is it not doing the exact same work as read()'ing 4KB, and then writing out the same to the destination file? And, as you said, 3x the system call overhead doesn't seem good, either....
Isn't the whole point of mmap to randomly access the data needed in memory? Did they think memcpy is a totally free operation or something, without any side effects?
I'm not sure the conclusion that vector instructions are responsible for the speed-up is correct. Both implementations seem to use ERMS (using REP MOVSB instructions)[0]. Looking at the profiles, the syscall implementation spends time in the [xfs] driver (even in the long test), while the mmap implementation does not. It appears the real speed-up is related to how memory-mapped pages interact with the buffer cache.
I might be misunderstanding things. What is really going on here?
The copy hypothesis also doesn't explain the difference between warm and cold start performance, at least for the "random read" case.
For warm start, the speed for read vs mmap is 3 GB/s vs 6 GB/s (i.e. mmap is 2x faster).
For cold start, that changes to 0.05 GB/s vs 0.25 GB/s (i.e. mmap is 5x faster).
If the only significant difference between read() and mmap() is how data is copied, then shouldn't the performance gap get much smaller in the cold start scenario? After all, you're adding the identical (and pretty large) overhead of reading from disk. Instead, the gap gets much larger. That's strange. (The sequential read case looks more reasonable.)
So there must be something else going on besides more efficient copying with mmap.
Summary: Mostly syscalls and mmap do the same things just substituting a page fault for a syscall to get to kernel mode, but… In user space her code is using AVX optimized memory copy instructions which are not accessible in kernel mode yielding a significant speed up.
Bonus summary: She didn’t use the mmapped data in place in order to make a more apples-to-apples comparison. If you can use the data in place then you will get even better performance.
Sasha is universally applied to both males and females, although to be fair, in Russian, it's culturally much acceptable to call Alexander Sasha in any context whatsoever, whereas Sasha as–in female Alexandra is reserved for informal communication.
Sasha is derived from Alexander via its dimunitive, but obsolete form - Aleksashka - shortned to Sashka, further simplified to Sasha as per established name format of Masha (Maria), Dasha (Daria), Pasha (Pavel, Paul), Glasha (Glafira), Natasha (Natalia), etc.
The bonus is what makes this writeup silly. I've always seen mmap presented as a way to reduce copies. If you're not using it for that, then yes, it's not a given it will be faster.
Yes, the test is sort of a strange one because the read test is reading directly in to the final destination wheras mmap is not and doing a redundant copy instead. So it's not your typical "eliminate a copy and go faster" usage of mmap.
What it actually seems to be comparing is whether its faster if the copy happens in the kernel, in the read() system call, or whether it's faster in userspace.
The results are puzzling because I would expect the kernel to be faster since it would incur no page-faults to access the pages from page cache since those ought to fall under the page-global mappings, or, at least, be in the kernels linear ram map which use 1GB pages (edit: in either case, not incur faults for demand map/demand load). But maybe that's no longer the case in a post-spectre world?
So I think actually it's a test of "does AVX go enough-faster to outweigh the costs of the extra-pagefaults incurred by copying from a user-space mapping which has not been prefaulted"
Sounds like the answer is yes (which still surprises me). And I expect if they MAP_POPULATE it would pull even further ahead...
Yes, 1,167 pagefaults for the read() version and 66,704 pagefaults for the mmap() version. With MAP_POPULATE it's back down to 1,167 faults and it goes a bit faster.
My libc is using rep/mov, same as the kernel, however. So I can't conclude that AVX memcpy is the win.
From the profile of readsyscall version I see this:
42.12% fa [kernel.kallsyms] [k] copy_user_enhanced_fast_string
14.45% fa [kernel.kallsyms] [k] syscall_exit_to_user_mode
11.00% fa [kernel.kallsyms] [k] find_get_entry
6.28% fa [kernel.kallsyms] [k] syscall_return_via_sysret
5.14% fa [kernel.kallsyms] [k] entry_SYSCALL_64
nearly 15% of the time is spent on a single verw instruction. Spectre mitigations.
When I reboot with mitigations=off, sure enough, the difference goes away and read() and mmap() perform identically.
What OP has discovered is that spectre mitigations make syscalls enough slower than pagefaults that what used to be a slower way to do things is now faster.
If you want the mmap version to perform faster when spectre mitigations are off, you need to use MAP_POPULATE.
The kernel does have access to these instructions. It is a deliberate choice by kernel developers not to use them in the case discussed here. In other cases the kernel does use such instructions.
Stupid question: Does the entire AVX state need to be saved and restored for a memcpy routine? It seems like it would be sufficient to save at most a couple of registers for a tight copy loop...
Thanks. Curse this language. I just want to refer to people! It’s simple encapsulation and abstraction. I shouldn’t have to care about implementation details irrelevant to the context.
Curiously, the page you've linked explains that it isn't normal to use it in this case: "can only be used with a morphologically and syntactically singular antecedent when what it refers to is semantically collective and/or generic and/or indefinite and/or unknown".
> According to the third edition, The New Fowler's Modern English Usage (edited by Burchfield and published in 1996) singular they has not only been widely used by good writers for centuries, but is now generally accepted, except by some conservative grammarians, including the Fowler of 1926, who, it is argued, ignored the evidence:
Anyone arguing against singular 'they' may as well be arguing for a gender neutral 'he' - an argument lost over a century ago by virtually any modern English standard.
Absolutely true. At least one whole generation grew up taught in school that singular "they" isn't a thing. I was on the tail end of that and read a lot of books written by the previous generation, so it never stuck strongly in my head, but that's what GP is referring to.
I don't think that's the case. I think many people (myself included) were taught only that "they" is plural, not that it explicitly cannot be singular. The distinction there may be too subtle to matter for some people, though.
Either way, clinging to a frozen-in-time interpretation of a living, changing language is foolish for anyone who would like to easily understand and be easily understood. And not waste time in our too-short lives on pointless language nitpickery like we're doing here.
I suppose I should have said it's doubtful that you lived at a time when the argument against singular they was legitimate. I'm sure many teachers are teaching students incorrect things all the time.
But by modern English standards, which I would define through works like Fowlers', the legitimate argument against singular they is long over.
> Anyone arguing against singular 'they' may as well be arguing for a gender neutral 'he'
That seems like a flawed analogy; many people abuse the traditional generic usage of 'they' but I've seen few people abuse the traditional generic usage of 'he'.
Mostly they're bad ones, you can read about them in the linked articles, specifically what's noted in Fowler's 1926 is probably the best place to start.
But the usage of they to refer a single person is older than anyone alive if that makes you feel better about sticking it to your picky grade school teachers.
Don't worry about it. Some people lose their marbles because they think females get erased when male language is used. Just erase both genders and you'll be fine. Use singular they.
Did some research on the topic of high bandwidth/high IOPS file accesses, some of my conclusions could be wrong though, but as I discovered modern NVMe drives need to have some queue pressure on them to perform at advertised speeds, as in hardware level they are essentially just a separate CPU running in background that has command queue(s). They also need to have requests align with flash memory hierarchy to perform at advertised speeds. So that's puts a quite finicky limitation on your access patterns: 64-256kb aligned blocks, 8+ accesses in parallel. To see that just try CrystalDiskMark and put queue depth at 1-2, and/or block size on something small, like 4kb, and see how your random speed plummets.
So given the limitations on the access pattern, if you just mmap your file and memcpy the pointer, you'll get like ~1 access request in flight if I understand right. And also as default page size is 4kb, that will be 4kb request size. And then your mmap relies on IRQ's to get completion notifications (instead of polling the device state), somewhat limiting your IOPS. Sure prefetching will help of course, but it is relying on a lot of heuristic machinery to get the correct access pattern, which sometimes fails.
As 7+GB/s drives and 10+Gbe networks become more and more mainstream, the main point where people will realize these requirements are in file copying, for example Windows explorer struggles to copy files at rates 10-25GBe+ simply because how it's file access architecture is designed. And hopefully then we will be better equip to reason about "mmap" vs "read" (really should be pread here to avoid the offset sem in the kernel).
Yep, mmap is really bad for performance on modern hardware because you can only fault on one page at a time (per thread), but SSDs require a high queue depth to deliver the advertised throughput. And you can't overcome that limitation by using more threads, because then you spend all your time on context switches. Hence, io_uring.
Can't you just use MAP_POPULATE which asks the system to populate the entire mapped address range, which is kind of like page-faulting on every page simultaneously?
That usually works if you have sufficient RAM, and do plan to touch substantially all of the file, and don't have any tight QoS targets to meet around the time you map the file.
Even if you use madvise() for a large sequential read, the kernel will often restrict its behavior to something suboptimal with respect to performance on modern hardware.
Yeah. Linux will end up splitting the requests down to typically 128kB blocks, but they're submitted to the SSD as a batch rather than one at a time, so there's sufficient work to keep the drive properly busy. But only do this if you actually need all 100MB. If you're randomly accessing only bits and pieces of the file, it's usually better to stick with 4kB requests (or larger if your file format and access patterns make that appropriate).
As a word of warning, mmap is fine if the semantics match the application.
mmap is not a good idea for a general purpose read()/write() replacement, e.g. as advocated in the 1994 "alloc stream facility" paper by Krieger et al. I worked with an I/O library that followed this strategy, and we had no end of trouble how to robustly deal with resizing files, and also how to do the windowing in a good way (this was in the time where we needed to care about systems with 32-bit pointers, VM space getting tight, but still needed to care about files larger than 2 GB). And then we needed the traditional read/write fallback path anyway, in order to deal with special files like tty's, pipes etc. In the end I ripped out the mmap path, and we saw a perf improvement in some benchmark by x300.
Also error handling. read and write can return errors, but what happens when you write to a mmaped pointer and the underlying file system has some issue? Assigning a value to a variable cannot return an error.
So you get a fine SIGBUS to your application and it crashes. Just the other day I used imagemagick and it always crashed with a SIGBUS and just when I started googling the issue I remembered mmap, noticed that the partition ran out of space, freed up some more and the issue was gone.
So you might want to set up a handler for that signal, but now the control flow suddenly jumps to another function if an error occurs, and you have to somehow figure out where in your program the error occurred and then what? Then you remember that longjmp exists and you end up with a steaming pile of garbage code.
Only use mmap if you absolutely must. Don't just "mmap all teh files" as it's the new cool thing you learned about.
Indeed. The issue with file resizing I mentioned was mostly related to error handling (what if another process/thread/file descriptor/ truncates the file, etc.). But yes, there are of course other errors as well, like the fs running out of space you mention.
Yeah, this is the biggest reason I stay the hell away from mmap now. Signal handlers are a much worse minefield than error handling in any standard file I/O API I've seen.
Then what do you think happens when you read from your mapped memory and the file system is corrupted and returns an error, or the drive has a bad sector, or the nfs server acts up...
Reading from a busted file system is a problem to be dealt with (or not), yes, and I certainly wouldn’t recommend mmaping a file shared over nfs if you can help it. I’m not sure what the use case is where that would seem like a good idea.
C, pointers, and mmap are dangerous, sharp instruments but I have to wonder who some of these dramatic warnings are for.
In general we can probably agree you should always check the return value of read() and write() and handle errors. At least just perror() and abort(), so the user has a chance at finding the problem. Similarly, using mmaped files without handling errors is user hostile since it just crashes the app and SIGBUS gives absolutely no clue to the user what happened. As said, my point is to use mmap when it really makes sense and is worth the hassle, not just because it seems cool and makes the code look a little simpler, exactly because you omit error handling. Especially if you don't know how people will use your software. As you said, mmap on nfs is bad, so you'd basically have to forbid users from using your software with network shares.
Except for running out of VM space, all the other issues are still there. And even if you have (for the time being) practically unlimited VM space, you may still not want to mmap a file of unbounded size, since setting up all those mappings takes quite a lot of time if you're using the default 4 kB page size. So you probably want to do some kind of windowing anyway. But then if the access pattern is random and the file is large, you have to continually shift the window (munmap + mmap) and performance goes down the drain. So I don't think going to 64-bit systems tilts the balance in favor of mmap.
Linux allocates page tables lazily, and fills them lazily. The only upfront work is to mark the virtual address range as valid and associated with the file. I'd expect mapping giant files to be fast enough to not need windowing.
There are still some cases where you'd not want unlimited VM mapping, but those are getting a bit esoteric and at least the most obvious ones are in the process of getting fixed.
Oh uh, IIRC 2004/2005 or thereabouts. Personally I was using PC HW running an up to date Linux distro, as I guess was the vast majority of the userbase, but there was a long tail of all kinds of weird and wonderful targets where the software was deployed.
I replicated the tests and discovered kernel/libc are both using rep/mov for the copy so copy-speed isn't the cause of the difference.
The cause of the difference is spectre mitigations which now make system call overhead much higher than it should be. So much so that the cost of the additional pagefaults incurred by the inefficient mmap version get dwarfed.
With mitigations=off they both perform roughly identically.
You could smell a rat anyway because AVX copies may be faster, but not that much faster.
It’s a bit old, but it should still apply. I remember that in general he was annoyed when seeing people recommend mmap instead of read/write for basic I/O usecases.
In general, it’s almost always better to use the specialized API (read, write etc.) instead of reinventing the wheel on your own.
LMDB (and its modern fork MDBX), and kdb+/shakti make incredibly good use of mmap - I suspect it is possible to get similar performance from read(), but probably at 10x the implementation complexity.
This is a poor explanation and poor benchmarking. Let’s see:
copy_user_enhanced_fast_string uses a CPU feature that (supposedly) is very fast. Benchmarking it against AVX could be interesting, but it would need actual benchmarking instead of handwaving. It’s worth noting that using AVX at all carries overhead, and it’s not always the right choice even if it’s faster in a tight loop.
Page faults, on x86_64, are much slower than syscalls. KPTI and other mitigations erode this difference to some extent. But surely the author should have compared the number of page faults to the number of syscalls. Perf can do this.
Finally, munmap() is very, very expensive, as is discarding a mapped page. This is especially true on x86. Workloads that do a lot of munmapping need to be careful, especially in multithreaded programs.
Last year we were migrating part of YouTube's serving to a new system and we were observing unexplainable high tail latency. It was eventually attributed to mlock()ing some mmap()ed files, which ended up freezing the whole process for significant amounts of time.
Memory mapped files are very tricky outside the happy path. In particular recovery from errors and concurrent modification leading to undefined behaviour. It's a good choice for certain use-cases, such as reading assets shipped with the application, where no untrusted process can write to the file and errors can be assumed to not happen.
The boost.interprocess library presents the capability to keep data structures (std::list, std::vector, ...) in shared memory (i.e. a memory-mapped file)- "offset pointers" are key to that. I can think of no other programming language that can pull this off, with such grace.
the functions (for copying data) used for syscall and mmap are very different, and not only in the name.
__memmove_avx_unaligned_erms, called in the mmap experiment, is implemented using Advanced Vector Extensions (AVX) (here is the source code of the functions that it relies on).
The implementation of copy_user_enhanced_fast_string, on the other hand, is much more modest. That, in my opinion, is the huge reason why mmap is faster. Using wide vector instructions for data copying effectively utilizes the memory bandwidth, and combined with CPU pre-fetching makes mmap really really fast.
Why can’t the kernel implementation use AVX? Well, if it did, then it would have to save and restore those registers on each system call, and that would make domain crossing even more expensive. So this was a conscious decision in the Linux kernel
You're completely ignoring MAP_POPULATE being used to aggressively asynchronously read-ahead the data on the kernel side, which the read-based implementation isn't attempting any equivalent of (posix_fadvise(2) and readahead(2) come to mind).
Frankly TFA is relatively useless as an mmap vs. read comparison.
> if it did, then it would have to save and restore those registers on each system call
Stupid question: Does the entire AVX state need to be saved and restored for a memcpy routine? It seems like it would be sufficient to save at most a couple of registers for a tight copy loop. Even for 512-byte copies that ought to be a net win...
The glibc implementation[0] uses Enhanced REP MOVSB when the array is long enough. It takes a few cycles to start up the ERMS feature, so it's only used on longer arrays.
Edit: Wait a minute... if this is true, then how can AVX be responsible for the speed up? Is it related to the size of the buffers being copied internally?
> The glibc implementation[0] uses Enhanced REP MOVSB when the array is long enough. It takes a few cycles to start up the ERMS feature, so it's only used on longer arrays.
That isn't true anymore either, on sufficiently recent processors with "Fast Short REP MOVSB (FSRM)". If the FSRM bit is set (which it is on Ice Lake and newer), you can just always use REP MOVSB.
Still waiting for the "Yes, This Time We Really Mean It Fast REP MOVSB" (YTTWRMIFRM) bit.
More seriously, if REP MOVSB can be counted on always being the fastest method that's fantastic. One thing that easily gets forgotten in microbenchmarking is I$ pollution by those fancy unrolled SIMD loops with 147 special cases.
It should be, I think, though it's a complicated question whose answer varies on so much, cpu architecture and how it is used. There is a great discussion on it here, too.
I'm a bit of an idiot, when I think of AVX I think of something that speeds up computation (specifically matrice stuff), not memory access. How wrong am I?
Its registers are just larger. The way x86 moves memory is through registers, register-to-register or register-to/from-memory. The AVX registers move up to 64 bytes in one move. A general purpose register moves at most 8 bytes.
It's a set of SIMD (single instruction, multiple data) extensions to the amd64 instruction set. They allow you to operate on larger chunks of data with a single instruction - for example, do 16 integer multiplications in parallel, etc.
Back in the days, this was on my 486 IIRC, using floating-point to copy was faster than using the regular integer instructions for pretty much the same reason, you could copy 64bit at a time rather than "just" 32bit.
> Why can’t the kernel implementation use AVX? Well, if it did, then it would have to save and restore those registers on each system call, and that would make domain crossing even more expensive. So this was a conscious decision in the Linux kernel.
I don't follow. So a syscall that could profit from AVX can't use it because then all syscalls would have to restore AVX registers? Why can't the restoring just happen specifically in those syscalls that make use of AVX?
It's not just syscalls. It's every context switch. If the process is in the midst of using AVX registers in kernel code, but is suddenly descheduled, those registers have to be saved/restored. You can't know if the task is using AVX or not, so you have to either always save/restore them, or adopt the policy that these registers are not saved/restored.
I think by "each system call" she meant it like "every time it calls read()", since it would be read() that was using the AVX registers. Since the example program just calls read() over and over, this could add a significant amount of overhead.
You'd have to have a static analysis of which syscalls can transitively reach which functions, which is probably not possible because linux uses tables of function pointers for many purposes. Also if thread 1 enters the kernel, suspends waiting for some i/o, and the kernel switches to thread 2, how would it know it needed to restore thread 2's registers because of AVX activity of thread 1? And if it did how would it have known to save them?
Yeah actually now that I'm part way through that first cup of coffee, the 2nd part of my comment doesn't make sense, the kernel already has to do a heavier save of a task's register state when it switches tasks.
That's not what they mean. You set up a memory map with the mmap system call, yes, but that's not the point.
The point is then that you can read and write mapped files by reading and writing memory addresses directly - you do not have to use a system call to perform each read and write.
Using mmap is pretending you read your entire file into memory but all the memory got swapped to disk, and so the pages has to be swapped back into memory when you access them.
Except you didn't do the initial read into memory, and instead of the system swap file it's using the file you actually want to read when paging data back into memory.
DMA is more like a bulk memory transfer operation usually facilitated by specific hardware that generally is asynchronous and requires manual synchronization. Usually hardware devices perform DMAs of memory regions, like a memcpy() but between physical memories.
Memory mappings established via mmap() more so set up the kernel to map in pages when faults on accesses occur. In this case you're not calling into the kernel, the TLB is generating an interrupt when you go to read an address referring to memory that's not yet paged in, which the kernel than handles and restores control flow to userspace without userspace being any wiser (unless userspace is keeping track of time). Handling page faults is faster than the syscalls involved in read() calls, it would seem.
For storage engines that prioritize performance and scalability, mmap() is a poor choice. Not only is it slower and less scalable than alternatives but it also has many more edge cases and behaviors you have to consider. Compared to a good O_DIRECT/io_submit storage engine design, which is a common alternative, it isn't particularly close. And now we have io_uring as an alternative too.
If your use case is quick-and-dirty happy path code then mmap() works fine. In more complex and rigorous environments, like database engines, mmap() is not well-behaved.
I use it a bit. The transactional aspect of it requires a bit consideration but generally the performance is good. I'd originally used libJudy in a bunch of places for fast lookups but the init time for programs was being slowed by having to preload everything. Using an mmap/LMDB is a decent middle ground.
Like others, it's unclear to me that AVX is responsible for the speedup. copy_user_enhanced_fast_string seems to be a beast. To copy data from kernel memory to userland (code link below), for every byte it copies, it checks if it's crossing a page boundary and about to cause a page fault. I don't see how this whole function could get compiled down to just a REP MOVSB.
There are different benefits and drawbacks of mmap. I only use mmap because it simplifies my code (I write experimental code).
But the OP and commenters are mostly concerned about the extra copy of read().
For video memory, DMABUF has been used to avoid extra copying. The idea is that the hardware directly DMAs into a buffer mapped in the userspace instead of copying into a kernel buffer and then copying into userspace. A system call is still involved, but no extra copying of data.
I wonder if anyone has actually implemented a disk I/O interface using DMABUF. There are a couple of low-level disk interfaces. But I don't know if any of them uses DMABUF.
The author says that in userspace memcpy, AVX is used, but
> The implementation of copy_user_enhanced_fast_string, on the other hand, is much more modest.
Why is that? I mean, if you compiled your kernel for a wide range of machines, then fine, but if you compiled targeting your actual CPU, why would the kernel functions not use AVX?
On a related tangent, why is mmap() needed to begin with? Is it not an example of a leaky abstraction?
I know, I know, how else would a process communicate its needs to the memory management system. I have seen shellcode that marks writeable pages executable using mmap, so it is certainly useful.
It's called REP MOVSB (or MOVSW, MOVSD, maybe also MOVSQ?). It has existed since the 8086 day; and for reasons I don't know, it supposedly works well for big blocks these days (>1K or so) but is supposedly slower than register moves for small blocks.
> it supposedly works well for big blocks these days (>1K or so) but is supposedly slower than register moves for small blocks.
On current processors with Fast Short REP MOVSB (FSRM), REP MOVSB is the fastest method for all sizes. On processors without FSRM, but with ERMS, REP MOVSB is faster for anything longer than ~128 bytes.
The 4-16kB buffer sizes are all rather tiny and inefficient for high throughput use-cases, which makes those results not that relevant. Something between 64kB to 1MB seems more applicable.
It's all very benchmark-chasing theoretical. In practice performance is more complicated and mmap is this weird corner-case over engineered inconsistent optimization thing that often wastes or even "leaks" memory which can be used for actually important for performance caches, it's also awful at error handling and so on. I had to literally patch LevelDB to disable mmap on amd64 once, which eliminated OOMs on those servers, allowed me to run way more LevelDB instances and overall improved performance so significantly, that I had to write this comment.
But they might "leak" it. What parent meant is that as an mmap() user you have no control how much of a mapping takes actual memory while you're visiting random memory-mapped file locations. Is that documented somewhere?
The paging system will only page in what's being used right now though and paging out has zero cost. Old data will naturally be paged out. To put it directly the answer is each mmap file will need 1 page of physical memory (the area currently being read/written). There may be old pages left around since there's no reason for the OS to page anything out unless some other application asked for the memory. But if they do mmap will go to 1 page just fine and there's zero cost to paging out.
I feel mmap gets a bad reputation when people look at memory usage tools that look at total virtual memory allocated.
I can mmap a 100GB of files, use 0 physical memory and a lot of memory usage tools will report 100GB of memory usage of a certain type (virtual memory allocated). You then get articles about application X using GB of memory. Anyone trying to correct this is ignored.
Google Chrome is somewhat unfairly hit by this. All those articles along the lines of "Why is Google using 4GB with no tabs after i viewed some large PDFs". The answer is that it reserved 4GB of 'addresses' that it has mapped to files. If another application wants to use that memory there's zero cost to paging out those files from memory. The OS is designed to do this and it's what mmap is for.
Paging out, as in removing a mapping, can be surprisingly costly, because you need to invalidate any cached TLB entries, possibly even in other CPUs.
> each mmap file will need 1 page of physical memory
Technically, a lower limit would be about 2 or so usable pages, because you can't use more than that simultaneously. However unmaps are expensive, so the system won't be too eager to page out.
Also, for pages to be accessible, they need to be specified in the page table (actually tree, of virtual -> physical mappings). A random address may require about 1-3 pages for page table aside from the 1 page of actual data (but won't need more page tables for the next MB).
> application X using GB of memory
I think there is a difference between reserved, allocated and file-backed mmapped memory. Address space, file-backed mmapped memory is easily paged-out, not sure what different types of reserved addresses/memory are, but chrome probably doesn't have lots of mmapped memory that can be paged out. If it's modified, then it must be swapped, otherwise it's just reserved and possibly mapped, but never used memory.
I'd argue the costs with paging out are already accounted for by the other process paging in though. The other process that paged in and led to the need to page out had already led to the need to change the page table and flush cache.
Paging in free memory (adding a mapping) is cheap (no need to flush). Removing a mapping is expensive (need to flush). Also, processes have their own (mostly) independent page tables.
I don't think it would be reasonable accounting, when paging-in is cheap, but only if there is no need to page out (available free memory). Especially when trying to argue that paging out is zero-cost.
madvise gives you pretty good control over the paging, no? Generally I think you can MADVISE_NOTNEEDED to page out content if you need to do it more aggressively, no? The benefit is that the kernel understands this enough that it can evict those page buffers, things it can’t do when those buffers live in user-space.
No, madvise() does not give good control over paging behavior. As the syscall indicates, it is merely a suggestion. The kernel is free to ignore it and frequently does. This non-determinism makes it nearly useless for many types of page scheduling optimizations. For some workloads, the kernel consistently makes poor paging choices and there is no way to force it to make good paging choices. You have much better visibility into the I/O behavior of your application than the kernel does.
In my experience, at least for database-y workloads, if you care enough about paging behavior to use madvise(), you might as well just use any number of O_DIRECT alternatives that offer deterministic paging control. It is much easier than trying to cajole the kernel into doing what you need via madvise().
I write software for a HPC centre and we noticed this as well. The speed of our programs heavily using memory maps can vary by an order of magnitude based upon the memory pressure on the node.
This non-determinism in runtime is a show stopper for us so we ripped out most memory maps from our codebase. There is just too much magic happening under the hood to have control over your program.
Curious now - were you running an unconventional workload that stressed LevelDB, or do you think some version of this advice could be applicable to typical workloads?
LevelDB is kinda like a single-tablet bigtable, but because of that its mmap i/o is not a result of battle hardening in production systems. bigtable doesn't use local unix i/o for any purpose at all, so I'm not surprised to hear that leveldb's local i/o subsystem is half baked.
One mechanism we developed was to build a variant of our storage node that could run in isolation. This meant that synthetic testing would give us some optimal numbers for hardware vetting and performance changes.
I proved quite quickly our application was quite thread poor and the costs of fixing it was quite worth it. Using other synthetic benchmarks to compare what the systems were capable of.
I was gone before that was finished, but it was quite an improvement. It also allowed cold volumes to exist in an over subscription model.
None of this excuses good real world telemetry and evaluation of your outliers.
The DMA needs to cooperate with the MMU which is on the CPU these days (and has been for almost 3 decades now). It's a lot of work to set up DMA correctly given physical<->logical memory mapping - so it's only worth it if you have a really big chunk to copy.
This is quite interesting. This, to me, seems like a systems bug. In the Embedded World, it is exceedingly common to use DMA for all high-speed transfers -- it's effectively a specialized parallel hardware 'MOV' instruction. Also, I have never had an occasion on modern PC hw to need mmap; read() lseek() are clean and less complex overall. Maybe I lack imagination.
mmap is being used by libraries under you; it's useful for files on internal drives, that won't be deleted, and want to be accessed randomly, and you don't want to allocate buffers to copy things out of them.
For instance, anytime you call a shared library it's been mmapped by dyld/ld.so.
>Further, since it is unsafe to directly dereference user-level pointers (what if they are null — that’ll crash the kernel!) the data referred to by these pointers must be copied into the kernel.
False. If the file was opened with O_DIRECT, then the kernel uses the user-space buffer directly.
From man write(2):
O_DIRECT (Since Linux 2.4.10)
Try to minimize cache effects of the I/O to and from this file. In general this will degrade performance, but it is useful in special situations, such as when applications do their own caching. File I/O is done directly to/from user-space buffers. The O_DIRECT flag on its own makes an effort to transfer data synchronously, but does not give the guarantees of the O_SYNC flag that data and necessary metadata are transferred. To guarantee synchronous I/O, O_SYNC must be used in addition to O_DIRECT. See NOTES below for further discussion.
I don't think O_DIRECT makes any guarantees about zero-copy operation. It merely disallows kernel-level caching of that data. But the kernel may make a private copy that isn't caching.
The original article said that the data must be copied based on some bogus handwavy argument, and I’ve pointed out that the manpage of write(2) contradicts this when it says the following:
>File I/O is done directly to/from user-space buffers.
Mongodb's original mmap storage engine (32-bit support anyone?), ultimately required replacement (wiredtiger was purchased in part for this reason IIRC).
systemd-journald read side (journalctl, `systemctl status`, sd-journal API), performance/scalability definitely suffers from the decision to use mmap for all IO.
mmap IO is extremely seductive in terms of programming style/convenience as it facilitates writing procedural, IO-naive code that assumes file-backed memory accesses Just Work. But this is transparently blocking, synchronous code.
Since your program isn't explicitly scheduling IO and continuations for execution upon completion of IO, there's zero application-level IO parallelism. The only asynchronous IO possible is what the kernel can automagically perform using relatively dumb page fault heuristics and madvise() hints.
When your process triggers a major fault on an mmap region, everything comes to a grinding hault; no speculative execution past that point, no potential for triggering other page faults in other files or independent regions of the same file, it all stops until the data is read from the backing store. Not without the use of threads, which brings a host of other issues.