Why mmap is faster than system calls (2019)

pengaru · on Jan 10, 2021

There isn't exactly a shortage of examples where mmap() was in the short-term quite convenient to get to market quickly with something Good Enough to appear production ready, that in the long-term proved to be rather problematic.

Mongodb's original mmap storage engine (32-bit support anyone?), ultimately required replacement (wiredtiger was purchased in part for this reason IIRC).

systemd-journald read side (journalctl, `systemctl status`, sd-journal API), performance/scalability definitely suffers from the decision to use mmap for all IO.

mmap IO is extremely seductive in terms of programming style/convenience as it facilitates writing procedural, IO-naive code that assumes file-backed memory accesses Just Work. But this is transparently blocking, synchronous code.

Since your program isn't explicitly scheduling IO and continuations for execution upon completion of IO, there's zero application-level IO parallelism. The only asynchronous IO possible is what the kernel can automagically perform using relatively dumb page fault heuristics and madvise() hints.

When your process triggers a major fault on an mmap region, everything comes to a grinding hault; no speculative execution past that point, no potential for triggering other page faults in other files or independent regions of the same file, it all stops until the data is read from the backing store. Not without the use of threads, which brings a host of other issues.

jjtheblunt · on Jan 10, 2021

Zero application level parallelism?

What about multiple posix threads on a multicore machine manipulating the same mapped page at the same time?

wtallis · on Jan 10, 2021

Multiple threads touching the same mapped page at the same time won't give you IO parallelism. You need different threads to be faulting on different pages at the same time in order to get multiple simultaneous IO requests sent to the storage device. But the IO queue depth is still limited by your thread count, and that limit can still be a serious problem even on the latest processors with the highest core counts.

pengaru · on Jan 10, 2021

> Not without the use of threads, which brings a host of other issues.

AshamedCaptain · on Jan 9, 2021

Claiming "mmap is faster than system calls" is dangerous.

I once worked for a company where they also heard someone say "mmap is faster than read/write" and as a consequence rewrite their while( read() ) loop into the following monstrosity:

1. mmap a 4KB chunk of the file

2. memcpy it into the destination buffer

3. munmap the 4KB chunk

4. repeat until eof

This is different from the claim in the article -- the above monstrosity is individually mmaping each 4KB block, while I presume the article's benchmark is mmaping the entire file in memory at once, which makes much more sense.

After I claimed the "monstrosity" was absurdly stupid, someone pointed to a benchmark they made and found that the "monstrosity" version was actually faster. To me, this made no sense. The monstrosity has triple the syscall overhead vs the read() version, requires manipulating page tables for every 4KB block and as a consequence had several page faults for each 4KB block of the file. Yet it was true: their benchmarks showed the monstrosity version to be slightly faster.

The idealist in me couldn't stand this and I reverted this change, using for my own (unrelated) experiments a binary which used the older, classic, read() loop instead of mmap.

Eventually I noticed I was getting results much faster using my build on my single-socket Xeon than they were getting on their $$$ server farms. Despite what the benchmark said.

Turns out, the "monstrosity" was indeed faster, but if you had several of these binaries running concurrently on the same machine, they would all slow down each other, as if the kernel was having scale issues with multiple processes constantly changing their page tables. The thing would slow down to single-core levels of performance.

I still have no idea why the benchmark was apparently slightly faster, but obviously they were checking it either isolated or on machines where the other processes where running read() loops. I guess that by wasting more kernel CPU time on yourself you may starve other processes in the system leaving more user time for yourself. But once every process does it, the net result is still significantly lowered performance for everyone.

Just yet another anecdote for the experience bag...

searealist · on Jan 9, 2021

Your anecdote doesn't follow your warning.

Using mmap in an unusual way (to read chunks) on presumably legacy hardware doesn't generalize to using it in the obvious way (mmap entire files or at least larger windows) on modern hardware.

jiggawatts · on Jan 10, 2021

Even old operating systems like Windows NT4 could easily memory map a window into a file as large as 256 MB at a time, and could map multiple segments concurrently. It could do this even if there was much less physical memory available, because this is a virtual memory mapping only.

Well-written code would start with a sliding window of some reasonable size such as 64 MB, and if that failed would try halving it repeatedly down to some lower threshold.

Unfortunately, the 64-bit era has lead to a "pit of failure" where many programmers incorrectly assume that this means that 2⁶⁴ bytes can be mapped reliably in a single call. This is never true, because of all sorts of operating system and hardware limitations.

I've seen "modern" code written with this assumption, such as a Rust library and a couple of C# libraries. They fail on older Xeons, some hypervisors, and 32-bit platforms.

Even in 2021 server applications run as 32-bit surprisingly often. For example, Azure's App Service defaults to 32-bit on the low-end tiers to save memory.

Mathnerd314 · on Jan 9, 2021

I think the story is to always benchmark first, and also to make sure your benchmarks reflect real-world use. What's dangerous is assuming something is faster without benchmarking.

searealist · on Jan 9, 2021

I think many people reading that anecdote may come away with the idea that mmap is bad (and a monstrosity even) and read is good rather than your interpretation that you should benchmark better.

I dislike this kind of muddying the waters and I hope my comment provides another perspective for readers.

Talanes · on Jan 10, 2021

The whole point of the story seems to be that you shouldn't just read a story like this or the linked article and take away a simplistic data point like "mmap is faster" or "mmap is bad"

anonunivgrad · on Jan 9, 2021

Best place to start is to have a good mental model of how things work and why they would be performant or not for a particular use case. Otherwise you're just taking shots in the dark.

pletnes · on Jan 9, 2021

It seems a more apples-to-apples comparison would be to open a file, seek(), read() a block, then close() the file. Just as bizarre as the repeated mmap, of course.

segfaultbuserr · on Jan 9, 2021

Regardless of how bizarre it is, I've seen this in real code in embedded applications before. It's a workaround of buggy serial port drivers (flow control or buffering is probably broken): You open the port, read/write a line, close it, and open it again...

craftinator · on Jan 9, 2021

Hah I came here to say pretty much the same thing! Recently ran into it and coding that workaround on a resource constrained system felt absolutely bonkers.

segfaultbuserr · on Jan 10, 2021

I already got 5 upvotes now. Looks like a real headache for many embedded developers ;-(

AshamedCaptain · on Jan 9, 2021

Indeed, my warning is about being cautious when making generalized claims.

searealist · on Jan 9, 2021

If someone claimed running was faster than walking and then I told a story about how I once saw someone running in snow shoes on the grass and it was slower than walking then that would just be muddying the waters.

segfaultbuserr · on Jan 9, 2021

I remember a quote, I cannot find the source for now, but it basically says "A book can either be completely correct, or be readable, but not both."

AshamedCaptain · on Jan 11, 2021

Yet the article is literally is also showing a current peculiarity (user-space memcpy being faster than kernel-space memcpy) that may be or may not be true for your platform _now_ or in 3 years from now.

CyberDildonics · on Jan 9, 2021

If someone hears "mmap is faster than system calls" and then mmaps and munmaps 4KB chunks at a time in a loop, not realizing that mmap and munmap are actually system calls and that the benefit is not about calling those functions as much as possible, there is no saving them.

That's not the fault of a 'dangerous' claim, that's the fault of people who go heads first into something without taking 20 minutes to understand what they are doing or 20 minutes to profile after.

411111111111111 · on Jan 9, 2021

You'd need significantly more time then 20 minutes to form an informed opinion on the topic of you don't already know basically everything about it.

The only thing you could do in that timespan is reading a single summary on the topic and hope that this includes all relevant information. Which is unlikely and the reason why things are often mistakenly taken out of context.

And as the original comment mentioned: they did benchmark and it showed an improvement. They just didn't stresstest it, but that's unlikely doable within 20 minutes either.

CyberDildonics · on Jan 9, 2021

In 20 minutes you can read what mmap does and see that you you can map a file and copy it like memory.

In another 20 minutes you can compile and run a medium sized program.

Neither of those is enough time for someone to go deep into something, but you can look up the brand new thing you're using and see where your bottlenecks are.

rini17 · on Jan 9, 2021

But without going deeper you are likely to have preconceived notions like "mmapping a gigabyte file into a memory actually uses a gigabyte of memory, oh noes!" I can totally see the thinking behind this case.

CyberDildonics · on Jan 10, 2021

I don't think an unsubstantiated general predication about people making a wild assumption has much to do with the point that nothing is going to protect people from performance pitfalls who don't learn the bare basics about what they are doing.

rini17 · on Jan 10, 2021

But they did learn the bare basics, enough to implement something somehow working. Or I don't get what you're trying to say?

CyberDildonics · on Jan 10, 2021

Who is 'they'? Is it still the made up future scenario? The bare basics of mmap on a file is that it maps the file into a processes memory space instead of copying it into memory.

anonunivgrad · on Jan 9, 2021

Yep, there's no substitute for a qualitative understanding of the system.

labawi · on Jan 9, 2021

Was it perhaps a multi-threaded task? Because that would almost definitely crawl.

In general, unmapping expensive, much more expensive than mapping memory, because you need to do a TLB-shootdown/flush/whatever to make sure a cached version of the old mapping is not used. A read/write does a copy, so no need to mess with mappings and TLBs, hence it can scale very well.

AshamedCaptain · on Jan 11, 2021

It was multi-processing. I guess mmap(or at least munmap) also needs to send an IPI even if no other processor currently has the same VM, to avoid race conditions.

jeffbee · on Jan 9, 2021

The fact that you once met some idiots who wrote code that wouldn't pass the review of any experienced Unix programmer doesn't really invalidate the fact that faulting in a mapped page is generally faster than getting the same data with a page-sized read. The one thing doesn't have anything to do with the other.

KMag · on Jan 10, 2021

Right, especially if the alignment within the file is a multiple of the page size, and it's a shared mapping, many OSes can make a zero-copy read directly from non-volatile storage into the process's memory (buffer cache pages mapped directly into the process's address space).

alexchamberlain · on Jan 9, 2021

The kernel can, in theory, estimate what page you're going to load next, so loading 4KB at a time may not have the page faults you'd expect.

jabl · on Jan 9, 2021

On Linux mmap_sem contention is a well-known concurrency bottleneck, you may have been hitting that. Multiple efforts over the years have failed to fix it, IIRC. I guess one day they'll find a good solution, but until then, take care.

beached_whale · on Jan 9, 2021

mmap is a more ergonomic interface than read too. How often are people copying a file to a local buffer, or the whole file a chunk at a time, in order to use the file like an array of bytes. mmap gives us a range of bytes right off the start. Even if not optimal, the simplicity in usage often means less room for bugs.

lathiat · on Jan 9, 2021

Be mindful that can give you Time Of Check/Time Of Use (TOCTOU) security bugs though. The buffer can change out from under you.

icedchai · on Jan 10, 2021

That can sometimes be a feature, not a bug. I've worked on systems that used memory mapped files as a form of persistent shared memory between processes.

beached_whale · on Jan 11, 2021

Thats the open/lock mode too. Same as without

pmiller2 · on Jan 10, 2021

I'm also puzzled as to why the benchmark of the monstrosity would have been faster. How is it not doing the exact same work as read()'ing 4KB, and then writing out the same to the destination file? And, as you said, 3x the system call overhead doesn't seem good, either....

xxs · on Jan 10, 2021

"munmap" is actually expensive as it has to flush the TLB

stanfordkid · on Jan 9, 2021

Isn't the whole point of mmap to randomly access the data needed in memory? Did they think memcpy is a totally free operation or something, without any side effects?

piyh · on Jan 9, 2021

Out of curiosity, what was the use case where they were trying to get these marginal gains out of their program?

justin_ · on Jan 9, 2021

I'm not sure the conclusion that vector instructions are responsible for the speed-up is correct. Both implementations seem to use ERMS (using REP MOVSB instructions)[0]. Looking at the profiles, the syscall implementation spends time in the [xfs] driver (even in the long test), while the mmap implementation does not. It appears the real speed-up is related to how memory-mapped pages interact with the buffer cache.

I might be misunderstanding things. What is really going on here?

[0] Lines 56 and 180 here: http://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_...

sltkr · on Jan 10, 2021

The copy hypothesis also doesn't explain the difference between warm and cold start performance, at least for the "random read" case.

For warm start, the speed for read vs mmap is 3 GB/s vs 6 GB/s (i.e. mmap is 2x faster). For cold start, that changes to 0.05 GB/s vs 0.25 GB/s (i.e. mmap is 5x faster).

If the only significant difference between read() and mmap() is how data is copied, then shouldn't the performance gap get much smaller in the cold start scenario? After all, you're adding the identical (and pretty large) overhead of reading from disk. Instead, the gap gets much larger. That's strange. (The sequential read case looks more reasonable.)

So there must be something else going on besides more efficient copying with mmap.

petters · on Jan 9, 2021

I thought this strange as well. The author even directly links to the source code where REP MOVSB is used.

scaramanga · on Jan 10, 2021

Long story short. Spectre mitigations make ret_from_syscall much slower. Much slower than taking a page-fault.

jws · on Jan 9, 2021

Summary: Mostly syscalls and mmap do the same things just substituting a page fault for a syscall to get to kernel mode, but… In user space her code is using AVX optimized memory copy instructions which are not accessible in kernel mode yielding a significant speed up.

Bonus summary: She didn’t use the mmapped data in place in order to make a more apples-to-apples comparison. If you can use the data in place then you will get even better performance.

usrnm · on Jan 9, 2021

Just a nitpick: "Alexandra" is the female version of the name "Alexander" in Russian, so it's a "she", not "he".

andi999 · on Jan 9, 2021

And 'Sasha' the nickname of 'Alexander'.... man who can think this up, this is like you were calling 'Richard' 'Dick'.

tucnak · on Jan 9, 2021

Sasha is universally applied to both males and females, although to be fair, in Russian, it's culturally much acceptable to call Alexander Sasha in any context whatsoever, whereas Sasha as–in female Alexandra is reserved for informal communication.

Disclaimer: I speak Russian.

LudwigNagasena · on Jan 9, 2021

Sasha is an informal version for both genders, I don’t think there is any difference.

Source: I am Russian

FpUser · on Jan 9, 2021

Second that. Was born in Russia as well

whidden · on Jan 9, 2021

As someone who grew up in a former Russian territory that speaks no Russian, even I knew that.

eps · on Jan 9, 2021

Sasha is derived from Alexander via its dimunitive, but obsolete form - Aleksashka - shortned to Sashka, further simplified to Sasha as per established name format of Masha (Maria), Dasha (Daria), Pasha (Pavel, Paul), Glasha (Glafira), Natasha (Natalia), etc.

enedil · on Jan 9, 2021

In Polish, Aleksandra (also female version) is shortened up to Ola, good luck with that ;)

johnisgood · on Jan 10, 2021

Jeez, thanks. I know someone who goes by Ola and is Polish, so I suppose her name is actually Aleksandra?

enedil · on Jan 10, 2021

Most probably yes. And if it's Aleksander (male name), he'd be called Olek.

johnisgood · on Jan 12, 2021

I asked, and her name is indeed Aleksandra. :)

R0b0t1 · on Jan 10, 2021

The bonus is what makes this writeup silly. I've always seen mmap presented as a way to reduce copies. If you're not using it for that, then yes, it's not a given it will be faster.

scaramanga · on Jan 10, 2021

Yes, the test is sort of a strange one because the read test is reading directly in to the final destination wheras mmap is not and doing a redundant copy instead. So it's not your typical "eliminate a copy and go faster" usage of mmap.

What it actually seems to be comparing is whether its faster if the copy happens in the kernel, in the read() system call, or whether it's faster in userspace.

The results are puzzling because I would expect the kernel to be faster since it would incur no page-faults to access the pages from page cache since those ought to fall under the page-global mappings, or, at least, be in the kernels linear ram map which use 1GB pages (edit: in either case, not incur faults for demand map/demand load). But maybe that's no longer the case in a post-spectre world?

So I think actually it's a test of "does AVX go enough-faster to outweigh the costs of the extra-pagefaults incurred by copying from a user-space mapping which has not been prefaulted"

Sounds like the answer is yes (which still surprises me). And I expect if they MAP_POPULATE it would pull even further ahead...

scaramanga · on Jan 10, 2021

Yes, 1,167 pagefaults for the read() version and 66,704 pagefaults for the mmap() version. With MAP_POPULATE it's back down to 1,167 faults and it goes a bit faster.

My libc is using rep/mov, same as the kernel, however. So I can't conclude that AVX memcpy is the win.

From the profile of readsyscall version I see this:

  42.12%  fa       [kernel.kallsyms]  [k] copy_user_enhanced_fast_string

  14.45%  fa       [kernel.kallsyms]  [k] syscall_exit_to_user_mode

  11.00%  fa       [kernel.kallsyms]  [k] find_get_entry

   6.28%  fa       [kernel.kallsyms]  [k] syscall_return_via_sysret

   5.14%  fa       [kernel.kallsyms]  [k] entry_SYSCALL_64

nearly 15% of the time is spent on a single verw instruction. Spectre mitigations.

When I reboot with mitigations=off, sure enough, the difference goes away and read() and mmap() perform identically.

What OP has discovered is that spectre mitigations make syscalls enough slower than pagefaults that what used to be a slower way to do things is now faster.

If you want the mmap version to perform faster when spectre mitigations are off, you need to use MAP_POPULATE.

spockz · on Jan 9, 2021

Why doesn’t the kernel have access to AVX optimised memory copy instructions?

topspin · on Jan 9, 2021

The kernel does have access to these instructions. It is a deliberate choice by kernel developers not to use them in the case discussed here. In other cases the kernel does use such instructions.

jiggawatts · on Jan 10, 2021

Stupid question: Does the entire AVX state need to be saved and restored for a memcpy routine? It seems like it would be sufficient to save at most a couple of registers for a tight copy loop...

jws · on Jan 9, 2021

The size of the state required to be saved and restored on each system call makes it a losing proposition.

PaulDavisThe1st · on Jan 9, 2021

Each context switch, not syscall.

anaisbetts · on Jan 9, 2021

*She, not he

jws · on Jan 9, 2021

Thanks. Curse this language. I just want to refer to people! It’s simple encapsulation and abstraction. I shouldn’t have to care about implementation details irrelevant to the context.

ryanianian · on Jan 9, 2021

"They" is an acceptable gender-neutral pronoun.

DonHopkins · on Jan 10, 2021

And it's not even as tricky and nuanced as the rules about "this" in JavaScript!

https://john-dugan.com/this-in-javascript/

kortilla · on Jan 9, 2021

Acceptable to some, still not frequent enough though to be normalized.

itamarst · on Jan 9, 2021

It's been used since the time of Jane Austen (by Jane Austen, in fact), it's perfectly normal: https://pemberley.com/janeinfo/austheir.html

jhgb · on Jan 10, 2021

Curiously, the page you've linked explains that it isn't normal to use it in this case: "can only be used with a morphologically and syntactically singular antecedent when what it refers to is semantically collective and/or generic and/or indefinite and/or unknown".

lolc · on Jan 9, 2021

I don't even notice it anymore.

DonHopkins · on Jan 10, 2021

Then please help normalize it, instead of worrying about hurting the illiterate feelings of those who find it unacceptable.

jfim · on Jan 9, 2021

You can use "the author" or refer to the article or paper.

[name of paper] mentions that X is faster than Y. The author suggests the cause of the speed up is Z, while we believe it is W.

throw_away · on Jan 9, 2021

singular they is the generalization you're looking for

jws · on Jan 9, 2021

I'm old enough that "they" is not singular, it is a grammatical error punishable by red ink and deducted points.

staticassertion · on Jan 9, 2021

Doubtful.

> According to the third edition, The New Fowler's Modern English Usage (edited by Burchfield and published in 1996) singular they has not only been widely used by good writers for centuries, but is now generally accepted, except by some conservative grammarians, including the Fowler of 1926, who, it is argued, ignored the evidence:

https://en.wikipedia.org/wiki/Singular_they

Elsewhere in the article you'll find dates like: 1375, 1382, 1400, 1489, 1759, etc, and references to these by Fowler's

https://en.wikipedia.org/wiki/A_Dictionary_of_Modern_English...

Anyone arguing against singular 'they' may as well be arguing for a gender neutral 'he' - an argument lost over a century ago by virtually any modern English standard.

Izkata · on Jan 10, 2021

> Doubtful

Absolutely true. At least one whole generation grew up taught in school that singular "they" isn't a thing. I was on the tail end of that and read a lot of books written by the previous generation, so it never stuck strongly in my head, but that's what GP is referring to.

kelnos · on Jan 10, 2021

I don't think that's the case. I think many people (myself included) were taught only that "they" is plural, not that it explicitly cannot be singular. The distinction there may be too subtle to matter for some people, though.

Either way, clinging to a frozen-in-time interpretation of a living, changing language is foolish for anyone who would like to easily understand and be easily understood. And not waste time in our too-short lives on pointless language nitpickery like we're doing here.

staticassertion · on Jan 10, 2021

I suppose I should have said it's doubtful that you lived at a time when the argument against singular they was legitimate. I'm sure many teachers are teaching students incorrect things all the time.

But by modern English standards, which I would define through works like Fowlers', the legitimate argument against singular they is long over.

jhgb · on Jan 10, 2021

> Anyone arguing against singular 'they' may as well be arguing for a gender neutral 'he'

That seems like a flawed analogy; many people abuse the traditional generic usage of 'they' but I've seen few people abuse the traditional generic usage of 'he'.

staticassertion · on Jan 10, 2021

The main historic arguments against singular they and for neutral he are intertwined.

jhgb · on Jan 10, 2021

What are the main historic arguments against singular they?

staticassertion · on Jan 10, 2021

Mostly they're bad ones, you can read about them in the linked articles, specifically what's noted in Fowler's 1926 is probably the best place to start.

microtherion · on Jan 9, 2021

You're older than William Shakespeare?

   There's not a man I meet but doth salute me
   As if I were their well-acquainted friend

Spivak · on Jan 9, 2021

But the usage of they to refer a single person is older than anyone alive if that makes you feel better about sticking it to your picky grade school teachers.

jfk13 · on Jan 9, 2021

So is the use of "he" to refer to an individual of unspecified gender.

(The OED quotations for sense 2(b) "In anaphoric reference to a singular noun or pronoun of undetermined gender" go back to at least 1200AD.)

cpach · on Jan 9, 2021

How about “they”…?

damudel · on Jan 9, 2021

Don't worry about it. Some people lose their marbles because they think females get erased when male language is used. Just erase both genders and you'll be fine. Use singular they.

dmytroi · on Jan 9, 2021

Did some research on the topic of high bandwidth/high IOPS file accesses, some of my conclusions could be wrong though, but as I discovered modern NVMe drives need to have some queue pressure on them to perform at advertised speeds, as in hardware level they are essentially just a separate CPU running in background that has command queue(s). They also need to have requests align with flash memory hierarchy to perform at advertised speeds. So that's puts a quite finicky limitation on your access patterns: 64-256kb aligned blocks, 8+ accesses in parallel. To see that just try CrystalDiskMark and put queue depth at 1-2, and/or block size on something small, like 4kb, and see how your random speed plummets.

So given the limitations on the access pattern, if you just mmap your file and memcpy the pointer, you'll get like ~1 access request in flight if I understand right. And also as default page size is 4kb, that will be 4kb request size. And then your mmap relies on IRQ's to get completion notifications (instead of polling the device state), somewhat limiting your IOPS. Sure prefetching will help of course, but it is relying on a lot of heuristic machinery to get the correct access pattern, which sometimes fails.

As 7+GB/s drives and 10+Gbe networks become more and more mainstream, the main point where people will realize these requirements are in file copying, for example Windows explorer struggles to copy files at rates 10-25GBe+ simply because how it's file access architecture is designed. And hopefully then we will be better equip to reason about "mmap" vs "read" (really should be pread here to avoid the offset sem in the kernel).

wtallis · on Jan 9, 2021

Yep, mmap is really bad for performance on modern hardware because you can only fault on one page at a time (per thread), but SSDs require a high queue depth to deliver the advertised throughput. And you can't overcome that limitation by using more threads, because then you spend all your time on context switches. Hence, io_uring.

kccqzy · on Jan 9, 2021

Can't you just use MAP_POPULATE which asks the system to populate the entire mapped address range, which is kind of like page-faulting on every page simultaneously?

wtallis · on Jan 9, 2021

That usually works if you have sufficient RAM, and do plan to touch substantially all of the file, and don't have any tight QoS targets to meet around the time you map the file.

astrange · on Jan 9, 2021

If you're reading sequentially this shouldn't be a problem because the VM system can pick up hints, or you can use madvise.

If you're reading randomly this is true and you want some kind of async I/O or multiple read operation.

mmap is also dangerous because there's no good way to return errors if the I/O fails, like if the file is resized or is on an external drive.

jandrewrogers · on Jan 9, 2021

Even if you use madvise() for a large sequential read, the kernel will often restrict its behavior to something suboptimal with respect to performance on modern hardware.

im3w1l · on Jan 9, 2021

If I read with a huge block size, say 100mb. Will the OS request things in a sane way?

wtallis · on Jan 10, 2021

Yeah. Linux will end up splitting the requests down to typically 128kB blocks, but they're submitted to the SSD as a batch rather than one at a time, so there's sufficient work to keep the drive properly busy. But only do this if you actually need all 100MB. If you're randomly accessing only bits and pieces of the file, it's usually better to stick with 4kB requests (or larger if your file format and access patterns make that appropriate).

foota · on Jan 9, 2021

Typically reviews of drives publish rates at different queue depths, or at least specify the queue depths tested.

jabl · on Jan 9, 2021

As a word of warning, mmap is fine if the semantics match the application.

mmap is not a good idea for a general purpose read()/write() replacement, e.g. as advocated in the 1994 "alloc stream facility" paper by Krieger et al. I worked with an I/O library that followed this strategy, and we had no end of trouble how to robustly deal with resizing files, and also how to do the windowing in a good way (this was in the time where we needed to care about systems with 32-bit pointers, VM space getting tight, but still needed to care about files larger than 2 GB). And then we needed the traditional read/write fallback path anyway, in order to deal with special files like tty's, pipes etc. In the end I ripped out the mmap path, and we saw a perf improvement in some benchmark by x300.

iforgotpassword · on Jan 9, 2021

Also error handling. read and write can return errors, but what happens when you write to a mmaped pointer and the underlying file system has some issue? Assigning a value to a variable cannot return an error.

So you get a fine SIGBUS to your application and it crashes. Just the other day I used imagemagick and it always crashed with a SIGBUS and just when I started googling the issue I remembered mmap, noticed that the partition ran out of space, freed up some more and the issue was gone.

So you might want to set up a handler for that signal, but now the control flow suddenly jumps to another function if an error occurs, and you have to somehow figure out where in your program the error occurred and then what? Then you remember that longjmp exists and you end up with a steaming pile of garbage code.

Only use mmap if you absolutely must. Don't just "mmap all teh files" as it's the new cool thing you learned about.

jabl · on Jan 9, 2021

Indeed. The issue with file resizing I mentioned was mostly related to error handling (what if another process/thread/file descriptor/ truncates the file, etc.). But yes, there are of course other errors as well, like the fs running out of space you mention.

chrchang523 · on Jan 9, 2021

Yeah, this is the biggest reason I stay the hell away from mmap now. Signal handlers are a much worse minefield than error handling in any standard file I/O API I've seen.

justin66 · on Jan 9, 2021

There’s nothing wrong with using a read only mmap in conjunction with another method for writes.

iforgotpassword · on Jan 9, 2021

You have exactly the same problem on a read error.

justin66 · on Jan 9, 2021

Not the problem you described in your second paragraph.

iforgotpassword · on Jan 10, 2021

Then what do you think happens when you read from your mapped memory and the file system is corrupted and returns an error, or the drive has a bad sector, or the nfs server acts up...

justin66 · on Jan 10, 2021

Reading from a busted file system is a problem to be dealt with (or not), yes, and I certainly wouldn’t recommend mmaping a file shared over nfs if you can help it. I’m not sure what the use case is where that would seem like a good idea.

C, pointers, and mmap are dangerous, sharp instruments but I have to wonder who some of these dramatic warnings are for.

iforgotpassword · on Jan 10, 2021

In general we can probably agree you should always check the return value of read() and write() and handle errors. At least just perror() and abort(), so the user has a chance at finding the problem. Similarly, using mmaped files without handling errors is user hostile since it just crashes the app and SIGBUS gives absolutely no clue to the user what happened. As said, my point is to use mmap when it really makes sense and is worth the hassle, not just because it seems cool and makes the code look a little simpler, exactly because you omit error handling. Especially if you don't know how people will use your software. As you said, mmap on nfs is bad, so you'd basically have to forbid users from using your software with network shares.

klodolph · on Jan 9, 2021

You don’t have to longjmp, you can remap the memory and set a flag, return from the signal handler, handle the error later, if you like.

nlitened · on Jan 9, 2021

Is it still the case in 64-bit systems?

jabl · on Jan 9, 2021

Except for running out of VM space, all the other issues are still there. And even if you have (for the time being) practically unlimited VM space, you may still not want to mmap a file of unbounded size, since setting up all those mappings takes quite a lot of time if you're using the default 4 kB page size. So you probably want to do some kind of windowing anyway. But then if the access pattern is random and the file is large, you have to continually shift the window (munmap + mmap) and performance goes down the drain. So I don't think going to 64-bit systems tilts the balance in favor of mmap.

pocak · on Jan 9, 2021

Linux allocates page tables lazily, and fills them lazily. The only upfront work is to mark the virtual address range as valid and associated with the file. I'd expect mapping giant files to be fast enough to not need windowing.

jabl · on Jan 9, 2021

Good point, scratch that part of my answer.

There are still some cases where you'd not want unlimited VM mapping, but those are getting a bit esoteric and at least the most obvious ones are in the process of getting fixed.

searealist · on Jan 9, 2021

What year / hardware / kernel version are you talking about?

jabl · on Jan 9, 2021

Oh uh, IIRC 2004/2005 or thereabouts. Personally I was using PC HW running an up to date Linux distro, as I guess was the vast majority of the userbase, but there was a long tail of all kinds of weird and wonderful targets where the software was deployed.

scaramanga · on Jan 10, 2021

I replicated the tests and discovered kernel/libc are both using rep/mov for the copy so copy-speed isn't the cause of the difference.

The cause of the difference is spectre mitigations which now make system call overhead much higher than it should be. So much so that the cost of the additional pagefaults incurred by the inefficient mmap version get dwarfed.

With mitigations=off they both perform roughly identically.

You could smell a rat anyway because AVX copies may be faster, but not that much faster.

lrossi · on Jan 9, 2021

From the mouth of Linus:

https://marc.info/?l=linux-kernel&m=95496636207616&w=2

It’s a bit old, but it should still apply. I remember that in general he was annoyed when seeing people recommend mmap instead of read/write for basic I/O usecases.

In general, it’s almost always better to use the specialized API (read, write etc.) instead of reinventing the wheel on your own.

nabla9 · on Jan 9, 2021

Yes. If you do lots of sequential/local reads you can reduce the number of context switches dramatically if you do something like:

    /* to reduce context switch */

    int bsize = 16*BUFSIZ;

    int fopen_bsize(char * filename) {
        int fp = fopen(filename, "r");        
        setvbuf(fp, NULL, _IOFBF, bsize);
        return fp;
    }

beagle3 · on Jan 9, 2021

LMDB (and its modern fork MDBX), and kdb+/shakti make incredibly good use of mmap - I suspect it is possible to get similar performance from read(), but probably at 10x the implementation complexity.

amluto · on Jan 9, 2021

This is a poor explanation and poor benchmarking. Let’s see:

copy_user_enhanced_fast_string uses a CPU feature that (supposedly) is very fast. Benchmarking it against AVX could be interesting, but it would need actual benchmarking instead of handwaving. It’s worth noting that using AVX at all carries overhead, and it’s not always the right choice even if it’s faster in a tight loop.

Page faults, on x86_64, are much slower than syscalls. KPTI and other mitigations erode this difference to some extent. But surely the author should have compared the number of page faults to the number of syscalls. Perf can do this.

Finally, munmap() is very, very expensive, as is discarding a mapped page. This is especially true on x86. Workloads that do a lot of munmapping need to be careful, especially in multithreaded programs.

utopcell · on Jan 9, 2021

Last year we were migrating part of YouTube's serving to a new system and we were observing unexplainable high tail latency. It was eventually attributed to mlock()ing some mmap()ed files, which ended up freezing the whole process for significant amounts of time.

Be weary of powerful abstractions.

CodesInChaos · on Jan 9, 2021

Memory mapped files are very tricky outside the happy path. In particular recovery from errors and concurrent modification leading to undefined behaviour. It's a good choice for certain use-cases, such as reading assets shipped with the application, where no untrusted process can write to the file and errors can be assumed to not happen.

For high performance code I'd use io_uring.

aloknnikhil · on Jan 9, 2021

Previous discussion: https://news.ycombinator.com/item?id=24842648

aleden · on Jan 9, 2021

The boost.interprocess library presents the capability to keep data structures (std::list, std::vector, ...) in shared memory (i.e. a memory-mapped file)- "offset pointers" are key to that. I can think of no other programming language that can pull this off, with such grace.

moldavi · on Jan 10, 2021

That sounds incredible. What kind of patterns does it enable? Have you seen it used well?

noncoml · on Jan 9, 2021

The meat of it is:

the functions (for copying data) used for syscall and mmap are very different, and not only in the name.

__memmove_avx_unaligned_erms, called in the mmap experiment, is implemented using Advanced Vector Extensions (AVX) (here is the source code of the functions that it relies on).

The implementation of copy_user_enhanced_fast_string, on the other hand, is much more modest. That, in my opinion, is the huge reason why mmap is faster. Using wide vector instructions for data copying effectively utilizes the memory bandwidth, and combined with CPU pre-fetching makes mmap really really fast.

Why can’t the kernel implementation use AVX? Well, if it did, then it would have to save and restore those registers on each system call, and that would make domain crossing even more expensive. So this was a conscious decision in the Linux kernel

pengaru · on Jan 10, 2021

You're completely ignoring MAP_POPULATE being used to aggressively asynchronously read-ahead the data on the kernel side, which the read-based implementation isn't attempting any equivalent of (posix_fadvise(2) and readahead(2) come to mind).

Frankly TFA is relatively useless as an mmap vs. read comparison.

noncoml · on Jan 10, 2021

I am not the author :)

jiggawatts · on Jan 10, 2021

> if it did, then it would have to save and restore those registers on each system call

Stupid question: Does the entire AVX state need to be saved and restored for a memcpy routine? It seems like it would be sufficient to save at most a couple of registers for a tight copy loop. Even for 512-byte copies that ought to be a net win...

cwzwarich · on Jan 9, 2021

Wouldn't REP MOVSB be as fast as an AVX memcpy for 4 KB sizes on recent Intel CPUs?

justin_ · on Jan 9, 2021

The glibc implementation[0] uses Enhanced REP MOVSB when the array is long enough. It takes a few cycles to start up the ERMS feature, so it's only used on longer arrays.

Edit: Wait a minute... if this is true, then how can AVX be responsible for the speed up? Is it related to the size of the buffers being copied internally?

[0] Line 48 here: http://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_...

JoshTriplett · on Jan 9, 2021

> The glibc implementation[0] uses Enhanced REP MOVSB when the array is long enough. It takes a few cycles to start up the ERMS feature, so it's only used on longer arrays.

That isn't true anymore either, on sufficiently recent processors with "Fast Short REP MOVSB (FSRM)". If the FSRM bit is set (which it is on Ice Lake and newer), you can just always use REP MOVSB.

jabl · on Jan 9, 2021

Still waiting for the "Yes, This Time We Really Mean It Fast REP MOVSB" (YTTWRMIFRM) bit.

More seriously, if REP MOVSB can be counted on always being the fastest method that's fantastic. One thing that easily gets forgotten in microbenchmarking is I$ pollution by those fancy unrolled SIMD loops with 147 special cases.

magicalhippo · on Jan 9, 2021

Ever since reading the advice to avoid these CISC-y instructions (IIRC right back to the first Pentium), I've been wondering why.

Like what made them not implement the best possible microcode for that?

I mean sure for very short loops I guess I can see unrolling being faster than accessing microcode, but yeah.

adzm · on Jan 9, 2021

It should be, I think, though it's a complicated question whose answer varies on so much, cpu architecture and how it is used. There is a great discussion on it here, too.

https://stackoverflow.com/questions/43343231/enhanced-rep-mo...

fangyrn · on Jan 9, 2021

I'm a bit of an idiot, when I think of AVX I think of something that speeds up computation (specifically matrice stuff), not memory access. How wrong am I?

jeffbee · on Jan 9, 2021

Its registers are just larger. The way x86 moves memory is through registers, register-to-register or register-to/from-memory. The AVX registers move up to 64 bytes in one move. A general purpose register moves at most 8 bytes.

jstimpfle · on Jan 9, 2021

It's a set of SIMD (single instruction, multiple data) extensions to the amd64 instruction set. They allow you to operate on larger chunks of data with a single instruction - for example, do 16 integer multiplications in parallel, etc.

magicalhippo · on Jan 9, 2021

Back in the days, this was on my 486 IIRC, using floating-point to copy was faster than using the regular integer instructions for pretty much the same reason, you could copy 64bit at a time rather than "just" 32bit.

aliceryhl · on Jan 9, 2021

AVX is useful for both.

jstimpfle · on Jan 9, 2021

> Why can’t the kernel implementation use AVX? Well, if it did, then it would have to save and restore those registers on each system call, and that would make domain crossing even more expensive. So this was a conscious decision in the Linux kernel.

I don't follow. So a syscall that could profit from AVX can't use it because then all syscalls would have to restore AVX registers? Why can't the restoring just happen specifically in those syscalls that make use of AVX?

PaulDavisThe1st · on Jan 9, 2021

It's not just syscalls. It's every context switch. If the process is in the midst of using AVX registers in kernel code, but is suddenly descheduled, those registers have to be saved/restored. You can't know if the task is using AVX or not, so you have to either always save/restore them, or adopt the policy that these registers are not saved/restored.

jabl · on Jan 9, 2021

I vaguely recall that the Linux kernel has used lazy save/restore of FP registers since way back when.

xymostech · on Jan 9, 2021

I think by "each system call" she meant it like "every time it calls read()", since it would be read() that was using the AVX registers. Since the example program just calls read() over and over, this could add a significant amount of overhead.

jeffbee · on Jan 9, 2021

You'd have to have a static analysis of which syscalls can transitively reach which functions, which is probably not possible because linux uses tables of function pointers for many purposes. Also if thread 1 enters the kernel, suspends waiting for some i/o, and the kernel switches to thread 2, how would it know it needed to restore thread 2's registers because of AVX activity of thread 1? And if it did how would it have known to save them?

jstimpfle · on Jan 9, 2021

Not a kernel person, but how about a flag for the thread data structure?

jeffbee · on Jan 9, 2021

Yeah actually now that I'm part way through that first cup of coffee, the 2nd part of my comment doesn't make sense, the kernel already has to do a heavier save of a task's register state when it switches tasks.

bigdict · on Jan 9, 2021

Hold up. Isn't mmap a system call?

chrisseaton · on Jan 9, 2021

> Hold up. Isn't mmap a system call?

That's not what they mean. You set up a memory map with the mmap system call, yes, but that's not the point.

The point is then that you can read and write mapped files by reading and writing memory addresses directly - you do not have to use a system call to perform each read and write.

bzb6 · on Jan 9, 2021

So like DMA?

magicalhippo · on Jan 9, 2021

No.

Using mmap is pretending you read your entire file into memory but all the memory got swapped to disk, and so the pages has to be swapped back into memory when you access them.

Except you didn't do the initial read into memory, and instead of the system swap file it's using the file you actually want to read when paging data back into memory.

ndesaulniers · on Jan 9, 2021

DMA is more like a bulk memory transfer operation usually facilitated by specific hardware that generally is asynchronous and requires manual synchronization. Usually hardware devices perform DMAs of memory regions, like a memcpy() but between physical memories.

Memory mappings established via mmap() more so set up the kernel to map in pages when faults on accesses occur. In this case you're not calling into the kernel, the TLB is generating an interrupt when you go to read an address referring to memory that's not yet paged in, which the kernel than handles and restores control flow to userspace without userspace being any wiser (unless userspace is keeping track of time). Handling page faults is faster than the syscalls involved in read() calls, it would seem.

_0ffh · on Jan 9, 2021

I think that comparison would be more confusing than helpful.

jFriedensreich · on Jan 9, 2021

Made me think about LMMD (https://en.m.wikipedia.org/wiki/Lightning_Memory-Mapped_Data...) and wonder why mmap didn't seem to have catched on more in storage engines

jandrewrogers · on Jan 9, 2021

For storage engines that prioritize performance and scalability, mmap() is a poor choice. Not only is it slower and less scalable than alternatives but it also has many more edge cases and behaviors you have to consider. Compared to a good O_DIRECT/io_submit storage engine design, which is a common alternative, it isn't particularly close. And now we have io_uring as an alternative too.

If your use case is quick-and-dirty happy path code then mmap() works fine. In more complex and rigorous environments, like database engines, mmap() is not well-behaved.

ricardo81 · on Jan 9, 2021

*LMDB

I use it a bit. The transactional aspect of it requires a bit consideration but generally the performance is good. I'd originally used libJudy in a bunch of places for fast lookups but the init time for programs was being slowed by having to preload everything. Using an mmap/LMDB is a decent middle ground.

rahimiali · on Jan 10, 2021

Like others, it's unclear to me that AVX is responsible for the speedup. copy_user_enhanced_fast_string seems to be a beast. To copy data from kernel memory to userland (code link below), for every byte it copies, it checks if it's crossing a page boundary and about to cause a page fault. I don't see how this whole function could get compiled down to just a REP MOVSB.

[0] https://elixir.bootlin.com/linux/v4.2/source/arch/x86/lib/us...

sally1620 · on Jan 10, 2021

There are different benefits and drawbacks of mmap. I only use mmap because it simplifies my code (I write experimental code).

But the OP and commenters are mostly concerned about the extra copy of read().

For video memory, DMABUF has been used to avoid extra copying. The idea is that the hardware directly DMAs into a buffer mapped in the userspace instead of copying into a kernel buffer and then copying into userspace. A system call is still involved, but no extra copying of data.

I wonder if anyone has actually implemented a disk I/O interface using DMABUF. There are a couple of low-level disk interfaces. But I don't know if any of them uses DMABUF.

einpoklum · on Jan 9, 2021

The author says that in userspace memcpy, AVX is used, but

> The implementation of copy_user_enhanced_fast_string, on the other hand, is much more modest.

Why is that? I mean, if you compiled your kernel for a wide range of machines, then fine, but if you compiled targeting your actual CPU, why would the kernel functions not use AVX?

ngneer · on Jan 10, 2021

On a related tangent, why is mmap() needed to begin with? Is it not an example of a leaky abstraction?

I know, I know, how else would a process communicate its needs to the memory management system. I have seen shellcode that marks writeable pages executable using mmap, so it is certainly useful.

silvestrov · on Jan 9, 2021

Why doesn't Intel CPUs implement a modern version of Z80's LDIR instruction (a memmove in a single instruction)?

Then the kernel wouldn't have to save any registers. (I'd really like if she had documented exactly which CPU/system she used for benchmarking).

beagle3 · on Jan 9, 2021

It's called REP MOVSB (or MOVSW, MOVSD, maybe also MOVSQ?). It has existed since the 8086 day; and for reasons I don't know, it supposedly works well for big blocks these days (>1K or so) but is supposedly slower than register moves for small blocks.

JoshTriplett · on Jan 9, 2021

> it supposedly works well for big blocks these days (>1K or so) but is supposedly slower than register moves for small blocks.

On current processors with Fast Short REP MOVSB (FSRM), REP MOVSB is the fastest method for all sizes. On processors without FSRM, but with ERMS, REP MOVSB is faster for anything longer than ~128 bytes.

beagle3 · on Jan 9, 2021

Thanks! Is there a simple rule-of-the-thumb about when can one rely on FSRM?

JoshTriplett · on Jan 9, 2021

You should check the corresponding CPUID bit, but in general, Ice Lake and newer.

sedatk · on Jan 9, 2021

LDIR is slower than unrolling multiple LDI instructions by the way.

jeffbee · on Jan 9, 2021

Intel CPUs have REP MOVS, which is basically the same thing.

_dh54 · on Jan 9, 2021

I believe if you turn PTI off the syscall numbers for sequential copies would be a lot higher.

Matthias247 · on Jan 9, 2021

The 4-16kB buffer sizes are all rather tiny and inefficient for high throughput use-cases, which makes those results not that relevant. Something between 64kB to 1MB seems more applicable.

hendry · on Jan 10, 2021

Developers using mmap broke the utility of strace for me.

Using mmap makes it pretty much impossible to debug what a program is actually doing!

tucnak · on Jan 9, 2021

Am I right to assume that Alexandra is well–known in the field? I've never heard the name.

lionsdan · on Jan 9, 2021

https://www.ece.ubc.ca/faculty/alexandra-sasha-fedorova

stingraycharles · on Jan 9, 2021

Apparently she’s a researcher and MongoDB consultant.

https://www.ece.ubc.ca/~sasha/

bzb6 · on Jan 9, 2021

It’s an ad.

qaq · on Jan 9, 2021

There is io_uring as an option

pjmlp · on Jan 9, 2021

On the context of Linux.

xoo1 · on Jan 9, 2021

It's all very benchmark-chasing theoretical. In practice performance is more complicated and mmap is this weird corner-case over engineered inconsistent optimization thing that often wastes or even "leaks" memory which can be used for actually important for performance caches, it's also awful at error handling and so on. I had to literally patch LevelDB to disable mmap on amd64 once, which eliminated OOMs on those servers, allowed me to run way more LevelDB instances and overall improved performance so significantly, that I had to write this comment.

CalChris · on Jan 9, 2021

Neither mmap() nor read()/write() leak memory.

jstimpfle · on Jan 9, 2021

But they might "leak" it. What parent meant is that as an mmap() user you have no control how much of a mapping takes actual memory while you're visiting random memory-mapped file locations. Is that documented somewhere?

AnotherGoodName · on Jan 9, 2021

The paging system will only page in what's being used right now though and paging out has zero cost. Old data will naturally be paged out. To put it directly the answer is each mmap file will need 1 page of physical memory (the area currently being read/written). There may be old pages left around since there's no reason for the OS to page anything out unless some other application asked for the memory. But if they do mmap will go to 1 page just fine and there's zero cost to paging out.

I feel mmap gets a bad reputation when people look at memory usage tools that look at total virtual memory allocated.

I can mmap a 100GB of files, use 0 physical memory and a lot of memory usage tools will report 100GB of memory usage of a certain type (virtual memory allocated). You then get articles about application X using GB of memory. Anyone trying to correct this is ignored.

Google Chrome is somewhat unfairly hit by this. All those articles along the lines of "Why is Google using 4GB with no tabs after i viewed some large PDFs". The answer is that it reserved 4GB of 'addresses' that it has mapped to files. If another application wants to use that memory there's zero cost to paging out those files from memory. The OS is designed to do this and it's what mmap is for.

labawi · on Jan 9, 2021

> paging out has zero cost

Paging out, as in removing a mapping, can be surprisingly costly, because you need to invalidate any cached TLB entries, possibly even in other CPUs.

> each mmap file will need 1 page of physical memory

Technically, a lower limit would be about 2 or so usable pages, because you can't use more than that simultaneously. However unmaps are expensive, so the system won't be too eager to page out.

Also, for pages to be accessible, they need to be specified in the page table (actually tree, of virtual -> physical mappings). A random address may require about 1-3 pages for page table aside from the 1 page of actual data (but won't need more page tables for the next MB).

> application X using GB of memory

I think there is a difference between reserved, allocated and file-backed mmapped memory. Address space, file-backed mmapped memory is easily paged-out, not sure what different types of reserved addresses/memory are, but chrome probably doesn't have lots of mmapped memory that can be paged out. If it's modified, then it must be swapped, otherwise it's just reserved and possibly mapped, but never used memory.

AnotherGoodName · on Jan 9, 2021

I'd argue the costs with paging out are already accounted for by the other process paging in though. The other process that paged in and led to the need to page out had already led to the need to change the page table and flush cache.

labawi · on Jan 9, 2021

Paging in free memory (adding a mapping) is cheap (no need to flush). Removing a mapping is expensive (need to flush). Also, processes have their own (mostly) independent page tables.

I don't think it would be reasonable accounting, when paging-in is cheap, but only if there is no need to page out (available free memory). Especially when trying to argue that paging out is zero-cost.

owl57 · on Jan 10, 2021

Are page tables garbage-collected in Linux? That seems like a potential non-trivial leak source: up to 200MB for your hypothetical 100GB file.

vlovich123 · on Jan 9, 2021

madvise gives you pretty good control over the paging, no? Generally I think you can MADVISE_NOTNEEDED to page out content if you need to do it more aggressively, no? The benefit is that the kernel understands this enough that it can evict those page buffers, things it can’t do when those buffers live in user-space.

jandrewrogers · on Jan 9, 2021

No, madvise() does not give good control over paging behavior. As the syscall indicates, it is merely a suggestion. The kernel is free to ignore it and frequently does. This non-determinism makes it nearly useless for many types of page scheduling optimizations. For some workloads, the kernel consistently makes poor paging choices and there is no way to force it to make good paging choices. You have much better visibility into the I/O behavior of your application than the kernel does.

In my experience, at least for database-y workloads, if you care enough about paging behavior to use madvise(), you might as well just use any number of O_DIRECT alternatives that offer deterministic paging control. It is much easier than trying to cajole the kernel into doing what you need via madvise().

rowanG077 · on Jan 9, 2021

That's called a space leak not a memory leak.

silon42 · on Jan 9, 2021

mmap has less deterministic memory pressure and more complex interactions with overcommit (if enabled).

__turbobrew__ · on Jan 9, 2021

I write software for a HPC centre and we noticed this as well. The speed of our programs heavily using memory maps can vary by an order of magnitude based upon the memory pressure on the node.

This non-determinism in runtime is a show stopper for us so we ripped out most memory maps from our codebase. There is just too much magic happening under the hood to have control over your program.

btown · on Jan 9, 2021

Curious now - were you running an unconventional workload that stressed LevelDB, or do you think some version of this advice could be applicable to typical workloads?

jeffbee · on Jan 9, 2021

LevelDB is kinda like a single-tablet bigtable, but because of that its mmap i/o is not a result of battle hardening in production systems. bigtable doesn't use local unix i/o for any purpose at all, so I'm not surprised to hear that leveldb's local i/o subsystem is half baked.

BikiniPrince · on Jan 9, 2021

One mechanism we developed was to build a variant of our storage node that could run in isolation. This meant that synthetic testing would give us some optimal numbers for hardware vetting and performance changes.

I proved quite quickly our application was quite thread poor and the costs of fixing it was quite worth it. Using other synthetic benchmarks to compare what the systems were capable of.

I was gone before that was finished, but it was quite an improvement. It also allowed cold volumes to exist in an over subscription model.

None of this excuses good real world telemetry and evaluation of your outliers.

jstimpfle · on Jan 9, 2021

Yup, I don't like using mmap() for the reason alone that it means giving up a lot of control.

baybal2 · on Jan 9, 2021

There is a third option on the table! Using DMA controller.

People say what's the difference in between copying memory by CPU, or by DMA controller? The difference is exactly that.

You rarely need that, but in some cases you have:

1. You do very long copies, and want to have full CPU performance.

2. You do very long copies, and want to have caches to not be flushed during it.

3. You care about power consumption, as DMA controller may let the CPU core to enter low power mode quicker.

4. On some CPU architectures, you can get wildly spread out pages quicker than with CPU/software.

beagle3 · on Jan 9, 2021

The DMA needs to cooperate with the MMU which is on the CPU these days (and has been for almost 3 decades now). It's a lot of work to set up DMA correctly given physical<->logical memory mapping - so it's only worth it if you have a really big chunk to copy.

GeorgeTirebiter · on Jan 9, 2021

This is quite interesting. This, to me, seems like a systems bug. In the Embedded World, it is exceedingly common to use DMA for all high-speed transfers -- it's effectively a specialized parallel hardware 'MOV' instruction. Also, I have never had an occasion on modern PC hw to need mmap; read() lseek() are clean and less complex overall. Maybe I lack imagination.

__turbobrew__ · on Jan 9, 2021

Every time you run a dynamically linked program on linux the system linker mmaps your program + libraries into memory.

You may not personally see mmap used, but it is used quite often in software you use.

astrange · on Jan 9, 2021

mmap is being used by libraries under you; it's useful for files on internal drives, that won't be deleted, and want to be accessed randomly, and you don't want to allocate buffers to copy things out of them.

For instance, anytime you call a shared library it's been mmapped by dyld/ld.so.

layoutIfNeeded · on Jan 9, 2021

>Further, since it is unsafe to directly dereference user-level pointers (what if they are null — that’ll crash the kernel!) the data referred to by these pointers must be copied into the kernel.

False. If the file was opened with O_DIRECT, then the kernel uses the user-space buffer directly.

From man write(2):

O_DIRECT (Since Linux 2.4.10) Try to minimize cache effects of the I/O to and from this file. In general this will degrade performance, but it is useful in special situations, such as when applications do their own caching. File I/O is done directly to/from user-space buffers. The O_DIRECT flag on its own makes an effort to transfer data synchronously, but does not give the guarantees of the O_SYNC flag that data and necessary metadata are transferred. To guarantee synchronous I/O, O_SYNC must be used in addition to O_DIRECT. See NOTES below for further discussion.