Apex memmove – fast memcpy/memmove on x86/x64

userbinator · on July 7, 2016

I am unconvinced. REP MOVS is tiny (it is literally 2 bytes, a single instruction), extremely fast on modern x86, and in my experience what little gain (if any) you get from the giant blob of code will be offset by all the instruction cache bloat it causes. If you search for "enhanced REP MOVSB" you will find some further reading.

slashdev · on July 7, 2016

According to Intel themselves[1], REP MOVS is optimal only for large copies, above about 2kb. Below that the large startup costs of the microcode sequence can be beaten by hand-written code for the target architecture. Which explains why glibc takes that approach. The linux kernel on the other hand, just goes with:

  shrq $3, %rcx
  andl $7, %edx
  rep movsq
  movl %edx, %ecx
  rep movsb

Torvalds justified it as a way to force Intel and AMD to do the right thing and optimize the rep movs* instructions. He was annoyed at the code complexity of "more optimal" apporaches, and like you, concerned about icache effects.

A hand-written memmove is faster in microbenchmarks, but the icache effects may make the overall performance difference smaller (or even negative.) That's harder to measure.

AFAIK glibc does get better results than the kernel approach, but they've also introduced bugs[3] that way and the code is very complex by comparison. Also worth noting is that the kernel usually runs with a cold(er) icache, which is why they compile for size rather than speed (last I checked anyway, which was long ago) and why their memmove may make more sense in the context of the kernel. Also I suspect copies in the kernel are more likely to be larger (e.g. packets, pages, blocks) as opposed to small objects in userspace.

[1] http://www.intel.com/content/www/us/en/architecture-and-tech...

[2] https://github.com/torvalds/linux/blob/master/arch/x86/lib/m...

[3] https://bugzilla.redhat.com/show_bug.cgi?id=638477#c99

valarauca1 · on July 7, 2016

I believe this [1] is what you are referencing. Additional reading for the truly knowledge craving [2] (section 3.7.6).

For the lazy:

On aligned data REP MOVS will automatically select the largest available register/load/store instruction to use. So invoking this instruction will also determine SSE2 vs AVX vs AVX512 (or more over its burned into the silicon). Furthermore if your allocation is large enough it will by-pass caching mechanisms for an additional speed up.

For the VERY lazy:

REP MOVS will attempt to saturate DRAM bandwidth. The best hand coded ASM or C can hope to do is be as fast as it.

[1] https://stackoverflow.com/questions/26246040/whats-missing-s...

[2] https://www-ssl.intel.com/content/www/us/en/architecture-and...

stephencanon · on July 7, 2016

REP MOVS is the fastest option for huge copies on Ivybridge and later by a large margin, as it uses special cache semantics that aren't available via any other instructions.

For small copies with buffers that aren't aligned or cacheline-multiple length and are resident in L1 cache, it's possible to be significantly faster than REP MOVS using a software sequence of AVX instructions. This is because the branches in these sequences are usually perfectly predictable in the real world, but the "microcode sequence"[1] used by REP MOVS does not benefit from the branch predictor. This imposes a static startup cost (a few tens of cycles) that exceeds function-call overhead, which keeps software implementations of memcpy in business.

[1] Not exactly microcode as that term is classically understood, but there isn't really a better term for it that's widely used.

nkurz · on July 7, 2016

[1] Not exactly microcode as that term is classically understood, but there isn't really a better term for it that's widely used.

What are technically correct but less widely used terms? And how does the current Intel approach differ from classical microcode? Most of my knowledge is from the Intel manuals, so I don't have a context for how it is different than other approaches.

stephencanon · on July 7, 2016

I don't think there is one, really. "Fast microcode" or something, maybe intel has an internal name.

ori_b · on July 7, 2016

REP MOVS is the fastest for large moves, but it seems to have a higher fixed overhead. Intel's optimization manual goes into depth with comparisons between the various methods of blitting.

ataylor284_ · on July 7, 2016

I am skeptical as well, and I agree with you in principle. However I'd be interested if a 3rd party managed to reproduce his benchmarks. He credits using various prefetch hints for the performance gains, which sounds somewhat plausible.

JoeAltmaier · on July 7, 2016

The code may bloat, but only because of the permutations. Each call, you execute only the small case that applies. Maybe not that bad; probably it all fits in the nearest instruction cache?

jhallenworld · on July 7, 2016

Latency for short copies is very important. I think "rep movs" is always going to win in this case.

bboreham · on July 7, 2016

I couldn't believe it was possible to write a new article about faster copies. Reminds me of reading "Inner Loops" by Rick Booth [0] in 1996, which first opened my mind to the darker corners of x86 optimizing.

BTW, the article is quite strangely structured. The initial piece sounds like he has been up for four days straight and he just wants to get his thoughts down. But he forgot a few points, so they are added in five updates. Arguably that's the best place to start reading - around the middle of the page.

Then, he goes back in time two years and adds an earlier draft of the same piece, noting "don't even look at the original article! It's very confusing!".

[0] https://books.google.co.uk/books/about/Inner_Loops.html?id=v...

wscott · on July 7, 2016

I don't see how you can write an article on x86 memcpy without having benchmarks for 'rep movs' as a baseline. That is the old x86 assembly instruction that does the entire copy in one instruction. With modern processors, that instruction drops into a microcode loop to move the data using the internal instructions.

Microcode has a bit of a delay getting in there and so 'rep movs' is not so great for short moves, but normally does pretty well for large transfers.

That is the big problem with memcpy, the optimal code depends on how it is used. Block copies of large aligned structures wants different code than moving short unaligned strings. Ideally, the compiler has more context and can select the different versions for different call sites.

tedunangst · on July 7, 2016

Among other problems with benchmarking SSE versions, your operating system may implement lazy FPU save/restore. SSE memcpy suddenly turns every process into an FPU using process, increasing context switch overhead. This is something you won't see in single process microbenchmarks.

pcwalton · on July 7, 2016

Are there many processes out there that don't use SSE these days? LLVM, for example, will aggressively use SSE for memcpy/memmove of medium sized values (e.g. copies of 128 byte long structs), which almost every nontrivial app will end up using at some point. Likewise, if the autovectorizer succeeds even once, then you'll also end up using SSE.

tedunangst · on July 7, 2016

True. I'm still living in the Bronze Age, but we may already be past the tipping point. I wonder if any consideration was given. Personally I don't know what the cost is, but I know I've never seen anybody try to measure it.

fsfod · on July 7, 2016

I remember Andy Glew discussing a little bit of the history of the behaviour of REP MOVS microcode in the comments of this stack overflow question http://stackoverflow.com/questions/8858778/why-are-complicat...

wscott · on July 8, 2016

Yup, we spend a bunch of time trying different options.

But they never gave Andy the uop he really wanted which would zero the next cache line. Normally when you start writing to memory the machine will populate the cache line to be written from memory. He wanted the ability to say he plans on writing the whole thing so don't bother and if something happens then assume it is all zeros. (or something like that)

Perhaps the current machines have this now... but probably not.

bboreham · on July 7, 2016

If you use more registers as storage it can go faster. 'rep movs' uses no registers for storage; only source, destination and count.

There are many, many tricks like doing 'look ahead' to start the L1 cache bringing in the next line before we actually need it, to hide that latency.

And the article describes how there are various code paths for big blocks, short blocks, aligned, unaligned, etc.

mikeash · on July 7, 2016

The hardware can do whatever it wants as long as it preserves the semantics. I'm sure modern CPUs use registers and perform prefetching as appropriate when they encounter a rep movs.

masklinn · on July 7, 2016

Surely the CPU hardware isn't limited to the ABI either and microcode can directly work on the physical registers?

mikeash · on July 7, 2016

Indeed. It could even conceivably use some special memcpy hardware. I doubt it goes that far, but another comment here mentions that it will use SSE or AVX registers when appropriate.

`rep movsb` shouldn't be thought of as repeatedly copying single bytes from source to destination, but rather as a `memcpy` instruction with a weird name due to history. If it's used frequently enough to be worth optimizing (and it seems like it is) then Intel and AMD can easily recognize it as such and make their hardware perform the operation as efficiently as possible.

gens · on July 7, 2016

I don't want to register to download the code, so take my comment as such.

I wrote an (actually) fast memcpy() in assembly. There are more problems then just different size-s. Hardest problem is actually alignment, particularly 1 byte unaligned buffers.

A couple things (if hn doesn't autoformat): 1. SSE2 non-temporal MOV (one that doesn't go through cache) is basically required when going over ~cache_size. 2. One byte unaligned (or any odd number) to cpu alignment buffers need to have the first n bytes copied first to get to that alignment (16byte for SSE; also couple bytes at the end). 3. Buffers not aligned to each other are the biggest mess. SSSE3 actually helps with that as it can shuffle bytes around, but i found it easier (and probably faster) to shift-or the things.

There's probably more that i forgot. I lost the original code but i have a half-arsed version somewhere around here if someone wants to see. It is only for 8 or 16 bytes aligned buffers (to each other) but the grand structure is there.

PS To those saying "REP MOV* is faster": In theory, yes, but no, not really.

JoeAltmaier · on July 7, 2016

Such a fascinating problem! For efficiency you want large aligned loads and stores. The source and destination alignment are independent issues; so the CPU uses registers for a sort of shift-and-realign engine.

And you want to unroll each case, because speed is the goal.

The longer the alignment size (4-byte? 8-byte? 16-byte?) the more permutations of start and end conditions there are. E.g. moving from a 15-byte aligned source to a 3-byte aligned destination. Maybe 256 combinations? Only 1 of which is zero-aligned source and destination (full-register move) which is all most folks think of writing to begin with.

Add in cache-line size considerations, and overlap possibilities (are we moving a string in-place left or right by 7 characters?) Can the bus do unaligned loads? Stores? How do they compare in performance to multiple single-byte transfers? So many things can affect performance, each case becomes a heuristic. After all the theorizing, testing is the only real metric.

I've dived down this rabbit hole before. Wrote a test app to move 0-128 bytes from source aligned 0-128 to destination aligned 0-128. The memcpy I was testing on a RISC architecture hit 11 bugs (processor faults, incorrect result) before I was through.

Its refreshing to find somebody else with an appreciation for the depth of this problem. Thanks for posting, gens!

bjourne · on July 8, 2016

> I wrote an (actually) fast memcpy() in assembly. There are more problems then just different size-s. Hardest problem is actually alignment, particularly 1 byte unaligned buffers.

It is, but in performance critical code there is never any reason to use unaligned buffers or copying byte sizes not a multiple of the machine's register size. E.g you can have a memcpy_fast() which demands correct alignment, and the standard memcpy() which is slower but has no such requirements.

Kristine1975 · on July 7, 2016

>I don't want to register to download the code

BugMeNot has you covered: http://bugmenot.com/view/codeproject.com

grenoire · on July 7, 2016

His writing is difficult to read, and makes him sound like a snake oil salesman. Good work nevertheless.

imglorp · on July 7, 2016

Important safety tip! Exclaiming half your sentences make you sound breathless!

Kristine1975 · on July 7, 2016

>Although these are C/C++ functions, they were `designed by disassembly` ... meaning I was paying close attention to the compiler output. This is NOT normally advisable (we don't want the compiler to dictate how we design our functions), however, when you are designing things for maximum performance, it's important to pay attention to how the code will compile! The strange layout of the functions is a testament to my close observation of the compiler output.

Why not write in assembly then? And this sounds like a recipe for slower code when using a different compiler or a different version of the same compiler:

>The way instructions are ordered also prevents the compiler from making some `assumptions`!

graycat · on July 7, 2016

Gee, for a fast memory move of blocks of data that contained some whole virtual memory pages the fast way was just to have an API to the OS that tweaked the virtual memory page-segment tables! I'm not sure that simple idea has ever been used -- maybe there are some problems with it.

mikeash · on July 7, 2016

I recall that macOS's memcpy does this for some cases. There's overhead for the syscalls and TLB invalidation, so it's not efficient for smaller copies, but it's a win for bigger ones. I think it considers the breakeven point to be 40kB. And of course the source and destination both must be page-aligned.

wmf · on July 7, 2016

That idea has definitely been used for IPC (instead of copying data you can just "donate" pages to the receiving process), but it's often slower due to TLB invalidation. Sadly, real-world performance is much harder than "zero copy = faster".

ndesaulniers · on July 7, 2016

Does anyone have any reading material that goes into the issues with unaligned load/store? I'm trying to do some research for a blog post.

nkurz · on July 8, 2016

You may have already found it, but I posted some experimental results here last year, and others contributed additional useful additional information: http://www.agner.org/optimize/blog/read.php?i=415#423

_nalply · on July 7, 2016

As for size = -size I think he was inverting the bits then adding 1 in one step (or something similar). Just a guess.

efaref · on July 7, 2016

This is a good example of a pointless "optimization" making the code harder to read.

If he had indeed meant that, he should have written

    x = ~x + 1;

GCC, even with optimisation turned off, will turn that into a negl instruction.

joosters · on July 7, 2016

He mentions Visual Studio often in the page, and the code is several years old. Perhaps your alternative doesn't optimise well across the compilers he used, back several years ago when the author was developing the code.

In any case, there are unknowns here. Perhaps you should be less hasty in declaring people's work 'pointless' and telling others what they 'should' have done?

nkurz · on July 7, 2016

I didn't look at the code, but I often use something similar in assembly when hyper-optimizing loops. The fastest memcpy-like loop structure I've found on x64 involves using a negative offset from the end of the region, which is then incremented until it hits zero.

The advantage of this construct is that you can use the flags set by the increment to test for loop termination, rather than needing an additional comparison. The problem in C is that the compiler tends to "unoptimize" it for you after seemingly innocuous changes.

Anyway, the loop ends up looking something like this:

  char *end = start + size;
  size_t neg = -size;
  do { 
    register = *(end + neg);
    process(register);
    neg += advance;
  } while (neg);

For certain loops this construct shaves off 1 cycle per iteration, at the cost of ~1 cycle of initial setup cost. I'm wagering that he was doing something similar, but just kept the name "size" after the negation, rather than switching to "neg" as I did here.

DblPlusUngood · on July 7, 2016

> The advantage of this construct is that you can use the flags set by the increment to test for loop termination, rather than needing an additional comparison.

I don't understand your optimization. ZF is also set for dec, so why not just use dec and not negate the size?

nkurz · on July 8, 2016

Sticking with a positive value and using dec or sub works if you just have a loop counter or if it's acceptable to process the data in your buffer backwards, but switching to negative lets you go through the buffer in standard ascending order. Sometimes data dependencies require this, and sometimes ascending works better for prefetching.

_nalply · on July 10, 2016

Your theory sounds way better than mine.

kstrauser · on July 7, 2016

That paragraph lost my interest in the project. He's writing a function that has to work perfectly, every time, in all situations, and gets used A LOT. And yet, he never bothered to document his reasoning and no longer knows why it works? No thanks.

PaulHoule · on July 7, 2016

Reminds me of replacing the memcpy function from turbo Pascal with run that ran 2x as fast on a 286 processor.