Why is that an achievement? Isn't it so that during a memory copy the CPU is bas...

boomlinde · on Nov 6, 2019

> Why is that an achievement?

There are tables in the paper comparing its speed to other implementations. This is more than an order of magnitude faster than the implementation used in Chrome, and for small data more than twice as fast as their previous AVX2 based implementation (https://arxiv.org/abs/1704.00605). The paper even notes that in some cases, for large inputs, it's decoding base64 faster than you could memcpy the data for the simple reason that it needs only write 6/8ths of the bytes. In their tests, it's only when the data fits in the tiny L1 cache that memcpy is consistently somewhat faster (44 GB/s vs 42 GB/s). So this is not only fast when you have to wait for a slow memory access, but when the loads hit other caches.

Or are you asking in more of a rhetorical sense? Like, why hasn't this been achieved before, that it's obvious or something? AVX-512 is only some 3 years old and base64 is only one of many problems that could benefit from vectorized processing.

mjoin · on Nov 6, 2019

Thanks for this. Was insightful!

vardump · on Nov 6, 2019

You're mostly right. It's often not that hard to completely saturate memory bus.

However, base64 has weird mappings that take some processing to undo in SIMD – can't use lookup tables and need to regroup bits from 8 bit to 6 bit width. That does take a lot of cycles without specialized bit manipulation instructions.

Also the data you'll need to process is often probably already hot in the CPU caches. Disk/network I/O can be DMA'd directly into L3 cache.

So I think this decoder is a useful contribution. But also somewhat no-brainer once you have those AVX-512 instructions available.

mytailorisrich · on Nov 6, 2019

> However, base64 has weird mappings that take some processing to undo in SIMD – can't use lookup tables and need to regroup bits from 8 bit to 6 bit width. That does take a lot of cycles without specialized bit manipulation instructions.

Base64 uses lookup tables and the bit manipulations required are standard shifts and 'and', which are basic, fast instructions on any CPU. That seems exactly what they do here with an efficient use of AVX512 to make it fast(er).

saati · on Nov 6, 2019

> SIMD – can't use lookup tables

This is doing exactly that with `vpermb` and `vpermi2b`.

vardump · on Nov 6, 2019

Yeah, missed that you can now fit 64 8-bit entries in a single 512 bit register!

Generally my point stands that LUTs are not vectorizable, except for small special cases like this.

londons_explore · on Nov 6, 2019

Why can't you use lookup tables?

fanf2 · on Nov 6, 2019

Part of the point of this paper is that the lookup tables for base64 fit into a 512 bit vector register.

vardump · on Nov 6, 2019

Notably this technique only works in this special case (or other small LUTs). Generic (larger) lookup tables can't be vectorized yet.

vardump · on Nov 6, 2019

Because table lookups don't vectorize. You could try to use vectorized gather (VPGATHERDD [0]), but so far in my experience it's been just as slow as using scalar code. (Then again, even gather instructions operate just on 32-bit data, so it wouldn't help anyways.)

So to perform multiple operations in parallel per core (SIMD), you'll have to write some code to perform the "lookup" transformation instead.

[0]: https://www.felixcloutier.com/x86/vpgatherdd:vpgatherqd

cperciva · on Nov 6, 2019

memory is so much slower than the CPU

Yes and no. Memory has high latency, but memory bandwidth has stayed far more in line with CPU performance.

MaxBarraclough · on Nov 6, 2019

See also GDDR VRAM, which further trades off latency for higher bandwidth.

bluesign · on Nov 6, 2019

I think you got it all wrong, memcpy comparison is given as a baseline.

It just shows even if you do not do anything (just copy the memory from place A to place B), you have some CPU usage baseline per GB. So this kind of saying: with this algorithm, you have very minimal CPU usage.

This algorithm is not doing excessive memory usage and Idling CPU because of memory bandwith, it is just doing very minimal CPU usage.

xxs · on Nov 6, 2019

avx is extremely power hungry to a point the cpu downclocks itself. it might be worse than regular instructions even.

memcpy with wide instructions has been an issue for libc.

sitkack · on Nov 6, 2019

From the same author, https://lemire.me/blog/2018/08/15/the-dangers-of-avx-512-thr... tl;dr Isn't that extreme.

xxs · on Nov 6, 2019

the test is on server grade cpu (read low clocks, great vrms) and the conclusion is: you have to use multiple cores to be in trouble.

current intel desktop processors have 'avx offset' which would suck if kicks on for mild base64.

overall wide instructions are nice and dandy and obliterate microbenchmarks but they may incur hidden costs, especially on low power cpus (laptops)

coldtea · on Nov 6, 2019

>Why is that an achievement?

Compared to what, being content to suffering slower base64 decoding?

Why it's an achievement is obvious.

Your question should be rephrased better as "will this make much difference in most common scenarios"?

useerup · on Nov 6, 2019

While it is correct that memory is orders og magnitudes slower than modern cpus, this is also why cpus sport multiple levels of caches.

memcpy operations (and base 64 encoding/decoding) are highly "local" operations. When the segment of memory they operate on has been placed in the cpu level1 or level2 caches, the operation is much faster and will not need to access main memory for each byte/word. Only when a memory barrier is encountered will the cpu need to flush/refresh to/from main memory.

BubRoss · on Nov 6, 2019

You are thinking of random memory access latency, which is very slow. Reading memory in spans that are contiguous is very fast in comparison because of prefetching. Writing is the same to a certain degree because of the same efficiency with TLBs but is not as fast as prefetched reads.