Why is that an achievement? Isn't it so that during a memory copy the CPU is basically idling because memory is so much slower than the CPU? Weren't that hundreds of instructions per memory access? So instead of having it wait it can also do computations (just standard instructions). Or is base64 computationally so heavy that it cannot fit into the "gaps"? I certainly have not tried, just thinking from the high level view that CPUs are incredibly fast and main memory is incredibly slow in comparison. And of course assuming that the data does not fit into any cache.
There are tables in the paper comparing its speed to other implementations. This is more than an order of magnitude faster than the implementation used in Chrome, and for small data more than twice as fast as their previous AVX2 based implementation (https://arxiv.org/abs/1704.00605). The paper even notes that in some cases, for large inputs, it's decoding base64 faster than you could memcpy the data for the simple reason that it needs only write 6/8ths of the bytes. In their tests, it's only when the data fits in the tiny L1 cache that memcpy is consistently somewhat faster (44 GB/s vs 42 GB/s). So this is not only fast when you have to wait for a slow memory access, but when the loads hit other caches.
Or are you asking in more of a rhetorical sense? Like, why hasn't this been achieved before, that it's obvious or something? AVX-512 is only some 3 years old and base64 is only one of many problems that could benefit from vectorized processing.
You're mostly right. It's often not that hard to completely saturate memory bus.
However, base64 has weird mappings that take some processing to undo in SIMD – can't use lookup tables and need to regroup bits from 8 bit to 6 bit width. That does take a lot of cycles without specialized bit manipulation instructions.
Also the data you'll need to process is often probably already hot in the CPU caches. Disk/network I/O can be DMA'd directly into L3 cache.
So I think this decoder is a useful contribution. But also somewhat no-brainer once you have those AVX-512 instructions available.
> However, base64 has weird mappings that take some processing to undo in SIMD – can't use lookup tables and need to regroup bits from 8 bit to 6 bit width. That does take a lot of cycles without specialized bit manipulation instructions.
Base64 uses lookup tables and the bit manipulations required are standard shifts and 'and', which are basic, fast instructions on any CPU. That seems exactly what they do here with an efficient use of AVX512 to make it fast(er).
Because table lookups don't vectorize. You could try to use vectorized gather (VPGATHERDD [0]), but so far in my experience it's been just as slow as using scalar code. (Then again, even gather instructions operate just on 32-bit data, so it wouldn't help anyways.)
So to perform multiple operations in parallel per core (SIMD), you'll have to write some code to perform the "lookup" transformation instead.
I think you got it all wrong, memcpy comparison is given as a baseline.
It just shows even if you do not do anything (just copy the memory from place A to place B), you have some CPU usage baseline per GB. So this kind of saying: with this algorithm, you have very minimal CPU usage.
This algorithm is not doing excessive memory usage and Idling CPU because of memory bandwith, it is just doing very minimal CPU usage.
While it is correct that memory is orders og magnitudes slower than modern cpus, this is also why cpus sport multiple levels of caches.
memcpy operations (and base 64 encoding/decoding) are highly "local" operations. When the segment of memory they operate on has been placed in the cpu level1 or level2 caches, the operation is much faster and will not need to access main memory for each byte/word. Only when a memory barrier is encountered will the cpu need to flush/refresh to/from main memory.
You are thinking of random memory access latency, which is very slow. Reading memory in spans that are contiguous is very fast in comparison because of prefetching. Writing is the same to a certain degree because of the same efficiency with TLBs but is not as fast as prefetched reads.