One thing the article doesn’t mention is how they figured out that waiting for external memory access is the bottleneck. Are there any profiling tools available that would tell the developer that the cpu is waiting for external memory x% of the time?
The other comments have mentioned the tools. On linux, there's good old perf.
There's `perf stat` that uses CPU performance counters to give you a high-level view if your workload is stalling due to waiting on memory: https://stackoverflow.com/questions/22165299/what-are-stalle.... However it won't tell you exactly where the problem is, just that there is a problem.
You can do `perf record` on your process, and then running `perf report` on the generated data. You'll see what functions and what lines/instructions are taking the most time. Most of the time it will be pretty obvious that it's a memory bottleneck because it will be some kind of assignment or lookup.
If you're using an intel processor, VTune is extremely detailed. Here's a nice article from Intel on using it: https://www.intel.com/content/www/us/en/docs/vtune-profiler/... . You'll see one of the tables in the articles lists functions as "memory bound" - most time is spent waiting on memory, as opposed to executing computations.
I am using perf. On Graviton, the event you want to look for is LLC-load-misses, which tracks last-level cache (LLC) misses (i.e., external memory accesses). The command "perf record -e LLC-load-misses -t $(pgrep valley-server)" will record the number of LLC misses per instruction during the execution of the valkey-server main thread. Please note that LLC-load-misses events are not collectable when running on instances that only use a subset of the processor's core. "perf list" provides the events that can be collected on your machine
Not directly related to the question. A few years ago, I read about that B-tree is also recommended for in-memory operations because the latency between CPU/memory is high now.