Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Valkey achieved one million RPS 6 months after forking from Redis (valkey.io)
113 points by reconditerose 12 months ago | hide | past | favorite | 37 comments


One thing the article doesn’t mention is how they figured out that waiting for external memory access is the bottleneck. Are there any profiling tools available that would tell the developer that the cpu is waiting for external memory x% of the time?


The other comments have mentioned the tools. On linux, there's good old perf.

There's `perf stat` that uses CPU performance counters to give you a high-level view if your workload is stalling due to waiting on memory: https://stackoverflow.com/questions/22165299/what-are-stalle.... However it won't tell you exactly where the problem is, just that there is a problem. You can do `perf record` on your process, and then running `perf report` on the generated data. You'll see what functions and what lines/instructions are taking the most time. Most of the time it will be pretty obvious that it's a memory bottleneck because it will be some kind of assignment or lookup.

If you're using an intel processor, VTune is extremely detailed. Here's a nice article from Intel on using it: https://www.intel.com/content/www/us/en/docs/vtune-profiler/... . You'll see one of the tables in the articles lists functions as "memory bound" - most time is spent waiting on memory, as opposed to executing computations.


I am using perf. On Graviton, the event you want to look for is LLC-load-misses, which tracks last-level cache (LLC) misses (i.e., external memory accesses). The command "perf record -e LLC-load-misses -t $(pgrep valley-server)" will record the number of LLC misses per instruction during the execution of the valkey-server main thread. Please note that LLC-load-misses events are not collectable when running on instances that only use a subset of the processor's core. "perf list" provides the events that can be collected on your machine


Linux perf would do this if you profile for "backend stalls".


Intel VTune Profiler can do that.


Not directly related to the question. A few years ago, I read about that B-tree is also recommended for in-memory operations because the latency between CPU/memory is high now.


This is really cool work! I am surprised to see this level of tuning without using cache profiling or other performance counters to identify the bottleneck and quantify the improvement.


They probably did.


Haha, we definitely used profiling tools. :) Perf is great and definitely the first tool to use. I know vtune has been used as well.


Redis' biggest flaw is its single threaded design. We end up having to run separate redis processes on each core and have client side sharding. We're lucky that our data allows this.

We experiment with KeyDB too but I'm not sure what state that project is in.


KeyDB is awesome. Works great.


Great to see Valkey team is making a progress well beyond keeping old Redis version Security Patched.


Who in their right mind uses linked lists for a database style workload? Try doing this with arrays to get a reasonable baseline.


Who in their right mind would use linked lists in linux kernel... hehe


That is not a database style workload like the one in the example with millions of elements that all had to be scanned.


A server with 100K TCP connections isn't much different, and you might be surprised (or maybe not) by how extensively dynamic data structures are used in the TCP stack implementation. I'm not sure all of them can be made 'cache-friendly,' and embedding all layers together would make the code unmaintainable. Intensive access to external memory is inevitable when accessing a large enough number of unrelated states.


Nice, how does that compare with Pedis, which was written in C++ several years ago? It's an incomplete Redis lookalike that isn't getting current development, but it uses Seastar, the same parallelism framework as ScyllaDB.

https://github.com/fastio/1store


It is a completely different kind of parallelism. Seastar makes easier to leverage parallelism across many cores. Here Valkey is leveraging instruction level parallelism within a single core.

A single CPU core actually can execute more than one instruction at a time, by leveraging out of order execution. The trick for leveraging out of order execution is avoiding having data dependencies locally. By swapping the iteration order, they allow the CPU core to continue with the next iteration before the previous has finished. Why? Because there is no data dependency anymore!

I haven't profiled that code, but I guess that now the bottleneck would be the sum. But it doesn't matter, as accesing a register is the fastest operation. Accesing the memory cache is slower, and accessing RAM is even slower.


I'm sure if they approach enough people offering to show them their Pedis, they will receive some feedback.


Ain’t nobody adopting anything named Pedis


I’m sorry, pedis?


I wonder why it didn’t get traction?


It looks to me like they have changed the name to 1store but haven't updated the web pages completely.


Parallel Redis!


Could have called it Threadis


What would be the point of comparing a popular active fork to an unheard-of non-maintained one?


Usually the point is the same as with a very-heard-of well-maintained one: how did they do it?

Algorithms and methodologies don't require being well-known or being well-maintained to be valid and useful comparisons.


I would argue that doing a comparison requires the author to actually have heard of the other option first. The world has infinite things, we can only focus on so much.


I mean, it's sort of what it says on the tin: it's parallelized. Valkey is essentially just an optimized Redis, which is largely single threaded. Pedis is multithreaded. That comes with tradeoffs under certain workloads, which you don't even have to get too creative to imagine.


Yeah, right now the valkey project is trying to keep full compatibility with the lass oss version of redis but improving where we can. There are plenty of concurrent hash maps attached to a tcp server that can out perform it on raw throughput, but they're usually missing a lot of features. One day valkey might be fully multithreaded, but not anytime soon. (Unless someone has a good idea for it)


The two articles on your epoll command queuing and prefetching have a lot of similar observations to the BP-Wrapper approach [1], so you might find that to be an interesting paper to read. That is used by Caffeine cache [2, 3] which uses a concurrent hash maps with lossy striped ring buffers for reads and lossless write buffer to record & replay policy updates. On my M3 Max 14-core, a 16 thread in-process zipf benchmark achieved 900M reads/s, 585M r/w per s, 40M writes/s (100% hit rate, so updates only). Of course the majority of your cost is your I/O threads but there are a few fun ideas nonetheless.

[1] https://dgraph.io/blog/refs/bp_wrapper.pdf

[2] https://highscalability.com/design-of-a-modern-cache/

[3] https://highscalability.com/design-of-a-modern-cachepart-deu...


I wonder if it is feasible to implement a shared memory interface instead of TCP. I'm not sure whether unix domain sockets are currently supported. I remember that Pedis/1Store can use DPDK to get a significant boost compared with the kernel network stack. I don't know if Redis/Valkey can do that.


The interleave thing isnt intuitive to me.

The problem with linked lists is the memory address of nodes isn't necessarily contiguous because of malloc and the value could be NULL? Why does interleave loop make it faster for the cpu? It still a linked list, arbitrary memory, could be NULL? Not sure what im missing here?


It's a bit like row vs column store.

If there are two lists, in the first example, you're doing:

    a1 -> b1 -> c1 -> d1
    a2 -> b2 -> c2
In the 2nd example, you're doing

    [a1, a2]
    [b1, b2]
    [c1, c2]
    [d1]
Visiting the same number of nodes, but because the nodes are referenced from an array, when you load `a1` you're [probably] also going to load `a2` into the cache.


> Visiting the same number of nodes, but because the nodes are referenced from an array, when you load `a1` you're [probably] also going to load `a2` into the cache.

Yep, the problem with a linked list is that you don't know the address of the next pointer, it could be allocated anywhere. However, if you have a continuous array, when you read the first item, you're also fetching the next items in the cache line.


[flagged]


What makes it repulsive?


I don’t think that I will probably be able to explain it to someone who it’s not already apparent to but let me try:

it’s ugly, yet cutesy, clumsy to say, and mainly sounds like a careless or trifling afterthought based on “key/val”, because the fork just needed some name or other that didn’t matter.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: