What kind of instances are you guys running for Redis/memcached? I am a bit surprised on the numbers here, but to be fair I don't do much in the virtualization world. With low cpu overhead, it sounds like you might be saturating the number of interrupts on the network card if it's not a bandwidth issue. Memcache can usually push 100-300k/s on an 8-core Westmere (could go higher if you removed the big lock). Redis on the other hand with pinned processes to each physical core can do about 500,000/s. We (Twitter) saw saturation around 100,000~ on CPU0, what tipped us off was ksoftirq spinning at 100%. If you have a modern server and network card, just pin each IRQ for every TX/RX queue to an individual physical core.
Those are really useful numbers--I think a lot of it can be chalked up to virtualization, but we should definitely explore more around IRQ pinning for queues. Any good starting points / reading, are you mostly using taskset?
Taskset is fine for the process pinning. Don't forget about hyperthreading, you want to try to keep each thread on each hardware thread. IRQ pinning, see an example script I have:
Because in-house we have a custom version of memcache. We rewrote memcache's slab allocator, and for some use cases, is better at memory efficiency than Redis.