Hacker News new | past | comments | ask | show | jobs | submit login
Otter, Fastest Go in-memory cache based on S3-FIFO algorithm (github.com/maypok86)
184 points by rickette on Dec 23, 2023 | hide | past | favorite | 71 comments



For a long time Go did not expose their super efficient internal map hashing algorithm that, if possible, would use AES instructions from the processor to hash even faster. That is, they did not expose it as an official API. People usually did some "unsafe" usage to link to that internal hashing function for their own super-fast map implementations.

That was changed some time ago. They released official maphash package: https://pkg.go.dev/hash/maphash

Otter author could probably look into replacing their 3rd party hashing dependency (https://github.com/dolthub/maphash) with the official one and knock off an unneeded dependency :)


hash/maphash isn't a very good replacement if you're trying to hash simple structs, since they still require conversion to byte slice by some means.


Hash functions all work on bytes sequence or string, hashing a struct/class/object is a different use case.


You use unsafe for that bit. Still less gross than DoltHub's take on it.


https://news.ycombinator.com/item?id=36434358

Note that S3-FIFO has no loop resistance. Nevertheless, the result that this library is functioning in a loop indicates that some important changes have been made. Also, the result that Ristretto, which has loop resistance, is not functioning in a loop is clearly an anomaly and likely not being measured correctly. Ristretto's benchmark results on S3, DS1, and OLTP differ markedly from the official ones (https://github.com/dgraph-io/ristretto). Otter's benchmark results are quite suspicious.


I think you may be right. The library author said he made many changes because implementing the eviction policy following the paper's design was pretty awful.

> First, I tried to implement cache on hash table and lock free queues, as the authors wrote. I was annoyed, s3-fifo was indeed many times faster than lru cache implemented on hash table with global mutex, but still lost to ristretto and theine, which use bp-wrapper techniques. The profiler also showed that it was the eviction policy that was spending the bulk of the time. And I was only able to beat ristretto and theine after rewriting the implementation using bp-wrapper.

> In summary, I completely disagree about the scalability advantage of S3-FIFO over the other policies

> Though to be honest, what was causing the hit ratio to go down in DS1 I still don't understand

https://github.com/Yiling-J/theine-go/issues/29#issuecomment...


1. Unfortunately, ristretto has been showing hit ratio around 0 on almost all traces for a very long time now and the authors don't respond to this in any way. Vitess for example has already changed it to another cache. Here are two issues about it: https://github.com/dgraph-io/ristretto/issues/346 and https://github.com/dgraph-io/ristretto/issues/336. That is, ristretto shows such results even on its own benchmarks. You can see it just by running hit ratio benchmarks on a very simple zipf distribution from the ristretto repository: https://github.com/dgraph-io/ristretto/blob/main/stress_test.... On this test I got the following: stress_test.go:75: actual: 0.07, optimal: 0.68

2. Yes, otter contains a number of changes to the algorithm that in my opinion improve it. For example, S3-FIFO in otter does not have a worst case time complexity equal to O(n), there O(1).

3. But the fact that S3-FIFO performs worse on DS1 than LRU sounds very doubtful. It sounds more like a bug in the implementation, since even on the caffeine simulator S3-FIFO wins.

linked.Lru 20.24 % product.Caffeine 45.82 % sampled.sampled.Lru 20.55 % two-queue.Qdlp 30.73 %


Could you share some pointers where I can read more on “loop resistance?” Does that just refer to whether it does well on a looping access pattern?


The usual phrase is "scan resistance", which is especially important for databases. A pure LRU policy does poorly on scans; LFU does better but performs worse than LRU on other workloads. The original ARC paper[0] has a good discussion of the tradeoffs.

[0] https://www.usenix.org/legacy/events/fast03/tech/full_papers...


Scan resistance and loop resistance are completely different. They are also distinguished in the papers. Note that ARC has scan resistance but no loop resistance.


None of the papers linked above or linked in the GitHub repo mention “loop resistance.”


Not completely different; they're both sequential access patterns and equivalent whenever the loop size exceeds cache size.


Not equivalent completely. Scan is a one-time access, so it never hits in scanning. Loop is multi-time accesses, so it can get many hits in looping if it has loop resistance. No hits on scan resistance in looping. Over.


I am disgusted by people who are repulsed by even the correction of a complete error. They must have bad grades in college. From now on, we have no choice but to leave it alone even if a false rumor is spread.


Sometimes referred to as tolerance. There is no established terminology, so the only way to find out is to use scan or loop as keywords.


So why did you say S3-FIFO has no loop resistance (or loop tolerance)?


There is an established benchmark for testing loop resistance (GLI, Loop). S3-FIFO has not passed this test. And I have already confirmed that S3-FIFO does not pass the test.


The design of S3-FIFO seems to imply it should handle all loops less than 90% of the cache size pretty well, perhaps even up to 100%.

Did you figure out why it did not pass the test? Was there a bug in the implementation, perhaps?

Also, where can I find the benchmark you speak of? My web search did not find anything.


My implementation is reviewed and linked from the official S3-FIFO page.

https://s3fifo.com

A benchmark is as follows.

https://github.com/ben-manes/caffeine/wiki/Efficiency

I can't deal with you any longer. Over.


/u/someplaceguy,

Those LIRS traces, along with many others, are available at this page [1]. I did a cursory review using their traces with Caffeine's and the author's simulators to avoid bias or a mistaken implementation. In their target workloads Caffeine was on par or better [2]. I have not seen anything novel in this or their previous works and find their claims to be easily disproven, so I have not implement this policy in Caffeine’s simulator yet.

[1]: https://github.com/ben-manes/caffeine/wiki/Simulator

[2]: https://github.com/1a1a11a/libCacheSim/discussions/20


Hi Ben, "I have not seen anything novel in this or their previous works and find their claims to be easily disproven" is a big statement.

We evaluated over 6000 workloads from 14 sources (Meta, Twitter, Microsoft, Wikipedia, VMWare, multiple CDNs, Tencent, Alibaba...), collected from 2007 to 2023. Most of the workloads we used are open-source and available to be verified [1]. If you want to claim that TinyLFU is better, you cannot just show it is better on one trace from 20 years ago. I wish this is not how you disprove a better algorithm.

Disclaim: this is the author of S3-FIFO. We have never claimed that S3-FIFO is the best on every trace. We observed that quick demotion[2] is important to achieve a low miss ratio in modern cache workloads, and existing algorithms such as TinyLFU and LIRS have lower miss ratios because of the small 1% window they use. This motivated us to design S3-FIFO, which uses simple FIFO queues to achieve low miss ratios. It is true that compared to state-of-the-art, S3-FIFO does not use any fancy techniques, but this does not mean it has bad performance.

In our large-scale evaluations, we found that the fancy techniques in LIRS, ARC, and TinyLFU can sometimes increase the miss ratio. But simple FIFO queues are more robust. However, *it is not true that S3-FIFO is better on every trace*.

* Note that some of the S3-FIFO results in Otter's repo are not updated and have an implementation bug, and we are working with the owner to update them.

[1] https://github.com/Thesys-lab/sosp23-s3fifo?tab=readme-ov-fi... [2] https://dl.acm.org/doi/10.1145/3593856.3595887


Hmm, I'd like to know more about the bugs and not updated results. It seems that all the bugs that were there long ago have been fixed (and even with some improvements). And the results show the latest version's performance. Especially since S3-FIFO shows very good results, in fact, inferior only to the adaptive version of W-TinyLFU on S3 and DS1 traces.

And about lock-free queues I don't agree, what they do with the implementation can be seen on the example of theine Reads 100% graph in the otter repository, if it wasn't there, it would be equal in speed to ristretto. And the BP-Wrapper technique actually has a huge engineering plus: it allows you to focus on optimizing different components of the system individually and easily make changes to the eviction policy that lock-free queues can't give to the developer.


Hi Alexey, I was referring to the bug you fixed two weeks ago. This comment was for the weird results that Ben's cited. I apologize here because I did not check the commit carefully, and I thought you had not updated the results (because the results on Zipf are worse than I expected). Thank you for clarifying this!

Can you clarify what you disagree about on lock-free queues? Do you mean that S3-FIFO cannot be implemented without locks?


In fact, lock-free queues have several problems at once, which prompted me to give up on them almost immediately.

1. Yes, S3-FIFO can be implemented using lock-free queues, but the problem is that each write to a filled cache using this design will cause a large number of additional atomic operations not friendly to the processor's cache, while bp-wrapper on the contrary amortizes this load. And reading with frequency update on hot entries can have a bad effect on performance. In many ways this is exactly what the last posts in my discussion with Ben are about (not really about this, but the current problem with otter read speed is caused by a similar problem). https://github.com/Yiling-J/theine-go/issues/29#issuecomment...

2. But the main problem for me is not even that. Lock-free queues work fine as long as you only need to support Get and Set operations, but as soon as you want to add features to your cache, the complexity of the implementation starts to increase, and some features are very hard to add to such a structure. Also, improving the eviction policy is under a big question mark, because not only do you have to think about how to improve the eviction policy, but also how to avoid locks while doing so or how not to slow down the implementation with your improvements. BP-Wrapper has no such problems at all, allows you to use any eviction policy and focus on improving different parts of your cache independently of each other.


I like the discussions with you. I have a few comments and questions.

1. why does it require multiple atomic operations at each insert? Our implementation in cachelib only used one CAS operation. I will get back to you with an illustration. Maybe you can show me something I missed.

2. I like the bp-wrapper idea. It is intuitive and can be used with most/all algorithms. Batching does reduce/amortize the overhead of locking for LRU-based algorithms. But IIRC, all the promotions are still serialized. Because the promotions are still protected by the lock, and only one core can promote objects at the same time. Caffeine achieves amazing scalability because it drops all the elements that need to be updated/promoted under high contention, which means it effectively becomes a FIFO. I don't think this is the correct way to claim good scalability.

3. "reading with frequency update on hot entries can have a bad effect on performance" I don't understand this. The frequency of hot entries should quickly reach 1 (if using a one-bit counter) or 3 (if using a two-bit counter) and will not be updated anymore. Why would read and update conflict?

4. Yes, implementing get and set with locking is trivial; adding support for delete requires a bit of work (10s lines) but should be manageable. But why would lock-free queues make other features hard to implement? Can you give an example?


1. Here I don't understand how you can do with one cas on a filled cache. You need to evict items on every insert, reinsert items into the main queue, etc., and if you add support for item weights, the number of xadd and cas operations will be even greater.

2. It's a bit non-obvious here. Caffeine does not lose a single insert, update, delete operation and moreover, keeps track of their correct execution with fantastic scrupulosity. Also the loss of some reads is inherent in many eviction policies. And caffeine's lossy buffers also have a great ability: the number of these buffers is automatically adjusted at runtime based on contention, which minimizes losses.

3. I specifically mentioned here that it's not the same, but it is possible to design a load that will force as many xadd operations as possible and as few load operations as possible. But in principle you can ignore this point, as it is rather artificial.

4. Here I rather mean that all features should live in the lock-free model or somehow be separately integrated into the architecture (which only complicates things). Plus, for example, a person wants to modify the eviction policy a bit, for example, by adding the lfu part to it and then it's not clear how to do it.


1. I wrote a bit on how to implement lock-free FIFO queues here https://blog.jasony.me/system/cache/2023/12/28/fifo, let me know if any part is not clear or not correct. Moving an object from the small queue to the main queue can be viewed as two operations: evicting the item from the small queue and inserting it into the main queue.

2. I am not criticizing the design decision in Caffeine. It is good because most people just don't need crazy scalability (>16 threads) or do not run it at such high throughput. I understand that critical operations are not lost. But the promotions are lost under contention, which makes comparison unfair. S3-FIFO is still the same algorithm under contention, but The W-TinyLFU in Caffeine becomes FIFO under high contention. I do not think that it can claim a low miss ratio and good scalability at the same time. The relationship is rather a low miss ratio OR good scalability. But as I mentioned, it does not matter for most use cases, and the design is elegant.

3. skipped

4. It is true that if one wants to port a lock-based new algorithm, it would require extra effort. But I guess this is a decision you have to make: only support lock-free eviction algorithms (FIFO-based) or support more algorithms with locks. But I do want to mention that S3-FIFO is not the only lock-free algorithm and there are and will be more.


I did a comparison on your data sets such as DS1, and S3-FIFO had very lower hit ratios than LRU in most cases. I think there are very limited situations where S3-FIFO is best.


You evaluated a single trace (from twenty years ago) and claimed that S3-FIFO is worse; why not show more traces? Most of the traces we used are open-source and can be found here.

https://github.com/Thesys-lab/sosp23-s3fifo?tab=readme-ov-fi...

Disclaimer: This is the author of S3-FIFO.


S3-FIFO has scan resistance. What do you mean by "loop resistance" and why do you say S3-FIFO doesn't have it?


Hi falsandru, I do not know where you can the feeling that S3-FIFO is not loop-resistant, can you elaborate more?

There is an adversarial pattern where each object is only requested twice, and the second request comes very late (out of the small queue). But such a pattern is very rare (if it ever exists at all), and other algorithms, such as TinyLFU, also suffer from this access pattern.

You claimed that you verified it, but there is no evidence given so far except on the single trace, which S3-FIFO is indeed worse. Frankly speaking, I don't understand why you get irritated by other users in this post.


I didn't know what the S3-FIFO algorithm was. https://s3fifo.com/ gives a great overview with some good visuals.


This entire time, I thought it had something to do with Amazon S3.


I wrote up this analysis of cache one-hits after reading the FIFO is all you need paper.

https://www.polyscale.ai/blog/one-hit-expectation


Sorry if I missed it, but does Otter do anything special to limit GC-impact when caching lots of references? That's the main selling point for bigcache and freecache.


No, and it's not planned (otter tries to avoid additional pressure on gc if possible, but that's not always possible). And I'm extremely frustrated that I had to include bigcache and fastcache in these benchmarks, since many people don't know the differences at all and just focus on performance and other metrics. Especially since bigcache and fastcache will only have an advantage when storing a huge number of items (well over 10 million). So I'll probably just add a P.S. in the README about it.


Yeah, they're fundamentally solving different problems. I was confused to see it benchmarking against them in the first place.

The number of items isn't the best metric alone, making it harder to demonstrate in a benchmark. I've reaped benefits from bigcache with ~600k-900k items — but there's a lot of eviction, each item is a pointer and can have a bunch of nested references itself. (protobufs)


Pretty much 2Q without the LRU.


I never really understood why anyone wants to maintain LRU properties in caches. It doesn't make sense today and it didn't make sense decades ago, either. You do need some rational basis for eviction, but ranking your hot set on every access was always just weird. I am glad this is being fixed in the literature lately.


What would you suggest as an alternative basis for eviction?


You can evict based on recency without maintaining LRU order at all times.


There are some difference, the most notable one is that 2Q evicts all objects from the small queue (does not move to the large LRU).


The issue is Go stdlib does not have parallel hash map.

We have https://github.com/puzpuzpuz/xsync#map a different Cache line hashmap impl.



The api doc there says why but not fully in depth:

> The Map type is optimized for two common use cases: (1) when the entry for a given key is only ever written once but read many times, as in caches that only grow, or (2) when multiple goroutines read, write, and overwrite entries for disjoint sets of keys. In these two cases, use of a Map may significantly reduce lock contention compared to a Go map paired with a separate Mutex or RWMutex.

sync#Map works by having two maps internally, one which receives writes and then periodically copies its values to the larger map and clears itself. This works on the two workloads mentioned above. It falls rather flat under write contention. This is by design. syncMap is really intended for maps that are mostly read only.



Hi, I made this repo. I'm here to drop my opinions as below.

[1]

- golang-fifo provides also SIEVE algorithm which is more effieicnt than S3-FIFO under web cache workloads(follows zipf's distribution)

[2]

- I added otter to my cache benchmark. It shows less efficiency than mine. I'm not sure why this happens.

See results here: https://github.com/scalalang2/go-cache-benchmark

[3]

- If the behavior of the two algorithms is the same, they should produce similar results. but they shows different results.

- Otter is faster with more concurrency, but misses more stuff (shows less hit/rate).


In fact, to say that sieve is better is fundamentally wrong.

1. You only test the hit ratio on a synthetic zipf distribution and claim an advantage without saying a word about the problems (which there are).

2. Checking implementation speed looks very questionable on such benchmarks, because you spend a huge amount of CPU time on synchronization and item generation.

3. Your implementation has at least three problems that otter doesn't: worst case O(n) time complexity of the eviction policy, disgusting scalability, also golang-fifo simply doesn't have as many features as otter, and adding each of them will only slow golang-fifo down even more.

4. Yes, when increasing contention otter sacrifices 1-2 percent (I'll have to see if I can reduce this loss) hit ratio to maintain response speed, but that's what caffeine, ristretto and other famous cache libraries do, because in this case it's more important to maintain scalability than to keep hundreds of goroutines waiting for a mutex to unlock.

5- This structure looks highly questionable to support ghost queue: https://github.com/scalalang2/golang-fifo/blob/main/s3fifo/b.... You are wasting a very large number of bytes just to maintain an item footprint.


Hello, Thank you for replying here :)

Many of answers you replied are reasonable and good.

---------

And I just want to add more comments for others.

1. SIEVE is not scan-resistant, so that, I think it should only be applied for web cache workloads (typlically follows power-law distribution)

2. SIEVE is somewhat scalalbe for read-intensive applications (e.g. blog, shop and etc), because it doesn't require to hold a lock on cahce hit.

3. The purpose of golang-fifo is to provide simple and efficient cache implementation (e.g. hashicorp-lru, groupcache).

-> I don't want to beat high-performant, bla-bla, specialized in-memory caches (bigcache, fastcache and etc).

-> Both S3-FIFO and SEIVE have O(n) worst-case cache eviction. and it only happens when all of objects in the cache are hit, which means, a perfect(100%) hit rate in sequences.

4. when increasing contention otter sacrifices 1-2 percent

-> I think that the statement is not enough. The hit rate varies depending on the total number of objects and the size of the cache, so it should be compared relatively. for example, otter's efficiency decreased by 5% compared to single-threaded when lock contention increased (decreased efficiency makes a mean network latency higher, because it may need to conduct heavy operation e.g. re-calculation, database access and so on)

-> Therefore, in certain situations, the efficiency of a cache library may be more important than its QPS. This is not a matter of which one is better, but rather that other people should be able to choose a cache library that suits the specific needs of their application.

5. ghost queue : honetly at that time of writing the code, I didn't deep dive into the bucket table implementation, it may not work same as actual bucket hash table (see here: https://github.com/scalalang2/golang-fifo/issues/16)

---

If anyone here want to raise issues or give contributions for golang-fifo, please visit here https://github.com/scalalang2/golang-fifo


1. That's not the problem, yes, most web traces aim for zipf distribution, but you can't say it will work well in real life just based on this check (this was also mentioned in the S3-FIFO discussion here). Because there are many ways to break this best case.

2. There is a small problem here, yes, on 100% reads workload this will show good results, but there is a catch here, you should add even one percent of writes and this cache degrades very badly.

3. This makes perfect sense! But the O(n) worst case is a bit scarier than you describe :(. There it is quite enough to have just a large enough sequence of elements in the main queue with a frequency greater than zero (100000 for example will suffice), in this case you will just reinsert all these elements with global locking, although hashicorp-lru or groupcache will not do this, but just move one element.

4. it's not quite true, losses depend on contention and nothing else, and in your benchmarks you make otter withstand 4 million qps and at such a load it lost only 2% (probably this value can be reduced a lot, I'll have to look at it at the holidays), which at such a load hardly matters much.

5. My point is that you store at least 2 * key size + 8*2 + 8 bytes just to store the fingerprint. And if you also count the pointers that are held for the sake of it, it becomes quite sad.


I was curious what function was used for hashing (how do you write a generic hash function in go?) and it's pretty disgusting.

https://github.com/dolthub/maphash/blob/main/runtime.go


Could you provide more context on what’s disgusting? I’m not a go programmer but this looks like it’s just a wrapper around a library called ‘maphash’ and doesn’t do any of the hashing here? This is more of a service wrapper around that lib so that it can have a consistent seed value etc.?


Constructing a map (an opaque type) just so you cast it to a carefully duplicated struct so you can steal the function pointer is kinda nutso.

Did you look at the linked code in maphash?


I wouldn't call it disgusting. It's just typical unsafe go code. The named return parameter in `getRuntimeHasher` is irritating, as all such parameters in go code are. If you're going to return values, do so explicitly in the return statement. In go code, seeing `return` at the end of a function does not mean that the function doesn't return any values, and that can lead to harder-to-read code.


> It's just typical unsafe go code.

But... just to hash something?


Go has a lot of lower level internal implementations of things that unsafe code like this serves as a thin shim for. It’s an almost zero-cost abstraction to use such a builtin.

I’m not saying it’s good: it just is what it is. It’s silly: but so is most of go. I’ve worked with this language for several years now: I don’t like it at all. It presents a facade of simplicity, but has all sorts of hidden issues that affect performance and behavior in surprising ways.


I think we're all saying the same thing then.


I suppose the use of unsafe pointers liberally is disgusting to some.


Disgusting is a word, but I'm guessing you'd say the same about most unsupported feature code (aka unsafe). You can make code like this reliable in the ways it needs to be while still being "unsafe". That's kind of table stakes when you start dabbling in unsupported things.


And what does this code do to make itself reliable in the event the go authors change the internal structure of a map?


Nothing. That’s the trade off. What is not clear about “unsafe” in that context?


The part about making it reliable in the ways it needs to be.



That's annoying, but at least reliable. Thank you.


You may want to vet this as a third-party dependency due to supply chain attack risks.


I always thought Go as a GC'ed language is unsuitable as a caching instrument?


I never understood this cost nonsense. TTL is the only thing that makes any sense to me. And implementation is trivial. All these caches are just a map underneath anyway, pointing to some data somewhere in memory. That's about it. Not much else to it.


I don’t have any practical experience but if I were to guess it’s relatively easy for a developer to estimate a cost of retrieval in terms of CPU/time or something, which can let the cache make smarter predictions about eviction depending on access patterns. Typically, a longer chained primary db call could be higher cost than say a quick lookup in a local kv store. That said, perhaps those should simply be different caches instead.

TTL doesn’t say anything about eviction among objects that are still live, no? For that you need an eviction policy.


> TTL doesn’t say anything about eviction among objects that are still live, no?

TTL IS eviction policy. You just store a timestamp along the value. When you access it, you check the timestamp and if it is stale, you delete it and return not found error(or whatever not found should return). And you have a simple bg worker that scans the cache periodically and deletes these stale entries to free up the memory.

This is the basis for all caches. You can add cost on top of this as yet another eviction policy but that only adds another worker to scan the cache for items to purge and you have to store the cost along with the entry so you are increasing the each entry by at least 4 bytes, if we're talking uint32. Not to mention that for this eviction policy, you need sorted cache because you need to know where the cut off for cost is.

Walking the entire cache to purge expired TTL entries is one thing, but keeping ordered cache is a whole another thing.


Sorry. TTL is not an eviction policy, as the others pointed out, you need an eviction algorithm to decide which objects to keep in the cache when it is full. Your mindset is about a key-value store, not kv cache, which is always full. I agree that TTL is useful, but it is NOT a replacement for the eviction algorithm. Moreover, when you have a huge multi-tenanted cache, different engineers set TTLs in different ways, some use TTLs to capture data lifetime, some use TTLs to guard around staleness, some use TTLs for legal compliance. Therefore, TTLs do not indicate eviction priority.


What I mean is that when cache is full you need an eviction policy, like LRU for instance. If everything fits in memory there’s not much to discuss, yes then TTL with periodic scan is reasonable, yes.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: