Using a disk-based Redis clone to reduce AWS S3 bill

latch · on April 27, 2021

Why not an http proxy with disk caching, e.g.

    location / {
      proxy_http_version 1.1;
      proxy_set_header Connection '';
      proxy_set_header .....

      proxy_cache s3_zone;
      proxy_pass https://s3
    }

If you wanted to scale this to multiple proxies, without having each one be a duplicate, do a consistent hash of the URL and proxy it to the owner.

We used to do this to cache the first chunk of videos. Encoded with `-movflags faststart` this typically ensured that the moov chunk was cached at an edge, which dramatically decreased the wait time of video playback (supporting arbitrary seeking of large/long video file)

ignoramous · on April 27, 2021

In essence, we are going to reinvent CDNs aren't we?

For our production needs, we use S3 as a database too, and Cloudflare's CDN as the caching layer. Cache invalidation is handled through versioning.

derefr · on April 27, 2021

CDNs aren’t intended as a “canonical store”; content can be invalidated from a CDN’s caches at any time, for any reason (e.g. because the CDN replaced one of their disk nodes), and the CDN expects to be able to re-fetch it from the origin. You need to maintain the canonical store yourself — usually in the form of an object store. (Also, because CDNs try to be nearly-stateless, they don’t tend to be built with an architecture capable of fetching one “primary” copy of your canonical-store data and then mirroring it from there; but rather they usually have each CDN node fetch its own copy directly from your origin. That can be expensive for you, if this data is being computed each time it’s fetched!)

Your own HTTP reverse-proxy caching scheme, meanwhile, can be made durable, such that the cache is guaranteed to only re-fetch at explicit controlled intervals. In that sense, it can be the “canonical store”, replacing an object store — at least for the type of data that “expires.”

This provides a very nice pipeline: you can write “reporting” code in your backend, exposed on a regular HTTP route, that does some very expensive computations and then just streams them out as an HTTP response; and then you can put your HTTP reverse-proxy cache in front of that route. As long as the cache is durable, and the caching headers are set correctly, you’ll only actually have the reporting endpoint on the backend re-requested when the previous report expires; so you’ll never do a “redundant” re-computation. And yet you don’t need to write a single scrap of rate-limiting code in the backend itself to protect that endpoint from being used to DDoS your system. It’s inherently protected by the caching.

You get essentially the same semantics as if the backend itself was a worker running a scheduler that triggered the expensive computation and then pushed the result into an object store, which is then fronted by a CDN; but your backend doesn’t need to know anything about scheduling, or object stores, or any of that. It can be completely stateless, just doing some database/upstream-API queries in response to an HTTP request, building a response, and streaming it. It can be a Lambda, or a single non-framework PHP file, or whatever-you-like.

ignoramous · on April 28, 2021

Thanks.

> Also, because CDNs try to be nearly-stateless, they don’t tend to be built with an architecture capable of fetching one “primary” copy of your canonical-store data and then mirroring it from there.

True, though some CDNs let you replicate data up to 25MB in size globally; others support bigger sizes depending on your spends with them.

> And yet you don’t need to write a single scrap of rate-limiting code in the backend itself to protect that endpoint from being used to DDoS your system. It’s inherently protected by the caching.

Systems in steady state aren't what cause extended outages. The recovery phase can also DDoS your systems. For example, when a cache goes cold, it may over-power an underscaled datastore [0].

> You get essentially the same semantics as if the backend itself was a worker running a scheduler that triggered the expensive computation and then pushed the result into an object store, which is then fronted by a CDN.

Agree, but running one's own high-availability infrastructure is hard for a small team. In some applications, even if scale isn't important, availability always mostly is.

> It can be a Lambda, or a single non-framework PHP file, or whatever-you-like.

Agree, a serverless function a CDN runs (Lambda@Edge, Workers, StackPath EdgeEngine etc) would accomplish a similar feat; and that's what we do for our data workloads that front S3 through Cloudflare Workers.

[0] https://sre.google/sre-book/addressing-cascading-failures/

cornball · on April 28, 2021

This comment touches on a lot of topics that I’m interested in learning more about but don’t have the vocabulary to google or find resources about. Any chance you could point me at a somewhere that I could read about these patterns in more depth? Handling expensive computations and preventing re-computation on the backend is something I don’t know how to solve effectively because there are several different approaches I can think of (in memory cache on the backend, persist the result and return it the next time it’s asked for, cache the response in front of the service, etc)

sam0x17 · on April 27, 2021

CDNs are great unless the content can change / be deleted / contain state information, in which case are bets are off

ignoramous · on April 27, 2021

In our case, content can change, be deleted, and contain state information.

One could accomplish a lot with a CDN and edge compute add-ons.

welder · on April 27, 2021

Because we invalidate some SSDB keys in various cases, for ex: like old code stats data incoming, renaming a project, etc.

Http proxies with etag seem like a good public-facing cache option, but our caching is internal to our DigitalOcean servers. Here's more info on our infra, which might help with our decision:

https://wakatime.com/blog/46-latency-of-digitalocean-spaces-...

otterley · on April 27, 2021

Consider using opaque symbols for your object/cache identifiers instead of using meaningful literals. That way, you can simply update the identifier mapping (Project A cache prefix moves from ajsdf09yu8sbvoihjasdg -> klajnsdg9fasf8avby), and the old values will naturally be evicted from the cache as they become less frequently accessed.

In my experience, having to perform cache invalidation is usually a sign of design immaturity. Senior engineers have been through this trial before :-)

latch · on April 27, 2021

nginx has proxy_cache_purge (1), but I agree that nginx doesn't expose a particularly good API around its cache. Plus, the awkwardness of doing disk IO from OpenResty (unless this has been improved), is a drag.

Maybe this is something envoy or the newer breads do better.

(1) http://nginx.org/en/docs/http/ngx_http_proxy_module.html#pro...

nyc_pizzadev · on April 27, 2021

Definitely a strong use case for a caching proxy like Varnish or Nginx. The post describes using Redis as an application cache which means your application needs custom logic. A cache lookup before going to S3, and then it needs to write back out to cache when done. Simple, but possibly a lot more work than necessary. A caching proxy does away with this, you make a single call to the proxy, which behaves identically to S3, and it does all the work for you. No application changes needed other than changing the S3 endpoint address. Also, much greater performance because you are potentially cutting out multiple costly round trips.

welder · on April 27, 2021

Cache invalidation is much easier with SSDB than Nginx or Varnish. Http proxies are good for public facing uses but not a good use case here.

haggy · on April 28, 2021

How do you figure? Http proxies support various common cache invalidation strategies such as hash-TTL and ETags

SahAssar · on April 28, 2021

ETags are not cache invalidation. ETags are a way for a client to validate the cache in the client, while cache invalidation for a http proxy would involve telling the proxy that it should discard a cached file.

---

Basically:

cache validation: Check if a file matches a cached version

cache invalidation: Preemptively tell a cache to discard a cached file

---

For example:

Client A requests file X. Proxy caches file X.

File X changes in the upstream.

Client B requests file X. Proxy does not know the file changed upstream, so it sends a stale version.

---

In this case either the proxy needs to revalidate each request with upstream (which is expensive) or we need some way to tell the proxy that it should discard it's cache for file X since it has changed.

Nginx plus supports this but the free version does not: https://nginx.org/en/docs/http/ngx_http_proxy_module.html#pr...

e12e · on April 28, 2021

For Varnish PURGE seems the most sensible:

https://book.varnish-software.com/4.0/chapters/Cache_Invalid...

I think I'll have to look into this to see if we can get away with document/blob cache in one of our solutions actually. I've resisted add-in Varnish in the mix, but it might be time to reconsider.

chrislusf · on April 28, 2021

(I work on SeaweedFS) How about using SeaweedFS? https://github.com/chrislusf/seaweedfs

With your dedicated server, the latency is consistent, No API/network cost. Extra data can be tiered to S3.

SeaweedFS as a Key-Large-Value store https://github.com/chrislusf/seaweedfs/wiki/Filer-as-a-Key-L...

Cloud Tiering https://github.com/chrislusf/seaweedfs/wiki/Cloud-Tier

welder · on April 28, 2021

Looks slightly better for our use case than MinIO with optimization for many small files, I'll look more into it. Thanks!

kaydub · on April 27, 2021

You should really just foot the bill for AWS. This is going to be technical debt for you later on that will need to be paid. As a software/systems engineer I'll walk away from anywhere doing stuff like this to save a few dollars. I've worked a lot of janky places and now I work somewhere that goes the extra mile and puts that little bit extra in to do things right the first time. I know the on-going costs of doing stuff like this as well as the fact that you will have to clean it up at some point.

EDIT: Not necessarily YOU will have to clean it up at some point, but the next guy most likely will.

welder · on April 27, 2021

It's just me nobody else is working on WakaTime, so it's ok. Also, S3 scales beautifully and has been the best database decision ever. Built-in reliability and redundancy.

SSDB has also been wonderful. It's just as easy and powerful as Redis, without the RAM limitation. I've had zero issues with SSDB regarding maintenance and reliability, as long as you increase your ulimit [1] and run ssdb-cli compact periodically [2].

I've tried dual-writing in production to many databases including Cassandra, RethinkDB, CockroachDB, TimescaleDB, and more. So far, this setup IS the right solution for this problem.

[1] https://wakatime.com/blog/47-maximize-your-concurrent-web-se...

[2] ssdb-cli compact runs garbage collection and needs to be run every day/week/month depending on your writes load, or you'll eventually run out of disk space. Check the blog post for my crontab automating ssdb-cli compact.

philihp · on April 27, 2021

I don't think there's anything wrong with this. The correct answer is not always "outsource to AWS"; dependence on AWS (and any external dependency, especially those you have to pay for) is also tech debt. Everything you do is tech debt, and since this is just you, building it yourself is not only respectable (here, have an upvote!) but could also perhaps be something you end up doing 2-3 years down the line, and you've just gotten ahead of yourself.

heliodor · on April 27, 2021

For many startups, that code that you mentioned as technical debt will be deleted en masse as the business pivots and adapts.

For startups, moving fast and getting 80% done in 20% of the time is the correct choice. It's frustrating to look at the result as a software developer, but the engineer is there to serve the business, which pays the salary. As a software engineer, I very much disliked working in engineering teams that were engineering for the sake of engineering. It quickly devolves into meaningless arguments and drama.

PixyMisa · on April 28, 2021

I would run a mile from anything that would lock me into AWS.

S3 is fine though because there's any number of compatible alternatives, including self-hosted ones like MinIO.

throwaway823882 · on April 28, 2021

When was the last time you desperately needed to get off of a single AWS service and couldn't?

sintaxi · on April 28, 2021

Im in the process of purging data off s3. I suspect it will be about two weeks of work and I calculate it will save me roughly 250k over the "lifetime" of my product.

What I think too few people understand about s3 is that costs on S3 compound. Every month you not only pay for data you stored in the current month but you also pay for the data stored in every month previously.

throwaway823882 · on April 28, 2021

S3 is a de facto standard supported by multiple vendors and open source technology. You can pick lots of other cloud vendors and use S3 there, or you can self-host S3. It's one of many AWS technologies that is not vendor-locked.

welder · on April 28, 2021

Yep, for ex: we use the exact same code to get/put files to DigitalOcean Spaces... the only thing that changes is a config value for the S3 endpoint. S3 is basically an open standard. The only vendor lock-in would be the time/money it takes to copy your data out of S3, but that's the same for anywhere you store your data... even when self-hosting.

simtel20 · on April 28, 2021

At around, say, 10k requests/sec, by far the dominant cost is the s3 api cost for GET requests at $.0004 per 1000 calls. Having a cache saves millions of dollars per month in this scenario, and if scaling is expected, then planning ahead like this is a good idea.

CameronBanga · on April 27, 2021

Wish they would have given an idea of how much this helped improve their AWS bill. Hard to judge the value of this implementation without knowing the $$$ saved.

welder · on April 27, 2021

Our S3 costs were between $400/mo. - $800/mo. After SSDB it's more stable around $200/mo, so a conservative number is $200/mo savings.

A related comment about our costs:

https://news.ycombinator.com/item?id=26740911

__jem · on April 27, 2021

How is $200/mo savings worth spending more than like 10 seconds of engineering time on? If you have to support the solution at all you wipe out all the savings almost instantly.

welder · on April 27, 2021

We're a one-man shop, so the cost savings made sense in the long run. I already use SSDB and Redis in production so I'm very familiar with supporting it.

JAlexoid · on April 27, 2021

From business growth perspective: if you have paying customers - then such low cost savings are counter intuitive, for a one man shop.

A one man shop with paying customers shouldn't spend time on optimization, unless it's just for fun.

If I'm a customer - I'd rather see more value added functionality, rather than cheaper data storage.

But that is my opinion, hope that your customers appreciate you spending time on this.

wormik · on April 28, 2021

Generally good thinking, but few "real world" notes

A) you first need to run the service to get any customers

B) this might take long

C) you maybe don't want / can't get VC money at this stage

D) you maybe are not the most advanced dev who can properly utilitize s3 from I/O perspective, getting you to higher costs than possible

E) there might be a time period between introducing the service and getting traction which yields enough feedback, so you can start adding more business features

F) when you are burning your own money, you are more senstive to the cost side - which is not ultimately wrong

Just my 2c

pengaru · on April 27, 2021

they're not mutually exclusive

jrh206 · on April 27, 2021

They kind of are. A one-man shop can only do one thing at once. Either they save costs on data storage, or they spend that time doing something else.

welder · on April 27, 2021

No, pengaru's right. They're not mutually exclusive. For ex: making the site faster and more stable means I spend less time fixing and keeping things running and more time building new features.

JAlexoid · on April 30, 2021

But cost optimization doesn't make a site more stable.(Remember the context here)

In fact, adding another component to optimize the cost is adding another point of failure.

staticassertion · on April 27, 2021

Nah. As someone who has gone from a 1 person project to ~10 people and funding, taking the time to tackle problems like this is critical - it frees up an order of magnitude of time later, helps make the product experience smoother, saves you real money that, when you're one person, really can make a difference.

woadwarrior01 · on April 28, 2021

Performance is a feature. In this case, it’s the perf gained by avoiding round trips to S3.

JAlexoid · on April 30, 2021

Which could have been gained by.... hosting on AWS.

Which ends up being back to - purely cost optimization of the initial cost optimization.

It seems like the whole product is focusing on cost optimization. I suspect that the OP would be better off making money by doing cost optimizations for third parties.

tebruno99 · on April 27, 2021

You could just host in AWS EC2 pay the extra few dimes and not pay ANY S3 transfer cost.

welder · on April 27, 2021

The economics don't work out on that... we migrated away from our $3k/mo AWS bill back in the day to save $2k/mo with DO. Going back to EC2 would save $200/mo but cost $2,000/mo in compute.

ramraj07 · on April 28, 2021

Have you considered lightsail? So much cheaper than ec2 and you still get to be on aws.

sam0x17 · on April 27, 2021

This is the problem with a lot of the dev ecosystem. Many of the tool-makers don't understand the concept of a one-man shop, so scale to/from zero is never given the first class priority it deserves

Zababa · on April 27, 2021

You can also look at it as 50% savings, which will grow with the business.

JAlexoid · on April 27, 2021

You can look at this as time wasted on premature optimization. That time could have been spent on building valuable functionality, that would actually grow the business.

welder · on April 27, 2021

How much experience bootstrapping a product do you have? It's not always a choice between spending time optimizing vs. new features. Sometimes you have time for both. Sometimes optimization means a more stable product that frees up your time for the new features.

JAlexoid · on April 30, 2021

I started a few projects from scratch... with stricter financial restrictions as well.

Splitting up the storage and execution between cloud providers was clearly an early cost optimization.

Then this cost optimization was to fix with the result of the previous cost optimization.

Adding more code that has no clear end value to your service* never makes anything more stable. Now OP has to maintain three things, instead of two. It's a classic "look at how smart I am" overengineering.

*- OPs customers definitely give 0 f's if everything is on AWS or split between AWS and DO.

danw1979 · on April 27, 2021

or you could look at it as a performance optimization that is actually valuable from a user's perspective.

or you could look at it as a tool to help the one-man shop make it through a tight cashflow situation that's coming down the line.

ramraj07 · on April 28, 2021

I have multiple small side projects running, and cannot afford to have an expense of more than a few dollars each per month. It’s a fun project to try and reduce costs as long as your priorities are straight.

mi100hael · on April 27, 2021

I remember seeing your comment on that previous post. I enjoy seeing those sorts of "behind the curtain" details that break down stack & cost of applicaitons.

I'm curious if you've tested what it would cost to just host on EC2 (or something potentially even cheaper in AWS like ECS Fargate) with a savings plan. At a glance, it looks like AWS would be cheaper than DO if you can commit to 1-yr reserved instances.

That would seem like an easier (and possibly more effective) way to get around costly AWS outbound data costs compared to running a separate cache and sending data across the internet between cloud providers just to save $200/mo.

welder · on April 27, 2021

Back in the day I did use EC2, but to get the same performance (especially SSD IOPs) it cost a lot more than DigitalOcean. Back then, the monthly bill for EC2 was over $3,000/mo. Switching to DigitalOcean we got better performance for under $1k/mo.

justinclift · on April 27, 2021

Out of curiosity, have you compared SSDB via MinIO (as another commenter mentions)?

Asking because we use MinIO, and for us (a small online place with a single actually active MinIO server) it's been pretty good + super stable.

welder · on April 27, 2021

Hadn't heard of MinIO. We looked into another S3 caching solution that someone linked in comments here, but it came down to we already used SSDB in production. Also, a personal preference of mine is to keep all logic in the same layer: Python.

For ex: We use SQLAlchemy as the source of truth for our relational database schema, and we use Alembic to manage database schema changes in Python. We even shard our Postgres tables in Python [1]. It fits with my preference to keep this caching logic also in Python, by reading/writing directly to the cache and S3 from the Python app and workers.

[1] https://github.com/sqlalchemy/sqlalchemy/issues/6055#issueco...

staticassertion · on April 27, 2021

Can Minio act as a caching proxy? Or do you do something manual like publishing to minio first, and then later publishing to S3 in batch or something?

edit: Found this https://docs.min.io/docs/minio-gateway-for-s3.html

Awesome! So minio can actually act as a caching proxy for S3. Very cool.

knightofmars · on April 27, 2021

How much time is spent updating these services and addressing CVEs and other security issues when they arise?

d10zqwe · on April 27, 2021

https://github.com/MarshallWace/cachenator

welder · on April 27, 2021

Very cool! Thanks for sharing, I hadn't seen this before. Edit: Actually I looked at this before using SSDB and it was overly complicated for my use case.

We went with SSDB because we already used Redis and SSDB, were very familiar with them, and had already worked out any pitfalls with using them. Wish we had found this sooner, thanks!

tyingq · on April 27, 2021

Hadn't heard of SSDB before, looks promising, active, variety of users, broad client support.

Off the top of my head, if I were trying to front AWS S3 from DigitalOcean, I'd have gone with MinIO and their AWS S3 gateway[1]. It appears to be purpose built for this exact kind of problem.

[1] https://docs.min.io/docs/minio-gateway-for-s3.html

juancampa · on April 27, 2021

It seems to be popular only in China [0]

0: http://ssdb.io/docs/users.html

edoceo · on April 27, 2021

aravindet · on April 27, 2021

You might face a language barrier when trying to find answers from the community: For example it looks like most of the discussion on the issues and pull requests are in Chinese. I do not believe Google Translate does well with technical terms.

(Of course this is a concern only if you don't read Chinese.)

brap · on April 27, 2021

Last I checked, DigitalOcean has a "Spaces" feature for file storage, which is compatible with the AWS S3 API (and they even suggest using the AWS client library).

I always found it odd that we can easily port apps, databases, etc., from one cloud to another, but for file storage/CDN it's always some proprietary solution like S3. AFAIK open source solutions never really took off.

tyingq · on April 27, 2021

The article mentions they looked at Spaces, but it was too slow.

welder · on April 27, 2021

Yep, that's why we only use Spaces for backups.

https://wakatime.com/blog/46-latency-of-digitalocean-spaces-...

crazypython · on April 27, 2021

Just use a CDN like CloudFront or Cloudflare. S3 is for storage, not retrieval. Building your own poorly distributed not at edge psuedo-CDN won't help as using a real edge CDN. Edge CDNs are closer.

welder · on April 27, 2021

No, I'm not using S3 as a CDN I'm using it like a database. S3 works wonderfully as a database.

sam0x17 · on April 27, 2021

Ah I see in that case you would have to do something like this -- otherwise CDNs are the way to go. Was wondering about that

nikisweeting · on April 27, 2021

Aren’t both of the problems posed solvable with stock redis? Using an append only file for persistence and enabling virtual memory support so the dataset can grow beyond RAM size.

jiofih · on April 27, 2021

Virtual memory was deprecated some time ago, there is no support for larger than RAM data anymore.

nikisweeting · on April 27, 2021

Ah TIL, that’s too bad.

xtracto · on April 27, 2021

What do you mean by deprecated? As far as I know, Swap partition/files are alive and kicking in Linux. There are even recommendations of the amount of swap you should setup depending on the available RAM.

welder · on April 27, 2021

They're talking about the Redis Virtual Memory feature not Linux Swap:

https://redis.io/topics/virtual-memory

klohto · on April 27, 2021

Deprecated in favor of Redis Flash (behind an enterprise subscription)

tebruno99 · on April 27, 2021

All problems are solvable by switching to AWS EC2 which doesn't cost money to transfer Data out of S3 to begin with.

sokoloff · on April 27, 2021

If you’re transferring the resulting data to your end customer, it still costs (quite a bit).

dang · on April 27, 2021

You posted this three times. That's overkill.

dhruvrrp · on April 27, 2021

How would SSDB compare to using etcd? They both are disk backed key-value stores, but SSDB doesn't seem to be under active maintenance, with most of the recent merges being about fixing build issues.

welder · on April 27, 2021

The main difference is you can use existing code/libraries without changing anything if you're already using Redis. That was a big win for us, since we could just swap out the Redis instances that suffered from low hit rates (because the data didn't fit in RAM and was purged) for SSDB easily.

dikei · on April 28, 2021

etcd is not suitable for high volume, it's built for data consistency, not performance.

mpermar · on April 27, 2021

This article seems to take it on Redis as it unproperly supports this use case, but Redis does not seem to be a right choice for a cheap vertically scalable k-v store. I didn't know SSSD but really you could have chosen amongst many other k-v stores or even memcache with disk support.

I guess the point is that someone chose Redis and soon realized that RAM was not going to be the cheapest store for 500Gb of data. But that is not Redis' fault.

dieters · on April 27, 2021

Scylla also exposes Redis API, disk-based as well: https://scylla.docs.scylladb.com/master/design-notes/redis.h... . With a good NVMe disk the latency is still expected to be good enough for most workloads.

PeterCorless · on April 28, 2021

There was an analysis done on the Redis API in Scylla back in 2019 (https://siddharthc.medium.com/redis-on-nvme-with-scylladb-5e...). The implementation, while still experimental, has come a long way since. I would be very interested to hear additional feedback from Redis users.

lykr0n · on April 27, 2021

Wow. Scylla does Cassandra, DynamoDB, and Redis. Wonder what will be next.

porker · on April 27, 2021

Not getting how this helps. How do they know if the data is stale in SSDB? If the data is immutable I get that caching it locally speeds things up, but then couldn't they merely mirror files and check if the file exists?

welder · on April 27, 2021

The majority of code stats (the data being stored) is received for the current day. Older data still comes in, but in that case we just update the SSDB cache after writing the new data to S3.

The reason we don't use SSDB as the main source of truth is because S3 provides replication and resilience.

bobmaxup · on April 27, 2021

> Caching is basically the same thing as a mirror

Can you elaborate?

welder · on April 27, 2021

Sure, I meant a mirror is just a cache for data that doesn't change. I edited the original comment, because it doesn't help explain why we used SSDB here.

jrochkind1 · on April 27, 2021

Hm. If a disk-based redis clone is actually more performant than redis... what are the reasons to use redis instead of SSDB? Why is anyone using redis instead of SSDB? Should I stop?

welder · on April 27, 2021

SSDB doesn't support every single Redis command, it's only a subset. Most people don't even know about SSDB. There's power behind the Redis snowball.

We've switched to SSDB, but with that we lost the Redis community.

zymhan · on April 27, 2021

> a disk-based redis clone is actually more performant than redis

Who says it's faster? This is supposed to make things cheaper.

plater · on April 27, 2021

The articles says so.

specialist · on April 27, 2021

re SSDB: is there an ELI5 for doing disk writes "correctly"? And maybe something like Jespen for testing database persistence?

Ages ago, I just used fsync() and called it good. After reading Dan Luu's article, I'm pretty sure I'd never figure it out on my own.

"Files are hard" https://danluu.com/file-consistency/

welder · on April 27, 2021

Because files are hard, a lot of projects use a file system library called LevelDB/RocksDB [1] They interact with the library instead of directly with files. CockroachDB and SSDB are examples of projects using LevelDB.

[1] https://en.wikipedia.org/wiki/RocksDB

aclatuts · on April 27, 2021

I think minio might have been better to use than SSDB? It is a proxy specific for object storage and can do local file caching.

https://docs.min.io/docs/minio-gateway-for-s3.html

welder · on April 27, 2021

Yep, someone suggested MinIO in the comments https://news.ycombinator.com/item?id=26956330

segmondy · on April 27, 2021

Minio is not a caching layer, it's pretty much spinning up your own object storage. I have used it and SSDB, and SSDB is a better drop in replacement for redis. Minio is a drop in for S3 if you don't want to pay for S3 and not serving a large enough data.

welder · on April 27, 2021

Oh, in that case I still prefer using S3 because it scales and has built-in redundancy and reliability.

segmondy · on April 27, 2021

Yup, nothing will match the redundancy and reliability of S3. If you use Minio and have hard drive issue, you now need to restore. If you really want to cut down on S3 cost some mor then you might want to use Wasabi or Backblaze with Minio as the S3 layer in front. If you're profitable tho, the peace of mind of S3 is worth it than trying to save another $100.

selcuka · on April 27, 2021

I am wondering as to why you opted to use DigitalOcean instead of LightSail. Any technical reasons?

welder · on April 27, 2021

LightSail didn't exist yet when we switched to DigitalOcean.

AtlasBarfed · on April 27, 2021

So... Redis is a read-primary system, but this is faster than Redis using SSDB/LevelDB for reads?

That's a bombshell that deserves the graphs to back it up, way hotter of a take than the article.

welder · on April 27, 2021

Yep, it's true SSDB is a hidden gem that deserves more exposure. LevelDB/RocksDB has been around for a while with benchmarks. Here's a related reply about it from earlier https://news.ycombinator.com/item?id=26957339

throwaway823882 · on April 28, 2021

Does LevelDB/RocksDB still randomly corrupt your data?

jiofih · on April 27, 2021

Since they are using Redis as a simple key-value store, I wonder if they looked into something like LMDB which is disk-backed and has even better performance?

3pt14159 · on April 27, 2021

> According to benchmarks, writes to SSDB are slightly slower but reads are actually faster than Redis!

How is this possible? Surely reads from RAM are faster than disk?

throwaway823882 · on April 28, 2021

I don't know Redis' design, but all sorts of things can slow down a program's reading from memory, from CPU cache misses to mutexes, garbage collection, CPU utilization, or something else (I don't know Redis' design). Whereas the page cache and disk cache can be very efficient, without anything but more reads/writes to get in your way. Or maybe the program reading from memory is single-threaded, and the program reading from disk isn't.

welder · on April 27, 2021

The magic of Google's LevelDB? Seriously though, I don't know and I'm not sure if it's an accurate benchmark. I just know our statsd metrics show it performing just as fast as official Redis on much larger data sets, so that benchmark is probably correct.

avinassh · on April 28, 2021

> How is this possible? Surely reads from RAM are faster than disk?

even I am so surprised. I will try to run the benchmarks myself and see.

but... can anyone really explain how can a disk based solution be faster than RAM

mikesabbagh · on April 27, 2021

you are decreasing your bill at the cost of your reliability. You are moving from offering the reliability of S3 to a single node reliability of your reverse proxy.

Maybe if u have the technological knowledge, start your own block storage using Rook/Ceph object storage on DO. This will reduce your bill even further, and if u know what u r doing, u can improve the reliability

welder · on April 27, 2021

No, the SSDB cache can go offline and the app just reads from S3 directly. I appreciate everyone trying to suggest better ways to solve this, but know that I've gone through many solutions dual-writing in production, comparing latency and throughput on production loads, and this is the best so far.

mikesabbagh · on April 27, 2021

Ah ok Then if it has the option to read from S3 directly, then it seems legit :)

ericls · on April 27, 2021

If you are caching files to something that is saved to a disk based file system, why not just use the filesystem?

welder · on April 27, 2021

That's kind of what we're doing, until we start using SSDB Clustering. Filesystem as cache would be good for a local same-server cache, but it doesn't scale well when used over the network.

secondcoming · on April 27, 2021

Last time I looked at twemproxy it looked like abandonware

reph2097 · on April 27, 2021

Just run a server.

Black101 · on April 27, 2021

If you worry about cost, maybe you should avoid AWS?

welder · on April 27, 2021

Yep, WakaTime uses DigitalOcean for almost everything, except S3 is just too good and the price is worth it.

tebruno99 · on April 27, 2021

I want to hate AWS as much as the next guy, but this is nearly the dumbest thing I've seen trending lately. They are paying money for something that is FREE to save a dime.

AWS S3 -> AWS EC2 bandwidth is FREE. To pay money to send the data to DO EC2 and then build a Redis cache to save a few dimes on EC2 that the D.O. Redis cluster now spends is...

just wow.

dang · on April 27, 2021

Can you please not post shallow dismissals and put other people's work down like this? Putdowns tend to get upvoted, and then they sit at the top of threads, letting off toxic fumes and making this a nasty place. We're trying to avoid that here. (I've downweighted this subthread now.)

I'm not sure why name-calling and meanness attract upvotes the way they do—it seems to be an unfortunate bug in how upvoting systems interact with the brain—but HN members need to realize that posts like this exert a strong conditioning effect on the community.

It's always possible to rephrase a comment like this as, for example, a curious question—you just have to remember that maybe you're not 100% aware of every consideration that went into someone's work.

https://news.ycombinator.com/newsguidelines.html

tebruno99 · on April 27, 2021

Fair enough, I should work on changing My tone. Thanks for the feedback.

dang · on April 27, 2021

Appreciated!

welder · on April 27, 2021

Totally get that. The journey here wasn't straightforward. I've tried dual-writing in production to many databases including Cassandra, RethinkDB, CockroachDB, TimescaleDB, and more. Haven't tried ClickHouse/VictoriaMetrics, but probably won't now because S3 scales beautifully. The main reason not using EC2 is compute and attached SSD IOPs costs. This balance of DO compute and AWS S3 is the best combination so far.

booi · on April 27, 2021

well you gotta spend money to save money...