S3 Express Is All You Need

Sirupsen · on Nov 28, 2023

Most production storage systems/databases built on top of S3 spend a significant amount of effort building an SSD/memory caching tier to make them performant enough for production (e.g. on top of RocksDB). But it's not easy to keep it in sync with blob...

Even with the cache, the cold query latency lower-bound to S3 is subject to ~50ms roundtrips [0]. To build a performant system, you have to tightly control roundtrips. S3 Express changes that equation dramatically, as S3 Express approaches HDD random read speeds (single-digit ms), so we can build production systems that don't need an SSD cache—just the zero-copy, deserialized in-memory cache.

Many systems will probably continue to have an SSD cache (~100 us random reads), but now MVPs can be built without it, and cold query latency goes down dramatically. That's a big deal

We're currently building a vector database on top of object storage, so this is extremely timely for us... I hope GCS ships this ASAP. [1]

[0]: https://github.com/sirupsen/napkin-math [1]: https://turbopuffer.com/

jamesblonde · on Nov 28, 2023

We built HopsFS-S3 [0] for exactly this problem, and have running it as part of Hopsworks now for a number of years. It's a network-aware, write-through cache for S3 with a HDFS API. Metadata operations are performed on HopsFS, so you don't have the other problems list max listing operations return 1000 files/dirs.

NVMe is what is changing the equation, not SSD. NVMe disks now have up to 8 GB/s, although the crap in the cloud providers barely goes to 2 GB/s - and only for expensive instances. So, instead of 40X better throughput than S3, we can get like 10X. Right now, these workloads are much better on-premises on the cheapest m.2 NVMe disks ($200 for 4TB with 4 GB/s read/write) backed by a S3 object store like Scality.

[0] https://www.hopsworks.ai/post/faster-than-aws-s3

dekhn · on Nov 28, 2023

the numbers you're giving are throughput (byte/sec) not latency.

The comment you reply to is talking mostly about latency - reporting that S3 object get latencies (time to open the object and return its head) in the single-digits ms, where S3 was 50ms before.

BTW EBS can do 4GB/sec per volume. But you will pay for it.

kernelsanderz · on Nov 29, 2023

Very excited about being able to build scalable vector databases on DiskANN like turbopuffer or lancedb. These changes in latency are game changing. The best server is no server. The capability a low latency vector database application that runs in lambda and S3 and is dirt cheap is pretty amazing.

seasily · on Nov 29, 2023

The clear use case is serverless—without the complications of DynamoDB (expensive, 0.03/GB read), DynamoDB+DAX (VPC complications), or Redis (again, VPC requirements).

This instantly makes a number of applications able to run directly on S3, sans any caching system.

__turbobrew__ · on Nov 29, 2023

Similar to vector databases this could be really useful for hosting cloud optimized geotiffs for mapping purposes. At a previous job we were able to do on the fly tiling in about 100ms but with this new storage class you could probably make something that could tile just as fast or even faster than arcgis with all of its proprietary optimizations and goop.

Take it a step further, gdal supports s3 raster data sources out of the box for a while now. Any gdal powered system may be able to operate on s3 files as if they are local.

promocha · on Nov 28, 2023

> “Of course the AWS S3 Express storage costs are still 8x higher than S3 standard, but that’s a non issue for any modern data storage system. Data can be trivially landed into low latency S3 Express buckets, and then compacted out to S3 Standard buckets asynchronously. Most modern data systems already have a form of compaction anyways, so this “storage tiering” is effectively free.”

This is key insight. The data storage cost essentially becomes negligible and latency goes down by a magnitude by making S3 Express as a buffer storage then moving data to standard S3. I see a future where most data-intensive apps would use S3 as main storage layer.

otabdeveloper4 · on Nov 29, 2023

Did you conveniently ignore egress costs?

tomjakubowski · on Nov 28, 2023

sounds a bit like CPU caches and main memory

haimez · on Nov 29, 2023

Or like SSD’s vs spinning disks…

francoismassot · on Nov 28, 2023

We tested S3 Express for our search engine quickwit [0] a couple of weeks ago.

While this was really satisfying on the performance side, we were a bit disappointed by the price, and I mostly agree with the article on this matter.

I can see some very specific use cases where the pricing should be OK but currently, I would say most of our users will just stay on the classic S3 and add some local SSD caching if they have a lot of requests.

[0] https://github.com/quickwit-oss/quickwit/

kernelsanderz · on Nov 29, 2023

I'd be fascinated if you could share your insights from using this. Where does the pricing fall down? And is the latency/throughput a big improvement for this use case? (ie. externalizing a search index).

fulmicoton · on Nov 29, 2023

I ran the benchmark at Quickwit. I confirm it works as intended. I was extremely excited about this feature, primarily interested in the decreased GET request cost, and secondly the lower latency.

Unfortunately the price model puts it in a place where it is the right technology only for some very rare places.

In a nutshell the key thing you need to know is: - The storage is 6.4x expensive than classic S3. - The GET requests are 2x cheaper (with additional cost for large requests). - Your data is replicated within a single region. - latency is single digit ms.

From a pure cost wise point of view, the realm where it makes sense to use it is there, but small, and often competes more with EBS than it competes with S3.

kernelsanderz · on Nov 29, 2023

thank so much for sharing. Amazing product BTW!

Hixon10 · on Dec 1, 2023

Did you have access to their private preview version, or it is some another S3 Express, not S3 Express One Zone?

emgeee · on Nov 28, 2023

some additional context here is that warpstream is building a Kakfa compatible streaming system that uses s3 as the object store. This allows them to leverage cheap zone transfer costs for redundancy + automatic storage tiering to cut down on the costs of running and maintaining these systems. This has previously come at the cost of latency due to s3's read/write speeds but with S3 this makes them more competitive with Confluent Kafka's managed offerings for these latency sensitive applications.

IMO warpstream is a really cool product and this new S3 offering makes them even better

refset · on Nov 28, 2023

I am eager to hear how it will affect their latency numbers:

> Engineering is about trade-offs, and we’ve made a significant one with WarpStream: latency. The current implementation has a P99 of ~400ms for Produce requests because we never acknowledge data until it has been durably persisted in S3 and committed to our cloud control plane. In addition, our current P99 latency of data end-to-end from producer-to-consumer is around 1s

via https://www.warpstream.com/blog/kafka-is-dead-long-live-kafk...

fswd · on Nov 28, 2023

I solved this problem locally. When uploading a file to the server before going to S3 it is cached in redis. Whenever the codebase needs to use the file, it checks redis, and if it is not there it fetches it and caches it again.

jamiesonbecker · on Nov 28, 2023

Exactly. Write-through cache is exactly how Userify[0] used to work for self-hosted versions. (when it was Python, we used Redis to keep state synced across multiple processes, but now that it's a Go app, we do all the caching and state management in memory using Ristretto[1])

However, we now install by default to local disk filesystem, since it's much faster to just do a periodic S3 hot sync, like with restic or aws-cli, than to treat S3 as the primary backing store, or just version the EBS or instance volume. The other reason you might want to use S3 as a primary is if you use a lot of disk, but our files are compressed and extremely small, even for a large installation with tens of thousands of users and instances.

0. https://userify.com (ssh key management + sudo for teams)

1. https://github.com/dgraph-io/ristretto

avinassh · on Nov 29, 2023

What were the reasons to move from Redis to Ristretto? Both seem to be very different, since Redis is distributed where as Ristretto is local to the process.

jamiesonbecker · on Nov 29, 2023

In our case, Python (because of the GIL) required us to have a single python process per core in order to take advantage of multiple cores, and so we needed Redis to maintain a unified memory state across all the cores, but Go can automatically span across multiple cores.

We also saw about a 10x speedup by moving all caching into the server process, and since it was all in the same process, we no longer had to compress and encrypt data before sending to Redis. We still checkpoint the moving server state, encrypted and compressed, to disk every sixty seconds, just like Redis would do with BGSAVE, so we can start back up within a few seconds (actually faster than the old Redis after a restart.)

edmundsauto · on Nov 29, 2023

So you store the actual image data in redis? That's interesting, no issues with storing binary data?

I ask because I've always been taught to not store files in a database. This use case sounds like an interesting exception.

nojvek · on Nov 29, 2023

Files are just a bunch of bytes. No harm in putting them in a database.

There were some benchmarks, I couldn’t fine where SQLite was faster than native file system at retrieving, searching and adding files to a large directory.

avinassh · on Nov 29, 2023

SQLite reads and writes small blobs (for example, thumbnail images) 35% faster¹ than the same blobs can be read from or written to individual files on disk using fread() or fwrite().

Furthermore, a single SQLite database holding 10-kilobyte blobs uses about 20% less disk space than storing the blobs in individual files.

The performance difference arises (we believe) because when working from an SQLite database, the open() and close() system calls are invoked only once, whereas open() and close() are invoked once for each blob when using blobs stored in individual files. It appears that the overhead of calling open() and close() is greater than the overhead of using the database. The size reduction arises from the fact that individual files are padded out to the next multiple of the filesystem block size, whereas the blobs are packed more tightly into an SQLite database.

https://www.sqlite.org/fasterthanfs.html

throwitaway222 · on Nov 28, 2023

I don't understand why EFS never gets major shout outs - it's way better than S3: systems can mount it as a drive, shared across systems, already has had super low latency... Not sure what s3 express is really useful for if EFS already exists.

candiddevmike · on Nov 28, 2023

EFS is really expensive and has terrible latency with small files in my experience

dekhn · on Nov 28, 2023

When you set up EFS did you maximize the IO settings?

Before doing that it was unacceptably slow. After doing that it was unacceptably expensive.

huntaub · on Nov 28, 2023

Note that EFS One Zone is priced the same as S3 Express One Zone with similar latency. One isn't better or worse than the other, it only depends on what kind of access your application needs.

brazzledazzle · on Nov 28, 2023

Yeah the main reason is that it's incredibly expensive. You can improve performance by allocating ahead of time but NFS has never been at its best when working with a bunch of tiny files.

richieartoul · on Nov 28, 2023

Do you have any more details you can share about the performance of EFS? I've never met anyone who has actually used it in anger.

a2tech · on Nov 28, 2023

Yes, I built a moderately large system on it that used lots of small shared files. The performance was fairly terrible. There's weird little niggles with it--we had random slowdowns, throughput issues, and things just didn't work quite right.

It was an ok solution for what we were doing, but several times I came really close to just dumping it and standing up an NFS server using EBS volumes.

I also used it a couple of times to store webroots and that was a complete disaster with systems that had lots of small files (Drupal I'm looking at you).

gchamonlive · on Nov 28, 2023

Throughput scales with the amount of data in it, it is in the docs. So depending on the application, even if latency is better, the speeds are atrocious at lower volumes of persisted data.

saddlerustle · on Nov 28, 2023

That’s not true anymore with EFS Elastic Throughput

toomuchtodo · on Nov 28, 2023

EFS exists if you don't care much about spend and performance while having to forklift a POSIX compliant use case into AWS for persistent data.

a2tech · on Nov 28, 2023

Thats basically how we were using it. It could have been worse.

notatoad · on Nov 28, 2023

i'm sure there's some cases where mounting storage as a disk is desirable, but from my perspective "systems can mount it as a drive" is a negative, not a positive.

treating storage as an application-controlled thing that doesn't need systems management is a good thing. i want "put this file in this spot" logic in my application code, not "put this file in this spot on the filesystem, and hope that location is backed by the correct storage layer"

yeeeloit · on Nov 28, 2023

I wonder if Mountpoint for S3 along with this new Express option makes it a direct competitor to EFS for some use cases.

https://docs.aws.amazon.com/AmazonS3/latest/userguide/mountp...

halz · on Nov 29, 2023

Looks like support for S3 Express was merged in with version 1.30 just a few hours ago https://github.com/awslabs/mountpoint-s3/pull/642

tneely · on Nov 28, 2023

I'm quite curious about this too - both from a cost and performance perspective. If S3 Express is close enough to EFS on these metrics, then I'd say it wins out due to the sheer ubiquity and portability of S3 these days.

sparrc · on Nov 28, 2023

In my experience the biggest drawback with EFS is startup time for systems that mount it in.

For example a container or EC2 instance might only need a tiny bit of your storage and with s3 can just download what it needs when it needs it.

As opposed to EFS where the container or instance needs to load in the entire datastore on startup which can add minutes to startup time if the EFS drive is large.

dpedu · on Nov 28, 2023

My understanding is that EFS is exposed as an NFS share. I haven't used it personally, but NFS mounting is generally fast, nearly instant. What does "load in the entire datastore" mean?

ericpauley · on Nov 28, 2023

EFS mounting is definitely nearly instant. I use it constantly.

dekhn · on Nov 29, 2023

Many servers start up, load a ton of data from storage into RAM, and then happily serve that data for a long time. The latency of the server when starting up before it can service its first request is entirely based on the throughput of the data load.

Often these servers will load 128+GB of data into RAM (crazy, huh?) and even if you have 1GB/sec it's still two minutes for the server to start up.

dpedu · on Nov 29, 2023

1GB/s is the same speed from EFS or S3.

dekhn · on Nov 29, 2023

I didn't say otherwise.

osti · on Nov 28, 2023

If I'm not wrong, this is the low latency S3 that is written in Rust. Finally launched after years in the making.

throwaway934223 · on Nov 29, 2023

S3 Standard has slow first byte latency for three reasons: 1. All data is stored on old school spinning HDDs with multi millisecond seek times 2. There's still Java (and garbage collection) on the request path. There has been a multi-year effort to move the request path entirely to Rust to eliminate GC but Java still remains. 3. To reduce storage costs, objects are erasure-coded "wide", which means many hosts are involved in servicing a request. This means only one such sub-request has to be slow to slow your request down.

The new storage class is SSD-backed, presumably doesn't use Java anywhere, and doesn't stripe your data across as many hosts. It's more expensive because SSDs are more expensive than HDDs and narrow erasure codes are more costly than wide erasure codes.

Source: Used to work on S3.

miraculixx · on Dec 12, 2023

Minio?

paulddraper · on Nov 28, 2023

Surely being written in a non-Rust language is not responsible for an extra 40ms of latency, right?

Or is rust really that magic?

osti · on Nov 28, 2023

Of course not, it's designed differently from the original S3. AWS came out with this to compete with Azure premium blob storage, which has very good first byte latency, and Azure had it 4 years ago..

https://azure.microsoft.com/en-us/blog/premium-block-blob-st...

estebarb · on Nov 28, 2023

ShardStore? (More info: https://www.thestack.technology/aws-shardstore-s3/ ) it seems that it was deployed years ago.

osti · on Nov 28, 2023

Nah, that looks very different, one of the stated goals of S3 Express is to minimize latency, which is the only thing about the Rust S3 that I remember.

FridgeSeal · on Nov 28, 2023

Do you have any sources for that? Very interested to know more about this.

osti · on Nov 28, 2023

Unfortunately I don't, this is already internal information that I don't know if I should say here. I never worked on S3 and I no longer work at AWS so someone from within would have to weigh in.

kristianp · on Nov 28, 2023

I saw "X is all you Need" with the "Attention is all you need" paper [1], which launched the Transformer upon the world. Is it the first instance of that phrase?

[1] https://arxiv.org/abs/1706.03762

cafard · on Nov 28, 2023

The Beatles might claim prior art.

kristianp · on Nov 29, 2023

Yes, I just realised and started hearing "all you need is love".

warpspin · on Nov 30, 2023

It starts to get more annoying than "X considered harmful".

KTibow · on Nov 29, 2023

There was a SIGBOVIK paper on this

BonoboIO · on Nov 28, 2023

Has anyone here a usecase which would perform better with this new S3 Express Tier?

And a second question, would it be worth the 8x times surcharge?

parhamn · on Nov 28, 2023

I think the key benefit brushed on by this article is the potential 10x improvement in access speeds (which has many applications, beyond reducing your s3 op charges).

> S3 Express One Zone can improve data access speeds by 10x and reduce request costs by 50% compared to S3 Standard and scales to process millions of requests per minute.

haimez · on Nov 29, 2023

10x reduction in latency, higher storage costs with lower access costs (SSD instead of spinning disks). So high I/O, small files situations (with no need for cross AZ access) are where the benefits can be found.

maccard · on Nov 28, 2023

I'm going to set up sccache [0] to use it tomorrow. We use MSVC, so EFS is off the cards.

[0] https://github.com/mozilla/sccache/blob/main/docs/S3.md

barsandtones · on Nov 28, 2023

This will work great with the s3 mount point that AWS recently released. This will outperform EFS if your application does not require full POSIX compatibility.

paulddraper · on Nov 28, 2023

A cache with large blobs (images, etc)

awoimbee · on Nov 28, 2023

If it's only a cache it should be on EBS, which is still way faster and 2x less expensive. I started a migration to s3 for such a project (container image caching) but then stopped when I realized what I was doing.

rbranson · on Nov 28, 2023

EBS attaches a single block storage volume to a single host[1]. S3 Express is a service-based object store. Apples and oranges.

[1] Yes, I am aware of multi-attach but this introduces a scaling bottleneck and requires a fairly exotic setup.

paulddraper · on Nov 28, 2023

1. You'd need an access/authentication layer on top of that.

2. Variable throughput may be a concern.

3. You may have availability concerns.

YetAnotherNick · on Nov 28, 2023

Yes, EBS is the gold standard but managing a EBS to scale up and down instantly, be available to multiple instances, lifecycle management, managing replica, switchover etc. are definitely not easy. And EBS are bad choice when throughput needed is very spiky.

mgaunard · on Nov 28, 2023

Many S3 implementations appear to simply be transparent downloads to disk rather than a true "use the network as a disk".

tjoff · on Nov 28, 2023

> However, the new storage class does open up an exciting new opportunity for all modern data infrastructure: the ability to tune an individual workload for low latency and higher cost or higher latency and lower cost with the exact same architecture and code.

I get it, but at the same time that is also what you lost when you locked yourself in with a particular vendor.

influx · on Nov 28, 2023

There's not much to the S3 API, and data import/export even at massive scale is available with Snowball. Sure, there's many other AWS services that aren't available at other vendors, but blob storage is commodified at this point.

amarshall · on Nov 28, 2023

Exporting data from S3 is ludicrously expensive, even with Snowball it’s $30/TB just for network egress.

Aeolun · on Nov 29, 2023

If I can get all of my data out of S3 for $30, then whatever the cost is for an enterprise with 1000TB, it must be within reason.

imheretolearn · on Nov 28, 2023

> I get it, but at the same time that is also what you lost when you locked yourself in with a particular vendor.

What are other viable practical alternative solution(s)?

toomuchtodo · on Nov 28, 2023

Storage adapter to talk S3 compatible to target, assuming you're not relying on vendor specific extensions or behavior (ie this).

Off the top of my head, Backblaze B2, Cloudflare R2, etc are S3 compatible, and Minio locally.

https://www.google.com/search?q=s3+compatible

anamexis · on Nov 28, 2023

There are no vendor specific extensions or behavior here, are there? Isn't it just a different billing structure?

williamdclt · on Nov 28, 2023

I suppose “super low latency” is behaviour, in the sense that “a large enough quantitative difference is a qualitative difference”. If you rely on the perf and only S3 provides that, then you effectively are locked into S3 implementation

jacobr1 · on Nov 28, 2023

Notifications, for event processing architectures aren't part of the API common to these systems

Spooky23 · on Nov 28, 2023

I used to run one on-prem from DDN. Another good one is Nutanix. There are many out there.

If you have a big use case and you really understand your needs, it's very doable.

throwawaaarrgh · on Nov 28, 2023

You can use a different vendor any time, it's all S3 compatible. You just don't get the same performance and billing.

paulddraper · on Nov 28, 2023

Except this has uniform billing, security, locality, monitoring, tools, etc

tjoff · on Nov 28, 2023

I did mention vendor lock in?

api · on Nov 29, 2023

“All You Need Considered Harmful” - most cliche title?

collinc777 · on Nov 29, 2023

Will this improve running sqlite on s3?