Most production storage systems/databases built on top of S3 spend a significant amount of effort building an SSD/memory caching tier to make them performant enough for production (e.g. on top of RocksDB). But it's not easy to keep it in sync with blob...
Even with the cache, the cold query latency lower-bound to S3 is subject to ~50ms roundtrips [0]. To build a performant system, you have to tightly control roundtrips. S3 Express changes that equation dramatically, as S3 Express approaches HDD random read speeds (single-digit ms), so we can build production systems that don't need an SSD cache—just the zero-copy, deserialized in-memory cache.
Many systems will probably continue to have an SSD cache (~100 us random reads), but now MVPs can be built without it, and cold query latency goes down dramatically. That's a big deal
We're currently building a vector database on top of object storage, so this is extremely timely for us... I hope GCS ships this ASAP. [1]
We built HopsFS-S3 [0] for exactly this problem, and have running it as part of Hopsworks now for a number of years. It's a network-aware, write-through cache for S3 with a HDFS API. Metadata operations are performed on HopsFS, so you don't have the other problems list max listing operations return 1000 files/dirs.
NVMe is what is changing the equation, not SSD. NVMe disks now have up to 8 GB/s, although the crap in the cloud providers barely goes to 2 GB/s - and only for expensive instances. So, instead of 40X better throughput than S3, we can get like 10X. Right now, these workloads are much better on-premises on the cheapest m.2 NVMe disks ($200 for 4TB with 4 GB/s read/write) backed by a S3 object store like Scality.
the numbers you're giving are throughput (byte/sec) not latency.
The comment you reply to is talking mostly about latency - reporting that S3 object get latencies (time to open the object and return its head) in the single-digits ms, where S3 was 50ms before.
BTW EBS can do 4GB/sec per volume. But you will pay for it.
Very excited about being able to build scalable vector databases on DiskANN like turbopuffer or lancedb. These changes in latency are game changing. The best server is no server. The capability a low latency vector database application that runs in lambda and S3 and is dirt cheap is pretty amazing.
The clear use case is serverless—without the complications of DynamoDB (expensive, 0.03/GB read), DynamoDB+DAX (VPC complications), or Redis (again, VPC requirements).
This instantly makes a number of applications able to run directly on S3, sans any caching system.
Similar to vector databases this could be really useful for hosting cloud optimized geotiffs for mapping purposes. At a previous job we were able to do on the fly tiling in about 100ms but with this new storage class you could probably make something that could tile just as fast or even faster than arcgis with all of its proprietary optimizations and goop.
Take it a step further, gdal supports s3 raster data sources out of the box for a while now. Any gdal powered system may be able to operate on s3 files as if they are local.
> “Of course the AWS S3 Express storage costs are still 8x higher than S3 standard, but that’s a non issue for any modern data storage system. Data can be trivially landed into low latency S3 Express buckets, and then compacted out to S3 Standard buckets asynchronously. Most modern data systems already have a form of compaction anyways, so this “storage tiering” is effectively free.”
This is key insight. The data storage cost essentially becomes negligible and latency goes down by a magnitude by making S3 Express as a buffer storage then moving data to standard S3. I see a future where most data-intensive apps would use S3 as main storage layer.
We tested S3 Express for our search engine quickwit [0] a couple of weeks ago.
While this was really satisfying on the performance side, we were a bit disappointed by the price, and I mostly agree with the article on this matter.
I can see some very specific use cases where the pricing should be OK but currently, I would say most of our users will just stay on the classic S3 and add some local SSD caching if they have a lot of requests.
I'd be fascinated if you could share your insights from using this. Where does the pricing fall down? And is the latency/throughput a big improvement for this use case? (ie. externalizing a search index).
I ran the benchmark at Quickwit. I confirm it works as intended.
I was extremely excited about this feature, primarily interested in the decreased GET request cost, and secondly the lower latency.
Unfortunately the price model puts it in a place where it is the right technology only for some very rare places.
In a nutshell the key thing you need to know is:
- The storage is 6.4x expensive than classic S3.
- The GET requests are 2x cheaper (with additional cost for large requests).
- Your data is replicated within a single region.
- latency is single digit ms.
From a pure cost wise point of view, the realm where it makes sense to use it is there, but small,
and often competes more with EBS than it competes with S3.
some additional context here is that warpstream is building a Kakfa compatible streaming system that uses s3 as the object store. This allows them to leverage cheap zone transfer costs for redundancy + automatic storage tiering to cut down on the costs of running and maintaining these systems. This has previously come at the cost of latency due to s3's read/write speeds but with S3 this makes them more competitive with Confluent Kafka's managed offerings for these latency sensitive applications.
IMO warpstream is a really cool product and this new S3 offering makes them even better
I am eager to hear how it will affect their latency numbers:
> Engineering is about trade-offs, and we’ve made a significant one with WarpStream: latency. The current implementation has a P99 of ~400ms for Produce requests because we never acknowledge data until it has been durably persisted in S3 and committed to our cloud control plane. In addition, our current P99 latency of data end-to-end from producer-to-consumer is around 1s
I solved this problem locally. When uploading a file to the server before going to S3 it is cached in redis. Whenever the codebase needs to use the file, it checks redis, and if it is not there it fetches it and caches it again.
Exactly. Write-through cache is exactly how Userify[0] used to work for self-hosted versions. (when it was Python, we used Redis to keep state synced across multiple processes, but now that it's a Go app, we do all the caching and state management in memory using Ristretto[1])
However, we now install by default to local disk filesystem, since it's much faster to just do a periodic S3 hot sync, like with restic or aws-cli, than to treat S3 as the primary backing store, or just version the EBS or instance volume. The other reason you might want to use S3 as a primary is if you use a lot of disk, but our files are compressed and extremely small, even for a large installation with tens of thousands of users and instances.
What were the reasons to move from Redis to Ristretto? Both seem to be very different, since Redis is distributed where as Ristretto is local to the process.
In our case, Python (because of the GIL) required us to have a single python process per core in order to take advantage of multiple cores, and so we needed Redis to maintain a unified memory state across all the cores, but Go can automatically span across multiple cores.
We also saw about a 10x speedup by moving all caching into the server process, and since it was all in the same process, we no longer had to compress and encrypt data before sending to Redis. We still checkpoint the moving server state, encrypted and compressed, to disk every sixty seconds, just like Redis would do with BGSAVE, so we can start back up within a few seconds (actually faster than the old Redis after a restart.)
Files are just a bunch of bytes. No harm in putting them in a database.
There were some benchmarks, I couldn’t fine where SQLite was faster than native file system at retrieving, searching and adding files to a large directory.
SQLite reads and writes small blobs (for example, thumbnail images) 35% faster¹ than the same blobs can be read from or written to individual files on disk using fread() or fwrite().
Furthermore, a single SQLite database holding 10-kilobyte blobs uses about 20% less disk space than storing the blobs in individual files.
The performance difference arises (we believe) because when working from an SQLite database, the open() and close() system calls are invoked only once, whereas open() and close() are invoked once for each blob when using blobs stored in individual files. It appears that the overhead of calling open() and close() is greater than the overhead of using the database. The size reduction arises from the fact that individual files are padded out to the next multiple of the filesystem block size, whereas the blobs are packed more tightly into an SQLite database.
I don't understand why EFS never gets major shout outs - it's way better than S3: systems can mount it as a drive, shared across systems, already has had super low latency... Not sure what s3 express is really useful for if EFS already exists.
Note that EFS One Zone is priced the same as S3 Express One Zone with similar latency. One isn't better or worse than the other, it only depends on what kind of access your application needs.
Yeah the main reason is that it's incredibly expensive. You can improve performance by allocating ahead of time but NFS has never been at its best when working with a bunch of tiny files.
Yes, I built a moderately large system on it that used lots of small shared files. The performance was fairly terrible. There's weird little niggles with it--we had random slowdowns, throughput issues, and things just didn't work quite right.
It was an ok solution for what we were doing, but several times I came really close to just dumping it and standing up an NFS server using EBS volumes.
I also used it a couple of times to store webroots and that was a complete disaster with systems that had lots of small files (Drupal I'm looking at you).
Throughput scales with the amount of data in it, it is in the docs. So depending on the application, even if latency is better, the speeds are atrocious at lower volumes of persisted data.
i'm sure there's some cases where mounting storage as a disk is desirable, but from my perspective "systems can mount it as a drive" is a negative, not a positive.
treating storage as an application-controlled thing that doesn't need systems management is a good thing. i want "put this file in this spot" logic in my application code, not "put this file in this spot on the filesystem, and hope that location is backed by the correct storage layer"
I'm quite curious about this too - both from a cost and performance perspective. If S3 Express is close enough to EFS on these metrics, then I'd say it wins out due to the sheer ubiquity and portability of S3 these days.
In my experience the biggest drawback with EFS is startup time for systems that mount it in.
For example a container or EC2 instance might only need a tiny bit of your storage and with s3 can just download what it needs when it needs it.
As opposed to EFS where the container or instance needs to load in the entire datastore on startup which can add minutes to startup time if the EFS drive is large.
My understanding is that EFS is exposed as an NFS share. I haven't used it personally, but NFS mounting is generally fast, nearly instant. What does "load in the entire datastore" mean?
Many servers start up, load a ton of data from storage into RAM, and then happily serve that data for a long time. The latency of the server when starting up before it can service its first request is entirely based on the throughput of the data load.
Often these servers will load 128+GB of data into RAM (crazy, huh?) and even if you have 1GB/sec it's still two minutes for the server to start up.
S3 Standard has slow first byte latency for three reasons:
1. All data is stored on old school spinning HDDs with multi millisecond seek times
2. There's still Java (and garbage collection) on the request path. There has been a multi-year effort to move the request path entirely to Rust to eliminate GC but Java still remains.
3. To reduce storage costs, objects are erasure-coded "wide", which means many hosts are involved in servicing a request. This means only one such sub-request has to be slow to slow your request down.
The new storage class is SSD-backed, presumably doesn't use Java anywhere, and doesn't stripe your data across as many hosts. It's more expensive because SSDs are more expensive than HDDs and narrow erasure codes are more costly than wide erasure codes.
Of course not, it's designed differently from the original S3. AWS came out with this to compete with Azure premium blob storage, which has very good first byte latency, and Azure had it 4 years ago..
Nah, that looks very different, one of the stated goals of S3 Express is to minimize latency, which is the only thing about the Rust S3 that I remember.
Unfortunately I don't, this is already internal information that I don't know if I should say here. I never worked on S3 and I no longer work at AWS so someone from within would have to weigh in.
I saw "X is all you Need" with the "Attention is all you need" paper [1], which launched the Transformer upon the world. Is it the first instance of that phrase?
I think the key benefit brushed on by this article is the potential 10x improvement in access speeds (which has many applications, beyond reducing your s3 op charges).
> S3 Express One Zone can improve data access speeds by 10x and reduce request costs by 50% compared to S3 Standard and scales to process millions of requests per minute.
10x reduction in latency, higher storage costs with lower access costs (SSD instead of spinning disks). So high I/O, small files situations (with no need for cross AZ access) are where the benefits can be found.
This will work great with the s3 mount point that AWS recently released. This will outperform EFS if your application does not require full POSIX compatibility.
If it's only a cache it should be on EBS, which is still way faster and 2x less expensive.
I started a migration to s3 for such a project (container image caching) but then stopped when I realized what I was doing.
Yes, EBS is the gold standard but managing a EBS to scale up and down instantly, be available to multiple instances, lifecycle management, managing replica, switchover etc. are definitely not easy. And EBS are bad choice when throughput needed is very spiky.
> However, the new storage class does open up an exciting new opportunity for all modern data infrastructure: the ability to tune an individual workload for low latency and higher cost or higher latency and lower cost with the exact same architecture and code.
I get it, but at the same time that is also what you lost when you locked yourself in with a particular vendor.
There's not much to the S3 API, and data import/export even at massive scale is available with Snowball. Sure, there's many other AWS services that aren't available at other vendors, but blob storage is commodified at this point.
I suppose “super low latency” is behaviour, in the sense that “a large enough quantitative difference is a qualitative difference”. If you rely on the perf and only S3 provides that, then you effectively are locked into S3 implementation
Even with the cache, the cold query latency lower-bound to S3 is subject to ~50ms roundtrips [0]. To build a performant system, you have to tightly control roundtrips. S3 Express changes that equation dramatically, as S3 Express approaches HDD random read speeds (single-digit ms), so we can build production systems that don't need an SSD cache—just the zero-copy, deserialized in-memory cache.
Many systems will probably continue to have an SSD cache (~100 us random reads), but now MVPs can be built without it, and cold query latency goes down dramatically. That's a big deal
We're currently building a vector database on top of object storage, so this is extremely timely for us... I hope GCS ships this ASAP. [1]
[0]: https://github.com/sirupsen/napkin-math [1]: https://turbopuffer.com/