If someone is wondering why this is huge. It's because of the problems you face, when you don't have strong consistency. See this picture to get an understanding of what kind of problems you might run into with eventual consistency:
I believe this makes Amazon S3 behave more similar to Azure blob storage[1] and Google Cloud Storage[2], which is pretty convenient for folks who are abstracting blob stores across different offerings.
For what it’s worth, consistency in S3 was usually pretty good anyways, but I ran into issues where it could vary a bit in the past. If you designed your application with this in mind, of course, it shouldn’t be an issue. In my case I believe we just had to add some retry logic. (And of course, that is no longer necessary.)
This is super awesome for customers. I am also beyond excited for all the open-source connectors to finally be simplified so that they don't have to deal with the "oh right, gotta be careful because of list consistency". It was super awesome when we were able to delete a huge chunk of the GCS Connector for Hadoop, and I hope to see the same across the S3-focused connectors.
I'm now much more inclined to make my GCS FUSE rewrite in Rust also support S3 directly :).
> It was super awesome when we were able to delete a huge chunk of the GCS Connector for Hadoop
It's been a few years, but I lost customer data on gs:// due to listing consistency - create a bunch of files, list dir, rename them one by one to commit the dir into Hive - listing missed a file & the rename skipped the missing one.
Put a breakpoint and the bug disappears, hit the rename and listing from the same JVM, the bug disappears.
Consistency issues are hell.
> It was super awesome when we were able to delete a huge chunk of the GCS Connector for Hadoop, and I hope to see the same across the S3-focused connectors.
Yes!
S3guard was a complete pain to secure, since it was built on top of dynamodb and since any client could be a writer for some file (any file), you had to give all clients write access to the entire dynamodb table, which was quite a security headache.
Do you have a link to the commits that removed the code. It'd be good to see what sort of complexity this sort of strong consistency can make redundant.
In the case of Hadoop's S3 connector, this could eliminate this entire directory, plus its tests, plus a bunch of hooks in the main code: https://github.com/apache/hadoop/tree/trunk/hadoop-tools/had.... There's an argument in favor of keeping it in case other S3-compatible stores need it (though you'd still need DynamoDB or some equivalent) and because it makes metadata lookups so much faster than S3 scans, which helps with query planning performance. But I imagine even fewer people will take the trade-off now that Amazon S3 itself is consistent.
Not yet, it’s too crappy :). I only get to it every once in a while and it shows.
As an example, until last weekend, I would just hardcode an access token rather than doing an oauth dance. Luckily, tame-oauth [1] from the folks at Embark was reasonably easy to integrate with.
I also got a little depressed that zargony/rust-fuse was stuck on an ancient version until I learned that Chris Berner from OpenAI had forked it earlier this year to modernize / update it [2].
That's pretty incredible. With no price increase either. Using s3 as a makeshift database is possible in a lot more scenarios now.
I especially like using s3 as a db when dealing with ETLs, where the state of some data is stored in its key prefix. This means that the etl can stop / resume at any point with a single source of truth as the database.
The potential drawback is cost of course; moving (renaming) is free but copying is not. S3's biggest price pain is always its PUTs. In many ETLs this is usually a non-issue because you have to do those PUTs regardless, as you probably want the data to be saved at the various stages for future debugging and recovery.
Yup me too! Checkout this state_doc Interface I wrote in Python for building a state machine where the state is stored in S3. I never bothered to solve for the edge case of inconsistent reads because my little state machine only ever measured in hours when it slept waiting for work and then minutes in-between waking up and doing work to verify restores.
With the news of S3's strong consistency my program is immediately safer to use for some set of defects, since the underlying datastore properties, S3 has made my codes naive/invalid assumptions, true.
This is one reason I have been a big fan of Google Cloud Storage over AWS S3: at a past company AWS consistency was a huge pain, and GCS has had this for years.
Ask HN: I have a question regarding strong consistency. The question is related to all geo distributed strong consistent storages, but let me pick Google Spanner as example.
Spanner has a maximum upper bound of 7 milliseconds clock offset between nodes, thx to TrueTime. To achieve external consistency, Spanner simply waits 7 ms before committing a transaction. After that amount of time the window of uncertainty is over and we can be sure, that every transaction happens to be in the right order. So far so good, but how does Spanner deal with cross data center network latency? Let's say I commit a transaction to Spanner in Canada and after 7 ms, I get my confirmation, but now in Australia someone also does a conflicting transaction and also get his confirmation after 7 ms. Spanner however, bound to the laws of physics, can only see this conflict after 100+ ms delay, due to network latency between the datacenters. How is that problem solved?
The simple answer is that there are round trips required between datacenters when you have a quorum that spans data centers. Additionally, one replica of a split is the leader, and the leader holds write locks. So already you have to talk to the leader replica, even if it's out of the DC you're in. Getting the write lock overlaps with the transaction commit though. So for your example if we say the leader is in Canada and the client is in Australia, and we're doing a write to row 'Foo' without first reading (a so called blind write):
Client (Australia) -> Leader (Canada): Take a write write lock on 'Foo' and try to commit transaction
Leader -> other replicas: Start commit, timestamp 123
Other replicas -> Leader: Ready for commit
Leader waits for majority ready for commit and for timestamp 123 to be before the current truetime interval
Leader -> Other replicas: Commit transaction, and in parallel replies to client.
Of course there are things you can do to mitigate this depending on your needs, but there's no free lunch. If you have a client in Australia and a client in Canada writing to the same data you're going to pay for round trips for transactions.
Writes are implemented with pessimistic locking. A write in a Canada datacenter may have to acquire a lock in Australia as well. See page 12 of the paper for the F1 numbers (mean of 103.0ms with standard deviation of 50ms for multi-site commits).
I am curious about how much Dropbox pays for data ingress/egress, they migrated storage to their own on premise data center, then now moving data back to S3 for the data lake.
For any AWS EMR/hadoop users out there, this means the end of emrfs.
For the uninitiated, emrfs recognizes consistency problems but does not fix them. It'll throw an exception if there's a consistency problem, even give you some tools to fix the problem, but the tools may give false positives. The result is you've got to fix some consistency problems yourself, parse items out of the emrfs DynamoDB catalog, match them up to s3, then make adjustments where needed. It's an unnecessary chore.
It surprises me that this issue has not got more attention over the years and thankfully it'll be solved soon.
Side question -- why is s3fs so slow? I get only about 1 put per 10 seconds with s3fs. One would hope that you could use something like s3fs as a makeshift S3 API, but its performance is really horrible.
I can't answer this directly, but I thought I'd chime in with a possible alternative, or at least something else to test.
I'm using rclone with S3[1] (and others). There are commands[2] where you can use to just sync/copy/ files but it also has a mount option[3]. It also has caching[4] and other things that might help.
This is not expected behavior. An older version had pessimized locking which preventing concurrently adding a file to a directory and listing the directory simultaneously which is the most similar symptom. I recommend upgrading to the latest version, running with debug flags `-f -d -o curldbg`, and observing what's happening. Please open an issue at https://github.com/s3fs-fuse/s3fs-fuse/issues if your symptoms persist.
One thing I noticed when I went digging through the s3fs source trying to answer that myself is that it's uploader is single threaded. Part of what makes AWS's cli reasonably fast is that it uses several threads to make sure to use as much bandwidth as possible. If I limit AWS's tool to one thread on a box with plenty of bandwidth, it turns my upload down from around 135 MiB/s to around 20 MiB/s
Not sure if that's all of it, or even the majority of the slowness.
Could you expand on this comment? s3fs uses multiple threads for uploading and populating readdir metadata via S3fsMultiCurl::MultiPerform. Earlier versions used curl_multi_perform which may have hidden some of the parallelism from you.
Huh, right you are. I did a cursory search, but whatever I saw clearly was wrong.
That said, something is going on that I can't explain:
# Default settings for aws
$ time aws s3 cp /dev/shm/1GB.bin s3://test-kihaqtowex/a
upload: ../../dev/shm/1GB.bin to s3://test-kihaqtowex/a
real 0m10.312s
user 0m5.909s
sys 0m4.204s
$ aws configure set default.s3.max_concurrent_requests 4
$ time aws s3 cp /dev/shm/1GB.bin s3://test-kihaqtowex/b
upload: ../../dev/shm/1GB.bin to s3://test-kihaqtowex/b
real 0m26.732s
user 0m4.989s
sys 0m2.741s
$ time cp /dev/shm/1GB.bin s3fs_mount_-kihaqtowex/c
real 0m26.368s
user 0m0.006s
sys 0m0.699s
(I did multiple runs, though not hundreds so I can't say with certainty this would hold)
I swear the difference was worse in the past, so maybe things have improved. Still not worth the added slowness to me for my use cases.
Thanks for testing! You may improve performance via increasing -o parallel_count (default="5"). Newer versions of s3fs have improved performance but please report any cliffs at https://github.com/s3fs-fuse/s3fs-fuse/issues .
One fun fact that I learned recently: a prefix is not strictly path-delimited. I would think of /foo/bar and /foo/baz and /bar/baz as having two prefixes, but it could be anywhere from one to three, depending on how S3 has partitioned your data.
I asked around and it seems, that `prefix` here is exact value for `prefix` or `key` query in the API. So you can make 5500 ListObject rps for prefix=/a/b and simultaneously make 5500 GetObject rps for key=/a/b/file1 and not be ratelimited.
In other word ratelimiting key is `${HTTP_METHOD}(${QURERY_PARAM_KEY}|${QUERY_PARAM_PREFIX})`
This is a major issue when turning batch data into individual files, especially with dynamic prefixes. You can and will find this edge pretty quick using EMR or Glue which tries to run as fast as possible. The only answer is....slow down your writes or just try again, a bit frustrating.
Neat! Would this make it possible in the future for S3 to support CAS like operations for PutObject? Something like supporting `if-match` and `if-none-match` headers on the etag of an object would be so useful.
Listing and update consistency is great, but the ACID SQL-on-S3 frameworks (Apache Hudi, Dela Lake) also require atomic rename to support transactions.
HopsFS-S3 solves the ACID SQL-on-S3 problem discussed here on HN last week ( https://news.ycombinator.com/item?id=25149154 ) by adding a metadata layer over S3 and providing a POSIX-like HDFS API to clients. Operations like rename, mv, chown, chmod are metadata operations - rename is not copy and delete.
Once libraries and tools start relying on this, it is going to make life interesting to the S3-compatible players. The API remains the same, but the behavior is quite different.
It is the smaller players I'm thinking of, which needed S3 compatibility for uptake. Openstack Swift, Backblaze, Ceph... I am not sure if any of these suddenly stopped being S3 compatible due to this announcement.
Ceph has strong consistency by design. S3 compatibility in Ceph comes via the Rados Gateway (RGW) which just sits atop normal Ceph. This means that the strong consistency is carried over to RGW. So, no change for Ceph.
OpenStack Swift has an eventually consistent design. If the Swift community desires to maintain strict compatibility with S3 regarding consistency model, it would be a challenge. Despite being a challenge, I would never bet against the Swift developers. If they want to do it, I'm quite sure that they can. Having said all of that, I doubt that the Swift community would consider this a burning issue that needs to be solved. They've existed for over 10 years with an eventually consistent model and their user base is well aware of it.
Not necessarily, I think Spanner's magic is most useful for bounded staleness reads and scaling reads through non-leader replicas. Otherwise I think you're looking at normal
"NewSQL" guarantees.
The previous behavior (new keys weren't strongly consistent) were weird, and made a lot of patterns that AWS itself sort of guides you towards, broken.
For example, you can set up an SQS queue to be informed of writes into a bucket. The idea being that some component writes into the bucket, and a consumer then fetches the data & processes it. Except — gotcha! — there wasn't a guarantee of consistency. It would usually work, though, and so you'd get these weird failures that you couldn't explain unless you already knew the answer, really.
Same thing with how S3 object names sort of feel like file names, and that you might use them for something useful, and you design around that, and then you hit the section of the docs that says "no, you shouldn't, they should be uniformly balanced". (Though honestly, I've never felt any negative side-effects from ignoring this, unlike the consistency guarantees.)
> then you hit the section of the docs that says "no, you shouldn't, they should be uniformly balanced". (Though honestly, I've never felt any negative side-effects from ignoring this
You likely won’t, unless you hit a bucket size that has at least a billion small files inside. If you get into the trillion and above range and are heavily unbalanced due to a recent deploy of yours, ooops, there goes S3. Or it did in 2013 anyway.
Are there any scenarios where you would want eventual consistency over strong consistency? (Assuming pricing, performance, replication etc are the same.)
Your last sentence is the answer I’ve always heard given: this in the context of a performance trade off — do you wait for every node in the system to acknowledge an update and purge caches, or return as soon as a sufficient durability threshold has been hit?
These are synonyms.
How do you scale? Horizontally. How you do it without degrading performance, is the same way you increase availability. Eventual consistency.
No. A magical do-everything datastore would have strong consistency. Eventual consistency is something you might settle for given the performance cost trade offs.
To see this, consider that strong consistency has all the guarantees of eventual consistency, plus some more. And the additional guarantees might make your application much easier to write.
It wasn’t a bug, it was well-documented behavior. However, it was the root of many bugs by people building on top of S3 that did not take the eventual consistency into account.
I’ll bet this release is going to fix a lot of weird bugs in systems out there using S3.
I'm pulling my hair out because people are using eventual consistency to mean "eventual consistency with a noticeable lag before the consistency is achieved", which bothers the pedant in me.
I mean, technically, strong consistency implies eventual consistency (with lag = 0). But everyone's equating eventual consistency with the noticeable lag itself, implying that EC per se is a bad thing.
For an analogy, it would be like if people were talking about how rectangular cakes suck, because <problems of unequal width vs length>, and thus they use square cakes, but ... square cakes are rectangular too.
(What's going on is that people use <larger set> to mean <larger set, excluding members that qualify for smaller set>. <larger set> = eventual consistency, rectangle; <smaller set> = strong consistency, square.)
Obviously, it's not a problem for communication because everyone implicitly understands what's going on and uses the words the same way, but I wish people spoke in terms of "has anyone actually seen a consistency lag?" rather than "has anyone actually seen eventual consistency?", since the latter is not the right way to frame it, and is actually a good thing to have, which happened both before and after this development.
Eventual consistency with zero lag is not strong consistency. Two simultaneous conflicting writes with eventual consistency will appear successful. Strong consistency guarantees that one of the writes will fail.
EDIT: I think the reason people care about the lag is specifically for this case; if you want to know if your eventually consistent write actually succeeded then you need to know how long to wait to check.
Well, in all the threads, they’re describing situations with lag. That is, they’re equating eventual consistently with the lag, and implying it’s therefore bad, when eventual consistency can include unnoticeably small lag.
And thus it’s the lag that they’re actually bothered by, not the EC per se. Hence my clawed out hair.
S3 is almost certainly not fully partition tolerant at the node level and requires some sort of quorum. Other “magical” data stores like Spanner also retain this limitation, they just have very reliable replication strategies.
I’m not claiming it’s a CA system, and the terminology “partition intolerant” is not verboten by Kyle Kingsbury. From your link: “Specifically, partition-intolerant systems must sacrifice invariants when partitions occur. Which invariants?”
The answer in this case is that availability is sacrificed, unless Amazon is making a very misleading claim of strong consistency (per the submission title/link). So it’s CP.
In case “not fully partition tolerant” is what is throwing you off, all I mean by that is that there are likely network partitions of the S3 nodes that will not cause outward failure (e.g. partition of a single node), though some will (e.g. too many partitions to form a quorum, or whatever their underlying consensus implementation relies on).
I guess the simple answer to the GP is "Availability is likely sacrificed", and the clue's in the submission title being all about consistency. I haven't looked too deep into it and am sure there are also still Consistency sacrifices too.
I'm still a little niggled though... "almost certainly not fully partition tolerant at the node level and requires some sort of quorum" - what does this actually mean? I'm confused why you raise the tolerance of a single node to partitions, when the partitions by definition require >1 node?
When I say “at the node level” I’m trying to draw a distinction with the service level view where a client is talking to S3 as a (from their perspective) single entity per region. So a better way of putting it may be the internal view of the service, where partitions are between nodes that are part of S3 itself (as opposed to between a client and S3) and they affect the overall health of the system. I’m not referring to a single node being partition resistant (hence why I said node level, and never mentioned single) which as you said wouldn’t make any sense.
>> Amazon S3 delivers strong read-after-write consistency automatically for all applications, without changes to performance or availability, without sacrificing regional isolation for applications, and at no additional cost.
So they claim performance and availability will remain same while claiming strong consistency. I was confused at first but then “same” availability isn’t 100% availability. So it indeed CP.
This is great news for application developers! And probably hard work for the AWS implementors. I will try to update this over the weekend: https://github.com/gaul/are-we-consistent-yet
Will not write the whole file. Instead each page of the database (default = 4KB) will be separate put.
The use case is: one-writer multiple readers.
The readers will read an old snapshot of the database (from s3) and writer will write to the active one. This way a lot of readers can open the DB in read-only mode.
As for writer, each page will be a separate put. The writes will be appended to a log(s) locally and the page will be synced in the background. This helps to exploit lot of parallel PUTs to S3. Once the sync happens the disk space for that page will be reclaimed.
You could shard your SQLlite files to only contain the data for a single user, if your application allows you to do that. Would make it a bit painful to run migrations, but other than that you've now got an infinitely scaling database with 9 nines uptime for no money at all.
Obviously, a sane implementation would divide db file into blocks and write them into separate objects. Not as small as traditional FS blocks, 1-4 Mb, perhaps.
Before this update, S3 provided "read-after-write" consistency for object creation at a unique path[0] with PUT and subsequent GET operations. Replacement wasn't consistent. So I think you could have worked out some kind of atomic pattern with S3 objects, using new keys for each state change.
The Etag is not a reliable hash of the file contents. It will be different if the file was uploaded in multiple parts like using the CLI than if it was moved as one operation like copying from one bucket to another. You can tell the difference because the Etag will have a “-“ in it if it was uploaded as a multipart upload.
This is really cool. We actually got burned on weak S3 consistency a few weeks ago when generating public download links for a customer. Took us a few hours of troubleshooting to realize they had downloaded a cached/older version of the software we had uploaded to the same URL just a few minutes prior. Resolution was to use unique paths per version to guarantee we were talking to the right files each time.
One potentially related item I was thinking about - How does HN feel about the idea of a system that has eventual durability guarantees which can be inspected by the caller? I.e. Call CloudService.WriteObject(). It writes the object to the local datacenter and asynchronously sends it to the remote(s). It returns some transaction id. You can then pass this id to CloudService.TransactionStatus() to determine if a Durable flag is set true. Or, have a convenience method like CloudService.AwaitDurabilityAsync(txnid). In the most disastrous of circumstances (asteroid hits datacenter 1ms after write returns to caller), you would get an exception on this await call after a locally-configured timeout in which case you can assume you should probably retry the write operation.
I was thinking this might be a way to give the application a way to decide how to deal with the concept of latency WRT cross-site replication. You could not await the durability for 4 nines or wait the additional 0-150 ms to see more nines. I wonder how this risk profile sits with people on the business side. I feel like having a window of reduced durability per object that is only ~150ms wide and up-front can be safely ignored. Especially considering the hypothetical ways in which you could almost instantaneously interrupt the subsequent activities/processing with the feedback mechanism proposed above.
> they had downloaded a cached/older version of the software we had uploaded to the same URL
This is a great example (to me) of how weak consistency is useful. It exposed a problem you otherwise wouldn't notice.
You were deploying offline artifacts to customers without giving the customer the version of the new artifact. Regardless of backend consistency, the customer will not be able to tell (from the URL, anyway) what version they are downloading. Nor will they be able to, say, download the old version if the new version introduces bugs. By changing your deployment method to use unique URLs per version, your customer gains useful functionality and you avoid having to depend on strong consistency (which actually removes expensive requirements).
If you design for weak consistency, your system ends up more resilient to multiple problems.
> I was thinking this might be a way to give the application a way to decide how to deal with the concept of latency WRT cross-site replication.
What if your application crashes in the middle of waiting?
Ultimately if you want full consistency you need it everywhere. Your workflow writing to consistent storage needs to be an atomic output of its own inputs with a scheduler that will re-run the processing if it hasn't committed its output durably.
The application crashing while awaiting durability is equivalent to not awaiting the assurance mechanism in the first place.
Ways to recover from this could include simply scanning through entities modified previously to determine durability status. Other application nodes could participate in this effort if the original is not recoverable. An explicit transaction should be used in cases where partial completion is not permissible, as with any complex database operation.
This is a tricky way to think about the problem because you are still getting transactional semantics regardless. The only thing that would be in question is if your data can survive an asteroid impact. With this model, even with an application crashing, there is still an exceedingly high probability that any given object will survive into additional asynchronous destinations (assuming a transaction scope is implicit or externally completed).
I think the biggest problem in selling this concept is with the fatalists in the developer/business community who are unwilling to make compromises in their application domain logic in order to conform to our physical reality. From a risk modeling perspective, I feel like we have way more dangerous things to worry about on a daily basis than an RPO in which the latest ~150ms of business data at one site might be lost in an extremely catastrophic scenario. Trying to synchronously replicate data to 2+ sites for literally everything a business does is probably going to cause more harm than good in most shops. If you operate nuclear reactors or manage civilian airspace, then perhaps you need something that can bridge that gap. But, most aren't even remotely close to this level of required assurance.
This was a problem with data lakes / analytics. Small inconsistencies would trash runs; that problem now goes away and removes a lot of janky half-fixes to work around the issue.
For most common use cases it's not really an issue.
Of course the real issue there is those tool treating S3 as something it's not (a filesystem) and building things on it with the expectation of certain guarantees that it explicitly didn't provide before now.
I dig the feature (strong consistency) but in some ways it just enables tools that were abusing it to just do so more easily.
That doesn't mean it blindly "mounts" it as a disk and attempts to use it as a filesystem. They're almost certainly using it in a manner that is aware of the specificities of how S3 works.
What if the blind mount works 99.9% of the time? I knew what I was doing was risky but well with in my tolerance for pain. Now that we have a strong consistency my code is fixed. That's pretty impressive.
While I don't know the specifics I can say that availability is down to engineering practices.
Let's say that their consistency model is achieved via quorum, that is writes write to a quorum of nodes while reads read from a quorum of nodes (of their metadata database) then this guarantees read after write consistency.
The availability aspect of this is just engineering, making sure you're never down to less than a quorum of nodes. E.g. by having redundant power supplies, generators, networks, or whatever engineering it takes to reduce the probability of failure.
There's other aspects, e.g. latency, that may suffer, but again this is solved via engineering. Just throw more hardware at it to bring the latency down. The only time where you absolutely can't solve it is if you provide strong consistency across geographical regions that are far apart, there's no way then not to pay that latency.
This is just another example of why the CAP theorem isn't really as useful to determining limitations of practical systems as it may seem at first site.
I hope they write a paper or at least do a talk about how they went about it. Usually these things are more involved than an outsider would think. Agreed that you can beat most of the CAP tradeoffs with significant engineering effort, but it’s usually pretty interesting how they accomplish it. Especially at S3’s scale where the COGs impact of “throw more hardware at it” can cost on the order of tens or hundreds of millions.
>> Let's say that their consistency model is achieved via quorum, that is writes write to a quorum of nodes while reads read from a quorum of nodes (of their metadata database) then this guarantees read after write consistency.
If reads are reading only from a quorum of nodes how do you guarantee they have latest data? In theory, while a node is servicing a read request wouldn’t you need to query “all” other nodes to verify that the data in current node is latest? What if the quorum of nodes don’t have the latest data yet?
You write to a quorum of nodes and you read from a quorum of nodes. Latest is determined by a clock. Let's say you wrote to 2 out of 3 nodes, and read from 2 out of 3 nodes, one of those 2 you read from is going to have the latest data. This is how consistency works in Cassandra e.g.
There are other models, for example by using a consensus protocol.
With that you have just moved the question to: how do they ensure that the metadata database is available _and_ strongly consistent at the same time for all the requests?
Because the metadata database is certainly also a distributed one, hence you need to query _all_ the nodes or can end up in a split-brain situation and lose consistency (or availability if you choose to down the system or a part of the system).
All you really care about is that P(split_brain) < your availability guarantee. This is when this turns into an engineering problem. If you have redundant top of rack switches, redundant power supplies, redundant network paths, you're not going to end up with a split brain.
An example I like to use is integrated circuits, like your CPU, sure, it can fail in a mode where you lose the internal connectivity between the transistors, and then your CPU is dead. It's a very complex distributed system where engineers have basically taken out any practical possibility of an internal failure. So just because CAP says that if you lose connectivity to some portion of transistors you can't maintain consistency or availability (which is true) doesn't mean a thing in practice. Nobody even worries about how to keep a CPU going if you lose one core.
> All you really care about is that P(split_brain) < your availability guarantee
In other words: availability suffers from it.
> An example I like to use is integrated circuits, like your CPU, sure, it can fail in a mode where you lose the internal connectivity between the transistors, and then your CPU is dead
In that case the system is dead, it doesn't become inconsistent. I get your point though, you can reduce the chance-of-happening for these cases, but this orthogonal for the CAP. The CAP says you have to balance between availability and consistency (assuming partitioning is required here). You cannot have both at 100%, no matter how efficient you get.
And if the efficiency stays the same (which we should assume when comparing S3 with and without strong consistency) then giving better consistency guarantees inevitably must reduce availability - that's what my point is.
> hence you need to query _all_ the nodes or can end up in a split-brain situation and lose consistency
I am not sure this is true. It is prohibitively expensive to read from all replicas. All you need to ensure is that any replica has the latest state. See how Azure Storage does it [0]. Another example is Spanner [1] that uses timestamps and not consensus for consistent reads.
> This decoupling and targeting a specific set of faults allows our system to provide high availability and strong consistency in face of various classes of failures we see in practice.
What that says is: for the failures we have seen so far, we guarantee consistency and availability. But how about the other kind of failures? The answer is clear: strong consistency only is guaranteed for _certain expected_ errors, not in the general case. Hence generally speaking, strong consistency is _not_ guaranteed. As simple as that. Saying it is guaranteed "in practice" is, pardon, stupid wording.
> hence you need to query _all_ the nodes or can end up in a split-brain situation and lose consistency (or availability if you choose to down the system or a part of the system).
No.Depending on the type of consensus algorithm you can get away with reading from a majority or nodes or even less: for example in a Multipaxos system you just read from the master.
One side of the partition has quorum and the other does not.
Processing keeps happening on the partition with quorum.
A practical example I worked on is Google's Photon; multi-region logjoining with Paxos for consistency. When the network caused a region to become unavailable all the logjoining work happened in the other regions. Resources were provisioned for enough excess capacity to run at full speed in the case of a single region failure and the N-region processing allowed the redundancy cost to be as low as 100%/N overhead.
The most devious s3 consistency issue I encountered was that S3 bucket configs are actually stored in S3, and were not read-after-write consistent.
It was bad. In order to "update" a bucket config programmatically you need to read the entire existing config, make an update, and then PUT it back (overwriting what's there). The problem is when you went to read the config it's possibly 15 minutes old or more, and when you put it back you overwrite any changes that may not be consistent.
I had to work around the issue by storing a high-water mark in the config ID, and then poll an external system to find the expected high water mark before I knew I could safely read and update.
This is awesome, I could have used it for a project we did a while back.
I'm sure that somewhere there is a Product Manager at Amazon who talks to customers they find out are using Google Cloud or Azure and asks them "Why not use AWS?" and they mumble some feature of the other service that they need.
Sometime later, said service shows up as part of AWS. :-)
For me this is the best part of tech rivalries, whether they are in CPU performance/architecture, programming language features/tools, or network services. Pressure to improve.
You have a processing job which dumps a bunch of output files in a directory. A downstream job uses these files as input, sees a new directory, and pulls all the files in the new directory.
Because s3 was not strongly consistent, you would have the downstream job see a arbitrary subset of the files for a short while after creation, and not just the oldest files. This could cause your job to skip processing input files unless you provided some sort of manifest of all the files it would expect in that batch. So then you'd have to load the manifest, then keep retrying until all the input files showed up in whatever s3 node you were hitting.
If this is an issue, you wait X amount of seconds after the file has been created and then start processing it. This would allow the file to be consistent before processing it.
A better idea would be triggering Lambda jobs which either directly processes the files as they are added to S3 or trigger Lambda jobs which add the files to SQS and each job in the SQS is processed by another Lambda job.
2. If you just take the maximum recorded time and say add 20% padding, then waiting that amount of time to process every dataset could be detrimental to performance.
The example I gave happened to the team I was on in 2017/2018. We had 1000s of files totaling terabytes of data in a given batch. The 90th percentile time for consistency was the low 10s of seconds, the 99th percentile was measured in minutes. The manifest and retry not yet present method avoids having to put in a sleep(5 minutes) for the 1% of cases.
Our image processing worker queues/servers write an S3 object and dispatch a follow up job. Currently we have to delay the next job (we use 5 seconds) otherwise then next job may start processing before the S3 object is available (it 404's if the next job is run straight away).
Same issues updating a file - you’d update on s3, then trigger processing and even though it was 50ms later - you’d get OLD version. I didn’t know gcp didn’t have this issue - would have been a big selling point
Okay that makes sense, when we have very low throughput it looks like consistency wasn't an issue but at that rate (especially if it occurs in bursts) I can see how the prior S3 implementation could fail to scale fast enough
No, we didn't try retrying (well actually if the job failed because of a 404 it would retry the whole job based on exponential backoff). When we encountered the error we simply added the delay as it was an easy config change to our SQS queues and doesn't meaningfully affect our use case.
We used to use S3 for Maven artifact storage. This is mostly an append-only workload, however Maven updates maven-metadata.xml files in place. These files contain info about what versions exist, when they were updated, what the latest snapshot version is, etc. We would see issues where a Maven build publishes to S3, and then a downstream build would read an out-of-date maven-metadata.xml and blow up. Or worse, it could silently use an older build and cause a nasty surprise when you deploy. It only happened a small percentage of the time, but when you’re doing tens of thousands of builds per day it ends up happening every day.
We switched to GCS for our Maven artifacts and the problem went away.
To be clear: the "blowing up" would occur when a client observed the new maven-metadata.xml file, but old ("does-not-exist") records for the newly uploaded artifact, correct?
With this update, ordering the metadata update after the artifact upload means this failure is now impossible.
For some context, S3 used to provide read-after-write consistency on new objects, but only if you didn't do a GET or HEAD to the key before it was created. So this access pattern is all good:
PUT new-key 200
GET new-key 200
However, this access pattern would be unreliable:
GET new-key 404
PUT new-key 200
GET new-key 200/404
This last GET could return a 200, or it could return a cached 404. Unfortunately, Maven uses this access pattern when publishing maven-metadata.xml files, because it needs to know whether it should update an existing maven-metadata.xml file or create one if it doesn't exist yet. So when publishing a new version, it does something like this:
GET com.example/example-artifact/1.0-SNAPSHOT/maven-metadata.xml 404
PUT com.example/example-artifact/1.0-SNAPSHOT/maven-metadata.xml 200
And then downstream builds would try to resolve version 1.0-SNAPSHOT, and the first thing Maven does is fetch the maven-metadata.xml:
GET com.example/example-artifact/1.0-SNAPSHOT/maven-metadata.xml 404
So when publishing a new version, you could get a cached 404 and fail loudly. However, when updating an existing maven-metadata.xml file you could silently read an old maven-metadata.xml, and end up using an out-of-date artifact (which is even more concerning). Here's what one of the maven-metadata.xml files looks like:
https://oss.sonatype.org/content/repositories/snapshots/org/...
Because updating an existing object in S3 didn't have read-after-write consistency, we could have a publishing flow that looked like:
GET com.example/example-artifact/1.0-SNAPSHOT/maven-metadata.xml 200 (v1)
PUT com.example/example-artifact/1.0-SNAPSHOT/maven-metadata.xml 200 (v2)
And then downstream builds would fetch the maven-metadata.xml:
GET com.example/example-artifact/1.0-SNAPSHOT/maven-metadata.xml 200 (v1)
So downstream builds could read a stale maven-metadata.xml file, which results in silently using an out-of-date artifact.
We ended up just switching to GCS because it was relatively straight-forward and gave us the consistency guarantees we want.
One very common one is for situations where you might have a multi-step pipeline to process data
- step 1 generates/processes data, stores it in S3, overwriting the previous copy. triggers step 2 to run
- step 2 runs, fetches the data from s3 for its own processing. However, because only a few seconds have elapsed, step 2 fetches the old version of data from the S3 bucket
You can work around this by, for example, always using unique S3 object keys, but then you have to coordinate across the data processing steps, and it becomes harder to manage things like storing only the 10 latest versions.
The Argo workflow tool (https://argoproj.github.io/) is one example of a tool that can suffer from this problem.
we use s3 in a repo pattern pretty heavily and I had no idea they didn't have read-after-write consistency. I've noticed s3 will hold connections open by sending back blank lines and always assumed it was synchronizing for consistency.
Has anyone ever seen S3 behave eventually consistent? I have not seen a lot of eventual consistency in the real world but I wonder if I'm just working on the wrong problems?
Yes. In fact I spent months working on a system to workaround all the problems it caused for Hadoop: https://hadoop.apache.org/docs/current/hadoop-aws/tools/hado.... We had a lot of automated tests that would fail because of missing (or more recently, out of date) files a significant portion (maybe 20-30%) of the time. We believed that S3 inconsistency was to blame but I always felt like that was the typical "blame flaky infra" cop-out. As soon as S3Guard was in place all those tests were constantly green. It really was S3 inconsistency, routinely.
To be fair to S3, Hadoop pretending S3 is a hierarchical filesystem is a bit of a hack. But I had cases where new objects wouldn't be listed for hours. There's only so much you can do to design an application around that, especially when Azure and Google Storage don't have that problem.
Yes, a ton. The big issue for me was Does key exist? -> No -> Upload object with key. Followed by Get key -> Not found. Less rarely, I'd see Upload object followed by an overwrite object, followed by a download download some combination of the two if the download was segmented (like AWS's cli will do).
The first one happened often enough to cause problems, the second one was a fairly rare event, but still had to be handled.
It might have been more of an issue with "large" buckets. The bucket in particular where I had to dance around the missing object issue had ~100 million fairly small objects. I ended up having to create a database to track if objects exist so no consumer would ever try to see if a non-existent object existed.
Even more confusing, the 404 on GET is caused by doing the HEAD when the object doesn't exist. Without the previous HEAD, the GET would usually succeed.
Yes. Writing objects from EMR Spark to S3 buckets in a different account. Immediately setting ACLs on the new objects fail as the objects are not yet “there”. Using EMRFS Consistent View was previously the obvious solution here, but it adds significant complexity.
EMRFS consistent view is a mess. Changes to S3 files are likely to break EMRFS, the metadata needs to be kept in sync (otherwise the table will keep increasing forever) with tools that are a pain to install, and when I used it to run Hive queries from a script, it didn't work.
Yup, the correct answer in the face of a network partition is for the PUT to return 500, even though the file really was uploaded. The API handler can't prove success, due to the network partition, so the correct move is to declare failure and let the client activate their contingencies.
I've never seen the minutes long delays that you were theoretically supposed to accommodate, but second-scale delays were pretty common in my team's experience. (We're writing a lot of files though, probably hundreds or thousands per second aggregated across all our customers.)
I ran across issues where files that we PUT in S3 were still not visible in Hive 6+ hours later. It only happened a few times out of hundreds of jobs per day for jobs running daily the last 2-3 years, but it was annoying to deal with when it happened because there wasn't a lot we could do beyond open tickets with AWS.
I've observed eventual consistency with S3 prior to this change where it took on the order of several hundred milliseconds to observe an update of an existing key.
There used to be a behavior with 404 caching that was pretty easy to trigger: if you race requesting an object and creating it, subsequent reads could continue to 404.
Not quite _always_. There were some documented caveats... the one I hit before was:
Read nonexistent key, followed by a write of that same key, and then a subsequent read could return a stale read saying it didn’t exist. (even though it had just been written for the first time.)
Anyway I am glad to see these gaps and caveats have been closed.
I think they use cloud providers in regions where they don't have physical footprints, to offer data locality guarantees.
But the testimonial only talks about Dropbox's data lake. So this is not about dropbox's main product storage, it's "just" their data lake for analytics.
(which is apparently 32PB !)
In the early days of S3 this was a real problem. Over the years they've beefed up the backend so objects become became consistent faster, but this is a huge leap.
i already had enough "catching-up-with-latest-aws-tech" problem with amazon, though this is a welcome addition, with daily announcements like this, makes life tougher for a AWS architect who likes to be up to date!
just today proton was announced and now this! this is going to be a hectic month :|
If you want consistency, you have to pay for it with latency, since the system has to do work in the background to ensure that the data you read is really the latest.
If you care more about latency than consistency you have an 'eventually consistent' system, where a write will eventually propagate, but a read might get stale data.
> If you care more about latency than consistency you have an 'eventually consistent' system, where a write will eventually propagate, but a read might get stale data.
Not just stale data, you can also have states which never actually existed. I'll steal the example from Doug Terry's paper "Replicated Data Consistency Explained
Through Baseball" because it's really good. Linked below.
Say you have a baseball game which is scored by innings. It's the middle of the 7th inning, and the true write log for the state of the game is as follows:
If you were to read the score at this point in time, and your system is strongly consistent, the score can only be 2-5 or a refusal to serve the request. If your system is eventually consistent, the score can be any of the following: 0-0, 0-1, 0-2, 0-3, 0-4, 0-5, 1-0, 1-1, 1-2, 1-3, 1-4, 1-5, 2-0, 2-1,2-2, 2-3, 2-4, 2-5.
if you've ever worked on distributed storage you'll know that this isn't fabricated. it just takes effort to fix it. and S3 is pretty much the oldest distributed store in existence, lots of tech debt to overcome.
mmyeah never had this issue on non aws distributed storage. luckily for aws there are plenty of inexperienced devs who think this is a real issue in a properly designed large storage system.
I agree nobody knew what the cloud was becoming. But it feels like aws is selling an unfinished product and people are wasting money and time working around non issues. Aws is great for medium sized projects but as soon as your project grows, you bump into silly issues like this one. Given that s3 is more or less a file storage you don’t really expect issues such as “weak” consistency. You expect a file to be committed once written (yes aws can do queueing behind the scenes) but when it returns a 200 ok it means the file was stored. Otherwise the good folks at aws should return a 202 accepted response as all properly designed apis do and let the user know that a delay in reading is expected.
https://i.ibb.co/DtxrRH3/eventual-consistency.png