S3 Strong Consistency

tutfbhuf · on Dec 2, 2020

If someone is wondering why this is huge. It's because of the problems you face, when you don't have strong consistency. See this picture to get an understanding of what kind of problems you might run into with eventual consistency:

https://i.ibb.co/DtxrRH3/eventual-consistency.png

pachico · on Dec 2, 2020

You just made my day :)

rafaelturk · on Dec 2, 2020

I shall print this..

freemint · on Dec 2, 2020

Please support the original artist and make sure you have a license to print it before you do.I

st1ck · on Dec 2, 2020

So you're concerned about printing 1 copy for home or office, but not about sharing this image on the internet?

homarp · on Dec 2, 2020

the artist is Pascal Jousselin http://pjousselin.free.fr/

it's called 'Imbattable' or Mister Invincible

memiux · on Dec 2, 2020

Ok that explains the Baguette

fuzzer37 · on Dec 2, 2020

Please do whatever you want with an image posted to an internet forum

throwaway894345 · on Dec 2, 2020

He passed a reference, not a copy. He doesn't need ownership to do so.

(a little Rust humor)

pantulis · on Dec 2, 2020

Unbelievably good!

jchw · on Dec 2, 2020

I believe this makes Amazon S3 behave more similar to Azure blob storage[1] and Google Cloud Storage[2], which is pretty convenient for folks who are abstracting blob stores across different offerings.

For what it’s worth, consistency in S3 was usually pretty good anyways, but I ran into issues where it could vary a bit in the past. If you designed your application with this in mind, of course, it shouldn’t be an issue. In my case I believe we just had to add some retry logic. (And of course, that is no longer necessary.)

[1]: https://docs.microsoft.com/en-us/azure/storage/common/storag...

[2]: https://cloud.google.com/storage/docs/consistency

joana035 · on Dec 2, 2020

And Minio too: https://docs.min.io/docs/distributed-minio-quickstart-guide....

slashdev · on Dec 2, 2020

And Wasabi as well: https://wasabi-support.zendesk.com/hc/en-us/articles/1150016...

y4m4b4 · on Dec 2, 2020

Wasabi is eventually consistent for overwrites ``eventual consistency for overwrite HTTP PUTs and HTTP DELETEs``

This is not the same.

slashdev · on Dec 2, 2020

Yes, I'm aware.

boulos · on Dec 2, 2020

Disclosure: I work on Google Cloud.

This is super awesome for customers. I am also beyond excited for all the open-source connectors to finally be simplified so that they don't have to deal with the "oh right, gotta be careful because of list consistency". It was super awesome when we were able to delete a huge chunk of the GCS Connector for Hadoop, and I hope to see the same across the S3-focused connectors.

I'm now much more inclined to make my GCS FUSE rewrite in Rust also support S3 directly :).

gopalv · on Dec 2, 2020

> It was super awesome when we were able to delete a huge chunk of the GCS Connector for Hadoop

It's been a few years, but I lost customer data on gs:// due to listing consistency - create a bunch of files, list dir, rename them one by one to commit the dir into Hive - listing missed a file & the rename skipped the missing one.

Put a breakpoint and the bug disappears, hit the rename and listing from the same JVM, the bug disappears.

Consistency issues are hell.

> It was super awesome when we were able to delete a huge chunk of the GCS Connector for Hadoop, and I hope to see the same across the S3-focused connectors.

Yes!

S3guard was a complete pain to secure, since it was built on top of dynamodb and since any client could be a writer for some file (any file), you had to give all clients write access to the entire dynamodb table, which was quite a security headache.

boulos · on Dec 2, 2020

If it's been a few years, it was likely before the migration to Spanner:

https://cloud.google.com/blog/products/gcp/how-google-cloud-...

We posted that blog in early 2018, but stated:

> Last year we migrated all of Cloud Storage metadata to Spanner, Google’s globally distributed and strongly consistent relational database.

So maybe pre-2017?

gigatexal · on Dec 2, 2020

Had no idea this happened. Very cool to see. We use both GCS and S3.

basicneo · on Dec 2, 2020

> It was super awesome when we were able to delete a huge chunk of the GCS Connector for Hadoop

Is this public?

ceocoder · on Dec 2, 2020

Yes, https://github.com/GoogleCloudDataproc/hadoop-connectors

disclosure: I work at Google as well.

boulos · on Dec 2, 2020

Right, https://github.com/GoogleCloudDataproc/hadoop-connectors/rel... was apparently the release:

> Delete metadata cache functionality because Cloud Storage has strong native list operation consistency already.

If folks are actually interested in these connectors, I'd also recommend this blogpost from last year:

https://cloud.google.com/blog/products/data-analytics/new-re...

because even with consistency, GCS and S3 still aren't filesystems :).

basicneo · on Dec 2, 2020

Do you have a link to the commits that removed the code. It'd be good to see what sort of complexity this sort of strong consistency can make redundant.

boulos · on Dec 2, 2020

(Phone reply, sorry for the brevity)

Just comparing to the previous release:

https://github.com/GoogleCloudDataproc/hadoop-connectors/com...

there are some big deletions like in:

https://github.com/GoogleCloudDataproc/hadoop-connectors/com...

and

https://github.com/GoogleCloudDataproc/hadoop-connectors/com...

Igor would know more :)

macksd · on Dec 2, 2020

In the case of Hadoop's S3 connector, this could eliminate this entire directory, plus its tests, plus a bunch of hooks in the main code: https://github.com/apache/hadoop/tree/trunk/hadoop-tools/had.... There's an argument in favor of keeping it in case other S3-compatible stores need it (though you'd still need DynamoDB or some equivalent) and because it makes metadata lookups so much faster than S3 scans, which helps with query planning performance. But I imagine even fewer people will take the trade-off now that Amazon S3 itself is consistent.

ofek · on Dec 2, 2020

Is your rewrite available?

boulos · on Dec 2, 2020

Not yet, it’s too crappy :). I only get to it every once in a while and it shows.

As an example, until last weekend, I would just hardcode an access token rather than doing an oauth dance. Luckily, tame-oauth [1] from the folks at Embark was reasonably easy to integrate with.

I also got a little depressed that zargony/rust-fuse was stuck on an ancient version until I learned that Chris Berner from OpenAI had forked it earlier this year to modernize / update it [2].

[1] https://github.com/EmbarkStudios/tame-oauth

[2] https://github.com/cberner/fuser

ofek · on Dec 2, 2020

Well when it becomes production-ready, feel free to open a PR or issue to https://github.com/ofek/csi-gcs

I'd prefer to use a memory-safe version :)

scrollaway · on Dec 2, 2020

That's pretty incredible. With no price increase either. Using s3 as a makeshift database is possible in a lot more scenarios now.

I especially like using s3 as a db when dealing with ETLs, where the state of some data is stored in its key prefix. This means that the etl can stop / resume at any point with a single source of truth as the database.

The potential drawback is cost of course; moving (renaming) is free but copying is not. S3's biggest price pain is always its PUTs. In many ETLs this is usually a non-issue because you have to do those PUTs regardless, as you probably want the data to be saved at the various stages for future debugging and recovery.

foxhop · on Dec 2, 2020

Yup me too! Checkout this state_doc Interface I wrote in Python for building a state machine where the state is stored in S3. I never bothered to solve for the edge case of inconsistent reads because my little state machine only ever measured in hours when it slept waiting for work and then minutes in-between waking up and doing work to verify restores.

With the news of S3's strong consistency my program is immediately safer to use for some set of defects, since the underlying datastore properties, S3 has made my codes naive/invalid assumptions, true.

https://github.com/remind101/dbsnap/blob/master/dbsnap_verif...

I should likely pull out the StateDoc class into it's own module.

kartayyar · on Dec 2, 2020

This is one reason I have been a big fan of Google Cloud Storage over AWS S3: at a past company AWS consistency was a huge pain, and GCS has had this for years.

https://cloud.google.com/blog/products/gcp/how-google-cloud-...

aparsons · on Dec 2, 2020

Azure as well

tutfbhuf · on Dec 2, 2020

Ask HN: I have a question regarding strong consistency. The question is related to all geo distributed strong consistent storages, but let me pick Google Spanner as example.

Spanner has a maximum upper bound of 7 milliseconds clock offset between nodes, thx to TrueTime. To achieve external consistency, Spanner simply waits 7 ms before committing a transaction. After that amount of time the window of uncertainty is over and we can be sure, that every transaction happens to be in the right order. So far so good, but how does Spanner deal with cross data center network latency? Let's say I commit a transaction to Spanner in Canada and after 7 ms, I get my confirmation, but now in Australia someone also does a conflicting transaction and also get his confirmation after 7 ms. Spanner however, bound to the laws of physics, can only see this conflict after 100+ ms delay, due to network latency between the datacenters. How is that problem solved?

foota · on Dec 2, 2020

This documentation talks about it in brief: https://cloud.google.com/spanner/docs/whitepapers/life-of-re.... You can read the spanner paper for more detail: https://static.googleusercontent.com/media/research.google.c...

The simple answer is that there are round trips required between datacenters when you have a quorum that spans data centers. Additionally, one replica of a split is the leader, and the leader holds write locks. So already you have to talk to the leader replica, even if it's out of the DC you're in. Getting the write lock overlaps with the transaction commit though. So for your example if we say the leader is in Canada and the client is in Australia, and we're doing a write to row 'Foo' without first reading (a so called blind write):

Client (Australia) -> Leader (Canada): Take a write write lock on 'Foo' and try to commit transaction

Leader -> other replicas: Start commit, timestamp 123

Other replicas -> Leader: Ready for commit

Leader waits for majority ready for commit and for timestamp 123 to be before the current truetime interval

Leader -> Other replicas: Commit transaction, and in parallel replies to client.

Of course there are things you can do to mitigate this depending on your needs, but there's no free lunch. If you have a client in Australia and a client in Canada writing to the same data you're going to pay for round trips for transactions.

rockinghigh · on Dec 2, 2020

The Spanner paper goes through this: http://static.googleusercontent.com/media/research.google.co...

Writes are implemented with pessimistic locking. A write in a Canada datacenter may have to acquire a lock in Australia as well. See page 12 of the paper for the F1 numbers (mean of 103.0ms with standard deviation of 50ms for multi-site commits).

throwaway_dcnt · on Dec 2, 2020

Partitioning.

jz-amz · on Dec 2, 2020

Some related reading -

* Jeff Barr's blog post on this topic, has an interesting case study: https://aws.amazon.com/blogs/aws/amazon-s3-update-strong-rea...

* Collaboration with Hadoop/S3A maintainers leading to this release: https://aws.amazon.com/blogs/opensource/community-collaborat...

yftsui · on Dec 2, 2020

I am curious about how much Dropbox pays for data ingress/egress, they migrated storage to their own on premise data center, then now moving data back to S3 for the data lake.

Scaevolus · on Dec 2, 2020

That's analytics data, not file storage. If they have 34PB of analytics, they certainly have far more than that of blob storage.

ashish-gandhi · on Dec 2, 2020

Definitely much more of user data storage. More about that https://dropbox.tech/infrastructure/inside-the-magic-pocket.

Happy to take other questions.

qt_scientist · on Dec 2, 2020

Did you consider building a S3 API in front of Magic Pocket so it could also serve as your analytics storage?

Or is your compute so bursty that it's more beneficial to be in the cloud on EC2 than to host on prem compute in front of Magic Pocket?

SteveNuts · on Dec 2, 2020

I thought they were almost entirely on prem for the actual storage

ddorian43 · on Dec 2, 2020

See https://github.com/rockset/rocksdb-cloud

bboreham · on Dec 2, 2020

Amazon has hard-disk products for large transfers. Snowcone, Snowball and Snowmobile.

iblaine · on Dec 2, 2020

AWS Re:invent talked about improvements to S3 earlier today. You can see that talk here https://virtual.awsevents.com/media/1_usujusng

For any AWS EMR/hadoop users out there, this means the end of emrfs.

For the uninitiated, emrfs recognizes consistency problems but does not fix them. It'll throw an exception if there's a consistency problem, even give you some tools to fix the problem, but the tools may give false positives. The result is you've got to fix some consistency problems yourself, parse items out of the emrfs DynamoDB catalog, match them up to s3, then make adjustments where needed. It's an unnecessary chore.

It surprises me that this issue has not got more attention over the years and thankfully it'll be solved soon.

damon_c · on Dec 2, 2020

The more I learn about S3 the more I feel the same awe I feel when looking at the Pyramids or Ankor Wat.

> You can send 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second per prefix in an S3 bucket.

dheera · on Dec 2, 2020

Side question -- why is s3fs so slow? I get only about 1 put per 10 seconds with s3fs. One would hope that you could use something like s3fs as a makeshift S3 API, but its performance is really horrible.

rovr138 · on Dec 2, 2020

I can't answer this directly, but I thought I'd chime in with a possible alternative, or at least something else to test.

I'm using rclone with S3[1] (and others). There are commands[2] where you can use to just sync/copy/ files but it also has a mount option[3]. It also has caching[4] and other things that might help.

[1] https://rclone.org/s3/

[2] https://rclone.org/commands/

[3] https://rclone.org/commands/rclone_mount/

[4] https://rclone.org/cache/

gaul · on Dec 2, 2020

This is not expected behavior. An older version had pessimized locking which preventing concurrently adding a file to a directory and listing the directory simultaneously which is the most similar symptom. I recommend upgrading to the latest version, running with debug flags `-f -d -o curldbg`, and observing what's happening. Please open an issue at https://github.com/s3fs-fuse/s3fs-fuse/issues if your symptoms persist.

banana_giraffe · on Dec 2, 2020

One thing I noticed when I went digging through the s3fs source trying to answer that myself is that it's uploader is single threaded. Part of what makes AWS's cli reasonably fast is that it uses several threads to make sure to use as much bandwidth as possible. If I limit AWS's tool to one thread on a box with plenty of bandwidth, it turns my upload down from around 135 MiB/s to around 20 MiB/s

Not sure if that's all of it, or even the majority of the slowness.

gaul · on Dec 2, 2020

Could you expand on this comment? s3fs uses multiple threads for uploading and populating readdir metadata via S3fsMultiCurl::MultiPerform. Earlier versions used curl_multi_perform which may have hidden some of the parallelism from you.

banana_giraffe · on Dec 2, 2020

Huh, right you are. I did a cursory search, but whatever I saw clearly was wrong.

That said, something is going on that I can't explain:

    # Default settings for aws
    $ time aws s3 cp /dev/shm/1GB.bin s3://test-kihaqtowex/a
    upload: ../../dev/shm/1GB.bin to s3://test-kihaqtowex/a

    real    0m10.312s
    user    0m5.909s
    sys     0m4.204s

    $ aws configure set default.s3.max_concurrent_requests 4
    $ time aws s3 cp /dev/shm/1GB.bin s3://test-kihaqtowex/b
    upload: ../../dev/shm/1GB.bin to s3://test-kihaqtowex/b

    real    0m26.732s
    user    0m4.989s
    sys     0m2.741s

    $ time cp /dev/shm/1GB.bin s3fs_mount_-kihaqtowex/c

    real    0m26.368s
    user    0m0.006s
    sys     0m0.699s

(I did multiple runs, though not hundreds so I can't say with certainty this would hold)

I swear the difference was worse in the past, so maybe things have improved. Still not worth the added slowness to me for my use cases.

gaul · on Dec 2, 2020

Thanks for testing! You may improve performance via increasing -o parallel_count (default="5"). Newer versions of s3fs have improved performance but please report any cliffs at https://github.com/s3fs-fuse/s3fs-fuse/issues .

web007 · on Dec 2, 2020

One fun fact that I learned recently: a prefix is not strictly path-delimited. I would think of /foo/bar and /foo/baz and /bar/baz as having two prefixes, but it could be anywhere from one to three, depending on how S3 has partitioned your data.

nopurpose · on Dec 2, 2020

I asked around and it seems, that `prefix` here is exact value for `prefix` or `key` query in the API. So you can make 5500 ListObject rps for prefix=/a/b and simultaneously make 5500 GetObject rps for key=/a/b/file1 and not be ratelimited.

In other word ratelimiting key is `${HTTP_METHOD}(${QURERY_PARAM_KEY}|${QUERY_PARAM_PREFIX})`

randallsquared · on Dec 2, 2020

It shouldn't be one if you're sending 3500+ requests a second. It may take some time for partitioning to happen, though.

CrossWired · on Dec 2, 2020

This is a major issue when turning batch data into individual files, especially with dynamic prefixes. You can and will find this edge pretty quick using EMR or Glue which tries to run as fast as possible. The only answer is....slow down your writes or just try again, a bit frustrating.

swyx · on Dec 2, 2020

11 nines of reliability. I don't know anything else promising that (unclear if they actually reach it)

ignoramous · on Dec 2, 2020

Route53 promises a 100% availability.

https://www.slideshare.net/AmazonWebServices/under-the-hood-...

https://aws.amazon.com/route53/sla/

darkr · on Dec 2, 2020

11 nines of _durability_. Uptime SLA for s3 is 99.9% AFAICR

mrep · on Dec 2, 2020

And those 11 nines of durability are what it is designed for, they don't offer any sort of SLA for that.

swyx · on Dec 2, 2020

thank you for the correction, I'm still pretty new to the backend world so I play too fast n loose with this terminology.

psanford · on Dec 2, 2020

Neat! Would this make it possible in the future for S3 to support CAS like operations for PutObject? Something like supporting `if-match` and `if-none-match` headers on the etag of an object would be so useful.

jamesblonde · on Dec 2, 2020

Listing and update consistency is great, but the ACID SQL-on-S3 frameworks (Apache Hudi, Dela Lake) also require atomic rename to support transactions.

HopsFS-S3 solves the ACID SQL-on-S3 problem discussed here on HN last week ( https://news.ycombinator.com/item?id=25149154 ) by adding a metadata layer over S3 and providing a POSIX-like HDFS API to clients. Operations like rename, mv, chown, chmod are metadata operations - rename is not copy and delete.

Disclosure: I work on HopsFS

stubish · on Dec 2, 2020

Once libraries and tools start relying on this, it is going to make life interesting to the S3-compatible players. The API remains the same, but the behavior is quite different.

sleepydog · on Dec 2, 2020

Google Cloud storage has behaved this way for years. According to other comments, so does Azure.

stubish · on Dec 3, 2020

It is the smaller players I'm thinking of, which needed S3 compatibility for uptake. Openstack Swift, Backblaze, Ceph... I am not sure if any of these suddenly stopped being S3 compatible due to this announcement.

neverartful · on Dec 3, 2020

Ceph has strong consistency by design. S3 compatibility in Ceph comes via the Rados Gateway (RGW) which just sits atop normal Ceph. This means that the strong consistency is carried over to RGW. So, no change for Ceph.

OpenStack Swift has an eventually consistent design. If the Swift community desires to maintain strict compatibility with S3 regarding consistency model, it would be a challenge. Despite being a challenge, I would never bet against the Swift developers. If they want to do it, I'm quite sure that they can. Having said all of that, I doubt that the Swift community would consider this a burning issue that needs to be solved. They've existed for over 10 years with an eventually consistent model and their user base is well aware of it.

kleebeesh · on Dec 2, 2020

Fantastic. Any write-ups or descriptions for how they made it happen?

hoytschermerhrn · on Dec 2, 2020

Seconded, I'd love to read a whitepaper if anyone is able to provide one. I'm assuming this operates on some variant of Google's TrueTime[0].

[0] https://cloud.google.com/spanner/docs/true-time-external-con...

foota · on Dec 2, 2020

Not necessarily, I think Spanner's magic is most useful for bounded staleness reads and scaling reads through non-leader replicas. Otherwise I think you're looking at normal "NewSQL" guarantees.

ultimoo · on Dec 2, 2020

> S3 consistency is available at no additional cost and removes the need for additional third-party, services, and complex architecture.

I wonder whether any companies/businesses solely depended on offering eventual consistency workarounds that probably now need to pivot.

pas · on Dec 2, 2020

The hadoop companies spent quite some bucks on developing S3Guard, and selling it.

https://hadoop.apache.org/docs/r3.0.3/hadoop-aws/tools/hadoo...

k__ · on Dec 2, 2020

It's interesting to read all these comments here that talk about the eventual consistency like it was some kind of bug.

deathanatos · on Dec 2, 2020

The previous behavior (new keys weren't strongly consistent) were weird, and made a lot of patterns that AWS itself sort of guides you towards, broken.

For example, you can set up an SQS queue to be informed of writes into a bucket. The idea being that some component writes into the bucket, and a consumer then fetches the data & processes it. Except — gotcha! — there wasn't a guarantee of consistency. It would usually work, though, and so you'd get these weird failures that you couldn't explain unless you already knew the answer, really.

Same thing with how S3 object names sort of feel like file names, and that you might use them for something useful, and you design around that, and then you hit the section of the docs that says "no, you shouldn't, they should be uniformly balanced". (Though honestly, I've never felt any negative side-effects from ignoring this, unlike the consistency guarantees.)

jsjohnst · on Dec 2, 2020

> then you hit the section of the docs that says "no, you shouldn't, they should be uniformly balanced". (Though honestly, I've never felt any negative side-effects from ignoring this

You likely won’t, unless you hit a bucket size that has at least a billion small files inside. If you get into the trillion and above range and are heavily unbalanced due to a recent deploy of yours, ooops, there goes S3. Or it did in 2013 anyway.

0xbadcafebee · on Dec 2, 2020

So the problem was that users don't know how S3 works?

howlgarnish · on Dec 2, 2020

Are there any scenarios where you would want eventual consistency over strong consistency? (Assuming pricing, performance, replication etc are the same.)

acdha · on Dec 2, 2020

Your last sentence is the answer I’ve always heard given: this in the context of a performance trade off — do you wait for every node in the system to acknowledge an update and purge caches, or return as soon as a sufficient durability threshold has been hit?

nhumrich · on Dec 2, 2020

At the benefit of added availability? absolutely. This is the entire premise of mongodb.

nullsense · on Dec 2, 2020

MongoDB is webscale.

castlecrasher2 · on Dec 2, 2020

I want the one with the bigger GBs.

nullsense · on Dec 2, 2020

Better yet, just pipe all your data to /dev/null it supports sharding. Sharding is the secret in the webscale sauce.

oblio · on Dec 2, 2020

Availability? I thought Mongo was fast and scalable, I don't remember anyone recommending it for higher availability.

nhumrich · on Dec 5, 2020

> Fast and scalable

> Availability

These are synonyms. How do you scale? Horizontally. How you do it without degrading performance, is the same way you increase availability. Eventual consistency.

robotresearcher · on Dec 2, 2020

No. A magical do-everything datastore would have strong consistency. Eventual consistency is something you might settle for given the performance cost trade offs.

To see this, consider that strong consistency has all the guarantees of eventual consistency, plus some more. And the additional guarantees might make your application much easier to write.

TheCoelacanth · on Dec 2, 2020

Job security?

dboreham · on Dec 2, 2020

Train tunnels

crazyjncsu · on Dec 2, 2020

CobrastanJorji · on Dec 2, 2020

If someone uses your API wrong, that's their bug. If a huge number of people use your API wrong, that's your bug.

zwily · on Dec 2, 2020

It wasn’t a bug, it was well-documented behavior. However, it was the root of many bugs by people building on top of S3 that did not take the eventual consistency into account.

I’ll bet this release is going to fix a lot of weird bugs in systems out there using S3.

k__ · on Dec 2, 2020

Guess the name simple storage service gave many people the idea that you could use it without a basic idea of distributed systems.

SilasX · on Dec 2, 2020

I'm pulling my hair out because people are using eventual consistency to mean "eventual consistency with a noticeable lag before the consistency is achieved", which bothers the pedant in me.

I mean, technically, strong consistency implies eventual consistency (with lag = 0). But everyone's equating eventual consistency with the noticeable lag itself, implying that EC per se is a bad thing.

For an analogy, it would be like if people were talking about how rectangular cakes suck, because <problems of unequal width vs length>, and thus they use square cakes, but ... square cakes are rectangular too.

(What's going on is that people use <larger set> to mean <larger set, excluding members that qualify for smaller set>. <larger set> = eventual consistency, rectangle; <smaller set> = strong consistency, square.)

Obviously, it's not a problem for communication because everyone implicitly understands what's going on and uses the words the same way, but I wish people spoke in terms of "has anyone actually seen a consistency lag?" rather than "has anyone actually seen eventual consistency?", since the latter is not the right way to frame it, and is actually a good thing to have, which happened both before and after this development.

benlivengood · on Dec 2, 2020

Eventual consistency with zero lag is not strong consistency. Two simultaneous conflicting writes with eventual consistency will appear successful. Strong consistency guarantees that one of the writes will fail.

EDIT: I think the reason people care about the lag is specifically for this case; if you want to know if your eventually consistent write actually succeeded then you need to know how long to wait to check.

SilasX · on Dec 2, 2020

Well, in all the threads, they’re describing situations with lag. That is, they’re equating eventual consistently with the lag, and implying it’s therefore bad, when eventual consistency can include unnoticeably small lag.

And thus it’s the lag that they’re actually bothered by, not the EC per se. Hence my clawed out hair.

mikaeluman · on Dec 2, 2020

I thought we had worked out in 2010-2012 that the CAP theorem, consistency, availability and partition tolerance (https://en.wikipedia.org/wiki/CAP_theorem#History) made this impossible?

Where is the mistake.

johncolanduoni · on Dec 2, 2020

S3 is almost certainly not fully partition tolerant at the node level and requires some sort of quorum. Other “magical” data stores like Spanner also retain this limitation, they just have very reliable replication strategies.

pricechild · on Dec 2, 2020

"almost certainly not fully partition tolerant at the node level"

Whoa hang on! You can't just say you're not tolerant of partitions...

I'm struggling to find the best post on aphyr.com about this but https://aphyr.com/posts/325-comments-on-you-do-it-to is a good one, specifically the line:

"CP and AP are upper bounds: systems can provide C or A during a partition, but might provide neither."

In short, Partitions happen, no matter what. No really, when you have multiple nodes they happen, you must tolerate them. You can't have CA.

johncolanduoni · on Dec 2, 2020

I’m not claiming it’s a CA system, and the terminology “partition intolerant” is not verboten by Kyle Kingsbury. From your link: “Specifically, partition-intolerant systems must sacrifice invariants when partitions occur. Which invariants?”

The answer in this case is that availability is sacrificed, unless Amazon is making a very misleading claim of strong consistency (per the submission title/link). So it’s CP.

In case “not fully partition tolerant” is what is throwing you off, all I mean by that is that there are likely network partitions of the S3 nodes that will not cause outward failure (e.g. partition of a single node), though some will (e.g. too many partitions to form a quorum, or whatever their underlying consensus implementation relies on).

pricechild · on Dec 3, 2020

Cool, sorry to put words in your mouth.

I guess the simple answer to the GP is "Availability is likely sacrificed", and the clue's in the submission title being all about consistency. I haven't looked too deep into it and am sure there are also still Consistency sacrifices too.

I'm still a little niggled though... "almost certainly not fully partition tolerant at the node level and requires some sort of quorum" - what does this actually mean? I'm confused why you raise the tolerance of a single node to partitions, when the partitions by definition require >1 node?

johncolanduoni · on Dec 3, 2020

When I say “at the node level” I’m trying to draw a distinction with the service level view where a client is talking to S3 as a (from their perspective) single entity per region. So a better way of putting it may be the internal view of the service, where partitions are between nodes that are part of S3 itself (as opposed to between a client and S3) and they affect the overall health of the system. I’m not referring to a single node being partition resistant (hence why I said node level, and never mentioned single) which as you said wouldn’t make any sense.

throw93 · on Dec 2, 2020

>> Amazon S3 delivers strong read-after-write consistency automatically for all applications, without changes to performance or availability, without sacrificing regional isolation for applications, and at no additional cost.

So they claim performance and availability will remain same while claiming strong consistency. I was confused at first but then “same” availability isn’t 100% availability. So it indeed CP.

chillydawg · on Dec 2, 2020

I would guess write times are slightly up to cover the cost of the new replication completion guarantees.

gaul · on Dec 2, 2020

This is great news for application developers! And probably hard work for the AWS implementors. I will try to update this over the weekend: https://github.com/gaul/are-we-consistent-yet

hendry · on Dec 2, 2020

I wonder if this would make S3 a host for sqlite!

foxhop · on Dec 2, 2020

Yup, I think it would.

sudeepj · on Dec 2, 2020

I have started working on this actually (the vfs being developed in Rust). The design was complex but now this simplifies a lot.

The vfs will also have snapshotting capabilities i.e. you can have multiple versions of sqlite db.

js4ever · on Dec 2, 2020

How are you handling latency (up to 800ms on S3) or the fact that you have to write the whole file each time?

sudeepj · on Dec 2, 2020

> you have to write the whole file each time

Will not write the whole file. Instead each page of the database (default = 4KB) will be separate put.

The use case is: one-writer multiple readers.

The readers will read an old snapshot of the database (from s3) and writer will write to the active one. This way a lot of readers can open the DB in read-only mode.

As for writer, each page will be a separate put. The writes will be appended to a log(s) locally and the page will be synced in the background. This helps to exploit lot of parallel PUTs to S3. Once the sync happens the disk space for that page will be reclaimed.

For the writer:

v = open_new_version()

//inserts + updates

//commit

close_version(v)

For readers:

open_version_for_read(v)

//select

speleding · on Dec 2, 2020

You could shard your SQLlite files to only contain the data for a single user, if your application allows you to do that. Would make it a bit painful to run migrations, but other than that you've now got an infinitely scaling database with 9 nines uptime for no money at all.

ComodoHacker · on Dec 2, 2020

Obviously, a sane implementation would divide db file into blocks and write them into separate objects. Not as small as traditional FS blocks, 1-4 Mb, perhaps.

daviesliu · on Dec 2, 2020

Yes, JuiceFS splits files into 4MB blocks and put them into S3.

juliansimioni · on Dec 2, 2020

Another, perhaps better link for the announcement: https://aws.amazon.com/blogs/aws/amazon-s3-update-strong-rea...

mholt · on Dec 2, 2020

Would this enable atomic operations with S3 (e.g. "put object only if it does not exist")?

leakybucket · on Dec 2, 2020

The copy object operation has options to control if the copy occurs based on examining source metadata, see "x-amz-copy-source-if-match" at https://docs.aws.amazon.com/AmazonS3/latest/API/API_CopyObje... .

With the consistency change, those might be useful as the basis for atomic operations.

OJFord · on Dec 2, 2020

Could you not already do that by specifying ETag?

rcaught · on Dec 2, 2020

No, because the object may not be there (but already uploaded) with eventual consistency.

sterwill · on Dec 2, 2020

Before this update, S3 provided "read-after-write" consistency for object creation at a unique path[0] with PUT and subsequent GET operations. Replacement wasn't consistent. So I think you could have worked out some kind of atomic pattern with S3 objects, using new keys for each state change.

[0] https://github.com/awsdocs/amazon-s3-developer-guide/blob/41...

scarface74 · on Dec 3, 2020

The Etag is not a reliable hash of the file contents. It will be different if the file was uploaded in multiple parts like using the CLI than if it was moved as one operation like copying from one bucket to another. You can tell the difference because the Etag will have a “-“ in it if it was uploaded as a multipart upload.

ryanworl · on Dec 2, 2020

I also want the answer to this, but to the best of my knowledge it does not.

bob1029 · on Dec 2, 2020

This is really cool. We actually got burned on weak S3 consistency a few weeks ago when generating public download links for a customer. Took us a few hours of troubleshooting to realize they had downloaded a cached/older version of the software we had uploaded to the same URL just a few minutes prior. Resolution was to use unique paths per version to guarantee we were talking to the right files each time.

One potentially related item I was thinking about - How does HN feel about the idea of a system that has eventual durability guarantees which can be inspected by the caller? I.e. Call CloudService.WriteObject(). It writes the object to the local datacenter and asynchronously sends it to the remote(s). It returns some transaction id. You can then pass this id to CloudService.TransactionStatus() to determine if a Durable flag is set true. Or, have a convenience method like CloudService.AwaitDurabilityAsync(txnid). In the most disastrous of circumstances (asteroid hits datacenter 1ms after write returns to caller), you would get an exception on this await call after a locally-configured timeout in which case you can assume you should probably retry the write operation.

I was thinking this might be a way to give the application a way to decide how to deal with the concept of latency WRT cross-site replication. You could not await the durability for 4 nines or wait the additional 0-150 ms to see more nines. I wonder how this risk profile sits with people on the business side. I feel like having a window of reduced durability per object that is only ~150ms wide and up-front can be safely ignored. Especially considering the hypothetical ways in which you could almost instantaneously interrupt the subsequent activities/processing with the feedback mechanism proposed above.

0xbadcafebee · on Dec 2, 2020

> they had downloaded a cached/older version of the software we had uploaded to the same URL

This is a great example (to me) of how weak consistency is useful. It exposed a problem you otherwise wouldn't notice.

You were deploying offline artifacts to customers without giving the customer the version of the new artifact. Regardless of backend consistency, the customer will not be able to tell (from the URL, anyway) what version they are downloading. Nor will they be able to, say, download the old version if the new version introduces bugs. By changing your deployment method to use unique URLs per version, your customer gains useful functionality and you avoid having to depend on strong consistency (which actually removes expensive requirements).

If you design for weak consistency, your system ends up more resilient to multiple problems.

benlivengood · on Dec 2, 2020

> I was thinking this might be a way to give the application a way to decide how to deal with the concept of latency WRT cross-site replication.

What if your application crashes in the middle of waiting?

Ultimately if you want full consistency you need it everywhere. Your workflow writing to consistent storage needs to be an atomic output of its own inputs with a scheduler that will re-run the processing if it hasn't committed its output durably.

bob1029 · on Dec 2, 2020

The application crashing while awaiting durability is equivalent to not awaiting the assurance mechanism in the first place.

Ways to recover from this could include simply scanning through entities modified previously to determine durability status. Other application nodes could participate in this effort if the original is not recoverable. An explicit transaction should be used in cases where partial completion is not permissible, as with any complex database operation.

This is a tricky way to think about the problem because you are still getting transactional semantics regardless. The only thing that would be in question is if your data can survive an asteroid impact. With this model, even with an application crashing, there is still an exceedingly high probability that any given object will survive into additional asynchronous destinations (assuming a transaction scope is implicit or externally completed).

I think the biggest problem in selling this concept is with the fatalists in the developer/business community who are unwilling to make compromises in their application domain logic in order to conform to our physical reality. From a risk modeling perspective, I feel like we have way more dangerous things to worry about on a daily basis than an RPO in which the latest ~150ms of business data at one site might be lost in an extremely catastrophic scenario. Trying to synchronously replicate data to 2+ sites for literally everything a business does is probably going to cause more harm than good in most shops. If you operate nuclear reactors or manage civilian airspace, then perhaps you need something that can bridge that gap. But, most aren't even remotely close to this level of required assurance.

scrollaway · on Dec 2, 2020

It would be a cool interface, I would love to see that.

QuinnyPig · on Dec 2, 2020

This was a problem with data lakes / analytics. Small inconsistencies would trash runs; that problem now goes away and removes a lot of janky half-fixes to work around the issue.

For most common use cases it's not really an issue.

FridgeSeal · on Dec 2, 2020

Of course the real issue there is those tool treating S3 as something it's not (a filesystem) and building things on it with the expectation of certain guarantees that it explicitly didn't provide before now.

I dig the feature (strong consistency) but in some ways it just enables tools that were abusing it to just do so more easily.

Decker87 · on Dec 2, 2020

Isn't that the point? To provide some useful functionality for others to build on (and pay for)?

staticassertion · on Dec 2, 2020

Like RedShift? AWS uses S3 as a database, seems reasonable that their customers would.

FridgeSeal · on Dec 2, 2020

That doesn't mean it blindly "mounts" it as a disk and attempts to use it as a filesystem. They're almost certainly using it in a manner that is aware of the specificities of how S3 works.

foxhop · on Dec 2, 2020

What if the blind mount works 99.9% of the time? I knew what I was doing was risky but well with in my tolerance for pain. Now that we have a strong consistency my code is fixed. That's pretty impressive.

cash22 · on Dec 2, 2020

How do they accomplish strong consistency without any hit to availability?

YZF · on Dec 2, 2020

While I don't know the specifics I can say that availability is down to engineering practices.

Let's say that their consistency model is achieved via quorum, that is writes write to a quorum of nodes while reads read from a quorum of nodes (of their metadata database) then this guarantees read after write consistency.

The availability aspect of this is just engineering, making sure you're never down to less than a quorum of nodes. E.g. by having redundant power supplies, generators, networks, or whatever engineering it takes to reduce the probability of failure.

There's other aspects, e.g. latency, that may suffer, but again this is solved via engineering. Just throw more hardware at it to bring the latency down. The only time where you absolutely can't solve it is if you provide strong consistency across geographical regions that are far apart, there's no way then not to pay that latency.

This is just another example of why the CAP theorem isn't really as useful to determining limitations of practical systems as it may seem at first site.

beoberha · on Dec 2, 2020

I hope they write a paper or at least do a talk about how they went about it. Usually these things are more involved than an outsider would think. Agreed that you can beat most of the CAP tradeoffs with significant engineering effort, but it’s usually pretty interesting how they accomplish it. Especially at S3’s scale where the COGs impact of “throw more hardware at it” can cost on the order of tens or hundreds of millions.

throw93 · on Dec 2, 2020

>> Let's say that their consistency model is achieved via quorum, that is writes write to a quorum of nodes while reads read from a quorum of nodes (of their metadata database) then this guarantees read after write consistency.

If reads are reading only from a quorum of nodes how do you guarantee they have latest data? In theory, while a node is servicing a read request wouldn’t you need to query “all” other nodes to verify that the data in current node is latest? What if the quorum of nodes don’t have the latest data yet?

YZF · on Dec 3, 2020

You write to a quorum of nodes and you read from a quorum of nodes. Latest is determined by a clock. Let's say you wrote to 2 out of 3 nodes, and read from 2 out of 3 nodes, one of those 2 you read from is going to have the latest data. This is how consistency works in Cassandra e.g.

There are other models, for example by using a consensus protocol.

valenterry · on Dec 2, 2020

> (of their metadata database)

With that you have just moved the question to: how do they ensure that the metadata database is available _and_ strongly consistent at the same time for all the requests?

Because the metadata database is certainly also a distributed one, hence you need to query _all_ the nodes or can end up in a split-brain situation and lose consistency (or availability if you choose to down the system or a part of the system).

YZF · on Dec 3, 2020

All you really care about is that P(split_brain) < your availability guarantee. This is when this turns into an engineering problem. If you have redundant top of rack switches, redundant power supplies, redundant network paths, you're not going to end up with a split brain.

An example I like to use is integrated circuits, like your CPU, sure, it can fail in a mode where you lose the internal connectivity between the transistors, and then your CPU is dead. It's a very complex distributed system where engineers have basically taken out any practical possibility of an internal failure. So just because CAP says that if you lose connectivity to some portion of transistors you can't maintain consistency or availability (which is true) doesn't mean a thing in practice. Nobody even worries about how to keep a CPU going if you lose one core.

valenterry · on Dec 3, 2020

> All you really care about is that P(split_brain) < your availability guarantee

In other words: availability suffers from it.

> An example I like to use is integrated circuits, like your CPU, sure, it can fail in a mode where you lose the internal connectivity between the transistors, and then your CPU is dead

In that case the system is dead, it doesn't become inconsistent. I get your point though, you can reduce the chance-of-happening for these cases, but this orthogonal for the CAP. The CAP says you have to balance between availability and consistency (assuming partitioning is required here). You cannot have both at 100%, no matter how efficient you get.

And if the efficiency stays the same (which we should assume when comparing S3 with and without strong consistency) then giving better consistency guarantees inevitably must reduce availability - that's what my point is.

jrbancel · on Dec 2, 2020

> hence you need to query _all_ the nodes or can end up in a split-brain situation and lose consistency

I am not sure this is true. It is prohibitively expensive to read from all replicas. All you need to ensure is that any replica has the latest state. See how Azure Storage does it [0]. Another example is Spanner [1] that uses timestamps and not consensus for consistent reads.

[0] https://sigops.org/s/conferences/sosp/2011/current/2011-Casc... [1] https://cloud.google.com/spanner/docs/whitepapers/life-of-re...

valenterry · on Dec 2, 2020

Yes, so to quote from it [0]:

> This decoupling and targeting a specific set of faults allows our system to provide high availability and strong consistency in face of various classes of failures we see in practice.

What that says is: for the failures we have seen so far, we guarantee consistency and availability. But how about the other kind of failures? The answer is clear: strong consistency only is guaranteed for _certain expected_ errors, not in the general case. Hence generally speaking, strong consistency is _not_ guaranteed. As simple as that. Saying it is guaranteed "in practice" is, pardon, stupid wording.

toolslive · on Dec 2, 2020

> hence you need to query _all_ the nodes or can end up in a split-brain situation and lose consistency (or availability if you choose to down the system or a part of the system).

No.Depending on the type of consensus algorithm you can get away with reading from a majority or nodes or even less: for example in a Multipaxos system you just read from the master.

valenterry · on Dec 2, 2020

Then what happens in case of network partitions?

benlivengood · on Dec 2, 2020

One side of the partition has quorum and the other does not.

Processing keeps happening on the partition with quorum.

A practical example I worked on is Google's Photon; multi-region logjoining with Paxos for consistency. When the network caused a region to become unavailable all the logjoining work happened in the other regions. Resources were provisioned for enough excess capacity to run at full speed in the case of a single region failure and the N-region processing allowed the redundancy cost to be as low as 100%/N overhead.

valenterry · on Dec 3, 2020

You are describing the case where there are 2 groups that are isolated/partitioned.

But what if there are more than 2?

valenterry · on Dec 2, 2020

Good question. There will be some drawback from having strong consistency indeed.

ed25519FUUU · on Dec 2, 2020

The most devious s3 consistency issue I encountered was that S3 bucket configs are actually stored in S3, and were not read-after-write consistent.

It was bad. In order to "update" a bucket config programmatically you need to read the entire existing config, make an update, and then PUT it back (overwriting what's there). The problem is when you went to read the config it's possibly 15 minutes old or more, and when you put it back you overwrite any changes that may not be consistent.

I had to work around the issue by storing a high-water mark in the config ID, and then poll an external system to find the expected high water mark before I knew I could safely read and update.

ChuckMcM · on Dec 2, 2020

This is awesome, I could have used it for a project we did a while back.

I'm sure that somewhere there is a Product Manager at Amazon who talks to customers they find out are using Google Cloud or Azure and asks them "Why not use AWS?" and they mumble some feature of the other service that they need.

Sometime later, said service shows up as part of AWS. :-)

For me this is the best part of tech rivalries, whether they are in CPU performance/architecture, programming language features/tools, or network services. Pressure to improve.

yodsanklai · on Dec 2, 2020

Could someone describe a few real-life scenarios where this is useful and noticeable?

Macha · on Dec 2, 2020

You have a processing job which dumps a bunch of output files in a directory. A downstream job uses these files as input, sees a new directory, and pulls all the files in the new directory.

Because s3 was not strongly consistent, you would have the downstream job see a arbitrary subset of the files for a short while after creation, and not just the oldest files. This could cause your job to skip processing input files unless you provided some sort of manifest of all the files it would expect in that batch. So then you'd have to load the manifest, then keep retrying until all the input files showed up in whatever s3 node you were hitting.

mullen · on Dec 2, 2020

If this is an issue, you wait X amount of seconds after the file has been created and then start processing it. This would allow the file to be consistent before processing it.

A better idea would be triggering Lambda jobs which either directly processes the files as they are added to S3 or trigger Lambda jobs which add the files to SQS and each job in the SQS is processed by another Lambda job.

Macha · on Dec 2, 2020

The issues with a sleep and pray strategy are:

1. No amount of time is provably enough

2. If you just take the maximum recorded time and say add 20% padding, then waiting that amount of time to process every dataset could be detrimental to performance.

The example I gave happened to the team I was on in 2017/2018. We had 1000s of files totaling terabytes of data in a given batch. The 90th percentile time for consistency was the low 10s of seconds, the 99th percentile was measured in minutes. The manifest and retry not yet present method avoids having to put in a sleep(5 minutes) for the 1% of cases.

PetahNZ · on Dec 2, 2020

Our image processing worker queues/servers write an S3 object and dispatch a follow up job. Currently we have to delay the next job (we use 5 seconds) otherwise then next job may start processing before the S3 object is available (it 404's if the next job is run straight away).

random5634 · on Dec 2, 2020

Same issues updating a file - you’d update on s3, then trigger processing and even though it was 50ms later - you’d get OLD version. I didn’t know gcp didn’t have this issue - would have been a big selling point

ramraj07 · on Dec 2, 2020

That sounds _very_ wrong, what type of throughput are we talking about here?

riking · on Dec 2, 2020

The story wasn't wrong, until today. That's why people are celebrating.

PetahNZ · on Dec 2, 2020

Up to hundreds of images per second.

ramraj07 · on Dec 2, 2020

Okay that makes sense, when we have very low throughput it looks like consistency wasn't an issue but at that rate (especially if it occurs in bursts) I can see how the prior S3 implementation could fail to scale fast enough

mcqueenjordan · on Dec 2, 2020

Out of curiosity, did you have retries on the GETs?

PetahNZ · on Dec 2, 2020

No, we didn't try retrying (well actually if the job failed because of a 404 it would retry the whole job based on exponential backoff). When we encountered the error we simply added the delay as it was an easy config change to our SQS queues and doesn't meaningfully affect our use case.

HiJon89 · on Dec 2, 2020

We used to use S3 for Maven artifact storage. This is mostly an append-only workload, however Maven updates maven-metadata.xml files in place. These files contain info about what versions exist, when they were updated, what the latest snapshot version is, etc. We would see issues where a Maven build publishes to S3, and then a downstream build would read an out-of-date maven-metadata.xml and blow up. Or worse, it could silently use an older build and cause a nasty surprise when you deploy. It only happened a small percentage of the time, but when you’re doing tens of thousands of builds per day it ends up happening every day.

We switched to GCS for our Maven artifacts and the problem went away.

riking · on Dec 2, 2020

To be clear: the "blowing up" would occur when a client observed the new maven-metadata.xml file, but old ("does-not-exist") records for the newly uploaded artifact, correct?

With this update, ordering the metadata update after the artifact upload means this failure is now impossible.

HiJon89 · on Dec 2, 2020

For some context, S3 used to provide read-after-write consistency on new objects, but only if you didn't do a GET or HEAD to the key before it was created. So this access pattern is all good:

  PUT new-key 200
  GET new-key 200

However, this access pattern would be unreliable:

  GET new-key 404
  PUT new-key 200
  GET new-key 200/404

This last GET could return a 200, or it could return a cached 404. Unfortunately, Maven uses this access pattern when publishing maven-metadata.xml files, because it needs to know whether it should update an existing maven-metadata.xml file or create one if it doesn't exist yet. So when publishing a new version, it does something like this:

  GET com.example/example-artifact/1.0-SNAPSHOT/maven-metadata.xml 404
  PUT com.example/example-artifact/1.0-SNAPSHOT/maven-metadata.xml 200

And then downstream builds would try to resolve version 1.0-SNAPSHOT, and the first thing Maven does is fetch the maven-metadata.xml:

  GET com.example/example-artifact/1.0-SNAPSHOT/maven-metadata.xml 404

So when publishing a new version, you could get a cached 404 and fail loudly. However, when updating an existing maven-metadata.xml file you could silently read an old maven-metadata.xml, and end up using an out-of-date artifact (which is even more concerning). Here's what one of the maven-metadata.xml files looks like: https://oss.sonatype.org/content/repositories/snapshots/org/...

Because updating an existing object in S3 didn't have read-after-write consistency, we could have a publishing flow that looked like:

  GET com.example/example-artifact/1.0-SNAPSHOT/maven-metadata.xml 200 (v1)
  PUT com.example/example-artifact/1.0-SNAPSHOT/maven-metadata.xml 200 (v2)

And then downstream builds would fetch the maven-metadata.xml:

  GET com.example/example-artifact/1.0-SNAPSHOT/maven-metadata.xml 200 (v1)

So downstream builds could read a stale maven-metadata.xml file, which results in silently using an out-of-date artifact.

We ended up just switching to GCS because it was relatively straight-forward and gave us the consistency guarantees we want.

juliansimioni · on Dec 2, 2020

One very common one is for situations where you might have a multi-step pipeline to process data

- step 1 generates/processes data, stores it in S3, overwriting the previous copy. triggers step 2 to run

- step 2 runs, fetches the data from s3 for its own processing. However, because only a few seconds have elapsed, step 2 fetches the old version of data from the S3 bucket

You can work around this by, for example, always using unique S3 object keys, but then you have to coordinate across the data processing steps, and it becomes harder to manage things like storing only the 10 latest versions.

The Argo workflow tool (https://argoproj.github.io/) is one example of a tool that can suffer from this problem.

throwaway_qcjal · on Dec 2, 2020

Congratulations to the S3 team from a friendly cloud storage neighbor in SLU :-)

marius_k · on Dec 2, 2020

I was naive to think that these type of consistencies were already in place.

intricatedetail · on Dec 2, 2020

Is "strong" redundant? Something can be either consistent or not. Seems like developers asked marketing for a catchy name, but that looks comical.

whateveracct · on Dec 2, 2020

This is really something, especially for small apps. S3 is now the easiest KV store I can imagine.

sudeepj · on Dec 2, 2020

> small apps

Even the big ones. A lot of applications (e.g backup vendors) had to resort to use EBS volumes because of lack of read-after-update consistency.

The barrier to entry for lot of developers has been reduced.

My prediction: A lot of storage infra (FS, DB) with native support for S3 will get commoditised now.

gfody · on Dec 2, 2020

we use s3 in a repo pattern pretty heavily and I had no idea they didn't have read-after-write consistency. I've noticed s3 will hold connections open by sending back blank lines and always assumed it was synchronizing for consistency.

yellow_lead · on Dec 2, 2020

What's up with all the new AWS features in the past few days? Coincidence or not?

kommissar · on Dec 2, 2020

No coincidence. It’s reinvent season. https://reinvent.awsevents.com/

yellow_lead · on Dec 2, 2020

Thanks, I was totally out of the loop there.

kommissar · on Dec 2, 2020

It’s free this year, too. Might as well register if you’re interested.

banana_giraffe · on Dec 2, 2020

AWS Re:Invent is going on. It's virtual this year, but they're still using it to unveil a bunch of new stuff.

amuraruhn · on Dec 2, 2020

One capability left to implement now: make `move` operations in S3 O(1)

WrtCdEvrydy · on Dec 2, 2020

Has anyone ever seen S3 behave eventually consistent? I have not seen a lot of eventual consistency in the real world but I wonder if I'm just working on the wrong problems?

macksd · on Dec 2, 2020

Yes. In fact I spent months working on a system to workaround all the problems it caused for Hadoop: https://hadoop.apache.org/docs/current/hadoop-aws/tools/hado.... We had a lot of automated tests that would fail because of missing (or more recently, out of date) files a significant portion (maybe 20-30%) of the time. We believed that S3 inconsistency was to blame but I always felt like that was the typical "blame flaky infra" cop-out. As soon as S3Guard was in place all those tests were constantly green. It really was S3 inconsistency, routinely.

To be fair to S3, Hadoop pretending S3 is a hierarchical filesystem is a bit of a hack. But I had cases where new objects wouldn't be listed for hours. There's only so much you can do to design an application around that, especially when Azure and Google Storage don't have that problem.

banana_giraffe · on Dec 2, 2020

Yes, a ton. The big issue for me was Does key exist? -> No -> Upload object with key. Followed by Get key -> Not found. Less rarely, I'd see Upload object followed by an overwrite object, followed by a download download some combination of the two if the download was segmented (like AWS's cli will do).

The first one happened often enough to cause problems, the second one was a fairly rare event, but still had to be handled.

It might have been more of an issue with "large" buckets. The bucket in particular where I had to dance around the missing object issue had ~100 million fairly small objects. I ended up having to create a database to track if objects exist so no consumer would ever try to see if a non-existent object existed.

Time to revisit all that mess, I suppose.

allengeorge · on Dec 2, 2020

Yes, we have. Specifically:

1. LIST 2. PUT 3. LIST

would trigger situations where (3) wouldn't include the object inserted in (2). This is well-known however.

juancampa · on Dec 2, 2020

Also:

  1. HEAD key -> 404
  2. PUT key  -> 200
  3. GET key  -> 404 (what? But I just put it!)

This is commonly used for "upload file if it doesn't exist"

electrum · on Dec 2, 2020

Even more confusing, the 404 on GET is caused by doing the HEAD when the object doesn't exist. Without the previous HEAD, the GET would usually succeed.

roseway4 · on Dec 2, 2020

Yes. Writing objects from EMR Spark to S3 buckets in a different account. Immediately setting ACLs on the new objects fail as the objects are not yet “there”. Using EMRFS Consistent View was previously the obvious solution here, but it adds significant complexity.

enw · on Dec 2, 2020

EMRFS consistent view is a mess. Changes to S3 files are likely to break EMRFS, the metadata needs to be kept in sync (otherwise the table will keep increasing forever) with tools that are a pain to install, and when I used it to run Hive queries from a script, it didn't work.

jamesblonde · on Dec 2, 2020

This news is the death of EMRFS. The only thing it has left is fast listings of "directories" using DynamoDB.

brian_herman · on Dec 2, 2020

Yeah, I never had that problem either. But I wonder how they are addressing the https://en.wikipedia.org/wiki/CAP_theorem

content_sesh · on Dec 2, 2020

Sacrificing availability I reckon.

riking · on Dec 2, 2020

Yup, the correct answer in the face of a network partition is for the PUT to return 500, even though the file really was uploaded. The API handler can't prove success, due to the network partition, so the correct move is to declare failure and let the client activate their contingencies.

SpicyLemonZest · on Dec 2, 2020

I've never seen the minutes long delays that you were theoretically supposed to accommodate, but second-scale delays were pretty common in my team's experience. (We're writing a lot of files though, probably hundreds or thousands per second aggregated across all our customers.)

WrtCdEvrydy · on Dec 2, 2020

Yeah, I got seconds but people were saying there were minutes of data... max I ever saw was 3 seconds on a 2gb file...

gregw2 · on Dec 2, 2020

I ran across issues where files that we PUT in S3 were still not visible in Hive 6+ hours later. It only happened a few times out of hundreds of jobs per day for jobs running daily the last 2-3 years, but it was annoying to deal with when it happened because there wasn't a lot we could do beyond open tickets with AWS.

enw · on Dec 2, 2020

All the time when using EMR with S3. Really frequent, and really annoying to mitigate.

rantwasp · on Dec 2, 2020

yes. i believe newer region had read after write in a while but you needed to use the right endpoint

i remember a time when if you were using the us-standard “region” and were unlucky it could take 12 hours for your objects to become visible

TheDong · on Dec 2, 2020

Yes.

I've observed eventual consistency with S3 prior to this change where it took on the order of several hundred milliseconds to observe an update of an existing key.

I've observed this as recently as 2 years ago.

johncolanduoni · on Dec 2, 2020

There used to be a behavior with 404 caching that was pretty easy to trigger: if you race requesting an object and creating it, subsequent reads could continue to 404.

OJFord · on Dec 2, 2020

Yes, it caused us some pain with multipart uploads that we sort of proxied, them being multipart from the client to us as well.

richwater · on Dec 2, 2020

Yes. When you're writing terabytes of data files it was very obvious.

That's why manifest files became so popular.

dilatedmind · on Dec 2, 2020

I think it was already strongly consistent within regions?

like if you had tried to read a non-existent key, then wrote to it, it might continue to appear to not exist for a minute?

TheDong · on Dec 2, 2020

Their previous consistency was RAW, but not RAU (for most regions. I think us-east-1 was missing RAW consistency for quite a while).

After a write, you would _always_ be able to read the key you just wrote.

After an update, you could get a stale copy of the key if your subsequent read hit a different server.

heartofgold · on Dec 2, 2020

Not quite _always_. There were some documented caveats... the one I hit before was: Read nonexistent key, followed by a write of that same key, and then a subsequent read could return a stale read saying it didn’t exist. (even though it had just been written for the first time.)

Anyway I am glad to see these gaps and caveats have been closed.