AWS has GPS/atomic clocks in each datacenter that provide an accurate reference time. Recent linux distros use chrony instead of ntpd to synchronize with the reference time, which should introduce only microseconds of error between the reference time and the system clock.
Am I missing something? I am not an expert, I'm just not seeing where the 100s of ms of error is going to enter this system.
I spent some time looking at TimeSync and my primary takeaway was simply that while it was nice, there's no actual hard numbers on how accurate it really is. I suspect it is very accurate but proving this (to yourself) is going to be challenging if you want to rely on global clocks to avoid consensus without details or insight. You are essentially making a huge bet on performance considerations by trading consensus for clocks, at the expense of a far, far higher bar for correctness.
It seems very likely based on when it was rolled out that it underpins AWS tech like DynamoDB Global Tables -- so it almost certainly powers critical infrastructure. But there's no SLA or reports on what the tolerances you can expect without doing a lot of work on your own. It's more of a nice bonus rather than a "product" they offer you, in that sense, so being wary maybe isn't unwarranted.
IIRC from the original Spanner/TT paper, they had a general error window of ~10ms from the TT daemons, and I would be extremely surprised if Google hasn't pushed that even lower, now, so your job is much more cut out for you than 100s of ms of error. And yes the clocks are in the same DC at a very precise window, but bugs happen through-out the stack, your hypervisor bugs out, systems get misconfigured, whatever, your process will fuzz out, especially as you begin to tighten things. You don't have the QA/testing of Spanner or DynamoDB, basically.
None of this is insurmountable, I think, though. It's just not easy any way you cut it. Even a few people doing the work to test and experiment with this would be very valuable. (It would be even better if AWS would make it a real product with real SLAs/numbers to back it up.) It's just a lot of work no matter what.
The fact that it is limited to AWS (for now) is a bit of a shame. I do hope other cloud providers start thinking about providing precise clocks in their datacenters, as well as accompanying software to go with it.
> Recent linux distros use chrony instead of ntpd to synchronize with the reference time, which should introduce only microseconds of error between the reference time and the system clock.
To be fair not everyone uses chrony; a lot of systems still use just ntpd or timesyncd (I spent a lot of time working on fixing time-sync related issues in our Linux distro lately across all our supported daemons, so I can at least say Chrony is a very wonderful choice, accurate, and so very easy to use! I actually found out about it when looking up TimeSync)
If you need strongly consistent data, you must write to a global table in only one region, and then clocks don't matter because replication does not create conflicts.
If you can survive lost writes, clock skew just makes a zone win more or less often. Even if the clocks were in perfect sync, you still wouldn't observe causality across regions (changes to different items can replicate out of order).
Am I missing something? I am not an expert, I'm just not seeing where the 100s of ms of error is going to enter this system.
(edit: thanks for the great explanation aesipp!)