Hacker News new | past | comments | ask | show | jobs | submit login
Making every (leap) second count with our new public NTP servers (googleblog.com)
110 points by scommab on Dec 1, 2016 | hide | past | favorite | 69 comments



“Leap Smearing must not be used for public-facing NTP servers” - https://tools.ietf.org/html/draft-ietf-ntp-bcp-02


There's good reason it's in the standard. Many applications of computer science rely on being able to accurately measure time.

Having every second increased by a non-trivial amount (~0.001%) on some days, and not on others, will produce subtly wrong results in all kinds of fields, from manufacturing to astronomy.

This was a bad, bad choice by Google.


I don't see any possibility for problems, as long as you just do what has sense to do.

https://developers.google.com/time/

"We recommend that you don’t configure Google Public NTP together with non-leap-smearing NTP servers."

If you point your NTP clients to time1.google.com to time4.google.com then don't point them to anything else.

That's all.

If you use Google's time servers you are using them either to be fully in sync with Google or because you like the time-smearing feature. In both cases, just use them, don't mix. Think it as a non-standard-service which is for convenience API compatible with the "standard" NTP.

As magicalist pointed, there are already other smearing algorithms online:

https://developers.google.com/time/smear#othersmears

and Google plans to switch to the new algorithm soon. If all those who need smear standardize around one algorithm, it's going to be even better: there will be one more standard, with the new name, but then it will be even more obvious to everybody what's going on. Obviously both approaches are needed, depending on the usage scenario.


Wow, that's a really boneheaded thing to put in a standard. I think we can all agree that it's important to make leap smearing available for those who want to use it, especially considering the bugs in leap second handling for common NTP clients.


I disagree. The point of NTP, and of time services in general, is that everyone agrees about the time. If an organisation wants to use non-standard time it can, but public-facing NTP servers should all agree and all provide the standard time. Google, for whatever reasons, is making its NTP servers deliberately wrong, and there is no mechanism in NTP for a server to say "I'm using time-smearing". So they shouldn't be doing this on public-facing NTP.


Then NTP has already failed. Most systems are already incapable of agreeing on whether it is 23:59:59 or 23:59:60 on days with leap seconds. There is simply not an API that will let you distinguish the two.

It is better to be deliberately wrong in a controlled fashion than to be accidentally wrong because you never expected your clock to be non-monotonic. You seem to be arguing for the status quo, are you aware of just how deeply broken the status quo is?


What is your definition of "most systems"? Because we had very few (if somewhat high-profile) leap second bugs since its introduction in 1972.


Unix, for example. That's a pretty big example. Look at gettimeofday. Completely incapable of handling leap seconds in any reasonable way, except if you use smoothing.

Windows, for example. That's another pretty big example. Just ignores the leap second bit and goes backwards at the next synchronization.

I'm not even talking about bugs here—these are straight up design flaws.


There are actually two reasonable ways of handling leap seconds with gettimeofday(). The first, which is in actual use by a range of people, is to define that the kernel time is actually a TAI-10 count not a UTC count. Arthur David Olson's "right" timezone system does this. The second is to allow the microseconds count to go up to 2,000,000.

* http://www.madore.org/~david/computers/unix-leap-seconds.htm...


I wonder how many clients handle that correctly. How many log files will have timestamps at "23:59:59.1500" instead of "23:59:60.500"? If you are going to break APIs you might as well make a new one instead.

And if you replace a simple API with one that requires distributing leap-second tables…


> And if you replace a simple API with one that requires distributing leap-second tables…

Not much worse than the distributed time zone tables we already need to update thrice a year. At least leap seconds aren't decided on by politicians.


> and there is no mechanism in NTP for a server to say "I'm using time-smearing".

That should most definitely be in the standard, along with communicating to the client full details about how smearing is configured.


Sure, but until then, public-facing NTP servers should stick to the current standard.


Is that a standard?

              Network Time Protocol Best Current Practices
                         draft-ietf-ntp-bcp-02
1. Introduction

   NTP Version 4 (NTPv4) has been widely used since its publication as
   RFC 5905 [RFC5905].  This documentation is a collection of Best
   Practices from across the NTP community.


Don't think it is a standard, looks like work in progress.

Imho a bit short notice to publish something like that Nov. 30th. Just at the time when they had to start advertising the leap second in NTP (or not advertise it where smearing). Not sure but somehow it sounds in draft-ietf-ntp-bcp-02 that that clients may need attention.

   Clients that are connected to leap smearing servers must not apply
   the "standard" NTP leap second handling.  So if they are using ntpd,
   these clients must not have a leap second file loaded, and the
   smearing servers must not advertise that a leap second is pending.


That's an Internet-Draft, which is a work-in-progress of the IETF. It's not a formal specification. https://www.ietf.org/id-info/


> Instead of adding a single extra second to the end of the day, we'll run the clocks 0.0014% slower across the ten hours before and ten hours after the leap second, and “smear” the extra second across these twenty hours.

Holy leaping second, batman! Unilaterally being off by up to a half second from the rest of the world's clocks is a pretty aggressive step. I think I would have preferred to see a resolution made by an independent body on something this drastic.


You're going to have a bad time if you assume "the rest of the world" isn't doing their own, different adjustment

https://developers.google.com/time/smear#othersmears

> preferred to see a resolution made by an independent body

Independent bodies have spent the last decade debating if leap seconds should even exist. Agreeing on how to treat them if we keep them is way down the priority list.


> You're going to have a bad time...

Hilarious. Was this intentional?


Smearing leap seconds does make sense, but it's an odd step to take unilaterally, rather than coordinating with other NTP servers and with Linux timekeeping (which currently handles leap seconds via a 61-second minute instead).


I think taking this step unilaterally is the only way it's going to be taken. Given that Google doesn't want to deal with leap seconds [1], and that the standards organizations have been debating removing leap seconds for years, at least they're publicizing what they're doing.

[1] for good reasons


It isn't even so simple as "work with others" - the other people involved here aren't even thinking at this level. Leap seconds are an ugly hack that were inserted without any thought as to the impact on computer systems. We should not use them, they exist as a vanity project imo.

Remember when leap seconds caused ALL JVMs to lock up until restarted? Or kernel bugs? ick!


> Leap seconds are an ugly hack that were inserted without any thought as to the impact on computer systems.

No. Leap seconds were a rationalization of a prior system that really was ugly for computers. The conversion between TAI and UT(n) that was used before UTC involved table-driven algorithms with multiple rules and microsecond adjustments.

If you thought that six months' notice to add a leap second to /etc/leapsecs.dat is a huge imposition, then you you should try creating a computer system that can cope with rules like "for the next three months, from January the 1st to March the 31st 1964, you must add 0.001296 of a second for each day since 38761 and then add a further 3.240130 seconds".

Ironically, UTC and the leap second system are geared towards the same sort of timekeeping that computers do and away from the civil timekeeping that preceded it: a constant length second that can be measured with oscillators and electronic counters, being the basis for civil time; rather than astronomical calculation.


In effect leap time has been used since the middle on the first millennium BC, the Babylonians discovered the difference between mean solar time and sidereal time and had corrected clocks ever since then. In order to reconcile relative earth surface time with earth mean solar time meant that time had to be inserted or removed somewhere.

In the spirit of DevOps small frequent changes are better than big infrequent changes we have leap seconds instead of leap minutes, hours, days etc. In this way noon is still when the sun is at it's highest (+/- 0.5 relative earth surface seconds).


Leap minutes could be publicized a century before being implemented, making sure all libraries accounted for the, just like leap hours.


Which then means, if you have a library that can support leap hours, why not leap seconds? More frequent small changes is much better than infrequent large changes; at the very least it will disabuse programmers of poor understanding of time and how to track it properly.


>Linux timekeeping (which currently handles leap seconds via a 61-second minute instead)

Google doesn't think so: "No commonly used operating system is able to handle a minute with 61 seconds"


Bits of pieces of the operating systems might handle leap seconds properly, but it's doubtful that every single component that uses time does the right thing. The last two leap seconds have revealed bugs in the kernel: https://lwn.net/Articles/504744/ for the one in 2015 and https://lwn.net/Articles/648313/ for the one in 2016, and I don't think it's unlikely that the one scheduled for December will reveal another.


Some things did go wrong on a few of the previous leap-second injections, and the Linux timekeeping maintainer had talked about changing the approach to handling them (which has already changed at least once in the past).

I don't, however, think it makes sense to unilaterally change this, without (any obvious signs of) coordination with the timekeeping maintainers and the maintainers of major NTP servers.


Things go wrong on every leap second, sometimes catastrophically. They go wrong on non-leap-seconds because of falsely advertised leap seconds. They go wrong 4 months before a leap second because a leap indicator got set and some software had an incorrect idea of when it was due.

Never mind the theory, the practice is a clusterfuck.


One of the big problems is application support. How many will break by seeing 60 as current second as opposed to 59 twice?


I am sure tons of applications that use gettimeofday() to keep track of time can break in subtle ways when seeing 59 twice. Of course, they're broken considering that there is clock_gettime(), however this is a POSIX interface that is not really monotonic too by default, and the monotonic versions of it are Linux-only implementations.


> I am sure tons of applications that use gettimeofday() to keep track of time can break in subtle ways when seeing 59 twice.

gettimeofday doesn't return hour/minute/second divisions; it just returns seconds/microseconds since the epoch. Functions like strftime and gmtime handle the components of time. And leap seconds don't make applications see 59 twice; they make them see 60 once (58, 59, 60, 0, 1, ...).

Quoting the manpages for gmtime and strftime:

> tm_sec The number of seconds after the minute, normally in the range 0 to 59, but can be up to 60 to allow for leap seconds.

> %S The second as a decimal number (range 00 to 60). (The range is up to 60 to allow for occasional leap seconds.) (Calculated from tm_sec.)


Break them, fix them, and move on. Must we coddle to every programmer's incompetence?


I'd assume its because, at Google scale, you can dictate what "time" is considered internally.

> All Google services, including all APIs, will be synchronized on smeared time, as described above. You’ll also get smeared time for virtual machines on Compute Engine if you follow our recommended settings.


This seems to be the Google way sometimes. "We're going to take a standard and change it and do things our way. Toodles!" Just like what they did with IMAP & Gmail.


It's far better than what POSIX clocks do. They'll just drop back a second and you'll get the same time twice.


That is a historical artifact. The original Unix developers decided to treat time as seconds since the start of 1970, implicitly assuming that every day has 86400 seconds. Back then UTC was in its infancy, most programmers had not even heard of leap seconds, and most computer clocks were set by the sysadmin looking at his watch. If we were starting from scratch we would have a date-time type with a day number field and a seconds-since-midnight field. However that would be a breaking change for every piece of software out there, so we are stuck with a time_t that cannot handle leap seconds.


Or we would use the NTP timestamp directly (or a lower precision version) as time_t, which (AFAIK) doesn't suffer from leap seconds. On can always convert time_t to truct tm, tm_sec is defined to be in the range of [0..60].



They've been doing this since at least 2011.


That doesn't make it any less unilateral. IMO, it makes it rather worse to have been doing this for five years. The situation would be much better if they had been working to build a broader consensus over all that time. As near as I can tell, they don't even have consensus within Linux, let alone POSIX or the ITU.


In the absense of a published and agreed standard, every approach is unilateral. My company is taking a similar approach - disconnecting from external NTP servers on 31st December, stepping the change in gradually, and reconnecting when we're "right" again.

Google have never been in the NTP business - there's no reason for them to have worked towards a concensus on this. But when a company their size makes their approach publicly available to all, it starts to pave the way for a consistent standard for everyone.


Google is in the NTP business. Chromebooks sync time from their servers, android devices can (but also from carrier provide time signals), and for hosts in their cloud services.


Honest question: what's so bad about others' systems running on slightly off time? I get why people care about internal consistency, and why deviations should be quite small, but this?


During the last leap second, I had servers configured against google's semi-public servers and some other good sources of time. ntpd marked the google servers as a false ticker sometime during the distortion, and when it was done, was happy with it again. However, I have more non-google servers than google servers, and high minpoll times which tends to result in time checks between servers happening far apart in time, so even if I had multiple google servers, they wouldn't look very close together.


Slightly contrived example: Lets say that you were running a distributed database, and you had distributed instances across different cloud providers for increased reliability. if your database relies on high-resolution timestamps for distributed conflict resolution, then you're going to have a hard time.

Another example: Suppose that a portion of an industrial monitoring system processes remote sensor data in a cloud datacenter with smeared time, while the sensor nodes keep strict UTC time. Your SCADA system had better not have any hard-baked assumptions like "messages cannot come from the future", or you're going to have a hard time, too.

Lets say that a company's internal NTC servers include several sources for reliability and redundancy. Much like Google DNS, perhaps one of the sources is Google NTP, while another is derived from the NTP pool. How do you expect the NTP daemon to behave in this situation? It will certainly be able to observe a 500ms difference between its source timeservers.


Both of those examples strike me as very contrived.

I can't think of anyone who cares that much about timekeeping who isn't running their own internal NTP infrastructure.

Google's Spanner requires accurate global time, so they deployed GPS and atomic clocks. Same for CDMA. There are some applications for high-resolution time (eg finance), so protocols like PTP exist.

A smeared NTP source in an otherwise normal list of time sources doesn't seem like that big of a deal either - eventually the daemon is just going to mark it as a falseticker and life goes on.


Everywhere Google documents the service, they clearly state you should not mix their smearing NTP servers with non-smearing NTP servers.


I'm not convinced it would do anything harmful, so long as you have enough NTP sources (which you should have anyway).

From their FAQ:

> We recommend that you do not mix smeared and non-smeared NTP servers. The results during a leap second may be unpredictable.

I read that as a soft SHOULD NOT, not MUST NOT. Would be a fun exercise to try doing it intentionally with common NTP implementations and see what happens.


first example: ok, yes, if they offer the DB as a service, that would be bad.

If you run it on a VM it's IMHO your responsibility to make sure your time sensitive database nodes have shared time.

Same for the second example, interesting point for SaaS scenario, although it seems like that could break through normal deviations already.

EDIT: ok, the blog post actually mentions "local clocks in sync with VM instances running on Google Compute Engine", my bad. Not sure what to think about that. In comparison, Amazon recommends running NTP on your VMs and their Linux AMIs come with pool.ntp.org configured as default. </edit>

Third: It's going to figure out some solution (if Google is only one source it's probably going to drop it as faulty), but you probably should not have added a time source that's officially documented to not strictly follow standards. It's not like Google offered a NTP service for years and now suddenly switched how it works.

I guess I underestimate the amount of trust people put into random time sources: practice is probably messier than theory.


I predicted this for leap smear a while back-- we have time sync because having systems with different times is a source of problems... logical fix: get them onto the same time.

Smear is a workaround for those who care about phase alignment but don't care about frequency error. ... and who don't need to exchange times with anyone else. This last point reduces the set to no one, since it can't extend to everyone (some parties care a lot more about frequency error than phase error!).

This circus is enhanced by NTP's inability to tell you what timebase it's using (or, god forbid, offsets between what its giving you and other timebases...)

It's going be especially awesome when NTP daemons with both smear and non-smear peers get both the smear frequency error AND get a leap second.

I for one welcome this great opportunity for an enhanced trash fire to help convince the world that we need to stop issuing leap seconds. (It's absurd-- causes tens of millions in disruption easily, -- and it would take 4000 years to even drift an hour off solar time, at which point timezones could be rotated if anyone really cared).


> Smear is a workaround for those who care about phase alignment but don't care about frequency error. ... and who don't need to exchange times with anyone else. This last point reduces the set to no one, since it can't extend to everyone (some parties care a lot more about frequency error than phase error!).

I don't quite understand that point. E.g. the typical web server doesn't have much of a need to exchange precise time with others. HTTP, TLS, ... require timestamps, timestamps are shown to users occasionally, but as long as they are roughly right that is enough. As long as all internal systems work off the same standard it is fine. Which seems to be the reasoning under which Google choose to use it, even though one might argue that with their cloud offerings they are not as insular.


A lot of interesting geophysics in the unpredictable need for leap seconds. I mention Google's "smearing" approach here:

http://arstechnica.com/science/2016/04/the-leap-second-becau...


Why the hell aren't time servers and clients sync to TAI instead? Dealing with leap seconds should be a client side problem.


At least the IEEE 1588-2008 (PTPv2) protocol uses TAI time (with the POSIX epoch). The current UTC offset is then passed in the Announce messages (as well as some flags for indicating an upcoming leapsecond) which allows the slave to derive UTC if it wants to.


Because leap seconds are not deterministic, kind of like time zone changes. It would make timestamp <--> date calculations A) harder and B) need constant updates to work. Hell for firmware and embedded code.

Edit: the worst thing is: it would make those calculations harder to do correctly, but it would be too seductive to not care for leap seconds. After all, what's a few seconds error, really? This leaves you in a state where, guaranteed, 99% of time software will not be correct and the error will compound over time.


Yup, it seems awesome, but software needs to be written to handle it properly. I think it'd work no problem with software already written against monotonic clocks, but everything else would probably need some fixing.


The problem is that time_t (seconds since 1970) implicitly assumes 86400 seconds per day. You would have to redefine time_t and rewrite every piece of code that uses it.


Leap seconds is what allows that assumption to work.

Leap seconds exist only in real time, not in historic recorded time.

There are in fact 86400 "calendar seconds" in a day, exactly.

Essentially, when a day is done, we call it 84600, even though it's actually 86400.epsilon.

Only special applications need to know the exact physical number of seconds between two calendar times, rather than the calendar seconds.


Right... Because leap seconds are high on everyone's priority list. It's the programmers fault!

Or... recognize that super rare bugs are inevitable and create a higher level way to avoid them entirely. I vote for option 2.


That "new higher level way" is leap seconds. The concept of leap seconds allows us to do date and time calculations in "calendar seconds", and not care about the discrepancy between physical seconds and calendar seconds.

Leap seconds basically add a corrective jump to physical time (what is measured by our super accurate clocks that use physical seconds and not calendar seconds) to match calendar time.

Leap seconds matter if you're doing some scientific or engineering calculation (astronomy, aerospace or whatever) and you need an exact physical time down the fraction of a second between two events that are far apart in the calendar.

They do not enter into everyday calculations, like using time_t seconds to calculate the number of days between two dates.


Google introduced this exactly to not have to deal with it client-side, while still having UTC timestamps that match the rest of the world most of the time.


You have to deal with it at client side. It doesn't make sense to fix broken software by introducing broken time servers. It could make sense to provide a wrapper library that provides smeared leap seconds and use that for broken software only (like libfaketime).

Now if I want to write software that uses precise TAI, I can't do that because of broken UTC from time servers and TAI is defined as UTC+tai_offset on my side.


> Now if I want to write software that uses precise TAI, I can't do that because of broken UTC from time servers and TAI is defined as UTC+tai_offset on my side.

Yup. Even if your NTP UTP source is perfect (lol), there aren't any cryptographically authenticated sources for the offset as far as I'm aware. (and NTP doesn't carry even an unauthenticated one).

GPS carries an offset between UTC and the (leapsecondless) GPS timescale... but the GPS signal is a bit of a pain to get to, and also unauthenticated...


For people talking about Google unilaterally doing this, it has been common to smear the leap second for the last couple years. Usually companies do it internally by having their NTP servers skew time, either with Chrony or `ntpd -x`. Standards bodies have not been able to react quickly enough to the need to smear the leap second in a consistent way. I'm thankful that Google has decided to run public NTP servers with consistently smeared leap seconds.

Here are two Red Hat articles on how to deal with the leap second, from 2016 and 2015:

https://access.redhat.com/articles/15145

http://developers.redhat.com/blog/2015/06/01/five-different-...


I hope that there are people on standards bodies who remember or learned what it was like before UTC when civil time seconds were not one SI second long, and in effect "smearing" happened all the time.


Does anyone know if Google has open sourced the time smearing algorithm?


They discuss various ways of smearing here, but I haven't seen their actual implementation code: https://developers.google.com/time/smear




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: