497.1-day uptime bug

yatsyk · on Nov 13, 2011

I'm not sure about details but as far as I remember Windows CE uses a brilliant approach to fix this bug. System tick count is set to value equal three minutes before overflow. And system counter overflows three minutes after OS starting. 3 minutes usually enough to load all applications that could be buggy but this time is less then usual debug session.

edit: found details:

http://msdn.microsoft.com/en-us/library/ms885645.aspx

For Debug configurations, 180 seconds is subtracted to check for overflow conditions in code that relies on GetTickCount. If this code started within 3 minutes of the device booting, it will experience an overflow condition if it runs for a certain amount of time.

bdonlan · on Nov 13, 2011

Linux does this too, but unconditionally: http://lxr.linux.no/linux+v3.1.1/include/linux/jiffies.h#L16...

Note that this particular timer is not directly exposed to userspace, however.

Groxx · on Nov 13, 2011

Increase awareness of =/= fix, but I like the technique either way :)

kabdib · on Nov 13, 2011

Win95 had a famous timer wrap at 49.7 days. Ouch.

All timers should either be really tiny (whereupon they are good subjects for test cases) or really huge and not subject to possible rollover (64 bits of nanoseconds is 580 years, and should serve for an interval counter).

128 bits of nanoseconds is 10E22 years, and should serve to drive calendar time, unless you're doing cosmology.

mappu · on Nov 13, 2011

    Win95 had a famous timer wrap at 49.7 days. Ouch.

The API call in question is called GetTickCount[1], and it's still really popular - especially for doing quick things like comparing for timeouts, and so forth. It returns milliseconds into a 32-bit int.

There's a replacement named, funnily enough, GetTickCount64, but iirc it's only present on Vista and newer, so it hasn't found its way into a lot of software yet. The Windows Performance counters probably provide better metrics for people actually interested in this data.

_______________________

1. http://msdn.microsoft.com/en-us/library/windows/desktop/ms72...

marshray · on Nov 14, 2011

I recently had to implement a version of GetTickCount64 for older platforms that only support GetTickCount(32). It works great as long as you remember to call it at least every 49.7 days. :-)

(Luckily the process already had a thread which wakes up to perform such maintenance every hour or so.)

carbocation · on Nov 14, 2011

256 bits of Planck time units is 10^26 years. No rollover, for any reason, ever.

http://www.wolframalpha.com/input/?i=2%5E256+units+of+planck...

anaisbetts · on Nov 14, 2011

Yep - there is still code in the kernel and in AppVerifier to simulate the rollover to test against these kinds of bugs.

blinkingled · on Nov 14, 2011

Speaking of uptimes - where I work, in our data center Cisco switches and load balancers for some reason always become flaky after 300+ days of uptime - weird resets, close_waits and other such things. Older Sun OS releases (8 and below) also get flaky after 200+ days uptime (we have had apps doing zero size reads on log files in a loop, processes hanging on startup etc.). Linux boxes have so many updates that they hardly cross 60+ days uptime. The only shining stars of rock solid uptime are all heavily loaded HP-UX 11i DB Server boxes - 1000+ days uptime and literally work like they were freshly booted!

maratd · on Nov 13, 2011

As a rule of thumb, I have all of my equipment and servers reboot every 30 days. You never know what sort of cruft you'll run into if you run your box long enough.

viraptor · on Nov 13, 2011

I never got this approach for three reasons. The first one is sure, you could say that some "cruft accumulates". But by rebooting you're guaranteeing that if something goes only slightly wrong you may not notice it and every month you'll start with a clean system that does not show the issue anymore. So you're choosing potentially ignoring tiny issues instead of letting them crash the system in a visible way and fixing them properly forever.

The second is that there shouldn't be any "cruft". Servers are not running win95 which reliably crashed given enough time to run. "Cruft" should be fixable - if it isn't then you're running a system which cannot really be supported.

The third one is that if you cannot say anything more specific than "cruft", then your system is badly managed. Are you restarting because your app leaks memory? Is it leaving zombie processes? Is it leaving dead connections to the database? Or maybe something else entirely? Restarting can be a short-term solution for some specific issue, but if it's there to remove "cruft" and "you never know" what it is, then you might as well try arranging your server room according to feng-shui or using voodoo healing to make your app run better. Either you control your system, or you don't.

maratd · on Nov 14, 2011

> So you're choosing potentially ignoring tiny issues instead of letting them crash the system in a visible way and fixing them properly forever.

This makes the assumption that I am able to fix them. Few people realize it, but you have very little control over the system. You can easily fix configuration errors, but if the error is due to a fundamental defect in the source code of a critical package, there's nothing you can do. I do not have the time to write patches for every defect in the system, nor do I have the luxury to tolerate them. Quite a conundrum, isn't it?

> The second is that there shouldn't be any "cruft".

There's always cruft. If you plot the uptime for servers, you will find that it looks like a very steep bell curve. Very few servers run for a few days, but also, very few servers run for years at a time. Most run for a month or two, or three.

In practical terms, this means that the bugs that get fixed first are the ones that crop up immediately on boot (everyone experiences them). The bugs that get fixed next are the ones that crop up for the average user or the middle of the bell curve (server running for a month or two). The bugs that get fixed last, if at all, are the ones that crop up for the fewest users (server running for years). This article is an example of that.

So by running your server for years at a time, you are exposing yourself to a greater amount of unfixed bugs. Also, memory leaks get worse over time and even a really minor one can mean serious issues over the span of years.

> The third one is that if you cannot say anything more specific than "cruft", then your system is badly managed.

I am not restarting because there is anything wrong with the system. It is not a solution to anything. It is a preventative measure. Preventative maintenance.

First, if there is a power failure or some other problem that forces the server to reboot unexpectedly, I know it will come back up. I know that because I have designed the system to reboot regularly and I test that capability on every server, after every update.

Second, I am avoiding problems that crop up for systems that are operating outside of the average uptime.

Third, my servers boot within 1 minute at most. What's my cost for doing this? 1 minute of uptime in the middle of the night once per month? So be it.

viraptor · on Nov 14, 2011

> Few people realize it, but you have very little control over the system.

That might be the difference in our POVs... Most of the time I work in environments where we do have control over the whole system, or at least aim for it.

> If you plot the uptime for servers, you will find that it looks like a very steep bell curve.

Unless they kernel paniced, systems I took care of run from one kernel update to the other. I never experienced the "cruft" in any way.

> I am not restarting because there is anything wrong with the system. It is not a solution to anything. It is a preventative measure.

So you don't know of anything going wrong. You're not fixing anything by restarting. You're restarting just in case... it prevents something from breaking. I guess I just disagree with that reasoning.

count · on Nov 14, 2011

How do you know you can restore your system to a working state in the event of an unscheduled outage, cruft or not?

You should discriminate between services and systems - make your service available 100% of the time, but you should be able to kill and restart/reload/replace systems for maintenance or other reasons at nearly any time. And you SHOULD do that, because without proof that you can do it, your DR solution is simply a best guess.

mburns · on Nov 14, 2011

By forcing Configuration Management Software (puppet, cfengine) so that one-off fixes that get hot patched on the production server and never documented.

count · on Nov 14, 2011

This goes a very long way to helping, yes, but by itself does not guarantee anything. And not everything can be cfengined or puppeted.

viraptor · on Nov 14, 2011

You can test for that in a staging environment. Restarting live services doesn't help you guarantee anything, but makes it more likely that you restart some server in a specific situation you can't recover from. (you'll never guarantee that you can recover from all situations)

dennisgorelik · on Nov 14, 2011

Reboot every 30 days transates into 120 reboots over the course of 10 years.

In 10 years you are likely to have your application being re-written anyway (with old bugs removed with old code and new bugs introduced).

It may be cheaper to do 120 reboots that debug and fix unknown cruft.

stdbrouw · on Nov 13, 2011

... eh?

mrb · on Nov 13, 2011

Someone else reports experiences of switches rebooting after 497 days:

http://storagemojo.com/2011/11/07/how-fault-tolerant-are-san...

brohee · on Nov 14, 2011

Tick counters are not much of an issue. Packets or octets counter overflowing are much more interesting, especially when they are somehow connected to billing...

drodil · on Nov 14, 2011

http://news.cnet.com/2100-1040-222391.html