I'm not sure about details but as far as I remember Windows CE uses a brilliant approach to fix this bug. System tick count is set to value equal three minutes before overflow. And system counter overflows three minutes after OS starting. 3 minutes usually enough to load all applications that could be buggy but this time is less then usual debug session.
For Debug configurations, 180 seconds is subtracted to check for overflow conditions in code that relies on GetTickCount. If this code started within 3 minutes of the device booting, it will experience an overflow condition if it runs for a certain amount of time.
All timers should either be really tiny (whereupon they are good subjects for test cases) or really huge and not subject to possible rollover (64 bits of nanoseconds is 580 years, and should serve for an interval counter).
128 bits of nanoseconds is 10E22 years, and should serve to drive calendar time, unless you're doing cosmology.
The API call in question is called GetTickCount[1], and it's still really popular - especially for doing quick things like comparing for timeouts, and so forth. It returns milliseconds into a 32-bit int.
There's a replacement named, funnily enough, GetTickCount64, but iirc it's only present on Vista and newer, so it hasn't found its way into a lot of software yet. The Windows Performance counters probably provide better metrics for people actually interested in this data.
I recently had to implement a version of GetTickCount64 for older platforms that only support GetTickCount(32). It works great as long as you remember to call it at least every 49.7 days. :-)
(Luckily the process already had a thread which wakes up to perform such maintenance every hour or so.)
Speaking of uptimes - where I work, in our data center Cisco switches and load balancers for some reason always become flaky after 300+ days of uptime - weird resets, close_waits and other such things. Older Sun OS releases (8 and below) also get flaky after 200+ days uptime (we have had apps doing zero size reads on log files in a loop, processes hanging on startup etc.). Linux boxes have so many updates that they hardly cross 60+ days uptime. The only shining stars of rock solid uptime are all heavily loaded HP-UX 11i DB Server boxes - 1000+ days uptime and literally work like they were freshly booted!
As a rule of thumb, I have all of my equipment and servers reboot every 30 days. You never know what sort of cruft you'll run into if you run your box long enough.
I never got this approach for three reasons. The first one is sure, you could say that some "cruft accumulates". But by rebooting you're guaranteeing that if something goes only slightly wrong you may not notice it and every month you'll start with a clean system that does not show the issue anymore. So you're choosing potentially ignoring tiny issues instead of letting them crash the system in a visible way and fixing them properly forever.
The second is that there shouldn't be any "cruft". Servers are not running win95 which reliably crashed given enough time to run. "Cruft" should be fixable - if it isn't then you're running a system which cannot really be supported.
The third one is that if you cannot say anything more specific than "cruft", then your system is badly managed. Are you restarting because your app leaks memory? Is it leaving zombie processes? Is it leaving dead connections to the database? Or maybe something else entirely? Restarting can be a short-term solution for some specific issue, but if it's there to remove "cruft" and "you never know" what it is, then you might as well try arranging your server room according to feng-shui or using voodoo healing to make your app run better. Either you control your system, or you don't.
> So you're choosing potentially ignoring tiny issues instead of letting them crash the system in a visible way and fixing them properly forever.
This makes the assumption that I am able to fix them. Few people realize it, but you have very little control over the system. You can easily fix configuration errors, but if the error is due to a fundamental defect in the source code of a critical package, there's nothing you can do. I do not have the time to write patches for every defect in the system, nor do I have the luxury to tolerate them. Quite a conundrum, isn't it?
> The second is that there shouldn't be any "cruft".
There's always cruft. If you plot the uptime for servers, you will find that it looks like a very steep bell curve. Very few servers run for a few days, but also, very few servers run for years at a time. Most run for a month or two, or three.
In practical terms, this means that the bugs that get fixed first are the ones that crop up immediately on boot (everyone experiences them). The bugs that get fixed next are the ones that crop up for the average user or the middle of the bell curve (server running for a month or two). The bugs that get fixed last, if at all, are the ones that crop up for the fewest users (server running for years). This article is an example of that.
So by running your server for years at a time, you are exposing yourself to a greater amount of unfixed bugs. Also, memory leaks get worse over time and even a really minor one can mean serious issues over the span of years.
> The third one is that if you cannot say anything more specific than "cruft", then your system is badly managed.
I am not restarting because there is anything wrong with the system. It is not a solution to anything. It is a preventative measure. Preventative maintenance.
First, if there is a power failure or some other problem that forces the server to reboot unexpectedly, I know it will come back up. I know that because I have designed the system to reboot regularly and I test that capability on every server, after every update.
Second, I am avoiding problems that crop up for systems that are operating outside of the average uptime.
Third, my servers boot within 1 minute at most. What's my cost for doing this? 1 minute of uptime in the middle of the night once per month? So be it.
> Few people realize it, but you have very little control over the system.
That might be the difference in our POVs... Most of the time I work in environments where we do have control over the whole system, or at least aim for it.
> If you plot the uptime for servers, you will find that it looks like a very steep bell curve.
Unless they kernel paniced, systems I took care of run from one kernel update to the other. I never experienced the "cruft" in any way.
> I am not restarting because there is anything wrong with the system. It is not a solution to anything. It is a preventative measure.
So you don't know of anything going wrong. You're not fixing anything by restarting. You're restarting just in case... it prevents something from breaking. I guess I just disagree with that reasoning.
How do you know you can restore your system to a working state in the event of an unscheduled outage, cruft or not?
You should discriminate between services and systems - make your service available 100% of the time, but you should be able to kill and restart/reload/replace systems for maintenance or other reasons at nearly any time. And you SHOULD do that, because without proof that you can do it, your DR solution is simply a best guess.
By forcing Configuration Management Software (puppet, cfengine) so that one-off fixes that get hot patched on the production server and never documented.
You can test for that in a staging environment. Restarting live services doesn't help you guarantee anything, but makes it more likely that you restart some server in a specific situation you can't recover from. (you'll never guarantee that you can recover from all situations)
Tick counters are not much of an issue. Packets or octets counter overflowing are much more interesting, especially when they are somehow connected to billing...
edit: found details:
http://msdn.microsoft.com/en-us/library/ms885645.aspx
For Debug configurations, 180 seconds is subtracted to check for overflow conditions in code that relies on GetTickCount. If this code started within 3 minutes of the device booting, it will experience an overflow condition if it runs for a certain amount of time.