The cause of the Zune leap year bug has been isolated to a Freescale date routine

aston · on Jan 1, 2009

The thing that kills me about this bug is that it's such an easy bug to find in a code-review. Some person totally unfamiliar with the code asks "What happens when that inner if condition is false?" and MSFT is saved from yet another embarassing gadget glitch.

demallien · on Jan 1, 2009

What kills me is the fact that they have a while loop on the main thread of the Zune, which wasn't tested for all possible values of input. I don't know about anyone else here, but as an embedded software developer, that sort of thing automatically sets off alarm bells for me. Such a situation would definately be code-reviewed, and heavily unit tested in a sane development environment.

thras · on Jan 1, 2009

If stories are true, Microsoft has some the most code-reviewed software in the world. The levels of bureaucracy they have got to go through are tremendous.

First-generation Zune software was so bad that it made iTunes look good. On the other hand, there was a seismic rift between first-gen Zune software and second-gen. The new Zune client software has got one of the nicest, prettiest UIs on my system.

I think that in order to generate the second-gen Zune software, a lot of the bureaucracy must have been chucked. Apparently too much. It's really too bad. I suppose that we have will have to make hard choices between stability, features, and beauty forever.

mixmax · on Jan 1, 2009

Bet there's a developer somewhere having a really shitty New Years eve right now...

DenisM · on Jan 1, 2009

Bugs happen. The root cause of this disaster is an incomplete test plan.

gruseom · on Jan 1, 2009

There is no such thing as a complete test plan.

lionheart · on Jan 1, 2009

But isn't this exactly the kind of thing that should be in a test plan? Run through all the possibly dates?

I mean, didn't we learn our lesson from Y2K?

jerf · on Jan 1, 2009

Sure thing. We'll run it through all possible dates.

Well, you know, some actions may only have problems on certain dates, so make that all possible actions on all possible dates.

Well, you know, maybe the meridian counts. So, make that all possible actions on all possible dates in both the morning and the afternoon.

Well, we better test that with all the pathological media files we've built up. So, make that all possible actions on all possible dates in the morning and the afternoon with every one of our three hundred test music files.

Well... what if...

In hindsight it's always easy to identify the test that would have revealed the problem. Given the ability of bugs to only manifest when five particularly tricky conditions occur (and let's not even TALK about race conditions!), it simply isn't possible to test them all exhaustively. Unit testing helps, sort of, because it can run through a matrix faster than a human can, but it's also dumber, and even unit testing can't exhaustively search a twelve-dimensional space in any reasonable period of time if it's anything more than binary values in those twelve dimensions.

That said, for date processing code, this is definitely something I would have unit tested; that style of date processing is OK if you get it perfect, but definitely inherently problematic. One case where I did write a fairly exhaustive unit test was for billing code with a variable bill-on date; getting every single case exactly correct took me a long time. I was up to about three thousand permutations when I was done, and yeah, I'd see bugs that only struck if you signed up on the 31st of a month followed by a month of 30 days, for instance. Easy to miss if you're writing on a 15th. (And that code was probably thrown away... :( )

stcredzero · on Jan 1, 2009

At least analyze boundary conditions and special cases like leap years.

Apparently, we don't learn the Y2K lesson. I keep hearing about date rollover bugs in the industry press. At a company I worked for, there was a 45 day internal clock rollover bug in one version of our virtual machine. AFAIK, no one ever ran into this, except for one system that controlled automated people-mover trains.

rbanffy · on Jan 1, 2009

Maybe, but there are good-enough test plans. This one wasn't.

DenisM · on Jan 3, 2009

Running single most common use case for every day for next 10 years is a simple, doable test.

jyothi · on Jan 1, 2009

gosh can't believe this went through. You don't need a test case to catch something like "> 366", a code review or just a conscious glance would do

The test plan as noted cannot be complete but this is a very apparent case. if the code was written to accommodate for the leap year, how could there be no testing for one.

mtw · on Jan 1, 2009

i'd also say the developer who did this will be in the 17% of MS staff layed off next month.

wmf · on Jan 1, 2009

It sounds like the code was written by Freescale, not Microsoft. You could argue that someone at MS should have tested it.

zitterbewegung · on Jan 1, 2009

This sounds like a daily wtf posting in the making.

zhyder · on Jan 1, 2009

The correct way to do this is to define a quadyear=365*4+1, and then just use mods & divides (quadyear -> year-within-quad -> day) with no loop, right?

marcus · on Jan 1, 2009

Not exactly its a bit more difficult, leap years occur once every four years unless the year is divisible by 100 and not by 400

so 1900 wasn't a leap year 2000 was one 2100 won't be.

Although in most places I saw the code just ignored it as something that's valid between 1901-2099 is considered good enough in most places.

Edit: sorry shouldn't post this early in the morning, its 400 not 1000

dwarry · on Jan 1, 2009

No, century years are leap years if they are divisible by 400, so 2400 will be the next such, not 3000!

marcus · on Jan 1, 2009

Sorry, you're right. Changed the original comment.

zhyder · on Jan 1, 2009

Ah, so we can maybe expect a Y2.1K problem then

marcus · on Jan 1, 2009

Yep, but we'll have Y2K38 bug before that because of the original Unix time rep as signed 32bit int

http://en.wikipedia.org/wiki/Year_2038_problem

tocomment · on Jan 1, 2009

Why do they have access to the source code? Since when is Zune open source?

redorb · on Jan 1, 2009

on zune.net it says this bug will work itself out noon tomorrow... still a bad taste.