Why are the Microsoft Office file formats so complicated? (And some workarounds)

WenomousVit · on Feb 19, 2008

Joel starts out by telling you that Microsoft's file formats are not the product of a demented Borg mind and are not impossible to read or create correctly.

And then he spends the rest of the article explaining why the file formats are so demented and impossible to read or create correctly.

Which isn't the same thing at all. Just because there is a rich history of it-seemed-like-a-good-idea-at-the-time decisions doesn't mean the end result is any good.

There is a serious, deep, and interesting problem of scaling and complexity management that could be discussed here. But Microsoft's approach seems to have been one of embracing complexity. And Joel's role, today, is just defending that approach.

projectileboy · on Feb 19, 2008

I call bullshit. A lot of his arguments perhaps make sense for early versions of Office, but for '97 and later? It didn't occur to anyone on the Word or Excel teams in 1997 that maybe interoperability would be an issue?

I really enjoy Joel's blog, but sometimes I can't stand the clubbish-ness of the old-school Microsoft brigade. "We were a bunch of geniuses, doing the best we could..." Yeah, well, the software is usable, but it kinda sucks. Don't be surprised if the people MS victimized for a decade sound critical now that we can all see how the sausage gets made.

spolsky · on Feb 19, 2008

Sure, they realized interoperability would be important. But by 1995 customers were already whining about how often the file format changed. Nobody would buy Office (n+1) if it would start writing documents in a format that Office (n) didn't understand, because interoperability means precisely that you can't get everyone to upgrade at the same time. So they did a remarkably good job of holding the file formats stable and backwards and forwards compatible for about ten years. Look in the Excel spec and you'll see how much work when into allowing old verisons of Excel to preserve and roundtrip future features they don't understand.

mojuba · on Feb 19, 2008

Ok, then take a look at how HTML does forward compatibility, for example. Extremely simple, elegant. Not that i defend lexable-parsable formats - I got the point that speed was important on stone-age computers. No. It's about architecture.

Here is what I think: nobody at Microsoft ever cared about making anything simpler than what they got after the first approximation, because complexity looks impressive and hence sells better. On the other hand, making things simpler requires more intellectual efforts and usually doesn't sell well, especially in the consumer product business.

brlewis · on Feb 19, 2008

Interoperability does not mean forward/backward compatibility for releases of a single vendor's software.

Forward/backward compatibility does not require a lot of work or intellect, unless you're deliberately obfuscating your format.

xirium · on Feb 20, 2008

I agree. This is bullshit. Xerox PARC solved a superset of these compatability problems by 1985. It was called InterScript ( http://hopl.murdoch.edu.au/showlanguage.prx?exp=5455 ) and it allows editors to preserve unknown elements within compound documents. This makes downward compatability quite trivial.

codesurgeon · on Feb 20, 2008

What is the motivation for Joel Spolsky's post? Sympathy for MS Office file formats? Reads like a manifesto for never ever changing/cleaning up the file formats because it took a thousand man-years to come up with the current incarnation. He is even going so far as to defend the office file formats' complexity and resulting maintainability hell by saying that portability was not an issue fifteen+ years ago - well, it is now. His suggestions for file format conversion solutions are OK, but doesn't it occur to him, that developers don't want to be locked-in to having to shell out money for MS APIs forever? Ties in well with Bruce Schneier's latest post on vendor lock-in http://www.schneier.com/blog/archives/2008/02/lockin.html

tlrobinson · on Feb 19, 2008

Normally I agree Joel's articles, but having dealt with the new garbage that Microsoft calls a standard, Office Open XML, I can tell you that none of this applies. Yet the Open XML spec is over 6000 pages long (compared to these binary specs which are a measly 100-300 pages each).

Open XML is not designed for performance... it's XML, and today's computers are fast enough. It IS [supposed to be] designed for interoperability (somehow they managed to get ECMA to put their stamp of approval on it), but in reality it feels like it's a half-assed attempt to wrap all of Office's legacy formats in XML. For example, to import you still need WMF importing, because a lot of the graphics (including all clip art) are WMF.

snorkel · on Feb 19, 2008

Joel suggests using one licensed copy of Office to run as a web service backend but I'm not so sure Microsoft's EULA allows that.

tlrobinson · on Feb 19, 2008

Does anyone know for sure if this is true or not?

henning · on Feb 19, 2008

"The idea of things like SGML and HTML—interchangeable, standardized file formats—didn’t really take hold until the Internet made it practical to interchange documents in the first place; this was a decade later than the Office binary formats were first invented."

Put another way, Microsoft didn't care about interop until they were convicted of illegal monopolistic practices and they couldn't get away with the kind of shit reflected in this spec anymore.

brlewis · on Feb 19, 2008

MSFT is releasing specs only after the EU is basically forcing them to, and in the year 2008. Please don't try to tell me their formats are not deliberately obfuscated.

If a format's goal truly was to maintain forward/backward compatibility while remaining friendly toward low-end hardware, then it wouldn't be so hard to reverse engineer.

edw519 · on Feb 19, 2008

"If you simply have to produce tabular data for use in Excel, consider CSV."

That's what most of us have been doing for years, simply to avoid the mess discussed in this article.

After reading the article, I, for one, will keep on doing it.

ghiotion · on Feb 19, 2008

I ran into a nasty bug(?) recently in the way Excel processes CSVs. I create a well formed CSV from a Java Web app which the user downloads. They open the CSV up in Excel and modify the data for an eventual upload. However, during the save process, Excel truncates trailing commas. Which causes my well formed CSV to be a poorly formed CSV during upload and parsing. Not fun.

malkia · on Feb 20, 2008

For such type of thing you should always have something at the last column, even return might be enough

A,B,,,,,,,," " C,D,,,,,,,," "

That's newline there between " and "

I guess Excel loads that internally through a convertor, trims the rightmost unused columns, and then saves it that way.

edw519 · on Feb 19, 2008

Never knew that. Once I cut a CSV for someone else, I never wanted to see it again. Now I know why.

jdueck · on Feb 19, 2008

You can't blame Microsoft for awful Office file formats. Remember, they wrote this stuff before the web came along, and before they generally started sucking really bad. It's just the accumulation of decades of feature creep, add-ons, re-dos, compatibility hacks, bug fixes, and workarounds.

Eventually, all software needs a rewrite. Not just MS's.

phil · on Feb 20, 2008

One option Joel doesn't mention is to use OpenOffice. It has its own object model (called UNO), which has bindings in many languages.

It's not a perfect implementation of the formats, but it's good enough for most things. And has the rather large advantage that you can run it under your favorite *nix.

BrandonM · on Feb 19, 2008

It seems to me like Joel has a rather warped view of how software should be built. He claims

It means you have to rewrite all of your date display and parsing code to handle both epochs. That would take several days to implement, I think.

That's just ridiculous. In my mind, you need a piece of code that reads the 1904 record and sets a flag in the code. Then, your date display and parsing code should all call one, or maybe two, functions which handle the conversion for you based on this flag. Thus, supporting two different epochs requires at most three components: one to read it and set a flag, one to convert a numerical argument to a date based on that flag, and one to convert a date back to a numerical argument (again, based on the flag). Should this really take "several days to implement"? It seems to me that an hour should be plenty of time.

It's going to be hard for me to continue to take him seriously if this is his view of how software should be built.

BrandonM · on Feb 22, 2008

I'm fine with being downmodded, but I'm curious why. I know that Joel has a lot of articles submitted here and that me questioning his insights is probably not a good idea (in terms of karma, at least), but does someone actually think that implementing code to handle different epochs should take several days?