Hacker News new | past | comments | ask | show | jobs | submit login
Gene name errors can be introduced inadvertently when using Excel in bioinformatics (nih.gov)
38 points by soundsop on March 4, 2009 | hide | past | favorite | 40 comments



I can't get into specifics per confidentiality at my job, but I'll tell you this: if you saw some of the things being done in the financial sector with Excel that I have you'd cry and go into a fit of rage.


Whenever I think about the financial sector at all, I cry and/or go into a fit of rage already. So it's probably for the best that I don't know.


If you can't go into specifics, can you at least go into generalities? What sorts of things about Excel give rise to the worst stuff you've observed?


A few years ago I had an office at an incubator, and my neighbour company made a better spreadsheet. He had worked in the financial sector before, and had conducted research that showed that around 90% (Don't remember the exact number) of excel files in the financial sector had mistakes in them. Scary stuff..


If you are doing any sort of data analysis you really need to avoid spreadsheets.

Unfortunately, most data analysis is just as bad. I have worked with SAS & R code, thousands of lines long, all without a single test. Basically a big series of hacks. Mostly, statisticians I have worked with have no concept of testing, or even why it's necessary.

I shudder to think of the gross error in research results from all I have seen!


Yeah...we do most aircraft structural analysis in excel. It scares the hell out of me.


That is just scary.


Well the thing to remember is that everything is checked twice and there's a 1.5 safety factor on top to protect against mistakes but the state of the industry is poor.

Engineers often get away with creating monstrous spreadsheets that are difficult to follow:

- scattered logic - methodology obsficated with complex excel formulas - magic numbers that you have a hell of a time tracking down the source of - links all over the place. Are they still valid or have they gone bad due to all the copy/pasting? etc etc

As a programmer who has to write something that has logic and no ambiguity for a computer to understand it, it makes you want to beat your head against a wall trying to understand and check someones work.

Until recently (victim of the downturn) I was in a small dev team looking to automate and improve things. We had some good wins but it's a struggle.

If someone like toyota starts making aircraft seriously and develops more standardized methods rather than simply employing piles of engineers to punch numbers wildly into excel they may be able to crush everyone else in time to market.


magic numbers that you have a hell of a time tracking down the source of - links all over the place. Are they still valid or have they gone bad due to all the copy/pasting? etc etc

Would it help to have a sheet of centralized assumptions (constants, magic numbers, etc.) that people could link to from any spreadsheet?

Would it help to be able to reuse formulas from another sheet without copy/pasting them? i.e. link in such a way that you can fill in your own numbers, but keep the original logic?

Would it help to be able to "lock down" certain models that people could use to do their analysis, but not make arbitrary changes to?

Would it help to be able to build some computations in a spreadsheet and wrap them up as a function that other spreadsheets could call (passing in their own arguments)?

I'm interested in what can be done to raise the level of abstraction (and consistency) without giving up the spreadsheet UI which is so productive for many users.


Yes to all of those and that's part of what we try and do. There are also many models/ideas from programming that are applicable to the way engineering should be done I feel, such as don't repeat yourself, reducing dependencies etc

There are software tools for analysis developed to standardise what can be standardised and to force people to use templates for their analysis, but they are generally not great and getting them developed is always politically and financially challenging.

IMHO engineers find it difficult to identify and extract the common processes because every part is slightly different and complex and usually requires its own judgement calls to be made about how to deal with it. This leads naturally to everyone doing things their own way and therefore using the most flexible and cheap tool there is...Excel.

That's not to say there aren't common processes and higher abstractions to be made, it just requires people to look hard for them and think about things in a way that does not come naturally to the industry. They also need to have the clout to get people using them.

I think things are changing some what with modern aircraft though. My opinion is that judgement is playing less of a role in many cases and we are moving to brute force analysis i.e. Just throw it at the computer and analyse it all in great detail.

This leads to more common processes and standard ways of sharing and moving data around.

It's a big problem and I enjoy working on it but the support isn't always there to invest in it.

The cost of designing an aircraft is tiny compared to the overall lifecycle of the aircraft and the costs involved in manufacturing. So the industry would generally prefer to keep operating in the same old ways rather than take risks on new things. Especially new things that involve upfront investment. Just my opinion anyway.


IMHO engineers find it difficult to identify and extract the common processes because every part is slightly different and complex and usually requires its own judgement calls to be made about how to deal with it. This leads naturally to everyone doing things their own way and therefore using the most flexible and cheap tool there is...Excel.

This touches on one of my pet theories. Spreadsheets were remarkably successful because they never require you to express a model abstractly (i.e. they never require you to write the formulas up front and plug in data later - which is the way that programming languages work). You're always dealing with concrete numbers. It's example-driven. I believe that this is one of the reasons why the spreadsheet is the computing tool of choice for non-programmers. Good programmers "get" abstraction; most non-programmers don't.

The trouble is that when you need to make many variations on the same theme ("every part is slightly different"), the only facility provided by the spreadsheet is copy/paste. Now you can modify Examples 1 thru N to your heart's content, but you've lost any link between the stuff that's common to them all. There's lots of research showing that this kind of spreadsheet development leads to billions of dollars in errors every year. Basically, imagine programming where you have no subroutines and nothing but copy/paste. We know what a mess that would be, the minute you tried to do anything complex.

I believe an improvement would be to allow a spreadsheet to clone ranges from another spreadsheet, thus inheriting their data and formulas, but preserving the linkage back to the original. You can override things in the clone to make the parts different that need to be different. But anything that you haven't overridden (or added yourself) is a live copy of the thing you cloned. That means if you change the master, the copies get updated. In fact, you haven't actually copied anything - you're just referring back. The system would be better able to trace the true dependencies of a sheet, and users would still be able to base new work on old examples.

Since you've struggled with these issues, I'd like to hear your thoughts.


This touches on one of my pet theories. Spreadsheets were remarkably successful because they never require you to express a model abstractly (i.e. they never require you to write the formulas up front and plug in data later - which is the way that programming languages work).

My pet theory is that people have a love of grids and graph paper -- I remember in grade school how graph paper was, for some reason, a rare thing, and it was treasured. The kid with a whole stack of graph paper and who DREW PICTURES OF TANKS AND DRAGONS ON IT was considered some kind of blasphemous rouge, but still respected because of access to graph paper. It brought some indication of order to the otherwise sloppy handwriting on looseleaf we had to do.

And that love of the grid translated into a love of spreadsheets. It's so neat and clean, ready to be filled up with MY organized madness! All those empty boxes, what fun! Look, I made a list of items! And it's arranged in neat rows and columns! And besides, who needs to be able to join these product ids with sales numbers? We can do that with a hundred clicks or so and create a pivot table!

Everything begins to look like a nail when your hammer of choice is Excel.


Now I am going to have to say that the spreadsheet UI is hugely valuable for three reasons: it fits the way non-programmers' minds work, it's familiar, and... it feels like graph paper!


Considering the kinds of things I've seen people put in spreadsheets, and the way they format them, I'm sure I could dig up a few examples of spreadsheets NOT fitting the way non-programmer's minds work, because they try to make it do things it wasn't meant to and it ends up being MORE work.


I don't doubt that such examples exist, but I'm adamant on this point: the spreadsheet is the invention that gave non-programmers the power to define their own computations. That's why it was so successful. It's the only "programming for non-programmers" invention I know of, of the zillions that have been tried, that doesn't suck. Can you name any others? The ones that come to my mind are the laughable boxes-and-lines "visual programming" gimmicks.


Oh, I agree with that. But I think the perception of (mock) organization it provides and its visual similarity to graph paper is what draws people to it. The power of spreadsheets are in the computations, and of course people use it for that, but I bet there are just as many spreadsheets that contain entirely static content as there exist spreadsheets that actually use spreadsheets as more than just a glorified grid formatting engine. I've seen people do calculations with desk calculators and then type the results into their spreadsheets. It does let non-programmers define their own computations, but I wouldn't call what non-programmers do with it to be programming, "computation for non-programmers", most definitely. I believe this may come down to how the both of us define programming, however.

Interesting contrast, though, is the relative avoidance of accounting ledgers, which spreadsheets are a closer cousin to than they are to graph paper.


My pet theory is that if people weren't able to to spend time setting up complex and impressive looking spreadsheets they would be bored out of their minds.


To address your idea specifically, I definitely think something like that is possible. If the formulas and data where separated such that one spreadsheet could inherit formulas from a master etc. There is so much variation in engineering though its just a matter of working it through and seeing where the limitations are.

In making things better in engineering the first goal is first try and stop people spending so much time managing and moving data around. This really takes up a large percentage of the work done, unnecessarily.

Second is to start getting engineers to see things more like programming where things aren't repeated and data is separated out from the method. If you receive new data (e.g. new aircraft loads and geometry) your goal is for there to be nothing whatsoever required for you to update your analysis. Its already setup, you just point at the new data and it's done. This is rarely the case I find.

Finally, I think is the issue you are talking about which is to tackle the problem of engineering by forming new abstractions and figuring out how to best represent those abstractions with existing tools or something new. This is definitely promising but honestly there's lots of easier stuff to do first that would make a big difference.

The thing that makes it so hard in engineering is that every job has its own set of "approved" methods and the amount of work required for each part varies for a number of reasons. There's a lot stuff you're never going to get the world to agree on but a spreadsheet type thing that separates the data and formulas completely is a good start.


I'm interested in what can be done to raise the level of abstraction (and consistency) without giving up the spreadsheet UI

The formatting of your comment reminded me of one of the many reasons I actively avoid Excel and other Microsoft products - those "helpful" agents. The ones that always pop up when I least expect and offer suggestions that are never useful.

Whatever path you decide to take on your quest for abstraction, please don't pursue that particular route. That way lies madness.


Oh God, anything but that. Sorry about the sucky formatting.

I think you're on to something here, actually. The great advances in the first generation of desktop apps were done by the early 1990s. Instead of making the tools more powerful by evolving the core concepts, MS (and other vendors) piled on feature after feature and gimmick after gimmick (like those "agents"). Now they have bloated monsters that can't be evolved any which way. It reminds me of programs that are heaps of special cases piled one on top of the other with no thought for how to effectively generalize (nor would it be surprising if that's just what these particular applications are). MS are now so hamstrung by this legacy that perhaps a window of opportunity for real innovation in the space has finally opened.


Do you have something particular in mind? Its an interesting opportunity and problem. All the solutions I have been considering and working on are in custom applications specific to the problem domain of aerospace.

The general one size fits all solution that is useful but does not lead to the mess Excel creates seems to be what you have in mind.

In aerospace the second most popular tool is Mathcad it seems. Have you seen that?

Although there's rarely enough licenses around (why spend when everyone has Office installed) and a lot of people aren't familiar with it.


Yeah, I have something very particular in mind :) It's nearing launch in fact. You're right that it's a horizontal solution (like Excel, applicable to many domains) and not a vertical one. A lot of custom vertical software is really crappy and really expensive. Vertical systems also tend to be rigid and too hard for people to coax into doing what they need. The more flexible you make them, the more they approach spreadsheets... but then people complain that it's a crappy spreadsheet! (If you've ever heard of Greenspun's 10th Law, one can say that all sufficiently complicated vertical applications end up being a poorly implemented version of half of a spreadsheet.)

The core principle of what we're doing is to build everything around the familiar spreadsheet UI. That UI is hugely valuable for two reasons: it fits the way that users' minds work, and after nearly 30 years it's what everybody knows and likes. The problem isn't with the spreadsheet UI as such, it's with the current generation of spreadsheets that were designed for the problems of 10-20 years ago. The problems have gotten way more complex, but the tool hasn't evolved.

If you think about all the different issues with Excel that lead to messes, in each case you can kind of imagine a spreadsheet that doesn't have that particular problem. Consider the problem of copy-paste creating massive duplication: you can imagine a spreadsheet that gave you the ability to copy something without duplicating it (the cloneage idea). Here's another example: in Excel, it's really hard to do analysis over top of a bunch of different spreadsheets. Even just adding up the totals of the same variables in different sheets is a pain. But one can imagine a spreadsheet that didn't have that problem. Other examples: it's hard to track the history of who changed what in an Excel spreadsheet; it's hard to share Excel spreadsheets or link from one file to another; it's hard to restrict access to certain ranges while allowing it to others; it's hard to modularize complexity. All of these issues together mean that people get into big trouble when they try to do serious work with Excel. Yet it's possible to imagine a modern spreadsheet application that doesn't have these weaknesses.

The question is, if you build this new thing, will it still be a spreadsheet that users recognize and feel comfortable with? Or will it be some foreign thing that nobody wants to learn? That's what killed many attempts to improve on spreadsheets in the past, so we're focused on making something that looks and feels a lot like Excel (but which doesn't create so many messes... or at least makes it easier to create non-messes).

Good lord, I did not intend to write this much. Thanks for listening!

p.s. I'm not familiar with Mathcad. It may well be a good vertical app. Those exist too!

pp.s. Email me (address is in my profile) if you'd like to take a look at what we're working on.


Trying to do real programming in SAS is a nightmare. The language sucks. Also, it's proprietary and expensive.

It's surprising that statisticians would fail to understand the importance of testing, given that a lot of the root issues in statistics (e.g. the Bayesian/frequentist dispute over "what is a probability?") are epistemological.


One of my undergrad classes was for SAS, and they mixed MIS, CS, math, and statistics people in this class. It took weeks for those on the CS track to get that when your code says "proc average", you're _calling_ a procedure, not defining a function. We had no idea where the code was supposed to go. There was a serious disconnect between the students and the professor in that class, who, I later realized would have just needed to say "this calls the procedure average" and things would have been fine. We had no idea, initially, how we were getting results.

Of course, we had to run all this on VMS and print it out on a fan-fold line printer, and it wasn't very interactive. SAS products seem to be more visual and interactive now.


Agreed entirely.

I don't understand why they don't get testing either. Not one I have met realises just how error ridden their scripts probably are. Still, a lot of programmers don't understand the value of testing yet.


This paper is from 2004, any updates? Did microsoft add an option to disable the conversion? Are public databases cleaner?


Excel has a cell format called "TEXT" that prevents all of these problems. If you plan to use a range of your spreadsheet as text (and not as numbers or dates), format it as Text and excel will never misinterpret what you type.

You can also start the cell entry with a single quote to make a single cell store text.


Excel shouldn't be used for science. Or for that matter anything important and numerically intensive.

But there's silly people everywhere, even in science.


This is the sort of error that worries me a lot about "programmable" business systems.

That people who don't know the domain and tools well enough, and don't have the pernickety edge-case worrying mindset will introduce errors like this. Writing a report that doesn't take into account a new database table, editing permissions and leaving a big hole in them, etc.

What I'm not sure about though, is whether it really matters. Whether the tradeoff with only allowing certain people to make changes saves enough trouble that it's worth the limitations. Businesses with people are very good at surviving errors, after all...


Beyond the floating-point problems, there's also the automatic name capitalization, the automatic replacement of "teh" for "the", the automatic conversions of numbers into dates, and many more.

Everyday practice gets me a bit more and more annoyed at all the automatisms that software office packages come with. And they are nearly all on by default! It takes a skilled person about 10 minutes to disable nearly all automatisms in MSWord or in OpenOffice Write, in several menus. In MSExcel, I don't even know where to look myself (having switched to Gnumerics long ago).

We would all appreciate one single button that says: "Don't do anything unless I told you to."


"Don't do anything unless I told you to."

For a moment there you made me have a bad flashback to Microsoft Frontpage circa 1997.


I worked at a hedge fund where one of the phone-screen brainteasers was "How many zeros are at the end of 35!?"

The very not-smart people would try to use Excel for this, and get hilariously (and obviously) incorrect results due to rounding errors.


Took me a few seconds to parse the question (that it was about "35 factorial" and not about some binary representation of "35" ended with interrobang "?!")

One possible solution in Python:

  import operator
  fact = str(reduce(operator.mul, xrange(2,36)))
  print len(fact) - len(fact.rstrip('0'))

  >>> 8
Please forgive me for using strings for numerical operation, it was for a brevity's sake (I still remember Leah Culver getting chastised here on HN).


Took me a while to figure out where the "extra" 8th zero came out.... namely 25. That there are at least 7 zeros is evident, since 5 and 2 is needed for each 0, and 5s are more scarce than 2s in 35!


Prime p (e.g. 5) occurs exactly k times in factors of n! .

k being a sum from 1 to m of a sequence whose term is [n / p^m] ([] meaning taking inteer part) where m is such that p^m =< n that is m being the integer part of logarithm base p of n.

In this example

  [log5(35)] = 2
so

  k = [35/5^1] + [35/5^2]
Sory for bump-posting, but I had perfectly nothing better to do.


Aha, so that's the elegant way how this was supposed to be solved (in your head and on the spot).


It's basically an honesty test, since the interviewer explicitly says not to use paper or a calculator (which would include Excel). Quants are given harder questions, and I doubt that a trader would be dinged for not knowing the correct approach, since it's not a brainteaser that would correlate highly to trading acumen. But the Excel answer (14, I believe) is a very wrong one.


Interesting, 35! according to various tools:

  10333147966386100000000000000000000000000 (Open Office Calc)
  10333147966386100000000000000000000000000 (Excel)
  10333147966386144000000000000000000000000 (Gnumeric)
  10333147966386145000000000000000000000000 (Windows Calculator)
  10333147966386144929666651337523200000000 (Python)


Python is correct. It's promoting to bignum, while the others are using floats, presumably, and passing them off as integers. I think they ought to display 1.0333147966386144e40.

Your average wannabe-banker/Wharton undergrad has never heard of Python, however, but has used Excel. Quants are familiar with Python, but they get harder problems.

Most quants don't, however, get to use Python. For some inexplicable reason, a lot of them are mired in C++, of all languages, and tend to be poor-to-mediocre programmers. There are exceptions, though; the one I worked at used an FP language and had excellent programmers.


I think they ought to display 1.0333147966386144e40.

Actually, they did show the result by default as floats in scientific notation, it was me who changed number display options for easier visual comparison.

I assumed quants preferred C++ because of performance. When I occasionally have to go back from Python to C/C++, I'm surprised much faster it is.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: