Hacker News new | past | comments | ask | show | jobs | submit login
What's the point of writing good scientific software? (bioinformaticszen.com)
40 points by michaelbarton on Feb 8, 2013 | hide | past | favorite | 51 comments



A large portion of my job, and my team's job, is programming for research at a university. We get "researcher code" that was used to write a paper, and turn it into something for the next set of researchers to build on. There are several other groups at this university that do the same thing.

I'm starting to think, there may be a field of study here. I regularly take software that "proves" some hypothesis or "shows some good results", tear it down, and throw some engineering at it, only to find that the results are not reproduced, or the benefits are severely diminished. Then I have to go track down the why... because until I can show it, it is assumed my fault.[1] The results of some of this are probably paper worthy themselves.

I think a useful field, or useful conference at least could be built for people in this typeof postions, studying the meta effects of software on research. How can we report the issues found, or the updates to the numbers, without putting black marks on the reputations of people who actually are doing good work?

Another interesting phenomenon that is worth study is that the refactoring/rewriting process often gets real results, but it turns out the mechanism for the improvement isn't what the original researcher thought/claimed. It is something perhaps related, perhaps a side effect, and so on. There needs to be a way to recognize both the original researcher, the programmer who found the issues, and the follow-up researchers who did some more determination of the problem.

[1] This isn't as antagonistic as it sounds. It is actually a nice check on my own mistakes. Did the differences in what the researcher did and what I did introduce some strange side effect? Did I remove a shortcut that wasn't actually a shortcut and I misunderstood? A hundred other things on both sides... Research has a large component of "we don't know what we're doing, axiomatically so", and as such it is a decent way of finding out more info.


case in point:

The Effects of FreeSurfer Version, Workstation Type, and Macintosh Operating System Version on Anatomical Volume and Cortical Thickness Measurements

http://dx.plos.org/10.1371/journal.pone.0038234


I've been debating about whether I want to open-source the scientific code I've been writing. A lot of it could be useful to other people in the molecular dynamics field.

I recently introduced my advisor to Github, and he thought it was a good idea; however, there were a few hesitations. The first, most importantly, is the likeliness of a bug. If you put your code on a very public website like Github, there is a chance it's going to be scrutinized by everyone in your field.

Now, unless you are one of the best programmers who has ever lived, there are bound to be bugs in your software, and when someone discovers them, it could have a deleterious effect on any journal articles you've written that used that code. The issue is that even though most bugs do not lead to significant changes in results, you would still need to redo all of your data to make sure that is the case. The software industry has long recognized buggy software as a reality, but I don't think the scientific community is as tolerant of it (hence the reason a lot of people hide their code).

For my MD simulations, I use the well-known LAMMPS package. Bugs in it are discovered all the time! (http://lammps.sandia.gov/bug.html). So I think there needs to be a collective realization among the scientific community that these are bound to occur and authors of journal articles can't be persecuted all the time for it. A lot of computational work is the art of approximation so I would just lump "human incompetency" under one of those approximation factors.

Despite this risk, I think I'm still going to release my code at some point as I would personally welcome critique and improvement suggestions. I'd like to think I'm a better coder than most scientists since I've been coding since I was twelve in multiple language paradigms and have won a major hackathon, but eh, who knows. I'm quite sure my environment isn't up to industry standards because I've always coded solo rather than in a team.


holy moly. It sounds like you don't want to release your code because someone else might find a bug that you missed (and that might have a "deleterious" effect on your journal articles).

You would rather leave potentially incorrect work standing than have a bug corrected?

What the f*ck has happened to science.


What the fuck has happened to science.

This is fairly standard in the science industry (I use the term deliberately). I get rated (in nuclear and particle physics) on how many papers I've published, what indictions of recognition I get from my peers and at which conferences I've been invited to speak. Short of something like a Millennium Prize or a Nobel (one-in-a-million, and both still very political), there are almost no direct rewards to accuracy or importance.

Academics tend not to be the people with the most understanding or the best direct insight - they're the people with the most friends on committees and who do the most fashionable things (usually badly). And so they appoint new academics who are a little bit worse than them (no point in opening yourself to competition - instead bring in people who are going to be grateful to you!) and the cycle continues.


I strongly agree with your statement. I am the author of the original post. I removed a sentence stating that job search committees care about the number of publications and citations more than they do about if your software is well documented and has a good command line interface. I feel as a post doc I am in a vulnerable career position and I always have to think about how 'will this lead to a publication' before anything else.


I can identify with that... for better or for worse, the currency of academic (at least research-based academia in north america) today is publications in high impact peer-reviewed journals. Period. If there are exceptions they are exceedingly rare.

The most sickening part of it all is that who are the ones who perpetuate this system? It is us (other academics).


Exactly. There are more and more scientists entering the job market every year making funding more and more competitive. I feel like it's not enough to be good anymore you have to be really good to get funding.

What can we do to change it? I love research. Having almost 10 years of experience but still earning less than if I had gone straight into industry after a masters degree wears you down though.

I think ultimately it is because scientists add little of value to society in the short term. An MD who has a comparable number of years of education can expect to earn 10X that of an equivalent scientist because they value immediately.


This is what turns me off so much from academia. Is this really that typical at most universities? I know that there's always been a large emphasis placed on publishing in high-impact journals (and often), but isn't the open publishing movement beginning to have an effect, albiet rather small?


Not that I have seen. In fact, in my experience, even publications are not all that important! In the UK and Germany both it seems to be 30% of what you know, 60% of who you know and 10% luck.


Yeah, this is what make me fed up with academia -- it's no longer really about the work.

I like Feynman's quote "It's a kind of scientific integrity, a principle of scientific thought that corresponds to a kind of utter honesty--a kind of leaning over backwards. For example, if you're doing an experiment, you should report everything that you think might make it invalid--not only what you think is right about it: other causes that could possibly explain your results; and things you thought of that you've eliminated by some other experiment, and how they worked--to make sure the other fellow can tell they have been eliminated." [1]

[1] http://neurotheory.columbia.edu/~ken/cargo_cult.html


Whoa, hold on there! No need to swear.

> It sounds like you don't want to release your code because someone else might find a bug that you missed

I already said at the end of my post that I decided to release my code. Everyone will be able to see it.

> You would rather leave potentially incorrect work standing than have a bug corrected?

Of course not. You missed the whole point of my post. My point was "Here's a situation in science that needs fixing". "People are hesitant to fix it because...". "I'm going to personally work towards a solution."

Let me ask: what incentive does any scientist have at all to publish their code? You're not going to make money off of it. It's in an obscure niche so you're not going to be world-famous with it. You may get citations to your work, but you're just making yourself vulnerable to having your reputation destroyed because of a bug that nullified all of your articles' results. This is why nobody wants to do it. I'm not saying it's right; I'm saying that it's the status quo.

Most big scientific packages are funded by the DOE, NSF, and others. That's likely the only reason they are even out there.


"What incentive does any scientist have at all to publish their code?"

holy moly x 2

How about that it represents a fuller account of what you did and how you did it? (bugs or no bugs). Isn't scientific publishing supposed to be about reporting what you did as accurately as possible so that others can (1) understand and (2) replicate?

BTW the * in my f*ck from above stands for "ra", what did you think it stood for?


Alright, I'm going to play devil's advocate since you're obviously not getting the point.

> How about that it represents a fuller account of what you did and how you did it?

So what? Journals never require your code to be submitted. It's not going to increase your article's chance of acceptance. And nobody asks for your code anyway. Why should I publish it if it's not going to bring me any benefits?

> Isn't scientific publishing supposed to be about reporting what you did as accurately as possible so that others can (1) understand and (2) replicate?

In an idealized world, yes. But nobody else does it so why should I?


From a game theory perspective, where you are seeking tangible rewards for your work, I totally see your point of the disadvantages of publishing your code: it's purely a short-term weakness.

From the perspective of what "science" is claimed to mean, namely the advancement of human knowledge in a way which is repeatable and verifiable, it seems axiomatic that sharing your algorithms, code, and data are necessary and beneficial to the scientific community.

As a researcher, you don't gain a lot from publishing YOUR code ... but you sure might gain a lot from being able to re-use someone else's code in your domain, or more easily replicate someone else's experiment.

In short: You should share your code because it's the Right Thing to Do if you want to grow human knowledge.


> it seems axiomatic that sharing your algorithms, code, and data are necessary and beneficial to the scientific community.

It seems that way, but it isn't. If a cultural foolishness causes you to lose significant credibility undeservedly, it's actually more beneficial to withhold things that could damage it so that you can continue your essentially still-very-useful work.

You could spearhead the fight against said cultural foolishness, but that's time and energy spent doing something other than the work you want to do and are best at.

Someone has to, but why you? This is the Bystander Effect.


My lawyer friend always said 'the probative value must outweigh the prejudicial value' for evidence to be accepted. So just because something is true and relevant, is not sufficient to make it admissible as evidence.

So, I guess I can see that argument here. Just because the software has a "bug" isn't sufficient to conclude the results are not accurate. The prejudicial value outweighs the probative value.


"Why should I publish it if it's not going to bring me any benefits?"

wow. has it come to this? really? maybe you should reconsider your career choice


I think it's a fair point. The time spent documenting and releasing code could be spent on producing and finishing another manuscript to add to your CV. If you're not behaving like this, them probably some one else is maybe? Job search and tenure committees demand more and more publications the higher the institution.

Science shouldn't be like this but I think it's zero sum.


I don't think you know what "devil's advocate" means...


Suppose that your code does become widely visible. How is that a bad thing for you? Your reputation will be enhanced, you will have contributed something that other people find useful, and if there are serious bugs affecting your results, it'll be nice to find those.

You might want to automate the process of re-running your analysis, though. It's a good idea in general, and especially if you anticipate needing to make minor tweaks to the software involved.


I think that is a big 'if.' If I create one of the pieces of bioinformatics software that becomes widely used then this is great for my career. However I believe a large majority of research software goes unused and therefore uncited. The extra effort I made to create a website and documentation is therefore time I could have spent more usefully on creating another publishable unit.


I have a feeling that there is great deal of buggy software in biology. Taking non-scientific software for example such as Rails, if there is a bug this becomes obvious a page is loaded incorrectly or a model is pulled incorrectly from the database.

In contrast imagine an academic scenario where I'm testing a hypothesis using someone else's software. How do I notice a bug? I'll notice the bugs where the output is in the wrong format for example. More subtle bugs I won't notice because I don't have any expectation on the results. This will affect the conclusion I draw though.


I would say that you should look at what stage you are in your career and what your goals are. If your goal is to become research faculty, you should focus on getting high impact papers out of the door--software is a tool for helping you do so.

If you find yourself re-using that bit of code, then it may be worth cleaning it up and making it maintainable. If people start sending you requests for it, then it may be worthwhile open sourcing it, documenting it, maintaining it, etc.--but only if you have time.

I do make open source scientific software as part of my job, but I'm at a later stage in my career and it's not something I would have a science postdoc work on--it's just not fair to them and their career prospects within science...

Recently, someone asked for some reduction code that I've developed and I realized that while it was documented, I didn't have time to refactor it and clean it up--finally, I just put it on github and told them to contact me if they had questions--they were happy to have it as a starting point for what they wanted to work on. So, if you believe that you've made something worthwhile, but don't have the bandwidth to maintain it and other people might find it useful, sometimes it might be better to just put it out there and let people play with it--no guarantees, but it may help someone else get started...

You can get a large number of citations in some subfields for writing commonly used software--but it may or may not help your career. For example, I have friends at various institutions around the world that tell me that their management gives them no credit for developing useful software (complete with lectures, updates, documentation, etc.)--they just release it because they feel they should and most of them are also already tenured in their positions.

Good luck!!!


I agree there's little direct credit given for developing software, but it can be helpful for early-career researchers to gain name recognition. Even if nobody knows your actual work, if you wrote some software lots of people use, it makes people feel like they know you from somewhere.

Admittedly, it's tricky to do, since that name recognition only really matters if you can also manage to publish enough papers. Realistically grad school is more likely for that than during a postdoc or as an assistant professor. Some grad students manage to release some widely used software (well, usually "widely" in a particular niche), which I think does help them build up more prominence than someone in their career stage might otherwise have had.

On a different angle, having produced some reasonably decent software can be a nice thing to have in your back pocket if you ever consider moving to industry. Having N papers and one decent software package is probably a better academia-to-industry transition CV than N+2 papers and no software packages.


I agree that producing a piece of software many people use is great for your career. I think, on average, most software is never downloaded or used. So, given this prior, is it a good investment of my time to flesh-out documentation and examples rather than just enough to get it published?


I am the original author. Thank you for writing this. I wish I had known this six years ago. I am coming to the end of my first post doc and I wish I had just done the minimum work to get it published. I could have then used the extra free time to work on finishing other manuscripts. I am especially feeling this pinch now as apply for jobs.

I also strongly agree with your other point about just making it open source and if anyone needs it they can get download it and ask you questions.


"I have previously believed that converting any code you've created into a open-source library benefits the community and prevents reinvention of the wheel [...]

I have however started to realise that perhaps something I thought would be very useful may be of little interest to anyone else. Furthermore the effort I have put into testing and documentation may not have been the best use of my time if no one but I will use it. As my time as a post doc is limited, the extra time effort spent on improving these tools could have instead have been spent elsewhere."

From a purely selfish perspective, I've found that documenting and cleaning up my own code benefits me in the future. Even if it's a one-off, single-purpose utility that I'll never use again in the future, I often find myself needing to borrow bits of code from my old projects. ("Oh, I solved this problem before. How did I do it? Let's dig up that old, old project...") At which point, present-day me benefits if my past self bothered to actually document things and make sure they're reasonably robust.

There are countless other reasons (moral and pragmatic) to document, test, and open-source one's code, of course! Many of them more important than the ability to crib one's old code, I'd argue.

But the author seems to have considered (and discarded) them...


I am the author of this post. I don't disagree with anything you wrote. Several posts on my blog say exactly this. The software I wrote however is open source on github [1], has it's own website [2], has man pages for each command [3], example projects [4] and three screencasts guiding how to install and use it [5].

I did this because I wished that all bioinformatics software had this attention to usability and documentation. However now I wonder what was the point of all of this if no one ever ends up using it? I could have done the minimum for publication then spent this time working on finishing other manuscripts I have waiting.

As I wrote though, I agree with what you wrote in your comment. I just don't think there is any incentive for post-docs in academia to prioritise writing good software over pushing out additional papers.

[1]: https://github.com/michaelbarton/genomer

[2]: http://next.gs

[3]: https://github.com/michaelbarton/genomer-plugin-view/tree/ma...

[4]: https://github.com/michaelbarton/chromosome-pfluorescens-r12...

[5]: http://www.youtube.com/user/BioinformaticsZen


Instead of, "But the author seems to have considered (and discarded) them"

I ought to have written, "But the author has considered them and concluded that their benefits don't outweigh their costs in his case."

I totally support what you're saying! IMHO it's definitely not worth it to document/open-source/etc code at the cost of one's career or happiness, especially when the code is of questionable utility to others.


I used to call the scientific software that I was writing, "paper-ware".

You aren't building a system for other users, you aren't really doing anything other than one-off analysis to create charts, which will be explained in a paper.

Things have changed somewhat since the early 2000's, but the concept remains the same. Nowadays, for interesting or controversial results other scientists want to be able to verify your results. However, that is usually more related to your data and how you processed it, rather than your software algorithms (which should be explained in the paper, and can be recreated from that).

So do these systems need to have reams of documentation? Probably not. However, if you leave the system for two years and come back to work on it, or figure out how it used to work, then you best have enough commenting with a thorough readme about some of the decisions you made and why. It's more analogous to scripting rather than software engineering.


On the other hand, if you put good software out that people use, it counts as a citation. The most cited prof in the dept I graduated from maintained a widely used program for astronomical simulations.


Seriously? That's intriguing... If you wrote a very popular scientific library, how much would that impact your career compared to, for example, a Nature publication?

(I ask this because, for me, I think the former is much more likely than the latter.)


They serve somewhat different purposes, but an influential library will have far far more citations. At least in bioinformatics, that's the only way to get a large number of citations, because at their core bioinformaticians are tool builders! For example, samtools has ~1400 citations [1]. The original BLAST paper is around ~44k citations [2].

However, there are a ton of bioinformatics libraries released every year, and almost none of them gain any traction. Nature publications are far more frequent than important new libraries, and you need more political clout to get the library popular than you need to get a Nature paper.

Really, you need to gauge usefulness and interest of your library before you devote a ton of time to it. It's a lot like a startup's product.

[1] http://scholar.google.com/scholar?q=samtools

[2] http://scholar.google.com/scholar?q=BLAST


Thank you. This is the point I tried to make when I wrote the original post. What's the point of spending extra time creating documentation and examples if, on average, no one is going to use or try your software. Perhaps it's just best to create a small publication for the software then push it out the door.


I think this is more a comment on how the system is broken. Researchers should be notified of new libraries in their area. At the very least, they should be able to consult a single site that everyone uploads their code to (think Github for science with more emphasis on exploration). Academic journals are not the only channels carrying useful information.


It's not the discoverability of the libraries that's the problem, it's that the utility of these libraries is generally not that great for anyone except the authors. One common type of library handles data transformation, normalization, and maybe even workflows. These abound. But they are rarely useful in other people's hands, because to extend them and actually get any work done, you need to spend as much time learning them as it would take to write it from scratch. And the advantage of writing it from scratch is that you know it intimately, and all of its assumptions and flaws, which you don't know about somebody else's code, even if it's extremely well documented. Take something like Taverna [1], which is probably very useful to some people, and had been recommended enthusiastically to me by many people, but after spending three hours reading documents and searching the web, I could not get it to do what I needed to do, so I wrote a simple one-off bash script that interfaced with our cluster system. Alternatively I could try to hack in loops, but that's going to take me 10x as long, will require me to interact with many other people who obviously don't understand my problem since they did not consider it a fundamental need, and may not even be accepted back into the mainline, at which point I'm off on my own fork and lose the benefit of using a common code base. Waiting 1-10 hours to hear back from the dev mailing list is unacceptable when you're trying to get work done.

Is it more important to get the result, or to use other people's code? Reinventing the wheel is a minor sin compared to not getting results.

[1] http://www.taverna.org.uk


I just think that very much depends on the field and the problem domain. Taverna seems like it's more targeted towards academics that don't know how to code, and that most people that use it are comfortable staying within its limits. I mean, you definitely are going to have a level of project specificity that is much higher than say, that found in the web development world. In science, many people are searching for the existence of new problems, not just the answers. Why build a gem for email integration if the next best method of communication will likely come out next week? The problem with this thinking is that it perpetuates itself. I don't write the library that only you would find useful because I don't think it's worth my time. In return, I never receive anything useful because everyone else has adopted that same mindset. As some others pointed out, I think the problem rests in the lack of best practices and poor comp sci education among researchers. Teach proper library construction and test-driven philosophy, and I think you'll see a lot more people become comfortable writing and publishing libraries. Cobble together some basic documentation, keep an eye on its use, and contribute more accordingly. You're never going to escape writing custom scripts, but there are more well-defined problems out there that could use standard solutions.


Within machine learning, Peter Cheeseman developed Autoclass, a very popular clustering program which is probably his most highly cited work. Kevin Murphy used to be very well known, when he was a postdoc, for his Bayes net toolbox for Matlab. The author of svmlight is another example; the package was developed during his phd. I can't recall for sure, but I think FFTW started out as a student's project.

The key is to find a modular component that is difficult or novel, and at the same time broadly applicable and hopefully extensible.


Like everything else in software, code quality should be feature-driven. Write the minimum to do what you need to. If you find that your code's poor quality is becoming a problem (whether because it's slowing your own development down, or other people aren't using it and you want them to, or whatever reason), do something about it then, but not before.


I agree. As an individual this is what everyone should do. Just write the software to get to the next publishable unit. I think however this leads to poor quality software for the field as a whole.


"I have begun to think now that the most important thing when writing software is to write the usable minimum. If then the tool becomes popular and other people begin to use it, then I should I work on the documentation and interface."

That. Like someone pointed out, I find that documenting and testing the key parts (that is, those I know at least I will reuse) is always a good investment of my time and prevents major headaches down the road. I've been experimenting with project structures that clearly separates the set of tools and functions that will be reusable, and those that are one shot. I focus all my testing efforts on former, and cut myself some slack on the latter.

Btw, I speak from a "scientist" perspective, and nothing I say applies to professional software engineering (I mean, I don't think it does).


This is debatable, but IMHO your job as a post-doc is to learn new things about biology and publish papers on what you've learned. If you can document your code along the way, that's great. But if it's taking up a bunch of your time then it's probably a misguided effort.


No, your job is to contribute to the science of biology (or whatever). If you can have a large impact building specialist tools, that's a legitimate scientific contribution. This is commonplace in e.g. astronomy, where a PhD or postdoc could be part of a team building an instrument. The student is long graduated before the thing gets first light, so doesn't directly discover a damn thing about galaxies or whatever, but their instrument is a fine contribution. Science software is the telescope of tomorrow. (And today, but it doesn't scan so nicely.)

My bona fides: I helped create a widely-used software system in my field, and have received reasonable credit for it as a scientific contribution.


A big concern for me has always been correctness. You're more likely to make mistakes and miss edge conditions in sloppy code. There's nothing worse than communicating some positive/inspiring results, only to find out later that you had an elusive computational bug in there that invalidates the results.

This reminds me: A man was seen cutting a tree down with a dull bladed axe. A bystander asked him "Why not sharpen your axe first?". The cutter responded "I don't have the time!".


I agree with respect to writing correct scientific software. I think in many cases though bugs are not found because who is going to download the software and test the conclusions? Unless it's a very big result, e.g. an arsenic backbone for DNA, few academics will spend the time to validate others results in this level of detail.


I'm curious, is there a place where you can submit your software to the community and tag it as relevant for doing A, B, C. So that others can use it to do the same or even build it further. I have limited experience with software in your field - but it seems like there isn't a good way to find tools already built to address your needs, or at least close enought? Am I wrong or miss something?


As far as I know, nothing like this exists. It could be useful though. There are a few publications that do critical comparisons of scientific software. The assemblathon is one example of this.


github and add one of several science-related tags


I was just thinking the other day about how good academic software is getting. And how useful it is to society that masters and PhDs are making software for the research.

Look at RapidMiner (developed at U. Dortmund), Stanford's CoreNLP, and the brat rapid annotation tool. These are better than a lot of commercial tools. They are more text-analytics than bioinformatics, but same diff.


Funny, I was just comparing the incentives for releasing scientific software to those of releasing well: http://multiplecomparisons.blogspot.com/2013/02/making-data-...

And now I hear this questioning the value of writing up and polishing scientific software!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: