As someone who worked with bits of scientific code: Does the code you write *rig...

BeetleB · on Aug 24, 2020

> their job is not coding

To me, that's like a theoretical physicist saying "My job is not to do mathematics" when asked for a derivation of a formula he put in the paper.

Or an experimental physicist saying "My job is not mechanical engineering" when asked for details of their lab equipment (almost all of which is typically custom built for the experiment).

Sebb767 · on Aug 24, 2020

On one hand, yes. But on the other hand, reuseable code, dependency management, linting, portability etc are not that easy problems and something junior developers tend to struggle with (and its not like that problem never pops up for seniors, either). I really can't fault non-compsci scientist for not handling that problem well. Of course, part of it (like publishing the relevant code) is far easier and should be done, but some aspects are really hard.

IMO the incentive problem in science (basically number of papers and new results is what counts) also plays into this, as investing tons of time in your code gives you hardly any reward.

BeetleB · on Aug 24, 2020

> But on the other hand, reuseable code, dependency management, linting, portability etc are not that easy problems and something junior developers tend to struggle with

On the original hand, these are easier problems than all the years of math education they have. Once you're relying on simulations to get results to explain natural phenomena, it needs to be put on the same pedestal as mathematics.

dunefox · on Aug 24, 2020

There are tons of tutorials on using conda for dependency management, it's not rocket science. And using a linter is difficult? If a scientist needs to read and write code as part of their job then they should learn the basics of programming - that includes tools and 'best practices'.

djaque · on Aug 24, 2020

The point is that as a scientist your code is a tool to get the job done and not the product. I can't spend 48 hours writing unit tests for my library (even though I want to) if it's not going to give me results. It's literally not my job and is not an efficient use of my time

dunefox · on Aug 24, 2020

If the code you base your work on is horrible it definitely makes me question your results. That's why it's called the reproducibility crisis.

Writing some tests, using a linter, commenting your code, and learning about best programming practices doesn't take long and pays off - even for yourself when writing the code or you need to touch the code again. "48 hours writing unit tests" is a ridiculous comparison.

konjin · on Aug 25, 2020

No, the reproducibility crisis is that when 10 people write the same library you all get different results.

This is just complaining that science is too hard because you can't be bother to replicate an experiment.

TimothyBJacobs · on Aug 24, 2020

This is the same as any other argument against testing. Unless you are actually selling a library, code is not the product. Customers are buying results, not your code base. Yet, we've discovered the importance of testing to make sure customers get the right results without issues.

If you want your results to be usable by others, the quality of the code matters. If all you care is publishing a paper, then I guess sure it doesn't matter if anyone else can build off your work.

PeterisP · on Aug 24, 2020

But the results are usable by others, in most fields of science the code is not part of these results and is not needed to enjoy, use and build upon the research results.

The only case where the code would be used (which is a valid reason why it should be available somehow) is to assert that your particular results are flawed or fraudulent; otherwise the quality of the code (or its availability, or even existence - perhaps you could have had a bunch of people do all of it on paper without any code) is simply irrelevant if you want your results to be usable by others.

BeetleB · on Aug 24, 2020

> The only case where the code would be used (which is a valid reason why it should be available somehow) is to assert that your particular results are flawed or fraudulent;

Not true. Code is often used and reused to churn out a lot more results than the initial paper. A flaw in the code doesn't just show one paper/result as problematic. It can show a large chunk of a researcher's work in his area of expertise to be problematic.

BeetleB · on Aug 24, 2020

> The point is that as a scientist your code is a tool to get the job done and not the product.

Everything you say is as true for experimental equipment and mathematical tools. Physicists are fantastic at mathematics, yet are one of the most anti-math people I know - in the sense of "Mathematics is just a tool to get results that explain nature! Doing mathematics for its own sake is a waste of time!"

The equation is not the product - the explanation of physical phenomena is. If the attitude of "I don't need to show how I got this equation" is unacceptable, the same should go for code.

RandoHolmes · on Aug 24, 2020

> I can't spend 48 hours writing unit tests for my library

No one is insisting on top quality code, but there has to be an acceptance that code can be flawed and that needs to be tested for.

Jabbles · on Aug 24, 2020

How do you know it won't give you results? Maybe it will find a bug that would have resulted in an embarrassing retraction.

Maybe it wouldn't find any bugs, but give confidence to and encourage other users and increasing your citations and "impact".

Maybe it will just save you 48h later on when you need to adapt the code.

Software engineering has generally accepted that unit testing is a good practice and well worth the time taken. Why do you think science is different?

dunefox · on Aug 24, 2020

> Why do you think science is different?

It's really not, I guess his focus lies on cranking out irreproducible papers.

konjin · on Aug 25, 2020

That's literally what they do.

Theoretical Physicists (literal conversation I had):

>Yeah, this looked like it simplifies to 1-ish and Smart John said it's probably right.

Experimental physicists (another literal conversation):

>Yeah, we build it with duck-tape and there's hot glue holding the important bits that kept falling off. Don't put anything metal in that, we use it as a tea heater, but there's 1000A running through it so it's shoots spoons out when we turn the main machine on.

abdullahkhalids · on Aug 24, 2020

Lots of people saying, it is the scientist's job to produce reproducible code. It is, and the benefits of reproducible code are many. I have been a big proponent of it in my own work.

But not with the current mess of software frameworks. If I am to produce reproducible scientific code, I need an idiot-proof method of doing it. Yes, I can put in the 50-100 hours to learn how to do it [1], but guess what, in about 3-5 years a lot of that knowledge will be outdated. People comparing it with math, but the math proofs I produce will still be readable and understandable a century from now.

Regularly used scientific computing frameworks like matlab/R/Python ecosystem/mathematica need a dumb guided method of producing releasable and reproducable code. I want to go through a bunch of next buttons, that help me fix the problems you indicate, and finally release a final version that has all the information necessary for someone else to reproduce the results.

[1] I have. I would put myself in the 90th percentile of physicists familiar with best practices for coding. I speak for the 50% percentile.

zelphirkalt · on Aug 24, 2020

The dumb guide is the following:

(1) Use a package manager, which stores hashsums in a lock file. (2) Install your dependencies from a lock file as spec. (3) Do not trust version numbers. Trust hash sums. Do not believe in "But I set the version number!". (4) Do not rely on downloads Again, trust hash sums, not URLs. (5) Hashsums!!! (6) Wherever there is randomness as in random number generators, use a seed. If the interface does not allow to specify the seed, thtow the trash away and use another generator. Careful when concurrency is involved. It might destroy reproducibility. For example this was the case with Tensorflow. Not sure it still is. (7) Use a version control system.

hobofan · on Aug 24, 2020

> in about 3-5 years a lot of that knowledge will be outdated

Yup, and most of the points you mentioned will probably not be outdated for quite some while. Every package manager I'm aware of with lock files that are that old can still consume them today.

TheJoeMan · on Aug 24, 2020

I emailed an author of a 5 year old paper and they said they had lost their original MATLAB code, certainly brings into question their paper.

James_Henry · on Aug 24, 2020

Definitely makes you question it more. Does the paper not explain the contents of the MATLAB code? That's all that is usually needed for reproducibility. You should be able to get the same results no matter who writes the code to do what is explained in their methods.

Of course, I have no idea about the paper you're talking about and just want to say that reproducibility isn't dependent on releasing code. There could even be a case were it's better if someone reproduces a result without having been biased by someone else's code.

dunefox · on Aug 24, 2020

If a scientist needs to write code then it's part of their job. It's as easy as that.

magv · on Aug 24, 2020

I think the idea that scientific code should be judged by the same standards as production code is a bit unfair. The point when the code works the first time is when an industry programmer starts to refactor it -- because he expects to use and work on it in the future. The point when the code works the first time is when a scientists abandons it -- because it has fulfilled its purpose. This is why the quality is lower: lots of scientific code is the first iteration that never got a second.

(Of course, not all scientific code is discardable, large quantities of reusable code is reused every day; we have many frameworks, and the code quality of those is completely different).

dunefox · on Aug 24, 2020

That's not the point, though. If you obtain your results by writing and executing code then code quality matters - to reproduce and validate them.

hobofan · on Aug 24, 2020

> their job is not coding

But it often is. For most non-CS papers (mostly biosciences) I've read, there are specific authors whose contribution to a large degree was mainly "coding".