I think there's a fundamental mistunderstanding and mismatch between what you wa...

spiralk · 2024-09-10T22:47:54 1726008474

Not having the outputs tied into the code is actually preferable if the ultimate goal is reproducible science. Code should be code, documentation should be documentation, and outputs should be outputs. Having multiple copies of important code in non-version controlled files is not a good practice. Having documentation dispersed with questionable organization in unsearchable files is not good a practice. Having outputs without run information and timestamps is not a good practice. Its easy to fall in to those traps with Jupyter notebooks. It might speed up initial set up and experimentation, but I've been working academic labs long enough to see the downstream effects.

majormajor · 2024-09-10T22:55:50 1726008950

Having the outputs recorded alongside specific versions of the code can actually be very valuable.

But since most uses of Jupyter notebooks I've seen don't version control them much at all, it's not as useful in practice often.

spiralk · 2024-09-10T23:30:09 1726011009

Yeah, jupyter notebooks don't guarantee any specifics about versions of code used for that output. In the real world you can expect everyone in the lab including all of the students to be editing jupyter notebooks at whim. The only way to do this would be to have proper version control and of your code, a snapshot of the environment, and to log all this along with the run that generated the output. This is possible with regular python using git, proper log files, etc. Jupyter notebooks seem like an extra roadblock.

paddy_m · 2024-09-11T00:39:29 1726015169

Ooh. That's a nice utility funtion that I will write soon. We tend to look at requirements as something we hope the package manager gets right, and then we ignore at runtime, but there are a bunch of errors we could avoid if we verified at runtime. Sometimes when writing a library you have to have different code paths for different versions.

Something like `if check_versions(pandas__gt="2.0.0", pandas__lt="3.0.0"):`

yunohn · 2024-09-11T08:16:39 1726042599

Often the notebook was run on a beefy server with GPUs attached, potentially taking hours/days of compute. It would be senseless to force every viewer of a Jupyter notebook to have the same setup and time just to read through the results and output.

epistasis · 2024-09-12T00:40:53 1726101653

> Not having the outputs tied into the code is actually preferable if the ultimate goal is reproducible science.

What a strange thing to assert, especially as a general overarching truth.

The best reports I have ever seen have matched code and output in the same file. There's never a question of what code generated a plot or a table with a notebook.

With .py files and separate outputs there's far more change for unreproducibke science, it's far messier, and for someone who doesn't appear to respect the organizational capabilities of academic labs, you are condemning them to far more poorly organized outputs.

> Having multiple copies of code

That doesn't have anything to do with notebooks. It's as silly as saying that a Python package is a poor idea because you say somebody repeat code across multiple places.

> non-version controlled files

Notebooks are no less version controllable than .py files.

> outputs with timestamps and run information

Jupyter notebooks are perfect for this, far superior to a directory of cryptically named outputs that need to be strung together in some order

> documentation dispersed with questionable organization

Using separate Python files rather than a notebook means that documentation can never be where it needs to be: next to the output. This is one of the ways that Python files are strictly inferior for generating results.

There are roughly two modes for notebooks: exploration with a REPL, and well-documented reports. The best scientific reports I have ever seen are notebooks (or R Markdown output) that are the full report text plus code plus figures.

spiralk · 2024-09-17T16:17:11 1726589831

> someone who doesn't appear to respect the organizational capabilities of academic labs, you are condemning them to far more poorly organized outputs.

This is not a great way to make your argument, though you are not the not only one here making a personal judgement without even knowing about my background. These are all issues I have seen first hard. With most academic labs being funding limited, the "organizational capabilities of academic labs" seems irrelevant to me. In our field, no one is getting grants to manage code of any kind .py or .ipynb and I suspect its the same at most university labs. It's effort wasted that ultimately does take time away from the actual research that's fundable and publishable. As someone who has been responsible for wrangling people's notebooks in the past, it's enough of a problem that I would encourage to remove all .ipynb.

> That doesn't have anything to do with notebooks. It's as silly as saying that a Python package is a poor idea because you say somebody repeat code across multiple places.

Human factors make jupyter notebooks lead to the problems I have listed. The issues are most apparent with large groups and over long periods of time. Python and other programming languages already solved most of these problems with git. There isn't a tool that is as elegant and scales from individuals to massive organizations.

> There are roughly two modes for notebooks: exploration with a REPL, and well-documented reports. The best scientific reports I have ever seen are notebooks (or R Markdown output) that are the full report text plus code plus figures.

The REPL functionality is handled by .py cell execution, as I’ve mentioned in other comments. It baffles me how the minimal effort saved by not using separate tools -- one for code, one for documentation -- justifies the issues it introduces.