Git and Jupyter Notebooks Guide

enriquto · on July 7, 2023

Curious that they discuss several options, but ignore the totally obvious one: just use jupytext [0]. Jupytext is a (tiny) jupyter extension that reads/writes notebooks as python files, with text cells being represented as comments. With jupytext, you do away with the stupid .ipynb format. As long as you don't need to save the cell outputs, which is the case for version control, jupytext is the way to go.

People: pip install jupytext. All your python files will become notebooks, and your notebooks will become python files.

[0] https://jupytext.readthedocs.io/en/latest/

pletnes · on July 7, 2023

What happens to the outputs in this case? I found the outputs to be both the most useful parts of notebooks, but also the most troublesome for diffing and versioning.

enriquto · on July 7, 2023

Why would you commit the outputs into git? That would be like committing compiled binary objects or pdfs. Of course the outputs are useful, but you just want to commit the sources.

The .ipynb stores inputs and outputs together in an unholy way. It is much cleaner to separate them. The inputs are python (or markdown) files that you can edit with a text editor and version control with git. The outputs are html, pdf, or whatever you want to nbconvert to and share.

The .ipynb file would only be useful if you want to share a stateful notebook, whose state cannot be easily reproduced by the people who you share it with. But that would be really bizarre and definitely in bad taste. Sharing the .ipynb is akin to sharing your .pyc files.

I love working with notebooks, but as a measure of hygiene I avoid .ipynb files altogether.

sfpotter · on July 7, 2023

"unholy" "cleaner" "stateful" "bizarre" "bad taste" "hygiene"

I'm not sure whether you're unaware or just feigning ignorance, but notebooks are frequently used to share partial results, often in the context of "research", however you may interpret it. Imagine a grad student or data scientist preparing some code and plots to show during a weekly meeting.

In this context, the only thing that matters is quick progress and advancing understanding of a problem. The highly loaded words you employ while blasting the idea of uploading Jupyter notebooks are not relevant here. Wasting time on these things is seen as a bad thing. It's clear why someone using notebooks this way would want the interaction with Git and GitHub to be as seamless as possible: uploading something to GitHub is a very easy way to share it, even if this isn't the platonic ideal.

It will probably cause you some pain, but I've known people to commit binary objects and PDFs to git to accomplish the same ends. ;-)

lynndotpy · on July 7, 2023

I agree entirely, outputs are great and I consider it best practice to provide notebooks with outputs provided.

As a concrete example, this one-liner of Python code is much more interesting to those who don't recognize it when it's presented with the associated output.

    4*sum([(random.random()**2 + random.random()**2)**.5 < 1 for _ in range(10**7)])/10**7

This is also useful, e.g. when viewing the read-only export of a notebook.

(The one-linear above is a monte-carlo simulation which approximates Pi. On one run, this result came to 3.1410416.)

tnecniv · on July 7, 2023

As someone who uses notebooks for research, outputs don’t play well with git. You can end up with very large commits that GitHub or wherever your repo lives may not like if you have a lot of plots and animations.

Moreover, while research moves fast, reproducibility remains important. If your notebook is stateful, then when you share it I may not be able to recreate your result or you might have a bug due to something lingering in the notebook state. Having your outputs is convenient, but if I download the notebook, run it myself, and find that the code doesn’t run because there’s some variable that got defined earlier in your session but that code got deleted during iteration, that’s really not helpful. It’s the equivalent of handing someone your lab notebook but you kept erasing over early pages to make room for new content.

That’s one example of a bug. You could easily introduce more subtle bugs where the state leads to invalid results.

sfpotter · on July 7, 2023

There are as many different "research workflows" as there are researchers.

I'm a researcher and don't use notebooks for all the reasons you outlined and more. I have my own approach to dealing with reproducibility which is low tech and works for me and my collaborators.

My comment is meant to point out that there are many researchers who view all of the problems you describe as unimportant and not worth spending time on.

pletnes · on July 7, 2023

Sometimes I work on software development, and this mindset («the only valuable asset is the code») makes total sense. But if I work on analytics / datascience projects, the analysis including outputs could be time consuming to run, validate, and visualize. In these cases, it might be required to version the outputs.

I’ve never used jupyter for taking notes in a lab setting, but with more and more instruments being computer/network connected, I imagine this would make total sense - put your data and notes with your analytics work.

Many jupyter users are not «software developers», they just use code to perform their work.

Zandikar · on July 7, 2023

Precisely this. When your output is something like research data, or even just something that generally takes a long human time to complete (hours vs Microseconds) it makes a lot of sense to version and keep outputs, at least on major "versions".

adament · on July 7, 2023

But would you version it by storing it as output in an ipnyb file where it is overwritten if you rerun that cell? I would store the data in a versioned database or as separate data files in the repository (possibly stored in git-lfs). And I would store results of the analysis as data files / image files / whatever else, NOT as ephemereal outputs in an ipynb file. But I am pretty far down the “ipynb files are for local use only” path.

tnecniv · on July 7, 2023

Yeah if your analysis to takes hours to run, you should really split up the number crunching code and result analysis / visualization. Not only does it make version control of the code easier, you can save the output in an organized labeled manner (time-stamped, etc.) and, if you lose power or the kernel crashes, you don’t need to rerun the lengthy analysis if you want to make a change further down the pipeline.

Zandikar · on July 10, 2023

It was an extreme example to drive home the point that one is "human scale time" and one is "computer scale time", people are reading far far too much into my choice of hours specifically there.

aqsalose · on July 7, 2023

However, I wouldn't then use version control software like Git for versioning analysis objects, as it is designed for text file source control and diffs.

(How one does a diff of a data object look like? If there is a natural text format to save it in, it still is usually quite messy, and Git doesn't really like Gb sized csvs.)

My preferred workflow is to version the source files in Git and store the associated data objects in a separate archive directory with meaningful name and the hash of commit of generating code as metadata attribute.

Now if you had a version control "IDE" software that would render changes in figures and other blobs nicely, then it would make sense to build a workflow around it.

tnecniv · on July 7, 2023

What’s your solution for hosting the archive of the data if you want to share it? That’s a weakness in my workflow.

chthonicdaemon · on July 8, 2023

I use nbdime and have integrated it into my git diff so I see graphical diffs. Github renders notebooks with their output intact. They recently announced [rich notebook diffs](https://github.blog/changelog/2023-03-01-feature-preview-ric...) as well.

CardenB · on July 7, 2023

Simply write the outputs to a formatted file and keep it separately?

benrutter · on July 7, 2023

I think it depends a lot on what your git repository is.

If it's specifically source code for anything that's intended to run, then avoiding including the outputs is a smart move. But then, if that's the case, there's a good chance you'd just be committing a .py file.

I like notebooks because they include output alongisde input. For example, Peter Norvig's Pytudes are all brilliant, quick notebooks that solve a particular puzzle[0]. The code itself might not be that interesting to run (unless you really want to confirm his strategy for wordle checks out) but reading through the notebooks makes for a great experience of simultaneously understanding his thought process, and seeing the solution.

I do a bunch of generative art stuff and have recently been experimenting with using notebooks as quick sketches[1]. I really like the workflow and end up with something like a journal that isn't necessarily intended to be ran repeatedly, but read over, where I can see the visual output created, as well as the method for it.

[0] Norvig's extremely cool pytudes, wordle example: https://github.com/norvig/pytudes/blob/main/ipynb/Wordle.ipy... [1] My not anywhere near as cool as Norvig's pytudes example: https://github.com/benrutter/jupyter-sketches

WCSTombs · on July 7, 2023

You don't lose the outputs, they just aren't committed into Git. So for each new clone, you'd need to regenerate the outputs, but on a single clone, the outputs exist and are persistent in the .ipynb form of the notebook (which is not committed). You are correct that the .py version of the notebook is exactly what Git ends up tracking, with the .ipynb being essentially a build product.

(Note that the jupytext paradigm does assume that the outputs can always be recomputed as a function of the inputs. I consider that a best practice, but some might disagree.)

chthonicdaemon · on July 7, 2023

How do you capture things like charts or tables produced from long-running notebooks? Do you have a separate system to keep track of these? I prize notebooks with outputs in our data science repo since I can see the results of our analyses years later without having to re-run the notebook. In some cases, the notebooks don't even run anymore since our environment has moved on, but I can still see the graphs and read the text that was associated with that analysis.

apwheele · on July 7, 2023

Some people use the outputs like documentation, since github renders the notebook contents nicely in the browser. I agree it is not the best practice in many situations, I like using it on occasion though.

Another alternative if you want the outputs is to use nbconvert to convert the output to markdown, https://andrewpwheeler.com/2021/09/06/using-jupyter-notebook...

westurner · on July 7, 2023

If you want to store (e.g. base64-encoded) outputs in Markdown, you're going to have to scroll past a lot of data to get to the next input cell in a notebook.

Jupyter notebooks store which Jupyter kernel they were run with to generate the outputs.

nbformat (.ipynb with inlined base64 outputs) isn't a sufficient package format: https://github.com/jupyter/enhancement-proposals/pull/103#is...

Papermill is one tool for running Jupyter notebooks as reports; with the date in the filename. https://papermill.readthedocs.io/en/latest/

RobinL · on July 7, 2023

Most of the time I agree, but if you want to e.g present a tutorial as a webpage, having an ipynb with both inputs and outputs becomes a feature. You can even bundle into docs e.g. https://moj-analytical-services.github.io/splink/demos/02_Ex...

Helmut10001 · on July 7, 2023

I use Jupytext since years. It allows me to have three types of synced notebook versions: 1) .ipynb (for opening/running), 2) .md (formatted code+comments, without outputs) and 3) *.py (python formatted, code+comments).

I commit the Markdown-version, but I also use the py-version of notebooks for chained notebook imports. Allows me to split larger notebooks into multiple smaller ones. Both of these options are a blessing and Jupytext works super-robust.

Finally, when I want to archive (and share) notebooks _with_ outputs once in a while, I have a cell at the end to convert (nbconvert) to HTML, and I commit this html file. The Markdown-version remains as a clean basis for commit history. The HTML file is much better suited for sharing and archiving than the ipynb file.

Helmut10001 · on July 7, 2023

Here's another HN comment with links of an example repo [1].

[1]: https://news.ycombinator.com/item?id=36516836

kzrdude · on July 7, 2023

I use jupytext paired with ipynb files. Only store the .py files in git. The ipynb files act as a local cache of outputs. Outputs are loaded from the ipynb even if you open the .py notebook.

bootsmann · on July 7, 2023

We use jupytext with dvc. You can generate the notebook in dvc.yaml using the jupytext cli and then push this alongside the .py file.

cycomanic · on July 7, 2023

This was the first thing I wanted to post when reading the article. Jupytext is excellent, although i typically use MyST (an extended Markdown syntax).

joouha · on July 7, 2023

Euporie (my terminal Jupyter notebook editor) also supports Jupytext

kortex · on July 7, 2023

Wow, no mention of DVC (http://www.dvc.org)? That has been invaluable for data scientist workflows.

I definitely do like to strip notebooks and make them run-idempotent to the best of my ability, but sometimes you just need stateful notebooks. And since .ipynb are technically json but in reality act more like a binary file format (with respect to diffing), DVC is the ideal tool to store them. Don't get me started on git annex or LFS, both of those took years off my life due to stress of using them and them bugging out.

Also I am hardly a fan of XML, but does anyone feel like notebook files would have been a near-ideal use-case of it? It's literally a collection of markup. The fact that json was chosen over xml I think is somewhat damning of xml as an application data storage format. I think xml is perfectly cromulent as a write-once-read-many presentation format or rendering target (html, svg, GeniCam api info), but it seems to flounder in virtually every other domain it's been shoehorned into, with the exception of office application formats.

Actually, downthread there is a link to a jupyer enhancement proposal for a .nb.md markdown based format. I think this is great. One theme I keep coming across in my computer science journey is that formats which have mandatory closing endcaps are kind of a PITA. It seems the stream-of-containers (with state machines as needed) is all-around better. JSON-LD is better than JSON, streaming video formats are better than ones that stick metadata at the end, zip is... an eldritch horror, etc.

wdroz · on July 7, 2023

If you don't need to "commit" the output, you can just use nbconvert[0]:

    jupyter nbconvert --clear-output --inplace my_notebook.ipynb

So you can use git as usual, like for code.

[0] -- https://nbconvert.readthedocs.io/en/latest/

andrecosta · on July 7, 2023

nbstripout[0] does that and installs a pre-commit hook

[0] -- https://github.com/kynan/nbstripout

milliams · on July 7, 2023

There is a draft JEP (Jupyter Enhancement Proposal) for Markdown-based notebooks (https://github.com/jupyter/enhancement-proposals/pull/103) which will make it a little more RMarkdown-like.

nvy · on July 7, 2023

Seems to me that this article does a great job explaining why jupyter notebooks are a poor collaboration tool.

I wish that non-emacs implementations of org were more commonplace, as it's a pretty sane markup language and supports embedded code and graphics, diffs nicely, and doesn't introduce the insanity of JSON.

joelschw · on July 7, 2023

The native GitHub feature in preview will make this a lot better for those able to use it https://github.blog/changelog/2023-03-01-feature-preview-ric...

SalsaCrotch · on July 7, 2023

This feature has resolved the problem for our team.

sashk · on July 7, 2023

You don't need to commit output into the git. I used pre-commit filter in git, where it will strip all output from the notebook before it was committed into repository. This allowed us to review the code changes of notebooks.

TeeWEE · on July 7, 2023

My quick solution is to not commit the result cells, only the commands. So its just code.

DryLabRebel · on July 7, 2023

You forgot another issue:

- containing potentially sensitive data in your notebook

sdfghswe · on July 7, 2023

I haven't read the link and I'm not going to.

I realized that jupyter notebooks are a flawed idea when I've tried vs code. vs code uses jupyter-the-protocol (as opposed to jupyter-the-notebooks) in order to give you a notebook-like experience that doesn't involve the jupyter notebook file format. VS code's interactive files are valid python code.

To me that killed jupyter notebooks. Why use something that is strictly worse in every respect?

kortex · on July 7, 2023

It sounds like you are using the tool wrong. Jupyter notebooks are strictly superior to anything else (namely: code only, spreadsheets, matlab/octave) at their primary use case, which is interactive data science (writing code to manipulate some data, while actively revising the code, or sharing the results of that code with others).

Nothing even comes close. There's a reason it's dominant in the data science field.

Your workflow works for you but the jupyter workflow works for millions of students, data scientists, and even developers. Heck I even know all the ways to avoid jupyter, and I still use it often, because it's so convenient.

steve_gh · on July 7, 2023

Yeah but jupyter notebooks suck at providing reproducible data science. I encourage my team's not to use Jupyter for data science.

Our preferred toolchain is based on make to build data science pipelines. Every step is scripted, and make ensures that upstream changes or script changes trigger downstream changes, ending with charting with gnuplot or similar. Our output charts all are not only timestamped but have a git commit id. And our source repositories contain a data manifest so we have commit IDs right into ETL stages into the DB.

End result is that in a couple of months, when the CxOb asks about some piece of work and pulls out a chart, we can trace the entire data pipeline used to create it, and reproduce it if required. That saves so much hassle!

enriquto · on July 7, 2023

> Yeah but jupyter notebooks suck at providing reproducible data science.

That depends on how you use the notebooks.

With just a tiny bit of discipline, you can integrate notebook users into your sane workflow. For example, encourage people to restart the kernel and run all cells a few times per day (and definitely, before sharing anything). Meaningful output artifacts can be saved into files, that are later read by the notebook and displayed.

Then, when users are satisfied with their notebook, they save it as a python file thanks to jupytext, and commit it to git.

This workflow integrates well with your makefile setup: to reproduce the notebook and obtain its results you simply run it as a script. If you want a pdf or a static html that shows the notebook as-is, you can nbconvert it from your makefile.

For example, if your makefile has lines like these:

    %.ipynb : %.py    ; jupytext $< --to notebook
    %.html  : %.ipynb ; jupyter nbconvert --execute --to html $<

Then you run "make foo.html" and it will convert "foo.py" to "foo.ipynb", run all the cells, and produce a static visualization "foo.html". Since the intermediary notebook is not marked as a precious file, it is deleted automatically by make.

Notice that you can simply run "python foo.py" as well, to produce the valuable output artifacts.

In the end, jupyter becomes just an editor of python files. A fancy editor, that allows interactive execution of pieces of code, which is great.

kortex · on July 8, 2023

> Yeah but jupyter notebooks suck at providing reproducible data science.

Why? I have no problem with reproducibility when I use a little bit of discipline.

Your workflow does indeed sound nice but also sounds like it involves way more tooling and institutional knowledge. Anywhere I can learn more about it or see the scripts you use?

steve_gh · on July 8, 2023

I completely agree that a bit of discipline does wonders. But it is not just your discipline, it is the team's discipline.

What I need to ensure is that anyone picking up a piece of analysis 3 months later can reproduce exactly what was done. I've been burnt in the past by having to go back to the original analyst and be told "oh you run this bit of this notebook, then paste the results in over here, then run that". By insisting that everything is scripted and that there are no manual steps, we get a reproducible analytics pipeline.

The starting point for our methodology is the book "Guerilla Analytics" by Enda Ridge. It's worth reading.

kevinskii · on July 7, 2023

I agree with the OP. VS Code using the Jupyter protocol is superior to notebooks in almost every respect in my experience. It gives you an excellent debugger, the ability to track changes in Git without any modification, and you can also run as a regular Python script.

esafak · on July 7, 2023

Jupyter offers nothing that Mathcad and Mathematica didn't in the 80s. We should be using open source, git-friendly file formats so we can edit them collaboratively in our editor of choice; e.g., our IDEs. We are not using it wrong; Jupyter notebooks reflect an archaic product philosophy and way of working. Kill it with fire.

sdfghswe · on July 7, 2023

> It sounds like you are using the tool wrong. Jupyter notebooks are strictly superior to anything else (namely: code only, spreadsheets, matlab/octave) at their primary use case, which is interactive data science (writing code to manipulate some data, while actively revising the code, or sharing the results of that code with others).

> Nothing even comes close. There's a reason it's dominant in the data science field.

> Your workflow works for you but the jupyter workflow works for millions of students, data scientists, and even developers. Heck I even know all the ways to avoid jupyter, and I still use it often, because it's so convenient.

Copy pasting your comment here so when you eventually delete it people can still see the ignorance.

You have absolutely no clue what you're talking about. Worse, it seems like you didn't read what you're responding to.