Hacker News new | past | comments | ask | show | jobs | submit login
The Jupyter+Git problem is now solved (2022) (fast.ai)
107 points by skadamat on July 19, 2023 | hide | past | favorite | 42 comments



We all know .ipynb JSON format is not a great fit for Git. The Jupyter ecosystem has come a long way in the last few years. Solving this really comes down to a few tools -

- JupyterLab Git Extension[1] for local diffs (pre-commit diffs)

- nbdime[2] / nbdev[3] for resolving .ipynb git merge conflicts

- GitHub PR code reviews with ReviewNB[4]

- Alternatively, if you don't care about cell outputs then Jupytext[5] to sync .ipynb JSON to markdown

Disclaimer: I built ReviewNB. It's a completely bootstrapped business, 5 years in the making and now used by leading DS teams at Meta, AWS, NASA JPL, AirBnB, Lyft, Affirm, AMD, Microsoft & more[6] for Jupyter Notebook code reviews on GitHub / Bitbucket.

[1] https://github.com/jupyterlab/jupyterlab-git

[2] https://nbdime.readthedocs.io

[3] https://nbdev.fast.ai

[4] https://www.reviewnb.com

[5] https://github.com/mwouts/jupytext

[6] https://www.reviewnb.com/#customers


> Alternatively, if you don't care about cell outputs then Jupytext[5] to sync .ipynb JSON to markdown

Notice that using markdown is a possibility for jupytext, but not the only one. More interestingly, you can also store your notebooks as plain python files, whose comments are interpreted as the markdown cells of the notebook.

This is very useful, and not only for version control: if your notebooks are python files they can be executed easily in CI or by third parties just by launching the interpreter. No need even of the jupyterlab dependency.

With some care, you can craft a single python file "foo.py" that can be used at the same time as

1. an executable command-line program (that happens to be written in python)

2. an importable python module

3. a jupyter notebook (to open it you need the jupytext extension of jupyter)

4. the documentation with auto-generated figures, convertible to html or to pdf using "jupyter nbconvert --execute"

5. a regular .ipynb if for some reason you want to distribute the outputs in a re-executable format

For small simple projects, to showcase, describe and illustrate an independent algorithm, we have found this structure invaluable.


And VS Code supports the py-percent format as a notebook too (that jupytext can use)


This is a post from my Linkedin page on my hopes for Jupyter notebooks and git. Anyone know of progress along this line?

#Jupyter notebook and git

As much as Jupyter Notebooks have been a great tool for data science, the transition to deployment, and the general software engineering friendliness of Jupyter Notebooks could use some work. From time to time, I have explored how others have dealt with turning notebooks into an organized codebase and outputs. To date, I have not found a comfortable approach for me. The ideal approach for me would be to use something like 'node metadata' in the way of [Leo Editor](https://leo-editor.github.io/leo-editor/) to function as 'decorators' for a notebook cell for integration with git.

By this I mean using something like special markers in Python comments (since much of data science is done with Python) to map the content of a cell (or output) to a git repository. Better yet, define a special cell type for git metadata preceding a code cell. Then implement some basic git operations on the contents of a cell. Let's suppose we use @@git as a marker for metadata in comments for git. --- beginning of cell --- # @@git %upstream%=https://github.com/pyro-ppl/pyro # @@git %local%=~/repo/pyrodev # @@git %branch%=burnburnburn # @@git %file%=examples/cvae/util.py

# Here begins the contents of the util.py file ... --- end of cell ---

An extension would implement items in the menubar for various git operations: stage - stage the content as util.py file checkout - checkout from upstream, replace local copy, and refresh content of cell commit - commit stage file specified by %file% status - ...

Imagined workflow is that once a working idea scattered throughout a notebook has been sketched out, the user would mark the notebook cells that should be mapped to files in a git repository. Also this could be used in a mixed dev/data science environment where library code under development can be pulled right into a notebook.

Yes, there will be problems with committing code with comments that are specific to one user which is why a special cell type makes sense. Yes, there will be problems that I can't even imagine right now but ...

Please message me if you know of a cell-based git extension for Jupyter Notebooks.


I like json for data exchange, but I will never be able to endorse the notebook is a json document idea. It is the poster child for why you would want a document markup language, and they said "why not encode it in a textual data language instead?"

Yes, even in a document model, merge conflicts can give you invalid documents. Programmers deal with this every day when they create invalid programs. Trying to hide the document into data complicates that in ways that are obvious in hindsight, and not that surprising with foresight.


Totally agree with this. I like the Markdown subset approach LiveBook (https://livebook.dev/) has taken to play nicely with version control in comparison.


I suspect that .ipynb will eventually be displaced by good old HTML documents loading PyScript (or more specifically, a notebook framework built on PyScript).


Ok but I’m still not sold on the whole NBDev ecosystem. Is it worth building up lost of custom infra just so we can build in notebooks? Why should DS software dev be in a different special set of tools distinct from the other software it interacts with?

And a host of other reasons

https://gist.github.com/softwaredoug/d527a18643f29832b0f41af...


The premise behind nbdev isn’t that DS software should be a different set of tools - it’s that we should all consider using a different set of tools, based on literate and exploratory programming principles.

I build all my projects, including a SQL lib, an EC2 interface, http and fastcgi frameworks, various web apps, and a mail merge system in notebooks, in nbdev and it’s made me much more productive.

Amongst folks that have used nbdev for 1+ years that I’ve spoken to, all report a 3+ multiple of improvement in productivity (based on non rigorous self assessment). This could however be biased because these people also report enjoying coding much more, so it’s possible some of that effect is the time just doesn’t seem as long.

Personally, after coding previously for over 20 years in various IDEs and editors (and for instance being prolific enough in vim I often gave talks about it) I wouldn’t ever want to go back to that old pre-notebooks time.

But it’s a huge investment and requires a lot of relearning - to get the most out of it you have to rethink just about everything. So it might be better for those earlier in their careers that haven’t got as many habits to change.


Switching entirely to notebooks for development just feels like requiring everyone to use your favorite IDE for software development. As it binds the underlying representation to ipynb files, which will become inscrutable to a lot of developers, and can create an additional barrier to entry.

I like the ideas of literate programming, but I think there should be a way to do it independent of the "editor" being used.


org mode ftw


Why shouldn't they use nbdev if it works for them? Plain Git obviously does not work and not only for data scientists. Git, with its inability to deal with non-textual artifacts, is completely unsuited to a modern software development workflow. Modern software development contains interactive graphs, audio and video assets, editable diagrams, vast amounts of binary data for reproduction of experiments, etc. etc. So much custom tooling has been built to somehow shoehorn this stuff into git and it never works well. I'm sick of diagramming software and IaC tools with horrible user experience just to keep a plaintext representation that no human dares to read anyway.


I'm not against notebooks.

I'm against the idea of doing all of your software development in notebooks.

There's a sane way to use notebooks. For me, nbdev is a step too far, as it really pushes notebooks as the primary dev artifact.

(All my opinions, people can do whatever they want)


I am assuming you've seen this? https://www.youtube.com/watch?v=7jiPeIFXb6U :D


Yes I'm a big fan :) I also like the "Why I like notebooks" counterpoint talk. I'm just not personally convinced.

While they make some great arguments about where notebooks can be really powerful for software development, I don't think the speaker makes entirely valid counterpoints to the original "I don't like notebooks" talk and most of the problems of notebooks still stand.

See this gist I posted above https://gist.github.com/softwaredoug/d527a18643f29832b0f41af...


ive written two projects with nbdev - its great, but after a while we moved off of it - too limiting


Cool to see that this is moving along - Jupyter merge conflicts have caused me a huge amount of headache over the years.

My solution has been to switch over to Quarto notebooks (mentioned in the post with Jupytext), but I see the issue around saving cell outputs.

I'm curious why one would specifically want to save cell outputs as is in the Jupyter notebook, rather than archiving that in some other format. Sure, that might require putting a lot of information in one page (e.g., if that output is dependent on many other code cells and their outputs), but that just moves the linkage problem around - you'd have to have some way of indicating that the specific cell output was generated by a specific version of cell code (and the order in which they were run, sometimes multiple times).


Strongly agree with this. IIRC, in RMarkdown state is treated as a separate cache stored outside the notebook and loaded as needed. You could use something like dvc or gitlfs to manage those cache files, and since the Markdown file is plain text, use regular git to inspect changes to the notebook implementation.

I feel like Jupyter notebooks are the PDFs of data science. They are super useful for displaying results, but bake that data in a super inconvenient way for doing anything but rendering the data to look nice.


You want to share the outputs so that you can share the results while showing your work. That jupyter also inlines everything, such that even charts are stored in the document, makes that even more necessary.

Though, I can see your point, I think. Why not include a build step that moves from your document to the generated output? My gut there is a large part of why the system got popular is that they worked hard on removing the friction that that would add.

As a comparison and to your point, I've seen people try to build "literate test suites" that were in a notebook, very happy with how the output looked. Only to find later that if they had used some of the more common test frameworks, those already create very nice reports. And moving the report format/creation out of the specification allowed a ton of flexibility.


> You want to share the outputs so that you can share the results while showing your work.

What's wrong with rendering to HTML?


You often share with people who want to play with the inputs or the code, while at the same time you want them to share what your choice of inputs outputted.


right, so what's the issue with sharing a notebook (code) and rendered html (results)?

If someone starts playing with the inputs, they're going to lose the outputs you've created unless you have a saved rendered copy anyways


Depends what you mean by "whats wrong?" Conceptually, absolutely nothing. In practice, many of the folks we are talking about have been bitten by mismatched files already. Why add one more set of files to juggle?


> I'm curious why one would specifically want to save cell outputs as is in the Jupyter notebook

My blind guess is that it improves the readability of the notebook / promotes the literate programming mindset. But just a guess


Glad to see a custom merge driver being used here - they’re one of the most powerful of git’s obscure features. Large teams working on a monorepo inevitably start noticing that particular files are magnets for conflicts (or other times, in cases like this, some files are a huge pain to resolve whenever they conflict).

This happens especially frequently if your team uses a lot of CIGARs (checked in generated artifacts)

In most cases writing a simple driver to automatically handle the conflict resolution is pretty straightforward (especially if the resolution is usually just to regenerate a generated file) and well worth the up front effort to eliminate ongoing conflict headaches.

https://git-scm.com/docs/gitattributes#_defining_a_custom_me...

The main hassle is that for security reasons all developers need to opt in by registering the merge driver , which you can put in a project bootstrap script if you have one. Would be great if GitHub (disclaimer: where I used to work) would integrate custom merge drivers in their internal conflict resolution flow.


seems very odd. the issue was that when Git marked lines in a file as being in conflict then the file was no longer a valid Jupyter file. and their solution was... to change the language so that Git's conflict syntax is valid?


No the answer was simply that git doesn’t come with a merge driver for json, only one for line oriented text. So nbdev provides one for ipynb json documents.


Yes, I think you are right. Seems like quite a good plan to me. Nothing else obvious jumps to mind, as anyone using git is going to end up with git conflict markers in their files at some point.


yeah it's definitely some classic engineering around the problem, but hard to fix Github I guess. At a previous company, we switched to Gitlab for the specific reason that they handled notebook diffs slightly better. It's a struggle out here


Note that the git conflict markers have nothing to do with Github but are generated by git itself (the commmand line client running locally on your machine).


I think a lot of us have our own solutions to Jupyter Workbooks in Git. My personal one is this: https://win-vector.com/2022/08/20/an-effective-personal-jupy... (which itself is sub-tooling for running lots of copies of a parameterized notebook, all easy as so much of nbconvert and Jupyter are exposed as APIs).


just a data point: Hex we built a .yaml based import/export for our notebooks, more than partially to make it friendlier to work with git (also partially because as we added new features it became harder to express custom cell types in .ipynb).

We still support ipynb import/export, but using yaml for our internal representation of notebooks has made it hugely easier to do human-readable diffs and makes git operations way easier. (https://hex.tech/blog/github-sync/)


nbdev has been a god send and I'm really enjoying using it. Great to see fast.ai investing more into making notebooks even more usable


Worth noting that GitHub has been rendering ipynb diffs for a while now: https://github.blog/changelog/2023-03-01-feature-preview-ric...


Jupyter+Git has no problem.

The problem is Jupyter notebooks+Git.

The solution is use Jupyter, but not Jupyter notebooks.


Thanks, but this is a bit pedantic. For most people Jupyter is the notebook. You could just as well call the other one the "Jupyter protocol".


It's not pedantic, it's a huge difference.

If I said "the problem is now solved, just use VS Code", most people would say "but that's not jupyter, i need the interactivity aspect", not realizing that VS Code is just as interactive (thank to jupyter).


What is the difference between the two? Googling it shows some comparisons between jupyter lab vs notebook but I can't find anything on plain jupyter.


Jupyter is the protocol to communicate between a frontend and a backend.

Jupyter notebooks is an interactive notebook that implements that protocol.

Do you know what also implements the protocol? VS Code. And doesn't have any of the stupid problems that the Jupyter notebooks do.


Are you saying that VS Code also lets you do all of the "Run just this cell" that notebooks do?


Yes.

In VS Code if you do as below:

#%%

print("my python print")

#%%

Then you can run it as a cell. Crucially however, note that those delimiters are ALSO valid python code. So you just version as with everything else.


There isn’t a problem any more with notebooks and git because nbdev now provides a merge driver. So it’s fine to use notebooks with git now.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: