Hacker News new | past | comments | ask | show | jobs | submit login

I work with a bunch of 'data scientists' / 'strategists' and the like who love their notebooks but it's a pain to convert their code into an application!

In particular:

* Notebooks store code and data together, which is very messy if you want to look at [only] code history in git. * It's hard to turn a notebook into an assertive test. * Converting a notebook function into a python module basically involves cutting and pasting from the notebook into a .py file.

These must be common issues for anyone working in this area. Are there any guides on best practices for bridging from notebooks to applications?

Ideally I'd want to build a python application that's managed via git, but some modules / functions are lifted exactly from notebooks.




> Are there any guides on best practices for bridging from notebooks to applications?

The main point of friction is that the "default" format for storing notebooks is not valid, human readable python code, but an unreadable json mess. The situation would be much better if a notebook was stored as a python file, with code cells verbatim, and markdown cells inside python comments with appropriate line breaking. That way, you could run and edit notebooks from outside the browser, and let git track them easily. Ah, what a nice world would that be.

But this is exactly the world we already live in, thanks to jupytext!

https://github.com/mwouts/jupytext


There's also org mode in emacs.

https://github.com/nnicandro/emacs-jupyter

I'm not a great fan of notebooks though, I keep using the REPL with X forwarding for matplotlib, sided with a code editor.


Or you could do what I do, and write the report as specially marked comments in the actual code, which can be grepped out later to create a valid markdown document.

Pipe into pandoc, prepend some css, optionally a mathjax header, done. Beautiful reports.

Honestly I've yet to be convinced there's good reason for anything more than this.


Yes, I use a very similar setup with a three-line makefile to test and build. But the OP wanted to use the in-browser notebook interface, and this is srill possible via jupytext (while allowing collaboration with out-of-browser users).



To your painpoints:

1) This is painful. There are tools to help, but the most effective means I've found are having a policy to only commit notebooks in a reset, clean state (enforced with githook).

2) I don't understand. I've written full testing frameworks for applications as notebooks as a means of having code documentation that enforced/tested the non-programmatic statements in the document. Using tools like papermill (https://papermill.readthedocs.io/en/latest/), you can easily write a unit test as a notebook with a whole host of documentation around what it's doing, execute, and inspect the result (failed execution vs. final state of the notebook vs. whatever you want)

3) Projects like ipynb (https://ipynb.readthedocs.io/en/stable/) allow you to import notebooks as if they were python modules. Some projects have different opinions of what that means to match different use cases. Papermill allows you have an interface with a notebook that is more like a system call than importing a module. I've personally used papermill and ipynb and found both enjoyable for different flavors of blending applications and notebooks.


This problem is one reason why I'm a little mystified by Juypter's widespread adoption. It's got a lot of neat features but the Rstudio/Rmarkdown combo solves the above problem, and for me at least, that's decisive. As a tradeoff, you deal with an IDE that, in a bunch of ways, adds friction to writing Python code; but I gather that the Rstudio team is working on that (https://www.rstudio.com/solutions/r-and-python/). Not trying to start a flamewar here, I actually just don't get why Jupyter has become the default.

(Caveat that Jupyter is way better with e.g. Julia, in my (limited) experience)


For R&D the feedback loops are much tighter for sketching an algorithm line by line in Jupyter vs a Python file. Error in the 20th function? Ok fine then I’ll just change the cell it’s defined in and continue from the state of after the 19th. If I forget the layout or type of an object, just inspect it right there in a new cell.

Especially if it deals with multimedia, can just blit images or audio or HTML applications inline.

And it’s fairly trivial to go from Jupyter Notebook -> Python file once you’re done.


I think the author was comparing R and Python, not Python and Jupyter.


Specifically I think they were comparing rmarkdown vs jupyter. And it's really no contest, all the things people hate about jupyter are solved by rmarkdown (and org mode, but that's a harder sell)


The problem with RStudio is that it uses R, which while excellent at numerical calculations, is terrible at everything else - data parsing, string munging, file processing, ...

As the joke goes: The best thing about R is that it's designed by statisticians. The worst thing about R is that it's designed by statisticians.


evidence that R is terrible at everything else?

specifically "data parsing", "string munging", and "file processing"?

I've used R extensively for all of these, and having recently re-visited the python world don't see any advantage that Python has over R for any of these tasks.


RStudio has pretty amazing python support now FYI


My wife has been learning Python (not a programmer) and now is looking at R. I thought she was going to like it as I personally think RStudio is nice. I was surprised she didn't like Rmarkdown after being exposed to Python notebooks, in particular she loved vscode + notebooks and immediate feedback and didn't like at all not having the markdown in RStudio interactively rendered and the R REPL. I have used very little R and I'm a heavy Python user so maybe I didn't know how to help her more effectively. I think I helped solving the main Python pain points: installing anaconda, vscode, the python extension and some additional auto completion. I don't use vscode (use Emacs) but it's great it's available for newbie users :p. Also, having Colab was nice for simple things.

To summarize: I think notebooks are great for newcomers. It requires more maturity to appreciate more principled programming.


I wonder what she would make of writing Python in Rstudio & to an Rmd? Rstudio is trying to get people on board with this e.g. https://support.rstudio.com/hc/en-us/articles/360023654474-I...


Well, it would be the opposite to what we want: R inside vscode. Anyway, we have yet to try R inside jupyter notebooks


Avoid if possible, is the easiest answer. Encourage your colleagues to move their code into proper packages when they're happy with it, and restrict notebooks to _use_ of their code.

Failing that, I think fast.ai's nbdev[0] is probably the most persuasive attempt at making notebooks a useable platform for library/application development. Netflix also has reported[1] substantial investment in notebooks as a development platform, and open-sourced many/most of their tools.

[0]: https://nbdev.fast.ai [1]: https://netflixtechblog.com/notebook-innovation-591ee3221233


I've worked as a data scientist for quite awhile now in IC, lead and manager roles and the biggest thing I've found is that data scientists cannot be allowed to live exclusively in notebooks.

Notebooks are essential for the EDA and early prototyping stages but all data scientists should be enough "software engineer" to get their code out of their notebook and into a reusable library/package of tools shared with engineering.

The best teams I've worked on the hand off between DS and engineering is not a notebook, it's a pull request, with code review from engineers. Data scientists must put their models in a standard format in a library used by engineering, they must create their own unit tests, and be subject to the same code review that engineer would. This last step is important: my experience is that many data scientists, especially coming from academic research, are scared of writing real code. However after a few rounds of getting helpful feedback from engineers they quickly realize how to write code much better.

This process is also essential because if you are shipping models to production, you will encounter bugs that require a data scientist to fix that an engineer cannot solve alone. If the data scientists aren't familiar with the model part of the code base this process is a nightmare, as you have to ask them to dust of questionable notebooks from months or years ago.

There are lots of the process of shipping a model to production that data scientists don't need to worry about, but they absolutely should be working as engineers at the final stage of the hand off.


I agree with everything you said above and that is exactly how we have always had things at my place of employment (work at a small ML/Algorithm/Software development shop). That being said, the one thing I really don't understand is why Notebooks are essential even for EDA. I guess if you were doing things in Notepad++ or a pure REPL shell, they are handy, but using a powerful IDE like Pycharm makes Notebooks feel very very limiting in comparison.

Browsing code, underlying library imports and associated code, type hinting, error checking, etc., are so vastly superior in something like Pycharm that it is really hard to see why one would give it all up to work in a Notebook unless they never matured their skillsets to see the benefits afforded by a more powerful IDE? I think notebooks can have their place and are certainly great for documenting things with a mix of Markdown, LaTeX and code, as well as for tutorials that someone else can directly execute. And some of the interactive widgets can also make for nice demos when needed.

Notebooks also make for poor habits often times and as you mentioned, having data scientists and ML engineers write code as modules or commit them via pull-requests helps them grow into being better software engineers which in my experience is almost a necessity to be a good and effective data scientist and ML engineer.

And lastly, version controlling notebooks is such a nightmare. Nor is it conducive to code reviews.


There's an advantage to long-lived interpreters/REPLs on remote machines for the kind of work done in notebooks. Significant amounts of data may have to be read, expensive computation performed, etc. before the work can begin. Notebooks are an ergonomic interface to that sort of environment if one isn't comfortable with ssh/screen/X-forwarding/etc, and frankly nice for some tasks even if one is.

There's also a tacit advantage to notebooks specifically for Python as the interface encourages the user to write all of their definitions in a single namespace. So, the user can define and re-define things at their leisure within a single REPL/interpreter lifetime. A user developing against import-ed modules can quickly get stuck behind python's inability to cleanly re-import a modules, or be forced to rely on flaky hacks to the import system.

It pains me a bit to make the argument _for_ notebooks, but it's important to understand the attractions.


Thanks for sharing that perspective! It was helpful to get that POV. I agree that a requirement for long lived interpreters and a simpler UX to get up and running probably makes it an attractive option.

With VSCode having such excellent remote development capabilities now however, it feels like a nicer option these days but I guess only if you really care about the benefits that brings. Agreed about reimporting libraries still being a major pain point in Python, but the "advantage" for Jupiter Notebooks is also unfortunately what leads to terrible practices and bad engineering as most non-disciplined engineers end up treating it as one giant script for spaghetti code to get the job done.


When EDA involves rendering tables or graphics, notebooks provide a faster default feedback loop. Part of this comes from the assumption that the kernel holds state and data loading, transformations, and viz can be ran incrementally and without switching context. That's not to say that it's not possible to do with a python repl and terminal with image support, but that's essentially the value prop of notebooks. Terrible for other things like shipping code, but very good for interactive sessions like EDA work.

Personally, I find myself prototyping in notebooks and then refactoring into scripts very often and productively.


I've found myself in a data science group by merger and this(what type of artifact to ship) is a current team discussion point. Would you be willing to let me pick your brain on this topic in depth?


This is how my lab works. We do a lot of prototyping, exploring, making sure everything seems to be working, etc. and then pack it all into reasonably well documented standard code.

Learned this the hard way after working for a group for awhile with a single shared notebook I had nicknamed "The wall of madness".


Not sure if this is similar, but my janky setup:

Atom (editor) + Hydrogen (Atom plugin). I like Hydrogen over more notebook-like plugins that exist for VSCode because it's nothing extra (no 'cells') beyond executing the line under your cursor/selection.

Then i just start coding, executing/testing, refactoring, moving functions to separate files, importing, call my own APIs.. rinse repeat.

I tend to maintain 3 'types' of .py files.

1. first class python modules - the refactored and nicely packaged re-usable code from all my tinkering

2. workspace files - these are my working files. I solve problems here. it gets messy, and doesn't necessarily execute top to bottom properly (i'm often highlighting a line and running just it, in the middle of the file)

3. polished workspaces - once i've solved a problem ("pull all the logs from this service and compute average latency, print a table"), i take the workspace file and turn it into a script that executes top to bottom so i can run it in any context.


This is a daily pain we've experienced while working in the industry! Our projects would usually allocate a few weeks to refactor notebooks before deployment! So we started working on an open-source framework to help us produce maintainable work from Jupyter. It allows easy git collaboration and eases deployment. https://github.com/ploomber/ploomber


I've been using ploomber for a month and so far, I really like it. The developers have been super helpful. It hits the sweet spot for writing developer-friendly, maintainable scientific code. Our data science team is looking at adopting it as our team's standard for deployments.


Admittedly, I'm one of those people. This problem also applies to the use of Excel for exploratory programming and analysis.

There are no guides that I'm aware of. Part of the reason may be a mild "culture" divide between casual and professional programmers, for lack of better terms. Any HN thread about "scientific" programming will include some comments to the effect that we should just leave programming to the pro's.

My advice is to immerse yourself in the actual work environment of the casual programmers: Observe how we work, what pressures and obstacles we face, what makes our domain unique, and so forth. Figure out what solutions work for the people in the trenches. My team hired an experienced dev, and I asked him specifically to help me with this. One thing I can say for sure is that practical measures will be incremental -- ways that we can improve our code on the fly. They will also have to recognize a vast range of skills, ranging from raw beginners to coders with decades of experience (and habits).

Jot down what you learn, and share it. I think our side of the cultural divide needs help, and would welcome some guidance.


I agree with you, having been on both sides of the divide and researched & written my masters thesis on teaching programming to undergrad science students.

Are you aware of https://software-carpentry.org/? It started after I graduated and I knew people who were involved with it at the time. It seemed like a good idea.


Care to share a link to your thesis? I'm always interested in work in this area.


It looks like I didn't put it on Arxiv, so I need to find a copy and then put it back online :) Will reply here when I do, but likely to be a week+ before I do


Any luck?


There’s nothing wrong with excel (as long as you stay below the 64k limit). People use it because it works. That is almost tautologically close to whatever it is that software aspires to.

Excel has gotten more people to write code than all other programming environments together. And they’ve often enjoyed doing it. It’s a fantastic success story.


Quite agreed, Excel is great, more important than Word or PowerPoint if you ask me.

But in terms of writing organized, readable code that can be used by other people, there's very little guidance.


- We mostly use notebooks as scratchpads or alpha prototypes.

- Papermill is a great tool when setting up a scheduled notebook and then shipping the output to S3: https://papermill.readthedocs.io/en/latest/

- When turning notebooks into more user-facing prototypes, I've found Streamlit is excellent and easy-to-use. Some of these prototypes have stuck around as Streamlit apps when there's 1-3 users who need to use them regularly.

- Moving to full-blown apps is much tougher and time-consuming.


This is a great insight! I think parameterizing the notebooks is part of the solution, moving to production shouldn't be time-consuming and definitely no need to refactor the code like I've seen some people do. I'd love to get your feedback. We're building a framework to help people develop maintainable work from Jupyter! https://github.com/ploomber/ploomber


First, yes, this is a common question. IPython does not try to deal with that, it's just the executing engine.

Notebooks, do not have to be stored in ipynb form, I would suggest to look at https://github.com/mwouts/jupytext, and notebook UI is inherently not design for multi-file and application developpement. So training humans will always be necessary.

Technically Jupyter Notebook does not even care that notebooks are files, you could save then using say postgres (https://github.com/quantopian/pgcontents) , and even sync content between notebooks.

I'm not too well informed anymore on this particular topic, but there are other folks at https://www.quansight.com/ that might be more aware, you can also ask on discourse.jupyter.org, I'm pretty sure you can find threads on those issues.

I think on the Jupyter side we could do a better job curating and exposing many tools to help with that, but there are just so many hours in the day...

I also recommend I don't like notebook from Joel Grus, https://www.youtube.com/watch?v=7jiPeIFXb6U it's a really funny talk, a lot of the points are IMHO invalid as Joel is misinformed on how things can be configured, but still a great watch.


I see where you're coming from. From where you sit Jupyter is a language agnostic tool and so in. But the fact that there's dozens of solutions in this space is surely a problem?

I'd have thought there would be some things you could strongly encourage:

1. Come up with some standard format where the code and the data live in separate files.

2. Come up with some standard format where you can take load a regular .py script as a cell based notebook using metadata comments (and save it again).

If these came out of the box it would solve most of the issues.


Funny you should ask. I just wrote a book called Effective Pandas[0] that discusses ways to use pandas (in Jupyter) that leads to easy re-use, sharing, production, testing. Here's a video with many of the ideas if you prefer [1].

People tend to have strong feeling when they see my pandas code as it is different from much of the (bad advice) in the Medium echo chamber. Generally, most who try it out are very happy.

The basics are embrace chaining, avoid .apply, and organize notebooks with functions (using the chain).

Oh, and Jupytext is a life saver if you are someone who uses source control.

0 - https://store.metasnake.com/effective-pandas-book 1 - https://www.youtube.com/watch?v=zgbUk90aQ6A


The whole point of notebooks is to focus only on exploration of data, making some nice plots, adding some explanatory text, and NEVER think about software engineering.

A decent data scientist who also understands software engineering will sooner or later take the prototype code from the notebook and refactor it into proper modules. Either this or the notebook will become an unrunnable mess as it is developed further. Reusing code and functions in a grown notebook is just too fragile.


I would suggest you to take a look at the nbdev library:

https://github.com/fastai/nbdev

I have been using it for more than a year and it has been a great experience


I'm working on a solution that helps with transforming notebooks into web applications (with GUI). You just need to define YAML config (similar to R Markdown) and the framework will generate web app with interactive widgets. After change in widgets, user clicks Run button and the whole notebook is executed, converted to HTML and displayed to the user.

The framework is called Mercury and is open-source https://github.com/mljar/mercury


The problems you mention are solved by auxiliary tools in the notebook ecosystem.

- Look at nbdime & ReviewNB for git diffs

- Checkout treon & nbdev for testing

- See jupytext for keeping .py & .ipynb in sync

I agree it's a bit of a pain to install & configure a bunch of auxiliary tools but once set up properly they do solve most of the issues in the Jupyter notebook workflow.

Disclaimer: I built ReviewNB & Treon


It is only a plan only (partially implemented). I am separating code to clean and ad-hoc. Clean code is "supported" - maintained (jobs monitored/failures handled/bug fixed) by more professional developers, if somebody what to have a custom job, they more or less on their own. When I am asked to fix problem in such "custom" job, first thing I do is refactoring code to follow standards (configuration, hardcoded paths and values, logging, alert notification to predefined list of people related to project, handling recovery, etc.), than it becomes a part of main pool - "maintained code".


In VS code, .py file can work like a notebook. VS Code treats #%% as start of a cell, while being a plain comment when running the it as .py file. VS code can also convert an existing jupyter notebook to .py with this format


Instead of looking for a quick 1:1 conversation from notebook --> app, it should be a line by line re-creation using a notebook as more of a whiteboard.

This approach while much slower limits errors and ensures sustainability because both the notebook creator and the app creator will know what's going on.

I think solutions like papermill and others only work when you have infinite money and time.


I agree with the idea of using it as a whiteboard - when I need to do casual programming and data analysis for my non-software job I tend to work it out in a notebook first, then start combining all the smaller cells into larger chunks that I then can move into a proper python script.


I use DVC to store periodic snapshots of raw notebooks, and export them to .py files to be tracked by plain git.

They are still kind of a mess because I use them as scratch space. Anything worthwhile gets polished and put into a package manually.


> Ideally I'd want to build a python application that's managed via git, but some modules / functions are lifted exactly from notebooks

Write libraries, track them in git and call them in notebooks?


This is a fundamental problem for me too. No source control, no tests, hard to extract into libraries. I'm surprised there isn't a better tool already.


We'd love to get your feedback. We're building a framework to help people develop maintainable work from Jupyter! https://github.com/ploomber/ploomber


if you are "cutting and pasting from the notebook into a .py file" you should look at `jupyter nbconvert` on the CLI.

I think there's ways to feed it a template that basically metaprograms what you want the output .py file to look like (e.g. render markdown cells as comments, vs. just removing them), but I've never quite figured that out.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: