How we made Jupyter notebooks load faster

davidgomes · 2024-09-10T15:52:22 1725983542

I was on this team at SingleStore and I can vouch for how hard this team worked on this project. I just opened a couple notebooks in production and they loaded *instantly*, so kudos to the team for seeing this project through.

(If you're not familiar with SingleStore's Jupyter Notebooks, they're similar to Databricks Notebooks[1] or Azure Synapse Notebooks[2]).

[1]: https://docs.databricks.com/en/notebooks/index.html

[2]: https://learn.microsoft.com/en-us/azure/synapse-analytics/sp...

yunohn · 2024-09-11T08:13:37 1726042417

I appreciate the write-up, it's very insightful. But I was quite concerned about the amount of response mocking being used as a solution. Did the team ever consider instead making upstream changes to jupyter-lab (which is FOSS), so that these requests are either deferred or can be configured to not run? That seems like it would benefit everyone, including your company - and might even uncover further optimizations.

tgrine · 2024-09-11T10:37:06 1726051026

Hi, I'm one of the co-authors of the blog post. You raise a valid point and, in large part, I agree with you. To offer some explanation, there are essentially 2 main reasons for us taking this approach:

1. Contributing to an open source project, especially one the size and complexity of jupyter-lab, is generally going to be a slower process than finding a solution "in house". Improving the load times became a priority once we realized our notebooks were bringing value to users, and we wanted to deliver a better experience as soon as possible; 2. It's not always apparent if the changes you are looking for from an open sourced project are useful to a more general audience or if it's very specific to the way you are using the project. A lot of the requests that were mocked could only be so because we either didn't use them in our implementation (for example, users and workspaces) or because we know the response won't change (for example, some extension settings which we don't allow users to change). Is this a common situation for others or is it a niche circumstance of how we are using jupyter-lab? If it's not common, then adding these options in jupyter-lab itself could just increase its complexity while not bringing that much benefit (not saying this is necessarily the case here);

To your point though, a good example of this is the checkpoints feature. There is an open issue requesting the option to disable checkpoints[1] as it is not always useful for people. We had the same issue, since we are not using checkpoints, but the requests were always being made. Ultimately, we just mocked the checkpoints requests, but it's probably the case that making the changes to jupyter-lab to disable this would benefit us and other people as well.

[1]: https://github.com/jupyterlab/jupyterlab/issues/11826

yunohn · 2024-09-11T12:13:47 1726056827

Thanks for the clarification, that's fair enough. But I hope you do look into upstreaming, maybe even just by opening an Issue to guage community interest. Like the pre-existing checkpoints Issue you linked to, opening some for your functionality might show others what is possible.

spiralk · 2024-09-10T20:35:51 1726000551

I dislike how Jupyter notebooks have become normalized. Yes, the interactive execution and visuals are nice for more academic workflows where the priority is quick results over code organization. However, when it comes to sharing code with others for the sake of doing reproducible science, jupyter notebooks cause more trouble than they are worth. Using cell based execution with python is so elegant with '# %%' lines in regular .py files (though it requires using VSCode or fiddling with vim plugins which not all scientists want to do I suppose). No .ipynb is necessary, .py files can be version controlled and shared like normal code while sill retaining the ability to use interactively, cell by cell.

Its much easier to organize .py files into a proper python module, and then share and collaborate with others. Instead, groups will collect jumbles of slightly different versions of the same jupyter notebooks that progressively become more complex and less manageable over time. It's not a hypothetical unfortunately, I've seen this happen at major university labs. I'm not blaming anyone because I understand -- the funding is there to do science and not rewrite code to build convenient software libraries. Yet, I can't help but wish jupyter notebooks could be removed from academic workflows.

epistasis · 2024-09-10T22:03:30 1726005810

I think there's a fundamental mistunderstanding and mismatch between what you want to do, and what Jupyter notebooks are for. The distinction is between code versus the results.

If the code is the end product, sure, use a python package.

But does your .py with `# %%` in it also store the outputs? If not, why even bring this up? A .py output without the plots tied to the code doesn't meet the basic use case.

If the end product is the plot, I want to see how that plot was generated. And a Jupyter notebook is a much much better artifact than a Python package, unless that Python package hard codes the inputs and execution path like a notebook would.

Over the past 20 years of my career I have run into this divergence of use cases a lot. Software engineers seem to not understand the end goals, how it should be performed, and the learnings of the practitioners that have been generating results for a long time. It's hard to protect data scientists from these inflexible software engineers that see "aha that's code, I know this!" without bothering to understand the actual use case at hand.

spiralk · 2024-09-10T22:47:54 1726008474

Not having the outputs tied into the code is actually preferable if the ultimate goal is reproducible science. Code should be code, documentation should be documentation, and outputs should be outputs. Having multiple copies of important code in non-version controlled files is not a good practice. Having documentation dispersed with questionable organization in unsearchable files is not good a practice. Having outputs without run information and timestamps is not a good practice. Its easy to fall in to those traps with Jupyter notebooks. It might speed up initial set up and experimentation, but I've been working academic labs long enough to see the downstream effects.

majormajor · 2024-09-10T22:55:50 1726008950

Having the outputs recorded alongside specific versions of the code can actually be very valuable.

But since most uses of Jupyter notebooks I've seen don't version control them much at all, it's not as useful in practice often.

spiralk · 2024-09-10T23:30:09 1726011009

Yeah, jupyter notebooks don't guarantee any specifics about versions of code used for that output. In the real world you can expect everyone in the lab including all of the students to be editing jupyter notebooks at whim. The only way to do this would be to have proper version control and of your code, a snapshot of the environment, and to log all this along with the run that generated the output. This is possible with regular python using git, proper log files, etc. Jupyter notebooks seem like an extra roadblock.

paddy_m · 2024-09-11T00:39:29 1726015169

Ooh. That's a nice utility funtion that I will write soon. We tend to look at requirements as something we hope the package manager gets right, and then we ignore at runtime, but there are a bunch of errors we could avoid if we verified at runtime. Sometimes when writing a library you have to have different code paths for different versions.

Something like `if check_versions(pandas__gt="2.0.0", pandas__lt="3.0.0"):`

yunohn · 2024-09-11T08:16:39 1726042599

Often the notebook was run on a beefy server with GPUs attached, potentially taking hours/days of compute. It would be senseless to force every viewer of a Jupyter notebook to have the same setup and time just to read through the results and output.

epistasis · 2024-09-12T00:40:53 1726101653

> Not having the outputs tied into the code is actually preferable if the ultimate goal is reproducible science.

What a strange thing to assert, especially as a general overarching truth.

The best reports I have ever seen have matched code and output in the same file. There's never a question of what code generated a plot or a table with a notebook.

With .py files and separate outputs there's far more change for unreproducibke science, it's far messier, and for someone who doesn't appear to respect the organizational capabilities of academic labs, you are condemning them to far more poorly organized outputs.

> Having multiple copies of code

That doesn't have anything to do with notebooks. It's as silly as saying that a Python package is a poor idea because you say somebody repeat code across multiple places.

> non-version controlled files

Notebooks are no less version controllable than .py files.

> outputs with timestamps and run information

Jupyter notebooks are perfect for this, far superior to a directory of cryptically named outputs that need to be strung together in some order

> documentation dispersed with questionable organization

Using separate Python files rather than a notebook means that documentation can never be where it needs to be: next to the output. This is one of the ways that Python files are strictly inferior for generating results.

There are roughly two modes for notebooks: exploration with a REPL, and well-documented reports. The best scientific reports I have ever seen are notebooks (or R Markdown output) that are the full report text plus code plus figures.

spiralk · 2024-09-17T16:17:11 1726589831

> someone who doesn't appear to respect the organizational capabilities of academic labs, you are condemning them to far more poorly organized outputs.

This is not a great way to make your argument, though you are not the not only one here making a personal judgement without even knowing about my background. These are all issues I have seen first hard. With most academic labs being funding limited, the "organizational capabilities of academic labs" seems irrelevant to me. In our field, no one is getting grants to manage code of any kind .py or .ipynb and I suspect its the same at most university labs. It's effort wasted that ultimately does take time away from the actual research that's fundable and publishable. As someone who has been responsible for wrangling people's notebooks in the past, it's enough of a problem that I would encourage to remove all .ipynb.

> That doesn't have anything to do with notebooks. It's as silly as saying that a Python package is a poor idea because you say somebody repeat code across multiple places.

Human factors make jupyter notebooks lead to the problems I have listed. The issues are most apparent with large groups and over long periods of time. Python and other programming languages already solved most of these problems with git. There isn't a tool that is as elegant and scales from individuals to massive organizations.

> There are roughly two modes for notebooks: exploration with a REPL, and well-documented reports. The best scientific reports I have ever seen are notebooks (or R Markdown output) that are the full report text plus code plus figures.

The REPL functionality is handled by .py cell execution, as I’ve mentioned in other comments. It baffles me how the minimal effort saved by not using separate tools -- one for code, one for documentation -- justifies the issues it introduces.

Twirrim · 2024-09-10T21:01:52 1726002112

I use jupyter notebooks at work, not so much for academic stuff, but often to help build and show a narrative to folks, including executives (where I have any even remotely technical leadership). It's great for narrative stuff, especially being able to emit PDFs and what not. I've been in a number of meetings where I've got the code up in Jupyter, sharing the screen, and leadership want us to tweak numbers and see the consequences.

It's great for exploring code and data too, especially situations where I'm really trying to feel my way towards a solution. I get to merrily intermingle rich text narrative and code so I explain how I got to where I got to and can walk people through it (I did that with some experimenting with an SMT solver several months ago, meant that people that had no experience with an SMT solver could understand the model I built).

I'd never use it to share code though. If we get to that stage, it's time to export from jupyter (which it natively supports), and then tidy up the code and productionise it. There's no way jupyter should be the deployed thing.

spiralk · 2024-09-10T22:55:15 1726008915

That seems like a reasonable way to use jupyter notebooks since you have an actual plan to move beyond it when necessary. My issue is mostly with the way its misused, often by people who are arguably at the top of the field.

ants_everywhere · 2024-09-10T23:50:00 1726012200

We've seen how this ends because mathematicians have been sharing Mathematica notebooks forever. It's not pretty.

Like you I see the appeal, but they're a usability nightmare beyond a few lines. Part of the problem, I think, is that you can't really incrementally improve them. Who wants to refactor a notebook and deal with all the cell dependency breakage?

So they start off okay and then slowly become terrible until they're either irreplaceable or too terrible to work with and a new one is started.

abdullahkhalids · 2024-09-10T22:00:17 1726005617

The same problem exists with spreadsheets. Should we get rid of excel (the single tool that literally runs half the world), and start manually writing markdown tables in text files?

The tool and the tool maker are supposed to serve the user. The user is not supposed to conform to the whims of the tool maker.

ants_everywhere · 2024-09-10T23:56:27 1726012587

Since 94% of business spreadsheets contain errors [0], then probably yes we should get rid of or significantly improve spreadsheets.

Probably the solution is that things like Jupyter notebooks and spreadsheets should be views into some better source of truth rather than the source of truth themselves.

[0] https://phys.org/news/2024-08-business-spreadsheets-critical.... I remember a similar figure from studies a decade or so ago.

glzone1 · 2024-09-11T01:35:37 1726018537

The funny thing is I've seen folks try to deploy software to get rid of spreadsheets. It always ends badly to terribly.

Spreadsheets are the nonprogrammers programming / modeling tool in business.

It does presentation, data filtering / sorting, modeling and more.

No AI needed (and you can now plug AI in in some cases).

ants_everywhere · 2024-09-11T04:00:08 1726027208

Sure, but here's the basic problem I think:

Suppose you have some formula that computes a financial metric for your company. Someone you've shared it with drunkenly fat-fingers the formula 3/4 of the way down a long row, and that causes all entries below it to recompute with the wrong formula. Unless the change is really drastic, you may never know it happened.

And this sort of mistake -- basically a typo or a bad mouse movement -- happens daily in every company in the world in some spreadsheet. Often people will notice the mistake, but not with probability 1.

Software engineers have mechanisms to guard against some of these mistakes, and even we have a hard time getting people to take code review or tests seriously. What is the guard in the spreadsheet world?

anthk · 2024-09-12T10:08:59 1726135739

Have you seen the Excel disaster on genomics? Enough to ban Excel for anything serious except for accounting.

ants_everywhere · 2024-09-12T11:35:11 1726140911

I had not heard of that, but it appears to be this

https://www.nature.com/articles/d41586-021-02211-4

paddy_m · 2024-09-11T00:42:56 1726015376

Another issue is that jupyter, pandas, and polars don't take displaying tabular data seriously. Just have a better default table display widget. Look at ipydatagrid, perspective, or buckaroo (my project) for examples of how it could be done better.

KolenCh · 2024-09-10T21:26:23 1726003583

I don’t disagree anything you said. Jupytext can be a good tool to bridge some gap, where you pair ipynb to a py script and can then commit the py only (git-ignore all ipynb for your collaborators.)

Also, while many practices out there is questionable, in alternative scenarios where ipynb doesn’t exist, they might have been using something like matlab for example. Eg, in my field (physics), often time there are experimentalists doing some coding. Ipynb can be very enabling for them.

I think a piece of research should be broken down and worked by multiple people to improve the state of the project. Some scientists might be passing you the initial prototype in the form of a notebook, and some others should be refactoring to something more suitable for deployment and archival purpose. Properly funding these roles is important, and is lacking but improving (eg hiring RSE.)

In my field, the most prominent way when ipynb is shared a lot is for training. It’s a great application as that becomes literate programming. In this sense notebook is highly underused as literate programming still hasn’t got mainstream.

spiralk · 2024-09-10T23:13:55 1726010035

I've looked into Jupytext, but ultimately decided to go with pure python. Most of the practical functionality can be replicated, but I do admit there isn't a easy single install tool or guide to replace notebooks at the moment.

I think the notebooks are a fine learning tool to introduce people to programming initially, but I'm afraid it doesn't allow for growth beyond a certain level. You have a good point about funding for those software roles. Perhaps this may not be as big of a concern if there were more software talent in these labs to handle the issues that arise.

KolenCh · 2024-09-11T02:33:52 1726022032

In an ideal world that we control everything and/or don’t need to collaborate with others, then whatever tooling one use is actually not that important (and each can choose the best fitting their needs.) So Jupyter+Jupytext is useful in the context of collaboration, where you can’t control your collaborators but want something from them.

While in an ideal world scientists who write softwares should write professionally, the same goes for anything they do, including math and stats used in their research, writing and typesetting and generates publishing quality visualization… That rarely happens because of how the academic world is financed, and the incentives associated with it. I can certainly complain about that all days, but in short a researcher hired by a research university, especially with a tenured track position in the US, will not be successful to get such position, let alone getting tenured, if they had not focused their scarce resource of time to maximize their “research output” (publications, grant, etc.), where software engineering is not part of. (Sorry, sentence too complicated.)

luplex · 2024-09-10T20:49:45 1726001385

In the end, usability wins. In a Jupyter notebook, you have a much better idea of state between cells, you can iterate much faster, you can write documentation in readable markdown. Often, Jupiter notebooks are more like interactive markdown than they are like python scripts.

dxbydt · 2024-09-10T20:55:36 1726001736

> In a Jupyter notebook... > Often, Jupiter notebooks...

Everytime I search my Slack, I have to run two searches because DS can't agree on how to spell the damn thing.

ambicapter · 2024-09-10T21:33:07 1726003987

The form factor of Jupyter notebooks seems to fit well with peoples workflows though. Looks like you just wish the internals of Jupyter were better architected.

spiralk · 2024-09-10T22:05:16 1726005916

Imo, the better architected .ipynb is simply .py with '# %%' blocks. It does almost everything a .ipynb can do with the right VSCode extensions. Even interactive visualizations can be sent to a browser window or saved to disk with plotly. Though I do wish '# %%' cell based execution was accessible to more people.

There isn't a single install tool that "just works" for this at the moment. If editors came with more robust support for it by default, I think the notebook format wouldn't be needed at that point and people could use regular python and interactive cell based python more interchangeably. I've seen important code get buried under collections of jupyter notebooks across different users so I have a good reason for this. Notebooks simply dont scale beyond a certain complexity.

paddy_m · 2024-09-11T00:45:58 1726015558

The two can coexist. store libraries in python code that is versioned and deployed properly. Notebooks with their data ingest, code, then output should read cleanly. Making the ingest and code readable is the job of library writers. A clean and elegantly coded notebook with inline outputs is a substantively different experience than searching all over the place for the correct browser window that corresponds to the output from a given piece of code.

pizza · 2024-09-11T02:25:55 1726021555

While we're on the topic of jupyter enhancements, would really love to be able to pop a cell off the pending execution stack if I realized running it would be a mistake and still have time before it gets there.. :^)

sa-code · 2024-09-11T06:02:44 1726034564

And also preserving cell output if you happen to reload the page

paddy_m · 2024-09-10T18:47:39 1725994059

Impressive work. I love jupyter, but it's a bear to work on.

What mix of JS packages do you see? Could that be built into one uber package?

lneves12 · 2024-09-10T19:09:00 1725995340

We are actually bundling everything inside one big main.js file, compared to jupyter-lab app that loads each extension from a different file using webpack federated modules. We did some benchmarking and it was actually faster than having one file per extension. There is definitely still some room for improvement here, but we have some other places we would like to optimize first, like, optimising the fetching of the notebooks content.

You can take a look at the notebooks entrypoint network request: https://portal-notebooks.singlestore.com/

jasongrout · 2024-09-11T02:40:35 1726022435

Nice. Bundling everything together was how JupyterLab used to work before version 3, but it required a compilation step to install extensions, which made it inconvenient for users. With JupyterLab 3 (and maybe 4?), if I recall correctly, you could have both worlds - compile some extensions into a base js bundle, then install other extension to be loaded as federated webpack modules.

lneves12 · 2024-09-11T08:59:42 1726045182

Thanks for the context, that makes a lot of sense. In our case since we have a more controlled and less generic environment we have a bit more flexibility/control in what we can do.

canucker2016 · 2024-09-10T22:27:31 1726007251

I don't see a Content-Encoding header on the response for the JS and HTML files, which suggests the 11.5MB JS and the HTML files aren't compressed.

Not much of a worry on the tiny HTML file, but the 11.5MB JS file should compress to a much smaller file on the wire.

lneves12 · 2024-09-12T18:19:56 1726165196

ah I thought I was going crazy (I was sure this was working at some point). The compression just stopped working because of a cloudfront limitation eheh

`CloudFront compresses objects that are between 1,000 bytes and 10,000,000 bytes in size.` - since the file became bigger than 10mb, cloudfront stopped compressing it...

canucker2016 · 2024-09-13T21:35:51 1726263351

Well at least you know you weren't lazy/ignorant.

It's just that your tools make you look that way... and cost you more money in extra bandwidth as well. I hope that's the last of the surprises in that area.

lneves12 · 2024-09-11T14:47:12 1726066032

ah that's an embarrassing oversight! we reuse the same cdn configuration for multiple projects and for some reason the compression isn't properly configured for our portal-notebooks.singlestore.com entry point. It's funny that I reconfirmed that before publishing the blogpost, but mistakenly looked at the request headers and not the response headers (facepalm). We are fixing that now, thank you! This will be helpful for cases where you access the notebooks UI directly. For cases where you come from other page, it shouldn't make that much difference, since the iframe is already pre rendered.

paddy_m · 2024-09-10T20:08:14 1725998894

I take it that's supposed to be the pre-loading page to just look at the requests, not a full working UI?

After cursory googling I couldn't find one, do you have a public notebook gallery?

lneves12 · 2024-09-10T20:09:44 1725998984

Yeah, sorry should have made that more clear. This doesn’t really load anything, it’s just the entry point for our iframe. To try our notebooks you can create an account at portal.singlestore.com (we have a gallery of notebooks there)