JupyterHub 1.0

acbart · on May 4, 2019

I love JupyterHub, but we've hit some real headaches in having it scale. At Virginia Tech, we have an introductory course for non-computing majors where students are using Jupyter through JH. At around the 70-student mark, we have performance issues. Considering that the course is eventually meant to scale much farther (hundreds of students), we're not really sure how we can make further progress with our current resources. I hope this new version has some performance enhancements (though I don't see any in the changelog). Last I talked with anyone about this was in the NBGrader project[1], where other schools were hitting walls with scaling.

[1] https://github.com/jupyter/nbgrader/issues/530#issuecomment-...

julienchastang · on May 5, 2019

Give us some details about your setup. Are you running JH on the cloud somewhere? Or is this running on a single box at VT? We’ve got JH running on a Kube cluster with persistent storage on the NSF Jetstream cloud [1] (which you could qualify for free resources being from VT) . This set up should theoretically address the scalability concerns you are having. See the work Andrea Zonca and I have been doing [2, 3, 4, 5]. Contact me for additional details.

[1] https://jetstream-cloud.org/

[2] https://zonca.github.io/2018/09/kubernetes-jetstream-kubespr...

[3] https://zonca.github.io/2018/09/kubernetes-jetstream-kubespr...

[4] https://zonca.github.io/2018/09/kubernetes-jetstream-kubespr...

[5] https://github.com/Unidata/xsede-jetstream/blob/master/vms/j...

acbart · on May 6, 2019

I'm not as up on the details as you'd probably want, but my understanding is that we're running it on a virtual server with a nice chunk of RAM and CPU locally at VT. I don't think we want to be at all reliant on external servers - this is FERPA protected data. Plus, the long term goal was to find a solution that other schools could adopt without being an R1.

hogu · on May 4, 2019

Are you running into scaling issues because all students are on a single box? I think the JupyterHub proxying service should be able to handle way more than 70 users, but I could see 70 users on a single machine being problematic. I'm doing a lot of work around JupyterHub - happy to help out if I can, I can be reached at hugo@saturncloud.io

acbart · on May 6, 2019

Well, it's a virtual server, so I'm not sure how that plays with things. I supposed getting more servers in play would help, but that feels more like throwing hardware at a software problem. It doesn't feel like JH should have this much overhead - is the kernel really doing so much work?

hogu · on May 10, 2019

Depends on what your students are doing right? Also I'd bet your running out of ram before CPU. You should try doing the same workload without JH involved - I'd bet you'd still run out of ram. Also - I'd throw hardware at software problems all day long if I could. Hardware is cheap compared to developer time

_pgmf · on May 4, 2019

When would someone use jupyterhub? I've been running my own notebook server for years, but it's single-user, single machine. Is hub for like providing separate jupyterlab instances for a bunch of different users/different machines?

jabl · on May 4, 2019

We have it setup at work, as an alternative way to use our HPC resource compared to the traditional Linux shell + slurm usage.

User goes with the web browser to our jupyterhub URL, logs in with our usual credentials, selects a job type (amount of memory and max duration), and jupyterhub takes care of launching a jupyter kernel as a slurm batch job on a compute node in the cluster, and proxies http I/O via the jupyterhub node to the user web browser. In the jupyter notebook, users have access to the same cluster filesystems as if she would login traditionally via ssh.

sliken · on May 5, 2019

I'd like to setup exactly this, any more details on your setup available?

jabl · on May 8, 2019

None publicly available AFAIK. There were some fixes required to the slurmspawner, but those should all be upstream now.

Then there's the PITA of integration with the site auth system, but that tends to be site specific..

rhizome31 · on May 4, 2019

Yes exactly. At my work when a new scientist joins us we just create an account and she can get started on her research within minutes. Each user gets a contained environment in which we mount a disk of shared data.

erikgaas · on May 4, 2019

I just did this in my University lab as well. Most people aren't savvy with Linux, so having normal accounts with Jupyter port forwarding is out of the question. JupyterHub is just about the lowest friction I can possibly make it for introducing the Python data science stack to non data scientists.

rhizome31 · on May 4, 2019

And just to be explicit for readers, Jupyter and JupyterHub also allow to work with other data science stacks, R in particular.

amrrs · on May 4, 2019

IMO, Jupyter Notebook is the closest equivalent to Python as what Rstudio is for R. While Pycharm and VSCode are also preferred by some Py-based Data scientists, Jupyterhub offers almost everything that a typical IDE would do along with the traditional Notebook environment which a lot of beginners these days start with. Thus much less friction while getting started.

d0mine · on May 4, 2019

Org-mode + jupyter-python https://github.com/dzop/emacs-jupyter#org-mode-source-blocks

prakhar987 · on May 4, 2019

I would be really hesitant to comapre Jupyter Notebook to an IDE.....an example is a debugger...the only visual debugger that i have come across for jupyter is pixie debugger, which is miles behind the debugger of an IDE like Pycharm.... there is a huge list of features that jupyter needs before you can compare it to an IDE

d0mine · on May 4, 2019

It is an interactive environment (not much use for a debugger).

erikgaas · on May 4, 2019

FWIW I use the %debug magic command in Jupyter and it has been a great experience. I'm pretty ignorant of the enterprise debugging tools so take that with a grain of salt.

y4mi · on May 4, 2019

Debuggers are only really useful if you're trying to figure out why some object in your server doesn't do what you want it to.

I'd wager that almost no data scientists write object oriented code.. it's probably mostly done one calculation at a time. executed in the notebooks repl. So the value you get from ide debuggers is tiny, as you're already doing everything one step at a time.

bonoboTP · on May 4, 2019

You still write functions and may want to inspect variable state in the middle of function execution.

applecrazy · on May 5, 2019

Correct. RStudio has this feature, where variable values can be inspected in a sidebar. This would be a really useful feature for Jupyter, especially when running a Python kernel.

timdumol · on May 5, 2019

There is a JupyterLab extension for that: https://github.com/lckr/jupyterlab-variableInspector

bonoboTP · on May 6, 2019

Does it work with variables that are local to a function? I don't mean inspecting global variables after having executed a cell, but local variables in the middle of a function execution.

Fomite · on May 5, 2019

This was my use case for it.

I ended up getting really frustrated with setting it up. Followed several different tutorials, had it blow up in a different way each time.

Going to have to revisit this to see if the documentation has gotten better.

bhl · on May 4, 2019

My school uses this for its data science classes so let me provide an example. Say that you have a lab or homework assignment done in Python. It may require a lot of storage (a large dataset), compute (maybe you have a large model), or dependencies. JupyterHub is one way to setup once and then deploy to multiple users regardless of their computer.

TallGuyShort · on May 5, 2019

I give talks in math classes about how math is used in the real world, and I've been looking at JupyterHub as a way to give everyone in the class a way to work through an activity with instructions & equations, diagrams, places to enter their own data, and run a little mathematical program, etc.

beojan · on May 4, 2019

I use JupyterHub on my laptop because it makes the process of starting a server automatically a lot easier.

jjtheblunt · on May 4, 2019

it's to share work with others, i believe. nothing more really.

prepend · on May 4, 2019

This is really cool and I’m impressed by the jupyter team. My favorite part is that it’s such a good product that beats the commercial products because it’s hard to figure out, I think, commercial models that support this wide range of collaborators (people who view once a month to people who author every day).

I was trying to read about whether jupyterhub is included in RStudio Connect, or if they are competing products.

javierluraschi · on May 4, 2019

You can use RStudio Connect to publish Jupyter notebooks, see https://blog.rstudio.com/2019/01/17/announcing-rstudio-conne...

prepend · on May 4, 2019

Thanks. That’s why I’m trying to figure out if it’s actually jupyterhub under the covers so they’ll get these new features. Or if they are competing and RStudio will have something similar or I have to check their dev schedule.

RootKitBeerCat · on May 4, 2019

I’d say products like DataBricks do have the model figured out

rhizome31 · on May 4, 2019

Congratulations! JupyterHub is a great project with high quality code and docs. Looking forward to try the named servers feature as I run a JupyterHub instance that spawns servers inside containers based on a single image which inevitably tends to grow as I add libraries. Being able to manage multiple servers should allow me to split the image into smaller specialized images.

victornomad · on May 4, 2019

Any recommendation on how to setup an environment with Jupyter and nvidia/cuda? Is it worth to use a docker container? or installed system wide?

prakhar987 · on May 4, 2019

I once tried running jupyter inside docker...the downside i saw was conda install of jupyter is around 1.2 gb, making the container pretty heavy...if you are fine with this then all thats left is to expose the port to the container on which you want your jupyter server to run and changing some config after generating jupyter_config file for remote access.

Q6T46nT668w6i3m · on May 4, 2019

Hi, I’ve been working on setting up JupyterHub on a DGX-1 for the past week. It’s straightforward, you run JupyterHub on your machine and you have it launch servers for each user with nvidia-docker.

patall · on May 4, 2019

At work we use conda which works great, though nvidia libraries are usually installed by admins system wide. But for anything python related, conda is perfect.

victornomad · on May 4, 2019

Thanks for the answer :) what is the advantages of using conda vs installing the packages by hand? Tried to document my self a bit before but I still dont know the real advantages in a modern Linux system

mch82 · on May 4, 2019

There are at least three:

1. Anaconda sells support that many companies will value, https://www.anaconda.com/support/

2. Anaconda checks the installed versions of packages in the distribution are compatible.

3. Conda has an “environments” feature so a developer/scientist can switch between many, project-specific development environments, https://docs.anaconda.com/ae-notebooks/user-guide/adv-tasks/...

Edit: Also, there is a distribution for Windows that’s handy when your employer has you using Windows. And, depending on your software approval process, it can be convenient to get one package approved (Anaconda) instead of every package Anaconda includes.

joshvm · on May 5, 2019

Conda gives you two real advantages: the first is straightforward virtual environments (you could also use venv) and secondly better dependency management than pip in some cases.

There are certain packages which aren't available on pip that conda-forge provides e.g. for a while OpenCV wasn't available through pypi.

Remember you can use pip within conda, and you can install directly (e.g. setup.py) within a conda env.

Docker is mostly useful if you want to mix and match different versions of CUDA/cudnn etc. If you just want to run an isolated Python environment, then Conda will do the job.

ianbooker · on May 4, 2019

Conda is "just" a distribution and package management tool. As a beginner it is a good idea to just go with it, but installing all packages using pip is just as good, in the end. For reference: https://www.xkcd.com/1987/

jhbadger · on May 4, 2019

In bioinformatics there is a trend for systems like qiime2 to be basically impossible to install via pip and to have conda and docker be the only options. In part this is because rarely do bioinformatics pipelines rely only on python but on a multitude of existing programs that need to be installed.

anonu · on May 4, 2019

When will jupyter have highlight to execute?

rhizome31 · on May 4, 2019

The Script of Scripts project adds this feature : https://vatlab.github.io/sos-docs/doc/user_guide/multi_kerne...

darsnack · on May 4, 2019

Great news! Question for the fellow Jupyter Hub users: how do I expose users’ Conda environments to them?

oakridge · on May 4, 2019

`nb_conda_kernel` should add the kernels installed in both system and user conda environments as notebook kernels.

EDIT: I meant `nb_conda_kernels` https://github.com/Anaconda-Platform/nb_conda_kernels

moonbug · on May 4, 2019

Install ipykernel or irkernel packages in the environments you want juoyterlab to know about

rhizome31 · on May 4, 2019

You'll also need to register it against Jupyter.

For Python :

    $ ipykernel install --user --name myenv --display-name "My environment"

For R :

    > IRkernel::installspec()

As mentioned previously, nb_conda_kernels allows to automate this step.

smortaz · on May 4, 2019

Congrats to the team! This is a major productivity milestone for teams using Jupyter.

ianbooker · on May 4, 2019

Jupyter Notebooks / JupyterHub and BinderHub are, in my humble opinion, most relevant for the future of education, as a tool for teaching, for science, as a format for replication, and for everything in between, see the data sciences. And even more!

jcims · on May 4, 2019

How would JupyterHub compare with something like AWS SageMaker or EMR notebooks?

hogu · on May 4, 2019

First of all, AWS SageMaker is really a ML system that happens to include Jupyter notebooks as a component. But if you are talking about just the Jupyter notebook part, then I would say - you could use JupyterHub to build your own implementation of SageMaker (you would want to use kubespawner and some deployment of kubernetes if you wanted to scale to multiple nodes). For example, I run https://www.saturncloud.io/, and we orchestrate JupyterHub to do just that.

JupyterHub is more flexible - for example, you could deploy JupyterHub to one beefy server and have Jupyter deployed for many users, which could all read data from a shared filesystem. that kind of thing is not easy to do with SageMaker since everything runs on a separate ec2 instance.

I can't comment on EMR notebooks.

LeicaLatte · on May 4, 2019

Congrats on a great product!