Why Jupyter is data scientists’ computational notebook of choice

makmanalp · on Oct 30, 2018

The majority of the complaints I hear about notebooks I think come from a misunderstanding of what they're supposed to be. It's a mashup between a scientific paper and a repl. So it's useful for a bit of both:

a) Just like with a paper, you can present scientific or mathematical ideas with accompanying visualizations or simulations. From the REPL side, as a bonus, you get interactivity, and the reader can pause and experiment with the examples you're giving to improve their understanding or test their hypotheses. If I change this variable, how will the system react? You can just try it!

b) Just like with a REPL, you can type in and execute commands step by step, viewing the output of the previous command instead of running the whole thing at once. From the document side, as a bonus, you get nicer presentation (charts, interactivity, nice and wide sortable tables, etc) than you would in a shell, which comes in handy when doing things like data exploration or mathematical simulation.

It's decidedly NOT there for you to type all your code in like an editor and make a huge mess. It's apples and oranges w.r.t and a poor substitute for something like PyCharm or VS Code or vim. It is there for you to a) try things out yourself, and whatever you discover hopefully eventually make it into proper python modules b) make interesting ideas presentable and explorable for others. That's all!

When I see stuff like "out of order execution is confusing", I don't disagree, but it does make me wonder how long and convoluted the notebooks these people work with are - probably a ripe candidate to refactor stuff out into python modules as functions. When I see stuff around notebooks for "reproducibility", I'm a bit confused in that notebooks often don't specify any guidance on installation and dependencies, let alone things like arguments and options that a regular old script would. In that regard I think it's barely an improvement over .py files lying around. When I hear "how do I import a notebook like a python module", I'm very very scared.

Granted, I've seen huge notebooks that are a mess, so I understand the frustration, but it's not like we all haven't seen the single file of code with 5000 lines and 10 nested layers of conditionals at some point in our lives.

neuromantik8086 · on Oct 30, 2018

> When I see stuff around notebooks for "reproducibility", I'm a bit confused in that notebooks often don't specify any guidance on installation and dependencies, let alone things like arguments and options that a regular old script would.

At the core of this, as some others may have already alluded to already, is that many academic scientists have not been socialized to make a distinction between development and production environments. Jupyter notebooks are clearly beneficial for sandboxing and trying out analyses creatively (with many wrong turns) before running "production" analyses, which ideally should be the ones that are reproducible. For many scientific papers, the analysis stops at "I was messing around in SPSS and MATLAB at 3 AM and got this result" without much consideration for reformulating what the researcher did and rewriting code/scripts so that they can be re-run consistently.

cbkeller · on Oct 30, 2018

> many academic scientists have not been socialized to make a distinction between development and production environments

Geologist here - definitely true in my field. Nonetheless, while I don't develop in notebooks at all, I do use them for "reproducibility" in a sense -- by putting a bit of dependency info in a github repo along with a .ipynb file, I can do things like this: https://mybinder.org/v2/gh/brenhinkeller/Chron.jl/master?fil...

Which ends up being useful when a lot of folks in my field don't do any computational work at all, so being able to just click on a link and have something work in browser is a big help.

Insanity · on Oct 31, 2018

Don't know if it is something you actively need, but the image did not load for me. (Even after re-evaluating the cell).

The image after: "For example (KJ04-70)"

(I also re-ran the preceding cells).

cbkeller · on Nov 1, 2018

Thanks for the tip! Interestingly enough, this appears to be browser-dependent. Apparently including a pdf in notebook markdown using an img tag like

works in Safari but not Chrome or Firefox. Switched to SVGs for now.

So much for reproducibility :/

analog31 · on Oct 31, 2018

This is kind of a broad observation, but scientists tend to borrow tools from a huge variety of fields, and use them in ways that seem un-disciplined to the practitioners of those fields. For instance, an engineer would be horrified to see me working in the machine shop without a fully dimensioned and toleranced drawing. A project manager would be disturbed to learn that I don't have a pre-written plan for my next task. How do I even know what I'm going to do? If we adopted the most disciplined processes from every field, we'd grind to a halt.

In fact, there might be something about what attracts people to be scientists rather than engineers, that makes us bristle at doing what engineers consider to be "good" engineering.

neuromantik8086 · on Nov 1, 2018

I agree that science can't be bound by the rigid structures of most applied disciplines, and that the freedom to combine technologies in novel ways is a pre-requisite to novel findings.

What I find objectionable is the inability of scientists to explicitly delegate tasks to domain specialists in their everyday work when it makes sense. I think that it's unrealistic of you to believe that engineers always work with "a fully dimensioned and toleranced drawing" before starting work on a project and that your would work "grind to a halt". Indeed, there's a reason for the qualifier rapid in the term "rapid prototyping". If you can give an engineer general specifications for what you want and then leave him/her alone, he/she should be able to produce something that mostly fits your needs while avoiding all of the pitfalls that wouldn't have occurred to you. It would also be incorrect to assume that engineering does not involve creativity and is purely bound by rigid processes- if your requirements were strange enough, something fresh would inevitably be built.

This sort of delegation of course, is actually more efficient, since you can work on other tasks in parallel with the engineer (such as writing your next grant proposal or article or gasp teaching). Most scientists also already do this implicitly by choosing to purchase instrumentation from manufacturers like Olympus, Phillips, or Siemens rather than building it themselves.

Part of the reason for why I have such strong opinions about this matter, is that I've actually witnessed scientists waste more time messing around in fields where they were clearly out of their depth. As an example, there was a thread on a listserv in my (former) field that lasted for literally months that was solely devoted to the appearance of a website. Everyone wanted to turn the website design into an academic debate, when the website's creation (which had little to do with the substance of the scholarship itself) could have been turned over to a seasoned web developer and finished in less than a week or two.

kazagistar · on Oct 31, 2018

But in the case of dev and prod distinction it has nothing to do with fitting some over-constrained engineering principle, but about fitting actual science: if you cannot reproduce something, you don't have a result, you have a fluke.

improbable22 · on Oct 31, 2018

I think GP here is an insightful comment. Reproducing things is indeed important, but re-running code is much too narrow a definition, and possibly distractingly narrow.

Maybe your awful notebook gets the same answer you got the day before on the blackboard. Or the same answer your collaborator got independently, perhaps with different tools. Those might be great checks that you understand what you're doing. Spending time on them might be more valuable for finding errors than spending time on making one approach run without human intervention.

Not to say that there aren't some scientists who would benefit from better engineering. But it's too strong to say that fixing everything that looks wrong to engineer's eyes is automatically a good idea.

analog31 · on Oct 31, 2018

I find that with Jupyter, re-running code does serve one useful purpose, which is to make sure that your result isn't affected by out-of-order execution or a global that you declared and forgot about. That is a real pitfall of Jupyter that has to be explained to beginners.

For my work, reproducing a result may involve collecting more data, because a notebook might be a piece of a bigger puzzle that includes hardware and physical data. This is where scripting is a two edged sword. On the one hand, it's easy to get sloppy in all of the ways that horrify real programmers. On the other hand, scripting an experiment so it runs with little manual intervention means that you can run it several times.

PeterFBell · on Oct 31, 2018

Huge fan of just including an environment.yml for a conda virtual-env in the repo you store your notebooks in, but the challenge there is that it's OS specific reproducibility. I've had no luck creating a single yml for all OS's and the overhead of creating similar yml's for (say) Mac and Win is a lot unless you plan on sharing your notebook widely.

jonnycomputer · on Oct 30, 2018

If you ever have used an R Notebook written in R-Markdown, then its pretty easy to see why Jupyter Notebooks putting everything in JSON is just... infuriatingly wrong-headed. In an R Notebook, I can see my code, I can see my text, everything is exceedingly simple to understand, and I can edit it in any of the fantastic text editors out there (Jupyter's editor is not among them)

celrod · on Oct 30, 2018

RStudio is also my favorite editor. All my work is data science / stats related, where I like the workflow of writing/modifying code in a .R (or .py) file, and being able to quickly experiment by running chunks in a REPL with Ctrl + Enter.

R and Python are supported. No Julia, unfortunately. VS Code and Atom support similar workflows with Julia. However, the Julia Language server in VS Code is extremely unstable and I regularly lose LaTeX completions. The REPL in Atom is mind boggling laggy and slow to the point that it is much less frustrating to copy and paste code into a REPL running in your favorite terminal emulator.

3rdAccount · on Oct 30, 2018

Is that atom's fault or just the Julia REPL's fault? I use the REPL directly on Windows and it seems to be really slow as it will take something like "using JuMP" and precompile the module which takes time.

celrod · on Oct 31, 2018

I think it is Juno (the Julia package for Atom)'s fault. Atom is fine on its own, as is the Julia REPL after compilation.

I just looked through Julia's settings tab in Atom, and saw the option "Fallback Renderer" with the note "Enable this if you're experiencing slowdowns in the built-in terminals." It was disabled by default, so I've just enabled it.

Subjectively, I think it feels fine now. Longer use will tell, but I suspect I was just running into a known issue some setups run into, and they already provided the workaround.

EDIT: Comparing running some code in Atom's terminal and a REPL running in the GNOME Terminal, the regular REPL still feels notably snappier -- even though I'm `using OhMyREPL`, which makes the REPL a bit less responsive.

I'd say Atom feels acceptable (and definitely not "mind boggling laggy" right now), and shift/ctrl + enter more convenient than switching tabs. So I will stick with it (for Julia). More time shall tell.

3rdAccount · on Nov 1, 2018

I wonder if Windows is slower. It isn't terrible, but waiting 8 seconds after typing in "using JuMP" is kind of long.

mlevental · on Oct 30, 2018

r studio supports python? does it completion and stuff like that?

celrod · on Oct 30, 2018

It uses reticulate for REPL support, and has basic Python completions and snippets.

However, it doesn't seem to complete variable names. Eg, if I defined "foobar", it won't let me tab-complete that later when I start typing "foo".

jonnycomputer · on Oct 30, 2018

You can use R Studio to execute the code, and you're preferred editor to do the editing, if you like.

acosmism · on Nov 1, 2018

try https://gryd.us

joshuamorton · on Oct 30, 2018

As a result of the serialize-to-json approach, jupyter supports R, python, scala, go, lua, bash, julia, and haskell, among others. Its accessible to a much wider range of programmers, at the cost of version control being a bit weirder.

taeric · on Oct 30, 2018

That is a complete non-sequitur. Json in no way enables that. Just having a defined format enables that.

Emacs org-mode is proof that a simple text format with markup rules is all you really need to support multiple languages in a single file. You lose some of the simplicity of parsing the file, but you gain a ton more.

rtpg · on Oct 31, 2018

This isn't true though, right? If you were writing an org mode document about org mode, you now need an escaping mechanism to not mix your structure and text

Multi-language parsing is much harder to solve than simply enforcing some escaping mechanism in the inner protocol level and having tools do the "heavy" lifting (basically a solved problem).

taeric · on Oct 31, 2018

Amusingly, no. Folks have done just that.

That said, you can define away a large part of the problem.

Edit: For trivial examples of "org-mode" in an org-mode document, you need only look at the documentation of org-mode. That said, I expect there to be limitations, because they make sense. Similar to how you can pretty print json inside a jupyter notebook, but don't expect to have a notebook interpreted in the notebook. (If that makes sense.)

kortex · on Oct 31, 2018

Json enables you to chuck the notebook to any browser client anywhere on the net.

Org-mode is great, but you still have to install emacs.

taeric · on Oct 31, 2018

Emacs is, amusingly, a lighter client than most browsers nowadays.

On point, a browser can not remember ifa notebook. Just parse the json. It can also parse text/plain. So, could show the org document without styling. The org document is actually readable. Json... Not so much.

To see the notebook, you have to have a Jupyter setup somewhere.

Edit: For example, see https://raw.githubusercontent.com/taeric/taeric.github.io/ma... which is the source for http://taeric.github.io/ChangeForDollar.html Not styled, and that is a short document, so probably woudn't be that tough to read in a json document, but I'm glad I don't have to.

PurpleRamen · on Oct 31, 2018

How is a notebook without proper software to handle it in any way more useful then other structured plaintext-file? Yes, JSON can be prettyprinted in a browser, but what then? It's still a useless mess you can't work with.

jonnycomputer · on Oct 30, 2018

It might be the case that serializing to json facilitates support for multiple languages, though I wonder how.

With the reticulate package in R Markdown you can run python chunks, by putting, e.g.

```{python}

for i in range(1:10):

    print("{}:{}".format(i, i*i))

# etc

```

And in emacs org-mode you can:

#+begin_src python

for i in range(1:10):

    print("{}:{}".format(i, i*i))

#+end_src

Language support in org-mode is pretty comprehensive, afaik.

I do not know the details of the implementations behind these, but my own source code is plain and simple unserialized text, and that means a lot to me.

joshuamorton · on Oct 30, 2018

Ah, actually it looks like I'm somewhat mistaken. R Markdown supports other languages as well. I think the real difference is that it doesn't look like R Markdown supports partial evaluation.

By that I mean that that to share the r-markdown doc it appears that you need to rerun the whole thing. It does some tricks to do concurrent visualization, but to actually share the doc you have to rerun all the R/python from scratch.

In jupyter OTOH, if I have a long running ML pipeline as part of my doc, I can render without rerunning the pipeline.

huac · on Oct 30, 2018

You can cache the results of an rmd cell, and you can also share the rendered version of the doc first. You're right that there's a higher emphasis on "run the whole thing," and I think that that's a conscious (and acceptable) design choice vs not being sure that the shared doc will run as provided.

joshuamorton · on Oct 31, 2018

Yeah I don't disagree about that, but when the question is why is this ugly format more popular, the answer "one has a share button and the other can be edited in emacs" sort of gives you the answer.

optimuspaul · on Oct 30, 2018

YES! R Notebook is so much better in my opinion than Jupyter. I definitely prefer to work with a simple format verses JSON for this kind of work.

narwally · on Oct 30, 2018

Agreed, but I'm coming from the org-mode side of things. Have you ever tried git diff on a JSON file? It's not always pretty.

kortex · on Oct 31, 2018

The main reason for json, I believe, is that the Jupyter client is separate from the backend. It's actually pretty trivial to run the engine on a beefy box while interacting on a light laptop (on the same subnet). With Jupyter Lab and some fiddling, you can put the server anywhere.

It's also trivial to export notebooks to .py files.

That said, my goodness do notebooks wreak havoc on git. I hope this in particular gets fixed as popularity grows.

PurpleRamen · on Oct 31, 2018

Having a proper server available is even more reason to use a proper fileformat. The client doesn't care what the server handles and the server doesn't need to send raw datastructures directly from the storage.

Actually, fixing the fileformat-mess should be very simple. Just change the file-load/save-functions. Use a folder-structure with every cell being a seperate file. Or switch to XML. Or make a generic interface and allow to save in whatever the people want. Saving notebooks in Mongodb or some SQL-Database seems like a good goal for dedicated services.

enriquto · on Oct 31, 2018

> It's also trivial to export notebooks to .py files.

But this is useless if you cannot edit those py files and obtain notebooks from them.

vhhn · on Oct 31, 2018

RStudio has a server mode as well

baldfat · on Oct 31, 2018

Here is a link to why the R Notebook is a much better format for doing science. I also prefer it since I use GIT.

https://rviews.rstudio.com/2017/03/15/why-i-love-r-notebooks...

This is the reason why it is great.

> 1) Plain text representation

taeric · on Oct 30, 2018

You parting "Granted..." is precisely what fills me with dread when I see notebooks. Yes, I have seen poorly done source files. I made more than a few myself. However, many of the practices we have grown into as sound programming advice seem to be largely thrown out the window for these notebooks.

The irony, to me, is that I actually typically argue for the mixing of presentation and content. But to me, notebooks look like an attempt by people to make a WYSIWYG out of JUnit/TestNG/whatever style reports. Only, without the repeatability.

There is also the entire bend where these are taking off in a way that doesn't make sense. Do they do the things you are saying? Well, yeah. But no better than plenty of tools before them. Mathematica and Matlab both had "notebook" like features for a long long time. Complete with optimized libraries. And this is ignoring the interactivity of the old LISP machines. (You can see from my history I have a soft spot for emacs org-mode.)

Jupyter is a lot of things. Bad isn't necessarily one of them, but exceptional isn't, either. Heavily marketed is.

ubernostrum · on Oct 31, 2018

There is also the entire bend where these are taking off in a way that doesn't make sense.

It makes perfect sense. Just not to a lot of HN readers.

The average HN reader is approaching this from a perspective of "I am a professional programmer who might occasionally dabble in scientific computing, and therefore I hate this thing because it's not a professional programmer's tool designed by and for professional programmers according to the best practices of professional programmers".

The people who are actually using notebooks, meanwhile, are not professional programmers. They're scientists who increasingly have to do programming as part of their science. And notebooks are a godsend for them. We don't need to drag them all the way into our world; we need to pay attention to what they actually want, need, and find useful, and accept that it's going to differ from what we want, need, and find useful.

gaius · on Oct 31, 2018

They're scientists who increasingly have to do programming as part of their science. And notebooks are a godsend for them

2/3 of scientific research cannot be reproduced by other scientists. But tell us more about why scientists should ignore best practices from other fields.

analog31 · on Oct 31, 2018

Because I don't have four years to get something done, that doesn't do what I want when I finally get it, if it even works at all, and that I can't fix myself.

Okay, that was extreme, and if you think I was talking about programming, it's because you have a guilty conscience. ;-) It actually applies to all interesting fields -- programming, engineering, management, classical music composition, etc. Those fields don't even know what their best practices are, and acknowledge that things take too long and can't be managed. No manager would say: "Our programmers have best practices, so the work will be done next week." Why should scientists have such faith?

Meanwhile, do you trust Maxwell's Equations, Darwinian evolution, quantum mechanics, etc.? How did we establish the physical constants to mostly better than 8 digits of precision? Science has somehow figured out how to make progress despite the messy business of research.

For me, it's not that I "have" to do programming, but that physical science has been computation driven since before the 1940s. Programming is how I think and work. With apologies to Richelieu, "programming is too important to be left to the programmers."

neuromantik8086 · on Nov 1, 2018

> Those fields don't even know what their best practices are

"Best practices" are a chimera. The issue at hand isn't about what is "best", but whether or not a software engineer's "good enough" practices are more likely to achieve science's goals than a graduate student's "good enough" practices.

It's also disingenuous to claim that classical music composition doesn't have "best practices" when the field of music theory exists as an explicit manifestation of "best practices" in music. Having gone to a school with a conservatory, I also believe that I know several individuals who would would disagree with your mindset regarding how the creative process can't be managed. Indeed, if creativity, as it relates to musical composition, couldn't be managed most orchestras would be brimming with anger at the number of commissions that weren't finished on time for the concert, and most Hollywood studios and Broadway shows would screech to a halt.

analog31 · on Nov 1, 2018

Okay, that's fair. I should not have included classical composition in that list.

ubernostrum · on Oct 31, 2018

Show me the reproducible research in programming about the merits of different type systems (murky at best). Or of different approaches to testing. Or software architecture. Or... well, most of the stuff day-to-day working programmers actually do. There are barely even attempts at rigor in most of our practices, let alone the kind of reviewed and reproduced results we demand from the sciences.

kazagistar · on Oct 31, 2018

I run my unit and integration tests with every build, and they reproducibly pass if my code is working. If you have code, it doesn't take much to make it able to run again and get the same result, and it's frustrating to see Jupyter users mess it up.

analog31 · on Oct 31, 2018

In Jupyter, I do a "restart kernel and run all cells." While it's not to the level of test driven development, it catches the worst of the issues.

taeric · on Oct 31, 2018

I'm approaching from the "I was an electric engineer and we had better tooling back wheni was in undergrad.". They just weren't free.

renjimen · on Oct 30, 2018

> Mathematica and Matlab both had "notebook" like features for a long long time.

They probably didn't take off to the same extent as Jupyter because they're not free. IIRC MATLAB was quite expensive, particularly if you wanted to do anything specialised.

3rdAccount · on Oct 30, 2018

Yes, both are expensive outside the student licenses, but Mathematica is significantly cheaper and has a lot more built into the language, so you don't have to turn around and buy expensive "toolboxes" for all the functionality missing in Matlab.

Notebooks have been in Mathematica for ages and are really powerful and difficult to describe to those who haven't used them. To give an example, I was building a tool and embedded images as variables in a way reminiscent of being an engineer on the USS Enterprise. You can point to a file in Python as a variable, but you can't just copy-paste an image in as a variable last I checked (don't think Jupyter is there yet).

jacquesm · on Oct 30, 2018

> However, many of the practices we have grown into as sound programming advice seem to be largely thrown out the window for these notebooks.

The exact same thing happened with the arrival of the www.

analog31 · on Oct 30, 2018

I work in development of scientific equipment. Jupyter is my lab notebook. I think that to make good use of Jupyter for this purpose, you have to be a good programmer and a good scientist. No tool will turn us into these things against our will.

With that said, Jupyter has greatly improved my ability to find my own mistakes, and to reproduce my own results later on.

jrumbut · on Oct 30, 2018

I think it speaks to people's desire for a quick and easy to set up basic GUI creator with an editor that allows inline code editing, and no need to deal explicitly with the client server interaction.

I myself, as someone who likes to create really solid and maintainable tools, have fallen into the notebook trap and written things like "change the month in cell 22 then execute cells 1 through 3 and 20 through 27 to update the report".

The notebook format was great for prototyping what was really a small app. You don't really have those problems when you're just generating a document.

Immortalin · on Oct 30, 2018

Yes. It's called Microsoft Excel. Software engineers don't like VB for the same reason they don't like Python-in-a-notebook but you cannot deny its effectiveness.

jrumbut · on Oct 31, 2018

You're right about what excel is (and the whole VB ecosystem for that matter), but I think the critical difference is that the language and environment are very different. If I know the smallest amount of python (or R) I can leverage Jupyter notebooks and it is intuitive.

To really get something great out of excel you have to learn excel. I think that difference is almost as important as the excel stigma.

taeric · on Oct 31, 2018

Interestingly, excel has the advantage that the data has to always be visible. This is also a disadvantage, because it can't work with too much data.

My gut is that the amount of data it can work with more than compensates for the disadvantage.

achompas · on Oct 30, 2018

I agree completely on your first point - notebooks are a poor substitute for proper software tooling. I wrote this recently [1]

> In the case of an analyst, the domain of "software engineering" lies close to their own domain. Projects in both areas require code which (ideally) exhibits clarity and reproducibility. Obfuscated software is bad [...] and idempotency is good.

> The problem, then, is when the analyst takes a core tool from their domain and applies it to a slightly different domain like software engineering. Things go south fast: your notebook has not-quite-imperative code that is untested and unmonitored. It is, in other words, bad software.

As for the point about "refactoring stuff out into python modules as functions," the problem is that the new crop of data scientists aren't learning how to do this. The role of "machine learning engineer" is emerging to address this shortcoming in SWE skill throughout the data science community. It honestly cannot happen quickly enough.

[1] https://buttondown.email/oneshotlearning/archive/c06a0ded-74...

klmr · on Oct 30, 2018

I fundamentally agree with you but I have the feeling that some some of the major proponents of notebooks belong to the category of people who misunderstand them, and simply use them for everything, and write long and convoluted notebooks; I’ve definitely seen my share of those in my domain (bioinformatics, AI) and elsewhere. By contrast, Joel Grus for instance perfectly understands their strengths and weaknesses.

As for being a a good REPL, I feel that an actual REPL (+ editor integration) works better than notebooks: You can combine a literate document with a REPL but still get the benefits of a proper editor/IDE and a proper execution environment, rather than a half-hearted mix of both that’s hosted inside a HTML contenteditable (= Jupyter), and you also get “charts, interactivity, nice and wide sortable tables, etc” if you want). RMarkdown inside RStudio or Nvim-R does this well. — I just don’t want to give up the advantages of a proper editor for the very slight increase in integration that Jupyter gives me.

j2kun · on Oct 30, 2018

My complaint is that people use notebooks as production systems.

brootstrap · on Oct 30, 2018

I think we can all agree some notebooks are shit storms and should not be relied upon at ALL for production. AT my job we started using notebooks as an 'in-repo', 'interactive' documentation of sorts. Showcase various modules and give simple usage examples of them. It was pretty awesome. I love using notebooks as a more advanced scratch pad. For the times when ipython shell isnt enough, and you want something extra. Also i had to install the vim bindings ASAP, gotta have that vim

qwerty456127 · on Oct 31, 2018

Is it actually a common skill to write meaningful non-helloworldish Python code that yields expected results without a number of iterations of debugging and correcting and without PyCharm intelligent completion, hinting and correcting features? I understand the value of Jupyter notebooks for publishing your work results but find it almost impossible to use it to actually do the work - it feels million times more convenient to code in PyCharm then copy-paste the code to Jupyter once it's ready.

narwally · on Oct 30, 2018

I'd say it's more of a shell than a REPL. For most languages that provide a shell, there isn't a real separation between the reader, evaluator, and the printer. Being able to interact with those components separately is the real advantage of a REPL over a shell.

gaius · on Oct 31, 2018

The majority of the complaints I hear about notebooks I think come from a misunderstanding of what they're supposed to be

No, the majority of complaints are that notebooks are great, but Jupyter is a bad notebook. I mean maybe it’s impressive to someone who’s never seen a notebook before but to someone used to Mathematica, MathCAD, RMarkdown, org-mode, whatever, it just seems clunky as hell. I wonder how many “data scientists” claiming it as their top choice have ever tried anything else?

amirathi · on Oct 30, 2018

Version control for Jupyter notebooks was one of the biggest complaint I had. Specifically, diff and merge with the JSON files (.ipynb) is ugly.

I built ReviewNb[1] to solve one of those problems (diff). Note that, there is nbdime[2] which works well for local diff/merge. The idea for ReviewNb is to have much tighter integration with GitHub etc.

[1] https://reviewnb.com

[2] https://nbdime.readthedocs.io/en/latest/

ssivark · on Oct 30, 2018

The hard part is that introducing a tool like git (which requires you to choose moments to take a snapshot of the file, and then add some commit message) breaks the flow of interactive experimentation that notebooks are so good for. And then we need to find a way to make those commits useful, because the time ordering of commits could be different from the time order in which cells were run! That is what is crucial to making computations reproducible — viewers should be able to replay the history of how a notebook result came to be. (EDIT: Note that this is the case only for stateful computations -- if a notebook interface was used to construct a dataflow graph (like spreadsheets) with values updating live, then this wouldn't be so much of a problem. More fundamentally, it is not at all obvious that thinking of notebook contents as akin to code is the best way to use version control)

I wonder whether there is a solution along the lines of auto-committing each cell before it’s executed and the results just after the cell is executed. Otherwise a user has to do too much manual organizing, which is a problem the notebook should ideally solve. When a user is happy with the experiments and the provenance of their results, they should be able to use an interactive rebase to create a cleaner version to share/archive.

ISL · on Oct 30, 2018

I'm not a Jupyter user, but I solve the reproducibility problem with Make.

As a project moves from exploration toward production, the entire thing is wrapped into a Makefile that can flow from raw data to publication in a single call to make.

ontouchstart · on Oct 30, 2018

To have reproducible prototypes, I use Make to wrap the whole workflow in docker. Then I push the code to gist and forget about it. Although GitHub gist doesn't allow binary file, images embedded in .ipynb (JSON), on the other hand works in gist. Here is an example.

https://gist.github.com/ontouchstart/854a3c280b81f530d3ae9cb...

The notebook generated by nbconvert (see the instruction in the Makefile) is too big to display in GitHub gist Web UI but works fine in nbviewer.

https://nbviewer.jupyter.org/gist/ontouchstart/854a3c280b81f...

kristjansson · on Oct 31, 2018

This has been my solution as well. There’s little that feels as good as running `make -B report` and watching the whole thing be rebuilt from scratch.

How do you manage encapsulating each step, and passing data between them?

ISL · on Nov 8, 2018

I don't do it well. See

https://github.com/4kbt/ReplicableAnalysis

and

https://github.com/4kbt/PlateWash

as examples. The former is smaller/less complicated. The latter was my thesis work -- more complicated and (unfortunately) abuses recursive calls to Make.

canhascodez · on Oct 31, 2018

Would that be easier with rake than with make?

rgardaphe · on Oct 31, 2018

This is a super valuable perspective! We (https://qri.io) are building a kind of git/github for datasets and are hoping to talk to would-be users about just this issue. Would love to have your feedback on it (particularly on how commits are registered). Mind if I ping you at the email address you listed? - Rico

SiempreViernes · on Oct 30, 2018

Tighter integration with git is very interesting, but this is sadly just integration with github.

I think coupling to github makes sense if you are a building a dev-support service, but for a end user it makes little sense to wed the vcs to a specific website.

MurrayHill1980 · on Oct 30, 2018

The RCloud project covered some of this ground https://cscheid.net/2015/08/17/collaborative-visual-analysis... It takes the view that everything should be saved and versioned. In hindsight it seems obvious that this can overwhelm people with dead ends and scratch work and in general the flat workbook space doesn't provide enough help with organizing results. There are some other ideas mentioned in the conclusion of the RCloud paper.

advisory5739f2 · on Oct 30, 2018

RStudio’s Markdown notebooks do not suffer from this and save a separate output file that can be gitignored.

wodenokoto · on Oct 30, 2018

And they pay for this on other accounts:

No inline rendering of markdown.

Opening an .Rmd file is a lottery to see if rendered graphs and tables still exists.

Tables render completely differently in editor, HTML and pdf

gbrown · on Oct 30, 2018

For me, markdown is meant to be readable even when not rendered. I could see how not having persistent graphs and tables might be an issue, but my own philosophy is to start fresh each time - I treat it like a templating language with some convenient rendering features for prototyping, rather than like an IDE.

Your last point also has an upside - it's using different engines (Rmarkdown vs. Sweave). I can write whatever HTML or LaTeX code I want, depending on what's appropriate. I wouldn't want to have to make web documents with LaTeX, nor would I want to make PDFs with HTML.

mike_ivanov · on Oct 31, 2018

> No inline rendering of markdown.

That's incorrect, take a look here -> https://blog.rstudio.com/2016/10/05/r-notebooks

andrestan · on Oct 30, 2018

RMarkdown and Knitr are dramatic improvements in terms of final outputs and VC relative to notebooks. Notebook believers (Satan worshippers, imho) would suggest that notebooks are best for developing in and not primarily made for use as final outputs.

Desustorm · on Oct 30, 2018

I have used jupytext (https://github.com/mwouts/jupytext) for this and it seems to work great - it outputs a separate .py file which is easily diff-able.

jimhefferon · on Oct 30, 2018

Thanks for the note; this looks good.

benjaminjackman · on Oct 30, 2018

I think they fundamentally json is just the wrong format for these files. Speaking from (ancient and limited) experience I made a little notebook-style interpreter for learning scala back in 2009 or so called scalide. It saved its files ("scalapads") to XML. XML actually worked better in some ways since most of the code could live between the tags unescaped (sans < > &) so it merged / diffed the user code well. The meta-level stuff (cell boundaries etc) needed by the notebook ... not so much.

In json the code has to be escaped into strings, and json is really finicky about syntax (e.g. no trailing commas). So it doesn't work well.

I never got the chance to redo it, however the solution I was leaning to for my post "I won the lottery, I can work on fun stuff" attempt was to store the meta-code in a version of the host language(s), with some simple syntax that could live comfortably in the comments of various different languages to do things like encode the cell divisions and so on.

Basically something like: #notebook[lang=python]

#cell[lang=python] def add(x,y): return x + y #endcell

//notebook[lang=scala]

//cell[lang=python] def add(x: Int, y: Int) = x + y //endcell

This I think would be beneficial for a couple of reasons.

1. Better diffing / merging.

2. One click toggle between show source and view as notebook mode, which would really allow this to work in an IDE like vscode pretty seamlessly. The cells become something akin to //#regions in the IDE. But at the end of the day you are still editing a source code file, so you can edit the whole file easily.

3. The keyboard shortcuts for executing and jumping between cells would generally work in raw code mode, so you could just edit there continuously and manually writing out //cell //endcell. Also the executing results could appear in block comments inline in the editor, off to the side, or in a popup above, the code you are editing.

4. The IDEs could uprender the comment-syntax into cells as they gained better support for the paradigm (similar to how they do for code folding / syntax higlighting already).

5. Eventually, perhaps a cross language, metasyntax could be established to make things a bit more concrete than magic comments (get ready for some serious bikeshed painting though!)

The closest I have seen anything come in this regard is Quokka however it's not quite all the way there.

TeMPOraL · on Oct 30, 2018

The closest thing I've seen to what you described would be... Emacs. It actually uses the "metadata in file-specific comments" paradigm. You can put file-local values for Emacs variables in comments at the top or bottom of your file, like described in [0].

Your example could be rewritten as:

  # -*- notebook-lang: python -*-

or

  // -*- notebook-lang: scala -*-

Still, the usual way of using Emacs for "interactive notebooks" is via org-mode, which is a better Markdown with support for (among other things) executing code blocks straight in the org document you're writing. This way, Emacs support all your points 1 to 5, and is generally more powerful than Jupyter or other similar things, but it also means you can kiss any kind of collaboration goodbye.

For some weird reason, the more powerful a tool, the less likely it is other people will be using it.

--

[0] - https://www.gnu.org/software/emacs/manual/html_node/emacs/Sp...

nemoniac · on Nov 3, 2018

Well, the reason is not so weird. Generally speaking, the more powerful the tool, the higher the bar. More time and effort is required to learn it and become proficient with it. When it comes to the very powerful tools, few will have the aptitude or be prepared to put in the effort to learn them.

For those who don't, less powerful tools take their place and proliferate.

As you say, emacs checks all the boxes but the majority is not prepared to learn it and prefers to program throught their browser.

btown · on Oct 30, 2018

Along the same lines, I would love to see a syntax something like this:

mynotebook.py

    ### (cell boundary)
    """Top-level unused strings (docstring-esque) rendered as markdown"""
    def add(x, y):
      return x + y
    # jupyter-output-hash: 0123abc (which would link to some external key-value storage for the project)

Anything in something other than the primary language could be in something like `execute_scala(""" scala code """)` - which would execute properly given proper globals.

As long as the output-hash storage is treated as append-only and is highly available (output cells could even be encrypted for security if this was a public cloud service, or you could even use a local or shared filesystem), then this file would not only parse and run as a perfectly valid Python file, but it would also hold references to outputs in a source-control friendly way. IDEs could show the cell outputs inline. If you rerun your notebook and get different outputs for some reason, `git diff` tells you exactly where things changed without being too messy. Basically, put outputs in off-chain storage, and just be a literate code file.

I feel like this would address most people's needs, no?

Fomite · on Oct 30, 2018

I nicknamed one we used during the Ebola epidemic (tight deadlines, lots of people working, etc.) "The Wall of Madness".

There were tons of

## JOHN: DONT RUN PAST HERE, EVERYTHING BROKEN

comments.

dev_dull · on Oct 30, 2018

I’m glad I’m not the only one. When I inherited some “production notebooks” (if that’s a thing) I couldn’t believe it was nearly impossible to do basic things such as test and review changes (via version control).

entee · on Oct 30, 2018

At our company if it's in a notebook it's not considered ready for production, it must run as a script before being considered for Eng to take over from DS. It's actually not that hard to write a notebook in such a way that it converts easily to a script. Just check and make sure that your variables/functions/whatever are initialized above the cell(s) they're used in, declare all imports in the top cell, and periodically move cells to fix any inconsistencies with these rules (checking that you didn't break anything of course). I've always said that Data Scientist doesn't mean, "I don't do engineering," good basic eng practice helps make more productive data science and brings it into production more robustly. How do you know your models work well if the code that generated them is inscrutable?

I wonder how much of the "3 engineers for 1 data scientist" ratio I hear all the time is due to Data Engineering being assigned the role of cleanup to code that should be better in the first place.

natalyarostova · on Oct 30, 2018

I think cleanup is part of it. I also have noticed as a guy on the DS job family, but who has taken a large interest in SDE work, separating the job families can result in churn. For example, I might think of three model choices A, B and C. C may be the worst of the three, but only very marginally worse. It can also be the case that C is an order of magnitude easier to keep and maintain in production.

I've seen cases where the wrong choice here ends up requiring three SDEs for half a year, where if they gave up a tiny benefit of the best model, they could have done it with 1 SDE in 1 month.

bitL · on Oct 30, 2018

You don't use Jupyter notebooks in production; they are super useful for pitching ideas to clients/bosses and doing some early prototyping. I feel sorry for anyone that has to work with "pure data scientists" that have no clue about software engineering practices...

brylie · on Oct 30, 2018

FWIW, Netflix uses Jupyter notebooks in production, using nteract UI:

https://medium.com/netflix-techblog/notebook-innovation-591e...

https://nteract.io/

This approach seems promising, particularly as it facilitates cross-disciplinary collaboration.

cwyers · on Oct 30, 2018

It depends on what you're doing, yeah? In RMarkdown notebooks... yeah, I wouldn't write models in one. But if the focus is on embedding some visualizations and tables into a document, and then refreshing the document to every so often pull in new data, I can see that as a production use for a notebook. TL;DR: Can be useful for reporting, wouldn't use it anywhere else in the pipeline.

i-am-charmander · on Oct 30, 2018

"Production notebooks" should not be a thing... Unless prefaced by a bold RUN ALL.

jboggan · on Oct 30, 2018

I'm coming to realize one of the key skills for a data engineer to have nowadays is "productionizing" notebook code from data scientists and PMs and teaching them to make it more testable and modular in the first place.

ska · on Oct 30, 2018

Though the name "data engineer" may be newish, the role is really an old one - and this aspect has always been the single most important part of the role.

solomatov · on Oct 30, 2018

If you are interested in workbooks which are collaborative and versioned, take a look at http://datalore.io/

Version control is transparent and integrated and it's possible to work with workbooks collaboratively.

kprybol · on Oct 30, 2018

Any chance of this being offered for on-prem install in the future? Looks interesting but cloud only makes it a no go for my team.

rgardaphe · on Oct 30, 2018

We're building something just like that at qri (https://qri.io) a free and open source dataset version control system. Right now all datasets on qri are public by default, but we're working toward supporting. encryption and private networks.

solomatov · on Oct 30, 2018

We are seriously considering such a possibility. Do you have any specific requirements for on prem installation?

kprybol · on Oct 31, 2018

Basically just the ability to run on Linux.

zimablue · on Oct 30, 2018

Jupytext linked elsewher eont he thread seems like a step in the right direction. Instead of changing the whole tool, accept that you're always going to be married to github and change the serialization-layer to be source control friendly. Basically, split the input from the output+metadata and flatten it all to text. Then you can source control it fine and if you need use the output+metadata fold them back in.

stephengillie · on Oct 30, 2018

Diffing JSON as text must be painful. Diffing JSON as data should be somewhat simple.

justin66 · on Oct 30, 2018

Are there any merge tools that offer features for this a lot more sophisticated than basic text comparison?

avip · on Oct 30, 2018

Not merge, but jq can diff, and it can also do a consistent dump (jq -cS) for you.

Edit: to clarify, jq -S does deep keys sorting.

  $ echo '{"z":{"b": "second", "a": "first"}, "x": 4, "y": 7}' | jq -S
  {
    "x": 4,
    "y": 7,
    "z": {
      "a": "first",
      "b": "second"
    }
  }

stephengillie · on Oct 30, 2018

Powershell has Compare-Object, which will diff .NET objects. It has the convenient alias "diff". JSON can be converted to .NET objects by ConvertFrom-JSON.

So you can import 2 JSON files and diff them in Powershell.

autokad · on Oct 30, 2018

I'm a huge fan of databricks, its got github sync built into it

zmmmmm · on Oct 30, 2018

As with so many things python related (including python itself), I am perplexed by how willing people seem to be to fall in love with solutions that have so many limitations and problems. I find Jupyter just barely usable. I constantly have issues with editing in the cells, diagrams not sizing correctly, cells accidentally displaying huge amounts of data and freezing my browser, complete failure of autocompletion in many languages, a very awkward security model involving manual cutting and pasting of auth tokens around, nearly impossible to get a reasonable rendering of the notebook into something reasonable like PDF (yes there attempts at solutions, they are full of problems). Many limitations derive directly from the architecture where the kernels are limited in what they can do because specific parts have to be interpreted in the browser that are language specific.

From my perspective, it's a dumpster fire - in 2018 there should be something so much better than this. RStudio is a thousand times better but only does R. I used to like Beaker Notebook but it gave up due to Jupyter's popularity and converted itself into a bunch of Jupyter extensions which now have all of Jupyter's limitations.

Yet despite all this I can see that there's this enormous community that loves this and keeps developing and contributing to it.

narwally · on Oct 30, 2018

I feel the same way, especially as an emacs user. Org-babel seems to be a superior implementation of the same idea. Org is just a text document, so git and git diffs work. I can use any combination of languages I want in a document and have them running in different sessions. And best of all I can edit code blocks using my customized major mode for that language. On top of that you get all the goodness that comes with org-mode, not least of which is the ability export it to dozens of other human readable formats for easy sharing. I think there's even an exporter for jupyter notebooks (there's at least one for ipython notebooks).

brians · on Oct 30, 2018

I expect jupyter notebooks to keep an environment consistent between cells. Org-Babel doesn’t generally do this. A Good notebook environment is more like a lisp buffer with block comments.

kristjansson · on Oct 31, 2018

Org Babel source blocks can take a session property to main consistent environment(s) across code blocks. The session property can also be set as a language specific file-level property eg: https://orgmode.org/manual/Header-arguments-in-Org-mode-prop...

newen · on Oct 30, 2018

I like jupyter and use it all the time mostly because I can log what I do for future reference. But I have to work around it so much stuff it can get really annoying sometimes. Yeah, compared to the Matlab IDE and how easy it is to use, it's not even close. But it's an open source project, so people tend to find it awkward to criticize it a lot (I mean, that sentiment is justified since it's mostly volunteer work but sometimes it can get to be too much).

stult · on Oct 30, 2018

I love everything about RStudio, except for all the R stuff.

gaius · on Oct 30, 2018

it's a dumpster fire - in 2018 there should be something so much better than this

In 1998 I was using a tool called MathCAD that provided a notebook interface running as a plugin to MS Word. In 2018, Jupyter is still not as good as that. Some things are just not meant to be webpages.

narwally · on Oct 30, 2018

> Some things are just not meant to be webpages.

This is how I feel about most of the single page apps I've worked on.

hatmatrix · on Oct 31, 2018

If I recall correctly, MathCAD was so much point and click to enter mathematical formulas that I found it very cumbersome to use.

jamesb93 · on Oct 31, 2018

As soon as a git repo involves jupyter notebooks I move on. They're ugly, they don't let me learn how the code works properly and in general looks awful. Why not just give me some code to run??

red-tea · on Oct 30, 2018

Why should there be something better? Just because you want it? Someone has to make it.

gaius · on Oct 31, 2018

There already are many better things as many commenters have pointed out

red-tea · on Oct 31, 2018

Are there? For python? People like python.

rsivapr · on Oct 30, 2018

Link to the deck by Joel Grus' talk that is mentioned in the article: https://docs.google.com/presentation/d/1n2RlMdmv1p25Xy5thJUh...

amirathi · on Oct 30, 2018

Really good talk. Here's the video: https://www.youtube.com/watch?v=7jiPeIFXb6U

And all JupyterCon 2018 talks if anyone is interested: https://www.youtube.com/playlist?list=PL055Epbe6d5b572IRmYAH...

azag0 · on Oct 30, 2018

I love Jupyter Notebook for experimenting and rapid creation of reports, but dislike it for not being able to use my editor and for intermingling inputs and outputs in a single file. So I'm working on an alternative frontend to Jupyter kernels, which is heavily inspired by KnitR: https://github.com/azag0/knitj It is still being developed, but it's functional and I use it every day.

y4mi · on Oct 30, 2018

> for not being able to use my editor

At least Atom has good integration with hydrogen

bodkan · on Oct 30, 2018

This looks incredible! Does your project already support other language kernels than the Python kernel?

I use R for 90% of my work, but most of it has been happening in Jupyter notebooks (which I'm not a huge fan of, despite practically living in them for the past 4 years of my life).

Thanks for sharing!

azag0 · on Oct 31, 2018

I have not tested it with anything else than the Python kernel, but it uses Jupyter Client to communicate with the kernel, which is kernel agnostic. So you should be able to do just “knitj -k <kernel name> ...”.

infinite8s · on Oct 31, 2018

If you use R so much why do t you use RStudio instead?

huac · on Oct 30, 2018

your project also seems to play much more nicely with git!

azag0 · on Oct 30, 2018

True. Actually that's what I meant by "intermingling inputs and outputs". KnitJ still shows both code and its output in the rendered HTML, but unlike in Jupyter Notebook, the code is stored and edited separately in a single source file.

mikepurvis · on Oct 30, 2018

That's my biggest frustration by far with trying to use Jupyter for anything Serious Business. Will definitely check out your project; thanks for sharing.

dangirsh · on Oct 30, 2018

I've found ob-ipython [1] within Org Mode to be one of the best options for interfacing with Jupyter. If you're sick of the limitations of working in a browser, it's worth checking out.

Scimax version: https://github.com/jkitchin/scimax/blob/master/scimax-ipytho...

Video of Scimax version: https://www.youtube.com/watch?v=dMira3QsUdg

Previous HN discussion highlighting key features: https://news.ycombinator.com/item?id=17839926

Some relevant blog posts:

- https://vxlabs.com/2017/11/24/getting-ob-ipython-to-show-doc...

- https://vxlabs.com/2017/11/30/run-code-on-remote-ipython-ker...

- https://kozikow.com/2016/05/21/very-powerful-data-analysis-e...

[1] https://github.com/gregsexton/ob-ipython

projectramo · on Oct 30, 2018

Here are the issues with Jupyter, and most other flavor, of notebook:

1. variables have to be explicitly output

The most important tool for programming, for me, is that window that shows you the current state of all the variables. When I step through a program, I look at the state. 90% of my debugging solutions come from seeing that variable doesn't have the right state.

2. Intellisense

For the love of god, I do not want to remember if it is len(), length(), .len(), .length(), .size(), size(1) or whatever.

That's it. But those two are so big that I have to code and debug in Spyder and then paste the code into notebook. I feel sorry for people who are new who think that all the debugging is happening in the notebook.

pavanagrawal123 · on Oct 30, 2018

Hey There! I'm trying to solve the issue of IntelliSense.. I'm building/improving Jupyter Notebooks inside VSCode: https://github.com/pavanagrawal123/VSNotebooks . It's a fork from another extension somebody already built, but all activity is dead, so I'm starting up dev on an active fork. I'd love to hear any feedback y'all have! :)

Also planning to add some nice debug features, plus hopefully integration into the inbuilt VSCode debugger!

yodon · on Oct 31, 2018

Microsoft just announced an initiative like this (unfunded, community-based, likely at risk of becoming abandonware), perhaps you could combine your efforts with theirs? (The issue in their code that I'm personally most impacted by in lack of support for conda[0])

[0]https://github.com/lorenzo2897/vscode-ipe/issues/162

yodon · on Oct 31, 2018

HN thread for that announcement: https://news.ycombinator.com/item?id=18346198

pavanagrawal123 · on Oct 31, 2018

Fixed this in my fork :)

I want to combine, but Neuron has stated they will not be accepting any PRs until December.

yodon · on Oct 30, 2018

How do you see the idea of a vs code notebook comparing to or being different from the goals of the hydrogen editor?

kylebarron · on Oct 30, 2018

To me, my favorite part of the design of Hydrogen is that it's entirely language agnostic, and can be used with _any_ Jupyter kernel.

pavanagrawal123 · on Oct 30, 2018

I'm planning on adding better language support in a couple of weeks! Don't want to limited to Python and R

awake · on Oct 31, 2018

One thing you may want to be aware of is that the python language server used for vscode runs pylint which performs static analysis on the code. However, Jupyter Notebook uses autocomplete by actually introspecting the variables as they are defined. This creates large differences when doing things such as selecting a column in a pandas dataframe. In jupyter if you press tab on the column name, it can autocomplete and also assumes you are getting a series, which leads to autocomplete on things like .min, .max, etc... In pylint you don't get any of this autocomplete since pylint cannot statically determine the column names so you lose the intellisense.

pavanagrawal123 · on Oct 31, 2018

yep! this is something I will be focussing on in VSNotebooks. I noticed this as a huge drawback to the current implementation, so I will be fixing it :)

pavanagrawal123 · on Oct 30, 2018

Better debugging is something I want to focus on in VSNotebooks.

Personally, I like VSCode more than Atom, so this was one of the reasons I started working on this extension!

pavanagrawal123 · on Oct 30, 2018

I think the biggest thing for me personally is that I like VScode a lot more than Atom. Also, I'm going to be focussing more on better debugging, which AFAIK, is not heavily emphasized in hydrogen.

projectramo · on Oct 30, 2018

Thanks for working on this. We’re converging on the Bret Victor IDE

makmanalp · on Oct 30, 2018

Re #2, if you haven't tried the newest versions recently (and especially with the jupyterlab beta which has a nicer completion GUI), I'd encourage you to take a look! It's come a long way, along with the library that's doing the completions under the hood.

projectramo · on Oct 30, 2018

Thanks for the tip. Will take a look. I thought I was using the latest version.

cirgue · on Oct 30, 2018

Re #1: There's a plugin in nbextensions that shows variable values a la spyder.

projectramo · on Oct 30, 2018

Good to know. Will check it out.

cphoover · on Oct 30, 2018

For people who prefer to code in JS. There is a similar application called observable notebooks that has recently come out: https://beta.observablehq.com/

It offers some nifty things including, well, observables where cells of the scratchpad can automatically update by observing changes from other cells.

pintxo · on Oct 30, 2018

Thx for the pointer. Seems this is related to Mike Bostock the guy behind d3.js, will definitely look into it.

solomatov · on Oct 30, 2018

If you want to have interactive computations in Python, http://datalore.io/ has support for this. It feels more or less like Excel with ability to write Python in there.

th0ma5 · on Oct 30, 2018

More of an online service, though.

cphoover · on Oct 30, 2018

true true.

wanderfowl · on Oct 30, 2018

I like R for many things, but Python just keeps getting more compelling, particularly given the excellent machine learning packages. As these sorts of toolchain elements get better and better, and as more people realize that there's a benefit to simultaneously training researchers to run code as well as stats, I suspect we'll start to see an exodus from pure R solutions.

The real question is when (and whether) new social scientist stats courses will start teaching Python stats toolchains, rather than R. That seemed to be an inflection point for R (as folks moved away from SAS), and could be for stats-centric Python too.

thousandautumns · on Oct 30, 2018

I'm not sure what about Jupyter makes Python more compelling in comparison to R. R is entirely usable in Jupyter Notebooks, and R Notebooks are, in my opinion, possibly superior to Jupyter notebooks in many ways.

> and as more people realize that there's a benefit to simultaneously training researchers to run code as well as stats, I suspect we'll start to see an exodus from pure R solutions

I'm not sure what you are saying here.

I would actually argue that most of the Python data science toolchain is years behind what is available in R.

achompas · on Oct 30, 2018

> I would actually argue that most of the Python data science toolchain is years behind what is available in R.

I do not want to litigate this on HN, but the problem with R is the toolchain around your data science work.

You've fit a model in R, and that's great! Now how do you get it into a real-time system? Or how do you test the software you wrote to train the model?

williamstein · on Oct 30, 2018

> R is entirely usable in Jupyter Notebooks

Not for everybody, e.g., the Swirl R package (https://swirlstats.com/) doesn't work in Jupyter, since Jupyter has limited support for R's many ways of getting interactive input from users.

mike_ivanov · on Oct 30, 2018

s/possibly/definitely/

mattkrause · on Oct 30, 2018

It seems to depend on what you’re doing.

Python definitely has more mindshare for machine learning, and particularly deep learning. However, that’s not all of statistics. For things like mixed-effects modeling, I think R still has a clear lead. There are some python packages (e.g., statsmodels) but R’s lme4 has more features, like custom covariance structures, and virtually every textbook and tutorial currently uses R. I’m actually not sure if I’ve ever actually encountered statsmodels in the wild. PyMCMC is relatively popular, but I think bugs/jags are also more common.

peatmoss · on Oct 30, 2018

And then there is the Zelig modeling framework for R that I can’t imagine not using after having used it.

Don’t get me wrong, I like Python well enough, and knew it before I coded R. But Python is really behind R in stats support. I’d also add the tidyverse in there for general data munging.

If I want libraries I’ll use R; if I want a programming language I love I'll use Racket or maybe Clojure; if I want some libraries and an okay programming language I’ll use Python, I guess.

haskellandchill · on Oct 30, 2018

Woa, thanks for pointing out Zelig, I needed that relogit and I didn't even know it :)

peatmoss · on Oct 30, 2018

The counter factual simulation features are amazing and easy.

haskellandchill · on Nov 2, 2018

Wondering why it hasn't got more publicity?

WhompingWindows · on Oct 30, 2018

The migration between languages is also industry specific. They are still teaching SAS to finance and healthcare analysts, for instance, and R and Python are still rising in healthcare specifically. Keep in mind all the legacy code and all the coders who just know SAS and don't need to change. It'll take longer for the transition than you think.

bigger_cheese · on Oct 31, 2018

I have tried to move from SAS to R a few times at this point it's largely inertia there is so much in my org already written in SAS makes it very difficult to convince people to change.

I do like the SAS dev tools (especially Enterprise Guide). I'd really love if R had some sort of GUI front end for non technical people. Eg. I know the finance analysts in our org wouldn't have a clue how to configure their own ODBC sources -which you need to do with R Studio until it's as easy for them as SAS convincing them to switch won't get any traction.

isatty · on Oct 30, 2018

You may want to try Julia

darksaints · on Oct 30, 2018

I'm definitely well in the R camp but keep feeling this nagging pull from Python. Especially for trading...it would be so nice to have a language for both research and production, as right now I translate all my research into scala for production.

currymj · on Oct 30, 2018

I hate to be the stereotypical Julia recommender, but it is made for this use case, more so than Python, which isn’t all that much faster than R if speed matters. (Unless you want to try Cython but that’s a whole bag of worms.)

fjuerfilis · on Oct 30, 2018

I'd second that. R and Python both have the same pre-LLVM performance issues.

I don't expect either R or Python to go away either time soon, nor would I want them to, but I would like to see people moving to things like Julia and Nim, which have the same level of expressivity, but are much more performant. I have difficulty imagining many people saying "I love programming in R and Python, but don't like Julia or Nim."

I like Python but at least with stats/numerics there isn't a big reason to move away from R except for specific libraries (especially DL stuff) or front-end integration with web-land (and even then things like Jupyter mitigate against that).

currymj · on Oct 31, 2018

I would also add two good reasons to stick with R: RStudio and Hadley Wickham.

In theory, there are Python and Julia equivalents to RStudio (JupyterLab, Spyder, PyCharm, Juno, whatever) but RStudio is just so, so, so good. A truly great piece of software.

And of course if you have a data pipeline type workflow, and it fits into the Hadleyverse paradigm and isn't too performance intensive, there's nothing better.

byt143 · on Oct 31, 2018

Julia has it's own "*verse" type data pipeline framework, with an even greater variety of backends and plotting solutions than R.

It's still in development (mutate and select have PRs) but it's almost there.

https://github.com/queryverse/Query.jl

darksaints · on Oct 31, 2018

I've tried it while it was unstable and it was excellent (I remember some random forest training that went from a few days down to half an hour). Since everything was unstable at the time, one update was all it took to break everything. Now that 1.0 is out, I'd definitely like to pick it up again. Unfortunately for my current use cases, the ecosystem just doesn't exist like it does with R or Python.

glup · on Oct 30, 2018

It's easy to grade student assignments in notebooks with https://github.com/jupyter/nbgrader, which also makes it great for teaching.

taeric · on Oct 30, 2018

Meh. As long as you have defined deliverables between grader and student, grading programming based assignments are relatively easy. Coursera has been around longer than Jupyter has been popular, after all. (And they aren't all just multiple choice.)

Being interactive is what makes it good for teaching. But there are plenty of interactive options. And for a certain class of teaching, it is not "on rails" enough such that people will have to have a ramp up period first on Jupyter before they can really get into their topic.

agibsonccc · on Oct 30, 2018

The only thing that stops me from being able to use notebooks full time is their intellisense compared to IDEs is horrible. I like being able to use them for demos/presentations, but I can't imagine trying to code within one primarily. Especially when it comes to tracking results.

How do people cope with this? Do you supplement it with other tools? I spend a lot of my time in an IDE and then just paste some of the code in to cells. That seems easier.

zimablue · on Oct 30, 2018

I do the opposite, my job is kind of bad data engineer/scientist/etl minion so it's a lot of dataframes.

Work (and often debug) in jupyter -> open the notebook from pycharm when it's got some completed thoughts and write into a python module + test module, tidying up and adding type annotations.

Sometimes doing that multiple times so that the notebook is importing from modules which were originally pulled out of the notebook.

It sucks having to use two tools but I don't think there's any one tool that can do both as well as pycharm/jupyter, short of me getting a lot better at emacs or writing a lot of custom Atom extensions (I think).

bunderbunder · on Oct 30, 2018

I am very hopeful that JupyterLab will get support for the Language Server Protocol sometime soon. That would make all the difference in the world for me. I'd still have to use a terminal to build and run tests, but I wouldn't be surprised if a test runner comes along fairly quickly after that.

(Relevant issue: https://github.com/jupyterlab/jupyterlab/issues/2163)

agibsonccc · on Oct 30, 2018

Data frame rendering in the various notebooks (beaker,jupyter,zeppelin,..) is wonderful. Your workflow sounds closest to what I do. If I want to visualize something I tend to compile my thoughts/imports and organize things in an editor first and put it in a notebook in parallel. It helps with version control as well.

MrPowers · on Oct 30, 2018

I am a Spark data engineer and spend a lot of time in Scala / Python IDEs & browser notebooks. Databricks lets you package code as JAR / wheel files & attach the binaries to the cluster. I write all the complicated code in tested projects that are checked into GitHub & use the notebooks to invoke the functions and visualize results.

Folks that try to do all programming in notebooks typically drown in complexity and suffer.

agibsonccc · on Oct 30, 2018

Yeah I agree. We do something similar if we're using zeppelin or beaker. I organize it, put an uber jar in there and then run everything from there. That's a ton easier.

daveFNbuck · on Oct 30, 2018

When you're processing a lot of data, it can be expensive to keep re-running your whole script every time you make a change. The notebook keeps the results of your earlier steps in memory when you want to change and re-run a later step.

This is a trade-off between how much code you're writing and how much data you're processing. If you're writing maybe 20 lines of code but you have enough input that it takes several minutes to run, the notebook becomes a clear win for your development process.

short_sells_poo · on Oct 30, 2018

So does the standard terminal repl in python. You can achieve the same workflow with having a plain old python file, and then just use your favorite editor's "Send block of code to console" function. This way, you retain your editor's functionality while you can work just as interactively as with a notebook.

mimischi · on Oct 30, 2018

But then there are plotting and interactive widgets in the notebook.

agibsonccc · on Oct 30, 2018

You can generally persist the results your self to disk though. Especially since a lot of things end up being numpy arrays. So you run 1 script that saves all the results, and another that loads it and runs just the part of your workflow you want. Bonus: it's persisted to disk on top of that! I know things get more complicated than that, but I'd say the compelling use case for notebooks isn't the state saving but more the whole package in one place (state persistence,visualization, interactive repl,..)

daveFNbuck · on Oct 30, 2018

Yeah, I often do that myself, but it's not as convenient for a quick one-off data exploration.

solomatov · on Oct 30, 2018

If you miss intellisense, you can try datalore (https://datalore.io/).

P.S. Disclaimer: I lead this project at JetBrains, Inc.

NegatioN · on Oct 30, 2018

Is your plan with this to always have it as what seems like a hosted service?

Is it possible to use it as what seems like a drop-in replacement for jupyter notebooks?

We have more data then I think would make sense to transfer out of our clusters/datacenter and privacy issues would probably be raised but I would love to use something like this.

solomatov · on Oct 30, 2018

>Is your plan with this to always have it as what seems like a hosted service?

We are seriously considering on premises version.

>Is it possible to use it as what seems like a drop-in replacement for jupyter notebooks?

Jupyter import/export will be released soon.

agibsonccc · on Oct 30, 2018

Already a customer, you have nothing you can sell me :).

psychometry · on Oct 30, 2018

Even IDEs like RStudio pale in comparison to proper text editors when it comes to actually editing the code.

sueders101 · on Oct 30, 2018

I've become a big fan of Hydrogen recently. It's Jupyter notebooks for Atom.

https://nteract.io/atom

atomic77 · on Oct 30, 2018

I find this odd because I am the opposite - one of my primary use cases for Jupyter/ipython in general is the ease with which I can get 'live' code introspection and intellisense. It's often my prototyping sandbox for python code that I then move into my IDE once it's close to being ready.

I also notice that developing in this way encourages me to create smaller, more testable functions that i can easily work with inside a single notebook cell.

heavenlyblue · on Oct 30, 2018

Doesn't PyCharm provide IntelliSense?

epistasis · on Oct 30, 2018

It's not about writing code as much as it is about exploring the data.

If you're writing a lot of code in them, it's probably better to put that code into libraries that get imported and reused.

And I do agree that default code environment is unbearable. Particularly the auto insertion of completing quotation marks, which has me continually fighting with the editor to get correct code into a tiny web text box.

agibsonccc · on Oct 30, 2018

Oh I won't argue you with you there. I just find myself rotating quite a bit because I have to do both deployment as well as writing code for experimentations.

What I'm specifically talking about is even that kinda hacky experiment code you end up writing. I don't try to implement whole projects in there, but even just "train this model" type code ends up being a hassle because of how bad the editors are.

My above comment was more referencing wishing I could spend more time writing experiment code in jupyter without copying and pasting all the time.

renjimen · on Oct 30, 2018

That's surprising because I have the opposite experience! Since my first cell is to import all of the libraries I want to use to memory, the intellisense works without fail, regardless of how big the libraries are. Comparing that with my VS Code experience where using intellisense to pull up functions' doc strings takes an age for all but the inbuilt Python libraries.

thealfreds · on Oct 30, 2018

I'm not a Python dev. Is it not common to just type and let it auto import in the required libraries for you?

agibsonccc · on Oct 30, 2018

Java's tooling for this is among the top. We're spoiled compared to the dynamically typed languages :)

Insanity · on Oct 30, 2018

I don't tend to do so in Python whereas I do in Java.

Maybe due to often importing and naming (something you don't do in Java.)

E.g

    Import matplotlib as plot

Vs Import java.util.

renjimen · on Oct 30, 2018

You'd think so. Maybe my setup is faulty. Something for me to look in to

pavanagrawal123 · on Oct 30, 2018

Hey There! I'm trying to solve this right now in VSCode's in built editor: https://github.com/pavanagrawal123/VSNotebooks . It's a fork from another extension somebody already built, but all activity is dead, so I'm starting up dev on an active fork. I'd love to hear any feedback y'all have! :)

cirgue · on Oct 30, 2018

NBextensions and doing mostly data analysis in notebooks then building actual code in a text editor. I would do this even if notebooks had perfect intellisense support.