Hacker News new | past | comments | ask | show | jobs | submit login
Root: CERN's scientific data analysis framework for C++ (cern.ch)
137 points by z3phyr on Aug 13, 2019 | hide | past | favorite | 39 comments



ROOT is my go-to example for peak Object Oriented in the 90s. Look at the inheritance for something like a 2D histogram of doubles[1].

Trying to find the documentation for how to draw an arrow on a plot was always fun, because searching for "TArrow ROOT" would inevitably get results for "taro root".

The XRootD[2] project is pretty interesting, though, and I feel like the software industry is going to have to start dealing with similar data problems before too long.

[1] https://root.cern.ch/doc/master/classTH2D.html [2] http://www.xrootd.org/


Object oriented is part of the name! If I remember correctly it's some combination of the original developers' names/initials with OO sandwiched between.


Ooh, boy, rant time. I used ROOT all through my PhD, and it is a royal mess.

As a starter, there is a mass of global state. It maintains a "gDirectory", which is the currently active directory, either referring to a location on disk or in memory. Many objects in ROOT will register themselves with the current gDirectory on creation, and will be destructed when the gDirectory closes. This includes objects that were declared on the stack, leading to destructors being called twice.

Since large data requires good performance, you might be wondering how this interacts with multithreading. Not well at all. There are many ROOT internals that assume a single-threaded environment. If you call TThread::Init(), most of those are avoided. However, ROOT has some memory tracking code that gets called on the constructor/destructor of all ROOT objects, and that memory tracking code is entirely thread unsafe. In any program that links against ROOT libraries, the memory tracking would be enabled based on a user's .rootrc configuration file. Tracking down why a program segfaults for some users but not others, and those segfaults happen at any point in the code that interacts with ROOT, was quite irritating.

The histogramming, which is the most used form of plotting in high-energy physics, has a broken class hierarchy. TH2, 2-d histograms as a function of x and y, are implemented as a subclass of TH1, 1-d histograms. As a result, there are all manner of functions appropriate only for TH2, but are able to be called on a TH1. This includes getting a Z axis for something that inherently doesn't have a Z axis.

And it has a web browser, TGHtmlBrowser. I don't know why it has a web browser. It doesn't support https, css, or javascript, so I couldn't see how it performs on any modern website, but it was rather amusing to see it break on the Acid tests.

Many of the issues with the interpreter were fixed with ROOT6, which uses an clang-based interpreter. However, every command executed increases the memory usage of the program. This becomes an issue with automatically updating histograms, because the main way that GUIs can be updated is by issuing a command that is then run through the interpreter.


I always name my TBrowser T_T to express my annoyance of having to use it ':(' is sadly not a valid name : /


It's like the graphical file open window is designed to troll the user. It opens in icon view sorted by name with hidden files visible. Like who would ever want that? Its gaurenteed to never be useful. Switching to list view and sorting by date requires 3 clicks, and date sorting only works in one direction. Trying to type to highlight a directory or file by name often doesnt work.


Root has some features that are very unique and powerful.

It’s used in particle physics today mostly because it allows to do performant out-of-memory, on-disk Data Processing.

With frameworks like Python pandas, you always end up having to manually partition your data if it doesn’t fit in memory. And of course, it’s C++, so by default the data analysis code is pretty performant. This makes a difference when you can iterate your analysis in one hour instead of 20.

That being said, when I last worked with it, Root was a scrambled mess with terrible interfaces and way to many fringe features, e.g. around plotting, that are better handled by Python nowadays. It even has a C++ command line!!!

I wrote a blog post back then how I thought it could be fixed: https://www.konstantinschubert.com/2016/06/18/root8-what-roo...


Let's be honest: it's used today because it was used yesterday, and there is a lot of useful legacy code. Not many like plotting with root, or faffing about with memory allocation.

The reason it started getting use is that in the 1990's, when the current generation of experiments were starting up, C++ was hot and Fortran was not. PAW was old and in Fortran and so the young ones wanted to work with the new hip ROOT instead[1].

[1]: https://www.quora.com/Why-does-CERN-use-ROOT/answer/Mario-Al...


Back when I was at HLT, I remember many talking about ROOT but we didn't use it much in TDAQ.


Oh man, that’s to hardcore for me. I bow to your superior inside knowledge and ask for enlightenment on the meaning of HLT & TDAQ


It is a Google search away. :)


No it isn't.


Not to mention that the actual code of Pandas and half of the data crunching tools in Python are actually C/C++ tools with a Python interface anyways.


> With frameworks like Python pandas, you always end up having to manually partition your data if it doesn’t fit in memory.

"Pandas Docs > Pandas Ecosystem > Out of Core" lists a number of solutions for working with datasets that don't fit into RAM: Blaze, Dask, Dask-ML (dask-distributed; Scikit-Learn, XGBoost, TensorFlow), Koalas, Odo, Ray, Vaex https://pandas-docs.github.io/pandas-docs-travis/ecosystem.h...

The dask API is very similar to the pandas API.

Are there any plans for ROOT to gain support for Apache Parquet, and/or Apache Arrow zero-copy reads and SIMD support, and/or https://RAPIDS.ai (Arrow, numba, Dask, pandas, scikit-learn, XGboost, spark, CUDA-X GPU acceleration, HPC)? https://arrow.apache.org/


Around 2014 we used root to build predictive models for lending. It was introduced to the company by some physicists. It was good and powerful but man was it messy.

Later we briefly moved to R and finally settled into Python and friends.


https://root.cern.ch/root-has-its-jupyter-kernel (2015)

> Yet another milestone of the integration plan of ROOT with the Jupyter technology has been reached: ROOT now offers a Jupyter kernel! You can try it already now.

> ROOT is the 54th entry in this list and this is pretty cool. Now not only the PyROOT, the ROOT Python bindings, are integrated with notebooks but it's also possible to express your data mining in C++ within a notebook, taking advantage of all the powerful features of ROOT - plotting (now also interactive thanks to (Javascript ROOT](https://root.cern.ch/js/)), multivariate analysis, linear algebra, I/O and reflection: all available within a notebook.

Does this work with JupyterLab now? (edit) Here's the JupyterLab extension developer guide: https://jupyterlab.readthedocs.io/en/stable/developer/extens... (edit) here's the gh issue: https://github.com/root-project/jsroot/issues/166

...

ROOT is now installable with conda: `conda install -c conda-forge root metakernel jupyterlab # notebook`


For c/c++ in Jupyter, see xeus-cling https://github.com/QuantStack/xeus-cling


Coincidentally, cling (wrapped by xeus-cling) is also a product from CERN.


Many of the CERN researchers are pretty deep into C++.

It was there that I got my template meta-programming baptism, back in 2002, when gcc was still trying to cope with template heavy code.

And curiously, also where I got my first safety heavy code reviews of C++ best practices.


I don't believe those a coincidences, but more collaborations and the right people working together :-)


This brings back memories from my undergraduate days as a physics student, where we made extensive use of root in labs. It was a kind of badge of nerd honor to do analyses in root as opposed to Matlab.

It also came with a sort of C++ interpreter that gave you a repl. I remember that kind of blew my mind back then.


> It also came with a sort of C++ interpreter that gave you a repl. I remember that kind of blew my mind back then.

That would be Cling (or CINT): https://root.cern.ch/cling


They also have been building a javascript version! A few years ago (as part of the GSOC program) I worked on parts of their webgl renderer, and ended up adding a feature to threejs as a result! Sergey has done a great job with it.

Try it in your browser:

https://root.cern.ch/js/


I'm a PhD student in physics, working on CAST [1], a small experiment at CERN. That fortunately means I don't depend on other peoples' code too much.

With the C++ interpreter of ROOT you can run so called ROOT macros. They are basically just C++ files containing your code at top level in pair of `{}`. The "funny" thing is half the time you run a piece of code it'll work the first time. But then running it again will cause the interpreter to segfault, due to the mess of global state and the half baked memory management of ROOT.

And don't get me started on the usability of the ROOT classes. A huge amount of stuff is string based, so throw away your type checking guarantees. I mean why bother...

Since I really didn't want to continue working with ROOT for my PhD (after having used it before), I decided to ditch the previous data analysis code and start from scratch [2]. I decided to use Nim for it, because it provides me with a powerful language (on par with python for my purposes - aside from admittedly certain libraries, which I had to wrap), yet still being as fast as the old ROOT code (in fact faster for my data analysis, but the comparison is problematic).

Yeah, it cost me a lot of time to get "back to where I started", the code itself might also not be a shining example of perfection, but I learned a huge amount about programming and I understand every piece of it. It was totally worth it. The gains in efficiency thanks to Nim and due to the fact I know the codebase means I make up a lot of the lost time anyways.

So to any people out there who may be in a similar position, don't be afraid to throw out what you don't like. :)

[1] https://home.cern/science/experiments/cast [2] https://github.com/Vindaar/TimepixAnalysis


Ah, axion searches! Who said you can't get paid for staring at a wall?

How come you didn't go with Julia? (or Fortran for that matter?)


Haha, well you can get payed for it, but you won't be payed well.

It's hard to give an easy answer for that.

But let's step back a little: I played around with Julia several years ago. I don't know when exactly, probably in 2015 (?). Back then I thought of it as a faster python for science. Many things I would use python for back then I could have done in Julia too. Others I couldn't though, because the eco system was still too small. And while Julia was somewhat faster for the things I tried, it wasn't amazingly faster. My interest in the language dropped sometime after that. But still, I felt like it was a language targeted specifically to scientists. Many things I wanted from my analysis framework however were outside that bubble I felt. I didn't simply want to write a faster "analysis script". I do realize however, that the language has evolved a lot and I'm happy it's finding acceptance!

Fortran on the other hand, I never really considered. I suppose modern Fortran is a pretty good language though.

So why did I choose Nim then? It gave me: - the ability to produce standalone binaries I could just put on any data acquisition pc without a hassle (well ok, be careful about old glibcs) - it's fast, and I felt right at home syntax wise coming from python - the community is amazing. The first time I entered Nim's IRC channel I noticed Araq, the creator of the language, answering random people's questions! In general the community allows for a super quick feedback loop to learn the language - it's a pretty concise language. The whole manual can be read in less than a day - having written some Clojure, I loved the idea of a powerful macro system - after seeing mratsim's arraymancer library [1] I was happy to 1) have a numpy substitute and 2) thought if one person could write such a great library in O(1 year) it must be a pretty awesome language to work with :) - being able to trivially wrap any C code around is super helpful - a pretty strong type system! No more annoying implicit conversions from any type to a bool, error prone implicit int <-> float conversions etc.

I probably forgot many points, but well. The truth is of course, many languages could have worked for me, but the time I spent with Nim in the beginning was just super pleasant.

[1] https://github.com/mratsim/Arraymancer


I followed the opposite route: tried Nim and liked it a lot, then switched to Julia. Perhaps my typical usage of a scientific language is quite different from yours. In my case, what drove me away from Nim was the fact that even small features of the language keep changing in no controlled waya. It was 2014, and commits related to some obscure feature were subtly changing the behaviour of apparently unrelated stuff. Julia changed a lot in the last years, but in a very controlled way, and always sticking to semantic versioning. In 2014, Nim 1.0 was said to be behind the corner, yet only a few weeks ago the first RC for version 1.0 was released (Nim 0.20), and it still broke some basic stuff like bitshift operators.

Moreover, the lack of scientific libraries in Nim was more severe wrt Julia. Sure, writing C bindings in Nim is easy, but it is not a zero-effort job: you have to properly test them to check that types get converted correctly, and you have to write some documentation. Some guy is currently checking the quality of Nim libraries [1], and he gave very low scores for a few libraries of mine (rightly so, IMO, e.g., [2]) because they lack documentation.

[1] https://forum.nim-lang.org/t/5092 [2] https://github.com/ziotom78/nimcfitsio


Sorry for the missing line breaks in there, typed this on my phone. And for some reason my edits don't stick. :|


I remember when ROOT came out. I remember downloading it, going through it a bit, and wondering why anyone would leave PAW for it. To this day, I think PAW is probably a superior system and the ROOT project is best considered an expensive failure.

The one guy doing C++ (on a different experiment) skipped ROOT entirely and used Python to orchestrate C++ code.


I can't recommend pyroot highly enough. All the benefits of python while removing some of the worst usability problems in root. Compiling with cython meshes well, as it tends to provide the biggest performance gains for the types of tasks people frequently need to perform (like checking lots of conditionals).

Advice to people starting with ROOT who have lots of data to process: if at all possible dont mess with multithreading. Make a single threaded process that grabs only the data needed and makes its own output file. Then combine the outputs with the hadd tool. You can run the single threaded program en masse with HTcondor or a similar scheme and it 'just works' while remaining scalable.


Are there any other academic fields so dependent on a single piece of software the way that (experimental) particle physics is dependent on ROOT?


ROOT is not really a single piece of software.

One one side, you have the analysis libraries, who inherited the knowledge contained in the older FORTRAN libraries. From this point of view, both ROOT and its predecessor PAW, are excatly what you could expect from CERN: highest quality, thoroughly debugged and everything you may need as a particle physicist.

On the other side, you have the programming framework... and that's a different matter!

When I worked in the ATLAS data acquisition group, ROOT was frowned upon! Libraries were ok, but we needed a sane environment and we built it.

ROOT as a framework gave me the feeling it had been designed by inexperienced self-taught-from-trade-magazines developers who thought learning a bit of programming was too much hassle for the average physicist and thus they had to dumb ROOT down a lot.

But, as Mr. T reportedly said: it takes a smart guy to play dumb!


My field is control systems. Every academic I know, and every paper I’ve read which mentions a software stack, uses matlab/simulink. Simulink appears to me to have no good alternative (maybe jmodelica or something?) There are some python/Julia alternatives to matlab, but the existing control libraries are really pretty limited in comparison.

I’m not sure exactly how dependent particle physics is on ROOT, so direct comparison is difficult.


The Modelica systems are a good alternative, but don't really exist in high level languages yet, other than some transpilers which are a little iffy. We are planning to change that with Julia though which has enough of an ecosystem to easily build such an open source tool unlike Python or R.


Basically of modern convex optimization is heavily dependent on SNOPT. Deep learning research is heavily dependent on cuDNN


Not sure I agree with you on SNOPT. There are many many tools out there that I see people use for convex (and non convex) optimization. Especially considering a single license costs $6000 I would be surprised if SNOPT even captured a plurality of solvers in use.

FWIW in my department (Electrical Engineering) No one is using SNOPT for their work that I know of.


Also in EE, and I don’t know anybody using SNOPT for solving any kind of optimization (of which there are many).


Very nice to see this trend on HN. I have been using croot as a REPL for C/C++ for close to a decade.


I like the user interface. It was built on xclass (a FVWM95 like UI toolkit). They ported it to Windows by emulating the xlib portions using gdk 1.3.


Why would some one use this, over say Spark? What are it's unique capabilities?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: