CERN Open Data Portal: Explore more than 2 PB of open data from particle physics

hi41 · on Aug 21, 2019

What software is being used to return the results? I tried it and feel the response is very quick. 12 pb is such a huge amount of data.

dingalapadum · on Aug 22, 2019

First of all it’s 2 and not 12 pb. But more importantly the search doesn‘t go through those 2 pb. The search goes through the different ‚experiments‘ (or whatever you call that) and the dataset for one of those experiments may easily be hundreds of gb. Your hits are ‚experiments‘ not the content of those large datasets obtained during the experiment... this reduces the size of the search index by several orders of magnitude compared to the datasets itself. So if for instance each dataset was 10gb in avg, you‘d ‚only‘ be going through roughly 20000 entries. So letˋs make a very conservative estimate of the lower and upper bounds, say something like 2000-2000000 entries (although 2 mio. datasets/experiments would be A LOT - like 500 experiments each day since 2008. Anyway, that being the search feels snappy indeed, which is nice. I agree with the other comment that ES is a good guess.

Untit1ed · on Aug 21, 2019

Judging by the shape of the API responses, looks like ElasticSearch is handling the metadata querying.

floxtor · on Aug 22, 2019

CERN open data is build using Invenio (https://inveniosoftware.org/). Invenio has a search module (https://invenio-search.readthedocs.io/en/latest/) that uses Elasticsearch.

samstave · on Aug 21, 2019

ELI5: what should anyone be looking for, how, and with what tools?

Also explain why they dont already have an Api/too/integration with *.edu. Looker tableau wolfram?????

goldenbeet · on Aug 21, 2019

Looks like the site has some resources for helping to get started.

I've also got a separate resource from a Meetup talk I went to a while back. The speaker is an ML engineer who looked into some LHC datasets and posted a writeup of her talk here: https://lavanya.ai/2019/05/31/searching-for-dark-matter/

tiagval · on Aug 21, 2019

Tools are provided with some starting points of benchmark analyses. I know for sure that is the case with Atlas open data.

marksbrown · on Aug 21, 2019

Geant4 should have been open source a decade ago.

rkwasny · on Aug 21, 2019

What do you mean? Geant4 was always open source, I did my bachelor thesis 15 years ago using it.

sdwa · on Aug 21, 2019

Now, FLUKA on the other hand... no idea what their deal is.

wbl · on Aug 22, 2019

It's very useful for studying ultra small highly metallic supernovas is probably part of the issue.

saboot · on Aug 21, 2019

I just downloaded and installed Geant4 from their website yesterday, it's also on GitHub