Reading CMS Open Data over HTTPS, on demand, no download

2026-06-21 — a screencast: pull the first events out of a 2 GB CMS Open Data NanoAOD file straight from the CERN open-data server, in pure Rust, fetching ~1.3 MB and storing nothing.

A recurring friction in HEP analysis is getting the data to the code — staging multi-GB files before you can touch a single event. nano.rust's owned ROOT I/O reads remotely on demand: it issues HTTPS byte-range requests for only the baskets it actually needs. So "open this 2 GB file and give me 5 events" fetches kilobytes, not gigabytes, with no local copy.

This is the real thing — the file is a public CMS Open Data NanoAODv9 file on eospublic.cern.ch, and the read is the pure-Rust nano-io reader (no ROOT, no xrootd).

(No player? Raw cast: demo-opendata.cast.)

What you're seeing

$ read_url_json "https://eospublic.cern.ch//eos/opendata/cms/Run2016H/DoubleMuon/.../*.root" 5 --insecure
  run=281616 event=59740  nMuon=0 Muon_pt=[]
  run=281616 event=172857 nMuon=1 Muon_pt=[...]
  ...
  fetched 1,323,577 bytes of 2,016,828,178  =  0.066% of the file

Real Run2016H events (run 281616), read by streaming only the baskets touched: 1.3 MB of a 2 GB file — 0.066%. No file was downloaded or stored; the --insecure flag is only for the EOS grid TLS chain, not the read itself.

How it works, briefly: nano-rootio opens the file via an HTTP Source that serves Range requests; the TKey/TTree/TBranch metadata is read first (a couple of small ranges), then each requested branch's baskets are fetched lazily as the event iterator advances. The _meta.bytes_fetched vs file_size in the output is measured, not estimated.

It's the same reader, validated

This isn't a special "remote mode" — it's the same nano-io reader with an HTTP source instead of a file source (events_url / events_url_chunked behind the http feature). And it is value-validated against uproot on this exact open-data file in CI on every push (scripts/bench_vs_uproot.py): our remote read and uproot's agree event-for-event. So the convenience (no staging) comes with no correctness asterisk.

Reproduce it

$ cargo run -p nano-io --example read_url_json --features http -- \
    "https://eospublic.cern.ch//eos/opendata/cms/Run2016H/DoubleMuon/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/2510000/127C2975-1B1C-A046-AABF-62B77E757A86.root" \
    5 --insecure

The same capability is what lets CI read open data with no checked-in data files, and it's the on-ramp to running the whole pipeline — selection, weights, skim — directly against remote datasets.