'Multiprocessing uproot approach

I just started looking into parallelizing the analysis of rootfiles, i.e. trees. I have worked with RDataFrames where implicit multiprocessing can be enabled with one line of code (EnableImplicitMT()). This works quite well. Now I want to experiment with explicit multiprocessing and uproot to see if efficiency can be boosted even further. I just need some guidance on a sensible approach.

Say I have a really large dataset (can not be read in at once) stored in a root file with a couple of branches. Nothing to crazy hast to be done for the analysis: Some calculations, filtering and then filling some histograms maybe.

The ideas I have:

  1. Trivial parallelizing: Somehow splitting the rootfile in many smaller files and running the same analysis in parallel on all files. In the end recombining the respective results.

  2. Maybe it is possible to read in the file and analyze it in batches as described in the uproot docs but distribute the batches and operations on them to different cores? One could use the python multiprocessing package?

  3. Similiar to 2. read in the file in batches but rather than distributing batches to the cores, slicing up the arrays of one batch and distribute the slices and the operation on them to the cores.

I need some feedback if these approaches are worth trying or if there are better ways of handling large files efficiently.



Solution 1:[1]

A key thing to mind about Uproot is that it isn't a framework for doing HEP analysis—it only reads ROOT files. The HEP analysis is the next step—code beyond any interactions with Uproot.

For the record, Uproot's file-reading can be parallelized, but that just means that multiple threads can be employed to wait for disk/network and decompress data, but the effect is the same: you wait for all the threads to be done to get the chunk of data, maybe a little faster. That's not what you're asking about.

You want your analysis code to be running in parallel, and that's a generic question about parallel processing in Python, not Uproot. You can break your work up into pieces (explicitly), and have each of those pieces independently use Uproot to read the data. Or you can use a Python library for parallel-processing to do it implicitly, such as Dask, or you can use a HEP-specific library that pulls these parts together, such as Coffea.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Jim Pivarski