'How to distribute files on a cluster and process each file entirely on a node?

I have a Ray cluster and want to distribute the input data evenly amongst each node. The input data consists of self-contained chunks of data (A chunk can consist of multiple files which should be stored together) that each node can process independently. Ideally I would have it so that the chunks would be distributed evenly amongst the cluster, specifically, each node should have a subset of chunks stored locally for faster access.

Currently, I have a Ceph file system mounted on each node and am storing the entire data set there. Is there a nice way to distribute the chunks evenly among each node so that a subset of chunks can be accessed locally by each node? I don't have to use Ceph for storage/distribution, if there is a better way I'm open to recommendations.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source