'How to distribute files on a cluster and process each file entirely on a node?
I have a Ray cluster and want to distribute the input data evenly amongst each node. The input data consists of self-contained chunks of data (A chunk can consist of multiple files which should be stored together) that each node can process independently. Ideally I would have it so that the chunks would be distributed evenly amongst the cluster, specifically, each node should have a subset of chunks stored locally for faster access.
Currently, I have a Ceph file system mounted on each node and am storing the entire data set there. Is there a nice way to distribute the chunks evenly among each node so that a subset of chunks can be accessed locally by each node? I don't have to use Ceph for storage/distribution, if there is a better way I'm open to recommendations.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
