'Distribute data for hyperparameter optimization in pytorch

Question

How do I set up hyperparameter optimization on the same dataset on multiple servers (or containers) without duplicated data preprocessing?

The possible solution feels useful and like a common task, but I fear reimplementing existing code. Therefore, what is the framework I am searching for, or if there are multiple: What is the keyword I am missing?

Setting the scene

I built a package using neural networks for classification and want hyperparameter optimization.
I used pytorch-lightning and PyTorch Geometric combined with hydra
I have multiple servers on-premise with multiple GPUs each
Data preprocessing takes a fair amount of time
Data fits into each single GPU
Hyperparameters are network architecture but also variants of data preprocessing

Possible solution?

Usually, torch datasets:

download the raw dataset
preprocess it
are stored or loaded in memory to be fed into the neural network.

Now that feels inefficient when I want to optimize hyperparameters because this would have to be done at least on every server, if not even inside every container instance. Therefore I feel the following would be optimal:

Preprocess the raw dataset on one server instance with set i of preproc-hparams
Distribute the preprocessed data to all other servers
Test all hyperparameters
Go to 1. with the next set of preproc-hparams

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'Distribute data for hyperparameter optimization in pytorch

Question

Setting the scene

Possible solution?

Sources

Related Questions