'Distribute data for hyperparameter optimization in pytorch

Question

How do I set up hyperparameter optimization on the same dataset on multiple servers (or containers) without duplicated data preprocessing?

The possible solution feels useful and like a common task, but I fear reimplementing existing code. Therefore, what is the framework I am searching for, or if there are multiple: What is the keyword I am missing?

Setting the scene

  • I built a package using neural networks for classification and want hyperparameter optimization.
  • I used pytorch-lightning and PyTorch Geometric combined with hydra
  • I have multiple servers on-premise with multiple GPUs each
  • Data preprocessing takes a fair amount of time
  • Data fits into each single GPU
  • Hyperparameters are network architecture but also variants of data preprocessing

Possible solution?

Usually, torch datasets:

  1. download the raw dataset
  2. preprocess it
  3. are stored or loaded in memory to be fed into the neural network.

Now that feels inefficient when I want to optimize hyperparameters because this would have to be done at least on every server, if not even inside every container instance. Therefore I feel the following would be optimal:

  1. Preprocess the raw dataset on one server instance with set i of preproc-hparams
  2. Distribute the preprocessed data to all other servers
  3. Test all hyperparameters
  4. Go to 1. with the next set of preproc-hparams


Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source