'tf.data pipeline from large numpy arrays for a multiple input, multiple output Keras model and distributed training
This question relates to the optimal setup for a multiple-input multiple-output Keras (Tensorflow) model given corresponding numpy arrays.
For example, suppose input arrays x1 and x2 and output arrays y1 and y2. A tf.data Dataset can be built as follows:
train_data = tf.data.Dataset.zip(
(
tf.data.Dataset.from_tensor_slices(
(
x1,
x2,
)
),
tf.data.Dataset.from_tensor_slices(
(
y1,
y2,
),
),
)
)
The above code works for small arrays and single GPU training. The two constraints that make this approach impossible/inadvisable in the full data/model are:
The numpy arrays are large enough to cross the 2 GB Protobuf limit.
Use of tf.distribute.MirroredStrategy to distribute the training over multiple GPUs.
What is the best way to pipeline the data?
Solution 1:[1]
The best approach in mid 2022 seems to be to
(1) Create tf data datasets from each array independently and then use tf zip to zip them into a single dataset. This should enable one to get around the 2 GB limit.
(2) Use distribute strategy to define and compile the keras model and then model.fit to train the model. Distribute strategy handles most of the heavy lifting.
If each array is still too large to form a tf data dataset using from_tensor_slices (https://www.tensorflow.org/api_docs/python/tf/data/Dataset#from_tensor_slices) then this will not work. However, in that case I found that it was better to split the numpy arrays into multiple arrays of smaller size and continue with the above approach rather than to write custom code as many typical solutions (e.g. https://www.pythonfixing.com/2022/01/fixed-using-datasets-from-large-numpy.html) are not designed for distributed training.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
