'Sagemaker - Distributed training

I can’t find documentation on the behavior of Sagemaker when distributed training is not explicitly specified.

Specifically,

When SageMaker distributed data parallel is used via distribution=‘dataparallel’ , documents state that each instance processes different batches of data.

from sagemaker.tensorflow import TensorFlow

estimator = TensorFlow(
    role=role,
    py_version="py37",
    framework_version="2.4.1",
    # For training with multinode distributed training, set this count. Example: 2
    instance_count=4,
    instance_type="ml.p3.16xlarge",
    sagemaker_session=sagemaker_session,
    # Training using SMDataParallel Distributed Training Framework
    distribution={"smdistributed": {"dataparallel": {"enabled": True}}},
)

I am not sure what happens when distribution parameter is not specified but instance_count > 1 as below

estimator = TensorFlow(
    py_version="py3",
    entry_point="mnist.py",
    role=role,
    framework_version="1.12.0",
    instance_count=4,
    instance_type="ml.m4.xlarge",
)

Thanks!

Solution 1:^[1]

In the training code, when you initialize smdataparallel you get a run time error - RuntimeError: smdistributed.dataparallel cannot be used outside smddprun for distributed training launch.

The distribution parameters you pass in the estimator select the appropriate runner.

Solution 2:^[2]

"I am not sure what happens when distribution parameter is not specified but instance_count > 1 as below" -> SageMaker will run your code on 4 machines. Unless you have code purpose-built for distributed computation this is useless (simple duplication).

It gets really interesting when:

you parse resource configuration (resourceconfig.json or via env variables) so that each machine is aware of its rank in the cluster, and you can write custom arbitrary distributed things
if you run the same code over input that is ShardedByS3Key, your code will run on different parts of your S3 data that is homogeneously spread over machines. Which makes SageMaker Training/Estimators a great place to run arbitrary shared-nothing distributed tasks such as file transformations and batch inference.

Having machines clustered together also allows you to launch open-source distributed training software like PyTorch DDP

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	Anoop
Solution 2	Olivier Cruchant

'Sagemaker - Distributed training

Solution 1:[1]

Solution 2:[2]

Sources

Related Questions

Solution 1:^[1]

Solution 2:^[2]