'AWS SageMaker Estimator cannot access the internet

I'm trying to run a training job on a SageMaker Tensorflow estimator. Before starting the training job I need to install some dependencies. As suggested in the Python SDK SageMaker documentation, I put a requirements.txt file in the code root directory.

The training job fails upon trying to install these dependencies with the following error:

sagemaker.exceptions.UnexpectedStatusException: Error for Training job tensorflow-training-2021-09-15-10-34-05-979: Failed. Reason: AlgorithmError: InstallRequirementsError:
Command "/usr/local/bin/python3.7 -m pip install -r requirements.txt"
WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7f41d0448550>, 'Connection to pypi.org timed out. (connect timeout=15)')': /simple/efficientnet/

I've specified the subnet and security group in the estimator construct

estimator = TensorFlow(
    entry_point="train.py",
    source_dir=job_dir,
    role=role,
    instance_count=1,
    instance_type=instance_type,
    py_version="py37",
    framework_version="2.4",
    subnets=[environ["SUBNET_ID"]],
    security_group_ids=[environ["SECURITY_GROUP_ID"]],
)

The security group allows all outbound ipv4 traffic, the subnet is public and has an internet gateway.

Moreover I've tested this networking configuration by spawning an ec2 instance in the same subnet-security group, connecting via ssh and successfully installing a pip package.

I can't understand why the sagemaker instance can't connect to pypi.org, nor find a way to debug this issue.



Solution 1:[1]

Could be possible that you don't have a NAT in the ENIs launched in the subnet only have a private IP - i.e need a NAT to communicate with the internet.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Raghu Ramesha