'Trouble on training YoloV5 on AWS Sagemaker | AlgorithmError: , exit code: 1

I'm trying to train YoloV5 on AWS Sagemaker with custom data (that is stored in S3) via a Docker Image (ECR) and I keep getting "AlgorithmError: , exit code: 1". Can someone please tell me how to debug this problem?

Here's the Docker Image :

# GET THE AWS IMAGE
FROM 763104351884.dkr.ecr.eu-west-3.amazonaws.com/pytorch-training:1.11.0-gpu-py38-cu113-ubuntu20.04-sagemaker

# UPDATES
RUN apt update

RUN DEBIAN_FRONTEND=noninteractive TZ=Etc/UTC apt install -y tzdata
RUN apt install -y python3-pip git zip curl htop screen libgl1-mesa-glx libglib2.0-0
RUN alias python=python3

# INSTALL REQUIREMENTS
COPY requirements.txt .

RUN python3 -m pip install --upgrade pip
RUN pip install --no-cache -r requirements.txt albumentations gsutil notebook \
    coremltools onnx onnx-simplifier onnxruntime openvino-dev tensorflow-cpu tensorflowjs
    



COPY code /opt/ml/code
WORKDIR /opt/ml/code


RUN git clone https://github.com/ultralytics/yolov5 /opt/ml/code/yolov5

ENV SAGEMAKER_SUBMIT_DIRECTORY /opt/ml/code
ENV SAGEMAKER_PROGRAM trainYolo.py


ENTRYPOINT ["python", "trainYolo.py"]

And here's trainYolo.py :


import json 
import os
import numpy as np
import cv2 as cv
import subprocess
import yaml
import shutil


trainSet = os.environ["SM_CHANNEL_TRAIN"]
valSet = os.environ["SM_CHANNEL_VAL"]

output_dir = os.environ["SM_CHANNEL_OUTPUT"]

#Creating the data.yaml for yolo
dict_file = [{'names' : ['block']},
{'nc' : ['1']}, {'train': [trainSet]}
             , {'val': [valSet]}]

with open(r'data.yaml', 'w') as file:
    documents = yaml.dump(dict_file, file)
    
    
#Execute this command to train Yolo
res = subprocess.run(["python3", "yolov5/train.py",  "--batch", "16" "--epochs", "100", "--data", "data.yaml", "--cfg", "yolov5/models/yolov5s.yaml","--weights", "yolov5s.pt"  "--cache"], shell=True)
                  

shutil.copy("yolov5", output_dir)

Note : I'm not sure if subprocess.run() works in an environment such as Sagemaker.

Thank you

Solution 1:^[1]

So your training script is not configured properly. When using a SageMaker estimator or Script Mode you must configure it in a format that will save the model properly. Here's an example notebook with TensorFlow and script mode. If you would like to build your own Dockerfile (Bring Your Own Container) then you would have to configure your train file as shown in the second link.

Script-Mode: https://github.com/RamVegiraju/SageMaker-Deployment/tree/master/RealTime/Script-Mode/TensorFlow/Classification

BYOC: https://github.com/RamVegiraju/SageMaker-Deployment/tree/master/RealTime/BYOC/Sklearn/Sklearn-Regressor/container/randomForest