'AWS SageMaker TensorFlow Serving - Endpoint failure - CloudWatch log ref: "NET_LOG: Entering the event loop ..."

It's my first time using sagemaker to serve my own custom tensorflow model so I have been using the medium articles to get me started:

How to Create a TensorFlow Serving Container for AWS SageMaker
How to Push a Docker Image to AWS ECS Repository
How to Deploy an AWS SageMaker Container Using TensorFlow Serving
How to Make Predictions Against a SageMaker Endpoint Using TensorFlow Serving

I managed to create my serving container, push it successfully to ECR, and create the sagemaker model from my docker image. However, when i tried to create the endpoints it started creating but after 3-5 minutes ended with the failure message:

"The primary container for production variant Default did not pass the ping health check. Please check CloudWatch logs for this endpoint."

Failure Image

I then checked my cloud watch logs which looked like this...

CloudWatch Logs

...ending with "NET_LOG: Entering the event loop ..."

I tried to google more about this log message in relation to deploying sagemaker models with tf-serving, but could not find any helpful solutions.

To give more context, before running into this problem I encountered 2 other issues:

  1. "FileSystemStoragePathSource encountered a file-system access error: Could not find base path

    ‹MODEL_PATH›/‹MODEL_NAME›/ for ‹MODEL_NAME›"

  2. "No versions of servable found under base path"

Both of which I managed to solve using the following links:

[Documentation] TensorFlowModel endpoints need the export/Servo folder structure, but this is not documented

Failed Reason: The primary container for production variant AllTraffic did not pass the ping health check.

It's also worth noting that my Tensorflow model was created using TF version 2.0 (hence why I needed the docker container). I solely used AWS CLI to carry out my tensorflow serving instead of the sagemaker SDK.

Here are snippets of my shell scripts:

nginx.config

events {
    # determines how many requests can simultaneously be served
    # https://www.digitalocean.com/community/tutorials/how-to-optimize-nginx-configuration
    # for more information
    worker_connections 2048;
}

http {
  server {
    # configures the server to listen to the port 8080
    # Amazon SageMaker sends inference requests to port 8080.
    # For more information: https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-inference-code.html#your-algorithms-inference-code-container-response
    listen 8080 deferred;

    # redirects requests from SageMaker to TF Serving
    location /invocations {
      proxy_pass http://localhost:8501/v1/models/pornilarity_model:predict;
    }

    # Used by SageMaker to confirm if server is alive.
    # https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-inference-code.html#your-algorithms-inference-algo-ping-requests
    location /ping {
      return 200 "OK";
    }
  }
}

Dockerfile


# RUN pip install sagemaker-containers

# Installing NGINX, used to reverse proxy the predictions from SageMaker to TF Serving
RUN apt-get update && apt-get install -y --no-install-recommends nginx git

# Copy our model folder to the container 
# NB: Tensorflow serving requires you manually assign version numbering to models e.g. model_path/1/
# see below links: 

# https://stackoverflow.com/questions/45544928/tensorflow-serving-no-versions-of-servable-model-found-under-base-path
# https://github.com/aws/sagemaker-python-sdk/issues/599
COPY pornilarity_model /opt/ml/model/export/Servo/1/

# Copy NGINX configuration to the container
COPY nginx.conf /opt/ml/code/nginx.conf

# Copies the hosting code inside the container
# COPY serve.py /opt/ml/code/serve.py

# Defines serve.py as script entrypoint
# ENV SAGEMAKER_PROGRAM serve.py

# starts NGINX and TF serving pointing to our model
ENTRYPOINT service nginx start | tensorflow_model_server --rest_api_port=8501 \
 --model_name=pornilarity_model \
 --model_base_path=/opt/ml/model/export/Servo/

Build and push

%%sh

# The name of our algorithm
ecr_repo=sagemaker-tf-serving
docker_image=sagemaker-tf-serving

cd container

# chmod a+x container/serve.py

account=$(aws sts get-caller-identity --query Account --output text)

# Get the region defined in the current configuration (default to us-west-2 if none defined)
region=$(aws configure get region)
region=${region:-eu-west-2}

fullname="${account}.dkr.ecr.${region}.amazonaws.com/${ecr_repo}:latest"

# If the repository doesn't exist in ECR, create it.

aws ecr describe-repositories --repository-names "${ecr_repo}" > /dev/null 2>&1

if [ $? -ne 0 ]
then
    aws ecr create-repository --repository-name "${ecr_repo}" > /dev/null
fi

# Get the login command from ECR and execute it directly
$(aws ecr get-login --region ${region} --no-include-email)

# Build the docker image locally with the image name and then push it to ECR
# with the full name.

docker build -t ${docker_image} .
# docker tag ${docker_image} ${fullname}
docker tag ${docker_image}:latest ${fullname}

docker push ${fullname}

Create SageMaker Model

#!/usr/bin/env bash

CONTAINER_NAME="Pornilarity-Container"
MODEL_NAME=pornilarity-model-v1

# the role named created with
# https://gist.github.com/mvsusp/599311cb9f4ee1091065f8206c026962
ROLE_NAME=AmazonSageMaker-ExecutionRole-20191202T133391

# the name of the image created with
# https://gist.github.com/mvsusp/07610f9cfecbec13fb2b7c77a2e843c4
ECS_IMAGE_NAME=sagemaker-tf-serving
# the role arn of the role
EXECUTION_ROLE_ARN=$(aws iam get-role --role-name ${ROLE_NAME} | jq -r .Role.Arn)

# the ECS image URI
ECS_IMAGE_URI=$(aws ecr describe-repositories --repository-name ${ECS_IMAGE_NAME} |\
jq -r .repositories[0].repositoryUri)

# defines the SageMaker model primary container image as the ECS image
PRIMARY_CONTAINER="ContainerHostname=${CONTAINER_NAME},Image=${ECS_IMAGE_URI}"

# Createing the model
aws sagemaker create-model --model-name ${MODEL_NAME} \
--primary-container=${PRIMARY_CONTAINER}  --execution-role-arn ${EXECUTION_ROLE_ARN}

Endpoint config

#!/usr/bin/env bash

MODEL_NAME=pornilarity-model-v1

ENDPOINT_CONFIG_NAME=pornilarity-model-v1-config

ENDPOINT_NAME=pornilarity-v1-endpoint

PRODUCTION_VARIANTS="VariantName=Default,ModelName=${MODEL_NAME},"\
"InitialInstanceCount=1,InstanceType=ml.c5.large"

aws sagemaker create-endpoint-config --endpoint-config-name ${ENDPOINT_CONFIG_NAME} \
--production-variants ${PRODUCTION_VARIANTS}

aws sagemaker create-endpoint --endpoint-name ${ENDPOINT_NAME} \
--endpoint-config-name ${ENDPOINT_CONFIG_NAME}

Docker Container Folder Structure

├── container
│   ├── Dockerfile
│   ├── nginx.conf
│   ├── pornilarity_model
│   │   ├── assets
│   │   ├── saved_model.pb
│   │   └── variables
│   │       ├── variables.data-00000-of-00002
│   │       ├── variables.data-00001-of-00002
│   │       └── variables.index

Any guidance would be much appreciated!!



Solution 1:[1]

You web server on path

<public_address>/ping

must have exposed liveness response, with returning 200 if everything is running. Now it is missing and Sagemaker does not see it as a valid container for inference. It is that simple :)

There are containers already with the web server doing this for you, in your case Tensorfolow serving container is publicly available at:

763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-inference:2.4.1-cpu-py37-ubuntu18.04

This is example for us-east and for CPU type of inference. You can add the model /opt/ml/model/ as you did in your example docker.

For all the available containers look at deep learning containers at AWS

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 zhrist