Seamless Deployment of Hugging Face Models on AWS SageMaker with Terraform: A Comprehensive Guide

When integrating Sagemaker with Hugging Face models using the default setup provided by the sagemaker-huggingface-inference-tollkit can be a good starting point. For a IaC setup, the terraform-aws-sagemaker-huggingface module is a handy resource, https://github.com/philschmid/terraform-aws-sagemaker-huggingface/blob/master/main.tf

However, during my experience, I ran into a few issues with the Sagemaker-Huggingface-Inference-Toolkit:

Deployment Flexibility: The toolkit was limited to deploying only through the Python SDK, which was quite restrictive. (As it described in docs. Actually you can, for example using terraform module mentioned above)
Code and Model Packaging: If you want to customize inference code it required storing the code with the model weights into a single tar file, which felt clunky. I prefer having the code as part of the image itself.
Custom Environments: The sagemaker-huggingface-inference-toolkit doesn't allow for custom environment setups, like installing the latest Transformers directly from GitHub.

One specific issue was the lack of support for setting torch_dtype to half precision for the pipelines, which was crucial for my project but not straightforward to implement.

Given these limitations, I decided against rewriting everything to default sagemaker-inference-toolkit and instead explored a solution that just overrides get_pipline function in sagemaker-huggingface-inference-toolkit. Using following example you can customize in a any way you would like

How to Deploy

Load model weights

The first step is to put model weights to s3 bucket in model.tar.gz file. Instructions how to do it here https://huggingface.co/docs/sagemaker/inference

Make entrypoint

The deployment starts with setting up an entrypoint script. This script acts as the bridge between your model and Sagemaker, telling Sagemaker how to run your model. Here's a basic template I used:

from pathlib import Path

import torch
from transformers import Pipeline, pipeline
from sagemaker_huggingface_inference_toolkit import transformers_utils, serving




def _get_pipeline(task: str, device: int, model_dir: Path, **kwargs) -> Pipeline:
    return pipeline(model=model_dir, device_map="auto", model_kwargs={"torch_dtype": torch.bfloat16})

transformers_utils.get_pipeline = _get_pipeline


if __name__ == "__main__":
    serving.main()

Build image

Next, you'll need to build a Docker image that Sagemaker can use to run your model. This involves starting with a basic transformers pytorch image (https://github.com/huggingface/transformers/blob/main/docker/transformers-pytorch-gpu/Dockerfile), than install sagemaker-huggingface-inference-toolkit with mms(multi model server), openjdk and congifure entrypoint.

FROM nvidia/cuda:12.1.0-cudnn8-devel-ubuntu20.04
LABEL maintainer="Hugging Face"

ARG DEBIAN_FRONTEND=noninteractive

RUN apt update
RUN apt install -y git libsndfile1-dev tesseract-ocr espeak-ng python3 python3-pip ffmpeg
RUN python3 -m pip install --no-cache-dir --upgrade pip

ARG REF=main
RUN git clone https://github.com/huggingface/transformers && cd transformers && git checkout $REF

# If set to nothing, will install the latest version
ARG PYTORCH='1.13.1'
ARG TORCH_VISION=''
ARG TORCH_AUDIO=''
# Example: `cu102`, `cu113`, etc.
ARG CUDA='cu121'

RUN [ ${#PYTORCH} -gt 0 ] && VERSION='torch=='$PYTORCH'.*' ||  VERSION='torch'; python3 -m pip install --no-cache-dir -U $VERSION --extra-index-url https://download.pytorch.org/whl/$CUDA
# RUN [ ${#TORCH_VISION} -gt 0 ] && VERSION='torchvision=='TORCH_VISION'.*' ||  VERSION='torchvision'; python3 -m pip install --no-cache-dir -U $VERSION --extra-index-url https://download.pytorch.org/whl/$CUDA
# RUN [ ${#TORCH_AUDIO} -gt 0 ] && VERSION='torchaudio=='TORCH_AUDIO'.*' ||  VERSION='torchaudio'; python3 -m pip install --no-cache-dir -U $VERSION --extra-index-url https://download.pytorch.org/whl/$CUDA

RUN python3 -m pip install --no-cache-dir -e ./transformers

# When installing in editable mode, `transformers` is not recognized as a package.
# this line must be added in order for python to be aware of transformers.
RUN cd transformers && python3 setup.py develop


RUN apt-get install -y \
    openjdk-8-jdk-headless
RUN pip install "sagemaker-huggingface-inference-toolkit[mms]"

COPY ./entrypoint.py /usr/local/bin/entrypoint.py
RUN chmod +x /usr/local/bin/entrypoint.py

RUN mkdir -p /home/model-server/


# Define an entrypoint script for the docker image
ENTRYPOINT ["python3", "/usr/local/bin/entrypoint.py"]

Now, push your image to your ECR

Deploy using terraform

Finally, you'll use Terraform to deploy everything to AWS. This includes setting up the endpoint role, model, its endpoint configuration, and the endpoint itself. Here's a simplified version of what the Terraform setup might look like:

resource "aws_sagemaker_model" "customHuggingface" {
  name = "custom-huggingface"

  primary_container {
    image          = "<YOUR_ACCOUNT>.dkr.ecr.<REGION>.amazonaws.com/<REPO>:<TAG>"
    model_data_url = "s3://<BUKET>/<PATH>/model.tar.gz"
  }
}


data "aws_iam_policy_document" "assume_role" {
  statement {
    actions = ["sts:AssumeRole"]

    principals {
      type        = "Service"
      identifiers = ["sagemaker.amazonaws.com"]
    }
  }
}


resource "aws_iam_role" "yourRole" {
  name               = "yourRole"
  assume_role_policy = data.aws_iam_policy_document.assume_role.json
}

data "aws_iam_policy_document" "InferenceAcess" {
  statement {
    actions   = ["s3:GetObject"]
    resources = ["arn:aws:s3:::<yourBucket>/*"]
  }
  statement {
    actions = [
      "ecr:GetAuthorizationToken",
      "ecr:BatchCheckLayerAvailability",
      "ecr:GetDownloadUrlForLayer",
      "ecr:GetRepositoryPolicy",
      "ecr:SetRepositoryPolicy",
      "ecr:DescribeRepositories",
      "ecr:ListImages",
      "ecr:DescribeImages",
      "ecr:BatchGetImage",
      "ecr:GetLifecyclePolicy",
      "ecr:GetLifecyclePolicyPreview",
      "ecr:ListTagsForResource",
      "ecr:DescribeImageScanFindings",
      "ecr:InitiateLayerUpload",
    ]

    resources = ["<YOUR_ECR>"]
  }
  statement {
    resources = ["*"]
    actions = [
      "cloudwatch:PutMetricData",
      "logs:CreateLogStream",
      "logs:PutLogEvents",
      "logs:CreateLogGroup",
      "logs:DescribeLogStreams",
    ]
  }
}

resource "aws_iam_policy" "InferenceAcess" {
  name        = "InferenceAcess"
  policy      = data.aws_iam_policy_document.InferenceAcess.json
}

resource "aws_iam_role_policy_attachment" "InferenceAcess" {
  role       = aws_iam_role.yourRole.name
  policy_arn = aws_iam_policy.InferenceAcess.arn
}
resource "aws_sagemaker_endpoint_configuration" "customHuggingface" {
  name = "customHuggingface"

  production_variants {
    variant_name           = "variant-1"
    model_name             = aws_sagemaker_model.customHuggingface.name
    initial_instance_count = 1
    instance_type          = "ml.g4dn.xlarge"
  }

}

resource "aws_sagemaker_endpoint" "customHuggingface" {
  name                 = "customHuggingface"
  endpoint_config_name = aws_sagemaker_endpoint_configuration.customHuggingface.name
}

Invoke your endpoint

After everything is deployed, you can test the endpoint with a simple request to make sure it's working as expected.

body = json.dumps({"inputs": <Your text>})
endpoint = "customHuggingface"
response = runtime.invoke_endpoint(EndpointName=endpoint, ContentType='application/json', Body=body)
response["Body"].read()

Blog

Seamless Deployment of Hugging Face Models on AWS SageMaker with Terraform: A Comprehensive Guide

akoshel

How to Deploy

Load model weights

Make entrypoint

Build image

Deploy using terraform

Invoke your endpoint

Useful links:

Join Our Newsletter. No Spam, Only the good stuff.

Related