Deploy Mistral LLM on Google Compute Engine with Docker, GPU Support, and Hugging Face Inference Server

Introduction

A practical guide on setting up Large Language Models (LLMs) on Google Compute Engine using GPUs. This guide is designed to walk you through the process step by step, making it easy for you to take advantage of the powerful combination of Google's cloud infrastructure and NVIDIA's GPU technology.

Machine Specs for the tutorial

Hardware Specifications:

GPU Information:

GPU Type: Nvidia T4
Number of GPUs: 2
GPU Memory: 16 GB GDDR6 (per GPU)

Google compute engine Machine Type:

Type: n1-highmem-4
vCPUs: 4
Cores: 2
Memory: 26 GB

Disk Information:

Disk Type: Balanced Persistent Disk
Disk Size: 150 GB

Software Specifications:

Operating System:

Ubuntu Version: 20.04 LTS

CUDA version:

CUDA version: 12.3

1.Setting Up Docker

Follow these simple steps to get Docker up and running on your system:

1.1 Adding Docker's Official GPG Key

add Docker’s official GPG key to your system. This step is crucial for validating the authenticity of the Docker packages you'll be installing

sudo apt-get update
sudo apt-get install ca-certificates curl gnupg
sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
sudo chmod a+r /etc/apt/keyrings/docker.gpg

1.2 Adding Docker Repository to Apt Sources

Add Docker's repository to your system's Apt sources. This allows you to fetch Docker packages from their official repository:

echo \
  "deb [arch="$(dpkg --print-architecture)" signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
  "$(. /etc/os-release && echo "$VERSION_CODENAME")" stable" | \
  sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update

1.3 Installing Docker

Install Docker using the following command

sudo apt-get --reinstall install docker-ce

1.4 Create Docker Group

If not already present, add the 'docker' group to your system:

sudo groupadd docker

1.5 Add default User to Docker Group

Add your default user to the 'docker' group to manage Docker as a non-root user:

sudo usermod -aG docker $USER

1.6 check on Docker status

systemctl status docker

2.Install NVIDIA Container Toolkit

2.1 Add NVIDIA GPG Key and NVIDIA Container Toolkit Repository

Start by adding the NVIDIA GPG key to ensure the authenticity of the software packages and add the NVIDIA Container Toolkit repository to your system's software sources:

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

2.2 Enable Experimental Features (Optional)

If you wish to use experimental features, uncomment the respective lines in the sources list:

sudo sed -i -e '/experimental/ s/^#//g' /etc/apt/sources.list.d/nvidia-container-toolkit.list

2.3 Update Package Index and Install NVIDIA Toollit

pdate your package index and install the NVIDIA Container Toolkit:

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit

3.Configure Container Toolkit

3.1 Configure NVIDIA Container Toolkit

Configure the NVIDIA Container Toolkit to work with Docker:

sudo nvidia-ctk runtime configure --runtime=docker

3.2 Restart Docker Service

Apply the changes by restarting the Docker service:

sudo systemctl restart docker

4.Prerequisites Before Installing CUDA Drivers

Ensure your system meets the following prerequisites before proceeding with the CUDA driver installation. For detailed guidance, refer to the official NVIDIA CUDA installation guide (https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#pre-installation-actions).

4.1 Verify CUDA-Capable GPU

First, confirm that your system has an NVIDIA GPU installed, this command should return information about the NVIDIA graphics card if one is present.

lspci | grep -i nvidia

4.2 Confirm Supported Linux Version

Ensure your Linux distribution is supported by checking its version, this command will display the architecture of your system and details about your Linux distribution :

uname -m && cat /etc/*release

4.3 Check Kernel Headers and Development Packages

Verify that your system has the appropriate kernel headers and development packages, which are essential for building the NVIDIA kernel module:

uname -r

5. Installing NVIDIA Drivers

Follow these steps to install NVIDIA drivers on your system. For detailed instructions, you can refer to the NVIDIA Tesla Installation Notes (https://docs.nvidia.com/datacenter/tesla/tesla-installation-notes/index.html).

5.1 Install Required Kernel Headers

Start by installing the Linux kernel headers corresponding to your current kernel version:

sudo apt-get install linux-headers-$(uname -r)

5.2 Add the NVIDIA CUDA Repository

Identify your distribution's version and add the NVIDIA CUDA repository to your system:

distribution=$(. /etc/os-release;echo $ID$VERSION_ID | sed -e 's/\.//g')
wget https://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64/cuda-keyring_1.0-1_all.deb
sudo dpkg -i cuda-keyring_1.0-1_all.deb

5.3 Update and Install NVIDIA Drivers

Finally, update your package lists and install the CUDA drivers, after the installation it will need a restart

sudo apt-get update
sudo apt-get -y install cuda-drivers

6. Post-Installation Steps for NVIDIA Driver

After successfully installing the NVIDIA drivers, perform the following post-installation steps to ensure everything is set up correctly. For a comprehensive guide, consult the NVIDIA CUDA Installation Guide for Linux (https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#post-installation-actions).

6.1 Verify NVIDIA Persistence Daemon

Check the status of the NVIDIA Persistence Daemon to ensure it's running correctly:

systemctl status nvidia-persistenced

6.2 Monitor GPU Utilization

To confirm that your GPU is recognized and monitor its utilization, use:

nvidia-smi

Define model configuration

To set up the model configuration, you can use the following environment variables, the variable model is set to mistralai/Mistral-7B-v0.1, representing the Mistral-7B-v0.1 model for the tutorial, The variable volume is set to the present working directory ($PWD) followed by /data, indicating the directory path where data will be stored :

export model=mistralai/Mistral-7B-v0.1
export volume=$PWD/data

Run text-generation-inference using Docker

To perform text generation inference, we will will use the Huggingface text generation inference server (for more details check this url https://huggingface.co/docs/text-generation-inference/index), execute the following Docker command with the following parameters, and also here is the parameters explanation:

--gpus all: Enables GPU support for Docker containers.
--shm-size 1g: Sets the shared memory size to 1 gigabyte.
-p 8080:80: Maps port 8080 on the host to port 80 in the Docker container.
-v $volume:/data: Mounts the local data volume specified by $volume inside the Docker container at the /data path.
ghcr.io/huggingface/text-generation-inference:1.3: Specifies the Docker image for text-generation-inference with the version tag 1.3.
--model-id $model: Passes the specified model identifier ($model) to the text-generation-inference application.

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.3 --model-id $model

check GPU utilisation

Run again the GPU monitoring command to check the memory utilization after loading model weights into the GPU memory :

nvidia-smi

Test API endpoint

To test the API endpoint, use the following curl command:

curl 127.0.0.1:8080/generate -X POST -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' -H 'Content-Type: application/json'