Benoit COUETIL 💫
Posted on February 17, 2024
- Initial thoughts
- 1. The right EC2 instance at the right price
- 2. Scripting the GitLab runner installation and configuration
- 3. Deploying the auto-stopping architecture with Terraform
- Further reading
Initial thoughts
In GitLab CI: The Majestic Single Server Runner, we found that a single server runner outperforms a Kubernetes cluster with equivalent node specifications until approximately 200 jobs requested simultaneously! This is typically beyond the average daily usage for most software teams. Equally important, when there are 40 queued jobs to process or below, the single server runner is twice as fast. This scenario is quite common, even during the busiest days, for most teams.
This article will help you deploy this no-compromise runner on AWS, at a reasonable price, thanks to multiple optimizations. Part of it applies to any Cloud, public or private.
The deployment is automated and optimized as much as possible:
- Infrastructure is provisioned with Terraform
- A spot instance is used
- EC2 is stopped at night and on week-end
- EC2 boot script (re)installs everything and registers to GitLab
- The runner is tagged with a few interesting ec2 characteristics
1. The right EC2 instance at the right price
An AWS spot instance is a cost-effective option that allows you to leverage spare EC2 capacity at a discounted price. By choosing spot instances, you can significantly reduce your Amazon EC2 costs. Since our deployment is automated and downtime is not critical, opting for spot instances is an optimal choice for cost optimization.
To fully utilize the capabilities of a single server runner while keeping costs reasonable, it is essential to select an EC2 instance with a local NVMe SSD disk. These instances are identified by the 'd' in their name, indicating that they are disk-optimized.
When choosing an EC2 instance, the following conditions should be considered:
- The instance should have the 'd' letter to indicate NVMe local disk support.
- It should be available in our usual region.
- The CPU specifications should match our usage requirements. For Java/Javascript applications CICD, about 1 core per parallel job is good. We choose here 16 CPU for 20 parallel jobs.
- The spot price should be reasonable.
For the purpose of this article, we have selected the r5d.4xlarge instance type. At the time of writing, the spot price for this instance in us-east-1
is approximately $370/month. It might seems high to you.
But when compared to the monthly cost of our development team, this price is relatively low. However, we can further optimize costs by automatically stopping the EC2 instance outside of working hours using daily CloudWatch executions. Since it is a local disk instance, the state will be lost every day, but we have nothing to loose except some cache, that can be warmed up with a scheduled pipeline every morning.
Let's calculate the cost: $0.5045/hour x 12 open daily hours x 21 open days per month = $127/month. This brings the cost even lower than the already acceptable price. To put it into perspective, this represents an 85% discount compared to running the same instance full-time on-demand ($841/month).
2. Scripting the GitLab runner installation and configuration
To streamline the process of deploying the EC2 instance, we will create a script that can be used as the user_data
to bootstrap the server anytime it (re)boots. This script will handle the installation of Docker, the GitLab Runner, and the configuration required to connect to the GitLab instance.
The script is designed to handle reboots and stop/start actions, which may result in the deletion of local disk data on the NVMe EC2 instance.
Make sure to modify the following variables at the start of the script according to your specific requirements:
aws-ec2-init-nvme-and-gitlab-runner.sh
#!/bin/bash
#
### Script to initialize a GitLab runner on an existing AWS EC2 instance with NVME disk(s)
#
# - script is not interactive (can be run as user_data)
# - will reboot at the end to perform NVME mounting
# - first NVME disk will be used for GitLab custom cache
# - last NVME disk will be used for Docker data (if only one NVME, the same will be used without problem)
# - robust: on each reboot and stop/start, disks are mounted again (but data may be lost if stop and then start after a few minutes)
# - runner is tagged with multiple instance data (public dns, IP, instance type...)
# - works with a single spot instance
# - should work even with multiple ones in a fleet, with same user_data (not tested for now)
#
# /!\ There is no prerequisite, except these needed variables:
MAINTAINER=zenika
RUNNER_NAME="majestic-runner"
GITLAB_URL=https://gitlab.com/
GITLAB_TOKEN=XXXX
# prepare docker (re)install
sudo apt-get -y install apt-transport-https ca-certificates curl gnupg lsb-release sysstat
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list >/dev/null
sudo apt-get update # needed to use the docker.list
# install gitlab runner
curl -L "https://packages.gitlab.com/install/repositories/runner/gitlab-runner/script.deb.sh" | sudo bash
sudo apt-get -y install gitlab-runner
# create NVME initializer script
cat <<EOF >/home/ubuntu/nvme-initializer.sh
#!/bin/bash
#
# To be run on each fresh start, since NVME disks are ephemeral
# so first start, start after stop, but not on reboot
# inspired by https://stackoverflow.com/questions/45167717/mounting-a-nvme-disk-on-aws-ec2
#
date | tee -a /home/ubuntu/nvme-initializer.log
### Handle NVME disks
# get NVME disks bigger than 100Go (some small size disk may be there for root, depending on server type)
NVME_DISK_LIST=\$(lsblk -b --output=NAME,SIZE | grep "^nvme" | awk '{if(\$2>100000000000)print\$1}' | sort)
echo "NVME disks are: \$NVME_DISK_LIST" | tee -a /home/ubuntu/nvme-initializer.log
# there may be 1 or 2 NVME disks, then we split (or not) the mounts between GitLab custom cache and Docker data
export NVME_GITLAB=\$(echo "\$NVME_DISK_LIST" | head -n 1)
export NVME_DOCKER=\$(echo "\$NVME_DISK_LIST" | tail -n 1)
echo "NVME_GITLAB=\$NVME_GITLAB and NVME_DOCKER=\$NVME_DOCKER" | tee -a /home/ubuntu/nvme-initializer.log
# format disks if not
sudo mkfs -t xfs /dev/\$NVME_GITLAB | tee -a /home/ubuntu/nvme-initializer.log || echo "\$NVME_GITLAB already formatted" # this may already be done
sudo mkfs -t xfs /dev/\$NVME_DOCKER | tee -a /home/ubuntu/nvme-initializer.log || echo "\$NVME_DOCKER already formatted" # disk may be the same, then already formated by previous command
# mount on /gitlab-host/ and /var/lib/docker/
sudo mkdir -p /gitlab
sudo mount /dev/\$NVME_GITLAB /gitlab | tee -a /home/ubuntu/nvme-initializer.log
sudo mkdir -p /gitlab/custom-cache
sudo mkdir -p /var/lib/docker
sudo mount /dev/\$NVME_DOCKER /var/lib/docker | tee -a /home/ubuntu/nvme-initializer.log
### reinstall Docker (which data may have been wiped out)
# docker (re)install
sudo apt-get -y reinstall docker-ce docker-ce-cli containerd.io docker-compose-plugin | tee -a /home/ubuntu/nvme-initializer.log
echo "NVME initialization succesful" | tee -a /home/ubuntu/nvme-initializer.log
EOF
# set NVME initializer script as startup script
sudo tee /etc/systemd/system/nvme-initializer.service >/dev/null <<EOS
[Unit]
Description=NVME Initializer
After=network.target
[Service]
ExecStart=/home/ubuntu/nvme-initializer.sh
Type=oneshot
RemainAfterExit=yes
[Install]
WantedBy=multi-user.target
EOS
sudo chmod 744 /home/ubuntu/nvme-initializer.sh
sudo chmod 664 /etc/systemd/system/nvme-initializer.service
sudo systemctl daemon-reload
sudo systemctl enable nvme-initializer.service
sudo systemctl start nvme-initializer.service
sudo systemctl status nvme-initializer.service
# tail -f /var/log/syslog
### Runner creation at the end to have a feedback on Gitlab side of the whole process done
echo "gitlab-runner ALL=(ALL) NOPASSWD:ALL" | sudo tee -a /etc/sudoers
RUNNER_VERSION_DETAILS=$(sudo gitlab-runner --version)
### Example
# Version: 15.10.1
# Git revision: dcfb4b66
# Git branch: 15-10-stable
# GO version: go1.19.6
# Built: 2023-03-29T13:01:22+0000
# OS/Arch: linux/amd64
RUNNER_VERSION=$(echo "$RUNNER_VERSION_DETAILS" | grep -oP 'Version:\s+\K[\d\.]+')
RUNNER_VERSION_DATE=$(echo "$RUNNER_VERSION_DETAILS" | grep -oP 'Built:\s+\K.+')
RUNNER_OS_ARCH=$(echo "$RUNNER_VERSION_DETAILS" | grep -oP 'OS/Arch:\s+\K.+')
# 169.254.169.254 IP is always the same whatever the instance
# EC2 IP and hostname will change on AWS if VM is restarted but may not be elsewhere
RUNNER_TAGS="$MAINTAINER,$RUNNER_VERSION,$RUNNER_VERSION_DATE,$RUNNER_OS_ARCH,$(curl --silent http://169.254.169.254/latest/meta-data/instance-type),$(curl --silent http://169.254.169.254/latest/meta-data/instance-life-cycle),$(curl --silent http://169.254.169.254/latest/meta-data/public-ipv4),$(curl --silent http://169.254.169.254/latest/meta-data/public-hostname)" && echo $RUNNER_TAGS
# to start as paused (only if on-demand ec2): --paused
sudo gitlab-runner register --name "$RUNNER_NAME" --url "$GITLAB_URL" --registration-token "$GITLAB_TOKEN" --executor "docker" --docker-image "ubuntu:20.04" --docker-volumes "/var/run/docker.sock:/var/run/docker.sock" --docker-volumes "/gitlab/custom-cache/:/host/" --run-untagged=true --custom_build_dir-enabled=true --tag-list "$RUNNER_TAGS" --docker-privileged --docker-pull-policy "if-not-present" --non-interactive
# replace "concurrent = 1" with "concurrent = 20"
sudo sed -i '/^concurrent /s/=.*$/= 20/' /etc/gitlab-runner/config.toml
# replace "check_interval = 0" with "check_interval = 2"
sudo sed -i '/^check_interval /s/=.*$/= 2/' /etc/gitlab-runner/config.toml
### from https://gitlab.com/gitlab-org/gitlab-runner/-/issues/4036#note_1083142570
# replace "/cache" technical volume with one mounted on disk to avoid cache failure when several jobs in parallel
# this could have also have been a docker volume mounted: https://gitlab.com/gitlab-org/gitlab-runner/-/issues/1151#note_1019634818 but this does not make it faster if 2 different MVNE disks (gitlab + docker)
sudo sed -i 's#"/cache"#"/gitlab/cache:/cache"#' /etc/gitlab-runner/config.toml
3. Deploying the auto-stopping architecture with Terraform
To quickly deploy the architecture, we will be using Terraform. With Terraform, we can automate the deployment process and have our infrastructure up and running in minutes.
Before we proceed, please ensure that you have an existing VPC created as a prerequisite. You can refer to the examples provided in the official GitHub repo for guidance on creating the VPC.
Here is the gitlab-runner.tf
file that contains the Terraform configuration:
################################################################################
# Gitlab Runner EC2 Spot instance (with security group)
################################################################################
resource "aws_security_group" "in-ssh-out-all" {
name = "in-ssh-out-all"
vpc_id = module.vpc.vpc_id
ingress {
cidr_blocks = [
"0.0.0.0/0"
]
from_port = 22
to_port = 22
protocol = "tcp"
} // Terraform removes the default rule
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
}
resource "aws_spot_instance_request" "gitlab-runner" {
ami = "ami-04ab94c703fb30101" # us-east-1, Canonical, Ubuntu, 22.04 LTS, amd64 jammy build on 2024-01-26. Choose here: https://cloud-images.ubuntu.com/locator/ec2/
instance_type = "r5d.4xlarge"
key_name = "my-key" # create a key and put it here if you want to connect to your EC2 in SSH
availability_zone = "us-east-1a" # sadly only one possible for now
subnet_id = module.vpc.public_subnets[0] # sadly only one possible for now
vpc_security_group_ids = [aws_security_group.in-ssh-out-all.id]
user_data = file("aws-ec2-init-nvme-and-gitlab-runner.sh")
valid_until = "2030-01-01T00:00:00Z"
wait_for_fulfillment = true
tags = merge(
local.tags,
{
Scheduled = "working-hours"
}
)
}
# Stop runner nightly and start it daily on working days
# from https://github.com/popovserhii/terraform-aws-lambda-scheduler
module "runner-stop-nightly" {
source = "popovserhii/lambda-scheduler/aws"
name = "stop-runner"
aws_regions = ["us-east-1"]
cloudwatch_schedule_expression = "cron(0 20 ? * MON-SUN *)"
schedule_action = "stop"
spot_schedule = true
ec2_schedule = false
rds_schedule = false
autoscaling_schedule = false
cloudwatch_alarm_schedule = false
resource_tags = [
{
Key = "Scheduled"
Value = "working-hours"
}
]
}
module "runner-start-daily" {
source = "popovserhii/lambda-scheduler/aws"
name = "start-runner"
aws_regions = ["us-east-1"]
cloudwatch_schedule_expression = "cron(0 08 ? * MON-FRI *)"
schedule_action = "start"
spot_schedule = true
ec2_schedule = false
rds_schedule = false
autoscaling_schedule = false
cloudwatch_alarm_schedule = false
resource_tags = [
{
Key = "Scheduled"
Value = "working-hours"
}
]
}
The runner starts at 08h00 and stops at 20h00, Monday to Friday. Feel free to change according to your requirements.
Once you have created and adapted the configuration, follow these steps:
- Run
terraform init
to initialize the Terraform configuration. - Run
terraform apply
to apply the configuration and deploy the infrastructure.
With these commands, Terraform will handle the deployment process, and your autonomous architecture will be up and running in no time.
Illustrations generated locally by DiffusionBee using FLUX.1-schnell model
Further reading
🔀 Efficient Git Workflow for Web Apps: Advancing Progressively from Scratch to Thriving
Benoit COUETIL 💫 for Zenika ・ Oct 10
🔀🦊 GitLab: Forget GitKraken, Here are the Only Git Commands You Need
Benoit COUETIL 💫 for Zenika ・ Aug 31
🦊 GitLab: A Python Script Displaying Latest Pipelines in a Group's Projects
Benoit COUETIL 💫 for Zenika ・ Jun 29
🦊 GitLab: A Python Script Calculating DORA Metrics
Benoit COUETIL 💫 for Zenika ・ Apr 5
🦊 GitLab CI: The Majestic Single Server Runner
Benoit COUETIL 💫 for Zenika ・ Jan 27
🦊 GitLab CI YAML Modifications: Tackling the Feedback Loop Problem
Benoit COUETIL 💫 for Zenika ・ Dec 18 '23
🦊 GitLab CI Optimization: 15+ Tips for Faster Pipelines
Benoit COUETIL 💫 for Zenika ・ Nov 6 '23
🦊 GitLab CI: 10+ Best Practices to Avoid Widespread Anti-Patterns
Benoit COUETIL 💫 for Zenika ・ Sep 25 '23
🦊 GitLab Pages per Branch: The No-Compromise Hack to Serve Preview Pages
Benoit COUETIL 💫 for Zenika ・ Aug 1 '23
🦊 ChatGPT, If You Please, Make Me a GitLab Jobs YAML Attributes Sorter
Benoit COUETIL 💫 for Zenika ・ Mar 30 '23
🦊 GitLab Runners Topologies: Pros and Cons
Benoit COUETIL 💫 for Zenika ・ Feb 7 '23
This article was enhanced with the assistance of an AI language model to ensure clarity and accuracy in the content, as English is not my native language.
Posted on February 17, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.