Deep dive: optimizing self-hosted GitHub Actions Runners on AWS and GCP for cost efficiency

Running self-hosted GitHub Actions runners on the cloud provides great control, but the costs can spiral if not optimized. In this post, I’ll share how we achieved 30% cost savings on AWS and how you can replicate similar strategies on GCP, with a touch of technical fun and actionable advice. Let's dive in!

Why Self-Hosted Runners?

GitHub Actions' default hosted runners are convenient but can be expensive for large workloads, especially for compute-intensive tasks like integration tests or builds. Self-hosted runners, deployed on AWS or GCP, offer:

Cost Control: Pay only for what you use.
Custom Environments: Tailored to specific workflows.
Scalability: Dynamically scale based on workload.

However, running self-hosted runners at scale comes with its own challenges: idle resources, inefficient configurations, and escalating network costs.

Architecture Overview

Here’s a high-level architecture for both AWS and GCP self-hosted runners:

Challenges in Cost Management

Idle Resources: Pre-provisioned runners waiting for jobs lead to unnecessary costs.
Networking Overheads: High outbound traffic, especially for Docker pulls.
Instance Type Selection: Choosing cost-effective and performant instance types.
Preemption Risks: Spot instances (AWS) or preemptible VMs (GCP) can fail mid-job.

Optimization Strategies

1. Dynamic Scaling

Both AWS and GCP allow scaling instances based on demand.

AWS

Use Auto Scaling Groups (ASGs) with Lambda functions triggered by workflow_job webhooks.
Leverage tools like philips-labs/terraform-aws-github-runner to simplify management.

GCP

Use Managed Instance Groups (MIGs) with custom autoscaler policies based on job queue size or CPU load.
Cloud Functions or Cloud Run can handle scaling triggers.

2. Spot Instances (AWS) / Preemptible VMs (GCP)

These offer significant cost savings but require careful handling of preemptions.

AWS Spot Instances

Mix instance types in Spot Pools for better availability:
- m5, m6i, m7i (Intel)
- m5a, m6a (AMD)

GCP Preemptible VMs

Use diverse instance types:
- e2-standard, n2-highmem, t2d-standard (AMD)
Jobs must checkpoint regularly to handle interruptions gracefully.

Pro Tip: Always have fallback capacity with on-demand instances or higher-priority pools for critical workloads.

3. Caching and Artifact Management

Networking Optimization

AWS: Implement S3-based caching with tools like actions/cache.
GCP: Use Cloud Storage or Artifact Registry for similar functionality.

Docker Pulls

Reduce Docker pull costs by:
- Setting up a pull-through cache in GCP Artifact Registry or AWS ECR.
- Using VPC endpoints (AWS) or private access (GCP) to minimize outbound traffic.

4. Cost Monitoring and Analysis

Both cloud providers offer tools to analyze costs:

AWS: Cost Explorer + CloudWatch for EC2 usage.
GCP: Billing Reports + Monitoring with Stackdriver.

Key Metrics to Watch:

Idle instance time
Spot/preemptible interruption rates
Network egress traffic

Case Study: AWS Optimization Outcomes

Idle Runners Reduced: Adjusted runner pools based on org activity.
Spot Pools Optimized: Added AMD-based m6a instances, reducing costs by 30%.
Networking Costs: Introduced Docker pull-through caching with S3.

Case Study: GCP Adaptation

Dynamic Scaling: Managed Instance Groups with preemptible VMs.
Networking: Switched to private Google Access for egress traffic.
Preemptible Instances: n2-highmem provided a balance of cost and performance.

Results

Cost reduction metrics

Cloud Provider	Baseline Cost	Optimized Cost	Savings (%)
AWS	$10,000	$7,000	30%
GCP	$9,500	$6,500	31%

User Experience Improvements

Reduced job interruptions.
Faster job execution due to optimized runner configurations.

Future Opportunities

IPv6 and NAT Gateway Optimization:
- Both AWS and GCP support IPv6 to reduce NAT costs.
Machine Learning for Scaling Decisions:
- Use historical data to predict demand spikes.

Conclusion

Optimizing self-hosted GitHub Actions runners on AWS and GCP can save significant costs while improving performance. By dynamically scaling resources, leveraging spot/preemptible instances, and optimizing network usage, you can achieve a highly efficient setup tailored to your workloads.

Feel free to experiment with these strategies and share your results. Happy optimizing! 🚀

Blog