Deep dive: optimizing self-hosted GitHub Actions Runners on AWS and GCP for cost efficiency

siddhantkcode

Siddhant Khare

Posted on November 20, 2024

Deep dive: optimizing self-hosted GitHub Actions Runners on AWS and GCP for cost efficiency

Running self-hosted GitHub Actions runners on the cloud provides great control, but the costs can spiral if not optimized. In this post, I’ll share how we achieved 30% cost savings on AWS and how you can replicate similar strategies on GCP, with a touch of technical fun and actionable advice. Let's dive in!


Why Self-Hosted Runners?

GitHub Actions' default hosted runners are convenient but can be expensive for large workloads, especially for compute-intensive tasks like integration tests or builds. Self-hosted runners, deployed on AWS or GCP, offer:

  • Cost Control: Pay only for what you use.
  • Custom Environments: Tailored to specific workflows.
  • Scalability: Dynamically scale based on workload.

However, running self-hosted runners at scale comes with its own challenges: idle resources, inefficient configurations, and escalating network costs.


Architecture Overview

Here’s a high-level architecture for both AWS and GCP self-hosted runners:

high-level architecture


Challenges in Cost Management

  1. Idle Resources: Pre-provisioned runners waiting for jobs lead to unnecessary costs.
  2. Networking Overheads: High outbound traffic, especially for Docker pulls.
  3. Instance Type Selection: Choosing cost-effective and performant instance types.
  4. Preemption Risks: Spot instances (AWS) or preemptible VMs (GCP) can fail mid-job.

Optimization Strategies

1. Dynamic Scaling

Both AWS and GCP allow scaling instances based on demand.

AWS

  • Use Auto Scaling Groups (ASGs) with Lambda functions triggered by workflow_job webhooks.
  • Leverage tools like philips-labs/terraform-aws-github-runner to simplify management.

GCP

  • Use Managed Instance Groups (MIGs) with custom autoscaler policies based on job queue size or CPU load.
  • Cloud Functions or Cloud Run can handle scaling triggers.

Scaling Decision


2. Spot Instances (AWS) / Preemptible VMs (GCP)

These offer significant cost savings but require careful handling of preemptions.

AWS Spot Instances

  • Mix instance types in Spot Pools for better availability:
    • m5, m6i, m7i (Intel)
    • m5a, m6a (AMD)

GCP Preemptible VMs

  • Use diverse instance types:
    • e2-standard, n2-highmem, t2d-standard (AMD)
  • Jobs must checkpoint regularly to handle interruptions gracefully.

Pro Tip: Always have fallback capacity with on-demand instances or higher-priority pools for critical workloads.


3. Caching and Artifact Management

Networking Optimization

  • AWS: Implement S3-based caching with tools like actions/cache.
  • GCP: Use Cloud Storage or Artifact Registry for similar functionality.

Docker Pulls

  • Reduce Docker pull costs by:
    • Setting up a pull-through cache in GCP Artifact Registry or AWS ECR.
    • Using VPC endpoints (AWS) or private access (GCP) to minimize outbound traffic.

4. Cost Monitoring and Analysis

Both cloud providers offer tools to analyze costs:

  • AWS: Cost Explorer + CloudWatch for EC2 usage.
  • GCP: Billing Reports + Monitoring with Stackdriver.

Key Metrics to Watch:

  1. Idle instance time
  2. Spot/preemptible interruption rates
  3. Network egress traffic

Cost Breakdown


Case Study: AWS Optimization Outcomes

  • Idle Runners Reduced: Adjusted runner pools based on org activity.
  • Spot Pools Optimized: Added AMD-based m6a instances, reducing costs by 30%.
  • Networking Costs: Introduced Docker pull-through caching with S3.

Case Study: GCP Adaptation

  • Dynamic Scaling: Managed Instance Groups with preemptible VMs.
  • Networking: Switched to private Google Access for egress traffic.
  • Preemptible Instances: n2-highmem provided a balance of cost and performance.

Results

Cost reduction metrics

Cost reduction metrics

Cloud Provider Baseline Cost Optimized Cost Savings (%)
AWS $10,000 $7,000 30%
GCP $9,500 $6,500 31%

User Experience Improvements

  • Reduced job interruptions.
  • Faster job execution due to optimized runner configurations.

Future Opportunities

  1. IPv6 and NAT Gateway Optimization:
    • Both AWS and GCP support IPv6 to reduce NAT costs.
  2. Machine Learning for Scaling Decisions:
    • Use historical data to predict demand spikes.

Conclusion

Optimizing self-hosted GitHub Actions runners on AWS and GCP can save significant costs while improving performance. By dynamically scaling resources, leveraging spot/preemptible instances, and optimizing network usage, you can achieve a highly efficient setup tailored to your workloads.

Feel free to experiment with these strategies and share your results. Happy optimizing! 🚀

💖 💪 🙅 🚩
siddhantkcode
Siddhant Khare

Posted on November 20, 2024

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related