Improving Our GitHub Actions Runner Orchestrator

wayofthepie

wayofthepie

Posted on February 8, 2020

Improving Our GitHub Actions Runner Orchestrator

Table of Contents

In the last post we got a simple actions runner orchestrator running with bash and cron. We also noted a few issues with that version. In this post we will fix up the following issues:

  • Instead of launching a runner per commit, we will instead launch a runner per check request.
  • Instead of running local docker containers, we will run kubernetes jobs.
  • Instead of just running locally with cron, we will create a kubernetes CronJob.

Let's do it!

Github checks api

When a check run is requested, the github checks api will reflect this. For example:

$ curl -s \
    -H "accept: application/vnd.github.antiope-preview+json" \
    -H "authorization: token ${PAT}" \
    https://api.github.com/repos/wayofthepie/gh-app-test/commits/e13119b0/check-runs
{
  "total_count": 1,
  "check_runs": [
    {
      "id": 433544203,
      "node_id": "MDg6Q2hlY2tSdW40MzM1NDQyMDM=",
      "head_sha": "e13119b07d81e4c587882b2f7c9d7a730810f709",
      "external_id": "ca395085-040a-526b-2ce8-bdc85f692774",
      "url": "https://api.github.com/repos/wayofthepie/gh-app-test/check-runs/433544203",
      "html_url": "https://github.com/wayofthepie/gh-app-test/runs/433544203",
      "details_url": "https://help.github.com/en/actions",
      "status": "queued",
      "conclusion": null,
      "started_at": "2020-02-08T14:40:27Z",
      "completed_at": null,
...
Enter fullscreen mode Exit fullscreen mode

This will return the status of all the check runs for the given commit (above in the url e13119b0 is the short ref for the given commit). As you can see above the status of the first check run is queued, in this case it means it's awaiting a runner to be executed on.

Using this information will also allow our script to be completely stateless. In the previous post it had to keep track of the last commit in a file, with the checks API we no longer need this.

Making use of this new information

Now we can change the logic in our orchestration script as follows:

  1. Get the latest commit.
  2. Get all requested check runs for that commit.
  3. For each requested run launch an actions runner.

Here is the updated script:

#!/usr/bin/env bash

PAT=$1
OWNER=$2
REPO=$3

# make sure we have values for all our arguments
[ -z ${PAT} ] || [ -z ${OWNER} ] || [ -z $REPO ] && {
    echo "Incorrect usage, example: ./orc.sh personal-access-token owner some-repo"
    exit 1
}

# get the latest commit
latest_commit=$(curl -s -H "authorization: token ${PAT}"  \
    https://api.github.com/repos/${OWNER}/${REPO}/commits |\
    jq -r .[0].sha)

# for each check run requested for this commit, get the "status"
# field and assign to the "check_status" variable 
for check_status in $(curl -s \
    -H "accept: application/vnd.github.antiope-preview+json" \
    -H "authorization: token ${PAT}"\
    https://api.github.com/repos/${OWNER}/${REPO}/commits/${latest_commit}/check-runs \
    | jq -r '.check_runs[] | "\(.status)"'); do

    # if "check_status" is queued launch an action runner
    if [ "${check_status}" == "queued" ]; then
        echo "Launching actions runner ..."
        docker run -d --rm actions-image \
            ${OWNER} \
            ${REPO} \
            ${PAT} \
            $(uuidgen)
    fi
done
Enter fullscreen mode Exit fullscreen mode

The code up to this point can be found here.

Add a new commit to the repository you have been running the actions against and run ./orc.sh ${PAT} ${OWNER} ${REPO}, it should start a container and run the build.

Using kubernetes jobs to schedule actions runners

If we move to using kubernetes instead of our local docker daemon we can scale out much easier. First let's update to launch the actions runners as kubernetes jobs instead of direct docker containers.

First let's create a cluster. There are many ways to do this, I use Google Cloud so I'm going to create a cluster on google cloud. This has a cost, see https://kubernetes.io/docs/setup/ for examples of local cluster setups.

To create a cluster on google cloud:

$ gcloud container clusters create actions-spawner --region europe-west2-c --num-nodes 1
WARNING: Currently VPC-native is not ...
WARNING: Newly created clusters ...
...
Creating cluster actions-spawner in europe-west2-c... Cluster is being health-checked (master is healthy)...done.
Created [https://container.googleapis.com/v1/projects/monthly-hacking/zones/europe-west2-c/clusters/actions-spawner].
To inspect the contents of your cluster, go to: https://console.cloud.google.com/kubernetes/workload_/gcloud/europe-west2-c/actions-spawner?project=monthly-
hacking
kubeconfig entry generated for actions-spawner.
NAME             LOCATION        MASTER_VERSION  MASTER_IP     MACHINE_TYPE   NODE_VERSION    NUM_NODES  STATUS
actions-spawner  europe-west2-c  1.13.11-gke.23  35.189.78.16  n1-standard-1  1.13.11-gke.23  1          RUNNING
Enter fullscreen mode Exit fullscreen mode

This will create a cluster with one node, we don't need more than that to test. Make sure you have kubectl installed. Auth should be setup automatically for kubectl. To test let's try to list the nodes:

$ kubectl get nodes
NAME                                             STATUS   ROLES    AGE     VERSION
gke-actions-spawner-default-pool-f1380f72-3wm5   Ready    <none>   4m47s   v1.13.11-gke.23
Enter fullscreen mode Exit fullscreen mode

Looks good! Now, let's update our orchestration script:

#!/usr/bin/env bash

PAT=$1
OWNER=$2
REPO=$3

# make sure we have values for all our arguments
[ -z ${PAT} ] || [ -z ${OWNER} ] || [ -z $REPO ] && {
    echo "Incorrect usage, example: ./orc.sh personal-access-token owner some-repo"
    exit 1
}

# get the latest commit
latest_commit=$(curl -s -H "authorization: token ${PAT}"  \
    https://api.github.com/repos/${OWNER}/${REPO}/commits |\
    jq -r .[0].sha)

# for each check run requested for this commit, get the "status"
# field and assign to the "check_status" variable
for check_status in $(curl -s \
    -H "accept: application/vnd.github.antiope-preview+json" \
    -H "authorization: token ${PAT}"\
    https://api.github.com/repos/${OWNER}/${REPO}/commits/${latest_commit}/check-runs \
    | jq -r '.check_runs[] | "\(.status)"'); do

    # if "check_status" is queued launch an action runner
    if [ "${check_status}" == "queued" ]; then
        echo "Found check run request with status ${check_status}, launching job ..."
        cat job.yaml \
            | sed -r "s/NAME/$(uuidgen)/; s/OWNER/${OWNER}/; s/REPO/${REPO}; s/TOKEN/${TOKEN}" \
            | kubectl apply -f -
    else
        echo "Found check run request with status '${check_status}', nothing to do ..."
    fi
done
Enter fullscreen mode Exit fullscreen mode

And create the job.yaml, the specification for our kubernetes job:

apiVersion: batch/v1
kind: Job
metadata:
  name: {NAME}
spec:
  template:
    spec:
      containers:
      - name: {NAME}
        image: wayofthepie/actions-image
        args: ["{OWNER}", "{REPO}", "{TOKEN}"]
      restartPolicy: Never
  backoffLimit: 4
Enter fullscreen mode Exit fullscreen mode

⚠️ WARNING ⚠️
The token here should be a stored as a kubernetes secret. Using the token as I have above is not good practice. I will fix this later in this post.

Let's test it:

$ ./orc.sh ${PAT} ${OWNER} ${REPO}
Found check run request with status 'completed', nothing to do ...
Enter fullscreen mode Exit fullscreen mode

Great, it works when there are no runs requested. Commit to the repo you are testing against and run again:

$ ./orc.sh ${PAT} ${OWNER} ${REPO}
Found check run request with status queued, launching job ...
job.batch/990a0d3d-bb98-419e-abc5-ca4fa48ca328 created
Enter fullscreen mode Exit fullscreen mode

Looks like it worked. Let's see what's running:

$ kubectl get jobs
NAME                                   COMPLETIONS   DURATION   AGE
990a0d3d-bb98-419e-abc5-ca4fa48ca328   0/1           5s         5s

$ kubectl get pods
NAME                                         READY   STATUS      RESTARTS   AGE
990a0d3d-bb98-419e-abc5-ca4fa48ca328-wxn5c   1/1     Running     0          8s

$ kubectl logs 990a0d3d-bb98-419e-abc5-ca4fa48ca328-wxn5c -f
Unrecognized command-line input arguments: 'name'. For usage refer to: .\config.cmd --help or ./config.sh --help

--------------------------------------------------------------------------------
|        ____ _ _   _   _       _          _        _   _                      |
|       / ___(_) |_| | | |_   _| |__      / \   ___| |_(_) ___  _ __  ___      |
|      | |  _| | __| |_| | | | | '_ \    / _ \ / __| __| |/ _ \| '_ \/ __|     |
|      | |_| | | |_|  _  | |_| | |_) |  / ___ \ (__| |_| | (_) | | | \__ \     |
|       \____|_|\__|_| |_|\__,_|_.__/  /_/   \_\___|\__|_|\___/|_| |_|___/     |
|                                                                              |
|                       Self-hosted runner registration                        |
|                                                                              |
--------------------------------------------------------------------------------

# Authentication


√ Connected to GitHub

# Runner Registration

Enter the name of runner: [press Enter for 990a0d3d-bb98-419e-abc5-ca4fa48ca328-wxn5c]
√ Runner successfully added
√ Runner connection is good
# Runner settings


√ Settings Saved.


√ Connected to GitHub

2020-02-08 16:35:49Z: Listening for Jobs
2020-02-08 16:35:53Z: Running job: build
Enter fullscreen mode Exit fullscreen mode

Great! It kicked off a build. However, notice the warning at the start:

Unrecognized command-line input arguments: 'name'. For usage refer to: .\config.cmd --help or ./config.sh --help
Enter fullscreen mode Exit fullscreen mode

Something is wrong... A quick look through orc.sh and job.yaml highlights the issue - we are missing the fourth argument for the wayofthepie/actions-runner image! This sets the name of the actions runner. Let's fix it up:

apiVersion: batch/v1
kind: Job
metadata:
  name: {NAME}
spec:
  template:
    spec:
      containers:
      - name: {NAME}
        image: wayofthepie/actions-image
        # here we add the name argument
        args: ["{OWNER}", "{REPO}", "{TOKEN}", "{NAME}"] 
      restartPolicy: Never
  backoffLimit: 4
Enter fullscreen mode Exit fullscreen mode

Let's commit to the test repo and run again:

$ ./orc.sh ${PAT} ${TOKEN} ${REPO}
Found check run request with status queued, launching job ...
job.batch/7abcb7a1-b1bb-4641-88af-fc4562e29bb7 created

$ kubectl get jobs
NAME                                   COMPLETIONS   DURATION   AGE
7abcb7a1-b1bb-4641-88af-fc4562e29bb7   0/1           4s         4s
990a0d3d-bb98-419e-abc5-ca4fa48ca328   1/1           46s        12m

$ kubectl get pods
NAME                                         READY   STATUS      RESTARTS   AGE
7abcb7a1-b1bb-4641-88af-fc4562e29bb7-q5jdd   1/1     Running     0          13s
990a0d3d-bb98-419e-abc5-ca4fa48ca328-wxn5c   0/1     Completed   0          12m

$ kubectl logs 7abcb7a1-b1bb-4641-88af-fc4562e29bb7-q5jdd

--------------------------------------------------------------------------------
|        ____ _ _   _   _       _          _        _   _                      |
|       / ___(_) |_| | | |_   _| |__      / \   ___| |_(_) ___  _ __  ___      |
|      | |  _| | __| |_| | | | | '_ \    / _ \ / __| __| |/ _ \| '_ \/ __|     |
|      | |_| | | |_|  _  | |_| | |_) |  / ___ \ (__| |_| | (_) | | | \__ \     |
|       \____|_|\__|_| |_|\__,_|_.__/  /_/   \_\___|\__|_|\___/|_| |_|___/     |
|                                                                              |
|                       Self-hosted runner registration                        |
|                                                                              |
--------------------------------------------------------------------------------

# Authentication


√ Connected to GitHub

# Runner Registration


√ Runner successfully added
√ Runner connection is good

# Runner settings


√ Settings Saved.


√ Connected to GitHub

2020-02-08 16:48:30Z: Listening for Jobs
2020-02-08 16:48:34Z: Running job: build
2020-02-08 16:48:52Z: Job build completed with result: Succeeded

# Runner removal


√ Runner removed successfully
√ Removed .credentials
√ Removed .runner
Enter fullscreen mode Exit fullscreen mode

Great! No warnings, all working as expected. The code up to this point can be seen here.

Storing our token properly

Right now our personal access token gets stored in the definition of our job! If we retrieve our job we can see it:

$ kubectl get jobs 990a0d3d-bb98-419e-abc5-ca4fa48ca328 -o json
{
    "apiVersion": "batch/v1",
    "kind": "Job",
    "metadata": {
        ...
    },
    "spec": {
        ...
        "template": {
           ...
            "spec": {
                "containers": [
                    {
                        "args": [
                            "wayofthepie",
                            "gh-app-test",
                            "THE TOKEN IS IN HERE!!!"
                        ],
                        "image": "wayofthepie/actions-image",
                        "imagePullPolicy": "Always",
                        "name": "990a0d3d-bb98-419e-abc5-ca4fa48ca328",
                        ...
                    }
                ],
    ...
}
Enter fullscreen mode Exit fullscreen mode

This is not good! We should be storing this as a kubernetes secret. Let's do that. First create a secret.yaml defining our secret:

apiVersion: v1
kind: Secret
metadata:
  name: github-token
type: Opaque
stringData:
  token: {TOKEN}
Enter fullscreen mode Exit fullscreen mode

To create the secret:

$ cat secret.yaml \
    | sed -r "s/\{TOKEN\}/${YOUR_PAT}/"\
    | kubectl apply -f -
secret/github-token created

$ kubectl get secrets
NAME                  TYPE                                  DATA   AGE
...
github-token          Opaque                                1      30s
Enter fullscreen mode Exit fullscreen mode

With that created let's update our job spec in job.yaml:

apiVersion: batch/v1
kind: Job
metadata:
  name: {NAME}
spec:
  template:
    spec:
      containers:
      - name: {NAME}
        image: wayofthepie/actions-image
        args: ["{OWNER}", "{REPO}", "$(GITHUB_TOKEN)", "{NAME}"]
        env:
        - name: GITHUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: github-token
              key: token
      restartPolicy: Never
  backoffLimit: 4
Enter fullscreen mode Exit fullscreen mode

See these docs for more on reading secret data into env vars.

Also note the syntax to reference the GITHUB_TOKEN it uses $() and not ${}, see here.

Re-run and everything should work:

$ ./orc.sh ${PAT} ${OWNER} ${REPO}
Found check run request with status queued, launching job ...
job.batch/eb1e314d-594b-4253-ae8a-74c797a2cd76 created

$ kubectl get pods
NAME                                         READY   STATUS              RESTARTS   AGE
eb1e314d-594b-4253-ae8a-74c797a2cd76-x47lr   0/1     ContainerCreating   0          3s

$ kubectl logs -f eb1e314d-594b-4253-ae8a-74c797a2cd76-x47lr
...
2020-02-08 17:41:08Z: Listening for Jobs
2020-02-08 17:41:12Z: Running job: build
2020-02-08 17:41:30Z: Job build completed with result: Succeeded

# Runner removal


√ Runner removed successfully
√ Removed .credentials
√ Removed .runner
Enter fullscreen mode Exit fullscreen mode

Great! We should also clean up orc.sh:

#!/usr/bin/env bash

PAT=$1
OWNER=$2
REPO=$3

# make sure we have values for all our arguments
[ -z ${PAT} ] || [ -z ${OWNER} ] || [ -z $REPO ] && {
    echo "Incorrect usage, example: ./orc.sh personal-access-token owner some-repo"
    exit 1
}

# get the latest commit
latest_commit=$(curl -s -H "authorization: token ${PAT}"  \
    https://api.github.com/repos/${OWNER}/${REPO}/commits |\
    jq -r .[0].sha)

# for each check run requested for this commit, get the "status"
# field and assign to the "check_status" variable
for check_status in $(curl -s \
    -H "accept: application/vnd.github.antiope-preview+json" \
    -H "authorization: token ${PAT}"\
    https://api.github.com/repos/${OWNER}/${REPO}/commits/${latest_commit}/check-runs \
    | jq -r '.check_runs[] | "\(.status)"'); do

    # if "check_status" is queued launch an action runner
    if [ "${check_status}" == "queued" ]; then
        echo "Found check run request with status ${check_status}, launching job ..."
        cat job.yaml \
            # we removed the {TOKEN} replacement here
            | sed -r "s/\{NAME\}/$(uuidgen)/g; s/\{OWNER\}/${OWNER}/; s/\{REPO\}/${REPO}/" \
            | kubectl apply -f -
    else
        echo "Found check run request with status '${check_status}', nothing to do ..."
    fi
done
Enter fullscreen mode Exit fullscreen mode

The code up to this point can be found here.

Running our script as a kubernetes cronjob

To run our orchestrator script as a kubernetes cronjob we first need to create a docker image:

FROM ubuntu

RUN useradd -m actions \
    && apt-get update \
    && apt-get install -y \
    curl \
    jq \
    uuid-runtime

RUN curl -LO https://storage.googleapis.com/kubernetes-release/release/v1.17.0/bin/linux/amd64/kubectl \
    && mv kubectl /usr/local/bin \
    && chmod +x /usr/local/bin/kubectl

WORKDIR /home/actions

USER actions

COPY orc.sh .
ENTRYPOINT ["./orc.sh"]
Enter fullscreen mode Exit fullscreen mode

I built this as wayofthepie/actions-orchestrator and pushed to the public docker registry. Next, let' create a CronJob spec:

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: actions-orchestrator
spec:
  schedule: "*/1 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: actions-orchestrator
            image: wayofthepie/actions-orchestrator
            args: ["$(GITHUB_TOKEN)", "{OWNER}", "{REPO}"]
            env:
            - name: GITHUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: github-token
                  key: token
          restartPolicy: Never
Enter fullscreen mode Exit fullscreen mode

This will run our orchestrator every minute. To create the job, replace with your own repo and owner:

$ cat cron.yaml \
    | sed -r "s/\{OWNER\}/wayofthepie/; s/\{REPO\}/gh-app-test/" \
    | kubectl apply -f -
cronjob.batch/actions-orchestrator created

$ kubectl get cronjob
NAME                   SCHEDULE      SUSPEND   ACTIVE   LAST SCHEDULE   AGE
actions-orchestrator   */1 * * * *   False     0        <none>          7s

$ kubectl get pods
NAME                                         READY   STATUS              RESTARTS   AGE
actions-orchestrator-1581193620-4j8jz        0/1     ContainerCreating   0          1s

$ kubectl logs actions-orchestrator-1581193620-4j8jz
Found check run request with status queued, launching job ...
Error from server (Forbidden): error when retrieving current configuration of:
...
from server for: "STDIN": jobs.batch "316b08ed-89e7-4321-a521-897c7a40fa50" is forbidden: User "system:serviceaccount:default:default" cannot get resource "
jobs" in API group "batch" in the namespace "default"
Enter fullscreen mode Exit fullscreen mode

An error! Let's delete the cronjob so it doesn't keep running, kubectl delete cronjob actions-orchestrator, and investigate.

Assigning the correct permissions

It seems the default service account we get in the pod does not have access to the jobs resource. To fix this we need to create a ClusterRole and ClusterRoleBinding:

roleRef:
  kind: ClusterRole
  name: jobs-manager
  apiGroup: rbac.authorization.k8s.io
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  namespace: default
  name: jobs-manager
rules:
- apiGroups: ["batch", "extensions"]
  resources: ["jobs"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
Enter fullscreen mode Exit fullscreen mode

And create it:

$ kubectl apply -f cluster-role.yaml
clusterrole.rbac.authorization.k8s.io/default created
Enter fullscreen mode Exit fullscreen mode

Re-create our cronjob:

$ cat cron.yaml | sed -r "s/\{OWNER\}/wayofthepie/; s/\{REPO\}/gh-app-test/" | kubectl apply -f -
cronjob.batch/actions-orchestrator created

# it should run every minute
$ kubectl get pods
NAME                                    READY   STATUS    RESTARTS   AGE
actions-orchestrator-1581196020-tmbll   1/1     Running   0          4s

Enter fullscreen mode Exit fullscreen mode

Great! Now if we commit to the test repo it should create a new job for the requested check runs.

Fixing a bug in our logic

We still only check the last commit for check requests meaning we can still miss requests, leaving check runs for some commits idle. This is a bug.. The real fix for this would require either a lot of API calls or using webhooks. But for now we can look at the last 5 minutes of commits rather than just the last commit. If we run the script every minute there is a much smaller chance of a missing check runs.

The updates to orc.sh are:

#!/usr/bin/env bash

PAT=$1
OWNER=$2
REPO=$3

# make sure we have values for all our arguments
[ -z ${PAT} ] || [ -z ${OWNER} ] || [ -z $REPO ] && {
    echo "Incorrect usage, example: ./orc.sh personal-access-token owner some-repo"
    exit 1
}

# get the date format in the format the github api wants
function five_minutes_ago {
    echo $(date --iso-8601=seconds --date='5 minutes ago' | awk -F'+' '{print $1}')
}

echo "Getting commits from the last 5 minutes ..."
commits=$(curl -s -H "accept: application/vnd.github.antiope-preview+json" \
    -H "authorization: token ${PAT}" \
    https://api.github.com/repos/${OWNER}/${REPO}/commits?since="$(five_minutes_ago)Z" \
    | jq -r .[].sha)

for commit in ${commits[@]}; do
    echo "Checking ${commit} for check requests ..."

    # for each check run requested for this commit, get the "status"
    # field and assign to the "check_status" variable
    for check_status in $(curl -s \
        -H "accept: application/vnd.github.antiope-preview+json" \
        -H "authorization: token ${PAT}"\
        https://api.github.com/repos/${OWNER}/${REPO}/commits/${commit}/check-runs \
        | jq -r '.check_runs[] | "\(.status)"'); do

        # if "check_status" is queued launch an action runner
        if [ "${check_status}" == "queued" ]; then
            echo "Found check run request with status ${check_status}, launching job ..."
            cat job.yaml \
                | sed -r "s/\{NAME\}/$(uuidgen)/g; s/\{OWNER\}/${OWNER}/; s/\{REPO\}/${REPO}/" \
                | kubectl apply -f -
        else
            echo "Found check run request with status '${check_status}', nothing to do ..."
        fi
    done
done
Enter fullscreen mode Exit fullscreen mode

Rebuild the actions orchestrator image, push and it should all work! The image up to this point is tagged as wayofthepie/actions-orchestrator:8-2-2020.

The code up to this point can be found here.

Conclusion

We now have away of orchestrating actions runners on kubernetes. There are still a few issues however:

  • There is no error recovery and the error messages are pretty bad. For example if for some reason the cronjob does not run for 5+ minutes we may miss commits and check runs.
  • It would be much better to use webhooks here.
  • We currently only support watching a single repository.
  • Things are getting complicated with bash!

I will tackle some of these in the next post.

💖 💪 🙅 🚩
wayofthepie
wayofthepie

Posted on February 8, 2020

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related