Retry automatically with Exponential Backoff in Cloud Workflows

koshilife

Kenji Koshikawa

Posted on February 13, 2022

Retry automatically with Exponential Backoff in Cloud Workflows

Recently I build workflows using Cloud Workflows to combine some modules of Cloud Functions or Cloud Run on Google Cloud.

Sometimes I encountered errors of 429 or 500 related scaling issues when calling an endpoint of Cloud Functions or Cloud Run.

According to this official document, the solution of retrying with exponential backoff is introduced.

The solution:
For HTTP trigger-based functions, have the client implement exponential backoff and retries for requests that must not be dropped.

So this post introduces the solution in Cloud Workflows.

An example of Cloud Functions

I made a simple function of Cloud Functions to reproduce the scaling errors easily like below.

  • just sleeps 3 seconds
  • set scale settings to the minimum (min instance:0, max instance:1)

foobar/main.py

import time

import flask

def main(request):
    time.sleep(3)
    return flask.jsonify({'result': 'ok'})
Enter fullscreen mode Exit fullscreen mode

Following command can deploy to Cloud Functions.

# Deploys the function
$ gcloud functions deploy foobar \
  --entry-point main \
  --runtime python39 \
  --trigger-http \
  --region asia-northeast1 \
  --timeout 120 \
  --memory 128MB \
  --min-instances 0 \
  --max-instances 1 \
  --source ./foobar

# Grants a service account associated with workflows to execute the function
$ gcloud functions add-iam-policy-binding foobar \
    --region=asia-northeast1 \
    --member=serviceAccount:${YOUR-SERVICE-ACCOUNT} \
    --role=roles/cloudfunctions.invoker
Enter fullscreen mode Exit fullscreen mode

A workflow for reproducing the scaling errors

At first, the following code is a workflow for reproducing the scaling errors.

main:
    params: [input]
    steps:
    - callFunc:
        call: http.get
        args:
            url: https://asia-northeast1-xxx.cloudfunctions.net/foobar
            auth:
                type: OIDC
        result: api_result
    - returnOutput:
        return: ${api_result.body}
Enter fullscreen mode Exit fullscreen mode

Following command can deploy to Cloud Workflows.

$ gcloud workflows deploy v1 \
                --source=v1.yml \
                --location=asia-southeast1 \
                --service-account=${YOUR-SERVICE-ACCOUNT}
Enter fullscreen mode Exit fullscreen mode

In order to reproduce the scaling errors, I executed the following shell over 20 times.

$ gcloud workflows run --project=${YOUR-PROJECT} --location=asia-southeast1 v1 --data='{}' &
Enter fullscreen mode Exit fullscreen mode

As expected, 429 error was reproduced many times. The probability of success of the workflow executions was about 6 in 20.

Image description

In the console of Cloud Workflows and Cloud Functions, I could see below an error Information.

HTTP server responded with error code 429
in step "callFunc", routine "main", line: 5
{
  "body": "Rate exceeded.",
  "code": 429,
  "headers": {
    "Alt-Svc": "h3=\":443\"; ma=2592000,h3-29=\":443\"; ma=2592000,h3-Q050=\":443\"; ma=2592000,h3-Q046=\":443\"; ma=2592000,h3-Q043=\":443\"; ma=2592000,quic=\":443\"; ma=2592000; v=\"46,43\"",
    "Content-Length": "14",
    "Content-Type": "text/html",
    "Date": "Wed, 09 Feb 2022 08:17:19 GMT",
    "Server": "Google Frontend",
    "X-Cloud-Trace-Context": "2a8e4ba95570e4a6585a0b678d7f3b98"
  },
  "message": "HTTP server responded with error code 429",
  "tags": [
    "HttpError"
  ]
}
Enter fullscreen mode Exit fullscreen mode

Image description

The solution

Next, The following code is the solution to retry automatically with exponential backoff.

In sub-workflow call_api, it is implemented to retry with exponential backoff when returning HTTP status 429 or 500 code. I set the retry count to 5 times and the initial sleep time to 10 seconds.

main:
    params: [input]
    steps:
    - callFunc:
        call: call_api
        args:
            url: https://asia-northeast1-xxx.cloudfunctions.net/foobar
        result: api_result
    - return_output:
        return: ${api_result.body}

call_api:
    params: [url]
    steps:
        - setup:
            assign:
                - retry_count: 5
                - first_sleep_sec: 10
                - sleep_time: ${first_sleep_sec}
        - try_many_times:
            for:
                value: count
                range: [1, ${retry_count}]
                steps:
                    - log_before_call:
                        call: sys.log
                        args:
                            text: ${"call_api url=" + url + " (" + string(count) + "/" + string(retry_count) + ")"}
                    - try_call_block:
                        try:
                            steps:
                                - request_url:
                                    call: http.get
                                    args:
                                        url: ${url}
                                        auth:
                                            type: OIDC
                                    result: api_result
                                - return_result:
                                    return: ${api_result}
                        except:
                            as: e
                            steps:
                                - handle_error:
                                    switch:
                                        - condition: ${count >= retry_count}
                                          raise: ${e}
                                        - condition: ${not("HttpError" in e.tags)}
                                          raise: ${e}
                                        - condition: ${(e.code == 429 or e.code == 500)}
                                          next: log_sleep_time
                                        - condition: true
                                          raise: ${e}
                                - log_sleep_time:
                                    call: sys.log
                                    args:
                                        severity: 'WARNING'
                                        text: ${"got HTTP status " + string(e.code) + ". waiting " + string(sleep_time) + " seconds."}
                                - wait:
                                    call: sys.sleep
                                    args:
                                        seconds: ${sleep_time}
                                - update_sleep_time:
                                    assign:
                                        - sleep_time: ${sleep_time * 2}
                                - next_continue:
                                    next: continue
Enter fullscreen mode Exit fullscreen mode

In the same way, after deploying this workflow I executed it over 20 times.

As a result, All of the workflow executions succeeded and improved. One of them took 3 minutes over, but the retries worked expected.

Image description

According to the logs, I could see to wait exponentially each retry time like 10, 20, 40 and 80 seconds so on.

Image description

Conclusion

This post introduced an implementation to retry automatically with exponential backoff in Cloud Workflows and showed the solution is effective and enables to continue processing when occurring scaling problems.


(Addition) A simpler way using try/retry statement

There is a simpler way to use try/retry statement. Thanks to @krisbraun who shared at this article's comment.

Pattern1: using default retry policy (very simple)

If your function is idempotent, I think most of the use cases can be covered by the default ${http.default_retry}.

Simple default retry policy for idempotent targets.
Retries on 429 (Too Many Requests), 502 (Bad Gateway), 503 (Service unavailable), and 504 (Gateway Timeout), as well as on any ConnectionError and TimeoutError.
Uses max_retries of 5, and backoff as per retry.default_backoff.

main:
    params: [input]
    steps:
    - call_api:
        try:
            call: http.get
            args:
                url: https://asia-northeast1-xxx.cloudfunctions.net/foobar
                auth:
                    type: OIDC
            result: api_result
        retry: ${http.default_retry}
    - return_value:
        return: ${api_result.body}
Enter fullscreen mode Exit fullscreen mode

Notice: ${http.default_retry} doesn't retry when 500 code.

Pattern2: using custom policy

The following workflow shows to add a retry condition when occurring 500 code in addition to ${http.default_retry} retry conditions. Also, the custom setting for backoff is set to initial sleep time 10 seconds, multiplier 2.

main:
    params: [input]
    steps:
    - call_api:
        try:
            call: http.get
            args:
                url: https://asia-northeast1-xxx.cloudfunctions.net/foobar
                auth:
                    type: OIDC
            result: api_result
        retry:
            predicate: ${custom_retry_policy}
            backoff:
                initial_delay: 10
                max_delay: 300
                multiplier: 2
    - return_value:
        return: ${api_result.body}

custom_retry_policy:
    params: [e]
    steps:
    - assign_retry_codes:
        assign:
            - retry_codes: [429, 500, 502, 503, 504]
    - what_to_repeat:
        switch:
          - condition: ${("code" in e) and (e.code in retry_codes)}
            return: True
          - condition: ${("tags" in e) and ("ConnectionError" in e.tags)}
            return: True
          - condition: ${("tags" in e) and ("TimeoutError" in e.tags)}
            return: True
    - otherwise:
          return: False
Enter fullscreen mode Exit fullscreen mode
💖 💪 🙅 🚩
koshilife
Kenji Koshikawa

Posted on February 13, 2022

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related