Retry automatically with Exponential Backoff in Cloud Workflows
Kenji Koshikawa
Posted on February 13, 2022
Recently I build workflows using Cloud Workflows to combine some modules of Cloud Functions or Cloud Run on Google Cloud.
Sometimes I encountered errors of 429 or 500 related scaling issues when calling an endpoint of Cloud Functions or Cloud Run.
According to this official document, the solution of retrying with exponential backoff is introduced.
The solution:
For HTTP trigger-based functions, have the client implement exponential backoff and retries for requests that must not be dropped.
So this post introduces the solution in Cloud Workflows.
An example of Cloud Functions
I made a simple function of Cloud Functions to reproduce the scaling errors easily like below.
- just sleeps 3 seconds
- set scale settings to the minimum (min instance:0, max instance:1)
foobar/main.py
import time
import flask
def main(request):
time.sleep(3)
return flask.jsonify({'result': 'ok'})
Following command can deploy to Cloud Functions.
# Deploys the function
$ gcloud functions deploy foobar \
--entry-point main \
--runtime python39 \
--trigger-http \
--region asia-northeast1 \
--timeout 120 \
--memory 128MB \
--min-instances 0 \
--max-instances 1 \
--source ./foobar
# Grants a service account associated with workflows to execute the function
$ gcloud functions add-iam-policy-binding foobar \
--region=asia-northeast1 \
--member=serviceAccount:${YOUR-SERVICE-ACCOUNT} \
--role=roles/cloudfunctions.invoker
A workflow for reproducing the scaling errors
At first, the following code is a workflow for reproducing the scaling errors.
main:
params: [input]
steps:
- callFunc:
call: http.get
args:
url: https://asia-northeast1-xxx.cloudfunctions.net/foobar
auth:
type: OIDC
result: api_result
- returnOutput:
return: ${api_result.body}
Following command can deploy to Cloud Workflows.
$ gcloud workflows deploy v1 \
--source=v1.yml \
--location=asia-southeast1 \
--service-account=${YOUR-SERVICE-ACCOUNT}
In order to reproduce the scaling errors, I executed the following shell over 20 times.
$ gcloud workflows run --project=${YOUR-PROJECT} --location=asia-southeast1 v1 --data='{}' &
As expected, 429 error was reproduced many times. The probability of success of the workflow executions was about 6 in 20.
In the console of Cloud Workflows and Cloud Functions, I could see below an error Information.
HTTP server responded with error code 429
in step "callFunc", routine "main", line: 5
{
"body": "Rate exceeded.",
"code": 429,
"headers": {
"Alt-Svc": "h3=\":443\"; ma=2592000,h3-29=\":443\"; ma=2592000,h3-Q050=\":443\"; ma=2592000,h3-Q046=\":443\"; ma=2592000,h3-Q043=\":443\"; ma=2592000,quic=\":443\"; ma=2592000; v=\"46,43\"",
"Content-Length": "14",
"Content-Type": "text/html",
"Date": "Wed, 09 Feb 2022 08:17:19 GMT",
"Server": "Google Frontend",
"X-Cloud-Trace-Context": "2a8e4ba95570e4a6585a0b678d7f3b98"
},
"message": "HTTP server responded with error code 429",
"tags": [
"HttpError"
]
}
The solution
Next, The following code is the solution to retry automatically with exponential backoff.
In sub-workflow call_api
, it is implemented to retry with exponential backoff when returning HTTP status 429 or 500 code. I set the retry count to 5 times and the initial sleep time to 10 seconds.
main:
params: [input]
steps:
- callFunc:
call: call_api
args:
url: https://asia-northeast1-xxx.cloudfunctions.net/foobar
result: api_result
- return_output:
return: ${api_result.body}
call_api:
params: [url]
steps:
- setup:
assign:
- retry_count: 5
- first_sleep_sec: 10
- sleep_time: ${first_sleep_sec}
- try_many_times:
for:
value: count
range: [1, ${retry_count}]
steps:
- log_before_call:
call: sys.log
args:
text: ${"call_api url=" + url + " (" + string(count) + "/" + string(retry_count) + ")"}
- try_call_block:
try:
steps:
- request_url:
call: http.get
args:
url: ${url}
auth:
type: OIDC
result: api_result
- return_result:
return: ${api_result}
except:
as: e
steps:
- handle_error:
switch:
- condition: ${count >= retry_count}
raise: ${e}
- condition: ${not("HttpError" in e.tags)}
raise: ${e}
- condition: ${(e.code == 429 or e.code == 500)}
next: log_sleep_time
- condition: true
raise: ${e}
- log_sleep_time:
call: sys.log
args:
severity: 'WARNING'
text: ${"got HTTP status " + string(e.code) + ". waiting " + string(sleep_time) + " seconds."}
- wait:
call: sys.sleep
args:
seconds: ${sleep_time}
- update_sleep_time:
assign:
- sleep_time: ${sleep_time * 2}
- next_continue:
next: continue
In the same way, after deploying this workflow I executed it over 20 times.
As a result, All of the workflow executions succeeded and improved. One of them took 3 minutes over, but the retries worked expected.
According to the logs, I could see to wait exponentially each retry time like 10, 20, 40 and 80 seconds so on.
Conclusion
This post introduced an implementation to retry automatically with exponential backoff in Cloud Workflows and showed the solution is effective and enables to continue processing when occurring scaling problems.
(Addition) A simpler way using try/retry statement
There is a simpler way to use try/retry statement. Thanks to @krisbraun who shared at this article's comment.
Pattern1: using default retry policy (very simple)
If your function is idempotent, I think most of the use cases can be covered by the default ${http.default_retry}.
Simple default retry policy for idempotent targets.
Retries on 429 (Too Many Requests), 502 (Bad Gateway), 503 (Service unavailable), and 504 (Gateway Timeout), as well as on any ConnectionError and TimeoutError.
Uses max_retries of 5, and backoff as per retry.default_backoff.
main:
params: [input]
steps:
- call_api:
try:
call: http.get
args:
url: https://asia-northeast1-xxx.cloudfunctions.net/foobar
auth:
type: OIDC
result: api_result
retry: ${http.default_retry}
- return_value:
return: ${api_result.body}
Notice: ${http.default_retry}
doesn't retry when 500 code.
Pattern2: using custom policy
The following workflow shows to add a retry condition when occurring 500 code in addition to ${http.default_retry}
retry conditions. Also, the custom setting for backoff is set to initial sleep time 10 seconds, multiplier 2.
main:
params: [input]
steps:
- call_api:
try:
call: http.get
args:
url: https://asia-northeast1-xxx.cloudfunctions.net/foobar
auth:
type: OIDC
result: api_result
retry:
predicate: ${custom_retry_policy}
backoff:
initial_delay: 10
max_delay: 300
multiplier: 2
- return_value:
return: ${api_result.body}
custom_retry_policy:
params: [e]
steps:
- assign_retry_codes:
assign:
- retry_codes: [429, 500, 502, 503, 504]
- what_to_repeat:
switch:
- condition: ${("code" in e) and (e.code in retry_codes)}
return: True
- condition: ${("tags" in e) and ("ConnectionError" in e.tags)}
return: True
- condition: ${("tags" in e) and ("TimeoutError" in e.tags)}
return: True
- otherwise:
return: False
Posted on February 13, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
March 4, 2022