Varnish Caching and 503 | Backend Fetch Failed

Introduction

Hello fellow developers!

With this, I will discuss the famous Varnish error which haunts people with various names

5xx(503)
Backend Fetch Failed
Guru meditation error

We also faced the same issue and it was very difficult to understand and there was no direct solution to fixing it. So I would like to share with you the steps we took to find out why the error occurred.

Prerequisites

You should have knowledge of these

Varnish
Nginx (Basic, depending on your configuration if you are using it as a middle service for Varnish to the Backend)
Logs (I use Elasticsearch Kibana)

I am already assuming you already know about all the above, so I will add very little information about the stacks and the way in which the problem will be addressed.

What is Varnish?

Varnish is a caching HTTP reverse proxy. You can install it on any HTTP server and configure it to cache the contents.

What is Nginx?

Nginx is open source software for web serving, reverse proxying, caching, load balancing, media streaming, and more. It started out as a web server designed for maximum performance and stability.

Although it can work as so many things but I am treating it as a web server as reverse proxy, we are using Varnish.

What is Kibana?

Kibana Kibana is a free and open frontend application that sits on top of the Elastic Stack, providing search and data visualization capabilities for data indexed in Elasticsearch. Commonly known as the charting tool for the Elastic Stack.

I am using Elastic search for the logs and Kibana for the visualization, you can use any other stack as per your choice. Our main purpose is to understand the logs and gain meaningful insight.

Problem Statement

The website shows 50x error with message Backend fetch failed. It's usually found on blogs or github issues for various platforms, such as WordPress and Magento, when backend teams are trying to enable it, but it's not a specific issue for either platform. The issue occured in my case because of some changes to the data that increased the request, response, and header values. It could also happen with a well-run system, and we must constantly monitor it.

I have the following environment as a reference:

Varnish in a Docker container (it uses Nginx to communicate with the backend)
The backend is in a different container
for logging, I use Elastic search (Kibana)

Error Behaviour

This issue could happen for various reasons and we can see different behaviour of the errors

If the error is persistent or showing for all the requests, it means there are higher chances that your varnish is unable to reach the backend at all so in that case, we need to focus more on the connection between these services.
If the error is intermittent, then the connection between varnish and the back end is working fine and there are other things to consider. There could be a few requests which are not compatible with the current varnish configuration and we need to tweak those as it's not failing all the time.

Solution

To find the solution, we can consider two options

Hit and Try
Find the truth.

Before that, let's discuss the reasons for this error.

Understand the reason

There could be various reasons for these errors. Let me give a few examples, which I found are the common cases which the majority of people face. We can have an error if these values are higher than the defined one in the Varnish config

Delay in the first or in-between byte Varnish receives from the backend. For it we will focus on the 'bytes' and time_first_byte values.
The header size of your request or response. We will focus on our headers from requests and responses.
The request or response size of the request. In this case, we will focus on the content of the whole request or response body, for example API requests are not paginated at all.
The request is taking longer to communicate between Varnish and the Backend. For it, we will focus on request_time and work on the 'connect_timeout' parameter of varnish

Hit and try

Based on the previous mentioned reasons, we can tweak the Varnish configuration and test if the solution works for us and then proceed. After increasing these values slightly, it should solve your issue. I am adding a sample solution covering all the mentioned parameters.

// Please remove the new line, it's just for clear display
varnishd
 -p connect_timeout=15 // Reason 4
 -p first_byte_timeout=90 // Reason 1
 -p between_bytes_timeout=120 // Reason 1
 -p http_req_hdr_len=64k // Reason 2 & 3
 -p http_req_size=256k // Reason 2 & 3
 -p http_resp_hdr_len=64k // Reason 2 & 3
 -p http_resp_size=256k // Reason 2 & 3
 -p timeout_idle=10 // Works with Reason 4
 -p workspace_backend=256K // Reason 2 & 3
 -p workspace_client=256k // Reason 2 & 3

Note: all the definitions are available on Varnish Doc

To find the default Varnish value, we can look at these files

CentOS 6: /etc/sysconfig/varnish
CentOS 7: /etc/varnish/varnish.params
Debian: /etc/default/varnish
Ubuntu: /etc/default/varnish

Find the truth.

In the previous step, I proposed that, we can directly try these values and increase a bit and proceed but what about if we want to know the exact reason or the value we want to put should be close to what our application wants not something random and most importantly to understand the actual reason?

To find the solution, first we need to enable logs. It is always helpful to have logs for the service otherwise it's very challenging to procced and we will be just trying bliendly. Logs will play crucial role in finding out the real culprit and useful for both the behaviour of error. We will try to make sense of the logs and find what values are appropriate for our Varnish config. For it we will need these logs

Enable Varnish logs

Varnish logs consist of the requests which Varnish will handle and can provide various information like status (503) or caching status or bytes. There are several ways to get the Varnish logs.

varnishlog it's also easy to use and gives clear information on the logs but I found it challenging to use with a logging service as I couldn't format the output.

varnishlog -q 'RespStatus > 499' -g request

varnishncsa it's easy to use and you can provide the format for the logs which you can easily parse in the logging service.

varnishncsa -q 'RespStatus > 499' -g request

To format the logs with more information, we can use something similar to this

// Please remove the linebreak as it could mess with your logging service parsing
// -c for the client reqeusts
// -b for the backend requests
// -E for the esi reqeusts
// -F for the format for the output
varnishncsa -c -b -E -F '{
        "@timestamp": "%{%Y-%m-%dT%H:%M:%S%z}t",
        "@fields" : {
        "client":"%h",
        "remote_user" : "%u",
        "x_forwarded_for" : "%{X-Forwarded-For}i",
        "hit_miss":" %{Varnish:hitmiss}x",
        "bytes": %b,
        "time_first_byte" : "%{Varnish:time_firstbyte}x",
        "duration_usec": %D,
        "status": %s,
        "request":"%r",
        "url":"%U",
        "query":"%q",
        "virtualhost":"%{host}i",
        "method":"%m",
        "handling" : "%{Varnish:handling}x",
        "referrer":"%{Referrer}i",
        "user_agent":"%{User-agent}i"
    }
}'

Generated logs with Kiabana

You can use a similar kind of format to generate the logs which will give you a clear idea about the requests. Based on the status you can filter the requests.

// you can filter in Kibana and it will display all the logs which have status more than 499, means all 50x
status > 499

Enable Nginx logs (optional)

If you are using Nginx then with its logs, we can have more information about the requests. With this, we can focus on a few parameters which are not available with the Varnish logs.

log_format apm '{'
    '"logsType":"Nginx", '
    '"time":"$time_local", '
    '"client":"$remote_addr", '
    '"method":"$request_method", '
    '"host":"$host", '
    '"request":"$request", '
    '"request_length":"$request_length", '
    '"status":"$status", '
    '"bytes_sent":"$bytes_sent", '
    '"body_bytes_sent":"$body_bytes_sent", '
    '"referer":"$http_referer", '
    '"user_agent":"$http_user_agent", '
    '"upstream_addr":"$upstream_addr", '
    '"upstream_status":"$upstream_status", '
    '"request_time":"$request_time", '
    '"upstream_response_time":"$upstream_response_time", '
    '"upstream_connect_time":"$upstream_connect_time", '
    '"upstream_header_time":"$upstream_header_time"'
'}';

access_log /dev/stdout apm;

Note: all the definitions are available on Nginx Docs

Generated logs with Kiabana

Update the configuration.

After checking the logs, we can track the actual parameter causing this issue and increase the respective value. We can use a bit higher value than what our logs are showing. I am again adding a sample solution which covers all the mentioned parameters.

// Please remove the new line, it's just for clear display
varnishd
    -p http_req_size=256k
    -p http_resp_size=256k
    -p workspace_backend=256K
    -p workspace_client=256k

References

Final notes

I appreciate you reading the article! And I hope it will help you to fix it. Feel free to comment and let me know if you have any questions.

I'd be pleased to connect with you on LinkedIn

Blog