Real Graceful Shutdown in Kubernetes and ASP.NET Core

arminshoeibi

Armin Shoeibi

Posted on June 22, 2024

Real Graceful Shutdown in Kubernetes and ASP.NET Core

Our team recently developed a "Payment as a Service" solution for our company. This service aims to provide a seamless payment integration for other microservices. We built it using an ASP.NET Core 8 and deployed it on Kubernetes (K8s).

However, we've faced significant stress during deployments. We often had to stay up late to perform near-rolling updates. The process wasn't a true rolling update, and it caused us considerable frustration.

Initially, our application had a 30-second graceful shutdown period. You might ask how? This is because .NET's Generic Host defaults the ShutdownTimeout to 30 seconds. However, this default setting wasn't suitable for our application, as we had long-running tasks and API calls.

We increased the shutdown timeout to 90 seconds.

builder.Host.ConfigureHostOptions(ho => 
{
    ho.ShutdownTimeout = TimeSpan.FromSeconds(90);
});
Enter fullscreen mode Exit fullscreen mode

but we still experienced several SIGKILLs after 30 seconds during our rolling updates. Initially, Kubernetes sends a SIGTERM signal, giving the pod 30 seconds to stop and shut down. However, our pods needed up to 90 seconds, not 30 seconds.

To address this, we needed to configure this behavior in Kubernetes. After some research, we discovered the terminationGracePeriodSeconds setting, which defaults to 30 seconds and was causing the SIGKILLs. We set it to 120 seconds thirty seconds more than our application's maximum shutdown needed.

apiVersion: apps/v1
kind: Statefulset
metadata:
  name: ---
spec:
  containers:
  - name: ---
    image: ---
    terminationGracePeriodSeconds: 120
Enter fullscreen mode Exit fullscreen mode

So far, we've made two key changes.

  1. Increased the HostOptions.ShutdownTimeout
  2. Increased the terminationGracePeriodSeconds in the k8s manifest

After making these changes, we tested our application and everything worked flawlessly.

To validate these changes, we created a straightforward action method.

[Route("api/v1/graceful-shutdown")]
[ApiController]
public class GracefulShutdownController : ControllerBase
{
    public async Task<IActionResult> TestAsync()
    {
        await Task.Delay(TimeSpan.FromSeconds(75));
        return Ok();
    }
}
Enter fullscreen mode Exit fullscreen mode

We called the 'TestAsync' endpoint and immediately deployed a new version using Kubernetes. Our pod entered the terminating state with a 120-second grace period provided by Kubernetes, while our application's shutdown timeout was set to 90 seconds. The 'TestAsync' action method, designed to run for 75 seconds, executed smoothly during this transition.

However, after several updates, our downstream microservices—mostly front-end applications—reported issues where some of their HTTP calls failed during our rolling updates. After further investigation, we discovered a gap between the Nginx Ingress controller and the pod states.

We found issues on GitHub related to this, and the .NET team fixed it by replacing IHostLifetime with a new implementation that delays the SIGTERM signal.
We set the delay to 10 seconds.

using System.Runtime.InteropServices;

namespace OPay.API.K8s;

public class DelayedShutdownHostLifetime(IHostApplicationLifetime applicationLifetime) : IHostLifetime, IDisposable
{
    private IEnumerable<IDisposable>? _disposables;

    public Task StopAsync(CancellationToken cancellationToken)
    {
        return Task.CompletedTask;
    }

    public Task WaitForStartAsync(CancellationToken cancellationToken)
    {
        _disposables =
        [
            PosixSignalRegistration.Create(PosixSignal.SIGINT, HandleSignal),
            PosixSignalRegistration.Create(PosixSignal.SIGQUIT, HandleSignal),
            PosixSignalRegistration.Create(PosixSignal.SIGTERM, HandleSignal)
        ];
        return Task.CompletedTask;
    }

    protected void HandleSignal(PosixSignalContext ctx)
    {
        ctx.Cancel = true;
        Task.Delay(TimeSpan.FromSeconds(10)).ContinueWith(t => applicationLifetime.StopApplication());
    }

    public void Dispose()
    {
        foreach (var disposable in _disposables ?? Enumerable.Empty<IDisposable>())
        {
            disposable.Dispose();
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

then register this Impl in the IoC.

builder.Services.AddSingleton<IHostLifetime, DelayedShutdownHostLifetime>();
Enter fullscreen mode Exit fullscreen mode

You can find the main source of the above code from here.

After implementing this shutdown delay, we eliminated deployment-related issues and significantly reduced our stress levels.

Navigate through these links to learn more:

  1. https://github.com/dotnet/dotnet-docker/blob/main/samples/kubernetes/graceful-shutdown/graceful-shutdown.md#adding-a-shutdown-delay

  2. https://github.com/dotnet/runtime/blob/v8.0.6/src/libraries/Microsoft.Extensions.Hosting/src/HostOptions.cs

  3. https://github.com/dotnet/runtime/blob/v8.0.6/src/libraries/Microsoft.Extensions.Hosting/src/Internal/ConsoleLifetime.netcoreapp.cs

  4. https://github.com/dotnet/runtime/blob/v8.0.6/src/libraries/Microsoft.Extensions.Hosting/src/Internal/Host.cs#L235

  5. https://github.com/dotnet/runtime/blob/v8.0.6/src/libraries/Microsoft.Extensions.Hosting/src/Internal/ApplicationLifetime.cs

  6. https://learn.microsoft.com/en-us/dotnet/core/extensions/generic-host?tabs=appbuilder#hosting-shutdown-process

đź’– đź’Ş đź™… đźš©
arminshoeibi
Armin Shoeibi

Posted on June 22, 2024

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related