Real Graceful Shutdown in Kubernetes and ASP.NET Core
Armin Shoeibi
Posted on June 22, 2024
Our team recently developed a "Payment as a Service" solution for our company. This service aims to provide a seamless payment integration for other microservices. We built it using an ASP.NET Core 8 and deployed it on Kubernetes (K8s).
However, we've faced significant stress during deployments. We often had to stay up late to perform near-rolling updates. The process wasn't a true rolling update, and it caused us considerable frustration.
Initially, our application had a 30-second graceful shutdown period. You might ask how? This is because .NET's Generic Host defaults the ShutdownTimeout
to 30 seconds. However, this default setting wasn't suitable for our application, as we had long-running tasks and API calls.
We increased the shutdown timeout to 90 seconds.
builder.Host.ConfigureHostOptions(ho =>
{
ho.ShutdownTimeout = TimeSpan.FromSeconds(90);
});
but we still experienced several SIGKILLs after 30 seconds during our rolling updates. Initially, Kubernetes sends a SIGTERM signal, giving the pod 30 seconds to stop and shut down. However, our pods needed up to 90 seconds, not 30 seconds.
To address this, we needed to configure this behavior in Kubernetes. After some research, we discovered the terminationGracePeriodSeconds
setting, which defaults to 30 seconds and was causing the SIGKILLs. We set it to 120 seconds thirty seconds more than our application's maximum shutdown needed.
apiVersion: apps/v1
kind: Statefulset
metadata:
name: ---
spec:
containers:
- name: ---
image: ---
terminationGracePeriodSeconds: 120
So far, we've made two key changes.
- Increased the
HostOptions.ShutdownTimeout
- Increased the
terminationGracePeriodSeconds
in the k8s manifest
After making these changes, we tested our application and everything worked flawlessly.
To validate these changes, we created a straightforward action method.
[Route("api/v1/graceful-shutdown")]
[ApiController]
public class GracefulShutdownController : ControllerBase
{
public async Task<IActionResult> TestAsync()
{
await Task.Delay(TimeSpan.FromSeconds(75));
return Ok();
}
}
We called the 'TestAsync' endpoint and immediately deployed a new version using Kubernetes. Our pod entered the terminating state with a 120-second grace period provided by Kubernetes, while our application's shutdown timeout was set to 90 seconds. The 'TestAsync' action method, designed to run for 75 seconds, executed smoothly during this transition.
However, after several updates, our downstream microservices—mostly front-end applications—reported issues where some of their HTTP calls failed during our rolling updates. After further investigation, we discovered a gap between the Nginx Ingress controller and the pod states.
We found issues on GitHub related to this, and the .NET team fixed it by replacing IHostLifetime with a new implementation that delays the SIGTERM signal.
We set the delay to 10 seconds.
using System.Runtime.InteropServices;
namespace OPay.API.K8s;
public class DelayedShutdownHostLifetime(IHostApplicationLifetime applicationLifetime) : IHostLifetime, IDisposable
{
private IEnumerable<IDisposable>? _disposables;
public Task StopAsync(CancellationToken cancellationToken)
{
return Task.CompletedTask;
}
public Task WaitForStartAsync(CancellationToken cancellationToken)
{
_disposables =
[
PosixSignalRegistration.Create(PosixSignal.SIGINT, HandleSignal),
PosixSignalRegistration.Create(PosixSignal.SIGQUIT, HandleSignal),
PosixSignalRegistration.Create(PosixSignal.SIGTERM, HandleSignal)
];
return Task.CompletedTask;
}
protected void HandleSignal(PosixSignalContext ctx)
{
ctx.Cancel = true;
Task.Delay(TimeSpan.FromSeconds(10)).ContinueWith(t => applicationLifetime.StopApplication());
}
public void Dispose()
{
foreach (var disposable in _disposables ?? Enumerable.Empty<IDisposable>())
{
disposable.Dispose();
}
}
}
then register this Impl in the IoC.
builder.Services.AddSingleton<IHostLifetime, DelayedShutdownHostLifetime>();
You can find the main source of the above code from here.
After implementing this shutdown delay, we eliminated deployment-related issues and significantly reduced our stress levels.
Navigate through these links to learn more:
Posted on June 22, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.